1668085860
Tidy, analyze, and plot causal directed acyclic graphs (DAGs). ggdag
uses the powerful dagitty
package to create and analyze structural causal models and plot them using ggplot2
and ggraph
in a consistent and easy manner.
You can install ggdag
with:
install.packages("ggdag")
Or you can install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("malcolmbarrett/ggdag")
ggdag
makes it easy to use dagitty
in the context of the tidyverse. You can directly tidy dagitty
objects or use convenience functions to create DAGs using a more Rlike syntax:
library(ggdag)
library(ggplot2)
# example from the dagitty package
dag < dagitty::dagitty("dag {
y < x < z1 < v > z2 > y
z1 < w1 <> w2 > z2
x < w1 > y
x < w2 > y
x [exposure]
y [outcome]
}")
tidy_dag < tidy_dagitty(dag)
tidy_dag
#> # A DAG with 7 nodes and 12 edges
#> #
#> # Exposure: x
#> # Outcome: y
#> #
#> # A tibble: 13 × 8
#> name x y direction to xend yend circular
#> <chr> <dbl> <dbl> <fct> <chr> <dbl> <dbl> <lgl>
#> 1 v 0.496 3.40 > z1 1.83 2.92 FALSE
#> 2 v 0.496 3.40 > z2 0.0188 2.08 FALSE
#> 3 w1 1.73 1.94 > x 2.07 1.42 FALSE
#> 4 w1 1.73 1.94 > y 1.00 0.944 FALSE
#> 5 w1 1.73 1.94 > z1 1.83 2.92 FALSE
#> 6 w1 1.73 1.94 <> w2 0.873 1.56 FALSE
#> 7 w2 0.873 1.56 > x 2.07 1.42 FALSE
#> 8 w2 0.873 1.56 > y 1.00 0.944 FALSE
#> 9 w2 0.873 1.56 > z2 0.0188 2.08 FALSE
#> 10 x 2.07 1.42 > y 1.00 0.944 FALSE
#> 11 y 1.00 0.944 <NA> <NA> NA NA FALSE
#> 12 z1 1.83 2.92 > x 2.07 1.42 FALSE
#> 13 z2 0.0188 2.08 > y 1.00 0.944 FALSE
# using more Rlike syntax to create the same DAG
tidy_ggdag < dagify(
y ~ x + z2 + w2 + w1,
x ~ z1 + w1 + w2,
z1 ~ w1 + v,
z2 ~ w2 + v,
w1 ~ ~w2, # bidirected path
exposure = "x",
outcome = "y"
) %>%
tidy_dagitty()
tidy_ggdag
#> # A DAG with 7 nodes and 12 edges
#> #
#> # Exposure: x
#> # Outcome: y
#> #
#> # A tibble: 13 × 8
#> name x y direction to xend yend circular
#> <chr> <dbl> <dbl> <fct> <chr> <dbl> <dbl> <lgl>
#> 1 v 3.58 3.30 > z1 4.05 4.63 FALSE
#> 2 v 3.58 3.30 > z2 2.23 3.74 FALSE
#> 3 w1 3.03 5.74 > x 3.20 5.14 FALSE
#> 4 w1 3.03 5.74 > y 1.98 5.22 FALSE
#> 5 w1 3.03 5.74 > z1 4.05 4.63 FALSE
#> 6 w1 3.03 5.74 <> w2 2.35 4.72 FALSE
#> 7 w2 2.35 4.72 > x 3.20 5.14 FALSE
#> 8 w2 2.35 4.72 > y 1.98 5.22 FALSE
#> 9 w2 2.35 4.72 > z2 2.23 3.74 FALSE
#> 10 x 3.20 5.14 > y 1.98 5.22 FALSE
#> 11 y 1.98 5.22 <NA> <NA> NA NA FALSE
#> 12 z1 4.05 4.63 > x 3.20 5.14 FALSE
#> 13 z2 2.23 3.74 > y 1.98 5.22 FALSE
ggdag
also provides functionality for analyzing DAGs and plotting them in ggplot2
:
ggdag(tidy_ggdag) +
theme_dag()
ggdag_adjustment_set(tidy_ggdag, node_size = 14) +
theme(legend.position = "bottom")
As well as geoms and other functions for plotting them directly in ggplot2
:
dagify(m ~ x + y) %>%
tidy_dagitty() %>%
node_dconnected("x", "y", controlling_for = "m") %>%
ggplot(aes(
x = x,
y = y,
xend = xend,
yend = yend,
shape = adjusted,
col = d_relationship
)) +
geom_dag_edges(end_cap = ggraph::circle(10, "mm")) +
geom_dag_collider_edges() +
geom_dag_point() +
geom_dag_text(col = "white") +
theme_dag() +
scale_adjusted() +
expand_plot(expand_y = expansion(c(0.2, 0.2))) +
scale_color_viridis_d(
name = "drelationship",
na.value = "grey85",
begin = .35
)
And common structures of bias:
ggdag_equivalent_dags(confounder_triangle())
ggdag_butterfly_bias(edge_type = "diagonal")
Author: Malcolmbarrett
Source Code: https://github.com/malcolmbarrett/ggdag
License: Unknown, MIT licenses found
1668074100
An R package for classifying Twitter accounts as bot or not
.
Uses machine learning to classify Twitter accounts as bots or not bots. The default model is 93.53% accurate when classifying bots and 95.32% accurate when classifying nonbots. The fast model is 91.78% accurate when classifying bots and 92.61% accurate when classifying nonbots.
Overall, the default model is correct 93.8% of the time.
Overall, the fast model is correct 91.9% of the time.
Install from CRAN:
## install from CRAN
install.packages("tweetbotornot")
Install the development version from Github:
## install remotes if not already
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}
## install tweetbotornot from github
devtools::install_github("mkearney/tweetbotornot")
Users must be authorized in order to interact with Twitter’s API. To setup your machine to make authorized requests, you’ll either need to be signed into Twitter and working in an interactive session of R–the browser will open asking you to authorize the rtweet client (rstats2twitter)–or you’ll need to create an app (and have a developer account) and your own API token. The latter has the benefit of (a) having sufficient permissions for writeacess and DM (direct messages) readaccess levels and (b) more stability if Twitter decides to shut down [@kearneymw](https://twitter.com/kearneymw)’s access to Twitter (I try to be very responsible these days, but Twitter isn’t always friendly to academic use cases). To create an app and your own Twitter token, see these instructions provided in the rtweet package.
There’s one function tweetbotornot()
(technically there’s also botornot()
, but it does the same exact thing). Give it a vector of screen names or user IDs and let it go to work.
## load package
library(tweetbotornot)
## select users
users < c("realdonaldtrump", "netflix_bot",
"kearneymw", "dataandme", "hadleywickham",
"ma_salmon", "juliasilge", "tidyversetweets",
"American__Voter", "mothgenerator", "hrbrmstr")
## get botornot estimates
data < tweetbotornot(users)
## arrange by prob ests
data[order(data$prob_bot), ]
#> # A tibble: 11 x 3
#> screen_name user_id prob_bot
#> <chr> <chr> <dbl>
#> 1 hadleywickham 69133574 0.00754
#> 2 realDonaldTrump 25073877 0.00995
#> 3 kearneymw 2973406683 0.0607
#> 4 ma_salmon 2865404679 0.150
#> 5 juliasilge 13074042 0.162
#> 6 dataandme 3230388598 0.227
#> 7 hrbrmstr 5685812 0.320
#> 8 netflix_bot 1203840834 0.978
#> 9 tidyversetweets 935569091678691328 0.997
#> 10 mothgenerator 3277928935 0.998
#> 11 American__Voter 829792389925597184 1.000
The botornot()
function also accepts data returned by rtweet functions.
## get most recent 100 tweets from each user
tmls < get_timelines(users, n = 100)
## pass the returned data to botornot()
data < botornot(tmls)
## arrange by prob ests
data[order(data$prob_bot), ]
#> # A tibble: 11 x 3
#> screen_name user_id prob_bot
#> <chr> <chr> <dbl>
#> 1 hadleywickham 69133574 0.00754
#> 2 realDonaldTrump 25073877 0.00995
#> 3 kearneymw 2973406683 0.0607
#> 4 ma_salmon 2865404679 0.150
#> 5 juliasilge 13074042 0.162
#> 6 dataandme 3230388598 0.227
#> 7 hrbrmstr 5685812 0.320
#> 8 netflix_bot 1203840834 0.978
#> 9 tidyversetweets 935569091678691328 0.997
#> 10 mothgenerator 3277928935 0.998
#> 11 American__Voter 829792389925597184 1.000
fast = TRUE
The default [gradient boosted] model uses both userslevel (bio, location, number of followers and friends, etc.) and tweetslevel (number of hashtags, mentions, capital letters, etc. in a user’s most recent 100 tweets) data to estimate the probability that users are bots. For larger data sets, this method can be quite slow. Due to Twitter’s REST API rate limits, users are limited to only 180 estimates per every 15 minutes.
To maximize the number of estimates per 15 minutes (at the cost of being less accurate), use the fast = TRUE
argument. This method uses only userslevel data, which increases the maximum number of estimates per 15 minutes to 90,000! Due to losses in accuracy, this method should be used with caution!
## get botornot estimates
data < botornot(users, fast = TRUE)
## arrange by prob ests
data[order(data$prob_bot), ]
#> # A tibble: 11 x 3
#> screen_name user_id prob_bot
#> <chr> <chr> <dbl>
#> 1 hadleywickham 69133574 0.00185
#> 2 kearneymw 2973406683 0.0415
#> 3 ma_salmon 2865404679 0.0661
#> 4 dataandme 3230388598 0.0965
#> 5 juliasilge 13074042 0.112
#> 6 hrbrmstr 5685812 0.121
#> 7 realDonaldTrump 25073877 0.368
#> 8 netflix_bot 1203840834 0.978
#> 9 tidyversetweets 935569091678691328 0.998
#> 10 mothgenerator 3277928935 0.999
#> 11 American__Voter 829792389925597184 0.999
In order to avoid confusion, the package was renamed from “botrnot” to “tweetbotornot” in June 2018. This package should not be confused with the botornot application.
Author: mkearney
Source Code: https://github.com/mkearney/tweetbotornot
License: Unknown, MIT licenses found
1668050340
Assertive Programming for R analysis Pipelines.
The assertr package supplies a suite of functions designed to verify assumptions about data early in an analysis pipeline so that data errors are spotted early and can be addressed quickly.
This package does not need to be used with the magrittr/dplyr piping mechanism but the examples in this README use them for clarity.
You can install the latest version on CRAN like this
install.packages("assertr")
or you can install the bleedingedge development version like this:
install.packages("devtools")
devtools::install_github("ropensci/assertr")
This package offers five assertion functions, assert
, verify
, insist
, assert_rows
, and insist_rows
, that are designed to be used shortly after dataloading in an analysis pipeline...
Let’s say, for example, that the R’s builtin car dataset, mtcars
, was not builtin but rather procured from an external source that was known for making errors in data entry or coding. Pretend we wanted to find the average miles per gallon for each number of engine cylinders. We might want to first, confirm
This could be written (in order) using assertr
like this:
library(dplyr)
library(assertr)
mtcars %>%
verify(has_all_names("mpg", "vs", "am", "wt")) %>%
verify(nrow(.) > 10) %>%
verify(mpg > 0) %>%
insist(within_n_sds(4), mpg) %>%
assert(in_set(0,1), am, vs) %>%
assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%
assert_rows(col_concat, is_uniq, mpg, am, wt) %>%
insist_rows(maha_dist, within_n_mads(10), everything()) %>%
group_by(cyl) %>%
summarise(avg.mpg=mean(mpg))
If any of these assertions were violated, an error would have been raised and the pipeline would have been terminated early.
Let's see what the error message look like when you chain a bunch of failing assertions together.
> mtcars %>%
+ chain_start %>%
+ assert(in_set(1, 2, 3, 4), carb) %>%
+ assert_rows(rowMeans, within_bounds(0,5), gear:carb) %>%
+ verify(nrow(.)==10) %>%
+ verify(mpg < 32) %>%
+ chain_end
There are 7 errors across 4 verbs:

verb redux_fn predicate column index value
1 assert <NA> in_set(1, 2, 3, 4) carb 30 6.0
2 assert <NA> in_set(1, 2, 3, 4) carb 31 8.0
3 assert_rows rowMeans within_bounds(0, 5) ~gear:carb 30 5.5
4 assert_rows rowMeans within_bounds(0, 5) ~gear:carb 31 6.5
5 verify <NA> nrow(.) == 10 <NA> 1 NA
6 verify <NA> mpg < 32 <NA> 18 NA
7 verify <NA> mpg < 32 <NA> 20 NA
Error: assertr stopped execution
assertr
give me?verify
 takes a data frame (its first argument is provided by the %>%
operator above), and a logical (boolean) expression. Then, verify
evaluates that expression using the scope of the provided data frame. If any of the logical values of the expression's result are FALSE
, verify
will raise an error that terminates any further processing of the pipeline.
assert
 takes a data frame, a predicate function, and an arbitrary number of columns to apply the predicate function to. The predicate function (a function that returns a logical/boolean value) is then applied to every element of the columns selected, and will raise an error if it finds any violations. Internally, the assert
function uses dplyr
's select
function to extract the columns to test the predicate function on.
insist
 takes a data frame, a predicategenerating function, and an arbitrary number of columns. For each column, the the predicategenerating function is applied, returning a predicate. The predicate is then applied to every element of the columns selected, and will raise an error if it finds any violations. The reason for using a predicategenerating function to return a predicate to use against each value in each of the selected rows is so that, for example, bounds can be dynamically generated based on what the data look like; this the only way to, say, create bounds that check if each datum is within x zscores, since the standard deviation isn't known a priori. Internally, the insist
function uses dplyr
's select
function to extract the columns to test the predicate function on.
assert_rows
 takes a data frame, a row reduction function, a predicate function, and an arbitrary number of columns to apply the predicate function to. The row reduction function is applied to the data frame, and returns a value for each row. The predicate function is then applied to every element of vector returned from the row reduction function, and will raise an error if it finds any violations. This functionality is useful, for example, in conjunction with the num_row_NAs()
function to ensure that there is below a certain number of missing values in each row. Internally, the assert_rows
function uses dplyr
'sselect
function to extract the columns to test the predicate function on.
insist_rows
 takes a data frame, a row reduction function, a predicategenerating function, and an arbitrary number of columns to apply the predicate function to. The row reduction function is applied to the data frame, and returns a value for each row. The predicategenerating function is then applied to the vector returned from the row reduction function and the resultant predicate is applied to each element of that vector. It will raise an error if it finds any violations. This functionality is useful, for example, in conjunction with the maha_dist()
function to ensure that there are no flagrant outliers. Internally, the assert_rows
function uses dplyr
'sselect
function to extract the columns to test the predicate function on.
assertr
also offers four (so far) predicate functions designed to be used with the assert
and assert_rows
functions:
not_na
 that checks if an element is not NAwithin_bounds
 that returns a predicate function that checks if a numeric value falls within the bounds supplied, andin_set
 that returns a predicate function that checks if an element is a member of the set supplied. (also allows inverse for "not in set")is_uniq
 that checks to see if each element appears only onceand predicate generators designed to be used with the insist
and insist_rows
functions:
within_n_sds
 used to dynamically create bounds to check vector elements with based on standard zscoreswithin_n_mads
 better method for dynamically creating bounds to check vector elements with based on 'robust' zscores (using median absolute deviation)and the following row reduction functions designed to be used with assert_rows
and insist_rows
:
num_row_NAs
 counts number of missing values in each rowmaha_dist
 computes the mahalanobis distance of each row (for outlier detection). It will coerce categorical variables into numerics if it needs to.col_concat
 concatenates all rows into stringsduplicated_across_cols
 checking if a row contains a duplicated value across columnsand, finally, some other utilities for use with verify
has_all_names
 check if the data frame or list has all supplied nameshas_only_names
 check that a data frame or list have only the names requestedhas_class
 checks if passed data has a particular classFor more info, check out the assertr
vignette
> vignette("assertr")
Or read it here
Author: ropensci
Source Code: https://github.com/ropensci/assertr
License: View license
1668042600
How to install
visdat is available on CRAN
install.packages("visdat")
If you would like to use the development version, install from github with:
# install.packages("devtools")
devtools::install_github("ropensci/visdat")
What does visdat do?
Initially inspired by csvfingerprint
, vis_dat
helps you visualise a dataframe and “get a look at the data” by displaying the variable classes in a dataframe as a plot with vis_dat
, and getting a brief look into missing data patterns using vis_miss
.
visdat
has 6 functions:
vis_dat()
visualises a dataframe showing you what the classes of the columns are, and also displaying the missing data.
vis_miss()
visualises just the missing data, and allows for missingness to be clustered and columns rearranged. vis_miss()
is similar to missing.pattern.plot
from the mi
package. Unfortunately missing.pattern.plot
is no longer in the mi
package (as of 14/02/2016).
vis_compare()
visualise differences between two dataframes of the same dimensions
vis_expect()
visualise where certain conditions hold true in your data
vis_cor()
visualise the correlation of variables in a nice heatmap
vis_guess()
visualise the individual class of each value in your data
vis_value()
visualise the value class of each cell in your data
vis_binary()
visualise the occurrence of binary values in your data
You can read more about visdat in the vignette, “using visdat”.
Please note that the visdat project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Examples
vis_dat()
Let’s see what’s inside the airquality
dataset from base R, which contains information about daily air quality measurements in New York from May to September 1973. More information about the dataset can be found with ?airquality
.
library(visdat)
vis_dat(airquality)
The plot above tells us that R reads this dataset as having numeric and integer values, with some missing data in Ozone
and Solar.R
. The classes are represented on the legend, and missing data represented by grey. The column/variable names are listed on the x axis.
vis_miss()
We can explore the missing data further using vis_miss()
:
vis_miss(airquality)
Percentages of missing/complete in vis_miss
are accurate to 1 decimal place.
You can cluster the missingness by setting cluster = TRUE
:
vis_miss(airquality,
cluster = TRUE)
Columns can also be arranged by columns with most missingness, by setting sort_miss = TRUE
:
vis_miss(airquality,
sort_miss = TRUE)
vis_miss
indicates when there is a very small amount of missing data at <0.1% missingness:
test_miss_df < data.frame(x1 = 1:10000,
x2 = rep("A", 10000),
x3 = c(rep(1L, 9999), NA))
vis_miss(test_miss_df)
vis_miss
will also indicate when there is no missing data at all:
vis_miss(mtcars)
To further explore the missingness structure in a dataset, I recommend the naniar
package, which provides more general tools for graphical and numerical exploration of missing values.
vis_compare()
Sometimes you want to see what has changed in your data. vis_compare()
displays the differences in two dataframes of the same size. Let’s look at an example.
Let’s make some changes to the chickwts
, and compare this new dataset:
set.seed(201904031105)
chickwts_diff < chickwts
chickwts_diff[sample(1:nrow(chickwts), 30),sample(1:ncol(chickwts), 2)] < NA
vis_compare(chickwts_diff, chickwts)
Here the differences are marked in blue.
If you try and compare differences when the dimensions are different, you get an ugly error:
chickwts_diff_2 < chickwts
chickwts_diff_2$new_col < chickwts_diff_2$weight*2
vis_compare(chickwts, chickwts_diff_2)
# Error in vis_compare(chickwts, chickwts_diff_2) :
# Dimensions of df1 and df2 are not the same. vis_compare requires dataframes of identical dimensions.
vis_expect()
vis_expect
visualises certain conditions or values in your data. For example, If you are not sure whether to expect values greater than 25 in your data (airquality), you could write: vis_expect(airquality, ~.x>=25)
, and you can see if there are times where the values in your data are greater than or equal to 25:
vis_expect(airquality, ~.x >= 25)
This shows the proportion of times that there are values greater than 25, as well as the missings.
vis_cor()
To make it easy to plot correlations of your data, use vis_cor
:
vis_cor(airquality)
vis_value
vis_value()
visualises the values of your data on a 0 to 1 scale.
vis_value(airquality)
It only works on numeric data, so you might get strange results if you are using factors:
library(ggplot2)
vis_value(iris)
data input can only contain numeric values, please subset the data to the numeric values you would like. dplyr::select_if(data, is.numeric) can be helpful here!
So you might need to subset the data beforehand like so:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
iris %>%
select_if(is.numeric) %>%
vis_value()
vis_binary()
vis_binary()
visualises binary values. See below for use with example data, dat_bin
vis_binary(dat_bin)
If you don’t have only binary values a warning will be shown.
vis_binary(airquality)
Error in test_if_all_binary(data) :
data input can only contain binary values  this means either 0 or 1, or NA. Please subset the data to be binary values, or see ?vis_value.
vis_guess()
vis_guess()
takes a guess at what each cell is. It’s best illustrated using some messy data, which we’ll make here:
messy_vector < c(TRUE,
T,
"TRUE",
"T",
"01/01/01",
"01/01/2001",
NA,
NaN,
"NA",
"Na",
"na",
"10",
10,
"10.1",
10.1,
"abc",
"$%TG")
set.seed(201904031106)
messy_df < data.frame(var1 = messy_vector,
var2 = sample(messy_vector),
var3 = sample(messy_vector))
vis_guess(messy_df) vis_dat(messy_df)
So here we see that there are many different kinds of data in your dataframe. As an analyst this might be a depressing finding. We can see this comparison above.
Thank yous
Thank you to Ivan Hanigan who first commented this suggestion after I made a blog post about an initial prototype ggplot_missing
, and Jenny Bryan, whose tweet got me thinking about vis_dat
, and for her code contributions that removed a lot of errors.
Thank you to Hadley Wickham for suggesting the use of the internals of readr
to make vis_guess
work. Thank you to Miles McBain for his suggestions on how to improve vis_guess
. This resulted in making it at least 23 times faster. Thanks to Carson Sievert for writing the code that combined plotly
with visdat
, and for Noam Ross for suggesting this in the first place. Thank you also to Earo Wang and Stuart Lee for their help in getting capturing expressions in vis_expect
.
Finally thank you to rOpenSci and it’s amazing onboarding process, this process has made visdat a much better package, thanks to the editor Noam Ross (@noamross), and the reviewers Sean Hughes (@seaaan) and Mara Averick (@batpigandme).
Author: ropensci
Source Code: https://github.com/ropensci/visdat
License: View license
1668030780
Advanced ImageProcessing in R
Bindings to ImageMagick: the most comprehensive opensource image processing library available. Supports many common formats (png, jpeg, tiff, pdf, etc) and manipulations (rotate, scale, crop, trim, flip, blur, etc). All operations are vectorized via the Magick++ STL meaning they operate either on a single frame or a series of frames for working with layers, collages, or animation. In RStudio images are automatically previewed when printed to the console, resulting in an interactive editing environment.
About the R package:
About the underlying library:
Run examples in RStudio to see live previews of the images! If you do not use RStudio, use image_browse
to open images. On Linux you can also use image_display
to get an X11 preview.
library(magick)
frink < image_read("https://jeroen.github.io/images/frink.png")
image_trim(frink)
image_scale(frink, "200x200")
image_flip(frink)
image_rotate(frink, 45) ## < result of this is shown
image_negate(frink)
frink %>%
image_background("green") %>%
image_flatten() %>%
image_border("red", "10x10")
image_rotate(frink, 45) %>% image_write("man/figures/frinkrotated.png")
Effects
image_oilpaint(frink)
image_implode(frink)
image_charcoal(frink) ## < result of this is shown
image_blur(frink)
image_edge(frink)
image_charcoal(frink) %>% image_write("man/figures/frinkcharcoal.png")
Create GIF animation:
# Download images
oldlogo < image_read("https://developer.rproject.org/Logo/Rlogo2.png")
newlogo < image_read("https://jeroen.github.io/images/Rlogoold.png")
logos < c(oldlogo, newlogo)
logos < image_scale(logos, "400x400")
# Create GIF
(animation1 < image_animate(logos))
image_write(animation1, "man/figures/anim1.gif")
# Morph effect < result of this is shown
(animation2 < image_animate(image_morph(logos, frames = 10)))
image_write(animation2, "man/figures/anim2.gif")
Read GIF animation frames. See the rotating earth example GIF.
earth < image_read("https://upload.wikimedia.org/wikipedia/commons/2/2c/Rotating_earth_%28large%29.gif")
length(earth)
earth[1]
earth[1:3]
earth1 < rev(image_flip(earth)) ## How Austrialans see earth
image_write(earth1, "man/figures/earth1.gif") ## < result of this is shown
R logo with dancing banana
logo < image_read("https://www.rproject.org/logo/Rlogo.png")
banana < image_read("https://jeroen.github.io/images/banana.gif")
front < image_scale(banana, "300")
background < image_scale(logo, "400")
frames < lapply(as.list(front), function(x) image_flatten(c(background, x)))
image_write(image_animate(image_join(frames)), "man/figures/Rlogobanana.gif")
This demo application shows how to use magick with shiny: https://github.com/jeroen/shinymagick
Binary packages for macOS or Windows can be installed directly from CRAN:
install.packages("magick")
Installation from source on Linux or OSX requires the imagemagick Magick++
library. On Debian or Ubuntu install libmagick++dev:
sudo aptget install y libmagick++dev
If you are on Ubuntu 14.04 (trusty) or 16.04 (xenial) you can get a more recent backport from the ppa:
sudo addaptrepository y ppa:cran/imagemagick
sudo aptget update
sudo aptget install y libmagick++dev
On Fedora, CentOS or RHEL we need ImageMagickc++devel. However on CentOS the system version of ImageMagick is quite old. More recent versions are available from the ImageMagick downloads website.
sudo yum install ImageMagickc++devel
On macOS use imagemagick@6 from Homebrew.
brew install imagemagick@6
The unversioned homebrew formulaimagemagick
can also be used, however it has some unsolved OpenMP problems.
There is also a fork of imagemagick called graphicsmagick, but this doesn't work for this package.
Author: ropensci
Source Code: https://github.com/ropensci/magick
License: View license
1668011220
Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plaintext one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or paywalled search engines.
The pdftools slightly overlaps with the Rpoppler package by Kurt Hornik. The main motivation behind developing pdftools was that Rpoppler depends on glib, which does not work well on Mac and Windows. The pdftools package uses the poppler c++ interface together with Rcpp, which results in a lighter and more portable implementation.
On Windows and Mac the binary packages can be installed directly from CRAN:
install.packages("pdftools")
Installation on Linux requires the poppler development library. For Ubuntu 18.04 (Bionic) and Ubuntu 20.04 (Focal) we provide backports of poppler version 22.02 to support the latest functionality:
sudo addaptrepository y ppa:cran/poppler
sudo aptget update
sudo aptget install y libpopplercppdev
On other versions of Debian or Ubuntu simply use::
sudo aptget install libpopplercppdev
If you want to install the package from source on MacOS you need brew:
brew install poppler
On Fedora:
sudo yum install popplercppdevel
Update: Itt is now recommended to use the backport PPA mentioned above. If you really want to build from source, follow the instructions of this askubuntu.com answer.
On CentOS the libpopplercpp
library is not included with the system so we need to build from source. Note that recent versions of poppler require C++11 which is not available on CentOS, so we build a slightly older version of libpoppler.
# Build dependencies
yum install wget xz libjpegdevel openjpeg2devel
# Download and extract
wget https://poppler.freedesktop.org/poppler0.47.0.tar.xz
tar Jxvf poppler0.47.0.tar.xz
cd poppler0.47.0
# Build and install
./configure
make
sudo make install
By default libraries get installed in /usr/local/lib
and /usr/local/include
. On CentOS this is not a default search path so we need to set PKG_CONFIG_PATH
and LD_LIBRARY_PATH
to point R to the right directory:
export LD_LIBRARY_PATH="/usr/local/lib"
export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig"
We can then start R and install pdftools
.
The ?pdftools
manual page shows a brief overview of the main utilities. The most important function is pdf_text
which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.
library(pdftools)
download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt < pdf_text("1403.2805.pdf")
# first page text
cat(txt[1])
# second page text
cat(txt[2])
In addition, the package has some utilities to extract other data from the PDF file. The pdf_toc
function shows the table of contents, i.e. the section headers which pdf readers usually display in a menu on the left. It looks pretty in JSON:
# Table of contents
toc < pdf_toc("1403.2805.pdf")
# Show as JSON
jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)
Other functions provide information about fonts, attachments and metadata such as the author, creation date or tags.
# Author, version, etc
info < pdf_info("1403.2805.pdf")
# Table with fonts
fonts < pdf_fonts("1403.2805.pdf")
A bonus feature on most platforms is rendering of PDF files to bitmap arrays. The poppler library provides all functionality to implement a complete PDF reader, including graphical display of the content. In R we can use pdf_render_page
to render a page of the PDF into a bitmap, which can be stored as e.g. png or jpeg.
# renders pdf to bitmap array
bitmap < pdf_render_page("1403.2805.pdf", page = 1)
# save bitmap image
png::writePNG(bitmap, "page.png")
webp::write_webp(bitmap, "page.webp")
This feature is still experimental and currently does not work on Windows.
Data scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data with pdftools
.
txt < pdf_text("http://arxiv.org/pdf/1406.4806.pdf")
# some tables
cat(txt[18])
cat(txt[19])
The tabulizer
package is dedicated to extracting tables from PDF, and includes interactive tools for selecting tables. However, tabulizer
depends on rJava
and therefore requires additional setup steps or may be impossible to use on systems where Java cannot be installed.
It is possible to use pdftools
with some creativity to parse tables from PDF documents, which does not require Java to be installed.
If you want to extract text from scanned text present in a pdf, you'll need to use OCR (optical character recognition). Please refer to the rOpenSci tesseract
package that provides bindings to the Tesseract OCR engine. In particular read the section of its vignette about reading from PDF files using pdftools
and tesseract
.
Author: ropensci
Source Code: https://github.com/ropensci/pdftools
License: Unknown, MIT licenses found
1667974440
Tabulizer provides R bindings to the Tabula java library, which can be used to computationaly extract tables from PDF documents.
Note: tabulizer is released under the MIT license, as is Tabula itself.
tabulizer depends on rJava, which implies a system requirement for Java. This can be frustrating, especially on Windows. The preferred Windows workflow is to use Chocolatey to obtain, configure, and update Java. You need do this before installing rJava or attempting to use tabulizer. More on this and troubleshooting below.
To install the latest CRAN version:
install.packages("tabulizer")
To install the latest development version:
if (!require("remotes")) {
install.packages("remotes")
}
# on 64bit Windows
remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"), INSTALL_opts = "nomultiarch")
# elsewhere
remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"))
The main function, extract_tables()
provides an R clone of the Tabula command line application:
library("tabulizer")
f < system.file("examples", "data.pdf", package = "tabulizer")
out1 < extract_tables(f)
str(out1)
## List of 4
## $ : chr [1:32, 1:10] "mpg" "21.0" "21.0" "22.8" ...
## $ : chr [1:7, 1:5] "Sepal.Length " "5.1 " "4.9 " "4.7 " ...
## $ : chr [1:7, 1:6] "" "145 " "146 " "147 " ...
## $ : chr [1:15, 1] "supp" "VC" "VC" "VC" ...
By default, it returns the most tablelike R structure available: a matrix. It can also write the tables to disk or attempt to coerce them to data.frames using the output
argument. It is also possible to select tables from only specified pages using the pages
argument.
out2 < extract_tables(f, pages = 1, guess = FALSE, output = "data.frame")
str(out2)
## List of 1
## $ :'data.frame': 33 obs. of 13 variables:
## ..$ X : chr [1:33] "Mazda RX4 " "Mazda RX4 Wag " "Datsun 710 " "Hornet 4 Drive " ...
## ..$ mpg : num [1:33] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## ..$ cyl : num [1:33] 6 6 4 6 8 6 8 4 4 6 ...
## ..$ X.1 : int [1:33] NA NA NA NA NA NA NA NA NA NA ...
## ..$ disp: num [1:33] 160 160 108 258 360 ...
## ..$ hp : num [1:33] 110 110 93 110 175 105 245 62 95 123 ...
## ..$ drat: num [1:33] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## ..$ wt : num [1:33] 2.62 2.88 2.32 3.21 3.44 ...
## ..$ qsec: num [1:33] 16.5 17 18.6 19.4 17 ...
## ..$ vs : num [1:33] 0 0 1 1 0 1 0 1 1 1 ...
## ..$ am : num [1:33] 1 1 1 0 0 0 0 0 0 0 ...
## ..$ gear: num [1:33] 4 4 4 3 3 3 3 4 4 4 ...
## ..$ carb: int [1:33] 4 4 1 1 2 1 4 2 2 4 ...
It is also possible to manually specify smaller areas within pages to look for tables using the area
and columns
arguments to extract_tables()
. This facilitates extraction from smaller portions of a page, such as when a table is embeded in a larger section of text or graphics.
Another function, extract_areas()
implements this through an interactive style in which each page of the PDF is loaded as an R graphic and the user can use their mouse to specify upperleft and lowerright bounds of an area. Those areas are then extracted automagically (and the return value is the same as for extract_tables()
). Here’s a shot of it in action:
locate_areas()
handles the area identification process without performing the extraction, which may be useful as a debugger.
extract_text()
simply returns text, possibly separately for each (specified) page:
out3 < extract_text(f, page = 3)
cat(out3, sep = "\n")
## len supp dose
## 4.2 VC 0.5
## 11.5 VC 0.5
## 7.3 VC 0.5
## 5.8 VC 0.5
## 6.4 VC 0.5
## 10.0 VC 0.5
## 11.2 VC 0.5
## 11.2 VC 0.5
## 5.2 VC 0.5
## 7.0 VC 0.5
## 16.5 VC 1.0
## 16.5 VC 1.0
## 15.2 VC 1.0
## 17.3 VC 1.0
## 22.5 VC 1.0
## 3
Note that for large PDF files, it is possible to run up against Java memory constraints, leading to a java.lang.OutOfMemoryError: Java heap space
error message. Memory can be increased using options(java.parameters = "Xmx16000m")
set to some reasonable amount of memory.
Some other utility functions are also provided (and made possible by the Java Apache PDFBox library):
extract_text()
converts the text of an entire file or specified pages into an R character vector.split_pdf()
and merge_pdfs()
split and merge PDF documents, respectively.extract_metadata()
extracts PDF metadata as a list.get_n_pages()
determines the number of pages in a document.get_page_dims()
determines the width and height of each page in pt (the unit used by area
and columns
arguments).make_thumbnails()
converts specified pages of a PDF file to image files.In command prompt, install Chocolately if you don’t already have it:
@powershell NoProfile ExecutionPolicy Bypass Command "iex ((newobject net.webclient).DownloadString('https://chocolatey.org/install.ps1'))" && SET PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin
Then, install java using Chocolately’s choco install
command:
choco install jdk7 y
You may also need to then set the JAVA_HOME
environment variable to the path to your Java installation (e.g., C:\Program Files\Java\jdk1.8.0_92
). This can be done:
Sys.setenv(JAVA_HOME = "C:/Program Files/Java/jdk1.8.0_92")
(note slashes), orsetx
command: setx JAVA_HOME C:\Program Files\Java\jdk1.8.0_92
, or[Environment]::SetEnvironmentVariable("JAVA_HOME", "C:\Program Files\Java\jdk1.8.0_92", "User")
, orControl Panel » System » Advanced » Environment Variables
(instructions here).You should now be able to safely open R, and use rJava and tabulizer. Note, however, that some users report that rather than setting this variable, they instead need to delete it (e.g., with Sys.setenv(JAVA_HOME = "")
), so if the above instructions fail, that is the next step in troubleshooting.
Some notes for troubleshooting common installation problems:
R CMD javareconf
on the command line (possibly with sudo
, etc. depending on your system setup).tabulizer
in R doing citation(package = 'tabulizer')
Author: ropensci
Source Code: https://github.com/ropensci/tabulizer
License: View license
1667492100
Become a Bayesian master you will
⚠️ We changed the default the CI width! Please make an informed decision and set it explicitly (ci = 0.89
, ci = 0.95
or anything else that you decide) ⚠️
Existing R packages allow users to easily fit a large variety of models and extract and visualize the posterior draws. However, most of these packages only return a limited set of indices (e.g., pointestimates and CIs). bayestestR provides a comprehensive and consistent set of functions to analyze and describe posterior distributions generated by a variety of models objects, including popular modeling packages such as rstanarm, brms or BayesFactor.
You can reference the package and its documentation as follows:
The bayestestR package is available on CRAN, while its latest development version is available on Runiverse (from rOpenSci).
Type  Source  Command 

Release  CRAN  install.packages("bayestestR") 
Development  Runiverse  install.packages("bayestestR", repos = "https://easystats.runiverse.dev") 
Once you have downloaded the package, you can then load it using:
library("bayestestR")
Tip
Instead of
library(datawizard)
, uselibrary(easystats)
. This will make all features of the easystatsecosystem available.To stay updated, use
easystats::install_latest()
.
Access the package documentation and checkout these vignettes:
Features
In the Bayesian framework, parameters are estimated in a probabilistic fashion as distributions. These distributions can be summarised and described by reporting four types of indices:
mean()
, median()
or map_estimate()
for an estimation of the mode.point_estimate()
can be used to get them at once and can be run directly on models.p_direction()
for a Bayesian equivalent of the frequentist pvalue (see Makowski et al., 2019)p_pointnull()
represents the odds of null hypothesis (h0 = 0) compared to the most likely hypothesis (the MAP).bf_pointnull()
for a classic Bayes Factor (BF) assessing the likelihood of effect presence against its absence (h0 = 0).p_rope()
is the probability of the effect falling inside a Region of Practical Equivalence (ROPE).bf_rope()
computes a Bayes factor against the null as defined by a region (the ROPE).p_significance()
that combines a region of equivalence with the probability of direction.describe_posterior()
is the master function with which you can compute all of the indices cited below at once.
describe_posterior(
rnorm(10000),
centrality = "median",
test = c("p_direction", "p_significance")
)
## Summary of Posterior Distribution
##
## Parameter  Median  95% CI  pd  ps
## 
## Posterior  4.19e03  [1.91, 1.98]  50.18%  0.46
describe_posterior()
works for many objects, including more complex brmsfitmodels. For better readability, the output is separated by model components:
zinb < read.csv("http://stats.idre.ucla.edu/stat/data/fish.csv")
set.seed(123)
model < brm(
bf(
count ~ child + camper + (1  persons),
zi ~ child + camper + (1  persons)
),
data = zinb,
family = zero_inflated_poisson(),
chains = 1,
iter = 500
)
describe_posterior(
model,
effects = "all",
component = "all",
test = c("p_direction", "p_significance"),
centrality = "all"
)
## Summary of Posterior Distribution
##
## Parameter  Median  Mean  MAP  95% CI  pd  ps  Rhat  ESS
## 
## (Intercept)  0.96  0.96  0.96  [0.81, 2.51]  90.00%  0.88  1.011  110.00
## child  1.16  1.16  1.16  [1.36, 0.94]  100%  1.00  0.996  278.00
## camper  0.73  0.72  0.73  [ 0.54, 0.91]  100%  1.00  0.996  271.00
##
## # Fixed effects (zeroinflated)
##
## Parameter  Median  Mean  MAP  95% CI  pd  ps  Rhat  ESS
## 
## (Intercept)  0.48  0.51  0.22  [2.03, 0.89]  78.00%  0.73  0.997  138.00
## child  1.85  1.86  1.81  [ 1.19, 2.54]  100%  1.00  0.996  303.00
## camper  0.88  0.86  0.99  [1.61, 0.07]  98.40%  0.96  0.996  292.00
##
## # Random effects (conditional) Intercept: persons
##
## Parameter  Median  Mean  MAP  95% CI  pd  ps  Rhat  ESS
## 
## persons.1  0.99  1.01  0.84  [2.68, 0.80]  92.00%  0.90  1.007  106.00
## persons.2  4.65e03  0.04  0.03  [1.63, 1.66]  50.00%  0.45  1.013  109.00
## persons.3  0.69  0.66  0.69  [0.95, 2.34]  79.60%  0.78  1.010  114.00
## persons.4  1.57  1.56  1.56  [0.05, 3.29]  96.80%  0.96  1.009  114.00
##
## # Random effects (zeroinflated) Intercept: persons
##
## Parameter  Median  Mean  MAP  95% CI  pd  ps  Rhat  ESS
## 
## persons.1  1.10  1.11  1.08  [0.23, 2.72]  94.80%  0.93  0.997  166.00
## persons.2  0.18  0.18  0.22  [0.94, 1.58]  63.20%  0.54  0.996  154.00
## persons.3  0.30  0.31  0.54  [1.79, 1.02]  64.00%  0.59  0.997  154.00
## persons.4  1.45  1.46  1.44  [2.90, 0.10]  98.00%  0.97  1.000  189.00
##
## # Random effects (conditional) SD/Cor: persons
##
## Parameter  Median  Mean  MAP  95% CI  pd  ps  Rhat  ESS
## 
## (Intercept)  1.42  1.58  1.07  [ 0.71, 3.58]  100%  1.00  1.010  126.00
##
## # Random effects (zeroinflated) SD/Cor: persons
##
## Parameter  Median  Mean  MAP  95% CI  pd  ps  Rhat  ESS
## 
## (Intercept)  1.30  1.49  0.99  [ 0.63, 3.41]  100%  1.00  0.996  129.00
bayestestR also includes many other features useful for your Bayesian analsyes. Here are some more examples:
library(bayestestR)
posterior < distribution_gamma(10000, 1.5) # Generate a skewed distribution
centrality < point_estimate(posterior) # Get indices of centrality
centrality
## Point Estimate
##
## Median  Mean  MAP
## 
## 1.18  1.50  0.51
As for other easystats packages, plot()
methods are available from the see package for many functions:
While the median and the mean are available through base R functions, map_estimate()
in bayestestR can be used to directly find the Highest Maximum A Posteriori (MAP) estimate of a posterior, i.e., the value associated with the highest probability density (the “peak” of the posterior distribution). In other words, it is an estimation of the mode for continuous parameters.
hdi()
computes the Highest Density Interval (HDI) of a posterior distribution, i.e., the interval which contains all points within the interval have a higher probability density than points outside the interval. The HDI can be used in the context of Bayesian posterior characterization as Credible Interval (CI).
Unlike equaltailed intervals (see eti()
) that typically exclude 2.5% from each tail of the distribution, the HDI is not equaltailed and therefore always includes the mode(s) of posterior distributions.
posterior < distribution_chisquared(10000, 4)
hdi(posterior, ci = 0.89)
## 89% HDI: [0.18, 7.63]
eti(posterior, ci = 0.89)
## 89% ETI: [0.75, 9.25]
p_direction()
computes the Probability of Direction (pd, also known as the Maximum Probability of Effect  MPE). It varies between 50% and 100% (i.e., 0.5
and 1
) and can be interpreted as the probability (expressed in percentage) that a parameter (described by its posterior distribution) is strictly positive or negative (whichever is the most probable). It is mathematically defined as the proportion of the posterior distribution that is of the median’s sign. Although differently expressed, this index is fairly similar (i.e., is strongly correlated) to the frequentist pvalue.
Relationship with the pvalue: In most cases, it seems that the pd corresponds to the frequentist onesided pvalue through the formula pvalue = (1pd/100)
and to the twosided pvalue (the most commonly reported) through the formula pvalue = 2*(1pd/100)
. Thus, a pd
of 95%
, 97.5%
99.5%
and 99.95%
corresponds approximately to a twosided pvalue of respectively .1
, .05
, .01
and .001
. See the reporting guidelines.
posterior < distribution_normal(10000, 0.4, 0.2)
p_direction(posterior)
## Probability of Direction: 0.98
rope()
computes the proportion (in percentage) of the HDI (default to the 89% HDI) of a posterior distribution that lies within a region of practical equivalence.
Statistically, the probability of a posterior distribution of being different from 0 does not make much sense (the probability of it being different from a single point being infinite). Therefore, the idea underlining ROPE is to let the user define an area around the null value enclosing values that are equivalent to the null value for practical purposes Kruschke (2018).
Kruschke suggests that such null value could be set, by default, to the 0.1 to 0.1 range of a standardized parameter (negligible effect size according to Cohen, 1988). This could be generalized: For instance, for linear models, the ROPE could be set as 0 +/ .1 * sd(y)
. This ROPE range can be automatically computed for models using the rope_range function.
Kruschke suggests using the proportion of the 95% (or 90%, considered more stable) HDI that falls within the ROPE as an index for “nullhypothesis” testing (as understood under the Bayesian framework, see equivalence_test).
posterior < distribution_normal(10000, 0.4, 0.2)
rope(posterior, range = c(0.1, 0.1))
## # Proportion of samples inside the ROPE [0.10, 0.10]:
##
## inside ROPE
## 
## 4.40 %
bayesfactor_parameters()
computes Bayes factors against the null (either a point or an interval), bases on prior and posterior samples of a single parameter. This Bayes factor indicates the degree by which the mass of the posterior distribution has shifted further away from or closer to the null value(s) (relative to the prior distribution), thus indicating if the null value has become less or more likely given the observed data.
When the null is an interval, the Bayes factor is computed by comparing the prior and posterior odds of the parameter falling within or outside the null; When the null is a point, a SavageDickey density ratio is computed, which is also an approximation of a Bayes factor comparing the marginal likelihoods of the model against a model in which the tested parameter has been restricted to the point null (Wagenmakers, Lodewyckx, Kuriyal, & Grasman, 2010).
prior < distribution_normal(10000, mean = 0, sd = 1)
posterior < distribution_normal(10000, mean = 1, sd = 0.7)
bayesfactor_parameters(posterior, prior, direction = "twosided", null = 0)
## Bayes Factor (SavageDickey density ratio)
##
## BF
## 
## 1.94
##
## * Evidence Against The Null: 0
The lollipops represent the density of a pointnull on the prior distribution (the blue lollipop on the dotted distribution) and on the posterior distribution (the red lollipop on the yellow distribution). The ratio between the two  the SavageDickey ratio  indicates the degree by which the mass of the parameter distribution has shifted away from or closer to the null.
For more info, see the Bayes factors vignette.
rope_range()
: This function attempts at automatically finding suitable “default” values for the Region Of Practical Equivalence (ROPE). Kruschke (2018) suggests that such null value could be set, by default, to a range from 0.1
to 0.1
of a standardized parameter (negligible effect size according to Cohen, 1988), which can be generalised for linear models to 0.1 * sd(y), 0.1 * sd(y)
. For logistic models, the parameters expressed in log odds ratio can be converted to standardized difference through the formula sqrt(3)/pi
, resulting in a range of 0.05
to 0.05
.
rope_range(model)
estimate_density()
: This function is a wrapper over different methods of density estimation. By default, it uses the base R density
with by default uses a different smoothing bandwidth ("SJ"
) from the legacy default implemented the base R density
function ("nrd0"
). However, Deng & Wickham suggest that method = "KernSmooth"
is the fastest and the most accurate.
distribution()
: Generate a sample of size n with nearperfect distributions.
distribution(n = 10)
## [1] 1.55 1.00 0.66 0.38 0.12 0.12 0.38 0.66 1.00 1.55
density_at()
: Compute the density of a given point of a distribution.
density_at(rnorm(1000, 1, 1), 1)
## [1] 0.45
Please note that the bayestestR project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
References
Kruschke, J. K. (2018). Rejecting or accepting parameter values in Bayesian estimation. Advances in Methods and Practices in Psychological Science, 1(2), 270–280. https://doi.org/10.1177/2515245918771304
Kruschke, J. K., & Liddell, T. M. (2018). The bayesian new statistics: Hypothesis testing, estimation, metaanalysis, and power analysis from a bayesian perspective. Psychonomic Bulletin & Review, 25(1), 178–206.
Wagenmakers, E.J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010). Bayesian hypothesis testing for psychologists: A tutorial on the savage–dickey method. Cognitive Psychology, 60(3), 158–189.
Author: easystats
Source Code: https://github.com/easystats/bayestestR
License: GPL3.0 license
1667429460
“From R to your manuscript”
report’s primary goal is to bridge the gap between R’s output and the formatted results contained in your manuscript. It automatically produces reports of models and data frames according to best practices guidelines (e.g., APA’s style), ensuring standardization and quality in results reporting.
library(report)
model < lm(Sepal.Length ~ Species, data = iris)
report(model)
# We fitted a linear model (estimated using OLS) to predict Sepal.Length with
# Species (formula: Sepal.Length ~ Species). The model explains a statistically
# significant and substantial proportion of variance (R2 = 0.62, F(2, 147) =
# 119.26, p < .001, adj. R2 = 0.61). The model's intercept, corresponding to
# Species = setosa, is at 5.01 (95% CI [4.86, 5.15], t(147) = 68.76, p < .001).
# Within this model:
#
#  The effect of Species [versicolor] is statistically significant and positive
# (beta = 0.93, 95% CI [0.73, 1.13], t(147) = 9.03, p < .001; Std. beta = 1.12,
# 95% CI [0.88, 1.37])
#  The effect of Species [virginica] is statistically significant and positive
# (beta = 1.58, 95% CI [1.38, 1.79], t(147) = 15.37, p < .001; Std. beta = 1.91,
# 95% CI [1.66, 2.16])
#
# Standardized parameters were obtained by fitting the model on a standardized
# version of the dataset. 95% Confidence Intervals (CIs) and pvalues were
# computed using a Wald tdistribution approximation.
The package is available on CRAN
and can be downloaded by running:
install.packages("report")
If you would instead like to experiment with the development version, you can download it from GitHub
:
install.packages("remotes")
remotes::install_github("easystats/report") # You only need to do that once
Load the package every time you start R
library("report")
Tip
Instead of
library(datawizard)
, uselibrary(easystats)
. This will make all features of the easystatsecosystem available.To stay updated, use
easystats::install_latest()
.
The package documentation can be found here.
The report
package works in a two step fashion. First, you create a report
object with the report()
function. Then, this report object can be displayed either textually (the default output) or as a table, using as.data.frame()
. Moreover, you can also access a more digest and compact version of the report using summary()
on the report object.
The report()
function works on a variety of models, as well as other objects such as dataframes:
report(iris)
# The data contains 150 observations of the following 5 variables:
#
#  Sepal.Length: n = 150, Mean = 5.84, SD = 0.83, Median = 5.80, MAD = 1.04,
# range: [4.30, 7.90], Skewness = 0.31, Kurtosis = 0.55, 0% missing
#  Sepal.Width: n = 150, Mean = 3.06, SD = 0.44, Median = 3.00, MAD = 0.44,
# range: [2, 4.40], Skewness = 0.32, Kurtosis = 0.23, 0% missing
#  Petal.Length: n = 150, Mean = 3.76, SD = 1.77, Median = 4.35, MAD = 1.85,
# range: [1, 6.90], Skewness = 0.27, Kurtosis = 1.40, 0% missing
#  Petal.Width: n = 150, Mean = 1.20, SD = 0.76, Median = 1.30, MAD = 1.04,
# range: [0.10, 2.50], Skewness = 0.10, Kurtosis = 1.34, 0% missing
#  Species: 3 levels, namely setosa (n = 50, 33.33%), versicolor (n = 50,
# 33.33%) and virginica (n = 50, 33.33%)
These reports nicely work within the tidyverse workflow:
iris %>%
select(starts_with("Sepal")) %>%
group_by(Species) %>%
report() %>%
summary()
# The data contains 150 observations, grouped by Species, of the following 3
# variables:
#
#  setosa (n = 50):
#  Petal.Length: Mean = 1.46, SD = 0.17, range: [1, 1.90]
#  Petal.Width: Mean = 0.25, SD = 0.11, range: [0.10, 0.60]
#
#  versicolor (n = 50):
#  Petal.Length: Mean = 4.26, SD = 0.47, range: [3, 5.10]
#  Petal.Width: Mean = 1.33, SD = 0.20, range: [1, 1.80]
#
#  virginica (n = 50):
#  Petal.Length: Mean = 5.55, SD = 0.55, range: [4.50, 6.90]
#  Petal.Width: Mean = 2.03, SD = 0.27, range: [1.40, 2.50]
Reports can be used to automatically format tests like ttests or correlations.
report(t.test(mtcars$mpg ~ mtcars$am))
# Effect sizes were labelled following Cohen's (1988) recommendations.
#
# The Welch Two Sample ttest testing the difference of mtcars$mpg by mtcars$am
# (mean in group 0 = 17.15, mean in group 1 = 24.39) suggests that the effect is
# negative, statistically significant, and large (difference = 7.24, 95% CI
# [11.28, 3.21], t(18.33) = 3.77, p = 0.001; Cohen's d = 1.41, 95% CI [2.26,
# 0.53])
As mentioned, you can also create tables with the as.data.frame()
functions, like for example with this correlation test:
cor.test(iris$Sepal.Length, iris$Sepal.Width) %>%
report() %>%
as.data.frame()
# Pearson's productmoment correlation
#
# Parameter1  Parameter2  r  95% CI  t(148)  p
# 
# iris$Sepal.Length  iris$Sepal.Width  0.12  [0.27, 0.04]  1.44  0.152
#
# Alternative hypothesis: two.sided
This works great with ANOVAs, as it includes effect sizes and their interpretation.
aov(Sepal.Length ~ Species, data = iris) %>%
report()
# The ANOVA (formula: Sepal.Length ~ Species) suggests that:
#
#  The main effect of Species is statistically significant and large (F(2, 147)
# = 119.26, p < .001; Eta2 = 0.62, 95% CI [0.54, 1.00])
#
# Effect sizes were labelled following Field's (2013) recommendations.
Reports are also compatible with GLMs, such as this logistic regression:
model < glm(vs ~ mpg * drat, data = mtcars, family = "binomial")
report(model)
# We fitted a logistic model (estimated using ML) to predict vs with mpg and drat
# (formula: vs ~ mpg * drat). The model's explanatory power is substantial
# (Tjur's R2 = 0.51). The model's intercept, corresponding to mpg = 0 and drat =
# 0, is at 33.43 (95% CI [77.90, 3.25], p = 0.083). Within this model:
#
#  The effect of mpg is statistically nonsignificant and positive (beta = 1.79,
# 95% CI [0.10, 4.05], p = 0.066; Std. beta = 3.63, 95% CI [1.36, 7.50])
#  The effect of drat is statistically nonsignificant and positive (beta =
# 5.96, 95% CI [3.75, 16.26], p = 0.205; Std. beta = 0.36, 95% CI [1.96,
# 0.98])
#  The interaction effect of drat on mpg is statistically nonsignificant and
# negative (beta = 0.33, 95% CI [0.83, 0.15], p = 0.141; Std. beta = 1.07, 95%
# CI [2.66, 0.48])
#
# Standardized parameters were obtained by fitting the model on a standardized
# version of the dataset. 95% Confidence Intervals (CIs) and pvalues were
# computed using a Wald zdistribution approximation.
Mixed models, whose popularity and usage is exploding, can also be reported:
library(lme4)
model < lme4::lmer(Sepal.Length ~ Petal.Length + (1  Species), data = iris)
report(model)
# We fitted a linear mixed model (estimated using REML and nloptwrap optimizer)
# to predict Sepal.Length with Petal.Length (formula: Sepal.Length ~
# Petal.Length). The model included Species as random effect (formula: ~1 
# Species). The model's total explanatory power is substantial (conditional R2 =
# 0.97) and the part related to the fixed effects alone (marginal R2) is of 0.66.
# The model's intercept, corresponding to Petal.Length = 0, is at 2.50 (95% CI
# [1.19, 3.82], t(146) = 3.75, p < .001). Within this model:
#
#  The effect of Petal Length is statistically significant and positive (beta =
# 0.89, 95% CI [0.76, 1.01], t(146) = 13.93, p < .001; Std. beta = 1.89, 95% CI
# [1.63, 2.16])
#
# Standardized parameters were obtained by fitting the model on a standardized
# version of the dataset. 95% Confidence Intervals (CIs) and pvalues were
# computed using a Wald tdistribution approximation.
Bayesian models can also be reported using the new SEXIT framework, which combines clarity, precision and usefulness.
library(rstanarm)
model < stan_glm(mpg ~ qsec + wt, data = mtcars)
report(model)
# We fitted a Bayesian linear model (estimated using MCMC sampling with 4 chains
# of 1000 iterations and a warmup of 500) to predict mpg with qsec and wt
# (formula: mpg ~ qsec + wt). Priors over parameters were set as normal (mean =
# 0.00, SD = 8.43) distributions. The model's explanatory power is substantial
# (R2 = 0.81, 95% CI [0.70, 0.90], adj. R2 = 0.79). The model's intercept,
# corresponding to qsec = 0 and wt = 0, is at 19.72 (95% CI [9.18, 29.63]).
# Within this model:
#
#  The effect of qsec (Median = 0.92, 95% CI [0.42, 1.46]) has a 99.90%
# probability of being positive (> 0), 99.00% of being significant (> 0.30), and
# 0.15% of being large (> 1.81). The estimation successfully converged (Rhat =
# 1.000) and the indices are reliable (ESS = 2411)
#  The effect of wt (Median = 5.04, 95% CI [6.00, 4.02]) has a 100.00%
# probability of being negative (< 0), 100.00% of being significant (< 0.30),
# and 100.00% of being large (< 1.81). The estimation successfully converged
# (Rhat = 1.000) and the indices are reliable (ESS = 2582)
#
# Following the Sequential Effect eXistence and sIgnificance Testing (SEXIT)
# framework, we report the median of the posterior distribution and its 95% CI
# (Highest Density Interval), along the probability of direction (pd), the
# probability of significance and the probability of being large. The thresholds
# beyond which the effect is considered as significant (i.e., nonnegligible) and
# large are 0.30 and 1.81 (corresponding respectively to 0.05 and 0.30 of the
# outcome's SD). Convergence and stability of the Bayesian sampling has been
# assessed using Rhat, which should be below 1.01 (Vehtari et al., 2019), and
# Effective Sample Size (ESS), which should be greater than 1000 (Burkner, 2017).
# and We fitted a Bayesian linear model (estimated using MCMC sampling with 4
# chains of 1000 iterations and a warmup of 500) to predict mpg with qsec and wt
# (formula: mpg ~ qsec + wt). Priors over parameters were set as normal (mean =
# 0.00, SD = 15.40) distributions. The model's explanatory power is substantial
# (R2 = 0.81, 95% CI [0.70, 0.90], adj. R2 = 0.79). The model's intercept,
# corresponding to qsec = 0 and wt = 0, is at 19.72 (95% CI [9.18, 29.63]).
# Within this model:
#
#  The effect of qsec (Median = 0.92, 95% CI [0.42, 1.46]) has a 99.90%
# probability of being positive (> 0), 99.00% of being significant (> 0.30), and
# 0.15% of being large (> 1.81). The estimation successfully converged (Rhat =
# 1.000) and the indices are reliable (ESS = 2411)
#  The effect of wt (Median = 5.04, 95% CI [6.00, 4.02]) has a 100.00%
# probability of being negative (< 0), 100.00% of being significant (< 0.30),
# and 100.00% of being large (< 1.81). The estimation successfully converged
# (Rhat = 1.000) and the indices are reliable (ESS = 2582)
#
# Following the Sequential Effect eXistence and sIgnificance Testing (SEXIT)
# framework, we report the median of the posterior distribution and its 95% CI
# (Highest Density Interval), along the probability of direction (pd), the
# probability of significance and the probability of being large. The thresholds
# beyond which the effect is considered as significant (i.e., nonnegligible) and
# large are 0.30 and 1.81 (corresponding respectively to 0.05 and 0.30 of the
# outcome's SD). Convergence and stability of the Bayesian sampling has been
# assessed using Rhat, which should be below 1.01 (Vehtari et al., 2019), and
# Effective Sample Size (ESS), which should be greater than 1000 (Burkner, 2017).
One can, for complex reports, directly access the pieces of the reports:
model < lm(Sepal.Length ~ Species, data = iris)
report_model(model)
report_performance(model)
report_statistics(model)
# linear model (estimated using OLS) to predict Sepal.Length with Species (formula: Sepal.Length ~ Species)
# The model explains a statistically significant and substantial proportion of
# variance (R2 = 0.62, F(2, 147) = 119.26, p < .001, adj. R2 = 0.61)
# beta = 5.01, 95% CI [4.86, 5.15], t(147) = 68.76, p < .001; Std. beta = 1.01, 95% CI [1.18, 0.84]
# beta = 0.93, 95% CI [0.73, 1.13], t(147) = 9.03, p < .001; Std. beta = 1.12, 95% CI [0.88, 1.37]
# beta = 1.58, 95% CI [1.38, 1.79], t(147) = 15.37, p < .001; Std. beta = 1.91, 95% CI [1.66, 2.16]
This can be useful to complete the Participants paragraph of your manuscript.
data < data.frame(
"Age" = c(22, 23, 54, 21),
"Sex" = c("F", "F", "M", "M")
)
paste(
report_participants(data, spell_n = TRUE),
"were recruited in the study by means of torture and coercion."
)
# [1] "Four participants (Mean age = 30.0, SD = 16.0, range: [21, 54]; Sex: 50.0% females, 50.0% males, 0.0% other) were recruited in the study by means of torture and coercion."
Report can also help you create a sample description table (also referred to as Table 1).
Variable  setosa (n=50)  versicolor (n=50)  virginica (n=50)  Total (n=150) 

Mean Sepal.Length (SD)  5.01 (0.35)  5.94 (0.52)  6.59 (0.64)  5.84 (0.83) 
Mean Sepal.Width (SD)  3.43 (0.38)  2.77 (0.31)  2.97 (0.32)  3.06 (0.44) 
Mean Petal.Length (SD)  1.46 (0.17)  4.26 (0.47)  5.55 (0.55)  3.76 (1.77) 
Mean Petal.Width (SD)  0.25 (0.11)  1.33 (0.20)  2.03 (0.27)  1.20 (0.76) 
Finally, report includes some functions to help you write the data analysis paragraph about the tools used.
report(sessionInfo())
# Analyses were conducted using the R Statistical language (version 4.2.1; R Core
# Team, 2022) on macOS Monterey 12.6, using the packages lme4 (version 1.1.30;
# Bates D et al., 2015), Matrix (version 1.5.1; Bates D et al., 2022), Rcpp
# (version 1.0.9; Eddelbuettel D, François R, 2011), rstanarm (version 2.21.3;
# Goodrich B et al., 2022), report (version 0.5.5.2; Makowski D et al., 2021) and
# dplyr (version 1.0.10; Wickham H et al., 2022).
#
# References
# 
#  Bates D, Mächler M, Bolker B, Walker S (2015). "Fitting LinearMixedEffects
# Models Using lme4." _Journal of Statistical Software_,*67*(1), 148.
# doi:10.18637/jss.v067.i01<https://doi.org/10.18637/jss.v067.i01>.
#  Bates D, Maechler M, Jagan M (2022). _Matrix: Sparse and Dense MatrixClasses
# and Methods_. R package version
# 1.51,<https://CRAN.Rproject.org/package=Matrix>.
#  Eddelbuettel D, François R (2011). "Rcpp: Seamless R and C++Integration."
# _Journal of Statistical Software_, *40*(8), 118.doi:10.18637/jss.v040.i08
# <https://doi.org/10.18637/jss.v040.i08>.Eddelbuettel D (2013). _Seamless R and
# C++ Integration with Rcpp_.Springer, New York.
# doi:10.1007/9781461468684<https://doi.org/10.1007/9781461468684>, ISBN
# 9781461468677.Eddelbuettel D, Balamuta JJ (2018). "Extending extitR with
# extitC++: ABrief Introduction to extitRcpp." _The American Statistician_,
# *72*(1),2836.
# doi:10.1080/00031305.2017.1375990<https://doi.org/10.1080/00031305.2017.1375990>.
#  Goodrich B, Gabry J, Ali I, Brilleman S (2022). "rstanarm: Bayesianapplied
# regression modeling via Stan." R package version
# 2.21.3,<https://mcstan.org/rstanarm/>.Brilleman S, Crowther M, MorenoBetancur
# M, Buros Novik J, Wolfe R(2018). "Joint longitudinal and timetoevent models
# via Stan." StanCon2018. 1012 Jan 2018. Pacific Grove, CA,
# USA.,<https://github.com/standev/stancon_talks/>.
#  Makowski D, BenShachar M, Patil I, Lüdecke D (2021). "AutomatedResults
# Reporting as a Practical Tool to Improve Reproducibility andMethodological Best
# Practices Adoption." _CRAN_.<https://github.com/easystats/report>.
#  R Core Team (2022). _R: A Language and Environment for StatisticalComputing_.
# R Foundation for Statistical Computing, Vienna,
# Austria.<https://www.Rproject.org/>.
#  Wickham H, François R, Henry L, Müller K (2022). _dplyr: A Grammar ofData
# Manipulation_. R package version
# 1.0.10,<https://CRAN.Rproject.org/package=dplyr>.
If you like it, you can put a star on this repo, and cite the package as follows:
citation("report")
To cite in publications use:
Makowski, D., BenShachar, M.S., Patil, I. & Lüdecke, D. (2020).
Automated Results Reporting as a Practical Tool to Improve
Reproducibility and Methodological Best Practices Adoption. CRAN.
Available from https://github.com/easystats/report. doi: .
A BibTeX entry for LaTeX users is
@Article{,
title = {Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption},
author = {Dominique Makowski and Mattan S. BenShachar and Indrajeet Patil and Daniel Lüdecke},
year = {2021},
journal = {CRAN},
url = {https://github.com/easystats/report},
}
report is a young package in need of affection. You can easily be a part of the developing community of this opensource software and improve science! Don’t be shy, try to code and submit a pull request (See the contributing guide). Even if it’s not perfect, we will help you make it great!
Please note that the report project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Author: Easystats
Source Code: https://github.com/easystats/report
License: GPL3.0 license
1667397730
geobr is a computational package to download official spatial data sets of Brazil. The package includes a wide range of geospatial data in geopackage format (like shapefiles but better), available at various geographic scales and for various years with harmonized attributes, projection and topology (see detailed list of available data sets below).
The package is currently available in R and Python.
# From CRAN
install.packages("geobr")
library(geobr)
# or use the development version with latest features
utils::remove.packages('geobr')
devtools::install_github("ipeaGIT/geobr", subdir = "rpackage")
library(geobr)
obs. If you use Linux, you need to install a couple dependencies before installing the libraries sf
and geobr
. More info here.
pip install geobr
Windows users:
conda create n geo_env
conda activate geo_env
conda config env add channels condaforge
conda config env set channel_priority strict
conda install python=3 geopandas
pip install geobr
Basic Usage
The syntax of all geobr
functions operate on the same logic so it becomes intuitive to download any data set using a single line of code. Like this:
sf
objectlibrary(geobr)
# Read specific municipality at a given year
mun < read_municipality(code_muni=1200179, year=2017)
# Read all municipalities of given state at a given year
mun < read_municipality(code_muni=33, year=2010) # or
mun < read_municipality(code_muni="RJ", year=2010)
# Read all municipalities in the country at a given year
mun < read_municipality(code_muni="all", year=2018)
More examples in the intro Vignette
geopandas
objectfrom geobr import read_municipality
# Read specific municipality at a given year
mun = read_municipality(code_muni=1200179, year=2017)
# Read all municipalities of given state at a given year
mun = read_municipality(code_muni=33, year=2010) # or
mun = read_municipality(code_muni="RJ", year=2010)
# Read all municipalities in the country at a given year
mun = read_municipality(code_muni="all", year=2018)
More examples here
Available datasets:
:point_right: All datasets use geodetic reference system "SIRGAS2000", CRS(4674).
Function  Geographies available  Years available  Source 

read_country  Country  1872, 1900, 1911, 1920, 1933, 1940, 1950, 1960, 1970, 1980, 1991, 2000, 2001, 2010, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020  IBGE 
read_region  Region  2000, 2001, 2010, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020  IBGE 
read_state  States  1872, 1900, 1911, 1920, 1933, 1940, 1950, 1960, 1970, 1980, 1991, 2000, 2001, 2010, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020  IBGE 
read_meso_region  Meso region  2000, 2001, 2010, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020  IBGE 
read_micro_region  Micro region  2000, 2001, 2010, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020  IBGE 
read_intermediate_region  Intermediate region  2017, 2019, 2020  IBGE 
read_immediate_region  Immediate region  2017, 2019, 2020  IBGE 
read_municipality  Municipality  1872, 1900, 1911, 1920, 1933, 1940, 1950, 1960, 1970, 1980, 1991, 2000, 2001, 2005, 2007, 2010, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020  IBGE 
read_municipal_seat  Municipality seats (sedes municipais)  1872, 1900, 1911, 1920, 1933, 1940, 1950, 1960, 1970, 1980, 1991, 2010  IBGE 
read_weighting_area  Census weighting area (área de ponderação)  2010  IBGE 
read_census_tract  Census tract (setor censitário)  2000, 2010, 2017, 2019, 2020  IBGE 
read_statistical_grid  Statistical Grid of 200 x 200 meters  2010  IBGE 
read_metro_area  Metropolitan areas  1970, 2001, 2002, 2003, 2005, 2010, 2013, 2014, 2015, 2016, 2017, 2018  IBGE 
read_urban_area  Urban footprints  2005, 2015  IBGE 
read_amazon  Brazil's Legal Amazon  2012  MMA 
read_biomes  Biomes  2004, 2019  IBGE 
read_conservation_units  Environmental Conservation Units  201909  MMA 
read_disaster_risk_area  Disaster risk areas  2010  CEMADEN and IBGE 
read_indigenous_land  Indigenous lands  201907, 202103  FUNAI 
read_semiarid  Semi Arid region  2005, 2017  IBGE 
read_health_facilities  Health facilities  2015  CNES, DataSUS 
read_health_region  Health regions and macro regions  1991, 1994, 1997, 2001, 2005, 2013  DataSUS 
read_neighborhood  Neighborhood limits  2010  IBGE 
read_schools  Schools  2020  INEP 
read_comparable_areas  Historically comparable municipalities, aka Areas minimas comparaveis (AMCs)  1872,1900,1911,1920,1933,1940,1950,1960,1970,1980,1991,2000,2010  IBGE 
read_urban_concentrations  Urban concentration areas (concentrações urbanas)  2015  IBGE 
read_pop_arrangements  Population arrangements (arranjos populacioanis)  2015  IBGE 
Function  Action 

list_geobr  List all datasets available in the geobr package 
lookup_muni  Look up municipality codes by their name, or the other way around 
grid_state_correspondence_table  Loads a correspondence table indicating what quadrants of IBGE's statistical grid intersect with each state 
cep_to_state  Determine the state of a given CEP postal code 
...  ... 
Note 1. Data sets and Functions marked with "dev" are only available in the development version of geobr
.
Note 2. Most data sets are available at scale 1:250,000 (see documentation for details).
Geography  Years available  Source 

read_census_tract  2007  IBGE 
Longitudinal Database* of micro regions  various years  IBGE 
Longitudinal Database* of Census tracts  various years  IBGE 
...  ...  ... 
'*' Longitudinal Database refers to áreas mínimas comparáveis (AMCs)
Contributing to geobr
If you would like to contribute to geobr and add new functions or data sets, please check this guide to propose your contribution.
As of today, there is another R package with similar functionalities: simplefeaturesbr. The geobr package has a few advantages when compared to simplefeaturesbr, including for example:
Credits
Original shapefiles are created by official government institutions. The geobr package is developed by a team at the Institute for Applied Economic Research (Ipea), Brazil. If you want to cite this package, you can cite it as:
Author: ipeaGIT
Source Code: https://github.com/ipeaGIT/geobr
1666986420
The goal of paletteer is to be a comprehensive collection of color palettes in R using a common interface. Think of it as the “caret of palettes”.
Notice This version is not backwards compatible with versions <= 0.2.1. Please refer to the end of the readme for breaking changes
You can install the released version of paletteer from CRAN with:
install.packages("paletteer")
If you want the development version instead then install directly from GitHub:
# install.packages("devtools")
devtools::install_github("EmilHvitfeldt/paletteer")
The palettes are divided into 2 groups; discrete and continuous. For discrete palette you have the choice between the fixed width palettes and dynamic palettes. Most common of the two are the fixed width palettes which have a set amount of colors which doesn’t change when the number of colors requested vary like the following palettes:
on the other hand we have the dynamic palettes where the colors of the palette depend on the number of colors you need like the green.pal
palette from the cartography
package:
Lastly we have the continuous palettes which provides as many colors as you need for a smooth transition of color:
This package includes 2569 palettes from 68 different packages and information about these can be found in the following data.frames: palettes_c_names
, palettes_d_names
and palettes_dynamic_names
. Additionally this github repo showcases all the palettes included in the package and more.
All the palettes can be accessed from the 3 functions paletteer_c()
, paletteer_d()
and paletteer_dynamic()
using the by using the syntax packagename::palettename.
paletteer_c("scico::berlin", n = 10)
#> <colors>
#> #9EB0FFFF #5AA3DAFF #2D7597FF #194155FF #11181DFF #270C01FF #501802FF #8A3F2AFF #C37469FF #FFACACFF
paletteer_d("nord::frost")
#> <colors>
#> #8FBCBBFF #88C0D0FF #81A1C1FF #5E81ACFF
paletteer_dynamic("cartography::green.pal", 5)
#> <colors>
#> #B8D9A9FF #8DBC80FF #5D9D52FF #287A22FF #17692CFF
All of the functions now also support tab completion to easily access the hundreds of choices
Lastly the package also includes scales for ggplot2
using the same standard interface
library(ggplot2)
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
geom_point() +
scale_color_paletteer_d("nord::aurora")
In version <= 0.2.1 a palatte was selected by specifying a package
and palette
argument like so
paletteer_c(package = "nord", palette = "frost")
After version 0.2.1 palettes are selected using the syntax "packagename::palettename"
inside the palette functions.
paletteer_c("nord::frost")
paletteer includes palettes from the following packages:
Author: EmilHvitfeldt
Source Code: https://github.com/EmilHvitfeldt/paletteer
License: View license
1666974180
Use twitter from R. Get started by reading vignette("rtweet")
.
To get the current released version from CRAN:
install.packages("rtweet")
You can install the development version of rtweet from GitHub with:
install.packages("rtweet", repos = 'https://ropensci.runiverse.dev')
All users must be authenticated to interact with Twitter’s APIs. The easiest way to authenticate is to use your personal twitter account  this will happen automatically (via a browser popup) the first time you use an rtweet function. See auth_setup_default()
for details. Using your personal account is fine for casual use, but if you are trying to collect a lot of data it’s a good idea to authenticate with your own Twitter “app”. See vignette("auth", package = "rtweet")
for details.
library(rtweet)
rtweet should be used in strict accordance with Twitter’s developer terms.
Search for up to 1000 tweets containing #rstats, the common hashtag used to refer to the R language, excluding retweets:
rt < search_tweets("#rstats", n = 1000, include_rts = FALSE)
Twitter rate limits cap the number of search results returned to 18,000 every 15 minutes. To request more than that, set retryonratelimit = TRUE
and rtweet will wait for rate limit resets for you.
Search for 200 users with the #rstats in their profile:
useRs < search_users("#rstats", n = 200)
Randomly sample (approximately 1%) from the live stream of all tweets:
random_stream < stream_tweets("")
Stream all geolocated tweets from London for 60 seconds:
stream_london < stream_tweets(location = lookup_coords("london"), timeout = 60)
Get all accounts followed by a user:
## get user IDs of accounts followed by R Foundation
R_Foundation_fds < get_friends("_R_Foundation")
## lookup data on those accounts
R_Foundation_fds_data < lookup_users(R_Foundation_fds$to_id)
Get all accounts following a user:
## get user IDs of accounts following R Foundation
R_Foundation_flw < get_followers("_R_Foundation", n = 100)
R_Foundation_flw_data < lookup_users(R_Foundation_flw$from_id)
If you want all followers, you'll need to set n = Inf
and retryonratelimit = TRUE
but be warned that this might take a long time.
Get the most recent 200 tweets from R Foundation:
## get user IDs of accounts followed by R Foundation
tmls < get_timeline("_R_Foundation", n = 100)
Get the 10 most recently favorited statuses by R Foundation:
favs < get_favorites("_R_Foundation", n = 10)
Communicating with Twitter’s APIs relies on an internet connection, which can sometimes be inconsistent.
If you have questions, or need an example or want to share a use case, you can post them on rOpenSci’s discuss. Where you can browse uses of rtweet too.
With that said, if you encounter an obvious bug for which there is not already an active issue, please create a new issue with all code used (preferably a reproducible example) on Github.
Code of Conduct
Please note that this package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Author: ropensci
Source Code: https://github.com/ropensci/rtweet
License: View license
1666953900
Data Science in a Box contains the materials required to teach (or learn from) an introductory data science course using R, all of which are freelyavailable and opensource. They include course materials such as slide decks, homework assignments, guided labs, sample exams, a final project assignment, as well as materials for instructors such as pedagogical tips, information on computing infrastructure, technology stack, and course logistics.
See datasciencebox.org for everything you need to know about the project!
Note that all materials are released with Creative Commons Attribution Share Alike 4.0 International license.
You can file an issue to get help, report a bug, or make a feature request.
Before opening a new issue, be sure to search issues and pull requests to make sure the bug hasn't been reported and/or already fixed in the development version. By default, the search will be prepopulated with is:issue is:open
. You can edit the qualifiers (e.g. is:pr
, is:closed
) as needed. For example, you'd simply remove is:open
to search all issues in the repo, open or closed.
If your issue involves R code, please make a minimal reproducible example using the reprex package. If you haven't heard of or used reprex before, you're in for a treat! Seriously, reprex will make all of your Rquestionasking endeavors easier (which is a pretty insane ROI for the five to ten minutes it'll take you to learn what it's all about). For additional reprex pointers, check out the Get help! section of the tidyverse site.
Please note that the datasciencebox project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Author: rstudioeducation
Source Code: https://github.com/rstudioeducation/datasciencebox
License: View license
1666949640
Themes for ggplot2. The idea of this package is that you can just set the theme and then forget about it. You shouldn't have to change any of your existing code. There are several parts to a theme:
There are a number of preset palettes and layouts, and methods to create your own colour schemes.
This package is still under development, but can be installed using devtools.
devtools::install_github('MikataProject/ggthemr')
We plan to submit to CRAN soon, but that is currently blocked by an upstream issue now.
To just set the colour scheme:
ggthemr('dust')
That's it. Any ggplot you create from then on will have the theme applied. You can clear the theme and return to ggplot2's default using:
ggthemr_reset()
The palette determines the colours of everything in a plot including the background, layers, gridlines, title text, axes lines, axes text and axes titles. The swatch is the the name given to the set of colours strictly used in styling the geoms/layer elements (e.g. the points in geom_point()
, bars in geom_bar()
etc.). At least six colours have been supplied in each palette's swatch.
There are a wide variety of themes in this package (and more on the way). Some of them serious business... others are deliberately stylish and might not be that good for use in proper publications.
define_palette()
lets you make your own themes that can be passed to ggthemr()
just like any of the palettes above. Here's an example of a (probably ugly) palette using random colours:
# Random colours that aren't white.
set.seed(12345)
random_colours < sample(colors()[c(1, 253, 361)], 10L)
ugly < define_palette(
swatch = random_colours,
gradient = c(lower = random_colours[1L], upper = random_colours[2L])
)
ggthemr(ugly)
example_plot + ggtitle(':(')
You can define all elements of a palette using define_palette()
including colours for the background, text, axes lines, swatch and gradients.
The layout of a theme controls the appearance and position of the axes, gridlines and text. Some folk prefer both major and minor gridlines, others prefer none or something in between.
Plot margins and space between axes titles and lines etc. is controlled with the spacing parameter. Lower values will make plots more compact, higher values will give them more padding. Compare the plots below where the spacing has been set to 0, 1 and 2 respectively.
The type parameter can be set to either inner or outer. When inner, the background colour of a plot will not extend past the plot area. outer will colour the entire plot and background.
ggthemr('earth', type = 'inner')
example_plot
ggthemr('earth', type = 'outer')
example_plot
Squinting at a chart? Low on printer ink? ggthemr includes some methods to tweak charts to make them lighter or darker. Here's a standard theme:
ggthemr('dust')
example_plot
Maybe that plot comes out a bit pale looking when you print it. Here's how you can add a bit more contrast to the swatch:
darken_swatch(amount = 0.3)
example_plot
The second parameter to darken_swatch()
controls the degree to which the colours are made darker. Full list of methods with similar functionality:
darken_swatch()
/ lighten_swatch()
: darker/lighter swatch colours.darken_gradient()
/ lighten_gradient()
: darker/lighter gradient colours.darken_palette()
/ lighten_palette()
: darker/lighter everything.I'll add methods to darken/lighten the axes lines and text soon too.
Most of the time you'll probably just want to set the theme and not worry about it. There may be times though where you'll want to make some small adjustment, or manually change what items appear as what colour in a plot.
ggthemr('dust')
mpg_plot < ggplot(mpg[mpg$drv != '4', ], aes(factor(cyl), cty, fill = drv)) +
geom_boxplot() + labs(x = 'Cylinders', y = 'City MPG', fill = 'Drive Type') +
theme(legend.position = 'bottom')
mpg_plot
For some reason you decide you want to change those colours. Frontwheel drive vehicles should be orange. Rearwheelers should be that red colour. You could change the order of the levels of your fill variable, but you shouldn't have to do that. You just want to switch those colours but you have no idea what they are. swatch()
will give you the colours in the currently active ggthemr palette.
swatch()
## [1] "#555555" "#db735c" "#EFA86E" "#9A8A76" "#F3C57B" "#7A6752" "#2A91A2"
## [8] "#87F28A" "#6EDCEF"
## attr(,"class")
## [1] "ggthemr_swatch"
So you can manually swap the two colours around.
to_swap < swatch()[2:3]
mpg_plot + scale_fill_manual(values = rev(to_swap))
Note: the first colour in a swatch is a special one. It is reserved for outlining boxplots, text etc. So that's why the second and third colours were swapped.
ggthemr does three different things while setting a theme.
ggplot2::theme_set()
function.ggplot2::update_geom_defaults()
function.In case, if you do not want to set the theme this way, use the set_theme = FALSE
option while using the ggthemr
function. An example of setting theme, geom aesthetic defaults and scales manually:
ggthemr_reset()
dust_theme < ggthemr('dust', set_theme = FALSE)
example_plot
example_plot + dust_theme$theme
example_plot + dust_theme$theme + scale_fill_manual(values = dust_theme$palette$swatch)
do.call(what = ggplot2::update_geom_defaults, args = dust_theme$geom_defaults$new$bar)
ggplot(diamonds, aes(price)) + geom_histogram(binwidth = 850) + dust_theme$theme
Mikata Project took over ggthemr and will be the primary maintainer of this wonderful package. We would like to thank @cttobin for creating this package. We also appreciate that he agreed to pass the repo ownership to Mikata Project. The Mikata Team plans to resolve backlog issues and make ggthemr
available on CRAN as the first step.
Author: MikataProject
Source Code: https://github.com/MikataProject/ggthemr
License: GPL3
1666945380
easystats is a collection of R packages, which aims to provide a unifying and consistent framework to tame, discipline, and harness the scary R statistics and their pesky models.
However, there is not (yet) an unique “easystats” way of doing data analysis. Instead, start with one package and, when you’ll face a new challenge, do check if there is an easystats answer for it in other packages. You will slowly uncover how using them together facilitates your life. And, who knows, you might even end up using them all.
Type  Source  Command 

Release  CRAN  install.packages("easystats") 
Development  runiverse  install.packages("easystats", repos = "https://easystats.runiverse.dev") 
Development  GitHub  remotes::install_github("easystats/easystats") 
Finally, as easystats sometimes depends on some additional packages for specific functions that are not downloaded by default. If you want to benefit from the full easystats experience without any hiccups, simply run the following:
easystats::install_suggested()
To cite the package, run the following command:
citation("easystats")
To cite easystats in publications use:
Lüdecke, Patil, BenShachar, Wiernik, & Makowski (2022). easystats:
Framework for Easy Statistical Modeling, Visualization, and
Reporting. CRAN. Available from
https://easystats.github.io/easystats/
A BibTeX entry for LaTeX users is
@Article{,
title = {easystats: Framework for Easy Statistical Modeling, Visualization, and Reporting},
author = {Daniel Lüdecke and Mattan S. BenShachar and Indrajeet Patil and Brenton M. Wiernik and Dominique Makowski},
journal = {CRAN},
year = {2022},
note = {R package},
url = {https://easystats.github.io/easystats/},
}
If you want to do this only for certain packages in the ecosystem, have a look at this article on how you can do so! https://easystats.github.io/easystats/articles/citation.html
Each easystats package has a different scope and purpose. This means your best way to start is to explore and pick the one(s) that you feel might be useful to you. However, as they are built with a “bigger picture” in mind, you will realize that using more of them creates a smooth workflow, as these packages are meant to work together. Ideally, these packages work in unison to cover all aspects of statistical analysis and data visualization.
How is easystats different from the tidyverse?
You’ve probably already heard about the tidyverse, another very popular collection of packages (ggplot, dplyr, tidyr, …) that also makes using R easier. So, should you pick the tidyverse or easystats? Pick both!
Indeed, these two ecosystems have been designed with very different goals in mind. The tidyverse packages are primarily made to create a new R experience, where data manipulation and exploration is intuitive and consistent. On the other hand, easystats focuses more on the final stretch of the analysis: understanding and interpreting your results and reporting them in a manuscript or a report, while following best practices. You can definitely use the easystats functions in a tidyverse workflow!
easystats + tidyverse = ❤️
Can easystats be useful to advanced users and/or developers?
Yes, definitely! easystats is built in terms of modules that are general enough to be used inside other packages. For instance, the insight package is made to easily implement support for postprocessing of pretty much all regression model packages under the sun. We use it in all the easystats packages, but it is also used in other noneasystats packages, such as ggstatsplot, modelsummary, ggeffects, and more.
So why not in yours?
Moreover, the easystats packages are very lightweight, with a minimal set of dependencies, which again makes it great if you want to rely on them.
Each easystats
package has a dedicated website.
For example, website for parameters
is https://easystats.github.io/parameters/.
In addition to the websites containing documentation for these packages, you can also read posts from easystats
blog: https://easystats.github.io/blog/posts/.
In addition to these websites and blog posts, you can also check out the following presentations and talks to learn more about this ecosystem:
https://easystats.github.io/easystats/articles/resources.html
easystats packages are designed to be lightweight, i.e., they don’t have any thirdparty hard dependencies, other than baseR packages or other easystats packages! If you develop R packages, this means that you can safely use easystats packages as dependencies in your own packages, without the risk of entering the dependency hell.
library(deepdep)
plot_dependencies("easystats", depth = 2, show_stamp = FALSE)
As we can see, the only exception is the {see}
package, which is responsible for plotting and creating figures and relies on {ggplot2}
, which does have a substantial number of dependencies.
Total  insight  bayestestR  parameters  performance  datawizard  effectsize  correlation  see  modelbased  report  easystats 

10,736,657  3,301,526  1,449,269  1,440,805  1,361,451  1,340,181  1,119,639  296,725  267,469  99,435  55,558  4,599 
We are happy to receive bug reports, suggestions, questions, and (most of all) contributions to fix problems and add features. Pull Requests for contributions are encouraged.
Here are some simple ways in which you can contribute (in the increasing order of commitment):
Please note that the ‘easystats’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Author: easystats
Source Code: https://github.com/easystats/easystats
License: GPL3.0 license