Nat  Grady

Nat Grady

1668085860

An R Package for Working with Causal Directed Acyclic Graphs (DAGs)

ggdag: An R Package for visualizing and analyzing causal directed acyclic graphs

Tidy, analyze, and plot causal directed acyclic graphs (DAGs). ggdag uses the powerful dagitty package to create and analyze structural causal models and plot them using ggplot2 and ggraph in a consistent and easy manner.

Installation

You can install ggdag with:

install.packages("ggdag")

Or you can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("malcolmbarrett/ggdag")

Example

ggdag makes it easy to use dagitty in the context of the tidyverse. You can directly tidy dagitty objects or use convenience functions to create DAGs using a more R-like syntax:

library(ggdag)
library(ggplot2)

#  example from the dagitty package
dag <- dagitty::dagitty("dag {
    y <- x <- z1 <- v -> z2 -> y
    z1 <- w1 <-> w2 -> z2
    x <- w1 -> y
    x <- w2 -> y
    x [exposure]
    y [outcome]
  }")

tidy_dag <- tidy_dagitty(dag)

tidy_dag
#> # A DAG with 7 nodes and 12 edges
#> #
#> # Exposure: x
#> # Outcome: y
#> #
#> # A tibble: 13 × 8
#>    name       x      y direction to       xend   yend circular
#>    <chr>  <dbl>  <dbl> <fct>     <chr>   <dbl>  <dbl> <lgl>   
#>  1 v     0.496  -3.40  ->        z1     1.83   -2.92  FALSE   
#>  2 v     0.496  -3.40  ->        z2     0.0188 -2.08  FALSE   
#>  3 w1    1.73   -1.94  ->        x      2.07   -1.42  FALSE   
#>  4 w1    1.73   -1.94  ->        y      1.00   -0.944 FALSE   
#>  5 w1    1.73   -1.94  ->        z1     1.83   -2.92  FALSE   
#>  6 w1    1.73   -1.94  <->       w2     0.873  -1.56  FALSE   
#>  7 w2    0.873  -1.56  ->        x      2.07   -1.42  FALSE   
#>  8 w2    0.873  -1.56  ->        y      1.00   -0.944 FALSE   
#>  9 w2    0.873  -1.56  ->        z2     0.0188 -2.08  FALSE   
#> 10 x     2.07   -1.42  ->        y      1.00   -0.944 FALSE   
#> 11 y     1.00   -0.944 <NA>      <NA>  NA      NA     FALSE   
#> 12 z1    1.83   -2.92  ->        x      2.07   -1.42  FALSE   
#> 13 z2    0.0188 -2.08  ->        y      1.00   -0.944 FALSE

#  using more R-like syntax to create the same DAG
tidy_ggdag <- dagify(
  y ~ x + z2 + w2 + w1,
  x ~ z1 + w1 + w2,
  z1 ~ w1 + v,
  z2 ~ w2 + v,
  w1 ~ ~w2, # bidirected path
  exposure = "x",
  outcome = "y"
) %>%
  tidy_dagitty()

tidy_ggdag
#> # A DAG with 7 nodes and 12 edges
#> #
#> # Exposure: x
#> # Outcome: y
#> #
#> # A tibble: 13 × 8
#>    name      x     y direction to     xend  yend circular
#>    <chr> <dbl> <dbl> <fct>     <chr> <dbl> <dbl> <lgl>   
#>  1 v     -3.58  3.30 ->        z1    -4.05  4.63 FALSE   
#>  2 v     -3.58  3.30 ->        z2    -2.23  3.74 FALSE   
#>  3 w1    -3.03  5.74 ->        x     -3.20  5.14 FALSE   
#>  4 w1    -3.03  5.74 ->        y     -1.98  5.22 FALSE   
#>  5 w1    -3.03  5.74 ->        z1    -4.05  4.63 FALSE   
#>  6 w1    -3.03  5.74 <->       w2    -2.35  4.72 FALSE   
#>  7 w2    -2.35  4.72 ->        x     -3.20  5.14 FALSE   
#>  8 w2    -2.35  4.72 ->        y     -1.98  5.22 FALSE   
#>  9 w2    -2.35  4.72 ->        z2    -2.23  3.74 FALSE   
#> 10 x     -3.20  5.14 ->        y     -1.98  5.22 FALSE   
#> 11 y     -1.98  5.22 <NA>      <NA>  NA    NA    FALSE   
#> 12 z1    -4.05  4.63 ->        x     -3.20  5.14 FALSE   
#> 13 z2    -2.23  3.74 ->        y     -1.98  5.22 FALSE

ggdag also provides functionality for analyzing DAGs and plotting them in ggplot2:

ggdag(tidy_ggdag) +
  theme_dag()

ggdag_adjustment_set(tidy_ggdag, node_size = 14) +
  theme(legend.position = "bottom")

As well as geoms and other functions for plotting them directly in ggplot2:

dagify(m ~ x + y) %>%
  tidy_dagitty() %>%
  node_dconnected("x", "y", controlling_for = "m") %>%
  ggplot(aes(
    x = x,
    y = y,
    xend = xend,
    yend = yend,
    shape = adjusted,
    col = d_relationship
  )) +
  geom_dag_edges(end_cap = ggraph::circle(10, "mm")) +
  geom_dag_collider_edges() +
  geom_dag_point() +
  geom_dag_text(col = "white") +
  theme_dag() +
  scale_adjusted() +
  expand_plot(expand_y = expansion(c(0.2, 0.2))) +
  scale_color_viridis_d(
    name = "d-relationship",
    na.value = "grey85",
    begin = .35
  )

And common structures of bias:

ggdag_equivalent_dags(confounder_triangle())


ggdag_butterfly_bias(edge_type = "diagonal")

Download Details:

Author: Malcolmbarrett
Source Code: https://github.com/malcolmbarrett/ggdag 
License: Unknown, MIT licenses found

#r #rstats 

An R Package for Working with Causal Directed Acyclic Graphs (DAGs)
Nat  Grady

Nat Grady

1668074100

R Package for Detecting Twitter Bots Via Machine Learning

Tweetbotornot

An R package for classifying Twitter accounts as bot or not.

Features

Uses machine learning to classify Twitter accounts as bots or not bots. The default model is 93.53% accurate when classifying bots and 95.32% accurate when classifying non-bots. The fast model is 91.78% accurate when classifying bots and 92.61% accurate when classifying non-bots.

Overall, the default model is correct 93.8% of the time.

Overall, the fast model is correct 91.9% of the time.

Install

Install from CRAN:

## install from CRAN
install.packages("tweetbotornot")

Install the development version from Github:

## install remotes if not already
if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}

## install tweetbotornot from github
devtools::install_github("mkearney/tweetbotornot")

API authorization

Users must be authorized in order to interact with Twitter’s API. To setup your machine to make authorized requests, you’ll either need to be signed into Twitter and working in an interactive session of R–the browser will open asking you to authorize the rtweet client (rstats2twitter)–or you’ll need to create an app (and have a developer account) and your own API token. The latter has the benefit of (a) having sufficient permissions for write-acess and DM (direct messages) read-access levels and (b) more stability if Twitter decides to shut down [@kearneymw](https://twitter.com/kearneymw)’s access to Twitter (I try to be very responsible these days, but Twitter isn’t always friendly to academic use cases). To create an app and your own Twitter token, see these instructions provided in the rtweet package.

Usage

There’s one function tweetbotornot() (technically there’s also botornot(), but it does the same exact thing). Give it a vector of screen names or user IDs and let it go to work.

## load package
library(tweetbotornot)

## select users
users <- c("realdonaldtrump", "netflix_bot",
  "kearneymw", "dataandme", "hadleywickham",
  "ma_salmon", "juliasilge", "tidyversetweets", 
  "American__Voter", "mothgenerator", "hrbrmstr")

## get botornot estimates
data <- tweetbotornot(users)

## arrange by prob ests
data[order(data$prob_bot), ]
#> # A tibble: 11 x 3
#>    screen_name     user_id            prob_bot
#>    <chr>           <chr>                 <dbl>
#>  1 hadleywickham   69133574            0.00754
#>  2 realDonaldTrump 25073877            0.00995
#>  3 kearneymw       2973406683          0.0607 
#>  4 ma_salmon       2865404679          0.150  
#>  5 juliasilge      13074042            0.162  
#>  6 dataandme       3230388598          0.227  
#>  7 hrbrmstr        5685812             0.320  
#>  8 netflix_bot     1203840834          0.978  
#>  9 tidyversetweets 935569091678691328  0.997  
#> 10 mothgenerator   3277928935          0.998  
#> 11 American__Voter 829792389925597184  1.000  

Integration with rtweet

The botornot() function also accepts data returned by rtweet functions.

## get most recent 100 tweets from each user
tmls <- get_timelines(users, n = 100)

## pass the returned data to botornot()
data <- botornot(tmls)

## arrange by prob ests
data[order(data$prob_bot), ]
#> # A tibble: 11 x 3
#>    screen_name     user_id            prob_bot
#>    <chr>           <chr>                 <dbl>
#>  1 hadleywickham   69133574            0.00754
#>  2 realDonaldTrump 25073877            0.00995
#>  3 kearneymw       2973406683          0.0607 
#>  4 ma_salmon       2865404679          0.150  
#>  5 juliasilge      13074042            0.162  
#>  6 dataandme       3230388598          0.227  
#>  7 hrbrmstr        5685812             0.320  
#>  8 netflix_bot     1203840834          0.978  
#>  9 tidyversetweets 935569091678691328  0.997  
#> 10 mothgenerator   3277928935          0.998  
#> 11 American__Voter 829792389925597184  1.000  

fast = TRUE

The default [gradient boosted] model uses both users-level (bio, location, number of followers and friends, etc.) and tweets-level (number of hashtags, mentions, capital letters, etc. in a user’s most recent 100 tweets) data to estimate the probability that users are bots. For larger data sets, this method can be quite slow. Due to Twitter’s REST API rate limits, users are limited to only 180 estimates per every 15 minutes.

To maximize the number of estimates per 15 minutes (at the cost of being less accurate), use the fast = TRUE argument. This method uses only users-level data, which increases the maximum number of estimates per 15 minutes to 90,000! Due to losses in accuracy, this method should be used with caution!

## get botornot estimates
data <- botornot(users, fast = TRUE)

## arrange by prob ests
data[order(data$prob_bot), ]
#> # A tibble: 11 x 3
#>    screen_name     user_id            prob_bot
#>    <chr>           <chr>                 <dbl>
#>  1 hadleywickham   69133574            0.00185
#>  2 kearneymw       2973406683          0.0415 
#>  3 ma_salmon       2865404679          0.0661 
#>  4 dataandme       3230388598          0.0965 
#>  5 juliasilge      13074042            0.112  
#>  6 hrbrmstr        5685812             0.121  
#>  7 realDonaldTrump 25073877            0.368  
#>  8 netflix_bot     1203840834          0.978  
#>  9 tidyversetweets 935569091678691328  0.998  
#> 10 mothgenerator   3277928935          0.999  
#> 11 American__Voter 829792389925597184  0.999  

NOTE

In order to avoid confusion, the package was renamed from “botrnot” to “tweetbotornot” in June 2018. This package should not be confused with the botornot application.

Download Details:

Author: mkearney
Source Code: https://github.com/mkearney/tweetbotornot 
License: Unknown, MIT licenses found

#r #machinelearning #twitter #rstats 

R Package for Detecting Twitter Bots Via Machine Learning
Nat  Grady

Nat Grady

1668050340

Assertr: Assertive Programming for R analysis Pipelines

Assertr

Assertive Programming for R analysis Pipelines.

What is it?

The assertr package supplies a suite of functions designed to verify assumptions about data early in an analysis pipeline so that data errors are spotted early and can be addressed quickly.

This package does not need to be used with the magrittr/dplyr piping mechanism but the examples in this README use them for clarity.

Installation

You can install the latest version on CRAN like this

    install.packages("assertr")

or you can install the bleeding-edge development version like this:

    install.packages("devtools")
    devtools::install_github("ropensci/assertr")

What does it look like?

This package offers five assertion functions, assert, verify, insist, assert_rows, and insist_rows, that are designed to be used shortly after data-loading in an analysis pipeline...

Let’s say, for example, that the R’s built-in car dataset, mtcars, was not built-in but rather procured from an external source that was known for making errors in data entry or coding. Pretend we wanted to find the average miles per gallon for each number of engine cylinders. We might want to first, confirm

  • that it has the columns "mpg", "vs", and "am"
  • that the dataset contains more than 10 observations
  • that the column for 'miles per gallon' (mpg) is a positive number
  • that the column for ‘miles per gallon’ (mpg) does not contain a datum that is outside 4 standard deviations from its mean, and
  • that the am and vs columns (automatic/manual and v/straight engine, respectively) contain 0s and 1s only
  • each row contains at most 2 NAs
  • each row is unique jointly between the "mpg", "am", and "wt" columns
  • each row's mahalanobis distance is within 10 median absolute deviations of all the distances (for outlier detection)

This could be written (in order) using assertr like this:

    library(dplyr)
    library(assertr)

    mtcars %>%
      verify(has_all_names("mpg", "vs", "am", "wt")) %>%
      verify(nrow(.) > 10) %>%
      verify(mpg > 0) %>%
      insist(within_n_sds(4), mpg) %>%
      assert(in_set(0,1), am, vs) %>%
      assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%
      assert_rows(col_concat, is_uniq, mpg, am, wt) %>%
      insist_rows(maha_dist, within_n_mads(10), everything()) %>%
      group_by(cyl) %>%
      summarise(avg.mpg=mean(mpg))

If any of these assertions were violated, an error would have been raised and the pipeline would have been terminated early.

Let's see what the error message look like when you chain a bunch of failing assertions together.

    > mtcars %>%
    +   chain_start %>%
    +   assert(in_set(1, 2, 3, 4), carb) %>%
    +   assert_rows(rowMeans, within_bounds(0,5), gear:carb) %>%
    +   verify(nrow(.)==10) %>%
    +   verify(mpg < 32) %>%
    +   chain_end
    There are 7 errors across 4 verbs:
    -
             verb redux_fn           predicate     column index value
    1      assert     <NA>  in_set(1, 2, 3, 4)       carb    30   6.0
    2      assert     <NA>  in_set(1, 2, 3, 4)       carb    31   8.0
    3 assert_rows rowMeans within_bounds(0, 5) ~gear:carb    30   5.5
    4 assert_rows rowMeans within_bounds(0, 5) ~gear:carb    31   6.5
    5      verify     <NA>       nrow(.) == 10       <NA>     1    NA
    6      verify     <NA>            mpg < 32       <NA>    18    NA
    7      verify     <NA>            mpg < 32       <NA>    20    NA

    Error: assertr stopped execution

What does assertr give me?

verify - takes a data frame (its first argument is provided by the %>% operator above), and a logical (boolean) expression. Then, verify evaluates that expression using the scope of the provided data frame. If any of the logical values of the expression's result are FALSE, verify will raise an error that terminates any further processing of the pipeline.

assert - takes a data frame, a predicate function, and an arbitrary number of columns to apply the predicate function to. The predicate function (a function that returns a logical/boolean value) is then applied to every element of the columns selected, and will raise an error if it finds any violations. Internally, the assert function uses dplyr's select function to extract the columns to test the predicate function on.

insist - takes a data frame, a predicate-generating function, and an arbitrary number of columns. For each column, the the predicate-generating function is applied, returning a predicate. The predicate is then applied to every element of the columns selected, and will raise an error if it finds any violations. The reason for using a predicate-generating function to return a predicate to use against each value in each of the selected rows is so that, for example, bounds can be dynamically generated based on what the data look like; this the only way to, say, create bounds that check if each datum is within x z-scores, since the standard deviation isn't known a priori. Internally, the insist function uses dplyr's select function to extract the columns to test the predicate function on.

assert_rows - takes a data frame, a row reduction function, a predicate function, and an arbitrary number of columns to apply the predicate function to. The row reduction function is applied to the data frame, and returns a value for each row. The predicate function is then applied to every element of vector returned from the row reduction function, and will raise an error if it finds any violations. This functionality is useful, for example, in conjunction with the num_row_NAs() function to ensure that there is below a certain number of missing values in each row. Internally, the assert_rows function uses dplyr'sselect function to extract the columns to test the predicate function on.

insist_rows - takes a data frame, a row reduction function, a predicate-generating function, and an arbitrary number of columns to apply the predicate function to. The row reduction function is applied to the data frame, and returns a value for each row. The predicate-generating function is then applied to the vector returned from the row reduction function and the resultant predicate is applied to each element of that vector. It will raise an error if it finds any violations. This functionality is useful, for example, in conjunction with the maha_dist() function to ensure that there are no flagrant outliers. Internally, the assert_rows function uses dplyr'sselect function to extract the columns to test the predicate function on.

assertr also offers four (so far) predicate functions designed to be used with the assert and assert_rows functions:

  • not_na - that checks if an element is not NA
  • within_bounds - that returns a predicate function that checks if a numeric value falls within the bounds supplied, and
  • in_set - that returns a predicate function that checks if an element is a member of the set supplied. (also allows inverse for "not in set")
  • is_uniq - that checks to see if each element appears only once

and predicate generators designed to be used with the insist and insist_rows functions:

  • within_n_sds - used to dynamically create bounds to check vector elements with based on standard z-scores
  • within_n_mads - better method for dynamically creating bounds to check vector elements with based on 'robust' z-scores (using median absolute deviation)

and the following row reduction functions designed to be used with assert_rows and insist_rows:

  • num_row_NAs - counts number of missing values in each row
  • maha_dist - computes the mahalanobis distance of each row (for outlier detection). It will coerce categorical variables into numerics if it needs to.
  • col_concat - concatenates all rows into strings
  • duplicated_across_cols - checking if a row contains a duplicated value across columns

and, finally, some other utilities for use with verify

  • has_all_names - check if the data frame or list has all supplied names
  • has_only_names - check that a data frame or list have only the names requested
  • has_class - checks if passed data has a particular class

More info

For more info, check out the assertr vignette

    > vignette("assertr")

Or read it here

ropensci\_footer

Download Details:

Author: ropensci
Source Code: https://github.com/ropensci/assertr 
License: View license

#r #assertion #rstats #functions 

Assertr: Assertive Programming for R analysis Pipelines
Nat  Grady

Nat Grady

1668042600

Visdat: Preliminary Exploratory Visualisation of Data

visdat 

How to install

visdat is available on CRAN

install.packages("visdat")

If you would like to use the development version, install from github with:

# install.packages("devtools")
devtools::install_github("ropensci/visdat")

What does visdat do?

Initially inspired by csv-fingerprint, vis_dat helps you visualise a dataframe and “get a look at the data” by displaying the variable classes in a dataframe as a plot with vis_dat, and getting a brief look into missing data patterns using vis_miss.

visdat has 6 functions:

vis_dat() visualises a dataframe showing you what the classes of the columns are, and also displaying the missing data.

vis_miss() visualises just the missing data, and allows for missingness to be clustered and columns rearranged. vis_miss() is similar to missing.pattern.plot from the mi package. Unfortunately missing.pattern.plot is no longer in the mi package (as of 14/02/2016).

vis_compare() visualise differences between two dataframes of the same dimensions

vis_expect() visualise where certain conditions hold true in your data

vis_cor() visualise the correlation of variables in a nice heatmap

vis_guess() visualise the individual class of each value in your data

vis_value() visualise the value class of each cell in your data

vis_binary() visualise the occurrence of binary values in your data

You can read more about visdat in the vignette, “using visdat”.

Code of Conduct

Please note that the visdat project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Examples

Using vis_dat()

Let’s see what’s inside the airquality dataset from base R, which contains information about daily air quality measurements in New York from May to September 1973. More information about the dataset can be found with ?airquality.

library(visdat)

vis_dat(airquality)

The plot above tells us that R reads this dataset as having numeric and integer values, with some missing data in Ozone and Solar.R. The classes are represented on the legend, and missing data represented by grey. The column/variable names are listed on the x axis.

Using vis_miss()

We can explore the missing data further using vis_miss():

vis_miss(airquality)

Percentages of missing/complete in vis_miss are accurate to 1 decimal place.

You can cluster the missingness by setting cluster = TRUE:

vis_miss(airquality, 
         cluster = TRUE)

Columns can also be arranged by columns with most missingness, by setting sort_miss = TRUE:

vis_miss(airquality,
         sort_miss = TRUE)

vis_miss indicates when there is a very small amount of missing data at <0.1% missingness:

test_miss_df <- data.frame(x1 = 1:10000,
                           x2 = rep("A", 10000),
                           x3 = c(rep(1L, 9999), NA))

vis_miss(test_miss_df)

vis_miss will also indicate when there is no missing data at all:

vis_miss(mtcars)

To further explore the missingness structure in a dataset, I recommend the naniar package, which provides more general tools for graphical and numerical exploration of missing values.

Using vis_compare()

Sometimes you want to see what has changed in your data. vis_compare() displays the differences in two dataframes of the same size. Let’s look at an example.

Let’s make some changes to the chickwts, and compare this new dataset:

set.seed(2019-04-03-1105)
chickwts_diff <- chickwts
chickwts_diff[sample(1:nrow(chickwts), 30),sample(1:ncol(chickwts), 2)] <- NA

vis_compare(chickwts_diff, chickwts)

Here the differences are marked in blue.

If you try and compare differences when the dimensions are different, you get an ugly error:

chickwts_diff_2 <- chickwts
chickwts_diff_2$new_col <- chickwts_diff_2$weight*2

vis_compare(chickwts, chickwts_diff_2)
# Error in vis_compare(chickwts, chickwts_diff_2) : 
#   Dimensions of df1 and df2 are not the same. vis_compare requires dataframes of identical dimensions.

Using vis_expect()

vis_expect visualises certain conditions or values in your data. For example, If you are not sure whether to expect values greater than 25 in your data (airquality), you could write: vis_expect(airquality, ~.x>=25), and you can see if there are times where the values in your data are greater than or equal to 25:

vis_expect(airquality, ~.x >= 25)

This shows the proportion of times that there are values greater than 25, as well as the missings.

Using vis_cor()

To make it easy to plot correlations of your data, use vis_cor:

vis_cor(airquality)

Using vis_value

vis_value() visualises the values of your data on a 0 to 1 scale.

vis_value(airquality)

It only works on numeric data, so you might get strange results if you are using factors:

library(ggplot2)
vis_value(iris)
data input can only contain numeric values, please subset the data to the numeric values you would like. dplyr::select_if(data, is.numeric) can be helpful here!

So you might need to subset the data beforehand like so:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

iris %>%
  select_if(is.numeric) %>%
  vis_value()

Using vis_binary()

vis_binary() visualises binary values. See below for use with example data, dat_bin

vis_binary(dat_bin)

If you don’t have only binary values a warning will be shown.

vis_binary(airquality)
Error in test_if_all_binary(data) : 
  data input can only contain binary values - this means either 0 or 1, or NA. Please subset the data to be binary values, or see ?vis_value.

Using vis_guess()

vis_guess() takes a guess at what each cell is. It’s best illustrated using some messy data, which we’ll make here:

messy_vector <- c(TRUE,
                  T,
                  "TRUE",
                  "T",
                  "01/01/01",
                  "01/01/2001",
                  NA,
                  NaN,
                  "NA",
                  "Na",
                  "na",
                  "10",
                  10,
                  "10.1",
                  10.1,
                  "abc",
                  "$%TG")

set.seed(2019-04-03-1106)
messy_df <- data.frame(var1 = messy_vector,
                       var2 = sample(messy_vector),
                       var3 = sample(messy_vector))

vis_guess(messy_df) vis_dat(messy_df)

So here we see that there are many different kinds of data in your dataframe. As an analyst this might be a depressing finding. We can see this comparison above.

Thank yous

Thank you to Ivan Hanigan who first commented this suggestion after I made a blog post about an initial prototype ggplot_missing, and Jenny Bryan, whose tweet got me thinking about vis_dat, and for her code contributions that removed a lot of errors.

Thank you to Hadley Wickham for suggesting the use of the internals of readr to make vis_guess work. Thank you to Miles McBain for his suggestions on how to improve vis_guess. This resulted in making it at least 2-3 times faster. Thanks to Carson Sievert for writing the code that combined plotly with visdat, and for Noam Ross for suggesting this in the first place. Thank you also to Earo Wang and Stuart Lee for their help in getting capturing expressions in vis_expect.

Finally thank you to rOpenSci and it’s amazing onboarding process, this process has made visdat a much better package, thanks to the editor Noam Ross (@noamross), and the reviewers Sean Hughes (@seaaan) and Mara Averick (@batpigandme).

ropensci_footer

Download Details:

Author: ropensci
Source Code: https://github.com/ropensci/visdat 
License: View license

#r #dataanalysis #visualising #rstats 

Visdat: Preliminary Exploratory Visualisation of Data
Nat  Grady

Nat Grady

1668030780

Magick: Magic, Madness, Heaven, Sin

rOpenSci: The magick package 

Advanced Image-Processing in R

Bindings to ImageMagick: the most comprehensive open-source image processing library available. Supports many common formats (png, jpeg, tiff, pdf, etc) and manipulations (rotate, scale, crop, trim, flip, blur, etc). All operations are vectorized via the Magick++ STL meaning they operate either on a single frame or a series of frames for working with layers, collages, or animation. In RStudio images are automatically previewed when printed to the console, resulting in an interactive editing environment.

Documentation

About the R package:

About the underlying library:

Hello World

Run examples in RStudio to see live previews of the images! If you do not use RStudio, use image_browse to open images. On Linux you can also use image_display to get an X11 preview.

library(magick)
frink <- image_read("https://jeroen.github.io/images/frink.png")
image_trim(frink)
image_scale(frink, "200x200")
image_flip(frink)
image_rotate(frink, 45) ## <-- result of this is shown
image_negate(frink)
frink %>% 
  image_background("green") %>% 
  image_flatten() %>%
  image_border("red", "10x10")
image_rotate(frink, 45) %>% image_write("man/figures/frink-rotated.png")

Effects

image_oilpaint(frink)
image_implode(frink)
image_charcoal(frink) ## <-- result of this is shown
image_blur(frink)
image_edge(frink)
image_charcoal(frink) %>% image_write("man/figures/frink-charcoal.png")

Create GIF animation:

# Download images
oldlogo <- image_read("https://developer.r-project.org/Logo/Rlogo-2.png")
newlogo <- image_read("https://jeroen.github.io/images/Rlogo-old.png")
logos <- c(oldlogo, newlogo)
logos <- image_scale(logos, "400x400")

# Create GIF
(animation1 <- image_animate(logos))
image_write(animation1, "man/figures/anim1.gif")

# Morph effect  <-- result of this is shown
(animation2 <- image_animate(image_morph(logos, frames = 10)))
image_write(animation2, "man/figures/anim2.gif")
anim2.gif

Read GIF animation frames. See the rotating earth example GIF.

earth <- image_read("https://upload.wikimedia.org/wikipedia/commons/2/2c/Rotating_earth_%28large%29.gif")
length(earth)
earth[1]
earth[1:3]
earth1 <- rev(image_flip(earth)) ## How Austrialans see earth
image_write(earth1, "man/figures/earth1.gif") ## <-- result of this is shown
earth1.gif

R logo with dancing banana

logo <- image_read("https://www.r-project.org/logo/Rlogo.png")
banana <- image_read("https://jeroen.github.io/images/banana.gif")
front <- image_scale(banana, "300")
background <- image_scale(logo, "400")
frames <- lapply(as.list(front), function(x) image_flatten(c(background, x)))
image_write(image_animate(image_join(frames)), "man/figures/Rlogo-banana.gif")
Rlogo-banana.gif

Use magick in Shiny Apps

This demo application shows how to use magick with shiny: https://github.com/jeroen/shinymagick

Installation

Binary packages for macOS or Windows can be installed directly from CRAN:

install.packages("magick")

Installation from source on Linux or OSX requires the imagemagick Magick++ library. On Debian or Ubuntu install libmagick++-dev:

sudo apt-get install -y libmagick++-dev

If you are on Ubuntu 14.04 (trusty) or 16.04 (xenial) you can get a more recent backport from the ppa:

sudo add-apt-repository -y ppa:cran/imagemagick
sudo apt-get update
sudo apt-get install -y libmagick++-dev 

On Fedora, CentOS or RHEL we need ImageMagick-c++-devel. However on CentOS the system version of ImageMagick is quite old. More recent versions are available from the ImageMagick downloads website.

sudo yum install ImageMagick-c++-devel

On macOS use imagemagick@6 from Homebrew.

brew install imagemagick@6

The unversioned homebrew formulaimagemagick can also be used, however it has some unsolved OpenMP problems.

There is also a fork of imagemagick called graphicsmagick, but this doesn't work for this package.

Download Details:

Author: ropensci
Source Code: https://github.com/ropensci/magick 
License: View license

#r #image #processing #rstats 

Magick: Magic, Madness, Heaven, Sin
Nat  Grady

Nat Grady

1668011220

PDFtools: Text Extraction, Rendering and Converting Of PDF Documents

pdftools

Introduction

Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plain-text one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or pay-walled search engines.

The pdftools slightly overlaps with the Rpoppler package by Kurt Hornik. The main motivation behind developing pdftools was that Rpoppler depends on glib, which does not work well on Mac and Windows. The pdftools package uses the poppler c++ interface together with Rcpp, which results in a lighter and more portable implementation.

Installation

On Windows and Mac the binary packages can be installed directly from CRAN:

install.packages("pdftools")

Installation on Linux requires the poppler development library. For Ubuntu 18.04 (Bionic) and Ubuntu 20.04 (Focal) we provide backports of poppler version 22.02 to support the latest functionality:

sudo add-apt-repository -y ppa:cran/poppler
sudo apt-get update
sudo apt-get install -y libpoppler-cpp-dev

On other versions of Debian or Ubuntu simply use::

sudo apt-get install libpoppler-cpp-dev

If you want to install the package from source on MacOS you need brew:

brew install poppler

On Fedora:

sudo yum install poppler-cpp-devel

Building from source

On Ubuntu

Update: Itt is now recommended to use the backport PPA mentioned above. If you really want to build from source, follow the instructions of this askubuntu.com answer.

On CentOS

On CentOS the libpoppler-cpp library is not included with the system so we need to build from source. Note that recent versions of poppler require C++11 which is not available on CentOS, so we build a slightly older version of libpoppler.

# Build dependencies
yum install wget xz libjpeg-devel openjpeg2-devel

# Download and extract
wget https://poppler.freedesktop.org/poppler-0.47.0.tar.xz
tar -Jxvf poppler-0.47.0.tar.xz
cd poppler-0.47.0

# Build and install
./configure
make
sudo make install

By default libraries get installed in /usr/local/lib and /usr/local/include. On CentOS this is not a default search path so we need to set PKG_CONFIG_PATH and LD_LIBRARY_PATH to point R to the right directory:

export LD_LIBRARY_PATH="/usr/local/lib"
export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig"

We can then start R and install pdftools.

Getting started

The ?pdftools manual page shows a brief overview of the main utilities. The most important function is pdf_text which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.

library(pdftools)
download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")

# first page text
cat(txt[1])

# second page text
cat(txt[2])

In addition, the package has some utilities to extract other data from the PDF file. The pdf_toc function shows the table of contents, i.e. the section headers which pdf readers usually display in a menu on the left. It looks pretty in JSON:

# Table of contents
toc <- pdf_toc("1403.2805.pdf")

# Show as JSON
jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)

Other functions provide information about fonts, attachments and metadata such as the author, creation date or tags.

# Author, version, etc
info <- pdf_info("1403.2805.pdf")

# Table with fonts
fonts <- pdf_fonts("1403.2805.pdf")

Bonus feature: rendering pdf

A bonus feature on most platforms is rendering of PDF files to bitmap arrays. The poppler library provides all functionality to implement a complete PDF reader, including graphical display of the content. In R we can use pdf_render_page to render a page of the PDF into a bitmap, which can be stored as e.g. png or jpeg.

# renders pdf to bitmap array
bitmap <- pdf_render_page("1403.2805.pdf", page = 1)

# save bitmap image
png::writePNG(bitmap, "page.png")
webp::write_webp(bitmap, "page.webp")

This feature is still experimental and currently does not work on Windows.

Limitations and related packages

Tables

Data scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data with pdftools.

txt <- pdf_text("http://arxiv.org/pdf/1406.4806.pdf")

# some tables
cat(txt[18])
cat(txt[19])

The tabulizer package is dedicated to extracting tables from PDF, and includes interactive tools for selecting tables. However, tabulizer depends on rJava and therefore requires additional setup steps or may be impossible to use on systems where Java cannot be installed.

It is possible to use pdftools with some creativity to parse tables from PDF documents, which does not require Java to be installed.

Scanned text

If you want to extract text from scanned text present in a pdf, you'll need to use OCR (optical character recognition). Please refer to the rOpenSci tesseract package that provides bindings to the Tesseract OCR engine. In particular read the section of its vignette about reading from PDF files using pdftools and tesseract.

Download Details:

Author: ropensci
Source Code: https://github.com/ropensci/pdftools 
License: Unknown, MIT licenses found

#r #text #rstats #pdf 

PDFtools: Text Extraction, Rendering and Converting Of PDF Documents
Nat  Grady

Nat Grady

1667974440

Tabulizer: Bindings for Tabula PDF Table Extractor Library

Extract Tables from PDFs

Tabulizer provides R bindings to the Tabula java library, which can be used to computationaly extract tables from PDF documents.

Note: tabulizer is released under the MIT license, as is Tabula itself.

Installation

tabulizer depends on rJava, which implies a system requirement for Java. This can be frustrating, especially on Windows. The preferred Windows workflow is to use Chocolatey to obtain, configure, and update Java. You need do this before installing rJava or attempting to use tabulizer. More on this and troubleshooting below.

To install the latest CRAN version:

install.packages("tabulizer")

To install the latest development version:

if (!require("remotes")) {
    install.packages("remotes")
}
# on 64-bit Windows
remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"), INSTALL_opts = "--no-multiarch")
# elsewhere
remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"))

Code Examples

The main function, extract_tables() provides an R clone of the Tabula command line application:

library("tabulizer")
f <- system.file("examples", "data.pdf", package = "tabulizer")
out1 <- extract_tables(f)
str(out1)
## List of 4
##  $ : chr [1:32, 1:10] "mpg" "21.0" "21.0" "22.8" ...
##  $ : chr [1:7, 1:5] "Sepal.Length " "5.1 " "4.9 " "4.7 " ...
##  $ : chr [1:7, 1:6] "" "145 " "146 " "147 " ...
##  $ : chr [1:15, 1] "supp" "VC" "VC" "VC" ...

By default, it returns the most table-like R structure available: a matrix. It can also write the tables to disk or attempt to coerce them to data.frames using the output argument. It is also possible to select tables from only specified pages using the pages argument.

out2 <- extract_tables(f, pages = 1, guess = FALSE, output = "data.frame")
str(out2)
## List of 1
##  $ :'data.frame':       33 obs. of  13 variables:
##   ..$ X   : chr [1:33] "Mazda RX4 " "Mazda RX4 Wag " "Datsun 710 " "Hornet 4 Drive " ...
##   ..$ mpg : num [1:33] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##   ..$ cyl : num [1:33] 6 6 4 6 8 6 8 4 4 6 ...
##   ..$ X.1 : int [1:33] NA NA NA NA NA NA NA NA NA NA ...
##   ..$ disp: num [1:33] 160 160 108 258 360 ...
##   ..$ hp  : num [1:33] 110 110 93 110 175 105 245 62 95 123 ...
##   ..$ drat: num [1:33] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##   ..$ wt  : num [1:33] 2.62 2.88 2.32 3.21 3.44 ...
##   ..$ qsec: num [1:33] 16.5 17 18.6 19.4 17 ...
##   ..$ vs  : num [1:33] 0 0 1 1 0 1 0 1 1 1 ...
##   ..$ am  : num [1:33] 1 1 1 0 0 0 0 0 0 0 ...
##   ..$ gear: num [1:33] 4 4 4 3 3 3 3 4 4 4 ...
##   ..$ carb: int [1:33] 4 4 1 1 2 1 4 2 2 4 ...

It is also possible to manually specify smaller areas within pages to look for tables using the area and columns arguments to extract_tables(). This facilitates extraction from smaller portions of a page, such as when a table is embeded in a larger section of text or graphics.

Another function, extract_areas() implements this through an interactive style in which each page of the PDF is loaded as an R graphic and the user can use their mouse to specify upper-left and lower-right bounds of an area. Those areas are then extracted auto-magically (and the return value is the same as for extract_tables()). Here’s a shot of it in action:

extract\_areas()

locate_areas() handles the area identification process without performing the extraction, which may be useful as a debugger.

extract_text() simply returns text, possibly separately for each (specified) page:

out3 <- extract_text(f, page = 3)
cat(out3, sep = "\n")
## len supp dose
## 4.2 VC 0.5
## 11.5 VC 0.5
## 7.3 VC 0.5
## 5.8 VC 0.5
## 6.4 VC 0.5
## 10.0 VC 0.5
## 11.2 VC 0.5
## 11.2 VC 0.5
## 5.2 VC 0.5
## 7.0 VC 0.5
## 16.5 VC 1.0
## 16.5 VC 1.0
## 15.2 VC 1.0
## 17.3 VC 1.0
## 22.5 VC 1.0
## 3

Note that for large PDF files, it is possible to run up against Java memory constraints, leading to a java.lang.OutOfMemoryError: Java heap space error message. Memory can be increased using options(java.parameters = "-Xmx16000m") set to some reasonable amount of memory.

Some other utility functions are also provided (and made possible by the Java Apache PDFBox library):

  • extract_text() converts the text of an entire file or specified pages into an R character vector.
  • split_pdf() and merge_pdfs() split and merge PDF documents, respectively.
  • extract_metadata() extracts PDF metadata as a list.
  • get_n_pages() determines the number of pages in a document.
  • get_page_dims() determines the width and height of each page in pt (the unit used by area and columns arguments).
  • make_thumbnails() converts specified pages of a PDF file to image files.

Installing Java on Windows with Chocolatey

In command prompt, install Chocolately if you don’t already have it:

@powershell -NoProfile -ExecutionPolicy Bypass -Command "iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1'))" && SET PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin

Then, install java using Chocolately’s choco install command:

choco install jdk7 -y

You may also need to then set the JAVA_HOME environment variable to the path to your Java installation (e.g., C:\Program Files\Java\jdk1.8.0_92). This can be done:

  1. within R using Sys.setenv(JAVA_HOME = "C:/Program Files/Java/jdk1.8.0_92") (note slashes), or
  2. from command prompt using the setx command: setx JAVA_HOME C:\Program Files\Java\jdk1.8.0_92, or
  3. from PowerShell, using the .NET framework: [Environment]::SetEnvironmentVariable("JAVA_HOME", "C:\Program Files\Java\jdk1.8.0_92", "User"), or
  4. from the Start Menu, via Control Panel » System » Advanced » Environment Variables (instructions here).

You should now be able to safely open R, and use rJava and tabulizer. Note, however, that some users report that rather than setting this variable, they instead need to delete it (e.g., with Sys.setenv(JAVA_HOME = "")), so if the above instructions fail, that is the next step in troubleshooting.

Troubleshooting

Some notes for troubleshooting common installation problems:

  • On Mac OS, you may need to install a particular version of Java prior to attempting to install tabulizer.
  • On a Unix-like, you need to ensure that R has been installed with Java support. This can often be fixed by running R CMD javareconf on the command line (possibly with sudo, etc. depending on your system setup).
  • On Windows, make sure you have permission to write to and install packages to your R directory before trying to install the package. This can be changed from “Properties” on the right-click context menu. Alternatively, you can ensure write permission by choosing “Run as administrator” when launching R (again, from the right-click context menu).

Meta

  • Please report any issues or bugs.
  • License: MIT
  • Get citation information for tabulizer in R doing citation(package = 'tabulizer')

rofooter

Download Details:

Author: ropensci
Source Code: https://github.com/ropensci/tabulizer 
License: View license

#r #java #pdf #rstats 

Tabulizer: Bindings for Tabula PDF Table Extractor Library
Nat  Grady

Nat Grady

1667492100

Utilities for analyzing Bayesian Models & Posterior Distributions

BayestestR 

Become a Bayesian master you will


⚠️ We changed the default the CI width! Please make an informed decision and set it explicitly (ci = 0.89, ci = 0.95 or anything else that you decide) ⚠️


Existing R packages allow users to easily fit a large variety of models and extract and visualize the posterior draws. However, most of these packages only return a limited set of indices (e.g., point-estimates and CIs). bayestestR provides a comprehensive and consistent set of functions to analyze and describe posterior distributions generated by a variety of models objects, including popular modeling packages such as rstanarm, brms or BayesFactor.

You can reference the package and its documentation as follows:

  • Makowski, D., Ben-Shachar, M. S., & Lüdecke, D. (2019). bayestestR: Describing Effects and their Uncertainty, Existence and Significance within the Bayesian Framework. Journal of Open Source Software, 4(40), 1541. 10.21105/joss.01541
  • Makowski, D., Ben-Shachar, M. S., Chen, S. H. A., & Lüdecke, D. (2019). Indices of Effect Existence and Significance in the Bayesian Framework. Frontiers in Psychology 2019;10:2767. 10.3389/fpsyg.2019.02767

Installation

The bayestestR package is available on CRAN, while its latest development version is available on R-universe (from rOpenSci).

TypeSourceCommand
ReleaseCRANinstall.packages("bayestestR")
DevelopmentR-universeinstall.packages("bayestestR", repos = "https://easystats.r-universe.dev")

Once you have downloaded the package, you can then load it using:

library("bayestestR")

Tip

Instead of library(datawizard), use library(easystats). This will make all features of the easystats-ecosystem available.

To stay updated, use easystats::install_latest().

Documentation

Access the package documentation and check-out these vignettes:

Tutorials

Articles

Features

In the Bayesian framework, parameters are estimated in a probabilistic fashion as distributions. These distributions can be summarised and described by reporting four types of indices:

describe_posterior() is the master function with which you can compute all of the indices cited below at once.

describe_posterior(
  rnorm(10000),
  centrality = "median",
  test = c("p_direction", "p_significance")
)
## Summary of Posterior Distribution
## 
## Parameter |    Median |        95% CI |     pd |   ps
## -----------------------------------------------------
## Posterior | -4.19e-03 | [-1.91, 1.98] | 50.18% | 0.46

describe_posterior() works for many objects, including more complex brmsfit-models. For better readability, the output is separated by model components:

zinb <- read.csv("http://stats.idre.ucla.edu/stat/data/fish.csv")
set.seed(123)
model <- brm(
  bf(
    count ~ child + camper + (1 | persons),
    zi ~ child + camper + (1 | persons)
  ),
  data = zinb,
  family = zero_inflated_poisson(),
  chains = 1,
  iter = 500
)

describe_posterior(
  model,
  effects = "all",
  component = "all",
  test = c("p_direction", "p_significance"),
  centrality = "all"
)
## Summary of Posterior Distribution
## 
## Parameter   | Median |  Mean |   MAP |         95% CI |     pd |   ps |  Rhat |    ESS
## --------------------------------------------------------------------------------------
## (Intercept) |   0.96 |  0.96 |  0.96 | [-0.81,  2.51] | 90.00% | 0.88 | 1.011 | 110.00
## child       |  -1.16 | -1.16 | -1.16 | [-1.36, -0.94] |   100% | 1.00 | 0.996 | 278.00
## camper      |   0.73 |  0.72 |  0.73 | [ 0.54,  0.91] |   100% | 1.00 | 0.996 | 271.00
## 
## # Fixed effects (zero-inflated)
## 
## Parameter   | Median |  Mean |   MAP |         95% CI |     pd |   ps |  Rhat |    ESS
## --------------------------------------------------------------------------------------
## (Intercept) |  -0.48 | -0.51 | -0.22 | [-2.03,  0.89] | 78.00% | 0.73 | 0.997 | 138.00
## child       |   1.85 |  1.86 |  1.81 | [ 1.19,  2.54] |   100% | 1.00 | 0.996 | 303.00
## camper      |  -0.88 | -0.86 | -0.99 | [-1.61, -0.07] | 98.40% | 0.96 | 0.996 | 292.00
## 
## # Random effects (conditional) Intercept: persons
## 
## Parameter |    Median |  Mean |   MAP |         95% CI |     pd |   ps |  Rhat |    ESS
## ---------------------------------------------------------------------------------------
## persons.1 |     -0.99 | -1.01 | -0.84 | [-2.68,  0.80] | 92.00% | 0.90 | 1.007 | 106.00
## persons.2 | -4.65e-03 | -0.04 |  0.03 | [-1.63,  1.66] | 50.00% | 0.45 | 1.013 | 109.00
## persons.3 |      0.69 |  0.66 |  0.69 | [-0.95,  2.34] | 79.60% | 0.78 | 1.010 | 114.00
## persons.4 |      1.57 |  1.56 |  1.56 | [-0.05,  3.29] | 96.80% | 0.96 | 1.009 | 114.00
## 
## # Random effects (zero-inflated) Intercept: persons
## 
## Parameter | Median |  Mean |   MAP |         95% CI |     pd |   ps |  Rhat |    ESS
## ------------------------------------------------------------------------------------
## persons.1 |   1.10 |  1.11 |  1.08 | [-0.23,  2.72] | 94.80% | 0.93 | 0.997 | 166.00
## persons.2 |   0.18 |  0.18 |  0.22 | [-0.94,  1.58] | 63.20% | 0.54 | 0.996 | 154.00
## persons.3 |  -0.30 | -0.31 | -0.54 | [-1.79,  1.02] | 64.00% | 0.59 | 0.997 | 154.00
## persons.4 |  -1.45 | -1.46 | -1.44 | [-2.90, -0.10] | 98.00% | 0.97 | 1.000 | 189.00
## 
## # Random effects (conditional) SD/Cor: persons
## 
## Parameter   | Median | Mean |  MAP |         95% CI |   pd |   ps |  Rhat |    ESS
## ----------------------------------------------------------------------------------
## (Intercept) |   1.42 | 1.58 | 1.07 | [ 0.71,  3.58] | 100% | 1.00 | 1.010 | 126.00
## 
## # Random effects (zero-inflated) SD/Cor: persons
## 
## Parameter   | Median | Mean |  MAP |         95% CI |   pd |   ps |  Rhat |    ESS
## ----------------------------------------------------------------------------------
## (Intercept) |   1.30 | 1.49 | 0.99 | [ 0.63,  3.41] | 100% | 1.00 | 0.996 | 129.00

bayestestR also includes many other features useful for your Bayesian analsyes. Here are some more examples:

Point-estimates

library(bayestestR)

posterior <- distribution_gamma(10000, 1.5) # Generate a skewed distribution
centrality <- point_estimate(posterior) # Get indices of centrality
centrality
## Point Estimate
## 
## Median | Mean |  MAP
## --------------------
## 1.18   | 1.50 | 0.51

As for other easystats packages, plot() methods are available from the see package for many functions:

While the median and the mean are available through base R functions, map_estimate() in bayestestR can be used to directly find the Highest Maximum A Posteriori (MAP) estimate of a posterior, i.e., the value associated with the highest probability density (the “peak” of the posterior distribution). In other words, it is an estimation of the mode for continuous parameters.

Uncertainty (CI)

hdi() computes the Highest Density Interval (HDI) of a posterior distribution, i.e., the interval which contains all points within the interval have a higher probability density than points outside the interval. The HDI can be used in the context of Bayesian posterior characterization as Credible Interval (CI).

Unlike equal-tailed intervals (see eti()) that typically exclude 2.5% from each tail of the distribution, the HDI is not equal-tailed and therefore always includes the mode(s) of posterior distributions.

posterior <- distribution_chisquared(10000, 4)

hdi(posterior, ci = 0.89)
## 89% HDI: [0.18, 7.63]

eti(posterior, ci = 0.89)
## 89% ETI: [0.75, 9.25]

Existence and Significance Testing

Probability of Direction (pd)

p_direction() computes the Probability of Direction (pd, also known as the Maximum Probability of Effect - MPE). It varies between 50% and 100% (i.e., 0.5 and 1) and can be interpreted as the probability (expressed in percentage) that a parameter (described by its posterior distribution) is strictly positive or negative (whichever is the most probable). It is mathematically defined as the proportion of the posterior distribution that is of the median’s sign. Although differently expressed, this index is fairly similar (i.e., is strongly correlated) to the frequentist p-value.

Relationship with the p-value: In most cases, it seems that the pd corresponds to the frequentist one-sided p-value through the formula p-value = (1-pd/100) and to the two-sided p-value (the most commonly reported) through the formula p-value = 2*(1-pd/100). Thus, a pd of 95%, 97.5% 99.5% and 99.95% corresponds approximately to a two-sided p-value of respectively .1, .05, .01 and .001. See the reporting guidelines.

posterior <- distribution_normal(10000, 0.4, 0.2)
p_direction(posterior)
## Probability of Direction: 0.98

ROPE

rope() computes the proportion (in percentage) of the HDI (default to the 89% HDI) of a posterior distribution that lies within a region of practical equivalence.

Statistically, the probability of a posterior distribution of being different from 0 does not make much sense (the probability of it being different from a single point being infinite). Therefore, the idea underlining ROPE is to let the user define an area around the null value enclosing values that are equivalent to the null value for practical purposes Kruschke (2018).

Kruschke suggests that such null value could be set, by default, to the -0.1 to 0.1 range of a standardized parameter (negligible effect size according to Cohen, 1988). This could be generalized: For instance, for linear models, the ROPE could be set as 0 +/- .1 * sd(y). This ROPE range can be automatically computed for models using the rope_range function.

Kruschke suggests using the proportion of the 95% (or 90%, considered more stable) HDI that falls within the ROPE as an index for “null-hypothesis” testing (as understood under the Bayesian framework, see equivalence_test).

posterior <- distribution_normal(10000, 0.4, 0.2)
rope(posterior, range = c(-0.1, 0.1))
## # Proportion of samples inside the ROPE [-0.10, 0.10]:
## 
## inside ROPE
## -----------
## 4.40 %

Bayes Factor

bayesfactor_parameters() computes Bayes factors against the null (either a point or an interval), bases on prior and posterior samples of a single parameter. This Bayes factor indicates the degree by which the mass of the posterior distribution has shifted further away from or closer to the null value(s) (relative to the prior distribution), thus indicating if the null value has become less or more likely given the observed data.

When the null is an interval, the Bayes factor is computed by comparing the prior and posterior odds of the parameter falling within or outside the null; When the null is a point, a Savage-Dickey density ratio is computed, which is also an approximation of a Bayes factor comparing the marginal likelihoods of the model against a model in which the tested parameter has been restricted to the point null (Wagenmakers, Lodewyckx, Kuriyal, & Grasman, 2010).

prior <- distribution_normal(10000, mean = 0, sd = 1)
posterior <- distribution_normal(10000, mean = 1, sd = 0.7)

bayesfactor_parameters(posterior, prior, direction = "two-sided", null = 0)
## Bayes Factor (Savage-Dickey density ratio)
## 
## BF  
## ----
## 1.94
## 
## * Evidence Against The Null: 0

The lollipops represent the density of a point-null on the prior distribution (the blue lollipop on the dotted distribution) and on the posterior distribution (the red lollipop on the yellow distribution). The ratio between the two - the Savage-Dickey ratio - indicates the degree by which the mass of the parameter distribution has shifted away from or closer to the null.

For more info, see the Bayes factors vignette.

Utilities

Find ROPE’s appropriate range

rope_range(): This function attempts at automatically finding suitable “default” values for the Region Of Practical Equivalence (ROPE). Kruschke (2018) suggests that such null value could be set, by default, to a range from -0.1 to 0.1 of a standardized parameter (negligible effect size according to Cohen, 1988), which can be generalised for linear models to -0.1 * sd(y), 0.1 * sd(y). For logistic models, the parameters expressed in log odds ratio can be converted to standardized difference through the formula sqrt(3)/pi, resulting in a range of -0.05 to 0.05.

rope_range(model)

Density Estimation

estimate_density(): This function is a wrapper over different methods of density estimation. By default, it uses the base R density with by default uses a different smoothing bandwidth ("SJ") from the legacy default implemented the base R density function ("nrd0"). However, Deng & Wickham suggest that method = "KernSmooth" is the fastest and the most accurate.

Perfect Distributions

distribution(): Generate a sample of size n with near-perfect distributions.

distribution(n = 10)
##  [1] -1.55 -1.00 -0.66 -0.38 -0.12  0.12  0.38  0.66  1.00  1.55

Probability of a Value

density_at(): Compute the density of a given point of a distribution.

density_at(rnorm(1000, 1, 1), 1)
## [1] 0.45

Code of Conduct

Please note that the bayestestR project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

References

Kruschke, J. K. (2018). Rejecting or accepting parameter values in Bayesian estimation. Advances in Methods and Practices in Psychological Science, 1(2), 270–280. https://doi.org/10.1177/2515245918771304

Kruschke, J. K., & Liddell, T. M. (2018). The bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a bayesian perspective. Psychonomic Bulletin & Review, 25(1), 178–206.

Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010). Bayesian hypothesis testing for psychologists: A tutorial on the savage–dickey method. Cognitive Psychology, 60(3), 158–189.

Download Details:

Author: easystats
Source Code: https://github.com/easystats/bayestestR 
License: GPL-3.0 license

#r #map #rstats #bayesian #hacktoberfest 

Utilities for analyzing Bayesian Models & Posterior Distributions
Nat  Grady

Nat Grady

1667429460

Report: Automated Reporting Of Objects in R

Report 

“From R to your manuscript”

report’s primary goal is to bridge the gap between R’s output and the formatted results contained in your manuscript. It automatically produces reports of models and data frames according to best practices guidelines (e.g., APA’s style), ensuring standardization and quality in results reporting.

library(report)

model <- lm(Sepal.Length ~ Species, data = iris)
report(model)
# We fitted a linear model (estimated using OLS) to predict Sepal.Length with
# Species (formula: Sepal.Length ~ Species). The model explains a statistically
# significant and substantial proportion of variance (R2 = 0.62, F(2, 147) =
# 119.26, p < .001, adj. R2 = 0.61). The model's intercept, corresponding to
# Species = setosa, is at 5.01 (95% CI [4.86, 5.15], t(147) = 68.76, p < .001).
# Within this model:
# 
#   - The effect of Species [versicolor] is statistically significant and positive
# (beta = 0.93, 95% CI [0.73, 1.13], t(147) = 9.03, p < .001; Std. beta = 1.12,
# 95% CI [0.88, 1.37])
#   - The effect of Species [virginica] is statistically significant and positive
# (beta = 1.58, 95% CI [1.38, 1.79], t(147) = 15.37, p < .001; Std. beta = 1.91,
# 95% CI [1.66, 2.16])
# 
# Standardized parameters were obtained by fitting the model on a standardized
# version of the dataset. 95% Confidence Intervals (CIs) and p-values were
# computed using a Wald t-distribution approximation.

Installation

The package is available on CRAN and can be downloaded by running:

install.packages("report")

If you would instead like to experiment with the development version, you can download it from GitHub:

install.packages("remotes")
remotes::install_github("easystats/report") # You only need to do that once

Load the package every time you start R

library("report")

Tip

Instead of library(datawizard), use library(easystats). This will make all features of the easystats-ecosystem available.

To stay updated, use easystats::install_latest().

Documentation

The package documentation can be found here.

Report all the things

General Workflow

The report package works in a two step fashion. First, you create a report object with the report() function. Then, this report object can be displayed either textually (the default output) or as a table, using as.data.frame(). Moreover, you can also access a more digest and compact version of the report using summary() on the report object.

workflow

The report() function works on a variety of models, as well as other objects such as dataframes:

report(iris)
# The data contains 150 observations of the following 5 variables:
# 
#   - Sepal.Length: n = 150, Mean = 5.84, SD = 0.83, Median = 5.80, MAD = 1.04,
# range: [4.30, 7.90], Skewness = 0.31, Kurtosis = -0.55, 0% missing
#   - Sepal.Width: n = 150, Mean = 3.06, SD = 0.44, Median = 3.00, MAD = 0.44,
# range: [2, 4.40], Skewness = 0.32, Kurtosis = 0.23, 0% missing
#   - Petal.Length: n = 150, Mean = 3.76, SD = 1.77, Median = 4.35, MAD = 1.85,
# range: [1, 6.90], Skewness = -0.27, Kurtosis = -1.40, 0% missing
#   - Petal.Width: n = 150, Mean = 1.20, SD = 0.76, Median = 1.30, MAD = 1.04,
# range: [0.10, 2.50], Skewness = -0.10, Kurtosis = -1.34, 0% missing
#   - Species: 3 levels, namely setosa (n = 50, 33.33%), versicolor (n = 50,
# 33.33%) and virginica (n = 50, 33.33%)

These reports nicely work within the tidyverse workflow:

iris %>%
  select(-starts_with("Sepal")) %>%
  group_by(Species) %>%
  report() %>%
  summary()
# The data contains 150 observations, grouped by Species, of the following 3
# variables:
# 
# - setosa (n = 50):
#   - Petal.Length: Mean = 1.46, SD = 0.17, range: [1, 1.90]
#   - Petal.Width: Mean = 0.25, SD = 0.11, range: [0.10, 0.60]
# 
# - versicolor (n = 50):
#   - Petal.Length: Mean = 4.26, SD = 0.47, range: [3, 5.10]
#   - Petal.Width: Mean = 1.33, SD = 0.20, range: [1, 1.80]
# 
# - virginica (n = 50):
#   - Petal.Length: Mean = 5.55, SD = 0.55, range: [4.50, 6.90]
#   - Petal.Width: Mean = 2.03, SD = 0.27, range: [1.40, 2.50]

t-tests and correlations

Reports can be used to automatically format tests like t-tests or correlations.

report(t.test(mtcars$mpg ~ mtcars$am))
# Effect sizes were labelled following Cohen's (1988) recommendations.
# 
# The Welch Two Sample t-test testing the difference of mtcars$mpg by mtcars$am
# (mean in group 0 = 17.15, mean in group 1 = 24.39) suggests that the effect is
# negative, statistically significant, and large (difference = -7.24, 95% CI
# [-11.28, -3.21], t(18.33) = -3.77, p = 0.001; Cohen's d = -1.41, 95% CI [-2.26,
# -0.53])

As mentioned, you can also create tables with the as.data.frame() functions, like for example with this correlation test:

cor.test(iris$Sepal.Length, iris$Sepal.Width) %>%
  report() %>%
  as.data.frame()
# Pearson's product-moment correlation
# 
# Parameter1        |       Parameter2 |     r |        95% CI | t(148) |     p
# -----------------------------------------------------------------------------
# iris$Sepal.Length | iris$Sepal.Width | -0.12 | [-0.27, 0.04] |  -1.44 | 0.152
# 
# Alternative hypothesis: two.sided

ANOVAs

This works great with ANOVAs, as it includes effect sizes and their interpretation.

aov(Sepal.Length ~ Species, data = iris) %>%
  report()
# The ANOVA (formula: Sepal.Length ~ Species) suggests that:
# 
#   - The main effect of Species is statistically significant and large (F(2, 147)
# = 119.26, p < .001; Eta2 = 0.62, 95% CI [0.54, 1.00])
# 
# Effect sizes were labelled following Field's (2013) recommendations.

General Linear Models (GLMs)

Reports are also compatible with GLMs, such as this logistic regression:

model <- glm(vs ~ mpg * drat, data = mtcars, family = "binomial")

report(model)
# We fitted a logistic model (estimated using ML) to predict vs with mpg and drat
# (formula: vs ~ mpg * drat). The model's explanatory power is substantial
# (Tjur's R2 = 0.51). The model's intercept, corresponding to mpg = 0 and drat =
# 0, is at -33.43 (95% CI [-77.90, 3.25], p = 0.083). Within this model:
# 
#   - The effect of mpg is statistically non-significant and positive (beta = 1.79,
# 95% CI [-0.10, 4.05], p = 0.066; Std. beta = 3.63, 95% CI [1.36, 7.50])
#   - The effect of drat is statistically non-significant and positive (beta =
# 5.96, 95% CI [-3.75, 16.26], p = 0.205; Std. beta = -0.36, 95% CI [-1.96,
# 0.98])
#   - The interaction effect of drat on mpg is statistically non-significant and
# negative (beta = -0.33, 95% CI [-0.83, 0.15], p = 0.141; Std. beta = -1.07, 95%
# CI [-2.66, 0.48])
# 
# Standardized parameters were obtained by fitting the model on a standardized
# version of the dataset. 95% Confidence Intervals (CIs) and p-values were
# computed using a Wald z-distribution approximation.

Mixed Models

Mixed models, whose popularity and usage is exploding, can also be reported:

library(lme4)

model <- lme4::lmer(Sepal.Length ~ Petal.Length + (1 | Species), data = iris)

report(model)
# We fitted a linear mixed model (estimated using REML and nloptwrap optimizer)
# to predict Sepal.Length with Petal.Length (formula: Sepal.Length ~
# Petal.Length). The model included Species as random effect (formula: ~1 |
# Species). The model's total explanatory power is substantial (conditional R2 =
# 0.97) and the part related to the fixed effects alone (marginal R2) is of 0.66.
# The model's intercept, corresponding to Petal.Length = 0, is at 2.50 (95% CI
# [1.19, 3.82], t(146) = 3.75, p < .001). Within this model:
# 
#   - The effect of Petal Length is statistically significant and positive (beta =
# 0.89, 95% CI [0.76, 1.01], t(146) = 13.93, p < .001; Std. beta = 1.89, 95% CI
# [1.63, 2.16])
# 
# Standardized parameters were obtained by fitting the model on a standardized
# version of the dataset. 95% Confidence Intervals (CIs) and p-values were
# computed using a Wald t-distribution approximation.

Bayesian Models

Bayesian models can also be reported using the new SEXIT framework, which combines clarity, precision and usefulness.

library(rstanarm)

model <- stan_glm(mpg ~ qsec + wt, data = mtcars)

report(model)
# We fitted a Bayesian linear model (estimated using MCMC sampling with 4 chains
# of 1000 iterations and a warmup of 500) to predict mpg with qsec and wt
# (formula: mpg ~ qsec + wt). Priors over parameters were set as normal (mean =
# 0.00, SD = 8.43) distributions. The model's explanatory power is substantial
# (R2 = 0.81, 95% CI [0.70, 0.90], adj. R2 = 0.79). The model's intercept,
# corresponding to qsec = 0 and wt = 0, is at 19.72 (95% CI [9.18, 29.63]).
# Within this model:
# 
#   - The effect of qsec (Median = 0.92, 95% CI [0.42, 1.46]) has a 99.90%
# probability of being positive (> 0), 99.00% of being significant (> 0.30), and
# 0.15% of being large (> 1.81). The estimation successfully converged (Rhat =
# 1.000) and the indices are reliable (ESS = 2411)
#   - The effect of wt (Median = -5.04, 95% CI [-6.00, -4.02]) has a 100.00%
# probability of being negative (< 0), 100.00% of being significant (< -0.30),
# and 100.00% of being large (< -1.81). The estimation successfully converged
# (Rhat = 1.000) and the indices are reliable (ESS = 2582)
# 
# Following the Sequential Effect eXistence and sIgnificance Testing (SEXIT)
# framework, we report the median of the posterior distribution and its 95% CI
# (Highest Density Interval), along the probability of direction (pd), the
# probability of significance and the probability of being large. The thresholds
# beyond which the effect is considered as significant (i.e., non-negligible) and
# large are |0.30| and |1.81| (corresponding respectively to 0.05 and 0.30 of the
# outcome's SD). Convergence and stability of the Bayesian sampling has been
# assessed using R-hat, which should be below 1.01 (Vehtari et al., 2019), and
# Effective Sample Size (ESS), which should be greater than 1000 (Burkner, 2017).
# and We fitted a Bayesian linear model (estimated using MCMC sampling with 4
# chains of 1000 iterations and a warmup of 500) to predict mpg with qsec and wt
# (formula: mpg ~ qsec + wt). Priors over parameters were set as normal (mean =
# 0.00, SD = 15.40) distributions. The model's explanatory power is substantial
# (R2 = 0.81, 95% CI [0.70, 0.90], adj. R2 = 0.79). The model's intercept,
# corresponding to qsec = 0 and wt = 0, is at 19.72 (95% CI [9.18, 29.63]).
# Within this model:
# 
#   - The effect of qsec (Median = 0.92, 95% CI [0.42, 1.46]) has a 99.90%
# probability of being positive (> 0), 99.00% of being significant (> 0.30), and
# 0.15% of being large (> 1.81). The estimation successfully converged (Rhat =
# 1.000) and the indices are reliable (ESS = 2411)
#   - The effect of wt (Median = -5.04, 95% CI [-6.00, -4.02]) has a 100.00%
# probability of being negative (< 0), 100.00% of being significant (< -0.30),
# and 100.00% of being large (< -1.81). The estimation successfully converged
# (Rhat = 1.000) and the indices are reliable (ESS = 2582)
# 
# Following the Sequential Effect eXistence and sIgnificance Testing (SEXIT)
# framework, we report the median of the posterior distribution and its 95% CI
# (Highest Density Interval), along the probability of direction (pd), the
# probability of significance and the probability of being large. The thresholds
# beyond which the effect is considered as significant (i.e., non-negligible) and
# large are |0.30| and |1.81| (corresponding respectively to 0.05 and 0.30 of the
# outcome's SD). Convergence and stability of the Bayesian sampling has been
# assessed using R-hat, which should be below 1.01 (Vehtari et al., 2019), and
# Effective Sample Size (ESS), which should be greater than 1000 (Burkner, 2017).

Other types of reports

Specific parts

One can, for complex reports, directly access the pieces of the reports:

model <- lm(Sepal.Length ~ Species, data = iris)

report_model(model)
report_performance(model)
report_statistics(model)
# linear model (estimated using OLS) to predict Sepal.Length with Species (formula: Sepal.Length ~ Species)
# The model explains a statistically significant and substantial proportion of
# variance (R2 = 0.62, F(2, 147) = 119.26, p < .001, adj. R2 = 0.61)
# beta = 5.01, 95% CI [4.86, 5.15], t(147) = 68.76, p < .001; Std. beta = -1.01, 95% CI [-1.18, -0.84]
# beta = 0.93, 95% CI [0.73, 1.13], t(147) = 9.03, p < .001; Std. beta = 1.12, 95% CI [0.88, 1.37]
# beta = 1.58, 95% CI [1.38, 1.79], t(147) = 15.37, p < .001; Std. beta = 1.91, 95% CI [1.66, 2.16]

Report participants’ details

This can be useful to complete the Participants paragraph of your manuscript.

data <- data.frame(
  "Age" = c(22, 23, 54, 21),
  "Sex" = c("F", "F", "M", "M")
)

paste(
  report_participants(data, spell_n = TRUE),
  "were recruited in the study by means of torture and coercion."
)
# [1] "Four participants (Mean age = 30.0, SD = 16.0, range: [21, 54]; Sex: 50.0% females, 50.0% males, 0.0% other) were recruited in the study by means of torture and coercion."

Report sample

Report can also help you create a sample description table (also referred to as Table 1).

Variablesetosa (n=50)versicolor (n=50)virginica (n=50)Total (n=150)
Mean Sepal.Length (SD)5.01 (0.35)5.94 (0.52)6.59 (0.64)5.84 (0.83)
Mean Sepal.Width (SD)3.43 (0.38)2.77 (0.31)2.97 (0.32)3.06 (0.44)
Mean Petal.Length (SD)1.46 (0.17)4.26 (0.47)5.55 (0.55)3.76 (1.77)
Mean Petal.Width (SD)0.25 (0.11)1.33 (0.20)2.03 (0.27)1.20 (0.76)

Report system and packages

Finally, report includes some functions to help you write the data analysis paragraph about the tools used.

report(sessionInfo())
# Analyses were conducted using the R Statistical language (version 4.2.1; R Core
# Team, 2022) on macOS Monterey 12.6, using the packages lme4 (version 1.1.30;
# Bates D et al., 2015), Matrix (version 1.5.1; Bates D et al., 2022), Rcpp
# (version 1.0.9; Eddelbuettel D, François R, 2011), rstanarm (version 2.21.3;
# Goodrich B et al., 2022), report (version 0.5.5.2; Makowski D et al., 2021) and
# dplyr (version 1.0.10; Wickham H et al., 2022).
# 
# References
# ----------
#   - Bates D, Mächler M, Bolker B, Walker S (2015). "Fitting LinearMixed-Effects
# Models Using lme4." _Journal of Statistical Software_,*67*(1), 1-48.
# doi:10.18637/jss.v067.i01<https://doi.org/10.18637/jss.v067.i01>.
#   - Bates D, Maechler M, Jagan M (2022). _Matrix: Sparse and Dense MatrixClasses
# and Methods_. R package version
# 1.5-1,<https://CRAN.R-project.org/package=Matrix>.
#   - Eddelbuettel D, François R (2011). "Rcpp: Seamless R and C++Integration."
# _Journal of Statistical Software_, *40*(8), 1-18.doi:10.18637/jss.v040.i08
# <https://doi.org/10.18637/jss.v040.i08>.Eddelbuettel D (2013). _Seamless R and
# C++ Integration with Rcpp_.Springer, New York.
# doi:10.1007/978-1-4614-6868-4<https://doi.org/10.1007/978-1-4614-6868-4>, ISBN
# 978-1-4614-6867-7.Eddelbuettel D, Balamuta JJ (2018). "Extending extitR with
# extitC++: ABrief Introduction to extitRcpp." _The American Statistician_,
# *72*(1),28-36.
# doi:10.1080/00031305.2017.1375990<https://doi.org/10.1080/00031305.2017.1375990>.
#   - Goodrich B, Gabry J, Ali I, Brilleman S (2022). "rstanarm: Bayesianapplied
# regression modeling via Stan." R package version
# 2.21.3,<https://mc-stan.org/rstanarm/>.Brilleman S, Crowther M, Moreno-Betancur
# M, Buros Novik J, Wolfe R(2018). "Joint longitudinal and time-to-event models
# via Stan." StanCon2018. 10-12 Jan 2018. Pacific Grove, CA,
# USA.,<https://github.com/stan-dev/stancon_talks/>.
#   - Makowski D, Ben-Shachar M, Patil I, Lüdecke D (2021). "AutomatedResults
# Reporting as a Practical Tool to Improve Reproducibility andMethodological Best
# Practices Adoption." _CRAN_.<https://github.com/easystats/report>.
#   - R Core Team (2022). _R: A Language and Environment for StatisticalComputing_.
# R Foundation for Statistical Computing, Vienna,
# Austria.<https://www.R-project.org/>.
#   - Wickham H, François R, Henry L, Müller K (2022). _dplyr: A Grammar ofData
# Manipulation_. R package version
# 1.0.10,<https://CRAN.R-project.org/package=dplyr>.

Credits

If you like it, you can put a star on this repo, and cite the package as follows:

citation("report")

To cite in publications use:

  Makowski, D., Ben-Shachar, M.S., Patil, I. & Lüdecke, D. (2020).
  Automated Results Reporting as a Practical Tool to Improve
  Reproducibility and Methodological Best Practices Adoption. CRAN.
  Available from https://github.com/easystats/report. doi: .

A BibTeX entry for LaTeX users is

  @Article{,
    title = {Automated Results Reporting as a Practical Tool to Improve Reproducibility and Methodological Best Practices Adoption},
    author = {Dominique Makowski and Mattan S. Ben-Shachar and Indrajeet Patil and Daniel Lüdecke},
    year = {2021},
    journal = {CRAN},
    url = {https://github.com/easystats/report},
  }

Contribute

report is a young package in need of affection. You can easily be a part of the developing community of this open-source software and improve science! Don’t be shy, try to code and submit a pull request (See the contributing guide). Even if it’s not perfect, we will help you make it great!

Code of Conduct

Please note that the report project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Download Details:

Author: Easystats
Source Code: https://github.com/easystats/report 
License: GPL-3.0 license

#r #report #models #rstats 

Report: Automated Reporting Of Objects in R
Nat  Grady

Nat Grady

1667397730

Geobr: Download Official Spatial Data Sets Of Brazil

geobr: Download Official Spatial Data Sets of Brazil

geobr is a computational package to download official spatial data sets of Brazil. The package includes a wide range of geospatial data in geopackage format (like shapefiles but better), available at various geographic scales and for various years with harmonized attributes, projection and topology (see detailed list of available data sets below).

The package is currently available in R and Python.

Installation R

# From CRAN
install.packages("geobr")
library(geobr)

# or use the development version with latest features
utils::remove.packages('geobr')
devtools::install_github("ipeaGIT/geobr", subdir = "r-package")
library(geobr)

obs. If you use Linux, you need to install a couple dependencies before installing the libraries sf and geobr. More info here.

Installation Python

pip install geobr

Windows users:

conda create -n geo_env
conda activate geo_env  
conda config --env --add channels conda-forge  
conda config --env --set channel_priority strict  
conda install python=3 geopandas  
pip install geobr

Basic Usage

The syntax of all geobr functions operate on the same logic so it becomes intuitive to download any data set using a single line of code. Like this:

R, reading the data as an sf object

library(geobr)

# Read specific municipality at a given year
mun <- read_municipality(code_muni=1200179, year=2017)

# Read all municipalities of given state at a given year
mun <- read_municipality(code_muni=33, year=2010) # or
mun <- read_municipality(code_muni="RJ", year=2010)

# Read all municipalities in the country at a given year
mun <- read_municipality(code_muni="all", year=2018)

More examples in the intro Vignette

Python, reading the data as a geopandas object

from geobr import read_municipality

# Read specific municipality at a given year
mun = read_municipality(code_muni=1200179, year=2017)

# Read all municipalities of given state at a given year
mun = read_municipality(code_muni=33, year=2010) # or
mun = read_municipality(code_muni="RJ", year=2010)

# Read all municipalities in the country at a given year
mun = read_municipality(code_muni="all", year=2018)

More examples here

Available datasets:

:point_right: All datasets use geodetic reference system "SIRGAS2000", CRS(4674).

FunctionGeographies availableYears availableSource
read_countryCountry1872, 1900, 1911, 1920, 1933, 1940, 1950, 1960, 1970, 1980, 1991, 2000, 2001, 2010, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020IBGE
read_regionRegion2000, 2001, 2010, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020IBGE
read_stateStates1872, 1900, 1911, 1920, 1933, 1940, 1950, 1960, 1970, 1980, 1991, 2000, 2001, 2010, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020IBGE
read_meso_regionMeso region2000, 2001, 2010, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020IBGE
read_micro_regionMicro region2000, 2001, 2010, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020IBGE
read_intermediate_regionIntermediate region2017, 2019, 2020IBGE
read_immediate_regionImmediate region2017, 2019, 2020IBGE
read_municipalityMunicipality1872, 1900, 1911, 1920, 1933, 1940, 1950, 1960, 1970, 1980, 1991, 2000, 2001, 2005, 2007, 2010, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020IBGE
read_municipal_seatMunicipality seats (sedes municipais)1872, 1900, 1911, 1920, 1933, 1940, 1950, 1960, 1970, 1980, 1991, 2010IBGE
read_weighting_areaCensus weighting area (área de ponderação)2010IBGE
read_census_tractCensus tract (setor censitário)2000, 2010, 2017, 2019, 2020IBGE
read_statistical_gridStatistical Grid of 200 x 200 meters2010IBGE
read_metro_areaMetropolitan areas1970, 2001, 2002, 2003, 2005, 2010, 2013, 2014, 2015, 2016, 2017, 2018IBGE
read_urban_areaUrban footprints2005, 2015IBGE
read_amazonBrazil's Legal Amazon2012MMA
read_biomesBiomes2004, 2019IBGE
read_conservation_unitsEnvironmental Conservation Units201909MMA
read_disaster_risk_areaDisaster risk areas2010CEMADEN and IBGE
read_indigenous_landIndigenous lands201907, 202103FUNAI
read_semiaridSemi Arid region2005, 2017IBGE
read_health_facilitiesHealth facilities2015CNES, DataSUS
read_health_regionHealth regions and macro regions1991, 1994, 1997, 2001, 2005, 2013DataSUS
read_neighborhoodNeighborhood limits2010IBGE
read_schoolsSchools2020INEP
read_comparable_areasHistorically comparable municipalities, aka Areas minimas comparaveis (AMCs)1872,1900,1911,1920,1933,1940,1950,1960,1970,1980,1991,2000,2010IBGE
read_urban_concentrationsUrban concentration areas (concentrações urbanas)2015IBGE
read_pop_arrangementsPopulation arrangements (arranjos populacioanis)2015IBGE

Other functions:

FunctionAction
list_geobrList all datasets available in the geobr package
lookup_muniLook up municipality codes by their name, or the other way around
grid_state_correspondence_tableLoads a correspondence table indicating what quadrants of IBGE's statistical grid intersect with each state
cep_to_stateDetermine the state of a given CEP postal code
......

Note 1. Data sets and Functions marked with "dev" are only available in the development version of geobr.

Note 2. Most data sets are available at scale 1:250,000 (see documentation for details).

Coming soon:

GeographyYears availableSource
read_census_tract2007IBGE
Longitudinal Database* of micro regionsvarious yearsIBGE
Longitudinal Database* of Census tractsvarious yearsIBGE
.........

'*' Longitudinal Database refers to áreas mínimas comparáveis (AMCs)

Contributing to geobr

If you would like to contribute to geobr and add new functions or data sets, please check this guide to propose your contribution.


Related projects

As of today, there is another R package with similar functionalities: simplefeaturesbr. The geobr package has a few advantages when compared to simplefeaturesbr, including for example:

  • A same syntax structure across all functions, making the package very easy and intuitive to use
  • Access to a wider range of official spatial data sets, such as states and municipalities, but also macro-, meso- and micro-regions, weighting areas, census tracts, urbanized areas, etc
  • Access to shapefiles with updated geometries for various years
  • Harmonized attributes and geographic projections across geographies and years
  • Option to download geometries with simplified borders for fast rendering
  • Stable version published on CRAN for R users, and on PyPI for Python users

Similar packages for other countries/continents


Credits

Original shapefiles are created by official government institutions. The geobr package is developed by a team at the Institute for Applied Economic Research (Ipea), Brazil. If you want to cite this package, you can cite it as:

  • Pereira, R.H.M.; Gonçalves, C.N.; et. all (2019) geobr: Loads Shapefiles of Official Spatial Data Sets of Brazil. GitHub repository - https://github.com/ipeaGIT/geobr.

Download Details:

Author: ipeaGIT
Source Code: https://github.com/ipeaGIT/geobr 

#r #python #rstats 

Geobr: Download Official Spatial Data Sets Of Brazil
Nat  Grady

Nat Grady

1666986420

Paletteer: Collection Of Most Color Palettes in A Single R Package

Paletteer 

The goal of paletteer is to be a comprehensive collection of color palettes in R using a common interface. Think of it as the “caret of palettes”.

Notice This version is not backwards compatible with versions <= 0.2.1. Please refer to the end of the readme for breaking changes

Installation

You can install the released version of paletteer from CRAN with:

install.packages("paletteer")

If you want the development version instead then install directly from GitHub:

# install.packages("devtools")
devtools::install_github("EmilHvitfeldt/paletteer")

Palettes

The palettes are divided into 2 groups; discrete and continuous. For discrete palette you have the choice between the fixed width palettes and dynamic palettes. Most common of the two are the fixed width palettes which have a set amount of colors which doesn’t change when the number of colors requested vary like the following palettes:

on the other hand we have the dynamic palettes where the colors of the palette depend on the number of colors you need like the green.pal palette from the cartography package:

Lastly we have the continuous palettes which provides as many colors as you need for a smooth transition of color:

This package includes 2569 palettes from 68 different packages and information about these can be found in the following data.frames: palettes_c_names, palettes_d_names and palettes_dynamic_names. Additionally this github repo showcases all the palettes included in the package and more.

Examples

All the palettes can be accessed from the 3 functions paletteer_c(), paletteer_d() and paletteer_dynamic() using the by using the syntax packagename::palettename.

paletteer_c("scico::berlin", n = 10)
#> <colors>
#> #9EB0FFFF #5AA3DAFF #2D7597FF #194155FF #11181DFF #270C01FF #501802FF #8A3F2AFF #C37469FF #FFACACFF
paletteer_d("nord::frost")
#> <colors>
#> #8FBCBBFF #88C0D0FF #81A1C1FF #5E81ACFF
paletteer_dynamic("cartography::green.pal", 5)
#> <colors>
#> #B8D9A9FF #8DBC80FF #5D9D52FF #287A22FF #17692CFF

All of the functions now also support tab completion to easily access the hundreds of choices

paletteer-demo.gif

ggplot2 scales

Lastly the package also includes scales for ggplot2 using the same standard interface

library(ggplot2)

ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  scale_color_paletteer_d("nord::aurora")

Breaking changes

In version <= 0.2.1 a palatte was selected by specifying a package and palette argument like so

paletteer_c(package = "nord", palette = "frost")

After version 0.2.1 palettes are selected using the syntax "packagename::palettename" inside the palette functions.

paletteer_c("nord::frost")

Special thanks

Included packages

paletteer includes palettes from the following packages:

NameGithubCRAN
awtoolsawhstin/awtools - 0.2.1-
basethemekaroliskoncevicius/basetheme - 0.1.20.1.2
beyoncedill/beyonce - 0.1-
calecopalan-bui/calecopal - 0.1.0-
cartographyriatelab/cartography - 3.0.13.0.1
colorBlindness-0.1.9
colorblindrclauswilke/colorblindr - 0.1.0-
colRozjacintak/colRoz - 0.2.2-
dichromat-2.0-0.1
DresdenColorkatiesaund/DresdenColor - 0.0.0.9000-
dutchmastersEdwinTh/dutchmasters - 0.1.0-
feathersshandiya/feathers - 0.0.0.9000-
fishualizenschiett/fishualize - 0.2.30.2.3
futurevisionsJoeyStanley/futurevisions - 0.1.1-
ggpomologicalgadenbuie/ggpomological - 0.1.2-
ggprismcsdaw/ggprism - 1.0.3.90001.0.3
ggscinanxstats/ggsci - 2.92.9
ggthemesjrnold/ggthemes - 4.2.24.2.2
ggthemrMikata-Project/ggthemr - 1.1.0-
ghibliewenme/ghibli - 0.3.3.90000.3.3
grDevices-4.2.1
harrypotteraljrico/harrypotter - 2.1.12.1.1
impressionist.colors-1.0
IslamicArtlambdamoses/IslamicArt - 0.1.0-
jcolorsjaredhuling/jcolors - 0.0.40.0.4
khromatesselle/khroma - 1.9.0.90001.9.0
LaCroixColoRjohannesbjork/LaCroixColoR - 0.1.0-
lisatyluRp/lisa - 0.1.2.90000.1.2
ManuG-Thomson/Manu - 0.0.2-
MapPalettesdisarm-platform/MapPalettes - 0.0.2-
MetBrewerBlakeRMills/MetBrewer - 0.2.00.2.0
miscpalettesEmilHvitfeldt/miscpalettes - 0.0.0.9000-
musculusColorsdawnbarlow/musculusColors - 0.1.0-
nationalparkcolorskatiejolly/nationalparkcolors - 0.1.0-
NatParksPaletteskevinsblake/NatParksPalettes - 0.2.00.2.0
nbapalettesmurrayjw/nbapalettes - 0.1.0.90000.1.0
NineteenEightyRm-clark/NineteenEightyR - 0.1.0-
nordjkaupp/nord - 1.0.01.0.0
ochRehollylkirk/ochRe - 1.0.0-
oompaBase-3.2.9
palettesForRfrareb/palettesForR - 0.1.20.1.2
palettetowntimcdlucas/palettetown - 0.1.1.900000.1.1
palrAustralianAntarcticDivision/palr - 0.3.00.3.0
palskwstat/pals - 1.71.7
peRReojbgb13/peRReo - 0.1.0-
PNWColorsjakelawlor/PNWColors - 0.1.0-
Polychrome-1.5.1
popthemesjohnmackintosh/popthemes - 0.0.0.9000-
rcartocolorNowosad/rcartocolor - 2.1.02.1.0
RColorBrewer-1.1-3
Redmonder-0.2.0
rockthemesjohnmackintosh/rockthemes - 0.0.0.9000-
RSkittleBreweralyssafrazee/RSkittleBrewer - 1.1-
rtisttomasokal/rtist - 1.0.01.0.0
scicothomasp85/scico - 1.3.1.90001.3.1
severanceivelasq/severance - 0.0.0.9000-
soilpaletteskaizadp/soilpalettes - 0.1.0-
suffrageralburezg/suffrager - 0.1.0-
tayloRswiftasteves/tayloRswift - 0.1.0-
tidyquantbusiness-science/tidyquant - 1.0.5.90001.0.5
trekcolorsleonawicz/trekcolors - 0.1.30.1.3
tvthemesRyo-N7/tvthemes - 1.3.11.3.1
uniknhneth/unikn - 0.6.0.90060.6.0
vapeplotseasmith/vapeplot - 0.1.0-
vapoRwavemoldach/vapoRwave - 0.0.0.9000-
viridissjmgarnier/viridis - 0.6.20.6.2
visiblym-clark/visibly - 0.2.9-
werpalssciencificity/werpals - 0.1.0-
wesandersonkarthik/wesanderson - 0.3.6.90000.3.6
yarrrndphillips/yarrr - 0.1.6NA

Download Details:

Author: EmilHvitfeldt
Source Code: https://github.com/EmilHvitfeldt/paletteer 
License: View license

#r #rstats 

Paletteer: Collection Of Most Color Palettes in A Single R Package
Nat  Grady

Nat Grady

1666974180

Rtweet: R Client for interacting with Twitter's [stream and REST] APIs

rtweet 

Use twitter from R. Get started by reading vignette("rtweet").

Installation

To get the current released version from CRAN:

install.packages("rtweet")

You can install the development version of rtweet from GitHub with:

install.packages("rtweet", repos = 'https://ropensci.r-universe.dev')

Usage

All users must be authenticated to interact with Twitter’s APIs. The easiest way to authenticate is to use your personal twitter account - this will happen automatically (via a browser popup) the first time you use an rtweet function. See auth_setup_default() for details. Using your personal account is fine for casual use, but if you are trying to collect a lot of data it’s a good idea to authenticate with your own Twitter “app”. See vignette("auth", package = "rtweet") for details.

library(rtweet)

rtweet should be used in strict accordance with Twitter’s developer terms.

Search tweets or users

Search for up to 1000 tweets containing #rstats, the common hashtag used to refer to the R language, excluding retweets:

rt <- search_tweets("#rstats", n = 1000, include_rts = FALSE)

Twitter rate limits cap the number of search results returned to 18,000 every 15 minutes. To request more than that, set retryonratelimit = TRUE and rtweet will wait for rate limit resets for you.

Search for 200 users with the #rstats in their profile:

useRs <- search_users("#rstats", n = 200)

Stream tweets

Randomly sample (approximately 1%) from the live stream of all tweets:

random_stream <- stream_tweets("")

Stream all geo-located tweets from London for 60 seconds:

stream_london <- stream_tweets(location = lookup_coords("london"), timeout = 60)

Get friends and followers

Get all accounts followed by a user:

## get user IDs of accounts followed by R Foundation
R_Foundation_fds <- get_friends("_R_Foundation")

## lookup data on those accounts
R_Foundation_fds_data <- lookup_users(R_Foundation_fds$to_id)

Get all accounts following a user:

## get user IDs of accounts following R Foundation
R_Foundation_flw <- get_followers("_R_Foundation", n = 100)
R_Foundation_flw_data <- lookup_users(R_Foundation_flw$from_id)

If you want all followers, you'll need to set n = Inf and retryonratelimit = TRUE but be warned that this might take a long time.

Get timelines

Get the most recent 200 tweets from R Foundation:

## get user IDs of accounts followed by R Foundation
tmls <- get_timeline("_R_Foundation", n = 100)

Get favorites

Get the 10 most recently favorited statuses by R Foundation:

favs <- get_favorites("_R_Foundation", n = 10)

Contact

Communicating with Twitter’s APIs relies on an internet connection, which can sometimes be inconsistent.

If you have questions, or need an example or want to share a use case, you can post them on rOpenSci’s discuss. Where you can browse uses of rtweet too.

With that said, if you encounter an obvious bug for which there is not already an active issue, please create a new issue with all code used (preferably a reproducible example) on Github.

Code of Conduct

Please note that this package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Download Details:

Author: ropensci
Source Code: https://github.com/ropensci/rtweet 
License: View license

#r #twitter #rest 

Rtweet: R Client for interacting with Twitter's [stream and REST] APIs
Nat  Grady

Nat Grady

1666953900

Datascience-box: Data Science Course in A Box

Data Science Course in a Box

Data Science in a Box contains the materials required to teach (or learn from) an introductory data science course using R, all of which are freely-available and open-source. They include course materials such as slide decks, homework assignments, guided labs, sample exams, a final project assignment, as well as materials for instructors such as pedagogical tips, information on computing infrastructure, technology stack, and course logistics.

See datasciencebox.org for everything you need to know about the project!

Note that all materials are released with Creative Commons Attribution Share Alike 4.0 International license.

Questions, bugs, feature requests

You can file an issue to get help, report a bug, or make a feature request.

Before opening a new issue, be sure to search issues and pull requests to make sure the bug hasn't been reported and/or already fixed in the development version. By default, the search will be pre-populated with is:issue is:open. You can edit the qualifiers (e.g. is:pr, is:closed) as needed. For example, you'd simply remove is:open to search all issues in the repo, open or closed.

If your issue involves R code, please make a minimal reproducible example using the reprex package. If you haven't heard of or used reprex before, you're in for a treat! Seriously, reprex will make all of your R-question-asking endeavors easier (which is a pretty insane ROI for the five to ten minutes it'll take you to learn what it's all about). For additional reprex pointers, check out the Get help! section of the tidyverse site.

Code of Conduct

Please note that the datascience-box project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Download Details:

Author: rstudio-education
Source Code: https://github.com/rstudio-education/datascience-box 
License: View license

#r #education #datascience #rstats 

Datascience-box: Data Science Course in A Box
Nat  Grady

Nat Grady

1666949640

GGthemr: Themes for ggplot2

ggthemr

Themes for ggplot2. The idea of this package is that you can just set the theme and then forget about it. You shouldn't have to change any of your existing code. There are several parts to a theme:

  • Colour palette for the background, axes, gridlines, text etc.
  • Layout of axes lines and gridlines.
  • Spacing around plot and between elements (i.e. axes titles to axes lines etc). You can set the spacing to determine how compact or spread out a plot is.
  • Text size.

There are a number of preset palettes and layouts, and methods to create your own colour schemes.

Installation

This package is still under development, but can be installed using devtools.

devtools::install_github('Mikata-Project/ggthemr')

We plan to submit to CRAN soon, but that is currently blocked by an upstream issue now.

Usage

To just set the colour scheme:

ggthemr('dust')

That's it. Any ggplot you create from then on will have the theme applied. You can clear the theme and return to ggplot2's default using:

ggthemr_reset()

Known issues

Palettes

The palette determines the colours of everything in a plot including the background, layers, gridlines, title text, axes lines, axes text and axes titles. The swatch is the the name given to the set of colours strictly used in styling the geoms/layer elements (e.g. the points in geom_point(), bars in geom_bar() etc.). At least six colours have been supplied in each palette's swatch.

There are a wide variety of themes in this package (and more on the way). Some of them serious business... others are deliberately stylish and might not be that good for use in proper publications.

flat

Base 16

flat dark

Base 16

camouflage

chalk

copper

dust

earth

fresh

grape

grass

greyscale

light

lilac

pale

sea

sky

solarized

Custom Palettes

define_palette() lets you make your own themes that can be passed to ggthemr() just like any of the palettes above. Here's an example of a (probably ugly) palette using random colours:

# Random colours that aren't white.
set.seed(12345)
random_colours <- sample(colors()[-c(1, 253, 361)], 10L)

ugly <- define_palette(
  swatch = random_colours,
  gradient = c(lower = random_colours[1L], upper = random_colours[2L])
)

ggthemr(ugly)

example_plot + ggtitle(':(')

You can define all elements of a palette using define_palette() including colours for the background, text, axes lines, swatch and gradients.

Layouts

The layout of a theme controls the appearance and position of the axes, gridlines and text. Some folk prefer both major and minor gridlines, others prefer none or something in between.

Clean

Clear (default)

Minimal

Plain

Scientific

Spacing

Plot margins and space between axes titles and lines etc. is controlled with the spacing parameter. Lower values will make plots more compact, higher values will give them more padding. Compare the plots below where the spacing has been set to 0, 1 and 2 respectively.

Type

The type parameter can be set to either inner or outer. When inner, the background colour of a plot will not extend past the plot area. outer will colour the entire plot and background.

ggthemr('earth', type = 'inner')
example_plot

ggthemr('earth', type = 'outer')
example_plot

Tweaking Themes

Squinting at a chart? Low on printer ink? ggthemr includes some methods to tweak charts to make them lighter or darker. Here's a standard theme:

ggthemr('dust')
example_plot

Maybe that plot comes out a bit pale looking when you print it. Here's how you can add a bit more contrast to the swatch:

darken_swatch(amount = 0.3)
example_plot

The second parameter to darken_swatch() controls the degree to which the colours are made darker. Full list of methods with similar functionality:

  • darken_swatch() / lighten_swatch(): darker/lighter swatch colours.
  • darken_gradient() / lighten_gradient(): darker/lighter gradient colours.
  • darken_palette() / lighten_palette(): darker/lighter everything.

I'll add methods to darken/lighten the axes lines and text soon too.

Plot Adjustments

Most of the time you'll probably just want to set the theme and not worry about it. There may be times though where you'll want to make some small adjustment, or manually change what items appear as what colour in a plot.

ggthemr('dust')
mpg_plot <- ggplot(mpg[mpg$drv != '4', ], aes(factor(cyl), cty, fill = drv)) + 
  geom_boxplot() + labs(x = 'Cylinders', y = 'City MPG', fill = 'Drive Type') +
  theme(legend.position = 'bottom')
mpg_plot

For some reason you decide you want to change those colours. Front-wheel drive vehicles should be orange. Rear-wheelers should be that red colour. You could change the order of the levels of your fill variable, but you shouldn't have to do that. You just want to switch those colours but you have no idea what they are. swatch() will give you the colours in the currently active ggthemr palette.

swatch()
## [1] "#555555" "#db735c" "#EFA86E" "#9A8A76" "#F3C57B" "#7A6752" "#2A91A2"
## [8] "#87F28A" "#6EDCEF"
## attr(,"class")
## [1] "ggthemr_swatch"

So you can manually swap the two colours around.

to_swap <- swatch()[2:3]
mpg_plot + scale_fill_manual(values = rev(to_swap))

Note: the first colour in a swatch is a special one. It is reserved for outlining boxplots, text etc. So that's why the second and third colours were swapped.

A note about theme setting

ggthemr does three different things while setting a theme.

  1. It updates the default ggplot2 theme with the specified ggthemr theme by using the ggplot2::theme_set() function.
  2. It modifies the aesthetic defaults for all geoms using the ggplot2::update_geom_defaults() function.
  3. It creates functions for all the different scales in the global environment.

In case, if you do not want to set the theme this way, use the set_theme = FALSE option while using the ggthemr function. An example of setting theme, geom aesthetic defaults and scales manually:

ggthemr_reset()
dust_theme <- ggthemr('dust', set_theme = FALSE)
example_plot

example_plot + dust_theme$theme

example_plot + dust_theme$theme + scale_fill_manual(values = dust_theme$palette$swatch)

do.call(what = ggplot2::update_geom_defaults, args = dust_theme$geom_defaults$new$bar)
ggplot(diamonds, aes(price)) + geom_histogram(binwidth = 850) + dust_theme$theme

Mikata Project took over ggthemr and will be the primary maintainer of this wonderful package. We would like to thank @cttobin for creating this package. We also appreciate that he agreed to pass the repo ownership to Mikata Project. The Mikata Team plans to resolve backlog issues and make ggthemr available on CRAN as the first step.

Download Details:

Author: Mikata-Project
Source Code: https://github.com/Mikata-Project/ggthemr 
License: GPL-3

#r #visualization #datavisualization #rstats 

GGthemr: Themes for ggplot2
Nat  Grady

Nat Grady

1666945380

Easystats: The R Easystats-project

Easystats: An R Framework for Easy Statistical Modeling, Visualization, and Reporting

What is easystats?

easystats is a collection of R packages, which aims to provide a unifying and consistent framework to tame, discipline, and harness the scary R statistics and their pesky models.

However, there is not (yet) an unique “easystats” way of doing data analysis. Instead, start with one package and, when you’ll face a new challenge, do check if there is an easystats answer for it in other packages. You will slowly uncover how using them together facilitates your life. And, who knows, you might even end up using them all.

Installation

TypeSourceCommand
ReleaseCRANinstall.packages("easystats")
Developmentr-universeinstall.packages("easystats", repos = "https://easystats.r-universe.dev")
DevelopmentGitHubremotes::install_github("easystats/easystats")

Finally, as easystats sometimes depends on some additional packages for specific functions that are not downloaded by default. If you want to benefit from the full easystats experience without any hiccups, simply run the following:

easystats::install_suggested()

Citation

To cite the package, run the following command:

citation("easystats")

To cite easystats in publications use:

  Lüdecke, Patil, Ben-Shachar, Wiernik, & Makowski (2022). easystats:
  Framework for Easy Statistical Modeling, Visualization, and
  Reporting. CRAN. Available from
  https://easystats.github.io/easystats/

A BibTeX entry for LaTeX users is

  @Article{,
    title = {easystats: Framework for Easy Statistical Modeling, Visualization, and Reporting},
    author = {Daniel Lüdecke and Mattan S. Ben-Shachar and Indrajeet Patil and Brenton M. Wiernik and Dominique Makowski},
    journal = {CRAN},
    year = {2022},
    note = {R package},
    url = {https://easystats.github.io/easystats/},
  }

If you want to do this only for certain packages in the ecosystem, have a look at this article on how you can do so! https://easystats.github.io/easystats/articles/citation.html

Getting started

Each easystats package has a different scope and purpose. This means your best way to start is to explore and pick the one(s) that you feel might be useful to you. However, as they are built with a “bigger picture” in mind, you will realize that using more of them creates a smooth workflow, as these packages are meant to work together. Ideally, these packages work in unison to cover all aspects of statistical analysis and data visualization.

  • report: 📜 🎉 Automated statistical reporting of objects in R
  • correlation: 🔗 Your all-in-one package to run correlations
  • modelbased: 📈 Estimate effects, group averages and contrasts between groups based on statistical models
  • bayestestR: 👻 Great for beginners or experts of Bayesian statistics
  • effectsize: 🐉 Compute, convert, interpret and work with indices of effect size and standardized parameters
  • see: 🎨 The plotting companion to create beautiful results visualizations
  • parameters: 📊 Obtain a table containing all information about the parameters of your models
  • performance: 💪 Models’ quality and performance metrics (R2, ICC, LOO, AIC, BF, …)
  • insight: 🔮 For developers, a package to help you work with different models and packages
  • datawizard: 🧙 Magic potions to clean and transform your data

Frequently Asked Questions

How is easystats different from the tidyverse?

You’ve probably already heard about the tidyverse, another very popular collection of packages (ggplot, dplyr, tidyr, …) that also makes using R easier. So, should you pick the tidyverse or easystats? Pick both!

Indeed, these two ecosystems have been designed with very different goals in mind. The tidyverse packages are primarily made to create a new R experience, where data manipulation and exploration is intuitive and consistent. On the other hand, easystats focuses more on the final stretch of the analysis: understanding and interpreting your results and reporting them in a manuscript or a report, while following best practices. You can definitely use the easystats functions in a tidyverse workflow!

easystats + tidyverse = ❤️

Can easystats be useful to advanced users and/or developers?

Yes, definitely! easystats is built in terms of modules that are general enough to be used inside other packages. For instance, the insight package is made to easily implement support for post-processing of pretty much all regression model packages under the sun. We use it in all the easystats packages, but it is also used in other non-easystats packages, such as ggstatsplot, modelsummary, ggeffects, and more.

So why not in yours?

Moreover, the easystats packages are very lightweight, with a minimal set of dependencies, which again makes it great if you want to rely on them.

Documentation

Websites

Each easystats package has a dedicated website.

For example, website for parameters is https://easystats.github.io/parameters/.

Blog

In addition to the websites containing documentation for these packages, you can also read posts from easystats blog: https://easystats.github.io/blog/posts/.

Other learning resources

In addition to these websites and blog posts, you can also check out the following presentations and talks to learn more about this ecosystem:

https://easystats.github.io/easystats/articles/resources.html

Dependencies

easystats packages are designed to be lightweight, i.e., they don’t have any third-party hard dependencies, other than base-R packages or other easystats packages! If you develop R packages, this means that you can safely use easystats packages as dependencies in your own packages, without the risk of entering the dependency hell.

library(deepdep)

plot_dependencies("easystats", depth = 2, show_stamp = FALSE)

As we can see, the only exception is the {see} package, which is responsible for plotting and creating figures and relies on {ggplot2}, which does have a substantial number of dependencies.

Usage

Total downloads

TotalinsightbayestestRparametersperformancedatawizardeffectsizecorrelationseemodelbasedreporteasystats
10,736,6573,301,5261,449,2691,440,8051,361,4511,340,1811,119,639296,725267,46999,43555,5584,599

Trend

 

Contributing

We are happy to receive bug reports, suggestions, questions, and (most of all) contributions to fix problems and add features. Pull Requests for contributions are encouraged.

Here are some simple ways in which you can contribute (in the increasing order of commitment):

  • Read and correct any inconsistencies in the documentation
  • Raise issues about bugs or wanted features
  • Review code
  • Add new functionality

Code of Conduct

Please note that the ‘easystats’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Download Details:

Author: easystats
Source Code: https://github.com/easystats/easystats 
License: GPL-3.0 license

#r #statistics #models #datascience #rstats 

Easystats: The R Easystats-project