Nat  Grady

Nat Grady

1666904520

A Frictionless, Pipeable Approach to Dealing with Summary Statistics

skimr

skimr provides a frictionless approach to summary statistics which conforms to the principle of least surprise, displaying summary statistics the user can skim quickly to understand their data. It handles different data types and returns a skim_df object which can be included in a pipeline or displayed nicely for the human reader.

Note: skimr version 2 has major changes when skimr is used programmatically. Upgraders should review this document, the release notes and vignettes carefully.

Installation

The current released version of skimr can be installed from CRAN. If you wish to install the current build of the next release you can do so using the following:

# install.packages("devtools")
devtools::install_github("ropensci/skimr")

The APIs for this branch should be considered reasonably stable but still subject to change if an issue is discovered.

To install the version with the most recent changes that have not yet been incorporated in the main branch (and may not be):

devtools::install_github("ropensci/skimr", ref = "develop")

Do not rely on APIs from the develop branch, as they are likely to change.

Skim statistics in the console

skimr:

  • Provides a larger set of statistics than summary(), including missing, complete, n, and sd.
  • reports each data types separately
  • handles dates, logicals, and a variety of other types
  • supports spark-bar and spark-line based on the pillar package.

Separates variables by class:

skim(chickwts)

## ── Data Summary ────────────────────────
##                            Values  
## Name                       chickwts
## Number of rows             71      
## Number of columns          2       
## _______________________            
## Column type frequency:             
##   factor                   1       
##   numeric                  1       
## ________________________           
## Group variables            None    
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts                        
## 1 feed                  0             1 FALSE          6 soy: 14, cas: 12, lin: 12, sun: 12
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean   sd  p0  p25 p50  p75 p100 hist 
## 1 weight                0             1 261. 78.1 108 204. 258 324.  423 ▆▆▇▇▃

Presentation is in a compact horizontal format:

skim(iris)

## ── Data Summary ────────────────────────
##                            Values
## Name                       iris  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   factor                   1     
##   numeric                  4     
## ________________________         
## Group variables            None  
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts               
## 1 Species               0             1 FALSE          3 set: 50, ver: 50, vir: 50
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean    sd  p0 p25  p50 p75 p100 hist 
## 1 Sepal.Length          0             1 5.84 0.828 4.3 5.1 5.8  6.4  7.9 ▆▇▇▅▂
## 2 Sepal.Width           0             1 3.06 0.436 2   2.8 3    3.3  4.4 ▁▆▇▂▁
## 3 Petal.Length          0             1 3.76 1.77  1   1.6 4.35 5.1  6.9 ▇▁▆▇▂
## 4 Petal.Width           0             1 1.20 0.762 0.1 0.3 1.3  1.8  2.5 ▇▁▇▅▃

Built in support for strings, lists and other column classes

skim(dplyr::starwars)

## ── Data Summary ────────────────────────
##                            Values         
## Name                       dplyr::starwars
## Number of rows             87             
## Number of columns          14             
## _______________________                   
## Column type frequency:                    
##   character                8              
##   list                     3              
##   numeric                  3              
## ________________________                  
## Group variables            None           
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 name                  0         1       3  21     0       87          0
## 2 hair_color            5         0.943   4  13     0       12          0
## 3 skin_color            0         1       3  19     0       31          0
## 4 eye_color             0         1       3  13     0       15          0
## 5 sex                   4         0.954   4  14     0        4          0
## 6 gender                4         0.954   8   9     0        2          0
## 7 homeworld            10         0.885   4  14     0       48          0
## 8 species               4         0.954   3  14     0       37          0
## 
## ── Variable type: list ─────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate n_unique min_length max_length
## 1 films                 0             1       24          1          7
## 2 vehicles              0             1       11          0          2
## 3 starships             0             1       17          0          5
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd p0   p25 p50   p75 p100 hist 
## 1 height                6         0.931 174.   34.8 66 167   180 191    264 ▁▁▇▅▁
## 2 mass                 28         0.678  97.3 169.  15  55.6  79  84.5 1358 ▇▁▁▁▁
## 3 birth_year           44         0.494  87.6 155.   8  35    52  72    896 ▇▁▁▁▁

Has a useful summary function

skim(iris) %>%
  summary()

## ── Data Summary ────────────────────────
##                            Values
## Name                       iris  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   factor                   1     
##   numeric                  4     
## ________________________         
## Group variables            None

Individual columns can be selected using tidyverse-style selectors

skim(iris, Sepal.Length, Petal.Length)

## ── Data Summary ────────────────────────
##                            Values
## Name                       iris  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   numeric                  2     
## ________________________         
## Group variables            None  
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean    sd  p0 p25  p50 p75 p100 hist 
## 1 Sepal.Length          0             1 5.84 0.828 4.3 5.1 5.8  6.4  7.9 ▆▇▇▅▂
## 2 Petal.Length          0             1 3.76 1.77  1   1.6 4.35 5.1  6.9 ▇▁▆▇▂

Handles grouped data

skim() can handle data that has been grouped using dplyr::group_by().

iris %>%
  dplyr::group_by(Species) %>%
  skim()

## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             150       
## Number of columns          5         
## _______________________              
## Column type frequency:               
##   numeric                  4         
## ________________________             
## Group variables            Species   
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##    skim_variable Species    n_missing complete_rate  mean    sd  p0  p25  p50  p75 p100 hist 
##  1 Sepal.Length  setosa             0             1 5.01  0.352 4.3 4.8  5    5.2   5.8 ▃▃▇▅▁
##  2 Sepal.Length  versicolor         0             1 5.94  0.516 4.9 5.6  5.9  6.3   7   ▂▇▆▃▃
##  3 Sepal.Length  virginica          0             1 6.59  0.636 4.9 6.22 6.5  6.9   7.9 ▁▃▇▃▂
##  4 Sepal.Width   setosa             0             1 3.43  0.379 2.3 3.2  3.4  3.68  4.4 ▁▃▇▅▂
##  5 Sepal.Width   versicolor         0             1 2.77  0.314 2   2.52 2.8  3     3.4 ▁▅▆▇▂
##  6 Sepal.Width   virginica          0             1 2.97  0.322 2.2 2.8  3    3.18  3.8 ▂▆▇▅▁
##  7 Petal.Length  setosa             0             1 1.46  0.174 1   1.4  1.5  1.58  1.9 ▁▃▇▃▁
##  8 Petal.Length  versicolor         0             1 4.26  0.470 3   4    4.35 4.6   5.1 ▂▂▇▇▆
##  9 Petal.Length  virginica          0             1 5.55  0.552 4.5 5.1  5.55 5.88  6.9 ▃▇▇▃▂
## 10 Petal.Width   setosa             0             1 0.246 0.105 0.1 0.2  0.2  0.3   0.6 ▇▂▂▁▁
## 11 Petal.Width   versicolor         0             1 1.33  0.198 1   1.2  1.3  1.5   1.8 ▅▇▃▆▁
## 12 Petal.Width   virginica          0             1 2.03  0.275 1.4 1.8  2    2.3   2.5 ▂▇▆▅▇

Behaves nicely in pipelines

iris %>%
  skim() %>%
  dplyr::filter(numeric.sd > 1)

## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             150       
## Number of columns          5         
## _______________________              
## Column type frequency:               
##   numeric                  1         
## ________________________             
## Group variables            None      
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean   sd p0 p25  p50 p75 p100 hist 
## 1 Petal.Length          0             1 3.76 1.77  1 1.6 4.35 5.1  6.9 ▇▁▆▇▂

Knitted results

Simply skimming a data frame will produce the horizontal print layout shown above. We provide a knit_print method for the types of objects in this package so that similar results are produced in documents. To use this, make sure the skimmed object is the last item in your code chunk.

faithful %>%
  skim()
NamePiped data
Number of rows272
Number of columns2
_______________________ 
Column type frequency: 
numeric2
________________________ 
Group variablesNone

Data summary

Variable type: numeric

skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100hist
eruptions013.491.141.62.1644.455.1▇▂▂▇▇
waiting0170.9013.5943.058.007682.0096.0▃▃▂▇▂

Customizing skimr

Although skimr provides opinionated defaults, it is highly customizable. Users can specify their own statistics, change the formatting of results, create statistics for new classes and develop skimmers for data structures that are not data frames.

Specify your own statistics and classes

Users can specify their own statistics using a list combined with the skim_with() function factory. skim_with() returns a new skim function that can be called on your data. You can use this factory to produce summaries for any type of column within your data.

Assignment within a call to skim_with() relies on a helper function, sfl or skimr function list. By default, functions in the sfl call are appended to the default skimmers, and names are automatically generated as well.

my_skim <- skim_with(numeric = sfl(mad))
my_skim(iris, Sepal.Length)

But you can also helpers from the tidyverse to create new anonymous functions that set particular function arguments. The behavior is the same as in purrr or dplyr, with both . and .x as acceptable pronouns. Setting the append = FALSE argument uses only those functions that you’ve provided.

my_skim <- skim_with(
  numeric = sfl(
    iqr = IQR,
    p01 = ~ quantile(.x, probs = .01)
    p99 = ~ quantile(., probs = .99)
  ),
  append = FALSE
)
my_skim(iris, Sepal.Length)

And you can remove default skimmers by setting them to NULL.

my_skim <- skim_with(numeric = sfl(hist = NULL))
my_skim(iris, Sepal.Length)

Skimming other objects

skimr has summary functions for the following types of data by default:

  • numeric (which includes both double and integer)
  • character
  • factor
  • logical
  • complex
  • Date
  • POSIXct
  • ts
  • AsIs

skimr also provides a small API for writing packages that provide their own default summary functions for data types not covered above. It relies on R S3 methods for the get_skimmers function. This function should return a sfl, similar to customization within skim_with(), but you should also provide a value for the class argument. Here’s an example.

get_skimmers.my_data_type <- function(column) {
  sfl(
    .class = "my_data_type",
    p99 = quantile(., probs = .99)
  )
}

Limitations of current version

We are aware that there are issues with rendering the inline histograms and line charts in various contexts, some of which are described below.

Support for spark histograms

There are known issues with printing the spark-histogram characters when printing a data frame. For example, "▂▅▇" is printed as "<U+2582><U+2585><U+2587>". This longstanding problem originates in the low-level code for printing dataframes. While some cases have been addressed, there are, for example, reports of this issue in Emacs ESS. While this is a deep issue, there is ongoing work to address it in base R.

This means that while skimr can render the histograms to the console and in RMarkdown documents, it cannot in other circumstances. This includes:

  • converting a skimr data frame to a vanilla R data frame, but tibbles render correctly
  • in the context of rendering to a pdf using an engine that does not support utf-8.

One workaround for showing these characters in Windows is to set the CTYPE part of your locale to Chinese/Japanese/Korean with Sys.setlocale("LC_CTYPE", "Chinese"). The helper function fix_windows_histograms() does this for you.

And last but not least, we provide skim_without_charts() as a fallback. This makes it easy to still get summaries of your data, even if unicode issues continue.

Printing spark histograms and line graphs in knitted documents

Spark-bar and spark-line work in the console, but may not work when you knit them to a specific document format. The same session that produces a correctly rendered HTML document may produce an incorrectly rendered PDF, for example. This issue can generally be addressed by changing fonts to one with good building block (for histograms) and Braille support (for line graphs). For example, the open font “DejaVu Sans” from the extrafont package supports these. You may also want to try wrapping your results in knitr::kable(). Please see the vignette on using fonts for details.

Displays in documents of different types will vary. For example, one user found that the font “Yu Gothic UI Semilight” produced consistent results for Microsoft Word and Libre Office Write.

Inspirations

TextPlots for use of Braille characters

spark for use of block characters.

The earliest use of unicode characters to generate sparklines appears to be from 2009.

Exercising these ideas to their fullest requires a font with good support for block drawing characters. PragamataPro is one such font.

Contributing

We welcome issue reports and pull requests, including potentially adding support for commonly used variable classes. However, in general, we encourage users to take advantage of skimr’s flexibility to add their own customized classes. Please see the contributing and conduct documents.

Download Details:

Author: ropensci
Source Code: https://github.com/ropensci/skimr 

#r #rstats 

What is GEEK

Buddha Community

A Frictionless, Pipeable Approach to Dealing with Summary Statistics
Bella Garvin

Bella Garvin

1625302026

Daily Deals App Development Company I Daily Deals Website Development

Orbit Edge is a daily deals app development company that has 10+ years of experience in daily deals app development services. Our robust, informative, and easy-to-use daily deals app delivers a best-in-class user experience.

#daily deals app development #daily deals app development services #daily deals website development #best daily deals apps #hire daily deals app developers

Bella Garvin

Bella Garvin

1621779092

Daily Deals App Development

Orbit Edge is a daily deals app development company that has 10+ years of experience in daily deals app development services. Our robust, informative, and easy-to-use daily deals app delivers a best-in-class user experience.

#daily deals app development #daily deals website development #daily deals app development services #hire daily deals app developers #best daily deals apps

Bella Garvin

Bella Garvin

1619790540

Top 5 Daily Deals App Development Companies 2021-22

http://blogs.rediff.com/bellagarvin/2021/04/30/top-5-daily-deals-app-development-companies-2021-22/

In case, if you are finding a new path in the field of daily deals app and searching for the most reliable daily deals app development companies then you are on the right path. Let me share the list of the top 5 daily deals app development companies that can develop customized apps and websites where online customers can easily browse through the coupons and promo codes to make a cost-effective deal.

#daily deals app development #daily deals mobile app development #daily deals app development services #daily deals app development company #best daily deals apps

Nat  Grady

Nat Grady

1666904520

A Frictionless, Pipeable Approach to Dealing with Summary Statistics

skimr

skimr provides a frictionless approach to summary statistics which conforms to the principle of least surprise, displaying summary statistics the user can skim quickly to understand their data. It handles different data types and returns a skim_df object which can be included in a pipeline or displayed nicely for the human reader.

Note: skimr version 2 has major changes when skimr is used programmatically. Upgraders should review this document, the release notes and vignettes carefully.

Installation

The current released version of skimr can be installed from CRAN. If you wish to install the current build of the next release you can do so using the following:

# install.packages("devtools")
devtools::install_github("ropensci/skimr")

The APIs for this branch should be considered reasonably stable but still subject to change if an issue is discovered.

To install the version with the most recent changes that have not yet been incorporated in the main branch (and may not be):

devtools::install_github("ropensci/skimr", ref = "develop")

Do not rely on APIs from the develop branch, as they are likely to change.

Skim statistics in the console

skimr:

  • Provides a larger set of statistics than summary(), including missing, complete, n, and sd.
  • reports each data types separately
  • handles dates, logicals, and a variety of other types
  • supports spark-bar and spark-line based on the pillar package.

Separates variables by class:

skim(chickwts)

## ── Data Summary ────────────────────────
##                            Values  
## Name                       chickwts
## Number of rows             71      
## Number of columns          2       
## _______________________            
## Column type frequency:             
##   factor                   1       
##   numeric                  1       
## ________________________           
## Group variables            None    
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts                        
## 1 feed                  0             1 FALSE          6 soy: 14, cas: 12, lin: 12, sun: 12
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean   sd  p0  p25 p50  p75 p100 hist 
## 1 weight                0             1 261. 78.1 108 204. 258 324.  423 ▆▆▇▇▃

Presentation is in a compact horizontal format:

skim(iris)

## ── Data Summary ────────────────────────
##                            Values
## Name                       iris  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   factor                   1     
##   numeric                  4     
## ________________________         
## Group variables            None  
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts               
## 1 Species               0             1 FALSE          3 set: 50, ver: 50, vir: 50
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean    sd  p0 p25  p50 p75 p100 hist 
## 1 Sepal.Length          0             1 5.84 0.828 4.3 5.1 5.8  6.4  7.9 ▆▇▇▅▂
## 2 Sepal.Width           0             1 3.06 0.436 2   2.8 3    3.3  4.4 ▁▆▇▂▁
## 3 Petal.Length          0             1 3.76 1.77  1   1.6 4.35 5.1  6.9 ▇▁▆▇▂
## 4 Petal.Width           0             1 1.20 0.762 0.1 0.3 1.3  1.8  2.5 ▇▁▇▅▃

Built in support for strings, lists and other column classes

skim(dplyr::starwars)

## ── Data Summary ────────────────────────
##                            Values         
## Name                       dplyr::starwars
## Number of rows             87             
## Number of columns          14             
## _______________________                   
## Column type frequency:                    
##   character                8              
##   list                     3              
##   numeric                  3              
## ________________________                  
## Group variables            None           
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min max empty n_unique whitespace
## 1 name                  0         1       3  21     0       87          0
## 2 hair_color            5         0.943   4  13     0       12          0
## 3 skin_color            0         1       3  19     0       31          0
## 4 eye_color             0         1       3  13     0       15          0
## 5 sex                   4         0.954   4  14     0        4          0
## 6 gender                4         0.954   8   9     0        2          0
## 7 homeworld            10         0.885   4  14     0       48          0
## 8 species               4         0.954   3  14     0       37          0
## 
## ── Variable type: list ─────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate n_unique min_length max_length
## 1 films                 0             1       24          1          7
## 2 vehicles              0             1       11          0          2
## 3 starships             0             1       17          0          5
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean    sd p0   p25 p50   p75 p100 hist 
## 1 height                6         0.931 174.   34.8 66 167   180 191    264 ▁▁▇▅▁
## 2 mass                 28         0.678  97.3 169.  15  55.6  79  84.5 1358 ▇▁▁▁▁
## 3 birth_year           44         0.494  87.6 155.   8  35    52  72    896 ▇▁▁▁▁

Has a useful summary function

skim(iris) %>%
  summary()

## ── Data Summary ────────────────────────
##                            Values
## Name                       iris  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   factor                   1     
##   numeric                  4     
## ________________________         
## Group variables            None

Individual columns can be selected using tidyverse-style selectors

skim(iris, Sepal.Length, Petal.Length)

## ── Data Summary ────────────────────────
##                            Values
## Name                       iris  
## Number of rows             150   
## Number of columns          5     
## _______________________          
## Column type frequency:           
##   numeric                  2     
## ________________________         
## Group variables            None  
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean    sd  p0 p25  p50 p75 p100 hist 
## 1 Sepal.Length          0             1 5.84 0.828 4.3 5.1 5.8  6.4  7.9 ▆▇▇▅▂
## 2 Petal.Length          0             1 3.76 1.77  1   1.6 4.35 5.1  6.9 ▇▁▆▇▂

Handles grouped data

skim() can handle data that has been grouped using dplyr::group_by().

iris %>%
  dplyr::group_by(Species) %>%
  skim()

## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             150       
## Number of columns          5         
## _______________________              
## Column type frequency:               
##   numeric                  4         
## ________________________             
## Group variables            Species   
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##    skim_variable Species    n_missing complete_rate  mean    sd  p0  p25  p50  p75 p100 hist 
##  1 Sepal.Length  setosa             0             1 5.01  0.352 4.3 4.8  5    5.2   5.8 ▃▃▇▅▁
##  2 Sepal.Length  versicolor         0             1 5.94  0.516 4.9 5.6  5.9  6.3   7   ▂▇▆▃▃
##  3 Sepal.Length  virginica          0             1 6.59  0.636 4.9 6.22 6.5  6.9   7.9 ▁▃▇▃▂
##  4 Sepal.Width   setosa             0             1 3.43  0.379 2.3 3.2  3.4  3.68  4.4 ▁▃▇▅▂
##  5 Sepal.Width   versicolor         0             1 2.77  0.314 2   2.52 2.8  3     3.4 ▁▅▆▇▂
##  6 Sepal.Width   virginica          0             1 2.97  0.322 2.2 2.8  3    3.18  3.8 ▂▆▇▅▁
##  7 Petal.Length  setosa             0             1 1.46  0.174 1   1.4  1.5  1.58  1.9 ▁▃▇▃▁
##  8 Petal.Length  versicolor         0             1 4.26  0.470 3   4    4.35 4.6   5.1 ▂▂▇▇▆
##  9 Petal.Length  virginica          0             1 5.55  0.552 4.5 5.1  5.55 5.88  6.9 ▃▇▇▃▂
## 10 Petal.Width   setosa             0             1 0.246 0.105 0.1 0.2  0.2  0.3   0.6 ▇▂▂▁▁
## 11 Petal.Width   versicolor         0             1 1.33  0.198 1   1.2  1.3  1.5   1.8 ▅▇▃▆▁
## 12 Petal.Width   virginica          0             1 2.03  0.275 1.4 1.8  2    2.3   2.5 ▂▇▆▅▇

Behaves nicely in pipelines

iris %>%
  skim() %>%
  dplyr::filter(numeric.sd > 1)

## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             150       
## Number of columns          5         
## _______________________              
## Column type frequency:               
##   numeric                  1         
## ________________________             
## Group variables            None      
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate mean   sd p0 p25  p50 p75 p100 hist 
## 1 Petal.Length          0             1 3.76 1.77  1 1.6 4.35 5.1  6.9 ▇▁▆▇▂

Knitted results

Simply skimming a data frame will produce the horizontal print layout shown above. We provide a knit_print method for the types of objects in this package so that similar results are produced in documents. To use this, make sure the skimmed object is the last item in your code chunk.

faithful %>%
  skim()
NamePiped data
Number of rows272
Number of columns2
_______________________ 
Column type frequency: 
numeric2
________________________ 
Group variablesNone

Data summary

Variable type: numeric

skim_variablen_missingcomplete_ratemeansdp0p25p50p75p100hist
eruptions013.491.141.62.1644.455.1▇▂▂▇▇
waiting0170.9013.5943.058.007682.0096.0▃▃▂▇▂

Customizing skimr

Although skimr provides opinionated defaults, it is highly customizable. Users can specify their own statistics, change the formatting of results, create statistics for new classes and develop skimmers for data structures that are not data frames.

Specify your own statistics and classes

Users can specify their own statistics using a list combined with the skim_with() function factory. skim_with() returns a new skim function that can be called on your data. You can use this factory to produce summaries for any type of column within your data.

Assignment within a call to skim_with() relies on a helper function, sfl or skimr function list. By default, functions in the sfl call are appended to the default skimmers, and names are automatically generated as well.

my_skim <- skim_with(numeric = sfl(mad))
my_skim(iris, Sepal.Length)

But you can also helpers from the tidyverse to create new anonymous functions that set particular function arguments. The behavior is the same as in purrr or dplyr, with both . and .x as acceptable pronouns. Setting the append = FALSE argument uses only those functions that you’ve provided.

my_skim <- skim_with(
  numeric = sfl(
    iqr = IQR,
    p01 = ~ quantile(.x, probs = .01)
    p99 = ~ quantile(., probs = .99)
  ),
  append = FALSE
)
my_skim(iris, Sepal.Length)

And you can remove default skimmers by setting them to NULL.

my_skim <- skim_with(numeric = sfl(hist = NULL))
my_skim(iris, Sepal.Length)

Skimming other objects

skimr has summary functions for the following types of data by default:

  • numeric (which includes both double and integer)
  • character
  • factor
  • logical
  • complex
  • Date
  • POSIXct
  • ts
  • AsIs

skimr also provides a small API for writing packages that provide their own default summary functions for data types not covered above. It relies on R S3 methods for the get_skimmers function. This function should return a sfl, similar to customization within skim_with(), but you should also provide a value for the class argument. Here’s an example.

get_skimmers.my_data_type <- function(column) {
  sfl(
    .class = "my_data_type",
    p99 = quantile(., probs = .99)
  )
}

Limitations of current version

We are aware that there are issues with rendering the inline histograms and line charts in various contexts, some of which are described below.

Support for spark histograms

There are known issues with printing the spark-histogram characters when printing a data frame. For example, "▂▅▇" is printed as "<U+2582><U+2585><U+2587>". This longstanding problem originates in the low-level code for printing dataframes. While some cases have been addressed, there are, for example, reports of this issue in Emacs ESS. While this is a deep issue, there is ongoing work to address it in base R.

This means that while skimr can render the histograms to the console and in RMarkdown documents, it cannot in other circumstances. This includes:

  • converting a skimr data frame to a vanilla R data frame, but tibbles render correctly
  • in the context of rendering to a pdf using an engine that does not support utf-8.

One workaround for showing these characters in Windows is to set the CTYPE part of your locale to Chinese/Japanese/Korean with Sys.setlocale("LC_CTYPE", "Chinese"). The helper function fix_windows_histograms() does this for you.

And last but not least, we provide skim_without_charts() as a fallback. This makes it easy to still get summaries of your data, even if unicode issues continue.

Printing spark histograms and line graphs in knitted documents

Spark-bar and spark-line work in the console, but may not work when you knit them to a specific document format. The same session that produces a correctly rendered HTML document may produce an incorrectly rendered PDF, for example. This issue can generally be addressed by changing fonts to one with good building block (for histograms) and Braille support (for line graphs). For example, the open font “DejaVu Sans” from the extrafont package supports these. You may also want to try wrapping your results in knitr::kable(). Please see the vignette on using fonts for details.

Displays in documents of different types will vary. For example, one user found that the font “Yu Gothic UI Semilight” produced consistent results for Microsoft Word and Libre Office Write.

Inspirations

TextPlots for use of Braille characters

spark for use of block characters.

The earliest use of unicode characters to generate sparklines appears to be from 2009.

Exercising these ideas to their fullest requires a font with good support for block drawing characters. PragamataPro is one such font.

Contributing

We welcome issue reports and pull requests, including potentially adding support for commonly used variable classes. However, in general, we encourage users to take advantage of skimr’s flexibility to add their own customized classes. Please see the contributing and conduct documents.

Download Details:

Author: ropensci
Source Code: https://github.com/ropensci/skimr 

#r #rstats 

Factors That Can Contribute to the Faulty Statistical Inference

Hypothesis testing is a procedure where researchers make a precise statement based on their findings or data. Then, they collect evidence to falsify that precise statement or claim. This precise statement or claim is called the null hypothesis. If the evidence is strong to falsify the null hypothesis, we can reject the null hypothesis and adapt the alternative hypothesis. This is the basic idea of hypothesis testing.

Error Types in Statistical Testing

There are two distinct types of errors that can occur in formal hypothesis testing. They are:

Type I: Type I error occurs when the null hypothesis is true but the hypothesis testing results show the evidence to reject it. This is called a false positive.

Type II: Type II error occurs when the null hypothesis is not true but it is not rejected in hypothesis testing.

Most hypothesis testing procedure performs well controlling type I error (at 5%) in ideal conditions. That may give a false idea that there is only a 5% probability that the reported findings are wrong. But it’s not that simple. The probability can be much higher than 5%.

Normality of the Data

The normality of the data is an issue that can break down a statistical test. If the dataset is small, the normality of the data is very important for some statistical processes such as confidence interval or p-test. But if the data is large enough, normality does not have a significant impact.

Correlation

If the variables in the dataset are correlated with each other, that may result in poor statistical inference. Look at this picture below:

Image for post

In this graph, two variables seem to have a strong correlation. Or, if a series of data is observed as a sequence, that means values are correlated with its neighbors, and there may have some clustering or autocorrelation in the data. This kind of behavior in the dataset can adversely impact the statistical tests.

Correlation and Causation

This is especially important when interpreting the result of a statistical test. “Correlation does not mean causation”. Here is an example. Suppose, you have study data that shows, more people who do not have college education believe that women should get paid less than men in the workplace. You may have conducted a good hypothesis testing and prove that. But care must be taken on what conclusion is drawn from this. Probably, there is a correlation between college education and the belief that ‘women should get paid less’. But it is not fair to say that not having a college degree is the cause of such belief. This is a correlation but not a direct cause ad effect relationship.

A more clear example can be provided from medical data. Studies showed that people with fewer cavities are less likely to get heart disease. You may have enough data to statistically prove that but you actually cannot say that the dental cavity causes heart disease. There is no medical theory like that.

#statistical-analysis #statistics #statistical-inference #math #data analysis