1599193380
Whether or not you are a fan of the tidyverse, there is no doubt that this collection of R packages offers some neat and attractive ways of wrangling data that is often very intuitive to users. In the earlier versions of tidyverse packages, some elements of user control of output were sacrificed in favor of simpler functions which could be picked up and easily used by newbies. In recent updates to dplyr
and tidyr
, there has been significant progress to restoring some of this control.
This means that there are new functions and methods available in the tidyverse that you may not be aware of. They allow you to better transform your data how you want, and to perform operations more flexibly. They also provide new and alternative ways to perform tasks like nesting, modeling or graphing in a way where your code is more readable and understandable to many. In fact, I am convinced that users are only just scratching the surface of what can be done with the latest updates to this important set of packages.
It’s incumbent on any programmer to stay up to date with methods. Here are ten examples of new approaches to common data tasks that are offered by the latest tidyverse updates. For these examples, I will use the new Palmer Penguins dataset, which is an alternative to the controversial Iris dataset which is known to have been used by Fischer in his work around eugenics. As we will see, it’s actually a better all round dataset for teaching and illustrating data wrangling, and I’d encourage you to use and explore it.
First let’s load our tidyverse packages and the Palmer Penguins dataset and take a quick look at it. I’d encourage you to install the latest versions of these packages before you try to replicate the work in this article.
We can see that the dataset presents several measurements of various anatomical features of penguins of different species, sexes and native locations, as well as the year in which the measures were taken.
tidyselect
helper functions are now built in to allow you to save time by selecting columns using dplyr::select()
based on common conditions. In this case, if I want to reduce the dataset to just bill measurements I can use this (noting that all measurement columns contain an underscore):
A full set of tidyselect
helper functions can be found in the documentation here.
dplyr::relocate()
allows a new way to reorder specific columns or sets of columns. For example, if I want to make sure that all of my measurement columns are at the end of the dataset, I can use this (noting that my last column is year
):
Similar to .after
you can also use .before
as an argument here.
You’ll note in the penguins
dataset that there are no unique identifiers for each penguin. This can be problematic when you have multiple penguins of the same species, island, sex and year in the dataset. To address this and prepare for later examples, let’s add a unique identifier using dplyr::mutate()
, and here we can illustrate how mutate()
now allows you to position your new column in a similar way to relocate()
:
#data-science #learning #development #analytics #programming
1649209980
A cross-platform command line REPL for the rapid experimentation and exploration of C#. It supports intellisense, installing NuGet packages, and referencing local .NET projects and assemblies.
(click to view animation)
C# REPL provides the following features:
C# REPL is a .NET 6 global tool, and runs on Windows 10, Mac OS, and Linux. It can be installed via:
dotnet tool install -g csharprepl
If you're running on Mac OS Catalina (10.15) or later, make sure you follow any additional directions printed to the screen. You may need to update your PATH variable in order to use .NET global tools.
After installation is complete, run csharprepl
to begin. C# REPL can be updated via dotnet tool update -g csharprepl
.
Run csharprepl
from the command line to begin an interactive session. The default colorscheme uses the color palette defined by your terminal, but these colors can be changed using a theme.json
file provided as a command line argument.
Type some C# into the prompt and press Enter to run it. The result, if any, will be printed:
> Console.WriteLine("Hello World")
Hello World
> DateTime.Now.AddDays(8)
[6/7/2021 5:13:00 PM]
To evaluate multiple lines of code, use Shift+Enter to insert a newline:
> var x = 5;
var y = 8;
x * y
40
Additionally, if the statement is not a "complete statement" a newline will automatically be inserted when Enter is pressed. For example, in the below code, the first line is not a syntactically complete statement, so when we press enter we'll go down to a new line:
> if (x == 5)
| // caret position, after we press Enter on Line 1
Finally, pressing Ctrl+Enter will show a "detailed view" of the result. For example, for the DateTime.Now
expression below, on the first line we pressed Enter, and on the second line we pressed Ctrl+Enter to view more detailed output:
> DateTime.Now // Pressing Enter shows a reasonable representation
[5/30/2021 5:13:00 PM]
> DateTime.Now // Pressing Ctrl+Enter shows a detailed representation
[5/30/2021 5:13:00 PM] {
Date: [5/30/2021 12:00:00 AM],
Day: 30,
DayOfWeek: Sunday,
DayOfYear: 150,
Hour: 17,
InternalKind: 9223372036854775808,
InternalTicks: 637579915804530992,
Kind: Local,
Millisecond: 453,
Minute: 13,
Month: 5,
Second: 0,
Ticks: 637579915804530992,
TimeOfDay: [17:13:00.4530992],
Year: 2021,
_dateData: 9860951952659306800
}
A note on semicolons: C# expressions do not require semicolons, but statements do. If a statement is missing a required semicolon, a newline will be added instead of trying to run the syntatically incomplete statement; simply type the semicolon to complete the statement.
> var now = DateTime.Now; // assignment statement, semicolon required
> DateTime.Now.AddDays(8) // expression, we don't need a semicolon
[6/7/2021 5:03:05 PM]
Use the #r
command to add assembly or nuget references.
#r "AssemblyName"
or #r "path/to/assembly.dll"
#r "path/to/project.csproj"
. Solution files (.sln) can also be referenced.#r "nuget: PackageName"
to install the latest version of a package, or #r "nuget: PackageName, 13.0.5"
to install a specific version (13.0.5 in this case).To run ASP.NET applications inside the REPL, start the csharprepl
application with the --framework
parameter, specifying the Microsoft.AspNetCore.App
shared framework. Then, use the above #r
command to reference the application DLL. See the Command Line Configuration section below for more details.
csharprepl --framework Microsoft.AspNetCore.App
The C# REPL supports multiple configuration flags to control startup, behavior, and appearance:
csharprepl [OPTIONS] [response-file.rsp] [script-file.csx] [-- <additional-arguments>]
Supported options are:
-r <dll>
or --reference <dll>
: Reference an assembly, project file, or nuget package. Can be specified multiple times. Uses the same syntax as #r
statements inside the REPL. For example, csharprepl -r "nuget:Newtonsoft.Json" "path/to/myproj.csproj"
-u <namespace>
or --using <namespace>
: Add a using statement. Can be specified multiple times.-f <framework>
or --framework <framework>
: Reference a shared framework. The available shared frameworks depends on the local .NET installation, and can be useful when running an ASP.NET application from the REPL. Example frameworks are:-t <theme.json>
or --theme <theme.json>
: Read a theme file for syntax highlighting. This theme file associates C# syntax classifications with colors. The color values can be full RGB, or ANSI color names (defined in your terminal's theme). The NO_COLOR standard is supported.--trace
: Produce a trace file in the current directory that logs CSharpRepl internals. Useful for CSharpRepl bug reports.-v
or --version
: Show version number and exit.-h
or --help
: Show help and exit.response-file.rsp
: A filepath of an .rsp file, containing any of the above command line options.script-file.csx
: A filepath of a .csx file, containing lines of C# to evaluate before starting the REPL. Arguments to this script can be passed as <additional-arguments>
, after a double hyphen (--
), and will be available in a global args
variable.If you have dotnet-suggest
enabled, all options can be tab-completed, including values provided to --framework
and .NET namespaces provided to --using
.
C# REPL is a standalone software application, but it can be useful to integrate it with other developer tools:
To add C# REPL as a menu entry in Windows Terminal, add the following profile to Windows Terminal's settings.json
configuration file (under the JSON property profiles.list
):
{
"name": "C# REPL",
"commandline": "csharprepl"
},
To get the exact colors shown in the screenshots in this README, install the Windows Terminal Dracula theme.
To use the C# REPL with Visual Studio Code, simply run the csharprepl
command in the Visual Studio Code terminal. To send commands to the REPL, use the built-in Terminal: Run Selected Text In Active Terminal
command from the Command Palette (workbench.action.terminal.runSelectedText
).
To add the C# REPL to the Windows Start Menu for quick access, you can run the following PowerShell command, which will start C# REPL in Windows Terminal:
$shell = New-Object -ComObject WScript.Shell
$shortcut = $shell.CreateShortcut("$env:appdata\Microsoft\Windows\Start Menu\Programs\csharprepl.lnk")
$shortcut.TargetPath = "wt.exe"
$shortcut.Arguments = "-w 0 nt csharprepl.exe"
$shortcut.Save()
You may also wish to add a shorter alias for C# REPL, which can be done by creating a .cmd
file somewhere on your path. For example, put the following contents in C:\Users\username\.dotnet\tools\csr.cmd
:
wt -w 0 nt csharprepl
This will allow you to launch C# REPL by running csr
from anywhere that accepts Windows commands, like the Window Run dialog.
This project is far from being the first REPL for C#. Here are some other projects; if this project doesn't suit you, another one might!
Visual Studio's C# Interactive pane is full-featured (it has syntax highlighting and intellisense) and is part of Visual Studio. This deep integration with Visual Studio is both a benefit from a workflow perspective, and a drawback as it's not cross-platform. As far as I know, the C# Interactive pane does not support NuGet packages or navigating to documentation/source code. Subjectively, it does not follow typical command line keybindings, so can feel a bit foreign.
csi.exe ships with C# and is a command line REPL. It's great because it's a cross platform REPL that comes out of the box, but it doesn't support syntax highlighting or autocompletion.
dotnet script allows you to run C# scripts from the command line. It has a REPL built-in, but the predominant focus seems to be as a script runner. It's a great tool, though, and has a strong community following.
dotnet interactive is a tool from Microsoft that creates a Jupyter notebook for C#, runnable through Visual Studio Code. It also provides a general framework useful for running REPLs.
Download Details:
Author: waf
Source Code: https://github.com/waf/CSharpRepl
License: MPL-2.0 License
1620466520
If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.
If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.
In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.
#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition
1620629020
The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.
This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.
As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).
This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.
#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management
1593801840
A data scientist/analyst in the making needs to format and clean data before being able to perform any kind of exploratory data analysis. Because when you have raw data, it has numerous problems that need fixing.
So when we say we are cleaning data into a tidy data set to be used for analysis later, we are actually (among many other things):
1. Removing duplicate values
2. Removing null values
3. Changing column names to readable, understandable, formatted names
4. Removing commas from numeric values i.e. (1,000,657 to 1000657)
5. Converting data types into their appropriate types for analysis
This article is based upon a brief course project I have recently completed in my Data Science Specialization, focused on retrieving raw data, combining it into one dataset and getting it ready for later analysis (not covered in this article). The language opted is R using Rstudio.
The Experiment:
The experiment conducted here is retrieved from UCI Machine Learning Repository where a group of 30 volunteers (age bracket of 19–48 years) performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a Samsung Galaxy S smartphone. The data collected from the embedded accelerometers was divided into testing and trained data. More information regarding the experiment can be found at this link.
http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
The first step required is to obtain the data. Often, to avoid the headache of manually downloading thousands of files, they are downloaded using small code snippets. Since this was a zipped folder, I used the following commands to get started.
download.file(“https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip", destfile = “files”, method = “curl”, mode = “wb”)
The download.file functions takes the URL as the first argument and saves it on your local PC in the name you assign to destfile.
unzip(“files”)
This function just unzips the zipped folder.
features <- read.table(“UCI HAR Dataset/features.txt”, col.names = c(“serial”, “Functions”))
activities <- read.table(“UCI HAR Dataset/activity_labels.txt”, col.names = c(“serial”, “Activity”))
x_test <- read.table(“UCI HAR Dataset/test/X_test.txt”, col.names = features$Functions)
y_test <- read.table(“UCI HAR Dataset/test/y_test.txt”, col.names = “serial”)
subject_test <- read.table(“UCI HAR Dataset/test/subject_test.txt”, col.names = “subject”)
subject_train <- read.table(“UCI HAR Dataset/train/subject_train.txt”, col.names = “subject”)
x_train <- read.table(“UCI HAR Dataset/train/X_train.txt”, col.names = features$Functions)
y_train <- read.table(“UCI HAR Dataset/train/y_train.txt”, col.names = “serial”)
Note: It might be difficult to understand at first what the data means and what column names to use, but after a while you’ll start making sense. For example, it is important to note that the x_test and x_train files are values that refer to the columns in features.txt (hence I’ve linked them up using features$functions)
Making sense of the Data:
After being able to actually look at the files, I found out they were a mess of several files with hundreds of just column names in one .txt file, others having the row values and one having the activity labels. After spending hours of trying to understand the logical representation of data, I was able to visualize it something as follows:
This clearly implies two things:
I had to merge the training and test sets by row binding them
I had to merge the different attributes of the subjects by column binding them.
This is where step 3 comes into play.
First, I performed the rbind() function to make one huge dataset.
binded_x <- rbind(x_test, x_train)
binded_y <- rbind(y_test, y_train)
subject <- rbind(subject_test, subject_train)
Next, I used the cbind() function to complete attaching the columns as well.
raw_data_combined <- cbind(subject, binded_x, binded_y)
#r #data-science-tools #data-analytics #data-science #data-cleaning #data analysis
1594236360
My journey into the vast world of data has been a fun and enthralling ride. I have been glued to my courses, waiting to finish one so I can proceed to the next. After completing introductory courses, I made my way over to data cleaning. It is no secret that most of the effort in any data science project goes into cleaning the data set and tidying it up for analysis. Therefore, it is crucial to have substantial knowledge about this topic.
Firstly, to understand the need for clean data, we need to look at the workflow for a typical data science project. Data is first accessed, followed by manipulation and analysis of the data. Afterward, insights are extracted, and finally, visualized and reported.
Typical Project Workflow
Errors and mistakes in data, if present, could end up generating errors throughout the entire workflow. Ultimately, the insights generated that are used to make critical business decisions are incorrect, which may lead to monetary and business losses. Thus, if untidy data is not tackled and corrected in the first step, the compounding effect can be immense.
This guide will serve as a quick onboarding tool for data cleaning by compiling all the necessary functions and actions that should be taken. I will briefly describe three types of common data errors and then explain how these can be identified in data sets and corrected. I will also be introducing some powerful cleaning and manipulation libraries including dplyr, stringr, and assertive. These can be installed by simply writing the following code in RStudio:
install.packages("tidyverse")
install.packages("assertive")
When data is imported, a possibility exists that RStudio incorrectly interprets a data column type, or the data column was wrongly labeled during extraction. For example, a common error is when numeric data containing numbers are improperly identified and labeled as a character type.
a) Identification
Firstly, to identify incorrect data type errors, the glimpse function is used to check the data types of all columns. The glimpse function is part of the **dplyr **package which needs to be installed before glimpse can be used. Glimpse will return all the columns with their respective data types.
library(dplyr)
glimpse(dataset)
Another form of logical checks includes the is function. The is function can be used for each data type and will return with a logical output (true/false). I have only mentioned the common is functions, but it can be used for all data types. If a numeric column is an argument for the is.numeric function, the output will be true, while if a character column is an argument for the is.numeric function the output will be false.
is.numeric(column_name)
is.character(column_name)
b) Correction
After all the incorrect data type columns have been identified, they can simply be converted to the correct data type by using the as functions. For example, if a numeric data type has been incorrectly imported as a character data type, the as.numeric function will convert it to numeric data type.
as.numeric(column_name)
#data-analysis #data-scientist #data #data-cleaning #r #data analysis