Level-up with Semi-joins in R

Introduction

Assuming you already have some background with the other more common types of joins, inner, left, right, and outer; adding semi and anti can prove incredibly useful saving you what could have alternatively taken multiple steps.

In this post, I’ll be focusing on just semi-joins; with that said, there is a lot of overlap between semi & anti, so get ready to learn a bit about both.

Filtering Joins

Semi & anti joins are quite a bit different than the other four that I just highlighted; the number one difference being they are actually classified as what’s known as filtering joins.

Syntactically it feels very similar to any other join, but the intention is not to enhance a dataset with additional columns or rows, the intent is to use these joins to perform filtering.

A filtering join is not classified by the additional of new columns of information, rather it facilitates one being able to keep or reduce records in a given dataset.

Semi Join

The semi join is used with the intent of taking a dataset and filtering it down based on whether a common identifier is located in some additional dataset.

A good way to drill this idea home is to code the alternative.

Opportunity Dataset Example

Let’s say we have a dataset from salesforce that contains all deals or opportunities that we’ve worked on or are currently working on.

This opportunity dataset gives us a lot of really good information around the deal itself. Let’s suppose it looks something like this:

| opp_id | account_id | created_date | close_date | amount | stage |

Now let’s say we need to filter this dataset down such that we’re only including enterprise accounts. Well it just so happens that there is no segment field on our opportunity dataset that we can use to filter… we’ll have to leverage information from elsewhere.

Let’s pretend that the only place enterprise accounts have been tracked is in a random excel file. Let’s say the dataset looks like this:

| account_id | customer_segment |

In a World Without Semi-joins

Based on what we know now around left joins… we could do the following:

opportunities %>%
left_join(enterprise_accounts, by = 'account_id')%>%
filter(!is.na(customer_segment))

As you can see, we can left join the enterprise accounts dataset to our main opps dataset and in the event that there is no matching value, customer segment would be null, as such you can add a filter statement saying you only want non null cases.

This is fine and effectively performs the same functionality I explained above. One annoying thing is it gives you a new field, customer_segment, that is the same for every record.

We could also throw on a select statement thereafter to pull that field off, which just adds another line of code for you to write to meet this functionality.

opportunities %>%
left_join(enterprise_accounts, by = 'account_id')%>%
filter(!is.na(customer_segment))%>%
select(-customer_segment)

Assuming you’ve learned about inner joins as well, we could also achieve a similar functionality there with slightly simpler code.

opportunities %>%
inner_join(enterprise_accounts, by = 'account_id')%>%
select(-customer_segment)

Simplifying with Semi-joins

Now let’s simplify things even more with a semi-join.

opportunities %>%
semi_join(enterprise_accounts, by = 'account_id')

This will get us to the exact same output as each of the above examples. It will filter out records of opportunity where there is no matching account_id in the enterprise accounts table. It wont add columns or rows to your dataset. It exclusively exists and is used with the intent to filter.

Conclusion

There you have it, in just a few minutes we’ve covered a lot, and unlocked a bit of dplyr functionality that can simplify your code, your workflow.

We’ve learned:

  • The difference between mutating joins and filtering joins
  • How to execute a “filtering join” in the absence of semi-joins
  • The specific output and intent for a semi-join
  • How to use a semi-join

#data-science #r #data-analysis #towards-data-science #machine-learning

What is GEEK

Buddha Community

Level-up with Semi-joins in R
Brad  Hintz

Brad Hintz

1599302760

Apache Spark’s Join Algorithms

One of the most frequently used transformations in Apache Spark is Join operation. Joins in Apache Spark allow the developer to combine two or more data frames based on certain (sortable) keys. The syntax for writing a join operation is simple but some times what goes on behind the curtain is lost. Internally, for Joins Apache Spark proposes a couple of Algorithms and then chooses one of them. Not knowing what these internal algorithms are, and which one does spark choose might make a simple Join operation expensive.

While opting for a Join Algorithm, Spark looks at the size of the data frames involved. It considers the Join type and condition specified, and hint (if any) to finally decide upon the algorithm to use. In most of the cases, Sort Merge join and Shuffle Hash join are the two major power horses that drive the Spark SQL joins. But if spark finds the size of one of the data frames less than a certain threshold, Spark puts up Broadcast Join as it’s top contender.

Broadcast Hash Join

Looking at the Physical plan of a Join operation, a Broadcast Hash Join in Spark looks like this

Joins in Apache Spark: Broadcast Join

The above plan shows that the data frame from one of the branches broadcasts to every node containing the other data frame. In each node, Spark then performs the final Join operation. This is Spark’s per-node communication strategy.

Spark uses the Broadcast Hash Join when one of the data frame’s size is less than the threshold set in spark.sql.autoBroadcastJoinThreshold. It’s default value is 10 Mb, but can be changed using the following code

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 100 * 1024 * 1024)

This algorithm has the advantage that the other side of the join doesn’t require any shuffle. If this other side is very large, not doing the shuffle will bring notable speed-up as compared to other algorithms that would have to do the shuffle.

Broadcasting large datasets can also lead to timeout errors. A configuration spark.sql.broadcastTimeout sets the maximum time that a broadcast operation should take, past which the operation fails. The default timeout value is 5 minutes, but it can be set as follows:

spark.conf.set("spark.sql.broadcastTimeout", time_in_sec)

Sort Merge Join

If neither of the data frames can be broadcasted, then Spark resorts to Sort Merge Join. This algorithm uses the node-node communication strategy, where Spark shuffles the data across the cluster.

Sort Merge Join requires both sides of the join to have correct partitioning and order. Generally, this is ensured by** shuffle and sort** in both branches of the join as depicted below

#apache spark #scala #tech blogs #broadcast join #join opertaions #join optimization #joins in spark #shuffled hash join #sort merge join

Karlee  Will

Karlee Will

1621561800

Your Ultimate Guide to SQL Join: CROSS JOIN

CROSS JOIN is in the spotlight. This article finishes our small series of SQL JOIN-related publications.

SQL Server CROSS JOIN is the simplest of all joins. It implements a combination of 2 tables without a join condition. If you have 5 rows in one table and 3 rows in another, you get 15 combinations. Another definition is a Cartesian Product.

Now, why would you want to combine tables without a join condition? Hang on a bit because we are getting there. First, let’s refer to the syntax.

#sql server #cross join #inner join #outer join #sql join #sql

Marcus  Flatley

Marcus Flatley

1594399440

Getting Started with R Markdown — Guide and Cheatsheet

In this blog post, we’ll look at how to use R Markdown. By the end, you’ll have the skills you need to produce a document or presentation using R Mardown, from scratch!

We’ll show you how to convert the default R Markdown document into a useful reference guide of your own. We encourage you to follow along by building out your own R Markdown guide, but if you prefer to just read along, that works, too!

R Markdown is an open-source tool for producing reproducible reports in R. It enables you to keep all of your code, results, plots, and writing in one place. R Markdown is particularly useful when you are producing a document for an audience that is interested in the results from your analysis, but not your code.

R Markdown is powerful because it can be used for data analysis and data science, collaborating with others, and communicating results to decision makers. With R Markdown, you have the option to export your work to numerous formats including PDF, Microsoft Word, a slideshow, or an HTML document for use in a website.

r markdown tips, tricks, and shortcuts

Turn your data analysis into pretty documents with R Markdown.

We’ll use the RStudio integrated development environment (IDE) to produce our R Markdown reference guide. If you’d like to learn more about RStudio, check out our list of 23 awesome RStudio tips and tricks!

Here at Dataquest, we love using R Markdown for coding in R and authoring content. In fact, we wrote this blog post in R Markdown! Also, learners on the Dataquest platform use R Markdown for completing their R projects.

We included fully-reproducible code examples in this blog post. When you’ve mastered the content in this post, check out our other blog post on R Markdown tips, tricks, and shortcuts.

Okay, let’s get started with building our very own R Markdown reference document!

R Markdown Guide and Cheatsheet: Quick Navigation

1. Install R Markdown

R Markdown is a free, open source tool that is installed like any other R package. Use the following command to install R Markdown:

install.packages("rmarkdown")

Now that R Markdown is installed, open a new R Markdown file in RStudio by navigating to File > New File > R Markdown…. R Markdown files have the file extension “.Rmd”.

2. Default Output Format

When you open a new R Markdown file in RStudio, a pop-up window appears that prompts you to select output format to use for the document.

New Document

The default output format is HTML. With HTML, you can easily view it in a web browser.

We recommend selecting the default HTML setting for now — it can save you time! Why? Because compiling an HTML document is generally faster than generating a PDF or other format. When you near a finished product, you change the output to the format of your choosing and then make the final touches.

One final thing to note is that the title you give your document in the pop-up above is not the file name! Navigate to File > Save As.. to name, and save, the document.

#data science tutorials #beginner #r #r markdown #r tutorial #r tutorials #rstats #rstudio #tutorial #tutorials

August  Larson

August Larson

1624422360

R vs Python: What Should Beginners Learn?

Let go of any doubts or confusion, make the right choice and then focus and thrive as a data scientist.

I currently lead a research group with data scientists who use both R and Python. I have been in this field for over 14 years. I have witnessed the growth of both languages over the years and there is now a thriving community behind both.

I did not have a straightforward journey and learned many things the hard way. However, you can avoid making the mistakes I made and lead a more focussed, more rewarding journey and reach your goals quicker than others.

Before I dive in, let’s get something out of the way. R and Python are just tools to do the same thing. Data Science. Neither of the tools is inherently better than the other. Both the tools have been evolving over years (and will likely continue to do so).

Therefore, the short answer on whether you should learn Python or R is: it depends.

The longer answer, if you can spare a few minutes, will help you focus on what really matters and avoid the most common mistakes most enthusiastic beginners aspiring to become expert data scientists make.

#r-programming #python #perspective #r vs python: what should beginners learn? #r vs python #r

Level-up with Semi-joins in R

Introduction

Assuming you already have some background with the other more common types of joins, inner, left, right, and outer; adding semi and anti can prove incredibly useful saving you what could have alternatively taken multiple steps.

In this post, I’ll be focusing on just semi-joins; with that said, there is a lot of overlap between semi & anti, so get ready to learn a bit about both.

Filtering Joins

Semi & anti joins are quite a bit different than the other four that I just highlighted; the number one difference being they are actually classified as what’s known as filtering joins.

Syntactically it feels very similar to any other join, but the intention is not to enhance a dataset with additional columns or rows, the intent is to use these joins to perform filtering.

A filtering join is not classified by the additional of new columns of information, rather it facilitates one being able to keep or reduce records in a given dataset.

Semi Join

The semi join is used with the intent of taking a dataset and filtering it down based on whether a common identifier is located in some additional dataset.

A good way to drill this idea home is to code the alternative.

Opportunity Dataset Example

Let’s say we have a dataset from salesforce that contains all deals or opportunities that we’ve worked on or are currently working on.

This opportunity dataset gives us a lot of really good information around the deal itself. Let’s suppose it looks something like this:

| opp_id | account_id | created_date | close_date | amount | stage |

Now let’s say we need to filter this dataset down such that we’re only including enterprise accounts. Well it just so happens that there is no segment field on our opportunity dataset that we can use to filter… we’ll have to leverage information from elsewhere.

Let’s pretend that the only place enterprise accounts have been tracked is in a random excel file. Let’s say the dataset looks like this:

| account_id | customer_segment |

In a World Without Semi-joins

Based on what we know now around left joins… we could do the following:

opportunities %>%
left_join(enterprise_accounts, by = 'account_id')%>%
filter(!is.na(customer_segment))

As you can see, we can left join the enterprise accounts dataset to our main opps dataset and in the event that there is no matching value, customer segment would be null, as such you can add a filter statement saying you only want non null cases.

This is fine and effectively performs the same functionality I explained above. One annoying thing is it gives you a new field, customer_segment, that is the same for every record.

We could also throw on a select statement thereafter to pull that field off, which just adds another line of code for you to write to meet this functionality.

opportunities %>%
left_join(enterprise_accounts, by = 'account_id')%>%
filter(!is.na(customer_segment))%>%
select(-customer_segment)

Assuming you’ve learned about inner joins as well, we could also achieve a similar functionality there with slightly simpler code.

opportunities %>%
inner_join(enterprise_accounts, by = 'account_id')%>%
select(-customer_segment)

Simplifying with Semi-joins

Now let’s simplify things even more with a semi-join.

opportunities %>%
semi_join(enterprise_accounts, by = 'account_id')

This will get us to the exact same output as each of the above examples. It will filter out records of opportunity where there is no matching account_id in the enterprise accounts table. It wont add columns or rows to your dataset. It exclusively exists and is used with the intent to filter.

Conclusion

There you have it, in just a few minutes we’ve covered a lot, and unlocked a bit of dplyr functionality that can simplify your code, your workflow.

We’ve learned:

  • The difference between mutating joins and filtering joins
  • How to execute a “filtering join” in the absence of semi-joins
  • The specific output and intent for a semi-join
  • How to use a semi-join

#data-science #r #data-analysis #towards-data-science #machine-learning