Connor Mills

Connor Mills

1573010602

The 3 Algorithms Concepts Data Scientists Need to Know

Algorithms are an integral part of data science. While most of us data scientists don’t take a proper algorithms course while studying, they are important all the same.

Many companies ask data structures and algorithms as part of their interview process for hiring data scientists.

Now the question that many people ask here is what is the use of asking a data scientist such questions. The way I like to describe it is that a data structure question may be thought of as a coding aptitude test.

We all have given aptitude tests at various stages of our life, and while they are not a perfect proxy to judge someone, almost nothing ever really is. So, why not a standard algorithm test to judge people’s coding ability.

But let’s not kid ourselves, they will require the same zeal to crack as your Data Science interviews, and thus, you might want to give some time for the study of algorithms.

This post is about fast-tracking that study and panning some essential algorithms concepts for the data scientists in an easy to understand way.

1. Recursion/Memoization

Recursion is where a function being defined is applied within its own definition. Put simply; recursion is when a function calls itself. Google does something pretty interesting when you search for recursion there.
3 Programming Concepts for Data Scientists

Hope you get the joke. While recursion may seem a little bit daunting to someone just starting, it is pretty simple to understand. And it is a beautiful concept once you know it.

The best example I find for explaining recursion is to calculate the factorial of a number.

def factorial(n):
    if n==0:
        return 1
    return n*factorial(n-1)

We can see how factorial is a recursive function quite easily.

Factorial(n) = n*Factorial(n-1)

So how does it translates to programming?

A function for a recursive call generally consists of two things:

  • A base case — The case where the recursion ends.
  • A recursive formulation- a formulaic way to move towards the base case.

A lot of problems you end up solving are recursive. It applies to data science, as well.

For example, A decision tree is just a binary tree, and tree algorithms are generally recursive. Or, we do use sort in a lot of times. The algorithm responsible for that is called mergesort, which in itself is a recursive algorithm. Another one is binary search, which includes finding an element in an array.

Now we have got a basic hang of recursion, let us try to find the nth Fibonacci Number. Fibonacci series is a series of numbers in which each number ( Fibonacci number ) is the sum of the two preceding numbers. The simplest is the series 1, 1, 2, 3, 5, 8, etc. The answer is:

def fib(n):
    if n<=1:
        return 1
    return fib(n-1) + fib(n-2)

But do you spot the problem here?

If you try to calculate fib(n=7) it runs fib(5) twice, fib(4) thrice, fib(3) five times. As n becomes larger, a lot of calls are made for the same number, and our recursive function calculates it again and again.
3 Programming Concepts for Data Scientists

Can we do better? Yes, we can. We can change our implementation a little bit an add a dictionary to add some storage to our method. Now, this memo dictionary gets updated any time a number has been calculated. If that number appears again, we don’t calculate it again but give results from the memo dictionary. This addition of storage is called Memoization.

memo = {}
def fib_memo(n):
    if n in memo:
        return memo[n]
    if n<=1:
        memo[n]=1
        return 1
    memo[n] = fib_memo(n-1) + fib_memo(n-2)
    return memo[n]

Usually, I like to write the recursive function first, and if it is making a lot of calls to the same parameters again and again, I add a dictionary to memorize solutions.

How much does it help?
3 Programming Concepts for Data Scientists

This is the run time comparison for different values of n. We can see the runtime for Fibonacci without Memoization increases exponentially, while for the memoized function, the time is linear.

2. Dynamic programming

Recursion is essentially a top-down approach. As in when calculating Fibonacci number n we start from n and then do recursive calls for n-2 and n-1 and so on.

In Dynamic programming, we take a bottom-up approach. It is essentially a way to write recursion iteratively. We start by calculating fib(0) and fib(1) and then use previous results to generate new results.

def fib_dp(n):
    dp_sols = {0:1,1:1}
    for i in range(2,n+1):
        dp_sols[i] = dp_sols[i-1] + dp_sols[i-2] 
    return dp_sols[n]

3 Programming Concepts for Data Scientists

Above is the comparison of runtimes for DP vs. Memoization. We can see that they are both linear, but DP still is a little bit faster.

Why? Because Dynamic Programming made only one call exactly to each subproblem in this case.

There is an excellent story on how Bellman who developed Dynamic Programming framed the term_:_

Where did the name, dynamic programming, come from? The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word research. What title, what name, could I choose? In the first place, I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word “programming”. I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying. I thought, let’s kill two birds with one stone. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.

3. Binary Search

Let us say we have a sorted array of numbers, and we want to find out a number from this array. We can go the linear route that checks every number one by one and stops if it finds the number. The problem is that it takes too long if the array contains millions of elements. Here we can use a Binary search.
3 Programming Concepts for Data Scientists

# Returns index of target in nums array if present, else -1 
def binary_search(nums, left, right, target):   
    # Base case 
    if right >= left: 
        mid = int((left + right)/2)
        # If target is present at the mid, return
        if nums[mid] == target: 
            return mid 
        # Target is smaller than mid search the elements in left
        elif nums[mid] > target: 
            return binary_search(nums, left, mid-1, target) 
        # Target is larger than mid, search the elements in right
        else: 
            return binary_search(nums, mid+1, right, target) 
    else: 
        # Target is not in nums 
        return -1nums = [1,2,3,4,5,6,7,8,9]
print(binary_search(nums, 0, len(nums)-1,7))

This is an advanced case of a recursion based algorithm where we make use of the fact that the array is sorted. Here we recursively look at the middle element and see if we want to search in the left or right of the middle element. This makes our searching space go down by a factor of 2 every step.

And thus the run time of this algorithm is O(logn) as opposed to O(n) for linear search.

How much does that matter? Below is a comparison in run times. We can see that the Binary search is pretty fast compared to Linear search.
3 Programming Concepts for Data Scientists

Conclusion

In this post, I talked about some of the most exciting algorithms that form the basis for programming.

These algorithms are behind some of the most asked questions in Data Science interviews, and a good understanding of these might help you land your dream job.

And while you can go a fair bit in data science without learning them, you can learn them just for a little bit of fun and maybe to improve your programming skills.

#data-science #machine-learning #deep-learning

What is GEEK

Buddha Community

The 3 Algorithms Concepts Data Scientists Need to Know
Siphiwe  Nair

Siphiwe Nair

1620466520

Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

Java Questions

Java Questions

1599137520

50 Data Science Jobs That Opened Just Last Week

Our latest survey report suggests that as the overall Data Science and Analytics market evolves to adapt to the constantly changing economic and business environments, data scientists and AI practitioners should be aware of the skills and tools that the broader community is working on. A good grip in these skills will further help data science enthusiasts to get the best jobs that various industries in their data science functions are offering.

In this article, we list down 50 latest job openings in data science that opened just last week.

(The jobs are sorted according to the years of experience r

1| Data Scientist at IBM

**Location: **Bangalore

Skills Required: Real-time anomaly detection solutions, NLP, text analytics, log analysis, cloud migration, AI planning, etc.

Apply here.

2| Associate Data Scientist at PayPal

**Location: **Chennai

Skills Required: Data mining experience in Python, R, H2O and/or SAS, cross-functional, highly complex data science projects, SQL or SQL-like tools, among others.

Apply here.

3| Data Scientist at Citrix

Location: Bangalore

Skills Required: Data modelling, database architecture, database design, database programming such as SQL, Python, etc., forecasting algorithms, cloud platforms, designing and developing ETL and ELT processes, etc.

Apply here.

4| Data Scientist at PayPal

**Location: **Bangalore

Skills Required: SQL and querying relational databases, statistical programming language (SAS, R, Python), data visualisation tool (Tableau, Qlikview), project management, etc.

Apply here.

5| Data Science at Accenture

**Location: **Bibinagar, Telangana

Skills Required: Data science frameworks Jupyter notebook, AWS Sagemaker, querying databases and using statistical computer languages: R, Python, SLQ, statistical and data mining techniques, distributed data/computing tools such as Map/Reduce, Flume, Drill, Hadoop, Hive, Spark, Gurobi, MySQL, among others.


#careers #data science #data science career #data science jobs #data science news #data scientist #data scientists #data scientists india

Gerhard  Brink

Gerhard Brink

1620629020

Getting Started With Data Lakes

Frameworks for Efficient Enterprise Analytics

The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.

This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.

Introduction

As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).


This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.

#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management

Ian  Robinson

Ian Robinson

1623175620

Data Science: Advice for Aspiring Data Scientists | Experfy Insights

Around once a month, I get emailed by a student of some type asking how to get into Data Science, I’ve answered it enough that I decided to write it out here so I can link people to it. So if you’re one of those students, welcome!

I’ll segment this into basic advice, which can be found quite easily if you just google ‘how to get into data science’ and advice that is less common, but advice that I’ve found very useful over the years. I’ll start with the latter, and move on to basic advice. Obviously take this with a grain of salt as all advice comes with a bit of survivorship bias.

Less Basic Advice:

1. Find a solid community

2. Apply Data Science to Things you Enjoy

3. Minimize the ‘Clicks to Proof of Competence’

4. Learn Through Research or Entry Level Jobs

#big data & cloud #data science #data scientist #statistics #aspiring data scientist #advice for aspiring data scientists

5 Indian Companies Recruiting Data Scientists In Large Numbers

According to a recent study on analytics and data science jobs, the number of vacancies for data science-related jobs in India has increased by 53 per cent, since India eased the lockdown restrictions. Moreover, India’s share of open data science jobs in the world has seen a steep rise from 7.2 per cent in January to 9.8 per cent in August.

Here is a list of 5 such companies, in no particular order, in India that are currently recruiting Data Scientists in bulk.

#careers #data science #data science career #data science jobs #data science recruitment #data scientist #data scientist jobs