Docker Vs Virtual Machine: Understand the differences

Docker Vs Virtual Machine: Understand the differences

Virtual machines and Docker containers, both are more than enough in order to get the most out of computer resources available in hardware and software.

Virtual machines and Docker containers, both are more than enough in order to get the most out of computer resources available in hardware and software.

Docker containers are kind of new on the block, but virtual machines or VMs have been there and will continue to remain popular in data centres of all sizes. If you are looking for the best solution to run your services in the cloud, it is advised that you understand these virtualization technologies first. Learn about the differences between the two, the best way they can be used, and the capabilities each one possesses.

Most of the organizations have either moved or are planning to move from on-premise computing services to cloud computing services. Cloud computing allows you access to a large pool of configurable resources that can be shared, for example - computer networks, servers, storage, applications, and services. For the implementation of cloud computing in a traditional way, virtual machines are used. However, these days Docker containers have gained a lot of popularity due to its features, as well as Dockers are considered to be of a lightweight compared to virtual machines which are heavier.

According to reports, there will be a rise in the use of application containers of 40% by the end of the year 2020. Docker containers have gained a lot of popularity as it facilitates rapid and agile development. But the question arises - How are Docker containers different from virtual machines? The most important thing to know is that Docker containers are not virtual machines or lightweight virtual machines or trimmed down virtual machines. Let us compare the two and understand the major differences.

What is exactly a Virtual Machine?

It is said that Virtual machines were born when server processing power and capacity was increased but bare metal applications were unable to exploit the new abundance in resources. Virtual machines were built by running software on top of physical servers in order to match the requirements of a particular hardware system. A virtual machine monitor or the hypervisor is a firmware, software or hardware which helps in creating a virtual machine and runs it. It is a necessary component to virtualize the server and it sits between the virtual machine and the hardware. As cloud computing services are available and virtualization is affordable, a lot of large as well as small IT departments have adapted virtual machines in order to reduce costs and increase efficiency.

Understanding Virtual Machines

Let us understand how virtual machines work starting from the bottom-most layer:

  • Infrastructure: This can be anything, your PC or laptop, a dedicated server running in a data centre, a private virtual server used in the cloud such as Amazon EC2 instance.
  • Host Operating System: Just on top of the infrastructure layer lies the host which runs an operating system. While you use your laptop, it will likely be Windows, MacOS or Linux. As we are discussing virtual machines, it is commonly labelled as the host operating system.
  • Hypervisor: It is also called a virtual machine monitor. You can consider a virtual machine as a self-contained computer packed into a single file, but something is required to be able to run the file. Type 1 hypervisors and Type 2 hypervisors are used to do so. In Type 1 hypervisor, Hyper-V for Windows, HyperKit for MacOS and KVM for Linux. Some popular Type 2 hypervisors are VirtualBox and VMWare.
  • Guest Operating System: Suppose you would like to run three applications on your server under total isolation. To run, you will need 3 guest operating systems. These guest operating systems are controlled by the hypervisors. Each guest operating system takes a disk space of around 700 MB, so the total of disk space that you use is 2.1GB utilized by guest OS and it gets more complicated when guest OS uses its own CPU and memory resources as well. This is what makes the virtual machine heavy.
  • BINS/LIBS: Each guest operating system uses its own set of various binaries and libraries in order to run several applications. For example, if you are using Python or Node JS you will have to install packages accordingly from this layer. Since each application will be different than the other, it is expected that each application will have its own set of library requirements.
  • Application Layer: This is the layer where you have your source code for the magical application you have developed. If you want each of these applications to be isolated, you will have to run each application inside its own guest operating system.
Types of Virtual Machines

There are different types of virtual machines, each offering various functions:

System Virtual Machines

A system virtual machine is a virtual machine which allows multiple instances of the operating system to run on a host system and share the physical resources. They emulate an existing architecture and are built with the purpose of providing a platform to run several programs where real hardware is not available for use. Some of the advantages of system virtual machines are -

  • Multiple OS environments can accommodate the same primary hard drive with a virtual partition which allows sharing files generated in either the “guest” virtual environment or the “host” operating system.
  • Application provisioning, high availability, maintenance and disaster recovery are inherent in the virtual machine software selected.

Some of the disadvantages of system virtual machines are mentioned below:

  • When a virtual machine accesses the host drive indirectly, it becomes less efficient than the actual machine.
  • Malware protection for virtual machines are not very compatible with the "host" and sometimes require separate software.

Process Virtual Machines

A process virtual machine is also known as an application virtual machine, or Managed Runtime Environment (MRE). It is used to execute a computer program inside a host OS and it supports a single process. A process virtual machine is created when the process starts and is destroyed as soon as you exit the process. The main purpose of this type of virtual machine is to provide a platform-independent programming environment.

Benefits of Virtual Machines

Virtualization provides you with a number of advantages such as centralized network management, reducing dependency on additional hardware and software, etc. Apart from these, virtual machines offer a few more benefits:

  • Multiple OS environments can be used simultaneously on the same machine, although isolated from each other.
  • Virtual machines have the ability to offer an instruction set architecture which differs from real computers
  • It has easy maintenance, application provisioning, availability and convenient recovery.
Popular VM Providers

Here are some of the selected software we think is best suited for people who want to keep things real, virtually.

Oracle VM Virtualbox

Oracle VM Virtualbox is free of cost, supports Windows, Mac and Linux, and it has the ability to host for 100,000 registered users. If you are not sure about which operating system you should choose to use, Oracle VM VirtualBox is a really good choice to go ahead with. It supports a wide range of host and client combinations. It supports operating systems from Windows XP onward, any Linus level above 2.4, Solaris, Open Solaris and even OpenBSD Unix. It also runs on Apple’s MacOS and can host a client Mac VM session.

VMware Fusion and Workstation

VMware Workstation and VMware Fusion are the industry leaders in virtualization. It is one of the few hosts which support DirectX 10 and OpenGL 3.3. It also supports CAD and other GPU accelerated applications to work under virtualization.

Red Hat Virtualization

Red Hat Virtualization has more of enterprise users with powerful bare-metal options. It has two versions: a basic version which is included in Enterprise Linux with four distinct VMs on a single host and the other one is a more sophisticated Red Hat virtualization edition.

Important features of virtual machines

A typical virtual machine has the following hardware features.

  • The hardware configuration of the virtual machine is similar to that of the default hardware configuration settings.
  • There is one processor and one processor per core. The execution mode is selected for the virtualization engine based on the host CPU and the guest operating system.
  • A single IDE CD/DVD drive is available which is configured after receiving power and detects automatically as a physical drive on the host system when connected.
  • A virtual network adapter is used which gets configured upon power on and uses network address translation (NAT). With the help of NAT networking, virtual machines are able to share the IP address of the host system.
  • It has one USB controller.
  • It has a sound card configured to use the default sound card on the host system.
  • It has one display configured to use the display settings on the host computer.

Some of the software features include:

  • The virtual machine is not encrypted.
  • Drag-and-drop, cut and paste features are available.
  • Remote access by VNC clients and shared folders are disabled.
What are Containers?

A container is a standard unit of software which packages up the code and all its dependencies in order to run the application reliably and quickly from one computing environment to another. A Docker container image is a standalone, lightweight, executable package of the software which includes everything needed to run an application such as system tools and libraries, code, runtime, and settings.

Understanding Docker Container

There is a lot less baggage compared to virtual machines. Let us understand each layer starting from the bottom most.

  • Infrastructure: Similar to virtual machines, the infrastructure used in Docker containers can be your laptop or a server in the cloud.
  • Host Operating System: This can be anything which is capable of running Docker. You can run Docker on MacOS, Windows and Linux.
  • Docker Daemon: It is the replacement for the hypervisor. Docker Daemon is a service which runs in the background of the host operating system. It also manages the execution and interaction with Docker containers
  • BINS/LIBS: It is similar to that on virtual machines except it is not running on a guest operating system, instead special packages called Docker images are built and finally the Docker daemon runs the images.
  • Application: This is the ultimate destination for the docker images. They are independently managed here. Each application gets packed with its library dependencies into the same Docker image and is still isolated.
Types of Container

Linux Containers (LXC) — LXC is the original Linux container technology. It is a Linux operating system level virtualization method which is used to run multiple isolated Linux systems on a single host.

Docker — Docker was first started as a project in order to build single-application LXC containers. This makes the containers more flexible and portable to use. Docker acts a Linux utility at a higher level and can efficiently create, ship, and run containers.

Benefits of Containers
  • It reduces IT management resources
  • It reduces the size of snapshots
  • It reduces and simplifies security updates
  • Needs less code in order to migrate, transfer, and upload workloads
Popular Container Providers
  1. Linux Containers
    LXCLXDCGManager1. Docker
  2. Windows Server Containers
**Docker vs Virtual Machines **

How is a Docker Container different from a Virtual Machine?

  • Containers are user space of the operating system whereas Docker is a container based technology. Dockers are built for running various applications. In Docker, the containers running share the host Operating system kernel.
  • Virtual machines are not based on container technology. They are mainly made up of kernel space along with user space of an operating system. The server's hardware is virtualized and each virtual machine has operating systems and apps which shares hardware resources from the host.

Both virtual machines and dockers come with merits and demerits. Within a container environment, multiple workloads can run with one operating system. It also results in reduced IT management resources, reduces the size of snapshots, quicker spinning up apps, less code to transfer, simplified and reduced updates and so on. However, within a virtual machine environment, each workload needs a complete operating system.

Basic Differences between Virtual Machines and Containers

Uses for VMs vs Uses for Containers

Both containers and VMs have benefits and drawbacks, and the ultimate decision will depend on your specific needs, but there are some general rules of thumb.

  • VMs are a better choice for running apps that require all of the operating system’s resources and functionality when you need to run multiple applications on servers or have a wide variety of operating systems to manage.
  • Containers are a better choice when your biggest priority is maximizing the number of applications running on a minimal number of servers.
Who wins amongst the two?

When To Use a Container vs. When to Use a Virtual Machine

Containers and virtual machines, each thrive in different use cases. Let us check some of the cases and know when to use a container and when is it a good choice to use virtual machines.

  • Virtual machines take a good amount of time to boot and shut down: This feature is heavily used in development and testing environments. If you have to spin up and power down machines regularly or clone machines, Docker containers are what you should choose over virtual machines.
  • Containers are geared based on Linux: Virtual machines are a better choice when you want to virtualize another operating system.
  • Dockers do not have many automation and security features: Most of the fully fledged virtual management platforms provide a variety of automation features along with built-in security from kernel level to network switches.
Virtual Machine and Container Use Cases

There is a fundamental difference between the usage of containers and virtual machines. Virtual machines are applicable for virtual environments, whereas containers use the underlying operations system and do not require a hypervisor.

Let us see some use cases:

Virtualized Environments

In a virtualized environment, multiple operating systems are run on a hypervisor which manages the I/O on one particular machine. However, in a containerized environment, it is not virtualized and hypervisor is not used. That does not mean you cannot run a container in a virtual machine.

You can run containers in a virtual machine. We know containers run on a single Operating System. As it can run several containers on one physical system, it is like mini-virtualization without a hypervisor. Hypervisors face certain limitations related to performance and it also blocks certain server components like networking controller.

DevOps

Containers are used in the DevOps environment for their develop-test-build. These containers perform much faster than virtual machines, they have faster spun up and down and have better access to system resources.

Containers are smaller in size and have the ability to run several servers and hundreds of virtual machines. This shows that containers have greater modularity over virtual machines. Using microservices, an app can be split into multiple containers. Due to this combination, you can avoid potential crashes and this will also help you isolate problems.

Older Systems

Virtual machines are capable of hosting an older version of an operating system. Suppose an application was built for an operating system many years back, which is quite unlikely to run in a newer generation operating system. In such cases, you can run the old operating system in a virtual machine and without any changes in the app you can run it.

More Secure Environments

As container needs frequent interaction with the underlying operating system or other containers, there is a security risk associated. However, in comparison to containers, virtual machines are ideal and considered to be a more secure environment.

Learn Data Science | How to Learn Data Science for Free

Learn Data Science | How to Learn Data Science for Free

Learn Data Science | How to Learn Data Science for Free. In this post, I have described a learning path and free online courses and tutorials that will enable you to learn data science for free.

The average cost of obtaining a masters degree at traditional bricks and mortar institutions will set you back anywhere between $30,000 and $120,000. Even online data science degree programs don’t come cheap costing a minimum of $9,000. So what do you do if you want to learn data science but can’t afford to pay this?

I trained into a career as a data scientist without taking any formal education in the subject. In this article, I am going to share with you my own personal curriculum for learning data science if you can’t or don’t want to pay thousands of dollars for more formal study.

The curriculum will consist of 3 main parts, technical skills, theory and practical experience. I will include links to free resources for every element of the learning path and will also be including some links to additional ‘low cost’ options. So if you want to spend a little money to accelerate your learning you can add these resources to the curriculum. I will include the estimated costs for each of these.

Technical skills

The first part of the curriculum will focus on technical skills. I recommend learning these first so that you can take a practical first approach rather than say learning the mathematical theory first. Python is by far the most widely used programming language used for data science. In the Kaggle Machine Learning and Data Science survey carried out in 2018 83% of respondents said that they used Python on a daily basis. I would, therefore, recommend focusing on this language but also spending a little time on other languages such as R.

Python Fundamentals

Before you can start to use Python for data science you need a basic grasp of the fundamentals behind the language. So you will want to take a Python introductory course. There are lots of free ones out there but I like the Codeacademy ones best as they include hands-on in-browser coding throughout.

I would suggest taking the introductory course to learn Python. This covers basic syntax, functions, control flow, loops, modules and classes.

Data analysis with python

Next, you will want to get a good understanding of using Python for data analysis. There are a number of good resources for this.

To start with I suggest taking at least the free parts of the data analyst learning path on dataquest.io. Dataquest offers complete learning paths for data analyst, data scientist and data engineer. Quite a lot of the content, particularly on the data analyst path is available for free. If you do have some money to put towards learning then I strongly suggest putting it towards paying for a few months of the premium subscription. I took this course and it provided a fantastic grounding in the fundamentals of data science. It took me 6 months to complete the data scientist path. The price varies from $24.50 to $49 per month depending on whether you pay annually or not. It is better value to purchase the annual subscription if you can afford it.

The Dataquest platform

Python for machine learning

If you have chosen to pay for the full data science course on Dataquest then you will have a good grasp of the fundamentals of machine learning with Python. If not then there are plenty of other free resources. I would focus to start with on scikit-learn which is by far the most commonly used Python library for machine learning.

When I was learning I was lucky enough to attend a two-day workshop run by Andreas Mueller one of the core developers of scikit-learn. He has however published all the material from this course, and others, on this Github repo. These consist of slides, course notes and notebooks that you can work through. I would definitely recommend working through this material.

Then I would suggest taking some of the tutorials in the scikit-learn documentation. After that, I would suggest building some practical machine learning applications and learning the theory behind how the models work — which I will cover a bit later on.

SQL

SQL is a vital skill to learn if you want to become a data scientist as one of the fundamental processes in data modelling is extracting data in the first place. This will more often than not involve running SQL queries against a database. Again if you haven’t opted to take the full Dataquest course then here are a few free resources to learn this skill.

Codeacamdemy has a free introduction to SQL course. Again this is very practical with in-browser coding all the way through. If you also want to learn about cloud-based database querying then Google Cloud BigQuery is very accessible. There is a free tier so you can try queries for free, an extensive range of public datasets to try and very good documentation.

Codeacademy SQL course

R

To be a well-rounded data scientist it is a good idea to diversify a little from just Python. I would, therefore, suggest also taking an introductory course in R. Codeacademy have an introductory course on their free plan. It is probably worth noting here that similar to Dataquest Codeacademy also offers a complete data science learning plan as part of their pro account (this costs from $31.99 to $15.99 per month depending on how many months you pay for up front). I personally found the Dataquest course to be much more comprehensive but this may work out a little cheaper if you are looking to follow a learning path on a single platform.

Software engineering

It is a good idea to get a grasp of software engineering skills and best practices. This will help your code to be more readable and extensible both for yourself and others. Additionally, when you start to put models into production you will need to be able to write good quality well-tested code and work with tools like version control.

There are two great free resources for this. Python like you mean it covers things like the PEP8 style guide, documentation and also covers object-oriented programming really well.

The scikit-learn contribution guidelines, although written to facilitate contributions to the library, actually cover the best practices really well. This covers topics such as Github, unit testing and debugging and is all written in the context of a data science application.

Deep learning

For a comprehensive introduction to deep learning, I don’t think that you can get any better than the totally free and totally ad-free fast.ai. This course includes an introduction to machine learning, practical deep learning, computational linear algebra and a code-first introduction to natural language processing. All their courses have a practical first approach and I highly recommend them.

Fast.ai platform

Theory

Whilst you are learning the technical elements of the curriculum you will encounter some of the theory behind the code you are implementing. I recommend that you learn the theoretical elements alongside the practical. The way that I do this is that I learn the code to be able to implement a technique, let’s take KMeans as an example, once I have something working I will then look deeper into concepts such as inertia. Again the scikit-learn documentation contains all the mathematical concepts behind the algorithms.

In this section, I will introduce the key foundational elements of theory that you should learn alongside the more practical elements.

The khan academy covers almost all the concepts I have listed below for free. You can tailor the subjects you would like to study when you sign up and you then have a nice tailored curriculum for this part of the learning path. Checking all of the boxes below will give you an overview of most elements I have listed below.

Maths

Calculus

Calculus is defined by Wikipedia as “the mathematical study of continuous change.” In other words calculus can find patterns between functions, for example, in the case of derivatives, it can help you to understand how a function changes over time.

Many machine learning algorithms utilise calculus to optimise the performance of models. If you have studied even a little machine learning you will probably have heard of Gradient descent. This functions by iteratively adjusting the parameter values of a model to find the optimum values to minimise the cost function. Gradient descent is a good example of how calculus is used in machine learning.

What you need to know:

Derivatives

  • Geometric definition
  • Calculating the derivative of a function
  • Nonlinear functions

Chain rule

  • Composite functions
  • Composite function derivatives
  • Multiple functions

Gradients

  • Partial derivatives
  • Directional derivatives
  • Integrals

Linear Algebra

Many popular machine learning methods, including XGBOOST, use matrices to store inputs and process data. Matrices alongside vector spaces and linear equations form the mathematical branch known as Linear Algebra. In order to understand how many machine learning methods work it is essential to get a good understanding of this field.

What you need to learn:

Vectors and spaces

  • Vectors
  • Linear combinations
  • Linear dependence and independence
  • Vector dot and cross products

Matrix transformations

  • Functions and linear transformations
  • Matrix multiplication
  • Inverse functions
  • Transpose of a matrix

Statistics

Here is a list of the key concepts you need to know:

Descriptive/Summary statistics

  • How to summarise a sample of data
  • Different types of distributions
  • Skewness, kurtosis, central tendency (e.g. mean, median, mode)
  • Measures of dependence, and relationships between variables such as correlation and covariance

Experiment design

  • Hypothesis testing
  • Sampling
  • Significance tests
  • Randomness
  • Probability
  • Confidence intervals and two-sample inference

Machine learning

  • Inference about slope
  • Linear and non-linear regression
  • Classification

Practical experience

The third section of the curriculum is all about practice. In order to truly master the concepts above you will need to use the skills in some projects that ideally closely resemble a real-world application. By doing this you will encounter problems to work through such as missing and erroneous data and develop a deep level of expertise in the subject. In this last section, I will list some good places you can get this practical experience from for free.

“With deliberate practice, however, the goal is not just to reach your potential but to build it, to make things possible that were not possible before. This requires challenging homeostasis — getting out of your comfort zone — and forcing your brain or your body to adapt.”, Anders Ericsson, Peak: Secrets from the New Science of Expertise

Kaggle, et al

Machine learning competitions are a good place to get practice with building machine learning models. They give access to a wide range of data sets, each with a specific problem to solve and have a leaderboard. The leaderboard is a good way to benchmark how good your knowledge at developing a good model actually is and where you may need to improve further.

In addition to Kaggle, there are other platforms for machine learning competitions including Analytics Vidhya and DrivenData.

Driven data competitions page

UCI Machine Learning Repository

The UCI machine learning repository is a large source of publically available data sets. You can use these data sets to put together your own data projects this could include data analysis and machine learning models, you could even try building a deployed model with a web front end. It is a good idea to store your projects somewhere publically such as Github as this can create a portfolio showcasing your skills to use for future job applications.


UCI repository

Contributions to open source

One other option to consider is contributing to open source projects. There are many Python libraries that rely on the community to maintain them and there are often hackathons held at meetups and conferences where even beginners can join in. Attending one of these events would certainly give you some practical experience and an environment where you can learn from others whilst giving something back at the same time. Numfocus is a good example of a project like this.

In this post, I have described a learning path and free online courses and tutorials that will enable you to learn data science for free. Showcasing what you are able to do in the form of a portfolio is a great tool for future job applications in lieu of formal qualifications and certificates. I really believe that education should be accessible to everyone and, certainly, for data science at least, the internet provides that opportunity. In addition to the resources listed here, I have previously published a recommended reading list for learning data science available here. These are also all freely available online and are a great way to complement the more practical resources covered above.

Thanks for reading!

Data Science vs Data Analytics vs Big Data

Data Science vs Data Analytics vs Big Data

When we talk about data processing, Data Science vs Big Data vs Data Analytics are the terms that one might think of and there has always been a confusion between them. In this article on Data science vs Big Data vs Data Analytics, I will understand the similarities and differences between them

When we talk about data processing, Data Science vs Big Data vs Data Analytics are the terms that one might think of and there has always been a confusion between them. In this article on Data science vs Big Data vs Data Analytics, I will understand the similarities and differences between them

We live in a data-driven world. In fact, the amount of digital data that exists is growing at a rapid rate, doubling every two years, and changing the way we live. Now that Hadoop and other frameworks have resolved the problem of storage, the main focus on data has shifted to processing this huge amount of data. When we talk about data processing, Data Science vs Big Data vs Data Analytics are the terms that one might think of and there has always been a confusion between them.

In this article on Data Science vs Data Analytics vs Big Data, I will be covering the following topics in order to make you understand the similarities and differences between them.
Introduction to Data Science, Big Data & Data AnalyticsWhat does Data Scientist, Big Data Professional & Data Analyst do?Skill-set required to become Data Scientist, Big Data Professional & Data AnalystWhat is a Salary Prospect?Real time Use-case## Introduction to Data Science, Big Data, & Data Analytics

Let’s begin by understanding the terms Data Science vs Big Data vs Data Analytics.

What Is Data Science?

Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data.

[Source: gfycat.com]

It also involves solving a problem in various ways to arrive at the solution and on the other hand, it involves to design and construct new processes for data modeling and production using various prototypes, algorithms, predictive models, and custom analysis.

What is Big Data?

Big Data refers to the large amounts of data which is pouring in from various data sources and has different formats. It is something that can be used to analyze the insights which can lead to better decisions and strategic business moves.

[Source: gfycat.com]

What is Data Analytics?

Data Analytics is the science of examining raw data with the purpose of drawing conclusions about that information. It is all about discovering useful information from the data to support decision-making. This process involves inspecting, cleansing, transforming & modeling data.

[Source: ibm.com]

What Does Data Scientist, Big Data Professional & Data Analyst Do?

What does a Data Scientist do?

Data Scientists perform an exploratory analysis to discover insights from the data. They also use various advanced machine learning algorithms to identify the occurrence of a particular event in the future. This involves identifying hidden patterns, unknown correlations, market trends and other useful business information.

Roles of Data Scientist

What do Big Data Professionals do?

The responsibilities of big data professional lies around dealing with huge amount of heterogeneous data, which is gathered from various sources coming in at a high velocity.

Roles of Big Data Professiona

Big data professionals describe the structure and behavior of a big data solution and how it can be delivered using big data technologies such as Hadoop, Spark, Kafka etc. based on requirements.

What does a Data Analyst do?

Data analysts translate numbers into plain English. Every business collects data, like sales figures, market research, logistics, or transportation costs. A data analyst’s job is to take that data and use it to help companies to make better business decisions.

Roles of Data Analyst

Skill-Set Required To Become Data Scientist, Big Data Professional, & Data Analyst

What Is The Salary Prospect?

The below figure shows the average salary structure of **Data Scientist, Big Data Specialist, **and Data Analyst.

A Scenario Illustrating The Use Of Data Science vs Big Data vs Data Analytics.

Now, let’s try to understand how can we garner benefits by combining all three of them together.

Let’s take an example of Netflix and see how they join forces in achieving the goal.

First, let’s understand the role of* Big Data Professional* in Netflix example.

Netflix generates a huge amount of unstructured data in forms of text, audio, video files and many more. If we try to process this dark (unstructured) data using the traditional approach, it becomes a complicated task.

Approach in Netflix

Traditional Data Processing

Hence a Big Data Professional designs and creates an environment using Big Data tools to ease the processing of Netflix Data.

Big Data approach to process Netflix data

Now, let’s see how Data Scientist Optimizes the Netflix Streaming experience.

Role of Data Scientist in Optimizing the Netflix streaming experience

1. Understanding the impact of QoE on user behavior

User behavior refers to the way how a user interacts with the Netflix service, and data scientists use the data to both understand and predict behavior. For example, how would a change to the Netflix product affect the number of hours that members watch? To improve the streaming experience, Data Scientists look at QoE metrics that are likely to have an impact on user behavior. One metric of interest is the rebuffer rate, which is a measure of how often playback is temporarily interrupted. Another metric is bitrate, that refers to the quality of the picture that is served/seen — a very low bitrate corresponds to a fuzzy picture.

2. Improving the streaming experience

How do Data Scientists use data to provide the best user experience once a member hits “play” on Netflix?

One approach is to look at the algorithms that run in real-time or near real-time once playback has started, which determine what bitrate should be served, what server to download that content from, etc.

For example, a member with a high-bandwidth connection on a home network could have very different expectations and experience compared to a member with low bandwidth on a mobile device on a cellular network.

By determining all these factors one can improve the streaming experience.

3. Optimize content caching

A set of big data problems also exists on the content delivery side.

The key idea here is to locate the content closer (in terms of network hops) to Netflix members to provide a great experience. By viewing the behavior of the members being served and the experience, one can optimize the decisions around content caching.

4. Improving content quality

Another approach to improving user experience involves looking at the quality of content, i.e. the video, audio, subtitles, closed captions, etc. that are part of the movie or show. Netflix receives content from the studios in the form of digital assets that are then encoded and quality checked before they go live on the content servers.

In addition to the internal quality checks, Data scientists also receive feedback from our members when they discover issues while viewing.

By combining member feedback with intrinsic factors related to viewing behavior, they build the models to predict whether a particular piece of content has a quality issue. Machine learning models along with natural language processing (NLP) and text mining techniques can be used to build powerful models to both improve the quality of content that goes live and also use the information provided by the Netflix users to close the loop on quality and replace content that does not meet the expectations of the users.

So this is how Data Scientist optimizes the Netflix streaming experience.

Now let’s understand how Data Analytics is used to drive the Netflix success.

Role of Data Analyst in Netflix

The above figure shows the different types of users who watch the video/play on Netflix. Each of them has their own choices and preferences.

So what does a Data Analyst do?

Data Analyst creates a user stream based on the preferences of users. For example, if user 1 and user 2 have the same preference or a choice of video, then data analyst creates a user stream for those choices. And also –
Orders the Netflix collection for each member profile in a personalized way.We know that the same genre row for each member has an entirely different selection of videos.Picks out the top personalized recommendations from the entire catalog, focusing on the titles that are top on ranking.By capturing all events and user activities on Netflix, data analyst pops out the trending video.Sorts the recently watched titles and estimates whether the member will continue to watch or rewatch or stop watching etc.
I hope you have *understood *the *differences *& *similarities *between Data Science vs Big Data vs Data Analytics.

How to building Python Data Science Container using Docker

How to building Python Data Science Container using Docker

In this article we will start with building a Python data science container - let's get started...

Artificial Intelligence(AI) and Machine Learning(ML) are literally on fire these days. Powering a wide spectrum of use-cases ranging from self-driving cars to drug discovery and to God knows what. AI and ML have a bright and thriving future ahead of them.

On the other hand, Docker revolutionized the computing world through the introduction of ephemeral lightweight containers. Containers basically package all the software required to run inside an image(a bunch of read-only layers) with a COW(Copy On Write) layer to persist the data.

Python Data Science Packages

Our Python data science container makes use of the following super cool python packages:

  1. NumPy: NumPy or Numeric Python supports large, multi-dimensional arrays and matrices. It provides fast precompiled functions for mathematical and numerical routines. In addition, NumPy optimizes Python programming with powerful data structures for efficient computation of multi-dimensional arrays and matrices.

  2. SciPy: SciPy provides useful functions for regression, minimization, Fourier-transformation, and many more. Based on NumPy, SciPy extends its capabilities. SciPy’s main data structure is again a multidimensional array, implemented by Numpy. The package contains tools that help with solving linear algebra, probability theory, integral calculus, and many more tasks.

  3. Pandas: Pandas offer versatile and powerful tools for manipulating data structures and performing extensive data analysis. It works well with incomplete, unstructured, and unordered real-world data — and comes with tools for shaping, aggregating, analyzing, and visualizing datasets.

  4. SciKit-Learn: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. It is one of the best-known machine-learning libraries for python. The Scikit-learn package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. The primary emphasis is upon ease of use, performance, documentation, and API consistency. With minimal dependencies and easy distribution under the simplified BSD license, SciKit-Learn is widely used in academic and commercial settings. Scikit-learn exposes a concise and consistent interface to the common machine learning algorithms, making it simple to bring ML into production systems.

  5. Matplotlib: Matplotlib is a Python 2D plotting library, capable of producing publication quality figures in a wide variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

  6. NLTK: NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

Building the Data Science Container

Python is fast becoming the go-to language for data scientists and for this reason we are going to use Python as the language of choice for building our data science container.

The Base Alpine Linux Image

Alpine Linux is a tiny Linux distribution designed for power users who appreciate security, simplicity and resource efficiency.

As claimed by Alpine:

Small. Simple. Secure. Alpine Linux is a security-oriented, lightweight Linux distribution based on musl libc and busybox.

The Alpine image is surprisingly tiny with a size of no more than 8MB for containers. With minimal packages installed to reduce the attack surface on the underlying container. This makes Alpine an image of choice for our data science container.

Downloading and Running an Alpine Linux container is as simple as:

$ docker container run --rm alpine:latest cat /etc/os-release

In our, Dockerfile we can simply use the Alpine base image as:

FROM alpine:latest

Talk is cheap let’s build the Dockerfile

Now let’s work our way through the Dockerfile.

The FROM directive is used to set alpine:latest as the base image. Using the WORKDIR directive we set the /var/www as the working directory for our container. The ENV PACKAGES lists the software packages required for our container like git, blas and libgfortran. The python packages for our data science container are defined in the ENV PACKAGES.

We have combined all the commands under a single Dockerfile RUN directive to reduce the number of layers which in turn helps in reducing the resultant image size.

Building and tagging the image

Now that we have our Dockerfile defined, navigate to the folder with the Dockerfile using the terminal and build the image using the following command:

$ docker build -t faizanbashir/python-datascience:2.7 -f Dockerfile .

The -t flag is used to name a tag in the 'name:tag' format. The -f tag is used to define the name of the Dockerfile (Default is 'PATH/Dockerfile').

Running the container

We have successfully built and tagged the docker image, now we can run the container using the following command:

$ docker container run --rm -it faizanbashir/python-datascience:2.7 python

Voila, we are greeted by the sight of a python shell ready to perform all kinds of cool data science stuff.

Python 2.7.15 (default, Aug 16 2018, 14:17:09) [GCC 6.4.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>

Our container comes with Python 2.7, but don’t be sad if you wanna work with Python 3.6. Lo, behold the Dockerfile for Python 3.6:

Build and tag the image like so:

$ docker build -t faizanbashir/python-datascience:3.6 -f Dockerfile .

Run the container like so:

$ docker container run --rm -it faizanbashir/python-datascience:3.6 python

With this, you have a ready to use container for doing all kinds of cool data science stuff.

Serving Puddin’

Figures, you have the time and resources to set up all this stuff. In case you don’t, you can pull the existing images that I have already built and pushed to Docker’s registry Docker Hub using:

# For Python 2.7 pull
$ docker pull faizanbashir/python-datascience:2.7# For Python 3.6 pull
$ docker pull faizanbashir/python-datascience:3.6

After pulling the images you can use the image or extend the same in your Dockerfile file or use it as an image in your docker-compose or stack file.

Aftermath

The world of AI, ML is getting pretty exciting these days and will continue to become even more exciting. Big players are investing heavily in these domains. About time you start to harness the power of data, who knows it might lead to something wonderful.

You can check out the code here.