Nina Diana

Nina Diana

1604533980

Quality Control Your Next Pyspark Dataframe

When the topic comes to data consistency and quality control in data engineering, it is always one of the most troublesome tasks to do. I suppose, as a data engineer, we have all experienced data inconsistency issues such as matching text string like NYC, new york and new york city, or without knowing the business context in the first place, trying to clip the value of some columns of data into a specific range, or the ML model you have deployed into production does not perform due to unexpected drifted input dataset that you have not observed before training. Normally, in an organization, with assistance from a data catalog, it can increase data transparency and help us understand these issues. However, the service does not come for free and it works the best with enterprise data sources like ERP, CRM, or standard RDBMS systems. Unfortunately, that is not the only type of data source DE has to deal with on a daily basis especially while we utilize Pyspark in the data processing.

What are the alternatives? I find two great projects which can help statistically summarize dataframe, Deequfrom AWS,

and Great Expectation. Both tools can perform rule-based data profiling operations onto dataframe and generate data validation reports. Deequ is meant to be used mainly in Scala env. You need to define an “AnalysisRunner” object to add a series of predefined analyzers such as compliance, size, completeness, uniqueness, etc. A sample function would look like

#data-engineering #data-profiling #pyspark #python

What is GEEK

Buddha Community

Quality Control Your Next Pyspark Dataframe
Nina Diana

Nina Diana

1604533980

Quality Control Your Next Pyspark Dataframe

When the topic comes to data consistency and quality control in data engineering, it is always one of the most troublesome tasks to do. I suppose, as a data engineer, we have all experienced data inconsistency issues such as matching text string like NYC, new york and new york city, or without knowing the business context in the first place, trying to clip the value of some columns of data into a specific range, or the ML model you have deployed into production does not perform due to unexpected drifted input dataset that you have not observed before training. Normally, in an organization, with assistance from a data catalog, it can increase data transparency and help us understand these issues. However, the service does not come for free and it works the best with enterprise data sources like ERP, CRM, or standard RDBMS systems. Unfortunately, that is not the only type of data source DE has to deal with on a daily basis especially while we utilize Pyspark in the data processing.

What are the alternatives? I find two great projects which can help statistically summarize dataframe, Deequfrom AWS,

and Great Expectation. Both tools can perform rule-based data profiling operations onto dataframe and generate data validation reports. Deequ is meant to be used mainly in Scala env. You need to define an “AnalysisRunner” object to add a series of predefined analyzers such as compliance, size, completeness, uniqueness, etc. A sample function would look like

#data-engineering #data-profiling #pyspark #python

Kasey  Turcotte

Kasey Turcotte

1623927960

Pandas DataFrame vs. Spark DataFrame: When Parallel Computing Matters

With Performance Comparison Analysis and Guided Example of Animated 3D Wireframe Plot

Python is famous for its vast selection of libraries and resources from the open-source community. As a Data Analyst/Engineer/Scientist, one might be familiar with popular packages such as NumpyPandasScikit-learnKeras, and TensorFlow. Together these modules help us extract value out of data and propels the field of analytics. As data continue to become larger and more complex, one other element to consider is a framework dedicated to processing Big Data, such as Apache Spark. In this article, I will demonstrate the capabilities of distributed/cluster computing and present a comparison between the Pandas DataFrame and Spark DataFrame. My hope is to provide more conviction on choosing the right implementation.

Pandas DataFrame

Pandas has become very popular for its ease of use. It utilizes DataFrames to present data in tabular format like a spreadsheet with rows and columns. Importantly, it has very intuitive methods to perform common analytical tasks and a relatively flat learning curve. It loads all of the data into memory on a single machine (one node) for rapid execution. While the Pandas DataFrame has proven to be tremendously powerful in manipulating data, it does have its limits. With data growing at an exponentially rate, complex data processing becomes expensive to handle and causes performance degradation. These operations require parallelization and distributed computing, which the Pandas DataFrame does not support.

Introducing Cluster/Distribution Computing and Spark DataFrame

Apache Spark is an open-source cluster computing framework. With cluster computing, data processing is distributed and performed in parallel by multiple nodes. This is recognized as the MapReduce framework because the division of labor can usually be characterized by sets of the mapshuffle, and reduce operations found in functional programming. Spark’s implementation of cluster computing is unique because processes 1) are executed in-memory and 2) build up a query plan which does not execute until necessary (known as lazy execution). Although Spark’s cluster computing framework has a broad range of utility, we only look at the Spark DataFrame for the purpose of this article. Similar to those found in Pandas, the Spark DataFrame has intuitive APIs, making it easy to implement.

#pandas dataframe vs. spark dataframe: when parallel computing matters #pandas #pandas dataframe #pandas dataframe vs. spark dataframe #spark #when parallel computing matters

Aarna Davis

Aarna Davis

1625217856

Top Software Testing/ QA Company | Software Quality Assurance Services

We are a top-rated software quality assurance & testing company leveraging our potential to profound expertise in delivering quality tested applications to businesses.

In the past 16 years, we have delivered over 4200 quality-assured software to a global clientele catering to various industries such as healthcare, adtechs, eLearning, and more.

Planning to outsource software QA services? Or would you like to hire an offshore software testing team?

Visit: https://www.valuecoders.com/software-quality-assurance-testing-services-company

#software quality assurance testing services #software quality assurance services #quality assurance testing services #quality assurance software testing company #quality assurance software testing

Callum Slater

Callum Slater

1653465344

PySpark Cheat Sheet: Spark DataFrames in Python

This PySpark SQL cheat sheet is your handy companion to Apache Spark DataFrames in Python and includes code samples.

You'll probably already know about Apache Spark, the fast, general and open-source engine for big data processing; It has built-in modules for streaming, SQL, machine learning and graph processing. Spark allows you to speed analytic applications up to 100 times faster compared to other technologies on the market today. Interfacing Spark with Python is easy with PySpark: this Spark Python API exposes the Spark programming model to Python. 

Now, it's time to tackle the Spark SQL module, which is meant for structured data processing, and the DataFrame API, which is not only available in Python, but also in Scala, Java, and R.

Without further ado, here's the cheat sheet:

PySpark SQL cheat sheet

This PySpark SQL cheat sheet covers the basics of working with the Apache Spark DataFrames in Python: from initializing the SparkSession to creating DataFrames, inspecting the data, handling duplicate values, querying, adding, updating or removing columns, grouping, filtering or sorting data. You'll also see that this cheat sheet also on how to run SQL Queries programmatically, how to save your data to parquet and JSON files, and how to stop your SparkSession.

Spark SGlL is Apache Spark's module for working with structured data.

Initializing SparkSession 
 

A SparkSession can be used create DataFrame, register DataFrame as tables, execute SGL over tables, cache tables, and read parquet files.

>>> from pyspark.sql import SparkSession
>>> spark a SparkSession \
     .builder\
     .appName("Python Spark SQL basic example") \
     .config("spark.some.config.option", "some-value") \
     .getOrCreate()

Creating DataFrames
 

Fromm RDDs

>>> from pyspark.sql.types import*

Infer Schema

>>> sc = spark.sparkContext
>>> lines = sc.textFile(''people.txt'')
>>> parts = lines.map(lambda l: l.split(","))
>>> people = parts.map(lambda p: Row(nameap[0],ageaint(p[l])))
>>> peopledf = spark.createDataFrame(people)

Specify Schema

>>> people = parts.map(lambda p: Row(name=p[0],
               age=int(p[1].strip())))
>>>  schemaString = "name age"
>>> fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
>>> schema = StructType(fields)
>>> spark.createDataFrame(people, schema).show()

 

From Spark Data Sources
JSON

>>>  df = spark.read.json("customer.json")
>>> df.show()

>>>  df2 = spark.read.load("people.json", format="json")

Parquet files

>>> df3 = spark.read.load("users.parquet")

TXT files

>>> df4 = spark.read.text("people.txt")

Filter 

#Filter entries of age, only keep those records of which the values are >24
>>> df.filter(df["age"]>24).show()

Duplicate Values 

>>> df = df.dropDuplicates()

Queries 
 

>>> from pyspark.sql import functions as F

Select

>>> df.select("firstName").show() #Show all entries in firstName column
>>> df.select("firstName","lastName") \
      .show()
>>> df.select("firstName", #Show all entries in firstName, age and type
              "age",
              explode("phoneNumber") \
              .alias("contactInfo")) \
      .select("contactInfo.type",
              "firstName",
              "age") \
      .show()
>>> df.select(df["firstName"],df["age"]+ 1) #Show all entries in firstName and age, .show() add 1 to the entries of age
>>> df.select(df['age'] > 24).show() #Show all entries where age >24

When

>>> df.select("firstName", #Show firstName and 0 or 1 depending on age >30
               F.when(df.age > 30, 1) \
              .otherwise(0)) \
      .show()
>>> df[df.firstName.isin("Jane","Boris")] #Show firstName if in the given options
.collect()

Like 

>>> df.select("firstName", #Show firstName, and lastName is TRUE if lastName is like Smith
              df.lastName.like("Smith")) \
     .show()

Startswith - Endswith 

>>> df.select("firstName", #Show firstName, and TRUE if lastName starts with Sm
              df.lastName \
                .startswith("Sm")) \
      .show()
>>> df.select(df.lastName.endswith("th"))\ #Show last names ending in th
      .show()

Substring 

>>> df.select(df.firstName.substr(1, 3) \ #Return substrings of firstName
                          .alias("name")) \
        .collect()

Between 

>>> df.select(df.age.between(22, 24)) \ #Show age: values are TRUE if between 22 and 24
          .show()

Add, Update & Remove Columns 

Adding Columns

 >>> df = df.withColumn('city',df.address.city) \
            .withColumn('postalCode',df.address.postalCode) \
            .withColumn('state',df.address.state) \
            .withColumn('streetAddress',df.address.streetAddress) \
            .withColumn('telePhoneNumber', explode(df.phoneNumber.number)) \
            .withColumn('telePhoneType', explode(df.phoneNumber.type)) 

Updating Columns

>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber')

Removing Columns

  >>> df = df.drop("address", "phoneNumber")
 >>> df = df.drop(df.address).drop(df.phoneNumber)
 

Missing & Replacing Values 
 

>>> df.na.fill(50).show() #Replace null values
 >>> df.na.drop().show() #Return new df omitting rows with null values
 >>> df.na \ #Return new df replacing one value with another
       .replace(10, 20) \
       .show()

GroupBy 

>>> df.groupBy("age")\ #Group by age, count the members in the groups
      .count() \
      .show()

Sort 
 

>>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("age", ascending=False).collect()
>>> df.orderBy(["age","city"],ascending=[0,1])\
     .collect()

Repartitioning 

>>> df.repartition(10)\ #df with 10 partitions
      .rdd \
      .getNumPartitions()
>>> df.coalesce(1).rdd.getNumPartitions() #df with 1 partition

Running Queries Programmatically 
 

Registering DataFrames as Views

>>> peopledf.createGlobalTempView("people")
>>> df.createTempView("customer")
>>> df.createOrReplaceTempView("customer")

Query Views

>>> df5 = spark.sql("SELECT * FROM customer").show()
>>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\
               .show()

Inspect Data 
 

>>> df.dtypes #Return df column names and data types
>>> df.show() #Display the content of df
>>> df.head() #Return first n rows
>>> df.first() #Return first row
>>> df.take(2) #Return the first n rows >>> df.schema Return the schema of df
>>> df.describe().show() #Compute summary statistics >>> df.columns Return the columns of df
>>> df.count() #Count the number of rows in df
>>> df.distinct().count() #Count the number of distinct rows in df
>>> df.printSchema() #Print the schema of df
>>> df.explain() #Print the (logical and physical) plans

Output

Data Structures 
 

 >>> rdd1 = df.rdd #Convert df into an RDD
 >>> df.toJSON().first() #Convert df into a RDD of string
 >>> df.toPandas() #Return the contents of df as Pandas DataFrame

Write & Save to Files 

>>> df.select("firstName", "city")\
       .write \
       .save("nameAndCity.parquet")
 >>> df.select("firstName", "age") \
       .write \
       .save("namesAndAges.json",format="json")

Stopping SparkSession 

>>> spark.stop()

Have this Cheat Sheet at your fingertips

Original article source at https://www.datacamp.com

#pyspark #cheatsheet #spark #dataframes #python #bigdata

Madelyn  Frami

Madelyn Frami

1598438700

10 Open Source/Commercial Control Panels For Virtual Machines (VM's) Management

Automatic creation and management of virtual machines is a topical issue for any company that provides VPS services. If you manage a large number of machines, a command line is definitely not the only tool you may need to perform various operations including client tasks, because such operations may be time-consuming.

In order to simplify routine tasks of server administrators and users, various companies develop control panels for virtual machines management, including interface-based solutions.

Don’t Miss20 Open Source/Commercial Control Panels to Manage Linux Servers

A control panel empowers you to perform any operation with a mouse click, whereas it would take you a good deal of time to complete the same task in the console. With a control panel, you will save your time and effort. However, it’s not all that simple.

Nowadays, VMmanager is the most popular software product for small and medium-sized businesses. VMware, in its turn, is a leading solution for large organizations. Both software products are commercial and rather expensive.

They deliver a large number of functions, however, some companies, especially, startups may need them. Besides, many of them cannot afford such an expensive product. For example, startups and companies in times of crisis may experience financial difficulties. Moreover, one can find interesting, outstanding solutions integrated with billing systems including tools for VM management.

How not to get lost among a great number of offers? We decided to help our users and wrote the following article, in which they will find answers to this question.

In this article, we will describe control panels for virtual machines management, both commercial and open-source, and help you choose the right solution to meet your personal needs.

1. VMmanager

VMmanager is one of the most popular commercial server virtualizations platforms based on QEMU/KVM technology. The solution has a reach feature set, that can suit both IT infrastructure owners and VPS services providers’ needs.

Virtual servers can be created within 2 minutes. Many routine tasks are performed automatically: including migration, cloning, reinstalling the OS, backups, adding and deleting interfaces, virtual server image creation, monitoring, statistics collection, server provisioning, etc.

The main advantages of VMmanager are:

  • Centralized management of various clusters.
  • Fault tolerance due to a microservice architecture.
  • Overselling, which helps to improve VPS provider’s equipment efficiency.
  • Complete control of the infrastructure thanks to a robust system of metrics collection.
  • A modern and intuitive interface.

VMmanager - Virtualization Management Platform

VMmanager – Virtualization Management Platform

2. VMware vSphere

VMware vSphere is the world’s leading server virtualization platform for building cloud infrastructure. With tons of its different powerful features, vSphere is a truely state-of-the-art software virtual machines management software. It is an ideal solution for large VPS providers with appropriate budgets and professional staff.

VMWare vSphere - Server Virtualization Platform

#control panels #virtualization #hosting control panel #linux control panels #virtual mahine control panels #linux