Alfie Kemp

Alfie Kemp

1575987286

Compare CuDF and Modin

With Pandas, by default we can only use a single CPU core at a time. This is fine for small datasets, but when working with larger files this can create a bottleneck.

It’s possible to speed-up things by using Parallel processing, but if you never wrote a multithreaded program don’t worry: you don’t need to learn how to do it. Some new libraries that can do it for us. Today we’re going to compare two of them: cuDF and Modin. They both use pandas-like APIs so we can start using them just by changing the import statement.

cuDF

cuDF is a GPU DataFrame library that provides a pandas-like API allowing us to accelerate our workflows without going into details of CUDA programming. The lib is part of RAPIDS, a suite of open source libraries that uses GPU-acceleration and integrates with popular data science libraries and workflows to speed up Machine Learning.

The RAPIDS suite
The RAPIDS suite

The API is really similar to pandas, so in most cases, we just need to change one line of code to start using it:

import cudf as pd
s = pd.Series([1,2,3,None,4])
df = pd.DataFrame([('a', list(range(20))),
                     ('b', list(reversed(range(20)))),
                     ('c', list(range(20)))])
df.head(2)
df.sort_values(by='b')
df['a']
df.loc[2:5, ['a', 'b']]
s = pd.Series([1,2,3,None,4])
s.fillna(999)
df = pd.read_csv('example_output/foo.csv')
df.to_csv('example_output/foo.csv', index=False)

cuDF is a single-GPU library. For Multi-GPU they use Dask and the dask-cudf package, which is able to scale cuDF across multiple GPUs on a single machine, or multiple GPUs across many machines in a cluster [cuDF Docs].

Modin

Modin also provides a pandas-like API that uses Ray or Dask to implement a high-performance distributed execution framework. With Modin you can use all of the CPU cores on your machine. It provides speed-ups of up to 4x on a laptop with 4 physical cores [Modin Docs].

Modin
Modin

The Environment

We’ll be using the Maingear VYBE PRO Data Science PC and I’m running the scripts using Jupyter. Here are the technical specs:

Maingear VYBE PRO Data Science PC

  • 125gb RAM
  • i9–7980XE, 36 cores
  • 2x TITAN RTX 24GB

The dataset used for the benchmarks was the brazilian 2018 higher education census.

Benchmark 1: READ a CSV file

Let’s go ahead and read a 3gb CSV file using Pandas, cuDF and Modin. We’ll run it 30 times and get the mean values.

loading_data.py

import pandas as pd
import modin.pandas as pd_modin
import cudf as pd_cudf

results_loading = []

### Read in the data with Pandas
for run in range(0,30):
	s = time.time()
	df = pd.read_csv("../inep/dados/microdados_educacao_superior_2018//microdados_ed_superior_2018/dados/DM_ALUNO.CSV")
	e = time.time()
	results_loading.append({"lib":"Pandas","time":float("{}".format(e-s))})
	print("Pandas Loading Time = {}".format(e-s))

### Read in the data with Modin
for run in range(0,30):
	s = time.time()
	df = pd_modin.read_csv("../inep/dados/microdados_educacao_superior_2018//microdados_ed_superior_2018/dados/DM_ALUNO.CSV")
	e = time.time()
	results_loading.append({"lib":"Modin","time":float("{}".format(e-s))})
	print("Modin Loading Time = {}".format(e-s))

### Read in the data with cudf
for run in range(0,30):
	s = time.time()
	df = pd_cudf.read_csv("../inep/dados/microdados_educacao_superior_2018//microdados_ed_superior_2018/dados/DM_ALUNO.CSV")
	e = time.time()
	results_loading.append({"lib":"Cudf","time":float("{}".format(e-s))})
	print("Cudf Loading Time = {}".format(e-s))

Modin

Modin is the winner with less than 4s on average. It automatically distributes the computation across all of the system’s available CPU cores and we have 36 cores so maybe this is the reason why 🤔?

Benchmark 2: missing values

In this benchmark we will fill the NaN values of the DataFrame.

fillna.py

import pandas as pd
import modin.pandas as pd_modin
import cudf as pd_cudf

results_fillna = []

### Read in the data with Pandas
for run in range(0,30):
	df = pd.read_csv("../inep/dados/microdados_educacao_superior_2018//microdados_ed_superior_2018/dados/DM_ALUNO.CSV")
    
	s = time.time()
	df = df.fillna(value="0")
	e = time.time()
	results_fillna.append({"lib":"Pandas","time":float("{}".format(e-s))})
	print("Pandas Fillna Time = {}".format(e-s))

### Read in the data with Modin
for run in range(0,30):
	df = pd_modin.read_csv("../inep/dados/microdados_educacao_superior_2018//microdados_ed_superior_2018/dados/DM_ALUNO.CSV")
    
	s = time.time()
	df = df.fillna(value="0")
	e = time.time()
	results_fillna.append({"lib":"Modin","time":float("{}".format(e-s))})
	print("Modin Fillna Time = {}".format(e-s))

### Read in the data with cudf
for run in range(0,30):
	df = pd_cudf.read_csv("../inep/dados/microdados_educacao_superior_2018//microdados_ed_superior_2018/dados/DM_ALUNO.CSV")
    
	s = time.time()
	df = df.fillna(value="0")
	e = time.time()
	results_fillna.append({"lib":"Cudf","time":float("{}".format(e-s))})
	print("Cudf Fillna Time = {}".format(e-s))

Filling missing values
Filling missing values

Modin is the winner for this benchmark too. And cuDF is the lib that takes more time to run on average.

Benchmark 3: groupby

Let’s group the rows to see how each library behaves.

groupby.py

import pandas as pd
import modin.pandas as pd_modin
import cudf as pd_cudf

results_groupby = []

### Read in the data with Pandas
for run in range(0,30):
	df = pd.read_csv("../inep/dados/microdados_educacao_superior_2018//microdados_ed_superior_2018/dados/DM_ALUNO.CSV",
                	delimiter="|",
                	encoding="latin-1")
    
	s = time.time()
	df = df.groupby("CO_IES").size()
	e = time.time()
	results_groupby.append({"lib":"Pandas","time":float("{}".format(e-s))})
	print("Pandas Groupby Time = {}".format(e-s))

### Read in the data with Modin
for run in range(0,30):
	df = pd_modin.read_csv("../inep/dados/microdados_educacao_superior_2018//microdados_ed_superior_2018/dados/DM_ALUNO.CSV",
                      	delimiter="|",
                      	encoding="latin-1")
    
	s = time.time()
	df = df.groupby("CO_IES").size()
	e = time.time()
	results_groupby.append({"lib":"Modin","time":float("{}".format(e-s))})
	print("Modin Groupby Time = {}".format(e-s))

### Read in the data with cudf
for run in range(0,30):
	df = pd_cudf.read_csv("../inep/dados/microdados_educacao_superior_2018//microdados_ed_superior_2018/dados/DM_ALUNO.CSV",
                     	delimiter="|",
                     	encoding="latin-1")
    
	s = time.time()
	df = df.groupby("CO_IES").size()
	e = time.time()
	results_groupby.append({"lib":"Cudf","time":float("{}".format(e-s))})
	print("Cudf Groupby Time = {}".format(e-s))

Here cuDF is the winner and Modin has the worst performance.

Cool, so which library should I use?

every technical debate
https://twitter.com/swyx/status/1202202923385536513

To answer this question I think we have to consider the methods we most use in our workflows. In today’s benchmark reading the file was much faster using Modin, but how many times do we need to use the read_csv() method in our ETL? By contrast, in theory, we would use the groupby() method more frequently, and in this case, the cuDF library had the best performance.

Modin is pretty easy to install (we just need to use pip) and cuDF is harder (you’ll need to update the NVIDIA drivers, install CUDA and then install cuDF using conda). Or you can skip all these steps and get a Data Science PC because it comes with all RAPIDS libraries and software fully installed.

Also, both Modin and cuDF are still in the early stages and they don’t have the complete coverage of the entire Pandas API yet.

Getting started

If you want to dive deep into cuDF, the 10 Minutes to cuDF and Dask-cuDF is a good place to start.

For more info about Modin, this blog post explains more about parallelizing Pandas with Ray. If you want to dive even deeper there is the technical report on Scaling Interactive Data Science Transparently with Modin, which does a great job of explaining the technical architecture of Modin.

#pandas #data-science #machine-learning

What is GEEK

Buddha Community

Compare CuDF and  Modin
Macey  Legros

Macey Legros

1601288700

Java Comparator Interface Example | Comparator Interface in Java

Java Comparator interface is used to sort an array or list of objects based on a custom order. Custom ordering of elements is imposed by implementing Comparator.compare() method in the objects.

Java Comparator interface does the total ordering of the objects which may not have a natural ordering. For example, for a list of student objects, the natural order may be ordered by student id.

But in real-life applications, we may want to sort the list of a student by their first name, date of birth or simply any other such criteria. In such conditions, we need to use the Comparator interface.
Java Comparator Interface
Comparator interface in Java is used to order the objects of user-defined classes. A comparator object is capable of comparing two objects of two different classes.

#java #java comparator interface #comparator.compare

Abigail betty

Abigail betty

1624402800

Hardware Wallets Explained, Reviewed and Compared

Hardware wallets use a form of 2-factor authentication (also known as 2FA). This means that in order to access your funds you’ll need to prove your identity through something you have (the physical wallet) and something you know (the PIN code for the wallet).
0:45 - Bitcoin Wallets in a Nutshell
1:40 - The process of sending a Bitcoin transaction
3:27 - Hardware wallets overview
4:11 - How do hardware wallets work?
5:26 - The best hardware wallets
5:54 - Setting up a Hardware Wallet
6:18 - Potential Risks of Hardware Wallets
9:33 - Conclusion

📺 The video in this post was made by 99Bitcoins
The origin of the article: https://www.youtube.com/watch?v=aPprQUQljHE
🔺 DISCLAIMER: The article is for information sharing. The content of this video is solely the opinions of the speaker who is not a licensed financial advisor or registered investment advisor. Not investment advice or legal advice.
Cryptocurrency trading is VERY risky. Make sure you understand these risks and that you are responsible for what you do with your money
🔥 If you’re a beginner. I believe the article below will be useful to you ☞ What You Should Know Before Investing in Cryptocurrency - For Beginner
⭐ ⭐ ⭐The project is of interest to the community. Join to Get free ‘GEEK coin’ (GEEKCASH coin)!
☞ **-----CLICK HERE-----**⭐ ⭐ ⭐
Thanks for visiting and watching! Please don’t forget to leave a like, comment and share!

#bitcoin #blockchain #hardware wallets explained, reviewed and compared #hardware wallets explained, #reviewed #compared

Joseph  Murray

Joseph Murray

1624351345

How to Compare Dates in Java

Introduction

Dates are something we encounter in everyday life, whether they’re used for calendars, scheduling appointments, or even remembering birthdays. Naturally, when working with dates we will often need to know if a certain date comes before or after another, or if they represent the same calendar date.

In this article, we’ll take a look at how to compare two dates in Java.

#java #datetime #joda-time #how to compare dates in java #compare dates

Joseph  Murray

Joseph Murray

1624443720

How to Compare Dates in Java

Introduction

Dates are something we encounter in everyday life, whether they’re used for calendars, scheduling appointments, or even remembering birthdays. Naturally, when working with dates we will often need to know if a certain date comes before or after another, or if they represent the same calendar date.

In this article, we’ll take a look at how to compare two dates in Java.

#java #datetime #joda-time #how to compare dates in java #compare dates

Kotlin - Compare Objects with Comparable Example » grokonez

https://grokonez.com/kotlin/kotlin-compare-objects-comparable-example

Kotlin – Compare Objects with Comparable Example

This tutorial shows you way to compare Objects with Comparable by an example.

Related posts:

I. Technology

- Java 1.8 - Kotlin 1.1.2

II. Overview

1. Goal

First, we compare two objects Date(year:Int,month:Int,day:Int). Second, we continue to work on two Product(name:String,date:Date) objects.

2. Steps to do

- Implement Comparable interface for the class of objects you want to compare. - Override compareTo(other: T) method and: + return zero if this object is equal other + a negative number if it's less than other + a positive number if it's greater than other

III. Practice

1. Create Classes


package com.javasampleapproach.objcomparision

import kotlin.Comparable

data class Date(val year: Int, val month: Int, val day: Int) : Comparable {

override fun compareTo(other: Date) = when {
	year != other.year -> year - other.year
	month != other.month -> month - other.month
	else -> day - other.day
}

}


Product class includes Date field (implemented Comparable interface), so you can compare two Date objects inside using operator <,>,==.

More at:

https://grokonez.com/kotlin/kotlin-compare-objects-comparable-example

Kotlin – Compare Objects with Comparable Example

#kotlin #comparable