Callum  Allen

Callum Allen

1661483700

Como Processar Arquivos Grandes Em Paralelo Em Python

Aprenda várias técnicas para reduzir o tempo de processamento de dados usando multiprocessamento, joblib e tqdm concorrente.

Para processamento paralelo, dividimos nossa tarefa em subunidades. Aumenta o número de trabalhos processados ​​pelo programa e reduz o tempo total de processamento. 

Por exemplo, se você estiver trabalhando com um arquivo CSV grande e desejar modificar uma única coluna. Alimentaremos os dados como uma matriz para a função e ela processará vários valores em paralelo de uma só vez com base no número de  trabalhadores disponíveis . Esses trabalhadores são baseados no número de núcleos em seu processador. 

Observação: usar o processamento paralelo em um conjunto de dados menor não melhorará o tempo de processamento.

Neste blog, aprenderemos como reduzir o tempo de processamento em arquivos grandes usando pacotes Python multiprocessing , joblib e tqdm . É um tutorial simples que pode ser aplicado a qualquer arquivo, banco de dados, imagem, vídeo e áudio. 

Nota: estamos usando o notebook Kaggle para os experimentos. O tempo de processamento pode variar de máquina para máquina.  

Começando

Usaremos o conjunto de dados de Acidentes dos EUA (2016 - 2021) do Kaggle, que consiste em 2,8 milhões de registros e 47 colunas. 

Importaremos multiprocessamento , joblib e tqdm para processamento paralelo , pandas para ingestão de dados e re , nltk e string para processamento de texto

# Parallel Computing

import multiprocessing as mp

from joblib import Parallel, delayed

from tqdm.notebook import tqdm

# Data Ingestion 

import pandas as pd

# Text Processing 

import re 

from nltk.corpus import stopwords

import string

Antes de começarmos, vamos definir n_workers dobrando cpu_count() . Como você pode ver, temos 8 trabalhadores.

n_workers = 2 * mp.cpu_count()

print(f"{n_workers} workers are available")

>>> 8 workers are available

Na próxima etapa, ingeriremos arquivos CSV grandes usando a função pandas read_csv . Em seguida, imprima a forma do dataframe, o nome das colunas e o tempo de processamento. 

Nota: A função mágica do Jupyter `%%time` pode exibir tempos de CPU e tempo de espera no final do processo. 

%%time
file_name="../input/us-accidents/US_Accidents_Dec21_updated.csv"
df = pd.read_csv(file_name)

print(f"Shape:{df.shape}\n\nColumn Names:\n{df.columns}\n")

Resultado

Shape:(2845342, 47)

Column Names:

Index(['ID', 'Severity', 'Start_Time', 'End_Time', 'Start_Lat', 'Start_Lng',
'End_Lat', 'End_Lng', 'Distance(mi)', 'Description', 'Number', 'Street',
'Side', 'City', 'County', 'State', 'Zipcode', 'Country', 'Timezone',
'Airport_Code', 'Weather_Timestamp', 'Temperature(F)', 'Wind_Chill(F)',
'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Direction',
'Wind_Speed(mph)', 'Precipitation(in)', 'Weather_Condition', 'Amenity',
'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway',
'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal',
'Turning_Loop', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
'Astronomical_Twilight'],
dtype='object')

CPU times: user 33.9 s, sys: 3.93 s, total: 37.9 s
Wall time: 46.9 s

Limpando o texto

O clean_text é uma função direta para processar e limpar o texto. Obteremos palavras irrelevantes em inglês usando nltk.copus para usá-las para filtrar palavras irrelevantes da linha de texto. Depois disso, removeremos caracteres especiais e espaços extras da frase. Será a função de linha de base para determinar o tempo de processamento para processamento serial , paralelo e em lote

def clean_text(text): 
  # Remove stop words
  stops = stopwords.words("english")
  text = " ".join([word for word in text.split() if word 
 not in stops])
  # Remove Special Characters
  text = text.translate(str.maketrans('', '', string.punctuation))
  # removing the extra spaces
  text = re.sub(' +',' ', text)
  return text

Processamento em série

Para processamento serial, podemos usar a função pandas .apply() , mas se você quiser ver a barra de progresso, você precisa ativar o tqdm para pandas e então usar a função .progress_apply()

Vamos processar os 2,8 milhões de registros e salvar o resultado na coluna da coluna “Descrição”. 

%%time
tqdm.pandas()

df['Description'] = df['Description'].progress_apply(clean_text)

Resultado

Levou 9 minutos e 5 segundos para o processador de ponta para processar em série 2,8 milhões de linhas. 

100% 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 2845342/2845342 [09:05<00:00, 5724.25it/s]

CPU times: user 8min 14s, sys: 53.6 s, total: 9min 7s
Wall time: 9min 5s

Multiprocessamento

Existem várias maneiras de processar o arquivo em paralelo, e vamos aprender sobre todas elas. O `multiprocessing` é um pacote python embutido que é comumente usado para o processamento paralelo de arquivos grandes. 

Criaremos um Pool de multiprocessamento com 8 trabalhadores e usaremos a função map para iniciar o processo. Para exibir as barras de progresso, estamos usando tqdm .

A função map consiste em duas seções. O primeiro requer a função e o segundo requer um argumento ou lista de argumentos. 

Saiba mais lendo a documentação

%%time
p = mp.Pool(n_workers) 

df['Description'] = p.map(clean_text,tqdm(df['Description']))

Resultado

Melhoramos nosso tempo de processamento em quase 3X . O tempo de processamento caiu de 9 minutos e 5 segundos para 3 minutos e 51 segundos .   

100% 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 2845342/2845342 [02:58<00:00, 135646.12it/s]

CPU times: user 5.68 s, sys: 1.56 s, total: 7.23 s
Wall time: 3min 51s

Paralelo

Agora vamos aprender sobre outro pacote Python para realizar processamento paralelo. Nesta seção, usaremos o Parallel e o delay do joblib para replicar a função map

  • O Parallel requer dois argumentos: n_jobs = 8 e backend = multiprocessing.
  • Em seguida, adicionaremos clean_text   à função atrasada .
  • Crie um loop para alimentar um único valor de cada vez.

O processo abaixo é bastante genérico e você pode modificar sua função e array de acordo com suas necessidades. Eu o usei para processar milhares de arquivos de áudio e vídeo sem nenhum problema. 

Recomendado: adicione tratamento de exceção usando `try:` e `except:`

def text_parallel_clean(array):
  result = Parallel(n_jobs=n_workers,backend="multiprocessing")(
  delayed(clean_text)
  (text) 
  for text in tqdm(array)
  )
  return result

Adicione a coluna “Descrição” a text_parallel_clean()

%%time
df['Description'] = text_parallel_clean(df['Description'])

Resultado

Nossa função levou 13 segundos a mais do que multiprocessar o Pool. Mesmo assim, o Parallel é 4 minutos e 59 segundos mais rápido que o processamento  serial .

100% 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 2845342/2845342 [04:03<00:00, 10514.98it/s]

CPU times: user 44.2 s, sys: 2.92 s, total: 47.1 s
Wall time: 4min 4s

Processamento de Lote Paralelo

Existe uma maneira melhor de processar arquivos grandes dividindo-os em lotes e processando-os em paralelo. Vamos começar criando uma função de lote que executará uma função_limpa em um único lote de valores. 

Função de processamento em lote

def proc_batch(batch):
  return [
  clean_text(text)
  for text in batch
  ]

Dividindo o arquivo em lotes

A função abaixo dividirá o arquivo em vários lotes com base no número de trabalhadores. No nosso caso, temos 8 lotes. 

def batch_file(array,n_workers):
  file_len = len(array)
  batch_size = round(file_len / n_workers)
  batches = [
  array[ix:ix+batch_size]
  for ix in tqdm(range(0, file_len, batch_size))
  ]
  return batches

batches = batch_file(df['Description'],n_workers)

>>> 100% 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 8/8 [00:00<00:00, 280.01it/s]

Executando processamento em lote paralelo

Por fim, usaremos Paralela e atrasada para processar os lotes. 

Nota: Para obter um único array de valores, temos que executar a compreensão da lista conforme mostrado abaixo. 

%%time
batch_output = Parallel(n_jobs=n_workers,backend="multiprocessing")(
  delayed(proc_batch)
  (batch) 
  for batch in tqdm(batches)
  )

df['Description'] = [j for i in batch_output for j in i]

Resultado

Melhoramos o tempo de processamento. Essa técnica é famosa por processar dados complexos e treinar modelos de aprendizado profundo. 

100% 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 8/8 [00:00<00:00, 2.19it/s]

CPU times: user 3.39 s, sys: 1.42 s, total: 4.81 s
Wall time: 3min 56s

tqdm simultâneo

tqdm leva o multiprocessamento para o próximo nível. É simples e poderoso. Vou recomendá-lo a todos os cientistas de dados. 

Confira a documentação para saber mais sobre multiprocessamento. 

O process_map requer:

  1. Nome da função
  2. Coluna de dataframe
  3. max_workers
  4. Chucksize é semelhante ao tamanho do lote. Calcularemos o tamanho do lote usando o número de trabalhadores ou você pode adicionar o número com base em sua preferência.
%%time
from tqdm.contrib.concurrent import process_map
batch = round(len(df)/n_workers)

df['Description'] = process_map(clean_text,df['Description'], max_workers=n_workers, chunksize=batch)

Resultado

Com uma única linha de código, obtemos o melhor resultado. 

100% 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 2845342/2845342 [03:48<00:00, 1426320.93it/s]

CPU times: user 7.32 s, sys: 1.97 s, total: 9.29 s
Wall time: 3min 51s

Conclusão

Você precisa encontrar um equilíbrio e selecionar a técnica que funciona melhor para o seu caso. Pode ser processamento serial, paralelo ou em lote. O processamento paralelo pode sair pela culatra se você estiver trabalhando com um conjunto de dados menor e menos complexo. 

Neste mini-tutorial, aprendemos sobre vários pacotes e técnicas do Python que nos permitem processar em paralelo nossas funções de dados. 

Se você estiver trabalhando apenas com um conjunto de dados tabular e quiser melhorar seu desempenho de processamento, sugiro que você experimente Dask , datatable e RAPIDS 

Fonte:  https://www.kdnuggets.com

#python 

What is GEEK

Buddha Community

Como Processar Arquivos Grandes Em Paralelo Em Python
Shardul Bhatt

Shardul Bhatt

1626775355

Why use Python for Software Development

No programming language is pretty much as diverse as Python. It enables building cutting edge applications effortlessly. Developers are as yet investigating the full capability of end-to-end Python development services in various areas. 

By areas, we mean FinTech, HealthTech, InsureTech, Cybersecurity, and that's just the beginning. These are New Economy areas, and Python has the ability to serve every one of them. The vast majority of them require massive computational abilities. Python's code is dynamic and powerful - equipped for taking care of the heavy traffic and substantial algorithmic capacities. 

Programming advancement is multidimensional today. Endeavor programming requires an intelligent application with AI and ML capacities. Shopper based applications require information examination to convey a superior client experience. Netflix, Trello, and Amazon are genuine instances of such applications. Python assists with building them effortlessly. 

5 Reasons to Utilize Python for Programming Web Apps 

Python can do such numerous things that developers can't discover enough reasons to admire it. Python application development isn't restricted to web and enterprise applications. It is exceptionally adaptable and superb for a wide range of uses.

Robust frameworks 

Python is known for its tools and frameworks. There's a structure for everything. Django is helpful for building web applications, venture applications, logical applications, and mathematical processing. Flask is another web improvement framework with no conditions. 

Web2Py, CherryPy, and Falcon offer incredible capabilities to customize Python development services. A large portion of them are open-source frameworks that allow quick turn of events. 

Simple to read and compose 

Python has an improved sentence structure - one that is like the English language. New engineers for Python can undoubtedly understand where they stand in the development process. The simplicity of composing allows quick application building. 

The motivation behind building Python, as said by its maker Guido Van Rossum, was to empower even beginner engineers to comprehend the programming language. The simple coding likewise permits developers to roll out speedy improvements without getting confused by pointless subtleties. 

Utilized by the best 

Alright - Python isn't simply one more programming language. It should have something, which is the reason the business giants use it. Furthermore, that too for different purposes. Developers at Google use Python to assemble framework organization systems, parallel information pusher, code audit, testing and QA, and substantially more. Netflix utilizes Python web development services for its recommendation algorithm and media player. 

Massive community support 

Python has a steadily developing community that offers enormous help. From amateurs to specialists, there's everybody. There are a lot of instructional exercises, documentation, and guides accessible for Python web development solutions. 

Today, numerous universities start with Python, adding to the quantity of individuals in the community. Frequently, Python designers team up on various tasks and help each other with algorithmic, utilitarian, and application critical thinking. 

Progressive applications 

Python is the greatest supporter of data science, Machine Learning, and Artificial Intelligence at any enterprise software development company. Its utilization cases in cutting edge applications are the most compelling motivation for its prosperity. Python is the second most well known tool after R for data analytics.

The simplicity of getting sorted out, overseeing, and visualizing information through unique libraries makes it ideal for data based applications. TensorFlow for neural networks and OpenCV for computer vision are two of Python's most well known use cases for Machine learning applications.

Summary

Thinking about the advances in programming and innovation, Python is a YES for an assorted scope of utilizations. Game development, web application development services, GUI advancement, ML and AI improvement, Enterprise and customer applications - every one of them uses Python to its full potential. 

The disadvantages of Python web improvement arrangements are regularly disregarded by developers and organizations because of the advantages it gives. They focus on quality over speed and performance over blunders. That is the reason it's a good idea to utilize Python for building the applications of the future.

#python development services #python development company #python app development #python development #python in web development #python software development

Art  Lind

Art Lind

1602968400

Python Tricks Every Developer Should Know

Python is awesome, it’s one of the easiest languages with simple and intuitive syntax but wait, have you ever thought that there might ways to write your python code simpler?

In this tutorial, you’re going to learn a variety of Python tricks that you can use to write your Python code in a more readable and efficient way like a pro.

Let’s get started

Swapping value in Python

Instead of creating a temporary variable to hold the value of the one while swapping, you can do this instead

>>> FirstName = "kalebu"
>>> LastName = "Jordan"
>>> FirstName, LastName = LastName, FirstName 
>>> print(FirstName, LastName)
('Jordan', 'kalebu')

#python #python-programming #python3 #python-tutorials #learn-python #python-tips #python-skills #python-development

Art  Lind

Art Lind

1602666000

How to Remove all Duplicate Files on your Drive via Python

Today you’re going to learn how to use Python programming in a way that can ultimately save a lot of space on your drive by removing all the duplicates.

Intro

In many situations you may find yourself having duplicates files on your disk and but when it comes to tracking and checking them manually it can tedious.

Heres a solution

Instead of tracking throughout your disk to see if there is a duplicate, you can automate the process using coding, by writing a program to recursively track through the disk and remove all the found duplicates and that’s what this article is about.

But How do we do it?

If we were to read the whole file and then compare it to the rest of the files recursively through the given directory it will take a very long time, then how do we do it?

The answer is hashing, with hashing can generate a given string of letters and numbers which act as the identity of a given file and if we find any other file with the same identity we gonna delete it.

There’s a variety of hashing algorithms out there such as

  • md5
  • sha1
  • sha224, sha256, sha384 and sha512

#python-programming #python-tutorials #learn-python #python-project #python3 #python #python-skills #python-tips

How To Compare Tesla and Ford Company By Using Magic Methods in Python

Magic Methods are the special methods which gives us the ability to access built in syntactical features such as ‘<’, ‘>’, ‘==’, ‘+’ etc…

You must have worked with such methods without knowing them to be as magic methods. Magic methods can be identified with their names which start with __ and ends with __ like init, call, str etc. These methods are also called Dunder Methods, because of their name starting and ending with Double Underscore (Dunder).

Now there are a number of such special methods, which you might have come across too, in Python. We will just be taking an example of a few of them to understand how they work and how we can use them.

1. init

class AnyClass:
    def __init__():
        print("Init called on its own")
obj = AnyClass()

The first example is _init, _and as the name suggests, it is used for initializing objects. Init method is called on its own, ie. whenever an object is created for the class, the init method is called on its own.

The output of the above code will be given below. Note how we did not call the init method and it got invoked as we created an object for class AnyClass.

Init called on its own

2. add

Let’s move to some other example, add gives us the ability to access the built in syntax feature of the character +. Let’s see how,

class AnyClass:
    def __init__(self, var):
        self.some_var = var
    def __add__(self, other_obj):
        print("Calling the add method")
        return self.some_var + other_obj.some_var
obj1 = AnyClass(5)
obj2 = AnyClass(6)
obj1 + obj2

#python3 #python #python-programming #python-web-development #python-tutorials #python-top-story #python-tips #learn-python

Arvel  Parker

Arvel Parker

1593156510

Basic Data Types in Python | Python Web Development For Beginners

At the end of 2019, Python is one of the fastest-growing programming languages. More than 10% of developers have opted for Python development.

In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.

Table of Contents  hide

I Mutable objects

II Immutable objects

III Built-in data types in Python

Mutable objects

The Size and declared value and its sequence of the object can able to be modified called mutable objects.

Mutable Data Types are list, dict, set, byte array

Immutable objects

The Size and declared value and its sequence of the object can able to be modified.

Immutable data types are int, float, complex, String, tuples, bytes, and frozen sets.

id() and type() is used to know the Identity and data type of the object

a**=25+**85j

type**(a)**

output**:<class’complex’>**

b**={1:10,2:“Pinky”****}**

id**(b)**

output**:**238989244168

Built-in data types in Python

a**=str(“Hello python world”)****#str**

b**=int(18)****#int**

c**=float(20482.5)****#float**

d**=complex(5+85j)****#complex**

e**=list((“python”,“fast”,“growing”,“in”,2018))****#list**

f**=tuple((“python”,“easy”,“learning”))****#tuple**

g**=range(10)****#range**

h**=dict(name=“Vidu”,age=36)****#dict**

i**=set((“python”,“fast”,“growing”,“in”,2018))****#set**

j**=frozenset((“python”,“fast”,“growing”,“in”,2018))****#frozenset**

k**=bool(18)****#bool**

l**=bytes(8)****#bytes**

m**=bytearray(8)****#bytearray**

n**=memoryview(bytes(18))****#memoryview**

Numbers (int,Float,Complex)

Numbers are stored in numeric Types. when a number is assigned to a variable, Python creates Number objects.

#signed interger

age**=**18

print**(age)**

Output**:**18

Python supports 3 types of numeric data.

int (signed integers like 20, 2, 225, etc.)

float (float is used to store floating-point numbers like 9.8, 3.1444, 89.52, etc.)

complex (complex numbers like 8.94j, 4.0 + 7.3j, etc.)

A complex number contains an ordered pair, i.e., a + ib where a and b denote the real and imaginary parts respectively).

String

The string can be represented as the sequence of characters in the quotation marks. In python, to define strings we can use single, double, or triple quotes.

# String Handling

‘Hello Python’

#single (') Quoted String

“Hello Python”

# Double (") Quoted String

“”“Hello Python”“”

‘’‘Hello Python’‘’

# triple (‘’') (“”") Quoted String

In python, string handling is a straightforward task, and python provides various built-in functions and operators for representing strings.

The operator “+” is used to concatenate strings and “*” is used to repeat the string.

“Hello”+“python”

output**:****‘Hello python’**

"python "*****2

'Output : Python python ’

#python web development #data types in python #list of all python data types #python data types #python datatypes #python types #python variable type