1661483700
Aprenda várias técnicas para reduzir o tempo de processamento de dados usando multiprocessamento, joblib e tqdm concorrente.
Para processamento paralelo, dividimos nossa tarefa em subunidades. Aumenta o número de trabalhos processados pelo programa e reduz o tempo total de processamento.
Por exemplo, se você estiver trabalhando com um arquivo CSV grande e desejar modificar uma única coluna. Alimentaremos os dados como uma matriz para a função e ela processará vários valores em paralelo de uma só vez com base no número de trabalhadores disponíveis . Esses trabalhadores são baseados no número de núcleos em seu processador.
Observação: usar o processamento paralelo em um conjunto de dados menor não melhorará o tempo de processamento.
Neste blog, aprenderemos como reduzir o tempo de processamento em arquivos grandes usando pacotes Python multiprocessing , joblib e tqdm . É um tutorial simples que pode ser aplicado a qualquer arquivo, banco de dados, imagem, vídeo e áudio.
Nota: estamos usando o notebook Kaggle para os experimentos. O tempo de processamento pode variar de máquina para máquina.
Usaremos o conjunto de dados de Acidentes dos EUA (2016 - 2021) do Kaggle, que consiste em 2,8 milhões de registros e 47 colunas.
Importaremos multiprocessamento , joblib e tqdm para processamento paralelo , pandas para ingestão de dados e re , nltk e string para processamento de texto .
# Parallel Computing
import multiprocessing as mp
from joblib import Parallel, delayed
from tqdm.notebook import tqdm
# Data Ingestion
import pandas as pd
# Text Processing
import re
from nltk.corpus import stopwords
import string
Antes de começarmos, vamos definir n_workers dobrando cpu_count() . Como você pode ver, temos 8 trabalhadores.
n_workers = 2 * mp.cpu_count()
print(f"{n_workers} workers are available")
>>> 8 workers are available
Na próxima etapa, ingeriremos arquivos CSV grandes usando a função pandas read_csv . Em seguida, imprima a forma do dataframe, o nome das colunas e o tempo de processamento.
Nota: A função mágica do Jupyter `%%time` pode exibir tempos de CPU e tempo de espera no final do processo.
%%time
file_name="../input/us-accidents/US_Accidents_Dec21_updated.csv"
df = pd.read_csv(file_name)
print(f"Shape:{df.shape}\n\nColumn Names:\n{df.columns}\n")
Resultado
Shape:(2845342, 47)
Column Names:
Index(['ID', 'Severity', 'Start_Time', 'End_Time', 'Start_Lat', 'Start_Lng',
'End_Lat', 'End_Lng', 'Distance(mi)', 'Description', 'Number', 'Street',
'Side', 'City', 'County', 'State', 'Zipcode', 'Country', 'Timezone',
'Airport_Code', 'Weather_Timestamp', 'Temperature(F)', 'Wind_Chill(F)',
'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Direction',
'Wind_Speed(mph)', 'Precipitation(in)', 'Weather_Condition', 'Amenity',
'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway',
'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal',
'Turning_Loop', 'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
'Astronomical_Twilight'],
dtype='object')
CPU times: user 33.9 s, sys: 3.93 s, total: 37.9 s
Wall time: 46.9 s
O clean_text é uma função direta para processar e limpar o texto. Obteremos palavras irrelevantes em inglês usando nltk.copus para usá-las para filtrar palavras irrelevantes da linha de texto. Depois disso, removeremos caracteres especiais e espaços extras da frase. Será a função de linha de base para determinar o tempo de processamento para processamento serial , paralelo e em lote .
def clean_text(text):
# Remove stop words
stops = stopwords.words("english")
text = " ".join([word for word in text.split() if word
not in stops])
# Remove Special Characters
text = text.translate(str.maketrans('', '', string.punctuation))
# removing the extra spaces
text = re.sub(' +',' ', text)
return text
Para processamento serial, podemos usar a função pandas .apply() , mas se você quiser ver a barra de progresso, você precisa ativar o tqdm para pandas e então usar a função .progress_apply() .
Vamos processar os 2,8 milhões de registros e salvar o resultado na coluna da coluna “Descrição”.
%%time
tqdm.pandas()
df['Description'] = df['Description'].progress_apply(clean_text)
Resultado
Levou 9 minutos e 5 segundos para o processador de ponta para processar em série 2,8 milhões de linhas.
100% 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 2845342/2845342 [09:05<00:00, 5724.25it/s]
CPU times: user 8min 14s, sys: 53.6 s, total: 9min 7s
Wall time: 9min 5s
Existem várias maneiras de processar o arquivo em paralelo, e vamos aprender sobre todas elas. O `multiprocessing` é um pacote python embutido que é comumente usado para o processamento paralelo de arquivos grandes.
Criaremos um Pool de multiprocessamento com 8 trabalhadores e usaremos a função map para iniciar o processo. Para exibir as barras de progresso, estamos usando tqdm .
A função map consiste em duas seções. O primeiro requer a função e o segundo requer um argumento ou lista de argumentos.
Saiba mais lendo a documentação .
%%time
p = mp.Pool(n_workers)
df['Description'] = p.map(clean_text,tqdm(df['Description']))
Resultado
Melhoramos nosso tempo de processamento em quase 3X . O tempo de processamento caiu de 9 minutos e 5 segundos para 3 minutos e 51 segundos .
100% 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 2845342/2845342 [02:58<00:00, 135646.12it/s]
CPU times: user 5.68 s, sys: 1.56 s, total: 7.23 s
Wall time: 3min 51s
Agora vamos aprender sobre outro pacote Python para realizar processamento paralelo. Nesta seção, usaremos o Parallel e o delay do joblib para replicar a função map .
O processo abaixo é bastante genérico e você pode modificar sua função e array de acordo com suas necessidades. Eu o usei para processar milhares de arquivos de áudio e vídeo sem nenhum problema.
Recomendado: adicione tratamento de exceção usando `try:` e `except:`
def text_parallel_clean(array):
result = Parallel(n_jobs=n_workers,backend="multiprocessing")(
delayed(clean_text)
(text)
for text in tqdm(array)
)
return result
Adicione a coluna “Descrição” a text_parallel_clean() .
%%time
df['Description'] = text_parallel_clean(df['Description'])
Resultado
Nossa função levou 13 segundos a mais do que multiprocessar o Pool. Mesmo assim, o Parallel é 4 minutos e 59 segundos mais rápido que o processamento serial .
100% 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 2845342/2845342 [04:03<00:00, 10514.98it/s]
CPU times: user 44.2 s, sys: 2.92 s, total: 47.1 s
Wall time: 4min 4s
Existe uma maneira melhor de processar arquivos grandes dividindo-os em lotes e processando-os em paralelo. Vamos começar criando uma função de lote que executará uma função_limpa em um único lote de valores.
def proc_batch(batch):
return [
clean_text(text)
for text in batch
]
A função abaixo dividirá o arquivo em vários lotes com base no número de trabalhadores. No nosso caso, temos 8 lotes.
def batch_file(array,n_workers):
file_len = len(array)
batch_size = round(file_len / n_workers)
batches = [
array[ix:ix+batch_size]
for ix in tqdm(range(0, file_len, batch_size))
]
return batches
batches = batch_file(df['Description'],n_workers)
>>> 100% 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 8/8 [00:00<00:00, 280.01it/s]
Por fim, usaremos Paralela e atrasada para processar os lotes.
Nota: Para obter um único array de valores, temos que executar a compreensão da lista conforme mostrado abaixo.
%%time
batch_output = Parallel(n_jobs=n_workers,backend="multiprocessing")(
delayed(proc_batch)
(batch)
for batch in tqdm(batches)
)
df['Description'] = [j for i in batch_output for j in i]
Resultado
Melhoramos o tempo de processamento. Essa técnica é famosa por processar dados complexos e treinar modelos de aprendizado profundo.
100% 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 8/8 [00:00<00:00, 2.19it/s]
CPU times: user 3.39 s, sys: 1.42 s, total: 4.81 s
Wall time: 3min 56s
tqdm leva o multiprocessamento para o próximo nível. É simples e poderoso. Vou recomendá-lo a todos os cientistas de dados.
Confira a documentação para saber mais sobre multiprocessamento.
O process_map requer:
%%time
from tqdm.contrib.concurrent import process_map
batch = round(len(df)/n_workers)
df['Description'] = process_map(clean_text,df['Description'], max_workers=n_workers, chunksize=batch)
Resultado
Com uma única linha de código, obtemos o melhor resultado.
100% 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 2845342/2845342 [03:48<00:00, 1426320.93it/s]
CPU times: user 7.32 s, sys: 1.97 s, total: 9.29 s
Wall time: 3min 51s
Você precisa encontrar um equilíbrio e selecionar a técnica que funciona melhor para o seu caso. Pode ser processamento serial, paralelo ou em lote. O processamento paralelo pode sair pela culatra se você estiver trabalhando com um conjunto de dados menor e menos complexo.
Neste mini-tutorial, aprendemos sobre vários pacotes e técnicas do Python que nos permitem processar em paralelo nossas funções de dados.
Se você estiver trabalhando apenas com um conjunto de dados tabular e quiser melhorar seu desempenho de processamento, sugiro que você experimente Dask , datatable e RAPIDS
Fonte: https://www.kdnuggets.com
1626775355
No programming language is pretty much as diverse as Python. It enables building cutting edge applications effortlessly. Developers are as yet investigating the full capability of end-to-end Python development services in various areas.
By areas, we mean FinTech, HealthTech, InsureTech, Cybersecurity, and that's just the beginning. These are New Economy areas, and Python has the ability to serve every one of them. The vast majority of them require massive computational abilities. Python's code is dynamic and powerful - equipped for taking care of the heavy traffic and substantial algorithmic capacities.
Programming advancement is multidimensional today. Endeavor programming requires an intelligent application with AI and ML capacities. Shopper based applications require information examination to convey a superior client experience. Netflix, Trello, and Amazon are genuine instances of such applications. Python assists with building them effortlessly.
Python can do such numerous things that developers can't discover enough reasons to admire it. Python application development isn't restricted to web and enterprise applications. It is exceptionally adaptable and superb for a wide range of uses.
Robust frameworks
Python is known for its tools and frameworks. There's a structure for everything. Django is helpful for building web applications, venture applications, logical applications, and mathematical processing. Flask is another web improvement framework with no conditions.
Web2Py, CherryPy, and Falcon offer incredible capabilities to customize Python development services. A large portion of them are open-source frameworks that allow quick turn of events.
Simple to read and compose
Python has an improved sentence structure - one that is like the English language. New engineers for Python can undoubtedly understand where they stand in the development process. The simplicity of composing allows quick application building.
The motivation behind building Python, as said by its maker Guido Van Rossum, was to empower even beginner engineers to comprehend the programming language. The simple coding likewise permits developers to roll out speedy improvements without getting confused by pointless subtleties.
Utilized by the best
Alright - Python isn't simply one more programming language. It should have something, which is the reason the business giants use it. Furthermore, that too for different purposes. Developers at Google use Python to assemble framework organization systems, parallel information pusher, code audit, testing and QA, and substantially more. Netflix utilizes Python web development services for its recommendation algorithm and media player.
Massive community support
Python has a steadily developing community that offers enormous help. From amateurs to specialists, there's everybody. There are a lot of instructional exercises, documentation, and guides accessible for Python web development solutions.
Today, numerous universities start with Python, adding to the quantity of individuals in the community. Frequently, Python designers team up on various tasks and help each other with algorithmic, utilitarian, and application critical thinking.
Progressive applications
Python is the greatest supporter of data science, Machine Learning, and Artificial Intelligence at any enterprise software development company. Its utilization cases in cutting edge applications are the most compelling motivation for its prosperity. Python is the second most well known tool after R for data analytics.
The simplicity of getting sorted out, overseeing, and visualizing information through unique libraries makes it ideal for data based applications. TensorFlow for neural networks and OpenCV for computer vision are two of Python's most well known use cases for Machine learning applications.
Thinking about the advances in programming and innovation, Python is a YES for an assorted scope of utilizations. Game development, web application development services, GUI advancement, ML and AI improvement, Enterprise and customer applications - every one of them uses Python to its full potential.
The disadvantages of Python web improvement arrangements are regularly disregarded by developers and organizations because of the advantages it gives. They focus on quality over speed and performance over blunders. That is the reason it's a good idea to utilize Python for building the applications of the future.
#python development services #python development company #python app development #python development #python in web development #python software development
1602968400
Python is awesome, it’s one of the easiest languages with simple and intuitive syntax but wait, have you ever thought that there might ways to write your python code simpler?
In this tutorial, you’re going to learn a variety of Python tricks that you can use to write your Python code in a more readable and efficient way like a pro.
Swapping value in Python
Instead of creating a temporary variable to hold the value of the one while swapping, you can do this instead
>>> FirstName = "kalebu"
>>> LastName = "Jordan"
>>> FirstName, LastName = LastName, FirstName
>>> print(FirstName, LastName)
('Jordan', 'kalebu')
#python #python-programming #python3 #python-tutorials #learn-python #python-tips #python-skills #python-development
1602666000
Today you’re going to learn how to use Python programming in a way that can ultimately save a lot of space on your drive by removing all the duplicates.
In many situations you may find yourself having duplicates files on your disk and but when it comes to tracking and checking them manually it can tedious.
Heres a solution
Instead of tracking throughout your disk to see if there is a duplicate, you can automate the process using coding, by writing a program to recursively track through the disk and remove all the found duplicates and that’s what this article is about.
But How do we do it?
If we were to read the whole file and then compare it to the rest of the files recursively through the given directory it will take a very long time, then how do we do it?
The answer is hashing, with hashing can generate a given string of letters and numbers which act as the identity of a given file and if we find any other file with the same identity we gonna delete it.
There’s a variety of hashing algorithms out there such as
#python-programming #python-tutorials #learn-python #python-project #python3 #python #python-skills #python-tips
1597751700
Magic Methods are the special methods which gives us the ability to access built in syntactical features such as ‘<’, ‘>’, ‘==’, ‘+’ etc…
You must have worked with such methods without knowing them to be as magic methods. Magic methods can be identified with their names which start with __ and ends with __ like init, call, str etc. These methods are also called Dunder Methods, because of their name starting and ending with Double Underscore (Dunder).
Now there are a number of such special methods, which you might have come across too, in Python. We will just be taking an example of a few of them to understand how they work and how we can use them.
class AnyClass:
def __init__():
print("Init called on its own")
obj = AnyClass()
The first example is _init, _and as the name suggests, it is used for initializing objects. Init method is called on its own, ie. whenever an object is created for the class, the init method is called on its own.
The output of the above code will be given below. Note how we did not call the init method and it got invoked as we created an object for class AnyClass.
Init called on its own
Let’s move to some other example, add gives us the ability to access the built in syntax feature of the character +. Let’s see how,
class AnyClass:
def __init__(self, var):
self.some_var = var
def __add__(self, other_obj):
print("Calling the add method")
return self.some_var + other_obj.some_var
obj1 = AnyClass(5)
obj2 = AnyClass(6)
obj1 + obj2
#python3 #python #python-programming #python-web-development #python-tutorials #python-top-story #python-tips #learn-python
1593156510
At the end of 2019, Python is one of the fastest-growing programming languages. More than 10% of developers have opted for Python development.
In the programming world, Data types play an important role. Each Variable is stored in different data types and responsible for various functions. Python had two different objects, and They are mutable and immutable objects.
Table of Contents hide
III Built-in data types in Python
The Size and declared value and its sequence of the object can able to be modified called mutable objects.
Mutable Data Types are list, dict, set, byte array
The Size and declared value and its sequence of the object can able to be modified.
Immutable data types are int, float, complex, String, tuples, bytes, and frozen sets.
id() and type() is used to know the Identity and data type of the object
a**=25+**85j
type**(a)**
output**:<class’complex’>**
b**={1:10,2:“Pinky”****}**
id**(b)**
output**:**238989244168
a**=str(“Hello python world”)****#str**
b**=int(18)****#int**
c**=float(20482.5)****#float**
d**=complex(5+85j)****#complex**
e**=list((“python”,“fast”,“growing”,“in”,2018))****#list**
f**=tuple((“python”,“easy”,“learning”))****#tuple**
g**=range(10)****#range**
h**=dict(name=“Vidu”,age=36)****#dict**
i**=set((“python”,“fast”,“growing”,“in”,2018))****#set**
j**=frozenset((“python”,“fast”,“growing”,“in”,2018))****#frozenset**
k**=bool(18)****#bool**
l**=bytes(8)****#bytes**
m**=bytearray(8)****#bytearray**
n**=memoryview(bytes(18))****#memoryview**
Numbers are stored in numeric Types. when a number is assigned to a variable, Python creates Number objects.
#signed interger
age**=**18
print**(age)**
Output**:**18
Python supports 3 types of numeric data.
int (signed integers like 20, 2, 225, etc.)
float (float is used to store floating-point numbers like 9.8, 3.1444, 89.52, etc.)
complex (complex numbers like 8.94j, 4.0 + 7.3j, etc.)
A complex number contains an ordered pair, i.e., a + ib where a and b denote the real and imaginary parts respectively).
The string can be represented as the sequence of characters in the quotation marks. In python, to define strings we can use single, double, or triple quotes.
# String Handling
‘Hello Python’
#single (') Quoted String
“Hello Python”
# Double (") Quoted String
“”“Hello Python”“”
‘’‘Hello Python’‘’
# triple (‘’') (“”") Quoted String
In python, string handling is a straightforward task, and python provides various built-in functions and operators for representing strings.
The operator “+” is used to concatenate strings and “*” is used to repeat the string.
“Hello”+“python”
output**:****‘Hello python’**
"python "*****2
'Output : Python python ’
#python web development #data types in python #list of all python data types #python data types #python datatypes #python types #python variable type