Kasey  Turcotte

Kasey Turcotte

1623947400

One Line of Code for a Common Text Pre-Processing Step in Pandas

A quick look at splitting text columns for use in machine learning and data analysis

ometimes you’ll want to do some processing to create new variables out of your existing data. This can be as simple as splitting up a “name” column into “first name” and “last name”.

Whatever the case may be, Pandas will allow you to effortlessly work with text data through a variety of in-built methods. In this piece, we’ll go specifically into parsing text columns for the exact information you need either for further data analysis or for use in a machine learning model.

If you’d like to follow along, go ahead and download the ‘train’ dataset here. Once you’ve done that, make sure it’s saved to the same directory as your notebook and then run the code below to read it in:

import pandas as pd
df = pd.read_csv('train.csv')

Let’s get to it!

#programming #python #one line of code for a common text pre-processing step in pandas #pandas #one line of code for a common text pre-processing #text pre-processing

What is GEEK

Buddha Community

One Line of Code for a Common Text Pre-Processing Step in Pandas
Kasey  Turcotte

Kasey Turcotte

1623947400

One Line of Code for a Common Text Pre-Processing Step in Pandas

A quick look at splitting text columns for use in machine learning and data analysis

ometimes you’ll want to do some processing to create new variables out of your existing data. This can be as simple as splitting up a “name” column into “first name” and “last name”.

Whatever the case may be, Pandas will allow you to effortlessly work with text data through a variety of in-built methods. In this piece, we’ll go specifically into parsing text columns for the exact information you need either for further data analysis or for use in a machine learning model.

If you’d like to follow along, go ahead and download the ‘train’ dataset here. Once you’ve done that, make sure it’s saved to the same directory as your notebook and then run the code below to read it in:

import pandas as pd
df = pd.read_csv('train.csv')

Let’s get to it!

#programming #python #one line of code for a common text pre-processing step in pandas #pandas #one line of code for a common text pre-processing #text pre-processing

Tyrique  Littel

Tyrique Littel

1604008800

Static Code Analysis: What It Is? How to Use It?

Static code analysis refers to the technique of approximating the runtime behavior of a program. In other words, it is the process of predicting the output of a program without actually executing it.

Lately, however, the term “Static Code Analysis” is more commonly used to refer to one of the applications of this technique rather than the technique itself — program comprehension — understanding the program and detecting issues in it (anything from syntax errors to type mismatches, performance hogs likely bugs, security loopholes, etc.). This is the usage we’d be referring to throughout this post.

“The refinement of techniques for the prompt discovery of error serves as well as any other as a hallmark of what we mean by science.”

  • J. Robert Oppenheimer

Outline

We cover a lot of ground in this post. The aim is to build an understanding of static code analysis and to equip you with the basic theory, and the right tools so that you can write analyzers on your own.

We start our journey with laying down the essential parts of the pipeline which a compiler follows to understand what a piece of code does. We learn where to tap points in this pipeline to plug in our analyzers and extract meaningful information. In the latter half, we get our feet wet, and write four such static analyzers, completely from scratch, in Python.

Note that although the ideas here are discussed in light of Python, static code analyzers across all programming languages are carved out along similar lines. We chose Python because of the availability of an easy to use ast module, and wide adoption of the language itself.

How does it all work?

Before a computer can finally “understand” and execute a piece of code, it goes through a series of complicated transformations:

static analysis workflow

As you can see in the diagram (go ahead, zoom it!), the static analyzers feed on the output of these stages. To be able to better understand the static analysis techniques, let’s look at each of these steps in some more detail:

Scanning

The first thing that a compiler does when trying to understand a piece of code is to break it down into smaller chunks, also known as tokens. Tokens are akin to what words are in a language.

A token might consist of either a single character, like (, or literals (like integers, strings, e.g., 7Bob, etc.), or reserved keywords of that language (e.g, def in Python). Characters which do not contribute towards the semantics of a program, like trailing whitespace, comments, etc. are often discarded by the scanner.

Python provides the tokenize module in its standard library to let you play around with tokens:

Python

1

import io

2

import tokenize

3

4

code = b"color = input('Enter your favourite color: ')"

5

6

for token in tokenize.tokenize(io.BytesIO(code).readline):

7

    print(token)

Python

1

TokenInfo(type=62 (ENCODING),  string='utf-8')

2

TokenInfo(type=1  (NAME),      string='color')

3

TokenInfo(type=54 (OP),        string='=')

4

TokenInfo(type=1  (NAME),      string='input')

5

TokenInfo(type=54 (OP),        string='(')

6

TokenInfo(type=3  (STRING),    string="'Enter your favourite color: '")

7

TokenInfo(type=54 (OP),        string=')')

8

TokenInfo(type=4  (NEWLINE),   string='')

9

TokenInfo(type=0  (ENDMARKER), string='')

(Note that for the sake of readability, I’ve omitted a few columns from the result above — metadata like starting index, ending index, a copy of the line on which a token occurs, etc.)

#code quality #code review #static analysis #static code analysis #code analysis #static analysis tools #code review tips #static code analyzer #static code analysis tool #static analyzer

Houston  Sipes

Houston Sipes

1604088000

How to Find the Stinky Parts of Your Code (Part II)

There are more code smells. Let’s keep changing the aromas. We see several symptoms and situations that make us doubt the quality of our development. Let’s look at some possible solutions.

Most of these smells are just hints of something that might be wrong. They are not rigid rules.

This is part II. Part I can be found here.

Code Smell 06 - Too Clever Programmer

The code is difficult to read, there are tricky with names without semantics. Sometimes using language’s accidental complexity.

_Image Source: NeONBRAND on _Unsplash

Problems

  • Readability
  • Maintainability
  • Code Quality
  • Premature Optimization

Solutions

  1. Refactor the code
  2. Use better names

Examples

  • Optimized loops

Exceptions

  • Optimized code for low-level operations.

Sample Code

Wrong

function primeFactors(n){
	  var f = [],  i = 0, d = 2;  

	  for (i = 0; n >= 2; ) {
	     if(n % d == 0){
	       f[i++]=(d); 
	       n /= d;
	    }
	    else{
	      d++;
	    }     
	  }
	  return f;
	}

Right

function primeFactors(numberToFactor){
	  var factors = [], 
	      divisor = 2,
	      remainder = numberToFactor;

	  while(remainder>=2){
	    if(remainder % divisor === 0){
	       factors.push(divisor); 
	       remainder = remainder/ divisor;
	    }
	    else{
	      divisor++;
	    }     
	  }
	  return factors;
	}

Detection

Automatic detection is possible in some languages. Watch some warnings related to complexity, bad names, post increment variables, etc.

#pixel-face #code-smells #clean-code #stinky-code-parts #refactor-legacy-code #refactoring #stinky-code #common-code-smells

Samanta  Moore

Samanta Moore

1621137960

Guidelines for Java Code Reviews

Get a jump-start on your next code review session with this list.

Having another pair of eyes scan your code is always useful and helps you spot mistakes before you break production. You need not be an expert to review someone’s code. Some experience with the programming language and a review checklist should help you get started. We’ve put together a list of things you should keep in mind when you’re reviewing Java code. Read on!

1. Follow Java Code Conventions

2. Replace Imperative Code With Lambdas and Streams

3. Beware of the NullPointerException

4. Directly Assigning References From Client Code to a Field

5. Handle Exceptions With Care

#java #code quality #java tutorial #code analysis #code reviews #code review tips #code analysis tools #java tutorial for beginners #java code review

Daron  Moore

Daron Moore

1598404620

Hands-on Guide to Pattern - A Python Tool for Effective Text Processing and Data Mining

Text Processing mainly requires Natural Language Processing( NLP), which is processing the data in a useful way so that the machine can understand the Human Language with the help of an application or product. Using NLP we can derive some information from the textual data such as sentiment, polarity, etc. which are useful in creating text processing based applications.

Python provides different open-source libraries or modules which are built on top of NLTK and helps in text processing using NLP functions. Different libraries have different functionalities that are used on data to gain meaningful results. One such Library is Pattern.

Pattern is an open-source python library and performs different NLP tasks. It is mostly used for text processing due to various functionalities it provides. Other than text processing Pattern is used for Data Mining i.e we can extract data from various sources such as Twitter, Google, etc. using the data mining functions provided by Pattern.

In this article, we will try and cover the following points:

  • NLP Functionalities of Pattern
  • Data Mining Using Pattern

#developers corner #data mining #text analysis #text analytics #text classification #text dataset #text-based algorithm