The Updated Guide to Unicode, UTF-8 and Strings on Python

The Updated Guide to Unicode, UTF-8 and Strings on Python

A Guide to Unicode, UTF-8 and Strings in Python. In Python (2 or 3), strings can either be represented in bytes or unicode code points. Unicode is international standard where a mapping of individual characters and a unique number is maintained. UTF-8:

Strings are one of the most common data types in Python. They are used to deal with text data of any kind. The field of Natural Language Processing is built on top of text and string processing of some kind. It is important to know about how strings work in Python. Strings are usually easy to deal with when they are made up of English ASCII characters, but “problems” appear when we enter into non-ASCII characters — which are becoming increasingly common in the world today esp. with advent of emojis etc.

For the longest time, I had the confusion about dealing with them, and like many programmers, threw encode and decode at strings in hopes of removing the dreaded UnicodeDecodeError — hopefully, this blog will help you overcome the dread about dealing with strings. Below I am going to take a Q and A format to really get to the answers to the questions you might have, and which I also had before I started learning about strings.

1. What are strings made of?

In Python (2 or 3), strings can either be represented in bytes or unicode code points.
Byte is a unit of information that is built of 8 bits — bytes are used to store all files in a hard disk. So all of the CSVs and JSON files on your computer are built of bytes. We can all agree that we need bytes, but then what about unicode code points? We will get to them in the next question.

2. What is Unicode, and unicode code points?

While reading bytes from a file, a reader needs to know what those bytes mean. So if you write a JSON file and send it over to your friend, your friend would need to know how to deal with the bytes in your JSON file. For the first 20 years or so of computing, upper and lower case English characters, some punctuations and digits were enough. These were all encoded into a 127 symbol list called ASCII. 7 bits of information or 1 byte is enough to encode every English character. You could tell your friend to decode your JSON file in ASCII encoding, and voila — she would be able to read what you sent her.

This was cool for the initial few decades or so, but slowly we realized that there are way more number of characters than just English characters. We tried extending 127 characters to 256 characters (via Latin-1 or ISO-8859–1) to fully utilize the 8 bit space — but that was not enough. We needed an international standard that we all agreed on to deal with hundreds and thousands of non-English characters.

In came Unicode!

Unicode is international standard where a mapping of individual characters and a unique number is maintained. As of May 2019, the most recent version of Unicode is 12.1 which contains over 137k characters including different scripts including English, Hindi, Chinese and Japanese, as well as emojis. These 137k characters are each represented by a unicode code point. So unicode code points refer to actual characters that are displayed.
These code points are encoded to bytes and decoded from bytes back to code points. Examples: Unicode code point for alphabet a is U+0061, emoji 🖐 is U+1F590, and for Ω is U+03A9.

3 of the most popular encoding standards defined by Unicode are UTF-8, UTF-16 and UTF-32.

3. What are Unicode encodings UTF-8, UTF-16, and UTF-32?

We now know that Unicode is an international standard that encodes every known character to a unique number. Then the next question is how do we move these unique numbers around the internet? You already know the answer! Using bytes of information.

UTF-8: It uses 1, 2, 3 or 4 bytes to encode every code point. It is backwards compatible with ASCII. All English characters just need 1 byte — which is quite efficient. We only need more bytes if we are sending non-English characters.
It is the most popular form of encoding, and is by default the encoding in Python 3. In Python 2, the default encoding is ASCII (unfortunately).

UTF-16 is variable 2 or 4 bytes. This encoding is great for Asian text as most of it can be encoded in 2 bytes each. It’s bad for English as all English characters also need 2 bytes here.

UTF-32 is fixed 4 bytes. All characters are encoded in 4 bytes so it needs a lot of memory. It is not used very often.
[You can read more in this StackOverflow post.]

We need encode method to convert unicode code points to bytes. This will happen typically during writing string data to a CSV or JSON file for example.
We need decode method to convert bytes to unicode code points. This will typically happen during reading data from a file into strings.

Why are encode and decode methods needed?

4. What data types in Python handle Unicode code points and bytes?

As we discussed earlier, in Python, strings can either be represented in bytes or unicode code points.
The main takeaways in Python are:
1. Python 2 uses **str** type to store bytes and **unicode** type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII. So if an incoming file is Cyrillic characters, Python 2 might fail because ASCII will not be able to handle those Cyrillic Characters. In this case, we need to remember to use decode("utf-8") during reading of files. This is inconvenient.
2. Python 3 came and fixed this. Strings are still**str** type by default but they now mean unicode code points instead — we carry what we see. If we want to store these str type strings in files we use **bytes** type instead. Default encoding is UTF-8 instead of ASCII. Perfect!

5. Any code examples to compare the different data types?

Yes, let’s look at “你好” which is Chinese for hello. It takes 6 bytes to store this string made of 2 unicode code points. Let’s take the example of popularlen function to see how things might differ in Python 2 and 3 — and things you need to keep note of.

>>> print(len(“你好”))   # Python 2 - str is bytes
6
>>> print(len(u“你好”))  # Python 2 - Add 'u' for unicode code points
2
>>> print(len(“你好”))   # Python 3 - str is unicode code points
2

So, prefixing a u in Python 2 can make a complete difference to your code functioning correctly or not — which can be confusing! Python 3 fixed this by using unicode code points by default — so len will work as you would expect giving length of 2 in the example above.

Let’s look at more examples in Python 3 for dealing with strings:

# strings is by default made of unicode code points
>>> print(len(“你好”)) 
2
# Manually encode a string into bytes
>>> print(len(("你好").encode("utf-8")))  
6
# You don't need to pass an argument as default encoding is "utf-8"
>>> print(len(("你好").encode()))  
6
# Print actual unicode code points instead of characters [Source]
>>> print(("你好").encode("unicode_escape"))
b'\\u4f60\\u597d'
# Print bytes encoded in UTF-8 for this string
>>> print(("你好").encode()) 
b'\xe4\xbd\xa0\xe5\xa5\xbd'
6. It’s a lot of information! Can you summarize?

Sure! Let’s see all we have covered so far visually.
By default in Python 3, we are on the left side in the world of Unicode code points for strings. We only need to go back and forth with bytes while writing or reading the data. Default encoding during this conversion is UTF-8, but other encodings can also be used. We need to know what encoder was used during the decoding process, otherwise we might get errors or get gibberish!

Visual diagram of how encoding and decoding works for strings

This diagram holds true for both Python 2 and Python 3! We might be gettingUnicodeDecodeErrors due to:

  1. We trying to use ASCII to encode non-ASCII characters. This would happen esp. in Python 2 where default encoder is ASCII. So you should explicitly encode and decode bytes using UTF-8.
  2. We might be using the wrong decoder completely. If unicode code points were encoded in UTF-16 instead of UTF-8, you might run into bytes that are gibberish in UTF-8 land. So UTF-8 decoder might fail completely to understand the bytes.

A good practice is to decode your bytes in UTF-8 (or an encoder that was used to create those bytes) as soon as they are loaded from a file. Run your processing on unicode code points through your Python code, and then write back into bytes into a file using UTF-8 encoder in the end. This is called Unicode Sandwich. Read/watch the excellent talk by Ned Batchelder (@nedbat) about this.

If you want to add more information about strings in Python, please mention in the comments below as it will help others. This concludes my blog on the guide to Unicode, UTF-8 and strings. Good luck in your own explorations with text!

Guide to Python Programming Language

Guide to Python Programming Language

Guide to Python Programming Language

Description
The course will lead you from beginning level to advance in Python Programming Language. You do not need any prior knowledge on Python or any programming language or even programming to join the course and become an expert on the topic.

The course is begin continuously developing by adding lectures regularly.

Please see the Promo and free sample video to get to know more.

Hope you will enjoy it.

Basic knowledge
An Enthusiast Mind
A Computer
Basic Knowledge To Use Computer
Internet Connection
What will you learn
Will Be Expert On Python Programming Language
Build Application On Python Programming Language

Python Programming Tutorials For Beginners

Python Programming Tutorials For Beginners

Python Programming Tutorials For Beginners

Description
Hello and welcome to brand new series of wiredwiki. In this series i will teach you guys all you need to know about python. This series is designed for beginners but that doesn't means that i will not talk about the advanced stuff as well.

As you may all know by now that my approach of teaching is very simple and straightforward.In this series i will be talking about the all the things you need to know to jump start you python programming skills. This series is designed for noobs who are totally new to programming, so if you don't know any thing about

programming than this is the way to go guys Here is the links to all the videos that i will upload in this whole series.

In this video i will talk about all the basic introduction you need to know about python, which python version to choose, how to install python, how to get around with the interface, how to code your first program. Than we will talk about operators, expressions, numbers, strings, boo leans, lists, dictionaries, tuples and than inputs in python. With

Lots of exercises and more fun stuff, let's get started.

Download free Exercise files.

Dropbox: https://bit.ly/2AW7FYF

Who is the target audience?

First time Python programmers
Students and Teachers
IT pros who want to learn to code
Aspiring data scientists who want to add Python to their tool arsenal
Basic knowledge
Students should be comfortable working in the PC or Mac operating system
What will you learn
know basic programming concept and skill
build 6 text-based application using python
be able to learn other programming languages
be able to build sophisticated system using python in the future

To know more:

Learn Python Programming

Learn Python Programming

Learn Python Programming

Description
Learn Python Programming

Learn Python Programming and increase your python programming skills with Coder Kovid.

Python is the highest growing programming language in this era. You can use Python to do everything like, web development, software development, cognitive development, machine learning, artificial intelligence, etc. You should learn python programming and increase your skills of programming.

In this course of learn python programming you don't need any prior programming knowledge. Every beginner can start with.

Basic knowledge
No prior knowledge needed to learn this course
What will you learn
Write Basic Syntax of Python Programming
Create Basic Real World Application
Program in a fluent manner
Get Familiar in Programming Environment