Python tudo sobre Unicode
January 05, 2020
The main goal of this cheat sheet is to collect some common snippets which are related to Unicode. In Python 3, strings are represented by Unicode instead of bytes. Further information can be found on PEP 3100
ASCII code is the most well-known standard which defines numeric codes for characters. The numeric values only define 128 characters originally, so ASCII only contains control codes, digits, lowercase letters, uppercase letters, etc. However, it is not enough for us to represent characters such as accented characters, Chinese characters, or emoji existed around the world. Therefore, Unicode was developed to solve this issue. It defines the code point to represent various characters like ASCII but the number of characters is up to 1,111,998.
String
In Python 2, strings are represented in bytes, not Unicode. Python
provides different types of string such as Unicode string, raw string,
and so on. In this case, if we want to declare a Unicode string, we add
u
prefix for string literals.
>>> s = 'Café' # byte string
>>> s
'Caf\xc3\xa9'
>>> type(s)
<type 'str'>
>>> u = u'Café' # unicode string
>>> u
u'Caf\xe9'
>>> type(u)
<type 'unicode'>
In Python 3, strings are represented in Unicode. If we want to
represent a byte string, we add the b
prefix for string literals. Note
that the early Python versions (3.0-3.2) do not support the u
prefix.
In order to ease the pain to migrate Unicode aware applications from
Python 2, Python 3.3 once again supports the u
prefix for string
literals. Further information can be found on PEP
414
>>> s = 'Café'
>>> type(s)
<class 'str'>
>>> s
'Café'
>>> s.encode('utf-8')
b'Caf\xc3\xa9'
>>> s.encode('utf-8').decode('utf-8')
'Café'
Characters
Python 2 takes all string characters as bytes. In this case, the length
of strings may be not equivalent to the number of characters. For
example, the length of Café
is 5, not 4 because é
is encoded as a 2
bytes character.
>>> s= 'Café'
>>> print([_c for _c in s])
['C', 'a', 'f', '\xc3', '\xa9']
>>> len(s)
5
>>> s = u'Café'
>>> print([_c for _c in s])
[u'C', u'a', u'f', u'\xe9']
>>> len(s)
4
Python 3 takes all string characters as Unicode code point. The lenght of a string is always equivalent to the number of characters.
>>> s = 'Café'
>>> print([_c for _c in s])
['C', 'a', 'f', 'é']
>>> len(s)
4
>>> bs = bytes(s, encoding='utf-8')
>>> print(bs)
b'Caf\xc3\xa9'
>>> len(bs)
5
Porting unicode(s, ‘utf-8’)
The
unicode()
built-in function was removed in Python 3 so what is the best way to
convert the expression unicode(s, 'utf-8')
so it works in both Python
2 and 3?
In Python 2:
>>> s = 'Café'
>>> unicode(s, 'utf-8')
u'Caf\xe9'
>>> s.decode('utf-8')
u'Caf\xe9'
>>> unicode(s, 'utf-8') == s.decode('utf-8')
True
In Python 3:
>>> s = 'Café'
>>> s.decode('utf-8')
AttributeError: 'str' object has no attribute 'decode'
So, the real answer is…
Unicode Code Point
ord is a
powerful built-in function to get a Unicode code point from a given
character. Consequently, If we want to check a Unicode code point of a
character, we can use ord
.
>>> s = u'Café'
>>> for _c in s: print('U+%04x' % ord(_c))
...
U+0043
U+0061
U+0066
U+00e9
>>> u = '中文'
>>> for _c in u: print('U+%04x' % ord(_c))
...
U+4e2d
U+6587
Encoding
A Unicode code point transfers to a byte string is called encoding.
>>> s = u'Café'
>>> type(s.encode('utf-8'))
<class 'bytes'>
Decoding
A byte string transfers to a Unicode code point is called decoding.
>>> s = bytes('Café', encoding='utf-8')
>>> s.decode('utf-8')
'Café'
Unicode Normalization
Some characters can be represented in two similar form. For example, the
character, é
can be written as e ́
(Canonical Decomposition) or é
(Canonical Composition). In this case, we may acquire unexpected results
when we are comparing two strings even though they look alike.
Therefore, we can normalize a Unicode form to solve the issue.
# python 3
>>> u1 = 'Café' # unicode string
>>> u2 = 'Cafe\u0301'
>>> u1, u2
('Café', 'Café')
>>> len(u1), len(u2)
(4, 5)
>>> u1 == u2
False
>>> u1.encode('utf-8') # get u1 byte string
b'Caf\xc3\xa9'
>>> u2.encode('utf-8') # get u2 byte string
b'Cafe\xcc\x81'
>>> from unicodedata import normalize
>>> s1 = normalize('NFC', u1) # get u1 NFC format
>>> s2 = normalize('NFC', u2) # get u2 NFC format
>>> s1 == s2
True
>>> s1.encode('utf-8'), s2.encode('utf-8')
(b'Caf\xc3\xa9', b'Caf\xc3\xa9')
>>> s1 = normalize('NFD', u1) # get u1 NFD format
>>> s2 = normalize('NFD', u2) # get u2 NFD format
>>> s1, s2
('Café', 'Café')
>>> s1 == s2
True
>>> s1.encode('utf-8'), s2.encode('utf-8')
(b'Cafe\xcc\x81', b'Cafe\xcc\x81')
Avoid UnicodeDecodeError
Python raises UnicodeDecodeError when byte strings cannot decode to Unicode code points. If we want to avoid this exception, we can pass replace, backslashreplace, or ignore to errors argument in decode.
>>> u = b"\xff"
>>> u.decode('utf-8', 'strict')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
>>> # use U+FFFD, REPLACEMENT CHARACTER
>>> u.decode('utf-8', "replace")
'\ufffd'
>>> # inserts a \xNN escape sequence
>>> u.decode('utf-8', "backslashreplace")
'\\xff'
>>> # leave the character out of the Unicode result
>>> u.decode('utf-8', "ignore")
''
Long String
The following snippet shows common ways to declare a multi-line string in Python.
# original long string
s = 'This is a very very very long python string'
# Single quote with an escaping backslash
s = "This is a very very very " \
"long python string"
# Using brackets
s = (
"This is a very very very "
"long python string"
)
# Using ``+``
s = (
"This is a very very very " +
"long python string"
)
# Using triple-quote with an escaping backslash
s = '''This is a very very very \
long python string'''
Acesse a Referência original 1: Acesse a Referência original 2: