Strings¶
Usage¶
String objects are very common in Python. They are especially usefull when reading and writting text files.
They are defined between simple quotes ('
) or double quotes ("
).
Special characters¶
Python contains a set of predefined characters, which are listed below (source: python.org)
Character |
Definition |
---|---|
|
ASCII Bell (BEL) |
|
ASCII Backspace (BS) |
|
ASCII Formfeed (FF) |
|
ASCII Linefeed (LF) |
|
ASCII Carriage Return (CR) |
|
ASCII Horizontal Tab (TAB) |
|
ASCII Vertical Tab (VT) |
In order to escape these special characters, add r
before the 1st quote:
str1 = '@this is \n@' # \n interpreted as line break
print(str1)
str2 = r'%this is \n%'
print(str2)
@this is
@
%this is \n%
String manipulation¶
String objects have a lot of methods to manipulate them.
Since they are immutable, these methods return new string objects, compared to list methods, which change the list content.
Extracting characters¶
string1 = 'char1 char2 char3'
print(len(string1)) # length of a string
chars = list(string1) # returns a list of char
print(chars)
17
['c', 'h', 'a', 'r', '1', ' ', 'c', 'h', 'a', 'r', '2', ' ', 'c', 'h', 'a', 'r', '3']
Changing case¶
# change case
# sets the string to lower case
stringl = string1.lower()
print(stringl)
# sets the string to uper case
stringu = string1.upper()
print(stringu)
stringc = string1.capitalize()
print(stringc)
# note here that syntax is different from list
# output of the method is a new string object since strings are immutable
char1 char2 char3
CHAR1 CHAR2 CHAR3
Char1 char2 char3
Replacement and word splitting¶
string2 = string1.replace('Char2', 'toto')
print(string2)
words = string1.split(' ') #['Char1', 'Char2', 'Char3']
print(words)
words = string1.split(',') #['Char1', 'Char2', 'Char3']
print(words)
char1 char2 char3
['char1', 'char2', 'char3']
['char1 char2 char3']
String formatting¶
sep = ',\n'
# merges a list of strings in one string providing a separator
string3 = sep.join(['toto1', 'toto2','toto3'])
print(string3)
toto1,
toto2,
toto3
# strings behaves list, no maths with lists!
# + and * for string concatenations
string4 = 2 * 'toto1' + '\t' + 'toto2' + \
'\n' + 'toto3 ' + str(10)
print(string4)
toto1toto1 toto2
toto3 10
# variables to display
x = 10
y = 0.5
z = 0.005
# string formatting. There should be as many %s as variables to display
string5 = '%04d, %.5f, %.3e\n' %(x, y, z)
# string5 = '%04d, %.5f\n' %(x, y, z) # fails because inconsistent number of var.
print(string5)
0010, 0.50000, 5.000e-03
# If you want to write percentage symbol, use %s (string format)
string5 = 'Percentage:\n%d%s' %(x, '%')
print(string5)
Percentage:
10%
string5 = ' test string '
# removing trailing whitespace (usefull when reading a file)
string6 = string5.strip()
print('#' + string5 + '#')
print('#' + string6 + '#')
# test string #
#test string#
Regular expressions¶
A very powerfull feature is the use of *regular expressions, which allows to match strings with given patterns (re library).
Creating regular expressions¶
# load the regular expression package
import re
# match string that starts (^) with a number ranging from 0 to 9
pattern1 = r'^[0-9]'
reg1 = re.compile(pattern1) # creates an object that will be used to match string
# match string that ends ($) with a number ranging from 0 to 9
pattern2 = r'\d$' # \d is a shortcut for [0-9]
reg2 = re.compile(pattern2) # creates an object that will be used to match string
Matching regular expressions¶
string1 = r'2-start'
string2 = r'end-3'
reg1.match(string1) # match: returns a re.Match object
<re.Match object; span=(0, 1), match='2'>
reg2.match(string2) # no match: returns None
reg2.match(string1) # returns None (no match)
reg1.match(string2) # returns a re.Match oject
# note: you should compile a regular expression if will be
# often tested. For isolated cases, you can use:
re.match(pattern1, string1)
<re.Match object; span=(0, 1), match='2'>
Extracting values¶
To extract the values matched by a regular expression, use the groups
method. For that, the
string pattern that must be extracted must be contained between ()
.
string1 = r'2-start'
string2 = r'04304-end'
string3 = r'04304-END'
string4 = r'-END'
# Function that returns the groups if a match is not None
def test(match):
if(match):
print(match.groups())
else:
print('none')
# to get the integer value, use the groups method of the re package
# use () to encompass the elements you want to extract
pattern1 = r'^([0-9]+)-([a-z]+)$' # + = 1 or more match of the preceding pattern
reg1 = re.compile(pattern1)
test(reg1.match(string1)) # Match
test(reg1.match(string2)) # Match
test(reg1.match(string3)) # No match (end not of the right case)
test(reg1.match(string4)) # No match (doesnt start with num.)
('2', 'start')
('04304', 'end')
none
none
pattern2 = r'^([0-9]+)-([a-zA-Z]+)$' # + = 1 or more match of the preceding pattern
reg2 = re.compile(pattern2)
test(reg2.match(string1)) # Match
test(reg2.match(string2)) # Match
test(reg2.match(string3)) # Match (pattern is now case insensitiv)
test(reg2.match(string4)) # No match (doesnt start with num.)
('2', 'start')
('04304', 'end')
('04304', 'END')
none
pattern3 = r'^([0-9]*)-([a-zA-Z]+)$' # * = 0 or more match of the preceding pattern
reg3 = re.compile(pattern3)
# All matches
test(reg3.match(string1))
test(reg3.match(string2))
test(reg3.match(string3))
test(reg3.match(string4))
('2', 'start')
('04304', 'end')
('04304', 'END')
('', 'END')
pattern4 = r'^([0-9]?)-([a-zA-Z]+)$' # ? = 0 or 1 match of the preceding pattern
reg4 = re.compile(pattern4)
test(reg4.match(string1)) # Match
test(reg4.match(string2)) # No match (more that 0 or 1 number at the begining)
test(reg4.match(string3)) # No match (more that 0 or 1 number at the begining)
test(reg4.match(string4))
('2', 'start')
none
none
('', 'END')
Splitting using regular expressions¶
# How to split this string into the three names?
string1 = r'lala toto titi'
sp1 = string1.split(' ')
print(sp1)
reg = re.compile(' +') # split based on regular expressions: splits with separator = 1 or more spaces
sp2 = reg.split(string1)
print(sp2)
['lala', '', '', '', '', 'toto', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'titi']
['lala', 'toto', 'titi']
string1 = r'01 0304 02 45 509 2950 204' # 302 01 2030 39393 50505 s0304 43df'
# list all the digits of the string
reg = re.compile(r'[0-9]')
print(string1)
print(reg.findall(string1))
01 0304 02 45 509 2950 204
['0', '1', '0', '3', '0', '4', '0', '2', '4', '5', '5', '0', '9', '2', '9', '5', '0', '2', '0', '4']
# List all the 2 to 3 digits. However, 0304 is matched as 030
reg = re.compile(r'[0-9]{2,3}')
print(string1)
print(reg.findall(string1))
01 0304 02 45 509 2950 204
['01', '030', '02', '45', '509', '295', '204']
# better but 0304 is still matched as 304, and 204 is not matched
reg = re.compile(r'[0-9]{2,3} ') # adding white space at the end
print(string1)
print(reg.findall(string1))
01 0304 02 45 509 2950 204
['01 ', '304 ', '02 ', '45 ', '509 ', '950 ']
# we are close with or statements, but there are white spaces
reg = re.compile(r' [0-9]{2,3} |^[0-9]{2,3} | [0-9]{2,3}$')
print(string1)
print(reg.findall(string1))
01 0304 02 45 509 2950 204
['01 ', ' 02 ', ' 509 ', ' 204']
# Solution: use the \b (word delimiter)
reg = re.compile(r'\b[0-9]{2,3}\b')
print(string1)
print(reg.findall(string1))
01 0304 02 45 509 2950 204
['01', '02', '45', '509', '204']