This is example of using re module for building regex. Searching match words that contains spaces or separate with spaces or more than one spaces and number. This is most basic things that may faced in daily development. First, let’s create sample text, eg :
test.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | 4569619 3116511 Mondego 1322303 3116511 San Julián 370339 615620 Babushera 4100133 891211 Gatooma 629834 1214031 Poelo Roesa 3210208 917742 Choma Estate 513719 917742 Choma Extension 3210820 920160 Chibuye Machila 375787 618323 Manubeyevka 4119676 2093597 Kombuglpagl Number 1 4124183 2747559 Schimperbos 4574844 3122918 Espadañedo 4575493 2518539 Estación El Cobujón 1100597 2285521 Kiohan 642481 1253033 Wadhwan 796506 1631407 Poelau Penjoesoe 4681301 3198522 Rt Kabala 1369593 3198522 Rt Ðurov Kam 4577204 3125885 Punta del Castello 1370155 3199460 Otok Gustač 149490 222620 Hendi 378801 623256 Polyashchitsy 4175160 625251 Mezhëvo 1496179 3695055 Morro de Malabrigo 873711 1788861 Hsinhuang T’ung Nationality Autonomous Hsien 873712 1788861 Hsin-huang T’ung Autonomous Hsien 4184736 1791575 Weiyuanbu 5390395 1791575 Weiyuan Zhen 5390396 1791575 威远镇 6150442 1869931 석가촌 967307 1870841 Samga 4740195 814059 Morozovo 4740683 3224917 Koparflua 4742360 816101 Bratskiy 6994437 163418 Tall Yastī 6646665 1896869 Gyegol 6964971 165131 تل سبعين 2545748 1896869 계골 4659906 3706786 Cerro La Primitiva 1499960 3706786 Cerro Primitivo 4717527 1885842 Bali 4039016 2673331 Stora Eken 1505215 3732758 Hacienda Guayacai |
First thing we do is identify the common pattern. Before, you should remember this example common python regex :
Number : d+
String : [a-zA-Z]+
Spaces : s+
Limited : {4}
Unlimited : +
Firstline : ^
Now we create simple python to read files:
1 2 3 4 5 6 7 8 9 | import re def parsing_text(): with open("test.txt") as f: for data in f: print(data) if __name__ == ‘__main__’: parsing_text() |
Then we can build regex pattern by analyze the text. eg :
1 | 4577204 3125885 Punta del Castello |
It can be interpreted as :
1 | (number)(spaces)(number)(spaces or tabs)(alphanumeric with spaces) |
Then you can replace number with (d+), spaces with s+ and alphanumeric (spaces) ([a-zA-Z ]+)
1 | pattern = re.compile(‘^(d+)s+(d+)s+([a-zA-Z ]+)’) |
1 2 3 4 5 6 7 8 9 10 11 12 | import re def parsing_text(): pattern = re.compile(‘^(d+)s+(d+)s+([a-zA-Z ]+)’) with open("test.txt") as f: for data in f: result = pattern.search(data) if result: print result.group(0) if __name__ == ‘__main__’: parsing_text() |
Group(0) is showing complete line that match. Because we separate pattern using () :
1 | ..(d+)…(d+)…([a-zA-Z ]+) |
So, it will be group(1) for first number, group(2) for second number and group(3) for alpha string.
Let’s print name of city only by :
1 2 3 4 5 6 7 8 9 10 | def parsing_text(): pattern = re.compile(‘^(d+)s+(d+)s+([a-zA-Z ]+)’) with open("test.txt") as f: for data in f: result = pattern.search(data) if result.group(3): print result.group(3) if __name__ == ‘__main__’: parsing_text() |
We can limit length of city name, eg:
Length exact 6:
1 | ([a-zA-Z ]{6}) |
Length more than 6:
1 | ([a-zA-Z ]{6, }) |
Length range 1-6 :
1 | ([a-zA-Z ]{1,6}) |
This also implemented into number regex, eg :
1 | (d{3}) |
Another good to read :
http://www.regular-expressions.info/reference.html