Regex match / search list of word by separate spaces, number, words with spaces using Python

This is example of using re module for building regex. Searching match words that contains spaces or separate with spaces or more than one spaces and number. This is most basic things that may faced in daily development. First, let’s create sample text, eg :

test.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

4569619 3116511 Mondego
1322303 3116511 San Julián
370339 615620 Babushera
4100133 891211 Gatooma
629834 1214031 Poelo Roesa
3210208 917742 Choma Estate
513719 917742 Choma Extension
3210820 920160 Chibuye Machila
375787 618323 Manubeyevka
4119676 2093597 Kombuglpagl Number 1
4124183 2747559 Schimperbos
4574844 3122918 Espadañedo
4575493 2518539 Estación El Cobujón
1100597 2285521 Kiohan
642481 1253033 Wadhwan
796506 1631407 Poelau Penjoesoe
4681301 3198522 Rt Kabala
1369593 3198522 Rt Ðurov Kam
4577204 3125885 Punta del Castello
1370155 3199460 Otok Gustač
149490 222620 Hendi
378801 623256 Polyashchitsy
4175160 625251 Mezhëvo
1496179 3695055 Morro de Malabrigo
873711 1788861 Hsinhuang T’ung Nationality Autonomous Hsien
873712 1788861 Hsin-huang T’ung Autonomous Hsien
4184736 1791575 Weiyuanbu
5390395 1791575 Weiyuan Zhen
5390396 1791575 威远镇
6150442 1869931 석가촌
967307 1870841 Samga
4740195 814059 Morozovo
4740683 3224917 Koparflua
4742360 816101 Bratskiy
6994437 163418 Tall Yastī
6646665 1896869 Gyegol
6964971 165131 تل سبعين
2545748 1896869 계골
4659906 3706786 Cerro La Primitiva
1499960 3706786 Cerro Primitivo
4717527 1885842 Bali
4039016 2673331 Stora Eken
1505215 3732758 Hacienda Guayacai

First thing we do is identify the common pattern. Before, you should remember this example common python regex :

Number : d+
String : [a-zA-Z]+
Spaces : s+
Limited : {4}
Unlimited : +
Firstline : ^

Now we create simple python to read files:

1
2
3
4
5
6
7
8
9

import re

def parsing_text():
with open("test.txt") as f:
for data in f:
print(data)

if __name__ == ‘__main__’:
parsing_text()

Then we can build regex pattern by analyze the text. eg :

1	4577204 3125885 Punta del Castello

It can be interpreted as :

1	(number)(spaces)(number)(spaces or tabs)(alphanumeric with spaces)

Then you can replace number with (d+), spaces with s+ and alphanumeric (spaces) ([a-zA-Z ]+)

1	pattern = re.compile(‘^(d+)s+(d+)s+([a-zA-Z ]+)’)

1
2
3
4
5
6
7
8
9
10
11
12

import re

def parsing_text():
pattern = re.compile(‘^(d+)s+(d+)s+([a-zA-Z ]+)’)
with open("test.txt") as f:
for data in f:
result = pattern.search(data)
if result:
print result.group(0)

if __name__ == ‘__main__’:
parsing_text()

Group(0) is showing complete line that match. Because we separate pattern using () :

1	..(d+)…(d+)…([a-zA-Z ]+)

So, it will be group(1) for first number, group(2) for second number and group(3) for alpha string.

Let’s print name of city only by :

1
2
3
4
5
6
7
8
9
10

def parsing_text():
pattern = re.compile(‘^(d+)s+(d+)s+([a-zA-Z ]+)’)
with open("test.txt") as f:
for data in f:
result = pattern.search(data)
if result.group(3):
print result.group(3)

if __name__ == ‘__main__’:
parsing_text()

We can limit length of city name, eg:

Length exact 6:

1	([a-zA-Z ]{6})

Length more than 6:

1	([a-zA-Z ]{6, })

Length range 1-6 :

1	([a-zA-Z ]{1,6})

This also implemented into number regex, eg :

(d{3})

Another good to read :
http://www.regular-expressions.info/reference.html

Yodi Aditya

Regex match / search list of word by separate spaces, number, words with spaces using Python

Leave a Reply Cancel reply