Regex match / search list of word by separate spaces, number, words with spaces using Python


This is example of using re module for building regex. Searching match words that contains spaces or separate with spaces or more than one spaces and number. This is most basic things that may faced in daily development. First, let’s create sample text, eg :

test.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
4569619 3116511         Mondego
1322303 3116511         San Julián
370339  615620          Babushera
4100133 891211          Gatooma
629834  1214031         Poelo Roesa
3210208 917742          Choma Estate
513719  917742          Choma Extension
3210820 920160          Chibuye Machila
375787  618323          Manubeyevka
4119676 2093597         Kombuglpagl Number 1
4124183 2747559         Schimperbos
4574844 3122918         Espadañedo
4575493 2518539         Estación El Cobujón
1100597 2285521         Kiohan
642481  1253033         Wadhwan
796506  1631407         Poelau Penjoesoe
4681301 3198522         Rt Kabala
1369593 3198522         Rt Ðurov Kam
4577204 3125885         Punta del Castello
1370155 3199460         Otok Gustač
149490  222620          Hendi
378801  623256          Polyashchitsy
4175160 625251          Mezhëvo
1496179 3695055         Morro de Malabrigo
873711  1788861         Hsinhuang T’ung Nationality Autonomous Hsien
873712  1788861         Hsin-huang T’ung Autonomous Hsien
4184736 1791575         Weiyuanbu
5390395 1791575         Weiyuan Zhen
5390396 1791575         威远镇
6150442 1869931         석가촌
967307  1870841         Samga
4740195 814059          Morozovo
4740683 3224917         Koparflua
4742360 816101          Bratskiy
6994437 163418          Tall Yastī
6646665 1896869         Gyegol
6964971 165131    تل سبعين
2545748 1896869    계골
4659906 3706786         Cerro La Primitiva
1499960 3706786         Cerro Primitivo
4717527 1885842         Bali
4039016 2673331         Stora Eken
1505215 3732758         Hacienda Guayacai

First thing we do is identify the common pattern. Before, you should remember this example common python regex :

Number : d+
String : [a-zA-Z]+
Spaces : s+
Limited : {4}
Unlimited : +
Firstline : ^

Now we create simple python to read files:

1
2
3
4
5
6
7
8
9
import re

def parsing_text():
    with open("test.txt") as f:
        for data in f:
            print(data)

if __name__ == ‘__main__’:
    parsing_text()

Then we can build regex pattern by analyze the text. eg :

1
4577204 3125885         Punta del Castello

It can be interpreted as :

1
(number)(spaces)(number)(spaces or tabs)(alphanumeric with spaces)

Then you can replace number with (d+), spaces with s+ and alphanumeric (spaces) ([a-zA-Z ]+)

1
pattern = re.compile(‘^(d+)s+(d+)s+([a-zA-Z ]+)’)
1
2
3
4
5
6
7
8
9
10
11
12
import re

def parsing_text():
    pattern = re.compile(‘^(d+)s+(d+)s+([a-zA-Z ]+)’)
    with open("test.txt") as f:
        for data in f:
            result = pattern.search(data)
            if result:
                print result.group(0)

if __name__ == ‘__main__’:
    parsing_text()

Group(0) is showing complete line that match. Because we separate pattern using () :

1
..(d+)…(d+)…([a-zA-Z ]+)

So, it will be group(1) for first number, group(2) for second number and group(3) for alpha string.

Let’s print name of city only by :

1
2
3
4
5
6
7
8
9
10
def parsing_text():
    pattern = re.compile(‘^(d+)s+(d+)s+([a-zA-Z ]+)’)
    with open("test.txt") as f:
        for data in f:
            result = pattern.search(data)
            if result.group(3):
                print result.group(3)

if __name__ == ‘__main__’:
    parsing_text()

We can limit length of city name, eg:

Length exact 6:

1
([a-zA-Z ]{6})

Length more than 6:

1
([a-zA-Z ]{6, })

Length range 1-6 :

1
([a-zA-Z ]{1,6})

This also implemented into number regex, eg :

1
(d{3})

Another good to read :
http://www.regular-expressions.info/reference.html


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.