Python How to Handle Data Unstructured From Text File

jainsaniya

New Member
Aug 27, 2020
2
0
1
vijayawada
codegnan.com
i have file python format like this.

# Jon Doe
# 27212000-C
# Calorina, 06/03 1993
# South Calorina Jaka Km 1
# Num 009.006
# Calorina. 11710, Tp.108437347343
# joe.st'a gmail.com
# 20-09-2016 Akn

# 36412506/E.15262
# Jakarta, 13/10/1994
# II, Let.jend, Soeprapto Gang Siaga
# V RT 005/03
# Jakarta, 10640. Tp.
# 22-09-2016/T Info

# Jenny Doe
# 5641141 2/E.15263
# Zimbabwe, 05/06/1993
# Mujair Street Iv No.185
# Mujair, 15116. Tp.04545454
# jenny@gmail.com
# 22-09-2016/T Info

# Igor Kart
# 36412777/E,15264
# Kongo, 30/10/1994
# Kp. Pintu Air Kel. Pabuaran Kec.Boj
# onggede Kab.Bogor RT 04/09
# Bogor, 16320. Tp,107262626
# igor.@gmail.com
# 22-09-2016T Info


how get best structure data from the output? i want get a result csv like this. Good_format.csv

Name Code Bday Address Phone Email Info
Jon Doe 27212000-C Calorina, 06/03 1993 South Calorina Jaka Km 1Num 009.006 Calorina. 11710 108437347343 joe.st'a gmail.com 20-09-2016 Akn
Jenny Doe 5641141 2/E.15263 Zimbabwe, 05/06/1993 Mujair Street Iv No.185 Mujair, 15116. 04545454 jenny@gmail.com 22-09-2016/T Info
Igor Kart 36412777/E,15264 Kongo, 30/10/1993 Kp. Pintu Air Kel. Pabuaran Kec.Bojonggede Kab.Bogor RT 04/09Bogor, 16320. 107262626 igor.@gmail.com 22-09-2016T Info
and record bad format to log.txt. i need bad format for me to fix it again.

# 36412506/E.15262
# Jakarta, 13/10/1994
# II, Let.jend,
# V RT 005/03
# Jakarta, 10640. Tp.
# 22-09-2016/T Info
 

alessio

New Member
Aug 14, 2020
2
0
1
I don't think there is an easy, beatiful solution to your problem: The format even of the "good" records varies too much. E.g. Jenny Doe uses seven lines while Jon Doe and Igor Kart use eight, yet all three should be considered good. Then Tp (telephone?) can use either "Tp," or "Tp.", the birthdate format differs etc.

Your best bet seems to be to write a custom parser with some heuristic for the fields and iteratively refine it with the output of your "log.txt".

Something to get you started:

Python:
import csv  # or use pandas

def parse_all(text, columns):
    good = []
    bad = []
    # The "rows" in your final csv
    records = dta.split('\n\n')  # separated by an empty line
   
    for record in records:
        fields = parse(record)
        if len(fields) == len(columns):
            good.append(fields)
        else:
            bad.append(record)
           
    return dict(good=good, bad=bad)


with open('unstructured.txt', 'r') as src:
    raw = src.read()

# What you want to find
columns = 'Name', 'Code', 'CoB', 'DoB', 'Address', 'Phone', 'Email', 'Info'
result = parse_all(raw, columns)

# "Bad" records
with open('log.txt', 'w') as log:
    log.write('',join(result['bad']))

# "Good" records
with open('Good_format.csv', 'w') as out:
    writer = csv.DictWriter(out, fieldnames=columns)
    writer.writerheader()
    for row in good:
        writer.writerow(dict(zip(columns, row)))

For starters, you could use something simple for parse(), such as
Python:
def parse(record):
    """
    Separate fields by new line + hashtag.
    Alas, not many records use this format...
    """"
    return ('\n' + record).split('\n# ')[1:]
To refine this, perhaps try to search for the email. This should be fairly easy to detect (at-sign?) and perhaps help to find the number of address rows. Or try to match number formats for telephone or DoB.
 

larrysb

Active Member
Nov 7, 2018
107
45
28
Welcome to data science. Which is mostly the unglamorous practice of, "how do I clean up this mess?"
 
  • Like
Reactions: alessio

larrysb

Active Member
Nov 7, 2018
107
45
28
On the assumption that your data set was generated by machine, one approach might be:

1. Determine the delimiter(s) which separate one record from another.
2. Try to find patterns in the fields in the records, with records of similar format being a "species."
3. Identify the hallmarks of each species, then sort the records into groups of species with automation.
4. Validate the separated records meet the expected formats.
5. Repeat if sub-species are discovered. Iterate until all species of record are discovered.
6. Map the discovered species of record formats into the desired format.

If the data was machine created, then it will have certain field formats, even if humans entered the data with lots of errors. If you have access to the software that wrote the data, then you can easily write rules to ferret it out. If you don't, you'll have to be creative and iterate until you find the patterns.

If it was all jotted down by humans without any structure, good luck. It may become an entirely human effort.