'Line feeder in unwanted places causing problem in reading data

My data has over 50,000+ observations. The problem is line feeder LF is inserted all over the place causing import into statistical software like STATA a nightmare. I have tried many different options in STATA and finally given up on STATA. Now, after spending half a day in Notepad++, I found out that manually deleting "LF" wherever the data is pushed into a new row fixes the problem.

I tried using 'replace all' 'LF' with nothing(empty cell) but it results in all data clumping in one single line (softwares(Excel, STATA) interprets this as a lengthy variables list).

I am hoping someone might have encountered issues like this and have found a solution. If anyone could share those solutions it would be awesome!

Messy data in notepad++ displays as:

"id"|"sex"|"choice"LF
"1aef2"|"M"|"burger",.LF
"pizza"LF
"pasta"LF
"B2qwX"|"F"|"salad".LF
"keto diet",LF
""LF

I want a clean data as:

"id"|"sex"|"choice"LF
"1aef2"|"M"|"burger","pizza""pasta"LF
"B2qwX"|"F"|"salad""keto diet",""LF

Please help!



Solution 1:[1]

This will remove linefeed when the next line doesn't contain a pipe character.

  • Ctrl+H
  • Find what: \n(?!.*\|)
  • Replace with: LEAVE EMPTY
  • CHECK Wrap around
  • CHECK Regular expression
  • UNCHECK . matches newline
  • Replace all

Explanation:

\n          # linefeed
(?!         # negative lookahead, make we haven't, after:
    .*          # 0 or more any character but newline
    \|          # a pipe character
)           # end lookahead

Screenshot (before):

enter image description here

Screenshot (after):

enter image description here

Solution 2:[2]

Assuming you have Excel, Word or other Office products on your computer, use the method below (since I have the code written for that; I'll write Python at the tail end in a bit).

Add a Module by right clicking on VBA Project > Insert > Module

enter image description here

Go to Tools > References > check Microsoft Scripting Runtime

Then, in the white space you can paste the code below:

Option Explicit

Sub CorrectFileFormat()

    Dim InFile As FileSystemObject
    Set InFile = New FileSystemObject
    
    Dim OutFile As FileSystemObject
    Set OutFile = New FileSystemObject
    
    Dim InFilePath As String
    InFilePath = "c:\test.txt"
    
    Dim OutFilePath As String
    OutFilePath = "c:\test2.txt"
    
    Dim InStream As TextStream
    Set InStream = InFile.OpenTextFile(InFilePath, ForReading, False)
    
    Dim OutStream As TextStream
    Set OutStream = InFile.OpenTextFile(OutFilePath, ForWriting, True)

    Dim Line As String
    Dim OutputLine As String
    OutputLine = ""
    
    Dim LineNo As Integer
    LineNo = 1
    
    While Not InStream.AtEndOfStream
LoopStart:
        Line = InStream.ReadLine
        
        ' if line is empty, skip it
        If Len(Trim(Line)) = 0 Then
            GoTo LoopStart
        End If
        
        ' write the header line to the output file as-is
        If InStr(Line, """id""|") = 1 Then
            OutStream.WriteLine (Line)
            GoTo LoopStart
        End If
        
        ' if line contains a separator, we should store this line in
        ' outputline buffer. Before doing that, check if outputline has information in
        ' it. If it does, write it to the file first and then overwrite
        ' its contents
        If InStr(Line, "|") > 0 Then
            If Len(Trim(OutputLine)) > 0 Then
                OutStream.WriteLine (OutputLine)
            End If
            OutputLine = Line
            
        Else
            OutputLine = OutputLine + Line
        End If
        
        LineNo = LineNo + 1
    Wend
    
    ' write whatever's in outputline to the output file
    OutStream.WriteLine (OutputLine)
    
    InStream.Close
    OutStream.Close
    
    MsgBox "Done"
    
End Sub

Put your cursor anywhere inside the subroutine and click on the run button.

enter image description here

You will see a message box Done.

Original text file

"id"|"sex"|"choice"
"1aef2"|"M"|"burger"
"pizza"
"pasta"
"B2qwX"|"F"|"salad"
"keto diet",
""

enter image description here

New text file

"id"|"sex"|"choice"
"1aef2"|"M"|"burger""pizza""pasta"
"B2qwX"|"F"|"salad""keto diet",""

enter image description here

You can adapt the code to your requirements.

Python example

inpath = r'c:\test.txt'
outpath = r'c:\test2.txt'

infile = open(inpath, 'r')
outfile = open(outpath, 'w')

indata = infile.readlines()

outstring = ''

for line in indata:
    
    if len(line.strip()) == 0:
        continue
    
    if line.startswith('"id"|'):
        outfile.write(line)
        continue
        
    if '|' in line:
        if len(outstring) > 0:
            outfile.write(outstring + '\n')
        
        outstring = line.strip()
    
    else:
        outstring += line.strip()
    
outfile.write(outstring + '\n')

infile.close()
outfile.close()

Adapt this code to your needs.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Toto
Solution 2