'Line feeder in unwanted places causing problem in reading data
My data has over 50,000+ observations. The problem is line feeder LF
is inserted all over the place causing import into statistical software like STATA a nightmare. I have tried many different options in STATA and finally given up on STATA. Now, after spending half a day in Notepad++, I found out that manually deleting "LF" wherever the data is pushed into a new row fixes the problem.
I tried using 'replace all' 'LF' with nothing(empty cell) but it results in all data clumping in one single line (softwares(Excel, STATA) interprets this as a lengthy variables list).
I am hoping someone might have encountered issues like this and have found a solution. If anyone could share those solutions it would be awesome!
Messy data in notepad++ displays as:
"id"|"sex"|"choice"LF
"1aef2"|"M"|"burger",.LF
"pizza"LF
"pasta"LF
"B2qwX"|"F"|"salad".LF
"keto diet",LF
""LF
I want a clean data as:
"id"|"sex"|"choice"LF
"1aef2"|"M"|"burger","pizza""pasta"LF
"B2qwX"|"F"|"salad""keto diet",""LF
Please help!
Solution 1:[1]
This will remove linefeed when the next line doesn't contain a pipe character.
- Ctrl+H
- Find what:
\n(?!.*\|)
- Replace with:
LEAVE EMPTY
- CHECK Wrap around
- CHECK Regular expression
- UNCHECK
. matches newline
- Replace all
Explanation:
\n # linefeed
(?! # negative lookahead, make we haven't, after:
.* # 0 or more any character but newline
\| # a pipe character
) # end lookahead
Screenshot (before):
Screenshot (after):
Solution 2:[2]
Assuming you have Excel, Word or other Office products on your computer, use the method below (since I have the code written for that; I'll write Python at the tail end in a bit).
- Open Excel
- Use this video to add Developer workbook: https://youtu.be/_oZGdg1aiEQ?t=26
- Go to Developer tab and click Visual Basic
Add a Module by right clicking on VBA Project > Insert > Module
Go to Tools > References > check Microsoft Scripting Runtime
Then, in the white space you can paste the code below:
Option Explicit
Sub CorrectFileFormat()
Dim InFile As FileSystemObject
Set InFile = New FileSystemObject
Dim OutFile As FileSystemObject
Set OutFile = New FileSystemObject
Dim InFilePath As String
InFilePath = "c:\test.txt"
Dim OutFilePath As String
OutFilePath = "c:\test2.txt"
Dim InStream As TextStream
Set InStream = InFile.OpenTextFile(InFilePath, ForReading, False)
Dim OutStream As TextStream
Set OutStream = InFile.OpenTextFile(OutFilePath, ForWriting, True)
Dim Line As String
Dim OutputLine As String
OutputLine = ""
Dim LineNo As Integer
LineNo = 1
While Not InStream.AtEndOfStream
LoopStart:
Line = InStream.ReadLine
' if line is empty, skip it
If Len(Trim(Line)) = 0 Then
GoTo LoopStart
End If
' write the header line to the output file as-is
If InStr(Line, """id""|") = 1 Then
OutStream.WriteLine (Line)
GoTo LoopStart
End If
' if line contains a separator, we should store this line in
' outputline buffer. Before doing that, check if outputline has information in
' it. If it does, write it to the file first and then overwrite
' its contents
If InStr(Line, "|") > 0 Then
If Len(Trim(OutputLine)) > 0 Then
OutStream.WriteLine (OutputLine)
End If
OutputLine = Line
Else
OutputLine = OutputLine + Line
End If
LineNo = LineNo + 1
Wend
' write whatever's in outputline to the output file
OutStream.WriteLine (OutputLine)
InStream.Close
OutStream.Close
MsgBox "Done"
End Sub
Put your cursor anywhere inside the subroutine and click on the run button.
You will see a message box Done
.
Original text file
"id"|"sex"|"choice"
"1aef2"|"M"|"burger"
"pizza"
"pasta"
"B2qwX"|"F"|"salad"
"keto diet",
""
New text file
"id"|"sex"|"choice"
"1aef2"|"M"|"burger""pizza""pasta"
"B2qwX"|"F"|"salad""keto diet",""
You can adapt the code to your requirements.
Python example
inpath = r'c:\test.txt'
outpath = r'c:\test2.txt'
infile = open(inpath, 'r')
outfile = open(outpath, 'w')
indata = infile.readlines()
outstring = ''
for line in indata:
if len(line.strip()) == 0:
continue
if line.startswith('"id"|'):
outfile.write(line)
continue
if '|' in line:
if len(outstring) > 0:
outfile.write(outstring + '\n')
outstring = line.strip()
else:
outstring += line.strip()
outfile.write(outstring + '\n')
infile.close()
outfile.close()
Adapt this code to your needs.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Toto |
Solution 2 |