'Camelot scraping issue for Non English (Tamil) PDF

Python Camelot works a charm when it comes to English. But when it comes to Tamil it's not scraping the words properly. It gives more or less junk characters close to the characters I would like to understand what the issue is and how it captures the non-English data.

Work Done So Far: I am trying to scrape data from a PDF from the Tamil Nadu Election Commission. Sample single page data here. For example, the word

is getting scraped as ெபயர்.

Reference: The CSV output just for the first table is attached below

"வ.
எண்.","ெபயர்","பானம்","தந்ைத /கணவர்
ெபயர்","கட்ச","ெபற்ற
வாக்கள்","சதவதம்
%",""
"1","இந்தராேதவ.ப","ெபண்","பழனச்சாம ஆர்","நா.த.க.","144","2.97","ைவப்த்
ெதாைக
இழப்"
"2","கீதா.வ","ெபண்","ேகாப ேஜா","அ.இ.அ.த..க","1355","27.97","ேதால்வ"
"3","சவகாம.ம","ெபண்","மேகஸ்வரன் ேக
ஆர்","ப.ேஜ.ப","341","7.04","ைவப்த்
ெதாைக
இழப்"
"4","ெசல்லம்மாள்.ஆ","ெபண்","ஆகம்","ேயட்ைச
ேவட்பாளர்","184","3.80","ைவப்த்
ெதாைக
இழப்"
"5","பாமத.","ெபண்","மார்","ேயட்ைச
ேவட்பாளர்","31","0.64","ைவப்த்
ெதாைக
இழப்"
"6","ஜனா ராண.வ","ெபண்","வஸ்வநாதன் எம்","த..க","2790","57.59","ெவற்ற"

Code used for scraping:

# coding: utf8
import camelot

tables = camelot.read_pdf('2.pdf',  encoding='utf-8', pages= '1-end' )

tables
x = tables.n 
print ("No of tables",x)
tables.export('ariyalur.csv', f='csv')

Addition / Edit for clarity as pointed out by @tripleee For Non Tamil Users. This is the header of the table The Expected output is வ.எண் பெயர்‌ பாலினம்‌ பெயர்‌ கட்சி வாக்குகள்‌ % முடிவு But , the output which has come "வ.எண்.","ெபயர்","பானம்","தந்ைத /கணவர் ெபயர்","கட்ச","ெபற்ற வாக்கள்","சதவதம் %",""

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Camelot scraping issue for Non English (Tamil) PDF

Sources

Related Questions