'Getting encoded output when I print hindi text from a tensorflow dataset
I'm using this corpus for an NLP task. When I read the file and store the hindi and english lines into separate lists, I get string literal outputs like so:
def extract_lines(fp):
return [line.strip() for line in open(fp).readlines()]
inp,target = extract_lines(train_hi),extract_lines(train_en)
sample: ['अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें', 'एक्सेर्साइसर पहुंचनीयता अन्वेषक'] ['Give your application an accessibility workout', 'Accerciser Accessibility Explorer']
I then create a tensorflow dataset using the two lists:
buffer_size = len(inp)
batch_size = 64
dataset = tf.data.Dataset.from_tensor_slices((inp,target)).shuffle(buffer_size)
dataset = dataset.batch(batch_size)
The output I get from
for input_sample,target_sample in dataset.take(1):
print(input_sample)
is something like:
tf.Tensor( [b'\xe0\xa4\xb5\xe0\xa5\x8d\xe0\xa4\xaf\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xa4\xe0\xa4\xbf\xe0\xa4\xaf\xe0\xa5\x8b\xe0\xa4\x82\xe0\xa4\x95\xe0\xa5\x80 \xe0\xa4\x95\xe0\xa5\x8b\xe0\xa4\x9f\xe0\xa4\xbf\xe0\xa4\xaf\xe0\xa4\xbe\xe0\xa4\x81'
I'm pretty new to dealing with text data (especially in tensorflow), what is happening here?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
