'Pyspark python code to create a word count, filtering out stopwords using a broadcasted variable. accessing stopwords not working

I am new to pyspark and am making a wordcount program in google collaborate. I need to use a broadcasted variable for my stopwords. everything about the program works except filtering out the stopwords. im pretty sure the issue lies with pyspark not being able to iterate or access the broadcast variable correctly for reasons ill explain in a bit.

this is how i declare my stopwords

stopwordsPath = '/content/drive/MyDrive/Assignment3/stopwords-en.txt'
file2 = sc.textFile(stopwordsPath)
stopwordsRDD = file2.flatMap(lambda line: line.split(" "))
stopwords = sc.broadcast([stopwordsRDD.collect()])

I added the line split recently to test. beforehand that step was not there, but having that step there or not doesn't change anything. its probably unecessary as the stopwords file looks like This . if you open the link you'll see that its one word per line. I tried to split it by newline character '\n' but that did not change things either, unless i didn't do it correctly.

This is how im trying to filter out the stopwords from the text im processing.

file = '/content/drive/MyDrive/Assignment3/tsk1Data.txt'
fileRDD = sc.textFile(file)
wordsRDD= fileRDD.flatMap(lambda line: line.split(" "))
wordsRDD = wordsRDD.map(lower_clean_str)
wordsRDD = wordsRDD.filter(lambda x:x not in stopwords.value)

as you can see i get the filepath, make it into a RDD, then split the file into words. then clean up the words (function is defined elsewhere) and then i attempt to filter out the stopwords from the cleaned up words. But it does not do anything.

when i run this command to see if its even accessing it right

print('the' in stopwords.value)
print('the' in wordsRDD.collect())

i get

False
True

so its not detecting the word "the" in the broadcasted stopwords variable but it is detecting it in the document i am processing. and i assure you that the word "the" is in the stopwords document. i have replaced the word "the" with different ones like "an" and "a" which are also stopwords and i still get false. so something is going on.

this is what both what stopwords.value and wordsRDD.collect look like

wordsRDD.collect()

['the',
 'project',
 'gutenberg',
 'ebook',
 'of',
 'pride',
 'and',
 'prejudice',
 'by',
 'jane',
 'austen',
 'this',
 'ebook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'at',
 'no',
 'cost',

stopwords.value

[["'ll",
  "'tis",
  "'twas",
  "'ve",
  '10',
  '39',
  'a',
  "a's",
  'able',
  'ableabout',
  'about',
  'above',
  'abroad',
  'abst',
  'accordance',
  'according',
  'accordingly',
  'across',
  'act',
  'actually',
  'ad',
  'added',
  'adj',
  'adopted',
  'ae',
  'af',
  'affected',
  'affecting',
  'affects',
  'after',
  'afterwards',

as you can see, they look mostly similar. Except that stopwords starts out with a double bracket [[ but wordsRDD only starts with one bracket [. and stopwords has random " quotation marks around some words. and if you scroll down, most of them have single quotes ' around them. but randomly, a word will have double quotes around them "". like in this snippet from the output

  'could',
  "could've",
  'couldn',
  "couldn't",
  'couldnt',
  'course',
  'cr',
  'cry',
  'cs',
  'cu',
  'currently',
  'cv',
  'cx',
  'cy',
  'cz',
  'd',
  'dare',
  "daren't",
  'darent',
  'date',

Somehow i need to clean the contents of the broadcast variable in order to access its contents properly. is some whitespace screwing it up? could that be what is causing the random double quotes? i don't know.

looking at it now though the double quotes appear to be only around contractions with single quotes in them '. Probably so that the single quote is not detected to be relevant to declaring it a string or something.

heres a picture of the rest of my output and code if you need it (i dont think i need to show the header stuff of importing things, mounting etc) picture

and heres my text cleaning function just in case

def lower_clean_str(x):
  punc='!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
  lowercased_str = x.lower()
  for ch in punc:
    lowercased_str = lowercased_str.replace(ch, '')
  return lowercased_str

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Pyspark python code to create a word count, filtering out stopwords using a broadcasted variable. accessing stopwords not working

Sources

Related Questions