'How to save inverted index RDD in HBase using Pyspark

I have created an inverted index RDD using pyspark which looks like this:

[('adage',
  [('file:/notebooks/Data/shakespeare/tragedies/macbeth', 1),
   ('file:/notebooks/Data/shakespeare/histories/3kinghenryvi', 1)]),
 ('adelaide', [('file:/notebooks/Data/Hugo/Miserables.txt', 1)]),
 ('adoration',
  [('file:/notebooks/Data/Hugo/Miserables.txt', 22),
   ('file:/notebooks/Data/Tolstoy/anna_karenhina.txt', 4),
   ('file:/notebooks/Data/Tolstoy/war_and_peace.txt', 4),
   ('file:/notebooks/Data/Hugo/NotreDame_De_Paris.txt', 1),
   ('file:/notebooks/Data/shakespeare/histories/kinghenryv', 1),
   ('file:/notebooks/Data/shakespeare/comedies/asyoulikeit', 1)]),
.
.
.
]

The inverted index RDD looks like this [(x[1],(x[1][0],x[1][1]))]
I am trying to save the inverted index RDD to HBase database.

What should the Hbase table look like?
How can I connect to Hbase database using Pyspark?
What are the steps and the code used to save the RDD to Hbase?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to save inverted index RDD in HBase using Pyspark

Sources

Related Questions