'How to save inverted index RDD in HBase using Pyspark

I have created an inverted index RDD using pyspark which looks like this:

[('adage',
  [('file:/notebooks/Data/shakespeare/tragedies/macbeth', 1),
   ('file:/notebooks/Data/shakespeare/histories/3kinghenryvi', 1)]),
 ('adelaide', [('file:/notebooks/Data/Hugo/Miserables.txt', 1)]),
 ('adoration',
  [('file:/notebooks/Data/Hugo/Miserables.txt', 22),
   ('file:/notebooks/Data/Tolstoy/anna_karenhina.txt', 4),
   ('file:/notebooks/Data/Tolstoy/war_and_peace.txt', 4),
   ('file:/notebooks/Data/Hugo/NotreDame_De_Paris.txt', 1),
   ('file:/notebooks/Data/shakespeare/histories/kinghenryv', 1),
   ('file:/notebooks/Data/shakespeare/comedies/asyoulikeit', 1)]),
.
.
.
]

The inverted index RDD looks like this [(x[1],(x[1][0],x[1][1]))]
I am trying to save the inverted index RDD to HBase database.

  • What should the Hbase table look like?
  • How can I connect to Hbase database using Pyspark?
  • What are the steps and the code used to save the RDD to Hbase?


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source