'How to save inverted index RDD in HBase using Pyspark
I have created an inverted index RDD using pyspark which looks like this:
[('adage',
[('file:/notebooks/Data/shakespeare/tragedies/macbeth', 1),
('file:/notebooks/Data/shakespeare/histories/3kinghenryvi', 1)]),
('adelaide', [('file:/notebooks/Data/Hugo/Miserables.txt', 1)]),
('adoration',
[('file:/notebooks/Data/Hugo/Miserables.txt', 22),
('file:/notebooks/Data/Tolstoy/anna_karenhina.txt', 4),
('file:/notebooks/Data/Tolstoy/war_and_peace.txt', 4),
('file:/notebooks/Data/Hugo/NotreDame_De_Paris.txt', 1),
('file:/notebooks/Data/shakespeare/histories/kinghenryv', 1),
('file:/notebooks/Data/shakespeare/comedies/asyoulikeit', 1)]),
.
.
.
]
The inverted index RDD looks like this [(x[1],(x[1][0],x[1][1]))]
I am trying to save the inverted index RDD to HBase database.
- What should the Hbase table look like?
- How can I connect to Hbase database using Pyspark?
- What are the steps and the code used to save the RDD to Hbase?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
