'Fetching data from Solr and load it into a Python Dataframe

I have fetched around 50k rows from Oracle 11g to Solr.

Now i want to fetch the same from Solr to python Dataframe.

I used following:

import pandas as pd
import pysolr

r = pysolr.Solr('http://localhost:8983/solr/db')

result = r.search('*')  

docs = pd.DataFrame(result.docs)

result = r.search('*') #Its throwing an error as SolrError: Solr responded with an error (HTTP 504): [Reason: None]504Gateway Timeout

Gateway Timeout

Server error - server 127.0.0.1 is unreachable at this moment.

Please retry the request or contact your administrator.

I am new to Solr. Thanks in Advance



Solution 1:[1]

You get results of type pysolr.Results

use functions to get required things like, qtime, docs, facets

results.docs give all documents in list, can check using type(results.docs)

results.docs[0] gives you first document in the results. which is of dicttype.

To get it into DataFrame import pandas library and use DataFrame()

import pandas as pd
import pysolr
solrcon = pysolr.Solr('http://localhost:8983/solr/db', timeout=10)
results = solrcon.search('*:*')
docs = pd.DataFrame(results.docs)

Solution 2:[2]

This helps

import pandas as pd
import time
from requests.utils import requote_uri
start_num =0
rows_num = 50000
total_docs =7591467
df = pd.DataFrame()
while total_docs> start_num :
    print('start row is',start_num)
    print('row number is ', rows_num)
    time.sleep(2.4)
    url = "localhost:8080/solr/collection_name/select?q=*:*&sort=xyz desc&wt=csv&start={}&rows={}".format(start_num,rows_num)
    encoded_URL = requote_uri(url)
    print(encoded_URL)
    df2 = pd.read_csv(encoded_URL) 
    df = df.append(df2, ignore_index=True)
    start_num = start_num+ 50000

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2