'How to split lines from a set of text input files which has movie titles(no spaces and no commas), followed by white space, followed by rating
The text in the files are in no particular order and we need to assume there are multiple lines for same movie. This is what the input might look:
batman_returns 4.0
The_Dark_Knight 4.0
batman_returns 5.0
Captain_America 5.0
Captain_America 4.0
Need to compute the average rating for each movie and have the output in sorted order. For example, the output should look like this:
[(4.5, batman_returns), (4.5, Captain_America), (4.0, The_Dark_Knight)]
Need to complete the problem using Python Spark code.
Solution 1:[1]
I don't know python spark but if it's similar to python then:
import pandas as pd
# Import text file
df = pd.read_csv('movies.txt', sep=" ", header=None)
df.columns = ["movie", "rating"]
# Find mean
avg = df.groupby(['movie']).mean('rating').reset_index()
# Create upper case dummy column to handle lower/upper case sorting
avg['upper_movies'] = avg['movie'].apply(lambda x: x.upper())
# Sort values by dummy column, drop dummy column.
avg = avg.sort_values(['upper_movies']).drop(['upper_movies'], axis=1)
# Create list of tuple by iterating through rows
output = [(row['rating'], row['movie']) for idx, row in avg.iterrows()]
print(output)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | d789w |
