'counting word frequency in a string across columns in R
I am trying to get a count of how many times each word appears total for every index of a column for my whole data set. The data can be found here:https://www.kaggle.com/tovarischsukhov/southparklines
My code is as follows:
SP = read.csv("All-seasons.csv")
SP$Season = as.numeric(SP$Season)
SP$Episode = as.numeric(SP$Episode)
Cartman = SP %>% group_by(Character) %>%
arrange(Season, Episode) %>%
filter(Character =="Cartman")
Cartman_text_tbl <- as_tibble(data.frame(uniqueID = 1:length(Cartman$Season),Cartman[1:length(Cartman$Season),]))
Cartman_text_tbl_words <- Cartman_text_tbl %>% select(uniqueID,Cartman$Line) %>%
unnest_tokens(word, Cartman$Line) %>% filter(str_detect(word,"^[a-z']+$")) %>%
group_by(uniqueID) %>% count(word)
When I run the last line of code I get this error:
Error in `select()`:
! Can't subset columns that don't exist.
x Columns `Yeah, go home you little dildo.\n`, `I know what it means!\n`, `I'm not telling you.\n`, `He-yeah, that's what Kyle's little brother is all right! Ow! \n`, `That's 'cause I was having these... bogus nightmares.\n`, etc. don't exist.
I did a project for a class a couple of years ago where the professor provided some similar code, I am trying to format this code off what was previously provided for me. If there is a better way to get a count that would be awesome to know about as well, otherwise a way to fix the error would be great. Additionally, each line ends with a "\n" I was wondering if its possible to remove those from every column? Thanks!
Solution 1:[1]
If I understand you correctly, I believe this may help you. The output gives you the count of each word said by Cartman for each episode and season. Of course for other characters you can use the same code and change the filter and object the output is assigned to. Also if you need to remove stop words you can add anti_join(stop_words, by = "word") %>% after the unnest_tokens() function. It is also set as sort = TRUE, so it will sort the words in descending order based on frequency, so you can change this and sort as needed.
Code:
library(tidyverse)
library(tidytext)
df <- read_csv("All-seasons.csv")
cartman <- df %>%
filter(Character == "Cartman") %>%
group_by(Season, Episode) %>%
unnest_tokens(output = word, input = Line) %>%
count(word, sort = TRUE)
Output Example:
> head(cartman)
# A tibble: 6 x 4
# Groups: Season, Episode [6]
Season Episode word n
<dbl> <dbl> <chr> <int>
1 7 11 you 73
2 11 8 i 73
3 5 4 you 66
4 16 7 you 63
5 14 8 i 61
6 11 2 i 60
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | pipeline-technician |
