'How can I delete half of the words contained in .txt-files using R?
I aim to compare the first half of texts with the entirety of the same texts. I have already done multiple analyses using the full texts, which I simply loaded into r with the help of the "readtext"-function (and some functions to attach variables like the session number). Likewise, I used the same function(s) to load in my texts again and now want to delete the second half of said texts.
My idea was to count the words in each string first, which I did using:
dataframe$numwords <- str_count (dataframe$text, "\\w+")
The next step would be, to use a for-loop to delete half the number of "numwords" from each row in the text column. However, I don't know how to do this. And is there a better way?
My dataframe looks like this (Note: The text in my data frames contains on average about 6000 words per row.)
| text | session_no | patient_code | numwords |
|---|---|---|---|
| I do not feel well today. | 05 | 2006X | 6 |
| My anxiety is getting worse. Why? | 05 | 2007X | 6 |
| I can not do anything right, as always. | 10 | 2006X | 8 |
Edit: Is there a way to keep the punctuation? I am searching the text for specific ngrams. Doing this without punctuation may lead to false alarms, as the detection tool may find a match in text originally coming from two separate sentences.
Solution 1:[1]
With the following, we take the text column and split it into words using strsplit().
Then we use lapply() to calculate ho how many words would be half of each text.
Finally, we return only the first half of each text, but we lose all punctuation in the proccess.
lapply(strsplit(dataframe$text, split = "\\W+"), function(words) {
half <- round(length(words) / 2, 0)
paste(words[1:half], collapse = " ")
})
Edit
If we want to keep punctuation, then we need to make some adjustments.
Our regex nos keeps the delimiter, but has the secondary effect of keep some spaces as "words", so we have to remove them. We also use trim_ws() to remove trailing whitespace.
lapply(strsplit(dataframe$text, split = "(?<=\\W)", perl = TRUE), function(words) {
words <- words[words != " "]
half <- round(length(words) / 2, 0)
new_text <- paste(words[1:half], collapse = "")
trimws(new_text)
})
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
