'How to add target variable whether to see sentence belongs to data 1 or data 2?

I am working on a project. I would like to summarize it with similar case. I need to collect n number of tweets with different hastags.

Here is similar code:

library(tm)
#tweets from first hastag
tweets_set1 <- search_tweets("barcelona", n = 100, include_rts = FALSE)
tweets_set1$text
corpus1 <- VCorpus(VectorSource(tweets_set1$text))
corpus1

#tweets from second hashtag
tweets_set2 <- search_tweets("realmadrid", n = 100, include_rts = FALSE)
tweets_set2$text
corpus2 <- VCorpus(VectorSource(tweets_set2$text))
corpus2

#Then I need to merge all two data
merge.corpus <- c(corpus1, corpus2)

#Then I did pre-processing such as lowercases, remove punctuation, remove numbers, remove whitespace. 

# Output
inspect(merge.corpus[1:50])

<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 51

[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 41

[[2]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 59

#Question: I need to create a target variable to see whether the text belongs to hashtag 1 or 2. Any idea? Note: I can't convert it to data frame to do it


rtm


Solution 1:[1]

Add a tag to the meta structure of each VCorpus before merging them together, like this:

tagged_corpus1 <- as.VCorpus(lapply(corpus1, function(x) { x$meta$tag <- "barcelona"; return(x) }))
tagged_corpus2 <- as.VCorpus(lapply(corpus2, function(x) { x$meta$tag <- "realmadrid"; return(x) }))

The tags should remain with your data during downstream processing. For example, if you merge and transform the corpus to lower case:

merged_corpus <- c(tagged_corpus1, tagged_corpus2)
cleaned_corpus <- tm_map(merged_corpus, content_transformer(tolower))

You can then look at the tags individually:

> cleaned_corpus[[1]]$meta$tag
[1] "barcelona"

Or together:

mytags <- unlist(lapply(cleaned_corpus, function(x) { x$meta$tag }))

> table(mytags)
mytags
 barcelona realmadrid
       100        100

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Emmanuel