Group words (from defined list) into themes in R - r

I am new to Stackoverflow and trying to learn R.
I want to find a set of defined words in a text. Return the count of these words in a table format with the associated theme I have defined.
Here is my attempt:
text <- c("Green fruits are such as apples, green mangoes and avocados are good for high blood pressure. Vegetables range from greens like lettuce, spinach, Swiss chard, and mustard greens are great for heart disease. When researchers combined findings with several other long-term studies and looked at coronary heart disease and stroke separately, they found a similar protective effect for both. Green mangoes are the best.")
library(qdap)
**#Own Defined Lists**
fruit <- c("apples", "green mangoes", "avocados")
veg <- c("lettuce", "spinach", "Swiss chard", "mustard greens")
**#Splitting in Sentences**
stext <- strsplit(text, split="\\.")[[1]]
**#Obtain and Count Occurences**
library(plyr)
fruitres <- laply(fruit, function(x) grep(x, stext))
vegres <- laply(veg, function(x) grep(x, stext))
**#Quick check, and not returning 2 results for** "green mangoes"
grep("green mangoes", stext)
**#Trying with stringr package**
tag_ex <- paste0('(', paste(fruit, collapse = '|'), ')')
tag_ex
library(dplyr)
library(stringr)
themes = sapply(str_extract_all(stext, tag_ex), function(x) paste(x, collapse=','))[[1]]
themes
#Create data table
library(data.table)
data.table(fruit,fruitres)
Using the respective qdap and stringr packages I am unable to obtain a solution I desire.
Desired solution for fruits and veg combined in a table
apples fruit 1
green mangoes fruit 2
avocados fruit 1
lettuce veg 1
spinach veg 1
Swiss chard veg 1
mustard greens veg 1
Any help will be appreciated. Thank you

I tried to generalize for N number of vectors
tidyverse and stringr solution
library(tidyverse)
library(stringr)
Create a data.frame of your vectors
data <- c("fruit","veg") # vector names
L <- map(data, ~get(.x))
names(L) <- data
long <- map_df(1:length(L), ~data.frame(category=rep(names(L)[.x]), type=L[[.x]]))
# You may receive warnings about coercing to characters
# category type
# 1 fruit apples
# 2 fruit green mangoes
# 3 fruit avocados
# etc
To count instances of each
long %>%
mutate(count=str_count(tolower(text), tolower(type)))
Output
category type count
1 fruit apples 1
2 fruit green mangoes 2
3 fruit avocados 1
4 veg lettuce 1
# etc
Extra stuff
We can add another vector easily
health <- c("blood", "heart")
data <- c("fruit","veg", "health")
# code as above
Extra output (tail)
6 veg Swiss chard 1
7 veg mustard greens 1
8 health blood 1
9 health heart 2

Related

How to present the frequencies of each of the choices of multiple-choices questions that are presented in different ways?

I have this example dataframe (my real dataframe is larger and this one includes all the cases I am facing with my big dataframe)
df = data.frame(ingridents = c('bread', 'BREAD', 'Bread orange juice',
'orange juice', 'Apple', 'apple bread, orange juice',
'bread Apple ORANGE JUICE'),
Frequency = c(10,3,5,4,2,3,1) )
In this df dataframe we can see that :
the ingridient bread is drafted as bread, BREAD and Bread (alone or with other answers). The same thing with the ingridient apple.
the ingridient orange juice is drafted in multiple forms and in one of the groups of responses there is a comma and in another there is no comma. Also, I want R to recognize the totality of the orange juice expression. Not orange alone and juice alone.
The objective is to create another dataframe with each of these 3 ingridients and their frequencies as follows :
ingridents Frequency
1 BREAD 22
2 ORANGE JUICE 13
3 APPLE 6
How can I program an algorithm on R so that he can recognise each response with its total frequency (wheather it includes capital or small letters or wheather it is formed of two-word expressions such as orange juice) ?
Here is one way to do it. First, we'll do some string preprocessing (i.e. get all strings in upper case, remove commas and concatenate the juice), then split by space and do the summing:
library(tidyr)
library(dplyr)
library(stringr)
df |>
mutate(ingridents = ingridents |>
toupper() |>
str_remove_all(",") |>
str_replace_all("ORANGE JUICE", "ORANGE_JUICE")) |>
separate_rows(ingridents, sep = " ") |>
count(ingridents, wt = Frequency) |>
arrange(desc(n)) |>
mutate(ingridents = str_replace_all(ingridents, "ORANGE_JUICE", "ORANGE JUICE"))
Output:
# A tibble: 3 × 2
ingridents n
<chr> <dbl>
1 BREAD 22
2 ORANGE JUICE 13
3 APPLE 6

How can I rename multiple string in same column with another name in R

The following names are in a column. I want to retain just five distinct names, while replace the rest with others. how do I go about that?
df <- data.frame(names = c('Marvel Comics','Dark Horse Comics','DC Comics','NBC - Heroes','Wildstorm',
'Image Comics',NA,'Icon Comics',
'SyFy','Hanna-Barbera','George Lucas','Team Epic TV','South Park',
'HarperCollins','ABC Studios','Universal Studios','Star Trek','IDW Publishing',
'Shueisha','Sony Pictures','J. K. Rowling','Titan Books','Rebellion','Microsoft',
'J. R. R. Tolkien'))
If I am understanding you correctly, use %in% and ifelse. Here, I chose the first five names as an example. I also created it in a new column, but you could just overwrite the column as well or create a vector:
df <- data.frame(names = c('Marvel Comics','Dark Horse Comics','DC Comics','NBC - Heroes','Wildstorm',
'Image Comics',NA,'Icon Comics',
'SyFy','Hanna-Barbera','George Lucas','Team Epic TV','South Park',
'HarperCollins','ABC Studios','Universal Studios','Star Trek','IDW Publishing',
'Shueisha','Sony Pictures','J. K. Rowling','Titan Books','Rebellion','Microsoft',
'J. R. R. Tolkien'))
fivenamez <- c('Marvel Comics','Dark Horse Comics','DC Comics','NBC - Heroes','Wildstorm')
df$names_transformed <- ifelse(df$names %in% fivenamez, df$names, "Other")
# names names_transformed
# 1 Marvel Comics Marvel Comics
# 2 Dark Horse Comics Dark Horse Comics
# 3 DC Comics DC Comics
# 4 NBC - Heroes NBC - Heroes
# 5 Wildstorm Wildstorm
# 6 Image Comics Other
# 7 <NA> Other
# 8 Icon Comics Other
# 9 SyFy Other
If you want to keep NA values as NA, just use df$names_transformed <- ifelse(df$names %in% fivenamez | is.na(df$names), df$names, "Other")
You can also use something like case when. The following code will keep marvel, dark horse, dc comics, JK Rowling and George Lucas the same and change all others to "Other". It functionally the same as u/jpsmith, but (in my humble opinion) offers a little more flexibility because you can change multiple things a bit more easily or make different comics have the same name should you choose to do so.
df = df %>%
mutate(new_names = case_when(names == 'Marvel Comics' ~ 'Marvel Comics',
names == 'Dark Horse Comics' ~ 'Dark Horse Comics',
names == 'DC Comics' ~ 'DC Comics',
names == 'George Lucas' ~ 'George Lucas',
names == 'J. K. Rowling' ~ 'J. K. Rowling',
TRUE ~ "Other"))

match products in a list in R

I have to classify a list of products like these:
product_list<-data.frame(product=c('banana from ecuador 1 unit', 'argentinian meat (1 kg) cow','chicken breast','noodles','salad','chicken salad with egg'))
Based on the words included in each element of this vector:
product_to_match<-c('cow meat','deer meat','cow milk','chicken breast','chicken egg salad','anana')
I would have to match all the words of each product product_to_match, into each element of the dataframe.
I am not sure what is the best way to do this, in order to classify each product into a new column, in order to have something like this:
product_list<-data.frame(product=c('banana from ecuador 1 unit', 'argentinian meat (1 kg)
cow','chicken breast','noodles','salad','chicken salad with egg'),class=c(NA,'cow meat','chicken
breast',NA,NA,'chicken egg salad'))
Notice that 'anana' did not match with 'banana', eventhough the characers are included in the string but not the word. I am not sure how to do this.
Thank you.
Perhaps this could help
q <- outer(
strsplit(product_to_match, "\\s+"),
strsplit(product_list$product, "\\s+"),
FUN = Vectorize(function(x, y) all(x %in% y))
)
product_list$class <- product_to_match[replace(colSums(q * row(q)), colSums(q) == 0, NA)]
such that
> product_list
product class
1 banana from ecuador 1 unit <NA>
2 argentinian meat (1 kg) cow cow meat
3 chicken breast chicken breast
4 noodles <NA>
5 salad <NA>
6 chicken salad with egg chicken egg salad
Using stringdist could get some matches
library(fuzzyjoin)
stringdist_left_join(product_list, tibble(product = product_to_match),
method = 'soundex')

Combining Two Data Frames Horizontally in R

I would like to combine two data frames horizontally in R.
These are my two data frames:
dataframe 1:
veg loc quantity
carrot sak three
pepper lon two
tomato apw five
dataframe 2:
seller quantity veg
Ben eleven eggplant
Nour six potato
Loni four zucchini
Ahmed two broccoli
I want the outcome to be one data frame that looks like this:
veg quantity
carrot three
pepper two
tomato five
eggplant eleven
potato six
zucchini four
broccoli two
The question says "horizontally" but from the sample output it seems that what you meant was "vertically".
Now, assuming the input shown reproducibly in the Note at the end, rbind them like this. No packages are used and no objects are overwritten.
sel <- c("veg", "quantity")
rbind( df1[sel], df2[sel] )
If you like you could replace the first line of code with the following which picks out the common columns giving the same result for sel.
sel <- intersect(names(df1), names(df2))
Note
Lines1 <- "veg loc quantity
carrot sak three
pepper lon two
tomato apw five"
Lines2 <- "seller quantity veg
Ben eleven eggplant
Nour six potato
Loni four zucchini
Ahmed two broccoli"
df1 <- read.table(text = Lines1, header = TRUE, strip.white = TRUE)
df2 <- read.table(text = Lines2, header = TRUE, strip.white = TRUE)
You can do it like this:
library (tidyverse)
df1 <- df1%>%select(veg, quantity)
df2 <- df2%>%select(veg, quantity)
df3 <- rbind(df1, df2)

Ordering data frame in R

I have the following data frame structure:
Animal Food
1 cat fish, milk, shrimp
2 dog steak, poo
3 fish seaweed, shrimp, krill, insects
I would like to reorganize it so that the rows are in descending order of number of factors in the "Food" column:
Animal Food
1 fish seaweed, shrimp, krill, insects
2 cat fish, milk, shrimp
3 dog steak, poo
Is there an R function that can help me with that?
Thanks
You can use count.fields to figure out how many items there are in each "food" row and order by that.
count.fields(textConnection(mydf$Food), ",")
# [1] 3 2 4
Assuming your data.frame is called "mydf":
mydf[order(count.fields(textConnection(mydf$Food), ","), decreasing=TRUE),]
# Animal Food
# 3 fish seaweed, shrimp, krill, insects
# 1 cat fish, milk, shrimp
# 2 dog steak, poo
make a new variable and sort by that, edit: thanks to Ananda and alexis
df$nFood<-length(unlist(strsplit(df$Food, ",", fixed=T)))
df$nFood<-sapply(strsplit(df$Food, ","), length)
You can order the frame according to the results of your counting function:
animals = data.frame( rbind(c("cat","fish, milk, shrimp"),
c("dog","steak, poo"),
c("fish","seaweed, shrimp, krill, insects")))
colnames(animals) = c("Animal","Food")
animals[order(sapply(animals$Food, function(x) { length(strsplit(as.character(x),split=",")[[1]]) })), ]
I put in the as.character because it defaults to a factor, you probably don't need it (quicker) alternatively you can use stringsAsFactors=FALSE when creating the data frame.

Resources