Mutate Column with Match Against List Variable Values - r

I would like to create a new column in my dataframe that corresponds to values in list variables.
My dataframe includes many rows with a 'product names' column. My intention is to create a new column that allows me to sort products into categories.
Sample code -
library(dplyr)
products <- c('Apple', 'orange', 'pear',
'carrot', 'cabbage',
'strawberry', 'blueberry')
df <- data.frame(products)
ls <- list(Fruit = c('Apple', 'orange', 'pear'),
Veg = c('carrot', 'cabbage'),
Berry = c('strawberry', 'blueberry'))
test <- df %>%
mutate(category = products %in% ls)
I hope that illustrates what I'm trying to do. By creating the list, I've basically got a register of products and their categories which could change over time.
Is there a solution to this using a list, or am I over-complicating it and not seeing the wood for the trees?
edit - It might help to let you know that I'm working with 100s of products.

stack the list and then join with the data frame:
df %>%
left_join(stack(ls), by = c('products' = 'values')) %>%
rename(category = ind)
# products category
#1 Apple Fruit
#2 orange Fruit
#3 pear Fruit
#4 carrot Veg
#5 cabbage Veg
#6 strawberry Berry
#7 blueberry Berry

Related

How to present the frequencies of each of the choices of multiple-choices questions that are presented in different ways?

I have this example dataframe (my real dataframe is larger and this one includes all the cases I am facing with my big dataframe)
df = data.frame(ingridents = c('bread', 'BREAD', 'Bread orange juice',
'orange juice', 'Apple', 'apple bread, orange juice',
'bread Apple ORANGE JUICE'),
Frequency = c(10,3,5,4,2,3,1) )
In this df dataframe we can see that :
the ingridient bread is drafted as bread, BREAD and Bread (alone or with other answers). The same thing with the ingridient apple.
the ingridient orange juice is drafted in multiple forms and in one of the groups of responses there is a comma and in another there is no comma. Also, I want R to recognize the totality of the orange juice expression. Not orange alone and juice alone.
The objective is to create another dataframe with each of these 3 ingridients and their frequencies as follows :
ingridents Frequency
1 BREAD 22
2 ORANGE JUICE 13
3 APPLE 6
How can I program an algorithm on R so that he can recognise each response with its total frequency (wheather it includes capital or small letters or wheather it is formed of two-word expressions such as orange juice) ?
Here is one way to do it. First, we'll do some string preprocessing (i.e. get all strings in upper case, remove commas and concatenate the juice), then split by space and do the summing:
library(tidyr)
library(dplyr)
library(stringr)
df |>
mutate(ingridents = ingridents |>
toupper() |>
str_remove_all(",") |>
str_replace_all("ORANGE JUICE", "ORANGE_JUICE")) |>
separate_rows(ingridents, sep = " ") |>
count(ingridents, wt = Frequency) |>
arrange(desc(n)) |>
mutate(ingridents = str_replace_all(ingridents, "ORANGE_JUICE", "ORANGE JUICE"))
Output:
# A tibble: 3 × 2
ingridents n
<chr> <dbl>
1 BREAD 22
2 ORANGE JUICE 13
3 APPLE 6

R string match between two dataframe columns

I am trying to extract texts based on a match in a character column of a dataframe with a column of another dataframe. Here is an example of reproducible dataframes.
productlist <- data.frame(prod_tg=c('Milk', 'Soybean', 'Pig meat'),
nomencl=c('milk|SMP|dairy|MK', 'Soybean|Soyabean', 'Pigmeat|PK|Pork|pigmeat') )
tctdf <- data.frame(policy_label=c('Market Milk', 'dairy products', 'OCHA - MK', 'pig meat', 'Soybeans'))
I would like to match the strings case insensitive. In the productlist, I have included all entries in nomencl column by using '|' so that any match of these will go specific entry of prod_tg such as Milk, Pig meat, Soybean.
my expected dataframe would look like as:
finaldf = data.frame(policy_label=c('Market Milk', 'dairy products', 'OCHA - MK', 'pig meat', 'Soybeans'), prod_match=c('milk', 'dairy', 'MK','pig', 'Soybean'), product_tag=c('Milk', 'Milk', 'Milk', 'Pig meat', 'Soybean'))
I have been thinking of grepl function in base R but open to any other function. Grateful for your suggestions.
Here's a way using stringr::str_extract
library(stringr)
cbind(tctdf,t(sapply(tctdf$policy_label, function(x) {
v <- str_extract(x, regex(productlist$nomencl, ignore_case = TRUE))
c(prod_match = toString(na.omit(v)),
product_tag = toString(productlist$prod_tg[!is.na(v)]))
}))) |> `rownames<-`(NULL)
# policy_label prod_match product_tag
#1 Market Milk Milk Milk
#2 dairy products dairy Milk
#3 OCHA - MK MK Milk
#4 pigmeat pigmeat Pig meat
#5 Soybeans Soybean Soybean
data
Changed <= to <- for tctdf and replaced 'pig meat' to 'pigmeat' so that it actually matches with productlist.
tctdf <- data.frame(policy_label=c('Market Milk', 'dairy products',
'OCHA - MK', 'pigmeat', 'Soybeans'))

How can I rename multiple string in same column with another name in R

The following names are in a column. I want to retain just five distinct names, while replace the rest with others. how do I go about that?
df <- data.frame(names = c('Marvel Comics','Dark Horse Comics','DC Comics','NBC - Heroes','Wildstorm',
'Image Comics',NA,'Icon Comics',
'SyFy','Hanna-Barbera','George Lucas','Team Epic TV','South Park',
'HarperCollins','ABC Studios','Universal Studios','Star Trek','IDW Publishing',
'Shueisha','Sony Pictures','J. K. Rowling','Titan Books','Rebellion','Microsoft',
'J. R. R. Tolkien'))
If I am understanding you correctly, use %in% and ifelse. Here, I chose the first five names as an example. I also created it in a new column, but you could just overwrite the column as well or create a vector:
df <- data.frame(names = c('Marvel Comics','Dark Horse Comics','DC Comics','NBC - Heroes','Wildstorm',
'Image Comics',NA,'Icon Comics',
'SyFy','Hanna-Barbera','George Lucas','Team Epic TV','South Park',
'HarperCollins','ABC Studios','Universal Studios','Star Trek','IDW Publishing',
'Shueisha','Sony Pictures','J. K. Rowling','Titan Books','Rebellion','Microsoft',
'J. R. R. Tolkien'))
fivenamez <- c('Marvel Comics','Dark Horse Comics','DC Comics','NBC - Heroes','Wildstorm')
df$names_transformed <- ifelse(df$names %in% fivenamez, df$names, "Other")
# names names_transformed
# 1 Marvel Comics Marvel Comics
# 2 Dark Horse Comics Dark Horse Comics
# 3 DC Comics DC Comics
# 4 NBC - Heroes NBC - Heroes
# 5 Wildstorm Wildstorm
# 6 Image Comics Other
# 7 <NA> Other
# 8 Icon Comics Other
# 9 SyFy Other
If you want to keep NA values as NA, just use df$names_transformed <- ifelse(df$names %in% fivenamez | is.na(df$names), df$names, "Other")
You can also use something like case when. The following code will keep marvel, dark horse, dc comics, JK Rowling and George Lucas the same and change all others to "Other". It functionally the same as u/jpsmith, but (in my humble opinion) offers a little more flexibility because you can change multiple things a bit more easily or make different comics have the same name should you choose to do so.
df = df %>%
mutate(new_names = case_when(names == 'Marvel Comics' ~ 'Marvel Comics',
names == 'Dark Horse Comics' ~ 'Dark Horse Comics',
names == 'DC Comics' ~ 'DC Comics',
names == 'George Lucas' ~ 'George Lucas',
names == 'J. K. Rowling' ~ 'J. K. Rowling',
TRUE ~ "Other"))

subtracting values in dataframe with condition(s) from another dataframe

I have two dataframes that have sales data from fruits store.
1st Data frame has sales data from 'Store A',
and the 2nd data frame has that data gathered from 'Store A + Store B'
StoreA = data.frame(
Fruits = c('Apple', 'Banana', 'Blueberry'),
Customer = c('John', 'Peter', 'Jenny'),
Quantity = c(2, 3, 1)
)
Total = data.frame(
Fruits = c('Blueberry', 'Apple', 'Banana', 'Blueberry', 'Pineapple'),
Customer = c('Jenny' , 'John', 'Peter', 'John', 'Peter'),
Quantity = c(4, 7, 3, 5, 3)
)
StoreA
Total
I wish to subtract the sales data of 'StoreA' from 'Total' to get sales data for 'StoreB'.
At the end, I wish to have something like
Great Question! There is a simple and graceful way of achieving exactly what you want.
The title to this question is: "subtracting values in a data frame with conditons() from another data frame"
This subtraction can be accomplished just like the title says. But there is a better way than using subtraction. Turning a subtraction problem into an addition problem is often the easiest way of solving a problem.
To make this into an addition problem, just convert one of the data frames (StoreA$Quantity) into negative values. Only convert the Quantity variable into negative values. And then rename the other data frame (Total) into StoreB.
Once that is done, it's easy to finish. Just use the join function with the two data frames (StoreA & StoreB). Doing that brings the negative and positive values together and the data is more understandable. When there are the same things with positive and negative values, then it's obvious these things need to be combined.
To combine those similar items, use the group_by() function and pipe it into a summarize() function. Doing the coding this way makes the code easy to read and easy to understand. The code can almost be read like a book.
Create data frames:
StoreA = data.frame(Fruits = c('Apple', 'Banana', 'Blueberry'), Customer = c('John', 'Peter', 'Jenny'), Quantity = c(2,3,1))
StoreB = data.frame(Fruits = c('Blueberry','Apple', 'Banana', 'Blueberry', 'Pineapple'), Customer = c('Jenny' ,'John', 'Peter', 'John', 'Peter'), Quantity = c(4,7,3,5,3))
Convert StoreA$Quantity to negative values:
StoreA_ <- StoreA
StoreA_Quanity <- StoreA$Quantity * -1
StoreA_
StoreA_ now looks like this:
Fruits Customer Quantity
<fct> <fct> <dbl>
Apple John -2
Banana Peter -3
Blueberry Jenny -1
Now combine StoreA and StoreB. Use the full_join() function to join the two stores:
Total <- full_join(StoreA_, StoreB, disparse = 0)
Total
The last thing is accomplished using the group_by function. This will combine the positive and negative values together.
Total %>% group_by(Fruits, Customer) %>% summarize(s = sum(Quantity))
It's Done! The output is shown at this link:
You could do a full join first, then rename the columns, fill the missing values resulting from the join and then compute the difference.
library(tidyverse)
StoreA = data.frame(Fruits = c('Apple', 'Banana', 'Blueberry'), Customer = c('John', 'Peter', 'Jenny'), Quantity = c(2,3,1))
Total = data.frame(Fruits = c('Blueberry','Apple', 'Banana', 'Blueberry', 'Pineapple'), Customer = c('Jenny' ,'John', 'Peter', 'John', 'Peter'), Quantity = c(4,7,3,5,3))
full_join(StoreA %>%
rename(Qty_A = Quantity),
Total %>%
rename(Qty_Total = Quantity), by = c("Fruits", "Customer")) %>%
# fill NAs with zero
replace_na(list(Qty_A = 0)) %>%
# compute the difference
mutate(Qty_B = Qty_Total - Qty_A)
#> Fruits Customer Qty_A Qty_Total Qty_B
#> 1 Apple John 2 7 5
#> 2 Banana Peter 3 3 0
#> 3 Blueberry Jenny 1 4 3
#> 4 Blueberry John 0 5 5
#> 5 Pineapple Peter 0 3 3
Created on 2020-09-28 by the reprex package (v0.3.0)

Combining Two Data Frames Horizontally in R

I would like to combine two data frames horizontally in R.
These are my two data frames:
dataframe 1:
veg loc quantity
carrot sak three
pepper lon two
tomato apw five
dataframe 2:
seller quantity veg
Ben eleven eggplant
Nour six potato
Loni four zucchini
Ahmed two broccoli
I want the outcome to be one data frame that looks like this:
veg quantity
carrot three
pepper two
tomato five
eggplant eleven
potato six
zucchini four
broccoli two
The question says "horizontally" but from the sample output it seems that what you meant was "vertically".
Now, assuming the input shown reproducibly in the Note at the end, rbind them like this. No packages are used and no objects are overwritten.
sel <- c("veg", "quantity")
rbind( df1[sel], df2[sel] )
If you like you could replace the first line of code with the following which picks out the common columns giving the same result for sel.
sel <- intersect(names(df1), names(df2))
Note
Lines1 <- "veg loc quantity
carrot sak three
pepper lon two
tomato apw five"
Lines2 <- "seller quantity veg
Ben eleven eggplant
Nour six potato
Loni four zucchini
Ahmed two broccoli"
df1 <- read.table(text = Lines1, header = TRUE, strip.white = TRUE)
df2 <- read.table(text = Lines2, header = TRUE, strip.white = TRUE)
You can do it like this:
library (tidyverse)
df1 <- df1%>%select(veg, quantity)
df2 <- df2%>%select(veg, quantity)
df3 <- rbind(df1, df2)

Resources