R: Conditional replacement using two data frames - r

I have a dataframe dflike this:
df <- data.frame(fruits = c("apple", "orange", "pineapple", "banana", "grape"))
df_rep <- data.frame(eng = c("apple", "orange", "grape"),
esp = c("manzana", "naranja", "uva"))
>df
fruits
apple
orange
pineapple
banana
grape
>df_rep
eng esp
apple manzana
orange naranja
grape uva
I want to replace the value in the fruits column of df referring to df_rep. If the values in the fruits column of df appears in eng column of df_rep, I want to replace them with the values in esp column of df_rep. So the result should look like this:
>df
fruits
manzana
naranja
pineapple
banana
uva
Point: I don't want to use ifelse as in my real data frame there are more than 100 replacement list. The example here is simplified for easy understanding. Nor for loop as my data frame contains more than 40,000 rows. I am looking for a simple and only one action solution.
Thank you very much for your help!

We can use the merge function (to simulate a SQL left join) and then the ifelse function to replace the fruits with non-NA esp values:
df2 <- merge(df, df_rep, by.x = 'fruits', by.y = 'eng', all.x = TRUE)
df2$fruits <- ifelse(is.na(df2$esp), df2$fruits, df2$esp)
# fruits esp
# 1 manzana manzana
# 2 banana <NA>
# 3 uva uva
# 4 naranja naranja
# 5 pineapple <NA>
Data
It's important to set stringsAsFactors = FALSE when creating the data:
df <- data.frame(fruits = c("apple", "orange", "pineapple", "banana", "grape"),
stringsAsFactors = FALSE)
df_rep <- data.frame(eng = c("apple", "orange", "grape"),
esp = c("manzana", "naranja", "uva"),
stringsAsFactors = FALSE)

Another option is coalesce from dplyr to replace the NAs that result from match with the respective values from df$fruits.
library(dplyr)
df$fruits2 <- coalesce(df_rep$esp[match(df$fruits, df_rep$eng)], df$fruits)
df
# fruits fruits2
#1 apple manzana
#2 orange naranja
#3 pineapple pineapple
#4 banana banana
#5 grape uva

Related

R grepl with dynamic search pattern

I have a data frame, df, having one column of different names. I have variable data frames, e.g. search_df or search_df1 containing search words which I would like to search via regex in the name column.
If the word has been found write it into a new column, e.g. df_final$which_word_search_df.
If more than one word has been found I would like to paste the results together.
The result should look like df_final.
# load packages
pacman::p_load(tidyverse)
# words I would like to search for
search_df <- data.frame(search_words = c("apple", "peach"))
search_df1 <- data.frame(search_words = c("strawberry", "peach", "banana"))
# data frame which is the basis for my search
df <- data.frame(name = c("apple123", "applepeach", "peachtime", "peachab", "bananarrr", "bananaxy"))
# how I expect the final result to look like
df_final <- data.frame(name = c("apple123", "applepeach", "peachtime", "peachab", "bananarrr", "bananaxy"),
which_word_search_df = c("apple", "apple; peach", "peach", "peach", NA, NA),
which_word_search_df1 = c(NA, NA, "peach", "peach", "banana", "banana"))
That is my current solution but as you can see it is not dynamic. I type in manually every search word instead of automatically going through all the search words.
df_trial <- df %>%
mutate(which_search_word_trial = ifelse(grepl("apple", name, ignore.case = T), "apple", ""),
which_search_word_trial = ifelse(grepl("peach", name, ignore.case = T),
paste(which_search_word_trial, "peach", sep = ";"), which_search_word_trial)
)
The example I am sharing is just a minimal one. For the actual use case df will have ~200k rows and my search_df will have ~1k rows.
We can do the following.
library(dplyr)
library(stringr)
df %>%
mutate(which_word_search_df = str_extract_all(name,str_c(search_df$search_words, collapse = '|')),
which_word_search_df1 = str_extract_all(name, str_c(search_df1$search_words, collapse = '|')))
# name which_word_search_df which_word_search_df1
# 1 apple123 apple
# 2 applepeach apple, peach peach
# 3 peachtime peach peach
# 4 peachab peach peach
# 5 bananarrr banana
# 6 bananaxy banana
Using your df as input (not the df_final): Here is an "automatic" way to do it by providing the name of the search dataframes:
n = c('search_df','search_df1')
for(i in n){
a= (lapply(get(i)$search_word, function(j){grep(j, df$name)}))
a=stack(setNames(a,get(i)$search_word))
df[,paste0('which_word_',i)]=NA
df[a$values,paste0('which_word_',i)]=as.character(a$ind)
}
The output is directly stored in df but you can change this easily by copying df to final_df and then use this one in the two last lines.
output:
name which_word_search_df which_word_search_df1
1 apple123 apple <NA>
2 applebum apple <NA>
3 peachtime peach peach
4 peachab peach peach
5 bananarrr <NA> banana
6 bananaxy <NA> banana
Lemme know if it worked for you

Cross table in R to find the relationship of two variables

I am trying to form a cross table for two items in my data frame, but they are not conveniently laid in two columns, rather they are elements inside the columns that have to be filtered out to continue with the crosstables.
e.g.
column titles: Gender, Favourite Fruit
column 1: F,M,M,M,F,M,F,M,M,F
column 2: apple, pear, pear, grapes, apple, banana, peach, apple, pear, grapes
I would like to make a cross-table for female and apple, to see if there is a relationship. How should I go about doing this?
Thank you!
Emmy
There are lots of ways to do this, but the workhorse is the table() function.
Here is some fake data:
set.seed(123)
df <- data.frame(gender = sample(c("M", "F"), 1000, replace = T ),
fruit = sample(c("apple", "grapes", "banana", "pear"), 1000, replace = T) )
The table() function is a great way to create cross tabulations. For example:
table(df)
fruit
gender apple banana grapes pear
F 134 122 128 109
M 114 131 127 135
You can do a lot with this function. To get something like what you want, you can do do some create your named logical vector right in the arguments of the function.
table(Female = df$gender == "F", Apple = df$fruit == "apple")
Apple
Female FALSE TRUE
FALSE 393 114
TRUE 359 134

Loop through 2 dataframes to identify common columns

I have 2 reproducible dataframes over here. I am trying to identify which column contain values that are similar to another column. I hope my code will take in every row and loop through every single column in df2. My code works below, but it requires fine-tuning to allow multiple matches with the same column.
df1 <- data.frame(fruit=c("Apple", "Orange", "Pear"), location = c("Japan", "China", "Nigeria"), price = c(32,53,12))
df2 <- data.frame(grocery = c("Durian", "Apple", "Watermelon"),
place=c("Korea", "Japan", "Malaysia"),
name = c("Mark", "John", "Tammy"),
favourite.food = c("Apple", "Wings", "Cakes"),
invoice = c("XD1", "XD2", "XD3"))
df <- sapply(names(df1), function(x) {
temp <- sapply(names(df2), function(y)
if(any(match(df1[[x]], df2[[y]], nomatch = FALSE))) y else NA)
ifelse(all(is.na(temp)), NA, temp[which.max(!is.na(temp))])
}
)
t1 <- data.frame(lapply(df, type.convert), stringsAsFactors=FALSE)
t1 <- data.frame(t(t1))
t1 <- cbind(newColName = rownames(t1), t1)
rownames(t1) <- 1:nrow(t1)
colnames(t1) <- c("Columns from df1", "Columns from df2")
df1
fruit location price
1 Apple Japan 32
2 Orange China 53
3 Pear Nigeria 12
df2
grocery place name favourite.food invoice
1 Durian Korea Mark Apple XD1
2 Apple Japan John Wings XD2
3 Watermelon Malaysia Tammy Cakes XD3
t1 #(OUTPUT FROM CODE ABOVE)
Columns from df1 Columns from df2
1 fruit grocery
2 location place
3 price <NA>
This is the output I hope to obtain instead:
Columns from df1 Columns from df2
1 fruit grocery, favourite.food
2 location place
3 price <NA>
Notice that the columns, "Grocery" and "favourite.food" both matches to the column "fruit", whereas my code only returns one column.
We can change the code to return all the matches instead and wrap them in one string using toString
vec <- sapply(names(df1), function(x) {
temp <- sapply(names(df2), function(y)
if(any(match(df1[[x]], df2[[y]], nomatch = FALSE))) y else NA)
ifelse(all(is.na(temp)), NA, toString(temp[!is.na(temp)]))
}
)
vec
# fruit location price
#"grocery, favourite.food" "place" NA
To convert it into dataframe, we can do
data.frame(columns_from_df1 = names(vec), columns_from_df2 = vec, row.names = NULL)
# columns_from_df1 columns_from_df2
#1 fruit grocery, favourite.food
#2 location place
#3 price <NA>

Mutate Column with Match Against List Variable Values

I would like to create a new column in my dataframe that corresponds to values in list variables.
My dataframe includes many rows with a 'product names' column. My intention is to create a new column that allows me to sort products into categories.
Sample code -
library(dplyr)
products <- c('Apple', 'orange', 'pear',
'carrot', 'cabbage',
'strawberry', 'blueberry')
df <- data.frame(products)
ls <- list(Fruit = c('Apple', 'orange', 'pear'),
Veg = c('carrot', 'cabbage'),
Berry = c('strawberry', 'blueberry'))
test <- df %>%
mutate(category = products %in% ls)
I hope that illustrates what I'm trying to do. By creating the list, I've basically got a register of products and their categories which could change over time.
Is there a solution to this using a list, or am I over-complicating it and not seeing the wood for the trees?
edit - It might help to let you know that I'm working with 100s of products.
stack the list and then join with the data frame:
df %>%
left_join(stack(ls), by = c('products' = 'values')) %>%
rename(category = ind)
# products category
#1 Apple Fruit
#2 orange Fruit
#3 pear Fruit
#4 carrot Veg
#5 cabbage Veg
#6 strawberry Berry
#7 blueberry Berry

Count elements in dataframe column, then create separate columns in R

I am struggling for a few days with a solution myself. Hope you can help.
I checked the following already:
Counting the number of elements with the values of x in a vector
Split strings in a matrix column and count the single elements in a new vector
http://tidyr.tidyverse.org/reference/separate.html
Count of Comma separate values in r
I have a dataframe as follows:
df<-list(column=c("apple juice,guava-peach juice,melon apple juice","orange juice,pineapple strawberry lemon juice"))
df<-data.frame(df)
I want to separate each element separated by "," in its own column. Number of columns must be based on the maximum number of elements in each row in column
column1 column2 column3
apple juice guava-peach juice melon apple juice
orange juice pineapple strawberry lemon juice NA
I tried using
library(tidyverse)
library(stringr)
#want to calculate number of columns needed and the sequence
x<-str_count(df$column)
results<-df%>%separate(column,x,",")
Unfortunately I am not getting what I wish to.
Thank you for your help.
Do you mean this?
library(splitstackshape)
library(dplyr)
df %>%
cSplit("column", ",")
Output is:
column_1 column_2 column_3
1: apple juice guava-peach juice melon apple juice
2: orange juice pineapple strawberry lemon juice <NA>
Sample data:
df <- structure(list(column = structure(1:2, .Label = c("apple juice,guava-peach juice,melon apple juice",
"orange juice,pineapple strawberry lemon juice"), class = "factor")), .Names = "column", row.names = c(NA,
-2L), class = "data.frame")

Resources