Combining two dataframes based on presence of values of various different columns - r

I have a question about creating new columns in my dataset by checking whether a value is present in one of the columns of my dataframe, and assigning the columns of a different dataframe based on that presence. As this description is quite vague, see the example dataset below:
newDf <- data.frame(c("Juice 1", "Juice 2", "Juice 3", "Juice 4","Juice 5"),
c("Banana", "Banana", "Orange", "Pear", "Apple"),
c("Blueberry", "Mango", "Rasberry", "Spinach", "Pear"),
c("Kale", NA, "Cherry", NA, "Peach"))
colnames(newDf) <- c("Juice", "Fruit 1", "Fruit 2", "Fruit 3")
dfChecklist <- data.frame(c("Banana", "Cherry"),
c("100", "80"),
c("5", "3"),
c("4", "5"))
colnames(dfChecklist) <- c("FruitCheck", "NutritionalValue", "Deliciousness", "Difficulty")
This gives the following dataframes:
Juice Fruit 1 Fruit 2 Fruit 3
1 Juice 1 Banana Blueberry Kale
2 Juice 2 Banana Mango <NA>
3 Juice 3 Orange Rasberry Cherry
4 Juice 4 Pear Spinach <NA>
5 Juice 5 Apple Pear Peach
FruitCheck NutritionalValue Deliciousness Difficulty
1 Banana 100 5 4
2 Cherry 80 3 5
I want to combine the two and make the result to be like this:
Juice Fruit 1 Fruit 2 Fruit 3 FruitCheck NutritionalValue Deliciousness Difficulty
1 Juice 1 Banana Blueberry Kale Banana 100 5 4
2 Juice 2 Banana Mango <NA> Banana 100 5 4
3 Juice 3 Orange Rasberry Cherry Cherry 80 3 5
4 Juice 4 Pear Spinach <NA> <NA> <NA> <NA> <NA>
5 Juice 5 Apple Pear Peach <NA> <NA> <NA> <NA>
The dataset above is an example, my own dataset is much larger and complexer.
Thanks so much in advance for your help!

First find the first match for each row
tmp=unlist(
apply(
newDf[,grepl("Fruit",colnames(newDf))],
1,
function(x){
y=as.vector(x)
y=y[which.min(match(y,dfChecklist$FruitCheck))]
ifelse(length(y)==0,NA,y)
}
)
)
add this to your original df and then a simple merge
newDf$FruitCheck=tmp
merge(
newDf,
dfChecklist,
by="FruitCheck",
all.x=T
)
resulting in
FruitCheck Juice Fruit 1 Fruit 2 Fruit 3 NutritionalValue Deliciousness
1 Banana Juice 1 Banana Blueberry Kale 100 5
2 Banana Juice 2 Banana Mango <NA> 100 5
3 Cherry Juice 3 Orange Rasberry Cherry 80 3
4 <NA> Juice 4 Pear Spinach <NA> <NA> <NA>
5 <NA> Juice 5 Apple Pear Peach <NA> <NA>
Difficulty
1 4
2 4
3 5
4 <NA>
5 <NA>

Related

Collapse rows with same identifier and columns and retain all values in r

I have a dataframe that contains several fields related to an identifier but some are disjointed:
id store manager fruit vegetable
1 Grocery1 Joe apple NA
1 Grocery1 Joe lemon NA
1 Grocery1 Joe NA zucchini
2 Grocery2 Amy orange NA
2 Grocery2 Amy NA asparagus
2 Grocery2 Amy NA spinach
3 Grocery3 Bill NA NA
I want the dataframe to look like:
id store manager fruit vegetable
1 Grocery1 Joe apple zucchini
1 Grocery1 Joe lemon zucchini
2 Grocery2 Amy orange asparagus
2 Grocery2 Amy orange spinach
3 Grocery3 Bill NA NA
Is there a way to easily do this?
You can use tidyr::fill to fill the NA, and only keep the non-duplicated rows using distinct.
library(dplyr)
library(tidyr)
df %>%
group_by(store, manager) %>%
fill(fruit, vegetable, .direction = "updown") %>%
distinct()
# A tibble: 5 × 5
# Groups: store, manager [3]
id store manager fruit vegetable
<int> <chr> <chr> <chr> <chr>
1 1 Grocery1 Joe apple zucchini
2 1 Grocery1 Joe lemon zucchini
3 2 Grocery2 Amy orange asparagus
4 2 Grocery2 Amy orange spinach
5 3 Grocery3 Bill NA NA

How does ggplot2 evaluate factor levels?

Consider this example dataframe:
library(tidyverse)
df <- tibble(item = c("Banana", "Ananas", "Apple", "Blueberry", "Orange",
"Spinach", "Cabbage", "Broccoli", "Carrot", "Eggplant"),
category = c(rep("Fruit", 5), rep("Vegetable", 5)),
n = c(57, 19, 14, 11, 8, 318, 70, 33, 31, 23))
First, I attribute factor levels to item by n. I also add a variable item_fct in order to display the factor levels of item in the output table for a better understanding of my problem:
df %>%
mutate(item = fct_reorder(item, n),
item_fct = as.numeric(item))
# A tibble: 10 x 4
item category n item_fct
<fct> <chr> <dbl> <dbl>
1 Banana Fruit 57 8
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 10
7 Cabbage Vegetable 70 9
8 Broccoli Vegetable 33 7
9 Carrot Vegetable 31 6
10 Eggplant Vegetable 23 5
This output table and the factor levels make sense to me: Item is labeled from 1-10 by n in an ascending order.
But if I group it by category, I get the following output:
df %>%
group_by(category) %>%
mutate(item = fct_reorder(item, n),
item_fct = as.numeric(item))
# A tibble: 10 x 4
# Groups: category [2]
item category n item_fct
<fct> <chr> <dbl> <dbl>
1 Banana Fruit 57 5
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 5
7 Cabbage Vegetable 70 4
8 Broccoli Vegetable 33 3
9 Carrot Vegetable 31 2
10 Eggplant Vegetable 23 1
Here, the factor levels are rather confusing to me. The items, depending on the group, share the same factor levels (1-5). I would still have expected levels from 1-10, but ordered by n depending on the respective group. Like this:
# A tibble: 10 x 4
item category n item_fct
<chr> <chr> <dbl> <dbl>
1 Banana Fruit 57 5
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 10
7 Cabbage Vegetable 70 9
8 Broccoli Vegetable 33 8
9 Carrot Vegetable 31 7
10 Eggplant Vegetable 23 6
My Question:
Ultimately, it's hard for me to understand why ggplot2 produces the following plot with this code:
df %>%
group_by(category) %>%
mutate(item = fct_reorder(item, n),
item_fct = as.numeric(item)) %>%
ggplot(aes(n, item, fill = category)) +
geom_col()
If the items are factored from 1-5 in their respective groups and thus share the same overall factor levels, then why does ggplot2 display the bars by the grouped variable?
This plot is actually my desired outcome, but I don't understand the logic behind it. Considering the factor levels, I would expect bars of the same factor level to be next to each other (basically starting with "Eggplant", then "Orange", then "Carrot", then "Blueberry" etc.).
I would've assumed that the following table would produce the plot, because here the factor levels are ordered in the same way the bars are ordered in the plot:
# A tibble: 10 x 4
item category n item_fct
<chr> <chr> <dbl> <dbl>
1 Banana Fruit 57 5
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 10
7 Cabbage Vegetable 70 9
8 Broccoli Vegetable 33 8
9 Carrot Vegetable 31 7
10 Eggplant Vegetable 23 6

Join dataframes in dplyr by characters

So I have two dataframes:
DF1
X Y ID
banana 14 1
orange 20 2
pineapple 1 3
guava 300 4
grapes 1 5
DF2
Store State ID
Walmart NY 1
Sears AL 1;2
Target DC 3
Old Navy PA 3
Popeye's HA 5
Footlocker NJ 4;5
I join with the following and get:
df1 %>%
inner_join(df2, by = "ID")
X Y ID Store State
banana 14 1 Walmart NY
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
grapes 1 5 Popeye's HA
But due to the semi-colons I'm not capturing those data points on the join, the end result should look like this:
X Y ID Store State
banana 14 1 Walmart NY
banana 14 1 Sears AL
orange 20 2 Sears AL
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
guava 300 4 Foot Locker NJ
grapes 1 5 Popeye's HA
grapes 1 5 Popeye's HA
Using separate_rows from tidyr in combination with dplyr will get you there.
First table I called fruit, the other stores.
library(dplyr)
library(tidyr)
fruit %>%
inner_join(separate_rows(stores, ID) %>% mutate(ID = as.integer(ID)))
Joining, by = "ID"
X Y ID Store State
1 banana 14 1 Walmart NY
2 banana 14 1 Sears AL
3 orange 20 2 Sears AL
4 pineapple 1 3 Target DC
5 pineapple 1 3 Old Navy PA
6 guava 300 4 Footlocker NJ
7 grapes 1 5 Popeye's HA
8 grapes 1 5 Footlocker NJ
With base R, we can use strsplit with merge
lst1 <- strsplit(DF2$ID, ";")
merge(DF1, transform(DF2[rep(seq_len(nrow(DF2)),
lengths(lst1)), 1:2], ID = unlist(lst1)))
# ID X Y Store State
#1 1 banana 14 Walmart NY
#2 1 banana 14 Sears AL
#3 2 orange 20 Sears AL
#4 3 pineapple 1 Target DC
#5 3 pineapple 1 Old Navy PA
#6 4 guava 300 Footlocker NJ
#7 5 grapes 1 Popeye's HA
#8 5 grapes 1 Footlocker NJ

Collapsing group of strings into one string using an if statement within a for loop in R

I have a dataframe with a column "Food."
dataframe <- data.frame(Color = c("red","red","red","red","red","blue","blue","blue","blue","blue","green","green","green","green","green","orange","orange","orange","orange","orange"),
Food = c("banana","apple","potato","orange","egg","strawberry","cheese","yogurt","kiwi","butter","kale","sugar","carrot","celery","radish","cereal","milk","blueberry","squash","lemon"), Count = c(2,5,4,8,10,7,5,6,9,11,1,8,5,3,7,9,2,3,6,4))
Every time a fruit appears I want to replace the name of the fruit with "fruit."
I've tried making a vector of the fruit names. Then I go through each row in the dataframe and where the string matches the fruit, I want to replace the fruit name with "fruit."
fruit_list <- c("banana","apple","orange","strawberry","kiwi","blueberry","lemon")
for (r in 1:nrow(dataframe)) {
for (i in 1:length(fruit_list)){
if (length(grep(fruit_list[i], dataframe$Food[r])) != 0) {
dataframe$Food[r] <- paste("fruit")
}
}
}
How do I use this general format so that dataframe$Food doesn't just end up filled with NA?
With dplyr:
library(dplyr)
ataframe %>%
mutate(Food=as.character(Food),
Food=ifelse(Food%in%fruit_list,"Fruit",Food))#can change to fruit
Result:
Color Food Count
1 red Fruit 2
2 red Fruit 5
3 red potato 4
4 red Fruit 8
5 red egg 10
6 blue Fruit 7
7 blue cheese 5
8 blue yogurt 6
9 blue Fruit 9
10 blue butter 11
11 green kale 1
12 green sugar 8
13 green carrot 5
14 green celery 3
15 green radish 7
16 orange cereal 9
17 orange milk 2
18 orange Fruit 3
19 orange squash 6
20 orange Fruit 4
Only R base:
dataframe$Food <-
sapply(dataframe$Food,
function(x,fruit_list) ifelse(x %in% fruit_list, "fruit", as.character(x) ),
fruit_list = fruit_list )
You don't necessarily need dplyr for this.
Just use:
dataframe$Food <- ifelse(dataframe$Food %in% fruit_list, "Fruit", as.character(dataframe$Food))
You can do this in one line by using data.table package-
> setDT(dataframe)[,Food:=ifelse(Food %in% fruit_list,"fruit",as.character(Food))]
Color Food Count
1: red fruit 2
2: red fruit 5
3: red potato 4
4: red fruit 8
5: red egg 10
6: blue fruit 7
7: blue cheese 5
8: blue yogurt 6
9: blue fruit 9
10: blue butter 11
11: green kale 1
12: green sugar 8
13: green carrot 5
14: green celery 3
15: green radish 7
16: orange cereal 9
17: orange milk 2
18: orange fruit 3
19: orange squash 6
20: orange fruit 4

Count most frequent word in row by R [duplicate]

This question already has answers here:
Find the most frequent value by row
(4 answers)
Closed 2 years ago.
There is a table shown below
Name Mon Tue Wed Thu Fri Sat Sun
1 John Apple Orange Apple Banana Apple Apple Orange
2 Ricky Banana Apple Banana Banana Banana Banana Apple
3 Alex Apple Orange Orange Apple Apple Orange Orange
4 Robbin Apple Apple Apple Apple Apple Banana Banana
5 Sunny Banana Banana Apple Apple Apple Banana Banana
So , I want to count the most frequent Fruit for each person and add those value in new column.
For example.
Name Mon Tue Wed Thu Fri Sat Sun Max_Acc Count
1 John Apple Orange Apple Banana Apple Apple Orange Apple 4
2 Ricky Banana Apple Banana Banana Banana Banana Apple Banana 5
3 Alex Apple Orange Orange Apple Apple Orange Orange Orange 4
4 Robbin Apple Apple Apple Apple Apple Banana Banana Apple 5
5 Sunny Banana Banana Apple Apple Apple Banana Banana Banana 4
I am facing problem in finding rows. I can find Frequency in column by using table() function.
>table(df$Mon)
Apple Banana
3 2
But here i want name of most frequent fruit in new column.
If we need the "Count" and "Names" corresponding to the max "Count", we loop through the rows of the dataset (using apply with MARGIN = 1), use table to get the frequency, extract the maximum value from it and the names corresponding to the maximum value, rbind it and cbind with the original dataset.
cbind(df1, do.call(rbind, apply(df1[-1], 1, function(x) {
x1 <- table(x)
data.frame(Count = max(x1), Names=names(x1)[which.max(x1)])})))
# Name Mon Tue Wed Thu Fri Sat Sun Count Names
#1 John Apple Orange Apple Banana Apple Apple Orange 4 Apple
#2 Ricky Banana Apple Banana Banana Banana Banana Apple 5 Banana
#3 Alex Apple Orange Orange Apple Apple Orange Orange 4 Orange
#4 Robbin Apple Apple Apple Apple Apple Banana Banana 5 Apple
#5 Sunny Banana Banana Apple Apple Apple Banana Banana 4 Banana
Or we can use data.table
library(data.table)
setDT(df1)[, c("Names", "Count") := {tbl <- table(unlist(.SD))
.(names(tbl)[which.max(tbl)], max(tbl))}, by = Name]
Another approach would be to loop over all unique fruits as follows
fruits_unique <- unique(unlist(dat[-1]))
occurence <- sapply(fruits_unique, function(x) rowSums(dat[,-1] == x))
# Using this data to create the resulting columns
ind <- apply(occurence,1,which.max)
dat$Names <- fruits_unique[ind]
dat$count <- occurence[cbind(seq_along(ind), ind)]
Result:
Name Mon Tue Wed Thu Fri Sat Sun Names Count
1 John Apple Orange Apple Banana Apple Apple Orange Apple 4
2 Ricky Banana Apple Banana Banana Banana Banana Apple Banana 5
3 Alex Apple Orange Orange Apple Apple Orange Orange Orange 4
4 Robbin Apple Apple Apple Apple Apple Banana Banana Apple 5
5 Sunny Banana Banana Apple Apple Apple Banana Banana Banana 4

Resources