Collapsing group of strings into one string using an if statement within a for loop in R - r

I have a dataframe with a column "Food."
dataframe <- data.frame(Color = c("red","red","red","red","red","blue","blue","blue","blue","blue","green","green","green","green","green","orange","orange","orange","orange","orange"),
Food = c("banana","apple","potato","orange","egg","strawberry","cheese","yogurt","kiwi","butter","kale","sugar","carrot","celery","radish","cereal","milk","blueberry","squash","lemon"), Count = c(2,5,4,8,10,7,5,6,9,11,1,8,5,3,7,9,2,3,6,4))
Every time a fruit appears I want to replace the name of the fruit with "fruit."
I've tried making a vector of the fruit names. Then I go through each row in the dataframe and where the string matches the fruit, I want to replace the fruit name with "fruit."
fruit_list <- c("banana","apple","orange","strawberry","kiwi","blueberry","lemon")
for (r in 1:nrow(dataframe)) {
for (i in 1:length(fruit_list)){
if (length(grep(fruit_list[i], dataframe$Food[r])) != 0) {
dataframe$Food[r] <- paste("fruit")
}
}
}
How do I use this general format so that dataframe$Food doesn't just end up filled with NA?

With dplyr:
library(dplyr)
ataframe %>%
mutate(Food=as.character(Food),
Food=ifelse(Food%in%fruit_list,"Fruit",Food))#can change to fruit
Result:
Color Food Count
1 red Fruit 2
2 red Fruit 5
3 red potato 4
4 red Fruit 8
5 red egg 10
6 blue Fruit 7
7 blue cheese 5
8 blue yogurt 6
9 blue Fruit 9
10 blue butter 11
11 green kale 1
12 green sugar 8
13 green carrot 5
14 green celery 3
15 green radish 7
16 orange cereal 9
17 orange milk 2
18 orange Fruit 3
19 orange squash 6
20 orange Fruit 4

Only R base:
dataframe$Food <-
sapply(dataframe$Food,
function(x,fruit_list) ifelse(x %in% fruit_list, "fruit", as.character(x) ),
fruit_list = fruit_list )

You don't necessarily need dplyr for this.
Just use:
dataframe$Food <- ifelse(dataframe$Food %in% fruit_list, "Fruit", as.character(dataframe$Food))

You can do this in one line by using data.table package-
> setDT(dataframe)[,Food:=ifelse(Food %in% fruit_list,"fruit",as.character(Food))]
Color Food Count
1: red fruit 2
2: red fruit 5
3: red potato 4
4: red fruit 8
5: red egg 10
6: blue fruit 7
7: blue cheese 5
8: blue yogurt 6
9: blue fruit 9
10: blue butter 11
11: green kale 1
12: green sugar 8
13: green carrot 5
14: green celery 3
15: green radish 7
16: orange cereal 9
17: orange milk 2
18: orange fruit 3
19: orange squash 6
20: orange fruit 4

Related

Combining two dataframes based on presence of values of various different columns

I have a question about creating new columns in my dataset by checking whether a value is present in one of the columns of my dataframe, and assigning the columns of a different dataframe based on that presence. As this description is quite vague, see the example dataset below:
newDf <- data.frame(c("Juice 1", "Juice 2", "Juice 3", "Juice 4","Juice 5"),
c("Banana", "Banana", "Orange", "Pear", "Apple"),
c("Blueberry", "Mango", "Rasberry", "Spinach", "Pear"),
c("Kale", NA, "Cherry", NA, "Peach"))
colnames(newDf) <- c("Juice", "Fruit 1", "Fruit 2", "Fruit 3")
dfChecklist <- data.frame(c("Banana", "Cherry"),
c("100", "80"),
c("5", "3"),
c("4", "5"))
colnames(dfChecklist) <- c("FruitCheck", "NutritionalValue", "Deliciousness", "Difficulty")
This gives the following dataframes:
Juice Fruit 1 Fruit 2 Fruit 3
1 Juice 1 Banana Blueberry Kale
2 Juice 2 Banana Mango <NA>
3 Juice 3 Orange Rasberry Cherry
4 Juice 4 Pear Spinach <NA>
5 Juice 5 Apple Pear Peach
FruitCheck NutritionalValue Deliciousness Difficulty
1 Banana 100 5 4
2 Cherry 80 3 5
I want to combine the two and make the result to be like this:
Juice Fruit 1 Fruit 2 Fruit 3 FruitCheck NutritionalValue Deliciousness Difficulty
1 Juice 1 Banana Blueberry Kale Banana 100 5 4
2 Juice 2 Banana Mango <NA> Banana 100 5 4
3 Juice 3 Orange Rasberry Cherry Cherry 80 3 5
4 Juice 4 Pear Spinach <NA> <NA> <NA> <NA> <NA>
5 Juice 5 Apple Pear Peach <NA> <NA> <NA> <NA>
The dataset above is an example, my own dataset is much larger and complexer.
Thanks so much in advance for your help!
First find the first match for each row
tmp=unlist(
apply(
newDf[,grepl("Fruit",colnames(newDf))],
1,
function(x){
y=as.vector(x)
y=y[which.min(match(y,dfChecklist$FruitCheck))]
ifelse(length(y)==0,NA,y)
}
)
)
add this to your original df and then a simple merge
newDf$FruitCheck=tmp
merge(
newDf,
dfChecklist,
by="FruitCheck",
all.x=T
)
resulting in
FruitCheck Juice Fruit 1 Fruit 2 Fruit 3 NutritionalValue Deliciousness
1 Banana Juice 1 Banana Blueberry Kale 100 5
2 Banana Juice 2 Banana Mango <NA> 100 5
3 Cherry Juice 3 Orange Rasberry Cherry 80 3
4 <NA> Juice 4 Pear Spinach <NA> <NA> <NA>
5 <NA> Juice 5 Apple Pear Peach <NA> <NA>
Difficulty
1 4
2 4
3 5
4 <NA>
5 <NA>

How does ggplot2 evaluate factor levels?

Consider this example dataframe:
library(tidyverse)
df <- tibble(item = c("Banana", "Ananas", "Apple", "Blueberry", "Orange",
"Spinach", "Cabbage", "Broccoli", "Carrot", "Eggplant"),
category = c(rep("Fruit", 5), rep("Vegetable", 5)),
n = c(57, 19, 14, 11, 8, 318, 70, 33, 31, 23))
First, I attribute factor levels to item by n. I also add a variable item_fct in order to display the factor levels of item in the output table for a better understanding of my problem:
df %>%
mutate(item = fct_reorder(item, n),
item_fct = as.numeric(item))
# A tibble: 10 x 4
item category n item_fct
<fct> <chr> <dbl> <dbl>
1 Banana Fruit 57 8
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 10
7 Cabbage Vegetable 70 9
8 Broccoli Vegetable 33 7
9 Carrot Vegetable 31 6
10 Eggplant Vegetable 23 5
This output table and the factor levels make sense to me: Item is labeled from 1-10 by n in an ascending order.
But if I group it by category, I get the following output:
df %>%
group_by(category) %>%
mutate(item = fct_reorder(item, n),
item_fct = as.numeric(item))
# A tibble: 10 x 4
# Groups: category [2]
item category n item_fct
<fct> <chr> <dbl> <dbl>
1 Banana Fruit 57 5
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 5
7 Cabbage Vegetable 70 4
8 Broccoli Vegetable 33 3
9 Carrot Vegetable 31 2
10 Eggplant Vegetable 23 1
Here, the factor levels are rather confusing to me. The items, depending on the group, share the same factor levels (1-5). I would still have expected levels from 1-10, but ordered by n depending on the respective group. Like this:
# A tibble: 10 x 4
item category n item_fct
<chr> <chr> <dbl> <dbl>
1 Banana Fruit 57 5
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 10
7 Cabbage Vegetable 70 9
8 Broccoli Vegetable 33 8
9 Carrot Vegetable 31 7
10 Eggplant Vegetable 23 6
My Question:
Ultimately, it's hard for me to understand why ggplot2 produces the following plot with this code:
df %>%
group_by(category) %>%
mutate(item = fct_reorder(item, n),
item_fct = as.numeric(item)) %>%
ggplot(aes(n, item, fill = category)) +
geom_col()
If the items are factored from 1-5 in their respective groups and thus share the same overall factor levels, then why does ggplot2 display the bars by the grouped variable?
This plot is actually my desired outcome, but I don't understand the logic behind it. Considering the factor levels, I would expect bars of the same factor level to be next to each other (basically starting with "Eggplant", then "Orange", then "Carrot", then "Blueberry" etc.).
I would've assumed that the following table would produce the plot, because here the factor levels are ordered in the same way the bars are ordered in the plot:
# A tibble: 10 x 4
item category n item_fct
<chr> <chr> <dbl> <dbl>
1 Banana Fruit 57 5
2 Ananas Fruit 19 4
3 Apple Fruit 14 3
4 Blueberry Fruit 11 2
5 Orange Fruit 8 1
6 Spinach Vegetable 318 10
7 Cabbage Vegetable 70 9
8 Broccoli Vegetable 33 8
9 Carrot Vegetable 31 7
10 Eggplant Vegetable 23 6

How would I be able to track change based on time? Identify when the condition first appeared - R

If I have a data set with different cases over time, what would be the best way to track change?
Let's say I tracked if there color change in walls over time; think weather damage. I was never consistent in the day or time I started tracking or tracked it. I have duplicate rows in terms of walls because each row represents the measurement at a specific time.
What would be the best way for me to identify which walls' changed at some point?
Wall Color Date
A yellow 2019-08
A white 2019-02
A yellow 2019-05
A yellow 2019-05
A white 2019-04
A white 2019-03
A yellow 2019-08
B yellow 2019-09
B white 2019-05
B yellow 2020-09
B yellow 2020-05
c white 2019-05
c white 2018-01
c white 2020-06
c white 2019-02
c white 2020-03
c yellow 2020-09
c white 2020-06
c yellow 2020-05
c white 2019-01
c yellow 2020-1
You can try :
library(dplyr)
df %>%
tidyr::separate(Date, c('Year', 'Month'), convert = TRUE) %>%
arrange(Wall, Year, Month) %>%
group_by(Wall) %>%
mutate(change = Color != lag(Color, default = first(Color)),
change = row_number() == which(change)[1])
# Wall Color Year Month change
# <chr> <chr> <int> <int> <lgl>
# 1 A white 2019 2 FALSE
# 2 A white 2019 3 FALSE
# 3 A white 2019 4 FALSE
# 4 A yellow 2019 5 TRUE
# 5 A yellow 2019 5 FALSE
# 6 A yellow 2019 8 FALSE
# 7 A yellow 2019 8 FALSE
# 8 B white 2019 5 FALSE
# 9 B yellow 2019 9 TRUE
#10 B yellow 2020 5 FALSE
# … with 11 more rows
I haven't figured out what is desired but his is a bit of R code taht shows how to do some of the operations that I think you do not yet have a handle on:
> dat <- scan( text="A yellow 2019-08
+ A white 2019-02
+ A yellow 2019-05
+ A yellow 2019-05
+ A white 2019-04
+ A white 2019-03
+ A yellow 2019-08
+ B yellow 2019-09
+ B white 2019-05
+ B yellow 2020-09
+ B yellow 2020-05
+ c white 2019-05
+ c white 2018-01
+ c white 2020-06
+ c white 2019-02
+ c white 2020-03
+ c yellow 2020-09
+ c white 2020-06
+ c yellow 2020-05
+ c white 2019-01
+ c yellow 2020-1", what=list(grp="",color="",mon=""))
Read 21 records
> dat=data.frame(dat)
> dat
grp color mon
1 A yellow 2019-08
2 A white 2019-02
3 A yellow 2019-05
4 A yellow 2019-05
5 A white 2019-04
6 A white 2019-03
7 A yellow 2019-08
8 B yellow 2019-09
9 B white 2019-05
10 B yellow 2020-09
11 B yellow 2020-05
12 c white 2019-05
13 c white 2018-01
14 c white 2020-06
15 c white 2019-02
16 c white 2020-03
17 c yellow 2020-09
18 c white 2020-06
19 c yellow 2020-05
20 c white 2019-01
21 c yellow 2020-1
> library(zoo)
> dat$mon<-as.yearmon(dat$mon)
> plot(as.numeric(dat$color)~dat$mon)
> plot(as.numeric(dat$color)~dat$mon, type="b")
> png();
plot(jitter(as.numeric(dat$color)) ~dat$mon, type="b", col=as.numeric(dat$grp));
dev.off()
I think it shows that you have not yet described the problem in sufficient detail to allow a coherent response.

Join dataframes in dplyr by characters

So I have two dataframes:
DF1
X Y ID
banana 14 1
orange 20 2
pineapple 1 3
guava 300 4
grapes 1 5
DF2
Store State ID
Walmart NY 1
Sears AL 1;2
Target DC 3
Old Navy PA 3
Popeye's HA 5
Footlocker NJ 4;5
I join with the following and get:
df1 %>%
inner_join(df2, by = "ID")
X Y ID Store State
banana 14 1 Walmart NY
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
grapes 1 5 Popeye's HA
But due to the semi-colons I'm not capturing those data points on the join, the end result should look like this:
X Y ID Store State
banana 14 1 Walmart NY
banana 14 1 Sears AL
orange 20 2 Sears AL
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
guava 300 4 Foot Locker NJ
grapes 1 5 Popeye's HA
grapes 1 5 Popeye's HA
Using separate_rows from tidyr in combination with dplyr will get you there.
First table I called fruit, the other stores.
library(dplyr)
library(tidyr)
fruit %>%
inner_join(separate_rows(stores, ID) %>% mutate(ID = as.integer(ID)))
Joining, by = "ID"
X Y ID Store State
1 banana 14 1 Walmart NY
2 banana 14 1 Sears AL
3 orange 20 2 Sears AL
4 pineapple 1 3 Target DC
5 pineapple 1 3 Old Navy PA
6 guava 300 4 Footlocker NJ
7 grapes 1 5 Popeye's HA
8 grapes 1 5 Footlocker NJ
With base R, we can use strsplit with merge
lst1 <- strsplit(DF2$ID, ";")
merge(DF1, transform(DF2[rep(seq_len(nrow(DF2)),
lengths(lst1)), 1:2], ID = unlist(lst1)))
# ID X Y Store State
#1 1 banana 14 Walmart NY
#2 1 banana 14 Sears AL
#3 2 orange 20 Sears AL
#4 3 pineapple 1 Target DC
#5 3 pineapple 1 Old Navy PA
#6 4 guava 300 Footlocker NJ
#7 5 grapes 1 Popeye's HA
#8 5 grapes 1 Footlocker NJ

Use count() in more than one column and order results

I have a dataframe called table like this
a m g c1 c2 c3 c4
1 2015 5 13 bread wine <NA> <NA>
2 2015 8 30 wine eggs rice cake
3 2015 1 21 wine rice eggs <NA>
...
I want to count the elements in column c1 to c4 and order them
I tried to use:
library(plyr)
c<-count(table,"c1")
But i don't know how to count more than one column.
Then i want to use arrange(c,desc(freq)) to order them but when i try with one column the value NA is always on top, and i want only top 3 elements. Like this
c freq
1 wine 3
2 eggs 2
3 rice 2
Can someone please get me some solution for this. Thanks
Use melt and table:
df1 <- read.table(text="a m g c1 c2 c3 c4
2015 5 13 bread wine NA NA
2015 8 30 wine eggs rice cake
2015 1 21 wine rice eggs NA", header=TRUE, stringsAsFactors=FALSE)
c_col <- melt(as.matrix(df1[,4:7]))
sort(table(c_col$value),decreasing=TRUE)
wine eggs rice bread cake
3 2 2 1 1
With qdaptools, with the example dataframe (having name table) provided:
library(qdapTools)
counts <- data.frame(count=sort(colSums(mtabulate(table[,4:7])), decreasing=TRUE))
subset(counts,rownames(counts)!='<NA>')[1:3,1,drop=FALSE] #remove <NA>, select top 3 elements
# count
# wine 3
# eggs 2
# rice 2

Resources