replace zeros with NA conditionally in - r

In the below df, there are only 2 farms, so each fruit is duplicated. i'd like to replace zeros with NA as follows
df[df==0] <- NA
However, whenever there is a value for either of the fruits, such as for a pear at y2019 with values c(0, 7), i'd like not to replace 0. dplyr solution would be great.
sample data:
df <- data.frame(fruit = c("apple", "apple", "peach", "peach", "pear", "pear", "lime", "lime"),
farm = as.factor(c(1,2,1,2,1,2,1,2)), 'y2019' = c(0,0,3,12,0,7,4,6),
'y2018' = c(5,3,0,0,8,2,0,0),'y2017' = c(4,5,7,15,0,0,0,0) )
> df
fruit farm y2019 y2018 y2017
1 apple 1 0 5 4
2 apple 2 0 3 5
3 peach 1 3 0 7
4 peach 2 12 0 15
5 pear 1 0 8 0
6 pear 2 7 2 0
7 lime 1 4 0 0
8 lime 2 6 0 0

library(dplyr)
df %>%
group_by(fruit) %>%
mutate_at(vars(starts_with("y20")), ~ if (any(. != 0)) . else NA_real_) %>%
ungroup()
# # A tibble: 8 x 5
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
# 1 apple 1 NA 5 4
# 2 apple 2 NA 3 5
# 3 peach 1 3 NA 7
# 4 peach 2 12 NA 15
# 5 pear 1 0 8 NA
# 6 pear 2 7 2 NA
# 7 lime 1 4 NA NA
# 8 lime 2 6 NA NA

Related

How can I split sentence into new variables in R (with zero-one encoding)?

I have a data like below:
V1 V2
1 orange, apple
2 orange, lemon
3 lemon, apple
4 orange, lemon, apple
5 lemon
6 apple
7 orange
8 lemon, apple
I want to split the V2 variable like this:
I have three categories of the V2 column: "orange", "lemon", "apple"
for each of the categories I want to create a new column (variable) that will inform about whether such a name appeared in V2 (0,1)
I tried this
df %>% separate(V2, into = c("orange", "lemon", "apple"))
.. and I got this result, but it's not what I expect.
V1 orange lemon apple
1 1 orange apple <NA>
2 2 orange lemon <NA>
3 3 lemon apple <NA>
4 4 orange lemon apple
5 5 lemon <NA> <NA>
6 6 apple <NA> <NA>
7 7 orange <NA> <NA>
8 8 lemon apple <NA>
The result I mean is below.
V1 orange lemon apple
1 1 0 1
2 1 1 0
3 0 1 1
4 1 1 0
5 0 1 0
6 0 0 1
7 1 0 0
8 0 1 1
you could try pivoting:
library(dplyr)
library(tidyr)
df |>
separate_rows(V2, sep = ", ") |>
mutate(ind = 1) |>
pivot_wider(names_from = V2,
values_from = ind,
values_fill = 0)
Output is:
# A tibble: 8 × 4
V1 orange apple lemon
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1 0 1
3 3 0 1 1
4 4 1 1 1
5 5 0 0 1
6 6 0 1 0
7 7 1 0 0
8 8 0 1 1
data I used:
V1 <- 1:8
V2 <- c("orange, apple", "orange, lemon",
"lemon, apple", "orange, lemon, apple",
"lemon", "apple", "orange",
"lemon, apple")
df <- tibble(V1, V2)
We may use dummy_cols
library(stringr)
library(fastDummies)
library(dplyr)
dummy_cols(df, "V2", split = ",\\s+", remove_selected_columns = TRUE) %>%
rename_with(~ str_remove(.x, '.*_'))
-output
# A tibble: 8 × 4
V1 apple lemon orange
<int> <int> <int> <int>
1 1 1 0 1
2 2 0 1 1
3 3 1 1 0
4 4 1 1 1
5 5 0 1 0
6 6 1 0 0
7 7 0 0 1
8 8 1 1 0

dplyr to calculate fraction by group

there are only 2 farms, but tons of fruit. trying to see which farm has been performing better over 3 years where the performance is simply farmi / (farm1 + farm2), so for the fruit==peach farm1 performance was 20% vs. farm2 80%
sample data:
df <- data.frame(fruit = c("apple", "apple", "peach", "peach", "pear", "pear", "lime", "lime"),
farm = as.factor(c(1,2,1,2,1,2,1,2)), 'y2019' = c(0,0,3,12,0,7,4,6),
'y2018' = c(5,3,0,0,8,2,0,0),'y2017' = c(4,5,7,15,0,0,0,0) )
> df
fruit farm y2019 y2018 y2017
1 apple 1 0 5 4
2 apple 2 0 3 5
3 peach 1 3 0 7
4 peach 2 12 0 15
5 pear 1 0 8 0
6 pear 2 7 2 0
7 lime 1 4 0 0
8 lime 2 6 0 0
>
desired output:
out
fruit farm y2019 y2018 y2017
1 apple 1 0.0 0.625 0.444444
2 apple 2 0.0 0.375 0.555556
3 peach 1 0.2 0.000 0.318818
4 peach 2 0.8 0.000 0.681818
5 pear 1 0.0 0.800 0.000000
6 pear 2 1.0 0.200 0.000000
7 lime 1 0.4 0.000 0.000000
8 lime 2 0.6 0.000 0.000000
>
this is a far as i could go:
df %>%
group_by(fruit) %>%
summarise(across(where(is.numeric), sum))
We can group by 'fruit', mutate across the columns that starts with 'y' to divide the elements by the sum of the values in those columns and if all values are 0, then return 0
library(dplyr)
df %>%
group_by(fruit) %>%
mutate(across(starts_with('y'), ~ if(all(. == 0)) 0 else ./sum(.)))
# A tibble: 8 x 5
# Groups: fruit [4]
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
#1 apple 1 0 0.625 0.444
#2 apple 2 0 0.375 0.556
#3 peach 1 0.2 0 0.318
#4 peach 2 0.8 0 0.682
#5 pear 1 0 0.8 0
#6 pear 2 1 0.2 0
#7 lime 1 0.4 0 0
#8 lime 2 0.6 0 0
NOTE: Here, we just used dplyr package and it is done in a single step
Or another option is adorn_percentages from janitor
library(janitor)
library(purrr)
df %>%
group_split(fruit) %>%
map_dfr(adorn_percentages, denominator = "col") %>%
as_tibble
Or using data.table
library(data.table)
setDT(df)[, (3:5) := lapply(.SD, function(x) if(all(x == 0)) 0
else x/sum(x, na.rm = TRUE)), .SDcols = 3:5, by = fruit][]
Or using base R
grpSums <- rowsum(df[3:5], df$fruit)
df[3:5] <- df[3:5]/grpSums[match(df$fruit, row.names(grpSums)),]
We can use prop.table to calculate the proportions for each fruit.
library(dplyr)
df %>%
group_by(fruit) %>%
mutate(across(where(is.numeric), prop.table),
#to replace `NaN` with 0
across(where(is.numeric), tidyr::replace_na, 0))
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
#1 apple 1 0 0.625 0.444
#2 apple 2 0 0.375 0.556
#3 peach 1 0.2 0 0.318
#4 peach 2 0.8 0 0.682
#5 pear 1 0 0.8 0
#6 pear 2 1 0.2 0
#7 lime 1 0.4 0 0
#8 lime 2 0.6 0 0

Replace NA conditionally

this an augmented version of my own question as i could not clearly explain it through the comments
There are only 2 farms, so each fruit is duplicated in the below df. i'd like to replace NA with 0 only if there is a value for either of the fruits, such as for a pear at y2019 with values c(NA, 7), i'd like to output c(0,7) instead.
sample data:
df <- data.frame(fruit = c("apple", "apple", "peach", "peach", "pear", "pear", "lime", "lime"),
farm = as.factor(c(1,2,1,2,1,2,1,2)), 'y2019' = c(NA,NA,3,12,NA,7,4,6),
'y2018' = c(5,3,NA,NA,8,2,NA,NA),'y2017' = c(4,5,7,15,NA,NA,1,NA))
> df
fruit farm y2019 y2018 y2017
1 apple 1 NA 5 4
2 apple 2 NA 3 5
3 peach 1 3 NA 7
4 peach 2 12 NA 15
5 pear 1 NA 8 NA
6 pear 2 7 2 NA
7 lime 1 4 NA 1
8 lime 2 6 NA NA
this is close
df %>%
group_by(fruit) %>%
mutate(across(where(is.numeric), ~ if (any(is.na(.))) 0 else .)) %>%
ungroup()
but :
7 gets wiped out in pear producing c(0,0).
i'd like to leave NA in when both farms are NA
#A tibble: 8 x 5
fruit farm y2019 y2018 y2017
<chr> <fct> <dbl> <dbl> <dbl>
1 apple 1 0 5 4
2 apple 2 0 3 5
3 peach 1 3 0 7
4 peach 2 12 0 15
5 pear 1 0 8 0
6 pear 2 0 2 0
7 lime 1 4 0 0
8 lime 2 6 0 0
desired outcome:
> df
fruit farm y2019 y2018 y2017
1 apple 1 NA 5 4
2 apple 2 NA 3 5
3 peach 1 3 NA 7
4 peach 2 12 NA 15
5 pear 1 0 8 NA
6 pear 2 7 2 NA
7 lime 1 4 NA 1
8 lime 2 6 NA 0
You can try :
library(dplyr)
df %>%
group_by(fruit) %>%
mutate(across(where(is.numeric), ~ if(any(!is.na(.)))
replace(., is.na(.), 0) else .)) %>%
ungroup()
# A tibble: 8 x 5
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
#1 apple 1 NA 5 4
#2 apple 2 NA 3 5
#3 peach 1 3 NA 7
#4 peach 2 12 NA 15
#5 pear 1 0 8 NA
#6 pear 2 7 2 NA
#7 lime 1 4 NA 1
#8 lime 2 6 NA 0
So we replace NA to 0 only if there is any value in the group which is not NA.
We can use replace_na from tidyr if there are any non-NA elements to replace with 0 or else return the value
library(dplyr)
library(tidyr)
df %>%
group_by(fruit) %>%
mutate(across(where(is.numeric), ~ if(any(!is.na(.))) replace_na(., 0) else .)) %>%
ungroup()
# A tibble: 8 x 5
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
#1 apple 1 NA 5 4
#2 apple 2 NA 3 5
#3 peach 1 3 NA 7
#4 peach 2 12 NA 15
#5 pear 1 0 8 NA
#6 pear 2 7 2 NA
#7 lime 1 4 NA 1
#8 lime 2 6 NA 0
or another option without if/else by having two logical expressions in replace after doing the group by 'fruit'
df %>%
group_by(fruit) %>%
mutate(across(where(is.numeric),
~ replace(., sum(!is.na(.)) > 0 & is.na(.), 0)))
# A tibble: 8 x 5
# Groups: fruit [4]
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
#1 apple 1 NA 5 4
#2 apple 2 NA 3 5
#3 peach 1 3 NA 7
#4 peach 2 12 NA 15
#5 pear 1 0 8 NA
#6 pear 2 7 2 NA
#7 lime 1 4 NA 1
#8 lime 2 6 NA 0

Ranking observations within groups that are tied

I'm trying to rank the certain groups by their counts using dense_rank, it doesn't make a distinct rank for groups that are tied. And any ranking function I try that has some sort of ties.method doesn't give me the rankings in a consecutive 1,2,3 order. Example:
library(dplyr)
id <- c(rep(1, 8),
rep(2, 8))
fruit <- c(rep('apple', 4), rep('orange', 1), rep('banana', 2), 'orange',
rep('orange', 4), rep('banana', 1), rep('apple', 2), 'banana')
df <- data.frame(id, fruit, stringsAsFactors = FALSE)
df2 <- df %>%
mutate(counter = 1) %>%
group_by(id, fruit) %>%
mutate(fruitCnt = sum(counter)) %>%
ungroup() %>%
group_by(id) %>%
mutate(fruitCntRank = dense_rank(desc(fruitCnt))) %>%
select(id, fruit, fruitCntRank)
df2
id fruit fruitCntRank
1 1 apple 1
2 1 apple 1
3 1 apple 1
4 1 apple 1
5 1 orange 2
6 1 banana 2
7 1 banana 2
8 1 orange 2
9 2 orange 1
10 2 orange 1
11 2 orange 1
12 2 orange 1
13 2 banana 2
14 2 apple 2
15 2 apple 2
16 2 banana 2
It doesn't matter which of orange or banana are ranked 3, and it doesn't even need to be consistent. I just need the groups to be ranked 1, 2, 3.
Desired result:
id fruit fruitCntRank
1 1 apple 1
2 1 apple 1
3 1 apple 1
4 1 apple 1
5 1 orange 2
6 1 banana 3
7 1 banana 3
8 1 orange 2
9 2 orange 1
10 2 orange 1
11 2 orange 1
12 2 orange 1
13 2 banana 2
14 2 apple 3
15 2 apple 3
16 2 banana 2
We can add count for each id and fruit combination, arrange them in descending order of count and get the rank using match.
library(dplyr)
df %>%
add_count(id, fruit) %>%
arrange(id, desc(n)) %>%
group_by(id) %>%
mutate(n = match(fruit, unique(fruit)))
#Another option with cumsum and duplicated
#mutate(n = cumsum(!duplicated(fruit)))
# id fruit n
# <dbl> <chr> <int>
# 1 1 apple 1
# 2 1 apple 1
# 3 1 apple 1
# 4 1 apple 1
# 5 1 orange 2
# 6 1 banana 3
# 7 1 banana 3
# 8 1 orange 2
# 9 2 orange 1
#10 2 orange 1
#11 2 orange 1
#12 2 orange 1
#13 2 banana 2
#14 2 apple 3
#15 2 apple 3
#16 2 banana 2

add missed value based on the value of the column in r

This is my sample dataset:
vector1 <-
data.frame(
"name" = "a",
"age" = 10,
"fruit" = c("orange", "cherry", "apple"),
"count" = c(1, 1, 1),
"tag" = c(1, 1, 2)
)
vector2 <-
data.frame(
"name" = "b",
"age" = 33,
"fruit" = c("apple", "mango"),
"count" = c(1, 1),
"tag" = c(2, 2)
)
vector3 <-
data.frame(
"name" = "c",
"age" = 58,
"fruit" = c("cherry", "apple"),
"count" = c(1, 1),
"tag" = c(1, 1)
)
list <- list(vector1, vector2, vector3)
print(list)
This is my test:
default <- c("cherry",
"orange",
"apple",
"mango")
for (num in 1:length(list)) {
#print(list[[num]])
list[[num]] <- rbind(
list[[num]],
data.frame(
"name" = list[[num]]$name,
"age" = list[[num]]$age,
"fruit" = setdiff(default, list[[num]]$fruit),#add missed value
"count" = 0,
"tag" = 1 #not found solutions
)
)
print(paste0("--------------", num, "--------"))
print(list)
}
#print(list)
I'm trying to find which fruit miss in the data frame and the fruit is based on the value of the tag.For example, in the first data frame, there are tags 1 and 2.If the value of tag 1 does not have the default fruit such as apple and banana, the missed default fruit will be added to 0 to the data frame.The expectation format likes the following:
[[1]]
name age fruit count tag
1 a 10 orange 1 1
2 a 10 cherry 1 1
3 a 10 apple 1 2
4 a 10 mango 0 1
5 a 10 apple 0 1
6 a 10 mango 0 2
7 a 10 orange 0 2
8 a 10 cherry 0 2
When I check the process of the loop, I also find that the first loop adds mango 3 times and I don't find the reason why it cannot add the missed value at one time.The overall output likes the following:
[[1]]
name age fruit count tag
1 a 10 orange 1 1
2 a 10 cherry 1 1
3 a 10 apple 1 2
4 a 10 mango 0 1
5 a 10 mango 0 1
6 a 10 mango 0 1
[[2]]
name age fruit count tag
1 b 33 apple 1 2
2 b 33 mango 1 2
3 b 33 cherry 0 1
4 b 33 orange 0 1
[[3]]
name age fruit count tag
1 c 58 cherry 1 1
2 c 58 apple 1 1
3 c 58 orange 0 1
4 c 58 mango 0 1
Does anyone help me and provides simple methods or other ways? Should I use the sqldf function to add 0 value?Is this a simple way to solve my problems?
Consider base R methods --lapply, expand.grid, transform, rbind, aggregate-- that appends all possible fruit and tag options to each dataframe and keeps the max counts.
new_list <- lapply(list, function(df) {
fruit_tag_df <- transform(expand.grid(fruit=c("apple", "cherry", "mango", "orange"),
tag=c(1,2)),
name = df$name[1],
age = df$age[1],
count = 0)
aggregate(.~name + age + fruit + tag, rbind(df, fruit_tag_df), FUN=max)
})
Output
new_list
# [[1]]
# name age fruit tag count
# 1 a 10 apple 1 0
# 2 a 10 cherry 1 1
# 3 a 10 orange 1 1
# 4 a 10 mango 1 0
# 5 a 10 apple 2 1
# 6 a 10 cherry 2 0
# 7 a 10 orange 2 0
# 8 a 10 mango 2 0
# [[2]]
# name age fruit tag count
# 1 b 33 apple 1 0
# 2 b 33 mango 1 0
# 3 b 33 cherry 1 0
# 4 b 33 orange 1 0
# 5 b 33 apple 2 1
# 6 b 33 mango 2 1
# 7 b 33 cherry 2 0
# 8 b 33 orange 2 0
# [[3]]
# name age fruit tag count
# 1 c 58 apple 1 1
# 2 c 58 cherry 1 1
# 3 c 58 mango 1 0
# 4 c 58 orange 1 0
# 5 c 58 apple 2 0
# 6 c 58 cherry 2 0
# 7 c 58 mango 2 0
# 8 c 58 orange 2 0
The OP has requested to complete each data.frame in list so that all combinations of default fruit and tags 1:2 will appear in the result whereby count should be set to 0 for the additional rows. Finally, each data.frame should consist at least of 4 x 2 = 8 rows.
I want to propose two different approaches:
Using lapply() and the CJ() (cross join) function from data.table to return a list.
Combine the separate data.frames in list to one large data.table using rbindlist() and apply the required transformations on the whole data.table.
Using lapply() and CJ()
library(data.table)
lapply(lst, function(x) setDT(x)[
CJ(name = name, age = age, fruit = default, tag = 1:2, unique = TRUE),
on = .(name, age, fruit, tag)][
is.na(count), count := 0][order(-count, tag)]
)
[[1]]
name age fruit count tag
1: a 10 cherry 1 1
2: a 10 orange 1 1
3: a 10 apple 1 2
4: a 10 apple 0 1
5: a 10 mango 0 1
6: a 10 cherry 0 2
7: a 10 mango 0 2
8: a 10 orange 0 2
[[2]]
name age fruit count tag
1: b 33 apple 1 2
2: b 33 mango 1 2
3: b 33 apple 0 1
4: b 33 cherry 0 1
5: b 33 mango 0 1
6: b 33 orange 0 1
7: b 33 cherry 0 2
8: b 33 orange 0 2
[[3]]
name age fruit count tag
1: c 58 apple 1 1
2: c 58 cherry 1 1
3: c 58 mango 0 1
4: c 58 orange 0 1
5: c 58 apple 0 2
6: c 58 cherry 0 2
7: c 58 mango 0 2
8: c 58 orange 0 2
Ordering by count and tag is not required but helps to compare the result with OP's expected output.
Creating on large data.table
Instead of a list of data.frames with identical structure we can use one large data.table where the origin of each row can be identified by an id column.
Indeed, th OP has asked other questions ("using lapply function and list in r"
and "how to loop the dataframe using sqldf?" where he asked for help in handling a list of data.frames. G. Grothendieck already had suggested to rbind the rows together.
The rbindlist() function has the idcol parameter which identifies the origin of each row:
library(data.table)
rbindlist(list, idcol = "df")
df name age fruit count tag
1: 1 a 10 orange 1 1
2: 1 a 10 cherry 1 1
3: 1 a 10 apple 1 2
4: 2 b 33 apple 1 2
5: 2 b 33 mango 1 2
6: 3 c 58 cherry 1 1
7: 3 c 58 apple 1 1
Note that df contains the number of the source data.frame in list (or the names of the list elements if list is named).
Now, we can apply above solution by grouping over df:
rbindlist(list, idcol = "df")[, .SD[
CJ(name = name, age = age, fruit = default, tag = 1:2, unique = TRUE),
on = .(name, age, fruit, tag)], by = df][
is.na(count), count := 0][order(df, -count, tag)]
df name age fruit count tag
1: 1 a 10 cherry 1 1
2: 1 a 10 orange 1 1
3: 1 a 10 apple 1 2
4: 1 a 10 apple 0 1
5: 1 a 10 mango 0 1
6: 1 a 10 cherry 0 2
7: 1 a 10 mango 0 2
8: 1 a 10 orange 0 2
9: 2 b 33 apple 1 2
10: 2 b 33 mango 1 2
11: 2 b 33 apple 0 1
12: 2 b 33 cherry 0 1
13: 2 b 33 mango 0 1
14: 2 b 33 orange 0 1
15: 2 b 33 cherry 0 2
16: 2 b 33 orange 0 2
17: 3 c 58 apple 1 1
18: 3 c 58 cherry 1 1
19: 3 c 58 mango 0 1
20: 3 c 58 orange 0 1
21: 3 c 58 apple 0 2
22: 3 c 58 cherry 0 2
23: 3 c 58 mango 0 2
24: 3 c 58 orange 0 2
df name age fruit count tag
A solution using dplyr and tidyr. We can use complete to expand the data frame and specify the fill values as 0 to count.
Notice that I changed your list name from list to fruit_list because it is a bad practice to use reserved words in R to name an object. Also notice that when I created the example data frame I set stringsAsFactors = FALSE because I don't want to create factor columns. Finally, I used lapply instead of for-loop to loop through the list elements.
library(dplyr)
library(tidyr)
fruit_list2 <- lapply(fruit_list, function(x){
x2 <- x %>%
complete(name, age, fruit = default, tag = c(1, 2), fill = list(count = 0)) %>%
select(name, age, fruit, count, tag) %>%
arrange(tag, fruit) %>%
as.data.frame()
return(x2)
})
fruit_list2
# [[1]]
# name age fruit count tag
# 1 a 10 apple 0 1
# 2 a 10 cherry 1 1
# 3 a 10 mango 0 1
# 4 a 10 orange 1 1
# 5 a 10 apple 1 2
# 6 a 10 cherry 0 2
# 7 a 10 mango 0 2
# 8 a 10 orange 0 2
#
# [[2]]
# name age fruit count tag
# 1 b 33 apple 0 1
# 2 b 33 cherry 0 1
# 3 b 33 mango 0 1
# 4 b 33 orange 0 1
# 5 b 33 apple 1 2
# 6 b 33 cherry 0 2
# 7 b 33 mango 1 2
# 8 b 33 orange 0 2
#
# [[3]]
# name age fruit count tag
# 1 c 58 apple 1 1
# 2 c 58 cherry 1 1
# 3 c 58 mango 0 1
# 4 c 58 orange 0 1
# 5 c 58 apple 0 2
# 6 c 58 cherry 0 2
# 7 c 58 mango 0 2
# 8 c 58 orange 0 2
DATA
vector1 <-
data.frame(
"name" = "a",
"age" = 10,
"fruit" = c("orange", "cherry", "apple"),
"count" = c(1, 1, 1),
"tag" = c(1, 1, 2),
stringsAsFactors = FALSE
)
vector2 <-
data.frame(
"name" = "b",
"age" = 33,
"fruit" = c("apple", "mango"),
"count" = c(1, 1),
"tag" = c(2, 2),
stringsAsFactors = FALSE
)
vector3 <-
data.frame(
"name" = "c",
"age" = 58,
"fruit" = c("cherry", "apple"),
"count" = c(1, 1),
"tag" = c(1, 1),
stringsAsFactors = FALSE
)
fruit_list <- list(vector1, vector2, vector3)
default <- c("cherry", "orange", "apple", "mango")

Resources