this an augmented version of my own question as i could not clearly explain it through the comments
There are only 2 farms, so each fruit is duplicated in the below df. i'd like to replace NA with 0 only if there is a value for either of the fruits, such as for a pear at y2019 with values c(NA, 7), i'd like to output c(0,7) instead.
sample data:
df <- data.frame(fruit = c("apple", "apple", "peach", "peach", "pear", "pear", "lime", "lime"),
farm = as.factor(c(1,2,1,2,1,2,1,2)), 'y2019' = c(NA,NA,3,12,NA,7,4,6),
'y2018' = c(5,3,NA,NA,8,2,NA,NA),'y2017' = c(4,5,7,15,NA,NA,1,NA))
> df
fruit farm y2019 y2018 y2017
1 apple 1 NA 5 4
2 apple 2 NA 3 5
3 peach 1 3 NA 7
4 peach 2 12 NA 15
5 pear 1 NA 8 NA
6 pear 2 7 2 NA
7 lime 1 4 NA 1
8 lime 2 6 NA NA
this is close
df %>%
group_by(fruit) %>%
mutate(across(where(is.numeric), ~ if (any(is.na(.))) 0 else .)) %>%
ungroup()
but :
7 gets wiped out in pear producing c(0,0).
i'd like to leave NA in when both farms are NA
#A tibble: 8 x 5
fruit farm y2019 y2018 y2017
<chr> <fct> <dbl> <dbl> <dbl>
1 apple 1 0 5 4
2 apple 2 0 3 5
3 peach 1 3 0 7
4 peach 2 12 0 15
5 pear 1 0 8 0
6 pear 2 0 2 0
7 lime 1 4 0 0
8 lime 2 6 0 0
desired outcome:
> df
fruit farm y2019 y2018 y2017
1 apple 1 NA 5 4
2 apple 2 NA 3 5
3 peach 1 3 NA 7
4 peach 2 12 NA 15
5 pear 1 0 8 NA
6 pear 2 7 2 NA
7 lime 1 4 NA 1
8 lime 2 6 NA 0
You can try :
library(dplyr)
df %>%
group_by(fruit) %>%
mutate(across(where(is.numeric), ~ if(any(!is.na(.)))
replace(., is.na(.), 0) else .)) %>%
ungroup()
# A tibble: 8 x 5
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
#1 apple 1 NA 5 4
#2 apple 2 NA 3 5
#3 peach 1 3 NA 7
#4 peach 2 12 NA 15
#5 pear 1 0 8 NA
#6 pear 2 7 2 NA
#7 lime 1 4 NA 1
#8 lime 2 6 NA 0
So we replace NA to 0 only if there is any value in the group which is not NA.
We can use replace_na from tidyr if there are any non-NA elements to replace with 0 or else return the value
library(dplyr)
library(tidyr)
df %>%
group_by(fruit) %>%
mutate(across(where(is.numeric), ~ if(any(!is.na(.))) replace_na(., 0) else .)) %>%
ungroup()
# A tibble: 8 x 5
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
#1 apple 1 NA 5 4
#2 apple 2 NA 3 5
#3 peach 1 3 NA 7
#4 peach 2 12 NA 15
#5 pear 1 0 8 NA
#6 pear 2 7 2 NA
#7 lime 1 4 NA 1
#8 lime 2 6 NA 0
or another option without if/else by having two logical expressions in replace after doing the group by 'fruit'
df %>%
group_by(fruit) %>%
mutate(across(where(is.numeric),
~ replace(., sum(!is.na(.)) > 0 & is.na(.), 0)))
# A tibble: 8 x 5
# Groups: fruit [4]
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
#1 apple 1 NA 5 4
#2 apple 2 NA 3 5
#3 peach 1 3 NA 7
#4 peach 2 12 NA 15
#5 pear 1 0 8 NA
#6 pear 2 7 2 NA
#7 lime 1 4 NA 1
#8 lime 2 6 NA 0
Related
I have a data like below:
V1 V2
1 orange, apple
2 orange, lemon
3 lemon, apple
4 orange, lemon, apple
5 lemon
6 apple
7 orange
8 lemon, apple
I want to split the V2 variable like this:
I have three categories of the V2 column: "orange", "lemon", "apple"
for each of the categories I want to create a new column (variable) that will inform about whether such a name appeared in V2 (0,1)
I tried this
df %>% separate(V2, into = c("orange", "lemon", "apple"))
.. and I got this result, but it's not what I expect.
V1 orange lemon apple
1 1 orange apple <NA>
2 2 orange lemon <NA>
3 3 lemon apple <NA>
4 4 orange lemon apple
5 5 lemon <NA> <NA>
6 6 apple <NA> <NA>
7 7 orange <NA> <NA>
8 8 lemon apple <NA>
The result I mean is below.
V1 orange lemon apple
1 1 0 1
2 1 1 0
3 0 1 1
4 1 1 0
5 0 1 0
6 0 0 1
7 1 0 0
8 0 1 1
you could try pivoting:
library(dplyr)
library(tidyr)
df |>
separate_rows(V2, sep = ", ") |>
mutate(ind = 1) |>
pivot_wider(names_from = V2,
values_from = ind,
values_fill = 0)
Output is:
# A tibble: 8 × 4
V1 orange apple lemon
<int> <dbl> <dbl> <dbl>
1 1 1 1 0
2 2 1 0 1
3 3 0 1 1
4 4 1 1 1
5 5 0 0 1
6 6 0 1 0
7 7 1 0 0
8 8 0 1 1
data I used:
V1 <- 1:8
V2 <- c("orange, apple", "orange, lemon",
"lemon, apple", "orange, lemon, apple",
"lemon", "apple", "orange",
"lemon, apple")
df <- tibble(V1, V2)
We may use dummy_cols
library(stringr)
library(fastDummies)
library(dplyr)
dummy_cols(df, "V2", split = ",\\s+", remove_selected_columns = TRUE) %>%
rename_with(~ str_remove(.x, '.*_'))
-output
# A tibble: 8 × 4
V1 apple lemon orange
<int> <int> <int> <int>
1 1 1 0 1
2 2 0 1 1
3 3 1 1 0
4 4 1 1 1
5 5 0 1 0
6 6 1 0 0
7 7 0 0 1
8 8 1 1 0
there are only 2 farms, but tons of fruit. trying to see which farm has been performing better over 3 years where the performance is simply farmi / (farm1 + farm2), so for the fruit==peach farm1 performance was 20% vs. farm2 80%
sample data:
df <- data.frame(fruit = c("apple", "apple", "peach", "peach", "pear", "pear", "lime", "lime"),
farm = as.factor(c(1,2,1,2,1,2,1,2)), 'y2019' = c(0,0,3,12,0,7,4,6),
'y2018' = c(5,3,0,0,8,2,0,0),'y2017' = c(4,5,7,15,0,0,0,0) )
> df
fruit farm y2019 y2018 y2017
1 apple 1 0 5 4
2 apple 2 0 3 5
3 peach 1 3 0 7
4 peach 2 12 0 15
5 pear 1 0 8 0
6 pear 2 7 2 0
7 lime 1 4 0 0
8 lime 2 6 0 0
>
desired output:
out
fruit farm y2019 y2018 y2017
1 apple 1 0.0 0.625 0.444444
2 apple 2 0.0 0.375 0.555556
3 peach 1 0.2 0.000 0.318818
4 peach 2 0.8 0.000 0.681818
5 pear 1 0.0 0.800 0.000000
6 pear 2 1.0 0.200 0.000000
7 lime 1 0.4 0.000 0.000000
8 lime 2 0.6 0.000 0.000000
>
this is a far as i could go:
df %>%
group_by(fruit) %>%
summarise(across(where(is.numeric), sum))
We can group by 'fruit', mutate across the columns that starts with 'y' to divide the elements by the sum of the values in those columns and if all values are 0, then return 0
library(dplyr)
df %>%
group_by(fruit) %>%
mutate(across(starts_with('y'), ~ if(all(. == 0)) 0 else ./sum(.)))
# A tibble: 8 x 5
# Groups: fruit [4]
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
#1 apple 1 0 0.625 0.444
#2 apple 2 0 0.375 0.556
#3 peach 1 0.2 0 0.318
#4 peach 2 0.8 0 0.682
#5 pear 1 0 0.8 0
#6 pear 2 1 0.2 0
#7 lime 1 0.4 0 0
#8 lime 2 0.6 0 0
NOTE: Here, we just used dplyr package and it is done in a single step
Or another option is adorn_percentages from janitor
library(janitor)
library(purrr)
df %>%
group_split(fruit) %>%
map_dfr(adorn_percentages, denominator = "col") %>%
as_tibble
Or using data.table
library(data.table)
setDT(df)[, (3:5) := lapply(.SD, function(x) if(all(x == 0)) 0
else x/sum(x, na.rm = TRUE)), .SDcols = 3:5, by = fruit][]
Or using base R
grpSums <- rowsum(df[3:5], df$fruit)
df[3:5] <- df[3:5]/grpSums[match(df$fruit, row.names(grpSums)),]
We can use prop.table to calculate the proportions for each fruit.
library(dplyr)
df %>%
group_by(fruit) %>%
mutate(across(where(is.numeric), prop.table),
#to replace `NaN` with 0
across(where(is.numeric), tidyr::replace_na, 0))
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
#1 apple 1 0 0.625 0.444
#2 apple 2 0 0.375 0.556
#3 peach 1 0.2 0 0.318
#4 peach 2 0.8 0 0.682
#5 pear 1 0 0.8 0
#6 pear 2 1 0.2 0
#7 lime 1 0.4 0 0
#8 lime 2 0.6 0 0
In the below df, there are only 2 farms, so each fruit is duplicated. i'd like to replace zeros with NA as follows
df[df==0] <- NA
However, whenever there is a value for either of the fruits, such as for a pear at y2019 with values c(0, 7), i'd like not to replace 0. dplyr solution would be great.
sample data:
df <- data.frame(fruit = c("apple", "apple", "peach", "peach", "pear", "pear", "lime", "lime"),
farm = as.factor(c(1,2,1,2,1,2,1,2)), 'y2019' = c(0,0,3,12,0,7,4,6),
'y2018' = c(5,3,0,0,8,2,0,0),'y2017' = c(4,5,7,15,0,0,0,0) )
> df
fruit farm y2019 y2018 y2017
1 apple 1 0 5 4
2 apple 2 0 3 5
3 peach 1 3 0 7
4 peach 2 12 0 15
5 pear 1 0 8 0
6 pear 2 7 2 0
7 lime 1 4 0 0
8 lime 2 6 0 0
library(dplyr)
df %>%
group_by(fruit) %>%
mutate_at(vars(starts_with("y20")), ~ if (any(. != 0)) . else NA_real_) %>%
ungroup()
# # A tibble: 8 x 5
# fruit farm y2019 y2018 y2017
# <chr> <fct> <dbl> <dbl> <dbl>
# 1 apple 1 NA 5 4
# 2 apple 2 NA 3 5
# 3 peach 1 3 NA 7
# 4 peach 2 12 NA 15
# 5 pear 1 0 8 NA
# 6 pear 2 7 2 NA
# 7 lime 1 4 NA NA
# 8 lime 2 6 NA NA
I'm trying to rank the certain groups by their counts using dense_rank, it doesn't make a distinct rank for groups that are tied. And any ranking function I try that has some sort of ties.method doesn't give me the rankings in a consecutive 1,2,3 order. Example:
library(dplyr)
id <- c(rep(1, 8),
rep(2, 8))
fruit <- c(rep('apple', 4), rep('orange', 1), rep('banana', 2), 'orange',
rep('orange', 4), rep('banana', 1), rep('apple', 2), 'banana')
df <- data.frame(id, fruit, stringsAsFactors = FALSE)
df2 <- df %>%
mutate(counter = 1) %>%
group_by(id, fruit) %>%
mutate(fruitCnt = sum(counter)) %>%
ungroup() %>%
group_by(id) %>%
mutate(fruitCntRank = dense_rank(desc(fruitCnt))) %>%
select(id, fruit, fruitCntRank)
df2
id fruit fruitCntRank
1 1 apple 1
2 1 apple 1
3 1 apple 1
4 1 apple 1
5 1 orange 2
6 1 banana 2
7 1 banana 2
8 1 orange 2
9 2 orange 1
10 2 orange 1
11 2 orange 1
12 2 orange 1
13 2 banana 2
14 2 apple 2
15 2 apple 2
16 2 banana 2
It doesn't matter which of orange or banana are ranked 3, and it doesn't even need to be consistent. I just need the groups to be ranked 1, 2, 3.
Desired result:
id fruit fruitCntRank
1 1 apple 1
2 1 apple 1
3 1 apple 1
4 1 apple 1
5 1 orange 2
6 1 banana 3
7 1 banana 3
8 1 orange 2
9 2 orange 1
10 2 orange 1
11 2 orange 1
12 2 orange 1
13 2 banana 2
14 2 apple 3
15 2 apple 3
16 2 banana 2
We can add count for each id and fruit combination, arrange them in descending order of count and get the rank using match.
library(dplyr)
df %>%
add_count(id, fruit) %>%
arrange(id, desc(n)) %>%
group_by(id) %>%
mutate(n = match(fruit, unique(fruit)))
#Another option with cumsum and duplicated
#mutate(n = cumsum(!duplicated(fruit)))
# id fruit n
# <dbl> <chr> <int>
# 1 1 apple 1
# 2 1 apple 1
# 3 1 apple 1
# 4 1 apple 1
# 5 1 orange 2
# 6 1 banana 3
# 7 1 banana 3
# 8 1 orange 2
# 9 2 orange 1
#10 2 orange 1
#11 2 orange 1
#12 2 orange 1
#13 2 banana 2
#14 2 apple 3
#15 2 apple 3
#16 2 banana 2
I have a large DataFrame, which looks like this:
ID GroupID a b ...
1 001 2 3
2 001 2 2
3 001 2 2
4 001 2 0
5 001 0 1
6 002 1 1
7 002 2 1
8 002 0 1
9 002 0 1
10 002 2 1
11 002 3 0
...
Now I want to set the whole group to NA, when one value appears more than 75% in this Group (because I assume the values are erroneous).
The result should look like this:
ID GroupID a b ...
1 001 NA 3
2 001 NA 2
3 001 NA 2
4 001 NA 0
5 001 NA 1
6 002 1 NA
7 002 2 NA
8 002 0 NA
9 002 0 NA
10 002 2 NA
11 002 3 NA
...
I know, thats quite a specific question but maybe you can help me.
In case you need the dateset above:
ID <- c(1:11)
GroupID <- c('001','001','001','001','001','002','002','002','002','002','002')
a <- c(2,2,2,2,0,1,2,0,0,2,3)
b <- c(3,2,2,0,1,1,1,1,1,1,0)
DF <- data.frame(ID, GroupID, a,b)
One approach would be
DF %>% group_by(GroupID) %>%
mutate_at(c("a", "b"), function(x) if(any(table(x) > length(x) * 0.75)) NA else x)
# A tibble: 11 x 4
# Groups: GroupID [2]
# ID GroupID a b
# <int> <fct> <dbl> <dbl>
# 1 1 001 NA 3
# 2 2 001 NA 2
# 3 3 001 NA 2
# 4 4 001 NA 0
# 5 5 001 NA 1
# 6 6 002 1 NA
# 7 7 002 2 NA
# 8 8 002 0 NA
# 9 9 002 0 NA
# 10 10 002 2 NA
# 11 11 002 3 NA
We can also use replace as follows.
library(dplyr)
anyPer <- function(x, threshold = 0.75){
a <- table(x)
b <- a/sum(a)
result <- any(b > threshold)
return(result)
}
dat2 <- dat %>%
group_by(GroupID) %>%
mutate_at(vars(-ID, -GroupID), funs(replace(., anyPer(.), NA))) %>%
ungroup()
dat2
# # A tibble: 11 x 4
# ID GroupID a b
# <int> <int> <int> <int>
# 1 1 1 NA 3
# 2 2 1 NA 2
# 3 3 1 NA 2
# 4 4 1 NA 0
# 5 5 1 NA 1
# 6 6 2 1 NA
# 7 7 2 2 NA
# 8 8 2 0 NA
# 9 9 2 0 NA
# 10 10 2 2 NA
# 11 11 2 3 NA
DATA
dat <- read.table(text = "ID GroupID a b
1 001 2 3
2 001 2 2
3 001 2 2
4 001 2 0
5 001 0 1
6 002 1 1
7 002 2 1
8 002 0 1
9 002 0 1
10 002 2 1
11 002 3 0",
header = TRUE)