Error while using the forcats relevel function - r

I have a dataframe with X, Y coordinate values and corresponding ID values in Val.
df1 <- data.frame(X=rnorm(1000,0,1), Y=rnorm(1000,0,1),
ID=paste(rep("ID", 1000), 1:1000, sep="_"),
Type=rep("ID",1000),
Val=c(rep(c('Type1','Type2'),300),
rep(c('Type3','Type4'),200)))
Adding the missing IDs for the existing X,Y values in df1.
dat2 <- data.frame(Type=rep('D',8),
Val=paste(rep("D", 8),
sample(1:2,8,replace=T), sep="_"))
dat2 <- cbind(df[sample(1:1000,80),1:3],dat2)
df1 <- rbind(df1, dat2)
Looking at the frequency of ID values.
df1 %>% count(Val)
# # A tibble: 6 x 2
# Val n
# <fctr> <int>
# 1 Type1 300
# 2 Type2 300
# 3 Type3 200
# 4 Type4 200
# 5 D_1 60
# 6 D_2 20
I am interested in only two IDs for further analysis and the rest can be grouped into a random value. With the help of fct_other function, I have recoded them into Other and the frequency looks as expected.
df1 %>% mutate(Val=fct_other(Val,keep=c('D_1','D_2'))) %>% count(Val)
# # A tibble: 3 x 2
# Val n
# <fctr> <int>
# 1 D_1 60
# 2 D_2 20
# 3 Other 1000
As the fct_other function puts "Other" values as the last factor value and I want it at first, I used the other function fct_relevel available in the same package.
df1 %>% mutate(Val=fct_other(Val,keep=c('Type5','Type6'))) %>%
mutate(Val=fct_relevel(Val,'Other'))%>%
count(Val)
# # A tibble: 1 x 2
# Val n
# <fctr> <int>
# 1 Other 1080
But it is giving unexpected results. Any idea on what might have gone wrong?
Update:
The error was trying to keep unavailable values.
df1 %>% mutate(Val=fct_other(Val,keep=c('D_1','D_2'))) %>%
mutate(Val=fct_relevel(Val,'Other'))%>% count(Val)
# # A tibble: 3 x 2
# Val n
# <fctr> <int>
# 1 Other 1000
# 2 D_1 30
# 3 D_2 50
When I tried to retain the unique values, the selected ones are missing:
df1 %>% mutate(Val=fct_other(Val,keep=c('D_1','D_2'))) %>%
mutate(Val=fct_relevel(Val,'Other'))%>%
arrange(Val) %>% filter(!duplicated(.[,c("X","Y")])) %>% count(Val)
# # A tibble: 1 x 2
# Val n
# <fctr> <int>
# 1 Other 1000
Relevelling after the extraction of unique values does the job:
df1 %>% mutate(Val=fct_other(Val,keep=c('D_1','D_2'))) %>%
arrange(Val) %>% filter(!duplicated(.[,c("X","Y")])) %>%
mutate(Val=fct_relevel(Val,'Other')) %>%
arrange(Val) %>% count(Val)
# # A tibble: 3 x 2
# Val n
# <fctr> <int>
# 1 Other 920
# 2 D_1 30
# 3 D_2 50
Is this the efficient way of doing it?

Related

How to filter out groups empty for 1 column in Tidyverse

tibble(
A = c("A","A","B","B"),
x = c(NA,NA,NA,1),
y = c(1,2,3,4),
) %>% group_by(A) -> df
desired output:
tibble(
A = c("B","B"),
x = c(NA,1)
y = c(3,4),
)
I want to find all groups for which all elements of x and x only are all NA, then remove those groups. "B" is filtered in because it has at least 1 non NA element.
I tried:
df %>%
filter(all(!is.na(x)))
but it seems that filters out if it finds at least 1 NA; I need the correct word, which is not all.
This will remove groups of column A if all elements of x are NA:
library(dplyr)
df %>%
group_by(A) %>%
filter(! all(is.na(x)))
# A tibble: 2 × 3
# Groups: A [1]
# A x y
# <chr> <dbl> <dbl>
#1 B NA 3
#2 B 1 4
Note that group "A" was removed because both cells in the column x are not defined.
We can use any with complete.cases
library(dplyr)
df %>%
group_by(A) %>%
filter(any(complete.cases(x))) %>%
ungroup
-output
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4
In the devel version of dplyr, we could use .by in filter thus we don't need to group_by/ungroup
df %>%
filter(any(complete.cases(x)), .by = 'A')
# A tibble: 2 × 3
A x y
<chr> <dbl> <dbl>
1 B NA 3
2 B 1 4

R: dplyr and row_number() does not enumerate as expected

I want to enumerate each record of a dataframe/tibble resulted from a grouping. The index is according a defined order. If I use row_number() it does enumerate but within group. But I want that it enumerates without considering the former grouping.
Here is an example. To make it simple I used the most minimal dataframe:
library(dplyr)
df0 <- data.frame( x1 = rep(LETTERS[1:2],each=2)
, x2 = rep(letters[1:2], 2)
, y = floor(abs(rnorm(4)*10))
)
df0
# x1 x2 y
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
Now, I group this table:
df1 <- df0 %>% group_by(x1,x2) %>% summarize(y=sum(y))
This gives me a object of class tibble:
# A tibble: 4 x 3
# Groups: x1 [?]
# x1 x2 y
# <fct> <fct> <dbl>
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
I want to add a row number to this table using row_numer():
df2 <- df1 %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# A tibble: 4 x 4
# Groups: x1 [2]
# x1 x2 y index
# <fct> <fct> <dbl> <int>
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 1
# 4 B a 0 2
row_number() does enumerate within the former grouping. This was not my intention. This can be avoid converting tibble to a dataframe first:
df2 <- df2 %>% as.data.frame() %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# x1 x2 y index
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 3
# 4 B a 0 4
My question is: is this behaviour intended?
If yes: is it not very dangerous to incorporate former data processing into tibble? Which type of processing is incorporated?
At the moment I will convert tibble into dataframe to avoid this kind of unexpected results.
To elaborate on my comment: yes, retaining grouping is intended, and in many cases useful. It's only dangerous if you don't understand how group_by works—and that's true of any function. To undo group_by, you call ungroup.
Take a look at the group_by docs, as they're very thorough and explain how this function interacts with others, how grouping is layered, etc. The docs also explain how each call to summarise removes a layer of grouping—it might be there that you got confused about what's going on.
For example, you can group by x1 and x2, summarize y, and create a row number, which will give you the rows according to x1 (summarise removed a layer of grouping, i.e. drops the x2 grouping). Then ungrouping allows you to get row numbers based on the entire data frame.
library(dplyr)
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(group_row = row_number()) %>%
ungroup() %>%
mutate(all_df_row = row_number())
#> # A tibble: 4 x 5
#> x1 x2 y group_row all_df_row
#> <fct> <fct> <dbl> <int> <int>
#> 1 A a 12 1 1
#> 2 A b 2 2 2
#> 3 B a 10 1 3
#> 4 B b 23 2 4
A use case—I do this for work probably every day—is to get sums within multiple groups (again, x1 and x2), then to find the shares of those values within their larger group (after peeling away a layer of grouping, this is x1) with mutate. Again, here I ungroup to show the shares instead of the entire data frame.
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(share_in_group = y / sum(y)) %>%
ungroup() %>%
mutate(share_all_df = y / sum(y))
#> # A tibble: 4 x 5
#> x1 x2 y share_in_group share_all_df
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 A a 12 0.857 0.255
#> 2 A b 2 0.143 0.0426
#> 3 B a 10 0.303 0.213
#> 4 B b 23 0.697 0.489
Created on 2018-10-11 by the reprex package (v0.2.1)
As camille nicely showed, there are good reasons for wanting to have the result of summarize() retain additional layers of grouping and it's a documented behaviour so not really dangerous or unexpected per se.
However one additional tip is that if you are just going to call ungroup() after summarize() you might as well use summarize(.groups = "drop") which will return an ungrouped tibble and save you a line of code.
library(tidyverse)
df0 <- data.frame(
x1 = rep(LETTERS[1:2], each = 2),
x2 = rep(letters[1:2], 2),
y = floor(abs(rnorm(4) * 10))
)
df0 %>%
group_by(x1,x2) %>%
summarize(y=sum(y), .groups = "drop") %>%
arrange(desc(y)) %>%
mutate(index = row_number())
#> # A tibble: 4 x 4
#> x1 x2 y index
#> <chr> <chr> <dbl> <int>
#> 1 A b 8 1
#> 2 A a 2 2
#> 3 B a 2 3
#> 4 B b 1 4
Created on 2022-02-06 by the reprex package (v2.0.1)

dplyr: passing a grouped tibble to a custom function

(The following scenario simplifies my actual situation)
My data comes from villages, and I would like to summarize an outcome variable by a village variable.
> data
village A Z Y
<chr> <int> <int> <dbl>
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700
For example, I would like to calculate the mean of Y only using Z==z by villages. In this case, I want to have (500 + 400)/2 = 450 for village "a" and 700 for village "b".
Please note that the actual situation is more complicated and I cannot directly use this answer, but the point is I need to pass a grouped tibble and a global variable (z) to my function.
z <- 1 # z takes 0 or 1
data %>%
group_by(village) %>% # grouping by village
summarize(Y_village = Y_hat_village(., z)) # pass a part of tibble and a global variable
Y_hat_village <- function(data_village, z){
# This function takes a part of tibble (`data_village`) and a variable `z`
# Calculate the mean for a specific z in a village
data_z <- data_village %>% filter(Z==get("z"))
return(mean(data_z$Y))
}
However, I found . passes entire tibble and the code above returns the same values for all groups.
There are a couple things you can simplify. One is in your function: since you're passing in a value z to the function, you don't need to use get("z"). You have a z in the global environment that you pass in; or, more safely, assign your z value to a variable with some other name so you don't run into scoping issues, and pass that in to the function. In this case, I'm calling it z_val.
library(tidyverse)
z_val <- 1
Y_hat_village2 <- function(data, z) {
data_z <- data %>% filter(Z == z)
return(mean(data_z$Y))
}
You can make the function call on each group using do, which will get you a list-column, and then unnesting that column. Again note that I'm passing in the variable z_val to the argument z.
df %>%
group_by(village) %>%
do(y_hat = Y_hat_village2(., z = z_val)) %>%
unnest()
#> # A tibble: 2 x 2
#> village y_hat
#> <chr> <dbl>
#> 1 a 450
#> 2 b 700
However, do is being deprecated in favor of purrr::map, which I am still having trouble getting the hang of. In this case, you can group and nest, which gives a column of data frames called data, then map over that column and again supply z = z_val. When you unnest the y_hat column, you still have the original data as a nested column, since you wanted access to the rest of the columns still.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = z_val))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 450
#> 2 b <tibble [2 × 3]> 700
Just to check that everything works okay, I also passed in z = 0 to check for 1. scoping issues, and 2. that other values of z work.
df %>%
group_by(village) %>%
nest() %>%
mutate(y_hat = map(data, ~Y_hat_village2(., z = 0))) %>%
unnest(y_hat)
#> # A tibble: 2 x 3
#> village data y_hat
#> <chr> <list> <dbl>
#> 1 a <tibble [3 × 3]> 800
#> 2 b <tibble [2 × 3]> 300
As an extension/modification to #patL's answer, you can also wrap the tidyverse solution within purrr:map to return a list of two tibbles, one for each z value:
z <- c(0, 1);
map(z, ~df %>% filter(Z == .x) %>% group_by(village) %>% summarise(Y.mean = mean(Y)))
#[[1]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 800.
#2 b 300.
#
#[[2]]
## A tibble: 2 x 2
# village Y.mean
# <fct> <dbl>
#1 a 450.
#2 b 700.
Sample data
df <- read.table(text =
" village A Z Y
1 a 1 1 500
2 a 1 1 400
3 a 1 0 800
4 b 1 0 300
5 b 1 1 700 ", header = T)
You can use dplyr to accomplish it:
library(dplyr)
df %>%
group_by(village) %>%
filter(Z == 1) %>%
summarise(Y_village = mean(Y))
## A tibble: 2 x 2
# village Y_village
# <chr> <dbl>
#1 a 450
#2 b 700
To get all columns:
df %>%
group_by(village) %>%
filter(Z == 1) %>%
mutate(Y_village = mean(Y)) %>%
distinct(village, A, Z, Y_village)
## A tibble: 2 x 4
## Groups: village [2]
# village A Z Y_village
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 450
#2 b 1 1 700
data
df <- data_frame(village = c("a", "a", "a", "b", "b"),
A = rep(1, 5),
Z = c(1, 1, 0, 0, 1),
Y = c(500, 400, 800, 30, 700))

R dplyr: summarise complete cases by group for all variables

I want to summarise variables by group for every variable in a dataset using dplyr. The summarised variables should be stored under a new name.
An example:
df <- data.frame(
group = c("A", "B", "A", "B"),
a = c(1,1,NA,2),
b = c(1,NA,1,1),
c = c(1,1,2,NA),
d = c(1,2,1,1)
)
df %>% group_by(group) %>%
mutate(complete_a = sum(complete.cases(a))) %>%
mutate(complete_b = sum(complete.cases(b))) %>%
mutate(complete_c = sum(complete.cases(c))) %>%
mutate(complete_d = sum(complete.cases(d))) %>%
group_by(group, complete_a, complete_b, complete_c, complete_d) %>% summarise()
results in my expected output:
# # A tibble: 2 x 5
# # Groups: group, complete_a, complete_b, complete_c [?]
# group complete_a complete_b complete_c complete_d
# <fct> <int> <int> <int> <int>
# A 1 2 2 2
# B 2 1 1 2
How can I generate the same output without duplicating the mutate statements per variable?
I tried:
df %>% group_by(group) %>% summarise_all(funs(sum(complete.cases(.))))
which works but does not rename the variables.
You are almost there. You have to use rename_all
library(dplyr)
df %>%
group_by(group) %>%
summarise_all(funs(sum(complete.cases(.)))) %>%
rename_all(~paste0("complete_", colnames(df)))
# A tibble: 2 x 5
# complete_group complete_a complete_b complete_c complete_d
# <fct> <int> <int> <int> <int>
#1 A 1 2 2 2
#2 B 2 1 1 2
Edit
Or as pointed all by #symbolrush, more directly without colnames:
df %>%
group_by(group) %>%
summarise_all(funs(sum(complete.cases(.)))) %>%
rename_all(~paste0("complete_", .))
## A tibble: 2 x 5
# complete_group complete_a complete_b complete_c complete_d
# <fct> <int> <int> <int> <int>
#1 A 1 2 2 2
#2 B 2 1 1 2

Avoiding the use of for loop for cumsum

First generating some sample data:
doy <- rep(1:365,times=2)
year <- rep(2000:2001,each=365)
set.seed(1)
value <-runif(min=0,max=10,365*2)
doy.range <- c(40,50,60,80)
thres <- 200
df <- data.frame(cbind(doy,year,value))
What I want to do is the following:
For the df$year == 2000, starting from doy.range == 40, start adding the
df$value and calculate the df$doy when the cumualtive sum of df$value is >= thres
Here's my long for loop to achieve this:
# create a matrix to store results
mat <- matrix(, nrow = length(doy.range)*length(unique(year)),ncol=3)
mat[,1] <- rep(unique(year),each=4)
mat[,2] <- rep(doy.range,times=2)
for(i in unique(df$year)){
dat <- df[df$year== i,]
for(j in doy.range){
dat1 <- dat[dat$doy >= j,]
dat1$cum.sum <-cumsum(dat1$value)
day.thres <- dat1[dat1$cum.sum >= thres,"doy"][1] # gives me the doy of the year where cumsum of df$value becomes >= thres
mat[mat[,2] == j & mat[,1] == i,3] <- day.thres
}
}
This loop gives me the in the third column of my matrix, the doy when cumsum$value exceeded thres
However, I really want to avoid the loops. Is there any way I can do it using less code?
If I understand correctly you can use dplyr. Assume a threshold of 200:
library(dplyr)
df %>% group_by(year) %>%
filter(doy >= 40) %>%
mutate(CumSum = cumsum(value)) %>%
filter(CumSum >= 200) %>%
top_n(n = -1, wt = CumSum)
which yields
# A tibble: 2 x 4
# Groups: year [2]
doy year value CumSum
<dbl> <dbl> <dbl> <dbl>
1 78 2000 3.899895 201.4864
2 75 2001 9.205178 204.3171
The verbs used are self-explanatory I guess. If not, let me know.
For different doy create a function and use lapply:
f <- function(doy.range) {
df %>% group_by(year) %>%
filter(doy >= doy.range) %>%
mutate(CumSum = cumsum(value)) %>%
filter(CumSum >= 200) %>%
top_n(n = -1, wt = CumSum)
}
lapply(doy.range, f)
[[1]]
# A tibble: 2 x 4
# Groups: year [2]
doy year value CumSum
<dbl> <dbl> <dbl> <dbl>
1 78 2000 3.899895 201.4864
2 75 2001 9.205178 204.3171
[[2]]
# A tibble: 2 x 4
# Groups: year [2]
doy year value CumSum
<dbl> <dbl> <dbl> <dbl>
1 89 2000 2.454885 200.2998
2 91 2001 6.578281 200.6544
[[3]]
# A tibble: 2 x 4
# Groups: year [2]
doy year value CumSum
<dbl> <dbl> <dbl> <dbl>
1 98 2000 4.100841 200.5048
2 102 2001 7.158333 200.3770
[[4]]
# A tibble: 2 x 4
# Groups: year [2]
doy year value CumSum
<dbl> <dbl> <dbl> <dbl>
1 120 2000 6.401010 204.9951
2 120 2001 5.884192 200.8252
The idea is to create a function that based on a given (starting) doy and threshold gets you the relevant info. Then apply this function to different combinations of starting doys and thresholds and get a dataset back with all relevant info:
# create example data
doy <- rep(1:365,times=2)
year <- rep(2000:2001,each=365)
set.seed(1)
value <-runif(min=0,max=10,365*2)
df <- data.frame(doy,year,value)
library(dplyr)
library(purrr)
# function (inputs: dr for doy range and t for threshold)
f = function(dr, t) {
df %>%
filter(doy >= dr) %>% # keep rows with values aboven a given doy
group_by(year) %>% # for each year
mutate(CumSumValue = cumsum(value)) %>% # get the cumulative sum of value
filter(CumSumValue >= t) %>% # keep rows equal or above a given threshold
slice(1) %>% # keep the first row
ungroup() %>% # forget the grouping
select(-value) %>% # remove unnecessary variable
mutate(doy_input=dr, thres_input=t) %>% # add the input info as columns
select(doy_input, thres_input, year, doy, CumSumValue) # re arrange columns
}
# input doy and threshold
doy.range <- c(40,50,60,80)
thres <- 200
# map those vectors to the function
map2_df(doy.range, thres, f)
# # A tibble: 8 x 5
# doy_input thres_input year doy CumSumValue
# <dbl> <dbl> <int> <int> <dbl>
# 1 40 200 2000 78 201.4864
# 2 40 200 2001 75 204.3171
# 3 50 200 2000 89 200.2998
# 4 50 200 2001 91 200.6544
# 5 60 200 2000 98 200.5048
# 6 60 200 2001 102 200.3770
# 7 80 200 2000 120 204.9951
# 8 80 200 2001 120 200.8252

Resources