Given the following data:
test = data.frame(x = c(NA,1,1,2,3,4),
y = c(NA,1,2,3,4,4))
I want to perform some calculations and store these as new columns. The calculations, however, might result in a variable amount of columns. E.g. suppose I want store for each row the column index of the column(s) that contain the minimum per row. E.g. in row 1, both columns contain the minimum, hence I need to create two columns.
Using the tidyverse approach, I know I can use the set_names argument when passing my function as a list. But this doesn't work when I don't know the number of columns my calculation will create. See also here: https://community.rstudio.com/t/how-to-handle-lack-of-names-with-unnest-wider/40496
My approach for the calculations:
library(tidyverse)
test %>%
rowwise() %>%
mutate(dist = min(c_across(everything())),
code = list(which(c_across(cols = c(everything(), -dist)) == dist))) %>%
ungroup() %>%
unnest_wider(code)
which automatically names the unnested columns with "...1" and "...2":
# A tibble: 6 x 5
x y dist ...1 ...2
<dbl> <dbl> <dbl> <int> <int>
1 NA NA NA NA NA
2 1 1 1 1 2
3 1 2 1 1 NA
4 2 3 2 1 NA
5 3 4 3 1 NA
6 4 4 4 1 2
But that's not what I want. I also tried to use the named_repair argument within the unnest_wider, i.e. unnest_wider(code, names_repair = ~paste0("code", .x)) but this renames all columns.
Any ideas (preferably in the tidyverse approach)? Expected outcome:
# A tibble: 6 x 5
x y dist code_1 code_2
<dbl> <dbl> <dbl> <int> <int>
1 NA NA NA NA NA
2 1 1 1 1 2
3 1 2 1 1 NA
4 2 3 2 1 NA
5 3 4 3 1 NA
6 4 4 4 1 2
EDITED to add an example where one row contains only missings.
Edit 2: this is my current solution. But it is really ugly and requires to stop half way through. Problem here is that the rename_with function doesn't recognize the on-the-fly generated "length_code" column when I put everything into one pipe.
test2 <- test %>%
rowwise() %>%
mutate(dist = min(c_across(everything())),
code = list(which(c_across(cols = c(everything(), -dist)) == dist)),
length_code = length(code)) %>%
ungroup() %>%
unnest_wider(code) %>%
test3 <- test2 %>%
rename_with(.cols = starts_with("..."), .fn = ~paste0("code_", 1:max(test2$length_code)))
which gives:
# A tibble: 6 x 6
x y dist code_1 code_2 length_code
<dbl> <dbl> <dbl> <int> <int> <int>
1 NA NA NA NA NA 0
2 1 1 1 1 2 2
3 1 2 1 1 NA 1
4 2 3 2 1 NA 1
5 3 4 3 1 NA 1
6 4 4 4 1 2 2
Related
I have questionnaire data (rows=individuals, cols=scores on questions)and would like to compute a sumscore for individuals if they answered a given number of questions, otherwise the sumscore variable should be NA. The code below computes row sums, counts the number of NA's, assigns an otherwise not occurring value to the row sum variable in case the number of NA's is large, and then replaces that with an NA. The code works but I bet there is a more elegant way...Suggestions much appreciated.
dum<-tibble(x=c(1,NA,2,3,4),y=c(1,2,3,NA,5),z=c(1,NA,2,3,4))
dum<-dum %>%
mutate(sumsum = rowSums(select(., x:z), na.rm = TRUE))
dum<-dum %>%
mutate(countna=rowSums(is.na(select(.,x:z))))
dum<-dum %>%
mutate(sumsum=case_when(countna>=2 ~ 100,TRUE~sumsum))
dum<-dum %>%
mutate(sumsum = na_if(sumsum, 100))
You may combine your code in one statement -
library(dplyr)
dum <- tibble(x=c(1,NA,2,3,4),y=c(1,2,3,NA,5),z=c(1,NA,2,3,4))
dum <- dum %>%
mutate(sumsum = replace(rowSums(select(., x:z), na.rm = TRUE),
rowSums(is.na(select(., x:z))) >= 2, NA))
dum
# A tibble: 5 × 4
# x y z sumsum
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 3
#2 NA 2 NA NA
#3 2 3 2 7
#4 3 NA 3 6
#5 4 5 4 13
You can also try this:
dum<-tibble(x=c(1,NA,2,3,4),y=c(1,2,3,NA,5),z=c(1,NA,2,3,4))
dum2 <- dum %>% mutate(sumsum = ifelse(rowSums(is.na(select(.,x:z)))>=2, NA,rowSums(select(., x:z), na.rm = TRUE)))
dum2
# A tibble: 5 × 4
x y z sumsum
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 3
2 NA 2 NA NA
3 2 3 2 7
4 3 NA 3 6
5 4 5 4 13
I'm trying to use the following function to iterate through a dataframe and return the counts from each row:
library(dplyr)
library(tidyr)
row_freq <- function(df_input,row_input){
print(df_input)
vec <- unlist(df_input %>%
select(-1) %>%
slice(row_input), use.names = FALSE)
r <- data.frame(table(vec)) %>%
pivot_wider(values_from = Freq, names_from = vec)
return(r)
}
This works fine if I use a single row from the dataframe:
sample_df <- data.frame(id = c(1,2,3,4,5), obs1 = c("A","A","B","B","B"),
obs2 = c("B","B","C","D","D"), obs3 = c("A","B","A","D","A"))
row_freq(sample_df, 1)
id obs1 obs2 obs3
1 1 A B A
2 2 A B B
3 3 B C A
4 4 B D D
5 5 B D A
# A tibble: 1 × 2
A B
<int> <int>
1 2 1
But when iterating over rows using purrr::map_dfr, it seems to reduce df_input to only the id column instead of using the entire dataframe as the argument, which I found quite strange:
purrr::map_dfr(sample_df, row_freq, 1:5)
[1] 1 2 3 4 5
Error in UseMethod("select") :
no applicable method for 'select' applied to an object of class "c('double', 'numeric')"
I'm looking for help with regards to 1) why this is happening, 2) how to fix it, and 3) any alternative approaches or functions that may already perform what I'm trying to do in a more efficient manner.
Specify the order of the arguments correctly if we are not passing with named arguments
purrr::map_dfr(1:5, ~ row_freq(sample_df, .x))
-output
# A tibble: 5 × 4
A B C D
<int> <int> <int> <int>
1 2 1 NA NA
2 1 2 NA NA
3 1 1 1 NA
4 NA 1 NA 2
5 1 1 NA 1
Or use a named argument
purrr::map_dfr(df_input = sample_df, .f = row_freq, .x = 1:5)
-output
# A tibble: 5 × 4
A B C D
<int> <int> <int> <int>
1 2 1 NA NA
2 1 2 NA NA
3 1 1 1 NA
4 NA 1 NA 2
5 1 1 NA 1
The reason is that map first argument is .x
map(.x, .f, ...)
and if we are providing the 'sample_df' as the argument, it takes the .x as sample_df and loops over the columns of the data (as data.frame/tibble/data.table - unit is column as these are list with additional attributes)
I have the following data
df <- tibble(Type=c(1,2,2,1,1,2),ID=c(6,4,3,2,1,5))
Type ID
1 6
2 4
2 3
1 2
1 1
2 5
For each of the type 2 rows, I want to find the IDs of the type 1 rows just below and above them. For the above dataset, the output will be:
Type ID IDabove IDbelow
1 6 NA NA
2 4 6 2
2 3 6 2
1 2 NA NA
1 1 NA NA
2 5 1 NA
Naively, I can write a for loop to achieve this, but that would be too time consuming for the dataset I am dealing with.
One approach using dplyr lead,lag to get next and previous value respectively and data.table's rleid to create groups of consecutive Type values.
library(dplyr)
library(data.table)
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = rleid(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
# Type ID IDabove IDbelow
# <dbl> <dbl> <dbl> <dbl>
#1 1 6 NA NA
#2 2 4 6 2
#3 2 3 6 2
#4 1 2 NA NA
#5 1 1 NA NA
#6 2 5 1 NA
A dplyr only solution:
You could create your own rleid function then apply the logic provided by Ronak(Many thanks. Upvoted).
library(dplyr)
my_func <- function(x) {
x <- rle(x)$lengths
rep(seq_along(x), times=x)
}
# this part is the same as provided by Ronak.
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = my_func(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
Output:
Type ID IDabove IDbelow
<dbl> <dbl> <dbl> <dbl>
1 1 6 NA NA
2 2 4 6 2
3 2 3 6 2
4 1 2 NA NA
5 1 1 NA NA
6 2 5 1 NA
I have a tibble containing time series of various blood parameters like CRP over the course of several days. The tibble is tidy, with each time series in one column, as well as a column for the day of measurement. The tibble contains another column with a day of infection. I want to replace each blood parameter with NA if the Day variable is greater-equal than the InfectionDay. Since I have a lot of variables, I'd like to have a function which accepts the column name dynamically and creates a new column name by appending "_censored" to the old one. I've tried the following:
censor.infection <- function(df, colname){
newcolname <- paste0(colname, "_censored")
return(df %>% mutate(!!newcolname := ifelse( Day < InfectionDay, !!colname, NA)))
}
data = tibble(Day=1:5, InfectionDay=3, CRP=c(3,2,5,4,1))
data = censor.infection(data, "CRP")
Running this, I expected
# A tibble: 5 x 4
Day InfectionDay CRP CRP_censored
<int> <dbl> <dbl> <chr>
1 1 3 3 3
2 2 3 2 2
3 3 3 5 NA
4 4 3 4 NA
5 5 3 1 NA
but I get
# A tibble: 5 x 4
Day InfectionDay CRP CRP_censored
<int> <dbl> <dbl> <chr>
1 1 3 3 CRP
2 2 3 2 CRP
3 3 3 5 NA
4 4 3 4 NA
5 5 3 1 NA
You can add sym() to the column name in mutate to convert to symbol before evaluating
censor.infection <- function(df, colname){
newcolname <- paste0(colname, "_censored")
return(df %>% mutate(!!newcolname := ifelse( Day < InfectionDay, !! sym(colname), NA)))
}
data = tibble(Day=1:5, InfectionDay=3, CRP=c(3,2,5,4,1))
data = censor.infection(data, "CRP")
We can select columns on which we want to apply the function (cols) and use mutate_at which will also automatically rename the columns. Added an extra column in the data to show renaming.
library(dplyr)
cols <- c("CRP", "CRP1")
data %>%
mutate_at(cols, list(censored = ~replace(., Day >= InfectionDay, NA)))
# A tibble: 5 x 6
# Day InfectionDay CRP CRP1 CRP_censored CRP1_censored
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 3 3 3 3 3
#2 2 3 2 2 2 2
#3 3 3 5 5 NA NA
#4 4 3 4 4 NA NA
#5 5 3 1 1 NA NA
data
data <- tibble(Day=1:5, InfectionDay=3, CRP=c(3,2,5,4,1), CRP1 = c(3,2,5,4,1))
I have a grouped df and I would like to add additional rows to the top of the groups that match with a variable (item_code) from the df.
The additional rows do not have an id column. The additional rows should not be duplicated within the groups of df.
Example data:
df <- as.tibble(data.frame(id=rep(1:3,each=2),
item_code=c("A","A","B","B","B","Z"),
score=rep(1,6)))
additional_rows <- as.tibble(data.frame(item_code=c("A","Z"),
score=c(6,6)))
What I tried
I found this post and tried to apply it:
Add row in each group using dplyr and add_row()
df %>% group_by(id) %>% do(add_row(additional_rows %>%
filter(item_code %in% .$item_code)))
What I get:
# A tibble: 9 x 3
# Groups: id [3]
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 Z 6
3 1 NA NA
4 2 A 6
5 2 Z 6
6 2 NA NA
7 3 A 6
8 3 Z 6
9 3 NA NA
What I am looking for:
# A tibble: 6 x 3
id item_code score
<int> <fct> <dbl>
1 1 A 6
2 1 A 1
3 1 A 1
4 2 B 1
5 2 B 1
6 3 B 1
7 3 Z 6
8 3 Z 1
This should do the trick:
library(plyr)
df %>%
join(subset(df, item_code %in% additional_rows$item_code, select = c(id, item_code)) %>%
join(additional_rows) %>%
subset(!duplicated(.)), type = "full") %>%
arrange(id, item_code, -score)
Not sure if its the best way, but it works
Edit: to get the score in the same order added the other arrange terms
Edit 2: alright, there should now be no duplicated rows added from the additional rows as per your comment