I have a huge list samplelist containing various tibbles (in the simplified sample called Alpha, Beta and Gamma). Each tibble contains various elements (in the sample called sample_0 and sample_1). However, not each tibble contains each element (Gamma contains only sample_0, but not sample_1). What I would like to do is to rename the elements based on a condition: if there is an element sample_1 in a tibble, rename it to sampling. However, if the tibble does not contain sample_1, rename sample_0 to sampling (so that the list now contains an element called sampling for each tibble).
samplelist <- list(Alpha = structure(list(sample_0 = c(3, NA, 7, 9, 2),
sample_1 = c(NA, 8, 5, 4, NA)),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame")),
Beta = structure (list(sample_0 = c(2, 9, NA, 3, 7),
sample_1 = c(3, 7, 9, 3, NA)),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame")),
Gamma = structure(list(sample_0 = c(NA, NA, 4, 6, 3)),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame")))
Does anybody know how to get the following desired output?
samplelist
$Alpha
# A tibble: 5 x 2
sample_0 sampling
<dbl> <dbl>
1 3 NA
2 NA 8
3 7 5
4 9 4
5 2 NA
$Beta
# A tibble: 5 x 2
sample_0 sampling
<dbl> <dbl>
1 2 3
2 9 7
3 NA 9
4 3 3
5 7 NA
$Gamma
# A tibble: 5 x 1
sampling
<dbl>
1 NA
2 NA
3 4
4 6
5 3
EDIT
With the code provided by #akrun:
map(errorlist, ~ if(ncol(.x) == 1 && names(.x) == 'sample_0')
setNames(.x, 'sampling') else
rename_with(.x, ~ 'sampling', matches('sample_1')))
I got the disired output for my samplelist. However, if there's more than one group in Gamma, the (adjusted) code only works for Alpha and Beta, yet leaves Gamma unchanged (Delta added from before editing):
errorlist <- list(Alpha = structure(list(sample_0 = c(3, NA, 7, 9, 2),
sample_1 = c(NA, 8, 5, 4, NA),
sample_2 = c(7, 3, 5, NA, NA)),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame")),
Beta = structure (list(sample_0 = c(2, 9, NA, 3, 7),
sample_1 = c(3, 7, 9, 3, NA),
sample_2 = c(4, 2, 6, 4, 6)),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame")),
Gamma = structure(list(sample_0 = c(NA, NA, 4, 6, 3),
sample_1 = c(3, 7, 3, NA, 8)),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame")),
Delta = structure (list(error = c(3, 7, 9, 3, NA)),
row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame")))
map(errorlist, ~ if(ncol(.x) == 1 && names(.x) == 'sample_1')
setNames(.x, 'sampling') else
rename_with(.x, ~ 'sampling', matches('sample_2')))
Output:
$Alpha
# A tibble: 5 x 3
sample_0 sample_1 sampling
<dbl> <dbl> <dbl>
1 3 NA 7
2 NA 8 3
3 7 5 5
4 9 4 NA
5 2 NA NA
$Beta
# A tibble: 5 x 3
sample_0 sample_1 sampling
<dbl> <dbl> <dbl>
1 2 3 4
2 9 7 2
3 NA 9 6
4 3 3 4
5 7 NA 6
$Gamma
# A tibble: 5 x 2
sample_0 sample_1
<dbl> <dbl>
1 NA 3
2 NA 7
3 4 3
4 6 NA
5 3 8
$Delta
# A tibble: 5 x 1
error
<dbl>
1 3
2 7
3 9
4 3
5 NA
Here is an option - loop over the list with map and do the changes with a condition (if/else) (Here, we are using errorlist as it is more general. It also works with samplelist)
library(dplyr)
library(purrr)
map(errorlist, ~ if(ncol(.x) == 1 && names(.x) == 'sample_0')
setNames(.x, 'sampling') else
rename_with(.x, ~ 'sampling', matches('sample_1')))
-output
$Alpha
# A tibble: 5 × 2
sample_0 sampling
<dbl> <dbl>
1 3 NA
2 NA 8
3 7 5
4 9 4
5 2 NA
$Beta
# A tibble: 5 × 2
sample_0 sampling
<dbl> <dbl>
1 2 3
2 9 7
3 NA 9
4 3 3
5 7 NA
$Gamma
# A tibble: 5 × 1
sampling
<dbl>
1 NA
2 NA
3 4
4 6
5 3
$Delta
# A tibble: 5 × 1
error
<dbl>
1 3
2 7
3 9
4 3
5 NA
Update
Based on the OP's comments
lapply(errorlist, \(x) {
nm1 <- stringr::str_subset(names(x), "^sample_\\d+$")
i1 <- which.max(as.numeric(stringr::str_extract(nm1,
"(?<=sample_)\\d+")))
if(length(i1) > 0) names(x)[names(x) == nm1[i1]] <- "sampling"
x})
-output
$Alpha
# A tibble: 5 × 3
sample_0 sample_1 sampling
<dbl> <dbl> <dbl>
1 3 NA 7
2 NA 8 3
3 7 5 5
4 9 4 NA
5 2 NA NA
$Beta
# A tibble: 5 × 3
sample_0 sample_1 sampling
<dbl> <dbl> <dbl>
1 2 3 4
2 9 7 2
3 NA 9 6
4 3 3 4
5 7 NA 6
$Gamma
# A tibble: 5 × 2
sample_0 sampling
<dbl> <dbl>
1 NA 3
2 NA 7
3 4 3
4 6 NA
5 3 8
$Delta
# A tibble: 5 × 1
error
<dbl>
1 3
2 7
3 9
4 3
5 NA
Related
I have a large dataset in which the answers to one question are distributed among various columns. However, if the columns belong together, they share the same prefix. I wonder how I can create a subset dataset of each question sorting based on the prefix.
Here is an example dataset. I would like to receive an efficient and easy adaptable solution to create a dataset only containing the values of either question one, two or three.
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8), Question1a = c(1,
1, NA, NA, 1, 1, 1, NA), Question1b = c(NA, 1, NA, 1, NA, 1,
NA, 1), Question1c = c(1, 1, NA, NA, 1, NA, NA, NA), Question2a = c(1,
NA, NA, NA, 1, 1, NA, NA), Question2b = c(NA, 1, NA, 1, NA, NA,
NA, NA), Question3a = c(NA, NA, NA, NA, 1, 1, 1, NA), Question3b = c(NA,
NA, 1, 1, NA, NA, NA, NA)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L))
You can use sapply and a function:
list_data <- sapply(c("Question1", "Question2", "Question3"),
function(x) df[startsWith(names(df),x)], simplify = FALSE)
This will store everything in a list. To get the individual data sets in the global environment as individual objects, use:
list2env(list_data, globalenv())
Output
# $Question1
# # A tibble: 8 × 3
# Question1a Question1b Question1c
# <dbl> <dbl> <dbl>
# 1 1 NA 1
# 2 1 1 1
# 3 NA NA NA
# 4 NA 1 NA
# 5 1 NA 1
# 6 1 1 NA
# 7 1 NA NA
# 8 NA 1 NA
#
# $Question2
# # A tibble: 8 × 2
# Question2a Question2b
# <dbl> <dbl>
# 1 1 NA
# 2 NA 1
# 3 NA NA
# 4 NA 1
# 5 1 NA
# 6 1 NA
# 7 NA NA
# 8 NA NA
#
# $Question3
# # A tibble: 8 × 2
# Question3a Question3b
# <dbl> <dbl>
# 1 NA NA
# 2 NA NA
# 3 NA 1
# 4 NA 1
# 5 1 NA
# 6 1 NA
# 7 1 NA
# 8 NA NA
I believe the underlying question is about data-formats.
Here's a few:
library(tidyverse)
structure(
list(
ID = c(1, 2, 3, 4, 5, 6, 7, 8),
Question1a = c(1,
1, NA, NA, 1, 1, 1, NA),
Question1b = c(NA, 1, NA, 1, NA, 1,
NA, 1),
Question1c = c(1, 1, NA, NA, 1, NA, NA, NA),
Question2a = c(1,
NA, NA, NA, 1, 1, NA, NA),
Question2b = c(NA, 1, NA, 1, NA, NA,
NA, NA),
Question3a = c(NA, NA, NA, NA, 1, 1, 1, NA),
Question3b = c(NA,
NA, 1, 1, NA, NA, NA, NA)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -8L)
) -> square_df
square_df %>%
pivot_longer(-ID,
names_to = c("Question", "Item"),
names_pattern = "Question(\\d+)(\\w+)") ->
long_df
long_df
#> # A tibble: 56 × 4
#> ID Question Item value
#> <dbl> <chr> <chr> <dbl>
#> 1 1 1 a 1
#> 2 1 1 b NA
#> 3 1 1 c 1
#> 4 1 2 a 1
#> 5 1 2 b NA
#> 6 1 3 a NA
#> 7 1 3 b NA
#> 8 2 1 a 1
#> 9 2 1 b 1
#> 10 2 1 c 1
#> # … with 46 more rows
long_df %>%
na.omit(value) ->
sparse_long_df
sparse_long_df
#> # A tibble: 22 × 4
#> ID Question Item value
#> <dbl> <chr> <chr> <dbl>
#> 1 1 1 a 1
#> 2 1 1 c 1
#> 3 1 2 a 1
#> 4 2 1 a 1
#> 5 2 1 b 1
#> 6 2 1 c 1
#> 7 2 2 b 1
#> 8 3 3 b 1
#> 9 4 1 b 1
#> 10 4 2 b 1
#> # … with 12 more rows
sparse_long_df %>%
nest(data = c(ID, Item, value)) ->
nested_long_df
nested_long_df
#> # A tibble: 3 × 2
#> Question data
#> <chr> <list>
#> 1 1 <tibble [12 × 3]>
#> 2 2 <tibble [5 × 3]>
#> 3 3 <tibble [5 × 3]>
Created on 2022-05-12 by the reprex package (v2.0.1)
You could also use map to store each dataframe in a list, e.g.
library(purrr)
# 3 = number of questions
map(c(1:3),
function(x){
quest <- paste0("Question",x)
select(df, ID, starts_with(quest))
})
Output:
[[1]]
# A tibble: 8 x 4
ID Question1a Question1b Question1c
<dbl> <dbl> <dbl> <dbl>
1 1 1 NA 1
2 2 1 1 1
3 3 NA NA NA
4 4 NA 1 NA
5 5 1 NA 1
6 6 1 1 NA
7 7 1 NA NA
8 8 NA 1 NA
[[2]]
# A tibble: 8 x 3
ID Question2a Question2b
<dbl> <dbl> <dbl>
1 1 1 NA
2 2 NA 1
3 3 NA NA
4 4 NA 1
5 5 1 NA
6 6 1 NA
7 7 NA NA
8 8 NA NA
[[3]]
# A tibble: 8 x 3
ID Question3a Question3b
<dbl> <dbl> <dbl>
1 1 NA NA
2 2 NA NA
3 3 NA 1
4 4 NA 1
5 5 1 NA
6 6 1 NA
7 7 1 NA
8 8 NA NA
I found a really intuitive solution using the dplyr package, using the select and starts_with commands. Alternatively, you can also replace the starts_with command with contains, if the you are not identifying the similar variables by a prefix but some other common feature.
Q1 <- Survey %>%
select(
starts_with("Question1")
)
Q2 <- Survey %>%
select(
starts_with("Question2")
)
Q3 <- Survey %>%
select(
starts_with("Question3")
)
I have a set of dataframes in a list and have to create an extra column for each dataframe (which I´ve done) and then create a formula for the first row, and a different one from the second row onwards taking lags from the same column:
Let say the list name is "CCNRRF_list"
Creation of the fourth column (X4)
CNRRF_list<- mapply(cbind, CNRRF_list, "X4"=NA,SIMPLIFY=F)
one of the resulting dataframes
x1 x2 x3 x4
1 1 1 1 NA
2 2 2 2 NA
3 3 3 3 NA
4 4 4 4 NA
5 5 5 5 NA
6 6 6 6 NA
7 7 7 7 NA
8 8 8 8 NA
First formula first row
for (i in seq_along(CNRRF_list)) {
CNRRF_list[[i]]$X4[1]<-(1+CNRRF_list[[i]]$X3[1])
}
Resulting data
x1 x2 x3 x4
1 1 1 1 2 ===> "formula (1+X3)=(1+1)=2"
2 2 2 2 NA
3 3 3 3 NA
4 4 4 4 NA
5 5 5 5 NA
6 6 6 6 NA
7 7 7 7 NA
8 8 8 8 NA
now it gets tricky, from the second row onwards the formula is:
lag(X4)*(1+X3)
so the resulting data should look like this for each dataframe in the list:
x1 x2 x3 x4
1 1 1 1 2
2 2 2 2 6 ===> "formula lag(X4)*(1+x3)=2*(1+2)=6"
3 3 3 3 24 ===> "formula 6*(1+3)"
4 4 4 4 120 ===> "formula 24*(1+4)"
5 5 5 5 720 ===> "formula 120*(1+5)"
6 6 6 6 5040 ===> "formula 720*(1+6)"
7 7 7 7 40320 ===> "formula 5040*(1+7)"
8 8 8 8 362880 ===> "formula 40320*(1+8)"
But I haven´t been able to create a good enough formula.
some of my attempts
for (i in seq_along(CNRRF_list)) {
CNRRF_list[[i]] <- mutate(CNRRF_list[[i]], X4 = (ifelse(is.na(CNRRF_list[[i]]$X4),lag(CNRRF_list[[i]]$X4)*(1+CNRRF_list[[i]]$X3), 1*(1+CNRRF_list[[i]]$X3))))
}
Not working...any help will be appreciate.
Thanks
How about this:
dat <- tibble::tribble(
~x1, ~x2, ~x3, ~x4,
1, 1, 1, NA,
2, 2, 2, NA,
3, 3, 3, NA,
4, 4, 4, NA,
5, 5, 5, NA,
6, 6, 6, NA,
7, 7, 7, NA,
8, 8, 8, NA)
for(i in 1:nrow(dat)){
dat$x4[i] <- prod(c(NA, lag(dat$x4))[i], (1+dat$x3[i]), na.rm=TRUE)
}
dat
#> # A tibble: 8 × 4
#> x1 x2 x3 x4
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 2
#> 2 2 2 2 6
#> 3 3 3 3 24
#> 4 4 4 4 120
#> 5 5 5 5 720
#> 6 6 6 6 5040
#> 7 7 7 7 40320
#> 8 8 8 8 362880
Created on 2022-04-05 by the reprex package (v2.0.1)
Edit: Apply to a list of data frames
Here's how you could apply this to a list of data frames.
dat <- tibble::tribble(
~x1, ~x2, ~x3, ~x4,
1, 1, 1, NA,
2, 2, 2, NA,
3, 3, 3, NA,
4, 4, 4, NA,
5, 5, 5, NA,
6, 6, 6, NA,
7, 7, 7, NA,
8, 8, 8, NA)
dat_list <- list(dat, dat, dat)
res <- lapply(dat_list, function(x){
for(i in 1:nrow(x)){
x$x4[i] <- prod(c(NA, lag(x$x4))[i], (1+x$x3[i]), na.rm=TRUE)
}
x
})
res
#> [[1]]
#> # A tibble: 8 × 4
#> x1 x2 x3 x4
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 2
#> 2 2 2 2 6
#> 3 3 3 3 24
#> 4 4 4 4 120
#> 5 5 5 5 720
#> 6 6 6 6 5040
#> 7 7 7 7 40320
#> 8 8 8 8 362880
#>
#> [[2]]
#> # A tibble: 8 × 4
#> x1 x2 x3 x4
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 2
#> 2 2 2 2 6
#> 3 3 3 3 24
#> 4 4 4 4 120
#> 5 5 5 5 720
#> 6 6 6 6 5040
#> 7 7 7 7 40320
#> 8 8 8 8 362880
#>
#> [[3]]
#> # A tibble: 8 × 4
#> x1 x2 x3 x4
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 2
#> 2 2 2 2 6
#> 3 3 3 3 24
#> 4 4 4 4 120
#> 5 5 5 5 720
#> 6 6 6 6 5040
#> 7 7 7 7 40320
#> 8 8 8 8 362880
Created on 2022-04-05 by the reprex package (v2.0.1)
Another option is to use the base function Reduce
using data.table
library(data.table)
setDT(dt) # make it a data.table if it is not already
dt[, x4 := 2 * Reduce(f = function(a, b) { a * (b + 1) }, accumulate = T, x = x3)]
using dplyr
dt %>%
mutate(x4 = 2 * Reduce(f = function(a, b) { a * (b + 1) }, accumulate = T, x = x3))
output
# x1 x2 x3 x4
# 1: 1 1 1 2
# 2: 2 2 2 6
# 3: 3 3 3 24
# 4: 4 4 4 120
# 5: 5 5 5 720
# 6: 6 6 6 5040
# 7: 7 7 7 40320
# 8: 8 8 8 362880
data
dt <- data.frame(x1 = seq(1:8), x2 = seq(1:8), x3 = seq(1:8))
With accumulate:
library(tidyverse)
dat %>%
mutate(x4 = accumulate(seq(nrow(.) + 1), ~ .y * .x)[-1])
# A tibble: 8 x 4
x1 x2 x3 x4
<dbl> <dbl> <dbl> <int>
1 1 1 1 2
2 2 2 2 6
3 3 3 3 24
4 4 4 4 120
5 5 5 5 720
6 6 6 6 5040
7 7 7 7 40320
8 8 8 8 362880
For multiple dataframes:
list <- list(dat, dat, dat)
dat_list %>%
map(~ .x %>%
mutate(x4 = purrr::accumulate(seq(nrow(.) + 1), ~ .y * .x)[-1])
)
I need to flag an id when they have different grade values in the grade columns. Here how my sample dataset looks like
df <- data.frame(id = c(11,22,33,44,55),
grade.1 = c(3,4,5,6,7),
grade.2 = c(3,4,5,NA,7),
grade.3 = c(4,4,6,5,7),
grade.4 = c(NA,NA,NA, 5, 7 ))
df$Grade <- paste0(df$grade.1, df$grade.2, df$grade.3, df$grade.4)
> df
id grade.1 grade.2 grade.3 grade.4 Grade
1 11 3 3 4 NA 334NA
2 22 4 4 4 NA 444NA
3 33 5 5 6 NA 556NA
4 44 6 NA 5 5 6NA55
5 55 7 7 7 7 7777
When an id has different grade values in grade.1 grade.2 grade.3 and grade.4, that row needs to be flagged. Having NA in that column does not affect the flagging.
In other words, if the Grade column at the end has any differential numbers, that id needs to be flagged.
My desired output should look like this:
> df
id grade.1 grade.2 grade.3 grade.4 flag
1 11 3 3 4 NA flagged
2 22 4 4 4 NA Not_flagged
3 33 5 5 6 NA flagged
4 44 6 NA 5 5 flagged
5 55 7 7 7 7 Not_flagged
Any ideas?
Thanks!
A base R solution using rle omitting NA values.
df$flag <- apply(df[,2:5], 1, function(x)
ifelse(length(rle(x[!is.na(x)])$lengths)==1, "not_flagged", "flagged"))
df
id grade.1 grade.2 grade.3 grade.4 flag
1 11 3 3 4 NA flagged
2 22 4 4 4 NA not_flagged
3 33 5 5 6 NA flagged
4 44 6 NA 5 5 flagged
5 55 7 7 7 7 not_flagged
Data
df <- structure(list(id = c(11, 22, 33, 44, 55), grade.1 = c(3, 4,
5, 6, 7), grade.2 = c(3, 4, 5, NA, 7), grade.3 = c(4, 4, 6, 5,
7), grade.4 = c(NA, NA, NA, 5, 7)), class = "data.frame", row.names = c(NA,
-5L))
Here is a base R approach.
df$flag <- c("not_flagged", "flagged")[
apply(df[-1L], 1L, \(x) length( (ux <- unique(x))[!is.na(ux)] ) > 1L) + 1L
]
Output
> df
id grade.1 grade.2 grade.3 grade.4 flag
1 11 3 3 4 NA flagged
2 22 4 4 4 NA not_flagged
3 33 5 5 6 NA flagged
4 44 6 NA 5 5 flagged
5 55 7 7 7 7 not_flagged
A possible solution:
library(tidyverse)
df <- data.frame(id = c(11,22,33,44,55),
grade.1 = c(3,4,5,6,7),
grade.2 = c(3,4,5,NA,7),
grade.3 = c(4,4,6,5,7),
grade.4 = c(NA,NA,NA, 5, 7 ))
df %>%
rowwise %>%
mutate(flag = if_else(length(unique(na.omit(c_across(2:5)))) == 1,
"not-flagged", "flagged")) %>% ungroup
#> # A tibble: 5 × 6
#> id grade.1 grade.2 grade.3 grade.4 flag
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 11 3 3 4 NA flagged
#> 2 22 4 4 4 NA not-flagged
#> 3 33 5 5 6 NA flagged
#> 4 44 6 NA 5 5 flagged
#> 5 55 7 7 7 7 not-flagged
Using data.table::uniqueN, that counts the number of unique elements in a vector (and that allows for NA removal):
library(data.table)
library(dplyr)
df %>%
rowwise %>%
mutate(flag = if_else(uniqueN(c_across(2:5), na.rm = T) == 1,
"not-flagged", "flagged")) %>% ungroup
n_distinct from dyplr is very helpful: Here a version using a combination of pivot_longer and pivot_wider:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
-c(id, Grade),
names_to = "name",
values_to = "value"
) %>%
group_by(id) %>%
mutate(flag = ifelse(n_distinct(value, na.rm = TRUE)==1, "Not flagged", "Flagged")) %>%
pivot_wider(
names_from = name,
values_from = value
)
id Grade flag grade.1 grade.2 grade.3 grade.4
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 11 334NA Flagged 3 3 4 NA
2 22 444NA Not flagged 4 4 4 NA
3 33 556NA Flagged 5 5 6 NA
4 44 6NA55 Flagged 6 NA 5 5
5 55 7777 Not flagged 7 7 7 7
I have two data frames named "df" and "df1". what i want is merging df with df1 based on gender and district in such a way that after merging I only have one column of "prob.dis". more clearly, I want that if dis is 1, then the value from prob.dis1` should be used, and if dis is 5, then I want the value from prob.dis5.any help would be appreciated.
df<-
age gender dis
10 1 1
11 2 5
10 1 4
11 2 2
10 1 1
11 2 2
10 1 4
11 2 5
10 1 3
11 2 3
df1<-
age gender prob.dis1 prob.dis2 prob.dis3 prob.dis4 prob.dis5
10 1 0.0099 0.0124 0.0037 0.0176 0.1
11 2 0.0021 0.802 0.005 0.0029 0.2
Transform df1 into long format and join it with df on gender and dis:
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(
cols = matches("^prob.dis"),
names_to = c(".value", "dis"),
names_pattern = "([^0-9]+)([0-9]+)"
) %>%
mutate(dis = as.integer(dis)) %>%
select(-age) %>%
left_join(df, ., by = c("gender", "dis"))
age gender dis prob.dis
<dbl> <dbl> <dbl> <dbl>
1 10 1 1 0.0099
2 11 2 5 0.2
3 10 1 4 0.0176
4 11 2 2 0.802
5 10 1 1 0.0099
6 11 2 2 0.802
7 10 1 4 0.0176
8 11 2 5 0.2
9 10 1 3 0.0037
10 11 2 3 0.005
data:
df <- structure(list(age = c(10, 11, 10, 11, 10, 11, 10, 11, 10, 11
), gender = c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2), dis = c(1, 5, 4,
2, 1, 2, 4, 5, 3, 3)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
df1 <- structure(list(age = c(10, 11), gender = c(1, 2), prob.dis1 = c(0.0099,
0.0021), prob.dis2 = c(0.0124, 0.802), prob.dis3 = c(0.0037,
0.005), prob.dis4 = c(0.0176, 0.0029), prob.dis5 = c(0.1, 0.2
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame")
)
Update:
pivot_longer df2
right_join with df1
library(dplyr)
library(tidyr)
df1 <- df2 %>%
pivot_longer(
cols = starts_with("prob"),
names_to = "dis",
values_to = "prob.dis"
) %>%
mutate(dis = as.numeric(gsub("\\D", "", dis))) %>%
right_join(df1, by = c("age", "gender", "dis"))
Output:
age gender dis prob.dis
<int> <int> <dbl> <dbl>
1 10 1 1 0.0099
2 10 1 1 0.0099
3 10 1 3 0.0037
4 10 1 4 0.0176
5 10 1 4 0.0176
6 11 2 2 0.802
7 11 2 2 0.802
8 11 2 3 0.005
9 11 2 5 0.2
10 11 2 5 0.2
In case dis is only ranging from 1 to 5 and is sorted in the columns of df2 you can use match to find the row and use 2 + df$dis to get the column, which can be subseted with a matrix produced with cbind.
df$prop.dis <- df1[cbind(match(df$gender, df1$gender), 2 + df$dis)]
df
# age gender dis prop.dis
#1 10 1 1 0.0099
#2 11 2 5 0.2000
#3 10 1 4 0.0176
#4 11 2 2 0.8020
#5 10 1 1 0.0099
#6 11 2 2 0.8020
#7 10 1 4 0.0176
#8 11 2 5 0.2000
#9 10 1 3 0.0037
#10 11 2 3 0.0050
or using the names of df1 to match the colums:
df$prop.dis <- df1[cbind(match(df$gender, df1$gender)
, match(paste0("prob.dis", df$dis), names(df1)))]
In case also age should be matched use in addition interaction:
M <- c("age", "gender")
df$prop.dis <- df1[cbind(match(interaction(df[M]), interaction(df1[M])), 2 + df$dis)]
I have a dataframe with several columns containing list columns that I want to unnest (or unchop). BUT, they are different lengths, so the resulting error is Error: No common size for...
Here is a reprex to show what works and doesn't work.
library(tidyr)
library(vctrs)
# This works as expected
df_A <- tibble(
ID = 1:3,
A = as_list_of(list(c(9, 8, 5), c(7,6), c(6, 9)))
)
unchop(df_A, cols = c(A))
# A tibble: 7 x 2
ID A
<int> <dbl>
1 1 9
2 1 8
3 1 5
4 2 7
5 2 6
6 3 6
7 3 9
# This works as expected as the lists are the same lengths
df_AB_1 <- tibble(
ID = 1:3,
A = as_list_of(list(c(9, 8, 5), c(7,6), c(6, 9))),
B = as_list_of(list(c(1, 2, 3), c(4, 5), c(7, 8)))
)
unchop(df_AB_1, cols = c(A, B))
# A tibble: 7 x 3
ID A B
<int> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 3
4 2 7 4
5 2 6 5
6 3 6 7
7 3 9 8
# This does NOT work as the lists are different lengths
df_AB_2 <- tibble(
ID = 1:3,
A = as_list_of(list(c(9, 8, 5), c(7,6), c(6, 9))),
B = as_list_of(list(c(1, 2), c(4, 5, 6), c(7, 8, 9, 0)))
)
unchop(df_AB_2, cols = c(A, B))
# Error: No common size for `A`, size 3, and `B`, size 2.
The output that I would like to achieve for df_AB_2 above is as follows where each list is unchopped and missing values are filled with NA:
# A tibble: 10 x 3
ID A B
<dbl> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 NA
4 2 7 4
5 2 6 5
6 2 NA 6
7 3 6 7
8 3 9 8
9 3 NA 9
10 3 NA 0
I have referenced this issue on Github and StackOverflow here.
Any ideas how to achieve the result above?
Versions
> packageVersion("tidyr")
[1] ‘1.0.0’
> packageVersion("vctrs")
[1] ‘0.2.0.9001’
Here is an idea via dplyr that you can generalise to as many columns as you want,
library(tidyverse)
df_AB_2 %>%
pivot_longer(c(A, B)) %>%
mutate(value = lapply(value, `length<-`, max(lengths(value)))) %>%
pivot_wider(names_from = name, values_from = value) %>%
unnest() %>%
filter(rowSums(is.na(.[-1])) != 2)
which gives,
# A tibble: 10 x 3
ID A B
<int> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 NA
4 2 7 4
5 2 6 5
6 2 NA 6
7 3 6 7
8 3 9 8
9 3 NA 9
10 3 NA 0
Defining a helper function to update the lengths of the element and proceeding with dplyr:
foo <- function(x, len_vec) {
lapply(
seq_len(length(x)),
function(i) {
length(x[[i]]) <- len_vec[i]
x[[i]]
}
)
}
df_AB_2 %>%
mutate(maxl = pmax(lengths(A), lengths(B))) %>%
mutate(A = foo(A, maxl), B = foo(B, maxl)) %>%
unchop(cols = c(A, B)) %>%
select(-maxl)
# A tibble: 10 x 3
ID A B
<int> <dbl> <dbl>
1 1 9 1
2 1 8 2
3 1 5 NA
4 2 7 4
5 2 6 5
6 2 NA 6
7 3 6 7
8 3 9 8
9 3 NA 9
10 3 NA 0
Using data.table:
library(data.table)
setDT(df_AB_2)
df_AB_2[, maxl := pmax(lengths(A), lengths(B))]
df_AB_2[, .(unlist(A)[seq_len(maxl)], unlist(B)[seq_len(maxl)]), by = ID]