Adding a column based on a list dplyr - r

I am trying to summarise a list of dataframes. Here is some test data
noms <- list('A', 'B')
A_data <- data.frame('Dis' = c(1, 1, 2, 2),
'adj' = c(3, 2, 6, 7))
B_data <- data.frame('Dis' = c(1, 1, 2, 2),
'adj' = c(2, 6, 3, 6))
frames <- list(A_data, B_data)
I want to produce a list of data frams where'adj' is summed for each 'Dis' group, and then add a column for the relevant name from 'noms' so I can then combine the data frames together to form a single dataframe in the future.
So far I have this:
totals <- setNames(lapply(frames, function (x)
x %>%
dplyr::group_by(Dis) %>%
dplyr::summarise(total = sum(adj)))
,paste0(unlist(noms)))
But I can figure out how to add a column with the relevant name. I know I need to use the mutate function something like so:
totals <- setNames(lapply(frames, function (x)
x %>%
dplyr::group_by(Dis) %>%
dplyr::summarise(total = sum(adj)) %>%
dplyr::mutate(nom = )
,paste0(unlist(noms)))
but I cant figure out how to add the correct name.
The expected output would be a list of two dataframes one for 'A' and one for 'B'. Here is the expected output for 'A':
Dis total Nom
1 1 5 A
2 2 13 A
How do I do this?

A base R option where we use Map instead of lapply
out <- Map(function(x, y) {
transform(aggregate(adj ~ Dis, data = x, sum), Nom = y)
}, x = frames, y = noms)
out
#[[1]]
# Dis adj Nom
#1 1 5 A
#2 2 13 A
#[[2]]
# Dis adj Nom
#1 1 8 B
#2 2 9 B
The same idea with tidyverse functions
library(purrr); library(dplyr)
map2(.x = frames, .y = noms, ~ .x %>%
group_by(Dis) %>%
summarise(adj = sum(adj)) %>%
mutate(Nom = .y))

Related

Merging different data frames in R to eliminate NAs

I'm currently working on a longitudinal data base in R. Therefore, I have a lot of missing values, because the values of the variables which have been unchanged since the last interview are not added in the new database. For example in the first wave the sex is defined as boy or girl and it doesn't change between the first wave and the second wave, so they are not giving the sex in the second wave again.
Basically, what I would like to do is to merge the data I have selected for the second wave and merge it with the data from the first wave, in order to eliminate some NAs. However, I would like to only keep the columns I have selected from the second wave. For the moment, and after looking on the internet, I was only able to merge the two datasets but I'm not able to only keep the data from the second wave.
Here is my code:
library("rqdatatable")
x <- data.frame(
ID = c(1,2,3,4),
S1 = c(1, 3, NA,0),
S2 = c(2, NA, 2,2)
)
y <- data.frame(
ID = c(1, 2, 3, 4,5,6,7,8),
S1 = c(1, 2, 5, 1,3,6,8,2),
S3 = c(3, 3, 3, 3,7,1,6,9),
S2 = c(0,0,0,0,0,0,0,0),
S4 = c(0,0,0,0,0,0,0,0)
)
final <- natural_join(x, y,
by = "ID",
jointype = "LEFT")
What I would like to get after my merge is:
z = data.frame(
ID = c(1,2,3,4),
S1 = c(1, 3, 5,0),
S2 = c(2, 0, 2,2)
)
Do you have any idea of how I can solve my problem?
It would be very time consuming to merge everything and to select the variables I want again.
Many thanks and best regards!
Here is a base r function that joins the data like in the question. It can also be call via a pipe, in this case R's pipe operator introduced in R 4.1.
x <- data.frame(
ID = c(1,2,3,4),
S1 = c(1, 3, NA,0),
S2 = c(2, NA, 2,2)
)
y <- data.frame(
ID = c(1, 2, 3, 4,5,6,7,8),
S1 = c(1, 2, 5, 1,3,6,8,2),
S3 = c(3, 3, 3, 3,7,1,6,9),
S2 = c(0,0,0,0,0,0,0,0),
S4 = c(0,0,0,0,0,0,0,0)
)
joinSpecial <- function(x, y, idcol = "ID"){
idcolx <- which(names(x) == idcol)
idcoly <- which(names(y) == idcol)
idx <- which(names(x) %in% names(y))
idy <- which(names(y) %in% names(x))
idx <- idx[idx != idcolx]
idy <- idy[idy != idcoly]
i <- match(x[[idcolx]], y[[idcoly]])
x[idx] <- mapply(\(a, b, i){
na <- is.na(a)
a[na] <- b[i][na]
a
}, x[idx], y[idy], MoreArgs = list(i = i), SIMPLIFY = FALSE)
x
}
joinSpecial(x, y)
#> ID S1 S2
#> 1 1 1 2
#> 2 2 3 0
#> 3 3 5 2
#> 4 4 0 2
x |> joinSpecial(y)
#> ID S1 S2
#> 1 1 1 2
#> 2 2 3 0
#> 3 3 5 2
#> 4 4 0 2
Created on 2022-03-18 by the reprex package (v2.0.1)
We could use inner_join in combination with coalesce
library(dplyr)
x %>%
inner_join(y, by="ID") %>%
mutate(S1 = coalesce(S1.x, S1.y),
S2 = coalesce(S2.x, S2.y)) %>%
select(ID, S1, S2)
ID S1 S2
1 1 1 2
2 2 3 0
3 3 5 2
4 4 0 2

Creating group ids by comparing values of two variables across rows: in R

I have a dataframe with two variables (start,end). would like to create an identifier variable which grows in ascending order of start and, most importantly, is kept constant if the value of start coincides with end of any other row in the dataframe.
Below is a simple example of the data
toy_data <- data.frame(start = c(1,5,6,10,16),
end = c(10,9,11,15,17))
The output I would be looking for is the following:
output_data <- data.frame(start = c(1,10,5,6,16),
end = c(10,15,9,11,17),
NEW_VAR = c(1,1,2,3,4))
You could try adapting this answer to group by ranges that are adjacent to each other. Credit goes entirely to #r2evans.
In this case, you would use expand.grid to get combinations of start and end. Instead of labels you would have row numbers rn to reference.
In the end, you can number the groups based on which rows appear together in the list. The last few lines starting with enframe use tibble/tidyverse. To match the group numbers I resorted the results too.
I hope this might be helpful.
library(tidyverse)
toy_data <- data.frame(start = c(1,5,6,10,16),
end = c(10,9,11,15,17))
toy_data$rn = 1:nrow(toy_data)
eg <- expand.grid(a = seq_len(nrow(toy_data)), b = seq_len(nrow(toy_data)))
eg <- eg[eg$a < eg$b,]
together <- cbind(
setNames(toy_data[eg$a,], paste0(names(toy_data), "1")),
setNames(toy_data[eg$b,], paste0(names(toy_data), "2"))
)
together <- subset(together, end1 == start2)
groups <- split(together$rn2, together$rn1)
for (i in toy_data$rn) {
ind <- (i == names(groups)) | sapply(groups, `%in%`, x = i)
vals <- groups[ind]
groups <- c(
setNames(list(unique(c(i, names(vals), unlist(vals)))), i),
groups[!ind]
)
}
min_row <- as.numeric(sapply(groups, min))
ctr <- seq_along(groups)
lapply(ctr[order(match(min_row, ctr))], \(x) toy_data[toy_data$rn %in% groups[[x]], ]) %>%
enframe() %>%
unnest(col = value) %>%
select(-rn)
Output
name start end
<int> <dbl> <dbl>
1 1 1 10
2 1 10 15
3 2 5 9
4 3 6 11
5 4 16 17
The following function should give you the desired identifier variable NEW_VAR.
identifier <- \(df) {
x <- array(0L, dim = nrow(df))
count <- 0L
my_seq <- seq_len(nrow(df))
for (i in my_seq) {
if(!df[i,]$start %in% df$end) {
x[i] <- my_seq[i] + count
} else {
x[i] <- my_seq[i]-1L + count
count <- count - 1L
}
}
x
}
Examples
# your example
toy_data <- data.frame(start = c(1,10,5,6,16),
end = c(10,15,9,11,17))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 1 1 2 3 4
# other example
toy_data <- data.frame(start = c(1, 2, 2, 4, 16, 21, 18, 3),
end = c(16, 2, 21, 2, 2, 2, 3, 1))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 0 0 0 1 1 1 2 2

dplyr select column based on string match

I am wanting to order my columns of a data frame by string matches.
library(dplyr)
data <- data.frame(start_a = 1,
start_f = 3,
end_a = 5,
end_f = 7,
middle_a= 9,
middle_f = 11)
For example I want to select start_f, start_a, middle_f, middle_a, end_f ,end_a
I am attempting to do so with data %>% select(matches("(start|middle|end)_(f|a)"))), so that the order I have typed within the matches is the order that I want the columns to be selected.
Desired output would be data[c(2,1,6,5,4,3)]
You can construct the columns in the order that you want with outer.
order1 <- c('start', 'middle', 'end')
order2 <- c('f', 'a')
cols <- c(t(outer(order1, order2, paste, sep = '_')))
cols
#[1] "start_f" "start_a" "middle_f" "middle_a" "end_f" "end_a"
data[cols]
# start_f start_a middle_f middle_a end_f end_a
#1 3 1 11 9 7 5
If not all combinations of order1 and order2 are present in the data we can use any_of which will select only the columns present in data without giving any error.
library(dplyr)
data %>% select(any_of(cols))
To select based on pattern in names.
order1 <- c('start', 'middle', 'end')
order2 <- c('f', 'a')
pattern <- c(t(outer(order1, order2, function(x, y) sprintf('^%s_%s.*', x, y))))
pattern
#[1] "^start_f.*" "^start_a.*" "^middle_f.*" "^middle_a.*" "^end_f.*" "^end_a.*"
cols <- names(data)
data[sapply(pattern, function(x) grep(x, cols))]
# start_f start_a middle_f middle_a end_f end_a
#1 3 1 11 9 7 5

How to iterate across multiple columns from different tibbles and assign unique names to result using tidy/purrr

I'm trying to make 4 objects with unique names that hold the output from multiple rbinom runs. Each set of rbinom runs uses different probabilities taken from a specific column of a different tibble. This is how I did it manually and I just want to recreate this with an iterative method.
collection_var_A <- rbinom(size = 20, n = 4 ,prob = (probs_tbl_A$value))
collection_var_B <- rbinom(size = 20, n = 4, prob = (probs_tbl_B$value))
collection_var_C <- rbinom(size = 20, n = 4, prob = (probs_tbl_C$value))
collection_var_D <- rbinom(size = 20, n = 4, prob = (probs_tbl_D$value))
The tibbles that contain the different $value columns look something like this:
probs_tbl_A <- tibble(
value = c(.56, .76, .85, .68), other = "other_stuff")
probs_tbl_B <- tibble(
value = c(.66, .72, .45, .39), other = "other_stuff")
probs_tbl_C <- tibble(
value = c(.56, .76, .85, .68), other = "other_stuff")
probs_tbl_D <- tibble(
value = c(.66, .72, .45, .39), other = "other_stuff")
I can get map() or map2() to iterate over the rbinom portion properly but it stores the output as a single list. I can't get it to iterate and simultaneously assign unique object names. I have tried supplying a list of the desired object names in different ways.
I feel like I'm missing a very easy way to do this.
You can put the tibbles in a list and then use map
list_df <- mget(ls(pattern = "probs_tbl_.*"))
list_output <- purrr::map(list_df, ~rbinom(size = 20, n = 4 ,prob = (.$value)))
list_output
#$probs_tbl_A
#[1] 9 14 18 11
#$probs_tbl_B
#[1] 10 16 11 7
#$probs_tbl_C
#[1] 16 15 16 15
#$probs_tbl_D
#[1] 13 15 8 8
It will return you a list of numbers, if you want them as separate objects you can do
names(list_output) <- paste0("collection_var_", letters[seq_along(list_output)])
list2env(list_output, .GlobalEnv)
map can also be replaced with lapply to keep it in base R
lapply(list_df, function(x) rbinom(size = 20, n = 4 ,prob = (x$value)))
We can use tidyverse methods
library(tidyverse)
map(mget(paste0("probs_tbl_", LETTERS[1:4])), ~ .x %>%
pull(value) %>%
rbinom(size = 20, n = 4, prob = .))

Select data frame values row-wise using a variable of column names

Suppose I have a data frame that looks like this:
dframe = data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
# x y
# 1 1 4
# 2 2 5
# 3 3 6
And a vector of column names, one per row of the data frame:
colname = c('x', 'y', 'x')
For each row of the data frame, I would like to select the value from the corresponding column in the vector. Something similar to dframe[, colname] but for each row.
Thus, I want to obtain c(1, 5, 3) (i.e. row 1: col "x"; row 2: col "y"; row 3: col "x")
My favourite old matrix-indexing will take care of this. Just pass a 2-column matrix with the respective row/column index:
rownames(dframe) <- seq_len(nrow(dframe))
dframe[cbind(rownames(dframe),colname)]
#[1] 1 5 3
Or, if you don't want to add rownames:
dframe[cbind(seq_len(nrow(dframe)), match(colname,names(dframe)))]
#[1] 1 5 3
One can use mapply to pass arguments for rownumber (of dframe) and vector for column name (for each row) to return specific column value.
The solution using mapply can be as:
dframe = data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
colname = c('x', 'y', 'x')
mapply(function(x,y)dframe[x,y],1:nrow(dframe), colname)
#[1] 1 5 3
Although, the next option may not be very intuitive but if someone wants a solution in dplyr chain then a way using gather can be as:
library(tidyverse)
data.frame(colname = c('x', 'y', 'x'), stringsAsFactors = FALSE) %>%
rownames_to_column() %>%
left_join(dframe %>% rownames_to_column() %>%
gather(colname, value, -rowname),
by = c("rowname", "colname" )) %>%
select(rowname, value)
# rowname value
# 1 1 1
# 2 2 5
# 3 3 3

Resources