Merging different data frames in R to eliminate NAs - r

I'm currently working on a longitudinal data base in R. Therefore, I have a lot of missing values, because the values of the variables which have been unchanged since the last interview are not added in the new database. For example in the first wave the sex is defined as boy or girl and it doesn't change between the first wave and the second wave, so they are not giving the sex in the second wave again.
Basically, what I would like to do is to merge the data I have selected for the second wave and merge it with the data from the first wave, in order to eliminate some NAs. However, I would like to only keep the columns I have selected from the second wave. For the moment, and after looking on the internet, I was only able to merge the two datasets but I'm not able to only keep the data from the second wave.
Here is my code:
library("rqdatatable")
x <- data.frame(
ID = c(1,2,3,4),
S1 = c(1, 3, NA,0),
S2 = c(2, NA, 2,2)
)
y <- data.frame(
ID = c(1, 2, 3, 4,5,6,7,8),
S1 = c(1, 2, 5, 1,3,6,8,2),
S3 = c(3, 3, 3, 3,7,1,6,9),
S2 = c(0,0,0,0,0,0,0,0),
S4 = c(0,0,0,0,0,0,0,0)
)
final <- natural_join(x, y,
by = "ID",
jointype = "LEFT")
What I would like to get after my merge is:
z = data.frame(
ID = c(1,2,3,4),
S1 = c(1, 3, 5,0),
S2 = c(2, 0, 2,2)
)
Do you have any idea of how I can solve my problem?
It would be very time consuming to merge everything and to select the variables I want again.
Many thanks and best regards!

Here is a base r function that joins the data like in the question. It can also be call via a pipe, in this case R's pipe operator introduced in R 4.1.
x <- data.frame(
ID = c(1,2,3,4),
S1 = c(1, 3, NA,0),
S2 = c(2, NA, 2,2)
)
y <- data.frame(
ID = c(1, 2, 3, 4,5,6,7,8),
S1 = c(1, 2, 5, 1,3,6,8,2),
S3 = c(3, 3, 3, 3,7,1,6,9),
S2 = c(0,0,0,0,0,0,0,0),
S4 = c(0,0,0,0,0,0,0,0)
)
joinSpecial <- function(x, y, idcol = "ID"){
idcolx <- which(names(x) == idcol)
idcoly <- which(names(y) == idcol)
idx <- which(names(x) %in% names(y))
idy <- which(names(y) %in% names(x))
idx <- idx[idx != idcolx]
idy <- idy[idy != idcoly]
i <- match(x[[idcolx]], y[[idcoly]])
x[idx] <- mapply(\(a, b, i){
na <- is.na(a)
a[na] <- b[i][na]
a
}, x[idx], y[idy], MoreArgs = list(i = i), SIMPLIFY = FALSE)
x
}
joinSpecial(x, y)
#> ID S1 S2
#> 1 1 1 2
#> 2 2 3 0
#> 3 3 5 2
#> 4 4 0 2
x |> joinSpecial(y)
#> ID S1 S2
#> 1 1 1 2
#> 2 2 3 0
#> 3 3 5 2
#> 4 4 0 2
Created on 2022-03-18 by the reprex package (v2.0.1)

We could use inner_join in combination with coalesce
library(dplyr)
x %>%
inner_join(y, by="ID") %>%
mutate(S1 = coalesce(S1.x, S1.y),
S2 = coalesce(S2.x, S2.y)) %>%
select(ID, S1, S2)
ID S1 S2
1 1 1 2
2 2 3 0
3 3 5 2
4 4 0 2

Related

In R, x and y are arrays (time series data), how to calculate cor() for trend correlation?

I want to see the correlation of two time series datasets.
df <- data.frame(
row.names = paste0("s", 1:5),
R_T1 = 1:5, R_T2 = 2:6, R_T3 = 3:7,
P_T1 = 4:8, P_T2 = 5:9, P_T4 = 6:10
)
That looks like Table 1
R_T1 R_T2 R_T3 P_T1 P_T2 P_T4
s1 1 2 3 4 5 6
s2 2 3 4 5 6 7
s3 3 4 5 6 7 8
s4 4 5 6 7 8 9
s5 5 6 7 8 9 10
or Table 2
R P
s1 T1 1 4
s1 T2 2 5
s1 T3 3 6
s2 T1 2 5
s2 T2 3 6
s2 T3 4 7
s1-s5 are samples names; R and P are two variables, each variable has 3 observation results.
What I want to calculate is cor(c(R_T1,R_T2,R_T3), c(P_T1,P_T2,P_T3)) for each sample.
For example: for s1, cor(c(1,2,3), c(4,5,6)) but not cor(R_T1,P_T1), cor(R_T2,P_T2)...is the second table more clear?
The purpose is to calculate the trend correlation of R and P.
How can I achieve this?
1) The question indicates you want the correlation between R and P for each sample so we are looking for 5 correlations corresponding to the 5 samples. Create two data frames each of which has one column per sample with rows corresponding to time. Then use mapply to get the correlations of the first column of R with the first column of P, the second column of R with the second column of P, etc.
isR <- startsWith(names(df), "R")
R <- as.data.frame(t(df[isR]))
P <- as.data.frame(t(df[!isR]))
mapply(cor, R, P)
## s1 s2 s3 s4 s5
## 1 1 1 1 1
2) It could also be written like this:
spl <- split(as.data.frame(t(df)), startsWith(names(df), "R"))
do.call("mapply", c(cor, unname(spl)))
## s1 s2 s3 s4 s5
## 1 1 1 1 1
3) or using pipes:
df |>
t() |>
as.data.frame() |>
list(. = _) |>
with(split(., startsWith(rownames(.), "R"))) |>
unname() |>
c(FUN = cor) |>
do.call(what = "mapply")
First of all hi!
Initially, I wanted to tell you that this is not the proper way to present your data neither your question!
If I understood your data structure this should be like the following,
df<-structure(list(R_T1 = c(1, 2, 3, 4, 5),
R_T2 = c(2, 3, 4, 5, 6),
R_T3 = c(3, 4, 5, 6, 7),
P_T1 = c(4, 5, 6, 7, 8),
P_T2 = c(5, 6, 7, 8, 9),
P_T3 = c(6, 7, 8, 9, 10)),
row.names = c("s1", "s2", "s3", "s4", "s5"),
class = "data.frame")
Then (again if I understood correctly), you want the correlation of R vs. P for each time point!
CorrelationT1 <- cor(df$R_T1,df$P_T1)
CorrelationT2 <- cor(df$R_T2,df$P_T2)
CorrelationT3 <- cor(df$R_T3,df$P_T3)
The problem here is that your data are highly correlated, so just to give you a more random data to check, please see bellow,
dfrnorm<-structure(list(R_T1 = rnorm(5),
R_T2 = rnorm(5),
R_T3 = rnorm(5),
P_T1 = rnorm(5),
P_T2 = rnorm(5),
P_T3 = rnorm(5)),
row.names = c("s1", "s2", "s3", "s4", "s5"),
class = "data.frame")
With the respected correlations,
CorrelationT1 <- cor(dfrnorm$R_T1,dfrnorm$P_T1)
CorrelationT2 <- cor(dfrnorm$R_T2,dfrnorm$P_T2)
CorrelationT3 <- cor(dfrnorm$R_T3,dfrnorm$P_T3)
And a simple plot could be like this,
plot(1:3,
c(CorrelationT1,CorrelationT2,CorrelationT3),
xlab="Time Points", ylab="Correlation")
I hope this will help you,
Cheers

Creating group ids by comparing values of two variables across rows: in R

I have a dataframe with two variables (start,end). would like to create an identifier variable which grows in ascending order of start and, most importantly, is kept constant if the value of start coincides with end of any other row in the dataframe.
Below is a simple example of the data
toy_data <- data.frame(start = c(1,5,6,10,16),
end = c(10,9,11,15,17))
The output I would be looking for is the following:
output_data <- data.frame(start = c(1,10,5,6,16),
end = c(10,15,9,11,17),
NEW_VAR = c(1,1,2,3,4))
You could try adapting this answer to group by ranges that are adjacent to each other. Credit goes entirely to #r2evans.
In this case, you would use expand.grid to get combinations of start and end. Instead of labels you would have row numbers rn to reference.
In the end, you can number the groups based on which rows appear together in the list. The last few lines starting with enframe use tibble/tidyverse. To match the group numbers I resorted the results too.
I hope this might be helpful.
library(tidyverse)
toy_data <- data.frame(start = c(1,5,6,10,16),
end = c(10,9,11,15,17))
toy_data$rn = 1:nrow(toy_data)
eg <- expand.grid(a = seq_len(nrow(toy_data)), b = seq_len(nrow(toy_data)))
eg <- eg[eg$a < eg$b,]
together <- cbind(
setNames(toy_data[eg$a,], paste0(names(toy_data), "1")),
setNames(toy_data[eg$b,], paste0(names(toy_data), "2"))
)
together <- subset(together, end1 == start2)
groups <- split(together$rn2, together$rn1)
for (i in toy_data$rn) {
ind <- (i == names(groups)) | sapply(groups, `%in%`, x = i)
vals <- groups[ind]
groups <- c(
setNames(list(unique(c(i, names(vals), unlist(vals)))), i),
groups[!ind]
)
}
min_row <- as.numeric(sapply(groups, min))
ctr <- seq_along(groups)
lapply(ctr[order(match(min_row, ctr))], \(x) toy_data[toy_data$rn %in% groups[[x]], ]) %>%
enframe() %>%
unnest(col = value) %>%
select(-rn)
Output
name start end
<int> <dbl> <dbl>
1 1 1 10
2 1 10 15
3 2 5 9
4 3 6 11
5 4 16 17
The following function should give you the desired identifier variable NEW_VAR.
identifier <- \(df) {
x <- array(0L, dim = nrow(df))
count <- 0L
my_seq <- seq_len(nrow(df))
for (i in my_seq) {
if(!df[i,]$start %in% df$end) {
x[i] <- my_seq[i] + count
} else {
x[i] <- my_seq[i]-1L + count
count <- count - 1L
}
}
x
}
Examples
# your example
toy_data <- data.frame(start = c(1,10,5,6,16),
end = c(10,15,9,11,17))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 1 1 2 3 4
# other example
toy_data <- data.frame(start = c(1, 2, 2, 4, 16, 21, 18, 3),
end = c(16, 2, 21, 2, 2, 2, 3, 1))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 0 0 0 1 1 1 2 2

Bind dataframes in a list two by two (or by name) - R

Lets say I have this list of dataframes:
DF1_A<- data.frame (first_column = c("A", "B","C"),
second_column = c(5, 5, 5),
third_column = c(1, 1, 1)
)
DF1_B <- data.frame (first_column = c("A", "B","E"),
second_column = c(1, 1, 5),
third_column = c(1, 1, 1)
)
DF2_A <- data.frame (first_column = c("E", "F","G"),
second_column = c(1, 1, 5),
third_column = c(1, 1, 1)
)
DF2_B <- data.frame (first_column = c("K", "L","B"),
second_column = c(1, 1, 5),
third_column = c(1, 1, 1)
)
mylist <- list(DF1_A, DF1_B, DF2_A, DF2_B)
names(mylist) = c("DF1_A", "DF1_B", "DF2_A", "DF2_B")
mylist = lapply(mylist, function(x){
x[, "first_column"] <- as.character(x[, "first_column"])
x
})
I want to bind them by their name (All DF1, All DF2 etc), or, objectively, two by two in this ordered named list. Keeping the "named list structure" of the list is important to keep track (for example, DF1_A and DF1_B = DF1 or something similiar in the names(mylist))
There are some rows that have duplicated values, and I want to keep them (which will introduce some duplicated characters such as first_column, value A)
I have tried finding any clues here on stack overflow, but most people want to bind dataframes irrespective of their names or orders.
Final result would look something like this:
mylist
DF1
DF2
DF1
first_column second_column third_column
A 1 1
A 5 1
B 1 1
B 5 1
C 5 1
E 5 1
Do you mean something like this?
lapply(
split(mylist, gsub("_.*", "", names(mylist))),
function(v) `row.names<-`((out <- do.call(rbind, v))[do.call(order, out), ], NULL)
)
which gives
$DF1
first_column second_column third_column
1 A 1 1
2 A 5 1
3 B 1 1
4 B 5 1
5 C 5 1
6 E 5 1
$DF2
first_column second_column third_column
1 B 5 1
2 E 1 1
3 F 1 1
4 G 5 1
5 K 1 1
6 L 1 1
Here is a solution with Map, but it only works for two suffixes. If you want to merge, use the first Map instruction; if you want to keep duplicates, use the 2nd, rbind solution.
sp <- split(mylist, sub("^DF.*_", "", names(mylist)))
res1 <- Map(function(x, y)merge(x, y, all = TRUE), sp[["A"]], sp[["B"]])
res2 <- Map(function(x, y)rbind(x, y), sp[["A"]], sp[["B"]])
names(res1) <- sub("_.*$", "", names(res1))
names(res2) <- sub("_.*$", "", names(res2))
One of many obligatory tidyverse solutions can be this.
library(purrr)
library(stringr)
# find the unique DF names
unique_df <- set_names(unique(str_split_fixed(names(mylist), "_", 2)[,1]))
# loop over each unique name, extracting the elements and binding into columns
purrr::map(unique_df, ~ keep(mylist, str_starts(names(mylist), .x))) %>%
map(bind_rows)
Also for things like this, bind_rows() from dplyr has a .id argument which will add a column with the list element name, and stack the rows. That can also be a helpful way. You can bind, manipulate the name how you'd like, and then split().

Adding a column based on a list dplyr

I am trying to summarise a list of dataframes. Here is some test data
noms <- list('A', 'B')
A_data <- data.frame('Dis' = c(1, 1, 2, 2),
'adj' = c(3, 2, 6, 7))
B_data <- data.frame('Dis' = c(1, 1, 2, 2),
'adj' = c(2, 6, 3, 6))
frames <- list(A_data, B_data)
I want to produce a list of data frams where'adj' is summed for each 'Dis' group, and then add a column for the relevant name from 'noms' so I can then combine the data frames together to form a single dataframe in the future.
So far I have this:
totals <- setNames(lapply(frames, function (x)
x %>%
dplyr::group_by(Dis) %>%
dplyr::summarise(total = sum(adj)))
,paste0(unlist(noms)))
But I can figure out how to add a column with the relevant name. I know I need to use the mutate function something like so:
totals <- setNames(lapply(frames, function (x)
x %>%
dplyr::group_by(Dis) %>%
dplyr::summarise(total = sum(adj)) %>%
dplyr::mutate(nom = )
,paste0(unlist(noms)))
but I cant figure out how to add the correct name.
The expected output would be a list of two dataframes one for 'A' and one for 'B'. Here is the expected output for 'A':
Dis total Nom
1 1 5 A
2 2 13 A
How do I do this?
A base R option where we use Map instead of lapply
out <- Map(function(x, y) {
transform(aggregate(adj ~ Dis, data = x, sum), Nom = y)
}, x = frames, y = noms)
out
#[[1]]
# Dis adj Nom
#1 1 5 A
#2 2 13 A
#[[2]]
# Dis adj Nom
#1 1 8 B
#2 2 9 B
The same idea with tidyverse functions
library(purrr); library(dplyr)
map2(.x = frames, .y = noms, ~ .x %>%
group_by(Dis) %>%
summarise(adj = sum(adj)) %>%
mutate(Nom = .y))

Gather duplicate column sets into single columns

The problem of gathering multiple sets of columns was already addressed here: Gather multiple sets of columns, but in my case, the columns are not unique.
I have the following data:
input <- data.frame(
id = 1:2,
question = c("a", "b"),
points = 0,
max_points = c(3, 5),
question = c("c", "d"),
points = c(0, 20),
max_points = c(5, 20),
check.names = F,
stringsAsFactors = F
)
input
#> id question points max_points question points max_points
#> 1 1 a 0 3 c 0 5
#> 2 2 b 0 5 d 20 20
The first column is an id, then I have many repeated columns (the original dataset has 133 columns):
identifier for question
points given
maximum points
I would like to end up with this structure:
expected <- data.frame(
id = c(1, 2, 1, 2),
question = letters[1:4],
points = c(0, 0, 0, 20),
max_points = c(3, 5, 5, 20),
stringsAsFactors = F
)
expected
#> id question points max_points
#> 1 1 a 0 3
#> 2 2 b 0 5
#> 3 1 c 0 5
#> 4 2 d 20 20
I have tried several things:
tidyr::gather(input, key, val, -id)
reshape2::melt(input, id.vars = "id")
Both do not deliver the desired output. Furthermore, with more columns than shown here, gather doesn't work any more, because there are too many duplicate columns.
As a workaround I tried this:
# add numbers to make col headers "unique"
names(input) <- c("id", paste0(1:(length(names(input)) - 1), names(input)[-1]))
# gather, remove number, spread
input %>%
gather(key, val, -id) %>%
mutate(key = stringr::str_replace_all(key, "[:digit:]", "")) %>%
spread(key, val)
which gives an error: Duplicate identifiers for rows (3, 9), (4, 10), (1, 7), (2, 8)
This problem was already discussed here: Unexpected behavior with tidyr, but I don't know why/how I should add another identifier. Most likely this is not the main problem, because I probably should approach the whole thing differently.
How could I solve my problem, preferably with tidyr or base? I don't know how to use data.table, but in case there is a simple solution, I will settle for that too.
Try this:
do.call(rbind,
lapply(seq(2, ncol(input), 3), function(i){
input[, c(1, i:(i + 2))]
})
)
# id question points max_points
# 1 1 a 0 3
# 2 2 b 0 5
# 3 1 c 0 5
# 4 2 d 20 20
The idiomatic way to do this in data.table is pretty simple:
library(data.table)
setDT(input)
res = melt(
input,
id = "id",
meas = patterns("question", "^points$", "max_points"),
value.name = c("question", "points", "max_points")
)
id variable question points max_points
1: 1 1 a 0 3
2: 2 1 b 0 5
3: 1 2 c 0 5
4: 2 2 d 20 20
You get the extra column called "variable", but you can get rid of it with res[, variable := NULL] afterwards if desired.
Another way to accomplish the same goal without using lapply:
We start by grabbing all the columns for question, max_points, and points then we melt each one individually and cbind them all together.
library(reshape2)
questions <- input[,c(1,c(1:length(names(input)))[names(input)=="question"])]
points <- input[,c(1,c(1:length(names(input)))[names(input)=="points"])]
max_points <- input[,c(1,c(1:length(names(input)))[names(input)=="max_points"])]
questions_m <- melt(questions,id.vars=c("id"),value.name = "questions")[,c(1,3)]
points_m <- melt(points,id.vars=c("id"),value.name = "points")[,3,drop=FALSE]
max_points_m <- melt(max_points,id.vars=c("id"),value.name = "max_points")[,3, drop=FALSE]
res <- cbind(questions_m,points_m, max_points_m)
res
id questions points max_points
1 1 a 0 3
2 2 b 0 5
3 1 c 0 5
4 2 d 20 20
You might need to clarify how you want the ID column to be handled but perhaps something like this ?
runme <- function(word , dat){
grep( paste0("^" , word , "$") , names(dat))
}
l <- mapply( runme , unique(names(input)) , list(input) )
l2 <- as.data.frame(l)
output <- data.frame()
for (i in 1:nrow(l2)) output <- rbind( output , input[, as.numeric(l2[i,]) ])
Not sure how robust it is with respect to handling different numbers of repeated columns but it works for your test data and should work if you columns are repeated equal numbers of times.

Resources