How to merge a single measurement into a dataframe of multiple measurements in R - r

I have a long dataframe of multiple measurements per ID, at different time points for variables BP1 and BP2.
ID <- c(1,1,1,2,2,2,3,3,4)
Time <- c(56,57,58,61,62,64,66,67,72)
BP1 <- c(70,73,73,74,75,76,74,74,70)
BP2 <- c(122,122,123,126,124,121,130,132,140)
df1 <- data.frame(ID, Time, BP1, BP2)
I would like to merge another dataframe (df2), which contains a single measurement for BP1 and BP2 per ID.
ID <- c(1,2,3,4)
Time <- c(55, 60, 65, 70)
BP1 <- c(70, 72, 73, 74)
BP2 <- c(120, 124, 130, 134)
df2 <- data.frame(ID, Time, BP1, BP2)
How do I combine these dataframes so that the Time variable is in order, and the dataframe looks like this:
Any help greatly appreciated, thank you!

In base R, use rbind() to combine and order() to sort, then clean up the rownames:
df3 <- rbind(df1, df2)
df3 <- df3[order(df3$ID, df3$Time), ]
rownames(df3) <- seq(nrow(df3))
df3
Or, using dplyr:
library(dplyr)
bind_rows(df1, df2) %>%
arrange(ID, Time)
Result from either approach:
ID Time BP1 BP2
1 1 55 70 120
2 1 56 70 122
3 1 57 73 122
4 1 58 73 123
5 2 60 72 124
6 2 61 74 126
7 2 62 75 124
8 2 64 76 121
9 3 65 73 130
10 3 66 74 130
11 3 67 74 132
12 4 70 74 134
13 4 72 70 140

Related

Average over rows pairs and paste the value based on condition

In R, I have a df such as:
a b c
1 124 70 aa
2 129 67 aa
3 139 71 aa
4 125 77 aa
5 125 82 aa
6 121 69 aa
7 135 68 bb
8 137 72 bb
9 137 78 bb
10 140 86 bb
I want to iterate along rows within columns (a, b), computing the mean of all rows pairs, and paste this mean to the same two rows of new columns (a_new, b_new) if the difference between these two rows is >=12. Otherwise just copy the old value. This behaviour should be restricted to groups as marked by another column (c), i.e it should not happen if two rows are from different groups.
In this example, it happens in row 3 (cos in column a, difference with next (4th) row is 14) and in row 5 (cos in column b, difference with next row is 13). However, this should not happen with row 6 cos row 7 is in another c group.
Thus, resulting df would look like:
a b c a_new b_new
1 124 70 aa 124 70
2 129 67 aa 129 67
3 139 71 aa 132 71
4 125 77 aa 132 68
5 125 82 aa 125 75.5
6 121 69 aa 121 75.5
7 135 68 bb 135 68
8 137 72 bb 137 72
9 137 78 bb 137 78
10 140 86 bb 140 86
I've been struggling to do this for a while, figured out that perhaps lag function could be used, but no success. Help would be much appreciated (be it base R, or dplyr, or whatever)
Dput:
structure(list(a = c(124, 129, 139, 125, 125, 121, 135, 137,
137, 140), b = c(70, 67, 71, 77, 82, 69, 68, 72, 78, 86), c = c("aa",
"aa", "aa", "aa", "aa", "aa", "bb", "bb", "bb", "bb")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
We can write a function which works for one chunk.
apply_fun <- function(x) {
inds <- which(abs(diff(x)) >= 12)
if(length(inds))
x[sort(c(inds, inds + 1))] <- c(sapply(inds, function(i)
rep(mean(x[c(i, i + 1)]), 2)))
return(x)
}
and then apply it for multiple columns by group.
library(dplyr)
df %>% group_by(c) %>% mutate_at(vars(a, b), list(new = apply_fun))
# a b c a_new b_new
# <dbl> <dbl> <chr> <dbl> <dbl>
# 1 124 70 aa 124 70
# 2 129 67 aa 129 67
# 3 139 71 aa 132 71
# 4 125 77 aa 132 77
# 5 125 82 aa 125 75.5
# 6 121 69 aa 121 75.5
# 7 135 68 bb 135 68
# 8 137 72 bb 137 72
# 9 137 78 bb 137 78
#10 140 86 bb 140 86
What I understood is to apply to each group given by the indicator column "c" the procedure commented in the code below:
pairAverage <- function(x) {
# x should be a numeric vector of length > 1
if (is.vector(x) & is.numeric(x) & length(x) > 1) {
# copy data to an aux vector
aux <- x
# get differences of lag 1
dh<-diff(x, 1)
# get means of consecutive pairs
med <- c(x$a[2:length(x)] - dh/2)
# get positions (index) of abs(means) >= 12
idx <- match(med[abs(dh) >= 12], med)
# need 2 reps of each mean to replace consecutive values of x
valToRepl <- med[sort(rep(idx,2))]
# ordered indexes pairs of consecutive elements of x to be replaced
idxToRepl <- sort(c(idx,idx+1))
# replace pairs of values
aux[idxToRepl] <- valToRepl
return(aux)
} else {
# do nothing
warning("paramater x should be a numeric vector of length > 1")
return(NULL)
}
}
pairAverageByGroups <- function(x, gr) {
if (is.vector(x) & is.numeric(x) & length(x) == length(gr)) {
x.ls <- split(x, as.factor(gr))
output <- unlist(lapply(x.ls, pairAverage))
names(output) <- NULL
output
} else {
# do nothing
warning("paremater x should be a numeric vector of length > 1")
return(NULL)
}
}
pairAverageByGroups(dd$a, dd$c)
[1] 124 129 132 132 125 121 135 137 137 140

If() statement in R

I am not very experienced in if statements and loops in R.
Probably you can help me to solve my problem.
My task is to add +1 to df$fz if sum(df$fz) < 450, but in the same time I have to add +1 only to max values in df$fz till that moment when when sum(df$fz) is lower than 450
Here is my df
ID_PP <- c(3,6, 22, 30, 1234456)
z <- c(12325, 21698, 21725, 8378, 18979)
fz <- c(134, 67, 70, 88, 88)
df <- data.frame(ID_PP,z,fz)
After mutating the new column df$new_value, it should look like 134 68 71 88 89
At this moment I have this code, but it adds +1 to all values.
if (sum(df$fz ) < 450) {
mutate(df, new_value=fz+1)
}
I know that I can pick top_n(3, z) and add +1 only to this top, but it is not what I want, because in that case I have to pick a top manually after checking sum(df$fz)
From what I understood from #Oksana's question and comments, we probably can do it this way:
library(tidyverse)
# data
vru <- data.frame(
id = c(3, 6, 22, 30, 1234456),
z = c(12325, 21698, 21725, 8378, 18979),
fz = c(134, 67, 70, 88, 88)
)
# solution
vru %>% #
top_n(450 - sum(fz), z) %>% # subset by top z, if sum(fz) == 450 -> NULL
mutate(fz = fz + 1) %>% # increase fz by 1 for the subset
bind_rows( #
anti_join(vru, ., by = "id"), # take rows from vru which are not in subset
. # take subset with transformed fz
) %>% # bind thous subsets
arrange(id) # sort rows by id
# output
id z fz
1 3 12325 134
2 6 21698 68
3 22 21725 71
4 30 8378 88
5 1234456 18979 89
The clarifications in the comments helped. Let me know if this works for you. Of course, you can drop the cumsum_fz and leftover columns.
# Making variables to use in the calculation
df <- df %>%
arrange(fz) %>%
mutate(cumsum_fz = cumsum(fz),
leftover = 450 - cumsum_fz)
# Find the minimum, non-negative value to use for select values that need +1
min_pos <- min(df$leftover[df$leftover > 0])
# Creating a vector that adds 1 using the min_pos value and keeps
# the other values the same
df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
# Checking the sum of the new value
> sum(df$new_value)
[1] 450
>
> df
ID_PP z fz cumsum_fz leftover new_value
1 6 21698 67 67 383 68
2 22 21725 70 137 313 71
3 30 8378 88 225 225 89
4 1234456 18979 88 313 137 88
5 3 12325 134 447 3 134
EDIT:
Because utubun already posted a great tidyverse solution, I am going to translate my first one completely to base (it was a bit sloppy to mix the two anyway). Same logic as above, and using the data OP provided.
> # Using base
> df <- df[order(fz),]
>
> leftover <- 450 - cumsum(fz)
> min_pos <- min(leftover[leftover > 0])
> df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
>
> sum(df$new_value)
[1] 450
> df
ID_PP z fz new_value
2 6 21698 67 68
3 22 21725 70 71
4 30 8378 88 89
5 1234456 18979 88 88
1 3 12325 134 134

Find Nth largest Across Columns (NOT in a vector)

Consider the following example:
Var_A <- sample(1:100,5,replace=TRUE)
Var_B <- sample(1:100,5,replace=TRUE)
Var_C <- sample(1:100,5,replace=TRUE)
Var_D <- sample(1:100,5,replace=TRUE)
DF <- as.data.frame(cbind(Var_A,Var_B,Var_C,Var_D))
In R, functions already exist to find the element-wise max and min, so I could easily create a new variable that is equal to the largest (or smallest) value across the columns of interest:
> DF$Max <- pmax(Var_A,Var_B,Var_C,Var_D)
> DF
Var_A Var_B Var_C Var_D Max
1 44 33 6 72 72
2 29 66 51 12 66
3 35 29 47 79 79
4 39 79 47 65 79
5 97 60 36 81 97
But what if I need to create a variable that captures, say, the second largest value in each row (i.e., across the columns)?
In the real data set that I'm working with, I have 600+ columns and about 28 million records. I need to create variables that will identify and store the largest, second largest, third largest, etc. values found when looking across the variables (columns) for each record, much like pmax would do, but for other ordinals.
The only way that I have been able to functionally make it work on a subset of the data is to do a loop, but that loop won't finish in my lifetime if I run it on the entire data set. I also considered using the apply function, but my understanding is that apply will convert the data set to a matrix first, which my data set won't take kindly to.
Any suggestions on a non-loop way to do this? And with this amount of data, the faster the better...
This may be a solution...
Var_A <- sample(1:100,5,replace=TRUE)
Var_B <- sample(1:100,5,replace=TRUE)
Var_C <- sample(1:100,5,replace=TRUE)
Var_D <- sample(1:100,5,replace=TRUE)
DF <- as.data.frame(cbind(Var_A,Var_B,Var_C,Var_D))
result <-sapply(1:nrow(DF), function(x) {
df <- as.data.frame(DF[x,])
ord <- df[order(-DF[x,])]
})
result <- t(result)
output <- cbind(DF,result)
for (i in (ncol(DF)+1):ncol(output) ) {
colnames(output)[i]<-paste0("Max",i-ncol(DF))
}
output
Var_A Var_B Var_C Var_D Max1 Max2 Max3 Max4
1 42 12 64 9 64 42 12 9
2 67 22 47 4 67 47 22 4
3 80 56 82 94 94 82 80 56
4 31 62 88 73 88 73 62 31
5 91 67 15 41 91 67 41 15

combine two data frame based on cell value in R

I have two data frames. One is the baseline data for different test type and the other one is my experiment data. Now I would like to combine this two data frame together. But it is not a simply merge or rbind. I wonder any R professionals can help me to solve it. Thank you.
Here is a example of two data frames:
experiment data:
experiment_num timepoint type value
50 10 7a,b4 90
50 20 7a,b4 89
50 20 10a,b4 93
50 10 7a,b6 85
50 20 7a,b6 87
50 20 10a,b6 88
baseline data:
experiment_num timepoint type value
50 0 0,b4 85
50 0 0,b6 90
Here is the output I would like to have:
experiment_num timepoint type value
50 0 7a,b4 85
50 10 7a,b4 90
50 20 7a,b4 89
50 0 10a,b4 85
50 20 10a,b4 89
50 0 7a,b6 90
50 10 7a,b6 85
50 20 7a,b6 87
50 0 10a,b6 90
50 20 10a,b6 88
This should do the job. You first need to install a couple of packages:
install.packages("dplyr")
install.packages("tidyr")
* Data *
ed <- data.frame(experiment_num=rep(50, 6), timepoint=rep(c(10, 20, 20), 2),
type=c("7a,b4", "7a,b4", "10a,b4", "7a,b6", "7a,b6", "10a,b6"),
value=c(90, 89, 93, 85, 87, 88))
db <- data.frame(experiment_num=rep(50, 2), timepoint=rep(0, 2), type=c("0,b4", "0,b6"),
value=c(85, 90))
* Code *
library(tidyr)
library(dplyr)
final <- rbind(separate(ed, type, into=c("typea", "typeb")),
left_join(ed %>% select(type) %>% unique %>%
separate(type, into=c("typea", "typeb")),
separate(db, type, into=c("zero", "typeb"))) %>%
select(experiment_num, timepoint, typea, typeb, value)
) %>%
arrange(typeb, typea, timepoint) %>% mutate(type=paste(typea, typeb, sep=",")) %>%
select(experiment_num, timepoint, type, value)
The logic is the following.
Separate the type into two columns typea and typeb then "create" the missing typea for baseline data. and then join to the experimental data.
final is the data set you are looking for.

Custom sorting of a dataframe in R

I have a binomail dataset that looks like this:
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
df <-df[order(df$replicate.1..sample.0.1..1000..rep...TRUE..),]
The data is currently soreted in a way to show the instances belonging to 0 group then the ones belonging to the 1 group. Is there a way I can sort the data in a 0-1-0-1-0... fashion? I mean to show a row that belongs to the 0 group, the row after belonging to the 1 group then the zero group and so on...
All I can think about is complex functions. I hope there's a simple way around it.
Thank you,
Here's an attempt, which will add any extra 1's at the end:
First make some example data:
set.seed(2)
df <- data.frame(replicate(4,sample(1:200,10,rep=TRUE)),
addme=sample(0:1,10,rep=TRUE))
Then order:
with(df, df[unique(as.vector(rbind(which(addme==0),which(addme==1)))),])
# X1 X2 X3 X4 addme
#2 141 48 78 33 0
#1 37 111 133 3 1
#3 115 153 168 163 0
#5 189 82 70 103 1
#4 34 37 31 174 0
#6 189 171 98 126 1
#8 167 46 72 57 0
#7 26 196 30 169 1
#9 94 89 193 134 1
#10 110 15 27 31 1
#Warning message:
#In rbind(which(addme == 0), which(addme == 1)) :
# number of columns of result is not a multiple of vector length (arg 1)
Here's another way using dplyr, which would make it suitable for within-group ordering. It's also probably pretty quick. If there's unbalanced numbers of 0's and 1's, it will leave them at the end.
library(dplyr)
df %>%
arrange(addme) %>%
mutate(n0 = sum(addme == 0),
orderme = seq_along(addme) - (n0 * addme) + (0.5 * addme)) %>%
arrange(orderme) %>%
select(-n0, -orderme)

Resources