Resampling and merge dataset

Resampling and merge dataset - r

Consider the following dataset:
d1 <- c(2, 3, 8)
d2 <- data.frame(d1)
d1 <- c(1, 7, 9, 10)
d3 <- data.frame(d1)
Now I want to randomly draw 3 observations (without replacement) from d3 3 times, and each time I want to merge it with d2. So I should have three merged data frames with 6 observations.
I have tried with:
for (r in 1:3)
{
sam <- sample(1:4, 3, replace = FALSE)
merge <- rbind(d2, d3[sam])
}
But this does not work. Can anyone help me?

You can try
library(tidyverse)
d1 <- data.frame(d1=c(2, 3, 8))
d2 <- data.frame(d1=c(1, 7, 9, 10))
1:3 %>%
map(~sample_n(d2, 3) %>%
bind_cols(d1))
[[1]]
d1 d11
1 1 2
2 7 3
3 9 8
[[2]]
d1 d11
1 10 2
2 1 3
3 7 8
[[3]]
d1 d11
1 9 2
2 7 3
3 10 8

Related

Add empty rows at specific positions of dataframe

I want to add empty rows at specific positions of a dataframe. Let's say we have this dataframe:
df <- data.frame(var1 = c(1,2,3,4,5,6,7,8,9),
var2 = c(9,8,7,6,5,4,3,2,1))
In which I want to add an empty row after rows 1, 3 and 5 (I know that this is not best practice in most cases, ultimately I want to create a table using flextable here). These row numbers are saved in a vector:
rows <- c(1,3,5)
Now I want to use a for loop that loops through the rows vector to add an empty row after each row using add_row():
for (i in rows) {
df <- add_row(df, .after = i)
}
The problem is, that while the first iteration works flawlessly, the other empty rows get misplaced, since the dataframe gets obviously longer. To fix this I tried adding 1 to the vector after each iteration:
for (i in rows) {
df <- add_row(df, .after = i)
rows <- rows+1
}
Which does not work. I assume the rows vector does only get evaluated once. Anyone got any ideas?

Do it all at once, no need for looping. Make a sequence of row numbers, add the new rows in, sort, then replace the duplicated row numbers with NA:
s <- sort(c(seq_len(nrow(df)), rows))
out <- df[s,]
out[duplicated(s),] <- NA
# var1 var2
#1 1 9
#1.1 NA NA
#2 2 8
#3 3 7
#3.1 NA NA
#4 4 6
#5 5 5
#5.1 NA NA
#6 6 4
#7 7 3
#8 8 2
#9 9 1
This will be much more efficient than looping or loop-like code, for even moderately sized data:
df <- df[rep(1:9,1e4),]
rows <- seq(1,9e4,100)
system.time({
s <- sort(c(seq_len(nrow(df)), rows))
out <- df[s,]
out[duplicated(s),] <- NA
})
# user system elapsed
# 0.01 0.00 0.02
df <- df[rep(1:9,1e4),]
rows <- seq(1,9e4,100)
system.time({
Reduce(function(x, y) tibble::add_row(x, .after = y), rev(rows), init = df)
})
# user system elapsed
# 26.03 0.00 26.03
df <- df[rep(1:9,1e4),]
rows <- seq(1,9e4,100)
system.time({
for (i in rev(rows)) {
df <- tibble::add_row(df, .after = i)
}
})
# user system elapsed
# 25.05 0.00 25.04

You could achieve your result by looping in the reverse direction:
df <- data.frame(
var1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
var2 = c(9, 8, 7, 6, 5, 4, 3, 2, 1)
)
rows <- c(1, 3, 5)
for (i in rev(rows)) {
df <- tibble::add_row(df, .after = i)
}
df
#> var1 var2
#> 1 1 9
#> 2 NA NA
#> 3 2 8
#> 4 3 7
#> 5 NA NA
#> 6 4 6
#> 7 5 5
#> 8 NA NA
#> 9 6 4
#> 10 7 3
#> 11 8 2
#> 12 9 1

Randomly sample contiguous rows from a data frame or matrix

I want to sample a number of contiguous rows from a data frame df.
df <- data.frame(C1 = c(1, 2, 4, 7, 9), C2 = c(2, 4, 6, 8, 10))
I am trying to get something similar to the following which allows me to sample 3 random rows and repeat the process 100 times.
test <- replicate(100, df[sample(1:nrow(df), 3, replace=T),], simplify=F)
By contiguous the result should be something like:
[[1]]
C1 C2
2 2 4
3 4 6
4 7 8
[[2]]
C1 C2
1 1 2
2 2 4
3 4 6
.
.
.
How could I achieve this?

We just need to sample the starting row index for a chunk.
sample.block <- function (DF, chunk.size) {
if (chunk.size > nrow(DF)) return(NULL)
start <- sample.int(nrow(DF) - chunk.size + 1, 1)
DF[start:(start + chunk.size - 1), ]
}
replicate(100, sample.block(df, 3), simplify = FALSE)

Replacing NAs in a data frame with values from a different column

I would like to replace NAs in my data frame with values from another column. For example:
a1 <- c(1, 2, 4, NA, 2, NA)
b1 <- c(3, NA, 4, 4, 4, 3)
c1 <- c(NA, 3, 3, 4, 2, 3)
a2 <- c(2, 3, 5, 5, 3, 4)
b2 <- c(1, 2, 4, 5, 6, 3)
c2 <- c(3, 3, 2, 3, 4, 3)
df <- as.data.frame(cbind(a1, b1, c1, a2, b2, c2))
df
> df
a1 b1 c1 a2 b2 c2
1 1 3 NA 2 1 3
2 2 NA 3 3 2 3
3 4 4 3 5 4 2
4 NA 4 4 5 5 3
5 2 4 2 3 6 4
6 NA 3 3 4 3 3
I would like replace the NAs in df$a1 with the values from the corresponding row in df$a2, the NAs in df$b1 with the values from the corresponding row in df$b2, and the NAs in df$c1 with the values from the corresponding row in df$c2 so that the new data frame looks like:
> df
a1 b1 c1
1 1 3 3
2 2 2 3
3 4 4 3
4 5 4 4
5 2 4 2
6 4 3 3
How can I do this? I have a large data frame with many columns, so it would be great to find an efficient way to do this (I've already seen Replace missing values with a value from another column). Thank you!

An extensible option:
df2 <- df[c('a1','b1','c1')]
df2[] <- mapply(function(x,y) ifelse(is.na(x), y, x),
df[c('a1','b1','c1')], df[c('a2','b2','c2')],
SIMPLIFY=FALSE)
df2
# a1 b1 c1
# 1 1 3 3
# 2 2 2 3
# 3 4 4 3
# 4 5 4 4
# 5 2 4 2
# 6 4 3 3
It's easy enough to extend this to arbitrary column pairs: the first column in the first subset (df[c('a1','b1','c1')]) is paired with the first column of the second subset; second column first subset, second column second subset; etc. It can even be generalized with df[grepl('1$',colnames(df))] and df[grepl('2$',colnames(df))], assuming they don't mis-match.

coalesce in dplyr is meant to do exactly this (replace NAs in a first vector with not NA elements of a later one). e.g.
coalesce(df$a1,df$a2)
[1] 1 2 4 5 2 4
It can be used with sapply to do the whole dataset in an efficient and easily extensible manner:
sapply(c("a","b","c"),function(x) coalesce(df[,paste0(x,1)],df[,paste0(x,2)]))
a b c
[1,] 1 3 3
[2,] 2 2 3
[3,] 4 4 3
[4,] 5 4 4
[5,] 2 4 2
[6,] 4 3 3

dfnew<- ifelse(is.na(df$a1) == T, df$a2, df$a1)
as.data.frame(dfnew)
this is just for a1 col, you'll have to run this for all a,b and c and cbind it. if there are too many columns, running a loop will be the best option imo

You can use hutils::coalesce. It should be slightly faster, especially if it can 'cheat' -- if any columns have no NAs and so don't need to change, coalesce will skip them:
a1 <- c(1, 2, 4, NA, 2, NA)
b1 <- c(3, NA, 4, 4, 4, 3)
c1 <- c(NA, 3, 3, 4, 2, 3)
a2 <- c(2, 3, 5, 5, 3, 4)
b2 <- c(1, 2, 4, 5, 6, 3)
c2 <- c(3, 3, 2, 3, 4, 3)
s <- function(x) {
sample(x, size = 1e6, replace = TRUE)
}
df <- as.data.frame(cbind(a1 = s(a1), b1 = s(b1), c1 = s(c1),
a2 = s(a2), b2 = s(b2), c2 = s(c2)))
library(microbenchmark)
library(hutils)
library(data.table)
dt <- as.data.table(df)
old <- paste0(letters[1:3], "1") # you will need to specify
new <- paste0(letters[1:3], "2")
dplyr_coalesce <- function(df) {
ans <- df
for (j in seq_along(old)) {
o <- old[j]
n <- new[j]
ans[[o]] <- dplyr::coalesce(ans[[o]], df[[n]])
}
ans
}
hutils_coalesce <- function(df) {
ans <- df
for (j in seq_along(old)) {
o <- old[j]
n <- new[j]
ans[[o]] <- hutils::coalesce(ans[[o]], df[[n]])
}
ans
}
microbenchmark(dplyr = dplyr_coalesce(df),
hutils = hutils_coalesce(df))
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> dplyr 45.78123 61.76857 95.10870 69.21561 87.84774 1452.0800 100 b
#> hutils 36.48602 46.76336 63.46643 52.95736 64.53066 252.5608 100 a
Created on 2018-03-29 by the reprex package (v0.2.0).

Using cumsum by columns

I have a matrix, for example:
A= [ 1 2 3 4
3 5 6 6
4 1 2 3 ]
I want to get the cumulative sum of the columns in the form of another matrix (or data frame). For example, this matrix would give:
B= [1 2 3 4
4 7 9 10
8 8 8 13]

If A is a matrix, use apply:
A <- structure(c(1, 3, 4, 2, 5, 1, 3, 6, 2, 4, 6, 3), .Dim = 3:4)
B <- apply(A, 2, cumsum)
B
# [,1] [,2] [,3] [,4]
# [1,] 1 2 3 4
# [2,] 4 7 9 10
# [3,] 8 8 11 13
If A is a data.frame, use lapply:
B <- as.data.frame(A)
B[] <- lapply(B, cumsum)
B
# V1 V2 V3 V4
# 1 1 2 3 4
# 2 4 7 9 10
# 3 8 8 11 13

Using dplyr for data frames:
library(dplyr)
m <- matrix(c(1, 3, 4, 2, 5, 1, 3, 6, 2, 4, 6, 3), 3, 4)
as.data.frame(m) %>% mutate_each(funs(cumsum))
With sqldf:
library(sqldf)
df <- as.data.frame(m)
sqldf('SELECT SUM(b.V1) V1, SUM(b.V2) V2, SUM(b.V3) V3, SUM(b.V4) V4
FROM df AS a, df AS b WHERE b.rowid <= a.rowid
GROUP BY a.V1 ORDER BY a.rowid')
Output:
V1 V2 V3 V4
1 1 2 3 4
2 4 7 9 10
3 8 8 11 13

How to calculate the difference between different data frames with common column names

I have three data frames and trying to calculate the difference between two data frames (Df2 and Df3) conditioned by data frame 1. As explained in following example I have three data frames, Df1, Df2 and Df3 with common names. In first step, in Df1, I want to compare the values of “standard” column with all three columns, “Das”,”Dss” and ”Tri” probably row wise and where ever any value of these columns, “Das”, “Dss” and “Tri” is higher than the “Standard” in Df1, calculate the difference of same position in Df2 and Df3 and put the difference in a separate column.
Df1
Names Standard Das Dss Tri
Aa 3 3 6 2
Ab 4 6 4 3
Ac 2 5 2 4
Ad 4 3 3 8
Ae 6 4 5 7
Af 4 5 7 5
Ag 2 6 8 2
Ah 9 7 6 2
Df2
Names Das Dss Tri
Aa 4 2 5
Ab 7 5 4
Ac 5 7 2
Ad 6 4 3
Ae 5 3 5
Af 3 2 6
Ag 2 5 4
Ah 4 6 3
Df3
Names Das Dss Tri
Aa 5 3 5
Ab 8 5 4
Ac 6 7 2
Ad 6 4 7
Ae 5 3 8
Af 4 5 6
Ag 1 5 4
Ah 4 6 3
Final Ouput
Df3
Names Das Dss Tri Difference
Aa 5 3 5 -1
Ab 8 5 4 -1
Ac 6 7 2 -1
Ad 6 4 7 -4
Ae 5 3 8 -3
Af 4 5 6 -4
Ag 1 5 4 1
Ah 4 6 3 0

Here's the script that takes the index of the first biggest value if more than 1 value is found and if no values are found, NA is returned.
df1 <- structure(list(standard = c(3, 4, 2, 4, 6, 4, 2, 9), das = c(3,
6, 5, 3, 4, 5, 6, 7), dss = c(6, 4, 2, 3, 5, 7, 8, 6), tri = c(2,
3, 4, 8, 7, 5, 2, 2)), .Names = c("standard", "das", "dss", "tri"
), row.names = c(NA, -8L), class = "data.frame")
df2 <- structure(list(das = c(4, 7, 5, 6, 5, 3, 2, 4), dss = c(2,
5, 7, 4, 3, 2, 5, 6), tri = c(5,4,2,3,5,6,4,3)), .Names = c("das", "dss", "tri"
), row.names = c(NA, -8L), class = "data.frame")
df3 <- structure(list(das = c(5, 8, 6, 6, 5, 4, 1, 4), dss = c(3,
5, 7, 4, 3, 5, 5, 6), tri = c(5,4,2,7,8,6,4,3)), .Names = c("das", "dss", "tri"
), row.names = c(NA, -8L), class = "data.frame")
# get indices. run through every row of df1
# and get the maximum column index > standard
idx.v <- sapply( 1:nrow(df1), function(idx) {
t <- which(df1[idx, 2:4] > df1[idx, 1])
})
df3$result <- sapply(1:length(idx.v), function(ix) {
col.idx <- idx.v[[ix]]
len.idx <- length(col.idx)
if (len.idx > 0) {
res <- sum(df2[ix, col.idx] - df3[ix, col.idx])
} else {
res <- NA
}
})
Output:
> df3
das dss tri result
1 5 3 5 -1
2 8 5 4 -1
3 6 7 2 -1
4 6 4 7 -4
5 5 3 8 -3
6 4 5 6 -4
7 1 5 4 1
8 4 6 3 NA
Thanks for the chat. This is what you require.

I think this is the correct result, but note that the seventh value differs. Using the max value of the three columns (an easier task) produces a result that differs in even more slots.
df1.w <- sapply( seq(1, nrow(df1)),
function(idx) min(c(Inf, which(df1[-(1:2)][idx,] > df1[idx, 2])))
)
df1.mat <- matrix(c(seq(1, nrow(df1)), df1.w), ncol=2)
df1.mat[is.infinite(df1.mat)] <- 1
ifelse(is.infinite(df1.w), 0,
df2[-1][df1.mat] - df3[-1][df1.mat]
)
## [1] -1 -1 -1 -4 -3 -1 1 0
If you actually do want to use the index of the max value in df1[-(1:2)], replace the definition of df1.w (the sapply call) with this:
df1.w <- apply(df1[-(1:2)], 1, which.max)
Using the rest of the code above then gives this result:
## [1] -1 -1 -1 -4 -3 -3 0 0