R: Combine multiple columns as pairs of column cells in same row - r

I'd like to combine/pair multiple columns in a data frame as pairs of column cells in the same row. As an example, df1 should be transformed to df2.
df1
col1 col2 col3
1 2 3
0 0 1
df2
c1 c2
1 2
1 3
2 3
0 0
0 1
0 1
The solution should be scalable for df1s with (way) more than three columns.
I thought about melt/reshape/dcast but found no solution yet. There are no NAs in the data frame. Thank you!
EDIT: Reshape just produced errors, so I thought about
combn(df1[1,], 2)
comb2 <- t(comb1)
and looping and appending through all rows. This inefficient, considering 2 million rows..

Here's the approach I would take.
Create a function that uses rbindlist from "data.table" and combn from base R. The function looks like this:
lengthener <- function(indf) {
temp <- rbindlist(
combn(names(indf), 2, FUN = function(x) indf[x], simplify = FALSE),
use.names = FALSE, idcol = TRUE)
setorder(temp[, .id := sequence(.N), by = .id], .id)[, .id := NULL][]
}
Here's the sample data from the other answer, and the application of the function on it:
df1 = as.data.frame(matrix(c(1,2,3,4,0,0,1,1), byrow = TRUE, nrow = 2))
lengthener(df1)
# V1 V2
# 1: 1 2
# 2: 1 3
# 3: 1 4
# 4: 2 3
# 5: 2 4
# 6: 3 4
# 7: 0 0
# 8: 0 1
# 9: 0 1
# 10: 0 1
# 11: 0 1
# 12: 1 1
Test it out on some larger data too:
set.seed(1)
M <- as.data.frame(matrix(sample(100, 100*100, TRUE), 100))
system.time(out <- lengthener(M))
# user system elapsed
# 0.19 0.00 0.19
out
# V1 V2
# 1: 27 66
# 2: 27 27
# 3: 27 68
# 4: 27 66
# 5: 27 56
# ---
# 494996: 33 13
# 494997: 33 66
# 494998: 80 13
# 494999: 80 66
# 495000: 13 66
System time for the other approach:
funAMK <- function(indf) {
nrow_combn = nrow(t(combn(indf[1,], m = 2)))
nrow_df = nrow(indf) * nrow_combn
df2 = data.frame(V1 = rep(0, nrow_df), V2 = rep(0, nrow_df))
for(i in 1:nrow(indf)){
df2[(((i-1)*nrow_combn)+1):(i*(nrow_combn)), ] = data.frame(t(combn(indf[i,], m = 2)))
}
df2
}
> system.time(funAMK(M))
user system elapsed
16.03 0.16 16.37

Your edit is very similar to my answer below, you just need to rbind the result each iteration over the rows of df1. Using data.table is a good way to speed up rbind, see this answer for more.
EDIT: Unfortunately, when I switched to the data.table approach, it turned out that the rbindlist() led the answer to be wrong (as pointed out in the comment below). Therefore, although it may be slightly slower, I think that preallocating a data frame and using rbind may be the best option.
EDIT2: switched the preallocated df to a more general number of rows.
df1 = as.data.frame(matrix(c(1,2,3,4,0,0,1,1), byrow = TRUE, nrow = 2))
nrow_combn = nrow(t(combn(df1[1,], m = 2)))
nrow_df = nrow(df1) * nrow_combn
df2 = data.frame(V1 = rep(0, nrow_df), V2 = rep(0, nrow_df))
for(i in 1:nrow(df1)){
df2[(((i-1)*nrow_combn)+1):(i*(nrow_combn)), ] = data.frame(t(combn(df1[i,], m = 2)))
}

Related

How to use functions to do a recursive calculation in data.table/R?

I am new to Programming and got stuck in it. I wanted to calculate the hourly temperature variation of an object throughout the year using some variables, which changes in every hour. The original data contains 60 columns and 8760 rows for the calculation.
I got the desired output using the for loop, but the model is taking a lot of time for the calculation. I wonder if there is any way to replace the loop with functions, which I suspect, can also increase the speed of the calculations.
Here is a small reproducible example to show what I did.
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
table
A B C
1: 1 1 10
2: 1 2 10
3: 1 3 10
4: 1 4 10
5: 1 5 10
The forloop
for (j in (2: nrow(table))) {
table$A[j] = (table$A[j-1] + table$B[j-1]) * table$B[j]
table$C[j] = table$B[j] * table$A[j]
}
I got the output as I desired:
A B C
1: 1 1 10
2: 4 2 8
3: 18 3 54
4: 84 4 336
5: 440 5 2200
but it took 15 min to run the whole program in my case (not this!)
So I tried to use function instead of the for loop.
I tried this:
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
myfun <- function(df){
df = df %>% mutate(A = (lag(A) + lag(B)) * B,
C = B * A)
return(df)
}
myfun(table)
But the output was
A B C
1 NA 1 NA
2 4 2 8
3 9 3 27
4 16 4 64
5 25 5 125
As it seems that the function refers to the rows of the first table not the updated rows after the calculation. Is there any way to obtain the desired output using functions? It is my first R project, any help is very much appreciated. Thank you.
A much faster alternative using data.table. Note that the calculation of C can be separated from the calculation of A so we can do less within the loop:
for (i in 2:nrow(table)) {
set(table, i = i, j = "A", value = with(table, (A[i-1] + B[i-1]) * B[i]))
}
table[-1, C := A * B]
table
# A B C
# <num> <int> <num>
# 1: 1 1 10
# 2: 4 2 8
# 3: 18 3 54
# 4: 84 4 336
# 5: 440 5 2200
You can try Reduce like below
dt[
,
A := Reduce(function(x, Y) (x + Y[2]) * Y[1],
asplit(embed(B, 2), 1),
init = A[1],
accumulate = TRUE
)
][
,
C := A * B
]
which updates dt as
> dt
A B C
1: 1 1 1
2: 4 2 8
3: 18 3 54
4: 84 4 336
5: 440 5 2200
data
dt <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
Here's a solution using purrr::accumulate2 which lets you use the result of the previous computation as the input to the next one:
library(data.table)
library(purrr)
library(magrittr)
table <- data.table("A" = c(1), "B" = c(1:5), "C" = c(10))
table$A <- accumulate2(
table$A,
seq(table$A),
~ (..1 + table$B[..3]) * table$B[..3 + 1],
.init = table$A[1]
) %>%
unlist() %>%
extract(1:nrow(table))
table$C <- table$B * table$A
table
# A B C
# 1: 1 1 1
# 2: 4 2 8
# 3: 18 3 54
# 4: 84 4 336
# 5: 440 5 2200

Find values in data frame 2 which is found in data frame 1, within a certain range

I want to find which values in df2 which is also present in df1, within a certain range. One value is considering both a and b in the data frames (a & b can't split up). For examples, can I find 9,1 (df1[1,1]) in df2? It doesn't have to be on the same position. Also, we can allow a diff of for example 1 for "a" and 1 for "b". For example, I want to find all values 9+-1,1+-1 in df2. "a" & "b" always go together, each row stick together. Does anyone have a suggestion of how to code this? Many many thanks!
set.seed(1)
a <- sample(10,5)
set.seed(1)
b <- sample(5,5, replace=T)
feature <- LETTERS[1:5]
df1 <- data.frame(feature,a,b)
df1
> df1
feature a b
A 9 1
B 4 4
C 7 1
D 1 2
E 2 5
set.seed(2)
a <- sample(10,5)
b <- sample(5,5, replace=T)
feature <- LETTERS[1:5]
df2 <- data.frame(feature,a,b)
df2
df2
feature a b
A 5 1
B 6 4
C 9 5
D 1 1
E 10 2
Not correct but Im imaging this can be done for a for loop somehow!
for(i in df1[,1]) {
for(j in df1[,2]){
s<- c(s,(df1[i,1] & df1[j,2]== df2[,1] & df2[,2]))# how to add certain allowed diff levels?
}
}
s
Output wanted:
feature_df1 <- LETTERS[1:5]
match <- c(1,0,0,1,0)
feature_df2 <- c("E","","","D", "")
df <- data.frame(feature_df1, match, feature_df2)
df
feature_df1 match feature_df2
A 1 E
B 0
C 0
D 1 D
E 0
I loooove data.table, which is (imo) the weapon of choice for these kind of problems..
library( data.table )
#make df1 and df2 a data.table
setDT(df1, key = "feature"); setDT(df2)
#now perform a join operation on each row of df1,
# creating an on-the-fly subset of df2
df1[ df1, c( "match", "feature_df2") := {
val = df2[ a %between% c( i.a - 1, i.a + 1) & b %between% c(i.b - 1, i.b + 1 ), ]
unique_val = sort( unique( val$feature ) )
num_val = length( unique_val )
list( num_val, paste0( unique_val, collapse = ";" ) )
}, by = .EACHI ][]
# feature a b match feature_df2
# 1: A 9 1 1 E
# 2: B 4 4 0
# 3: C 7 1 0
# 4: D 1 2 1 D
# 5: E 2 5 0
One way to go about this in Base R would be to split the data.frames() into a list of rows then calculate the absolute difference of row vectors to then evaluate how large the absolute difference is and if said difference is larger than a given value.
Code
# Find the absolute difference of all row vectors
listdif <- lapply(l1, function(x){
lapply(l2, function(y){
abs(x - y)
})
})
# Then flatten the list to a list of data.frames
listdifflat <- lapply(listdif, function(x){
do.call(rbind, x)
})
# Finally see if a pair of numbers is within our threshhold or not
m1 <- 2
m2 <- 3
listfin <- Map(function(x){
x[1] > m1 | x[2] > m2
},
listdifflat)
head(listfin, 1)
[[1]]
V1
[1,] TRUE
[2,] FALSE
[3,] TRUE
[4,] TRUE
[5,] TRUE
[6,] TRUE
[7,] TRUE
[8,] TRUE
[9,] TRUE
[10,] TRUE
Data
df1 <- read.table(text = "
4 1
7 5
1 5
2 10
13 6
19 10
11 7
17 9
14 5
3 5")
df2 <- read.table(text = "
15 1
6 3
19 6
8 2
1 3
13 7
16 8
12 7
9 1
2 6")
# convert df to list of row vectors
l1<- lapply(1:nrow(df1), function(x){
df1[x, ]
})
l2 <- lapply(1:nrow(df2), function(x){
df2[x, ]
})

Merge a data.table and a list

I want to add a list into my data.table. Lets consider this data.table:
dt = data.table(id = 1:3)
lst <- list()
lst[[2]] <- cbind(a=10:12, b=5:7)
dt[-nrow(dt), lst:=lst]
dt
# id lst
#1: 1
#2: 2 10,11,12, 5, 6, 7
#3: 3
Is it possible to "unlist" the lst, so that the data.table will look like this?
id a b
1.0: 1
2.0: 2
2.1: 2 10 5
2.2: 2 11 6
2.3: 2 12 7
3.0: 3
There is also a speed issue, as the data I am working with consists of billions of rows.
An option as mentioned in comment:
rbindlist(list(dt,
rbindlist(lapply(lst, as.data.table), idcol='id')),
use.names=TRUE, fill=TRUE)[order(id)]
output:
id a b
1: 1 NA NA
2: 2 NA NA
3: 2 10 5
4: 2 11 6
5: 2 12 7
6: 3 NA NA
You can run a lapply across the list, and add the rows to an empty row if the item in the list is non-empty:
dt <- data.table(id = 1:3)
lst <- list()
lst[[2]] <- cbind(a=10:12, b=5:7)
create_table <- function(x, lst) {
if (!is.null(lst[[x]])) {
# Empty row plus items in list
rbindlist(
list(data.table(id = x), data.table(id = x, lst[[x]])),
use.names = TRUE, fill = TRUE
)
} else {
data.table(id = x)
}
}
aux_lst <- rbindlist(
lapply(seq(lst), create_table, lst = lst),
use.names = TRUE, fill = TRUE
)
aux_lst[dt, on = .(id)] # Keeps all IDs in dt
If the list is named and the id column relates to those names, then replace seq with names
Some reformatting is needed, but you can use rbindlist:
# create all entries in lst
length(lst) <- nrow(dt)
# identify table sizes
lens = sapply(lst, NROW)
# use data.tables instead of matrices
# fill empty tables with a blank template
template = data.table(a=NA_real_, b=NA_real_)
dtlist = replace(lapply(lst, as.data.table), lens == 0, list(template))
# expand dt to match tables
replens = pmax(lens, 1L)
cbind(dt[rep(1:.N, replens)], rbindlist(dtlist))
id a b
1: 1 NA NA
2: 2 10 5
3: 2 11 6
4: 2 12 7
5: 3 NA NA
library(data.table)
dt = data.table(id = 1:3)
lst <- list()
lst[[2]] <- cbind(a=10:12, b=5:7)
unique(rbindlist(lapply(1:length(lst), function(i) {
data.table(id = i, lst[[i]])[dt, on = .(id)]
}
), fill=TRUE))[order(id)]
id a b
1: 1 NA NA
2: 2 NA NA
3: 2 10 5
4: 2 11 6
5: 2 12 7
6: 3 NA NA

Repeating blocks of rows in a data frame based on another value in the data frame

There are a number of questions here about repeating rows a prespecified number of times in R, but I can't find one to address the specific question I'm asking.
I have a dataframe of responses from a survey in which each respondent answers somewhere between 5 and 10 questions. As a toy example:
df <- data.frame(ID = rep(1:2, each = 5),
Response = sample(LETTERS[1:4], 10, replace = TRUE),
Weight = rep(c(2,3), each = 5))
> df
ID Response Weight
1 1 D 2
2 1 C 2
3 1 D 2
4 1 D 2
5 1 B 2
6 2 D 3
7 2 C 3
8 2 B 3
9 2 D 3
10 2 B 3
I would like to repeat respondent 1's answers twice, as a block, and then respondent 2's answers 3 times, as a block, and I want each block of responses to have a unique ID. In other words, I want the end result to look like this:
ID Response Weight
1 11 D 2
2 11 C 2
3 11 D 2
4 11 D 2
5 11 B 2
6 12 D 2
7 12 C 2
8 12 D 2
9 12 D 2
10 12 B 2
11 21 D 3
12 21 C 3
13 21 B 3
14 21 D 3
15 21 B 3
16 22 D 3
17 22 C 3
18 22 B 3
19 22 D 3
20 22 B 3
21 23 D 3
22 23 C 3
23 23 B 3
24 23 D 3
25 23 B 3
The way I'm doing this is currently really clunky, and, given that I have >3000 respondents in my dataset, is unbearably slow.
Here's my code:
df.expanded <- NULL
for(i in unique(df$ID)) {
x <- df[df$ID == i,]
y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
y$order <- rep(1:max(x$Weight), nrow(x))
y <- y[with(y, order(order)),]
y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
df.expanded <- rbind(df.expanded, y)
}
Is there a faster way to do this?
There is an easier solution. I suppose you want to duplicate rows based on Weight as shown in your code.
df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')
# sort the rows
df2 <- df2[order(df2$ID), ]
Is this method faster? Let's see:
library(microbenchmark)
microbenchmark(
m1 = {
df.expanded <- NULL
for(i in unique(df$ID)) {
x <- df[df$ID == i,]
y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
y$order <- rep(1:max(x$Weight), nrow(x))
y <- y[with(y, order(order)),]
y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
df.expanded <- rbind(df.expanded, y)
}
},
m2 = {
df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')
# sort the rows
df2 <- df2[order(df2$ID), ]
}
)
# Unit: microseconds
# expr min lq mean median uq max neval
# m1 806.295 862.460 1101.6672 921.0690 1283.387 2588.730 100
# m2 171.731 194.199 245.7246 214.3725 283.145 506.184 100
There might be other more efficient ways.
Another approach would be to use data.table.
Assuming you're starting with "DT" as your data.table, try:
library(data.table)
DT[, list(.id = rep(seq(Weight[1]), each = .N), Weight, Response), .(ID)]
I haven't pasted the ID columns together, but instead, created a secondary column. That seems a little bit more flexible to me.
Data for testing. Change n to create a larger dataset to play with.
set.seed(1)
n <- 5
weights <- sample(3:15, n, TRUE)
df <- data.frame(ID = rep(seq_along(weights), weights),
Response = sample(LETTERS[1:5], sum(weights), TRUE),
Weight = rep(weights, weights))
DT <- as.data.table(df)

Removal of constant columns in R

I was using the prcomp function when I received this error
Error in prcomp.default(x, ...) :
cannot rescale a constant/zero column to unit variance
I know I can scan my data manually but is there any function or command in R that can help me remove these constant variables?
I know this is a very simple task, but I have never been across any function that does this.
Thanks,
The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :
df <- data.frame(x=1:5, y=rep(1,5))
df
# x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"
So if you want to exclude these columns, you can use :
df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]
EDIT : In fact it is simpler to use apply instead. Something like this :
df[,apply(df, 2, var, na.rm=TRUE) != 0]
I guess this Q&A is a popular Google search result but the answer is a bit slow for a large matrix, plus I do not have enough reputation to comment on the first answer. Therefore I post a new answer to the question.
For each column of a large matrix, checking whether the maximum is equal to the minimum is sufficient.
df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]
This is the test. More than 90% of the time is reduced compared to the first answer. It is also faster than the answer from the second comment on the question.
ncol = 1000000
nrow = 10
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]) # my method
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 })]) # Keith's method
time1
# user system elapsed
# 22.267 0.194 22.626
time2
# user system elapsed
# 2.073 0.077 2.155
time3
# user system elapsed
# 6.702 0.060 6.790
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix and #raymkchow version is slow with NAs i propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
First build an example data.table, with more lines than columns (which is usually the case) and 10% of NAs
ncol = 1000
nrow = 100000
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df <- apply (df, 2, function(x) {x[sample( c(1:nrow), floor(nrow/10))] <- NA; x} ) # Add 10% of NAs
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
df <- as.data.table(df)
Then benchmark all approaches:
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0, with = F]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE)), with = F]) # raymkchow
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 }), with = F]) # Keith's method
time4 <- system.time(df4 <- df[,-which_are_constant(df, verbose=FALSE)]) # My method
The results are the following:
time1 # Variance approch
# user system elapsed
# 2.55 1.45 4.07
time2 # Min = max approach
# user system elapsed
# 2.72 1.5 4.22
time3 # length(unique()) approach
# user system elapsed
# 6.7 2.75 9.53
time4 # Exponential search approach
# user system elapsed
# 0.39 0.07 0.45
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
all.equal(df4, df2)
# [1] TRUE
dataPreparation:which_are_constant is 10 times faster than the other approaches.
Plus the more rows you have the more interesting it is to use.
The janitor library has the comment remove_constant that can help delete constant columns.
Let's create a synthesis data for illustration:
library(janitor)
test_dat <- data.frame(A=1, B=1:10, C= LETTERS[1:10])
test_dat
This is the test_dat
> test_dat
A B C
1 1 1 A
2 1 2 B
3 1 3 C
4 1 4 D
5 1 5 E
6 1 6 F
7 1 7 G
8 1 8 H
9 1 9 I
10 1 10 J
then the comment remove_constant can help delete the constant column
remove_constant(test_dat)
remove_constant(test_dat, na.rm= TRUE)
Using the above two comments, we will get:
B C
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
NOTE: use the argument na.rm = TRUE to make sure that any column having one value and NA will also be deleted. For example,
test_dat_with_NA <- data.frame(A=c(1, NA), B=1:10, C= LETTERS[1:10])
test_dat_with_NA
the test_dat_with_NA we get:
A B C
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
then the comment
remove_constant(test_dat_with_NA)
could not delete the column A
A B C
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
while the comment
remove_constant(test_dat_with_NA, na.rm= TRUE)
could delete the column A with only value 1 and NA:
B C
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
If you are after a dplyr solution that returns the non-constant variables in a df, I'd recommend the following. Optionally, you can add %>% colnames() if the column names are desired:
library(dplyr)
df <- data.frame(x = 1:5, y = rep(1,5))
# returns dataframe
var_df <- df %>%
select_if(function(v) var(v, na.rm=TRUE) != 0)
var_df %>% colnames() # returns column names
tidyverse version of Keith's comment:
df %>% purrr::keep(~length(unique(.x)) != 1)

Resources