vectorize combining two matrices in R - r

I have a data frame A as follows (the numbers are totally made up)
ID statistic p.value
1 4 .1
2 5 .3
3 3 .4
4 2 .4
5 1 .5
6 7 .8
and data frame B as follows:
ID Info1 Info2 ....
4 A1 B1
5 A2 B2
2 A3 ..
3 A4
1 A5
6 A6
7 A7
9 A8
8 A9
How would I cbind data frame A to data frame B in the correct order without a loop. I know I need to do something like:
cbind(A, B[something in here, ]) but how do I get the ordering? do I do a which statement? something else?

Too long for a comment.
So if I understand you correctly (from your question and all the comments), A and B are extremely large data frames. A has an ID column, and B has the IDs in the row names.
You should definitely use data.tables for this. Assuming you are pulling in the data from some kind of text file, read up on fread(...) in the data.table package. This will read the file directly into a data.table. fread(...) is extremely fast: 10 - 100 times faster than read.table(...) or read.csv(...) for large datasets.
Below is a comparison of the data frame approach with merge(...) and the data.table join approach.
data.frame approach
N <- 1e7 # 10 million rows; big enough??
set.seed(1) # for reproducible example
A <- data.frame(ID=1:N,statistic=sample(1:10,N,replace=T),pvalue=runif(N),stringsAsFactors=F)
B <- data.frame(info1=sample(LETTERS,N,replace=T),info2=sample(letters,N,replace=T),stringsAsFactors=F)
rownames(B) <- sample(1:N,N) # row names in randon order in B
system.time({
# this does the work...
B$ID <- as.integer(rownames(B))
result <- merge(B,A,by="ID")
})
# user system elapsed
# 285.75 3.15 289.33
data.table approach
set.seed(1)
A <- data.frame(ID=1:N,statistic=sample(1:10,N,replace=T),pvalue=runif(N),stringsAsFactors=F)
B <- data.frame(info1=sample(LETTERS,N,replace=T),info2=sample(letters,N,replace=T),stringsAsFactors=F)
rownames(B) <- sample(1:N,N)
library(data.table)
system.time({
# this does the work...
IDs <- as.integer(rownames(B))
setDT(A)
setDT(B)
B[,ID:=IDs]
setkey(A,ID)
setkey(B,ID)
B[A,c("statistic","pvalue"):=list(statistic,pvalue=pvalue)]
})
# user system elapsed
# 122.46 0.40 122.87
So the data.table approach is twice as fast in this example. But most of the time is spent converting the rownames to a column, so if you can read them into a column to begin with, and especially if you can read the data directly into data.tables using fread(...), this will be much faster.

Related

Subset Data Frame Rows by value in row.names in R

I have seen this Subsetting a data frame based on a logical condition on a subset of rows and that https://statisticsglobe.com/filter-data-frame-rows-by-logical-condition-in-r
I want to subset a data.frame according to a specific value in the row.names.
data <- data.frame(x1 = c(3, 7, 1, 8, 5), # Create example data
x2 = letters[1:5],
group = c("ga1", "ga2", "gb1", "gc3", "gb1"))
data # Print example data
# x1 x2 group
# 3 a ga1
# 7 b ga2
# 1 c gb1
# 8 d gc3
# 5 e gb1
I want to subset data according to group. One subset should be the rows containing a in their group, one containing b in their group and one c. Maybe something with grepl?
The result should look like this
data.a
# x1 x2 group
# 3 a ga1
# 7 b ga2
data.b
# x1 x2 group
# 1 c gb1
# 5 e gb1
data.c
# 8 d gc3
I would be interested in how to subset one of these output examples, or perhaps a loop would work too.
I modified the example from here https://statisticsglobe.com/filter-data-frame-rows-by-logical-condition-in-r
Extract the data which you want to split on :
sub('\\d+', '', data$group)
#[1] "ga" "ga" "gb" "gc" "gb"
and use the above in split to divide the data into groups.
new_data <- split(data, sub('\\d+', '', data$group))
new_data
#$ga
# x1 x2 group
#1 3 a ga1
#2 7 b ga2
#$gb
# x1 x2 group
#3 1 c gb1
#5 5 e gb1
#$gc
# x1 x2 group
#4 8 d gc3
It is better to keep data in a list however, if you want separate dataframes for each group you can use list2env.
list2env(new_data, .GlobalEnv)
We can use group_split with str_remove in tidyverse
library(dplyr)
library(stringr)
data %>%
group_split(grp = str_remove(group, "\\d+$"), .keep = FALSE)
Good question. This solution uses inputs and outputs that closely match the request: "I want to subset data according to group. One subset should be the rows containing a in their group, one containing b in their group and one c. Maybe something with grepl?".
The code below uses the data frame that was provided (named data), and uses grep(), and subsets by group.
code:
ga <- grep("ga", data$group) # seperate the data by group type
gb <- grep("gb", data$group)
gc <- grep("gc", data$group)
ga1 <- data[ga,] # subset ga
gb1 <- data[gb,] # subset gb
gc1 <- data[gc,] # subset gc
print(ga1)
print(gb1)
print(gc1)
Windows and Jupyter Lab were used. This output here closely matches the output that was shown above.
Output shown at link: link1

Unroll R data.frame list column retaining the other values in the row [duplicate]

This question already has answers here:
Unlisting columns by groups
(3 answers)
Closed 7 years ago.
I need to efficiently "unroll" a list column in an R data.frame. For example, if I have a data.frame defined as:
dbt <- data.frame(values=c(1,1,1,1,2,3,4),
parm1=c("A","B","C","A","B","C","B"),
parm2=c("d","d","a","b","c","a","a"))
Then, assume an analysis that generates one column as a list, similar to the following output:
agg <- aggregate(values ~ parm1 + parm2, data=dbt,
FUN=function(x) {return(list(x))})
The aggregated data.frame looks like (where class(agg$values) == "list"):
parm1 parm2 values
1 B a 4
2 C a 1, 3
3 A b 1
4 B c 2
5 A d 1
6 B d 1
I'd like to unroll the "values" column , repeating the parm1 & 2 values (adding more rows) in an efficient manner for each element of the list over all the data.frame rows.
At the top level I wrote a function that does the unroll in a for loop called in an apply. It's really inefficient, (the aggregated data.frame takes about an hour to create and nearly 24 hours to unroll, the fully unrolled data has ~500k records). The top level I'm using is:
unrolled.data <- do.call(rbind, apply(agg, 1, FUN=unroll.data))
The function just calls unlist() on the value column object then builds a data.frame object in a for loop as the returned object.
The environment is somewhat restricted and the tidyr, data.table and splitstackshape libraries are unavailable to me, it needs to not only be functions found in base:: but limited to those available in v3.1.1 and before. Thus the answers in this (not really a duplicate) question do not apply.
Any suggestions on something faster?
Thanks!
With base R, you could try
with(agg, {
data.frame(
lapply(agg[,1:2], rep, times=lengths(values)),
values=unlist(values)
)
})
# parm1 parm2 values
# 1.2 B a 4
# 1.31 C a 1
# 1.32 C a 3
# 2.1 A b 1
# 3.2 B c 2
# 4.1 A d 1
# 4.2 B d 1
Timings for an alternative (thanks #thelatemail)
library(dplyr)
agg %>%
sample_n(1e7, replace=T) -> bigger
system.time(
with(bigger, { data.frame(lapply(bigger[,1:2], rep, times=lengths(values)), values=unlist(values)) })
)
# user system elapsed
# 3.78 0.14 3.93
system.time(
with(bigger, { data.frame(bigger[rep(rownames(bigger), lengths(values)), 1:2], values=unlist(values)) })
)
# user system elapsed
# 11.30 0.34 11.64

Creating combination of sequences

I am trying to solve following problem:
Consider 5 simple sequences: 0:100, 100:0, rep(0,101), rep(50,101), rep(100,101)
I need sets of 3 numeric variables, which have above sequences in all combinations. Since there are 5 sequences and 3 variables, there can be 5*5*5 combinations, hence total of 12625 (5*5*5*101) numbers in each variable (101 for each sequence).
These can be grouped in a data.frame of 12625 rows and 4 columns. First column (V) will simply have seq(1:12625) (rownumbers can be used in its place). Other 3 columns (A,B,C) will have above 5 sequences in different combinations. For example, the first 101 rows will have 0:100 in all 3 A,B and C. Next 101 rows will have 0:100 in A and B, and 100:0 in C. And so on...
I can create sequences as:
s = list()
s[[1]] = 0:100
s[[2]] = 100:0
s[[3]] = rep(0,101)
s[[4]] = rep(50,101)
s[[5]] = rep(100,101)
But how to proceed further? I do not really need the data frame but I need a function that returns a list containing the values of c(A,B,C) for the number (first or V column) sent to it. The number can obviously vary from 1 to 12625.
How can I create such a function. I will prefer a vector solution or one using apply family functions to optimize the speed.
You asked for a vectorized solution, so here's one using only data.table (similar to #SimonGs methodology)
library(data.table)
grd <- CJ(A = seq_len(5), B = seq_len(5), C = seq_len(5))
res <- grd[, lapply(.SD, function(x) unlist(s[x]))]
res
# A B C
# 1: 0 0 0
# 2: 1 1 1
# 3: 2 2 2
# 4: 3 3 3
# 5: 4 4 4
# ---
# 12621: 100 100 100
# 12622: 100 100 100
# 12623: 100 100 100
# 12624: 100 100 100
# 12625: 100 100 100
I came up with two solutions. I find this hard to do with apply and the likes since they tend to give an output that is not so nice to handle (maybe someone can "tame" them better than I can :D)
First solution uses seperate calls to lapply, second one uses a for loop and some programming No-No's. Personally I prefer the second one, first one is faster though...
grd <- expand.grid(a=1:5,b=1:5,c=1:5)
# apply-ish
A <- lapply(grd[,1], function(z){ s[[z]] })
B <- lapply(grd[,2], function(z){ s[[z]] })
C <- lapply(grd[,3], function(z){ s[[z]] })
dfr <- data.frame(A=do.call(c,A), B=do.call(c,B), C=do.call(c,C))
# for-ish
mat <- NULL
for(i in 1:nrow(grd)){
cur <- grd[i,]
tmp <- cbind(s[[cur[,1]]],s[[cur[,2]]],s[[cur[,3]]])
mat <- rbind(mat,tmp)
}
The output of both dfr and mat seem to be what you describe.
Cheers!

Merge csv files in R

I have 3 .csv files that I need to analyse in R. File one contains columns with user id and signupdate. File two contains columns with user id, purchase date and amount of purchases. File three contains columns with user id, message date and number of messages.
Please note that the order of the user id is not the same in each of the three files, thus cop.
Would love some help merging these files so that the large dataset has order user id, signupdate, purchase date, amount of purchases, message date and number of messages. Can't seem to find code to do this in R
Thanks in advance
While merge doesn't take three arguments, Reduce is made for the task of iterating over a list and passing pairs to a function. Here's an example of a three-way merge:
d1 <- data.frame(id=letters[1:3], x=2:4)
d2 <- data.frame(id=letters[3:1], y=5:7)
d3 <- data.frame(id=c('b', 'c', 'a'), z=c(5,6,8))
Reduce(merge, list(d1, d2, d3))
## id x y z
## 1 a 2 7 8
## 2 b 3 6 5
## 3 c 4 5 6
Note that the order of the column id is not the same, but the values are matched.
In the case where you have non-matching data and want to keep all possible rows, you need an outer join, by supplying all=TRUE to merge. As Reduce does not have a way to pass additional arguments to the function, a new function must be created to call merge:
d1 <- data.frame(id=letters[1:3], x=2:4)
d2 <- data.frame(id=letters[3:1], y=5:7)
d3 <- data.frame(id=c('b', 'c', 'd'), z=c(5,6,8))
Reduce(function(x,y) merge(x,y,all=TRUE), list(d1, d2, d3))
## id x y z
## 1 a 2 7 NA
## 2 b 3 6 5
## 3 c 4 5 6
## 4 d NA NA 8
NA is used to indicate data in non-matched rows.

Improving performance of updating contents of large data frame using contents of similar data frame

I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.
The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)
Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.
So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:
K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1
...and the update data frame "df2":
K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2
The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.
I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).
I'm sure I'm missing something, but can't find a comparable example.
I see ddply usage that looks close, but not a general solution. The data.table package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.
Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.
Here is some code to fabricate the data frames above:
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
Thanks
This loops by column, setting dt1 by reference and (hopefully) should be quick.
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
nna = !is.na(dt2[[i]])
set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
K P1 P2 P3
[1,] A 1 1 1
[2,] B 2 1 2
[3,] C 1 2 2
[4,] D 2 2 2
This is likely not the fastest solution but is done entirely in base.
(updated answer per Tommy's comments)
#READING IN YOUR DATA FRAMES
df1 <- read.table(text=" K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1", header=TRUE)
df2 <- read.table(text=" K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2", header=TRUE)
all <- c(levels(df1$K), levels(df2$K)) #all cells of key column
dups <- all[duplicated(all)] #the overlapping key cells
ndups <- all[!all %in% dups] #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows
decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ],
FUN = decider)) #repalce all NAs of df2 with df1 values if they exist
df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ] #reorder based on key column
rownames(df5) <- 1:nrow(df5) #give proper non duplicated rownames
df5
This yields:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:
all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
colnames(LS[[i]]) <- colnames(LS[[2]])
return(LS[[i]])
}
)
LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])
decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5
EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.
require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
k = K[i]
p = unlist(dt2[i,-1,with=FALSE])
p = p[!is.na(p)]
dt1[J(k),names(p):=as.list(p),with=FALSE]
}
or, can you use matrix instead of data.frame? If so it could be a single line using A[B] syntax where B is a 2-column matrix containing the row and column numbers to update.
The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.
library(data.table)
library(reshape2)
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")
dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table
dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")
dt1s[dt2s,value:=value.new] # Update new values
dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names

Resources