how to merge matrices in R with different number of rows - r

I would like to merge several matrices using their row names.
These matrices do not have the same number of rows and columns.
For instance:
m1 <- matrix(c(1, 2, 3, 4, 5, 6), 3, 2)
rownames(m1) <- c("a","b","c")
m2 <- matrix(c(1, 2, 3, 5, 4, 5, 6, 2), 4, 2)
rownames(m2) <- c("a", "b", "c", "d")
m3 <- matrix(c(1, 2, 3, 4), 2,2)
rownames(m3) <- c("d", "e")
mlist <- list(m1, m2, m3)
For them I would like to get:
Row.names V1.x V2.x V1.y V2.y V1.z V2.z
a 1 4 1 4 NA NA
b 2 5 2 5 NA NA
c 3 6 3 6 NA NA
d NA NA 5 2 1 3
e NA NA NA NA 2 4
I have tried to use lapply with the function merge:
M <- lapply(mlist, merge, mlist, by = "row.names", all = TRUE)
However, it did not work:
Error in data.frame(c(1, 2, 3, 4, 5, 6), c(1, 2, 3, 5, 4, 5, 6, 2), c(1, :
arguments imply differing number of rows: 3, 4, 2
Is there an elegant way to merge these matrices?

You are trying to apply a reduction (?Reduce) to the list of matrices, where the reduction is basically merge. The problem is that merge(m1, m2, by = "row.names", all = T) doesn't give you a new merged matrix with row names, but instead returns the row names in the first column. This is why we need additional logic in the reduction function.
Reduce(function(a,b) {
res <- merge(a,b,by = "row.names", all = T);
rn <- res[,1]; # Row.names column of merge
res <- res[,-1]; # Actual data
row.names(res) <- rn; # Assign row.names
return(res) # Return the merged data with proper row.names
},
mlist[-1], # Reduce (left-to-right) by applying function(a,b) repeatedly
init = mlist[[1]] # Start with the first matrix
)

Or alternatively:
df <- mlist[[1]]
for (i in 2:length(mlist)) {
df <- merge(df, mlist[[i]], by = "row.names", all=T)
rownames(df) <- df$Row.names
df <- df[ , !(names(df) %in% "Row.names")]
}
# V1.x V2.x V1.y V2.y V1 V2
# a 1 4 1 4 NA NA
# b 2 5 2 5 NA NA
# c 3 6 3 6 NA NA
# d NA NA 5 2 1 3
# e NA NA NA NA 2 4

This could also be conceptualised as a reshape operation if the right long-form data.frame is passed to the function:
tmp <- do.call(rbind, mlist)
tmp <- data.frame(tmp, id=rownames(tmp),
time=rep(seq_along(mlist),sapply(mlist,nrow)) )
reshape(tmp, direction="wide")
# id X1.1 X2.1 X1.2 X2.2 X1.3 X2.3
#a a 1 4 1 4 NA NA
#b b 2 5 2 5 NA NA
#c c 3 6 3 6 NA NA
#d d NA NA 5 2 1 3
#e e NA NA NA NA 2 4

Related

Replacing NAs in a data frame with values from a different column

I would like to replace NAs in my data frame with values from another column. For example:
a1 <- c(1, 2, 4, NA, 2, NA)
b1 <- c(3, NA, 4, 4, 4, 3)
c1 <- c(NA, 3, 3, 4, 2, 3)
a2 <- c(2, 3, 5, 5, 3, 4)
b2 <- c(1, 2, 4, 5, 6, 3)
c2 <- c(3, 3, 2, 3, 4, 3)
df <- as.data.frame(cbind(a1, b1, c1, a2, b2, c2))
df
> df
a1 b1 c1 a2 b2 c2
1 1 3 NA 2 1 3
2 2 NA 3 3 2 3
3 4 4 3 5 4 2
4 NA 4 4 5 5 3
5 2 4 2 3 6 4
6 NA 3 3 4 3 3
I would like replace the NAs in df$a1 with the values from the corresponding row in df$a2, the NAs in df$b1 with the values from the corresponding row in df$b2, and the NAs in df$c1 with the values from the corresponding row in df$c2 so that the new data frame looks like:
> df
a1 b1 c1
1 1 3 3
2 2 2 3
3 4 4 3
4 5 4 4
5 2 4 2
6 4 3 3
How can I do this? I have a large data frame with many columns, so it would be great to find an efficient way to do this (I've already seen Replace missing values with a value from another column). Thank you!
An extensible option:
df2 <- df[c('a1','b1','c1')]
df2[] <- mapply(function(x,y) ifelse(is.na(x), y, x),
df[c('a1','b1','c1')], df[c('a2','b2','c2')],
SIMPLIFY=FALSE)
df2
# a1 b1 c1
# 1 1 3 3
# 2 2 2 3
# 3 4 4 3
# 4 5 4 4
# 5 2 4 2
# 6 4 3 3
It's easy enough to extend this to arbitrary column pairs: the first column in the first subset (df[c('a1','b1','c1')]) is paired with the first column of the second subset; second column first subset, second column second subset; etc. It can even be generalized with df[grepl('1$',colnames(df))] and df[grepl('2$',colnames(df))], assuming they don't mis-match.
coalesce in dplyr is meant to do exactly this (replace NAs in a first vector with not NA elements of a later one). e.g.
coalesce(df$a1,df$a2)
[1] 1 2 4 5 2 4
It can be used with sapply to do the whole dataset in an efficient and easily extensible manner:
sapply(c("a","b","c"),function(x) coalesce(df[,paste0(x,1)],df[,paste0(x,2)]))
a b c
[1,] 1 3 3
[2,] 2 2 3
[3,] 4 4 3
[4,] 5 4 4
[5,] 2 4 2
[6,] 4 3 3
dfnew<- ifelse(is.na(df$a1) == T, df$a2, df$a1)
as.data.frame(dfnew)
this is just for a1 col, you'll have to run this for all a,b and c and cbind it. if there are too many columns, running a loop will be the best option imo
You can use hutils::coalesce. It should be slightly faster, especially if it can 'cheat' -- if any columns have no NAs and so don't need to change, coalesce will skip them:
a1 <- c(1, 2, 4, NA, 2, NA)
b1 <- c(3, NA, 4, 4, 4, 3)
c1 <- c(NA, 3, 3, 4, 2, 3)
a2 <- c(2, 3, 5, 5, 3, 4)
b2 <- c(1, 2, 4, 5, 6, 3)
c2 <- c(3, 3, 2, 3, 4, 3)
s <- function(x) {
sample(x, size = 1e6, replace = TRUE)
}
df <- as.data.frame(cbind(a1 = s(a1), b1 = s(b1), c1 = s(c1),
a2 = s(a2), b2 = s(b2), c2 = s(c2)))
library(microbenchmark)
library(hutils)
library(data.table)
dt <- as.data.table(df)
old <- paste0(letters[1:3], "1") # you will need to specify
new <- paste0(letters[1:3], "2")
dplyr_coalesce <- function(df) {
ans <- df
for (j in seq_along(old)) {
o <- old[j]
n <- new[j]
ans[[o]] <- dplyr::coalesce(ans[[o]], df[[n]])
}
ans
}
hutils_coalesce <- function(df) {
ans <- df
for (j in seq_along(old)) {
o <- old[j]
n <- new[j]
ans[[o]] <- hutils::coalesce(ans[[o]], df[[n]])
}
ans
}
microbenchmark(dplyr = dplyr_coalesce(df),
hutils = hutils_coalesce(df))
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> dplyr 45.78123 61.76857 95.10870 69.21561 87.84774 1452.0800 100 b
#> hutils 36.48602 46.76336 63.46643 52.95736 64.53066 252.5608 100 a
Created on 2018-03-29 by the reprex package (v0.2.0).

data.table - efficiently manipulate large data set

I am amazed by the blazing speed of data.table. The coding below does exactly what I need however when executed on a large table it does not perform very well.
convinced that this can be done faster with data.table but I do not see how.
Output
The output needs to be a matrix with the rownames a regular sequence of days.
For each column separately:
All values before the first value need to be NA
All values after the last value need to be NA
Between the first and the last value 0 need to be added as the do not exist in the input table
The following coding shows how the result should look like:
M <-
matrix(c(NA, NA, NA, 2, 0, 1, 3, 0, 2 , NA,
NA, NA, 3, 1, 3, 2, 1, 2, NA, NA),
ncol = 2,
dimnames = list(as.character((Sys.Date() + 0:9)),
c("E1", "E2")))
Output example
## E1 E2
## 2017-01-27 NA NA
## 2017-01-28 NA NA
## 2017-01-29 NA 2
## 2017-01-30 2 2
## 2017-01-31 0 2
## 2017-02-01 3 1
## 2017-02-02 1 3
## 2017-02-03 0 3
## 2017-02-04 2 NA
## 2017-02-05 NA NA
Input
The following table shows the source/input for the coding/function:
DS <- data.table(
E = c(rep("E1", 4), rep("E2", 6)),
C = c(c(Sys.Date() + c(3, 5, 6, 8)),
c(Sys.Date() + c(2, 3, 4, 5, 6, 7))),
S = round(runif(n = 10,min = 1, max = 3), 0),
key = c("E", "C"))
## E C S
## 1: E1 2017-01-30 3
## 2: E1 2017-02-01 1
## 3: E1 2017-02-02 2
## 4: E1 2017-02-04 1
## 5: E2 2017-01-29 3
## 6: E2 2017-01-30 2
## 7: E2 2017-01-31 3
## 8: E2 2017-02-01 1
## 9: E2 2017-02-02 2
## 10: E2 2017-02-03 3
Input example
Code working
The following few lines do exactly what I need and is simple. However it is not efficient.
The real table has 700 unique C values and 2 Million E values.
# Create the regular time line per day
CL <- c(C= (Sys.Date() + 0:9))
# Determine first and last per E
DM <- DS[, .(MIN = min(C), MAX = max(C)), by =.(E)]
# Generate all combinations
CJ <- CJ(E = DS$E, C = CL, unique = TRUE)
# Join
DC <- DS[CJ, on = .(E, C)][!is.na(E)]
# replace NA by 0
DC[is.na(S), S:=0]
# Lead-in
DC[DM, on=.(E, C<MIN), S:=NA]
# Lead-out
DC[DM, on=.(E, C>MAX), S:=NA]
# Cast to matrix format
DC2 <- dcast(
data = DC, formula = C ~ E,
fun.aggregate = sum, value.var = "S")
# coerce to matrix
M3 <- as.matrix(DC2[, -1])
# add row nanes
rownames(M3) <- format(CL, "%Y-%m-%d")
I made some long, un-readable, clumsy coding which creates the matrix with 1.2B cells in 35 secs. This must be possible as quick but far more elegant with data.table, however not like this.
A data.table, like a data.frame underneath everything is a list (with length = number of columns)
200 Million columns is a lot of columns - this will make anything slow.
The description of the conversion to "wide" will bloat the data with large number of NA values. You can almost certainly perform the analysis you need on the "long form" and using keys.
It isn't clear from your question what you need, but you can calculate the various sums
# convert to an IDate
DT[, CALDAY := as.IDate(CALDAY)]
# get range of dates
rangeDays <- DT[,range(CALDAY)]
all_days <- as.IDate(seq(rangeDays[1],rangeDays[2], by=1))
# create sums
DT_sum <- DT[, list(VALUE= sum(VALUE)), keyby = list(ENTITY, CALDAY)]
and then index using entity and dates.
DT_sum[.("2a8605e2-e283-11e6-a3bb-bbe3fd226f8d", all_days)]
and if you need to replace NA with 0
na_replace <- function(x,repl=0){x[is.na(x)]<-repl;x}
DT_sum[.("2a8605e2-e283-11e6-a3bb-bbe3fd226f8d", all_days), na_replace(VALUE)]
This does the trick. But still the performance is not good.
It takes DS as input parameter. The result is a data.table which should be coerced to matrix by:
M <- as.matrix(build_timeseries_DT(DS))
Function
build_timeseries_DT <- function(DS){
# regular time serie for complete range with index
dtC <- data.table(
CAL = seq(min(DS$C), max(DS$C), by = "day"))[, idx:= 1:.N]
# add row index (idx) to sales
DQ <- dtC[DS, on = "CAL"]
setkey(DQ, "ENT")
# calculate min index per ENT
DM <- DQ[, .(MIN = min(idx), MAX = max(idx)), by = .(ENT)]
# allocate memory, assign 0 and set rownames by reference
DT <- dtC[, .(CAL)][, (DM[, ENT]):= 0L][, CAL:= NULL]
setattr(DT, "row.names", format(dtC$CAL, "%Y-%m-%d"))
# Set NA for the Lead-in and out, next populate values by ref
for(j in colnames(DT)){
set(x = DT,
i = c(1L:(DM[j, MIN]), (DM[j, MAX]):DT[, .N]),
j = j,
value = NA )
set(x = DT,
i = DQ[j, idx],
j = j,
value = DQ[j, SLS] )}
return(DT)
}
Test Data
DS <- data.table(
ENT = c("A", "A", "A", "B", "B", "C", "C", "C", "D", "D"),
CAL = c(Sys.Date() + c(0, 5, 6, 3, 8, 1, 2, 9, 3, 5)),
SLS = as.integer(c(1, 2, 1, 2, 3, 1, 2, 3, 2, 1)),
key = c("ENT", "CAL"))
ENT CAL SLS
1: A 2017-01-28 1
2: A 2017-02-02 2
3: A 2017-02-03 1
4: B 2017-01-31 2
5: B 2017-02-05 3
6: C 2017-01-29 1
7: C 2017-01-30 2
8: C 2017-02-06 3
9: D 2017-01-31 2
10: D 2017-02-02 1
Result
as.matrix(build_timeseries_DT(DS))
A B C D
[1,] 1 NA NA NA
[2,] 0 NA 1 NA
[3,] 0 NA 2 NA
[4,] 0 2 0 2
[5,] 0 0 0 0
[6,] 2 0 0 1
[7,] 1 0 0 NA
[8,] NA 0 0 NA
[9,] NA 3 0 NA
[10,] NA NA 3 NA
result with colors

Merging two vectors with an 'or'

I have 2 vectors, each of which has some NA values.
a <- c(1, 2, NA, 3, 4, NA)
b <- c(NA, 6, 7, 8, 9, NA)
I'd like to combine these two with a result that uses the value from a if it is non-NA, otherwise the value from b.
So the result would look like:
c <- c(1, 2, 7, 3, 4, NA)
How can I do this efficiently in R?
How about:
> c <- ifelse(is.na(a), b, a)
> c
[1] 1 2 7 3 4 NA
Try
a[is.na(a)] <- b[is.na(a)]
a
## [1] 1 2 7 3 4 NA
Or, if you don't want to overwrite a, just do
c <- a
c[is.na(c)] <- b[is.na(c)]
c
## [1] 1 2 7 3 4 NA

Insert count number of elements in columns into table in R

I'm working in R and I've got a matrix with A, B and NA values, and I would like to count the number of A or B or NA values in every column and insert the results into the table. I used the code below to account the A, B and NA.
mydata <- matrix(c(rep("A", 8), rep("B", 2), rep(NA, 2), rep("A", 4),
rep(c("B", "A", "A", "A"), 2), rep("A", 4)), ncol = 4, byrow = TRUE)
myFun <- function(x) {
data.frame(n.A = sum(x == "A", na.rm = TRUE), n.B = sum(x == "B",
na.rm = TRUE), n.NA = sum(is.na(x)))
}
count <- apply(mydata, 2, myFun)
Now, I need to insert the results from count (count <- apply(mydata, 2, myFun)) into the a dataframe as a table with only a header.
Almost identical in concept to mnel's answer, you can also try the following in base R:
sapply(as.data.frame(mydata),
function(x) table(factor(x, levels = unique(as.vector(mydata))),
useNA = "always"))
# V1 V2 V3 V4
# A 4 6 6 6
# B 3 1 0 0
# <NA> 0 0 1 1
Here, rather than manually specifying the factor levels, I've made use of the data in mydata.
I think the easiest with using plyr and adply or ldply
You can replace myfun with a call to table.
library(plyr)
adply(mydata,2, function(x) table(factor(x, levels = c('A','B')), useNA = 'always'))
# X1 A B NA
# 1 1 4 3 0
# 2 2 6 1 0
# 3 3 6 0 1
# 4 4 6 0 1
If you have large data, then plyr isn't the way go. apply will work nicely
apply(mydata, 2, function(x) {
xx <- table(factor(x, levels = c('A','B')), useNA = 'always')
names(xx) <- c('nA','nB', 'nNA')
xx})
[,1] [,2] [,3] [,4]
nA 4 6 6 6
nB 3 1 0 0
nNA 0 0 1 1

Replace row of NAs with previous row in R

I was wondering if anyone had a quick and dirty solution to the following problem, I have a matrix that has rows of NAs and I would like to replace the rows of NAs with the previous row (if it is not also a row of NAs).
Assume that the first row is not a row of NAs
Thanks!
Adapted from an answer to this question: Idiomatic way to copy cell values "down" in an R vector
f <- function(x) {
idx <- !apply(is.na(x), 1, all)
x[idx,][cumsum(idx),]
}
x <- data.frame(a=c(1, 2, NA, 3, NA, NA), b=c(4, 5, NA, 6, NA, 7))
> x
a b
1 1 4
2 2 5
3 NA NA
4 3 6
5 NA NA
6 NA 7
> f(x)
a b
1 1 4
2 2 5
2.1 2 5
4 3 6
4.1 3 6
6 NA 7
Trying to think of times you may have two all NA rows in a row.
#create a data set like you discuss (in the future please do this yourself)
set.seed(14)
x <- matrix(rnorm(10), nrow=2)
y <- rep(NA, 5)
v <- do.call(rbind.data.frame, sample(list(x, x, y), 10, TRUE))
One approach:
NArows <- which(apply(v, 1, function(x) all(is.na(x)))) #find all NAs
notNA <- which(!seq_len(nrow(v)) %in% NArows) #find non NA rows
rep.row <- sapply(NArows, function(x) tail(notNA[x > notNA], 1)) #replacement rows
v[NArows, ] <- v[rep.row, ] #assign
v #view
This would not work if your first row is all NAs.
You can always use a loop, here assuming that 1 is not NA as indicated:
fill = data.frame(x=c(1,NA,3,4,5))
for (i in 2:length(fill)){
if(is.na(fill[i,1])){ fill[i,1] = fill[(i-1),1]}
}
If m is your matrix, this is your quick and dirty solution:
sapply(2:nrow(m),function(i){ if(is.na(m[i,1])) {m[i,] <<- m[(i-1),] } })
Note it uses the ugly (and dangerous) <<- operator.
Matthew's example:
x <- data.frame(a=c(1, 2, NA, 3, NA, NA), b=c(4, 5, NA, 6, NA, 7))
na.rows <- which( apply( x , 1, function(z) (all(is.na(z)) ) ) )
x[na.rows , ] <- x[na.rows-1, ]
x
#---
a b
1 1 4
2 2 5
3 2 5
4 3 6
5 3 6
6 NA 7
Obviously a first row with all NA's would present problems.
Here is a straightforward and conceptually perhaps the simplest one-liner:
x <- data.frame(a=c(1, 2, NA, 3, NA, NA), b=c(4, 5, NA, 6, NA, 7))
a b
1 1 4
2 2 5
3 NA NA
4 3 6
5 NA NA
6 NA 7
x1<-t(sapply(1:nrow(x),function(y) ifelse(is.na(x[y,]),x[y-1,],x[y,])))
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 2 5
[4,] 3 6
[5,] 3 6
[6,] NA 7
To put the column names back, just use colnames(x1)<-colnames(x)

Resources