Binding matrices in R with unequal rows - r

I have two matrices with potentially both equal columns but unequal rows (but hopefully a solution will generalize to unequal numbers of both).
I would like the following behavior (demonstrated using data.frames):
x = data.frame(z = c(8, 9), w = c(10, 11))
y = data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
> x
z w
1 8 10
2 9 11
> y
x y
1 1 4
2 2 5
3 3 6
And I would like to do something like
magic_cbind(x, y)
z w x y
1 8 10 1 4
2 9 11 2 5
3 NA NA 3 6
I found a perverse solution using rbind.fill from the plyr package:
> x = data.frame(t(x))
> y = data.frame(t(y))
> x
X1 X2
z 8 9
w 10 11
> y
X1 X2 X3
x 1 2 3
y 4 5 6
> rbind.fill(x, y)
X1 X2 X3
1 8 9 NA
2 10 11 NA
3 1 2 3
4 4 5 6
> rbind.fill(x, y) %>% t %>% as.matrix %>% unname
[,1] [,2] [,3] [,4]
[1,] 8 10 1 4
[2,] 9 11 2 5
[3,] NA NA 3 6
But I was wondering if there were a more elegant solution? I don't know the final size of the matrix in advance, which is a problem, and it grows inside a loop (which is terrible practice but it but won't grow large enough to actually be a concern). That is, given a matrix, I'm trying to bind additional columns obtained through a loop to it in the way described above.
I cobbled my solution up using the following questions:
Bind list with unequal columns
Combining (cbind) vectors of different length
R: column binding with unequal number of rows

We can use cbind.fill from rowr
rowr::cbind.fill(x, y, fill = NA)
# z w x y
#1 8 10 1 4
#2 9 11 2 5
#3 NA NA 3 6

Here's a way in base R:
as.data.frame(lapply(c(x,y),`length<-`,max(nrow(x),nrow(y))))
z w x y
1 8 10 1 4
2 9 11 2 5
3 NA NA 3 6

data.frame( sapply(c(x,y), '[', seq(max(lengths(c(x, y))))))
z w x y
1 8 10 1 4
2 9 11 2 5
3 NA NA 3 6
Or Use
library(magrittr)
library(purrr)
map_df(c(x, y), extract, seq(max(lengths(c(x, y)))))
or
map_df(c(x,y), `[`, seq(max(lengths(c(x, y)))))
# A tibble: 3 x 4
z w x y
<dbl> <dbl> <dbl> <dbl>
1 8.00 10.0 1.00 4.00
2 9.00 11.0 2.00 5.00
3 NA NA 3.00 6.00

Related

using R rowmeans to get mean regardless of any missing values [duplicate]

I would like to get the average for certain columns for each row.
I have this data:
w=c(5,6,7,8)
x=c(1,2,3,4)
y=c(1,2,3)
length(y)=4
z=data.frame(w,x,y)
Which returns:
w x y
1 5 1 1
2 6 2 2
3 7 3 3
4 8 4 NA
I would like to get the mean for certain columns, not all of them. My problem is that there are a lot of NAs in my data. So if I wanted the mean of x and y, this is what I would like to get back:
w x y mean
1 5 1 1 1
2 6 2 2 2
3 7 3 3 3
4 8 4 NA 4
I guess I could do something like z$mean=(z$x+z$y)/2 but the last row for y is NA so obviously I do not want the NA to be calculated and I should not be dividing by two. I tried cumsum but that returns NAs when there is a single NA in that row. I guess I am looking for something that will add the selected columns, ignore the NAs, get the number of selected columns that do not have NAs and divide by that number. I tried ??mean and ??average and am completely stumped.
ETA: Is there also a way I can add a weight to a specific column?
Here are some examples:
> z$mean <- rowMeans(subset(z, select = c(x, y)), na.rm = TRUE)
> z
w x y mean
1 5 1 1 1
2 6 2 2 2
3 7 3 3 3
4 8 4 NA 4
weighted mean
> z$y <- rev(z$y)
> z
w x y mean
1 5 1 NA 1
2 6 2 3 2
3 7 3 2 3
4 8 4 1 4
>
> weight <- c(1, 2) # x * 1/3 + y * 2/3
> z$wmean <- apply(subset(z, select = c(x, y)), 1, function(d) weighted.mean(d, weight, na.rm = TRUE))
> z
w x y mean wmean
1 5 1 NA 1 1.000000
2 6 2 3 2 2.666667
3 7 3 2 3 2.333333
4 8 4 1 4 2.000000
Try using rowMeans:
z$mean=rowMeans(z[,c("x", "y")], na.rm=TRUE)
w x y mean
1 5 1 1 1
2 6 2 2 2
3 7 3 3 3
4 8 4 NA 4
Here is a tidyverse solution using c_across which is designed for row-wise aggregations. This makes it easy to refer to columns by name, type or position and to apply any function to the selected columns.
library("tidyverse")
w <- c(5, 6, 7, 8)
x <- c(1, 2, 3, 4)
y <- c(1, 2, 3, NA)
z <- data.frame(w, x, y)
z %>%
rowwise() %>%
mutate(
mean = mean(c_across(c(x, y)), na.rm = TRUE),
max = max(c_across(x:y), na.rm = TRUE)
)
#> # A tibble: 4 × 5
#> # Rowwise:
#> w x y mean max
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 1 1 1 1
#> 2 6 2 2 2 2
#> 3 7 3 3 3 3
#> 4 8 4 NA 4 4
Created on 2022-06-25 by the reprex package (v2.0.1)

Add column to each data frame within list with function rowSums and range of columns

SO. The following might serve as a small example of the real list.
a <- data.frame(
x = c("A","A","A","A","A"),
y = c(1,2,3,4,5),
z = c(1,2,3,4,5))
b <- data.frame(
x = c("A","A","A","A","A"),
y = c(1,2,3,4,5),
z = c(1,2,3,4,5))
c <- data.frame(
x = c("A","A","A","A","A"),
y = c(1,2,3,4,5),
z = c(1,2,3,4,5))
l <- list(a,b,c)
From the second column to last column - on every data frame - i want to add the sums as a new column to each data frame.
I tried:
lapply(l, function(x) rowSums(x[2:ncol(x)]))
which returns the correct sums, but doesn't add them to the data frames.
I also tried:
lapply(l, transform, sum = y + z)
which gives me the correct results but is not flexible enough, because i don't always know how many columns there are for each data frame and what names they have. The only thing i know, is, that i have to start from second column to end. I tried to combine these two approaches but i can't figure out, how to do it exactly.
Thanks
Try this. You can play around index in columns and exclude the first variable so that there is not issues about how many additional variables you have in order to obtain the rowsums. Here the code:
#Compute rowsums
l1 <- lapply(l,function(x) {x$RowSum<-rowSums(x[,-1],na.rm=T);return(x)})
Output:
l1
[[1]]
x y z RowSum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[2]]
x y z RowSum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[3]]
x y z RowSum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
Here's how to combine your attempts. I used data[-1] instead of data[2:ncol(data)] because it seems simpler, but either should work.
lapply(l, function(data) transform(data, sum = rowSums(data[-1])))
Unfortunately, transform will be confused if the name of the argument to your anonymous function is the same as a column name - data[-1] needs to look at the data frame, not a particular column. (I originally use function(x) instead of function(data), and this caused an error because there is a column named x. From this perspective, Duck's answer is a little safer.)
Does this work:
> add_col <- function(df){
+ df[(ncol(df)+1)] = rowSums(df[2:ncol(df)])
+ df
+ }
> lapply(l, add_col)
[[1]]
x y z V4
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[2]]
x y z V4
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[3]]
x y z V4
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
>
With sum as column name:
> add_col <- function(df){
+ df['sum'] = rowSums(df[2:ncol(df)])
+ df
+ }
> lapply(l, add_col)
[[1]]
x y z sum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[2]]
x y z sum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
[[3]]
x y z sum
1 A 1 1 2
2 A 2 2 4
3 A 3 3 6
4 A 4 4 8
5 A 5 5 10
use tidyverse
library(tidyverse)
map(l, ~.x %>% mutate(Sum := apply(.x[-1], 1, sum)))
#> [[1]]
#> x y z Sum
#> 1 A 1 1 2
#> 2 A 2 2 4
#> 3 A 3 3 6
#> 4 A 4 4 8
#> 5 A 5 5 10
#>
#> [[2]]
#> x y z Sum
#> 1 A 1 1 2
#> 2 A 2 2 4
#> 3 A 3 3 6
#> 4 A 4 4 8
#> 5 A 5 5 10
#>
#> [[3]]
#> x y z Sum
#> 1 A 1 1 2
#> 2 A 2 2 4
#> 3 A 3 3 6
#> 4 A 4 4 8
#> 5 A 5 5 10
Created on 2020-09-30 by the reprex package (v0.3.0)
We can use map with mutate
library(purrr)
library(dplyr)
map(l, ~ .x %>%
mutate(sum = rowSums(select(., -1))))
Or with c_across
map(l, ~ .x %>%
rowwise() %>%
mutate(sum = sum(c_across(-1), na.rm = TRUE)) %>%
ungroup)

Extract a data frame using model.frame and formula

I want to extract a data frame using a formula, which specifies which columns to select and some crossing overs among columns.
I know model.frame function. However it does not give me the crossing overs:
For example:
df <- data.frame(x = c(1,2,3,4), y = c(2,3,4,7), z = c(5,6, 9, 1))
f <- formula('z~x*y')
model.frame(f, df)
output:
> df
x y z
1 1 2 5
2 2 3 6
3 3 4 9
4 4 7 1
> f <- formula('z~x*y')
> model.frame(f, df)
z x y
1 5 1 2
2 6 2 3
3 9 3 4
4 1 4 7
I hope to get:
z x y x*y
1 5 1 2 2
2 6 2 3 6
3 9 3 4 12
4 1 4 7 28
Is there a package that could achieve this functionality? (It would be perfect if I can get the resulting matrix as a sparse matrix because the crossed columns will be highly sparse)
You can use model.matrix:
> model.matrix(f, df)
(Intercept) x y x:y
1 1 1 2 2
2 1 2 3 6
3 1 3 4 12
4 1 4 7 28
attr(,"assign")
[1] 0 1 2 3
If you want to save the result as a sparse matrix, you can use the Matrix package:
> mat <- model.matrix(f, df)
> library(Matrix)
> Matrix(mat, sparse = TRUE)
4 x 4 sparse Matrix of class "dgCMatrix"
(Intercept) x y x:y
1 1 1 2 2
2 1 2 3 6
3 1 3 4 12
4 1 4 7 28

repeatedly applying ave for computing group means in a data frame

The following code separately produces the group means of x and y in accordance to group. Suppose that I have a number of variables for which repeating the same operation.
How would you suggest to proceed in order to obtain the same result through a single command? (I suppose it is necessary to adopt tapply, but I am not really sure about it..).
x=seq(1,11,by=2); y=seq(2,12,by=2); group=rep(1:2, each=3)
dat <- data.frame(cbind(group, x, y))
dat$m_x <- ave(dat$x, dat$group)
dat$m_y <- ave(dat$y, dat$group)
dat
Many thanks.
Alternative solutions using data.table and plyr packages:
1) Using data.table
require(data.table)
dt <- data.table(dat, key="group")
# Following #Matthew's comment, edited:
dt[, `:=`(m_x = mean(x), m_y = mean(y)), by=group]
Output:
group x y m_x m_y
1: 1 1 2 3 4
2: 1 3 4 3 4
3: 1 5 6 3 4
4: 2 7 8 9 10
5: 2 9 10 9 10
6: 2 11 12 9 10
2) using plyr and transform:
require(plyr)
ddply(dat, .(group), transform, m_x=mean(x), m_y=mean(y))
output:
group x y m_x m_y
1 1 1 2 3 4
2 1 3 4 3 4
3 1 5 6 3 4
4 2 7 8 9 10
5 2 9 10 9 10
6 2 11 12 9 10
3) using plyr and numcolwise (note the reduced output):
ddply(dat, .(group), numcolwise(mean))
Output:
group x y
1 1 3 4
2 2 9 10
Assuming you have more than just two columns, you would want to use apply to apply ave to every column in the matrix.
x=seq(1,11,by=2); y=seq(2,12,by=2); group=rep(1:2, each=3)
dat <- cbind(x, y)
ave.dat <- apply(dat, 2, function(column) ave(column, group))
# x y
# [1,] 1 2
# [2,] 3 4
# [3,] 5 6
# [4,] 7 8
# [5,] 9 10
# [6,] 11 12
You can also use aggregate():
dat2 <- data.frame(dat, aggregate(dat[,-1], by=list(dat$group), mean)[group, -1])
dat2
group x y x.1 y.1
1 1 1 2 3 4
1.1 1 3 4 3 4
1.2 1 5 6 3 4
2 2 7 8 9 10
2.1 2 9 10 9 10
2.2 2 11 12 9 10
row.names(dat2) <- rownames(dat)
colnames(dat2) <- gsub("(.)\\.1", "m_\\1", colnames(dat2))
dat2
group x y m_x m_y
1 1 1 2 3 4
2 1 3 4 3 4
3 1 5 6 3 4
4 2 7 8 9 10
5 2 9 10 9 10
6 2 11 12 9 10
If the variable names are more than a single character, you would need to modify the gsub() call.

R: create a data frame out of a rolling window

Lets say I have a data frame with the following structure:
DF <- data.frame(x = 0:4, y = 5:9)
> DF
x y
1 0 5
2 1 6
3 2 7
4 3 8
5 4 9
what is the most efficient way to turn 'DF' into a data frame with the following structure:
w x y
1 0 5
1 1 6
2 1 6
2 2 7
3 2 7
3 3 8
4 3 8
4 4 9
Where w is a length 2 window rolling through the dataframe 'DF.' The length of the window should be arbitrary, i.e a length of 3 yields
w x y
1 0 5
1 1 6
1 2 7
2 1 6
2 2 7
2 3 8
3 2 7
3 3 8
3 4 9
I am a bit stumped by this problem, because the data frame can also contain an arbitrary number of columns, i.e. w,x,y,z etc.
/edit 2: I've realized edit 1 is a bit unreasonable, as xts doesn't seem to deal with multiple observations per data point
My approach would be to use the embed function. The first thing to do is to create a rolling sequence of indices into a vector. Take a data-frame:
df <- data.frame(x = 0:4, y = 5:9)
nr <- nrow(df)
w <- 3 # window size
i <- 1:nr # indices of the rows
iw <- embed(i,w)[, w:1] # matrix of rolling-window indices of length w
> iw
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 2 3 4
[3,] 3 4 5
wnum <- rep(1:nrow(iw),each=w) # window number
inds <- i[c(t(iw))] # the indices flattened, to use below
dfw <- sapply(df, '[', inds)
dfw <- transform(data.frame(dfw), w = wnum)
> dfw
x y w
1 0 5 1
2 1 6 1
3 2 7 1
4 1 6 2
5 2 7 2
6 3 8 2
7 2 7 3
8 3 8 3
9 4 9 3

Resources