R: Mean of subsets of dataframe on both row and column labels - r

Let's say I have:
set.seed(42)
d = data.frame(replicate(6,rnorm(10)))
col_labels = c("a", "a", "b", "b", "c", "c")
row_labels = c(1,1,1,2,2,3,3,4,4,4)
I now want to calculate the mean value of a subset of d corresponding to each combination of col_labels and row_labels, ie:
s = subset(d, row_labels==1, select=col_labels=="a")
s_mean = mean(as.matrix(s))
In the end, I would like a dataframe, with rows corresponding to row_labels and columns corresponding to col_labels and values the mean value of the subset. How do I do this without a large number of for-loops?

Here's another option:
res <- lapply(split.default(d, col_labels), FUN=by, INDICES=list(row_labels), function(x) mean(unlist(x)))
do.call(rbind, res)
# 1 2 3 4
# a 0.56201 0.1563 0.4393 -0.3193
# b -0.01075 0.7515 -0.7973 -0.8620
# c 0.28615 -0.3406 0.1443 -0.1583

Try:
set.seed(42)
d <- data.frame(replicate(6,rnorm(10)))
indx <- expand.grid(unique(row_labels), unique(col_labels))
val1 <- apply(indx, 1, function(x)
mean(as.matrix(subset(d, row_labels==x[1], select=col_labels==x[2]))))
val1
#[1] 0.56200717 0.15625521 0.43927374 -0.31929307 -0.01074557 0.75147423
#[7] -0.79730155 -0.86200887 0.28615306 -0.34058148 0.14431610 -0.15834522
Or
fun1 <- function(x,y) mean(as.matrix(subset(d, row_labels==x, select=col_labels==y)))
mapply(fun1, indx[,1], indx[,2])
#[1] 0.56200717 0.15625521 0.43927374 -0.31929307 -0.01074557 0.75147423
#[7] -0.79730155 -0.86200887 0.28615306 -0.34058148 0.14431610 -0.15834522
Or using outer
outer(unique(row_labels), unique(col_labels), Vectorize(fun1))
# [,1] [,2] [,3]
#[1,] 0.5620072 -0.01074557 0.2861531
#[2,] 0.1562552 0.75147423 -0.3405815
#[3,] 0.4392737 -0.79730155 0.1443161
#[4,] -0.3192931 -0.86200887 -0.1583452
cbind the indx and val
res <- cbind(indx, val1)
head(res,3)
#Var1 Var2 val1
#1 1 a 0.5620072
#2 2 a 0.1562552
#3 3 a 0.4392737
mean(as.matrix(subset(d, row_labels==1, select=col_labels=="a")))
#[1] 0.5620072
mean(as.matrix(subset(d, row_labels==2, select=col_labels=="a")))
#[1] 0.1562552
Update
You can also change the formatting
res1 <- outer(unique(row_labels), unique(col_labels), Vectorize(fun1))
dimnames(res1) <- list(unique(row_labels), unique(col_labels))
res1
# a b c
#1 0.5620072 -0.01074557 0.2861531
#2 0.1562552 0.75147423 -0.3405815
#3 0.4392737 -0.79730155 0.1443161
#4 -0.3192931 -0.86200887 -0.1583452
Or you could use reshape2
library(reshape2)
acast(res, Var1~Var2, value.var="val1")
# a b c
#1 0.5620072 -0.01074557 0.2861531
#2 0.1562552 0.75147423 -0.3405815
#3 0.4392737 -0.79730155 0.1443161
#4 -0.3192931 -0.86200887 -0.1583452

You're going to need to change the data to long format. You should consider why you imported the data in this format, and better ways of cleaning it.
Firstly, set the column names
colnames(d) <- col_labels
Secondly, you cannot have duplicate rownames, so you can't simply do rownames(d) <- row_labels.
Instead, we're going to have to split them up another way. You could use
split(d, rowlabels)
Now we're going to get it all into long format. The melt function in the package reshape2 is commonly used for this.
require(reshape2)
dMelt <- melt(split(d, row_labels))
Now look at dMelt. Is there any reason you couldn't have organised the data in this way?
In order to find the subsetted means, use the function aggregate()
aggregate(dMelt$value, FUN=mean, by=list(dMelt$variable, dMelt$L1))

Here an option using data.table. It should be very fast and with any loop
library(data.table)
library(reshape2)
set.seed(42)
merge(
setkey(data.table(variable=colnames(d),x=col_labels),variable),
setkey(melt(setDT(d)[,row:=row_labels,],id.vars="row"),variable))[
,mean(value),c("row","x")]
row x V1
1: 1 a 0.56200717
2: 2 a 0.15625521
3: 3 a 0.43927374
4: 4 a -0.31929307
5: 1 b -0.01074557
6: 2 b 0.75147423
7: 3 b -0.79730155
8: 4 b -0.86200887
9: 1 c 0.28615306
10: 2 c -0.34058148
11: 3 c 0.14431610
12: 4 c -0.15834522
The idea is to :
put the d data.frame in the long format after adding row labels as a row
merge it with another data table to to have correspondence between previous column names and your repeated column names
Compute the mean by group of row and x ( resulted from the merge)

Related

Replace values in a data.table based on row values in another table

I have two data.tables:
left_table <- data.table(a = c(1,2,3,4), b = c(4,5,6,7), c = c(8,9,10,11))
right_table <- data.table(record = sample(LETTERS, 9))
I would like to replace the numeric entries in left_table by the values associated with the corresponding row numbers in right_table. e.g. All instances of 4 in left_table are replaced by whatever letter (or set of characters in my real data) is on row 4 of right_table and so on.
I have this solution but I feel it's a bit cumbersome and a simpler solution must be possible?
right_table <- data.table(row_n = as.character(seq_along(1:9)), right_table)
for (i in seq_along(left_table)){
cols <- colnames(left_table)
current_col <- cols[i]
# convert numbers to character to allow := to work for matching records
left_table[,(current_col) := lapply(.SD, as.character), .SDcols = current_col]
#right_table[,(current_col) := lapply(.SD, as.character), .SDcols = current_col]
#set key for quick joins
setkeyv(left_table, current_col)
setkeyv(right_table, "row_n")
# replace matching records
left_table[right_table, (current_col) := record]
}
You can create the new columns fetching the letters from right_table using the original variables.
left_table[, c("newa","newb","newc") :=
.(right_table[a,record],right_table[b,record],right_table[c,record])]
# a b c newa newb newc
# 1: 1 4 8 Y A R
# 2: 2 5 9 D B W
# 3: 3 6 10 G K <NA>
# 4: 4 7 11 A N <NA>
Edit:
To make it more generic:
columnNames <- names(left_table)
left_table[, (columnNames) :=
lapply(columnNames, function(x) right_table[left_table[,get(x)],record])]
Although there is probably a better way to do this without needing to call left_table inside lapply()
Using mapvalue from plyr:
library(plyr)
corresp <- function(x) mapvalues(x,seq(right_table$record),right_table$record)
left_table[,c(names(left_table)) := lapply(.SD,corresp),.SDcols = names(left_table)]
a b c
1: N K X
2: U Q V
3: Z I 10
4: K G 11
Here is my attempt. When we replace the numeric values to character values, we get NAs as we see from some other answers. So I decided to take another way. First, I created a vector using unlist(). Then, I used fifelse() from the data.table package. I used foo as indices and replaces numbers in foo with characters. I also converted numeric to character (i.e., 10 and 11 in the sample data). Then, I created a matrix and converted it to a data.table object. Finally, I assigned column names to the object.
library(data.table)
foo <- unlist(left_table)
temp <- fifelse(test = foo <= nrow(right_table),
yes = right_table$record[foo],
no = as.character(foo))
res <- as.data.table(matrix(data = temp, nrow = nrow(left_table)))
setnames(res, names(left_table))
# a b c
#1: B G J
#2: Y D I
#3: P T 10
#4: G S 11
I think it might be easier to just keep record as a vector and access it via indexing:
left_table <- data.table(a = c(1,2,3,4), b = c(4,5,6,7), c = c(8,9,10,11))
# a b c
#1: 1 4 8
#2: 2 5 9
#3: 3 6 10
#4: 4 7 11
set.seed(0L)
right_table <- data.table(record = sample(LETTERS, 9))
record <- right_table$record
#[1] "N" "Y" "D" "G" "A" "B" "K" "Z" "R"
left_table[, names(left_table) := lapply(.SD, function(k) fcoalesce(record[k], as.character(k)))]
left_table
# a b c
# 1: N G Z
# 2: Y A R
# 3: D B 10
# 4: G K 11

Combine/match/merge vectors by row names

I have a vector of variable names and several matrices with single rows.
I want to create a new matrix. The new matrix is created by match/merge the row names of the matrices with single rows.
Example:
A vector of variable names
Complete_names <- c("D","C","A","B")
Several matrices with single rows
Matrix_1 <- matrix(c(1,2,3),3,1)
rownames(Matrix_1) <- c("D","C","B")
Matrix_2 <- matrix(c(4,5,6),3,1)
rownames(Matrix_1) <- c("A","B","C")
Desired output:
Desired_output <- matrix(c(1,2,NA,3,NA,6,4,5),4,2)
rownames(Desired_output) <- c("D","C","A","B")
[,1] [,2]
D 1 NA
C 2 6
A NA 4
B 3 5
I know there are several similar postings like this, but those previous answers do not work perfectly for this one.
The main job can be done with merge, returning a data frame:
merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
# Row.names V1.x V1.y
# 1 A NA 4
# 2 B 3 5
# 3 C 2 6
# 4 D 1 NA
Depending on your purposes you may then further modify names or get rid of Row.names.
The answers offered by Julius Vainora and achimneyswallow work well, but just to exactly obtain the desired output I want:
temp <- merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
temp$Row.names <- factor(temp$Row.names, levels=Complete_names)
temp <- temp[order(temp$Row.names),]
rownames(temp) <- temp[,1]
Desired_output <- as.matrix(temp[,-1])
V1.x V1.y
D 1 NA
C 2 6
A NA 4
B 3 5

R lapply: check if data frame contains a column. If not, create this column

I have a list of dataframes.
I would like to check every column name of the dataframes. If the column name is missing, I want to create this column to the dataframe, and complete with NA values.
Dummy data:
d1 <- data.frame(a=1:2, b=2:3, c=4:5)
d2 <- data.frame(a=1:2, b=2:3)
l<-list(d1, d2)
# Check the columns names of the dataframes
# If column is missing, add new column, add NA as values
lapply(l, function(x) if(!("c" %in% colnames(x)))
{
c<-rep(NA, nrow(x))
cbind(x, c) # does not work!
})
What I get:
[[1]]
NULL
[[2]]
a b c
1 1 2 NA
2 2 3 NA
What I want instead:
[[1]]
a b c
1 1 2 4
2 2 3 5
[[2]]
a b c
1 1 2 NA
2 2 3 NA
Thanks for your help!
You could use dplyr::mutate with an ifelse:
library(dplyr)
lapply(l, function(x) mutate(x, c = ifelse("c" %in% names(x), c, NA)))
[[1]]
a b c
1 1 2 4
2 2 3 4
[[2]]
a b c
1 1 2 NA
2 2 3 NA
You have some good answers, but if you want to stick to base R:
lapply(l, function(x)
if(!("c" %in% colnames(x))) {
c<-rep(NA, nrow(x))
return(cbind(x, c))
}
else(return(x))
)
Your code was returning NULL for the first df because you had no else statement to handle the case of c existing (i.e FALSE in the if statement).
One way is to use dplyr::bind_rows to bind data.frames in the list and fill entries from missing columns with NA, and then split the resulting data.frame again to produce a list of data.frames:
df <- dplyr::bind_rows(l, .id = "id");
lapply(split(df, df$id), function(x) x[, -1])
#$`1`
# a b c
#1 1 2 4
#2 2 3 5
#
#$`2`
# a b c
#3 1 2 NA
#4 2 3 NA
Or the same as a tidyverse/magrittr chain
bind_rows(l, .id = "id") %>% split(., .$id) %>% lapply(function(x) x[, -1])
library(purrr)
map(l, ~{if(!length(.x$c)) .x$c <- NA; .x})

Insert nonexistent columns in matrix or dataframe in given order

I am on the lookout for a function in R that would check for the presence of particular columns, e.g.
cols=c("a","b","c","d")
in a matrix or dataframe that would insert a column with NAs in case any columns did not exist (in the position in which the columns are given in vector cols). Say if you had a matrix or dataframe with named columns "a", "d", that it would insert a column "b" and "c" filled up with NAs before column "d", and that any columns not listed in cols would be deleted (e.g. column "e"). What would be the easiest and fastest way to achieve this (I am dealing with a fairly large dataset of ca. 1 million rows)? Or is there already some function that does this?
I would separate the creation step and the ordering step. Here is an example:
cols <- letters[1:4]
## initialize test data set
my.df <- data.frame(a = rnorm(100), d = rnorm(100), e = rnorm(100))
## exclude columns not in cols
my.df <- my.df[ , colnames(my.df) %in% cols]
## add missing columns filled with NA
my.df[, cols[!(cols %in% colnames(my.df))]] <- NA
## reorder
my.df <- my.df[, cols]
Other approach I also just discovered using match, but only works for matrices:
# original matrix
matrix=cbind(a = 1:2, d = 3:4)
# required columns
coln=c("a","b","c","d")
colnmatrix=colnames(matrix)
matrix=matrix[,match(coln,colnmatrix)]
colnames(matrix)=coln
matrix
a b c d
[1,] 1 NA NA 3
[2,] 2 NA NA 4
Another possibility if your data is in a matrix
# original matrix
m1 <- cbind(a = 1:2, d = 3:4)
m1
# a d
# [1,] 1 3
# [2,] 2 4
# matrix will all columns, filled with NA
all.cols <- letters[1:4]
m2 <- matrix(nrow = nrow(m1), ncol = length(all.cols), dimnames = list(NULL, all.cols))
m2
# a b c d
# [1,] NA NA NA NA
# [2,] NA NA NA NA
# replace columns in 'NA matrix' with values from original matrix
m2[ , colnames(m1)] <- m1
m2
# a b c d
# [1,] 1 NA NA 3
# [2,] 2 NA NA 4

What does "df[] <-" do in R

Pretty simple question, and I have had a quick search in google and stackoverflow.
I found this in another post: In aggregate: sum not meaningful for factors.
df[] <- lapply(df, function(x) type.convert(as.character(x)))
how does df[] work?
To add to what Roland wrote,{edit} aaagh he ninja'd me w/ his comment the point is that using DF[] retains the existing object DF with its attributes, in this case the fact that it's got two dimensions and the names a and b .
Rgames> foo<- matrix(1:6,2,3)
Rgames> foo[]<-7:12
Rgames> foo
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
Rgames> foo<-7:12
Rgames> foo
[1] 7 8 9 10 11 12
It invokes [<-.data.frame (i.e., the data.frame method for [<-). That way you assign a list to a data.frame. You could also do
df <- as.data.frame(lapply(df, function(x) type.convert(as.character(x))))
Example:
DF <- data.frame(a=1:2, b=3:4)
DF[] <- list(c=10:11, d=12:13)
# a b
# 1 10 12
# 2 11 13
But compare with this:
DF <- `[<-.data.frame`(DF, , , list(c=c("a", "b"), d=c("d", "e")))
# c d
# 1 a d
# 2 b e
VS. this:
DF <- `[<-.data.frame`(DF, 1:2, 1:2, list(c=c("a", "b"), d=c("d", "e")))
# a b
#1 a d
#2 b e
There is also this:
DF <- as.data.frame(list(c=10:11, d=12:13))
# c d
# 1 10 12
# 2 11 13

Resources