What does "df[] <-" do in R - r

Pretty simple question, and I have had a quick search in google and stackoverflow.
I found this in another post: In aggregate: sum not meaningful for factors.
df[] <- lapply(df, function(x) type.convert(as.character(x)))
how does df[] work?

To add to what Roland wrote,{edit} aaagh he ninja'd me w/ his comment the point is that using DF[] retains the existing object DF with its attributes, in this case the fact that it's got two dimensions and the names a and b .
Rgames> foo<- matrix(1:6,2,3)
Rgames> foo[]<-7:12
Rgames> foo
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
Rgames> foo<-7:12
Rgames> foo
[1] 7 8 9 10 11 12

It invokes [<-.data.frame (i.e., the data.frame method for [<-). That way you assign a list to a data.frame. You could also do
df <- as.data.frame(lapply(df, function(x) type.convert(as.character(x))))
Example:
DF <- data.frame(a=1:2, b=3:4)
DF[] <- list(c=10:11, d=12:13)
# a b
# 1 10 12
# 2 11 13
But compare with this:
DF <- `[<-.data.frame`(DF, , , list(c=c("a", "b"), d=c("d", "e")))
# c d
# 1 a d
# 2 b e
VS. this:
DF <- `[<-.data.frame`(DF, 1:2, 1:2, list(c=c("a", "b"), d=c("d", "e")))
# a b
#1 a d
#2 b e
There is also this:
DF <- as.data.frame(list(c=10:11, d=12:13))
# c d
# 1 10 12
# 2 11 13

Related

How to organize the output of the list of list in R

Suppose this is my list of list (I would like to organize the result as my data contains more than 40 results and it is difficult for me to organize them manually).
s <- c(1,2,3)
ss <- c(4,5,6)
S <- list(s,ss)
h <- c(4,8,7)
hh <- c(0,3,4)
H <- list(h,hh)
HH <- list(S,H)
names1 <- c("First","Second")
lapply(setNames(HH, paste0(names1, '_Model')), function(x)
setNames(x, paste0('Res_', seq_along(x))))
#$First_Model
#$First_Model$Res_1
#[1] 1 2 3
#$First_Model$Res_2
#[1] 4 5 6
#$Second_Model
#$Second_Model$Res_1
#[1] 4 8 7
#$Second_Model$Res_2
#[1] 0 3 4
I would like to have the result similar to the following:
#$First_Model
#$First_Model$Res_1
#[1] 1 2 3
#$Second_Model
#$Second_Model$Res_1
#[1] 4 8 7
#$First_Model$Res_2
#[1] 4 5 6
#$Second_Model$Res_2
#[1] 0 3 4
The problem in question is how to rearrange the nested list from "Model No. > Results No." to "Results No. > Model No."
I was going for something similar to Wimpel's answer.
Res_no <- seq_along(HH[[1]]) # results elements
lapply(setNames(Res_no, paste0("Res_", Res_no)), function(x)
lapply(setNames(HH, paste0(names1, '_Model')), `[[`, x)
)
Output
#$Res_1
#$Res_1$First_Model
#[1] 1 2 3
#
#$Res_1$Second_Model
#[1] 4 8 7
#
#
#$Res_2
#$Res_2$First_Model
#[1] 4 5 6
#
#$Res_2$Second_Model
#[1] 0 3 4
The base of this solution is to extract the x-th element of the nested list (seen in the inner lapply() function of the code). You can do this with lapply or purrr:map, as described here.
The outer lapply() function lets you repeat it for all the "Results No."
Something like this perhaps?
# From your code, create a list L
L <- lapply(setNames(HH, paste0(names1, '_Model')), function(x)
setNames(x, paste0('Res_', seq_along(x))))
# get all x-th elements from the list, and add them to new list L2
L2 <- lapply( 1:length(L[[1]]), function(x) {
lapply(L, "[[", x)
})
# set names of L2
names(L2) <- names(L[[1]])
output
# $Res_1
# $Res_1$First_Model
# [1] 1 2 3
#
# $Res_1$Second_Model
# [1] 4 8 7
#
#
# $Res_2
# $Res_2$First_Model
# [1] 4 5 6
#
# $Res_2$Second_Model
# [1] 0 3 4

R Dataframe comparison which, scaling bad

The idea is extracting the position of df charactes with a reference of other df, example:
L<-LETTERS[1:25]
A<-c(1:25)
df<-data.frame(L,A)
Compare<-c(LETTERS[sample(1:25, 25)])
df[] <- lapply(df, as.character)
for (i in 1:nrow(df)){
df[i,1]<-which(df[i,1]==Compare)
}
head(df)
L A
1 14 1
2 12 2
3 2 3
This works good but scale very bad, like all for, any ideas with apply, or dplyr?
Thanks
Just use match
Your data (use set.seed when providing data using sample)
df <- data.frame(L = LETTERS[1:25], A = 1:25)
set.seed(1)
Compare <- LETTERS[sample(1:25, 25)]
Solution
df$L <- match(df$L, Compare)
head(df)
# L A
# 1 10 1
# 2 23 2
# 3 12 3
# 4 11 4
# 5 5 5
# 6 21 6

R: Mean of subsets of dataframe on both row and column labels

Let's say I have:
set.seed(42)
d = data.frame(replicate(6,rnorm(10)))
col_labels = c("a", "a", "b", "b", "c", "c")
row_labels = c(1,1,1,2,2,3,3,4,4,4)
I now want to calculate the mean value of a subset of d corresponding to each combination of col_labels and row_labels, ie:
s = subset(d, row_labels==1, select=col_labels=="a")
s_mean = mean(as.matrix(s))
In the end, I would like a dataframe, with rows corresponding to row_labels and columns corresponding to col_labels and values the mean value of the subset. How do I do this without a large number of for-loops?
Here's another option:
res <- lapply(split.default(d, col_labels), FUN=by, INDICES=list(row_labels), function(x) mean(unlist(x)))
do.call(rbind, res)
# 1 2 3 4
# a 0.56201 0.1563 0.4393 -0.3193
# b -0.01075 0.7515 -0.7973 -0.8620
# c 0.28615 -0.3406 0.1443 -0.1583
Try:
set.seed(42)
d <- data.frame(replicate(6,rnorm(10)))
indx <- expand.grid(unique(row_labels), unique(col_labels))
val1 <- apply(indx, 1, function(x)
mean(as.matrix(subset(d, row_labels==x[1], select=col_labels==x[2]))))
val1
#[1] 0.56200717 0.15625521 0.43927374 -0.31929307 -0.01074557 0.75147423
#[7] -0.79730155 -0.86200887 0.28615306 -0.34058148 0.14431610 -0.15834522
Or
fun1 <- function(x,y) mean(as.matrix(subset(d, row_labels==x, select=col_labels==y)))
mapply(fun1, indx[,1], indx[,2])
#[1] 0.56200717 0.15625521 0.43927374 -0.31929307 -0.01074557 0.75147423
#[7] -0.79730155 -0.86200887 0.28615306 -0.34058148 0.14431610 -0.15834522
Or using outer
outer(unique(row_labels), unique(col_labels), Vectorize(fun1))
# [,1] [,2] [,3]
#[1,] 0.5620072 -0.01074557 0.2861531
#[2,] 0.1562552 0.75147423 -0.3405815
#[3,] 0.4392737 -0.79730155 0.1443161
#[4,] -0.3192931 -0.86200887 -0.1583452
cbind the indx and val
res <- cbind(indx, val1)
head(res,3)
#Var1 Var2 val1
#1 1 a 0.5620072
#2 2 a 0.1562552
#3 3 a 0.4392737
mean(as.matrix(subset(d, row_labels==1, select=col_labels=="a")))
#[1] 0.5620072
mean(as.matrix(subset(d, row_labels==2, select=col_labels=="a")))
#[1] 0.1562552
Update
You can also change the formatting
res1 <- outer(unique(row_labels), unique(col_labels), Vectorize(fun1))
dimnames(res1) <- list(unique(row_labels), unique(col_labels))
res1
# a b c
#1 0.5620072 -0.01074557 0.2861531
#2 0.1562552 0.75147423 -0.3405815
#3 0.4392737 -0.79730155 0.1443161
#4 -0.3192931 -0.86200887 -0.1583452
Or you could use reshape2
library(reshape2)
acast(res, Var1~Var2, value.var="val1")
# a b c
#1 0.5620072 -0.01074557 0.2861531
#2 0.1562552 0.75147423 -0.3405815
#3 0.4392737 -0.79730155 0.1443161
#4 -0.3192931 -0.86200887 -0.1583452
You're going to need to change the data to long format. You should consider why you imported the data in this format, and better ways of cleaning it.
Firstly, set the column names
colnames(d) <- col_labels
Secondly, you cannot have duplicate rownames, so you can't simply do rownames(d) <- row_labels.
Instead, we're going to have to split them up another way. You could use
split(d, rowlabels)
Now we're going to get it all into long format. The melt function in the package reshape2 is commonly used for this.
require(reshape2)
dMelt <- melt(split(d, row_labels))
Now look at dMelt. Is there any reason you couldn't have organised the data in this way?
In order to find the subsetted means, use the function aggregate()
aggregate(dMelt$value, FUN=mean, by=list(dMelt$variable, dMelt$L1))
Here an option using data.table. It should be very fast and with any loop
library(data.table)
library(reshape2)
set.seed(42)
merge(
setkey(data.table(variable=colnames(d),x=col_labels),variable),
setkey(melt(setDT(d)[,row:=row_labels,],id.vars="row"),variable))[
,mean(value),c("row","x")]
row x V1
1: 1 a 0.56200717
2: 2 a 0.15625521
3: 3 a 0.43927374
4: 4 a -0.31929307
5: 1 b -0.01074557
6: 2 b 0.75147423
7: 3 b -0.79730155
8: 4 b -0.86200887
9: 1 c 0.28615306
10: 2 c -0.34058148
11: 3 c 0.14431610
12: 4 c -0.15834522
The idea is to :
put the d data.frame in the long format after adding row labels as a row
merge it with another data table to to have correspondence between previous column names and your repeated column names
Compute the mean by group of row and x ( resulted from the merge)

Difference between `names(df[1]) <- ` and `names(df)[1] <- `

Consider the following:
df <- data.frame(a = 1, b = 2, c = 3)
names(df[1]) <- "d" ## First method
## a b c
##1 1 2 3
names(df)[1] <- "d" ## Second method
## d b c
##1 1 2 3
Both methods didn't return an error, but the first didn't change the column name, while the second did.
I thought it has something to do with the fact that I'm operating only on a subset of df, but why, for example, the following works fine then?
df[1] <- 2
## a b c
##1 2 2 3
What I think is happening is that replacement into a data frame ignores the attributes of the data frame that is drawn from. I am not 100% sure of this, but the following experiments appear to back it up:
df <- data.frame(a = 1:3, b = 5:7)
# a b
# 1 1 5
# 2 2 6
# 3 3 7
df2 <- data.frame(c = 10:12)
# c
# 1 10
# 2 11
# 3 12
df[1] <- df2[1] # in this case `df[1] <- df2` is equivalent
Which produces:
# a b
# 1 10 5
# 2 11 6
# 3 12 7
Notice how the values changed for df, but not the names. Basically the replacement operator `[<-` only replaces the values. This is why the name was not updated. I believe this explains all the issues.
In the scenario:
names(df[2]) <- "x"
You can think of the assignment as follows (this is a simplification, see end of post for more detail):
tmp <- df[2]
# b
# 1 5
# 2 6
# 3 7
names(tmp) <- "x"
# x
# 1 5
# 2 6
# 3 7
df[2] <- tmp # `tmp` has "x" for names, but it is ignored!
# a b
# 1 10 5
# 2 11 6
# 3 12 7
The last step of which is an assignment with `[<-`, which doesn't respect the names attribute of the RHS.
But in the scenario:
names(df)[2] <- "x"
you can think of the assignment as (again, a simplification):
tmp <- names(df)
# [1] "a" "b"
tmp[2] <- "x"
# [1] "a" "x"
names(df) <- tmp
# a x
# 1 10 5
# 2 11 6
# 3 12 7
Notice how we directly assign to names, instead of assigning to df which ignores attributes.
df[2] <- 2
works because we are assigning directly to the values, not the attributes, so there are no problems here.
EDIT: based on some commentary from #AriB.Friedman, here is a more elaborate version of what I think is going on (note I'm omitting the S3 dispatch to `[.data.frame`, etc., for clarity):
Version 1 names(df[2]) <- "x" translates to:
df <- `[<-`(
df, 2,
value=`names<-`( # `names<-` here returns a re-named one column data frame
`[`(df, 2),
value="x"
) )
Version 2 names(df)[2] <- "x" translates to:
df <- `names<-`(
df,
`[<-`(
names(df), 2, "x"
) )
Also, turns out this is "documented" in R Inferno Section 8.2.34 (Thanks #Frank):
right <- wrong <- c(a=1, b=2)
names(wrong[1]) <- 'changed'
wrong
# a b
# 1 2
names(right)[1] <- 'changed'
right
# changed b
# 1 2

How to assign multiple columns to data.frame without repeating function call

Why doesn't this work for an example? There's same value in each row and warning as well
data <- data.frame(id = 1:10)
slowCall <- function(id) data.frame(b = rep(id, 3), c = runif(3))
data[,c("d", "e")] <- sapply(data$id, function(id) {
tmp <- slowCall(id)
list(sum(tmp$b), min(tmp$c))
})
Warning message:
In `[<-.data.frame`(`*tmp*`, , c("d", "e"), value = list(3L, 0.104784948984161, :
provided 20 variables to replace 2 variables
print(data)
id d e
1 1 3 0.1047849
2 2 3 0.1047849
3 3 3 0.1047849
4 4 3 0.1047849
5 5 3 0.1047849
6 6 3 0.1047849
7 7 3 0.1047849
8 8 3 0.1047849
9 9 3 0.1047849
10 10 3 0.1047849
You could try something like this. First, vectorize the assign function (per #Joran's answer here), then modify your code slightly.
# vectorize
assignVec <- Vectorize("assign",c("x","value"))
library(plyr)
set.seed(1) # this is just here for reproducibility
data <- data.frame(id = 1:10)
slowCall <- function(id) data.frame(b = rep(id, 3), c = runif(3))
# I store this as `tmp` just to make the code below look cleaner
tmp <- mlply(sapply(data$id, function(id) {
tmp <- slowCall(id)
list(sum(tmp$b), min(tmp$c))
}), c)
# here's the key part:
data <- within(data, assignVec(c('d','e'), tmp, envir=environment()))
Output:
> data
id e d
1 1 0.26550866 3
2 2 0.20168193 6
3 3 0.62911404 9
4 4 0.06178627 12
5 5 0.38410372 15
6 6 0.49769924 18
7 7 0.38003518 21
8 8 0.12555510 24
9 9 0.01339033 27
10 10 0.34034900 30
Note: I invoke plyr::mlply to get your sapply output into a list.
The simpler answer, though, is to change the righthand side of your operation into:
data[,c("d", "e")] <- as.data.frame(t(sapply(data$id, function(id) {
tmp <- slowCall(id)
list(sum(tmp$b), min(tmp$c))
})))
which would give you the same result.
The problem here is that the matrix returned by your sapply contains one-element lists instead of numeric values. Change your list to a c and transpose the output, then it will work.
data[, c("d", "e")] <- t(sapply(data$id, function(id) {
tmp <- slowCall(id)
c(sum(tmp$b), min(tmp$c))
}))
Here's a generic method to add two columns of different data types (e.g. character and numeric). It uses lists and transposes lists (via this answer).
Here, this answer would preserve the integer and numeric types of the two outputs.
rowwise <- lapply(data$id, function(id) {
tmp <- slowCall(id)
list(sum(tmp$b), min(tmp$c))
})
colwise <- lapply(seq_along(rowwise[[1]]), function(i) lapply(rowwise, "[[", i))
data[,c("d", "e")] <- colwise

Resources