I want to "loop through" the rows of a data.table and calculate an average for each row. The average should be calculated based on the following mechanism:
Look up the identifier ID in row i (ID(i))
Look up the value of T2 in row i (T2(i))
Calculate the average over the Data1 values in all rows j, which meet these two criteria: ID(j) = ID(i) and T1(j) = T2(i)
Enter the calculated average in the column Data2 of row i
DF = data.frame(ID=rep(c("a","b"),each=6),
T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
DT = data.table(DF)
DT[ , Data2:=NA_real_]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 NA
[2,] a 1 2 2 NA
[3,] a 1 3 3 NA
[4,] a 2 1 4 NA
[5,] a 2 2 5 NA
[6,] a 2 3 6 NA
[7,] b 1 1 7 NA
[8,] b 1 2 8 NA
[9,] b 1 3 9 NA
[10,] b 2 1 10 NA
[11,] b 2 2 11 NA
[12,] b 2 3 12 NA
For this simple example the result should look like this:
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
I think one way of doing this would be to loop through the rows, but I think that is inefficient. I've had a look at the apply() function, but I'm sure if it would solve my problem. I could also use data.frame instead of data.table if this would make it much more efficient or much easier. The real dataset contains approximately 1 million rows.
The rule of thumb is to aggregate first, and then join to that.
agg = DT[,mean(Data1),by=list(ID,T1)]
setkey(agg,ID,T1)
DT[,Data2:={JT=J(ID,T2);agg[JT,V1][[3]]}]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
As you can see it's a bit ugly in this case (but will be fast). It's planned to add drop which will avoid the [[3]] bit, and maybe we could provide a way to tell [.data.table to evaluate i in calling scope (i.e. no self join) which would avoid the JT= bit which is needed here because ID is in both agg and DT.
keyby has been added to v1.8.0 on R-Forge so that avoids the need for the setkey, too.
A somewhat faster alternative to iterating over rows would be a solution which employs vectorization.
R> d <- data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
R> d
ID T1 T2 Data1
1 a 1 1 1
2 a 1 2 2
3 a 1 3 3
4 a 2 1 4
5 a 2 2 5
6 a 2 3 6
7 b 1 1 7
8 b 1 2 8
9 b 1 3 9
10 b 2 1 10
11 b 2 2 11
12 b 2 3 12
R> rowfunction <- function(i) with(d, mean(Data1[which(T1==T2[i] & ID==ID[i])]))
R> d$Data2 <- sapply(1:nrow(d), rowfunction)
R> d
ID T1 T2 Data1 Data2
1 a 1 1 1 2
2 a 1 2 2 5
3 a 1 3 3 NaN
4 a 2 1 4 2
5 a 2 2 5 5
6 a 2 3 6 NaN
7 b 1 1 7 8
8 b 1 2 8 11
9 b 1 3 9 NaN
10 b 2 1 10 8
11 b 2 2 11 11
12 b 2 3 12 NaN
Also, I'd prefer to preprocess the data before getting it into R. I.e. if you are retrieving the data from an SQL server, it might be a better choice to let the server calculate the averages, as it will very likely do a better job in this.
R is actually not very good at number crunching, for several reasons. But it's excellent when doing statistics on the already-preprocessed data.
Using tapply and part of another recent post:
DF = data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
EDIT: Actually, most of the original function is redundant and was intended for something else. Here, simplified:
ansMat <- tapply(DF$Data1, DF[, c("ID", "T1")], mean)
i <- cbind(match(DF$ID, rownames(ansMat)), match(DF$T2, colnames(ansMat)))
DF<-cbind(DF,Data2 = ansMat[i])
# ansMat<-tapply(seq_len(nrow(DF)), DF[, c("ID", "T1")], function(x) {
# curSub <- DF[x, ]
# myIndex <- which(DF$T2 == curSub$T1 & DF$ID == curSub$ID)
# meanData1 <- mean(curSub$Data1)
# return(meanData1 = meanData1)
# })
The trick was doing tapply over ID and T1 instead of ID and T2. Anything speedier?
Related
Is there a best practice means of "tidying" a matrix/array? By "tidy" in this context I mean
one row per element of the matrix
one column per dimension. the elements of these columns give you the "coordinates" of the matrix element which is stored on that row
I have an example here for a 2d matrix, but ideally this would work with an array also (This example works for mm <- array(1:18, c(3,3,3)), but I thought that would be too much to paste in here)
mm <- matrix(1:9, nrow = 3)
mm
#> [,1] [,2] [,3]
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
inds <- which(mm > -Inf, arr.ind = TRUE)
cbind(inds, value = mm[inds])
#> row col value
#> [1,] 1 1 1
#> [2,] 2 1 2
#> [3,] 3 1 3
#> [4,] 1 2 4
#> [5,] 2 2 5
#> [6,] 3 2 6
#> [7,] 1 3 7
#> [8,] 2 3 8
#> [9,] 3 3 9
as.data.frame.table One way to convert from wide to long is the following. See ?as.data.frame.table for more information. No packages are used.
mm <- matrix(1:9, 3)
long <- as.data.frame.table(mm)
The code gives this data.frame:
> long
Var1 Var2 Freq
1 A A 1
2 B A 2
3 C A 3
4 A B 4
5 B B 5
6 C B 6
7 A C 7
8 B C 8
9 C C 9
numbers
If you prefer row and column numbers:
long[1:2] <- lapply(long[1:2], as.numeric)
giving:
> long
Var1 Var2 Freq
1 1 1 1
2 2 1 2
3 3 1 3
4 1 2 4
5 2 2 5
6 3 2 6
7 1 3 7
8 2 3 8
9 3 3 9
names Note that above it used A, B, C, ... because there were no row or column names. They would have been used if present. That is, had there been row and column names and dimension names the output would look like this:
mm2 <- array(1:9, c(3, 3), dimnames = list(A = c("a", "b", "c"), B = c("x", "y", "z")))
as.data.frame.table(mm2, responseName = "Val")
giving:
A B Val
1 a x 1
2 b x 2
3 c x 3
4 a y 4
5 b y 5
6 c y 6
7 a z 7
8 b z 8
9 c z 9
3d
Here is a 3d example:
as.data.frame.table(array(1:8, c(2,2,2)))
giving:
Var1 Var2 Var3 Freq
1 A A A 1
2 B A A 2
3 A B A 3
4 B B A 4
5 A A B 5
6 B A B 6
7 A B B 7
8 B B B 8
2d only For 2d one can alternately use row and col:
sapply(list(row(mm), col(mm), mm), c)
or
cbind(c(row(mm)), c(col(mm)), c(mm))
Either of these give this matrix:
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 1 2
[3,] 3 1 3
[4,] 1 2 4
[5,] 2 2 5
[6,] 3 2 6
[7,] 1 3 7
[8,] 2 3 8
[9,] 3 3 9
Another method is to use arrayInd together with cbind like this.
# a 3 X 3 X 2 array
mm <- array(1:18, dim=c(3,3,2))
Similar to your code, but with the more natural arrayInd function, we have
# get array in desired format
myMat <- cbind(c(mm), arrayInd(seq_along(mm), .dim=dim(mm)))
# add column names
colnames(myMat) <- c("values", letters[24:26])
which returns
myMat
values x y z
[1,] 1 1 1 1
[2,] 2 2 1 1
[3,] 3 3 1 1
[4,] 4 1 2 1
[5,] 5 2 2 1
[6,] 6 3 2 1
[7,] 7 1 3 1
[8,] 8 2 3 1
[9,] 9 3 3 1
[10,] 10 1 1 2
[11,] 11 2 1 2
[12,] 12 3 1 2
[13,] 13 1 2 2
[14,] 14 2 2 2
[15,] 15 3 2 2
[16,] 16 1 3 2
[17,] 17 2 3 2
[18,] 18 3 3 2
I want to separate a data.table into groups based on the continuity of one variable.
So to speak, from this data.table:
DT <- data.table(Var1 = c(1:5, 7:10))
I want it to be grouped like this:
# Var1 group
# 1: 1 1 # 1 to 5 is continuous with a maximal difference of 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 7 2 # 6 to 10 is continuous again
# 7: 8 2
# 8: 9 2
# 9: 10 2
The difference of Var1 should not be limited to one like in this minimal example, but be adjustable, so that DT <- data.table(Var1 = c(seq(1,10, 2), seq(13,30, 2))) will also be separated into two groups when given a maximal difference of 2.
EDIT:
I should clarify that a 'maximal difference' of 2 or more is meant in a way that differences in Var1 smaller than two should be treated as 'continuous'. Furthermore the variable Var1 should not be limited to integer values. The last thing could be avoided by multiplying e.g. 0.14 by 100 to get 14 and also multiplying 'maximal difference' by 100.
DT[, group := rleid(cumprod(c(1, diff(Var1))))]
# Var1 group
#1: 1 1
#2: 2 1
#3: 3 1
#4: 4 1
#5: 5 1
#6: 7 2
#7: 8 2
#8: 9 2
#9: 10 2
step <- 2
DT <- data.table(Var1 = c(seq(1,10, 2), seq(13,30, 2)))
DT[, group := rleid(cumsum(c(FALSE, diff(Var1) != step)))]
# Var1 group
# 1: 1 1
# 2: 3 1
# 3: 5 1
# 4: 7 1
# 5: 9 1
# 6: 13 2
# 7: 15 2
# 8: 17 2
# 9: 19 2
#10: 21 2
#11: 23 2
#12: 25 2
#13: 27 2
#14: 29 2
A base R solution.
foo <- function(x){
gr <- which(!(duplicated(diff(x)) | duplicated(diff(x), fromLast = T)))
if(length(gr) == 1){
cbind(Var1=x,group=rep(1:(length(gr)+1), c(min(gr),length(x)-max(gr))))
}else{
cbind(Var1=x,group=rep(1:(length(gr)+1), c(min(gr), diff(gr),length(x)-max(gr))))
}
}
All kind of differences are working.
foo(c(seq(1,10, 2), seq(13,30, 2)))
Var1 group
[1,] 1 1
[2,] 3 1
[3,] 5 1
[4,] 7 1
[5,] 9 1
[6,] 13 2
[7,] 15 2
[8,] 17 2
[9,] 19 2
[10,] 21 2
[11,] 23 2
[12,] 25 2
[13,] 27 2
[14,] 29 2
Three groups are working as well.
foo(c(1:5, 7:10, 13:20))
Var1 group
[1,] 1 1
[2,] 2 1
[3,] 3 1
[4,] 4 1
[5,] 5 1
[6,] 7 2
[7,] 8 2
[8,] 9 2
[9,] 10 2
[10,] 13 3
[11,] 14 3
[12,] 15 3
[13,] 16 3
[14,] 17 3
[15,] 18 3
[16,] 19 3
[17,] 20 3
For a data.table you can try:
foo <- function(x){
gr <- which(!(duplicated(diff(x)) | duplicated(diff(x), fromLast = T)))
if(length(gr) == 1){
rep(1:(length(gr)+1), c(min(gr),length(x)-max(gr)))
}else{
rep(1:(length(gr)+1), c(min(gr), diff(gr),length(x)-max(gr)))
}
}
DT[, group := foo(Var1)]
I have a list which contains vectors that I would like to export as a single .csv file containing all vectors as named colums.
For instance, if I have, simply, four vectors containing ten items from hypothetical cluster analyses of four models containing a variable number data points created by
veglist=list.files(pattern="TXT") #create list of files
veg=lapply(veglist,read.csv,header=T,row.names=1) #read list of files
vegbc=lapply(veg,vegdist,method="bray") #create dissimilarity matrix from each file
av=lapply(vegbc,agnes,method="average") #do clustering analysis with each dissimilarity mat
av2=lapply(av,cutree,k=2) #cut the hierarchical analysis at 2 groups level
when I type in fix(av2) I would see:
list(c(1,1,1,1,1,1,2,2,2,2,2,2),c(1,1,1,1,1,2,2,2,2,2),c(1,1,1,2,1,2,2,2,2,2),c(1,1,1,1,2,1,2,2,2,2,2,2,2))
If I type in av2 I see
[[1]]
[1] 1 1 1 1 1 1 2 2 2 2 2 2
[[2]]
[1] 1 1 1 1 1 2 2 2 2 2
[[3]]
[1] 1 1 1 2 1 2 2 2 2 2
[[4]]
[1] 1 1 1 1 2 1 2 2 2 2 2 2 2
I have tried following this example How to read every .csv file in R and export them into single large file. This did not work.
I think the underlying problem is that my vectors are not the same size. What I want to do is output the vectors into a single table that looks something like:
a b c d
1 1 1 1
1 1 1 1
1 1 1 1
1 1 2 1
1 1 1 1
1 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2
2 2
2 2
2
Where a,b,c,d are in place of my actual names. Preferably it would look prettier than this, but I could work with it.
I apologize for the very long question, but I was trying to provide enough of an example to go by. I am also sorry if this has a very easy answer, but I am not yet good with R. Thanks in advance.
Here is one way you can do:
l <- list(c(1,1,1,1,1,1,2,2,2,2,2,2),c(1,1,1,1,1,2,2,2,2,2),c(1,1,1,2,1,2,2,2,2,2),c(1,1,1,1,2,1,2,2,2,2,2,2,2))
maxlength <- max(sapply(l, length))
df <- data.frame(sapply(l, function(x) c(x, rep(NA, (maxlength - length(x))))))
df
X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 1 2 1
5 1 1 1 2
6 1 2 2 1
7 2 2 2 2
8 2 2 2 2
9 2 2 2 2
10 2 2 2 2
11 2 NA NA 2
12 2 NA NA 2
13 NA NA NA 2
You would first need to extend each vector to the length of the maximum length-ed vector and then you could cbind them together so that write.csv would send them out as "columns":
> maxlength <- max(sapply(l, length))
> mat <- cbind(sapply(l, `length<-`, maxlength))
> mat
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 1 1 1 1
[3,] 1 1 1 1
[4,] 1 1 2 1
[5,] 1 1 1 2
[6,] 1 2 2 1
[7,] 2 2 2 2
[8,] 2 2 2 2
[9,] 2 2 2 2
[10,] 2 2 2 2
[11,] 2 NA NA 2
[12,] 2 NA NA 2
[13,] NA NA NA 2
> write.csv(mat, file="mycsv.csv")
Which looks like this in a text editor (and would get imported into Excel properly.):
"","V1","V2","V3","V4"
"1",1,1,1,1
"2",1,1,1,1
"3",1,1,1,1
"4",1,1,2,1
"5",1,1,1,2
"6",1,2,2,1
"7",2,2,2,2
"8",2,2,2,2
"9",2,2,2,2
"10",2,2,2,2
"11",2,NA,NA,2
"12",2,NA,NA,2
"13",NA,NA,NA,2
This can be done with stri_list2matrix from stringi
library(stringi)
m1 <- stri_list2matrix(l)
mode(m1) <- "integer"
m1
# [,1] [,2] [,3] [,4]
# [1,] 1 1 1 1
# [2,] 1 1 1 1
# [3,] 1 1 1 1
# [4,] 1 1 2 1
# [5,] 1 1 1 2
# [6,] 1 2 2 1
# [7,] 2 2 2 2
# [8,] 2 2 2 2
# [9,] 2 2 2 2
#[10,] 2 2 2 2
#[11,] 2 NA NA 2
#[12,] 2 NA NA 2
#[13,] NA NA NA 2
I have a bizarre problem where I've combined together several data frames that have different species abundance data. I used rbind.fill() to collate the data frames, but some of the columns names for like species are spelled slightly differently, hence, for several species I have 2-3 columns.
Does anyone know of a way I can merge the data from these columns together?
Simple example:
dat <- matrix(data=c(
Sp.a=c(1,2,3,4,5,NA,NA,NA,NA,NA),
Sp.b=c(3,4,5,6,7,5,4,6,3,4),
Sp.c=c(4,4,4,3,2,NA,NA,NA,NA,NA),
Spp.A=c(NA,NA,NA,NA,NA,2,3,4,2,3),
Spp.C=c(NA,NA,NA,NA,NA,3,4,2,5,4)
), 10,5)
colnames(dat)<- c("Sp.a", "Sp.b", "Sp.c", "Spp.A", "Spp.C")
dat
sp.a sp.b sp.c Spp.A Spp.C
[1,] 1 3 4 NA NA
[2,] 2 4 4 NA NA
[3,] 3 5 4 NA NA
[4,] 4 6 3 NA NA
[5,] 5 7 2 NA NA
[6,] NA 5 NA 2 3
[7,] NA 4 NA 3 4
[8,] NA 6 NA 4 2
[9,] NA 3 NA 2 5
[10,] NA 4 NA 3 4
How can I get sp.a and Spp.A into a single column? (same for sp.c and Spp.C).
Thanks for any help,
Paul
Using reshape2 and going from long --> wide --> long(again) format:
library(reshape2)
## long format
dat.m <- melt(dat)
## remove missing values
dat.m <- dat.m[!is.na(dat.m$value),]
## rename names
dat.m$Var2 <- tolower(sub("Spp","Sp", dat.m$Var2) )
## wide format
dcast(Var1~Var2,data=dat.m)
Var1 sp.a sp.b sp.c
1 1 1 3 4
2 2 2 4 4
3 3 3 5 4
4 4 4 6 3
5 5 5 7 2
6 6 2 5 3
7 7 3 4 4
8 8 4 6 2
9 9 2 3 5
10 10 3 4 4
Here's one way. This is pretty general, and would even work if you had one series divided over three or more columns.
dat <- data.frame(dat)
# get the last letter of each column and make it lowercase,
# we'll be grouping the columns by this
ns <- tolower(gsub('^.+\\.', '', names(dat)))
# group the columns by their last letter, and run each group through pmax
result <- lapply(split.default(dat, ns), function(x) do.call(function(...) pmax(..., na.rm=TRUE), x))
do.call(cbind, result)
# a b c
# [1,] 1 3 4
# [2,] 2 4 4
# [3,] 3 5 4
# [4,] 4 6 3
# [5,] 5 7 2
# [6,] 2 5 3
# [7,] 3 4 4
# [8,] 4 6 2
# [9,] 2 3 5
# [10,] 3 4 4
ColsToMerge <- c("sp.a", "Spp.A")
dat[["A.merged"]] <-
apply(dat[, ColsToMerge], 1, function(rr) ifelse(is.na(rr[[1]]), rr[[2]], rr[[1]]))
Using R, I'm trying to construct a dataframe of the row and col numbers of a given matrix. E.g., if
a <- matrix(c(1:15), nrow=5, ncol=3)
then I'm looking to construct a dataframe that gives:
row col
1 1
1 2
1 3
. .
5 1
5 2
5 3
What I've tried:
row <- matrix(row(a), ncol=1, nrow=dim(a)[1]*dim(a)[2], byrow=T)
col <- matrix(col(a), ncol=1, nrow=dim(a)[1]*dim(a)[2], byrow=T)
out <- cbind(row, col)
colnames(out) <- c("row", "col")
results in:
row col
[1,] 1 1
[2,] 2 1
[3,] 3 1
[4,] 4 1
[5,] 5 1
[6,] 1 2
[7,] 2 2
[8,] 3 2
[9,] 4 2
[10,] 5 2
[11,] 1 3
[12,] 2 3
[13,] 3 3
[14,] 4 3
[15,] 5 3
Which isn't what I'm looking for, as the sequence of rows and cols in suddenly reversed, even tough I specified "byrow=T". I don't see if and where I'm making a mistake but would hugely appreciate suggestions to overcome this problem. Thanks in advance!
I'd use expand.grid on the vectors 1:ncol and 1:nrow, then flip the columns with [,2:1] to get them in the order you want:
> expand.grid(seq(ncol(a)),seq(nrow(a)))[,2:1]
Var2 Var1
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
10 4 1
11 4 2
12 4 3
13 5 1
14 5 2
15 5 3
Use row and col, but more directly manipulate their output ordering since they return corresponding indices in place for the input array. Use t to get the non-default order you want in the end:
data.frame(row = as.vector(t(row(a))), col = as.vector(t(col(a))))
row col
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
10 4 1
11 4 2
12 4 3
13 5 1
14 5 2
15 5 3
Or, as a matrix not a data.frame:
cbind(as.vector(t(row(a))), as.vector(t(col(a))))
[,1] [,2]
[1,] 1 1
[2,] 1 2
[3,] 1 3
[4,] 2 1
[5,] 2 2
[6,] 2 3
[7,] 3 1
[8,] 3 2
[9,] 3 3
[10,] 4 1
[11,] 4 2
[12,] 4 3
[13,] 5 1
[14,] 5 2
[15,] 5 3
You may want to have a look at ?expand.grid, which does just about exactly what you want to achieve.
Since there are many ways to skin a cat, I'll chip in with yet another variant based on rep:
data.frame(row=rep(seq(nrow(a)), each=ncol(a)), col=rep(seq(ncol(a)), nrow(a)))
...but to announce a "winner", I think you need to time the solutions:
# Make up a huge matrix...
a <- matrix(runif(1e7), 1e4)
system.time( a1<-data.frame(row = as.vector(t(row(a))),
col = as.vector(t(col(a)))) ) # 0.68 secs
system.time( a2<-expand.grid(col = seq(ncol(a)),
row = seq(nrow(a)))[,2:1] ) # 0.49 secs
system.time( a3<-data.frame(row=rep(seq(nrow(a)), each=ncol(a)),
col=rep(seq(ncol(a)), nrow(a))) ) # 0.59 secs
identical(a1, a2) && identical(a1, a3) # TRUE
...so it seems #Spacedman has the speediest solution!