Exporting Items from List into Single .csv File R - r

I have a list which contains vectors that I would like to export as a single .csv file containing all vectors as named colums.
For instance, if I have, simply, four vectors containing ten items from hypothetical cluster analyses of four models containing a variable number data points created by
veglist=list.files(pattern="TXT") #create list of files
veg=lapply(veglist,read.csv,header=T,row.names=1) #read list of files
vegbc=lapply(veg,vegdist,method="bray") #create dissimilarity matrix from each file
av=lapply(vegbc,agnes,method="average") #do clustering analysis with each dissimilarity mat
av2=lapply(av,cutree,k=2) #cut the hierarchical analysis at 2 groups level
when I type in fix(av2) I would see:
list(c(1,1,1,1,1,1,2,2,2,2,2,2),c(1,1,1,1,1,2,2,2,2,2),c(1,1,1,2,1,2,2,2,2,2),c(1,1,1,1,2,1,2,2,2,2,2,2,2))
If I type in av2 I see
[[1]]
[1] 1 1 1 1 1 1 2 2 2 2 2 2
[[2]]
[1] 1 1 1 1 1 2 2 2 2 2
[[3]]
[1] 1 1 1 2 1 2 2 2 2 2
[[4]]
[1] 1 1 1 1 2 1 2 2 2 2 2 2 2
I have tried following this example How to read every .csv file in R and export them into single large file. This did not work.
I think the underlying problem is that my vectors are not the same size. What I want to do is output the vectors into a single table that looks something like:
a b c d
1 1 1 1
1 1 1 1
1 1 1 1
1 1 2 1
1 1 1 1
1 2 2 2
2 2 2 2
2 2 2 2
2 2 2 2
2 2
2 2
2
Where a,b,c,d are in place of my actual names. Preferably it would look prettier than this, but I could work with it.
I apologize for the very long question, but I was trying to provide enough of an example to go by. I am also sorry if this has a very easy answer, but I am not yet good with R. Thanks in advance.

Here is one way you can do:
l <- list(c(1,1,1,1,1,1,2,2,2,2,2,2),c(1,1,1,1,1,2,2,2,2,2),c(1,1,1,2,1,2,2,2,2,2),c(1,1,1,1,2,1,2,2,2,2,2,2,2))
maxlength <- max(sapply(l, length))
df <- data.frame(sapply(l, function(x) c(x, rep(NA, (maxlength - length(x))))))
df
X1 X2 X3 X4
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 1 2 1
5 1 1 1 2
6 1 2 2 1
7 2 2 2 2
8 2 2 2 2
9 2 2 2 2
10 2 2 2 2
11 2 NA NA 2
12 2 NA NA 2
13 NA NA NA 2

You would first need to extend each vector to the length of the maximum length-ed vector and then you could cbind them together so that write.csv would send them out as "columns":
> maxlength <- max(sapply(l, length))
> mat <- cbind(sapply(l, `length<-`, maxlength))
> mat
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 1 1 1 1
[3,] 1 1 1 1
[4,] 1 1 2 1
[5,] 1 1 1 2
[6,] 1 2 2 1
[7,] 2 2 2 2
[8,] 2 2 2 2
[9,] 2 2 2 2
[10,] 2 2 2 2
[11,] 2 NA NA 2
[12,] 2 NA NA 2
[13,] NA NA NA 2
> write.csv(mat, file="mycsv.csv")
Which looks like this in a text editor (and would get imported into Excel properly.):
"","V1","V2","V3","V4"
"1",1,1,1,1
"2",1,1,1,1
"3",1,1,1,1
"4",1,1,2,1
"5",1,1,1,2
"6",1,2,2,1
"7",2,2,2,2
"8",2,2,2,2
"9",2,2,2,2
"10",2,2,2,2
"11",2,NA,NA,2
"12",2,NA,NA,2
"13",NA,NA,NA,2

This can be done with stri_list2matrix from stringi
library(stringi)
m1 <- stri_list2matrix(l)
mode(m1) <- "integer"
m1
# [,1] [,2] [,3] [,4]
# [1,] 1 1 1 1
# [2,] 1 1 1 1
# [3,] 1 1 1 1
# [4,] 1 1 2 1
# [5,] 1 1 1 2
# [6,] 1 2 2 1
# [7,] 2 2 2 2
# [8,] 2 2 2 2
# [9,] 2 2 2 2
#[10,] 2 2 2 2
#[11,] 2 NA NA 2
#[12,] 2 NA NA 2
#[13,] NA NA NA 2

Related

R - Counting through a vector and refreshing at a certain value?

Let's say I have a vector
vec <- c(3,0,1,1,0,3,0,1,3,0,0,0,3)
And I want to be able to count through this vector using the value 3 as the refresh point. So, the output I want is
vec out
[1,] 3 1
[2,] 0 2
[3,] 1 3
[4,] 1 4
[5,] 0 5
[6,] 3 1
[7,] 0 2
[8,] 1 3
[9,] 3 1
[10,] 0 2
[11,] 0 3
[12,] 0 4
[13,] 3 1
How would I do this in R, preferably without using loops?
With base R, you can do:
ave(vec, cumsum(vec == 3), FUN = seq_along)
[1] 1 2 3 4 5 1 2 3 1 2 3 4 1
An option using data.table::rowid:
data.table::rowid(cumsum(vec==3L))
As another idea, we can locate the indices of the last value of 3 for each element of vec:
last3 = cummax((vec == 3) * seq_along(vec))
last3
# [1] 1 1 1 1 1 6 6 6 9 9 9 9 13
And subtract from their respective indices in vec:
seq_along(vec) - last3 + 1 ## `.. - pmax(last3, 1) ..` if `vec[1] != 3`
# [1] 1 2 3 4 5 1 2 3 1 2 3 4 1

How do I generate all the vectors by swapping two positions?

Suppose I have a column vector of [1 1 1 2 2 2 3 3 3] and I want to generate all the different column vectors only by switching two positions. For an example, one such vector would be
[1 1 3 2 2 2 1 3 3].
Try this (it gives you a data frame each row of which is a unique vector with 2 elements swapped from the original vector, there are 28 such unique vectors, including the original one):
v <- c(1,1,1,2,2,2,3,3,3)
unique(t(apply(t(combn(1:length(v), 2)), 1, function(x) {v[x] <- v[rev(x)]; v})))
with output:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 1 1 1 2 2 2 3 3 3 # original one
[2,] 2 1 1 1 2 2 3 3 3 # swap 1st & 4th elements
[3,] 2 1 1 2 1 2 3 3 3 # swap 1st & 5th
[4,] 2 1 1 2 2 1 3 3 3 # ...
[5,] 3 1 1 2 2 2 1 3 3
[6,] 3 1 1 2 2 2 3 1 3
[7,] 3 1 1 2 2 2 3 3 1
[8,] 1 2 1 1 2 2 3 3 3
[9,] 1 2 1 2 1 2 3 3 3
[10,] 1 2 1 2 2 1 3 3 3
[11,] 1 3 1 2 2 2 1 3 3
[12,] 1 3 1 2 2 2 3 1 3
[13,] 1 3 1 2 2 2 3 3 1
[14,] 1 1 2 1 2 2 3 3 3
[15,] 1 1 2 2 1 2 3 3 3
[16,] 1 1 2 2 2 1 3 3 3
[17,] 1 1 3 2 2 2 1 3 3
[18,] 1 1 3 2 2 2 3 1 3
[19,] 1 1 3 2 2 2 3 3 1
[20,] 1 1 1 3 2 2 2 3 3
[21,] 1 1 1 3 2 2 3 2 3
[22,] 1 1 1 3 2 2 3 3 2
[23,] 1 1 1 2 3 2 2 3 3
[24,] 1 1 1 2 3 2 3 2 3
[25,] 1 1 1 2 3 2 3 3 2
[26,] 1 1 1 2 2 3 2 3 3
[27,] 1 1 1 2 2 3 3 2 3
[28,] 1 1 1 2 2 3 3 3 2 # swap 6th & 9th
First, we can use the utils function combn to generate all of the possible combinations of pairs of positions to swap. Here, I am assuming you don't want to swap the same number (e.g. 1 and 1), so am checking them to make sure they are different values:
allCombo <-
combn(1:length(startVec), 2)
toKeep <- apply(allCombo, 2, function(x) {
startVec[x[1]] != startVec[x[2]]
})
Then, apply along those that you are keeping, and swap the positions.
outVecs <- apply(allCombo[ , toKeep], 2, function(x){
temp <- startVec
temp[x] <- startVec[rev(x)]
return(temp)
})
This returns as a vector, but you can convert it to a list, which may be easier to manage, like so:
outVecsInList <-
as.list(as.data.frame(outVecs))
head(outVecsInList) shows:
$V1
[1] 2 1 1 1 2 2 3 3 3
$V2
[1] 2 1 1 2 1 2 3 3 3
$V3
[1] 2 1 1 2 2 1 3 3 3
$V4
[1] 3 1 1 2 2 2 1 3 3
$V5
[1] 3 1 1 2 2 2 3 1 3
$V6
[1] 3 1 1 2 2 2 3 3 1

Subsetting non NA elements from matrix to a vector

I have a matrix like this:
[1] [2] [3] [4] [5] [6]
[1]NA NA NA 2 NA NA
[2]NA NA NA 7 5 4
[3]NA 2 2 2 2 2
[4]NA 4 4 32 1 1
[5]9 NA NA NA NA NA
[6]NA 2 1 1 1 1
Is there any way to subset (maybe column-wise) the elements which are not NA and then store all numbers in 1 numeric vector, so that I can plot them as.numeric?
Thanks
You can use apply and na.omit:
unlist(apply(mat, 2, na.omit))
# [1] 9 2 4 2 2 4 1 2 7 2 32 1 5 2 1 1 4 2 1 1
You can also use
na.omit(as.vector(mat))
Try this:
#dummy data
mat <- matrix(rep(c(1,2,3,NA),7),ncol=4)
mat
# [,1] [,2] [,3] [,4]
# [1,] 1 NA 3 2
# [2,] 2 1 NA 3
# [3,] 3 2 1 NA
# [4,] NA 3 2 1
# [5,] 1 NA 3 2
# [6,] 2 1 NA 3
# [7,] 3 2 1 NA
mat[!is.na(mat)]
# [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

"Loop through" data.table to calculate conditional averages

I want to "loop through" the rows of a data.table and calculate an average for each row. The average should be calculated based on the following mechanism:
Look up the identifier ID in row i (ID(i))
Look up the value of T2 in row i (T2(i))
Calculate the average over the Data1 values in all rows j, which meet these two criteria: ID(j) = ID(i) and T1(j) = T2(i)
Enter the calculated average in the column Data2 of row i
DF = data.frame(ID=rep(c("a","b"),each=6),
T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
DT = data.table(DF)
DT[ , Data2:=NA_real_]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 NA
[2,] a 1 2 2 NA
[3,] a 1 3 3 NA
[4,] a 2 1 4 NA
[5,] a 2 2 5 NA
[6,] a 2 3 6 NA
[7,] b 1 1 7 NA
[8,] b 1 2 8 NA
[9,] b 1 3 9 NA
[10,] b 2 1 10 NA
[11,] b 2 2 11 NA
[12,] b 2 3 12 NA
For this simple example the result should look like this:
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
I think one way of doing this would be to loop through the rows, but I think that is inefficient. I've had a look at the apply() function, but I'm sure if it would solve my problem. I could also use data.frame instead of data.table if this would make it much more efficient or much easier. The real dataset contains approximately 1 million rows.
The rule of thumb is to aggregate first, and then join to that.
agg = DT[,mean(Data1),by=list(ID,T1)]
setkey(agg,ID,T1)
DT[,Data2:={JT=J(ID,T2);agg[JT,V1][[3]]}]
ID T1 T2 Data1 Data2
[1,] a 1 1 1 2
[2,] a 1 2 2 5
[3,] a 1 3 3 NA
[4,] a 2 1 4 2
[5,] a 2 2 5 5
[6,] a 2 3 6 NA
[7,] b 1 1 7 8
[8,] b 1 2 8 11
[9,] b 1 3 9 NA
[10,] b 2 1 10 8
[11,] b 2 2 11 11
[12,] b 2 3 12 NA
As you can see it's a bit ugly in this case (but will be fast). It's planned to add drop which will avoid the [[3]] bit, and maybe we could provide a way to tell [.data.table to evaluate i in calling scope (i.e. no self join) which would avoid the JT= bit which is needed here because ID is in both agg and DT.
keyby has been added to v1.8.0 on R-Forge so that avoids the need for the setkey, too.
A somewhat faster alternative to iterating over rows would be a solution which employs vectorization.
R> d <- data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
R> d
ID T1 T2 Data1
1 a 1 1 1
2 a 1 2 2
3 a 1 3 3
4 a 2 1 4
5 a 2 2 5
6 a 2 3 6
7 b 1 1 7
8 b 1 2 8
9 b 1 3 9
10 b 2 1 10
11 b 2 2 11
12 b 2 3 12
R> rowfunction <- function(i) with(d, mean(Data1[which(T1==T2[i] & ID==ID[i])]))
R> d$Data2 <- sapply(1:nrow(d), rowfunction)
R> d
ID T1 T2 Data1 Data2
1 a 1 1 1 2
2 a 1 2 2 5
3 a 1 3 3 NaN
4 a 2 1 4 2
5 a 2 2 5 5
6 a 2 3 6 NaN
7 b 1 1 7 8
8 b 1 2 8 11
9 b 1 3 9 NaN
10 b 2 1 10 8
11 b 2 2 11 11
12 b 2 3 12 NaN
Also, I'd prefer to preprocess the data before getting it into R. I.e. if you are retrieving the data from an SQL server, it might be a better choice to let the server calculate the averages, as it will very likely do a better job in this.
R is actually not very good at number crunching, for several reasons. But it's excellent when doing statistics on the already-preprocessed data.
Using tapply and part of another recent post:
DF = data.frame(ID=rep(c("a","b"),each=6), T1=rep(1:2,each=3), T2=c(1,2,3), Data1=c(1:12))
EDIT: Actually, most of the original function is redundant and was intended for something else. Here, simplified:
ansMat <- tapply(DF$Data1, DF[, c("ID", "T1")], mean)
i <- cbind(match(DF$ID, rownames(ansMat)), match(DF$T2, colnames(ansMat)))
DF<-cbind(DF,Data2 = ansMat[i])
# ansMat<-tapply(seq_len(nrow(DF)), DF[, c("ID", "T1")], function(x) {
# curSub <- DF[x, ]
# myIndex <- which(DF$T2 == curSub$T1 & DF$ID == curSub$ID)
# meanData1 <- mean(curSub$Data1)
# return(meanData1 = meanData1)
# })
The trick was doing tapply over ID and T1 instead of ID and T2. Anything speedier?

Create dataframe of all array indices in R

Using R, I'm trying to construct a dataframe of the row and col numbers of a given matrix. E.g., if
a <- matrix(c(1:15), nrow=5, ncol=3)
then I'm looking to construct a dataframe that gives:
row col
1 1
1 2
1 3
. .
5 1
5 2
5 3
What I've tried:
row <- matrix(row(a), ncol=1, nrow=dim(a)[1]*dim(a)[2], byrow=T)
col <- matrix(col(a), ncol=1, nrow=dim(a)[1]*dim(a)[2], byrow=T)
out <- cbind(row, col)
colnames(out) <- c("row", "col")
results in:
row col
[1,] 1 1
[2,] 2 1
[3,] 3 1
[4,] 4 1
[5,] 5 1
[6,] 1 2
[7,] 2 2
[8,] 3 2
[9,] 4 2
[10,] 5 2
[11,] 1 3
[12,] 2 3
[13,] 3 3
[14,] 4 3
[15,] 5 3
Which isn't what I'm looking for, as the sequence of rows and cols in suddenly reversed, even tough I specified "byrow=T". I don't see if and where I'm making a mistake but would hugely appreciate suggestions to overcome this problem. Thanks in advance!
I'd use expand.grid on the vectors 1:ncol and 1:nrow, then flip the columns with [,2:1] to get them in the order you want:
> expand.grid(seq(ncol(a)),seq(nrow(a)))[,2:1]
Var2 Var1
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
10 4 1
11 4 2
12 4 3
13 5 1
14 5 2
15 5 3
Use row and col, but more directly manipulate their output ordering since they return corresponding indices in place for the input array. Use t to get the non-default order you want in the end:
data.frame(row = as.vector(t(row(a))), col = as.vector(t(col(a))))
row col
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
10 4 1
11 4 2
12 4 3
13 5 1
14 5 2
15 5 3
Or, as a matrix not a data.frame:
cbind(as.vector(t(row(a))), as.vector(t(col(a))))
[,1] [,2]
[1,] 1 1
[2,] 1 2
[3,] 1 3
[4,] 2 1
[5,] 2 2
[6,] 2 3
[7,] 3 1
[8,] 3 2
[9,] 3 3
[10,] 4 1
[11,] 4 2
[12,] 4 3
[13,] 5 1
[14,] 5 2
[15,] 5 3
You may want to have a look at ?expand.grid, which does just about exactly what you want to achieve.
Since there are many ways to skin a cat, I'll chip in with yet another variant based on rep:
data.frame(row=rep(seq(nrow(a)), each=ncol(a)), col=rep(seq(ncol(a)), nrow(a)))
...but to announce a "winner", I think you need to time the solutions:
# Make up a huge matrix...
a <- matrix(runif(1e7), 1e4)
system.time( a1<-data.frame(row = as.vector(t(row(a))),
col = as.vector(t(col(a)))) ) # 0.68 secs
system.time( a2<-expand.grid(col = seq(ncol(a)),
row = seq(nrow(a)))[,2:1] ) # 0.49 secs
system.time( a3<-data.frame(row=rep(seq(nrow(a)), each=ncol(a)),
col=rep(seq(ncol(a)), nrow(a))) ) # 0.59 secs
identical(a1, a2) && identical(a1, a3) # TRUE
...so it seems #Spacedman has the speediest solution!

Resources