I have been struggling for a while with this issue in R:
let's say that I have a for loop which runs over 9 data files. In this loop, I extract a short vector from a longer vector. Easy.
After that,still during the loop, I want to fill a matrix with the freshly extracted short vector and save the matrix to a file. Not easy, because none of the short vectors have the same length. To avoid the problem, I tried to directly save (still in the loop) the vectors directly in a .csv file with write.table() and append=TRUE, but R appends all the 9 vectors in the same column, when I would like to have 9 columns. Does anything like by.column=TRUE exist?
One way or another, I only want to have my 9 short vectors in 9 columns in a data file. It looks to me that I am not too far with the second solution.
Has anyone an idea how to finally finish this little programm? I would greatly appreciate.
Thank you for your time
Michel
I'd suggest that you adjust your workflow slightly, doing something like this:
First collect all of the vectors in a single R list object.
Still within R, massage the list of vectors into a matrix or data.frame. (This will be easier once you have extracted all of the vectors, and know the length of the longest one.)
Only then write out the resultant table with a single call to write.table or write.csv.
Here's an example:
## Set up your for() loop or lapply() call so that it returns a list structure
## like 'l'
l <- list(a=1:9, b=1:2, c=1:5)
## Make a matrix from the list, with shorter vectors filled out with ""s
n <- max(sapply(l, length))
ll <- lapply(l, function(X) {
c(as.character(X), rep("", times = n - length(X)))
})
out <- do.call(cbind, ll)
## Then output it however you'd like. (Here are three options.)
# write.csv(out, file="summary.csv")
# write.csv(out, file="summary.csv", quote=FALSE)
# write.table(out, file="summary.txt", row.names=FALSE, quote=FALSE, sep="\t")
Related
I need to create a loop for converting several list into dataframe, and then write each dataframe as csv. I mean, I want to (i) run a loop for unlist all my lists + convert them into data.frames, and (ii) write each list as CSV.
I ran the following scrip which works for one of my lists but I need to do the same for many of them.
Script to convert a nested list (e.g., list1) in data frame, and write as CSV
data <- as.data.frame(t(do.call(rbind,unlist(list1,recursive = FALSE))))
write.csv(data,"list1.csv"))
Please note that "list1" is one of my list that I wrote as an example. I created an script (done <- ls(pattern="list")) to get a vector with the name of all my lists load in the R environment. So that, I should apply the step (i) and (ii) to all the names in the "done" vector. Was it clearer now?
I would really appreciate if you can help me to create the loop?
for(i in 1:nrow(done){
list_name <- done[i]
data <- as.data.frame(t(do.call(rbind,unlist(noquote(list_name),recursive = FALSE))))
write.csv(data,paste0(list_name,".csv"))
}
fun <- function(x){
data <- as.data.frame(t(do.call(rbind,unlist(paste0("list",x),recursive = FALSE))))
write.csv(data,paste0("list",x,".csv"))
}
fun(1:n)
I believe this is the most efficient way.
I'm wondering whether there is a way to do in-place modification of objects in a list without using a for loop. This would be useful, for example, if the individual objects in the list are large and complex, so that we want to avoid making a temporary copy of the entire object. As an example, consider the following code, which creates a list of three data frames, then calculates the vector of maximums across all three data frames for one column of the data, and then assigns that vector to each original data frame. (Code like this is needed when aligning plots in ggplot2.)
data_list <- lapply(1:3, function(x) data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10)))
max_x <- do.call(pmax, lapply(data_list, function(d){d$x}))
for( i in 1:length(data_list))
{
data_list[[i]]$x <- max_x
}
Is there any way to write the final part without a for loop?
Answers to some of the questions I'm getting:
What makes me think a copy would be made? I don't know for sure whether a copy would or would not be made. The actual scenario I'm working with deals with entire ggplot graphs (see e.g. here). Since they are rather large and complex, it's critical that no copy be made.
What's the problem with a for loop? I just would rather iterate directly over a list than have to introduce a counter. I don't like counters.
Why not use data.table? Because I'm actually manipulating ggplot graphs, not data frames. The code provided here is just a simplified example.
Base R data structures are copy-on-modify with sharing. Take your example of a data.frame with three numeric columns. Each data.frame is a length 3 "list" vector, each containing a reference to the numeric vectors of the underlying columns. If we modify/replace the first column, R creates a new length 3 data.frame "list" containing references to the new(ly modified) column and the other two unmodified columns.
Let's take a look using the address function*
set.seed(1)
data_list <- lapply(1:3, function(x) data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10)))
before <- rapply(data_list,address)
Now you want to replace the first column with
max_x <- do.call(pmax, lapply(data_list, function(d){d$x}))
How you do this doesn't much matter, but here's one way without an explicit loop-with-counter
data_list <- lapply(data_list,`[<-`,"x",value=max_x)
after <- rapply(data_list,address)
Now compare the addresses before and after. Note that the addresses for the y and z columns have not changed. Furthermore, all "after" x columns have the same address -- the address of max_x!
address(max_x)
[1] "05660600"
cbind(before,after)
before after
x "0565F530" "05660600"
y "0565F400" "0565F400"
z "05660AC0" "05660AC0"
x "05660A28" "05660600"
y "05660990" "05660990"
z "05660860" "05660860"
x "056607C8" "05660600"
y "05660730" "05660730"
z "05660698" "05660698"
This means you don't have to worry as much as you might think about making a change to a large data structure. In general, only the modified piece and the skeleton of the data structure will have to be replaced. In this example, the max_x vector had to be created anyway, so the only overhead is creating a new 3 cell data.frame "list" and populating it with 3 references**. This, however, could start to become inefficient if you are iteratively "banging on" changes or working with subvectors rather than entire columns. These are use cases for data.table that are not applicable to this example.
* The address function used here is exported from the data.table package.
** And, of course, in this example, the 3 cell outer list "list" containing the 3 data.frames themselves.
I have to write an R script where I want to load different numbers of files at different times. The files are loaded into data frames and certain columns of the data frames are extracted. The columns are then merged with the cbind function. My problem is that I do not know how I can adapt to varying numbers of files that are loaded from time to time, because there may be 3 vectors for cbind at one time or 5 vectors at another time. So how can I give cbind a number of vectors so that it doesn't output errors when it doesn't get all vectors? This happens when I give it a fixed number.
raw1 <- read.table()
raw2 <- read.table()
vec1 <- raw1[,2]
vec2 <- raw2[,2]
cbind(vec1,vec2,vec3)
I know I'd better write sth interactive such as a tcltk dialog and the some kind of loop. Maybe you could provide me with some kind of an idea of how an effective loop could be structured.
You can store the data frames in a list and then cbind them using do.call(). This is a good way to cbind lists of arbitrary length.
datalist <- lapply(filenames, function(i) read.table(i)[, 2])
# ... where filenames are the names of the files you want to read, and
# passing any additional parameters to read.table that are needed
# Then cbind all the entries of datalist
do.call(cbind, datalist)
I have a list containing 4 matrices, each with 21 random numbers in 3 columns and 7 rows.
I want to create new list using lapply function in which each matrix is sorted by the first column.
I tried:
#example data
set.seed(1)
list.a <- replicate(4, list(matrix(sample(1:99, 21), nrow=7)))
ordered <- order(list.a[,1])
lapply(list.a, function(x){[ordered,]})
but at the first step the R gives me error "incorrect number of dimensions". Don't know what to do. It works with one matrix, though.
Please help me. Thanks!
You were almost there - but you would need to iterate through the list to reorder each matrix.
Its easier to do this is one lapply statement
lapply(list.a, function(x) x[order(x[,1]),])
Note that x in the function call represents the matrices in the list.
The top post of this question helped me equally divide a vector into an even set of chunks:
Split a vector into chunks in R
My problem now is that I would like to construct data frames out of the output. Here is the problem in R syntax:
d <- rpois(73,5)
solution1 <- split(d, ceiling(seq_along(d)/20))
ERROR <- as.data.frame(solution1)
The error that you should see is "arguments imply differing number of rows." I'm especially confused because I thought that the as.data.frame() function could handle this problem, as evident here:
http://www.r-bloggers.com/converting-a-list-to-a-data-frame-2/
Thanks for all your help!
EDIT 1:
I am close to a solution with this line, however, there are NA values that are being introduced that distort the output that I seek:
ldply(solution1,data.frame)
ldply is from the plyr package
Did you read the ?split help page? Did you notice the unsplit() function? That sounds like exactly what you're trying to do here.
d <- rpois(73,5)
f <- ceiling(seq_along(d)/20) #factor for splitting
solution1 <- split(d, f)
unsplit(solution1 , f)
I'm not sure what you expected your data.frame to look like, but the error message you got was because as.data.frame() was trying to create a new column in your data.frame for each item in solution1. And since each of those vectors in the list has a different number of elements, you cannot make a data.frame from that. A data.frame requires that every column has the same number of rows.