I have to write an R script where I want to load different numbers of files at different times. The files are loaded into data frames and certain columns of the data frames are extracted. The columns are then merged with the cbind function. My problem is that I do not know how I can adapt to varying numbers of files that are loaded from time to time, because there may be 3 vectors for cbind at one time or 5 vectors at another time. So how can I give cbind a number of vectors so that it doesn't output errors when it doesn't get all vectors? This happens when I give it a fixed number.
raw1 <- read.table()
raw2 <- read.table()
vec1 <- raw1[,2]
vec2 <- raw2[,2]
cbind(vec1,vec2,vec3)
I know I'd better write sth interactive such as a tcltk dialog and the some kind of loop. Maybe you could provide me with some kind of an idea of how an effective loop could be structured.
You can store the data frames in a list and then cbind them using do.call(). This is a good way to cbind lists of arbitrary length.
datalist <- lapply(filenames, function(i) read.table(i)[, 2])
# ... where filenames are the names of the files you want to read, and
# passing any additional parameters to read.table that are needed
# Then cbind all the entries of datalist
do.call(cbind, datalist)
Related
I want to combine all rows of different data sets. The names of all data sets starts with test. All data sets have same number of observations. I know i can combine it by using rbind(). But typing the names of every data set will take a lot of time. Suggest me some better approach.
rbind(test1,test2,test3,test4)
Try first obtaining a vector of all matching objects using ls() with the pattern ^test:
dfs <- lapply(ls(pattern="^test"), function(x) get(x))
result <- rbindlist(dfs)
I am taking the suggestion by #Rohit to use rbindlist to make our lives easier to rbind together a list of data frames.
Second line of above code will work only if data sets are in data.table form or data frame form. IF data sets are in xts/zoo format then one have to make slight improvement use do.call() function.
## First make a list of all your data sets as suggested above
list_xts <- lapply(ls(pattern="^test"), function(x) get(x))
## then use do call and rbind()
xts_results<-do.call(rbind,list_xts)
I am trying to isolate one point in the same location (same column and row) from 1000 data frames. Each data frame has the same 8 columns with varying amounts of rows (at least one)- and I only need the points from the first row for now. These data frames are within a list created with the lapply function. Here is how I did that:
list <- list.files(pattern=".aei")
files <- lapply(list, read.table, ...)
Now, I need to isolate points from each data frame in Row 1 and Column 2. I was able to do this for one data frame with the following code:
a <- data.frame(files[1])[1,2]
However, I can't get this to work for all 1000 files. I've tried several pieces of code, such as:
all <- data.frame(files[1:999])[1,2]
all<- lapply(files data.frame)[1,2]
all<- lapply(files, data.frame[1,2])
and even two different for loops:
for(i in files [[1:999]]) {
list(files[1:999])[1,2]
}
for(i in files [[1:999]]) {
data.frame(files[1:999])[1,2]
}
Are any of these methods on the right track or are they completely wrong? I've been stuck on this for awhile and seem to have hit a complete dead end regarding any other ideas. Please let me know of any suggestions you may have!
We can use a anonymous function (lambda function) to extrac the element
lapply(files, function(x) x[1,2])
The read.table already gives a data.frame, so there is no need to wrap with data.frame
I'm wondering whether there is a way to do in-place modification of objects in a list without using a for loop. This would be useful, for example, if the individual objects in the list are large and complex, so that we want to avoid making a temporary copy of the entire object. As an example, consider the following code, which creates a list of three data frames, then calculates the vector of maximums across all three data frames for one column of the data, and then assigns that vector to each original data frame. (Code like this is needed when aligning plots in ggplot2.)
data_list <- lapply(1:3, function(x) data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10)))
max_x <- do.call(pmax, lapply(data_list, function(d){d$x}))
for( i in 1:length(data_list))
{
data_list[[i]]$x <- max_x
}
Is there any way to write the final part without a for loop?
Answers to some of the questions I'm getting:
What makes me think a copy would be made? I don't know for sure whether a copy would or would not be made. The actual scenario I'm working with deals with entire ggplot graphs (see e.g. here). Since they are rather large and complex, it's critical that no copy be made.
What's the problem with a for loop? I just would rather iterate directly over a list than have to introduce a counter. I don't like counters.
Why not use data.table? Because I'm actually manipulating ggplot graphs, not data frames. The code provided here is just a simplified example.
Base R data structures are copy-on-modify with sharing. Take your example of a data.frame with three numeric columns. Each data.frame is a length 3 "list" vector, each containing a reference to the numeric vectors of the underlying columns. If we modify/replace the first column, R creates a new length 3 data.frame "list" containing references to the new(ly modified) column and the other two unmodified columns.
Let's take a look using the address function*
set.seed(1)
data_list <- lapply(1:3, function(x) data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10)))
before <- rapply(data_list,address)
Now you want to replace the first column with
max_x <- do.call(pmax, lapply(data_list, function(d){d$x}))
How you do this doesn't much matter, but here's one way without an explicit loop-with-counter
data_list <- lapply(data_list,`[<-`,"x",value=max_x)
after <- rapply(data_list,address)
Now compare the addresses before and after. Note that the addresses for the y and z columns have not changed. Furthermore, all "after" x columns have the same address -- the address of max_x!
address(max_x)
[1] "05660600"
cbind(before,after)
before after
x "0565F530" "05660600"
y "0565F400" "0565F400"
z "05660AC0" "05660AC0"
x "05660A28" "05660600"
y "05660990" "05660990"
z "05660860" "05660860"
x "056607C8" "05660600"
y "05660730" "05660730"
z "05660698" "05660698"
This means you don't have to worry as much as you might think about making a change to a large data structure. In general, only the modified piece and the skeleton of the data structure will have to be replaced. In this example, the max_x vector had to be created anyway, so the only overhead is creating a new 3 cell data.frame "list" and populating it with 3 references**. This, however, could start to become inefficient if you are iteratively "banging on" changes or working with subvectors rather than entire columns. These are use cases for data.table that are not applicable to this example.
* The address function used here is exported from the data.table package.
** And, of course, in this example, the 3 cell outer list "list" containing the 3 data.frames themselves.
I am a naive user of R and am attempting to come to terms with the 'apply' series of functions which I now need to use due to the complexity of the data sets.
I have large, ragged, data frame that I wish to reshape before conducting a sequence of regression analyses. It is further complicated by having interlaced rows of descriptive data(characters).
My approach to date has been to use a factor to split the data frame into sets with equal row lengths (i.e. a list), then attempt to remove the trailing empty columns, make two new, matching lists, one of data and one of chars and then use reshape to produce a common column number, then recombine the sets in each list. e.g. a simplified example:
myDF <- as.data.frame(rbind(c("v1",as.character(1:10)),
c("v1",letters[1:10]),
c("v2",c(as.character(1:6),rep("",4))),
c("v2",c(letters[1:6], rep("",4)))))
myDF[,1] <- as.factor(myDF[,1])
myList <- split(myDF, myDF[,1])
myList[[1]]
I can remove the empty columns for an individual set and can split the data frame into two sets from the interlacing rows but have been stumped with the syntax in writing a function to apply the following function to the list - though 'lapply' with 'seq_along' should do it?
Thus for the individual set:
DF <- myList[[2]]
DF <- DF[,!sapply(DF, function(x) all(x==""))]
DF
(from an earlier answer to a similar, but simpler example on this site). I have a large data set and would like an elegant solution (I could use a loop but that would not use the capabilities of R effectively). Once I have done that I ought to be able to use the same rationale to reshape the frames and then recombine them.
regards
jac
Try
lapply(split(myDF, myDF$V1), function(x) x[!colSums(x=='')])
I have been struggling for a while with this issue in R:
let's say that I have a for loop which runs over 9 data files. In this loop, I extract a short vector from a longer vector. Easy.
After that,still during the loop, I want to fill a matrix with the freshly extracted short vector and save the matrix to a file. Not easy, because none of the short vectors have the same length. To avoid the problem, I tried to directly save (still in the loop) the vectors directly in a .csv file with write.table() and append=TRUE, but R appends all the 9 vectors in the same column, when I would like to have 9 columns. Does anything like by.column=TRUE exist?
One way or another, I only want to have my 9 short vectors in 9 columns in a data file. It looks to me that I am not too far with the second solution.
Has anyone an idea how to finally finish this little programm? I would greatly appreciate.
Thank you for your time
Michel
I'd suggest that you adjust your workflow slightly, doing something like this:
First collect all of the vectors in a single R list object.
Still within R, massage the list of vectors into a matrix or data.frame. (This will be easier once you have extracted all of the vectors, and know the length of the longest one.)
Only then write out the resultant table with a single call to write.table or write.csv.
Here's an example:
## Set up your for() loop or lapply() call so that it returns a list structure
## like 'l'
l <- list(a=1:9, b=1:2, c=1:5)
## Make a matrix from the list, with shorter vectors filled out with ""s
n <- max(sapply(l, length))
ll <- lapply(l, function(X) {
c(as.character(X), rep("", times = n - length(X)))
})
out <- do.call(cbind, ll)
## Then output it however you'd like. (Here are three options.)
# write.csv(out, file="summary.csv")
# write.csv(out, file="summary.csv", quote=FALSE)
# write.table(out, file="summary.txt", row.names=FALSE, quote=FALSE, sep="\t")