removing rows from dataframes in a list using R - r

I have several files in one folder that I read into R as a list. Each element in the list is a data frame and I need to remove a random consecutive 6000 rows form each data frame. I can unlist the data frames and pull out the rows but ideally I would like to keep it in the list and just go through each element of the list and remove the rows I need. I thought a for loop or apply function would work but the individual elements don't seem t be recognized as data frames when they're in the list.
Here is what I have so far
files <- list.files('file location')
fs <- lapply(files, read.table, sep=',',skip=3,header=TRUE)
##separates the list into individual data frames
for (i in seq(fs))
assign(paste("df", i, sep = ""), fs[[i]])
##selects a random 6000 rows to remove from a dataframe
n <- nrow(df1)
samp <- sample(1:(n-6000),1)
rmvd <- df1[-seq(from = samp, to =samp+5999),]
I would either like to apply the last part to each dataframe individually and put those back into a list or be able to apply it to the list. I want it in a list in the end because it will be easier to write each dataframe to its own csv file.

If you stick with the list of data.frames, fs, instead of assigning them, you can do something like
lapply(fs, function(x) x[-(sample(nrow(x)-6000,1)+0:5999), ])
If n=nrow(x) is ever under 6000, you're in trouble, of course.

Related

Extracting a Point From Multiple Data Frames Within a List

I am trying to isolate one point in the same location (same column and row) from 1000 data frames. Each data frame has the same 8 columns with varying amounts of rows (at least one)- and I only need the points from the first row for now. These data frames are within a list created with the lapply function. Here is how I did that:
list <- list.files(pattern=".aei")
files <- lapply(list, read.table, ...)
Now, I need to isolate points from each data frame in Row 1 and Column 2. I was able to do this for one data frame with the following code:
a <- data.frame(files[1])[1,2]
However, I can't get this to work for all 1000 files. I've tried several pieces of code, such as:
all <- data.frame(files[1:999])[1,2]
all<- lapply(files data.frame)[1,2]
all<- lapply(files, data.frame[1,2])
and even two different for loops:
for(i in files [[1:999]]) {
list(files[1:999])[1,2]
}
for(i in files [[1:999]]) {
data.frame(files[1:999])[1,2]
}
Are any of these methods on the right track or are they completely wrong? I've been stuck on this for awhile and seem to have hit a complete dead end regarding any other ideas. Please let me know of any suggestions you may have!
We can use a anonymous function (lambda function) to extrac the element
lapply(files, function(x) x[1,2])
The read.table already gives a data.frame, so there is no need to wrap with data.frame

Split dataframe in R by date

I have a data.frame that contains one Date type variable. I want to export 4 files, one containing a subset corresponding to each week. The following will divide my data in 4 however I don't know how to store each of this in a new data.frame.
split(DataAir, sample(rep(1:4)))
Thanks
If you save your split data frames in a variable. You can access the elements with double-bracket subsetting, (e.g. s[[1]]). To save, create a vector of file names
as you'd like and write each to file.
s <- split(iris, iris$Species)
filenames <- paste0("my_path/file", 1:3, ".csv")
for(i in 1:length(s)) write.csv(s[[i]], filenames[i])
And for R users that get unnecessarily bugged out by for loops:
mapply(function(x,y) write.csv(x,y), s, filenames)

Count the number of data frames beginning with prefix in R

I have a collection of data frames that I have generated in R. I need to count the number of data frames whose names begin with "entry_". I'd like to generate a number to then use for a function that rbinds all of these data frames and these data frames only.
So far, I have tried using grep to identify the data frames, however, this just returns where they are indexed in my object list (e.g., 16:19 --- objects 16-19 begin with "entry_"):
count_entry <- (grep("entry_", objects()))
Eventually I would like to rbind all of these data frames like so:
list.make <- function() {
sapply(paste('entry_', seq(1:25), sep=''), get, environment(), simplify = FALSE)
}
all.entries <- list.make()
final.data <- rbind.fill(all.entries)
I don't want to have to enter the sequence manually every time (for example (1:25) in the code above), which is why I'm hoping to be able to automatically count the data frames beginning with "entry_".
If anyone has any ideas of how to solve this, or how to go about this in a better way, I'm all ears!
Per comment by docendo: The ls function will list objects in an environment that match a regex pattern. You can then use mget to retrieve those objects as a list:
mylist <- mget(ls(pattern = "^entry_"))
That will then work with rbind.fill. You can then remove the original objects using something similar: rm(ls(pattern = "^entry_"))

Change column names after merging multiple data frames into one in R

After merging multiple data frames into one, I would like to know how to change the column headers in the master data frame to represent the original files that they came from. I merged a large number of data frames into one using the code below:
library(plyr)
dflist = list.files(path=dir, pattern="csv$", full.names=TRUE, recursive=FALSE)
import.list = llply(dflist, read.csv)
Master = Reduce(function(x, y) merge(x, y, by="Hours"), import.list)
I would like the columns that belonged to each original data frame to be named by the unique ID that the original data frame/ csv file is named by (i.e. aa, ab, ac). The unique IDs in the filenames comes immediately before a low line ("_") so I can isolate them using the code below. However, I am having trouble now applying this to column headers. Any help would be much appreciated.
filename = dflist[1]
unqID = strsplit(filename,"_")[[1]][1]
You could define a function in your llply call to and have read.csv assign names.
or just rename them after reading them in and before merging #joran suggested
#First get the names
filenames = dflist
#I am unsure about the line below, as I
unqID = lapply(filenames,function(x) strplit(x,"_")[1])
names(import.list) <- paste("unqID", names(import.list),sep=".") #renaming the list items
And then merge using your code

Appending a row to a dataframe while reading from multiple csv files in R

I'm reading from multiple csv files in a loop, and performing some calculations on each file's data, and then I wish to add that new row to a data frame:
for (i in csvFiles) {
fileToBeRead<-paste(directory, i, sep="/")
dataframe<-read.csv(paste(fileToBeRead, "csv", sep="."))
file <- i
recordsOK <- sum(complete.cases(dataframe))
record.data <- data.frame(monitorID, recordsOK)
}
So, I want to add file and recordsOK as a new row to the data frame. This just overwrites data frame every time, so I'd end up with the data from the latest csv file. How can I do this while preserving the data from the last iteration?
Building a data.frame one row at a time is almost always the wrong way to do it. Here'a more R-like solution
OKcount<-sapply(csvFiles, function(i) {
fileToBeRead<-paste(directory, i, sep="/")
dataframe<-read.csv(paste(fileToBeRead, "csv", sep="."))
sum(complete.cases(dataframe))
})
record.data <- data.frame(monitorID=seq_along(csvFiles), recordsOK=OKcount)
The main idea is that you generally build your data column-wise, not row-wise, and then bundle it together in a data.frame when you're all done. Because R has so many vectorized operations, this is usually pretty easy.
But if you really want to add rows to a data.frame, you can rbind (row bind) additional rows in. So instead of overwriting record.data each time, you would do
record.data <- rbind(record.data, data.frame(monitorID, recordsOK)
But that means you will need to define record.data outside of your loop and initialize it with the correct column names and data types since only matching data.frames can be combined. You can initialize it with
record.data <- data.frame(monitorID=numeric(), recordsOK=numeric())

Resources