I have a quite big number of quite heavy datasets. I would like to extract a subset out of each of them and save it into different csv files (one for each dataset). These are the commands I would like to loop for all the files I have in the folder:
df <-read.csv("1985.csv",header=FALSE,stringsAsFactors=TRUE,sep="\t")
df_short <- df[df$V6=="OPP", ]
write.csv(df_short, file = "OPP_1985.csv",row.names=FALSE)
rm(df)
rm(df_short)
This is probably a very noob question, but I am struggling to understand how to do it, so I would appreciate a lot help with this!
EDIT:
Following #SimonShine's suggestion, I have run this code and it works!
You don't specify if you are trying to collect the subsets into one dataset, or if you are trying to make one file per subset. You refer to OPP_1985 that appears out of scope for the code you wrote. Did you mean to refer to df_short?
You could start by abstracting what you want to do with one datafile into a function, e.g.:
extract_and_save_from_dataset <- function(csvfile) {
df <- read.csv(csvfile, header=F, stringsAsFactors=T, sep="\t")
df_short <- df[df$V6 == "OPP",]
csvfile_short <- gsub(".csv", "_short.csv", csvfile)
write.csv(df_short, file=csvfile_short, row_names=F)
}
Assuming you have a collection of dataset filenames, you could apply this function multiple times:
# csvfiles <- c("OPP_1985.csv", "OPP_1986.csv", ...)
csvfiles <- list.files("/path/to/my/csvfiles")
for (csvfile in csvfiles) {
extract_and_save_from_dataset(csvfile)
}
The data.table approach is probably the fastest option, specially if you have a large dataset. The function fwrite{data.table} works in parallel using many CPUS, making it extremely fast.
Here is how you can divide your original data according to subgroups defined based on the values of df$V6 and save each subset into a separate .csv file.
library (data.table)
set(df)[, fwrite(.SD, paste0("output_", V6,".csv")), by = V6, .SDcols=names(df) ]
ps. The name of the files will be output_*.csv where * is the correspondent V6 value.
Related
I've just started learning R so forgive me for my ignorance! I'm reading in lots of .csv files, each of which correlates to a different year (2010-2019). I then filter down the .csv files based on a variable within one of the columns (because the datasets are very large. Currently I am using the below code to do this and then repeating it for each year:
data_2010 <- data.table::fread("//Project/2010 data/2010 data.csv", select = c("date", "id", "type"))
data_b_2010 <- data_2010[which(data_2010$type=="ABC123")]
rm(data_2010)
What I would like to do is use a For-loop to create new object data_20xx for each year, and then read in the .csv files (and apply the filter of "type") for each year too.
I think I know how to create the objects in a For-loop but not entirely sure how I would also assign the .csv files and change the filepath string so it updates with each year (i.e. "//Project/2010 data/2010 data.csv" to "//Project/2011 data/2011 data.csv").
Any help would be greatly appreciated!
Next time please provide a repoducible example so we can help you.
I would use data.table which contains specialized functions to do what you want.
library(data.table)
setwd("Project")
allfiles <- list.files(recursive = T, full.names = T)
allcsv <- allfiles[grepl(".csv", allfiles)]
data_list <- list()
for(i in 1:length(allcsv)) {
print(paste(round(i/length(allcsv),2)))
data_list[i] <- fread(allcsv[i])
}
data_list_filtered <- lapply(data_list, function(x) {
y <- data.frame(x)
return(y[which(y["type"]=="ABC123",)])
})
result <- rbindlist(data_list_filtered)
First, list.files will tell you all the files contained in your working dir by default.
Second, read each csv file into the data_list list using the fast and efficient fread function.
Third, do the filtering within a loop, as requested.
Fourth, use rbindlist from data.table to rbind all of these data.table's.
Finally, if you are not familiar with the data.table syntax, you can run setDF(result) to convert your results back to a data.frame.
I strongly encourage you to learn the data.table syntax as it is quite powerful and efficient for tabular data manipulations. These vignettes will get you started.
I have 30 sas-files (dataset1.sas7bdat through dataset30.sas7bdat, approx. 10 GB per file) in a folder, and need to analyse a subset of rows in these data files (all rows where the character variable A begins with 10).
Thus, I need to read each of the sas-files into R, filter a subset with grep on variable A and then save each of these filtered datasets as a .rds-file.
I'm trying to achieve this using a for loop of list.files() and the Haven package to read the sas-file. In order to avoid going out-of-memory, I need to remove the imported dataset on each iteration after the subset has been filtered and saved as .rds.
Though not elegant nor satisfying, I could hard-code it manually 30 times over like this, copy/pasting and incrementing the suffixes by 1 each time:
dt1 <- haven::read_sas("~/folder/dataset1.sas7bdat")
dt1 <- data.table::as.data.table(dt1)
dt1 <- dt1[grep("^10", A)]
saveRDS(dt1, "~/folder/subset1.rds")
dt2 <- haven::read_sas("~/folder/dataset2.sas7bdat")
dt2 <- data.table::as.data.table(dt2)
dt2 <- dt1[grep("^10", A)]
saveRDS(dt2, "~/folder/subset2.rds")
etc.
While the following for loop technically works to read the files into memory, it is never going to finish due to massively going out of memory, so it does not allow me to filter the data:
folder <- "~/folder/"
file_list <- list.files(path = folder, pattern = "^dataset")
for (i in 1:length(file_list)) {
assign(file_list[i], Haven::read_sas(paste(folder, file_list[i], sep='')))
}
Is there a way to - on each iteration in the loop - filter the dataset, remove the unfiltered dataset and save the subset in a .rds-file?
I can't seem to come up with a way to incorporate this into my approach of using the assign() function.
Is there a better way to go about this?
You could potentially look at freeing up memory by clearing the memory after each load and filter via
rm(list=ls())
I slept on it, and woke up with a working solution using a function to do the work, and a loop to cycle through filenames. This also enables me to save the output in a different folder (my raw data folder is read-only):
library(haven)
library(data.table)
fromFolder <- "~/folder_with_input_data/"
toFolder <- "~/folder_with_output_data/"
import_sas <- function(filename) {
dt <- read_sas(paste(fromFolder, filename, sep=''), NULL)
dt <- as.data.table(dt)
dt <- dt[grep("^10", A)]
saveRDS(dt, paste(toFolder,filename,'.rds.', sep =''), compress = FALSE)
remove(dt)
}
file_list <- list.files(path = fromFolder, pattern="^dataset")
for (filename in file_list) {
import_sas(filename)
}
I haven't tested this with the full 30 files yet. I'll do that tonight.
If I encounter problems, I will post an update tomorrow. Otherwise, this question can be closed in 48 hours.
Update: It worked without a hitch and completed the 297GB conversion in around 13 hours. I don't think it can be optimized to accomplish the task much faster; the vast majority of computing time is spent on opening the sas-files, which I don't think can be done faster by other means than Haven. Unless someone has an idea to optimize the process, this question can be closed.
I am using sapply(tk_choose.files) to produce an interactive window where I can choose which .csv files (multiple) to import. I then do some basic data manipulation so that the mean of one particular column can be plotted using ggplot.
So far my code looks something like this:
>tfiles <- data.frame(sapply(sapply(tk_choose.files(caption="Choose T files
(hold CTRL to select multiple files)"), read.table, header=TRUE, sep=","), c))
>rfiles <- data.frame(sapply(sapply(tk_choose.files(caption="Choose R files
(hold CTRL to select multiple files)"), read.table, header=TRUE, sep=","), c))
I have then calculated the mean of a particular column for both tfiles and rfiles so that I could plot 100-tfiles-rfiles.
While this is working fine for one set of data, I would like to now import more sets of data, preferably also using sapply(tk_choose.files). Essentially I need to get t/rfiles1, t/rfiles2...and repeat the data manipulation process after that, so that I could get a plot of multiple sets of data. I have no idea how to do this without having to copy and paste my code!
Sorry if this is a stupid question, I am very new to R so I am really stuck, your help is greatly appreciated!
Assuming that the files in the working directory are as follow:
all.files<-list.files(pattern="\\.csv")
all.files
[1] "R01.csv" "R02.csv" "R03.csv" "R04.csv" "T01.csv" "T02.csv" "T03.csv" "T04.csv"
And you wish to call tfiles1 as merged data of T01 and T02; tfiles2 as merged data of T03 and T04
T <- grep("T", all.files, value=T)
T
[1] "T01.csv" "T02.csv" "T03.csv" "T04.csv"
t.list <- list(T[1:2], T[3:4])
all.T <- lapply(t.list, function(x)ldply(x, read.csv))
for (i in 1:length(all.T)) assign(paste0("tfiles", i), all.T[[i]]) #this will produce tfiles1 and tfiles2 in your R environment.
I have a simple question regarding a loop that I wrote. I want to access different files in different directories and extract data from these files and combine into one table. My problem is that my loop is not adding the results of the different files but only updating with the species that is currently in the loop. Here it is my code:
for(i in 1:length(splist.par))
{
results<-read.csv(paste(getwd(),"/ResultsR10arcabiotic/",splist.par[i],"/","maxentResults.csv",sep=""),h=T)
species <- splist.par[i]
AUC <- results$Test.AUC[1:10]
AUC_SD <- results$AUC.Standard.Deviation[1:10]
Variable <- "a"
Resolution <- "10arc"
table <-cbind(species,AUC,AUC_SD,Variable,Resolution)
}
This is probably an easy question but I am not an experienced programmer. Thanks for the attention
Gabriel
I'd use lapply to get the desired data from each file and add the Species information, and then combine with rbind. Something like this (untested):
do.call(rbind, lapply(splist.par, function(x) {
d <- read.csv(file.path("ResultsR10arcabiotic", x, "maxentResults.csv"))
d <- d[1:10, c("Test.AIC", "AIC.Standard.Deviation")]
names(d) <- c("AUC", "AUC_SD")
cbind(Species=x, d, stringsAsFactors=FALSE)
}))
#Aaron's lapply answer is good, and clean. But to debug your code: you put a bunch of data into table but overwrite table every time. You need to do
table <-cbind(table, species,AUC,AUC_SD,Variable,Resolution)
BTW, since table is a function in R, I'd avoid using it as a variable name. Imagine:
table(table)
:-)
I noticed I encounter this task quite often when programming in R, yet I don't think I implement it "pretty".
I get a list of file names, each containing a table or a simple vector. I want to read all the files into some construct (list of tables?) so I can later manipulate them in simple loops.
I know how to read each file into a table/vector, but I do not know how to put all these objects together in one structure (list?).
Anyway, I guess this is VERY routine so I'll be happy to hear about your tricks.
Do all the files have the same # of columns? If so, I think this should work to put them all into one dataframe.
library(plyr)
x <- c(FILENAMES)
df <- ldply(x, read.table, sep = "\t", header = T)
If they don't have all the same columns, then use llply() instead
Or, without plyr:
filenames <- c("file1.txt", "file2.txt", "file3.txt")
mydata <- array(list(NULL))
for (i in 1:length(filenames))
{
mydata[[i]] <- read.table(filenames[i])
}
You can have a look at my answer here: Merge several data.frames into one data.frame with a loop.