I want to construct a data frame by reading in a csv file for each day in the month. My daily csv files contain columns of characters, doubles, and integers of the same number of rows. I know the maximum number of rows for any given month and the number of columns remains the same for each csv file. I loop through each day of a month with fileListing, which contains the list of csv file names (say, for January):
output <- matrix(ncol=18, nrow=2976)
for ( i in 1 : length( fileListing ) ){
df = read.csv( fileListing[ i ], header = FALSE, sep = ',', stringsAsFactors = FALSE, row.names = NULL )
# each df is a data frame with 96 rows and 18 columns
# now insert the data from the ith date for all its rows, appending as you go
for ( j in 1 : 18 ){
output[ , j ] = df[[ j ]]
}
}
Sorry for having revised my question as I figured out part of it (duh), but should I use rbind to progressively insert data at the bottom of the data frame, or is that slow?
Thank you.
BSL
You can read them into a list with lapply, then combine them all at once:
data <- lapply(fileListing, read.csv, header = FALSE, stringsAsFactors = FALSE, row.names = NULL)
df <- do.call(rbind.data.frame, data)
First define a master dataframe to hold all of the data. Then as each file read, append the data onto the master.
masterdf<-data.frame()
for ( i in 1 : length( fileListing ) ){
df = read.csv( fileListing[ i ], header = FALSE, sep = ',', stringsAsFactors = FALSE, row.names = NULL )
# each df is a data frame with 96 rows and 18 columns
masterdf<-rbind(masterdf, df)
}
At the end of the loop, masterdf will contain all of the data. This code code can be improved but for the size of the dataset this should be quick enough.
If the data is fairly small relative to your available memory, just read the data in and don't worry about it. After you have read in all the data and done some cleaning, save the file using save() and have your analysis scripts read in that file using load(). Separating reading/cleaning scripts from analysis clips is a good way to reduce this problem.
A feature to speed up the reading of read.csv is to use the nrow and colClass arguments. Since you say that you know that number of rows in each file, telling R this will help speed up the reading. You can extract the column classes using
colClasses <- sapply(read.csv(file, nrow=100), class)
then give the result to the colClass argument.
If the data is getting close to being too large, you may consider processing individual files and saving intermediate versions. There are a number of related discussions to managing memory on the site that cover this topic.
On memory usage tricks:
Tricks to manage the available memory in an R session
On using the garbage collector function:
Forcing garbage collection to run in R with the gc() command
Related
I'm attempting to loop through multiple CSV files and complete the same task for each file to save myself time. First, I ran 'list.files' to list all files in the folder (e.g., GPS_Collar33800_13.csv,GPS_Collar33801_13.CSV,etc). I then developed a loop but I'm struggling on how to structure the other parts of the code to work through each individual file. My end goal is to have 24 files that all look the same structurally and then I need to merge them all together into a master file. Another issue is that I need to list a unique ID for each file (Add column for collar ID, e.g., 33800,33801,33802,etc.) but I don't know how to easily do this without manually adding in a new unique ID by hand (if I knew that it was bringing in file GPS_Collar33800_13.csv first then I can make the AnimalID column value=33800 and do the same thing for GPS_Collar33801_13.csv and add in AnimalID column value=33801). The unique IDs are based on the file name. Any suggestions would be much appreciated!
## List CSV files in folder
`files<-list.files()`
## Run a for loop to complete the same tasks for each
for (i in 1:length(files)){
## Read table
tmp<-read.table(files[i],header=FALSE,sep=" ")
## Keep certain columns
tmp1 <- tmp[c(2:5,9,10,12,13)]
#Name the remaining columns
names(tmp1) <-
c("GMT_Date","GMT_Time","LMT_Date","LMT_Time","Latitude","Longitude","PDOP","2D_3D")
#Add column for collar ID
tmp1$AnimalID<-33800
#Cleanup dataframe by removing records with NAs
tmp1[tmp1 == "N/A"] <- NA
tmp2<-na.omit(tmp1)
You can give this a try:
library(stringr)
## List CSV files in folder
files<-list.files()
big.df <- vector('list',length(files))
## Run a for loop to complete the same tasks for each
for (i in 1:length(files)){
## Read table
tmp<-read.table(files[i],header=FALSE,sep=" ")
## Keep certain columns
tmp1 <- tmp[c(2:5,9,10,12,13)]
#Name the remaining columns
names(tmp1) <-
c("GMT_Date","GMT_Time","LMT_Date","LMT_Time","Latitude","Longitude","PDOP","2D_3D")
#Add column for collar ID
tmp1$AnimalID<-str_match(files[i], 'Collar(\\d+)_')[,2]
#Cleanup dataframe by removing records with NAs
tmp1[tmp1 == "N/A"] <- NA
tmp2<-na.omit(tmp1)
big.df[[i]] <- tmp2
}
final.df <- do.call('rbind', big.df)
It will require the stringr package and assumes your filenames all look like 'GPS_Collar33801_13.csv', etc. It then reads in each file, stores it in a large list, moves to the next file... and when it's done, it mashes them all together in a data.frame called final.df.
Edit: Just fixed the str_match argument.
So let me make sure before I begin that I understand the ask:
For each file in the folder,
Import the file as a data frame
Drop some columns
Rename the remaining columns
Set a column in the data frame to a value obtained from the file name
Remove cases containing the string "N/A" anywhere
Then, combine each of the resulting data frames into one data frame by UNION-ing them (that is, adding the rows together because the columns should be the same).
It's critically important that you provide your data with any such question. If you can't provide your specific data, create some fake data that still demonstrates the problem at hand. Then, provide an example of what it should look like once the operations are complete. This reduces guesswork by the people answering your question.
So with all that said, let's get cracking.
Let's abstract away the sub-parts of task #1 by pretending that we have a function called process_a_file that will do steps 1-5 of each individual file and return a data frame. I can explain how that function works later.
For the "for each file" part, you need lapply. lapply runs a given function on each element of a list you provide, and returns a list of what the function returns:
results_list <- lapply(files, process_a_file)
This will return a list, where each element of the list is a data frame returned by process_a_file. Then you need a function to combine them - I recommend bind_rows from the package dplyr:
results_df <- dplyr::bind_rows(results_list)
And that's all you need to do!
So, now, what do we put in process_a_file? This is pretty easy - your code is mostly complete for doing this, but there are some different ways to do it that I prefer :)
process_a_file <- function(filename) {
#???????
}
Step 1 is to import the file as a data frame. For this I recommend read_delim from the readr package - it's much faster than the default R methods, has nice defaults, and lets us tackle Step 5 at the same time by specifying that "N/A" means NA:
df <- readr::read_delim(filename, delim = " ", col_names = FALSE, na = "N/A")
For step 2, your way works, but I also recommend the select function from dplyr:
dplyr::select(df, 2:5,9,10,12,1)
You can also index columns with unquoted names, and drop columns with -5 or -column_name too - and you can do step 3 at the same time!
df <- dplyr::select(
df,
GMT_Date = 2,
GMT_Time = 3,
LMT_Date = 4,
LMT_Time = 5,
Latitude = 9,
Longitude = 10,
PDOP = 12,
`2D_3D` = 13
)
Your way of renaming the columns is fine, too. By the way, if you start a column name with a number, you have to use this `backtick` syntax everywhere, so it's quite inconvenient and you should probably avoid it if you can.
Then finally, I recommend getting the ID from the file name using regular expressions. I'll assume you can write that regular expression since that's really out of scope - so you can use basename(tools::file_path_sans_ext(filename) to return the filename without the path or extension, and use stringr::str_extract to pop out the ID, which you then add to a column using dplyr::mutate
dplyr::mutate(df, animal_id = stringr::str_extract(basename(tools::file_path_sans_ext(filename)), "THE REGEX GOES HERE"))
So now, putting this all together - using dplyr's piping syntax %>% to make it look nice:
process_a_file <- function(filename) {
readr::read_delim(filename,
delim = " ",
col_names = FALSE,
na = "N/A") %>%
dplyr::select(
GMT_Date = 2,
GMT_Time = 3,
LMT_Date = 4,
LMT_Time = 5,
Latitude = 9,
Longitude = 10,
PDOP = 12,
`2D_3D` = 13
) %>%
dplyr::mutate(animal_id = stringr::str_extract(basename(tools::file_path_sans_ext(filename)), "THE REGEX GOES HERE"))
}
results_list <- lapply(files, process_a_file)
results_df <- dplyr::bind_rows(results_list)
I'm very new to R but do program. I'm probably just getting fed up with my own progress at this stage, so here's my issue;
Lots of .csv files, large (6MB) with spectrum data that I need to do analysis afterwards. I'm trying to read in the data - two columns of Frequency and Voltage (V as dB values), 500,000 data points per file. I would like to "merge" the data from the 2nd column in a new data set for every 10 files.
Eg: 10 files, ten Frequency (all the same for each so can be ignored for the moment) and ten Voltage. Take the data from the Voltage in the 2nd column and merge it into a data set. If I have 10 files = I end up with one data set, 100 files = 10 data sets. Hopefully in the end each data set will have 11 columns | Frequency | V1 | V2 | ... | V10 |. It would be nice to do an Index-Match on each file but I'm not sure my PC will be able for it until I upgrade resources.
This might seem quiet convoluted, all suggestions welcome, memory seems to be an issue when trying to sort through 1200 .csv files or even just reading 100 of them. Thanks for your time!
I haven't tested this since I obviously don't have your data, but something like the code below should work. Basically, you create a vector of all the file names and then read, combine, and write 10 of them at a time.
library(reshape2)
library(dplyr)
# Get the names of all the csv files
files = list.files(pattern="csv$")
# Read, combine, and save ten files at a time in each iteration of the loop
for (i in (unique(1:length(files)) - 1) %/% 10)) {
# Read ten files at a time into a list
dat = lapply(files[(1:length(files) - 1) %/% 10 == i], function(f) {
d=read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Add file name as a column
d$file = gsub("(.*)\\.csv$", "\\1", f)
return(d)
})
# Combine the ten files into a single data frame
dat = bind_rows(dat)
# Reshape from long to wide format
dat = dcast(Frequency ~ file, value.var="Voltage")
# Write to csv
write.csv(dat, paste("Files_", i,".csv"), row.names=FALSE)
}
On the other hand, if you want to just combine them all into a single file in long format, which will make analysis easier (if you have enough memory of course):
# Read all files into a list
dat = lapply(files, function(f) {
d = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Add file name as a column
d$file = gsub("(.*)\\.csv$", "\\1", f)
return(d)
})
# Combine into a single data frame
dat = bind_rows(dat)
# Save to csv
write.csv(dat, "All_files_combined.csv", row.names=FALSE)
I have a very large .csv file (~4GB) which I'd like to read, then subset.
The problem comes at reading (memory allocation error). Being that large reading crashes, so what I'd like is a way to subset the file before or while reading it, so that it only gets the rows for one city (Cambridge).
f:
id City Value
1 London 17
2 Coventry 21
3 Cambridge 14
......
I've already tried the usual approaches:
f <- read.csv(f, stringsAsFactors=FALSE, header=T, nrows=100)
f.colclass <- sapply(f,class)
f <- read.csv(f,sep = ",",nrows = 3000000, stringsAsFactors=FALSE,
header=T,colClasses=f.colclass)
which seem to work for up to 1-2M rows, but not for the whole file.
I've also tried subsetting at the reading itself using pipe:
f<- read.table(file = f,sep = ",",colClasses=f.colclass,stringsAsFactors = F,pipe('grep "Cambridge" f ') )
and this also seems to crash.
I thought packages sqldf or data.table would have something, but no success yet !!
Thanks in advance, p.
I think this was alluded to already but just in case it wasn't completely clear. The sqldf package creates a temporary SQLite DB on your machine based on the csv file and allows you to write SQL queries to perform subsets of the data before saving the results to a data.frame
library(sqldf)
query_string <- "select * from file where City=='Cambridge' "
f <- read.csv.sql(file = "f.csv", sql = query_string)
#or rather than saving all of the raw data in f, you may want to perform a sum
f_sum <- read.csv.sql(file = "f.csv",
sql = "select sum(Value) from file where City=='Cambridge' " )
One solution to this type of error is
you can convert your csv file to excel file first.
Then you can map your excel file into mysql table by using toad for mysql it is easy.Just check for datatype of variables.
then using RODBC package you can access such a large dataset.
I am working with a datasets of size more than 20 GB this way.
Although there's nothing wrong with the existing answers, they miss the most conventional/common way of dealing with this: chunks (Here's an example from one of the multitude of similar questions/answers).
The only difference is, unlike for most of the answers that load the whole file, you would read it chunk by chunk and only keep the subset you need at each iteration
# open connection to file (mostly convenience)
file_location = "C:/users/[insert here]/..."
file_name = 'name_of_file_i_wish_to_read.csv'
con <- file(paste(file_location, file_name,sep='/'), "r")
# set chunk size - basically want to make sure its small enough that
# your RAM can handle it
chunk_size = 1000 # the larger the chunk the more RAM it'll take but the faster it'll go
i = 0 # set i to 0 as it'll increase as we loop through the chunks
# loop through the chunks and select rows that contain cambridge
repeat {
# things to do only on the first read-through
if(i==0){
# read in columns only on the first go
grab_header=TRUE
# load the chunk
tmp_chunk = read.csv(con, nrows = chunk_size,header=grab_header)
# subset only to desired criteria
cond = tmp_chunk[,'City'] == "Cambridge"
# initiate container for desired data
df = tmp_chunk[cond,] # save desired subset in initial container
cols = colnames(df) # save column names to re-use on next chunks
}
# things to do on all subsequent non-first chunks
else if(i>0){
grab_header=FALSE
tmp_chunk = read.csv(con, nrows = chunk_size,header=grab_header,col.names = cols)
# set stopping criteria for the loop
# when it reads in 0 rows, exit loop
if(nrow(tmp_chunk)==0){break}
# subset only to desired criteria
cond = tmp_chunk[,'City'] == "Cambridge"
# append to existing dataframe
df = rbind(df, tmp_chunk[cond,])
}
# add 1 to i to avoid the things needed to do on the first read-in
i=i+1
}
close(con) # close connection
# check out the results
head(df)
Hi so I have a data in the following format
101,20130826T155649
------------------------------------------------------------------------
3,1,round-0,10552,180,yellow
12002,1,round-1,19502,150,yellow
22452,1,round-2,28957,130,yellow,30457,160,brake,31457,170,red
38657,1,round-3,46662,160,yellow,47912,185,red
and I have been reading them and cleaning/formating them by this code
b <- read.table("sid-101-20130826T155649.csv", sep = ',', fill=TRUE, col.names=paste("V", 1:18,sep="") )
b$id<- b[1,1]
b<-b[-1,]
b<-b[-1,]
b$yellow<-B$V6
and so on
There are about 300 files like this, and ideally they will all compiled without the first two lines, since the first line is just id and I made a separate column to identity these data. Does anyone know how to read these table quickly and clean and format the way I want then compile them into a large file and export them?
You can use lapply to read all the files, clean and format them, and store the resulting data frames in a list. Then use do.call to combine all of the data frames into single large data frame.
# Get vector of files names to read
files.to.load = list.files(pattern="csv$")
# Read the files
df.list = lapply(files.to.load, function(file) {
df = read.table(file, sep = ',', fill=TRUE, col.names=paste("V", 1:18,sep=""))
... # Cleaning and formatting code goes here
df$file.name = file # In case you need to know which file each row came from
return(df)
})
# Combine into a single data frame
df.combined = do.call(rbind, df.list)
I would like to write a multiple dataframe "neighbours_dataframe" in a single CSV file :
I use this line to write the multiple dataframe to multiple file :
for(i in 1:vcount(karate)){
write.csv(neighbours_dataframe[[i]], file = as.character(V(karate3)$name[i]),row.names=FALSE)}
if I use this code:
for(i in 1:vcount(karate)){
write.csv(neighbours_dataframe[[i]], file = "karate3.csv",row.names=FALSE)}
this would give me just the last dataframe in the csv file :
I was wondering , How could I have a single CSV file which have all the dataframe in the way that the column header of the first dataframe just written to the csv file and all other data frame copied in a consecutive manner ?
thank you in advance
Two methods; the first is likely to be a little faster if neighbours_dataframe is a long list (though I haven't tested this).
Method 1: Convert the list of data frames to a single data frame first
As suggested by jbaums.
library(dplyr)
neighbours_dataframe_all <- rbind_all(neighbours_dataframe)
write.csv(neighbours_dataframe_all, "karate3.csv", row.names = FALSE)
Method 2: use a loop, appending
As suggested by Neal Fultz.
for(i in seq_along(neighbours_dataframe))
{
write.table(
neighbours_dataframe[[i]],
"karate3.csv",
append = i > 1,
sep = ",",
row.names = FALSE,
col.names = i == 1
)
}