I have a series of csv files like this:
dataframe_1 <- read.csv('C:filepath/data_1.csv', header = T, skip = 1)
Where each file is a year of records. The number varies from run to run, so one time might be only a few, other times dozens of files. What I've been doing is creating individual dataframes, stripping out the columns I want using:
cutout_1 <- dataframe_1[c(1:365), c(2, 4, 6, 8, 10)]
and then binding them with rbind() as follows:
total <- rbind(cutout_1, cutout_2, cutout_3, cutout_4)
as.data.frame(total)
However this is clunky and I need to re-write every time I change something about the model I am using, such as the number of years (and thus the number of files it produces), which wastes a lot of time.
I have tried indexing through the data file, but can't seem to find a way to extract only the files I want, nor find a way to skip the first row, which is essential because of the way the data is produced which I have no control over.
Assuming the working directory is the directory where the files can be found, the code below first gets the filenames, then reads them in a lapply loop and creates a cutout column with the file base name without directory path nor extension. Then rbinds them in one data.frame.
filenames <- list.files(pattern = "Day_Climate_.*\\.csv")
cols_to_keep <- c(2, 4, 6, 8, 10)
rows_to_keep <- 1:365
cutout_list <- lapply(filenames, \(x) {
dftmp <- read.csv(x, skip = 1L)
dftmp <- dftmp[rows_to_keep, cols_to_keep]
# these instructions create in each file
# a column telling where they came from
# (this might not be needed)
cutout <- basename(x)
cutout <- tools::file_path_sans_ext(cutout)
dftmp$cutout <- cutout
# need to return the anonymous function value
dftmp
})
total <- do.call(rbind, cutout_list)
Related
I'm having a lot of trouble reading/writing to CSV files. Say I have over 300 CSV's in a folder, each being a matrix of values.
If I wanted to find out a characteristic of each individual CSV file such as which rows had an exact number of 3's, and write the result to another CSV fil for each test, how would I go about iterating this over 300 different CSV files?
For example, say I have this code I am running for each file:
values_4 <- read.csv(file = 'values_04.csv', header=FALSE) // read CSV in as it's own DF
values_4$howMany3s <- apply(values_04, 1, function(x) length(which(x==3))) // compute number of 3's
values_4$exactly4 <- apply(values_04[50], 1, function(x) length(which(x==4))) // show 1/0 on each column that has exactly four 3's
values_4 // print new matrix
I am then continuously copy and pasting this code and changing the "4" to a 5, 6, etc and noting the values. This seems wildly inefficient to me but I'm not experienced enough at R to know exactly what my options are. Should I look at adding all 300 CSV files to a single list and somehow looping through them?
Appreciate any help!
Here's one way you can read all the files and proceess them. Untested code as you haven't given us anything to work on.
# Get a list of CSV files. Use the path argument to point to a folder
# other than the current working directory
files <- list.files(pattern=".+\\.csv")
# For each file, work your magic
# lapply runs the function defined in the second argument on each
# value of the first argument
everything <- lapply(
files,
function(f) {
values <- read.csv(f, header=FALSE)
apply(values, 1, function(x) length(which(x==3)))
}
)
# And returns the results in a list. Each element consists of
# the results from one function call.
# Make sure you can access the elements of the list by filename
names(everything) <- files
# The return value is a list. Access all of it with
everything
# Or a single element with
everything[["values04.csv"]]
I'm attempting to loop through multiple CSV files and complete the same task for each file to save myself time. First, I ran 'list.files' to list all files in the folder (e.g., GPS_Collar33800_13.csv,GPS_Collar33801_13.CSV,etc). I then developed a loop but I'm struggling on how to structure the other parts of the code to work through each individual file. My end goal is to have 24 files that all look the same structurally and then I need to merge them all together into a master file. Another issue is that I need to list a unique ID for each file (Add column for collar ID, e.g., 33800,33801,33802,etc.) but I don't know how to easily do this without manually adding in a new unique ID by hand (if I knew that it was bringing in file GPS_Collar33800_13.csv first then I can make the AnimalID column value=33800 and do the same thing for GPS_Collar33801_13.csv and add in AnimalID column value=33801). The unique IDs are based on the file name. Any suggestions would be much appreciated!
## List CSV files in folder
`files<-list.files()`
## Run a for loop to complete the same tasks for each
for (i in 1:length(files)){
## Read table
tmp<-read.table(files[i],header=FALSE,sep=" ")
## Keep certain columns
tmp1 <- tmp[c(2:5,9,10,12,13)]
#Name the remaining columns
names(tmp1) <-
c("GMT_Date","GMT_Time","LMT_Date","LMT_Time","Latitude","Longitude","PDOP","2D_3D")
#Add column for collar ID
tmp1$AnimalID<-33800
#Cleanup dataframe by removing records with NAs
tmp1[tmp1 == "N/A"] <- NA
tmp2<-na.omit(tmp1)
You can give this a try:
library(stringr)
## List CSV files in folder
files<-list.files()
big.df <- vector('list',length(files))
## Run a for loop to complete the same tasks for each
for (i in 1:length(files)){
## Read table
tmp<-read.table(files[i],header=FALSE,sep=" ")
## Keep certain columns
tmp1 <- tmp[c(2:5,9,10,12,13)]
#Name the remaining columns
names(tmp1) <-
c("GMT_Date","GMT_Time","LMT_Date","LMT_Time","Latitude","Longitude","PDOP","2D_3D")
#Add column for collar ID
tmp1$AnimalID<-str_match(files[i], 'Collar(\\d+)_')[,2]
#Cleanup dataframe by removing records with NAs
tmp1[tmp1 == "N/A"] <- NA
tmp2<-na.omit(tmp1)
big.df[[i]] <- tmp2
}
final.df <- do.call('rbind', big.df)
It will require the stringr package and assumes your filenames all look like 'GPS_Collar33801_13.csv', etc. It then reads in each file, stores it in a large list, moves to the next file... and when it's done, it mashes them all together in a data.frame called final.df.
Edit: Just fixed the str_match argument.
So let me make sure before I begin that I understand the ask:
For each file in the folder,
Import the file as a data frame
Drop some columns
Rename the remaining columns
Set a column in the data frame to a value obtained from the file name
Remove cases containing the string "N/A" anywhere
Then, combine each of the resulting data frames into one data frame by UNION-ing them (that is, adding the rows together because the columns should be the same).
It's critically important that you provide your data with any such question. If you can't provide your specific data, create some fake data that still demonstrates the problem at hand. Then, provide an example of what it should look like once the operations are complete. This reduces guesswork by the people answering your question.
So with all that said, let's get cracking.
Let's abstract away the sub-parts of task #1 by pretending that we have a function called process_a_file that will do steps 1-5 of each individual file and return a data frame. I can explain how that function works later.
For the "for each file" part, you need lapply. lapply runs a given function on each element of a list you provide, and returns a list of what the function returns:
results_list <- lapply(files, process_a_file)
This will return a list, where each element of the list is a data frame returned by process_a_file. Then you need a function to combine them - I recommend bind_rows from the package dplyr:
results_df <- dplyr::bind_rows(results_list)
And that's all you need to do!
So, now, what do we put in process_a_file? This is pretty easy - your code is mostly complete for doing this, but there are some different ways to do it that I prefer :)
process_a_file <- function(filename) {
#???????
}
Step 1 is to import the file as a data frame. For this I recommend read_delim from the readr package - it's much faster than the default R methods, has nice defaults, and lets us tackle Step 5 at the same time by specifying that "N/A" means NA:
df <- readr::read_delim(filename, delim = " ", col_names = FALSE, na = "N/A")
For step 2, your way works, but I also recommend the select function from dplyr:
dplyr::select(df, 2:5,9,10,12,1)
You can also index columns with unquoted names, and drop columns with -5 or -column_name too - and you can do step 3 at the same time!
df <- dplyr::select(
df,
GMT_Date = 2,
GMT_Time = 3,
LMT_Date = 4,
LMT_Time = 5,
Latitude = 9,
Longitude = 10,
PDOP = 12,
`2D_3D` = 13
)
Your way of renaming the columns is fine, too. By the way, if you start a column name with a number, you have to use this `backtick` syntax everywhere, so it's quite inconvenient and you should probably avoid it if you can.
Then finally, I recommend getting the ID from the file name using regular expressions. I'll assume you can write that regular expression since that's really out of scope - so you can use basename(tools::file_path_sans_ext(filename) to return the filename without the path or extension, and use stringr::str_extract to pop out the ID, which you then add to a column using dplyr::mutate
dplyr::mutate(df, animal_id = stringr::str_extract(basename(tools::file_path_sans_ext(filename)), "THE REGEX GOES HERE"))
So now, putting this all together - using dplyr's piping syntax %>% to make it look nice:
process_a_file <- function(filename) {
readr::read_delim(filename,
delim = " ",
col_names = FALSE,
na = "N/A") %>%
dplyr::select(
GMT_Date = 2,
GMT_Time = 3,
LMT_Date = 4,
LMT_Time = 5,
Latitude = 9,
Longitude = 10,
PDOP = 12,
`2D_3D` = 13
) %>%
dplyr::mutate(animal_id = stringr::str_extract(basename(tools::file_path_sans_ext(filename)), "THE REGEX GOES HERE"))
}
results_list <- lapply(files, process_a_file)
results_df <- dplyr::bind_rows(results_list)
I'm trying to use readLines() to scrape .txt files hosted by the Census and compile them into one .txt/.csv file. I am able to use it to read individual pages but I'd like to have it so that I can just run a function that will go out and readLines() based on a csv with urls.
My knowledge of looping and function properties isn't great, but here are the pieces of my code that I'm trying to incorporate:
Here is how I build my matrix of urls which I can add to and/or turn into a csv and have a function read it that way.
MasterList <- matrix( data = c("%20Region/ne0001y.txt", "%20Region/ne0002y.txt", "%20Region/ne0003y.txt"), ncol = 1)
urls <- sprintf("http://www2.census.gov/econ/bps/Place/Northeast%s", MasterList)
Here's the function (riddled with problems) I started writing:
Scrape <- function(x){
for (i in x){
URLS <- i
headers <- readLines(URLS, n=2)
bod <- readLines(URLS)
bodclipped <- bod[-c(1,2,3)]
Totes <- c(headers, bodclipped)
write(Totes, file = "[Directory]/ScrapeTest.txt")
return(head(Totes))
}
}
The idea being that I would run Scrape(urls) which would generate a cumulation of the 3 urls I have in my "urls" matrix/csv with the Census' build in headers removed from all files except the first one (headers vs. bodclipped).
I've tried doing lapply() to "urls" with readLines but that only generates text based on the last url and not all three, and they still have the headers for each text file which I could just remove and then reattach at the end.
Any help would be appreciated!
As all of these documents are csv files with 38 columns you can combine then very easily using:
MasterList <- c("%20Region/ne0001y.txt", "%20Region/ne0002y.txt", "%20Region/ne0003y.txt")
urls <- sprintf("http://www2.census.gov/econ/bps/Place/Northeast%s", MasterList)
raw_dat <- lapply(urls, read.csv, skip = 3, header = FALSE)
dat <- do.call(rbind, dat_raw)
What happens here and how is this looping?
The lapply function basically creates a list with 3 (= length(urls)) entries and populates them with: read.csv(urls[i], skip = 3, header = FALSE). So raw_dat is a list with 3 data.frames containing your data. do.call(rbind, dat) binds em together.
The header row seams somehow broken thats why i use skip = 3, header = FALSE which is equivalent to your bod[-c(1,2,3)].
If all the scraped data fits into memory you can combine it this way and in the end write it into a file using:
write.csv(dat, "[Directory]/ScrapeTest.txt")
I have 100 files, of which I want to extract the 4th column (total_volume) containing 100.000 rows and put it together in 1 big file which then contains 100 columns with each 100.000 rows. I was trying something with the following script:
setwd("/run/media/mydirectory")
library(data.table)
fileNames <- Sys.glob("*.txt.csv")
#read file in fileNames
for (fileName in fileNames) {
dataDf <- read.delim(fileName, header = FALSE)
# remove columns with only example values
dataDf <- dataDf[, -(7:14)]
# convert data frame to data table
dataDt <- data.table(dataDf)
# set column names
setnames(dataDt, c("mcs", "cell_type", "cell_number", "total_volume"))
#new file with only total volume
total_volume <- dataDt$total_volume
#export file
write.table(dataDt$total_volume, file = "total_volume20.csv")
But what I get then is that all columns get superimposed with as result a .csv file with the 4th column of only the last file. I would like the columns to be next to eachother instead of being superimposed. How could I do that?
Thanks in advance!
P.S. Obviously the overwriting thing happens because I used a loop. However, I am not sure how else to combine everything, so suggestions are very welcome!
You haven't given us a reproducible example, so I can't test this properly, but this should give you a table with one column for total volume from each of the files you get from the call to Sys.glob(). The idea is to make a function that does what you want for one file; use lapply() to make a list with the results of that function for each file in your target environment; then cbind the columns in that list into one big table.
setwd("/run/media/mydirectory")
library(data.table)
fileNames <- Sys.glob("*.txt.csv")
# For the function, I'm reproducing your code. You could do in fewer lines and without
# data.table if you like, but maybe there's a reason you chose this approach.
extractor <- function(fileName) {
require(data.table)
dataDf <- read.delim(fileName, header = FALSE)
dataDf <- dataDf[, -(7:14)]
dataDt <- data.table(dataDf)
setnames(dataDt, c("mcs", "cell_type", "cell_number", "total_volume"))
total_volume <- dataDt$total_volume
return(total_volume)
}
total.list <- lapply(fileNames, extractor)
total.table <- Reduce(cbind, total.list)
write.table(total.table, file = "total_volume20.csv")
Or do that last bit in one line if you like:
write.table(Reduce(cbind, lapply(Sys.glob("*.txt.csv"), extractor)), file="total_volume20.csv")
I would like to know how I solve the following problem using higher order functions like ddply, ldply, dlply, and avoid using problematic for loops.
The problem:
I have a .csv file representing a dataset loaded into a data.frame, with each row containing the path to a directory where more information is stored in files. I want to use the directory information in the datas.frame to open the files("file1.txt","file2.txt") in that directory, merge them, then combine the merged files from each entry in one large dataframe.
something like this:
df =
entryName,dir
1,/home/guest/data/entry1
2,/home/guest/data/entry2
3,/home/guest/data/entry3
4,/home/guest/data/entry4
what I would like to do is apply a function to the dataframe that take the directory,
appends a couple of file names "file1.txt", "file.txt", then merges the two files together based off a given field.
for example file1.txt could be:
entry,subEntry,value
1,A,2
1,B,3
1,C,4
1,D,5
1,E,3
1,F,3
for example file2.txt could be:
entry,subEntry,value
1,A,8
1,B,7
1,C,8
1,D,9
1,E,8
1,F,7
the output would look something like this:
entryName,subEntry,valueFromFile1,valueFromFile2
1,A,2,8
1,B,3,7
1,C,4,8
1,D,5,9
1,E,3,8
1,F,3,7
2,A,4,8
2,B,5,9
2,C,6,7
2,D,3,7
2,E,6,8
2,F,5,9
Right now I am using a for loop, but for obvious reasons would like to use a higher order function. Here is what I have so far:
allCombined <- data.frame()
df <- read.csv(file="allDataEntries.csv",header=true)
numberOfEntries = <- dim(df)[1]
for(i in 1:numberOfEntries){
dir <- df$dir[i]
file1String <- paste(dir,"/file1.txt",sep='')
file2String <- paste(dir,"/file2.txt",sep='')
file1.df <- read.csv(file=file1String,header=TRUE)
file2.df <- read.csv(file=file2String,header=TRUE)
localMerged <- merge(file1.df,file2.df, by="value")
allCombined <- rbind(allCombined,localMerged)
}
#rest of my analysis...
Here is one way to do it. The idea is to create a list with contents of all the files, and then use Reduce to merge them sequentially using the common columns entry and subEntry.
# READ DIRECTORIES, FILES AND ENTRIES
dirs <- read.csv(file = "allDataEntries.csv", header = TRUE, as.is = TRUE)$dir
files <- as.vector(outer(dirs, c('file.txt', 'file2.txt'), 'file.path'))
entries <- lapply(files, 'read.csv', header = TRUE)
# APPLY CUSTOM MERGE FUNCTION TO COMBINE ENTRIES
merge_by <- function(x, y){
merge(x, y, by = c('entry', 'subEntry'))
}
Reduce('merge_by', entries)
I've not tested this, but it seems like it should work. The anonymous function takes a single row from df, reads in the two associated files, and merges them together by value. Using ddply will take these data frames and make a single one out of them by rbinding (since the requested output is a data frame). It does assume entryName is not repeated in df. If it is, you can add a unique row to group over instead.
ddply(df, .(entryName), function(DF) {
dir <- df$dir
file1String <- paste(dir,"/file1.txt",sep='')
file2String <- paste(dir,"/file2.txt",sep='')
file1.df <- read.csv(file=file1String,header=TRUE)
file2.df <- read.csv(file=file2String,header=TRUE)
merge(file1.df,file2.df, by="value")
})