I'm attempting to loop through multiple CSV files and complete the same task for each file to save myself time. First, I ran 'list.files' to list all files in the folder (e.g., GPS_Collar33800_13.csv,GPS_Collar33801_13.CSV,etc). I then developed a loop but I'm struggling on how to structure the other parts of the code to work through each individual file. My end goal is to have 24 files that all look the same structurally and then I need to merge them all together into a master file. Another issue is that I need to list a unique ID for each file (Add column for collar ID, e.g., 33800,33801,33802,etc.) but I don't know how to easily do this without manually adding in a new unique ID by hand (if I knew that it was bringing in file GPS_Collar33800_13.csv first then I can make the AnimalID column value=33800 and do the same thing for GPS_Collar33801_13.csv and add in AnimalID column value=33801). The unique IDs are based on the file name. Any suggestions would be much appreciated!
## List CSV files in folder
`files<-list.files()`
## Run a for loop to complete the same tasks for each
for (i in 1:length(files)){
## Read table
tmp<-read.table(files[i],header=FALSE,sep=" ")
## Keep certain columns
tmp1 <- tmp[c(2:5,9,10,12,13)]
#Name the remaining columns
names(tmp1) <-
c("GMT_Date","GMT_Time","LMT_Date","LMT_Time","Latitude","Longitude","PDOP","2D_3D")
#Add column for collar ID
tmp1$AnimalID<-33800
#Cleanup dataframe by removing records with NAs
tmp1[tmp1 == "N/A"] <- NA
tmp2<-na.omit(tmp1)
You can give this a try:
library(stringr)
## List CSV files in folder
files<-list.files()
big.df <- vector('list',length(files))
## Run a for loop to complete the same tasks for each
for (i in 1:length(files)){
## Read table
tmp<-read.table(files[i],header=FALSE,sep=" ")
## Keep certain columns
tmp1 <- tmp[c(2:5,9,10,12,13)]
#Name the remaining columns
names(tmp1) <-
c("GMT_Date","GMT_Time","LMT_Date","LMT_Time","Latitude","Longitude","PDOP","2D_3D")
#Add column for collar ID
tmp1$AnimalID<-str_match(files[i], 'Collar(\\d+)_')[,2]
#Cleanup dataframe by removing records with NAs
tmp1[tmp1 == "N/A"] <- NA
tmp2<-na.omit(tmp1)
big.df[[i]] <- tmp2
}
final.df <- do.call('rbind', big.df)
It will require the stringr package and assumes your filenames all look like 'GPS_Collar33801_13.csv', etc. It then reads in each file, stores it in a large list, moves to the next file... and when it's done, it mashes them all together in a data.frame called final.df.
Edit: Just fixed the str_match argument.
So let me make sure before I begin that I understand the ask:
For each file in the folder,
Import the file as a data frame
Drop some columns
Rename the remaining columns
Set a column in the data frame to a value obtained from the file name
Remove cases containing the string "N/A" anywhere
Then, combine each of the resulting data frames into one data frame by UNION-ing them (that is, adding the rows together because the columns should be the same).
It's critically important that you provide your data with any such question. If you can't provide your specific data, create some fake data that still demonstrates the problem at hand. Then, provide an example of what it should look like once the operations are complete. This reduces guesswork by the people answering your question.
So with all that said, let's get cracking.
Let's abstract away the sub-parts of task #1 by pretending that we have a function called process_a_file that will do steps 1-5 of each individual file and return a data frame. I can explain how that function works later.
For the "for each file" part, you need lapply. lapply runs a given function on each element of a list you provide, and returns a list of what the function returns:
results_list <- lapply(files, process_a_file)
This will return a list, where each element of the list is a data frame returned by process_a_file. Then you need a function to combine them - I recommend bind_rows from the package dplyr:
results_df <- dplyr::bind_rows(results_list)
And that's all you need to do!
So, now, what do we put in process_a_file? This is pretty easy - your code is mostly complete for doing this, but there are some different ways to do it that I prefer :)
process_a_file <- function(filename) {
#???????
}
Step 1 is to import the file as a data frame. For this I recommend read_delim from the readr package - it's much faster than the default R methods, has nice defaults, and lets us tackle Step 5 at the same time by specifying that "N/A" means NA:
df <- readr::read_delim(filename, delim = " ", col_names = FALSE, na = "N/A")
For step 2, your way works, but I also recommend the select function from dplyr:
dplyr::select(df, 2:5,9,10,12,1)
You can also index columns with unquoted names, and drop columns with -5 or -column_name too - and you can do step 3 at the same time!
df <- dplyr::select(
df,
GMT_Date = 2,
GMT_Time = 3,
LMT_Date = 4,
LMT_Time = 5,
Latitude = 9,
Longitude = 10,
PDOP = 12,
`2D_3D` = 13
)
Your way of renaming the columns is fine, too. By the way, if you start a column name with a number, you have to use this `backtick` syntax everywhere, so it's quite inconvenient and you should probably avoid it if you can.
Then finally, I recommend getting the ID from the file name using regular expressions. I'll assume you can write that regular expression since that's really out of scope - so you can use basename(tools::file_path_sans_ext(filename) to return the filename without the path or extension, and use stringr::str_extract to pop out the ID, which you then add to a column using dplyr::mutate
dplyr::mutate(df, animal_id = stringr::str_extract(basename(tools::file_path_sans_ext(filename)), "THE REGEX GOES HERE"))
So now, putting this all together - using dplyr's piping syntax %>% to make it look nice:
process_a_file <- function(filename) {
readr::read_delim(filename,
delim = " ",
col_names = FALSE,
na = "N/A") %>%
dplyr::select(
GMT_Date = 2,
GMT_Time = 3,
LMT_Date = 4,
LMT_Time = 5,
Latitude = 9,
Longitude = 10,
PDOP = 12,
`2D_3D` = 13
) %>%
dplyr::mutate(animal_id = stringr::str_extract(basename(tools::file_path_sans_ext(filename)), "THE REGEX GOES HERE"))
}
results_list <- lapply(files, process_a_file)
results_df <- dplyr::bind_rows(results_list)
Related
Apologies if this may seem simple, but I can't find a workable answer anywhere on the site.
My data is in the form of a csv with the filename being a name and number. Not quite as simple as having file with a generic word and increasing number...
I've achieved exactly what i want to do with just one file, but the issue is there are a couple of hundred to do, so changing the name each time is quite tedious.
Posting my original single-batch code here in the hopes someone may be able to ease the growing tension of failed searches.
# set workspace
getwd()
setwd(".../Desktop/R Workspace")
# bring in original file, skipping first four rows
Person_7<- read.csv("PersonRound7.csv", header=TRUE, skip=4)
# cut matrix down to 4 columns
Person7<- Person_7[,c(1,2,9,17)]
# give columns names
colnames(Person7) <- c("Time","Spare", "Distance","InPeriod")
# find the empty rows, create new subset. Take 3 rows away for empty lines.
nullrow <- (which(Person7$Spare == "Velocity"))-3
Person7 <- Person7[(1:nullrow), ]
#keep 3 needed columns from matrix
Person7<- Person7[,c(1,3,4)]
colnames(Person7) <- c("Time","Distance","InPeriod")
#convert distance and time columns to factors
options(digits=9)
Person7$Distance <- as.numeric(as.character(Person7$Distance))
Person7$Time <- as.numeric(as.character(Person7$Time))
#Create the differences column for distance
Person7$Diff <- c(0, diff(Person7$Distance))
...whole heap of other stuff...
#export Minutes to an external file
write.csv(Person7_maxs, ".../Desktop/GPS Minutes/Person7.csv")
So the three part issue is as follows:
I can create a list or vector to read through the file names, but not a dataframe for each, each time (if that's even a good way to do it).
The variable names throughout the code will need to change instead of just being "Person1" "Person2", they'll be more like "Johnny1" "Lou23".
Need to export each resulting dataframe to it's own csv file with the original name.
Taking any and all suggestions on board - s.t.ruggling with this one.
Cheers!
Consider using one list of the ~200 dataframes. No need for separate named objects flooding global environment (though list2env still shown below). Hence, use lapply() to iterate through all csv files of working directory, then simply name each element of list to basename of file:
setwd(".../Desktop/R Workspace")
files <- list.files(path=getwd(), pattern=".csv")
# CREATE DATA FRAME LIST
dfList <- lapply(files, function(f) {
df <- read.csv(f, header=TRUE, skip=4)
df <- setNames(df[c(1,2,9,17)], c("Time","Spare","Distance","InPeriod"))
# ...same code referencing temp variable, df
write.csv(df_max, paste0(".../Desktop/GPS Minutes/", f))
return(df)
})
# NAME EACH ELEMENT TO CORRESPONDING FILE'S BASENAME
dfList <- setNames(dfList, gsub(".csv", "", files))
# REFERENCE A DATAFRAME WITH LIST INDEXING
str(dfList$PersonRound7) # PRINT STRUCTURE
View(dfList$PersonRound7) # VIEW DATA FRAME
dfList$PersonRound7$Time # OUTPUT ONE COLUMN
# OUTPUT ALL DFS TO SEPARATE OBJECTS (THOUGH NOT NEEDED)
list2env(dfList, envir = .GlobalEnv)
I have a folder with hundreds of CSV files each containing data for a particular postal code.
Each CSV files contains two columns and thousands of rows. Descriptors are in Column A, values are in Column B.
I need to extract two pieces of information from each file and create a new table or dataframe using the values in [Column A, Row 2] (which is the postal code) and [Column B, Row 1585] (which is the median income).
The end result should be a table/dataframe with two columns: one for postal code, the other for median income.
Any help or advice would be appreciated.
Disclaimer: this question is pretty vague. Next time, be sure to add a reproducible example that we can run on our machines. It will help you, the people answering your questions, and future users.
You might try something like:
files = list.files("~/Directory")
my_df = data.frame(matrix(ncol = 2, nrow = length(files)
for(i in 1:length(files)){
row1 = read.csv("~/Directory/files[i]",nrows = 1)
row2 = read.csv("~/Directory/files[i]", skip = 1585, nrows = 1)
my_df = rbind(my_df, rbind(row1, row2))
}
my_df = my_df[,c("A","B")]
# Note on interpreting indexing syntax:
Read this as "my_df is now (=) my_df such that ([) the columns (,)
are only A and B (c("A", "B")) "
You can use list.files function to get directories for all your files and then use read.csv and rbind in for loop to create one data.frame.
Something like this:
direct<-list.files("directory_to_your_files")
df<-NULL
for(i in length(direct)){
df<-rbind(df,read.csv(direct[i]))
}
So here is the code which does what I want it to do. If there are more elegant solutions, please feel free to point them out.
# set the working directory to where the data files are stored
setwd("/foo")
# count the files
files = list.files("/foo")
#create an empty dataframe and name the columns
dataMatrix=data.frame(matrix(c(rep(NA,times=2*length(files))),nrow=length(files)))
colnames(dataMatrix)=c("Postal Code", "Median Income")
# create a for loop to get the information in R2/C1 and R1585/C2 of each data file
# Data is R2/C1 is a string, but is interpreted as a number unless specifically declared a string
for(i in 1:length(files)) {
getData = read.csv(files[i],header=F)
dataMatrix[i,1]=toString(getData[2,1])
dataMatrix[i,2]=(getData[1585,2])
}
Thank you to all those who helped me figure this out, especially Nancy.
I have read multiple questionnaire files into DFs in R. Now I want to create new DFs based on them, buit with only specific rows in them, via looping over all of them.The loop appears to work fine. However the selection of the rows does not seem to work. When I try selecting with simple squarebrackts, i get the error "incorrect number of dimensions". I tried it with subet(), but i dont seem to be able to set the subset correctly.
Here is what i have so far:
for (i in 1:length(subjectlist)) {
p[i] <- paste("path",subjectlist[i],sep="")
files <- list.files(path=p,full.names = T,include.dirs = T)
assign(paste("subject_",i,sep=""),read.csv(paste("path",subjectlist[i],".csv",sep=""),header=T,stringsAsFactors = T,row.names=NULL))
assign(paste("subject_",i,"_t",sep=""),sapply(paste("subject_",i,sep=""),[c((3:22),(44:63),(93:112),(140:159),(180:199),(227:246)),]))
}
Here's some code that tries to abstract away the details and do what it seems like you're trying to do. If you just want to read in a bunch of files and then select certain rows, I think you can avoid the assign functions and just use sapply to read all the data frames into a list. Let me know if this helps:
# Get the names of files we want to read in
files = list.files([arguments])
df.list = sapply(files, function(file) {
# Read in a csv file from the files vector
df = read.csv(file, header=TRUE, stringsAsFactors=FALSE)
# Add a column telling us the name of the csv file that the data came from
df$SourceFile = file
# Select only the rows we want
df = df[c(3:22,44:63,93:112,140:159,180:199,227:246), ]
}, simplify=FALSE)
If you now want to combine all the data frames into a single data frame, you can do the following (the SourceFile column tells you which file each row originally came from):
# Combine all the files into a single data frame
allDFs = do.call(rbind, df.list)
I have a folder with several files. From each file I would like to select a cell (third row, 5th column) and bind them into one single column. Here's what I've got so far:
fnames1 <- scan(file.choose(), what = "character", quiet = TRUE)
print(fnames1)
for (i in fnames1)
{
date.time <- read.table(paste("...",i, sep = ""), skip = 2, nrows = 1)
timecol <- paste(date.time[, 5])
time <- cbind(timecol)}
but I'm still just getting several individual cells instead of one row with all the cells in it.
Any help will be very much appreciated!
EDIT: Answers to MrFlick: when prompted, I choose a file that contains the names of all the files in the folder from which I want to extract the cell I need. This is where fnames1 comes from. Time is a variable I'm creating to try to concatenate all the cells together (which is obviously not working). Adding that pasteafter the read table is the only way I've managed to get the loop working... I'm very new to R and I've been working with trial and error.
I'm still not sure I completely understand, But what about
fnames1 <- scan(file.choose(), what = "character", quiet = TRUE)
print(fnames1)
time <- sapply(fnames, function(fn) {
date.time <- read.table(fn, skip = 2, nrows = 1)
date.time[, 5]
})
print(time)
Here we use sapply to loop over the file names rather than a for loop which also will automatically create a vector of the correct size for us using the date.time[3,5] values.
I would like to know how I solve the following problem using higher order functions like ddply, ldply, dlply, and avoid using problematic for loops.
The problem:
I have a .csv file representing a dataset loaded into a data.frame, with each row containing the path to a directory where more information is stored in files. I want to use the directory information in the datas.frame to open the files("file1.txt","file2.txt") in that directory, merge them, then combine the merged files from each entry in one large dataframe.
something like this:
df =
entryName,dir
1,/home/guest/data/entry1
2,/home/guest/data/entry2
3,/home/guest/data/entry3
4,/home/guest/data/entry4
what I would like to do is apply a function to the dataframe that take the directory,
appends a couple of file names "file1.txt", "file.txt", then merges the two files together based off a given field.
for example file1.txt could be:
entry,subEntry,value
1,A,2
1,B,3
1,C,4
1,D,5
1,E,3
1,F,3
for example file2.txt could be:
entry,subEntry,value
1,A,8
1,B,7
1,C,8
1,D,9
1,E,8
1,F,7
the output would look something like this:
entryName,subEntry,valueFromFile1,valueFromFile2
1,A,2,8
1,B,3,7
1,C,4,8
1,D,5,9
1,E,3,8
1,F,3,7
2,A,4,8
2,B,5,9
2,C,6,7
2,D,3,7
2,E,6,8
2,F,5,9
Right now I am using a for loop, but for obvious reasons would like to use a higher order function. Here is what I have so far:
allCombined <- data.frame()
df <- read.csv(file="allDataEntries.csv",header=true)
numberOfEntries = <- dim(df)[1]
for(i in 1:numberOfEntries){
dir <- df$dir[i]
file1String <- paste(dir,"/file1.txt",sep='')
file2String <- paste(dir,"/file2.txt",sep='')
file1.df <- read.csv(file=file1String,header=TRUE)
file2.df <- read.csv(file=file2String,header=TRUE)
localMerged <- merge(file1.df,file2.df, by="value")
allCombined <- rbind(allCombined,localMerged)
}
#rest of my analysis...
Here is one way to do it. The idea is to create a list with contents of all the files, and then use Reduce to merge them sequentially using the common columns entry and subEntry.
# READ DIRECTORIES, FILES AND ENTRIES
dirs <- read.csv(file = "allDataEntries.csv", header = TRUE, as.is = TRUE)$dir
files <- as.vector(outer(dirs, c('file.txt', 'file2.txt'), 'file.path'))
entries <- lapply(files, 'read.csv', header = TRUE)
# APPLY CUSTOM MERGE FUNCTION TO COMBINE ENTRIES
merge_by <- function(x, y){
merge(x, y, by = c('entry', 'subEntry'))
}
Reduce('merge_by', entries)
I've not tested this, but it seems like it should work. The anonymous function takes a single row from df, reads in the two associated files, and merges them together by value. Using ddply will take these data frames and make a single one out of them by rbinding (since the requested output is a data frame). It does assume entryName is not repeated in df. If it is, you can add a unique row to group over instead.
ddply(df, .(entryName), function(DF) {
dir <- df$dir
file1String <- paste(dir,"/file1.txt",sep='')
file2String <- paste(dir,"/file2.txt",sep='')
file1.df <- read.csv(file=file1String,header=TRUE)
file2.df <- read.csv(file=file2String,header=TRUE)
merge(file1.df,file2.df, by="value")
})