I have an assignment on Coursera and I am stuck - I do not necessarily need or want a complete answer (as this would be cheating) but a hint in the right direction would be highly appreciated.
I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames) and the files you want to use in the calculation (id).
I have tried to keep it as simple as possible:
pm <- function(directory, pollutant, id = 1:332) {
setwd("C:/Users/cw/Documents")
setwd(directory)
files <<- list.files()
First of all, set the wd and get a list of all files
x <- id[1]
x
get the starting point of the user-specified ID.
Problem
for (i in x:length(id)) {
df <- rep(NA, length(id))
df[i] <- lapply(files[i], read.csv, header=T)
result <- do.call(rbind, df)
return(df)
}
}
So this is where I am hitting a wall: I would need to take the user-specified input from above (e.g. 10:25) and put the content from files "010.csv" through "025.csv" into a dataframe to actually come up with the mean of one specific column.
So my idea was to run a for-loop along the length of id (e.g. 16 for 10:25) starting with the starting point of the specified id. Within this loop I would then need to take the appropriate values of files as the input for read.csv and put the content of the .csv files in a dataframe.
I can get single .csv files and put them into a dataframe, but not several.
Does anybody have a hint how I could procede?
Based on your example e.g. 16 files for 10:25, i.e. 010.csv, 011.csv, 012.csv, etc.
Under the assumption that your naming convention follows the order of the files in the directory, you could try:
csvFiles <- list.files(pattern="\\.csv")[10:15]#here [10:15] ... in production use your function parameter here
file_list <- vector('list', length=length(csvFiles))
df_list <- lapply(X=csvFiles, read.csv, header=TRUE)
names(df_list) <- csvFiles #OPTIONAL: if you want to rename (later rows) to the csv list
df <- do.call("rbind", df_list)
mean(df[ ,"columnName"])
These code snippets should be possible to pimp and incorprate into your routine.
You can aggregate your csv files into one big table like this :
for(i in 100:250)
{
infile<-paste("C:/Users/cw/Documents/",i,".csv",sep="")
newtable<-read.csv(infile)
newtable<-cbind(newtable,rep(i,dim(newtable)[1]) # if you want to be able to identify tables after they are aggregated
bigtable<-rbind(bigtable,newtable)
}
(you will have to replace 100:250 with the user-specified input).
Then, calculating what you want shouldn't be very hard.
That won't works for files 001 to 099, you'll have to distinguish those from the others because of the "0" but it's fixable with little treatment.
Why do you have lapply inside a for loop? Just do lapply(files[files %in% paste0(id, ".csv")], read.csv, header=T).
They should also teach you to never use <<-.
Related
I have several .csv files of data stored in a directory, and I need to import all of them into R.
Each .csv has two columns when imported into R. However, the 1001st row needs to be stored as a separate variable for each of the .csv files (it corresponds to an expected value which was stored here during the simulation; I want it to be outside of the main data).
So far I have the following code to import my .csv files as matrices.
#Load all .csv in directory into list
dataFiles <- list.files(pattern="*.csv")
for(i in dataFiles) {
#read all of the csv files
name <- gsub("-",".",i)
name <- gsub(".csv","",name)
i <- paste(".\\",i,sep="")
assign(name,read.csv(i, header=T))
}
This produces several matrices with the naming convention "sim_data_L_mu" where L and mu are parameters from the simulation. How can I remove the 1001st row (which has a number in the first column, and the second column is null) from each matrix and store it as a variable named "sim_data_L_mu_EV"? The main problem I have is that I do not know how to call all of the newly created matrices in my for loop.
Couldn't post long code in comments so am writing here:
# Use dialog to select folder
# Full names are required to access files that are not in the current working directory
file_list <- list.files(path = choose.dir(), pattern = "*.csv", full.names = T)
big_list <- lapply(file_list, function(z){
df <- read.csv(z)
scalar <- df[1000,1]
return(list(df, scalar))
})
To access the scalar value from the third file, you can use
big_list[[3]][2]
The elements in big_list follow the order of file_list so you always know which file the data comes from.
If you use data.table::fread() instead of read.csv, you can play around with assigning column names, selecting which rows/columns to read etc. It's also considerably faster for large datafiles.
Hope this helps!
Apologies if this may seem simple, but I can't find a workable answer anywhere on the site.
My data is in the form of a csv with the filename being a name and number. Not quite as simple as having file with a generic word and increasing number...
I've achieved exactly what i want to do with just one file, but the issue is there are a couple of hundred to do, so changing the name each time is quite tedious.
Posting my original single-batch code here in the hopes someone may be able to ease the growing tension of failed searches.
# set workspace
getwd()
setwd(".../Desktop/R Workspace")
# bring in original file, skipping first four rows
Person_7<- read.csv("PersonRound7.csv", header=TRUE, skip=4)
# cut matrix down to 4 columns
Person7<- Person_7[,c(1,2,9,17)]
# give columns names
colnames(Person7) <- c("Time","Spare", "Distance","InPeriod")
# find the empty rows, create new subset. Take 3 rows away for empty lines.
nullrow <- (which(Person7$Spare == "Velocity"))-3
Person7 <- Person7[(1:nullrow), ]
#keep 3 needed columns from matrix
Person7<- Person7[,c(1,3,4)]
colnames(Person7) <- c("Time","Distance","InPeriod")
#convert distance and time columns to factors
options(digits=9)
Person7$Distance <- as.numeric(as.character(Person7$Distance))
Person7$Time <- as.numeric(as.character(Person7$Time))
#Create the differences column for distance
Person7$Diff <- c(0, diff(Person7$Distance))
...whole heap of other stuff...
#export Minutes to an external file
write.csv(Person7_maxs, ".../Desktop/GPS Minutes/Person7.csv")
So the three part issue is as follows:
I can create a list or vector to read through the file names, but not a dataframe for each, each time (if that's even a good way to do it).
The variable names throughout the code will need to change instead of just being "Person1" "Person2", they'll be more like "Johnny1" "Lou23".
Need to export each resulting dataframe to it's own csv file with the original name.
Taking any and all suggestions on board - s.t.ruggling with this one.
Cheers!
Consider using one list of the ~200 dataframes. No need for separate named objects flooding global environment (though list2env still shown below). Hence, use lapply() to iterate through all csv files of working directory, then simply name each element of list to basename of file:
setwd(".../Desktop/R Workspace")
files <- list.files(path=getwd(), pattern=".csv")
# CREATE DATA FRAME LIST
dfList <- lapply(files, function(f) {
df <- read.csv(f, header=TRUE, skip=4)
df <- setNames(df[c(1,2,9,17)], c("Time","Spare","Distance","InPeriod"))
# ...same code referencing temp variable, df
write.csv(df_max, paste0(".../Desktop/GPS Minutes/", f))
return(df)
})
# NAME EACH ELEMENT TO CORRESPONDING FILE'S BASENAME
dfList <- setNames(dfList, gsub(".csv", "", files))
# REFERENCE A DATAFRAME WITH LIST INDEXING
str(dfList$PersonRound7) # PRINT STRUCTURE
View(dfList$PersonRound7) # VIEW DATA FRAME
dfList$PersonRound7$Time # OUTPUT ONE COLUMN
# OUTPUT ALL DFS TO SEPARATE OBJECTS (THOUGH NOT NEEDED)
list2env(dfList, envir = .GlobalEnv)
I am trying to clean up some data in R. I have a bunch of .txt files: each .txt file is named with an ID (e.g. ABC001), and there is a column (let's call this ID_Column) in the .txt file that contains the same ID. Each column has 5 rows (or less - some files have missing data). However, some of the files have incorrect/missing IDs (e.g. ABC01). Here's an image of what each file looks like:
https://i.stack.imgur.com/lyXfV.png
What I am trying to do here is to import everything AND replace the ID_Column with the filename (which I know to all be correct).
Is there any way to do this easily? I think this can probably be done with a for loop but I would like to know if there is any other way. Right now I have this:
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, read.table, header=TRUE))
So, basically, I want to know if it is possible to use lapply (or any other function) to replace data$ID_Column with the filenames in all_files. I am having trouble as each filename is only represented once in all_files, while each ID_Column in data is represented 5 times (but not always, due to missing data). I think the solution is to create a function and call it within lapply, but I am having trouble with that.
Thanks in advance!
I would just make a function that uses read.table and adds the file's name as a column.
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, function(x){
a = read.table(x, header=TRUE);
a$ID_Column=x
return(a)
}
)
I want to automate the extraction of certain information from text files using grep, grepl and regexpr. I have a code that works when I do it for each individual file, however I cannot get the loop to work, to automate the process for all files in my working directory.
I am reading in the txt files as strings because of the structure of the data. The loop seems to iterate through the first file numerous times corresponding to the number of files in the directory, obviously because of the length(txtfiles)command in the for statement.
txtfiles = list.files(pattern="*.txt")
for (i in 1:length(txtfiles)){
all_data <- readLines(txtfiles[i])
#select hours of operation
hours_op[i] <- all_data[hours_of_operation <- grep("Annual Hours of Operation:",all_data)]
hours_op[i] <-regmatches(hours_op, regexpr("[0-9]{1,9}.[0-9]{1,9}",hours_op))
}
I would be grateful if someone could point me in the right direction to repeat this routine for each file, rather than the same file multiple times over. I want to end up with a list of the file names and the corresponding hours_op.
you need to either add an index ([i]) to every one of your reference to hours_op[i], as in:
for (i in 1:length(txtfiles)){
all_data <- readLines(txtfiles[i])
hours_op[i] <- all_data[hours_of_operation <- grep("Annual Hours of Operation:",all_data)]
hours_op[i] <-regmatches(hours_op[i], regexpr("[0-9]{1,9}.[0-9]{1,9}",hours_op[i]))
}
or better yet, use a temporary variable:
for (i in 1:length(txtfiles)){
all_data <- readLines(txtfiles[i])
temp <- all_data[hours_of_operation <- grep("Annual Hours of Operation:",all_data)]
hours_op[i] <-regmatches(temp, regexpr("[0-9]{1,9}.[0-9]{1,9}",temp))
}
I am working on a project that imports all csv files from a given folder and merges them into one file. I was able to import the rows and columns I wanted from each of the files from the folder but now need help merging them all into one file. I do not know how many files I will eventually end up with (probably around 120) so I do not want to merge them 1 by 1.
Here is what I have so far:
# Import All files
rowsToUse <- c(9:104,657:752)
colsToUse <- c(15,27,28,29,30,33,35)
filenames <- list.files("save", pattern="*.csv", full.names=TRUE)
for (i in seq_along(filenames)) {
assign(paste("df", i, sep = "."), read.csv(filenames[i])[!is.na(30),][rowsToUse,colsToUse])
}
# Merge into one file
for (i in seq_along(filenames)) {
df<-rbind(df.[i])
}
The first part of the code creates a series of dataframes labled df.1, df.2, etc. I would like them to end up in one final dataframe called df. All files are identical in structure.
I would really appreciate some help if someone has a few extra minutes! Thank you!
Since you have already read the files in, you can try the following:
do.call(rbind, mget(ls(pattern = "df")))
The ls(pattern = df) should capture all of your "df.1", "df.2", and so on. Hopefully you don't have other things named with the same pattern, but if you do, experiment with a stricter pattern until the command lists just your data.frames.
mget() will bring all of these into a list on which you can use do.call(rbind, ...).
Those all seem complicated ;). The answers above seem to be operating on "we have a list of objects with very similar names, how do we handle that". Answer: they don't need to have very similar names. They don't even have to be different objects.
If you read the files in not through a for loop, but through lapply(), you get a single object that contains all of the data frames - each one as a single element. These can then trivially be extracted. So you'd have something that looks like...
#Grab a list of filenames
filenames <- list.files("save", pattern="*.csv", full.names=TRUE)
#Iterate through that list of names, using lapply(), reading the data in.
list_of_data_frames <- lapply(filenames, function(x){
#Read the data in
to_return <- read.csv(x)[!is.na(30),][c(9:104,657:752),c(15,27,28,29,30,33,35)])
#Return it. You could save lines of code (and processor time!) by just reading
#straight into return(), but it would be a lot less clear.
return(to_return)
})
#Now use do.call to turn it into a single data frame.
data.df <- do.call("rbind", list_of_data_frames)