I want to automate the extraction of certain information from text files using grep, grepl and regexpr. I have a code that works when I do it for each individual file, however I cannot get the loop to work, to automate the process for all files in my working directory.
I am reading in the txt files as strings because of the structure of the data. The loop seems to iterate through the first file numerous times corresponding to the number of files in the directory, obviously because of the length(txtfiles)command in the for statement.
txtfiles = list.files(pattern="*.txt")
for (i in 1:length(txtfiles)){
all_data <- readLines(txtfiles[i])
#select hours of operation
hours_op[i] <- all_data[hours_of_operation <- grep("Annual Hours of Operation:",all_data)]
hours_op[i] <-regmatches(hours_op, regexpr("[0-9]{1,9}.[0-9]{1,9}",hours_op))
}
I would be grateful if someone could point me in the right direction to repeat this routine for each file, rather than the same file multiple times over. I want to end up with a list of the file names and the corresponding hours_op.
you need to either add an index ([i]) to every one of your reference to hours_op[i], as in:
for (i in 1:length(txtfiles)){
all_data <- readLines(txtfiles[i])
hours_op[i] <- all_data[hours_of_operation <- grep("Annual Hours of Operation:",all_data)]
hours_op[i] <-regmatches(hours_op[i], regexpr("[0-9]{1,9}.[0-9]{1,9}",hours_op[i]))
}
or better yet, use a temporary variable:
for (i in 1:length(txtfiles)){
all_data <- readLines(txtfiles[i])
temp <- all_data[hours_of_operation <- grep("Annual Hours of Operation:",all_data)]
hours_op[i] <-regmatches(temp, regexpr("[0-9]{1,9}.[0-9]{1,9}",temp))
}
Related
I am trying to clean up some data in R. I have a bunch of .txt files: each .txt file is named with an ID (e.g. ABC001), and there is a column (let's call this ID_Column) in the .txt file that contains the same ID. Each column has 5 rows (or less - some files have missing data). However, some of the files have incorrect/missing IDs (e.g. ABC01). Here's an image of what each file looks like:
https://i.stack.imgur.com/lyXfV.png
What I am trying to do here is to import everything AND replace the ID_Column with the filename (which I know to all be correct).
Is there any way to do this easily? I think this can probably be done with a for loop but I would like to know if there is any other way. Right now I have this:
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, read.table, header=TRUE))
So, basically, I want to know if it is possible to use lapply (or any other function) to replace data$ID_Column with the filenames in all_files. I am having trouble as each filename is only represented once in all_files, while each ID_Column in data is represented 5 times (but not always, due to missing data). I think the solution is to create a function and call it within lapply, but I am having trouble with that.
Thanks in advance!
I would just make a function that uses read.table and adds the file's name as a column.
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, function(x){
a = read.table(x, header=TRUE);
a$ID_Column=x
return(a)
}
)
I have this code that works for me (it's from Jockers' Text Analysis with R for Students of Literature). However, what I need to be able to do is to automate this: I need to perform the "ProcessingSection" for up to thirty individual text files. How can I do this? Can I have a table or data frame that contains thirty occurrences of "text.v" for each scan("*.txt")?
Any help is much appreciated!
# Chapter 5 Start up code
setwd("D:/work/cpd/R/Projects/5/")
text.v <- scan("pupil-14.txt", what="character", sep="\n")
length(text.v)
#ProcessingSection
text.lower.v <- tolower(text.v)
mars.words.l <- strsplit(text.lower.v, "\\W")
mars.word.v <- unlist(mars.words.l)
#remove blanks
not.blanks.v <- which(mars.word.v!="")
not.blanks.v
#create a new vector to store the individual words
mars.word.v <- mars.word.v[not.blanks.v]
mars.word.v
It's hard to help as your example is not reproducible.
Admitting you're happy with the result of mars.word.v,
you can turn this portion of code into a function that will accept a single argument,
the result of scan.
processing_section <- function(x){
unlist(strsplit(tolower(x), "\\W"))
}
Then, if all .txt files are in the current working directory, you should be able to list them,
and apply this function with:
lf <- list.files(pattern=".txt")
lapply(lf, function(path) processing_section(scan(path, what="character", sep="\n")))
Is this what you want?
I have an assignment on Coursera and I am stuck - I do not necessarily need or want a complete answer (as this would be cheating) but a hint in the right direction would be highly appreciated.
I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames) and the files you want to use in the calculation (id).
I have tried to keep it as simple as possible:
pm <- function(directory, pollutant, id = 1:332) {
setwd("C:/Users/cw/Documents")
setwd(directory)
files <<- list.files()
First of all, set the wd and get a list of all files
x <- id[1]
x
get the starting point of the user-specified ID.
Problem
for (i in x:length(id)) {
df <- rep(NA, length(id))
df[i] <- lapply(files[i], read.csv, header=T)
result <- do.call(rbind, df)
return(df)
}
}
So this is where I am hitting a wall: I would need to take the user-specified input from above (e.g. 10:25) and put the content from files "010.csv" through "025.csv" into a dataframe to actually come up with the mean of one specific column.
So my idea was to run a for-loop along the length of id (e.g. 16 for 10:25) starting with the starting point of the specified id. Within this loop I would then need to take the appropriate values of files as the input for read.csv and put the content of the .csv files in a dataframe.
I can get single .csv files and put them into a dataframe, but not several.
Does anybody have a hint how I could procede?
Based on your example e.g. 16 files for 10:25, i.e. 010.csv, 011.csv, 012.csv, etc.
Under the assumption that your naming convention follows the order of the files in the directory, you could try:
csvFiles <- list.files(pattern="\\.csv")[10:15]#here [10:15] ... in production use your function parameter here
file_list <- vector('list', length=length(csvFiles))
df_list <- lapply(X=csvFiles, read.csv, header=TRUE)
names(df_list) <- csvFiles #OPTIONAL: if you want to rename (later rows) to the csv list
df <- do.call("rbind", df_list)
mean(df[ ,"columnName"])
These code snippets should be possible to pimp and incorprate into your routine.
You can aggregate your csv files into one big table like this :
for(i in 100:250)
{
infile<-paste("C:/Users/cw/Documents/",i,".csv",sep="")
newtable<-read.csv(infile)
newtable<-cbind(newtable,rep(i,dim(newtable)[1]) # if you want to be able to identify tables after they are aggregated
bigtable<-rbind(bigtable,newtable)
}
(you will have to replace 100:250 with the user-specified input).
Then, calculating what you want shouldn't be very hard.
That won't works for files 001 to 099, you'll have to distinguish those from the others because of the "0" but it's fixable with little treatment.
Why do you have lapply inside a for loop? Just do lapply(files[files %in% paste0(id, ".csv")], read.csv, header=T).
They should also teach you to never use <<-.
I am working on a project that imports all csv files from a given folder and merges them into one file. I was able to import the rows and columns I wanted from each of the files from the folder but now need help merging them all into one file. I do not know how many files I will eventually end up with (probably around 120) so I do not want to merge them 1 by 1.
Here is what I have so far:
# Import All files
rowsToUse <- c(9:104,657:752)
colsToUse <- c(15,27,28,29,30,33,35)
filenames <- list.files("save", pattern="*.csv", full.names=TRUE)
for (i in seq_along(filenames)) {
assign(paste("df", i, sep = "."), read.csv(filenames[i])[!is.na(30),][rowsToUse,colsToUse])
}
# Merge into one file
for (i in seq_along(filenames)) {
df<-rbind(df.[i])
}
The first part of the code creates a series of dataframes labled df.1, df.2, etc. I would like them to end up in one final dataframe called df. All files are identical in structure.
I would really appreciate some help if someone has a few extra minutes! Thank you!
Since you have already read the files in, you can try the following:
do.call(rbind, mget(ls(pattern = "df")))
The ls(pattern = df) should capture all of your "df.1", "df.2", and so on. Hopefully you don't have other things named with the same pattern, but if you do, experiment with a stricter pattern until the command lists just your data.frames.
mget() will bring all of these into a list on which you can use do.call(rbind, ...).
Those all seem complicated ;). The answers above seem to be operating on "we have a list of objects with very similar names, how do we handle that". Answer: they don't need to have very similar names. They don't even have to be different objects.
If you read the files in not through a for loop, but through lapply(), you get a single object that contains all of the data frames - each one as a single element. These can then trivially be extracted. So you'd have something that looks like...
#Grab a list of filenames
filenames <- list.files("save", pattern="*.csv", full.names=TRUE)
#Iterate through that list of names, using lapply(), reading the data in.
list_of_data_frames <- lapply(filenames, function(x){
#Read the data in
to_return <- read.csv(x)[!is.na(30),][c(9:104,657:752),c(15,27,28,29,30,33,35)])
#Return it. You could save lines of code (and processor time!) by just reading
#straight into return(), but it would be a lot less clear.
return(to_return)
})
#Now use do.call to turn it into a single data frame.
data.df <- do.call("rbind", list_of_data_frames)
Hope I can explain my question well enough to obtain an answer - any help will be appreciated.
I have a number if data files which I need to merge into one. I use a for loop to do this and add a column which indicates which file it is.
In this case there are 6 files with up to 100 data entries in each.
When there are 6 files I have no problem in getting this to run.
But when there are less I have a problem.
What I would like to do is use the for loop to test for the files and use the for loop variable to assemble a vector which references the files that exist.
I can't seem to get the new variable to combine the new value of the for loop variable as it goes through the loop.
Here is the sample code I have written so far.
for ( rloop1 in 1 : 6) {
ReadFile=paste(rloop1,SampleName,"_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile))
**files_found <- c(rloop1)**
}
What I am looking for is that files_found will contain those files where 1...6 are valid for the files found.
Regards
Steve
It would probably be better to list the files you want to load, and then loop over that list to load them. list.files is your friend here. We can use a regular expression to list only those files that end in "_Stats.csv". For example, in my current working directory I have the following files:
$ ls | grep Stats
bar_Stats.csv
foobar_Stats.csv
foobar_Stats.csv.txt
foo_Stats.csv
Only three of them are csv files I want to load (the .txt file doesn't match the pattern you showed). We can get these file names using list.files():
> list.files(pattern = "_Stats.csv$")
[1] "bar_Stats.csv" "foo_Stats.csv" "foobar_Stats.csv"
You can then loop over that and read the files in. Something like:
fnames <- list.files(pattern = "_Stats.csv$")
for(i in seq_along(fnames)) {
assign(paste("file_", i, sep = ""), read.csv(fnames[i]))
}
That will create a series of objects file_1, file_2, file_3 etc in the global workspace. If you want the files in a list, you could instead lapply over the fnames:
lapply(fnames, read.csv)
and if suitable, do.call might help combine the files from the list:
do.call(rbind, lapply(fnames, read.csv))
There's a much shorter way to do this using list.files() as Henrik showed. In case you're not familiar with regular expressions (see ?regex), you could do.
n <- 6
Fnames <- paste(1:n,SampleName,"_",FileName,"Stats.csv",sep="")
Filelist <- Fnames[file.exists(Fnames)]
which is perfectly equivalent. Both paste and file.exists are vectorized functions, so you better make use of that. There's no need for a for-loop whatsoever.
To get the number of the filenames (assuming that's the only digits), you can do:
gsub("^[:digit:]","", Filelist)
See also ?regex
I think there are better solutions (e.g., you could use list.files() to scan the folder and then loop over the length of the returned object), but this should (I didn't try it) do the trick (using your sample code):
files.found <- ""
for (rloop1 in 1 : 6) {
ReadFile=paste(rloop1,SampleName,"_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile)) files_found <- c(files.found, rloop1)
}
Alternatively, you could get the fileNames (other than their index) via:
files.found <- ""
for (rloop1 in 1 : 6) {
ReadFile=paste(rloop1,SampleName,"_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile)) files_found <- c(files.found, ReadFile)
}
Finally, in your case list.files could look something like this:
files.found <- list.files(pattern = "[[:digit:]]_SampleName_FileName_Stats.csv")