rbind.fill based on common pattern R - r

I have a lot of text files in R that are written in the following format:
building_000000.txt
building_window_roof_000123.txt
building_window_roof_000126.txt
...
which I have listed using this command
files_list <- list.files(pattern="txt")
What I wanted to do is to bind all files (dataframes) which have this pattern "building_roof_window_\\\\\d+" into a single .txt file by using mget(ls). I also wanted to use "rbind.fill" because not all dataframes have the same number of columns. So this is what I tried to do:
building_roof_window <- do.call("rbind.fill", mget(ls(pattern="^building[_]roof[_]window[_]\\\\\\d+")))
But the result is an empty dataframe.
What am I missing? Is it perhaps due to the sloppy use of regex?

The main task is to select filenames using correct regex. We can use the regex as below :
files_list <- list.files(pattern= 'building_roof_window_\\d+.*\\.txt$')
building_roof_window <- do.call(plyr::rbind.fill, mget(files_list))

Related

Merging Dataframes containing Strings with Different Row in R langguage

Currently having problem binding two sets of dataframes together.
Folder1 <-list.files(path[1],pattern=".csv")
Folder2 <-list.files(path[2],pattern=".csv")
File <-rbind(Folder1,Folder2)
Error:SQL logic error missing database near "AS":syntax error
You are not understanding exactly what list.files does. It creates a list of all the filenames that match your pattern and/or path. That does not mean however that anything has been imported yet.
This is the construction I usually use:
library(data.table) #for fread and rbindlist
Folder1_reads <- list()
Folder1_list <- list.files(path[1],pattern=".csv")
for (i in 1:length(Folder1_list)) {
Folder1_reads[[i]] <- fread(paste(path[1], Folder1_list[i], sep = "/")) #maybe you won't need the "/" depending on what is in path[1]
}
Folder1 <- rbindlist(Folder1_reads)

Extracting a single cell value from multiple csv files in R and

I have 500 csv. files with data that looks like:
sample data
I want to extract one cell (e.g. B4 or 0.477) per a csv file and combine those values into a single csv. What are some recommendations on how to do this easily?
You can try something like this
all.fi <- list.files("/path/to/csvfiles", pattern=".csv", full.names=TRUE) # store names of csv files in path as a string vector
library(readr) # package for read_lines and write_lines
ans <- sapply(all.fi, function(i) { eachline <- read_lines(i, n=4) # read only the 4th line of the file
ans <- unlist(strsplit(eachline, ","))[2] # split the string on commas, then extract the 2nd element of the resulting vector
return(ans) })
write_lines(ans, "/path/to/output.csv")
I can not add a comment. So, I will write my comment here.
Since your data is very large and it is very difficult to load it individually, then try this: Importing multiple .csv files into R. It is similar to the first part of your problem. For second part, try this:
You can save your data as a data.frame (as with the comment of #Bruno Zamengo) and then you can use select and merge functions in R. Then, you can easily combine them in single csv file. With select and merge functions you can select all the values you need and them combine them. I used this idea in my project. Do not forget to use lapply.

Replace values within dataframe with filename while importing using read.table (R)

I am trying to clean up some data in R. I have a bunch of .txt files: each .txt file is named with an ID (e.g. ABC001), and there is a column (let's call this ID_Column) in the .txt file that contains the same ID. Each column has 5 rows (or less - some files have missing data). However, some of the files have incorrect/missing IDs (e.g. ABC01). Here's an image of what each file looks like:
https://i.stack.imgur.com/lyXfV.png
What I am trying to do here is to import everything AND replace the ID_Column with the filename (which I know to all be correct).
Is there any way to do this easily? I think this can probably be done with a for loop but I would like to know if there is any other way. Right now I have this:
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, read.table, header=TRUE))
So, basically, I want to know if it is possible to use lapply (or any other function) to replace data$ID_Column with the filenames in all_files. I am having trouble as each filename is only represented once in all_files, while each ID_Column in data is represented 5 times (but not always, due to missing data). I think the solution is to create a function and call it within lapply, but I am having trouble with that.
Thanks in advance!
I would just make a function that uses read.table and adds the file's name as a column.
all_files <- list.files(pattern=".txt")
data <- do.call(rbind, lapply(all_files, function(x){
a = read.table(x, header=TRUE);
a$ID_Column=x
return(a)
}
)

Combine several data frames in the global environment by row (rbind)

I am working on a project that imports all csv files from a given folder and merges them into one file. I was able to import the rows and columns I wanted from each of the files from the folder but now need help merging them all into one file. I do not know how many files I will eventually end up with (probably around 120) so I do not want to merge them 1 by 1.
Here is what I have so far:
# Import All files
rowsToUse <- c(9:104,657:752)
colsToUse <- c(15,27,28,29,30,33,35)
filenames <- list.files("save", pattern="*.csv", full.names=TRUE)
for (i in seq_along(filenames)) {
assign(paste("df", i, sep = "."), read.csv(filenames[i])[!is.na(30),][rowsToUse,colsToUse])
}
# Merge into one file
for (i in seq_along(filenames)) {
df<-rbind(df.[i])
}
The first part of the code creates a series of dataframes labled df.1, df.2, etc. I would like them to end up in one final dataframe called df. All files are identical in structure.
I would really appreciate some help if someone has a few extra minutes! Thank you!
Since you have already read the files in, you can try the following:
do.call(rbind, mget(ls(pattern = "df")))
The ls(pattern = df) should capture all of your "df.1", "df.2", and so on. Hopefully you don't have other things named with the same pattern, but if you do, experiment with a stricter pattern until the command lists just your data.frames.
mget() will bring all of these into a list on which you can use do.call(rbind, ...).
Those all seem complicated ;). The answers above seem to be operating on "we have a list of objects with very similar names, how do we handle that". Answer: they don't need to have very similar names. They don't even have to be different objects.
If you read the files in not through a for loop, but through lapply(), you get a single object that contains all of the data frames - each one as a single element. These can then trivially be extracted. So you'd have something that looks like...
#Grab a list of filenames
filenames <- list.files("save", pattern="*.csv", full.names=TRUE)
#Iterate through that list of names, using lapply(), reading the data in.
list_of_data_frames <- lapply(filenames, function(x){
#Read the data in
to_return <- read.csv(x)[!is.na(30),][c(9:104,657:752),c(15,27,28,29,30,33,35)])
#Return it. You could save lines of code (and processor time!) by just reading
#straight into return(), but it would be a lot less clear.
return(to_return)
})
#Now use do.call to turn it into a single data frame.
data.df <- do.call("rbind", list_of_data_frames)

For loop question in R

Hope I can explain my question well enough to obtain an answer - any help will be appreciated.
I have a number if data files which I need to merge into one. I use a for loop to do this and add a column which indicates which file it is.
In this case there are 6 files with up to 100 data entries in each.
When there are 6 files I have no problem in getting this to run.
But when there are less I have a problem.
What I would like to do is use the for loop to test for the files and use the for loop variable to assemble a vector which references the files that exist.
I can't seem to get the new variable to combine the new value of the for loop variable as it goes through the loop.
Here is the sample code I have written so far.
for ( rloop1 in 1 : 6) {
ReadFile=paste(rloop1,SampleName,"_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile))
**files_found <- c(rloop1)**
}
What I am looking for is that files_found will contain those files where 1...6 are valid for the files found.
Regards
Steve
It would probably be better to list the files you want to load, and then loop over that list to load them. list.files is your friend here. We can use a regular expression to list only those files that end in "_Stats.csv". For example, in my current working directory I have the following files:
$ ls | grep Stats
bar_Stats.csv
foobar_Stats.csv
foobar_Stats.csv.txt
foo_Stats.csv
Only three of them are csv files I want to load (the .txt file doesn't match the pattern you showed). We can get these file names using list.files():
> list.files(pattern = "_Stats.csv$")
[1] "bar_Stats.csv" "foo_Stats.csv" "foobar_Stats.csv"
You can then loop over that and read the files in. Something like:
fnames <- list.files(pattern = "_Stats.csv$")
for(i in seq_along(fnames)) {
assign(paste("file_", i, sep = ""), read.csv(fnames[i]))
}
That will create a series of objects file_1, file_2, file_3 etc in the global workspace. If you want the files in a list, you could instead lapply over the fnames:
lapply(fnames, read.csv)
and if suitable, do.call might help combine the files from the list:
do.call(rbind, lapply(fnames, read.csv))
There's a much shorter way to do this using list.files() as Henrik showed. In case you're not familiar with regular expressions (see ?regex), you could do.
n <- 6
Fnames <- paste(1:n,SampleName,"_",FileName,"Stats.csv",sep="")
Filelist <- Fnames[file.exists(Fnames)]
which is perfectly equivalent. Both paste and file.exists are vectorized functions, so you better make use of that. There's no need for a for-loop whatsoever.
To get the number of the filenames (assuming that's the only digits), you can do:
gsub("^[:digit:]","", Filelist)
See also ?regex
I think there are better solutions (e.g., you could use list.files() to scan the folder and then loop over the length of the returned object), but this should (I didn't try it) do the trick (using your sample code):
files.found <- ""
for (rloop1 in 1 : 6) {
ReadFile=paste(rloop1,SampleName,"_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile)) files_found <- c(files.found, rloop1)
}
Alternatively, you could get the fileNames (other than their index) via:
files.found <- ""
for (rloop1 in 1 : 6) {
ReadFile=paste(rloop1,SampleName,"_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile)) files_found <- c(files.found, ReadFile)
}
Finally, in your case list.files could look something like this:
files.found <- list.files(pattern = "[[:digit:]]_SampleName_FileName_Stats.csv")

Resources