Merging Dataframes containing Strings with Different Row in R langguage

Merging Dataframes containing Strings with Different Row in R langguage - r

Currently having problem binding two sets of dataframes together.
Folder1 <-list.files(path[1],pattern=".csv")
Folder2 <-list.files(path[2],pattern=".csv")
File <-rbind(Folder1,Folder2)
Error:SQL logic error missing database near "AS":syntax error

You are not understanding exactly what list.files does. It creates a list of all the filenames that match your pattern and/or path. That does not mean however that anything has been imported yet.
This is the construction I usually use:
library(data.table) #for fread and rbindlist
Folder1_reads <- list()
Folder1_list <- list.files(path[1],pattern=".csv")
for (i in 1:length(Folder1_list)) {
Folder1_reads[[i]] <- fread(paste(path[1], Folder1_list[i], sep = "/")) #maybe you won't need the "/" depending on what is in path[1]
}
Folder1 <- rbindlist(Folder1_reads)

Related

rbind.fill based on common pattern R

I have a lot of text files in R that are written in the following format:
building_000000.txt
building_window_roof_000123.txt
building_window_roof_000126.txt
...
which I have listed using this command
files_list <- list.files(pattern="txt")
What I wanted to do is to bind all files (dataframes) which have this pattern "building_roof_window_\\\\\d+" into a single .txt file by using mget(ls). I also wanted to use "rbind.fill" because not all dataframes have the same number of columns. So this is what I tried to do:
building_roof_window <- do.call("rbind.fill", mget(ls(pattern="^building[_]roof[_]window[_]\\\\\\d+")))
But the result is an empty dataframe.
What am I missing? Is it perhaps due to the sloppy use of regex?

The main task is to select filenames using correct regex. We can use the regex as below :
files_list <- list.files(pattern= 'building_roof_window_\\d+.*\\.txt$')
building_roof_window <- do.call(plyr::rbind.fill, mget(files_list))

Read files matching subdirectory patterns in R

I've used a lot of posts to get me this far (such as here R list files with multiple conditions and here How can I read multiple files from multiple directories into R for processing? but can't accomplish what I need in R.
I have many .csv files distributed in multiple subdirectories that I want to read in and then save as separate objects to the corresponding basename. The end result will be to rbind each of those files together. Here's sample dir structure and some of what I've tried:
./DATA/Cat_Animal/animal1.csv
./DATA/Dog_Animal/animal2.csv
./DATA/Dog_Animal/animal3.csv
./DATA/Dog_Animal/animal3.1.csv
#read in all csv files
files <- list.files(path="./DATA", pattern="*.csv", full.names=TRUE, recursive=TRUE)
But this results in all files in all subdirectories. I want to match specific files (animalsX.csv) in specific subdirectories matching the pattern (X_Animal) such as this:
files <- dir(path=paste0("./DATA/", pattern="*+_Animal"), recursive=TRUE, full.names=TRUE, pattern="animal+.*csv")
Once I get my list of files, I want to read each of them in and save each to the corresponding file's basename. So the file named animal1.csv
would be saved to animal1. I think I need to use the function basename() somewhere in a loop but not sure how.
Help very much appreciated I've spent a lot of time trying out various options with little progress.

This question is really two questions, consider splitting them up. On the last part of your question, how to rbind a list full of data.frames together try:
finalDf = do.call(rbind, result)
You'll likely need to use str_split() from the stringr package to extract the parts of the file path you need. You could also use str_extract() regular expressions.

I think I found a work-around for the short term because luckily I only have a few subdirectories currently.
myFiles1 <- list.files(path = "./DATA/Cat_Animal/", pattern="animal+.*csv")
processFile <- function(f) {
df <- read.csv(file = paste0("./DATA/Cat_Animal/", f ))
}
result1 <- sapply(myFiles1, processFile)
#then do it again for the next subdir:
myFiles2 <- list.files(path = "./DATA/Dog_Animal/", pattern="animal+.*csv")
processFile <- function(f) {
df <- read.csv(file = paste0("./DATA/Dog_Animal/", f ))
}
result2 <- sapply(myFiles2, processFile)
finalDf = do.call(rbind, result1, result2)
I know there is a better way but can't figure out the pattern matching for the subdirectories! It's so easy in unix for example

You can simply do it two times.
a <- list.files(path="./DATA", pattern="*_Animal", full.names=T, recursive=F)
a
#[1] "./DATA/Cat_Animal" "./DATA/Dog_Animal"
files <- list.files(path=a, pattern="*animal*", full.names=T)
files
#[1] "./DATA/Cat_Animal/animal1.txt" "./DATA/Dog_Animal/animal2.txt" #"./DATA/Dog_Animal/animal3.txt"
#[4] "./DATA/Dog_Animal/animal4.txt"
In the first step, please make sure to use full.names = T and recursive = F. You need full.names = T to get the file path not just file name, otherwise you might lose path to animal*.csv in the second step. And recursive = T would return nothing since Dog_Animal and Cat_Animal are folders not files.

Building a mean across several csv files

I have an assignment on Coursera and I am stuck - I do not necessarily need or want a complete answer (as this would be cheating) but a hint in the right direction would be highly appreciated.
I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames) and the files you want to use in the calculation (id).
I have tried to keep it as simple as possible:
pm <- function(directory, pollutant, id = 1:332) {
setwd("C:/Users/cw/Documents")
setwd(directory)
files <<- list.files()
First of all, set the wd and get a list of all files
x <- id[1]
x
get the starting point of the user-specified ID.
Problem
for (i in x:length(id)) {
df <- rep(NA, length(id))
df[i] <- lapply(files[i], read.csv, header=T)
result <- do.call(rbind, df)
return(df)
}
}
So this is where I am hitting a wall: I would need to take the user-specified input from above (e.g. 10:25) and put the content from files "010.csv" through "025.csv" into a dataframe to actually come up with the mean of one specific column.
So my idea was to run a for-loop along the length of id (e.g. 16 for 10:25) starting with the starting point of the specified id. Within this loop I would then need to take the appropriate values of files as the input for read.csv and put the content of the .csv files in a dataframe.
I can get single .csv files and put them into a dataframe, but not several.
Does anybody have a hint how I could procede?

Based on your example e.g. 16 files for 10:25, i.e. 010.csv, 011.csv, 012.csv, etc.
Under the assumption that your naming convention follows the order of the files in the directory, you could try:
csvFiles <- list.files(pattern="\\.csv")[10:15]#here [10:15] ... in production use your function parameter here
file_list <- vector('list', length=length(csvFiles))
df_list <- lapply(X=csvFiles, read.csv, header=TRUE)
names(df_list) <- csvFiles #OPTIONAL: if you want to rename (later rows) to the csv list
df <- do.call("rbind", df_list)
mean(df[ ,"columnName"])
These code snippets should be possible to pimp and incorprate into your routine.

You can aggregate your csv files into one big table like this :
for(i in 100:250)
{
infile<-paste("C:/Users/cw/Documents/",i,".csv",sep="")
newtable<-read.csv(infile)
newtable<-cbind(newtable,rep(i,dim(newtable)[1]) # if you want to be able to identify tables after they are aggregated
bigtable<-rbind(bigtable,newtable)
}
(you will have to replace 100:250 with the user-specified input).
Then, calculating what you want shouldn't be very hard.
That won't works for files 001 to 099, you'll have to distinguish those from the others because of the "0" but it's fixable with little treatment.

Why do you have lapply inside a for loop? Just do lapply(files[files %in% paste0(id, ".csv")], read.csv, header=T).
They should also teach you to never use <<-.

Combine several data frames in the global environment by row (rbind)

I am working on a project that imports all csv files from a given folder and merges them into one file. I was able to import the rows and columns I wanted from each of the files from the folder but now need help merging them all into one file. I do not know how many files I will eventually end up with (probably around 120) so I do not want to merge them 1 by 1.
Here is what I have so far:
# Import All files
rowsToUse <- c(9:104,657:752)
colsToUse <- c(15,27,28,29,30,33,35)
filenames <- list.files("save", pattern="*.csv", full.names=TRUE)
for (i in seq_along(filenames)) {
assign(paste("df", i, sep = "."), read.csv(filenames[i])[!is.na(30),][rowsToUse,colsToUse])
}
# Merge into one file
for (i in seq_along(filenames)) {
df<-rbind(df.[i])
}
The first part of the code creates a series of dataframes labled df.1, df.2, etc. I would like them to end up in one final dataframe called df. All files are identical in structure.
I would really appreciate some help if someone has a few extra minutes! Thank you!

Since you have already read the files in, you can try the following:
do.call(rbind, mget(ls(pattern = "df")))
The ls(pattern = df) should capture all of your "df.1", "df.2", and so on. Hopefully you don't have other things named with the same pattern, but if you do, experiment with a stricter pattern until the command lists just your data.frames.
mget() will bring all of these into a list on which you can use do.call(rbind, ...).

Those all seem complicated ;). The answers above seem to be operating on "we have a list of objects with very similar names, how do we handle that". Answer: they don't need to have very similar names. They don't even have to be different objects.
If you read the files in not through a for loop, but through lapply(), you get a single object that contains all of the data frames - each one as a single element. These can then trivially be extracted. So you'd have something that looks like...
#Grab a list of filenames
filenames <- list.files("save", pattern="*.csv", full.names=TRUE)
#Iterate through that list of names, using lapply(), reading the data in.
list_of_data_frames <- lapply(filenames, function(x){
#Read the data in
to_return <- read.csv(x)[!is.na(30),][c(9:104,657:752),c(15,27,28,29,30,33,35)])
#Return it. You could save lines of code (and processor time!) by just reading
#straight into return(), but it would be a lot less clear.
return(to_return)
})
#Now use do.call to turn it into a single data frame.
data.df <- do.call("rbind", list_of_data_frames)

For loop question in R

Hope I can explain my question well enough to obtain an answer - any help will be appreciated.
I have a number if data files which I need to merge into one. I use a for loop to do this and add a column which indicates which file it is.
In this case there are 6 files with up to 100 data entries in each.
When there are 6 files I have no problem in getting this to run.
But when there are less I have a problem.
What I would like to do is use the for loop to test for the files and use the for loop variable to assemble a vector which references the files that exist.
I can't seem to get the new variable to combine the new value of the for loop variable as it goes through the loop.
Here is the sample code I have written so far.
for ( rloop1 in 1 : 6) {
ReadFile=paste(rloop1,SampleName,"_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile))
**files_found <- c(rloop1)**
}
What I am looking for is that files_found will contain those files where 1...6 are valid for the files found.
Regards
Steve

It would probably be better to list the files you want to load, and then loop over that list to load them. list.files is your friend here. We can use a regular expression to list only those files that end in "_Stats.csv". For example, in my current working directory I have the following files:
$ ls | grep Stats
bar_Stats.csv
foobar_Stats.csv
foobar_Stats.csv.txt
foo_Stats.csv
Only three of them are csv files I want to load (the .txt file doesn't match the pattern you showed). We can get these file names using list.files():
> list.files(pattern = "_Stats.csv$")
[1] "bar_Stats.csv" "foo_Stats.csv" "foobar_Stats.csv"
You can then loop over that and read the files in. Something like:
fnames <- list.files(pattern = "_Stats.csv$")
for(i in seq_along(fnames)) {
assign(paste("file_", i, sep = ""), read.csv(fnames[i]))
}
That will create a series of objects file_1, file_2, file_3 etc in the global workspace. If you want the files in a list, you could instead lapply over the fnames:
lapply(fnames, read.csv)
and if suitable, do.call might help combine the files from the list:
do.call(rbind, lapply(fnames, read.csv))

There's a much shorter way to do this using list.files() as Henrik showed. In case you're not familiar with regular expressions (see ?regex), you could do.
n <- 6
Fnames <- paste(1:n,SampleName,"_",FileName,"Stats.csv",sep="")
Filelist <- Fnames[file.exists(Fnames)]
which is perfectly equivalent. Both paste and file.exists are vectorized functions, so you better make use of that. There's no need for a for-loop whatsoever.
To get the number of the filenames (assuming that's the only digits), you can do:
gsub("^[:digit:]","", Filelist)
See also ?regex

I think there are better solutions (e.g., you could use list.files() to scan the folder and then loop over the length of the returned object), but this should (I didn't try it) do the trick (using your sample code):
files.found <- ""
for (rloop1 in 1 : 6) {
ReadFile=paste(rloop1,SampleName,"_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile)) files_found <- c(files.found, rloop1)
}
Alternatively, you could get the fileNames (other than their index) via:
files.found <- ""
for (rloop1 in 1 : 6) {
ReadFile=paste(rloop1,SampleName,"_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile)) files_found <- c(files.found, ReadFile)
}
Finally, in your case list.files could look something like this:
files.found <- list.files(pattern = "[[:digit:]]_SampleName_FileName_Stats.csv")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Merging Dataframes containing Strings with Different Row in R langguage - r

Currently having problem binding two sets of dataframes together. Folder1 <-list.files(path[1],pattern=".csv") Folder2 <-list.files(path[2],pattern=".csv") File <-rbind(Folder1,Folder2) Error:SQL logic error missing database near "AS":syntax error

Related

rbind.fill based on common pattern R

Read files matching subdirectory patterns in R

Building a mean across several csv files

Combine several data frames in the global environment by row (rbind)

For loop question in R

Categories

Resources