Append files based on their names - r

I am new in R and I have a lot of climate data files in text format with long names in the same folder, for example, "tasmax_SAM-44_ICHEC-EC-EARTH_rcp26_r12i1p1_SMHI-RCA4_v3_day_20060101-20101231.txt" where each term separated by "_" corresponds to a characteristic like the variable, domain, institute, scenario, etc.
What I want is a code that allows me to select all the files in my folder that have the same name as model name, scenario name, gcm name and append them by rows.
What I tried is to first create a list of the files and assigned variables for each part of their name like model_name, gcm_name, etc.
And then created a condition where I compare those variables through the files with a loop.
file <- list.files ( pattern = '*.txt' )
group <- function(input){
index = which(file == input)
df=read.table(input,header=FALSE,sep="")
fname= unlist((strsplit(input,"_")),use.names=FALSE)
model_name=fname[3]
sce_name=fname[4]
gcm_name=fname[6]
m=1
for (m in 1:length(file)) {
if (model_name[m]==model_name[m+1] & sce_name[m]==sce_name[m+1] & gcm_name[m]==gcm_name[m+1]) {
data=rbind(df[m],df[m+1])
} else {}
}
}
for (i in 1:length(file)) {
group(file[i])
}
The error I had with my code is this:
Error in if (model_name[m] == model_name[m + 1] & sce_name[m] ==
sce_name[m + : missing value where TRUE/FALSE needed
In the end, the code should append files that meet the if a condition like for example making a file out of these two files:
tasmax_SAM-44_ICHEC-EC-EARTH_rcp26_r12i1p1_SMHI-RCA4_v3_day_20060101-20101231.txt
tasmax_SAM-44_ICHEC-EC-EARTH_rcp26_r12i1p1_SMHI-RCA4_v3_day_20110101-20151231.txt
Any help and suggestions are very welcome!

I would suggest a completely different approach:
Get the list of all txt files:
file <- list.files ( pattern = '*.txt' )
Read all the files into a single dataframe:
library(dplyr)
library(readr)
df <- suppressMessages(do.call(bind_rows,lapply(file, read_csv, col_names = FALSE)))
Then group_by the fields you want and write each frame into a separate csv file
df %>%
group_by(X3, X4, X6) %>%
do(write_csv(., paste(.$X3, .$X4, .$X6, ".csv", sep = "_")))

Not sure if i get your question completely but this may help:
The code works as follows
Read the values of the file you give as input.
loop over all other files and append them if they match your conditions.
The If condition checks the values of your input and then compares it with the names of file[m] now. If true, it gets appended to your data. Another fix: you have to use return(data) at the end of your function.
file <- list.files ( pattern = '*.txt' )
group <- function(input){
index = which(file == input)
data=read.table(input,header=FALSE,sep="")
fname= unlist((strsplit(input,"_")),use.names=FALSE)
model_name=fname[3]
sce_name=fname[4]
gcm_name=fname[6]
for (m in 2:length(file)) {
index = file[m]
df_new=read.table(file[m],header=FALSE,sep="")
fname= unlist((strsplit(input,"_")),use.names=FALSE)
if (model_name==fname[3] & sce_name==fname[4] & gcm_name==fname[6]) {
data=rbind(data,df_new)
} else {}
}
return(data)
}
group(file[1])
Problems which still have to be solved: You have to fix if you don't input the first file. Since this code using the file you input in your group function. But the for loop goes with the second file. So if you use group(file[3]) the first file will be skipped and the third file will be doubled. You could use something like another if condition. if(file==input){skip} (not actual syntax, just for an idea, also make sure you get your loop range correct then)

Related

R function to get directory name of a file as characters

I can create a list of csv files in folder_A:
list1 <- dir_ls("path to folder_A")
I can define a function to add a column with filenames and combine these files into one dataframe:
read_and_save_combo <- function(fileX){
read_csv(fileX) %>%
mutate(fileX = path_file(fileX)}
combo_df <- map_df(list1, read_and_save_combo)
I want to add another column with enclosing folder name (would be the same for all files, folder_A). If I use dirname() on an individual file, I get the full parent directory path to folder_A. I only want the characters "folder_A". If I use dirname() as part of the function, I get another column but its filled with "." Less importantly, I don't know why I get the "." instead of the full path, but more importantly is there a function like path_parentfoldername, that would let me add a new column with only the name of the folder containing each file to each row of the combined dataframe?
Thanks!
Edit:
New function for clarity after answers:
read_and_save_combo <- function(fileX){
read_csv(fileX) %>%
mutate(filename = path_file(fileX), foldername = dirname(fileX) %>%
str_replace(pattern = ".*/", replacement = ""))}
This works because . is the wildcard but * modifies the meaning to 0-infinity characters, so ".*" is any character and any number of characters preceding /. Gregor said this but now I understand it.
Also, I was getting the column filled with ".", because in the function, I was reading one file, but then trying to mutate with dirname operating on the list, which is a vector with multiple elements (more than one file).
You can use dirname + basename :
list1 <- list.files('folder_A_path', full.names = TRUE)
read_and_save_combo <- function(fileX) {
readr::read_csv(fileX) %>%
dplyr::mutate(fileX = basename(dirname(fileX)))
}
combo_df <- purrr::map_df(list1, read_and_save_combo)
If your file is at the path 'Users/Downloads/FolderA/Filename.csv' :
dirname('Users/Downloads/FolderA/Filename.csv')
#[1] "Users/Downloads/FolderA"
basename(dirname('Users/Downloads/FolderA/Filename.csv'))
#[1] "FolderA"
"path to folder_A" is a bad example, use "path/to/folder_A". You need to delete everything from the start through the last /:
library(stringr)
str_replace("path/to/folder_A", pattern = ".*/", replacement = "")
# [1] "folder_A"
If you're worried about \\ or other non-standard things, use dirname() as the input.
Here are two ways to do what I wanted, using the helpful answers above:
read_and_save_combo <- function(file){
read_csv(file) %>%
mutate(filename = path_file(file), foldername = basename(dirname(file)))}
read_and_save_combo <- function(file){
read_csv(file) %>%
mutate(filename = path_file(file), foldername = dirname(file) %>%
str_replace(pattern = ".*/", replacement = ""))}
Other basic things I learned that could be helpful for other beginners:
(1) While writing the function, point all the functions (read_csv(), dirname(), etc.) at a uniform variable (here written as "file" but it could be just a letter "g" or whatever you choose). Then you will avoid the problem I had where part of the function is acting on one file and another part is acting on a list.
(2)
filex and fileX
appear far too similar to each other using certain fonts, which can mess you up (capitalization).

Can I automate an increasing value in a file name in R?

So I have .csv's of nesting data that I need to trim. I wrote a series of functions in R and then spit out the new pretty .csv. The issue is that I need to do this with 59 .csv's and I would like to automate the name.
data1 <- read.csv("Nest001.csv", skip = 3, header=F)
functions functions functions
write.csv("Nest001_NEW.csv, file.path(out.path, edit), row.names=F)
So...is there any way for me to loop the name Nest001 to Nest0059 so that I don't have to delete and retype the name for every .csv?
EDIT to incorporate Gregor's suggestion:
One option:
filenames_in <- sprintf("Nest%03d.csv", 1:59)
filenames_out <- sub(pattern = "(\\d{3})(\\.)", replacement = "\\1_NEW\\2", filenames_in)
all_files <- matrix(c(filenames_in, filenames_out), ncol = 2)
And then loop through them:
for (i in 1:nrow(all_files)) {
temp <- read.csv(all_files[[i, 1]], skip = 3, header=F)
do stuff
write.csv(temp, all_files[[i, 2]], row.names = f)
)
To do this purrr-style, you would create two lists similar to the above, and then write a custom function to read in the file, perform all the functions, and then output it.
e.g.
purrr::walk2(
.x = list(filenames_in),
.y = list(filenames_out),
.f = ~my_function()
)
Consider .x and .y as the i in the for loop; it goes through both lists simultaneously, and performs the function on each item.
More info is available here.
Your best bet is to put all of these CSVs into one folder, without any other CSVs in that folder. Then, you can write a loop to go over every file in that folder, and read them in.
library(dplyr)
setwd("path to the folder with CSV's goes here")
combinedData = data.frame()
files = list.files()
for (file in files)
{
read.csv(file)
combinedData = bind_rows(combinedData, file)
}
EDIT: if there are other files in the folder that you don't want to read, you can add this line of code to only read in files that contain the word "Nest" in the title:
files= files[grepl("Nest",filesToRead)]
I don't remember off the top of my head if that is case sensitive or not

Sort list into sub-lists based on pattern matching R

I have a number of TIFF files (each belonging to an image date) in one folder and want to make lists for as many unique dates as there are and then populate those lists with the appropriate files. Ideally, I'd like to have a function where a user would just make changes to the list of dates, though I haven't been able to run a function that would loop through my list of dates. Instead, I've tried to make a function and would run it for each unique date.
dates <- list('20180420', '20180522', '20180623', '20180725', '20180810')
# Make a list of all files in the data directory
allFilesDataDirectory <- list.files(path = dataDirectory, pattern = 'TIF$')
# allFilesDataDirectory is a list of 60 TIFF files with the same naming convention along the lines of LC08_L1TP_038037_20180810_20180815_01_T1_B9
allDateLists <- NULL
for (d in dates){
fileFolderDate <- NULL
dynamicDateNames <- paste0('fileListL8', d)
assign(dynamicDateNames, fileFolderDate)
allDateLists <- c(allDateLists, dynamicDateNames)
}
myFunction <- function(date, fileNameList){
# files first
for (i in allFilesDataDirectory){
# Create a list out of the file name by splitting up the name wherever there is a _ in the name
splitFileName <- unlist(strsplit(i, "[_]"))
if(grepl(splitFileName[4], date) & (grepl('B', splitFileName[8]))){
fileNameList <- c(fileNameList, i)
print(i)
}
else {
print('no')
}
}
}
myFunction(date = '20180623', fileNameList = 'fileListL820180623')
The function runs, but fileListL820180623 is NULL.
When hard coding this, everything works and am not sure of the difference. I tried using assign() (not shown here), but it did nothing.
for (i in allFilesDataDirectory){
# Create a list out of the file name by splitting up the name wherever there is a _ in the name
splitFileName <- unlist(strsplit(i, "[_]"))
if(grepl(splitFileName[4], '20180623') & (grepl('B', splitFileName[8]))){
fileListL820180623 <<- c(fileListL820180623, i)
}
else {
print('no')
}
}
For some reason grepl wasn't working well in this case, but glob2rx worked great.
dates <- list('20180420', '20180522', '20180623', '20180725', '20180810')
for (d in dates){
listLandsatFiles <- list.files(path = dataDirectory, pattern = glob2rx(paste0('*', d, '*B*TIF')))
files
dynamicFileListName <- paste0('fileListL8', d)
assign(dynamicFileListName, listLandsatFiles)
}
p.s. This might be helpful if you have multiple Landsat images saved in one directory and want to make lists by image date of only the TIFF files ( and perhaps want to make a raster brick later on).
I am not exactly sure what you want to achieve, but it seems you are making it too difficult and you are using poor choices with shortcuts <<- and assign (there are very few cases where their use is warranted).
I would suggest something along these lines:
getTiffPattern <- function(pattern='', folder='.') {
ff <- list.files(folder, pattern = pattern, full=TRUE)
grep('\\.tif$', ff, ignore.case = TRUE, value=TRUE)
}
getTiffPattern('20180420')
Or for a vector of dates
dates <- list('20180420', '20180522', '20180623', '20180725', '20180810')
x <- lapply(dates, getTiffPattern)

looping over all files in the same directory in R

the following code in R for all the files. actually I made a for loop for that but when I run it it will be applied only on one file not all of them. BTW, my files do not have header.
You use [[ to subset something from peaks. However, after reading it using the file name, it is a data frame with then no more reference to the file name. Thus, you just have to get rid of the [[i]].
for (i in filelist.coverages) {
peaks <- read.delim(i, sep='', header=F)
PeakSizes <- c(PeakSizes, peaks$V3 - peaks$V2)
}
By using the iterator i within read.delim() which holds a new file name each time, every time R goes through the loop, peaks will have the content of a new file.
In your code, i is referencing to a name file. Use indices instead.
And, by the way, don't use setwd, use full.names = TRUE option in list.files. And preallocate PeakSizes like this: PeakSizes <- numeric(length(filelist.coverages)).
So do:
filelist.coverages <- list.files('K:/prostate_cancer_porto/H3K27me3_ChIPseq/',
pattern = 'island.bed', full.names = TRUE)
##all 97 bed files
PeakSizes <- numeric(length(filelist.coverages))
for (i in seq_along(filelist.coverages)) {
peaks <- read.delim(filelist.coverages[i], sep = '', header = FALSE)
PeakSizes[i] <- peaks$V3 - peaks$V2
}
Or you could simply use sapply or purrr::map_dbl:
sapply(filelist.coverages, function(file) {
peaks <- read.delim(file, sep = '', header = FALSE)
peaks$V3 - peaks$V2
})

creating a function which extracts a user specified column from a set of files

I have a set of csv files. All of them have same structure. I want to create a function which extracts a particular column from all files. Finds the mean of all the values in that column and store it in a vector. The column name should be passed by user.
I have coded following program. Somehow it can not identify "pollutant" which contains a name of column.
pollutantmean<-function(pollutant)
{
file_names<-dir("C:/Users/Keval/Desktop/Project R/R_courseera_programming_exercise/specdata",pattern= glob2rx("*.csv"))
for(file_name in file_names)
{
file_reader<-read.csv(file_name)
pollutant_data<-file_reader$pollutant
}
pollutant_data
pollutant
}`enter code here`
Use a string, e.g., call your function with
pollutantmean(pollutant = "mercury")
and use [ (which accepts strings) instead of $, which doesn't:
# replace the line
pollutant_data <- file_reader$pollutant
# with this:
pollutant_data <- file_reader[, pollutant]
This won't error out, but you still need to take a mean and store it. I'm also pretty sure you want list.files, not dir.
pollutantmean<-function(pollutant) {
file_names <- list.files("C:/Users/Keval/Desktop/ProjectR/R_courseera_programming_exercise/specdata",
pattern= glob2rx("*.csv"))
# initialize mean vector at correct length
my_means = numeric(length(file_names)
# make the loop indexed by number
for(i in seq_along(file_names)) {
file_reader <- read.csv(file_names[i])
pollutant_data <- file_reader[, pollutant]
# using the number index
my_means[i] = mean(pollutant_data)
}
return(my_means)
}

Resources