Changing the name of a dataframe inside of a dataframe - r

I am working with a folder of csv files. These files were imported using the following code:
data_frame <- list.files("path", pattern = ".csv", all.files = TRUE, full.names = TRUE)
csv_data <- lapply(data_frame, read.csv)
names(csv_data) <- gsub(".csv","",
list.files("path", pattern = ".csv", all.files = TRUE, full.names = FALSE),
fixed = TRUE)
After this has been generated the dataframes hold the name of the csv. Since I have over 3000 csv files, I was wondering how to change the name of them to keep track of them better.
For example, instead of 'City, State, US', it will generate 'City-State-US'.
I apologize if this has already been asked, but I cannot find anything that could help.

So, if I understand your question correctly, you have CSV files names "City, State, US.csv" (so "Chicago, IL, USA.csv", etc.), you are reading them in to R and storing them in a list where the list element name is the CSV name, and you want to make some changes to that element name?
You can access the names of the list item using names(csv_data), as you did above, and then treat it however you like and write it back to the same.
For instance, the example you gave:
names(csv_data) <- gsub(", ", "-", names(csv_data), fixed = TRUE)
This should do what you need. If you need to do something else, just change the gsub parameters or function to something else - the key is that you can extract and write back the list item names in one shot.
You are already sort of doing this with the third line, where you name the items - you could even make the treatment before you assign the names.
Edit: Also, a quick note: You are already storing the output of list.files in the data_frame variable - you could just reuse that variable in the third line instead of calling list.files again.

Related

loop (for) in R to read separately many files (from list of files)

I have a list of Excel files that I want to read.
As they have sometimes different columns I want to read them separately, giving them their location as name.
I'm new in loops, and I cannot get to name them differently. This is what I tried, but reads only one (of 40) file.
myfiles = list.files(path="Invoices/GLS/",
pattern = "0_*", full.names = TRUE, recursive = T)
myfiles
for(val in myfiles) {
val <- read_excel(val)
}
As mentioned by Ronak, you're probably better off with the sapply approach because of verbosity and ease of reading, as well as preventing your environment to be cluttered with objects (this would only make processing more labour intensive down the line, as well). For the sake of illustration, I'll show an approach using a for loop first, followed by sapply() and lapply().
The reason you only have one dataframe in your environment is because you're assigning data to the same object in each iteration. As a result, your object is overwritten whenever that line is called in your loop. At the end of the loop, val will contain the last file that was loaded via read_excel(val). To solve this, you could use the assign() approach.
for(val in seq_along(myfiles)) {
# create name of object to assign file to
obj_name <- paste0("df_", val)
# Import file
df_file <- read_excel(myfiles[val])
# Assign it to a separate object in global environment
assign(obj_name, df_file)
# Ensure no df_file remains in environment when loop ends
rm(df_file))
}
Naturally, you could give your R objects other names than df_1, df_2 and so forth. You can also write the above in less lines of code, but that would defeat the purpose of illustrating the different steps.
As for sapply(simplify = FALSE) and lapply(), your data would not be loaded into separate objects in your environment. Instead your datasets would be loaded into a list. As an extra step, you can easily make 1 single dataframe out of that list if you so desire. Regardless, I'd opt for having 1 list of 40 dataframes over 40 objects sitting in my environment.
A convenient feature of sapply() is the USE.NAMES argument, which defaults to TRUE. This will set the names of the elements of the resulting list to your X argument (your input vector with filenames). The simlify = FALSE prevents the function from trying coercing your list to another structure (in this case, it would become a matrix).
# load excel files into a list of dataframes, retaining filenames
df_list <- sapply(myfiles, read_excel, simplify = FALSE)
However, in my opinion, the names of your list elements can become quite messy if you're dealing with FULL filenames (as a result of full.names = TRUE in list.files(). In the two-step approach below, you first load your dataframes into a list, and then set the names of said list using some regular expression. This expression would only extract the last part of the filepath that contains the actual name of your file (e.g. invoice542.xlsx)
# load excel files into a list of dataframes
df_list <- lapply(myfiles, read_excel)
# name list elements using regex
names(val2) <- sub(r"{[\w\/]+(?:\/)}", "", myfiles, perl = TRUE)
Bear in mind that regex isn't my strong suit, so the above expression could probably be written more clearly/concise.
And that's it for my first ever SO contribution!
Read them in a list using sapply -
myfiles = list.files(path="Invoices/GLS/", pattern = "0_*",
full.names = TRUE, recursive = TRUE)
val <- sapply(myfiles, readxl::read_excel, simplify = FALSE)
names(val) should return path of the files.

Load multiple .csv files in R, edit them and save as new .csv files named by a list of chracter strings

I am pretty new to R and programming so I do apologies if this question has been asked elsewhere.
I'm trying to load multiple .csv files, edit them and save again. But cannot find out how to manage more than one .csv file and also name new files based on a list of character strings.
So I have .csv file and can do:
species_name<-'ace_neg'
{species<-read.csv('species_data/ace_neg.csv')
species_1_2<-species[,1:2]
species_1_2$species<-species_name
species_3_2_1<-species_1_2[,c(3,1,2)]
write.csv(species_3_2_1, file='ace_neg.csv',row.names=FALSE)}
But I would like to run this code for all .csv files in the folder and add text to a new column based on .csv file name.
So I can load all .csv files and make a list of character strings for use as a new column text and as new file names.
NDOP_files <- list.files(path="species_data", pattern="*.csv$", full.names=TRUE, recursive=FALSE)
short_names<- substr(NDOP_files, 14,20)
Then I tried:
lapply(NDOP_files, function(x){
species<-read.csv(x)
species_1_2<-species[,1:2]
species_1_2$species<-'name' #don't know how to insert first character string of short_names instead of 'name', than second character string from short_names for second csv. file etc.
Then continue in the code to change an order of columns
species_3_2_1<-species_1_2[,c(3,1,2)]
And then write all new modified csv. files and name them again by the list of short_names.
I'm sorry if the text is somewhat confusing.
Any help or suggestions would be great.
You are actually quite close and using lapply() is really good idea.
As you state, the issue is, it only takes one list as an argument,
but you want to work with two. mapply() is a function in base R that you can feed multiple lists into and cycle through synchronically. lapply() and mapply()are both designed to create/ manipulate objects inRbut you want to write the files and are not interested in the out withinR. Thepurrrpackage has thewalk*()\ functions which are useful,
when you want to cycle through lists and are only interested in creating
side effects (in your case saving files).
purrr::walk2() takes two lists, so you can provide the data and the
file names at the same time.
library(purrr)
First I create some example data (I’m basically already using the same concept here as I will below):
test_data <- map(1:5, ~ data.frame(
a = sample(1:5, 3),
b = sample(1:5, 3),
c = sample(1:5, 3)
))
walk2(test_data,
paste0("species_data/", 1:5, "test.csv"),
~ write.csv(.x, .y))
Instead of getting the file paths and then stripping away the path
to get the file names, I just call list.files(), once with full.names = TRUE and once with full.names = FALSE.
NDOP_filepaths <-
list.files(
path = "species_data",
pattern = "*.csv$",
full.names = TRUE,
recursive = FALSE
)
NDOP_filenames <-
list.files(
path = "species_data",
pattern = "*.csv$",
full.names = FALSE,
recursive = FALSE
)
Now I feed the two lists into purrr::walk2(). Using the ~ before
the curly brackets I can define the anonymous function a bit more elegant
and then use .x, and .y to refer to the entries of the first and the
second list.
walk2(NDOP_filepaths,
NDOP_filenames,
~ {
species <- read.csv(.x)
species <- species[, 1:2]
species$species <- gsub(".csv", "", .y)
write.csv(species, .x)
})
Learn more about purrr at purrr.tidyverse.org.
Alternatively, you could just extract the file name in the loop and stick to lapply() or use purrr::map()/purrr::walk(), like this:
lapply(NDOP_filepaths,
function(x) {
species <- read.csv(x)
species <- species[, 1:2]
species$species <- gsub("species///|.csv", "", x)
write.csv(species, gsub("species///", "", x))
})
NDOP_files <- list.files(path="species_data", pattern="*.csv$",
full.names=TRUE, recursive=FALSE)
# Get name of each file (without the extension)
# basename() removes all of the path up to and including the last path seperator
# file_path_sands_ext() removes the .csv extension
csvFileNames <- tools::file_path_sans_ext(basename(NDOP_files))
Then, I would write a function that takes in 1 csv file and does some manipulation to the file and outputs out a data frame. Since you have a list of csv files from using list.files, you can use the map function in the purrr package to apply your function to each csv file.
doSomething <- function(NDOP_file){
# your code here to manipulate NDOP_file to your liking
return(NDOP_file)
NDOP_files <- map(NDOP_files, ~doSomething(.x))
Lastly, you can manipulate the file names when you write the new csv files using csvFileNames and a custom function you write to change the file name to your liking. Essentially, use the same architecture of defining your custom function and using map to apply to each of your files.

How to merge many databases in R?

I have this huge database from a telescope at the institute where I currently am working, this telescope saves every single day in a file, it takes values for each of the 8 channels it measures every 10 seconds, and every day starts at 00:00 and finishes at 23:59, unless there was a connection error, in which case there are 2 or more files for one single day.
Also, the database has measurement mistakes, missing data, repeated values, etc.
File extensions are .sn1 for days saved in one single file and, .sn1, .sn2, .sn3...... for days saved in multiple files, all the files have the same number of rows and variables, besides that there are 2 formats of databases, one has a sort of a header and it uses the first 5 lines of the file, the other one doesn't have it.
Every month has it's own folder including the days it has, and then this folders are saved in the year they belong to, so for 10 years I'm talking about more than 3000 files, and to be honest I had never worked with .sn1 files before
I have code to merge 2 or a handful of files into 1, but this time I have thousands of files (which is way more then what I've used before and also the reason of why I can't provide a simple example) and I would like to generate a program that would merge all of the files to 1 huge database, so I can get a better sample from it.
I have an Excel extension that would list all the file locations in a specific folder, can I use a list like this to put all the files together?
Suggestions were too long for a comment, so I'm posting them as an aswer here.
It appears that you are able to read the files into R (at least one at a time) so I'm not getting into that.
Multiple Locations: If you have a list of all the locations, you can search in those locations to give you just the files you need. You mentioned an excel file (let's call it paths.csv - has only one column with the directory locations):
library(data.table)
all_directories <- fread(paths.csv, col.names = "paths")
# Focussing on only .sn1 files to begin with
files_names <- dir(path = all_directories$paths[1], pattern = ".sn1")
# Getting the full path for each file
file_names <- paste(all_directories$path[1], file_names, sep = "/")
Reading all the files: I created a space-delimited dummy file and gave it the extension ".sn1" - I was able to read it properly with data.table::fread(). If you're able to open the files using notepad or something similar, it should work for you too. Need more information on how the files with different headers can be distinguished from one another - do they follow a naming convention, or have different extensions (appears to be the case). Focusing on the files with 5 rows of headers/other info for now.
read_func <- function(fname){
dat <- fread(fname, sep = " ", skip = 5)
dat$file_name <- fname # Add file name as a variable - to use for sorting the big dataset
}
# Get all files into a list
data_list <- lapply(file_names, read_func)
# Merge list to get one big dataset
dat <- rdbindlist(data_list, use.names = T, fill = T)
Doing all of the above will give you a dataset for all the files that have the extension ".sn1" in the first directory from your list of directories (paths.csv). You can enclose all of this in a function and use lapply over all the different directories to get a list wherein each element is a dataset of all such files.
To include files with ".sn2", ".sn3" ... extensions you can modify the call as below:
ptrns <- paste(sapply(1:5, function(z) paste(".sn",z,sep = "")), collapse = "|")
# ".sn1|.sn2|.sn3|.sn4|.sn5"
dir(paths[1], pattern = ptrns)
Here's the simplified version that should work for all file extensions in all directories right away - might take some time if the files are too large etc. You may want to consider doing this in chunks instead.
# Assuming only one column with no header. sep is set to ";" since by default fread may treate spaces
# as separators. You can use any other symbol that is unlikely to be present in the location names
# We need the output to be a vector so we can use `lapply` without any unwanted behaviour
paths_vec <- as.character(fread("paths.csv", sep = ";", select = 1, header = F)$V1)
# Get all file names incl. location)
file_names <- unlist(lapply(paths_vec, function(z){
ptrns <- paste(sapply(1:5, function(q) paste(".sn",q,sep = "")), collapse = "|")
inter <- dir(z, pattern = ptrns)
return(paste(z,inter, sep = "/"))
}))
# Get all data in a single data.table using read_func previously defined
dat <- rbindlist(lapply(file_names, read_func), use.names = T, fill = T)

Run separate functions on multiple elements of list based on regex criteria in data.frame

The following works, but I'm missing a functional programming technique, indexing, or a better way of structuring my data. After a month, it will take a bit to remember exactly how this works instead of being easy to maintain. It seems like a workaround when it shouldn't be. I want to use regex to decide which function to use for expected groups of files. When a new file format comes along, I can write the read function, then add the function along with the regex to the data.frame to run it alongside all the rest.
I have different formats of Excel and csv files that need to be read in and standardized. I want to maintain a list or data.frame of the file name regex and appropriate function to use. Sometimes there will be new file formats that won't be matched, and old formats without new files. But then it gets complicated which is something I would prefer to avoid.
# files to read in based on filename
fileexamples <- data.frame(
filename = c('notanyregex.xlsx','regex1today.xlsx','regex2today.xlsx','nomatch.xlsx','regex1yesterday.xlsx','regex2yesterday.xlsx','regex3yesterday.xlsx'),
readfunctionname = NA
)
# regex and corresponding read function
filesourcelist <- read.table(header = T,stringsAsFactors = F,text = "
greptext readfunction
'.*regex1.*' 'readsheettype1'
'.*nonematchthis.*' 'readsheetwrench'
'.*regex2.*' 'readsheettype2'
'.*regex3.*' 'readsheettype3'
")
# list of grepped files
fileindex <- lapply(filesourcelist$greptext,function(greptext,files){
grepmatches <- grep(pattern = greptext,x = data.frame(files)[,1],ignore.case = T)
},files = fileexamples$filename)
# run function on files based on fileindex from grep
for(i in 1:length(fileindex)){
fileexamples[fileindex[[i]],'readfunctionname'] <- filesourcelist$readfunction[i]
}

Creating a new file with both a subset of data and file names from a group of .csv files

My issue is likely with how I'm exporting the data from the for loop, but I'm not sure how to fix it.
I've got over 200 files in a folder, all structured in the same way, from which I'd like to pull the maximum number from a single column. I've made a for loop to do this based off of code from here http://www.r-bloggers.com/looping-through-files/
What I have running so far looks like this:
fileNames<-Sys.glob("*.csv")
for(i in 1:length(fileNames)){
data<-read.csv(fileNames[i])
VelM = max(data[,8],na.rm=TRUE)
write.table(VelM, "Summary", append=TRUE, sep=",",
row.names=FALSE,col.names=FALSE)
}
This works, but I need to figure out a way to have a second column in my summary file that contains the original file name the data in that row came from for reference.
I tried making both a matrix and a data frame instead of going straight to the table writing, but in both cases I wasn't able to append the data and ended up with values from only the last file.
Any ideas would be greatly appreciated!
Here's what I would recommend to improve your current method, also going with fread() because it's very fast and has the select argument. Notice I have moved the write.table() call outside the for() loop. This allows a cleaner way of adding the new column of file names alongside the max column, and eliminates the need to append to the file on every iteration.
library(data.table)
fileNames <- Sys.glob("*.csv")
VelM <- numeric(length(fileNames))
for(i in seq_along(fileNames)) {
VelM[i] <- max(fread(fileNames[i], select = 8)[[1L]], na.rm = TRUE)
}
write.table(data.frame(VelM, fileNames), "Summary", sep = ",",
row.names = FALSE, col.names = FALSE)
If you want to quickly read files, you should consider using data.table::fread or readr::read_csv instead of base read.csv.
For example:
fileNames <- list.files(path = your_path, pattern='\\.csv') # instead of Sys.glob
library('data.table')
dt <- rbindlist(lapply(fileNames, fread, select=8, idcol=TRUE))
dt[, .(max_val = max(your_var)), by = id]
write.table(dt, 'yourfile.csv', sep=',', row.names=FALSE, col.names=FALSE)
Explanation: data.table::fread reads in only the select=8th column from each file (via lapply to fileNames, which returns a list of data.tables). Then data.table::rbindlist combines this list of data.tables (of one column each) into a single data.table, producing an additional column idcol. From ?fread, note that
If input is a named list, ids are generated using them
Because lapply returns a named list with each name being the element of fileNames, this is an easy way of passing fileNames index for grouping.
The rest is data.table syntax. It wasn't clear from your question if there is a header row and whether you know the heading in advance. If so, you can either keep header=TRUE and use the header name for your_var, or you can do skip=1, header=FALSE, col.names = 'your_var'.

Resources