Import Mutliple Csv File as Data Frame in R - r

I want to import multiple csv file as data frame. I try the code below, but the elements of my list are still character. Thanks for your help!
new_seg <-(list.files (path=csv, pattern="^new.*?\\.csv",recursive = T))
for (i in 1:length(new_seg))
assign(new_seg[i], data.frame(read.csv(new_seg[i])))
new_seg
[1] "new_ Seg_grow_1mm.csv" "new_ Seg_grow_3mm.csv" "new_ Seg_resample.csv"
class('new_ Seg_grow_1mm.csv')
[1] "character"

You need to use full.names = T in the list.files function. Then I typically use lapply to load the files in. Also, in my code below I use pattern = "\\.csv" because that's what I needed for this to work with my files.
csv <- getwd()
new_seg <- (list.files(path=csv, pattern="\\.csv", recursive = T, full.names = T))
new_seg_dfs <- lapply(new_seg, read.csv)
Now, new_seg_dfs is a list of data frames.
P.S. seems that you maybe set your working directory beforehand since your files are showing up, but it's always good practice to show the every step you took in these examples.

Related

Read data from data files into R data frames

I would like to read data from several files into separated data frames. The files are in a different folder than the script.
I have used list with filenames.
users_list <- list.files(path = "Data_Eye/Jazz/data/first",
pattern = "*.cal", full.names = F)
I tried to use functions map and read_delim but without success. It is important for me to read each file to a different dataframe. It will be the best to have the list of data frames.
You can do something like this, though I don't have any .cal files to test it on. So, it's possible you might need a different function for reading in those files.
library(devtools)
devtools::install_github("https://github.com/KirtOnthank/OTools")
library(OTools)
# Give full path to your files (if different than working directory).
temp = list.files(path="../Data/original_data", pattern="*.cal", full.names = TRUE)
# Then, apply the read.cal function to the list of files.
myfiles = lapply(temp, OTools::read.cal)
# Then, set the name of each list element (each dataframe) to its respective file name.
names(myfiles) <- gsub(".cal","",
list.files("../Data/original_data",full.names = FALSE),
fixed = TRUE)
# Now, put all of those individual dataframes from your list into the global environment as separate dataframes.
list2env(myfiles,envir=.GlobalEnv)
In base R, just use lapply to generate a list of data frames:
list_of_dfs <- lapply(users_list, read_delim)

Load multiple .csv files in R, edit them and save as new .csv files named by a list of chracter strings

I am pretty new to R and programming so I do apologies if this question has been asked elsewhere.
I'm trying to load multiple .csv files, edit them and save again. But cannot find out how to manage more than one .csv file and also name new files based on a list of character strings.
So I have .csv file and can do:
species_name<-'ace_neg'
{species<-read.csv('species_data/ace_neg.csv')
species_1_2<-species[,1:2]
species_1_2$species<-species_name
species_3_2_1<-species_1_2[,c(3,1,2)]
write.csv(species_3_2_1, file='ace_neg.csv',row.names=FALSE)}
But I would like to run this code for all .csv files in the folder and add text to a new column based on .csv file name.
So I can load all .csv files and make a list of character strings for use as a new column text and as new file names.
NDOP_files <- list.files(path="species_data", pattern="*.csv$", full.names=TRUE, recursive=FALSE)
short_names<- substr(NDOP_files, 14,20)
Then I tried:
lapply(NDOP_files, function(x){
species<-read.csv(x)
species_1_2<-species[,1:2]
species_1_2$species<-'name' #don't know how to insert first character string of short_names instead of 'name', than second character string from short_names for second csv. file etc.
Then continue in the code to change an order of columns
species_3_2_1<-species_1_2[,c(3,1,2)]
And then write all new modified csv. files and name them again by the list of short_names.
I'm sorry if the text is somewhat confusing.
Any help or suggestions would be great.
You are actually quite close and using lapply() is really good idea.
As you state, the issue is, it only takes one list as an argument,
but you want to work with two. mapply() is a function in base R that you can feed multiple lists into and cycle through synchronically. lapply() and mapply()are both designed to create/ manipulate objects inRbut you want to write the files and are not interested in the out withinR. Thepurrrpackage has thewalk*()\ functions which are useful,
when you want to cycle through lists and are only interested in creating
side effects (in your case saving files).
purrr::walk2() takes two lists, so you can provide the data and the
file names at the same time.
library(purrr)
First I create some example data (I’m basically already using the same concept here as I will below):
test_data <- map(1:5, ~ data.frame(
a = sample(1:5, 3),
b = sample(1:5, 3),
c = sample(1:5, 3)
))
walk2(test_data,
paste0("species_data/", 1:5, "test.csv"),
~ write.csv(.x, .y))
Instead of getting the file paths and then stripping away the path
to get the file names, I just call list.files(), once with full.names = TRUE and once with full.names = FALSE.
NDOP_filepaths <-
list.files(
path = "species_data",
pattern = "*.csv$",
full.names = TRUE,
recursive = FALSE
)
NDOP_filenames <-
list.files(
path = "species_data",
pattern = "*.csv$",
full.names = FALSE,
recursive = FALSE
)
Now I feed the two lists into purrr::walk2(). Using the ~ before
the curly brackets I can define the anonymous function a bit more elegant
and then use .x, and .y to refer to the entries of the first and the
second list.
walk2(NDOP_filepaths,
NDOP_filenames,
~ {
species <- read.csv(.x)
species <- species[, 1:2]
species$species <- gsub(".csv", "", .y)
write.csv(species, .x)
})
Learn more about purrr at purrr.tidyverse.org.
Alternatively, you could just extract the file name in the loop and stick to lapply() or use purrr::map()/purrr::walk(), like this:
lapply(NDOP_filepaths,
function(x) {
species <- read.csv(x)
species <- species[, 1:2]
species$species <- gsub("species///|.csv", "", x)
write.csv(species, gsub("species///", "", x))
})
NDOP_files <- list.files(path="species_data", pattern="*.csv$",
full.names=TRUE, recursive=FALSE)
# Get name of each file (without the extension)
# basename() removes all of the path up to and including the last path seperator
# file_path_sands_ext() removes the .csv extension
csvFileNames <- tools::file_path_sans_ext(basename(NDOP_files))
Then, I would write a function that takes in 1 csv file and does some manipulation to the file and outputs out a data frame. Since you have a list of csv files from using list.files, you can use the map function in the purrr package to apply your function to each csv file.
doSomething <- function(NDOP_file){
# your code here to manipulate NDOP_file to your liking
return(NDOP_file)
NDOP_files <- map(NDOP_files, ~doSomething(.x))
Lastly, you can manipulate the file names when you write the new csv files using csvFileNames and a custom function you write to change the file name to your liking. Essentially, use the same architecture of defining your custom function and using map to apply to each of your files.

Read files matching subdirectory patterns in R

I've used a lot of posts to get me this far (such as here R list files with multiple conditions and here How can I read multiple files from multiple directories into R for processing? but can't accomplish what I need in R.
I have many .csv files distributed in multiple subdirectories that I want to read in and then save as separate objects to the corresponding basename. The end result will be to rbind each of those files together. Here's sample dir structure and some of what I've tried:
./DATA/Cat_Animal/animal1.csv
./DATA/Dog_Animal/animal2.csv
./DATA/Dog_Animal/animal3.csv
./DATA/Dog_Animal/animal3.1.csv
#read in all csv files
files <- list.files(path="./DATA", pattern="*.csv", full.names=TRUE, recursive=TRUE)
But this results in all files in all subdirectories. I want to match specific files (animalsX.csv) in specific subdirectories matching the pattern (X_Animal) such as this:
files <- dir(path=paste0("./DATA/", pattern="*+_Animal"), recursive=TRUE, full.names=TRUE, pattern="animal+.*csv")
Once I get my list of files, I want to read each of them in and save each to the corresponding file's basename. So the file named animal1.csv
would be saved to animal1. I think I need to use the function basename() somewhere in a loop but not sure how.
Help very much appreciated I've spent a lot of time trying out various options with little progress.
This question is really two questions, consider splitting them up. On the last part of your question, how to rbind a list full of data.frames together try:
finalDf = do.call(rbind, result)
You'll likely need to use str_split() from the stringr package to extract the parts of the file path you need. You could also use str_extract() regular expressions.
I think I found a work-around for the short term because luckily I only have a few subdirectories currently.
myFiles1 <- list.files(path = "./DATA/Cat_Animal/", pattern="animal+.*csv")
processFile <- function(f) {
df <- read.csv(file = paste0("./DATA/Cat_Animal/", f ))
}
result1 <- sapply(myFiles1, processFile)
#then do it again for the next subdir:
myFiles2 <- list.files(path = "./DATA/Dog_Animal/", pattern="animal+.*csv")
processFile <- function(f) {
df <- read.csv(file = paste0("./DATA/Dog_Animal/", f ))
}
result2 <- sapply(myFiles2, processFile)
finalDf = do.call(rbind, result1, result2)
I know there is a better way but can't figure out the pattern matching for the subdirectories! It's so easy in unix for example
You can simply do it two times.
a <- list.files(path="./DATA", pattern="*_Animal", full.names=T, recursive=F)
a
#[1] "./DATA/Cat_Animal" "./DATA/Dog_Animal"
files <- list.files(path=a, pattern="*animal*", full.names=T)
files
#[1] "./DATA/Cat_Animal/animal1.txt" "./DATA/Dog_Animal/animal2.txt" #"./DATA/Dog_Animal/animal3.txt"
#[4] "./DATA/Dog_Animal/animal4.txt"
In the first step, please make sure to use full.names = T and recursive = F. You need full.names = T to get the file path not just file name, otherwise you might lose path to animal*.csv in the second step. And recursive = T would return nothing since Dog_Animal and Cat_Animal are folders not files.

Using R to merge many large CSV files across sub-directories

I have over 300 large CSV files with the same filename, each in a separate sub-directory, that I would like to merge into a single dataset using R. I'm asking for help on how to remove columns I don't need in each CSV file, while merging in a way that breaks the process down into smaller chunks that my memory can more easily handle.
My objective is to create a single CSV file that I can then import into STATA for further analysis using code I have already written and tested on one of these files.
Each of my CSVs is itself rather large (about 80 columns, many of which are unnecessary, and each file has tens to hundreds of thousands of rows), and there are almost 16 million observations in total, or roughly 12GB.
I have written some code which manages to do this successfully for a test case of two CSVs. The challenge is that neither my work nor my personal computers have enough memory to do this for all 300+ files.
The code I have tried is here:
library(here) ##installs package to find files
( allfiles = list.files(path = here("data"), ##creates a list of the files, read as [1], [2], ... [n]
pattern = "candidates.csv", ##[identifies the relevant files]
full.names = TRUE, ##identifies the full file name
recursive = TRUE) ) ##searches in sub-directories
read_fun = function(path) {
test = read.csv(path,
header = TRUE )
test
} ###reads all the files
(test = read.csv(allfiles,
header = TRUE ) )###tests that file [1] has been read
library(purrr) ###installs package to unlock map_dfr
library(dplyr) ###installs packages to unlock map_dfr
( combined_dat = map_dfr(allfiles, read_fun) )
I expect the result to be a single RDS file, and this works for the test case. Unfortunately, the amount of memory this process requires when looking at 15.5m observations across all my files causes RStudio to crash, and no RDS file is produced.
I am looking for help on how to 1) reduce the load on my memory by stripping out some of the variables in my CSV files I don't need (columns with headers junk1, junk2, etc); and 2) how to merge in a more manageable way that merges my CSV files in sequence, either into a few RDS files to themselves be merged later, or through a loop cumulatively into a single RDS file.
However, I don't know how to proceed with these - I am still new to R, and any help on how to proceed with both 1) and 2) would be much appreciated.
Thanks,
Twelve GB is quite a bit for one object. It's probably not practical to use a single RDS or CSV unless you have far more than 12GB of RAM. You might want to look into using a database, a techology that is made for this kind of thing. I'm sure Stata can also interact with databases. You might also want to read up on how to interact with large CSVs using various strategies and packages.
Creating a large CSV isn't at all difficult. Just remember that you have to work with said giant CSV sometime in the future, which probably will be difficult. To create a large CSV, just process each component CSV individually and then append them to your new CSV. The following reads in each CSV, removes unwanted columns, and then appends the resulting dataframe to a flat file:
library(dplyr)
library(readr)
library(purrr)
load_select_append <- function(path) {
# Read in CSV. Let every column be of class character.
df <- read_csv(path, col_types = paste(rep("c", 82), collapse = ""))
# Remove variables beginning with "junk"
df <- select(df, -starts_with("junk"))
# If file exists, append to it without column names, otherwise create with
# column names.
if (file.exists("big_csv.csv")) {
write_csv(df, "big_csv.csv", col_names = F, append = T)
} else {
write_csv(df, "big_csv.csv", col_names = T)
}
}
# Get the paths to the CSVs.
csv_paths <- list.files(path = "dir_of_csvs",
pattern = "\\.csv.*",
recursive = T,
full.names = T
)
# Apply function to each path.
walk(csv_paths, load_select_append)
When you're ready to work with your CSV you might want to consider using something like the ff package, which enables interaction with on-disk objects. You are somewhat restricted in what you can do with an ffdf object, so eventually you'll have to work with samples:
library(ff)
df_ff <- read.csv.ffdf(file = "big_csv.csv")
df_samp <- df_ff[sample.int(nrow(df_ff), size = 100000),]
df_samp <- mutate(df_samp, ID = factor(ID))
summary(df_samp)
#### OUTPUT ####
values ID
Min. :-9.861 17267 : 6
1st Qu.: 6.643 19618 : 6
Median :10.032 40258 : 6
Mean :10.031 46804 : 6
3rd Qu.:13.388 51269 : 6
Max. :30.465 52089 : 6
(Other):99964
As far as I know, chunking and on-disk interactions are not possible with RDS or RDA, so you are stuck with flat files (or you go with one of the other options I mentioned above).

Using rbind() to combine multiple data frames into one larger data.frame within lapply()

I'm using R-Studio 0.99.491 and R version 3.2.3 (2015-12-10). I'm a relative newbie to R, and I'd appreciate some help. I'm doing a project where I'm trying to use the server logs on an old media server to identify which folders/files within the server are still being accessed and which aren't, so that my team knows which files to migrate. Each log is for a 24 hour period, and I have approximately a year's worth of logs, so in theory, I should be able to see all of the access over the past year.
My ideal output is to get a tree structure or plot that will show me the folders on our server that are being used. I've figured out how to read one log (one day) into R as a data.frame and then use the data.tree package in R to turn that into a tree. Now, I want to recursively go through all of the files in the directory, one by one, and add them to that original data.frame, before I create the tree. Here's my current code:
#Create the list of log files in the folder
files <- list.files(pattern = "*.log", full.names = TRUE, recursive = FALSE)
#Create a new data.frame to hold the aggregated log data
uridata <- data.frame()
#My function to go through each file, one by one, and add it to the 'uridata' df, above
lapply(files, function(x){
uriraw <- read.table(x, skip = 3, header = TRUE, stringsAsFactors = FALSE)
#print(nrow(uriraw)
uridata <- rbind(uridata, uriraw)
#print(nrow(uridata))
})
The problem is that, no matter what I try, the value of 'uridata' within the lapply loop seems to not be saved/passed outside of the lapply loop, but is somehow being overwritten each time the loop runs. So instead of getting one big data.frame, I just get the contents of the last 'uriraw' file. (That's why there are those two commented print commands inside the loop; I was testing how many lines there were in the data frames each time the loop ran.)
Can anyone clarify what I'm doing wrong? Again, I'd like one big data.frame at the end that combines the contents of each of the (currently seven) log files in the folder.
do.call() is your friend.
big.list.of.data.frames <- lapply(files, function(x){
read.table(x, skip = 3, header = TRUE, stringsAsFactors = FALSE)
})
or more concisely (but less-tinkerable):
big.list.of.data.frames <- lapply(files, read.table,
skip = 3,header = TRUE,
stringsAsFactors = FALSE)
Then:
big.data.frame <- do.call(rbind,big.list.of.data.frames)
This is a recommended way to do things because "growing" a data frame dynamically in R is painful. Slow and memory-expensive, because a new frame gets built at each iteration.
You can use map_df from purrr package instead of lapply, to directly have all results combined as a data frame.
map_df(files, read.table, skip = 3, header = TRUE, stringsAsFactors = FALSE)
Another option is fread from data.table
library(data.table)
rbindlist(lapply(files, fread, skip=3))

Resources