How can I use do.call() to read specific rows? - r

I'm using the do.call() command to read a list of csv-files to have all data points in one csv file. I have been using the following:
files = list.files(path = "G:/SafeGraph201708MidWest",
pattern = "*.csv",
recursive = TRUE,
full.names = TRUE)
library(data.table)
DT = do.call(rbind, lapply(files, fread))
Instead of reading all the rows in each file, I want to read specific rows. Especially the ones that are within this range:
Data <- filter(DT, longitude >= -86.97 & longitude <= -86.78,
latitude >= 40.35 & latitude <= 40.49)
Is there a way that I can do that using do.call()? Looking forward for a soon reply. Thank you!

There are several strategies on how to tackle this. You can import all the data into a list using lapply and then from each list element filter out based on your filter. You would use data.table::rbindlist to make the final data.table. Another one would be to do this in one step, e.g. (not tested, obviously)
library(data.table)
files = list.files(path = "G:/SafeGraph201708MidWest",
pattern = "*.csv",
recursive = TRUE,
full.names = TRUE)
xy <- lapply(files, FUN = function(x) {
out <- fread(x)
out <- filter(out, longitude >= -86.97 & longitude <= -86.78,
latitude >= 40.35 & latitude <= 40.49)
out
})
xy <- rbindlist(xy)

Assuming you use Windows PC and have at least Microsoft Office 2007+ installed, consider directly querying the CSV with the JET/ACE SQL Engine (.dll files) which is the very engine of MS Access.
Below includes two connection strings using Access or Excel. Either version works and the files do need to exist but are never used except for connecting to ACE. Once connected, CSV files are then queried from same or different path.
library(odbc)
# VERIFY AVAILABLE DSNs AND DRIVERS
odbcListDataSources()
# DSN VERSIONS
conn <- dbConnect(odbc::odbc(), DSN ="MS Access Database;DBQ=C:\\Path\\To\\Access.accdb;");
conn <- dbConnect(odbc::odbc(), DSN ="Excel Files;DBQ=C:\\Path\\To\\Excel.xlsx;");
# DRIVER VERSIONS
conn <- dbConnect(odbc::odbc(),
.connection_string = "Driver={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=C:\\Path\\To\\Access.accdb;");
conn <- dbConnect(odbc::odbc(),
.connection_string ="Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};DBQ=C:\\Path\\To\\Excel.xlsx;");
# CSV QUERY
df <- dbGetQuery(conn, "SELECT t.*
FROM [text;database=C:\\Path\\To\\CSV_Folder].Name_of_File.csv AS t
WHERE t.longitude BETWEEN -86.97 AND -86.78
AND t.latitude BETWEEN 40.35 AND 40.49;")
head(df)
dbDisconnect(conn)
And in a loop:
files = list.files(path = "G:/SafeGraph201708MidWest",
pattern = "*.csv",
recursive = TRUE,
full.names = TRUE)
df_list <- lapply(files, function(f)
df <- dbGetQuery(conn,
paste0("SELECT t.* ",
" FROM [text;database=G:\\SafeGraph201708MidWest].", f, " AS t ",
" WHERE t.longitude BETWEEN -86.97 AND -86.78",
" AND t.latitude BETWEEN 40.35 AND 40.49;")
)
)
final_dt <- rbindlist(df_list)

You can use the ability from data.table::fread() to execute a command and 'read' he results.
I Assume you are using windows, so you have got accessto the findstr-function in your command prompt.
So, if you can build a regex that 'hits' on the lines you want to extract, you can filter wanted lines before reading the entire file into R. This is (potentially) a huuuge memory-saver on larger files, and may speed up your workflow considerably.
sample data
lat's say coords.csv looks line this:
id,latitude,longitude
1,10,11
2,11,12
3,12,13
4,13,14
5,14,15
In this example, you want to extract lines with latitudes bewteen 12 and 14 AND longitudes beteeen 11 and 13
code
#build a list of files (I created only one)
#make sure you use the full path (not relative)
x <- list.files( path = "C:/folder", pattern = "coord.csv", full.names = TRUE )
#build reges that only hits on rows with:
# latitude 12-14
# longitude 11-13
pattern = "^[0-9],1[2-4],1[1-3]$"
#read the file(s), extract the lines with match the regex-pattern
#and bind the resuklt to a data.table
rbindlist( lapply( x, function(x) {
fread( cmd = paste0( "findstr /R ", pattern, " ", x ), header = FALSE )
} ) )
output
V1 V2 V3
1: 3 12 13

Related

Read nested folder and file name, export to Excel file

So I am tasked with building an excel spreadsheet cataloging a drive with various nested folders and files.
This SO gets me somewhat there but I am confused on how to get my desired output.
I know that there might be a command to get file info and I can break that into these columns.
Apart from the directories split into subdirs, the adaptation of the function in the question's link, Stibu's answer, might be of help.
rfl <- function(path) {
folders <- list.dirs(path, recursive = FALSE, full.names = FALSE)
if (length(folders)==0) {
files <- list.files(path, full.names = TRUE)
finfo <- file.info(files)
Filename <- basename(files)
FileType <- tools::file_ext(files)
DateModified <- finfo$mtime
FullFilePath <- dirname(files)
size <- finfo$size
data.frame(Filename, FileType, DateModified, FullFilePath, size)
} else {
sublist <- lapply(paste0(path,"/",folders),rfl)
setNames(sublist,folders)
}
}
If you have the full path and file names then you can loop through that and parse it into these columns. You can get more file info with file.info:
files <- c("I:/Administration/Budget/2015-BUDGET DOCUMENT.xlsx",
"I:/Administration/Budget/2014-2015 Budget/BUDGET DOCUMENT.xlsx")
# files <- list.files("I:", recursive = T, full.names = T) # this could take a while to run
file_info <- list(length = length(files))
for (i in seq_along(files)){
fullpath <- dirname(files[i])
fullname <- basename(files[i])
file_ext <- unlist(strsplit(fullname, ".", fixed = T))
file_meta <- file.info(files[i])[c("size", "mtime")]
path <- unlist(strsplit(fullpath, "/", fixed = T))[-1]
file_info[[i]] <- unlist(c(file_ext, file_meta, fullpath, path))
}
l <- lapply(file_info, `length<-`, max(lengths(file_info)))
df <- data.frame(do.call(rbind, l))
names(df) <- c("filename", "extension", "size", "modified", paste0("sub", 1:(ncol(df) - 4)))
rownames(df) <- NULL
df$modified <- as.POSIXct.numeric(as.numeric(df$modified), origin = "1970-01-01")
df$size <- as.numeric(df$size)
If you do not have the files you can recursively search the drive using list.files() with recursive = T: list.files("I:", recursive = T, full.names = T)
Note:
l <- lapply(file_info, `length<-`, max(lengths(file_info))) sets the vector length of each list element to be the same. This is necessary because otherwise when the vectors are stacked with unequal lengths values get recycled. A simple example of this is: rbind(1:3, 1:5)
The output of unlist(c(file_ext, file_meta, fullpath, path)) is a vector and vectors in R are atomic, meaning all elements have to be the same class. That means everything gets converted to character in this case, which is why we have the lines df$modified <- ... and df$size <- ... at the end to convert them to their appropriate type.
If you want to output this data frame to excel check out xlsx::write.xlsx or openxlsx::write.xlsx. If you don't have those libraries installed you'll need to use install.packages() first.
Output
Because these files/locations don't actually exist on my computer there are NA values in the size and date modified fields:
filename extension size modified sub1 sub2 sub3 sub4
1 2015-BUDGET DOCUMENT xlsx NA <NA> I:/Administration/Budget Administration Budget <NA>
2 BUDGET DOCUMENT xlsx NA <NA> I:/Administration/Budget/2014-2015 Budget Administration Budget 2014-2015 Budget

Import multiple csv files into postgresql database using r (memory error)

I am trying to import a dataset (with many csv files) into r and afterwards write the data into a table in a postgresql database.
I successfully connected to the database, created a loop to import the csv files and tried to import.
R then returns an error, because my pc runs out of memory.
My question is:
Is there a way to create a loop, which imports the files one after another, writes them into the postgresql table and deletes them afterwards?
That way I would not run out of memory.
Code which returns the memory error:
`#connect to PostgreSQL database
db_tankdata <- 'tankdaten'
host_db <- 'localhost'
db_port <- '5432'
db_user <- 'postgres'
db_password <- 'xxx'
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = db_tankdata, host=host_db,
port=db_port, user=db_user, password=db_password)
#check if connection was succesfull
dbExistsTable(con, "prices")
#create function to load multiple csv files
import_csvfiles <- function(path){
files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
lapply(files,read_csv) %>% bind_rows() %>% as.data.frame()
}
#import files
prices <- import_csvfiles("path...")
dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)`
Thanks in advance for the feedback!
If you change the lapply() to include an anonymous function, you can read each file and write it to the database, reducing the amount of memory required. Since lapply() acts as an implied for() loop, you don't need an extra looping mechanism.
import_csvfiles <- function(path){
files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
lapply(files,function(x){
prices <- read.csv(x)
dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)
})
}
I assume that your csv files are very large that you are importing to your database? According to my knowledge R first want to store the data in a dataframe with the code that you have written, storing the data in memory. The alternative will be to read a CSV file in chunks as you do with Python's Pandas.
When calling ?read.csv I saw the following output:
nrows : the maximum number of rows to read in. Negative and other invalid values are ignored.
skip : the number of lines of the data file to skip before beginning to read data.
Why don't you try to read 5000 rows at a time into the dataframe write to the PostgreSQL database and then do it for each file.
For example, for each file do the following:
number_of_lines = 5000 # Number of lines to read at a time
row_skip = 0 # number of lines to skip initially
keep_reading = TRUE # We will change this value to stop the while
while (keep_reading) {
my_data <- read.csv(x, nrow = number_of_lines , skip = row_skip)
dbWriteTable(con, "prices", my_data , append = TRUE, row.names = FALSE) # Write to the DB
row_skip = 1 + row_skip + number_of_lines # The "1 +" is there due to inclusivity avoiding duplicates
# Exit Statement: if the number of rows read is no more the size of the total lines to read per read.csv(...)
if(nrow(my_data) < number_of_lines){
keep_reading = FALSE
} # end-if
} # end-while
By doing this you are breaking up the csv into smaller parts. You can play around with the number_of_lines variable to reduce the amount of loops. This may seem a bit hacky with a loop involved but I'm sure it will work

How do I run the same code for multiple .csv files with a for loop in R?

I have written a code to filter, group, and sort my large data files. I have multiple text files I have to analyze. I know I can copy the code and run it with new data but I was wondering if there was a way to put this in a for loop that would open the text files one by one and run and store the results. I use the following to load all my text files. In the next steps, I select columns and filter them to find the desired values. But at the moment it only reads one file. I want to obtain results from all data files.
Samples <- Sys.glob("*.csv")
for (filename in Samples) {
try <- read.csv(filename, sep = ",", header = FALSE)
shear <- data.frame(try[,5],try[,8],try[,12])
lane <- shear[which(shear$Load == "LL-1"),]
Ext <- subset(lane, Girder %in% c("Left Ext","Right Ext"))
Max.Ext <- max(Ext$Shear)
}
You can put everything that you want to apply to each file in a function :
apply_fun <- function(filename) {
try <- read.csv(filename, sep = ",", header = FALSE)
shear <- data.frame(try[,5],try[,8],try[,12])
lane <- shear[which(shear$Load == "LL-1"),]
Ext <- subset(lane, Girder %in% c("Left Ext","Right Ext"))
return(max(Ext$Shear, na.rm = TRUE))
}
and here it seems we want only one number (max) from each file, we can use sapply to apply the function to each file.
Samples <- Sys.glob("*.csv")
sapply(Samples, apply_fun)

specify the path to the common folder with json files in R

to parse json, i can use this approach
library("rjson")
json_file <- "https://api.coindesk.com/v1/bpi/currentprice/USD.json"
json_data <- fromJSON(paste(readLines(json_file), collapse=""))
but what if i want work with set of json files
it located
json_file<-"C:/myfolder/"
How to parse in to data.frame all json files in this folder? (there 1000 files)?
A lot of missing info, but this will probably work..
I used pblapply to get a nice progress-bar (since you are mentioning >1000 files).
I never used the solution below for JSON-files (no experience wit JSON), but it works flawless on .csv and .xls files ( of course with different read-functions).. so I expect it to work with JSON also.
library(data.table)
library(pbapply)
library(rjson)
folderpath <- "C:\\myfolder\\"
filefilter <- "*.json$"
#set paramaters as needed
f <- list.files( path = folderpath,
pattern = filefilter,
full.names = TRUE,
recursive = FALSE )
#read all files to a list
f.list <- pblapply( f, function(x) fromJSON( file = x ) )
#join lists together
dt <- data.table::rbindlist( f.list )

Speed up R script looping through files/folders to check thresholds, calculate averages, and plot

I'm trying to speed up some code in R. I think my looping methods can be replaced (maybe with some form of lapply or using sqldf) but I can't seem to figure out how.
The basic premise is that I have a parent directory with ~50 subdirectories, and each of those subdirectories contains ~200 CSV files (a total of 10,000 CSVs). Each of those CSV files contains ~86,400 lines (data is daily by the second).
The goal of the script is to calculate the mean and stdev for two intervals of time from each file, and then make one summary plot for each subdirectory as follows:
library(timeSeries)
library(ggplot2)
# list subdirectories in parent directory
dir <- list.dirs(path = "/ParentDirectory", full.names = TRUE, recursive = FALSE)
num <- (length(dir))
# iterate through all subdirectories
for (idx in 1:num){
# declare empty vectors to fill for each subdirectory
DayVal <- c()
DayStd <- c()
NightVal <- c()
NightStd <- c()
date <- as.Date(character())
setwd(dir[idx])
filenames <- list.files(path=getwd())
numfiles <- length(filenames)
# for each file in the subdirectory
for (i in c(1:numfiles)){
day <- read.csv(filenames[i], sep = ',')
today <- as.Date(day$time[1], "%Y-%m-%d")
# setting interval for times of day we care about <- SQL seems like it may be useful here but I couldn't get read.csv.sql to recognize hourly intervals
nightThreshold <- as.POSIXct(paste(today, "03:00:00"))
dayThreshold <- as.POSIXct(paste(today, "15:00:00"))
nightInt <- day[(as.POSIXct(day$time) >= nightThreshold & as.POSIXct(day$time) <= (nightThreshold + 3600)) , ]
dayInt <- day[(as.POSIXct(day$time) >= dayThreshold & as.POSIXct(day$time) <= (dayThreshold + 3600)) , ]
#check some thresholds in the data for that time period
if (sum(nightInt$val, na.rm=TRUE) < 5){
NightMean <- mean(nightInt$val, na.rm =TRUE)
NightSD <-sd(nightInt$val, na.rm =TRUE)
} else {
NightMean <- NA
NightSD <- NA
}
if (sum(dayInt$val, na.rm=TRUE) > 5){
DayMean <- mean(dayInt$val, na.rm =TRUE)
DaySD <-sd(dayInt$val, na.rm =TRUE)
} else {
DayMean <- NA
DaySD <- NA
}
NightVal <- c(NightVal, NightMean)
NightStd <- c(NightStd, NightSD)
DayVal <- c(gsrDayVal, DayMean)
DayStd <- c(gsrDayStd, DaySD)
date <-c(date, as.Date(today))
}
df<-data.frame(date,DayVal,DayStd,NightVal, NightStd)
# plot for the subdirectory
p1 <- ggplot() +
geom_point(data = df, aes(x = date, y = gsrDayVal, color = "Day Average")) +
geom_point(data = df, aes(x = date, y = gsrDayStd, color = "Day Standard Dev")) +
geom_point(data = df, aes(x = date, y = gsrNightVal, color = "Night Average")) +
geom_point(data = df, aes(x = date, y = gsrNightStd, color = "Night Standard Dev")) +
scale_colour_manual(values = c("steelblue", " turquoise3", "purple3", "violet"))
}
Thanks very much for any advice you can offer!
Consider an SQL database solution as you manage quite a bit of data in flatfiles. A Relational Database Management System (RDMS) can easily handle millions of records, even aggregate as needed using its scalable db engine rather than processing in memory per R. If not for speed and efficiency, databases can provide security, robustness, and organization as the central repository. Even work a script to import each daily csv thereafter directly into database.
Fortunately, practically all RDMS have CSV handlers and can load mulitple files in bulk. Below are open source solutions: SQLite (file level database), MySQL, and PostgreSQL (both server level databases), all of which have corresponding libraries in R. Each example recursively imports a csv file from directory list of files into database table named timeseriesdata (with same named fields/data types as csv files). At the end is one SQL call to import an aggregation of Night and Day interval mean and standard deviation (adjust as needed). The only challenge is designating a file and subdirectory indicator (which may or may not exist in actual data) and appending with csv files (possibly after each iteration, run an update query to a FileID column).
dir <- list.dirs(path = "/ParentDirectory",
full.names = TRUE, recursive = FALSE)
# SQLITE DATABASE
library(RSQLite)
sqconn <- dbConnect(RSQLite::SQLite(), dbname = "/path/to/database.db")
# (CONNECTION NOT NEEDED DUE TO CMD LINE LOAD BELOW)
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT VIA COMMAND LINE OR BASH (ASSUMES SQLITE3 IS PATH VARIABLE)
cmd <- paste0("(echo .separator ,; echo .import ' ", csvfile , " ' timeseriesdata ')",
" '| sqlite3 ' /path/to/databasename.db")
system(cmd)
}
}
# CLOSE CONNNECTION
dbDisconnect(sqconn)
# MYSQL DATABASE
library(RMySQL)
myconn <- dbConnect(RMySQL::MySQL(), dbname="databasename", host="hostname",
username="username", password="***")
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT USING LOAD DATA INFILE COMMAND
sql <- paste0("LOAD DATA INFILE '", csvfile, "'
INTO TABLE timeseriesdata
FIELDS TERMINATED BY ','
ENCLOSED BY '\"'
ESCAPED BY '\"'
LINES TERMINATED BY '\\n'
IGNORE 1 LINES
(col1, col2, col3, col4, col5);")
dbSendQuery(myconn, sql)
dbCommit(myconn)
}
}
# CLOSE CONNECTION
dbDisconnect(myconn)
# POSTGRESQL DATABASE
library(RPostgreSQL)
pgconn <- dbConnect(PostgreSQL(), dbname="databasename", host="myhost",
user= "postgres", password="***")
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT USING COPY COMMAND
sql <- paste("COPY timeseriesdata(col1, col2, col3, col4, col5)
FROM '", csvfile , "' DELIMITER ',' CSV;")
dbSendQuery(pgconn, sql)
}
}
# CLOSE CONNNECTION
dbDisconnect(pgconn)
# CREATE PLOT DATA FRAME (MYSQL EXAMPLE)
# (ADD INSIDE SUBDIRECTORY LOOP OR INCLUDE SUBDIR COLUMN IN GROUP BY)
library(RMySQL)
myconn <- dbConnect(RMySQL::MySQL(), dbname="databasename", host="hostname",
username="username", password="***")
# AGGREGATE QUERY USING TWO DERIVED TABLE SUBQUERIES
# (FOR NIGHT AND DAY, ADJUST FILTERS PER NEEDS)
strSQL <- "SELECT dt.FileID, NightMean, NightSTD, DayMean, DaySTD
FROM
(SELECT nt.FileID, Avg(nt.time) As NightMean, STDDEV(nt.time) As NightSTD
FROM timeseriesdata nt
WHERE nt.time >= '15:00:00' AND nt.time <= '21:00:00'
GROUP BY nt.FileID
HAVING Sum(nt.val) < 5) AS ng
INNER JOIN
(SELECT dt.FileID, Avg(dt.time) As DayMean, STDDEV(dt.time) As DaySTD
FROM timeseriesdata dt
WHERE dt.time >= '03:00:00' AND dt.time <= '09:00:00'
GROUP BY dt.FileID
HAVING Sum(dt.val) > 5) AS dy
ON ng.FileID = dy.FileID;"
df <- dbSendQuery(myconn, strSQL)
dbFetch(df)
dbDisconnect(myconn)
One thing would be to do the conversion of day$time once instead of all the times you are doing it now. Also use the lubridate package because if you have a large number of times to convert, it is much faster than 'as.POSIXct'.
Also size the variables you are storing results in, e.g., DayVal, DayStd, to the approriate size (DayVal <- numeric(num)) and then index the result into the appropriate index.
If the CSV files are large, consider using the 'fread' function in data.table package.

Resources