Looping code for multiple db3 files in R - r

I'm trying to import a number of .db3 files, and rbind them together for further analysis. I'm having no troubles importing a single .db3 file, but my rbind won't work, despite it working fine for .csv files. Where have I gone wrong?
df <- c()
for (x in list.files(pattern="*.db3")){
sqlite <- dbDriver("SQLite")
mydb <- dbConnect(sqlite, x)
dbListTables(mydb)
results <- dbSendQuery(mydb, "SELECT * FROM gps_data")
data = fetch(results, n = -1)
data$Label <- factor(x)
data <- rbind(df, data)
}
Any help you can offer would be great!

Let's have a close look at that rbind call at the end of your loop:
df <- c()
for (x in list.files(pattern="*.db3")){
sqlite <- dbDriver("SQLite")
mydb <- dbConnect(sqlite, x)
dbListTables(mydb)
results <- dbSendQuery(mydb, "SELECT * FROM gps_data")
data = fetch(results, n = -1)
data$Label <- factor(x)
data <- rbind(df, data)
}
You've created the object df, then you're binding data to the end of it and using that to override the existing data (note df hasn't changed). Great. Now your loop starts again, creating a new data object, and binding it to.... df. Doh! It's a simple error, but you're binding things in the wrong order. Try changing that last line to:
df <- rbind( df, data )
and see how it goes.
What you'll be doing differently is overwriting df over and over, making it bigger each time. When you overwrote data, you went back and recreated it anew, throwing away what you'd just done.

Related

Looping over lists, extracting certain elements and delete the list?

I am trying to create an efficient code that opens data files containing a list, extracts one element within the list, stores it in a data frame and then deletes this object before opening the next one.
My idea is doing this using loops. Unfortunately, I am quite new in learning how to do this using loops, and don't know how write the code.
I have managed to open the data-sets using the following code:
for(i in 1995:2015){
objects = paste("C:/Users/...",i,"agg.rda", sep=" ")
load(objects)
}
The problem is that each data-set is extremely large and R cannot open all of them at once. Therefore, I am now trying to extract an element within each list called: tab_<<i value >>_agg[["A"]] (for example tab_1995_agg[["A"]]), then delete the object and iterate over each i (which are different years).
I have tried using the following code but it does not work
for(i in unique(1995:2015)){
objects = paste("C:/Users/...",i,"agg.rda", sep=" ")
load(objects)
tmp = cat("tab",i,"_agg[[\"A\"]]" , sep = "")
y <- rbind(y, tmp)
rm(list=objects)
}
I apologize for any silly mistake (or question) and greatly appreciate any help.
Here’s a possible solution using a function to rename the object you’re loading in. I got loadRData from here. The loadRData function makes this a bit more approachable because you can load in the object with a different name.
Create some data for a reproducible example.
tab2000_agg <-
list(
A = 1:5,
b = 6:10
)
tab2001_agg <-
list(
A = 1:5,
d = 6:10
)
save(tab2000_agg, file = "2000_agg.rda")
save(tab2001_agg, file = "2001_agg.rda")
rm(tab2000_agg, tab2001_agg)
Using your loop idea.
loadRData <- function(fileName){
load(fileName)
get(ls()[ls() != "fileName"])
}
y <- list()
for(i in 2000:2001){
objects <- paste("", i, "_agg.rda", sep="")
data_list <- loadRData(objects)
tmp <- data_list[["A"]]
y[[i]] <- tmp
rm(data_list)
}
y <- do.call(rbind, y)
You could also turn it into a function rather than use a loop.
getElement <- function(year){
objects <- paste0("", year, "_agg.rda")
data_list <- loadRData(objects)
tmp <- data_list[["A"]]
return(tmp)
}
y <- lapply(2000:2001, getElement)
y <- do.call(rbind, y)
Created on 2022-01-14 by the reprex package (v2.0.1)

Memory and otimization problems in JSTOR's XML big loop

I'm having memory and otimization problems when loping over 200,000 documents of JSTOR's data for research. The documents are in xml format. More information can be found here: https://www.jstor.org/dfr/.
In the first step of the code I transform a xml file into a tidy dataframe in the following manner:
Transform <- function (x)
{
a <- xmlParse (x)
aTop <- xmlRoot (a)
Journal <- xmlValue(aTop[["front"]][["journal-meta"]][["journal-title group"]][["journal-title"]])
Publisher <- xmlValue (aTop[["front"]][["journal-meta"]][["publisher"]][["publisher-name"]])
Title <- xmlValue (aTop[["front"]][["article-meta"]][["title-group"]][["article-title"]])
Year <- as.integer(xmlValue(aTop[["front"]][["article-meta"]][["pub-date"]][["year"]]))
Abstract <- xmlValue(aTop[["front"]][["article-meta"]][["abstract"]])
Language <- xmlValue(aTop[["front"]][["article-meta"]][["custom-meta-group"]][["custom-meta"]][["meta-value"]])
df <- data.frame (Journal, Publisher, Title, Year, Abstract, Language, stringsAsFactors = FALSE)
df
}
In the sequence, I use this first function to transform a series of xml files into a single dataframe:
TransformFiles <- function (pathFiles)
{
files <- list.files(pathFiles, "*.xml")
i = 2
df2 <- Transform (paste(pathFiles, files[i], sep="/", collapse=""))
while (i<=length(files))
{
df <- Transform (paste(pathFiles, files[i], sep="/", collapse=""))
df2[i,] <- df
i <- i + 1
}
data.frame(df2)
}
When I have more than 100000 files it takes several hours to run. In case with 200000 it eventually breaks or gets to slow over time. Even in small sets, it can be noticed that it runs slower over time. Is there something I'm doing worong? Could I do something to otimize the code? I've already tried rbind and bind-rows instead of allocating the values directly using df2[i,] <- df.
Avoid the growth of objects in a loop with your assignment df2[i,] <- df (which by the way only works if df is one-row) and avoid the bookkeeping required of while with iterator, i.
Instead, consider building a list of data frames with lapply that you can then rbind together in one call outside of loop.
TransformFiles <- function (pathFiles)
{
files <- list.files(pathFiles, "*.xml", full.names = TRUE)
df_list <- lapply(files, Transform)
final_df <- do.call(rbind, unname(df_list))
# ALTERNATIVES FOR POSSIBLE PERFORMANCE:
# final_df <- data.table::rbindlist(df_list)
# final_df <- dplyr::bind_rows(df_list)
# final_df <- plyr::rbind.fill(df_list)
}

How to get better performance in R: one big file or several smaller files?

I had about 200 different files (all of them were big matrices, 465x1080) (that is huge for me). I then used cbind2 to make them all one bigger matrix (465x200000).
I did that because I needed to create one separate file for each row (465 files) and I thought that it would be easier for R to load the data from 1 file to the memory only ONCE and then just read line per line creating a separate file for each one of them, instead of opening and closing 200 different files for every row.
Is this really the faster way? (I am wondering because now it is taking quite a lot to do that). When I check in the Task Manager from Windows it shows the RAM used by R and it just goes from 700MB to 1GB to 700MB all the time (twice every second). Seems like the main file wasn't loaded just once, but that it is being loaded and erased from the memory in every iteration (which could be the reason why it is a bit slow?).
I am a beginner so all of this that I wrote might not make any sense.
Here is my code: (those +1 and -1 are because the original data has 1 extra column that I dont need in the new files)
extractStationData <- function(OriginalData, OutputName = "BCN-St") {
for (i in 1:nrow(OriginalData)) {
OutputData <- matrix(NA,nrow = ncol(OriginalData)-1,3)
colnames(OutputData) <- c("Time","Bikes","Slots")
for (j in 1:(ncol(OriginalData)-1)) {
OutputData[j,1] <- colnames(OriginalData[j+1])
OutputData[j,2] <- OriginalData[i,j+1]
}
write.table(OutputData,file = paste(OutputName,i,".txt",sep = ""))
print(i)
}
}
Any thoughts? Maybe I should just create an object (the huge file) before the first for loop and then it would be loaded just once?
Thanks in advance.
Lets assume you have already created the 465x200000 matrix and in question are only extractStationData function. Then we can modify it for example like this:
require(data.table)
extractStationData <- function(d, OutputName = "BCN-St") {
d2 <- d[, -1] # remove the column you do not need
# create empty matrix outside loop:
emtyMat <- matrix(NA, nrow = ncol(d2), 3)
colnames(emtyMat) <- c("Time","Bikes","Slots")
emtyMat[, 1] <- colnames(d2)
for (i in 1:nrow(d2)) {
OutputData <- emtyMat
OutputData[, 2] <- d2[i, ]
fwrite(OutputData, file = paste(OutputName, i, ".txt", sep = "")) # use fwrite for speed
}
}
V2:
If your OriginalData is in matrix format this approach for creating the list of new data.tables looks quite fast:
extractStationData2 <- function(d, OutputName = "BCN-St") {
d2 <- d[, -1] # romove the column you dont need
ds <- split(d2, 1:nrow(d2))
r <- lapply(ds, function(x) {
k <- data.table(colnames(d2), x, NA)
setnames(k, c("Time","Bikes","Slots"))
k
})
r
}
dl <- extractStationData2(d) # list of new data objects
# write to files:
for (i in seq_along(dl)) {
fwrite(dl[[i]], file = paste(OutputName, i, ".txt", sep = ""))
}
Should work also for data.frame with minor changes:
k <- data.table(colnames(d2), t(x), NA)

Deleting rows in a sequence for MULTIPLE lists in R

I know how to delete rows in in a sequence for a SINGLE list:
data <- data.table('A' = c(1,2,3,4), 'B' = c(900,6,'NA',2))
row.remove <- data[!(data$A = seq(from=1,to=4,by=2) )]
However, I would like to know how to do so with MULTIPLE lists.
Code I've tried:
file.number <- c(1:5)
data <- setNames(lapply(paste(file.number,".csv"), read.csv, paste(file.number)) # this line imports the lists from csv files - works
data.2 <- lapply(data, data.table) # seems to work
row.remove <- lapply(data.2, function(x) x[!(data.2$A = seq(from=1,to=4,by=2)) # no error message, but deletes all the rows
I feel like I'm missing something obvious, any help will be greatly appreciated.
Solution:
for (i in 1:5){
file.number = i
data <- setNames(lapply(paste(file.number,".csv"), read.csv, paste(file.number))
data <- as.data.table(data)
row.remove <- data[!(data$A = seq(from=1,to=4,by=2) )]
}
Instead of analyzing the list simultaneously, this will analyze the lists one by one. It's not a full solution, but more of a work around.

undefined columns selected error - works with 1 csv file gives error with more

I have a few functions that I created to to help me analyze some data. My main function starts by binding all .csv files that are in a folder and then calls other functions to perform various tasks, it looks like this:
x <- function (directory){
files <- list.files(directory, full.names = TRUE)
num_files <-length(files)
options(stringsAsFactors = TRUE)
df <- data.frame()
for (i in 1:num_files) {
df_data <- read.csv(files[i])
df <- rbind(df, df_data)
}
df$Stauts <- "ba"
ab_cid <- input() # simple input function see below for input functin code
df$Status[df$cid %in% bad_cid] <- "ab"
df$Status <- as.factor(df$Status)
bad_var_list <- prep_dataset(df)
df <- df[,!(names(df) % in% bad_var_list)]
df
}
Here is the input function:
input <- function(){
x <- readline("Enter a comma seperated list of cids with ab status :")
x <- as.numeric(unlist(strsplit(x, ",")))
x
}
Another function is later called to clean up the data to meet some requirements that I have
The code in the prep_dataset function starts out like this, it gives me an error in the last line shown here:
prep_dataset(data){
df<- subset(df, Status == 'ab')
listfactors <- sapply(df2, is.factor)
df_factors <- df[,listfactors]
df_bad <- df_factors[,(colSums(df_factors == "") >= nrow(df_factors) * .20)]
......
}
When i run my function x('Folder Name') if there is one .csv file in the folder it runs fine, I get the desired results. However if there is more than one file I get this:
Error in `[.data.frame`(df_factors, , (colSums(df_factors == :
undefined columns selected
Called from: `[.data.frame`(df_factors, , (colSums(df_factors == "") >= nrow(df_factors)*0.2))
I took two csv files and manually put them into one and than I compared the data frames that get created when I combined them vs when they are combined in the for loop and they look identical - no clue whats going and why this error message keeps popping up.
So I discovered that for some reason when the read.csv file ran on a folder that has 2 or more csv files it would replace empty "" columns with NAs for what looks like only columns that were all empty, this happened once the prep_dataset() function ran this df<- subset(df, Status == 'ab'). Also, this would only happen if the folder had multiple csv files and would not happen with a single csv file - I'm not really sure why.
But to fix the issue I had to do get rid of the NAs by doing the following:
char <- sapply(df_factors, as.character)
char[is.na(char)] <- ""
df_char <- as.data.frame(char)
Now when the function continues and runs
df_bad <- df_char[,(colSums(df_char == "") >= nrow(df_char) * .20)]
The undefined columns selected error does not happen anymore.

Resources