I am working on R and learning how to code. I have written a piece of code, utilizing a for loop and I find it very slow. I was wondering if I can get some assistance to convert it to use either the sapply or lapply function. Here is my working R code:
library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE) #creates a list of files
dat <- data.frame() #creates an empty data frame
for (i in seq_along(files_list)) {
#loops through the files, rbinding them together
dat <- rbind(dat, read.csv(files_list[i]))
}
dat_subset <- filter(dat, dat$ID %in% id) #subsets the rows that match the 'ID' argument
mean(dat_subset[, pollutant], na.rm=TRUE) #identifies the Mean of a Pollutant
}
pollutantmean("specdata", "sulfate", 1:10)
This code takes almost 20 seconds to return, which is unacceptable for 332 records. Imagine if I have a dataset with 10K records and wanted to get the mean of those variables?
You can rbind all elements in a list using do.call, and you can read in all the files into that list using lapply:
mean(
filter( # here's the filter that will be applied to the rbind-ed data
do.call("rbind", # call "rbind" on all elements of a list
lapply( # create a list by reading in the files from list.files()
# add any necessary args to read.csv:
list.files("[::DIR_PATH::]"), function(x) read.csv(file=x, ...)
)
)
), ID %in% id)$pollutant, # make sure id is replaced with what you want
na.rm = TRUE
)
The reason your code is slow because you are incrementally growing your dataframe in the loop. One way to do this using dplyr and map_df from purrr can be
library(dplyr)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE)
purrr::map_df(files_list, read.csv) %>%
filter(ID %in% id) %>%
summarise_at(pollutant, mean, na.rm = TRUE)
}
Related
I need to bind_row 27 excel files. Although I can do it manually, I want to do it with a loop. The problem with the loop is that it will bind the first file to i and then the first file to i+1, hence losing i. How can I fix this?
nm <- list.files(path="sample/sample/sample/")
df <- data.frame()
for(i in 1:27){
my_data <- bind_rows(df, read_excel(path = nm[i]))
}
We could loop over the files with map, read the data with read_excel and rbind with _dfr
library(purrr)
my_data <- map_dfr(nm, read_excel)
In the Op's code, the issue is that in each iteration, it is creating a temporary dataset 'my_data', instead, it should be binded to the original 'df' already created
for(i in 1:27){
df <- rbind(df, read_excel(path = nm[i]))
}
We could use sapply
result <- sapply(files, read_excel, simplify=FALSE) %>%
bind_rows(.id = "id")
You can still use your for loop:
my_data<-vector('list', 27)
for(i in 1:27){
my_data[i] <- read_excel(path = nm[i])
}
do.call(rbind, my_data)
I'm trying to write a function called complete that takes a file directory (which has csv files titled 1-332) and the title of the file as a number to print out the number of rows without NA in the sulfate or nitrate columns. I am trying to use mutate to add a column titled nobs which returns 1 if neither column is na and then takes the sum of nobs for my answer, but I get an error message that the object nob is not found. How can I fix this? The specific file directory in question is downloaded within this block of code.
library(tidyverse)
if(!file.exists("rprog-data-specdata.zip")) {
temp <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip",temp)
unzip(temp)
unlink(temp)
}
complete <- function(directory, id = 1:332){
#create a list of files
files_full <- list.files(directory, full.names = TRUE)
#create an empty data frame
dat <- data.frame()
for(i in id){
dat <- rbind(dat, read.csv(files_full[i]))
}
mutate(dat, nob = ifelse(!is.na(dat$sulfate) & !is.na(dat$nitrate), 1, 0))
x <- summarise(dat, sum = sum(nob))
return(x)
}
When one runs the following code nobs should be 117, but I get an error message instead
complete("specdata", 1)
Error: object 'nob' not found"
I think the function below should get what you need. Rather than a loop, I prefer map (or apply) in this setting. It's difficult to say where your code went wrong without the error message or an example I can run on my machine, however.
Happy Coding,
Daniel
library(tidyverse)
complete <- function(directory, id = 1:332){
#create a list of files
files_full <- list.files(directory, full.names = TRUE)
# cycle over each file to get the number of nonmissing rows
purrr::map_int(
files_full,
~ read.csv(.x) %>% # read in datafile
dplyr::select(sulfate, nitrate) %>% # select two columns of interest
tidyr::drop_na %>% # drop missing observations
nrow() # get the number of rows with no missing data
) %>%
sum() # sum the total number of rows not missing among all files
}
As mentioned, avoid building objects in a loop. Instead, consider building a list of data frames from each csv then call rbind once. In fact, even consider base R (i.e., tinyverse) for all your needs:
complete <- function(directory, id = 1:332){
# create a list of files
files_full <- list.files(directory, full.names = TRUE)
# create a list of data frames
df_list <- lapply(files_full[id], read.csv)
# build a single data frame with nob column
dat <- transform(do.call(rbind, df_list),
nob = ifelse(!is.na(sulfate) & !is.na(nitrate), 1, 0)
)
return(sum(dat$nob))
}
I'm having memory and otimization problems when loping over 200,000 documents of JSTOR's data for research. The documents are in xml format. More information can be found here: https://www.jstor.org/dfr/.
In the first step of the code I transform a xml file into a tidy dataframe in the following manner:
Transform <- function (x)
{
a <- xmlParse (x)
aTop <- xmlRoot (a)
Journal <- xmlValue(aTop[["front"]][["journal-meta"]][["journal-title group"]][["journal-title"]])
Publisher <- xmlValue (aTop[["front"]][["journal-meta"]][["publisher"]][["publisher-name"]])
Title <- xmlValue (aTop[["front"]][["article-meta"]][["title-group"]][["article-title"]])
Year <- as.integer(xmlValue(aTop[["front"]][["article-meta"]][["pub-date"]][["year"]]))
Abstract <- xmlValue(aTop[["front"]][["article-meta"]][["abstract"]])
Language <- xmlValue(aTop[["front"]][["article-meta"]][["custom-meta-group"]][["custom-meta"]][["meta-value"]])
df <- data.frame (Journal, Publisher, Title, Year, Abstract, Language, stringsAsFactors = FALSE)
df
}
In the sequence, I use this first function to transform a series of xml files into a single dataframe:
TransformFiles <- function (pathFiles)
{
files <- list.files(pathFiles, "*.xml")
i = 2
df2 <- Transform (paste(pathFiles, files[i], sep="/", collapse=""))
while (i<=length(files))
{
df <- Transform (paste(pathFiles, files[i], sep="/", collapse=""))
df2[i,] <- df
i <- i + 1
}
data.frame(df2)
}
When I have more than 100000 files it takes several hours to run. In case with 200000 it eventually breaks or gets to slow over time. Even in small sets, it can be noticed that it runs slower over time. Is there something I'm doing worong? Could I do something to otimize the code? I've already tried rbind and bind-rows instead of allocating the values directly using df2[i,] <- df.
Avoid the growth of objects in a loop with your assignment df2[i,] <- df (which by the way only works if df is one-row) and avoid the bookkeeping required of while with iterator, i.
Instead, consider building a list of data frames with lapply that you can then rbind together in one call outside of loop.
TransformFiles <- function (pathFiles)
{
files <- list.files(pathFiles, "*.xml", full.names = TRUE)
df_list <- lapply(files, Transform)
final_df <- do.call(rbind, unname(df_list))
# ALTERNATIVES FOR POSSIBLE PERFORMANCE:
# final_df <- data.table::rbindlist(df_list)
# final_df <- dplyr::bind_rows(df_list)
# final_df <- plyr::rbind.fill(df_list)
}
New to R and to programming, I reviewed all the possible threads on SO on this Coursera assignment but couldn't figure out what the issue was. I know this function can be optimized using lapply and much more, but I would like to know why this particular function does not work. I felt like some questions on this function slightly irritated some users. To be honest, I reviewed the relevant posts on that and I don't see what I can do about this particular bug.
pollutantmean <- function (directory, pollutant, id) {
#Create the data frame with the data from the 332 files
files <- list.files(getwd())
df <- data.frame()
id <- 1:332
for (i in 1:length(id)) {df <- rbind(df, read.csv(files[i]))
if (pollutant=="nitrate"){
#Create a subset for nitrate values of df
df_nitrate <- df[df$ID==id[i], "nitrate"]
#Take mean of df_nitrate
mean (df_nitrate, na.rm = TRUE)
} else {
#Create a subset for sulfate values of df
df_sulfate <- df[df$ID==id[i],"sulfate"]
#Take mean of df_sulfate
mean(df_sulfate, na.rm = TRUE)
}
}
}
For those of you who have not heard of this assignment function: I have 332 csv files(named 001.csv, 002.csv and so on) in my working directory. The task is to get all of them in one data frame and to be able to call the mean of a column of a file (given by "id" variable that corresponds to that file) or across multiple files (some examples of function and output can be found here)
I tried to call traceback or debug functions to situate the problem, but to no avail:
pollutantmean(getwd(), "nitrate", 23)
> traceback()
No traceback available
> debug(pollutantmean)
>
The OS is Windows 10.
Any suggestions or comments are welcome. Thanks in advance.
Your for loop is wrapped around your if block. R functions will not return a value while in a loop (unless you use the return function, which is not what you want to do here).
pollutantmean <- function (directory, pollutant, id) {
#Create the data frame with the data from the 332 files
files <- list.files(getwd())
df <- data.frame()
id <- 1:332
for (i in 1:length(id)) {df <- rbind(df, read.csv(files[i]))
# close for loop HERE
}
if (pollutant=="nitrate"){
#Create a subset for nitrate values of df
df_nitrate <- df[df$ID==id[i], "nitrate"]
#Take mean of df_nitrate
mean (df_nitrate, na.rm = TRUE)
} else {
#Create a subset for sulfate values of df
df_sulfate <- df[df$ID==id[i],"sulfate"]
#Take mean of df_sulfate
mean(df_sulfate, na.rm = TRUE)
}
# not HERE
#}
}
I am new to R. I created the function below to calculate the mean of dataset contained in 332 csv files. Seek advice on how I could improve this code. It takes 38 sec to run which make me think it is not very efficient.
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list <- list.files(directory, full.names = TRUE) #creats list of files
dat <- data.frame() #creates empty dataframe
for(i in id){
dat<- rbind(dat,read.csv(files_list[i])) #combin all the monitor data together
}
good <- complete.cases(dat) #remove all NA values from dataset
mean(dat[good,pollutant]) #calculate mean
} #run time ~ 37sec - NEED TO OPTIMISE THE CODE
Instead of creating a void data.frame and rbind each time with a for loop, you can store all data.frames in a list and combine them in one shot. You can also use na.rm option of mean function not to take into account NA values.
pollutantmean <- function(directory, pollutant, id = 1:332)
{
files_list = list.files(directory, full.names = TRUE)[id]
df = do.call(rbind, lapply(files_list, read.csv))
mean(df[[pollutant]], na.rm=TRUE)
}
Optional - I would increase the readability with magrittr:
library(magrittr)
pollutantmean <- function(directory, pollutant, id = 1:332)
{
list.files(directory, full.names = TRUE)[id] %>%
lapply(read.csv) %>%
do.call(rbind,.) %>%
extract2(pollutant) %>%
mean(na.rm=TRUE)
}
You can improve it by using data.table's fread function (see Quickly reading very large tables as dataframes in R)
Also binding the result using data.table::rbindlist is way faster.
require(data.table)
pollutantmean <- function(directory, pollutant, id = 1:332) {
files_list = list.files(directory, full.names = TRUE)[id]
DT = rbindlist(lapply(files_list, fread))
mean(DT[[pollutant]], na.rm=TRUE)
}