I am writing a function that takes a directory and a threshold (default of 0) for complete cases as arguments and then performs the cor function for two of the columns. The function runs, kind of. However, whenever I run it it seems like it is skipping everything I have in the function and returns numeric(0). I can't figure out what is causing it to return that. I ran each line through the console and made sure each line worked on its own. But once it comes time to put it all together it seems like it all falls apart. Below is what I'm working with.
corr <- function(directory, threshold =0){
files <- list.files(path = directory, pattern = "*.csv", full.names = T)
cr <- numeric()
for(i in seq(files)){
data <- read.csv(files[i], )
cleandata <- complete.cases
data <- data[cleandata, ]
if (nrow(data) > threshold){
correlation <- cor(data$sulfate,data$nitrate)
cr <- append(cr, correlation)
}
}
return(cr)
}
Edit : removed last part of question. After re-reading I realized that I would not get what I wanted from return(files) even if it worked correctly. I should also note that adding the return(files) changed my output from numeric(0) to character(0)
How about something like this. You put all the files in a dataframe and then map out the correlations. As per #r2evans comment, this adds an NA when the threshold condition is not met and as per #Limey comment, this is more efficient than a for loop and append. The benefit here is that we keep the correlation with the file name so that you can see which file has what correlation.
library(tidyverse)
corr <- function(directory, threshold =0){
tibble(file = list.files(path = directory,
pattern = "*.csv",
full.names = TRUE)) |>
mutate(data = map(file, read_csv),
comp = map(data, complete.cases),
nrw = map_dbl(comp, nrow),
cr = case_when(
nrw > threshold ~ map_dbl(comp, ~cor(.x$sulfate, .x$nitrate)),
TRUE ~ NA_real_
))
}
corr_df <- corr(directory = "user/AndS./stack_questions")
corr_vec <- pull(corr_df, cr)
Related
I have written a function which is meant to extract the means of particle quantities in the air. However the code works perfectly, if the Id argument is defined to be anything bellow "1:279" the function results in NaN.
Pollutantmean <- function(directory, pollutant, id = 1:332)
{
Airpoll <- list.files(path = directory, pattern = "*.csv", full.names = TRUE) %>%
lapply(read.csv) %>%
bind_rows
Sig_data <- Airpoll[id, pollutant]
mean(Sig_data, na.rm = TRUE)
}
I'm trying to write a function called complete that takes a file directory (which has csv files titled 1-332) and the title of the file as a number to print out the number of rows without NA in the sulfate or nitrate columns. I am trying to use mutate to add a column titled nobs which returns 1 if neither column is na and then takes the sum of nobs for my answer, but I get an error message that the object nob is not found. How can I fix this? The specific file directory in question is downloaded within this block of code.
library(tidyverse)
if(!file.exists("rprog-data-specdata.zip")) {
temp <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip",temp)
unzip(temp)
unlink(temp)
}
complete <- function(directory, id = 1:332){
#create a list of files
files_full <- list.files(directory, full.names = TRUE)
#create an empty data frame
dat <- data.frame()
for(i in id){
dat <- rbind(dat, read.csv(files_full[i]))
}
mutate(dat, nob = ifelse(!is.na(dat$sulfate) & !is.na(dat$nitrate), 1, 0))
x <- summarise(dat, sum = sum(nob))
return(x)
}
When one runs the following code nobs should be 117, but I get an error message instead
complete("specdata", 1)
Error: object 'nob' not found"
I think the function below should get what you need. Rather than a loop, I prefer map (or apply) in this setting. It's difficult to say where your code went wrong without the error message or an example I can run on my machine, however.
Happy Coding,
Daniel
library(tidyverse)
complete <- function(directory, id = 1:332){
#create a list of files
files_full <- list.files(directory, full.names = TRUE)
# cycle over each file to get the number of nonmissing rows
purrr::map_int(
files_full,
~ read.csv(.x) %>% # read in datafile
dplyr::select(sulfate, nitrate) %>% # select two columns of interest
tidyr::drop_na %>% # drop missing observations
nrow() # get the number of rows with no missing data
) %>%
sum() # sum the total number of rows not missing among all files
}
As mentioned, avoid building objects in a loop. Instead, consider building a list of data frames from each csv then call rbind once. In fact, even consider base R (i.e., tinyverse) for all your needs:
complete <- function(directory, id = 1:332){
# create a list of files
files_full <- list.files(directory, full.names = TRUE)
# create a list of data frames
df_list <- lapply(files_full[id], read.csv)
# build a single data frame with nob column
dat <- transform(do.call(rbind, df_list),
nob = ifelse(!is.na(sulfate) & !is.na(nitrate), 1, 0)
)
return(sum(dat$nob))
}
I am newbie in R and have got to calculate the mean of column sulf from 332 files. The mean formulas bellow works well with 1 file . The problem comes when I attempt to calculate across the files.
Perhaps the reading all files and storing them in mydata does not work well? Could you help me out?
Many thanks
pollutantmean <- function(specdata,pollutant=xor(sulf,nit),i=1:332){
specdata<-getwd()
pollutant<-c(sulf,nit)
for(i in 1:332){
mydata<-read.csv(file_list[i])
}
sulfate <- (subset(mydata,select=c("sulfate")))
sulf <- sulfate[!is.na(sulfate)]
y <- mean(sulf)
print(y)
}
This is not tested, but the steps are as followed. Note also that this kind of questions are being asked over and over again (e.g. here). Try searching for "work on multiple files", "batch processing", "import many files" or something akin to this.
lx <- list.files(pattern = ".csv", full.names = TRUE)
# gives you a list of
xy <- sapply(lx, FUN = function(x) {
out <- read.csv(x)
out <- out[, "sulfate", drop = FALSE] # do not drop to vector just for fun
out <- out[is.na(out[, "sulfate"]), ]
out
}, simplify = FALSE)
xy <- do.call(rbind, xy) # combine the result for all files into one big data.frame
mean(xy[, "sulfate"]) # calculate the mean
# or
summary(xy)
If you are short on RAM, this can be optimized a bit.
thank you for your help.
I have sorted it out. the key was to use full.names=TRUE in list.files and rbind(mydata, ... ) as otherwise it reads the files one by one and does not append them after each other, which is my aim
See below. I am not sure if it is the most "R" solution but it works
pollutantmean<-function(directory,pollutant,id=1:332){
files_list <- list.files(directory, full.names=TRUE)
mydata <- data.frame()
for (i in id) {
mydata <- rbind(mydata, read.csv(files_list[i]))
}
if(pollutant %in% "sulfate")
{
mean(mydata$sulfate,na.rm=TRUE)
}
else
{if(pollutant %in% "nitrate")
{
mean(mydata$nitrate,na.rm=TRUE)
}
else
{"wrong pollutant"
}
}
}
`
New to R and to programming, I reviewed all the possible threads on SO on this Coursera assignment but couldn't figure out what the issue was. I know this function can be optimized using lapply and much more, but I would like to know why this particular function does not work. I felt like some questions on this function slightly irritated some users. To be honest, I reviewed the relevant posts on that and I don't see what I can do about this particular bug.
pollutantmean <- function (directory, pollutant, id) {
#Create the data frame with the data from the 332 files
files <- list.files(getwd())
df <- data.frame()
id <- 1:332
for (i in 1:length(id)) {df <- rbind(df, read.csv(files[i]))
if (pollutant=="nitrate"){
#Create a subset for nitrate values of df
df_nitrate <- df[df$ID==id[i], "nitrate"]
#Take mean of df_nitrate
mean (df_nitrate, na.rm = TRUE)
} else {
#Create a subset for sulfate values of df
df_sulfate <- df[df$ID==id[i],"sulfate"]
#Take mean of df_sulfate
mean(df_sulfate, na.rm = TRUE)
}
}
}
For those of you who have not heard of this assignment function: I have 332 csv files(named 001.csv, 002.csv and so on) in my working directory. The task is to get all of them in one data frame and to be able to call the mean of a column of a file (given by "id" variable that corresponds to that file) or across multiple files (some examples of function and output can be found here)
I tried to call traceback or debug functions to situate the problem, but to no avail:
pollutantmean(getwd(), "nitrate", 23)
> traceback()
No traceback available
> debug(pollutantmean)
>
The OS is Windows 10.
Any suggestions or comments are welcome. Thanks in advance.
Your for loop is wrapped around your if block. R functions will not return a value while in a loop (unless you use the return function, which is not what you want to do here).
pollutantmean <- function (directory, pollutant, id) {
#Create the data frame with the data from the 332 files
files <- list.files(getwd())
df <- data.frame()
id <- 1:332
for (i in 1:length(id)) {df <- rbind(df, read.csv(files[i]))
# close for loop HERE
}
if (pollutant=="nitrate"){
#Create a subset for nitrate values of df
df_nitrate <- df[df$ID==id[i], "nitrate"]
#Take mean of df_nitrate
mean (df_nitrate, na.rm = TRUE)
} else {
#Create a subset for sulfate values of df
df_sulfate <- df[df$ID==id[i],"sulfate"]
#Take mean of df_sulfate
mean(df_sulfate, na.rm = TRUE)
}
# not HERE
#}
}
I'd like to know how to include every object that fulfils certain naming requirements in my arguments in R. Let's say the objects are all called something like this
var01 var02 var03 var04 varnn
What I would do in Stata for instance would be simply this
tab1 var*
and it would tabulate every variable with the first 3 letters "var".
In an earlier version of this post I was quite vague about what I actually want to do in my R project. So here goes. I've got a for loop, that iterates over 650 instances, with the goal of appending 6 datasets for every one of these instances. However, for some (I don't know which), not all 6 datasets exist, which is why the rbind command that's written like this fails:
rbind(data01, data02, data03, data04, data05, data06)
I'd therefore like to run something like this
rbind(data*)
So as to account for missing datasets.
Sorry for the confusion, I wasn't being clear enough when I originally wrote the question.
Just for reference, here is the whole loop:
for(i in 1:650){
try(part1 <- read.csv(file = paste0("Twitter Scrapes/searchTwitter/09July/",MP.ID[i],".csv")))
try(part2 <- read.csv(file = paste0("Twitter Scrapes/userTimeline/08July/",MP.ID[i],".csv")))
try(part3 <- read.csv(file = paste0("Twitter Scrapes/userTimeline/16July/",MP.ID[i],".csv")))
try(part4 <- read.csv(file = paste0("Twitter Scrapes/searchTwitter/17July/",MP.ID[i],".csv")))
try(part5 <- read.csv(file = paste0("Twitter Scrapes/userTimeline/24July/",MP.ID[i],".csv")))
try(part6 <- read.csv(file = paste0("Twitter Scrapes/searchTwitter/24July/",MP.ID[i],".csv")))
allParts <- ls(pattern = "^part*")
allNames <- paste(allParts, collapse = ", ") # this is just what I tried just now, didn't work though
combined.df <- rbind(ALL THE DATASETS WITH PART))
}
Data
var01 <- sample(2, 10, TRUE)
var02 <- sample(2, 10, TRUE)
var03 <- sample(2, 10, TRUE)
vvv01 <- sample(2, 10, TRUE) # variable which should not be tabulated
Code
allV <- ls(pattern = "^var.*") # vector of all variables starting with 'var'
lapply(allV, function(.) table(get(.)))
Explanation
With ls you get all variables which are named according to the pattern you provide. Then, you loop over all these variables, retrieve the variable by its name and tabulate it.
Update
With your recent changes what I would do is the following:
allV <- lapply(ls(pattern = "^part.*"), get) #stores all part variables in a list
combined.df <- do.call(rbind, allV) # rbinds all of them