R Programming: Difficulty removing NAs from frame when using lapply - r

Full disclosure: I am taking a Data Science course on Coursera. For this particular question, we need to calculate the mean of some pollutant data that is being read in from multiple files.
The main function I need help with also references a couple other functions that I wrote in the script. For brevity, I'm just going to list them and their purpose:
boundIDs: I use this to bound the input so that inputs won't be accepted that are out of range. (range is 1:332, so if someone enters 1:400 this changes the range to 1:332)
pollutantToCode: converts the pollutant string entered to that pollutant's column number in the data file
fullFilePath - Creates the file name and appends it to the full file path. So if
someone states they need the file for ID 1 in directory
"curse/your/sudden/but/inevitable/betrayal/", the function will return
"curse/your/sudden/but/inevitable/betrayal/001.csv" to be added to
the file list vector.
After all that, the main function I'm working with is:
pollutantmean <- function(directory = "", pollutant, id = 1:332){
id <- boundIDs(id)
pollutant <- pollutantToCode(pollutant)
numberOfIds <- length(id)
fileList <- character(numberOfIds)
for (i in 1:numberOfIds){
if (id[i] > 332){
next
}
fileList[i] <- fullFilePath(directory, id[i])
}
data <- lapply(fileList, read.csv)
print(data[[1]][[pollutant]])
}
Right now, I'm intentionally printing only the first frame of data to see what my output looks like. To remove the NAs I've tried using:
data <- lapply(fileList, read.csv)
data <- data[!is.na(data)]
But the NAs remained, so then I tried computing the mean directly and using the na.rm parameter:
print(mean(data[[1]][[pollutant]], na.rm = TRUE))
But the mean was still "NA". Then I tried na.omit:
data <- lapply(fileList, na.omit(read.csv))
...and unfortunately the problem persisted.
Can someone please help? :-/
(PS: Right now I'm just focusing on the first frame of whatever is read in, i.e. data[[1]], since I figure if I can't get it for the first frame there's no point in iterating over the rest.)

Related

RSME on dataframe of multiple files in R

My goal is to read many files into R, and ultimately, run a Root Mean Square Error (rmse) function on each pair of columns within each file.
I have this code:
#This calls all the files into a dataframe
filnames <- dir("~/Desktop/LGsampleHUCsWgraphs/testRSMEs", pattern = "*_45Fall_*")
#This reads each file
read_data <- function(z){
dat <- read_excel(z, skip = 0, )
return(dat)
}
#This combines them into one list and splits them by the names in the first column
datalist <- lapply(filnames, read_data)
bigdata <- rbindlist(datalist, use.names = T)
splitByHUCs <- split(bigdata, f = bigdata$HUC...1 , sep = "\n", lex.order = TRUE)
So far, all is working well. Now I want to apply an rmse [library(Metrics)] analysis on each of the "splits" created above. I don't know what to call the "splits". Here I have used names but that is an R reserved word and won't work. I tried the bigdata object but that didn't work either. I also tried to use splitByHUCs, and rMSEs.
rMSEs <- sapply(splitByHUCs, function(x) rmse(names$Predicted, names$Actual))
write.csv(rMSEs, file = "~/Desktop/testRMSEs.csv")
The rmse code works fine when I run it on a single file and create a name for the dataframe:
read_excel("bcc1_45Fall_1010002.xlsm")
bcc1F1010002 <- read_excel("bcc1_45Fall_1010002.xlsm")
rmse(bcc1F1010002$Predicted, bcc1F1010002$Actual)
The "splits" are named by the "splitByHUCs" script, like this:
They are named for the file they came from, appropriately. I need some kind of reference name for the rmse formula and I don't know what it would be. Any ideas? Thanks. I made some small versions of the files, but I don't know how to add them here.
As it is a list, we can loop over the list with sapply/lapply as in the OP's code, but the names$ is incorrect as the lambda function object is x which signifies each of the elements of the list (i.e. a data.frame). Therefore, instead of names$, use x$
sapply(splitByHUCs, function(x) rmse(x$Predicted, x$Actual))

Saving output of for-loop for every iteration

I am currently working on an imputation project where I need to evaluate my methods of imputation. I have my incomplete dataframe with NAs from which I calculate the missing rate for every column/variable. My second data frame contains the complete cases which I extracted from the first data frame. I now want to simulate the missingness structure of the real data in the frame containing the complete cases. the data frame with the generated NAs get stored in the object "result" as you can see in the code. If I now want to replicate this code and thus generate 100 different data frames like "result", how do I replicate and save them separately?
I'm a beginner and would be really thankful for your answers!
I tried to put my loop which generates the NAs in another loop which contains the replicate() command and counts from 1:100 and saves these 100 replicated data frames but it didn't work at all.
result = data.frame(res0=rep(NA, dim(comp_cas)[1]))
for (i in 1:length(Z32_miss_item$miss_per_item)) {
dat = comp_cas[,i]
missRate = Z32_miss_item$miss_per_item[i]
cat (i, " ", paste0(dat, collapse=",") ," ", missRate, "!\n")
df <- data.frame("res"= GenMiss(x=dat, missrate = missRate), stringsAsFactors = FALSE)
colnames(df) = gsub("res", paste0("Var", i), colnames(df))
result = cbind(result, df)
}
result = result[,-1]
I expect that every data frame of the 100 runs get saved in a separate .rda file in my project folder.
also, is imputation and the evaluation of fitness of the latter beginner stuff in r or at what level of proficiency am I if you take a look at the code that I posted?
It is difficult to guess what exactly you are doing without some dummy data. But it is fine to have loops within loops and to save data.frames. Firstly, I would avoid the replicate function here as it has a strange syntax and just stick with plain loops. Secondly, you must make sure that the loops have different indexes (i.e. for(i ... should be surrounded by, say, for(j ... since functions can loop outside their scope in R. Finally, use saveRDS rather than save, as you can then have each object (data.frame) saved in separate .rds files. The save function is designed for saving your whole workspace so that you can pick up where you left off.
fun <- function(i){
df <- data.frame(x=rnorm(5))
names(df) <- paste0("x",i)
df
}
for(j in 1:100){
res <- data.frame(id=1:5)
for(i in 1:10){
res <- cbind(res, fun(i))
}
saveRDS(res, sprintf("replication_%s.rds",j))
}

getting error: 'incorrect number of dimensions' when executing a function in R 3.1.2. What does it mean and how to fix it?

Let me first describe you the function and what I have to process.
Basically theres this folder containing some 300 comma separated value files. Each file has an ID associated with it, as in 200.csv has ID 200 in it and contains some data pertaining to sulphate and nitrate pollutants. What I have to do is calculate the mean of these particles for either one ID or a range of IDs. For example, calculating the mean of sulphate for ID 5 or calculate the same thing for IDs 5:10.
This is my procedure for processing the data but I'm getting a silly error in the end.
I have a list vector of these .csv files.
A master data frame combining all these files, I used the data.table package for this.
Time to describe the function:
pollutantmean <- function(spectate,pollutant,id)
specdata <- rbindlist(filelist)
setkey(specdata, ... = 'ID) ## because ID needs to be sorted out
for(i in id)
if(pollutant == 'sulphate'){
return(mean(specdata[, 'sulphate'], na.rm = TRUE))
}else{
if(pollutant == 'nitrate'){
return(mean(specdata([, nitrate], na.rm = TRUE))
}else{
print('NA')}
}
}
Now the function is very simply. I defined spectate, i defined the for loop to calculate data for each id. I get no error when function is run. But there is one error that is being the last obstacle.
'Error in "specdata"[, "sulfate"] : incorrect number of dimensions'
When I execute the function. Could someone elaborate?
I think this is the kind of task where the plyr function will be really helpful.
More specifically, I think you should use ldply and a bespoke function.
ldply: as in a list as the input and a dataframe as the output. The list should be the directory contents, and output will be the summary values from each of the csv files.
Your example isn't fully reproducible so the code below is just an example structure:
require(plyr)
require(stringr)
files_to_extract <- list.files("dir_with_files", pattern=".csv$")
files_to_extract <- str_sub_replace(files_to_extract, ".csv$", "")
fn <- function(this_file_name){
file_loc <- paste0(dir_with_files, "/", this_file_name, ".csv")
full_data <- read.csv(file_loc)
out <- data.frame(
variable_name=this_file_name,
variable_summary= mean(full_data$variable_to_summarise)
)
return(out)
}
summarised_output <- ldply(files_to_extract, fn)
Hope that helps. Probably won't work first time and you might want to add some additional conditions and so on to handle files that don't have the expected contents. Happy to discuss what it's doing but probably best to read this, as once you understand the approach it makes all kinds of tasks much easier.

How to pass an R function argument to subset a column

First I am new here, this is my first post so my apologies in advance if I am not doing everything correct. I did take the time to search around first but couldn't find what I am looking for.
Second, I am pretty sure I am breaking a rule in that this question is related to a 'coursera.org' R programming course I am taking (this was part of an assignment) but the due date has lapsed and I have failed for now, I will repeat the subject next month and try again but I am kind of now in damage control trying to find out what went wrong.
Basically below is my code:
What I am trying to do is read in data from a series of files. These files are four columns wide with the titles: Date, nitrate, sulfate and id and contain various rows of data.
The function I am trying to write should take the arguments of the directory of the files, the pollutant (so either nitrate or sulfate), and the set of numbered files, e.g. files 1 and 2, files 1 through to 4 etc. The return of the function should be the average value of the selected pollutant across the selected files.
I would call the function using a call like this
pollutantmean("datafolder", "nitrate", 1:3)
and the return should just be a number which is the average in this case of nitrate across data files 1 through to 3
OK, I hope I have provided enough information. Other stuff that may be useful is:
Operating system :Ubuntu
Language: R
Error message received:
Warning message:
In is.na(x) : is:na() applied to non(list or vector) of type 'NULL'
As I say, the data files are a series of files located in a folder and are four columns wide and vary as to the number of rows.
My function code is a follows:
pollutantmean <- function(directory, pollutant, id = 1:5) { #content of the function
#create a list of files, a vector I think
files_list <- dir(directory, full.names = TRUE)
# Now create an empty data frame
dat <- data.frame()
# Next step is to execute a loop to read all the selected data files into the dataframe
for (i in 1:5) {
dat <- rbind(dat, read.csv(files_list[i]))
}
#subsets the rows matching the selected monitor numbers
dat_subset <- dat[dat[, "ID"] == id, ]
#identify the median of the pollutant and ignore the NA values
median(dat_subset$pollutant, na.rm = TRUE)
ok, that is it, through trial and error I am pretty sure the final line of code, the "median(dat_subset$pollutant, na.rm = TRUE)" appears to be the problem. I pass an argument to the function of pollutant which should be either sulfate or nitrate but it seems the dat_subset$pollutant bit of code is what is not working. Somehow I am getting the passed pollutant argument to not come into the function body. the dat_subset$pollutant bit should ideally be equivalent to either dat_subset$nitrate or dat_subset$sulfate depending on the argument fed to the function.
You cannot subset with $ operator if you pass the column name in an object like in your example (where it is stored in pollutant). So try to subset using [], in your case that would be:
median(dat_subset[,pollutant], na.rm = TRUE)
or
median(dat_subset[[pollutant]], na.rm = TRUE)
Does that work?

Building a mean across several csv files

I have an assignment on Coursera and I am stuck - I do not necessarily need or want a complete answer (as this would be cheating) but a hint in the right direction would be highly appreciated.
I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames) and the files you want to use in the calculation (id).
I have tried to keep it as simple as possible:
pm <- function(directory, pollutant, id = 1:332) {
setwd("C:/Users/cw/Documents")
setwd(directory)
files <<- list.files()
First of all, set the wd and get a list of all files
x <- id[1]
x
get the starting point of the user-specified ID.
Problem
for (i in x:length(id)) {
df <- rep(NA, length(id))
df[i] <- lapply(files[i], read.csv, header=T)
result <- do.call(rbind, df)
return(df)
}
}
So this is where I am hitting a wall: I would need to take the user-specified input from above (e.g. 10:25) and put the content from files "010.csv" through "025.csv" into a dataframe to actually come up with the mean of one specific column.
So my idea was to run a for-loop along the length of id (e.g. 16 for 10:25) starting with the starting point of the specified id. Within this loop I would then need to take the appropriate values of files as the input for read.csv and put the content of the .csv files in a dataframe.
I can get single .csv files and put them into a dataframe, but not several.
Does anybody have a hint how I could procede?
Based on your example e.g. 16 files for 10:25, i.e. 010.csv, 011.csv, 012.csv, etc.
Under the assumption that your naming convention follows the order of the files in the directory, you could try:
csvFiles <- list.files(pattern="\\.csv")[10:15]#here [10:15] ... in production use your function parameter here
file_list <- vector('list', length=length(csvFiles))
df_list <- lapply(X=csvFiles, read.csv, header=TRUE)
names(df_list) <- csvFiles #OPTIONAL: if you want to rename (later rows) to the csv list
df <- do.call("rbind", df_list)
mean(df[ ,"columnName"])
These code snippets should be possible to pimp and incorprate into your routine.
You can aggregate your csv files into one big table like this :
for(i in 100:250)
{
infile<-paste("C:/Users/cw/Documents/",i,".csv",sep="")
newtable<-read.csv(infile)
newtable<-cbind(newtable,rep(i,dim(newtable)[1]) # if you want to be able to identify tables after they are aggregated
bigtable<-rbind(bigtable,newtable)
}
(you will have to replace 100:250 with the user-specified input).
Then, calculating what you want shouldn't be very hard.
That won't works for files 001 to 099, you'll have to distinguish those from the others because of the "0" but it's fixable with little treatment.
Why do you have lapply inside a for loop? Just do lapply(files[files %in% paste0(id, ".csv")], read.csv, header=T).
They should also teach you to never use <<-.

Resources