How to pass an R function argument to subset a column - r

First I am new here, this is my first post so my apologies in advance if I am not doing everything correct. I did take the time to search around first but couldn't find what I am looking for.
Second, I am pretty sure I am breaking a rule in that this question is related to a 'coursera.org' R programming course I am taking (this was part of an assignment) but the due date has lapsed and I have failed for now, I will repeat the subject next month and try again but I am kind of now in damage control trying to find out what went wrong.
Basically below is my code:
What I am trying to do is read in data from a series of files. These files are four columns wide with the titles: Date, nitrate, sulfate and id and contain various rows of data.
The function I am trying to write should take the arguments of the directory of the files, the pollutant (so either nitrate or sulfate), and the set of numbered files, e.g. files 1 and 2, files 1 through to 4 etc. The return of the function should be the average value of the selected pollutant across the selected files.
I would call the function using a call like this
pollutantmean("datafolder", "nitrate", 1:3)
and the return should just be a number which is the average in this case of nitrate across data files 1 through to 3
OK, I hope I have provided enough information. Other stuff that may be useful is:
Operating system :Ubuntu
Language: R
Error message received:
Warning message:
In is.na(x) : is:na() applied to non(list or vector) of type 'NULL'
As I say, the data files are a series of files located in a folder and are four columns wide and vary as to the number of rows.
My function code is a follows:
pollutantmean <- function(directory, pollutant, id = 1:5) { #content of the function
#create a list of files, a vector I think
files_list <- dir(directory, full.names = TRUE)
# Now create an empty data frame
dat <- data.frame()
# Next step is to execute a loop to read all the selected data files into the dataframe
for (i in 1:5) {
dat <- rbind(dat, read.csv(files_list[i]))
}
#subsets the rows matching the selected monitor numbers
dat_subset <- dat[dat[, "ID"] == id, ]
#identify the median of the pollutant and ignore the NA values
median(dat_subset$pollutant, na.rm = TRUE)
ok, that is it, through trial and error I am pretty sure the final line of code, the "median(dat_subset$pollutant, na.rm = TRUE)" appears to be the problem. I pass an argument to the function of pollutant which should be either sulfate or nitrate but it seems the dat_subset$pollutant bit of code is what is not working. Somehow I am getting the passed pollutant argument to not come into the function body. the dat_subset$pollutant bit should ideally be equivalent to either dat_subset$nitrate or dat_subset$sulfate depending on the argument fed to the function.

You cannot subset with $ operator if you pass the column name in an object like in your example (where it is stored in pollutant). So try to subset using [], in your case that would be:
median(dat_subset[,pollutant], na.rm = TRUE)
or
median(dat_subset[[pollutant]], na.rm = TRUE)
Does that work?

Related

R - Error in colMeans(wind.speed, na.rm = T) : 'x' must be numeric

I am trying to importa single column of a text file data set where each file is a single day of data. I want to take the mean of each day's wind speed. Here is the code I have written for that:
daily.wind.speed <- c()
file.names <- dir("C:\\Users\\User Name\\Desktop\\R project\\Sightings Data\\Weather Data", pattern =".txt")
for(i in 1:length(file.names))
{
##import data file into data frame
weather.data <-read.delim(file.names[i])
## extract wind speed column
wind.speed <- weather.data[3]
##Attempt to fix numeric error
##wind.speed.num <- as.numeric(wind.speed)
##Take the column mean of wind speed
daily.avg <- colMeans(wind.speed,na.rm=T)
##Add daily average to list
daily.wind.speed <- c(daily.wind.speed,daily.avg)
##Print for troubleshooting and progress
print(daily.wind.speed)
}
This code seems to work on some files in my data set, but others give me this error during this section of the code:
> daily.avg <- colMeans(wind.speed,na.rm=T)
Error in colMeans(wind.speed, na.rm = T) : 'x' must be numeric
I am also having trouble converting these values to numeric and am looking for options to either convert my data to numeric, or to possibly take the mean in a different way that dosen't encounter this issue.
> as.numeric(wind.speed.df)
Error: (list) object cannot be coerced to type 'double'
weather.data Example
Even though this is not a reproducible example the problem is that you are applying a matrix function to a vector so it won't work. Just change the colMeans for mean

Searching for target in Excel spreadsheet using R

As an R noob, I'm currently rather stumped by what is probably a rather trivial problem. I have data that looks like in the second image below, essentially a long sheet of rows with values in three columns. What I need is for a way to scan the sheet looking for particular combinations of values in the first and second column - combinations that are specified in a second spreadsheet of targets (see picture 1). When that particular combination is found, I need the script to extract the whole row in question from the data file.
So far, I've managed to read the files without problem:
library(xlsx)
folder <- 'C:\\Users\\...\\Desktop\\R EXCEL test\\'
target_file <- paste(folder,(readline(prompt = "Enter filename for target list:")),sep = "")
data_file <- paste(folder,(readline(prompt = "Enter data file:")),sep = "")
targetsDb <- read.xlsx(target_file, sheetName = "Sheet1")
data <- read.xlsx(data_file, sheetName = "Sheet1")
targets <- vector(mode = "list", length = 3)
for(i in 1:nrow(targetsDb)){
targets[[i]] <- c(targetsDb[i,1],targetsDb[i,2])
}
And with the last command I've managed to save the target combinations as items in a list. However, I run into trouble when it comes to iterating through the file looking for any of those combinations of cell values in the first two columns. My approach was to create a list with one item,
SID_IA <- vector(mode = "list", length = 1)
and to fill it with the values of column 1 and 2 iteratively for each row of the data file:
for(n in 1:nrow(data)){
SID_IA[[n]] <- c(data[n,1],data[n,2])
I would then nest another for loop here, which basically goes through every row in the targets sheet to check if the combination of values currently in the SID_IA list matches any of the target ones. Then at the end of the loop, the list is emptied so it can be filled with the following combination of data values.
for(i in targets){
if(SID_IA[[n]] %in% targets){
print(SID_IA[[n]], "in sentence" , data[n,1], "is ", data[n,3])
}else{
print(FALSE)
}
SID_IA[[n]] <- NULL
}
}
However, if I try to run that last loop, it returns the following output and error:
[1] FALSE
Error in SID_IA[[n]] : subscript out of bounds
In addition: Warning message:
In if (SID_IA[[n]] %in% targets) { :
the condition has length > 1 and only the first element will be used
So, it seems to be doing something for at least one iteration, but then crashes. I'm sure I'm missing something very elementary, but I just can't see it. Any ideas?
EDIT: As requested, I've removed the images and made the test Excel sheets available here and here.
OK.. I'm attempting an answer that should require minimum use of fancy tricks.
data<- xlsx::read.xlsx(file = "Data.xlsx",sheetIndex = 1)
target<- xlsx::read.xlsx(file = "Targets.xlsx",sheetIndex = 1)
head(data)
target
These values are already in data.frame format. If all you want to know is which rows appear exactly same in data and target, then it will be as simple as finding a merge
merge(target,data,all = F)
If, on the other hand , you want to keep the data table with a marking of target rows, then the easiest way will be to make an index column
data$indx<- 1:nrow(data)
data
mrg<- merge(target,data,all = F)
data$test<- rep("test", nrow(data))
data$test[mrg$indx]<- "target"
data
This is like the original image you'd posted.
BTW , if yo are on a graphical interface you can also use File dialogue to open data files.. check out file.choose()
(Posted on behalf of the OP).
Following from #R.S.'s suggestion that didn't involve vectors and loops, and after some playing around, I have figured out how to extract the target lines, and then how to remove them from the original data, outputting both results. I'm leaving it here for future reference and considering this solved.
extracted <- merge(targets,data,all = F)
write.xlsx(extracted,output_file1)
combined <-rbind(data,extracted)
minus.target <- combined[!duplicated(combined,fromLast = FALSE)&!duplicated(combined,fromLast = TRUE),]
write.xls(minus.target,output_file2)

getting error: 'incorrect number of dimensions' when executing a function in R 3.1.2. What does it mean and how to fix it?

Let me first describe you the function and what I have to process.
Basically theres this folder containing some 300 comma separated value files. Each file has an ID associated with it, as in 200.csv has ID 200 in it and contains some data pertaining to sulphate and nitrate pollutants. What I have to do is calculate the mean of these particles for either one ID or a range of IDs. For example, calculating the mean of sulphate for ID 5 or calculate the same thing for IDs 5:10.
This is my procedure for processing the data but I'm getting a silly error in the end.
I have a list vector of these .csv files.
A master data frame combining all these files, I used the data.table package for this.
Time to describe the function:
pollutantmean <- function(spectate,pollutant,id)
specdata <- rbindlist(filelist)
setkey(specdata, ... = 'ID) ## because ID needs to be sorted out
for(i in id)
if(pollutant == 'sulphate'){
return(mean(specdata[, 'sulphate'], na.rm = TRUE))
}else{
if(pollutant == 'nitrate'){
return(mean(specdata([, nitrate], na.rm = TRUE))
}else{
print('NA')}
}
}
Now the function is very simply. I defined spectate, i defined the for loop to calculate data for each id. I get no error when function is run. But there is one error that is being the last obstacle.
'Error in "specdata"[, "sulfate"] : incorrect number of dimensions'
When I execute the function. Could someone elaborate?
I think this is the kind of task where the plyr function will be really helpful.
More specifically, I think you should use ldply and a bespoke function.
ldply: as in a list as the input and a dataframe as the output. The list should be the directory contents, and output will be the summary values from each of the csv files.
Your example isn't fully reproducible so the code below is just an example structure:
require(plyr)
require(stringr)
files_to_extract <- list.files("dir_with_files", pattern=".csv$")
files_to_extract <- str_sub_replace(files_to_extract, ".csv$", "")
fn <- function(this_file_name){
file_loc <- paste0(dir_with_files, "/", this_file_name, ".csv")
full_data <- read.csv(file_loc)
out <- data.frame(
variable_name=this_file_name,
variable_summary= mean(full_data$variable_to_summarise)
)
return(out)
}
summarised_output <- ldply(files_to_extract, fn)
Hope that helps. Probably won't work first time and you might want to add some additional conditions and so on to handle files that don't have the expected contents. Happy to discuss what it's doing but probably best to read this, as once you understand the approach it makes all kinds of tasks much easier.

R for loop index issue

I am new to R and I am practicing to write R functions. I have 100 cvs separate
data files stored in my directory, and each is labeled by its id, e.g. "1" to "100.
I like to write a function that reads some selected files into R, calculates the
number of complete cases in each data file, and arrange the results into a data frame.
Below is the function that I wrote. First I read all files in "dat". Then, using
rbind function, I read the selected files I want into a data.frame. Lastly, I computed
the number of complete cases using sum(complete.cases()). This seems straightforward but
the function does not work. I suspect there is something wrong with the index but
have not figured out why. Searched through various topics but could not find a useful
answer. Many thanks!
`complete = function(directory,id) {
dat = list.files(directory, full.name=T)
dat.em = data.frame()
for (i in id) {
dat.ful= rbind(dat.em, read.csv(dat[i]))
obs = numeric()
obs[i] = sum(complete.cases(dat.ful[dat.ful$ID == i,]))
}
data.frame(ID = id, count = obs)
}
complete("envi",c(1,3,5)) `
get error and a warning message:
Error in data.frame(ID = id, count = obs) : arguments imply differing number of rows: 3, 5
One problem with your code is that you reset obs to numeric() each time you go through the loop, so obs ends up with only one value (the number of complete cases in the last file in dat).
Another issue is that the line dat.ful = rbind(dat.em, read.csv(dat[i])) resets dat.ful to contain just the data frame being read in that iteration of the loop. This won't cause an error, but you don't actually need to store the previous data frames, since you're just checking the number of complete cases for each data frame you read in.
Here's a different approach using lapply instead of a loop. Note that instead of giving the function a vector of indices, this function takes a vector of file names. In your example, you use the index instead of the file name as the file "id". It's better to use the file names directly, because even if the file names are numbers, using the index will give an incorrect result if, for some reason, your vector of file names is not sorted in ascending numeric order, or if the file names don't use consecutive numbers.
# Read files and return data frame with the number of complete cases in each csv file
complete = function(directory, files) {
# Read each csv file in turn and store its name and number of complete cases
# in a list
obs.list = lapply(files, function(x) {
dat = read.csv(paste0(directory,"/", x))
data.frame(fileName=x, count=sum(complete.cases(dat)))
})
# Return a data frame with the number of complete cases for each file
return(do.call(rbind, obs.list))
}
Then, to run the function, you need to give it a directory and a list of file names. For example, to read all csv files in the current working directory, you can do this:
filesToRead = list.files(pattern=".csv")
complete(getwd(), filesToRead)

R Programming: Difficulty removing NAs from frame when using lapply

Full disclosure: I am taking a Data Science course on Coursera. For this particular question, we need to calculate the mean of some pollutant data that is being read in from multiple files.
The main function I need help with also references a couple other functions that I wrote in the script. For brevity, I'm just going to list them and their purpose:
boundIDs: I use this to bound the input so that inputs won't be accepted that are out of range. (range is 1:332, so if someone enters 1:400 this changes the range to 1:332)
pollutantToCode: converts the pollutant string entered to that pollutant's column number in the data file
fullFilePath - Creates the file name and appends it to the full file path. So if
someone states they need the file for ID 1 in directory
"curse/your/sudden/but/inevitable/betrayal/", the function will return
"curse/your/sudden/but/inevitable/betrayal/001.csv" to be added to
the file list vector.
After all that, the main function I'm working with is:
pollutantmean <- function(directory = "", pollutant, id = 1:332){
id <- boundIDs(id)
pollutant <- pollutantToCode(pollutant)
numberOfIds <- length(id)
fileList <- character(numberOfIds)
for (i in 1:numberOfIds){
if (id[i] > 332){
next
}
fileList[i] <- fullFilePath(directory, id[i])
}
data <- lapply(fileList, read.csv)
print(data[[1]][[pollutant]])
}
Right now, I'm intentionally printing only the first frame of data to see what my output looks like. To remove the NAs I've tried using:
data <- lapply(fileList, read.csv)
data <- data[!is.na(data)]
But the NAs remained, so then I tried computing the mean directly and using the na.rm parameter:
print(mean(data[[1]][[pollutant]], na.rm = TRUE))
But the mean was still "NA". Then I tried na.omit:
data <- lapply(fileList, na.omit(read.csv))
...and unfortunately the problem persisted.
Can someone please help? :-/
(PS: Right now I'm just focusing on the first frame of whatever is read in, i.e. data[[1]], since I figure if I can't get it for the first frame there's no point in iterating over the rest.)

Resources