I would like to create a generic function naTrans that replaces 'NA' and '' by NA.
The problem is that I can't replace the dataframe testin the global environment by the modified test dataframe (mydf) created within the function. Here's my best try.
# Example dataframe containing 'NA'
test <- as.data.frame(matrix(sample(c('NA', 1:9), 10*10, TRUE), 10))
# My function
naTrans <- function (mydf) {
mydf[mydf == 'NA' | mydf ==''] <- NA
assign(deparse(substitute(mydf))[1], mydf, envir = globalenv())
}
test <- naTrans(test)
any(is.na(test))
# [1] FALSE
Surely the problem lies in the last line of code assign(print(deparse(substitute(mydf))), mydf, envir = globalenv())
Any idea?
I hope the comments in the code are clear enough
test <- as.data.frame(matrix(sample(c('NA',1:9),10*10,T),10))
naTrans <- function (mydf) {
mydf[mydf == 'NA' | mydf == ''] <- NA # use and or opertor, %in% don't work on DF but on vectors
return(mydf) # return the modified mydf (the return is optionnal, you may just use mydf here
}
test <- naTrans(test) # replace actual object by caller.
Related
I am trying to write a function with an unspecified number of arguments using ... but I am running into issues where those arguments are column names. As a simple example, if I want a function that takes a data frame and uses within() to make a new column that is several other columns pasted together, I would intuitively write it as
example.fun <- function(input,...){
res <- within(input,pasted <- paste(...))
res}
where input is a data frame and ... specifies column names. This gives an error saying that the column names cannot be found (they are treated as objects). e.g.
df <- data.frame(x = c(1,2),y=c("a","b"))
example.fun(df,x,y)
This returns "Error in paste(...) : object 'x' not found "
I can use attach() and detach() within the function as a work around,
example.fun2 <- function(input,...){
attach(input)
res <- within(input,pasted <- paste(...))
detach(input)
res}
This works, but it's clunky and runs into issues if there happens to be an object in the global environment that is called the same thing as a column name, so it's not my preference.
What is the correct way to do this?
Thanks
1) Wrap the code in eval(substitute(...code...)) like this:
example.fun <- function(data, ...) {
eval(substitute(within(data, pasted <- paste(...))))
}
# test
df <- data.frame(x = c(1, 2), y = c("a", "b"))
example.fun(df, x, y)
## x y pasted
## 1 1 a 1 a
## 2 2 b 2 b
1a) A variation of that would be:
example.fun.2 <- function(data, ...) {
data.frame(data, pasted = eval(substitute(paste(...)), data))
}
example.fun.2(df, x, y)
2) Another possibility is to convert each argument to a character string and then use indexing.
example.fun.3 <- function(data, ...) {
vnames <- sapply(substitute(list(...))[-1], deparse)
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.3(df, x, y)
3) Other possibilities are to change the design of the function and pass the variable names as a formula or character vector.
example.fun.4 <- function(data, formula) {
data.frame(data, pasted = do.call("paste", get_all_vars(formula, data)))
}
example.fun.4(df, ~ x + y)
example.fun.5 <- function(data, vnames) {
data.frame(data, pasted = do.call("paste", data[vnames]))
}
example.fun.5(df, c("x", "y"))
Consider the following data.frame:
df <- setNames(data.frame(rep("text_2010"),rep(1,5)), c("id", "value"))
I only want to keep the 4 last characters of the cells in the column "id". Therefore, I can use the following code:
df$id <- substr(df$id,nchar(df$id)-3,nchar(df$id))
However, I want to create a function that does the same. Therefore, I create the following function and apply it:
testfunction <- function(x) {
x$id <- substr(x$id,nchar(x$id)-3,nchar(x$id))
}
df <- testfunction(df)
But I do not get the same result. Why is that?
Add return(x) in your function to return the changed object.
testfunction <- function(x) {
x$id <- substr(x$id,nchar(x$id)-3,nchar(x$id))
return(x)
}
df <- testfunction(df)
However, you don't need an explicit return statement always (although it is better to have one). R by default returns the last line in your function so here you can also do
testfunction <- function(x) {
transform(x, id = substring(id, nchar(id)-3))
}
df <- testfunction(df)
which should work the same.
We can also create a function that takes an argument n (otherwise, the function would be static for the n and only useful as a dynamic function for different data) and constructs a regex pattern to be used with sub
testfunction <- function(x, n) {
pat <- sprintf(".*(%s)$", strrep(".", n))
x$id <- sub(pat, "\\1", x$id)
return(x)
}
-testing
testfunction(df, n = 4)
# id value
#1 2010 1
#2 2010 1
#3 2010 1
#4 2010 1
#5 2010 1
Base R solution attempting to mirror Excel's RIGHT() function:
# Function to extract the right n characters from each element of a provided vector:
right <- function(char_vec, n = 1){
# Check if vector provided isn't of type character:
if(!is.character(char_vec)){
# Coerce it, if not: char_vec => character vector
char_vec <- vapply(char_vec, as.character, "character")
}
# Store the number of characters in each element of the provided vector:
# num_chars => integer vector
num_chars <- nchar(char_vec)
# Return the right hand n characters of the string: character vector => Global Env()
return(substr(char_vec, (num_chars + 1) - n, num_chars))
}
# Application:
right(df$id, 4)
Data:
df <- setNames(data.frame(rep("text_2010"),rep(1,5)), c("id", "value"))
I need to add rows to a data frame. I have many files with many rows so I have converted the code to a function. When I go through each element of the code it works fine. When I wrap everything in a function each row from my first loop gets added twice.
My code looks for a string (xx or x). If xx is present is replaces the xx with numbers 00-99 (one row for each number) and 0-9. If x is present it replaces it with number 0-9.
Create DF
a <- c("1.x", "2.xx", "3.1")
b <- c("single", "double", "nothing")
df <- data.frame(a, b, stringsAsFactors = FALSE)
names(df) <- c("code", "desc")
My dataframe
code desc
1 1.x single
2 2.xx double
3 3.1 nothing
My function
newdf <- function(df){
# If I run through my code chunk by chunk it works as I want it.
df$expanded <- 0 # a variable to let me know if the loop was run on the row
emp <- function(){ # This function creates empty vectors for my loop
assign("codes", c(), envir = .GlobalEnv)
assign("desc", c(), envir = .GlobalEnv)
assign("expanded", c(), envir = .GlobalEnv)
}
emp()
# I want to expand xx with numbers 00 - 99 and 0 - 9.
#Note: 2.0 is different than 2.00
# Identifies the rows to be expanded
xd <- grep("xx", df$code)
# I used chr vs. numeric so I wouldn't lose the trailing zero
# Create a vector to loop through
tens <- formatC(c(0:99)); tens <- tens[11:100]
ones <- c("00","01","02","03","04","05","06","07","08","09")
single <- as.character(c(0:9))
exp <- c(single, ones, tens)
# This loop appears to run twice when I run the function: newdf(df)
# Each row is there twice: 2.00, 2.00, 2.01 2.01...
# It runs as I want it to if I just highlight the code.
for (i in xd){
for (n in exp) {
codes <- c(codes, gsub("xx", n, df$code[i])) #expanding the number
desc <- c(desc, df$desc[i]) # repeating the description
expanded <- c(expanded, 1) # assigning 1 to indicated the row has been expanded
}
}
# Binds the df with the new expansion
df <- df[-xd, ]
df <- rbind(as.matrix(df),cbind(codes,desc,expanded))
df <- as.data.frame(df, stringsAsFactors = FALSE)
# Empties the vector to begin another expansion
emp()
xs <- grep("x", df$code) # This is for the single digit expansion
# Expands the single digits. This part of the code works fine inside the function.
for (i in xs){
for (n in 0:9) {
codes <- c(codes, gsub("x", n, df$code[i]))
desc <- c(desc, df$desc[i])
expanded <- c(expanded, 1)
}
}
df <- df[-xs,]
df <- rbind(as.matrix(df), cbind(codes,desc,expanded))
df <- as.data.frame(df, stringsAsFactors = FALSE)
assign("out", df, envir = .GlobalEnv) # This is how I view my dataframe after I run the function.
}
Calling my function
newdf(df)
When I use aggregate function on a data.frame which contains character and numeric columns, aggregate fails and returns only NAs for all. How can I solve this? My first idea was to check for value class but it did not work.
name <- rep(LETTERS[1:5],each=2)
feat <- paste0("Feat",name)
valuesA <- runif(10)*10
valuesB <- runif(10)*10
daf <- data.frame(ID=name,feature=feat,valueA=valuesA,valueB=valuesB, stringsAsFactors = FALSE)
aggregate(.~ID, data=daf,FUN=mean)
aggregate(.~ID, data=daf,FUN=function(x){
if(is.character(x)){
return(NA)
}else{ return(mean(x))}
})
pollutantmean <- function(directory, pollutant, ID = 1:332){
+ files_list <- list.files("specdata", full.names = TRUE)
+ dat <- data.frame()
+ for (i in 1:332){
+ dat <- rbind(dat, read.csv(files_list[i]))
+ }
+ dat_subset <- subset(dat, dat$ID == ID)
+ mean(dat_subset$nitrate, na.rm = TRUE)
+ mean(dat_subset$sulfate, na.rm = TRUE)
+ }
pollutantmean(specdata, sulfate, 1:10)
[1] 3.189369
pollutantmean(specdata, nitrate, 70:72)
[1] 3.189369
pollutantmean("specdata", "sulfate", 1:10)
[1] 3.189369
I think the issue is because the dat does not exisis so we can try:
if(exist("dat")== FALSE){
dat <- read.csv(i)
}else{
dat <- rbind(dat, read.csv(files_list[i]))
}
The behavior you are observing is caused by the line
dat_subset <- subset(dat, dat$ID == ID)
subset will look for any variables in the second argument within the first argument. So when you say dat$ID == ID, it looks for ID first within dat. Since dat has an element named ID, it is equivalent to saying dat$ID == dat$ID. Thus, you aren't actually getting a subset of your data; you're using the complete data set.
To illustrate, let's use the mtcars data set. This data set has 32 rows. One of the columns is named am and his made up of the values 0 and 1 indicating if the vehicle is an automatic or manual transmission
nrow(mtcars)
[1] 32
Let's define a vector on which to subset mtcars on am
am <- 0
When we subset against am, we want to get only those rows in mtcars where am == 0.
nrow(subset(mtcars, mtcars$am %in% am))
[1] 32
That didn't seem to work, though. Notice that we get the same result if we use the next line
nrow(subset(mtcars, am %in% am))
[1] 32
Now let's change the name of our subsetting vector and watch what happens.
am_list <- 0
nrow(subset(mtcars, am %in% am_list))
[1] 19
In this case, subset did not find a column named am_list in mtcars, so it looked outside of mtcars to find an object called am_list. This behavior is one of the reasons behind the warning in ?subset
This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.
I think you would be better off writing your function this way (I took some liberty with rewriting it)
pollutantmean <- function(directory, pollutant, ID = 1:332){
# Get file names from directory argument
files_list <- list.files(directory, full.names = TRUE)
# Create a single data frame from all of the files
dat <- lapply(files_list,
read.csv)
dat <- do.call("rbind", dat)
# Subset the data appropriately
dat_subset <- dat[dat$ID %in% ID, ]
# Get the mean of the two columns, return in a list
lapply(dat_subset[c("nitrate", "sulfate")],
mean,
na.rm = TRUE)
}