not error, but not results either in R - r

I am trying to make a function in R that calculates the mean of nitrate, sulfate and ID. My original dataframe have 4 columns (date,nitrate, sulfulfate,ID). So I designed the next code
prueba<-read.csv("C:/Users/User/Desktop/coursera/001.csv",header=T)
columnmean<-function(y, removeNA=TRUE){ #y will be a matrix
whichnumeric<-sapply(y, is.numeric)#which columns are numeric
onlynumeric<-y[ , whichnumeric] #selecting just the numeric columns
nc<-ncol(onlynumeric) #lenght of onlynumeric
means<-numeric(nc)#empty vector for the means
for(i in 1:nc){
means[i]<-mean(onlynumeric[,i], na.rm = TRUE)
}
}
columnmean(prueba)
When I run my data without using the function(), but I use row by row with my data it will give me the mean values. Nevertheless if I try to use the function so it will make all the steps by itself, it wont mark me error but it also won't compute any value, as in my environment the dataframe 'prueba' and the columnmean function
what am I doing wrong?

A reproducible example would be nice (although not absolutely necessary in this case).
You need a final line return(means) at the end of your function. (Some old-school R users maintain that means alone is OK - R automatically returns the value of the last expression evaluated within the function whether return() is specified or not - but I feel that using return() explicitly is better practice.)
colMeans(y[sapply(y, is.numeric)], na.rm=TRUE)
is a slightly more compact way to achieve your goal (although there's nothing wrong with being a little more verbose if it makes your code easier for you to read and understand).

The result of an R function is the value of the last expression. Your last expression is:
for(i in 1:nc){
means[i]<-mean(onlynumeric[,i], na.rm = TRUE)
}
It may seem strange that the value of that expression is NULL, but that's the way it is with for-loops in R. The means vector does get changed sequentially, which means that BenBolker's advice to use return(.) is correct (as his advice almost always is.) . For-loops in R are a notable exception to the functional programming paradigm. They provide a mechanism for looping (as do the various *apply functions) but the commands inside the loop exert their effects in the calling environment via side effects (unlike the apply functions).

Related

How to loop through columns of a data.frame and use a function

This has probably been answered already and in that case, I am sorry to repeat the question, but unfortunately, I couldn't find an answer to my problem. I am currently trying to work on the readability of my code and trying to use functions more frequently, yet I am not that familiar with it.
I have a data.frame and some columns contain NA's that I want to interpolate with, in this case, a simple kalman filter.
require(imputeTS)
#some test data
col <- c("Temp","Prec")
df_a <- data.frame(c(10,13,NA,14,17),
c(20,NA,30,NA,NA))
names(df_a) <- col
#this is my function I'd like to use
gapfilling <- function(df,col){
print(sum(is.na(df[,col])))
df[,col] <- na_kalman(df[,col])
}
#this is my for-loop to loop through the columns
for (i in col) {
gapfilling(df_a, i)
}
I have two problems:
My for loop works, yet it doesn't overwrite the data.frame. Why?
How can I achieve this without a for-loop? As far as I am aware you should avoid for-loops if possible and I am sure it's possible in my case, I just don't know how.
How can I achieve this without a for-loop? As far as I am aware you should avoid for-loops if possible and I am sure it's possible in my case, I just don't know how.
You most definitely do not have to avoid for loops. What you should avoid is using a loop to perform actions that could be vectorized. Loops are in general just fine, however they are (much) slower compared to compiled languages such as c++, but are equivalent to loops in languages such as python.
My for loop works, yet it doesn't overwrite the data.frame. Why?
This is a problem with overwriting values within a function, or what is referred to as scope. Basically any assignment is restricted to its current environment (or scope). Take the example below:
f <- function(x){
a <- x
cat("a is equal to ", a, "\n")
return(3)
}
x <- 4
f(x)
a is equal to 4
[1] 3
print(a)
Error in print(a) : object 'a' not found
As you can see, "a" definitely exists, but it stops existing after the function call has been fulfilled. It is restricted to the environment (or scope) of the function. Here the scope is basically the time at which the function is run.
To alleviate this, you have to overwrite the value in the global environment
for (i in col) {
df_a[, i] <- gapfilling(df_a, i)
}
Now for readability (not speed) one could change this to a lapply
df_a[, col] <- lapply(df_a[, col], na_kalman)
I set a heavy point on it not being faster than using a loop. lapply iterates over each column, as you would in a loop. Speed could be obtained if say na_kalman was programmed to take multiple columns, and possibly save time using optimized c or c++ code.

Using values from a dataframe to apply a function to a vector

I'll start off by admitting that I'm terrible at the apply functions, and function writing in general, in R. I am working on a course project to clean and model some text data, and I would like to include a step that cleans up contractions.
The qdapDictionaries package includes a contractions data frame with two columns, the first column is the contraction and the second is the expanded version. For example:
contraction expanded
5 aren't are not
I want to use the values in here to run a gsub function on my text, which I still have in a large character element. Something like gsub(contr,expd,text).
Here's an example vector that I am using to test things out:
vct <- c("I've got a problem","it shouldn't be that hard","I'm having trouble 'cause I'm dumb")
I'm stumped on how to loop through the data frame (without actually writing a loop, because it seems like the least efficient way to do it) so I can run all the gsubs that I need.
There's probably a simple answer, but here's what I tried: first, I created a function that would return the expanded version if passed a contraction:
expand <- function(contr) {
expd <- contractions[which(contractions[1]==contr),2]
}
I can use sapply with this and it does work, more or less; looping over the first column in contractions, sapply(contractions[,1],expand) returns a named vector of characters with the expanded phrases.
I can't figure out how to combine this vector with gsub though. I tried writing a second function gsub_expand and changing the expand function to return both the contraction and the expansion:
gsub_expand <- function(list, text) {
text <- gsub(list[[1]],list[[2]],text)
return(text)
}
When I ran gsub_expand(sapply(contractions[,1],expand),vct) it only corrected a portion of my vector.
[1] "I've got a problem" "it shouldn't be that hard" "I'm having trouble because I'm dumb"
The first entry in the contractions data frame is 'cause and because, so the interior sapply doesn't seem to actually be looping. I'm stuck in the logic of what I want to pass to what, and what I'm supposed to loop over.
Thanks for any help.
Two options:
stringr::str_replace_all
The stringr package does mostly the same things you can do with base regex functions, but sometimes in a dramatically simpler way. This is one of those times. You can pass str_replace_all a named list or character vector, and it will use the names as patterns and the values as replacements, so all you need is
library(stringr)
contractions <- c("I've" = 'I have', "shouldn't" = 'should not', "I'm" = 'I am')
str_replace_all(vct, contractions)
and you get
[1] "I have got a problem" "it should not be that hard"
[3] "I am having trouble 'cause I am dumb"
No muss, no fuss, just works.
lapply/mapply/Map and gsub
You can, of course, use lapply or a for loop to repeat gsub. You can formulate this call in a few ways, depending on how your data is stored, and how you want to get it out. Let's first make a copy of vct, because we're going to overwrite it:
vct2 <- vct
Now we can use any of these three:
lapply(1:length(contractions),
function(x){vct2 <<- gsub(names(contractions[x]), contractions[x], vct2)})
# `mapply` is a multivariate version of `sapply`
mapply(function(x, y){vct2 <<- gsub(x, y, vct2)}, names(contractions), contractions)
# `Map` is a multivariate version of `lapply`
Map(function(x, y){vct2 <<- gsub(x, y, vct2)}, names(contractions), contractions)
each of which will return slightly different useless data, but will also save the changes to vct2, which now looks the same as the results of str_replace_all above.
These are a little complicated, mostly because you need to save the internal version of vct as you go with each change made. The vct <<- writes to the initialized vct2 outside the function's environment, allowing us to capture the successive changes. Be a little careful with <<-; it's powerful. See ?assignOps for more info.

Use the multiple variables in function in r

I have this function
ANN<-function (x,y){
DV<-rep(c(0:1),5)
X1<-c(1:10)
X2<-c(2:11)
ANN<-neuralnet(x~y,hidden=10,algorithm='rprop+')
return(ANN)
}
I need the function run like
formula=X1+X2
ANN(DV,formula)
and get result of the function. So the problem is to say the function USE the object which was created during the run of function. I need to run trough lapply more combinations of x,y, so I need it this way. Any advices how to achieve it? Thanks
I've edited my answer, this still works for me. Does it work for you? Can you be specific about what sort of errors you are getting?
New response:
ANN<-function (y){
X1<-c(1:10)
DV<-rep(c(0:1),5)
X2<-c(2:11)
dat <- data.frame(X1,X2)
ANN<-neuralnet(DV ~y,hidden=10,algorithm='rprop+',data=dat)
return(ANN)
}
formula<-X1+X2
ANN(formula)
If you want so specify the two parts of the formula separately, you should still pass them as formulas.
library(neuralnet)
ANN<-function (x,y){
DV<-rep(c(0:1),5)
X1<-c(1:10)
X2<-c(2:11)
formula<-update(x,y)
ANN<-neuralnet(formula,data=data.frame(DV,X1,X2),
hidden=10,algorithm='rprop+')
return(ANN)
}
ANN(DV~., ~X1+X2)
And assuming you're using neuralnet() from the neuralnet library, it seems the data= is required so you'll need to pass in a data.frame with those columns.
Formulas as special because they are not evaluated unless explicitly requested to do so. This is different than just using a symbol, where as soon as you use it is evaluated to something in the proper frame. This means there's a big difference between DV (a "name") and DV~. (a formula). The latter is safer for passing around to functions and evaluating in a different context. Things get much trickier with symbols/names.

Remove values from a dataset based on a vector of those values

I have a dataset that looks like this, except it's much longer and with many more values:
dataset <- data.frame(grps = c("a","b","c","a","d","b","c","a","d","b","c","a"), response = c(1,4,2,6,4,7,8,9,4,5,0,3))
In R, I would like to remove all rows containing the values "b" or "c" using a vector of values to remove, i.e.
remove<-c("b","c")
The actual dataset is very long with many hundreds of values to remove, so removing values one-by-one would be very time consuming.
Try:
dataset[!(dataset$grps %in% remove),]
There's also subset:
subset(dataset, !(grps %in% remove))
... which is really just a wrapper around [ that lets you skip writing dataset$ over and over when there are multiple subset criteria. But, as the help page warns:
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
‘[’, and in particular the non-standard evaluation of argument
‘subset’ can have unanticipated consequences.
I've never had any problems, but the majority of my R code is scripting for my own use with relatively static inputs.
2013-04-12
I have now had problems. If you're building a package for CRAN, R CMD check will throw a NOTE if you have use subset in this way in your code - it will wonder if grps is a global variable, even though subset is evaluating it within dataset's environment (not the global one). So if there's any possiblity your code will end up in a package and you feel squeamish about NOTEs, stick with Rcoster's method.

Subsetting within a function

I'm trying to subset a dataframe within a function using a mixture of fixed variables and some variables which are created within the function (I only know the variable names, but cannot vectorise them beforehand). Here is a simplified example:
a<-c(1,2,3,4)
b<-c(2,2,3,5)
c<-c(1,1,2,2)
D<-data.frame(a,b,c)
subbing<-function(Data,GroupVar,condition){
g=Data$c+3
h=Data$c+1
NewD<-data.frame(a,b,g,h)
subset(NewD,select=c(a,b,GroupVar),GroupVar%in%condition)
}
Keep in mind that in my application I cannot compute g and h outside of the function. Sometimes I'll want to make a selection according to the values of h (as above) and other times I'll want to use g. There's also the possibility I may want to use both, but even just being able to subset using 1 would be great.
subbing(D,GroupVar=h,condition=5)
This returns an error saying that the object h cannot be found. I've tried to amend subset using as.formula and all sorts of things but I've failed every single time.
Besides the ease of the function there is a further reason why I'd like to use subset.
In the function I'm actually working on I use subset twice. The first time it's the simple subset function. It's just been pointed out below that another blog explored how it's probably best to use the good old data[colnames()=="g",]. Thanks for the suggestion, I'll have a go.
There is however another issue. I also use subset (or rather a variation) in my function because I'm dealing with several complex design surveys (see package survey), so subset.survey.design allows you to get the right variance estimation for subgroups. If I selected my group using [] I would get the wrong s.e. for my parameters, so I guess this is quite an important issue.
Thank you
It's happening right as the function is trying to define GroupVar in the beginning. R is looking for the object h by itself (not within the dataframe).
The best thing to do is refer to the column names in quotes in the subset function. But of course, then you'd have to sidestep the condition part:
subbing <- function(Data, GroupVar, condition) {
....
DF <- subset(Data, select=c("a","b", GroupVar))
DF <- DF[DF[,3] %in% condition,]
}
That will do the trick, although it can be annoying to have one data frame indexing inside another.

Resources