How can I use apply properly in R in this dataframe column? - r

I have a dataframe column with NA, I want to how can I use apply (or lapply, sapply, ...) to the column.
I've tried with apply and lapply, but it return an error.
The function I want to apply to the column is:
a.b <- function(x, y = 165){
if (x < y)
return('Good')
else if (x > y)
return('Bad')
}
the column of the dataframe is:
data$col = 180 170 NA NA 185 185
When I use apply I get:
apply(data$col, 2, a.b)
Error in apply(data$col, 2, a.b) :
dim(X) must have a positive length
I have try dim(data$col) and the return is NULL and I think it is because of the NA's.
I also use lapply and I get:
lapply(data$col, a.b)
Error in if (x < y) return("Good") else if (x > y) return("Bad") :
missing value where TRUE/FALSE needed
This is for a course of R for beginners that I am doing so I am sorry if I made some mistakes. Thanks for taking your time to read it and trying to help.

apply is used on a matrix, not a vector. Try:
a.b <- function(x, y = 165){
if (is.na(x)){
return("NA")
} else if (x < y){
return('Good')} else if (x > y){
return('Bad')}
}
data$col=sapply(data$col,a.b)

You should be able to solve this with mapply by specifying the values to pass into your parameters:
mapply(a.b, x = data[,'col'], y = 165)
Note that you may need to modify your a.b.() function in order to manage the NA's.

There's a few issues going on here:
apply is meant to run on a something with a dimension to act over, which is the MARGIN argument. A column, which you're passing to apply has no dimension. see below:
> dim(mtcars)
[1] 32 11
> dim(mtcars$cyl)
NULL
apply and lapply are meant to run over all columns (or rows if you're using that margin for apply). If you want to just replace one column, you should not use apply. Do something like data$my_col <- my_func(data$my_col) if you want to replace my_col with the result of passing it to my_func
NA values do not return TRUE or FALSE when using an operator on them. Note that 7 < NA will return NA. Your if statement is looking for a TRUE or FALSE value but getting an NA value, hence the error in your second attempt. If you want to handle NA values, you may need to incorporate that into your function with is.na.
Your function should be vectorized. See circle 3 of the R-Inferno. Currently, it will just return length 1 vectors of "Good" or "Bad". My hunch is what you want is similar to the following (although not exactly same if x == y)
a.b <- function(x, y = 165){
ifelse(x < y, "Good", "Bad")
}
I beleive using the above info should get you where you want to be.

Related

R: mapply with vector argument

I have a function of this form:
foo<-function(x,y){
if(length(y)==1){
return(x*y)
}
else{
return(x-y[1]*y[2])
}
}
and for the y argument I pass either a number or a vector of numbers:
> #test function:
> foo(1,2)
[1] 2
> foo(1,c(1,2))
[1] -1
Now I wish to use mapply to this function, but I run into problems when I wish to pass a vector for the y argument:
df<-data.frame(
"a"<-floor(runif(6, 1,10)),
"b"<-floor(runif(6, 18,80)),
"c"<-floor(runif(6, 1,80)),
"d"<-floor(runif(6, 100,800)),
"e"<-floor(runif(6, 1000,4000)),
"f"<-floor(runif(6, 1,10)),
"g"<-floor(runif(6, 5,80))
)
names(df)=c("a","b","c","d","e","f","g")
The following works fine:
> mapply(FUN=foo,df["a"],df["b"])
,but I run into trouble when I try to do the following:
> mapply(FUN=foo,df["a"],cbind(df["b"],df["c"]))
I'm very grateful for tips on how to better use an argument that have verying length, or how to pass the argument to mapply!
There are a lot of possible fixes here. Fundamentally, you need to turn 2nd input into mapply into a list with two elements in each list. One way to achieve that is to do something like:
tmp <- as.data.frame(t(df[c('b', 'c')]))
result <- mapply(FUN=foo,df["a"], tmp)
since a data frame is a list. This is going to run the function on all combinations of df["a"] and tmp. The elements you want will be along the diagonal (1st element of df['a'] with the first element of tmp, so the final answer is
diag(result)
BTW, when you are inside a function such as data.frame, use = for assignment instead of <-. You also do not need the quotes around the letters (they are being ignored). so you're call to data.frame should look like
df<-data.frame(
a = floor(runif(6, 1,10)),
b = floor(runif(6, 18,80)),
c = floor(runif(6, 1,80)),
d = floor(runif(6, 100,800)),
e = floor(runif(6, 1000,4000)),
f = floor(runif(6, 1,10)),
g = floor(runif(6, 5,80))
)
Which allows you to avoid having to name the data frame after you define it.
Update without diagonal call
f1 <- function(x) {
if(length(x) ==2 ) x[1] * x[2]
else x[1] - x[2]*x[3]
}
apply(df[,c("a","b", "c")], 1, f1)

retain NA in custom winsorization function withs apply

I am attempting to winsorize a data frame with a lot of NA entries, and I need to retain the NA entries after winsorization. The columns within the data frame that require winsorization on fine on the lower end (i.e. small values), but require some winsorization on the upper end (i.e. large values). I have created a function that almost does the job, but I can't seem to get the function to return NA entries where they occur. Here is an example.
# vector with an NA entry & upper-end value needs winsorization
a <- c(1:3,NA,90)
# my function
winsor <- function(x)
{ y <- quantile(x,probs=.95,na.rm=T)
sapply(x, function(x) {
if ( x>=y ){
x <- y
}
else { x <- x}
})
# returned vector after apply winsor to my object, a.
z <- as.data.frame(winsor(a))
The result I get when doing this gives the following error:
Error in if (x >= y) { : missing value where TRUE/FALSE needed
But it nevertheless returns the following vector (which is what I want, expect it does not return the NA entry as needed).
a
1.000
2.000
3.000
85.824
Any help will be greatly appreciated, as well as an extension to operationalize this function to an entire data frame. Hopefully this all makes sense. This is my very fist post (and hopefully my example is reproducible!).
winsor function needs an } in the end. I took the liberty and tweaked the code:
winsor <- function(x){
y <- quantile(x,probs=.95,na.rm=T)
sapply(x, function(x) {
if(!is.na(x)){
if(x>=y ){
x <- y
}
else { x <- x}
}
})
}
z<-data.frame(winsor=as.numeric(unlist(winsor(a))))

Apply function in data frame

I have a data frame named Cat. I have multiple columns. In one vector named
Jan.15_Transaction I have values. I want to apply a condition that if value is greater than 0 then 1 else 0. So I do not want to use if else condition as there are 42 columns similar to this in which I want to apply the same the same logic.
Jan.15_Transaction Feb.15_Transaction
1 1
2 2
3 3
4 4
Hence I build this function
myfunc <- function(x){
if(x > 0){
x=1
}
else {
x=0
}
return(x)
}
This is getting applied to first element only when I use this code.
Cat$Jan.15_Transaction.1<-myfunc(Cat$Jan.15_Transaction)
Warning message:
In if (x > 0) { :
the condition has length > 1 and only the first element will be used
So I tried sapply and got this error below
sapply(Cat$Jan.15_Transaction.1, myfunction(Cat))
Error in match.fun(FUN) : argument "FUN" is missing, with no default
You can use the ifelse function to vectorise (= apply across a vector) an if statement:
myfunc = function (x)
ifelse(x > 0, 1, 0)
Alternatively, you could use the following which is more efficient (but less readable):
myfunc = function (x)
as.integer(x > 0)
Coming back to your original function, your way of writing it is very un-R-like. A more R-like implementation would look like this:
myfunc = function (x)
if (x > 0) 1 else 0
— No need for a temporary variable, assignments, or the return statement.
I am assuming you want to apply the function on columns which have names ending with '_Transaction'. This can be done with the base function grepl.
vars <- grepl('_Transaction', names(df))
df[, vars] <- ifelse(df[, vars] > 0, 1, 0)
You could also use dplyr like shown below. This would generalize to more complicated functions too.
binarizer <- function(x) ifelse(x > 0, 1, 0)
df <- bind_cols(
df %>% select(-ends_with('_Transaction')),
df %>% select(ends_with('_Transaction')) %>%
mutate_each(funs(binarizer))
)

R loop missing value where TRUE/FALSE needed

I've got this code in R:
j <- 1
k <- nrow(group_IDs)
while (j <= k)
{
d_clust <- Mclust(Customers_Attibutes_s[which (Customers_Attibutes_s$Group_ID == group_IDs$Group_ID[j]),3:7], G=2:7)
temp <- cbind(Customers_Attibutes[which (Customers_Attibutes$Group_ID == group_IDs$Group_ID[j]),], as.data.frame (predict.Mclust(d_clust, Customers_Attibutes[which(Customers_Attibutes$Group_ID == group_IDs$Group_ID[j]), 3:7]))[1])
temp_ <- rbind(temp,temp_)
j <- j+1
}
j <= k in the while statement is returning this error:
missing value where TRUE/FALSE needed.
group_IDs is not null and it actually contains the value 8 in this case.
It seems to get into the loop and crash at the second round.
You can get around the indexing issues using for, e.g.:
for (ID in group_IDs) {}
This, of course, assumes that group_IDs is a vector of values.
Note: Your code shows the following inside the loop group_IDs$Group_ID[j] which implies something other than a vector; perhaps you meant group_IDs[j]?
Since group_ IDsis a vector, try length(group_IDs) instead of nrow. A vector doesn't have rows, so the equivalent is length.
Here's what I suspect is happening:
> group_IDs <- 8
> nrow(group_IDs)
NULL

How to compute weighted means of a vector within factor levels?

I am able to successfully get a simple mean of a given vector within factor levels, but in attempting to take it to the next step of weighting the observations, I can't get it to work. This works:
> tapply(exp.f,part.f.p.d,mean)
1 2 3 4 5 6 7 8 9 10
0.8535996 1.1256058 0.6968142 1.4346451 0.8136110 1.2006801 1.6112160 1.9168835 1.5135006 3.0312460
But this doesn't:
> tapply(exp.f,part.f.p.d,weighted.mean,b.pct)
Error in weighted.mean.default(X[[1L]], ...) :
'x' and 'w' must have the same length
>
In the code below, I am trying to find the weighted mean of exp.f, within levels of the factor part.f.p.d, weighted by the observations within b.pct that are in each level.
b.exp <- tapply(exp.f,part.f.p.d,weighted.mean,b.pct)
Error in weighted.mean.default(X[[1L]], ...) :
'x' and 'w' must have the same length
I am thinking I must be supplying the incorrect syntax, as all 3 of these vectors are the same length:
> length(b.pct)
[1] 978
> length(exp.f)
[1] 978
> length(part.f.p.d)
[1] 978
What is the correct way to do this? Thank you in advance.
Now I do it like this (thanks to Gavin):
sapply(split(Data,Data$part.f.p.d), function(x) weighted.mean(x$exp.f,x$b.pct)))
Others likely use ddply from the plyr package:
ddply(Data, "part.f.p.d", function(x) weighted.mean(x$exp.f, x$b.pct))
Your problem is that tapply does not "split" the extra arguments supplied (through its ... arguments) to the function, as it does for the main argument X. See the 'Note' on the help page for tapply (?tapply).
Optional arguments to FUN supplied by
the ... argument are not divided into
cells. It is therefore inappropriate
for FUN to expect additional arguments
with the same length as X.
Here is a hacky solution.
exp.f <- rnorm(10)
part.f.p.d <- factor(sample(1:5, size = 10, replace = T))
b.pct <- rnorm(10)
a <- split(exp.f, part.f.p.d)
b <- split(b.pct, part.f.p.d)
lapply(seq_along(a), function(i){
weighted.mean(a[[i]], b[[i]])
})
I've recreated the error with some dummy data. I'm assuming that part.f.p.d is some kind of factor that you're using to separate the other vectors.
b.pct <- sample(1:100, 10) / 100
exp.f <- sample(1:1000, 10)
part.f.p.d <- factor(rep(letters[1:5], 2))
tapply(exp.f, part.f.p.d, mean) # this works
tapply(exp.f, part.f.p.d, weighted.mean, w = b.pct) # this doesn't
A call to traceback() helps to uncover the problem. The reason the second doesn't work is because the INDEX argument (ie part.f.p.d) that you passed to tapply() is used to split the X argument (ie exp.f) into smaller vectors. Each of these splits is applied to weighted.mean() together with the w argument (ie b.pct), which was not split.
EDIT: This should do what you want.
sapply(levels(part.f.p.d),
function(whichpart) weighted.mean(x = exp.f[part.f.p.d == whichpart],
w = b.pct[part.f.p.d == whichpart]))

Resources