Incorporating external function in R's apply - r

Given this data.frame
x y z
1 1 3 5
2 2 4 6
I'd like to add the value of columns x and z plus a coefficient 10, for every rows in dat.
The intended result is this
x y z result
1 1 3 5 16 #(1+5+10)
2 2 4 6 18 #(2+6+10)
But why this code doesn't produce the desired result?
dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
Coeff <- 10
# Function
process.xz <- function(v1,v2,cf) {
return(v1+v2+cf)
}
# It breaks here
sm <- apply(dat[,c('x','z')], 1, process.xz(dat$x,dat$y,Coeff ))
# Later I'd do this:
# cbind(dat,sm);

I wouldn't use an apply here. Since the addition + operator is vectorized, you can get the sum using
> process.xz(dat$x, dat$z, Coeff)
[1] 16 18
To write this in your data.frame, don't use cbind, just assign it directly:
dat$result <- process.xz(dat$x, dat$z, Coeff)

The reason it fails is because apply doesn't work like that - you must pass the name of a function and any additional parameters. The rows of the data frame are then passed (as a single vector) as the first argument to the function named.
dat <- data.frame(x=c(1,2), y=c(3,4), z=c(5,6))
Coeff <- 10
# Function
process.xz <- function(x,cf) {
return(x[1]+x[2]+cf)
}
sm <- apply(dat[,c('x','z')], 1, process.xz,cf=Coeff)
I completely agree that there's no point in using apply here though - but it's good to understand anyway.

Related

Referring to previous row in calculation

I'm new to R and can't seem to get to grips with how to call a previous value of "self", in this case previous "b" b[-1].
b <- ( ( 1 / 14 ) * MyData$High + (( 13 / 14 )*b[-1]))
Obviously I need a NA somewhere in there for the first calculation, but I just couldn't figure this out on my own.
Adding example of what the sought after result should be (A=MyData$High):
A b
1 5 NA
2 10 0.7142...
3 15 3.0393...
4 20 4.6079...
1) for loop Normally one would just use a simple loop for this:
MyData <- data.frame(A = c(5, 10, 15, 20))
MyData$b <- 0
n <- nrow(MyData)
if (n > 1) for(i in 2:n) MyData$b[i] <- ( MyData$A[i] + 13 * MyData$b[i-1] )/ 14
MyData$b[1] <- NA
giving:
> MyData
A b
1 5 NA
2 10 0.7142857
3 15 1.7346939
4 20 3.0393586
2) Reduce It would also be possible to use Reduce. One first defines a function f that carries out the body of the loop and then we have Reduce invoke it repeatedly like this:
f <- function(b, A) (A + 13 * b) / 14
MyData$b <- Reduce(f, MyData$A[-1], 0, acc = TRUE)
MyData$b[1] <- NA
giving the same result.
This gives the appearance of being vectorized but in fact if you look at the source of Reduce it does a for loop itself.
3) filter Noting that the form of the problem is a recursive filter with coefficient 13/14 operating on A/14 (but with A[1] replaced with 0) we can write the following. Since filter returns a time series we use c(...) to convert it back to an ordinary vector. This approach actually is vectorized as the filter operation is performed in C.
MyData$b <- c(filter(replace(MyData$A, 1, 0)/14, 13/14, method = "recursive"))
MyData$b[1] <- NA
again giving the same result.
Note: All solutions assume that MyData has at least 1 row.
There are a couple of ways you could do this.
The first method is a simple loop
df <- data.frame(A = seq(5, 25, 5))
df$b <- 0
for(i in 2:nrow(df)){
df$b[i] <- (1/14)*df$A[i]+(13/14)*df$b[i-1]
}
df
A b
1 5 0.0000000
2 10 0.7142857
3 15 1.7346939
4 20 3.0393586
5 25 4.6079758
This doesn't give the exact values given in the expected answer, but it's close enough that I've assumed you made a transcription mistake. Note that we have to assume that we can take the NA in df$b[1] as being zero or we get NA all the way down.
If you have heaps of data or need to do this a bunch of time the speed could be improved by implementing the code in C++ and calling it from R.
The second method uses the R function sapply
The form you present the problem in
is recursive, which makes it impossible to vectorise, however we can do some maths and find that it is equivalent to
We can then write a function which calculates b_i and use sapply to calculate each element
calc_b <- function(n,A){
(1/14)*sum((13/14)^(n-1:n)*A[1:n])
}
df2 <- data.frame(A = seq(10,25,5))
df2$b <- sapply(seq_along(df2$A), calc_b, df2$A)
df2
A b
1 10 0.7142857
2 15 1.7346939
3 20 3.0393586
4 25 4.6079758
Note: We need to drop the first row (where A = 5) in order for the calculation to perform correctly.

Changing dataframe values with a function

My problem is that i want to use a function to change a random value to NA in a global data frame.
df is a dataframe with 230 rows and 2 columns.
abstract code:
emptychange<- function(x){
placenumber <- round(runif(1,min= min(1),max=max(nrow(x))))
x[placenumber,2] <<- NA
}
emptychange(df)
The Error is:"Error in x[placenumber, 2] <<- NA : object 'x' not found".
I think the mistake is, that r searches for the global variable 'x' and doesn't use the function x-value (in this case df). How can I fix this? Thanks!
This works. The problem was this: <<- NA Double arrows are used when you want to assign a value to an object outside the function. In you case, your x is inside the function.
df1 <-data.frame(x = 1, y = 1:10)
emptychange<- function(x){
placenumber <- round(runif(1,min= min(1),max=max(nrow(x))))
x[placenumber,2] <- NA
return(x)
}
emptychange(df1)
f you want this to be done at the console, you can just use sample-ing from the row count inside the [<- function:
> df1 <-data.frame(x = 1, y = 1:10)
> df1[sample(nrow(df1), 1) , 2] <- NA
> df1
x y
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 NA
7 1 7
8 1 8
9 1 9
10 1 10
If you want to destructively change the dataframe argument given to a function you should instead assign the value which is returned back to the original name:
> randNA.secCol <- function(df) {df[sample(nrow(df), 1) , 2] <- NA; df}
> df1 <-data.frame(x = 1, y = 1:10)
> df1 <- randNA.secCol(df1)
Best practice in R is avoidance of the use of the <<- function.

Using sum(x:y) to create a new variable/vector from existing values in R

I am working in R with a data frame d:
ID <- c("A","A","A","B","B")
eventcounter <- c(1,2,3,1,2)
numberofevents <- c(3,3,3,2,2)
d <- data.frame(ID, eventcounter, numberofevents)
> d
ID eventcounter numberofevents
1 A 1 3
2 A 2 3
3 A 3 3
4 B 1 2
5 B 2 2
where numberofevents is the highest value in the eventcounter for each ID.
Currently, I am trying to create an additional vector z <- c(6,6,6,3,3).
If the numberofevents == 3, it is supposed to calculate sum(1:3), equally to 3 + 2 + 1 = 6.
If the numberofevents == 2, it is supposed to calculate sum(1:2) equally to 2 + 1 = 3.
Working with a large set of data, I thought it might be convenient to create this additional vector
by using the sum function in R d$z<-sum(1:d$numberofevents), i.e.
sum(1:3) # for the rows 1-3
and
sum(1:2) # for the rows 4-5.
However, I always get this warning:
Numerical expression has x elements: only the first is used.
You can try ave
d$z <- with(d, ave(eventcounter, ID, FUN=sum))
Or using data.table
library(data.table)
setDT(d)[,z:=sum(eventcounter), ID][]
Try using apply sapply or lapply functions in R.
sapply(numberofevents, function(x) sum(1:x))
It works for me.

Replicate variable based off match of two other variables in R

I've got a seemingly simple question that I can't answer: I've got three vectors:
x <- c(1,2,3,4)
weight <- c(5,6,7,8)
y <- c(1,1,1,2,2,2)
I want to create a new vector that replicates the values of weight for each time an element in x matches y such that it produces the following new weight vector associated with y:
y_weight <- c(5,5,5,6,6,6)
Any thoughts on how to do this (either loop or vectorized)? Thanks
You want the match function.
match(y, x)
to return the indicies of the matches, the use that to build your new weight vector
weight[match(y, x)]
#Using plyr
library(plyr)
df<-as.data.frame(cbind(x,weight)) # converting to dataframe
df<-rename(df,c(x="y")) # rename x as y for joining dataframes
y<-as.data.frame(y) # converting to dataframe
mydata <- join(df, y, by = "y",type="right")
> mydata
y weight
1 1 5
2 1 5
3 1 5
4 2 6
5 2 6
6 2 6

How can I use ddply with varying .variables?

I use ddply to summarize some data.frameby various categories, like this:
# with both group and size being factors / categorical
split.df <- ddply(mydata,.(group,size),summarize,
sumGroupSize = sum(someValue))
This works smoothly, but often I like to calculate ratios which implies that I need to divide by the group's total. How can I calculate such a total within the same ddply call?
Let's say I'd like to have the share of observations in group A that are in size class 1. Obviously I have to calculate the sum of all observations in size class 1 first.
Sure I could do this with two ddply calls, but using all one call would be more comfortable. Is there a way to do so?
EDIT:
I did not mean to ask overly specific, but I realize I was disturbing people here. So here's my specific problem. In fact I do have an example that works, but I don't consider it really nifty. Plus it has a shortcoming that I need to overcome: it does not work correctly with apply.
library(plyr)
# make the dataset more "realistic"
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA
# someValue is summarized !
# note we have a another, varying category hence we need the a parameter
calcShares <- function(a, data) {
# !is.na needs to be specific!
tempres1 <- eval(substitute(ddply(data[!is.na(a),],.(group,size,a),summarize,
sumTest = sum(someValue,na.rm=T))),
envir=data, enclos=parent.frame())
tempres2 <- eval(substitute(ddply(data[!is.na(a),],.(group,size),summarize,
sumTestTotal = sum(someValue,na.rm=T))),
envir=data, enclos=parent.frame())
res <- merge(tempres1,tempres2,by=c("group","size"))
res$share <- res$sumTest/res$sumTestTotal
return(res)
}
test <- calcShares(category,mydata)
test2 <- calcShares(categoryA,mydata)
head(test)
head(test2)
As you can see I intend to run this over different categorical variables. In the example I have only two (category, categoryA) but in fact I got more, so using apply with my function would be really nice, but somehow it does not work correctly.
applytest <- head(apply(mydata[grep("^cat",
names(mydata),value=T)],2,calcShares,data=mydata))
.. returns a warning message and a strange name (newX[, i] ) for the category var.
So how can I do THIS a) more elegantly and b) fix the apply issue?
This seems simple, so I may be missing some aspect of your question.
First, define a function that calculates the values you want inside each level of group. Then, instead of using .(group, size) to split the data.frame, use .(group), and apply the newly defined function to each of the split pieces.
library(plyr)
# Create a dataset with the names in your example
mydata <- warpbreaks
names(mydata) <- c("someValue", "group", "size")
# A function that calculates the proportional contribution of each size class
# to the sum of someValue within a level of group
getProps <- function(df) {
with(df, ave(someValue, size, FUN=sum)/sum(someValue))
}
# The call to ddply()
res <- ddply(mydata, .(group),
.fun = function(X) transform(X, PROPS=getProps(X)))
head(res, 12)
# someValue group size PROPS
# 1 26 A L 0.4785203
# 2 30 A L 0.4785203
# 3 54 A L 0.4785203
# 4 25 A L 0.4785203
# 5 70 A L 0.4785203
# 6 52 A L 0.4785203
# 7 51 A L 0.4785203
# 8 26 A L 0.4785203
# 9 67 A L 0.4785203
# 10 18 A M 0.2577566
# 11 21 A M 0.2577566
# 12 29 A M 0.2577566

Resources