Using approx in ddply? - r

I am trying to use approx() to predict points on curves inside of ddply, but it does not seem to be working as I expect it to once it is handed to ddply.
This all works:
#Fake Data, V3 is my index variable
df<-data.frame(V1=rep(0:10,3), V2=c(exp(0:10), 2*exp(0:10), 3*exp(0:10)), V3=rep(1:3,each=11))
approxy<-function(i){
estim<-approx(x=i$V1, y=i$V2, xout=c(1.1,5.1,9.1))$y
return(data.frame(ex1=estim[1], ex5=estim[2], ex9=estim[3]))
}
approxy(df[df$V3==1,])
This does not:
ddply(df, c("V3"), fun=approxy)
It just spits the original dataframe back out. Any thoughts on this problem would be appreciated.

Your syntax is incorrect:
ddply(df, c("V3"), .fun=approxy)
gives
V3 ex1 ex5 ex9
1 1 3.185359 173.9147 9495.422
2 2 6.370719 347.8294 18990.844
3 3 9.556078 521.7442 28486.266

Related

Equivalent of Stata's expand in R

Whilst reviewing a colleague's Stata code I came across the command expand.
I would really love to be able to do the same thing simply in my own R code.
Essentially expand duplicates a dataset n times but has the option to create a new variable which is 0 if the observation originally appeared
in the dataset and 1 if the observation is a duplicate.
Does anyone know of a quick way of implementing this in R? Or is it a case of writing my own function?
rep_r<-function(x,n){if(n<=1){rep(x,times=1)}else{rep(x,times=n)}}
expand_r<-function(x,n){
Reduce(function(x,y)
{c(x,y)},mapply(rep_r,x,n))
}
expand_r(c(2,3,4,1,5),c(-1,0,1,2,3))
#[1] 2 3 4 1 1 5 5 5
EDIT: Thanks to the suggestion from #nicola the above functionality can be simply achieved by the following one-liner.
expand_r<-function(x,n) rep(x,replace(n,n<1,1))
#>expand_r(c(2,3,4,1,5),c(-1,0,1,2,3))
#[1] 2 3 4 1 1 5 5 5
This function expands the rows of a data.frame like the Stata expand command does. I got the idea from the R mefa package.
expand_r <- function(df, ...) {
as.data.frame(lapply(df, rep, ...))
}
df <- data.frame(x = 1:2, y = c("a", "b"))
expand_r(df, times = 3)

binding rows from a function into a dataframe

I wrote a function that does operations on a list. Now I am trying to bind the results into a data.frame, but nothing seems to work. Can someone explain how to fix this, but more importantly, why I am having this problem?
ret<-lapply(1:3,function(x){getVals(x,x+1,x+2)})
getVals<-function(x,y,z){
rbind(x,y,z)
}
as.data.frame(ret)
as.matrix(ret,ncol=3)
Desired output is:
1,2,3
2,3,4
3,4,5
You can get the result as a data frame by doing something like this:
as.data.frame(do.call(cbind, ret))
V1 V2 V3
x 1 2 3
y 2 3 4
z 3 4 5
ret is a list of arrays. The are several ways of working with a these lists. I prefer to unlist, convert to matrix and then onto the data frame:
df<-data.frame(matrix(unlist(ret),ncol=3, byrow=TRUE))
df

Getting corresponding values from data.frame

my problem is that I can't really get my problem down in words which makes it hard to google it, so I am forced to ask you. I hope you will shed light on my issue:
I got a data.frame like this:
6 4
5 2
3 6
0 7
0 2
1 3
6 0
1 1
As you noticed, in the first column I got 0 repeating two times, 1 two times and so one. What I would like to do is get get all the corresponging values for one number, say 0, in the second columns (in this example 7 and 2). Preferably in data.frame.
I know the attempt with df$V2[which(df$V1==0)], however since the first column might have over 100 rows I can't really use this. Do you guys have a good solution?
Maybe some words regarding the background of this question: I need to process this data, i.e. get the mean of the second column for all 0's in the first columns, or get min/max values.
Regards
Here a solution using dplyr
df %>% group_by(V1) %>% summarize(ME=mean(V2))
Using your data (with some temporary names attached)
txt <- "6 4
5 2
3 6
0 7
0 2
1 3
6 0
1 1"
df <- read.table(text = txt)
names(df) <- paste0("Var", seq_len(ncol(df)))
Coerce the first column to be a factor
df <- transform(df, Var1 = factor(Var1))
Then you can use aggregate() with a nice formula interface
aggregate(Var2 ~ Var1, data = df, mean)
aggregate(Var2 ~ Var1, data = df, max)
aggregate(Var2 ~ Var1, data = df, min)
(eg:
> aggregate(Var2 ~ Var1, data = df, mean)
Var1 Var2
1 0 4.5
2 1 2.0
3 3 6.0
4 5 2.0
5 6 2.0
) or using the default interface
with(df, aggregate(Var2, list(Var1), FUN = mean))
> with(df, aggregate(Var2, list(Var1), FUN = mean))
Group.1 x
1 0 4.5
2 1 2.0
3 3 6.0
4 5 2.0
5 6 2.0
But the output is nicer from the formula interface.
Using data.table
library(data.table)
setDT(df)[, list(mean=mean(V2), max= max(V2), min=min(V2)), by = V1]
First, what exactly is the issue with the solution you suggest? Is it a question of efficiency? Frankly the code you present is close to optimal [1].
For the general case, you're probably looking at a split-apply-combine action, to apply a function to subsets of the data based on some differentiator. As #teucer points out, dplyr (and it's ancestor, plyr) are designed for exactly this, as is data.tables. In vanilla R, you would tend to use by or aggregate (or split and sapply for more advanced usage) for the same task. For example, to compute group means, you would do
by(df$V2, df$V1, mean)
or
aggregate(df, list(type=df$V1), mean)
Or even
sapply(split(df$V2, df$V1), mean)
[1] The code can be simplified to df$V2[df$V1 == 0] or df[df$V1 == 0,] as well.
Thanks all for your replies. I decided to go for the dplyr solution posted by teucer and eipi10. Since I have a third (and maybe even a fourth) column, this solution seems to be pretty easy to use (just adding V3 to group_by).
Since some are asking what's wrong with df$V2[which(df$V1==0)]: I maybe was a bit unclear when saying "rows", was I actually meant was "values". Let's assume I had n distinct values in the first column, I would have to use the command n times for all distinct values and store the n resulting vectors.

Create and process several columns with loop in R

I'm quite new to R and I would like to learn how to write a Loop to create and process several columns.
I imported a table into R that cointains data with 23 variables. For all of these variables I want to calculate the per capita valuem multiply this with 1000 and either write the data into a new table or in the same table as the old data.
So to this for only one column my operation looked like this:
<i>agriculture<-cbind(agriculture,"Total_value_per_capita"=agriculture$Total/agriculture$Total.Population*1000)</i>
Now I'm asking how to do this in a Loop for the 23 variables so that I won't have to write 23 similar lines of code.
I think the solution might look quite similar to the code pasted in this thread:
loop to create several matrix in R (maybe using paste)
but I dind't got it working on my code.
So any suggestion would be very helpful.
I would always favor an appropriate *ply function over loops in R. In this case sapply could be your friend:
df <- data.frame( a=sample(10), b=sample(10), c=sample(10) )
df.per.capita <– as.data.frame(
sapply(
df[ colnames(df) != "c" ], function(x){ x/df$c *1000 }
)
)
For more complicated cases, you should definitely have a look at the plyr package.
This can be done using sweep function. Using Beasterfield's data generation but setting the seed you can obtain the same results
set.seed(001)
df <- data.frame( a=sample(10), b=sample(10), c=sample(10) )
per.capita <- sweep(df[,colnames(df) != "c"], 1, STATS=df$c, FUN='/')*1000
per.capita
a b
1 300.0000 300.0000
2 2000.0000 1000.0000
3 833.3333 1000.0000
4 7000.0000 10000.0000
5 222.2222 555.5556
6 1000.0000 875.0000
7 1285.7143 1142.8571
8 1200.0000 800.0000
9 3333.3333 333.3333
10 250.0000 2250.0000
Comparing with Beasterfield's results:
all.equal(df.per.capita, per.capita)
[1] TRUE

mapping over the rows of a data frame

Suppose I have a data frame with columns c1, ..., cn, and a function f that takes in the columns of this data frame as arguments.
How can I apply f to each row of the data frame to get a new data frame?
For example,
x = data.frame(letter=c('a','b','c'), number=c(1,2,3))
# x is
# letter | number
# a | 1
# b | 2
# c | 3
f = function(letter, number) { paste(letter, number, sep='') }
# desired output is
# a1
# b2
# c3
How do I do this? I'm guessing it's something along the lines of {s,l,t}apply(x, f), but I can't figure it out.
as #greg points out, paste() can do this. I suspect your example is a simplification of a more general problem. After struggling with this in the past, as illustrated in this previous question, I ended up using the plyr package for this type of thing. plyr does a LOT more, but for these things it's easy:
> require(plyr)
> adply(x, 1, function(x) f(x$letter, x$number))
X1 V1
1 1 a1
2 2 b2
3 3 c3
you'll want to rename the output columns, I'm sure
So while I was typing this, #joshua showed an alternative method using ddply. The difference in my example is that adply treats the input data frame as an array. adply does not use the "group by" variable row that #joshua created. How he did it is exactly how I was doing it until Hadley tipped me to the adply() approach. In the aforementioned question.
paste(x$letter, x$number, sep = "")
I think you were thinking of something like this, but note that the apply family of functions do not return data.frames. They will also attempt to coerce your data.frame to a matrix before applying the function.
apply(x,1,function(x) paste(x,collapse=""))
So you may be more interested in ddply from the plyr package.
> x$row <- 1:NROW(x)
> ddply(x, "row", function(df) paste(df[[1]],df[[2]],sep=""))
row V1
1 1 a1
2 2 b2
3 3 c3

Resources