How to use variable names as arguments - r

For a homework assignment, I wrote a function that performs forward step-wise regression. It takes 3 arguments: dependent variable, list of potential independent variables, and the data frame in which these variables are found. Currently all of my inputs except data frame, including the list of independent variables, are strings.
Many built-in functions, as well as functions from high-profile packages, allow for variable inputs that are not strings. Which way is best-practice and why? If non-string is best practice, how can I implement this considering that one of the arguments is a list of variables in the data frame, not a single variable?

Personally I don't see any problem with using strings if it accomplishes what you need it to. If you want, you could rewrite your function to take a formula as input rather than strings to designate independent and dependent variables. In this case your function calls would look like this:
fitmodel(x ~ y + z,data)
rather than this:
fitmodel("x",list("y","z"),data)
Using formulas would allow you to specify simple algebraic combinations of variables to use in your regression, like x ~ y + log(z). If you go this route, then you can build the data frame specified by the formula by calling model.frame and then use this new data frame to run your algorithm. For example:
> df<-data.frame(x=1:10,y=10:1,z=sqrt(1:10))
> model.frame(x ~ y + z,df)
x y z
1 1 10 1.000000
2 2 9 1.414214
3 3 8 1.732051
4 4 7 2.000000
5 5 6 2.236068
6 6 5 2.449490
7 7 4 2.645751
8 8 3 2.828427
9 9 2 3.000000
10 10 1 3.162278
> model.frame(x ~ y + z + I(x^2) + log(z) + I(x*y),df)
x y z I(x^2) log(z) I(x * y)
1 1 10 1.000000 1 0.0000000 10
2 2 9 1.414214 4 0.3465736 18
3 3 8 1.732051 9 0.5493061 24
4 4 7 2.000000 16 0.6931472 28
5 5 6 2.236068 25 0.8047190 30
6 6 5 2.449490 36 0.8958797 30
7 7 4 2.645751 49 0.9729551 28
8 8 3 2.828427 64 1.0397208 24
9 9 2 3.000000 81 1.0986123 18
10 10 1 3.162278 100 1.1512925 10
>

Related

R Help - na.approx similar to SuperTrend in Excel

Raw Data na.approx desired result
1 1 1
NA 3 4
5 5 5
6 6 6
7 7 7
NA 8 4
NA 9 7
10 10 10
13 11 13
14 12 14
By default, i believe na.approx in R will interpolate NA between two known values; one before and another after NA (the result will be seen as column "na.approx" above). Is there a way I can change this function to interpolate based on next two known values? for eg, first NA to be interpolated using 5 and 6.... but not 1 and 5.
I am not sure if there is an exact equivalent to what you want to do, but you can achieve similar results the following way:
> data <- c(1, NA, 5,6,7,NA,NA,10,13,14)
> ind <- which(is.na(data))
> sapply(rev(ind), function(i) data[i] <<- data[i + 1] - 1)
> data
[1] 1 4 5 6 7 8 9 10 13 14

Sum certain values from changing dataframe in R

I have a data frame that I would like to aggregate by adding certain values. Say I have six clusters. I then feed data from each cluster into some function that generates a value x which is then put into the output data frame.
cluster year lambda v e x
1 1 1 -0.12160997 -0.31105287 -0.253391178 15
2 1 2 -0.12160997 -1.06313732 -0.300349972 10
3 1 3 -0.12160997 -0.06704185 0.754397069 40
4 2 1 -0.07378295 -0.31105287 -1.331764904 4
5 2 2 -0.07378295 -1.06313732 0.279413039 19
6 2 3 -0.07378295 -0.06704185 -0.004581941 23
7 3 1 -0.02809310 -0.31105287 0.239647063 28
8 3 2 -0.02809310 -1.06313732 1.284568047 38
9 3 3 -0.02809310 -0.06704185 -0.294881283 18
10 4 1 0.33479251 -0.31105287 -0.480496125 15
11 4 2 0.33479251 -1.06313732 -0.380251626 12
12 4 3 0.33479251 -0.06704185 -0.078851036 34
13 5 1 0.27953088 -0.31105287 1.435456851 100
14 5 2 0.27953088 -1.06313732 -0.795435607 0
15 5 3 0.27953088 -0.06704185 -0.166848530 0
16 6 1 0.29409366 -0.31105287 0.126647655 44
17 6 2 0.29409366 -1.06313732 0.162961658 18
18 6 3 0.29409366 -0.06704185 -0.812316265 13
To aggregate, I then add up the x value for cluster 1 across all three years with seroconv.cluster1=sum(data.all[c(1:3),6]) and repeat for each cluster.
Every time I change the number of clusters right now I have to manually change the addition of the x's. I would like to be able to say n.vec <- seq(6, 12, by=2) and feed n.vec into the functions and get x and have R add up the x values for each cluster every time with the number of clusters changing. So it would do 6 clusters and add up all the x's per cluster. Then 8 and add up the x's and so on.
It seems you are asking for an easy way to split your data up, apply a function (sum in this case) and then combine it all back together. Split apply combine is a common data strategy, and there are several split/apply/combine strategies in R, the most popular being ave in base, the dplyr package and the data.table package.
Here's an example for your data using dplyr:
library(dplyr)
df %>% group_by(cluster, year) %>% summarise_each(funs(sum))
To get the sum of x for each cluster as a vector, you can use tapply:
tapply(df$x, df$cluster, sum)
# 1 2 3 4 5 6
# 65 46 84 61 100 75
If you instead wanted to output as a data frame, you could use aggregate:
aggregate(x~cluster, sum, data=df)
# cluster x
# 1 1 65
# 2 2 46
# 3 3 84
# 4 4 61
# 5 5 100
# 6 6 75

Generating a repeating sequence an increasingly truncated number of times in R

Consider some vector in R: x
x<-1:10
I'd like to create a repeating sequence of x, with the first element of each sequence truncated with each repetition, yielding the same output as would be given by issuing the following command in R:
c(1:10,2:10,3:10,4:10,5:10,6:10,7:10,8:10,9:10,10:10)
Can this be done? In reality, I'm working with a much larger vector for x. I'm playing with numerous combinations of the rep() function, to no avail.
Here's an alternative using mapply:
unlist(mapply(":", 1:10, 10))
# [1] 1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 3 4 5 6 7
# [25] 8 9 10 4 5 6 7 8 9 10 5 6 7 8 9 10 6 7 8 9 10 7 8 9
# [49] 10 8 9 10 9 10 10
A bit of a hack, because you can decompose what you are trying to do into two sequences:
rep(0:9, 10:1) + sequence(10:1)
You can see what each part does. I don't know if there is a way to feed the parameters to rep() or seq() like you would do in a Python expansion.
unlist(sapply(1:10, function(x) { x:10 }))

R Create Column as Running Average of Another Column

I want to create a column in R that is simply the average of all previous values of another column. For Example:
D
X
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
I would like D$Y to be the prior average of D$X that is, D$Y is the average of all previous observations of D$X. I know how to do this using a for loop moving through every row, but is there a more efficient manner?
I have a large dataset and hardware not up to that task!
You can generate cumulative means of a vector like this:
set.seed(123)
x<-sample(20)
x
## [1] 6 15 8 16 17 1 18 12 7 20 10 5 11 9 19 13 14 4 3 2
xmeans<-cumsum(x)/1:length(x)
xmeans
## [1] 6.000000 10.500000 9.666667 11.250000 12.400000 10.500000 11.571429
## [8] 11.625000 11.111111 12.000000 11.818182 11.250000 11.230769 11.071429
## [15] 11.600000 11.687500 11.823529 11.388889 10.947368 10.500000
So D$Y<-cumsum(D$X)/1:nrow(D) should work.

lag in apply statement doesn't work in R

I'm trying to "apply" a function that does "lag"s on zoo objects in R.
The function works correctly if I pass a single zoo vector - it applys the lag and everything works.
However, if I apply( data, 1, function ) then the lag doesn't work. There is no error, just the equivalent of a zero lag.
This is also the case with a simple apply( data, 1, lag ).
Can anyone explain why this should be the case? Is there anything I can do to make the lag to occur?
Here's some data:
> x <- zoo(matrix(1:12, 4, 3), as.Date("2003-01-01") + 0:3)
> x
2003-01-01 1 5 9
2003-01-02 2 6 10
2003-01-03 3 7 11
2003-01-04 4 8 12
If you want to lag this multivariate time series, just call lag (i.e. no need for apply):
> lag(x)
2003-01-01 2 6 10
2003-01-02 3 7 11
2003-01-03 4 8 12
If you want to apply a function across the rows, it needs to be sensible. For instance, to get mean of the row values:
> apply(x, 1, mean)
2003-01-01 2003-01-02 2003-01-03 2003-01-04
5 6 7 8
You can't apply a zoo object and get a zoo object back. The output of apply is "a vector or array or list of values". In the example above:
> class(apply(x, 1, mean))
[1] "numeric"
You need to recreate it as a zoo object and then lag it:
> lag(zoo(apply(coredata(x), 1, mean), index(x)))
2003-01-01 2003-01-02 2003-01-03
6 7 8
You need to be slightly careful of the direction of your output. But you can transpose it if necessary with the t() function. For instance:
> zoo(t(apply(coredata(x), 1, quantile)), index(x))
0% 25% 50% 75% 100%
2003-01-01 1 3 5 7 9
2003-01-02 2 4 6 8 10
2003-01-03 3 5 7 9 11
2003-01-04 4 6 8 10 12
You could also wrap this in a function. Alternatively you can use one of the apply functions in the xts time series library (this will retain the time series object in the process):
> x <- as.xts(x)
> apply.daily(x, mean)
[,1]
2003-01-01 5
2003-01-02 6
2003-01-03 7
2003-01-04 8
Why don't you try quantmod::Lag function for generating a matrix consisting of various lagged-series of a series, at different lag values? For example
> quantmod::Lag (1:10, k=c(0,5,2))
will return
Lag.0 Lag.5 Lag.2
[1,] 1 NA NA
[2,] 2 NA NA
[3,] 3 NA 1
[4,] 4 NA 2
[5,] 5 NA 3
[6,] 6 1 4
[7,] 7 2 5
[8,] 8 3 6
[9,] 9 4 7
[10,] 10 5 8
#Marek - lag(data) does do what I want, but I wanted to be able to use this as part of an "apply" construct to make the vector->matrix abstraction a little easier.

Resources