Basic use of approx() for lookup table with linear interpolation - r

I have a vector, my_points and a dataframe that describes the almost linear relationship between points and values.
How do I obtain the vector, my_values, from the relationship described in the dataframe and my_points using linear interpolation?
Assume the relationship beyond the last point in the data frame remains linear.
my_points <- c(4400, 8800, 13200, 37600, 42000, 46400, 50800, 55200, 59600,
64000, 68400, 72800, 77200, 81600, 86000, 90400, 94800, 99200,
103600, 108000, 112400, 116800, 121200, 125600)
df <- structure(list(points = c(3000, 4500, 7500, 11000, 14500, 21500,
43000, 71500), values = c(20, 30, 50, 75, 100, 150, 300, 500),
points_per_value = c(150, 150, 150, 146.666666666667, 145,
143.333333333333, 143.333333333333, 143)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))

You said "interpolation", in which case you can get:
cbind(
data.frame(my_points),
lapply(df[-1], function(z) approx(df$points, z, xout = my_points)$y)
)
# my_points values points_per_value
# 1 4400 29.33333 150.0000
# 2 8800 59.28571 148.7619
# 3 13200 90.71429 145.6190
# 4 37600 262.32558 143.3333
# 5 42000 293.02326 143.3333
# 6 46400 323.85965 143.2936
# 7 50800 354.73684 143.2421
# 8 55200 385.61404 143.1906
# 9 59600 416.49123 143.1392
# 10 64000 447.36842 143.0877
# 11 68400 478.24561 143.0363
# 12 72800 NA NA
# 13 77200 NA NA
# 14 81600 NA NA
# 15 86000 NA NA
# 16 90400 NA NA
# 17 94800 NA NA
# 18 99200 NA NA
# 19 103600 NA NA
# 20 108000 NA NA
# 21 112400 NA NA
# 22 116800 NA NA
# 23 121200 NA NA
# 24 125600 NA NA
But you also said "beyond the last point", suggesting you want "extrapolation":
cbind(
data.frame(my_points), lapply(df[-1], function(z)
Hmisc::approxExtrap(df$points, z, xout = my_points)$y)
)
# my_points values points_per_value
# 1 4400 29.33333 150.0000
# 2 8800 59.28571 148.7619
# 3 13200 90.71429 145.6190
# 4 37600 262.32558 143.3333
# 5 42000 293.02326 143.3333
# 6 46400 323.85965 143.2936
# 7 50800 354.73684 143.2421
# 8 55200 385.61404 143.1906
# 9 59600 416.49123 143.1392
# 10 64000 447.36842 143.0877
# 11 68400 478.24561 143.0363
# 12 72800 509.12281 142.9848
# 13 77200 540.00000 142.9333
# 14 81600 570.87719 142.8819
# 15 86000 601.75439 142.8304
# 16 90400 632.63158 142.7789
# 17 94800 663.50877 142.7275
# 18 99200 694.38596 142.6760
# 19 103600 725.26316 142.6246
# 20 108000 756.14035 142.5731
# 21 112400 787.01754 142.5216
# 22 116800 817.89474 142.4702
# 23 121200 848.77193 142.4187
# 24 125600 879.64912 142.3673
If all you need is the vector of one of these columns, then
Hmisc::approxExtrap(df$points, df$my_values, xout = my_points)$y

Related

Making a "Race" Between Two Variables

I would like to make two variables ("a" and "b") that keep:
taking a random value less ALWAYS than their current value (i.e. a1 > a2 > a3 ...> an , b1 > b2 > b3 ... bn ALWAYS)
until one of them less than or equal to 0:
I showed a demo below:
#iteration 1
a1 = 100 - rnorm(1,5,10)
b1 = 100 -rnorm(1,5,10)
a2 = a1 - rnorm(1,5,10)
b2 = b1 -rnorm(1,5,10)
a3 = a2 - rnorm(1,5,10)
b3 = b2 -rnorm(1,5,10)
#etc.
I would then like to repeat this many times. In the end, this would look something :
Currently, I am doing this manually, and then using the bind_rows() command to "pile" each iteration on top of each other. Can someone please show me a faster way to do this?
Thank you!
You could write a smallrecursive function:
fun <- function(x){
if(any(x < 0)) x
else rbind(x, fun(x - abs(rnorm(length(x),5,10)) ))
}
Now for 1 draw of A and B:
set.seed(1)
fun(c(A=100, B=100))
A B
x 100.00000 100.000000
x 98.73546 93.163567
x 95.37918 72.210759
x 87.08410 69.006075
x 77.20981 56.622828
x 66.45199 54.676712
x 46.33418 45.778279
x 45.12178 28.631280
x 28.87247 24.080617
x 24.03437 9.642254
10.82216 -1.296759
We can use this within a function to replicate. Will maintain BASE R although can be simplified in tidyverse:
random_seq <- function(n, start){
fun <- function(x){
if(any(x < 0)) c(x)
else rbind(x, fun(x - abs(rnorm(length(x),5,10)) ))
}
R <-replicate(n, data.frame(fun(start), row.names = NULL), simplify = FALSE)
S <- do.call(rbind, Map(cbind, id = seq(R), R))
U <-transform(S, time = ave(id, id, FUN = seq_along))
reshape(U, dir='wide', idvar = 'id', sep='')
}
set.seed(1)
random_seq(4, c(A=20,B=20))
id A1 B1 A2 B2 A3 B3 A4 B4
1 1 20 20 18.7354619 13.163567 15.379176 -7.789241 NA NA
4 2 20 20 11.7049223 16.795316 1.830632 4.412069 -8.927182 2.465953
8 3 20 20 -0.1178117 11.101568 NA NA NA NA
10 4 20 20 18.7875942 2.853001 2.538285 -1.697663 NA NA
BONUS:
if interested, fun can directly reproduce the names:
fun <- function(x){
nms <- as.numeric(sub('\\D+', '',names(x))) + 1
names(x) <- paste0(sub("\\d+", '', names(x)), nms)
if(any(x < 0)) c(x)
else c(x, Recall(x - abs(rnorm(length(x),5,10)) ))
}
fun(c(A0=20, B0=30))
A1 B1 A2 B2 A3 B3
20.000000 30.000000 11.234808 23.323201 -9.611483 1.544311
Here's a function that runs a single start to 0, nicely configurable, and we can use replicate to run it as many times as needed, returning a list.
to_0 = function(start = 100, fun = runif, ..., n = 1000) {
if(start <= 0) stop("Must start greater than 0")
result = start - c(0, cumsum(fun(n, ...)))
if(all(result > 0)) stop("Didn't reach 0, set a higher n or check inputs.")
first_0 = match(TRUE, result < 0)
result[seq_len(first_0)]
}
I used runif as the default instead of your rnorm because you say you want the series to be strictly decreasing, but rnorm is sometimes positive and sometimes negative so it will sometimes lead to increases.
I cut off the series at the first negative value. Since the lengths of each run are different, a data.frame seems like a bad choice, keeping them in a list is better. We can use lengths() to see how long each vector in the list is.
The function is parametrized, so you can easily try out other distributions or custom functions, e.g., to_0(start = 100, fun = rexp, rate = 0.1). Below I demonstrate with the uniform distribution starting at 10.
set.seed(47)
race = replicate(n = 100, to_0(start = 10))
head(race)
# [[1]]
# [1] 10.00000000 9.02303800 8.64912196 7.88761993 7.06512831 6.49158390 5.80017147 5.41110962 4.94216364 4.39885390 3.47396185
# [12] 3.33516427 2.63317707 2.47098343 1.87167641 1.36564030 0.46366678 0.06316398 0.03221901 -0.03913915
#
# [[2]]
# [1] 10.00000000 9.27320918 8.54814801 7.77974923 7.34440424 7.27499236 6.76825217 6.75134855 6.20214287 5.43031741 4.56633348
# [12] 3.59288910 3.24547860 2.60269295 1.75639299 1.73279651 1.72371866 1.38211688 0.71933800 0.04916749 -0.40714758
#
# [[3]]
# [1] 10.00000000 9.08923490 9.06189460 8.69397353 8.30179409 8.11077841 7.96295850 7.49701585 6.52812608 6.26480567 5.34558158
# [12] 5.31801508 4.90573089 3.98774633 3.89046321 3.70358854 3.61482042 3.53824450 3.36900151 2.86522484 2.23295349 1.80544403
# [23] 0.82311022 0.73664857 -0.09385818
#
# [[4]]
# [1] 10.0000000 9.2172681 8.4175584 8.1672679 7.3683421 7.3373712 7.0319788 6.6512214 5.7210315 5.2732412 4.6817849 4.1065416
# [13] 3.9452541 3.4009742 2.5018050 1.5316136 0.7175295 0.4410275 -0.1859260
#
# [[5]]
# [1] 10.00000000 9.91914621 9.90238843 9.82993154 9.33156028 8.90827720 8.44160294 7.46348397 6.76539075 6.27298443 5.97401412
# [12] 5.03395592 4.55537992 3.75737919 2.82175869 2.75045000 2.70081885 2.67523320 2.20266408 2.12695183 1.25880525 0.57011279
# [23] 0.03173135 -0.79275633
#
# [[6]]
# [1] 10.0000000 9.9292630 9.6154147 9.0754730 8.7814754 8.5273701 7.6998567 6.8127609 5.9944598 5.6232599 5.1505038 4.8676191
# [13] 4.6337121 4.5868438 4.0435219 3.0981151 2.2621741 1.9925101 1.2104707 0.9334569 0.7574446 0.1643009 -0.5220925
lengths(race)
# [1] 20 21 25 19 24 23 21 24 23 22 25 24 19 19 23 17 19 23 25 21 24 25 18 22 24 25 19 19 23 22 19 26 20 23 24 24 22 21 25 23 21 28 19 20 16 20
# [47] 22 25 20 22 23 23 24 22 19 23 23 23 22 18 22 23 24 21 21 23 21 22 20 25 22 23 21 17 20 20 16 25 21 21 21 20 20 19 24 19 23 24 26 25 20 21
# [93] 23 17 27 18 30 24 21 23

Delete all columns in which all rows have a value below 10000

I have a big dataset (2300 obs x 9700 var) in which I only want to have all variables (columns) in which all samples (rows) have a values above 10000.
I tried a lot of codes, for example:
subset <- df[,apply(df,1,function(z) !all(z<10000))]
subset<- df[,df> 10000]
But they are not working. RowMeans doesn't work since maybe the mean is < 10000 but some individual row may be >10000.
Anybody some tips how to tackle this?
Try this to filter out all rows in which all values are >10000:
df[rowSums(df>10000)==ncol(df),]
# a b c
#4 14139 127746 10911
#7 11582 73952 10821
To filter out all columns in which all values are >10000:
df[,colSums(df>10000)==nrow(df)]
# [1] 40004 105808 70261 127746 60177 134365 73952 86584 87551 67781
data
df <- structure(list(a = c(13773, 8680, 4854, 14139, 3106, 3044, 11582,
9475, 728, 4666), b = c(40004, 105808, 70261, 127746, 60177,
134365, 73952, 86584, 87551, 67781), c = c(1913, 2092, 14468,
10911, 14414, 8015, 10821, 12636, 12320, 1266)), .Names = c("a",
"b", "c"), row.names = c(NA, -10L), class = "data.frame")
# a b c
# 1 13773 40004 1913
# 2 8680 105808 2092
# 3 4854 70261 14468
# 4 14139 127746 10911
# 5 3106 60177 14414
# 6 3044 134365 8015
# 7 11582 73952 10821
# 8 9475 86584 12636
# 9 728 87551 12320
# 10 4666 67781 1266

Automate regression by rows

I have a data.frame
set.seed(100)
exp <- data.frame(exp = c(rep(LETTERS[1:2], each = 10)), re = c(rep(seq(1, 10, 1), 2)), age1 = seq(10, 29, 1), age2 = seq(30, 49, 1),
h = c(runif(20, 10, 40)), h2 = c(40 + runif(20, 4, 9)))
I'd like to make a lm for each row in a data set (h and h2 ~ age1 and age2)
I do it by loop
exp$modelh <- 0
for (i in 1:length(exp$exp)){
age = c(exp$age1[i], exp$age2[i])
h = c(exp$h[i], exp$h2[i])
model = lm(age ~ h)
exp$modelh[i] = coef(model)[1] + 100 * coef(model)[2]
}
and it works well but takes some time with very large files. Will be grateful for the faster solution f.ex. dplyr
Using dplyr, we can try with rowwise() and do. Inside the do, we concatenate (c) the 'age1', 'age2' to create 'age', likewise, we can create 'h', apply lm, extract the coef to create the column 'modelh'.
library(dplyr)
exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )
gives the output
# exp re age1 age2 h h2 modelh
#1 A 1 10 30 19.23298 46.67906 68.85506
#2 A 2 11 31 17.73018 47.55402 66.17050
#3 A 3 12 32 26.56967 46.69174 84.98486
#4 A 4 13 33 11.69149 47.74486 61.98766
#5 A 5 14 34 24.05648 46.10051 82.90167
#6 A 6 15 35 24.51312 44.85710 89.21053
#7 A 7 16 36 34.37208 47.85151 113.37492
#8 A 8 17 37 21.10962 48.40977 74.79483
#9 A 9 18 38 26.39676 46.74548 90.34187
#10 A 10 19 39 15.10786 45.38862 75.07002
#11 B 1 20 40 28.74989 46.44153 100.54666
#12 B 2 21 41 36.46497 48.64253 125.34773
#13 B 3 22 42 18.41062 45.74346 81.70062
#14 B 4 23 43 21.95464 48.77079 81.20773
#15 B 5 24 44 32.87653 47.47637 115.95097
#16 B 6 25 45 30.07065 48.44727 101.10688
#17 B 7 26 46 16.13836 44.90204 84.31080
#18 B 8 27 47 20.72575 47.14695 87.00805
#19 B 9 28 48 20.78425 48.94782 84.25406
#20 B 10 29 49 30.70872 44.65144 128.39415
We could do this with the devel version of data.table i.e. v1.9.5. Instructions to install the devel version are here.
We convert the 'data.frame' to 'data.table' (setDT), create a column 'rn' with the option keep.rownames=TRUE. We melt the dataset by specifying the patterns in the measure to convert from 'wide' to 'long' format. Grouped by 'rn', we do the lm and get the coef. This can be assigned as a new column in the original dataset ('exp') while removing the unwanted 'rn' column by assigning (:=) it to NULL.
library(data.table)#v1.9.5+
modelh <- melt(setDT(exp, keep.rownames=TRUE), measure=patterns('^age', '^h'),
value.name=c('age', 'h'))[, {model <- lm(age ~h)
coef(model)[1] + 100 * coef(model)[2]},rn]$V1
exp[, modelh:= modelh][, rn := NULL]
exp
# exp re age1 age2 h h2 modelh
# 1: A 1 10 30 19.23298 46.67906 68.85506
# 2: A 2 11 31 17.73018 47.55402 66.17050
# 3: A 3 12 32 26.56967 46.69174 84.98486
# 4: A 4 13 33 11.69149 47.74486 61.98766
# 5: A 5 14 34 24.05648 46.10051 82.90167
# 6: A 6 15 35 24.51312 44.85710 89.21053
# 7: A 7 16 36 34.37208 47.85151 113.37492
# 8: A 8 17 37 21.10962 48.40977 74.79483
# 9: A 9 18 38 26.39676 46.74548 90.34187
#10: A 10 19 39 15.10786 45.38862 75.07002
#11: B 1 20 40 28.74989 46.44153 100.54666
#12: B 2 21 41 36.46497 48.64253 125.34773
#13: B 3 22 42 18.41062 45.74346 81.70062
#14: B 4 23 43 21.95464 48.77079 81.20773
#15: B 5 24 44 32.87653 47.47637 115.95097
#16: B 6 25 45 30.07065 48.44727 101.10688
#17: B 7 26 46 16.13836 44.90204 84.31080
#18: B 8 27 47 20.72575 47.14695 87.00805
#19: B 9 28 48 20.78425 48.94782 84.25406
#20: B 10 29 49 30.70872 44.65144 128.39415
Great (double) answer from #akrun.
Just a suggestion for your future analysis as you mentioned "it's an example of a bigger problem". Obviously, if you are really interested in building models rowwise then you'll create more and more columns as your age and h observations increase. If you get N observations you'll have to use 2xN columns for those 2 variables only.
I'd suggest to use a long data format in order to increase your rows instead of your columns.
Something like:
exp[1,] # how your first row (model building info) looks like
# exp re age1 age2 h h2
# 1 A 1 10 30 19.23298 46.67906
reshape(exp[1,], # how your model building info is transformed
varying = list(c("age1","age2"),
c("h","h2")),
v.names = c("age_value","h_value"),
direction = "long")
# exp re time age_value h_value id
# 1.1 A 1 1 10 19.23298 1
# 1.2 A 1 2 30 46.67906 1
Apologies if the "bigger problem" refers to something else and this answer is irrelevant.
With base R, the function sprintf can help us create formulas. And lapply carries out the calculation.
strings <- sprintf("c(%f,%f) ~ c(%f,%f)", exp$age1, exp$age2, exp$h, exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
exp$modelh <- unlist(lst)
exp
# exp re age1 age2 h h2 modelh
# 1 A 1 10 30 19.23298 46.67906 68.85506
# 2 A 2 11 31 17.73018 47.55402 66.17050
# 3 A 3 12 32 26.56967 46.69174 84.98486
# 4 A 4 13 33 11.69149 47.74486 61.98766
# 5 A 5 14 34 24.05648 46.10051 82.90167
# 6 A 6 15 35 24.51312 44.85710 89.21053
# 7 A 7 16 36 34.37208 47.85151 113.37493
# 8 A 8 17 37 21.10962 48.40977 74.79483
# 9 A 9 18 38 26.39676 46.74548 90.34187
# 10 A 10 19 39 15.10786 45.38862 75.07002
# 11 B 1 20 40 28.74989 46.44153 100.54666
# 12 B 2 21 41 36.46497 48.64253 125.34773
# 13 B 3 22 42 18.41062 45.74346 81.70062
# 14 B 4 23 43 21.95464 48.77079 81.20773
# 15 B 5 24 44 32.87653 47.47637 115.95097
# 16 B 6 25 45 30.07065 48.44727 101.10688
# 17 B 7 26 46 16.13836 44.90204 84.31080
# 18 B 8 27 47 20.72575 47.14695 87.00805
# 19 B 9 28 48 20.78425 48.94782 84.25406
# 20 B 10 29 49 30.70872 44.65144 128.39416
In the lapply function the expression as.formula(x) is what converts the formulas created in the first line into a format usable by the lm function.
Benchmark
library(dplyr)
library(microbenchmark)
set.seed(100)
big.exp <- data.frame(age1=sample(30, 1e4, T),
age2=sample(30:50, 1e4, T),
h=runif(1e4, 10, 40),
h2= 40 + runif(1e4,4,9))
microbenchmark(
plafort = {strings <- sprintf("c(%f,%f) ~ c(%f,%f)", big.exp$age1, big.exp$age2, big.exp$h, big.exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
big.exp$modelh <- unlist(lst)},
akdplyr = {big.exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )}
,times=5)
t: seconds
expr min lq mean median uq max neval cld
plafort 13.00605 13.41113 13.92165 13.56927 14.53814 15.08366 5 a
akdplyr 26.95064 27.64240 29.40892 27.86258 31.02955 33.55940 5 b
(Note: I downloaded the newest 1.9.5 devel version of data.table today, but continued to receive errors when trying to test it.
The results also differ fractionally (1.93 x 10^-8). Rounding likely accounts for the difference.)
all.equal(pl, ak)
[1] "Attributes: < Component “class”: Lengths (1, 3) differ (string compare on first 1) >"
[2] "Attributes: < Component “class”: 1 string mismatch >"
[3] "Component “modelh”: Mean relative difference: 1.933893e-08"
Conclusion
The lapply approach seems to perform well compared to dplyr with respect to speed, but it's 5 digit rounding may be an issue. Improvements may be possible. Perhaps using apply after converting to matrix to increase speed and efficiency.

how to fill stacked barplot with patterns or textures in R

I have used ggplot2 to draw a stacked barplot, and I want to fill the barplot with patterns. But it seems that the problem is very complicated to be solved by ggplot2.
So is there a way to fill stacked barplot with patterns or textures with base R or with another R package?
My plot is similar to this barplot:
and I want the barplot looks like this, fill with patterns or textures:
My data is from my previous post:
plant group n percentage
1 Cucumber-1 [3.19e-39,2] 14729 0.8667686695
2 Cucumber-1 (2,4] 1670 0.0982757606
3 Cucumber-1 (4,6] 447 0.0263049491
4 Cucumber-1 (6,8] 131 0.0077090567
5 Cucumber-1 (8,10] 16 0.0009415642
6 Cucumber-2 [3.19e-39,2] 20206 0.9410394933
7 Cucumber-2 (2,4] 1155 0.0537909836
8 Cucumber-2 (4,6] 90 0.0041915052
9 Cucumber-2 (6,8] 16 0.0007451565
10 Cucumber-2 (8,10] 5 0.0002328614
11 Eggplant-1 [3.19e-39,2] 11273 0.9012631916
12 Eggplant-1 (2,4] 960 0.0767508794
13 Eggplant-1 (4,6] 181 0.0144707387
14 Eggplant-1 (6,8] 31 0.0024784138
15 Eggplant-1 (8,10] 63 0.0050367765
16 Eggplant-2 [3.19e-39,2] 16483 0.9493721921
17 Eggplant-2 (2,4] 725 0.0417578620
18 Eggplant-2 (4,6] 140 0.0080635871
19 Eggplant-2 (6,8] 12 0.0006911646
20 Eggplant-2 (8,10] 2 0.0001151941
21 Pepper-1 [3.19e-39,2] 4452 0.9763157895
22 Pepper-1 (2,4] 97 0.0212719298
23 Pepper-1 (4,6] 11 0.0024122807
24 Pepper-2 [3.19e-39,2] 23704 0.9560763119
25 Pepper-2 (2,4] 905 0.0365022385
26 Pepper-2 (4,6] 184 0.0074214496
Most of the required work is to get your data in shape. The function ?barplot is simple to use, but you want to feed it a matrix. You can use vectors for the density= and angle= arguments to distinguish the elements of the stacked bar plot.
d = read.table(text="plant ...
... 184 0.0074214496", header=T)
d$group <- factor(d$group, levels=c(levels(d$group)[c(5,1:4)]),
labels=c("(0,2]", levels(d$group)[1:4]))
levels(d$group)
# [1] "(0,2]" "(2,4]" "(4,6]" "(6,8]" "(8,10]"
tab <- table(d$group, d$plant)
tab
# output omitted
d <- rbind(d,
c("Pepper-1", "(6,8]", 0, 0),
c("Pepper-1", "(8,10]", 0, 0),
c("Pepper-2", "(6,8]", 0, 0),
c("Pepper-2", "(8,10]", 0, 0) )
d <- d[order(d$plant, d$group),]
d
# output omitted
mat <- matrix(as.numeric(d$percentage), nrow=5, ncol=6)
rownames(mat) <- levels(d$group)
colnames(mat) <- levels(d$plant)
names(dimnames(mat)) <- c("group", "plant")
mat
# plant
# group Cucumber-1 Cucumber-2 Eggplant-1 Eggplant-2 Pepper-1 Pepper-2
# (0,2] 0.8667686695 0.9410394933 0.901263192 0.9493721921 0.976315789 0.95607631
# (2,4] 0.0982757606 0.0537909836 0.076750879 0.0417578620 0.021271930 0.03650224
# (4,6] 0.0263049491 0.0041915052 0.014470739 0.0080635871 0.002412281 0.00742145
# (6,8] 0.0077090567 0.0007451565 0.002478414 0.0006911646 0.000000000 0.00000000
# (8,10] 0.0009415642 0.0002328614 0.005036777 0.0001151941 0.000000000 0.00000000
barplot(mat, density=5:9, angle=seq(40, 90, 10), cex.names=.8)

outcome variable as argument in regression function

I have a datasetup function which currently has 2 arguments: testData and ID1. I want to include outcome variable as an argument.
Suppose outcomevar=c(y1,y2,y3) then the function should create the lagged and differenced variable of my outcome variable.
preparedata<-function(testData,ID1,outcomevar){
#Order temp data by firm and date
testData <- testData[order(testData$firm,testData$date),]
#Create lagged outcomevar for each firm
testData <- ddply(testData, .(firm), transform,
ly1 = c( NA, y1[-length(y1)] ) )
#Create differenced variable
testData$dy1<-(testData$y1-testData$ly1)
}
where the "l" and "d" in front of y1 stand for lagged and differenced.
Depending How can I include the outcome variable?
Thanks
T
Here's a solution using data tables:
# create sample dataset
set.seed(1)
df <- data.frame(firm=rep(LETTERS[1:5],each=10),
date=as.Date("2014-01-01")+1:10,
y1=sample(1:100,50),y2=sample(1:100,50),y3=sample(1:100,50))
preparedata<-function(testData,ID1,outcomevar){
require(data.table)
DT <- as.data.table(testData)
setkey(DT,firm,date)
DT[,lag := c(NA,unlist(.SD)[-.N]), by=firm, .SDcols=outcomevar]
DT[,diff := c(NA,diff(unlist(.SD))), by=firm, .SDcols=outcomevar]
setnames(DT,c("lag","diff"),paste0(c("l","d"),outcomevar))
return(DT)
}
result <- preparedata(df,1,outcomevar="y1")
head(result)
# firm date y1 y2 y3 ly1 dy1
# 1: A 2014-01-02 27 48 66 NA NA
# 2: A 2014-01-03 37 86 35 27 10
# 3: A 2014-01-04 57 43 27 37 20
# 4: A 2014-01-05 89 24 97 57 32
# 5: A 2014-01-06 20 7 61 89 -69
# 6: A 2014-01-07 86 10 21 20 66
This assumes you pass the name of the column containing the "outcomevar", not the column itself.
You should read the documentation on data tables (?data.table), but in brief this code converts the input data frame to a data table, orders the data table (using setkey(...)), and adds two new columns by reference: lag and diff. .SD is a special variable in the data table framework which is an alias for "the subset of the original DT containing the rows specified in by=...". You can specify which columns to include using .SDcols=.... The diff(...) function calculates lagged differences, which is the same thing you were doing. Finally, we rename the columns lag and diff to, e.g. ly1 and dy1.
Here is an outline of a function that relies more heavily on your example:
preparedata<-function(testData,outcomevar){
require(plyr)
testData <- testData[order(testData$firm,testData$date),]
testData$tmp.var <- with(testData, eval(parse(text=outcomevar)))
testData <- ddply(testData, .(firm), transform,
lvar = c( NA, tmp.var[-length(tmp.var)]))
testData$tmp.var <- NULL
testData <- within(testData, assign(paste("d", outcomevar, sep=""),
testData[,outcomevar]-testData$lvar))
colnames(testData)[grep("lvar", colnames(testData))] <- paste("l", outcomevar, sep="")
return(testData)
}
Using the df defined in jihoward's answer, we get
> head(preparedata(df,"y1"))
firm date y1 y2 y3 lvar dy1
1 A 2014-01-02 27 48 66 NA NA
2 A 2014-01-03 37 86 35 27 10
3 A 2014-01-04 57 43 27 37 20
4 A 2014-01-05 89 24 97 57 32
5 A 2014-01-06 20 7 61 89 -69
6 A 2014-01-07 86 10 21 20 66
This function returns a dataframe where ly1 is the lagged variable, and dy1 is the differenced variable that was specified with the second argument outcomevar. Note that in this function, you pass the name (i.e. a character) to the function. That is, do not write y1, but "y1" when you call the function.
You could process all outcome variables simultaneously by first gathering them into a key-value column pair:
set.seed(1)
df <- data.frame(
firm = rep(LETTERS[1:5], each = 10),
date = as.Date("2014-01-01") + 1:10,
y1 = sample(100, 50),
y2 = sample(100, 50),
y3 = sample(100, 50)
)
library(dplyr)
library(tidyr)
df %>%
gather(key, value, y1:y3) %>%
group_by(firm, key) %>%
mutate(lag = lag(value), diff = lag - value)
#> Source: local data frame [150 x 6]
#> Groups: firm, key
#>
#> firm date key value lag diff
#> 1 A 2014-01-02 y1 27 NA NA
#> 2 A 2014-01-03 y1 37 27 -10
#> 3 A 2014-01-04 y1 57 37 -20
#> 4 A 2014-01-05 y1 89 57 -32
#> 5 A 2014-01-06 y1 20 89 69
#> 6 A 2014-01-07 y1 86 20 -66
#> 7 A 2014-01-08 y1 97 86 -11
#> 8 A 2014-01-09 y1 62 97 35
#> 9 A 2014-01-10 y1 58 62 4
#> 10 A 2014-01-11 y1 6 58 52
#> .. ... ... ... ... ... ...

Resources