I have some data that has a name, date, and two factors (x,y). I would like to calculate
dt<-seq(as.Date("2013/1/1"), by = "days", length.out = 20)
df1<-data.frame("ABC",dt,rnorm(20, 0,3),rnorm(20, 2,4) )
names(df1)<-c("name","date","x","y")
df2<-data.frame("XYZ",dt,rnorm(20, 2,5),rnorm(20, 3,10) )
names(df2)<-c("name","date","x","y")
df<-rbind(df1,df2)
I would like to add a column named "Correl" that for each date, takes the correlation of the previous 5 periods. However, when the name changes, I would like it to give NA's instead.
As you can see below, when the data becomes XYZ instead of ABC, the first 4 periods, the correlation is NA. When there's 5 data points is when the correlation begins again.
name date x y Correl
ABC 1/1/2013 -3.59 -5.13 NA
ABC 1/2/2013 -8.69 4.22 NA
ABC 1/3/2013 2.80 -0.59 NA
ABC 1/4/2013 0.54 5.06 NA
ABC 1/5/2013 1.13 3.49 -0.03
ABC 1/6/2013 0.52 5.16 -0.38
ABC 1/7/2013 -0.24 -5.40 0.08
ABC 1/8/2013 3.26 -2.75 -0.16
ABC 1/9/2013 1.33 5.94 -0.04
ABC 1/10/2013 2.24 1.14 -0.01
ABC 1/11/2013 0.01 9.87 -0.24
ABC 1/12/2013 2.29 1.28 -0.99
ABC 1/13/2013 1.03 -6.30 -0.41
ABC 1/14/2013 0.62 4.82 -0.47
ABC 1/15/2013 1.08 -1.17 -0.50
ABC 1/16/2013 2.43 8.86 0.45
ABC 1/17/2013 -3.43 9.38 -0.35
ABC 1/18/2013 -5.73 7.59 -0.38
ABC 1/19/2013 1.77 3.13 -0.44
ABC 1/20/2013 -0.97 -0.77 -0.24
XYZ 1/1/2013 2.12 10.22 NA
XYZ 1/2/2013 -0.81 0.22 NA
XYZ 1/3/2013 -1.55 -2.25 NA
XYZ 1/4/2013 -4.53 3.63 NA
XYZ 1/5/2013 2.95 -1.51 0.13
XYZ 1/6/2013 6.76 24.16 0.69
XYZ 1/7/2013 3.33 7.31 0.66
XYZ 1/8/2013 -1.47 -4.23 0.67
XYZ 1/9/2013 3.89 -0.43 0.81
XYZ 1/10/2013 5.63 17.95 0.86
XYZ 1/11/2013 3.29 -7.09 0.63
XYZ 1/12/2013 6.03 -9.03 0.29
XYZ 1/13/2013 -5.63 6.96 -0.19
XYZ 1/14/2013 1.70 13.59 -0.18
XYZ 1/15/2013 -1.19 -16.79 -0.29
XYZ 1/16/2013 4.76 4.91 -0.11
XYZ 1/17/2013 9.02 25.16 0.57
XYZ 1/18/2013 4.56 6.48 0.84
XYZ 1/19/2013 5.30 11.81 0.99
XYZ 1/20/2013 -0.60 3.38 0.84
UPDATE: I have tried all of your suggestions and have run into problems using the actual data. I have attached a subset of the data below:
https://www.dropbox.com/s/6k4xhwuinlu0p1f/TEST_SUBSET.csv?dl=0
I cannot get this to work. I've tried removing the NA's, renaming the rows, reading the data in differently, formatting the date differently. Nothing is working for me. Can you see if what you are running is working for this dataset? Thank you very much folks!
Apply ave to the row indexes of df to process by name and use rollapplyr to perform the rolling computations. Note that i is a vector of indexes:
library(zoo)
corx <- function(x) cor(x[, 1], x[, 2])
df$Correl <- ave(1:nrow(df), df$name, FUN = function(i)
rollapplyr(df[i, c("x", "y")], 5, corx, by.column = FALSE, fill = NA))
Update Changed rollapply to rollapplyr to be consistent with the output shown in the question. If you want centred correlations change it back to rollapply.
This is a little late to the party, but the below is a pretty compact solution with dplyr and rollapply from (zoo package).
library(dplyr)
library(zoo)
dt<-seq(as.Date("2013/1/1"), by = "days", length.out = 20)
df1<-data.frame("ABC",dt,rnorm(20, 0,3),rnorm(20, 2,4) )
names(df1)<-c("name","date","x","y")
df2<-data.frame("XYZ",dt,rnorm(20, 2,5),rnorm(20, 3,10) )
names(df2)<-c("name","date","x","y")
df<-rbind(df1,df2)
df<-df %>%
group_by(name)%>%
arrange(date) %>%
do({
correl <- rollapply(.[-(1:2)],width = 5, function(a) cor(a[,1],a[,2]), by.column = FALSE, align = "right", fill = NA)
data.frame(., correl)
})
which returns...
> df
Source: local data frame [40 x 5]
Groups: name
name date x y correl
1 ABC 2013-01-01 -0.61707785 -0.7299461 NA
2 ABC 2013-01-02 1.35353618 9.1314743 NA
3 ABC 2013-01-03 2.60815932 0.2511828 NA
4 ABC 2013-01-04 -2.89619789 -1.2586655 NA
5 ABC 2013-01-05 2.23750886 4.6616034 0.52013407
6 ABC 2013-01-06 -1.97573999 3.6800832 0.37575664
7 ABC 2013-01-07 1.70360813 2.2621718 0.32390612
8 ABC 2013-01-08 0.02017797 2.5088032 0.64020507
9 ABC 2013-01-09 0.96263256 1.6711756 -0.00557611
10 ABC 2013-01-10 -0.62400803 5.2011656 -0.66040650
.. ... ... ... ... ...
checking that the other group responds correctly...
> df %>%
+ filter(name=="XYZ")
Source: local data frame [20 x 5]
Groups: name
name date x y correl
1 XYZ 2013-01-01 3.4199729 5.0866361 NA
2 XYZ 2013-01-02 4.7326297 -5.4613465 NA
3 XYZ 2013-01-03 3.8983329 11.1635903 NA
4 XYZ 2013-01-04 1.5235936 3.9077184 NA
5 XYZ 2013-01-05 -5.4885373 7.8961020 -0.3755766
6 XYZ 2013-01-06 0.2311371 2.0157046 -0.3754510
7 XYZ 2013-01-07 2.6903306 -3.2940181 -0.1808097
8 XYZ 2013-01-08 -0.2584268 3.6047800 -0.8457930
9 XYZ 2013-01-09 -0.2897795 2.1029431 -0.9526992
10 XYZ 2013-01-10 5.9571558 18.5810947 0.7025559
11 XYZ 2013-01-11 -7.5250647 -8.0858699 0.7949917
12 XYZ 2013-01-12 2.8438336 -8.4072829 0.6563161
13 XYZ 2013-01-13 7.2295030 -0.1236801 0.5383666
14 XYZ 2013-01-14 -0.7579570 -0.2830291 0.5542751
15 XYZ 2013-01-15 4.3116507 -6.5291051 0.3894343
16 XYZ 2013-01-16 1.4334510 0.5957465 -0.1480032
17 XYZ 2013-01-17 -2.6444881 6.1261976 -0.6183805
18 XYZ 2013-01-18 0.8517223 0.5587499 -0.9243050
19 XYZ 2013-01-19 6.2140131 -3.0944259 -0.8939475
20 XYZ 2013-01-20 11.2871086 -0.1187153 -0.6845300
Hope this helps!
FOLLOW UP
I just ran the following on your actual data set:
library(dplyr)
library(zoo)
import <- read.csv("TEST_SUBSET.CSV", header=TRUE, stringsAsFactors = FALSE)
str(head(import))
import_df<-import %>%
group_by(id)%>%
arrange(asof_dt) %>%
do({
correl <- rollapply(.[-(1:2)],width = 5, function(a) cor(a[,1],a[,2]), by.column = FALSE, align = "right", fill = NA)
data.frame(., correl)
})
import_df
and received the following:
> import_df
Source: local data frame [15,365 x 5]
Groups: id
id asof_dt x y correl
1 DC1123 1/10/1990 -0.003773632 NA NA
2 DC1123 1/10/1991 0.014034992 NA NA
3 DC1123 1/10/1992 -0.004109765 NA NA
4 DC1123 1/10/1994 0.006369326 0.012176085 NA
5 DC1123 1/10/1995 0.014900600 0.001241080 NA
6 DC1123 1/10/1996 0.005763689 -0.013112491 NA
7 DC1123 1/10/1997 0.006949765 0.010737034 NA
8 DC1123 1/10/2000 0.044052805 0.003346296 0.02724175
9 DC1123 1/10/2001 0.009452785 0.017582638 0.01362101
10 DC1123 1/10/2002 -0.018876970 0.004346372 0.01343657
.. ... ... ... ... ...
so it feels like its working.
The (cor) function is only going to return data when it has 5 input points, which doesn't happen until row 8.
Here is a a solution using base R, note that it requires that the data set be sorted by name and date, in that order.
dt<-seq(as.Date("2013/1/1"), by = "days", length.out = 20)
df1<-data.frame("ABC",dt,rnorm(20, 0,3),rnorm(20, 2,4) )
names(df1)<-c("name","date","x","y")
df2<-data.frame("XYZ",dt,rnorm(20, 2,5),rnorm(20, 3,10) )
names(df2)<-c("name","date","x","y")
df<-rbind(df1,df2)
rollcorr = function(df, lag = 4) {
out = numeric(nrow(df) - lag)
for( i in seq_along(out) ) {
window = i:(i+lag)
out[i] = cor(df$x[window], df$y[window])
}
out <- c(rep(NA, lag), out)
return(out)
}
df$Correl <- do.call(c, by(df[, -1], df[, 1], rollcorr))
Related
I have a large dataset that has a continuous variable "Cholesterol" for two visits for each participant (each participant has two rows: first visit = Before & second visit= After). I'd like to standadise cholesterol but I have both Before and After visits merged which will not make my standardisation accurate as it is calculated using the mean and the SD
USING R BASE, How can I create a new cholesterol variable standardised based on Visit in the same data set (in this process standardisation should be done twice; once for Before and another time for After, but the output (standardised values) will be in a one variable again following the same structure of this DF
DF$Cholesterol<- c( 0.9861551,2.9154158, 3.9302373,2.9453085, 4.2248018,2.4789901, 0.9972635, 0.3879830, 1.1782336, 1.4065341, 1.0495609,1.2750138, 2.8515144, 0.4369885, 2.2410429, 0.7566147, 3.0395565,1.7335131, 1.9242212, 2.4539439, 2.8528908, 0.8432039,1.7002653, 2.3952744,2.6522959, 1.2178764, 2.3426695, 1.9030782,1.1708246,2.7267124)
DF$Visit< -c(Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before,After,Before, After,Before,After,Before,After)
# the standardisation function I want to apply
standardise <- function(x) {return((x-min(x,na.rm = T))/sd(x,na.rm = T))}
thank you in advance
Let's make your data, fix the df$visit assignment, fix the standardise function to be mean rather than min, and then assume each new occasion of before is the next person, pivot to wide format, then mutate our before and after standardised variables:
df <- data.frame(x = rep(1, 30))
df$cholesterol<- c( 0.9861551,2.9154158, 3.9302373,2.9453085, 4.2248018,2.4789901, 0.9972635, 0.3879830, 1.1782336, 1.4065341, 1.0495609,1.2750138, 2.8515144, 0.4369885, 2.2410429, 0.7566147, 3.0395565,1.7335131, 1.9242212, 2.4539439, 2.8528908, 0.8432039,1.7002653, 2.3952744,2.6522959, 1.2178764, 2.3426695, 1.9030782,1.1708246,2.7267124)
df$visit <- rep(c("before", "after"), 15)
standardise <- function(x) {return((x-mean(x,na.rm = T))/sd(x,na.rm = T))}
df <- df %>%
mutate(person = cumsum(visit == "before"))%>%
pivot_wider(names_from = visit, id_cols = person, values_from = cholesterol)%>%
mutate(before_std = standardise(before),
after_std = standardise(after))
gives:
person before after before_std after_std
<int> <dbl> <dbl> <dbl> <dbl>
1 1 0.986 2.92 -1.16 1.33
2 2 3.93 2.95 1.63 1.36
3 3 4.22 2.48 1.91 0.842
4 4 0.997 0.388 -1.15 -1.49
5 5 1.18 1.41 -0.979 -0.356
6 6 1.05 1.28 -1.10 -0.503
7 7 2.85 0.437 0.609 -1.44
8 8 2.24 0.757 0.0300 -1.08
9 9 3.04 1.73 0.788 0.00940
10 10 1.92 2.45 -0.271 0.814
11 11 2.85 0.843 0.611 -0.985
12 12 1.70 2.40 -0.483 0.749
13 13 2.65 1.22 0.420 -0.567
14 14 2.34 1.90 0.126 0.199
15 15 1.17 2.73 -0.986 1.12
If you actually want min in your standardise function rather than mean, editing it should be simple enough.
Edited for BaseR solution, but with a cautionary tale that there's probably a much neater solution:
df <- data.frame(id = rep(c(seq(1, 15, 1)), each = 2))
df$cholesterol<- c( 0.9861551,2.9154158, 3.9302373,2.9453085, 4.2248018,2.4789901, 0.9972635, 0.3879830, 1.1782336, 1.4065341, 1.0495609,1.2750138, 2.8515144, 0.4369885, 2.2410429, 0.7566147, 3.0395565,1.7335131, 1.9242212, 2.4539439, 2.8528908, 0.8432039,1.7002653, 2.3952744,2.6522959, 1.2178764, 2.3426695, 1.9030782,1.1708246,2.7267124)
df$visit <- rep(c("before", "after"), 15)
df <- reshape(df, direction = "wide", idvar = "id", timevar = "visit")
standardise <- function(x) {return((x-mean(x,na.rm = T))/sd(x,na.rm = T))}
df$before_std <- round(standardise(df$cholesterol.before), 2)
df$aafter_std <- round(standardise(df$cholesterol.after), 2)
gives:
i id cholesterol.before cholesterol.after before_std after_std
1 1 0.9861551 2.9154158 -1.16 1.33
3 2 3.9302373 2.9453085 1.63 1.36
5 3 4.2248018 2.4789901 1.91 0.84
7 4 0.9972635 0.3879830 -1.15 -1.49
9 5 1.1782336 1.4065341 -0.98 -0.36
11 6 1.0495609 1.2750138 -1.10 -0.50
13 7 2.8515144 0.4369885 0.61 -1.44
15 8 2.2410429 0.7566147 0.03 -1.08
17 9 3.0395565 1.7335131 0.79 0.01
19 10 1.9242212 2.4539439 -0.27 0.81
21 11 2.8528908 0.8432039 0.61 -0.99
23 12 1.7002653 2.3952744 -0.48 0.75
25 13 2.6522959 1.2178764 0.42 -0.57
27 14 2.3426695 1.9030782 0.13 0.20
29 15 1.1708246 2.7267124 -0.99 1.12
Its a bit complicated to explain, so I hope it is clear enough, but if not I'll try and expand more.
So I have a data-frame like this:
df <- data.frame(index=sort(runif(300, -10,10)), v1=runif(300, -2,-1), v2=runif(300, 1,2))
It gives us a 3-column 300-row df. The first column ("index") contains sorted values from -10 to 10, and the next two columns ("v1"/"v2") contain random numeric values that are not important for this issue.
Now I classify my df rows into deciles according to the index column, (e.g. decile 1: places 1-30, decile 2: places 31-60) and I want to swap randomly between the rows such that all the 1st decile values are swapped randomly with the 6th decile, all 2nd decile values are swapped randomly with the 7th decile, and so on. When I say swapped I mean that the index value remains in its place but the v1 and v2 values are swapped (still coupled) with the v1 and v2 of a random row in the appropriate decile.
For example, the v1 and v2 of the first row in the df (and thus from the 1st decile), will be swapped with the v1 and v2 of the 160th row in the df (6th decile), the v1 and v2 of the second row in the df (1st decile) will be swapped with the v1 and v2 of the 175th row in the df (also 6th decile), the v1 and v2 of the 31st row in the df (2nd decile) will be swapped with the v1 and v2 of the 186th row in the df (7th decile) and so on so all of the v1+v2 values have changed places randomly to their appropriate new decile.
Hope it's clear. I've been trying to solve it for hours and couldn't figure it out.
Thanks
Using order() to sort by two indices, one being the rearranged deciles, the other one random.
set.seed(123)
dtf <- data.frame(round(cbind(index=sort(runif(20, -10, 10)),
v1=runif(20, 0, 5),
v2=runif(20, 5, 10)), 2))
ea <- nrow(dtf)/10
# Deciles shifted by 5
d <- rep(((1:10 + 4) %% 10) + 1, each=ea)
# Random index within decile
r <- c(replicate(10, sample(ea)))
cbind(dtf, z=dtf[order(d, r), -1])
# index v1 v2 z.v1 z.v2
# 12 -9.16 4.45 5.71 4.51 7.21
# 11 -9.09 3.46 7.07 4.82 5.23
# 14 -7.94 3.20 7.07 3.98 5.61
# 13 -5.08 4.97 6.84 3.45 8.99
# 15 -4.25 3.28 5.76 0.12 7.80
# 16 -3.44 3.54 5.69 2.39 6.03
# 17 -1.82 2.72 6.17 3.79 5.64
# 18 -0.93 2.97 7.33 1.08 8.77
# 19 -0.87 1.45 6.33 1.59 9.48
# 20 0.56 0.74 9.29 1.16 6.87
# 2 1.03 4.82 5.23 3.46 7.07
# 1 1.45 4.51 7.21 4.45 5.71
# 3 3.55 3.45 8.99 3.20 7.07
# 4 5.77 3.98 5.61 4.97 6.84
# 6 7.66 0.12 7.80 3.54 5.69
# 5 7.85 2.39 6.03 3.28 5.76
# 8 8.00 3.79 5.64 2.97 7.33
# 7 8.81 1.08 8.77 2.72 6.17
# 10 9.09 1.59 9.48 0.74 9.29
# 9 9.14 1.16 6.87 1.45 6.33
I think that this is what you need.
swapByBlocks <- function(df, blockSize = 30, nblocks = 10){
if((nrow(df) != blockSize*nblocks) || nblocks %%2) stop("Undefined behaviour")
swappedDF <- df[c((nrow(df)/2 +1):nrow(df), 1:(nrow(df)/2)),]
ndxMat <- sapply(1:(nblocks/2),function(dummy) sample(1:blockSize))
for(i in 1:ncol(ndxMat)) {
swappedDF[(i-1)*blockSize + 1:blockSize, ] <- swappedDF[((i-1)*blockSize + 1:blockSize)[ndxMat[,i]], ]
swappedDF[(i+nblocks/2-1)*blockSize + 1:blockSize, ] <- swappedDF[((i+nblocks/2-1)*blockSize + 1:blockSize)[order(ndxMat[,i])], ]
}
return(swappedDF)
}
A small case where you can check how it works:
res <- swapByBlocks(df[1:18,], blockSize = 3, nblocks = 6)
> df[1:18,]
index v1 v2
1 -9.859624 -1.657779 1.954094
2 -9.774898 -1.015825 1.006341
3 -9.624402 -1.713754 1.527065
4 -9.441129 -1.891834 1.803793
5 -9.424195 -1.125674 1.581199
6 -8.890537 -1.142044 1.219111
7 -8.838012 -1.173445 1.013408
8 -8.296938 -1.780396 1.570550
9 -8.172076 -1.789056 1.178596
10 -7.671897 -1.988539 1.690468
11 -7.655868 -1.095662 1.876414
12 -7.450011 -1.337443 1.632104
13 -7.204528 -1.880350 1.408944
14 -7.085862 -1.232293 1.593247
15 -7.030691 -1.087031 1.924306
16 -6.989892 -1.639967 1.495058
17 -6.978945 -1.395340 1.872944
18 -6.930379 -1.841031 1.061046
> res
index v1 v2
10 -7.450011 -1.337443 1.632104
11 -7.655868 -1.095662 1.876414
12 -7.671897 -1.988539 1.690468
13 -7.030691 -1.087031 1.924306
14 -7.085862 -1.232293 1.593247
15 -7.204528 -1.880350 1.408944
16 -6.989892 -1.639967 1.495058
17 -6.930379 -1.841031 1.061046
18 -6.978945 -1.395340 1.872944
1 -9.624402 -1.713754 1.527065
2 -9.774898 -1.015825 1.006341
3 -9.859624 -1.657779 1.954094
4 -8.890537 -1.142044 1.219111
5 -9.424195 -1.125674 1.581199
6 -9.441129 -1.891834 1.803793
7 -8.838012 -1.173445 1.013408
8 -8.172076 -1.789056 1.178596
9 -8.296938 -1.780396 1.570550
>
Here there are 18 rows with six blocks of three numbers each. Rows 1 to 3 get swapped with rows 10 to 12, rows 4 to 6 with rows 13 to 15 and rows 4
7 to 9 with rows 16 to 17.
I need to generate bins from a data.frame based on the values of one column. I have tried the function "cut".
For example: I want to create bins of air temperature values in the column "AirTDay" in a data frame:
AirTDay (oC)
8.16
10.88
5.28
19.82
23.62
13.14
28.84
32.21
17.44
31.21
I need the bin intervals to include all values in a range of 2 degrees centigrade from that initial value (i.e. 8-9.99, 10-11.99, 12-13.99...), to be labelled with the average value of the range (i.e. 9.5, 10.5, 12.5...), and to respect blank cells, returning "NA" in the bins column.
The output should look as:
Air_T (oC) TBins
8.16 8.5
10.88 10.5
5.28 NA
NA
19.82 20.5
23.62 24.5
13.14 14.5
NA
NA
28.84 28.5
32.21 32.5
17.44 18.5
31.21 32.5
I've gotten as far as:
setwd('C:/Users/xxx')
temp_data <- read.csv("temperature.csv", sep = ",", header = TRUE)
TAir <- temp_data$AirTDay
Tmin <- round(min(TAir, na.rm = FALSE), digits = 0) # is start at minimum value
Tmax <- round(max(TAir, na.rm = FALSE), digits = 0)
int <- 2 # bin ranges 2 degrees
mean_int <- int/2
int_range <- seq(Tmin, Tmax + int, int) # generate bin sequence
bin_label <- seq(Tmin + mean_int, Tmax + mean_int, int) # generate labels
temp_data$TBins <- cut(TAir, breaks = int_range, ordered_result = FALSE, labels = bin_label)
The output table looks correct, but for some reason it shows a sequential additional column, shifts column names, and collapse all values eliminating blank cells. Something like this:
Air_T (oC) TBins
1 8.16 8.5
2 10.88 10.5
3 5.28 NA
4 19.82 20.5
5 23.62 24.5
6 13.14 14.5
7 28.84 28.5
8 32.21 32.5
9 17.44 18.5
10 31.21 32.5
Any ideas on where am I failing and how to solve it?
v<-ceiling(max(dat$V1,na.rm=T))
breaks<-seq(8,v,2)
labels=seq(8.5,length.out=length(s)-1,by=2)
transform(dat,Tbins=cut(V1,breaks,labels))
V1 Tbins
1 8.16 8.5
2 10.88 10.5
3 5.28 <NA>
4 NA <NA>
5 19.82 18.5
6 23.62 22.5
7 13.14 12.5
8 NA <NA>
9 NA <NA>
10 28.84 28.5
11 32.21 <NA>
12 17.44 16.5
13 31.21 30.5
This result follows the logic given: we have
paste(seq(8,v,2),seq(9.99,v,by=2),sep="-")
[1] "8-9.99" "10-11.99" "12-13.99" "14-15.99" "16-17.99" "18-19.99" "20-21.99"
[8] "22-23.99" "24-25.99" "26-27.99" "28-29.99" "30-31.99"
From this we can tell that 19.82 will lie between 18 and 20 thus given the value 18.5, similar to 10.88 being between 10-11.99 thus assigned the value 10.5
I'm a newbie in R. I have a data set with 3 set of lung function measurements for 3 corresponding dates given below for each observation. I would like to extract slope for each observation(decline in lung function) using R software and insert in the new column for each observation.
1. How should I approach the problem?
2. Is my data set arranged in right format?
ID FEV1_Date11 FEV1_Date12 FEV1_Date13 DATE11 DATE12 DATE13
18105 1.35 1.25 1.04 6/9/1990 8/16/1991 8/27/1993
18200 0.87 0.85 9/12/1991 3/11/1993
18303 0.79 4/23/1992
24204 4.05 3.95 3.99 6/8/1992 3/22/1993 11/5/1994
28102 1.19 1.04 0.96 10/31/1990 7/24/1991 6/27/1992
34104 1.03 1.16 1.15 7/25/1992 12/8/1993 12/7/1994
43108 0.92 0.83 0.79 6/23/1993 1/12/1994 1/11/1995
103114 2.43 2.28 2.16 6/5/1994 6/21/1995 4/7/1996
114101 0.73 0.59 0.6 6/25/1989 8/5/1990 8/24/1991
example for 1st observation, slope=0.0003
Thanks..
If I understood the question, I think you want the slope between each set of visits:
library(dplyr)
group_by(df, ID) %>%
mutate_at(vars(starts_with("DATE")), funs(as.Date(., "%m/%d/%Y"))) %>%
do(data_frame(slope=diff(unlist(.[,2:4]))/diff(unlist(.[,5:7])),
after_visit=1+(1:length(slope))))
## Source: local data frame [18 x 3]
## Groups: ID [9]
##
## ID slope after_visit
## <int> <dbl> <dbl>
## 1 18105 -2.309469e-04 2
## 2 18105 -2.830189e-04 3
## 3 18200 -3.663004e-05 2
## 4 18200 NA 3
## 5 18303 NA 2
## 6 18303 NA 3
## 7 24204 -3.484321e-04 2
## 8 24204 6.745363e-05 3
## 9 28102 -5.639098e-04 2
## 10 28102 -2.359882e-04 3
## 11 34104 2.594810e-04 2
## 12 34104 -2.747253e-05 3
## 13 43108 -4.433498e-04 2
## 14 43108 -1.098901e-04 3
## 15 103114 -3.937008e-04 2
## 16 103114 -4.123711e-04 3
## 17 114101 -3.448276e-04 2
## 18 114101 2.604167e-05 3
Alternate munging:
group_by(df, ID) %>%
mutate_at(vars(starts_with("DATE")), funs(as.Date(., "%m/%d/%Y"))) %>%
do(data_frame(date=as.Date(unlist(.[,5:7]), origin="1970-01-01"), # in the event you wanted to keep the data less awful and have one observation per row, this preserves the Date class
reading=unlist(.[,2:4]))) %>%
do(data_frame(slope=diff(.$reading)/unclass(diff(.$date))))
This is a bit of a "hacky" solution but if I understand your question correctly (some clarification may be needed), this should work in your case. Note, this is somewhat specific to your case since the column pairs are expected to be in the order you specified.
library(dplyr)
library(lubridate)
### Load Data
tdf <- read.table(header=TRUE, stringsAsFactors = FALSE, text = '
ID FEV1_Date11 FEV1_Date12 FEV1_Date13 DATE11 DATE12 DATE13
18105 1.35 1.25 1.04 6/9/1990 8/16/1991 8/27/1993
18200 0.87 0.85 NA 9/12/1991 3/11/1993 NA
18303 0.79 NA NA 4/23/1992 NA NA
24204 4.05 3.95 3.99 6/8/1992 3/22/1993 11/5/1994
28102 1.19 1.04 0.96 10/31/1990 7/24/1991 6/27/1992
34104 1.03 1.16 1.15 7/25/1992 12/8/1993 12/7/1994
43108 0.92 0.83 0.79 6/23/1993 1/12/1994 1/11/1995
103114 2.43 2.28 2.16 6/5/1994 6/21/1995 4/7/1996
114101 0.73 0.59 0.6 6/25/1989 8/5/1990 8/24/1991') %>% tbl_df
#####################################
### Reshape the data by column pairs.
#####################################
### Function to reshape a single column pair
xform_data <- function(x) {
df<-data.frame(tdf[,'ID'],
names(tdf)[x],
tdf[,names(tdf)[x]],
tdf[,names(tdf)[x+3]], stringsAsFactors = FALSE)
names(df) <- c('ID', 'DateKey', 'Val', 'Date'); df
}
### Create a new data frame with the data in a deep format (i.e. reshaped)
### 'lapply' is used to reshape each pair of columns (date and value).
### 'lapply' returns a list of data frames (on df per pair) and 'bind_rows'
### combines them into one data frame.
newdf <-
bind_rows(lapply(2:4, function(x) {xform_data(x)})) %>%
mutate(Date = mdy(Date, tz='utc'))
#####################################
### Calculate the slopes per ID
#####################################
slopedf <-
newdf %>%
arrange(DateKey, Date) %>%
group_by(ID) %>%
do(slope = lm(Val ~ Date, data = .)$coefficients[[2]]) %>%
mutate(slope = as.vector(slope)) %>%
ungroup
slopedf
## # A tibble: 9 x 2
## ID slope
## <int> <dbl>
## 1 18105 -3.077620e-09
## 2 18200 -4.239588e-10
## 3 18303 NA
## 4 24204 -5.534095e-10
## 5 28102 -4.325210e-09
## 6 34104 1.690414e-09
## 7 43108 -2.490139e-09
## 8 103114 -4.645589e-09
## 9 114101 -1.924497e-09
##########################################
### Adding slope column to original data.
##########################################
> tdf %>% left_join(slopedf, by = 'ID')
## # A tibble: 9 x 8
## ID FEV1_Date11 FEV1_Date12 FEV1_Date13 DATE11 DATE12 DATE13 slope
## <int> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 18105 1.35 1.25 1.04 6/9/1990 8/16/1991 8/27/1993 -3.077620e-09
## 2 18200 0.87 0.85 NA 9/12/1991 3/11/1993 <NA> -4.239588e-10
## 3 18303 0.79 NA NA 4/23/1992 <NA> <NA> NA
## 4 24204 4.05 3.95 3.99 6/8/1992 3/22/1993 11/5/1994 -5.534095e-10
## 5 28102 1.19 1.04 0.96 10/31/1990 7/24/1991 6/27/1992 -4.325210e-09
## 6 34104 1.03 1.16 1.15 7/25/1992 12/8/1993 12/7/1994 1.690414e-09
## 7 43108 0.92 0.83 0.79 6/23/1993 1/12/1994 1/11/1995 -2.490139e-09
## 8 103114 2.43 2.28 2.16 6/5/1994 6/21/1995 4/7/1996 -4.645589e-09
## 9 114101 0.73 0.59 0.60 6/25/1989 8/5/1990 8/24/1991 -1.924497e-09
I have:
DT = data.table(ID=rep(1:2,each = 2), Index=rep(1:2,times = 2), Close=3:6, Open=7:10)
My algorithm has earlier determined that the DT holds the time information in the column with name Index, hence the algorithm stores the following mapping:
time.col <- "Index"
Now the algorithm wants to perform a calculation that would be equivalent to:
DT[, list(Index, Value=cumsum(Close)),by=ID]
ID Index Value
1: 1 1 3
2: 1 2 7
3: 2 1 5
4: 2 2 11
How to rewrite the line and plug the time.col variable in?
Neither of the following works:
DT[, list(time.col, Value=cumsum(Close)),by=ID]
DT[, list(substitute(time.col), Value=cumsum(Close)),by=ID]
You can create an expression for all of j in DT:
e <- parse(text = paste0("list(", time.col,",", "Value=cumsum(Close))"))
DT[, eval(e),by=ID]
EDIT
Or, if you store "Index" as a name, you can evaluate time.col within the environment of .SD:
time.col <- as.name("Index")
DT[,list(eval(time.col,envir=.SD), Value=cumsum(Close)),by=ID]
Very similar question here: In R data.table, how do I pass variable parameters to an expression?
Also, this question helps to understand the mystery of non-standard evaluation in data.table:
eval and quote in data.table
It turns out that the fastest solution from the above-mentioned evals is
e <- parse(text = paste0("list(", time.col,",", "Value=cumsum(Close))"))
DT[, eval(e),by=ID]
However, := solution is even faster. See also Arun's note regarding copying.
Dataset
dim(DT); object.size(DT); DT
[1] 1354402 8
81291568 bytes
Instrument Date Open High Low Close Volume Adjusted Close
1: GOOG/AMEX_ABI 1981-03-11 NA NA 6.56 6.75 217200 NA
2: GOOG/AMEX_ABI 1981-03-12 NA NA 6.66 6.88 616400 NA
3: GOOG/AMEX_ABI 1981-03-13 NA NA 6.81 6.84 462000 NA
4: GOOG/AMEX_ABI 1981-03-16 NA NA 6.81 7.00 306400 NA
5: GOOG/AMEX_ABI 1981-03-17 NA NA 6.88 6.88 925600 NA
---
1354398: YAHOO/TSX_AMM_TO 2014-04-24 1.56 1.58 1.56 1.58 2700 1.58
1354399: YAHOO/TSX_AMM_TO 2014-04-25 1.60 1.62 1.59 1.62 11000 1.62
1354400: YAHOO/TSX_AMM_TO 2014-04-28 1.59 1.61 1.54 1.54 7200 1.54
1354401: YAHOO/TSX_AMM_TO 2014-04-29 1.58 1.60 1.58 1.59 500 1.59
1354402: YAHOO/TSX_AMM_TO 2014-04-30 1.55 1.55 1.50 1.52 36800 1.52
Benchmarking
time.col <- "Date"
fun <- function(){
out <- DT[, list(get(time.col), Value=cumsum(Close)),by=Instrument]
setnames(out, "V1", time.col)
}
fun2 <- function() {
DT[, Value := cumsum(Close), by=Instrument]
out <- DT[ , c("Instrument", ..time.col, "Value")]
DT[, Value:=NULL] # cleanup
out
}
fun2. <- function() {
DT[, Value := cumsum(Close), by=Instrument]
# out <- DT[,c("Instrument", ..time.col, "Value")]
# DT[, Value:=NULL] # cleanup
# out
}
fun3 <- function() {
DT[,list( eval(as.name(time.col),envir=.SD), Value=cumsum(Close)),by=Instrument]
}
fun4 <- function() {
e <- parse(text = paste0("list(", time.col,",", "Value=cumsum(Close))"))
DT[, eval(e),by=Instrument]
}
Result
library(rbenchmark)
benchmark(fun(),
fun2(),
fun3(),
fun4(),
replications=200)
test replications elapsed relative user.self sys.self user.child sys.child
1 fun() 200 5.40 1.327 5.29 0.11 NA NA
2 fun2() 200 5.18 1.273 4.72 0.45 NA NA
3 fun2.() 200 2.70 1.000 2.70 0.00 NA NA
3 fun3() 200 4.12 1.012 3.90 0.22 NA NA
4 fun4() 200 4.07 1.000 3.91 0.16 NA NA