Use previous calculated row value in r Continued 2 - r

I have a data.table that looks like this:
library(data.table)
DT <- data.table(A=1:20, B=1:20*10, C=1:20*100)
DT
A B C
1: 1 10 100
2: 2 20 200
3: 3 30 300
4: 4 40 400
5: 5 50 500
...
20: 20 200 2000
I want to be able to calculate a new column "R" that has the first value as
DT$R[1]<-tanh(DT$B[1]/400000)
, and then I want to use the first row of column R to help calculate the next row value of G.
DT$R[2] <- 0.5*tanh(DT$B[2]/400000) + DT$R[1]*0.6
DT$R[3] <- 0.5*tanh(DT$B[3]/400000) + DT$R[2]*0.6
DT$R[4] <- 0.5*tanh(DT$B[4]/400000) + DT$R[3]*0.6
This will then look a bit like this
A B C R
1: 1 10 100 2.5e-05
2: 2 20 200 4e-05
3: 3 30 300 6.15e-05
4: 4 40 400 8.69e-05
5: 5 50 500 0.00011464
...
20: 20 200 2000 0.0005781274
Any ideas on this would be made?

Is this what you are looking for ?
DT <- data.table(A=1:20, B=1:20*10, C=1:20*100)
DT$R = 0
DT$R[1]<-tanh(DT$B[1]/400000)
for(i in 2:nrow(DT)) {
DT$R[i] <- 0.5*tanh(DT$B[i]/400000) + DT$R[i-1]*0.6
}

Related

Calculate percentage change in dataframe from first row

I want to calculate the per cent change in my dataframe using the first row as the reference. For example my dataframe
Set rate field
A 3 10
B 2 17
C 5 4
Using row A as the reference, I want to calculate the percentage change from row A to every other row for all columns in the dataframe.
which will result in
Set rate field
A 3 10
B -33 70
C 66.66 -60
or
Set rate field pct_rate pct-field
A 3 10 0 0
B 2 17 -33 70
C 5 4 66.66 -60
My code:
z %>%
mutate(pct_rate = (rate - lag(rate)/ rate ) * 100)
which doesn't give me the desired result
df <- fread("Set rate field
A 3 10
B 2 17
C 5 4")
Soltuion using dplyr: We can use dplyr's first function to refer to the first element of a vector (your attempt with lag is very close to this solution). Also I used first(rate) in the denominator to calculate the percentage difference to get the numbers in your example...
library(dplyr)
df %>%
mutate(pct_rate = (rate - first(rate)) / first(rate) * 100,
pct_field = (field - first(field)) / first(field) * 100)
Returns:
Set rate field pct_rate pct_field
1: A 3 10 0.00000 0
2: B 2 17 -33.33333 70
3: C 5 4 66.66667 -60
You can use z$rate[1] or z$field[1] to get the first element and make than the calculations with all values.
z$pct_rate <- 100 * (z$rate - z$rate[1]) / z$rate[1]
z$pct_field <- 100 * (z$field - z$field[1]) / z$field[1]
z
# Set rate field pct_rate pct_field
#1 A 3 10 0.00000 0
#2 B 2 17 -33.33333 70
#3 C 5 4 66.66667 -60
or for many columns:
rbind(z[1,], do.call(cbind.data.frame, c(z[1],
lapply(z[-1], function(x) 100 * (x - x[1]) / x[1])))[-1,])
# Set rate field
#1 A 3.00000 10
#2 B -33.33333 70
#3 C 66.66667 -60

Column-specific arguments to lapply in data.table .SD when applying rbinom

I have a data.table for which I want to add columns of random binomial numbers based on one column as number of trials and multiple probabilities based on other columns:
require(data.table)
DT = data.table(
ID = letters[sample.int(26,10, replace = T)],
Quantity=as.integer(100*runif(10))
)
prob.vecs <- LETTERS[1:5]
DT[,(prob.vecs):=0]
set.seed(123)
DT[,(prob.vecs):=lapply(.SD, function(x){runif(.N,0,0.2)}), .SDcols=prob.vecs]
DT
ID Quantity A B C D E
1: b 66 0.05751550 0.191366669 0.17790786 0.192604847 0.02856000
2: l 9 0.15766103 0.090666831 0.13856068 0.180459809 0.08290927
3: u 38 0.08179538 0.135514127 0.12810136 0.138141056 0.08274487
4: d 27 0.17660348 0.114526680 0.19885396 0.159093484 0.07376909
5: o 81 0.18809346 0.020584937 0.13114116 0.004922737 0.03048895
6: f 44 0.00911130 0.179964994 0.14170609 0.095559194 0.02776121
7: d 81 0.10562110 0.049217547 0.10881320 0.151691908 0.04660682
8: t 81 0.17848381 0.008411907 0.11882840 0.043281587 0.09319249
9: x 79 0.11028700 0.065584144 0.05783195 0.063636202 0.05319453
10: j 43 0.09132295 0.190900730 0.02942273 0.046325157 0.17156554
Now I want to add five columns Quantity_A Quantity_B Quantity_C Quantity_D Quantity_E
which apply the rbinom with the correspoding probability and quantity from the second column.
So for example the first entry for Quantity_A would be:
set.seed(741)
sum(rbinom(66,1,0.05751550))
> 2
This problem seems very similar to this post: How do I pass column-specific arguments to lapply in data.table .SD? but I cannot seem to make it work. My try:
DT[,(paste0("Quantity_", prob.vecs)):= mapply(function(x, Quantity){sum(rbinom(Quantity, 1 , x))}, .SD), .SDcols = prob.vecs]
Error in rbinom(Quantity, 1, x) :
argument "Quantity" is missing, with no default
Any ideas?
I seemed to have found a work-around, though I am not quite sure why this works (probably has something to do with the function rbinom not beeing vectorized in both arguments):
first define an index:
DT[,Index:=.I]
and then do it by index:
DT[,(paste0("Quantity_", prob.vecs)):= lapply(.SD,function(x){sum(rbinom(Quantity, 1 , x))}), .SDcols = prob.vecs, by=Index]
set.seed(789)
ID Quantity A B C D E Index Quantity_A Quantity_B Quantity_C Quantity_D Quantity_E
1: c 37 0.05751550 0.191366669 0.17790786 0.192604847 0.02856000 1 0 4 7 8 0
2: c 51 0.15766103 0.090666831 0.13856068 0.180459809 0.08290927 2 3 5 9 19 3
3: r 7 0.08179538 0.135514127 0.12810136 0.138141056 0.08274487 3 0 0 2 2 0
4: v 53 0.17660348 0.114526680 0.19885396 0.159093484 0.07376909 4 8 4 16 12 3
5: d 96 0.18809346 0.020584937 0.13114116 0.004922737 0.03048895 5 17 3 12 0 4
6: u 52 0.00911130 0.179964994 0.14170609 0.095559194 0.02776121 6 1 3 8 6 0
7: m 43 0.10562110 0.049217547 0.10881320 0.151691908 0.04660682 7 6 1 7 6 2
8: z 3 0.17848381 0.008411907 0.11882840 0.043281587 0.09319249 8 1 0 2 1 1
9: m 3 0.11028700 0.065584144 0.05783195 0.063636202 0.05319453 9 1 0 0 0 0
10: o 4 0.09132295 0.190900730 0.02942273 0.046325157 0.17156554 10 0 0 0 0 0
numbers look about right to me
If someone finds a solution without the index would still be appreciated.

Comparing each element in two columns and set another column

I have a data frame (after fread from a file) with two columns (dep and label). I want to set another column (mark) with id value depending on the match. If the 'dep' entry matches 'lablel' entry, mark get the 'id' of the matched 'label'. For no match, mark get the value of its own 'id'. Currently, I have work around solution with loops but I know there should be a neat way to do it in R specifics.
trace <- data.table(id=seq(1:7),dep=c(-1,45,40,47,0,45,43),
label=c(99,40,43,45,47,42,48), mark=rep("",7))
id dep label mark
1: 1 -1 99 1
2: 2 45 40 2
3: 3 40 43 2
4: 4 47 45 4
5: 5 0 47 5
6: 6 45 42 4
7: 7 43 48 3
I know loops are slow in r and just to give example the following naive for/while works for small sizes but my data set is huge.
trace$mark <- trace$id
for (i in 1:length(trace$id)){
val <- trace$dep[i]
j <- 1
while(j<=i && val !=-1 && val!=0){ // don't compare if val is -1/0
if(val==trace$label[j]){
trace$mark[i] <- trace$id[j]
}
j <-j +1
}
}
I have also tried using the following approach but it works only if there is a single match.
match <- which(trace$dep %in% trace$label)
match_to <- which(trace$label %in% trace$dep)
trace$mark[match] <- trace$mark[match_to]
This solution might help:
trace[trace[,.(id,dep=label)],mark:=as.character(i.id),on="dep"]
trace[mark=="",mark:=as.character(id)]
# id dep label mark
# 1: 1 -1 99 1
# 2: 2 45 40 4
# 3: 3 -1 43 3
# 4: 4 47 45 5
# 5: 5 -1 47 5
# 6: 6 45 42 4
# 7: 7 43 48 3
Update:
To make sure you are not matching dep with 0 or -1 values you can just add another line.
trace[dep %in% c(0,-1), mark:= as.character(id)]
OR
Try this:
trace[trace[!dep %in% c(0,-1),.(id,dep=label)],mark:=as.character(i.id),on="dep"]
trace[mark=="",mark:=as.character(id)]
The solution that worked
trace[trace[,.(id,dep=label)],on=.(id<=id,dep),mark:=as.char‌​acter(i.id),allow.ca‌​rtesian=TRUE]

Conditional data.table merge with .EACHI

I have been playing around with the newer data.table conditional merge feature and it is very cool. I have a situation where I have two tables, dtBig and dtSmall, and there are multiple row matches in both datasets when this conditional merge takes place. Is there a way to aggregate these matches using a function like max or min for these multiple matches? Here is a reproducible example that tries to mimic what I am trying to accomplish.
Set up environment
## docker run --rm -ti rocker/r-base
## install.packages("data.table", type = "source",repos = "http://Rdatatable.github.io/data.table")
Create two fake datasets
A create a "big" table with 50 rows (10 values for each ID).
library(data.table)
set.seed(1L)
# Simulate some data
dtBig <- data.table(ID=c(sapply(LETTERS[1:5], rep, 10, simplify = TRUE)), ValueBig=ceiling(runif(50, min=0, max=1000)))
dtBig[, Rank := frank(ValueBig, ties.method = "first"), keyby=.(ID)]
ID ValueBig Rank
1: A 266 3
2: A 373 4
3: A 573 5
4: A 909 9
5: A 202 2
---
46: E 790 9
47: E 24 1
48: E 478 2
49: E 733 7
50: E 693 6
Create a "small" dataset similar to the first, but with 10 rows (2 values for each ID)
dtSmall <- data.table(ID=c(sapply(LETTERS[1:5], rep, 2, simplify = TRUE)), ValueSmall=ceiling(runif(10, min=0, max=1000)))
ID ValueSmall
1: A 478
2: A 862
3: B 439
4: B 245
5: C 71
6: C 100
7: D 317
8: D 519
9: E 663
10: E 407
Merge
I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig. I tried doing this two different ways. Method 2 gives me the desired output, but I am unclear why the output is different at all. It seems like it is just returning the last matched value.
## Method 1
dtSmall[dtBig, RankSmall := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]
## Method 2
setorder(dtBig, ValueBig)
dtSmall[dtBig, RankSmall2 := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]
Results
ID ValueSmall RankSmall RankSmall2 DesiredRank
1: A 478 1 4 4
2: A 862 1 7 7
3: B 439 3 4 4
4: B 245 1 2 2
5: C 71 1 1 1
6: C 100 1 1 1
7: D 317 1 2 2
8: D 519 3 5 5
9: E 663 2 5 5
10: E 407 1 1 1
Is there a better data.table way of grabbing the max value in another data.table with multiple matches?
I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig.
setorder(dtBig, ID, ValueBig, Rank)
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), mult="last", x.Rank ]
]
ID ValueSmall r
1: A 478 4
2: A 862 7
3: B 439 4
4: B 245 2
5: C 71 1
6: C 100 1
7: D 317 2
8: D 519 5
9: E 663 5
10: E 407 1
I imagine it is considerably faster to sort dtBig and take the last matching row rather than to compute the max by .EACHI, but am not entirely sure. If you don't like sorting, just save the previous sort order so it can be reverted to afterwards.
Is there a way to aggregate these matches using a function like max or min for these multiple matches?
For this more general problem, .EACHI works, just making sure you're doing it for each row of the target table (dtSmall in this case), so...
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), max(x.Rank), by=.EACHI ]$V1
]

Calculate mean of a proportion of the data.frame

I'm working with data that looks similar to this:
cat value n
1 100 18
2 0 19
3 -100 15
4 100 13
5 0 17
6 -100 18
In the real data, there are many cats and value can be any number between -100 and 100 (no NA).
What I want to do is to calculate the mean of value based on terciles defined by n
So, for example, since sum(n)=100 what I want to do is to get n's as close as possible to 33 and calculate the mean of value. So for the first tercile, 18 isn't quite 33, so I need to take 15 values from cat=2. So the mean for the first tercile should be (100*18+0*15)/(18+15). The second tercile would be the remaining ns from cat=2, then as many as are needed to get to 33: (0*4+-100*15+100*13+0*1)/(4+15+13+1). Similar for the last tercile.
I got started writing this, but ended up with lots of nasty for loops and if statements. I'm hoping that you see an easier way to deal with this than I do. Thanks in advance!
A solution with data.table:
setDT(df)[rep(1:.N,n)
][,indx:=c(rep("a",33),rep("b",33),rep("c",34))
][,.(mean_val_indx=mean(value)),by=indx]
this gives:
indx mean_val_indx
1: a 54.545455
2: b -6.060606
3: c -52.941176
Which are the means of value for the three parts of the data.
Broken down in the intermediate steps:
1: replice the rows according n
setDT(df)[rep(1:.N,n)]
this gives (shortened):
cat value n
1: 1 100 18
2: 1 100 18
....
17: 1 100 18
18: 1 100 18
19: 2 0 19
20: 2 0 19
....
36: 2 0 19
37: 2 0 19
38: 3 -100 15
....
99: 6 -100 18
100: 6 -100 18
2: create an index with [,indx:=c(rep("a",33),rep("b",33),rep("c",34))]
setDT(df)[rep(1:.N,n)
][,indx:=c(rep("a",33),rep("b",33),rep("c",34))]
this gives:
> dt
cat value n indx
1: 1 100 18 a
2: 1 100 18 a
....
17: 1 100 18 a
18: 1 100 18 a
19: 2 0 19 a
20: 2 0 19 a
....
32: 2 0 19 a
33: 2 0 19 a
34: 2 0 19 b
35: 2 0 19 b
....
99: 6 -100 18 c
100: 6 -100 18 c
3: summarise value by indx with [,.(mean_val_indx=mean(value)),by=indx]
You could try something like this, data being your example dataframe:
longData<-unlist(apply(data[,c("value","n")],1,function(x){
rep(x["value"],x["n"])
}))
aggregate(longData,list(cut(seq_along(longData),breaks=3,right=FALSE)),mean)
longData will be a vector of length 100 with, using your example, 18 repetitions of -100, 19 repetitions of 0 etc.
The cut in the aggregate will divide longData into three groups, and the mean of each group will be calculated.
If already the data is very long repetition by "n" is perhaps unwanted.
The following solution doesn't do this. Moreover, 1/3 of the sum of the
"n"-values is not rounded to the nearest integer.
"i" is the vector of row numbers where terciles end. Since it is possible
that several terciles end at the same row, those row numbers are replicated.
The result is the vector "k".
For each index "j" the cumulative sum of "data$value"*"data$n" up to "k[j]"
covers "ms[k[j]]" terciles, so "ms[j]-j" terciles have to be subtracted
to get the cumulative sum up to the "j"th tercile.
m <- 3
sn <- sum(data$n)
ms <- m * cumsum(data$n) / sn
d <- diff(c(0,floor(ms)))
i <- which(d>0)
k <- rep(i,d[i])
vn <- data$value * data$n
sums <- cumsum(vn)[k] - (ms[k]-(1:m))*data$value[k]*sn/m
means <- m*diff(c(0,sums))/sn
The means of the terciles are:
> means
[1] 54 -6 -54
In this example "i" is equal to "k". But if terciles are replaced by deciles,
i.e. "m" is not 3 but 10, they are distinct:
> m
[1] 10
> i
[1] 1 2 3 4 5 6
> k
[1] 1 2 2 3 3 4 5 5 6 6
> means
[1] 100 80 0 -30 -100 60 50 0 -80 -100
I compared the speed of the 4 answers, using out small example with 8 rows:
> ##### "longData"-Answer #####
>
> system.time( for ( i in 1:1000 ) { A1 <- f1(data) } )
User System verstrichen
3.48 0.00 3.49
> ##### "sapply"-Answer #####
>
> system.time( for ( i in 1:1000 ) { A2 <- f2(data) } )
User System verstrichen
1.00 0.00 0.99
> ##### "data.table"Answer #####
>
> system.time( for ( i in 1:1000 ) { A3 <- f3(data) } )
User System verstrichen
4.73 0.00 4.79
> ##### this Answer #####
>
> system.time( for ( i in 1:1000 ) { A4 <- f4(data) } )
User System verstrichen
0.43 0.00 0.44
The "sapply"-Answer is even false:
> A1
Group.1 x
1 [0.901,34) 54.545455
2 [34,67) -6.060606
3 [67,100) -52.941176
> A2
(0,33] (33,67] (67,100]
-100.00000 0.00000 93.93939
> A3
indx mean_val_indx
1: a 54.545455
2: b -6.060606
3: c -52.941176
> A4
[1] 54 -6 -54
>
This is basically the same as NicE although perhaps useful as a different way fo assembling the rep and cutting operations:
sapply(split( sort(unlist( mapply(rep, res$value, res$n) )),
cut(seq(sum(res$n)), breaks=c(0,33,67,100) )),
mean)
(0,33] (33,67] (67,100]
-100.00000 0.00000 93.93939

Resources