Is there a way to "auto-name" expression in J - r

I have a few questions/suggestions concerning data.table.
R) X = data.table(x=c("q","q","q","w","w","e"),y=1:6,z=10:15)
R) X[,list(sum(y)),by=list(x)]
x V1
1: q 6
2: w 9
3: e 6
I think it is too bad that one has to write
R) X[,list(y=sum(y)),by=list(x)]
x y
1: q 6
2: w 9
3: e 6
It should default to keeping the same column name (ie: y) where the function calls only one column, this would be a massive gain in most of the cases, typically in finance as we usually look as weighted sums or last time or...
=> Is there any variable I can set to default to this behaviour ?
When doing a selectI might want to do a calculus on few columns and apply another operation for all other columns.
I mean too bad that when I want this:
R) X = data.table(x=c("q","q","q","w","w","e"),y=1:6,z=10:15,t=20:25,u=30:35)
R) X
x y z t u
1: q 1 10 20 30
2: q 2 11 21 31
3: q 3 12 22 32
4: w 4 13 23 33
5: w 5 14 24 34
6: e 6 15 25 35
R) X[,list(y=sum(y),z=last(z),t=last(t),u=last(u)),by=list(x)] #LOOOOOOOOOOONGGGG
#EXPR
x y z t u
1: q 6 12 22 32
2: w 9 14 24 34
3: e 6 15 25 35
I cannot write it like...
R) X[,list(sum(y)),by=list(x),defaultFn=last] #defaultFn would be
applied to all remaniing columns
=> Can I do this somehow (may be setting an option)?
Thanks

On part 1, that's not a bad idea. We already do that for expressions in by, and something close is already on the list for j :
FR#2286 Inferred naming could apply to j=colname[...]
Find max per group and return another column
But if we did do that it would probably need to be turned on via an option, to maintain backwards compatibility. I've added a link in that FR back to this question.
On the 2nd part how about :
X[,c(y=sum(y),lapply(.SD,last)[-1]),by=x]
x y z t u
1: q 6 12 22 32
2: w 9 14 24 34
3: e 6 15 25 35
Please ask multiple questions separately, though. Each question on S.O. is supposed to be a single question.

Related

Write a function in R - calculate value from historical records and add to future records

I have the following dataset
Name<-c('A','A','B','C','B','C','D','B','C','A','D','C','B','C','A','D','C','B','A','D','C','B')
Rate<-c(12,13,4,8,7,3,6,8,5,4,7,5,9,4,7,2,7,3,9,13,14,12)
Date<-c('1998-11-11', '1992-12-01','2010-06-17', '2001-10-3','2019-4-01', '2020-4-23','2021-2-01', '1995-12-01',
'1994-7-11', '2023-3-01','2022-06-17', '1982-10-3','1898-4-01', '2027-4-23','1927-2-01', '2028-12-01',
'1993-5-21', '2013-2-09','2020-01-17', '1987-4-3','1881-5-01', '2024-5-23')
df<-cbind.data.frame(Name,Rate, Date)
df
Name Rate Date
1 A 12 1998-11-11
2 A 13 1992-12-01
3 B 4 2010-06-17
4 C 8 2001-10-3
5 B 7 2019-4-01
6 C 3 2020-4-23
7 D 6 2021-2-01
8 B 8 1995-12-01
9 C 5 1994-7-11
10 A 4 2023-3-01
11 D 7 2006-06-17
12 C 5 1982-10-3
13 B 9 1898-4-01
14 C 4 2027-4-23
15 A 7 1927-2-01
16 D 2 2028-12-01
17 C 7 1993-5-21
18 B 3 2013-2-09
19 A 9 2020-01-17
20 D 13 1987-4-3
21 C 14 1881-5-01
22 B 12 2024-5-23
I want to write a function in R to do the following :
Find the Standard Deviation for each type of Name (A, B, C, D) of historical data. Historical data is any records with Date < Dec'2018. Future records would not be used to calculate the SD for type of Name. I want to then add the SD of the historical data to the future Rates of respective type of Name(A, B, C, D). Future Rates are the one with Date > Dec'2018. Could anyone please help me to write this function?
Below is the function I am working on
with(mutate(df,timediff = as.yearmon(Date) - as.yearmon(Sys.Date()) ),
tapply(df$Rate, Name, function(x){
ifelse(timediff < 0 ,
x + sd(x),
x)
}, simplify=FALSE) )

R ranges: 1:0 - illogical behavior

I have an array X of length N, and I'd like to compute sum(X[(i+1):N]) - sum(X[1:(i-1)]. This works fine if my index, i, is within 2..(N-1). If it's equal to 1, the second term will return the first element of the array rather than exclude it. If it's equal to N, the first term will return the last element of the array rather than exclude it. seq_len is the only function I'm aware of that does the job, but only for the 2nd term (it indexes 1:n). What I need is a range function that will return NULL (rather than throw an exception like seq) when its 2nd argument is below its first. The sum function will do the rest. Is anyone aware of one, or do I have to write one myself?
I suggest an alternate path for generating indexing sequences: seq_len, which reacts intuitively in the extremes.
Bottom Line Up Front: use sum(X[-seq_len(i)]) - sum(X[seq_len(i-1)]) instead.
First, some sample data:
X <- 1:10
N <- length(X)
Your approach, at the two extremes:
i <- 1
X[(i+1):N]
# [1] 2 3 4 5 6 7 8 9 10
X[1:(i-1)] # oops
# [1] 1
That should return "nothing", I believe. (More the point, sum(...) should return 0. For the record, sum(integer(0)) is 0.)
i <- 10
X[(i+1):N] # oops
# [1] NA 10
X[1:(i-1)]
# [1] 1 2 3 4 5 6 7 8 9
There's your other error, where you'd expect "nothing" in the first subset.
Instead, I suggest you use seq_len:
i <- 1
X[-seq_len(i)]
# [1] 2 3 4 5 6 7 8 9 10
X[seq_len(i-1)]
# integer(0)
i <- 10
X[-seq_len(i)]
# integer(0)
X[seq_len(i-1)]
# [1] 1 2 3 4 5 6 7 8 9
Both seem fine, and something in the middle makes sense.
i <- 5
X[-seq_len(i)]
# [1] 6 7 8 9 10
X[seq_len(i-1)]
# [1] 1 2 3 4
In this contrived example, what we're looking for at any value of i:
1: sum(2:10) - 0 = 54 - 0 = 54
2: sum(3:10) - sum(1:1) = 52 - 1 = 51
3: sum(4:10) - sum(1:2) = 49 - 3 = 46
...
10: 0 - sum(1:9) = 0 - 45 = -45
And we now get that:
func <- function(i, x) sum(x[-seq_len(i)]) - sum(x[seq_len(i-1)])
sapply(c(1,2,3,10), func, X)
# [1] 54 51 46 -45
Edit:
李哲源's answer got me to thinking that you don't need to re-sum the numbers before and after all the time. Just do it once and re-use it. This method could be easily a bit faster if your vector is large.
Xb <- c(0, cumsum(X)[-N])
Xb
# [1] 0 1 3 6 10 15 21 28 36 45
Xa <- c(rev(cumsum(rev(X)))[-1], 0)
Xa
# [1] 54 52 49 45 40 34 27 19 10 0
sapply(c(1,2,3,10), function(i) Xa[i] - Xb[i])
# [1] 54 51 46 -45
So this suggests that your summed value at any value of i is
Xs <- Xa - Xb
Xs
# [1] 54 51 46 39 30 19 6 -9 -26 -45
where you can find the specific value with Xs[i]. No repeated summing required.

Adding extreme value distributed noise (with µ=0,σ=10) to a vector of numbers in R

I have the following matrix
Measurement Treatment
38 A
14 A
54 A
69 A
20 B
36 B
35 B
10 B
11 C
98 C
88 C
14 C
I want to add extreme value distributed noise (with mean=0 and sd=10) to the Measurement values. How can I achieve that in R?
I found revd in extRemes package, but it does not work as expected. Does devd from the same package do what I want to do? (but it does not allow for mean and sd to be defined)
If you want to use your measure as the mean for the noise, then you can do this:
measure = round(runif(10,0,30),0)
data = data.frame(measure)
for(i in 1:nrow(data)){
data$measure1[i] = rnorm(1,data$measure[i],10)
}
data
measure measure1
1 6 6.281557
2 12 -5.780177
3 18 13.529773
4 26 33.665584
5 14 12.666614
6 24 41.146132
7 5 -1.850390
8 14 16.728703
9 13 26.082601
10 13 14.066475
EDIT: You can avoid the for loop with this instead:
data$measure1 = data$measure + rnorm(1,0,10)

2 variables in a for loop in R

I have two vectors that I would like to reference in a for loop, but each is of different lengths.
n=1:50
m=letters[1:14]
I tried a single loop to read it
for (i in c(11:22,24,25)){
cat (paste(n[i],m[i],sep='\t'),sep='\n')
}
and ended up with:
11 k
12 l
13 m
14 n
15 NA
16 NA
17 NA
18 NA
19 NA
20 NA
21 NA
22 NA
24 NA
25 NA
but I would like to obtain:
11 a
12 b
13 c
...
25 n
is there a way to have a double variable declaration?
for (i in c(11:22,24,25) and j in 1:14){
cat (paste(n[i],m[j],sep='\t'),sep='\n')
}
or something similar to get the result I want?
No there isn't. But you can do this:
ind_j <- c(11:22,24,25)
ind_k <- 1:14
for (i in seq_along(ind_j)){
cat (paste(n[ind_j[i]],m[ind_k[i]],sep='\t'),sep='\n')
}
Of course, it's very probable that you shouldn't use a for loop for your actual problem.
If you want m to start over when it has reached the end, you can take advantage of recycling in R.
cat(paste(n, m, sep='\t', collapse='\n'), '\n')
When the end of m is reached, it will start over until all elements of n have been iterated over. If you need this in a loop, replace cat with a for loop.
your problem lies in assigning the values to i in for (i in c(11:22,24,25) - this assigns the values 11,12,13,14,15 .... to i.
then you want to get the values of m[i].
but remember: m[i] has only 1..14 items so for item 15 and above - you'll get NAs
maybe this is what you wanted - there are more robust answers here and #Roland's is far better but imho - this fixes your problem without changing your initial approach
for (i in c(1:12,14,15)){cat (paste(n[i],m[i],sep='\t'),sep='\n')}
if you just subtract 10 from your sequence - the indexing problem will go away and u'll get
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
11 k
12 l
14 n
15 o

How to reorder a column in a data frame to be the last column

I have a data frame where columns are constantly being added to it. I also have a total column that I would like to stay at the end. I think I must have skipped over some really basic command somewhere but cannot seem to find the answer anywhere. Anyway, here is some sample data:
x=1:10
y=21:30
z=data.frame(x,y)
z$total=z$x+z$y
z$w=11:20
z$total=z$x+z$y+z$w
When I type z I get this:
x y total w
1 1 21 33 11
2 2 22 36 12
3 3 23 39 13
4 4 24 42 14
5 5 25 45 15
6 6 26 48 16
7 7 27 51 17
8 8 28 54 18
9 9 29 57 19
10 10 30 60 20
Note how the total column comes before the w, and obviously any subsequent columns. Is there a way I can force it to be the last column? I am guessing that I would have to use ncol(z) somehow. Or maybe not.
You can reorder your columns as follows:
z <- z[,c('x','y','w','total')]
To do this programmatically, after you're done adding your columns, you can retrieve their names like so:
nms <- colnames(z)
Then you can grab the ones that aren't 'total' like so:
nms[nms!='total']
Combined with the above:
z <- z[, c(nms[nms!='total'],'total')]
You have a logic issue here. Whenever you add to a data.frame, it grows to the right.
Easiest fix: keep total a vector until you are done, and only then append it. It will then be the rightmost column.
(For critical applications, you would of course determine your width k beforehand, allocate k+1 columns and just index the last one for totals.)

Resources