I have a data.table
dt2 <- data.table(urn=1:10,freq=0, freqband="")
dt2$freqband = NA
dt2$freq <- 1:7 #does give a warning message
## urn freq freqband
## 1: 1 1 NA
## 2: 2 2 NA
## 3: 3 3 NA
## 4: 4 4 NA
## 5: 5 5 NA
## 6: 6 6 NA
## 7: 7 7 NA
## 8: 8 1 NA
## 9: 9 2 NA
##10: 10 3 NA
i also have a function that I am wanting to use to group my freq column
fn_GetFrequency <- function(numgifts) {
if (numgifts <5) return("<5")
if (numgifts >=5) return("5+")
return("ERROR")
}
I am wanting to set the freqband column based on this function. In some cases it will be all records, in some cases it will be a subset. My current approach is (for a subset):
dt2[dt2$urn < 9, freqband := fn_GetFrequency(freq)]
using this approach I get the warning:
Warning message:
In if (numgifts < 5) return("<5") :
the condition has length > 1 and only the first element will be used
then it sets all the records to have a value of "<5" rather than the correct value. I'm figuring that I need to use some sort of lapply/sapply/etc function, however I still haven't been able to quite grasp how they work in order to use them to solve my problem.
Any help would be greatly appreciated.
EDIT: How might you do this if you use a function that requires 2 parameters?
UPDATED: to include the output of dt2 after my attempted update
urn freq freqband
1: 1 1 <5
2: 2 2 <5
3: 3 3 <5
4: 4 4 <5
5: 5 5 <5
6: 6 6 <5
7: 7 7 <5
8: 8 1 <5
9: 9 2 NA
10: 10 3 NA
UPDATE: I tried this code to and it worked to deliver the desired output, and it allows me to have a function I can call in other places of code too.
dt2[dt2$urn < 9, freqband := sapply(freq, fn_GetFrequency)]
> fn_GetFrequency <- function(numgifts) {
+ ifelse (numgifts <5, "<5", "5+")
+ }
> dt2[dt2$urn < 9, freqband := fn_GetFrequency(freq)]
> dt2
urn freq freqband
1: 1 1 <5
2: 2 2 <5
3: 3 3 <5
4: 4 4 <5
5: 5 5 5+
6: 6 6 5+
7: 7 7 5+
8: 8 1 <5
9: 9 2 NA
10: 10 3 NA
For multiple bands (which I'm sure has been asked before) you should use the findInterval function. And I'm doing it the data.table way reather than the dataframe way:
dt2[ urn==8, freq := -1 ] # and something to test the <0 condition
dt2[ urn <= 8, freqband := c("ERROR", "<5", "5+")[
findInterval(freq,c(-Inf, 0, 5 ,Inf))] ]
dt2
urn freq freqband
1: 1 1 <5
2: 2 2 <5
3: 3 3 <5
4: 4 4 <5
5: 5 5 5+
6: 6 6 5+
7: 7 7 5+
8: 8 -1 ERROR
9: 9 2 NA
10: 10 3 NA
Related
I am trying to call different columns of a data.table inside a loop, to get unique values of each column.
Consider the simple data.table below.
> df <- data.table(var_a = rep(1:10, 2),
+ var_b = 1:20)
> df
var_a var_b
1: 1 1
2: 2 2
3: 3 3
4: 4 4
5: 5 5
6: 6 6
7: 7 7
8: 8 8
9: 9 9
10: 10 10
11: 1 11
12: 2 12
13: 3 13
14: 4 14
15: 5 15
16: 6 16
17: 7 17
18: 8 18
19: 9 19
20: 10 20
My code works when I call for a specific column outside a loop,
> unique(df$var_a)
[1] 1 2 3 4 5 6 7 8 9 10
> unique(df[, var_a])
[1] 1 2 3 4 5 6 7 8 9 10
> unique(df[, "var_a"])
var_a
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
but not when I do so within a loop that goes through different columns of the data.table.
> for(v in c("var_a","var_b")){
+ print(v)
+ df$v
+ unique(df[, .v])
+ unique(df[, "v"])
+ }
[1] "var_a"
Error in `[.data.table`(df, , .v) :
j (the 2nd argument inside [...]) is a single symbol but column name '.v' is not found. Perhaps you intended DT[, ...v]. This difference to data.frame is deliberate and explained in FAQ 1.1.
>
> unique(df[, ..var_a])
Error in `[.data.table`(df, , ..var_a) :
Variable 'var_a' is not found in calling scope. Looking in calling scope because you used the .. prefix.
For the first problem, when you're referencing a column name indirectly, you can either use double-dot ..v syntax, or add with=FALSE in the data.table::[ construct:
for (v in c("var_a", "var_b")) {
print(v)
print(df$v)
### either one of these will work:
print(unique(df[, ..v]))
# print(unique(df[, v, with = FALSE]))
}
# [1] "var_a"
# NULL
# var_a
# <int>
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 6
# 7: 7
# 8: 8
# 9: 9
# 10: 10
# [1] "var_b"
# NULL
# var_b
# <int>
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# 6: 6
# 7: 7
# 8: 8
# 9: 9
# 10: 10
# 11: 11
# 12: 12
# 13: 13
# 14: 14
# 15: 15
# 16: 16
# 17: 17
# 18: 18
# 19: 19
# 20: 20
# var_b
But this just prints it without changing anything. If all you want to do is look at unique values within each column (and not change the underlying frame), then I'd likely go with
lapply(df[,.(var_a, var_b)], unique)
# $var_a
# [1] 1 2 3 4 5 6 7 8 9 10
# $var_b
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
which shows the name and unique values. The use of lapply (whether on df as a whole or a subset of columns) is also preferable to another recommendation to use apply(df, 2, unique), though in this case it returns the same results.
Use .subset2 to refer to a column by its name:
for(v in c("var_a","var_b")) {
print(unique(.subset2(df, v)))
}
following the information on the first error, this would be the correct way to call in a loop:
for(v in c("var_a","var_b")){
print(unique(df[, ..v]))
}
# won't print all the lines
as for the second error you have not declared a variable called "var_a", it looks like you want to select by name.
# works as you have shown
unique(df[, "var_a"])
# works once the variable is declared
var_a <- "var_a"
unique(df[, ..var_a])
You may also be interested in the env param of data.table (see development version); here is an illustration below, but you could use this in a loop too.
v="var_a"
df[, v, env=list(v=v)]
Output:
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
i am using frollsum with adaptive = TRUE to calculate the rolling sum over a window of 26 weeks, but for weeks < 26, the window is exactly the size of available weeks.
Is there anything similar, but instead of a rolling sum, a function to identify the most common value? I basically need the media of the past 26 (or less) weeks. I realize, that frollapply does not allow adaptive = TRUE, so that it is not working in my case, as I need values for the weeks before week 26 as well.
Here is an example (I added "desired" column four)
week product sales desired
1: 1 1 8 8
2: 2 1 8 8
3: 3 1 7 8
4: 4 1 4 8
5: 5 1 7 7.5
6: 6 1 4 7.5
7: 7 1 8 8
8: 8 1 9 and
9: 9 1 4 so
10: 10 1 7 on
11: 11 1 5 ...
12: 12 1 3
13: 13 1 8
14: 14 1 10
Here is some example code:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")
data[,desired:=frollapply(sales,26,median,adaptive=TRUE)] #This only starts at week 26
Thank you very much for your help!
Here is an option using RcppRoll with data.table:
library(RcppRoll)
data[, med_sales :=
fifelse(is.na(x <- roll_medianr(sales, 26L)),
c(sapply(1L:25L, function(n) median(sales[1L:n])), rep(NA, .N - 25L)),
x)]
or using replace instead of fifelse:
data[, med_sales := replace(roll_medianr(sales, 26L), 1L:25L,
sapply(1L:25L, function(n) median(sales[1L:n])))]
output:
week product sales med_sales
1: 1 1 9 9
2: 2 1 3 6
3: 3 1 4 4
4: 4 1 6 5
5: 5 1 9 6
---
996: 96 10 2 5
997: 97 10 8 5
998: 98 10 7 5
999: 99 10 4 5
1000: 100 10 3 5
data:
library(data.table)
set.seed(0L)
week <- seq(1:100)
products <- seq(1:10)
sales <- round(runif(1000,1,10),0)
data <- as.data.table(cbind(merge(week,products,all=T),sales))
names(data) <- c("week","product","sales")
Finding the last position of a vector that is less than a given value is fairly straightforward (see e.g. this question
But, doing this line by line for a column in a data.frame or data.table is horribly slow. For example, we can do it like this (which is ok on small data, but not good on big data)
library(data.table)
set.seed(123)
x = sort(sample(20,5))
# [1] 6 8 15 16 17
y = data.table(V1 = 1:20)
y[, last.x := tail(which(x <= V1), 1), by = 1:nrow(y)]
# V1 last.x
# 1: 1 NA
# 2: 2 NA
# 3: 3 NA
# 4: 4 NA
# 5: 5 NA
# 6: 6 1
# 7: 7 1
# 8: 8 2
# 9: 9 2
# 10: 10 2
# 11: 11 2
# 12: 12 2
# 13: 13 2
# 14: 14 2
# 15: 15 3
# 16: 16 4
# 17: 17 5
# 18: 18 5
# 19: 19 5
# 20: 20 5
Is there a fast, vectorised way to get the same thing? Preferably using data.table or base R.
You may use findInterval
y[ , last.x := findInterval(V1, x)]
Slightly more convoluted using cut. But on the other hand, you get the NAs right away:
y[ , last.x := as.numeric(cut(V1, c(x, Inf), right = FALSE))]
Pretty simple in base R
x<-c(6L, 8L, 15L, 16L, 17L)
y<-1:20
cumsum(y %in% x)
[1] 0 0 0 0 0 1 1 2 2 2 2 2 2 2 3 4 5 5 5 5
I've found several options on how to generate IDs by groups using the data.table package in R, but none of them fit my problem exactly. Hopefully someone can help.
In my problem, I have 160 markets that fall within 21 regions in a country. These markets are numbered 1:160 and there may be multiple observations documented within each market. I would like to restructure my market ID variable so that it represents unique markets within each region, and starts counting over again with each new region.
Here's some code to represent my problem:
require(data.table)
dt <- data.table(region = c(1,1,1,1,2,2,2,2,3,3,3,3),
market = c(1,1,2,2,3,3,4,4,5,6,7,7))
> dt
region market
1: 1 1
2: 1 1
3: 1 2
4: 1 2
5: 2 3
6: 2 3
7: 2 4
8: 2 4
9: 3 5
10: 3 6
11: 3 7
12: 3 7
Currently, my data is set up to represent the result of
dt[, market_new := .GRP, by = .(region, market)]
But what I'd like get is
region market market_new
1: 1 1 1
2: 1 1 1
3: 1 2 2
4: 1 2 2
5: 2 3 1
6: 2 3 1
7: 2 4 2
8: 2 4 2
9: 3 5 1
10: 3 6 2
11: 3 7 3
12: 3 7 3
This seems to return what you want
dt[, market_new:=as.numeric(factor(market)), by=region]
here we divide the data up by regions and then give a unique ID to each market in each region via the factor() function and extract the underlying numeric index.
From 1.9.5+, you can use frank() (or frankv()) with ties.method = "dense" as follows:
dt[, market_new := frankv(market, ties="dense"), by=region]
I have a question which really need your help:
set.seed(1111)
s<-rep(seq(1,4),5)
a<-runif(20,0.2,0.6)
b<-runif(20,0.4,0.7)
b[6:8]<-NA
c<-runif(20,4,7)
d<-data.table(s,a,b,c)
setkey(d,s)
The data is as following:
s a b c
1: 1 0.3862011 0.4493240 6.793058
2: 1 0.4955267 0.4187441 4.708561
3: 1 0.4185155 0.5916827 6.810053
4: 1 0.5003833 0.5403744 5.948629
5: 1 0.5667312 0.5634135 6.880848
6: 2 0.3651699 0.5263655 5.721908
7: 2 0.5905308 NA 6.863213
8: 2 0.2560464 0.4649180 5.745656
9: 2 0.4533625 0.5077432 5.958526
10: 2 0.4228027 0.4340407 5.115065
11: 3 0.5628013 0.6517352 6.252962
12: 3 0.5519840 NA 4.875669
13: 3 0.2006761 0.6418540 5.452210
14: 3 0.5472671 0.4503713 6.962282
15: 3 0.5601675 0.5195013 6.666593
16: 4 0.2548422 0.6962112 5.535579
17: 4 0.2467137 NA 6.680080
18: 4 0.4995830 0.6793684 6.334579
19: 4 0.2637452 0.4078512 6.076039
20: 4 0.5063548 0.4055017 5.287291
If I do a simple sum, using s as key, it will return a nice table summarize the result:
d[,sum(c),by=s]
s V1
1: 1 31.14115
2: 2 29.40437
3: 3 30.20972
4: 4 29.91357
However, if my data.table command contain ifelse statement, I will not get similar table:
d2<-d[,ifelse(a<b,"NA",sum(c)),by=s]
d2
s V1
1: 1 NA
2: 1 31.1411493057385
3: 1 NA
4: 1 NA
5: 1 31.1411493057385
6: 2 NA
7: 2 NA
8: 2 NA
9: 2 NA
10: 2 NA
11: 3 NA
12: 3 NA
13: 3 NA
14: 3 30.2097161230631
15: 3 30.2097161230631
16: 4 NA
17: 4 NA
18: 4 NA
19: 4 NA
20: 4 29.9135677714366
Is that possible to use the ifelse statement return a result just like the simple sum result table which return the unique non-na value under the each index value?
Thanks a lot!!!!!
I am not entirely certain what you are looking for, but I think you just want to use the a<b condition as the row selector in your data.table, which is done by using it as the first argument in the brackets:
> d[a<b, sum(c), by = s]
s V1
1: 1 19.6
2: 2 22.5
3: 3 11.7
4: 4 17.9
library(plyr)
ddply(d[a<b], .(s), summarize, tot=sum(c))
There is a simple and fast solution based on conditional sum using which:
d[, .( sum_c = sum(c[which( a < b)]) ), by=s]
# s sum_c
# 1: 1 19.552
# 2: 2 22.541
# 3: 3 11.705
# 4: 4 17.946
The advantage of this structure over the other answers presented so far is that it allows you to calculate different aggregations in the same call using different conditions, for example:
d[, .( sum_c = sum(c[which( a < b)]),
sum_a = sum(c[which( c < 6)]) ), by=s]
# s sum_c sum_a
# 1: 1 19.552 10.657
# 2: 2 22.541 22.541
# 3: 3 11.705 10.328
# 4: 4 17.946 10.823
There is a benchmark of the speed of this solution compared to other approaches in a similar question, here.