findInterval() with varying intervals in data.table R - r

I have asked this question a long time ago, but haven't found the answer yet. I do not know if this is legit in stackoverflow, but I repost it.
I have a data.table in R and I want to create a new column that finds the interval for every price of the respective year/month.
Reproducible example:
set.seed(100)
DT <- data.table(year=2000:2009, month=1:10, price=runif(5*26^2)*100)
intervals <- list(year=2000:2009, month=1:10, interval = sort(round(runif(9)*100)))
intervals <- replicate(10, (sample(10:100,100, replace=T)))
intervals <- t(apply(intervals, 1, sort))
intervals.dt <- data.table(intervals)
intervals.dt[, c("year", "month") := list(rep(2000:2009, each=10), 1:10)]
setkey(intervals.dt, year, month)
setkey(DT, year, month)
I have just tried:
merging the DT and intervals.dt data.tables by month/year,
creating a new intervalsstring column consisting of all the V* columns to
one column string, (not very elegant, I admit), and finally
substringing it to a vector, so as I can use it in findInterval() but the solution does not work for every row (!)
So, after:
DT <- merge(DT, intervals.dt)
DT <- DT[, intervalsstring := paste(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10)]
DT <- DT[, c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10") := NULL]
DT[, interval := findInterval(price, strsplit(intervalsstring, " ")[[1]])]
I get
> DT
year month price intervalsstring interval
1: 2000 1 30.776611 12 21 36 46 48 51 63 72 91 95 2
2: 2000 1 62.499648 12 21 36 46 48 51 63 72 91 95 6
3: 2000 1 53.581115 12 21 36 46 48 51 63 72 91 95 6
4: 2000 1 48.830599 12 21 36 46 48 51 63 72 91 95 5
5: 2000 1 33.066053 12 21 36 46 48 51 63 72 91 95 2
---
3376: 2009 10 33.635924 12 40 45 48 50 65 75 90 96 97 2
3377: 2009 10 38.993769 12 40 45 48 50 65 75 90 96 97 3
3378: 2009 10 75.065820 12 40 45 48 50 65 75 90 96 97 8
3379: 2009 10 6.277403 12 40 45 48 50 65 75 90 96 97 0
3380: 2009 10 64.189162 12 40 45 48 50 65 75 90 96 97 7
which is correct for the first rows, but not for the last (or other) rows.
For example, for the row 3380, the price ~64.19 should be in the 5th interval and not the 7th. I guess my mistake is that by my last command, finding Intervals relies only on the first row of intervalsstring.
Thank you!

Your main problem is that you just didn't do findInterval for each group. But I also don't see the point of making that large merged data.table, or the paste/strsplit business. This is what I would do:
DT[, interval := findInterval(price,
intervals.dt[.BY][, V1:V10]),
by = .(year, month)][]
# year month price interval
# 1: 2000 1 30.776611 2
# 2: 2000 1 62.499648 6
# 3: 2000 1 53.581115 6
# 4: 2000 1 48.830599 5
# 5: 2000 1 33.066053 2
# ---
#3376: 2009 10 33.635924 1
#3377: 2009 10 38.993769 1
#3378: 2009 10 75.065820 7
#3379: 2009 10 6.277403 0
#3380: 2009 10 64.189162 5
Note that intervals.dt[.BY] is a keyed subset.

Related

Select rows based on a specified value (Difference between max score of 19+ for second highest value)

I have a list of 50 meditation techniques that I am classifying into one of 3 categories based on ratings by 92 people. I have calculated the difference between the 'max' value in each row from the other 2 categories with ratings.
I now want to select the specific rows where the difference between the 2nd highest rating and the max value is greater than 19 (so 20+).
Looking at the table below for MATKO_NEWBERG_01 the highest rating is for the CDM category with 64 and the second highest rating is the NDM cateory with 12. This gives a difference of 52 (Value2_NDM) which is clearly above my 20 threshold I desire. I would like to therefore keep MATKO_NEWBERG_01 row in the dataframe as it satisfies this criteria. For MATKO_NEWBERG_07 you can see that the second highest rating (NDM = 20) only exhibits a difference value from the max (CDM = 23) of 3 well below by desired threshold of 20, therefore I would like to remove this. And the same is true for MATKO_NEWBERG_03 and _05.
Med_Technique
NDM
CDM
ADM
Value2_NDM
Value2_CDM
Value2_ADM
MATKO_NEWBERG_01
12
64
8
52
NA
56
MATKO_NEWBERG_02
5
76
9
71
NA
67
MATKO_NEWBERG_03
20
45
27
25
NA
18
MATKO_NEWBERG_04
6
73
12
67
NA
61
MATKO_NEWBERG_05
6
37
47
41
10
NA
MATKO_NEWBERG_06
6
6
78
72
72
NA
MATKO_NEWBERG_07
20
23
18
3
NA
5
Desired output:
Med_Technique
NDM
CDM
ADM
Value2_NDM
Value2_CDM
Value2_ADM
MATKO_NEWBERG_01
12
64
8
52
NA
56
MATKO_NEWBERG_02
5
76
9
71
NA
67
MATKO_NEWBERG_04
6
73
12
67
NA
61
MATKO_NEWBERG_06
6
6
78
72
72
NA
Thanks for reading
Using your Value2 columns, you could do:
dat[apply(dat[5:7], 1, min, na.rm = T) >= 20,]
#or
dat[do.call(pmin, c(dat[5:7], list(na.rm = TRUE))) >= 20,]
Med_Technique NDM CDM ADM Value2_NDM Value2_CDM Value2_ADM
1 MATKO_NEWBERG_01 12 64 8 52 NA 56
2 MATKO_NEWBERG_02 5 76 9 71 NA 67
4 MATKO_NEWBERG_04 6 73 12 67 NA 61
6 MATKO_NEWBERG_06 6 6 78 72 72 NA
Here's one other way that does not use the last columns. For each row (apply with MARGIN = 1) compute the absolute difference (dist) between the first and second highest value (sort(x, decreasing = T)[1:2]). Look whether it is higher >= 20.
idx = apply(dat[2:4], 1, \(x) dist(sort(x, decreasing = T)[1:2])) >= 20
# [1] TRUE TRUE FALSE TRUE FALSE TRUE FALSE
dat[idx, ]
Med_Technique NDM CDM ADM Value2_NDM Value2_CDM Value2_ADM
1 MATKO_NEWBERG_01 12 64 8 52 NA 56
2 MATKO_NEWBERG_02 5 76 9 71 NA 67
4 MATKO_NEWBERG_04 6 73 12 67 NA 61
6 MATKO_NEWBERG_06 6 6 78 72 72 NA
#Also works (maybe less intuitive, but shorter)
idx = apply(dat[2:4], 1, \(x) diff(sort(x))[2]) >= 20

R, obtain pointed column value from another table (faster)

I have two data frame. One with all the data A, and a smaller one B that contain an unique identifier of A and column names of A. I am trying to add a column on A base on what the B is pointed to. In another word, I need to get data from A pointed by B.
For example
A<-airquality
B<-data.frame(Month=unique(A$Month),col=c("Ozone","Solar.R", "Wind", "wind","Solar.R"))
This would give me the following
> head(A)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> B
Month col
1 5 Ozone
2 6 Solar.R
3 7 Wind
4 8 wind
5 9 Solar.R
The result should be something like
> head(A)
Ozone Solar.R Wind Temp Month Day ADDED
1 41 190 7.4 67 5 1 41
2 36 118 8.0 72 5 2 36
3 12 149 12.6 74 5 3 12
4 18 313 11.5 62 5 4 18
5 NA NA 14.3 56 5 5 NA
6 28 NA 14.9 66 5 6 28
> tail(A)
Ozone Solar.R Wind Temp Month Day ADDED
148 14 20 16.6 63 9 25 20
149 30 193 6.9 70 9 26 193
150 NA 145 13.2 77 9 27 145
151 14 191 14.3 75 9 28 191
152 18 131 8.0 76 9 29 131
153 20 223 11.5 68 9 30 223
The only way I can do it is
for(i in 1:nrow(B))
{
j<-A$Month==B$Month[i]
k<-subset(A, select=B$col[i])[j,]
A$ADDED[j]<-k
}
while this does work, it become extremely slow as I have a big dataset. I feel like I am doing it the dumb way. What is a good way of doing this?
Thank
You could do this with the sapply or lapply-functions.
ADDED <- sapply(1:nrow(B), function(i){
A[A$Month==B$Month[i], (B$col[i])]
})
A$ADDED <- unlist(ADDED)
For partial matching, you would have to filter the data, to get only rows where B has values and then assign the values. But befor we have to assign a value for all rows of the ADDED column; in this case NA.
A$ADDED = NA
A[A$Month %in% B$Month,]$ADDED <- unlist(ADDED)
That is already talking only a 1/3 of the time, compared to a for-loop.
appl <- function(){
ADDED <- sapply(1:nrow(B), function(i){
A[A$Month==B$Month[i], (B$col[i])]
})
A$ADDED1 <- unlist(ADDED)
}
lappl <- function(){
ADDED <- lapply(1:nrow(B), function(i){
A[A$Month==B$Month[i], (B$col[i])]
})
A$ADDED1 <- unlist(ADDED)
}
forlo <- function(){
for(i in 1:nrow(B)) {
j<-A$Month==B$Month[i]
k<-subset(A, select=B$col[i])[j,]
A$ADDED[j]<-k
}
}
library(microbenchmark)
mc <- microbenchmark(times = 1000,
sapply = appl(),
lapply = lappl(),
forloop = forlo()
)
mc
Unit: microseconds
expr min lq mean median uq max neval cld
sapply 337.478 359.2125 378.6964 369.7775 385.474 2324.913 1000 a
lapply 319.367 340.7990 366.8448 349.2510 362.532 9051.828 1000 a
forloop 964.136 1013.6415 1074.5584 1032.5070 1059.825 5116.802 1000 b

Binning differences in dates as time-unit-aware numeric vector across years

I need to calculate "how many x units apart" each element in a vector of POSIX dates is from a given reference date, where
x is a "typical" time unit like month, week, quarter etc.
the date vector can span multiple years
the result needs to be a numeric vector
I have something, but it doesn't feel like a consistent approach that could be generalized (two different approaches for month and week).
Possibly worth nothing: I'm generally looking for solutions that comply with ISO 8601
EDIT
"Consistent" in the sense that I would ideally, say, a solution that is always leverages as.numeric(dates) with some clever "time unit binning" afterwards. But for months I wouldn't see how this could be achieved as each month contains a different number of days (works for weeks as we can always safely say "a week contains 7 days").
In other words: for months I'd like to use something like (as.numeric(.x) / (<something>)) just as I use (as.numeric(.x) / (60 * 60 * 24 * 7)) for weeks. It's that <something> that I'm looking for to have a generic way of binning differences in dates.
Solution draft
Function defs:
library(magrittr)
library(purrr)
normalize_time_distance_month <- function(dates) {
dates %>%
as.POSIXct() %>%
purrr::map_dbl(function(.x)
as.numeric(format(.x, "%y")) * 12 + as.numeric(format(.x, "%m")))
}
normalize_time_distance_week <- function(dates) {
dates %>%
as.POSIXct() %>%
purrr::map_dbl(function(.x)
(as.numeric(.x) / (60 * 60 * 24 * 7)) %>%
round())
}
Months:
# Months ------------------------------------------------------------------
dates <- seq(as.POSIXct("2018-03-01"), length.out = 24, by = "month")
origin <- as.POSIXct("2018-05-01")
dates_norm <- normalize_time_distance_month(dates)
origin_norm <- normalize_time_distance_month(origin)
(time_diffs <- dates_norm - origin_norm)
#> [1] -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#> [24] 21
Weeks:
# Weeks -------------------------------------------------------------------
dates <- seq(as.POSIXct("2018-05-07"), length.out = 104, by = "week")
origin <- as.POSIXct("2018-05-21")
dates_norm <- normalize_time_distance_week(dates)
origin_norm <- normalize_time_distance_week(origin)
(time_diffs <- dates_norm - origin_norm)
#> [1] -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
#> [18] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
#> [35] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
#> [52] 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
#> [69] 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
#> [86] 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
#> [103] 100 101
Created on 2018-05-25 by the reprex package (v0.2.0).
One option would be to pass expression as an argument and then parse it
library(tidyverse)
library(rlang)
normalize_time_distance <- function(dates, expr) {
dates %>%
as_tibble %>%
mutate(value = as.POSIXct(value)) %>%
mutate(value = !! parse_expr(expr)) %>%
pull(value)
}
expr1 <- 'as.numeric(format(value, "%y")) * 12 + as.numeric(format(value, "%m"))'
normalize_time_distance(dates, expr1)
#[1] 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237
#[20] 238 239 240 241 242
expr2 <- 'round((as.numeric(value) / (60 * 60 * 24 * 7)))'
normalize_time_distance(dates, expr2)
#[1] 2513 2517 2522 2526 2530 2535 2539 2544 2548 2552 2557 2561 2565 2570 2574
#[16] 2578 2583 2587 2591 2596 2600 2604 2609 2613
If you are interested in intervals that are multiples of a day there is no point in using POSIXt class. It will only give rise of the possibility of time zone errors which you can entirely prevent by using Date class so from here on we will assume Date class. as.Date can be used to convert a POSIXct object to a Date object.
There are two distinct cases in your question. Intervals that are multiples of a day (day, week) and intervals that are multiples of a month (month, quarter, year). These will have to be dealt with separately because there is not a fixed number of days in a month.
Case 1 - interval is multiple of days
If the interval length is d days then if x and y are Date class objects the
number of intervals is
# x and y are Date class
(as.numeric(y) - as.numeric(x)) / d
where d is 1 for days and 7 for weeks.
Case 2 -- interval is mulitple of months
If the interval length is m months then if x and y are Date class objects:
library(zoo)
date2ym <- function(x) {
ym <- as.yearmon(x)
b <- as.numeric(as.Date(ym))
e <- as.numeric(as.Date(ym, frac = 1))
12 * as.numeric(ym) + (as.numeric(x) - b) / (e - b + 1)
}
# x and y are Date class
(date2ym(y) - date2ym(x)) / m
where m is 1 for months, 3 for quarters and 12 for years.
EDIT
Fix (2).

Filter data frame by results from tapply function

I'm trying to apply a tapply function I wrote to filter a dataset. Here is a sample data frame (df) below to describe what I'm trying to do.
I want to keep in my data frame the rows where the value of df$Cumulative_Time is closest to the value of 14. It should do this for each factor level in df$ID (keep row closest the value 14 for each ID factor).
ID Date Results TimeDiff Cumulative_Time
A 7/10/2015 71 0 0
A 8/1/2015 45 20 20
A 8/22/2015 0 18 38
A 9/12/2015 79 17 55
A 10/13/2015 44 26 81
A 11/27/2015 98 37 118
B 7/3/2015 75 0 0
B 7/24/2015 63 18 18
B 8/21/2015 98 24 42
B 9/26/2015 70 30 72
C 8/15/2015 77 0 0
C 9/2/2015 69 15 15
C 9/4/2015 49 2 17
C 9/8/2015 88 2 19
C 9/12/2015 41 4 23
C 9/19/2015 35 6 29
C 10/10/2015 33 18 47
C 10/14/2015 31 3 50
D 7/2/2015 83 0 0
D 7/28/2015 82 22 22
D 8/27/2015 100 26 48
D 9/17/2015 19 17 65
D 10/8/2015 30 18 83
D 12/9/2015 96 51 134
D 1/6/2016 30 20 154
D 2/17/2016 32 36 190
D 3/19/2016 42 27 217
I got as far as the following:
spec_day = 14 # value I want to compare df$Cumulative_Time to
# applying function to calculate closest value to spec_day
tapply(df$Cumulative_Time, df$ID, function(x) which(abs(x - spec_day) == min(abs(x - spec_day))))
Question: how do I include this tapply function as a means to do the filtering of my data frame df? Am I approaching this problem the right way, or is there some simpler way to accomplish this that I'm not seeing? Any help would be appreciated--thanks!
Here's a way you can do it, note that I didn't use tapply:
spec_day <- 14
new_df <- do.call('rbind',
by(df, df$ID,
FUN = function(x) x[which.min(abs(x$Cumulative_Time - spec_day)), ]
))
new_df
ID Date Results TimeDiff Cumulative_Time
A A 8/1/2015 45 20 20
B B 7/24/2015 63 18 18
C C 9/2/2015 69 15 15
D D 7/28/2015 82 22 22
which.min (and its sibling which.max) is a very useful function.
Here's a more concise and faster alternative using data.table:
library(data.table)
setDT(df)[, .SD[which.min(abs(Cumulative_Time - 14))], by = ID]
# ID Date Results TimeDiff Cumulative_Time
#1: A 8/1/2015 45 20 20
#2: B 7/24/2015 63 18 18
#3: C 9/2/2015 69 15 15
#4: D 7/28/2015 82 22 22

Computing normalized Euclidean distance in R

The data frame I have is as follows:
Binning_data[1:4,]
person_id V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1 312 74 80 NA 87 90.0 85 88 98 96.5 99 94 95 90 90 93 106
2 316 NA NA 116 106 105.0 110 102 105 105.0 102 98 101 98 92 89 91
3 318 71 61 61 61 60.5 68 62 67 64.0 60 59 60 62 59 63 63
4 319 64 NA 80 80 83.0 84 87 83 85.0 88 87 95 74 70 63 83
I would like to compute the Euclidean distance of a given 'index_person_id' (say 312) with all the other person_id while omitting all NAs.
For example: Normalized Euclidean distance between "312" and "316" should omit the first 3 bins (V1,V2,V3) because atleast one of the two rows has NAs. It should just compute the Euclidean distance from 4th bin to 16th bin and divide by 13 (number of non empty bins)
Dimension of Binning_Data is 10000*17.
The output file should be of size 10000*2 with the first column being the person_id and the second column being the 'normalized Euclidean distance'.
I am currently using sapply for this purpose:
index_person<-binning_data[which(binning_data$person_id==index_person_id),]
non_empty_index_person=which(is.na(index_person[2:ncol(index_person)])==FALSE)
distance[,2]<-sapply(seq_along(binning_data$person_id),function(j) {
compare_person<-binning_data[j,]
non_empty_compare_person=which(is.na(compare_person[2:ncol(compare_person)])==FALSE)
non_empty=intersect(non_empty_index_person,non_empty_compare_person)
distance_temp=(index_person[non_empty+1]-compare_person[non_empty+1])^2
as.numeric(mean(distance_temp))
})
This seems to take a considerable amount of time. Is there a better way to do this?
If I run your code I get:
0.0000 146.0192 890.9000 200.8750
If you convert your data frame into a matrix, transpose, then you can subtract columns and then use na.rm=TRUE on mean to get the distances you want. This can be done over columns using colMeans. Here for row II of your sample data:
> II = 1
> m = t(as.matrix(binning_data[,-1]))
> colMeans((m - m[,II])^2, na.rm=TRUE)
1 2 3 4
0.0000 146.0192 890.9000 200.8750
Your 10000x2 matrix is then (where here 10000==4):
> cbind(II,colMeans((m - m[,II])^2, na.rm=TRUE))
II
1 1 0.0000
2 1 146.0192
3 1 890.9000
4 1 200.8750
If you want to compute this for a given list of indexes, loop it, perhaps like this with an lapply and an rbind putting it all back together again as a data frame for a change:
II = c(1,2,1,4,4)
do.call(rbind,lapply(II, function(i){data.frame(i,d=colMeans((m-m[,i])^2,na.rm=TRUE))}))
i d
1 1 0.0000
2 1 146.0192
3 1 890.9000
4 1 200.8750
11 2 146.0192
21 2 0.0000
31 2 1595.0179
41 2 456.7143
12 1 0.0000
22 1 146.0192
32 1 890.9000
42 1 200.8750
13 4 200.8750
23 4 456.7143
33 4 420.8833
43 4 0.0000
14 4 200.8750
24 4 456.7143
34 4 420.8833
44 4 0.0000
That's a 4 x length(II)-row matrix

Resources