R: Error: vector are not the same length in a for loop - r

I am starting to learn R and I got an error for the loop.
Here is the error I got:
Error in names(data1)[5] <- "TR" :
'names' attribute [5] must be the same length as the vector [4]
Here is my code:
#read data
data1<-read.table("C_0.txt",header = T,sep=",")
#name the 5th col TR
names(data1)[5]<-"TR"
#calculate the length of data1
n<-nrow(data1)
#initialize first TR value to NA
data1[1,5]<-NA
for (i in 1:(n-1)){
if (data1[i,3]==data1[i+1,3]) {data1[i+1,5]<-data1[i,5]}
if (data1[i,3]< data1[i+1,3]) {data1[i+1,5]<- 1}
if(data1[i,3]> data1[i+1,3]) {data1[i+1,5]<- -1}
}
Here is the algorithm I am trying to codify:
if the current price is above the previous price, mark +1 in the column named TR
if the current price is below the previous price, mark -1 in the column named TR
if the current price is the same as the previous price, mark the same thing as in the previous price in the column named TR
I marked the first TR row as NA because there is no price to compare it to.
here is the data from C_0.txt:
Date,Time,Price,Size
02/18/2014,05:06:13,49.6,200
02/18/2014,05:06:13,49.6,200
02/18/2014,05:06:13,49.6,200
02/18/2014,05:06:14,49.6,200
02/18/2014,05:06:14,49.6,193
02/18/2014,05:44:41,49.62,100
02/18/2014,06:26:36,49.52,100
02/18/2014,06:26:36,49.52,500
02/18/2014,07:09:29,49.6,100
02/18/2014,07:56:40,49.56,300
02/18/2014,07:56:40,49.55,400
02/18/2014,07:56:41,49.54,200
02/18/2014,07:56:43,49.55,100
02/18/2014,07:56:43,49.55,100
02/18/2014,07:56:50,49.55,100
02/18/2014,07:57:12,49.53,100
02/18/2014,07:57:12,49.51,2200
02/18/2014,07:57:12,49.51,100
02/18/2014,07:57:12,49.5,200
Thanks a lot!

Adding a column to a data frame is fairly basic stuff. Here's a question that summarises that.
You can't add a column to a data frame by changing the length of the names attribute vector. You have to create it by methods like in this question.
As for the other part of your question, let's use diff and rle instead of a for-loop.
d <- read.table(sep = ",", text =
"Date,Time,Price,Size
02/18/2014,05:06:13,49.6,200
02/18/2014,05:06:13,49.6,200
02/18/2014,05:06:13,49.6,200
02/18/2014,05:06:14,49.6,200
02/18/2014,05:06:14,49.6,193
02/18/2014,05:44:41,49.62,100
02/18/2014,06:26:36,49.52,100
02/18/2014,06:26:36,49.52,500
02/18/2014,07:09:29,49.6,100
02/18/2014,07:56:40,49.56,300
02/18/2014,07:56:40,49.55,400
02/18/2014,07:56:41,49.54,200
02/18/2014,07:56:43,49.55,100
02/18/2014,07:56:43,49.55,100
02/18/2014,07:56:50,49.55,100
02/18/2014,07:57:12,49.53,100
02/18/2014,07:57:12,49.51,2200
02/18/2014,07:57:12,49.51,100
02/18/2014,07:57:12,49.5,200", header = TRUE)
tmp <- c(NA, diff(d$Price))
tmp <- sign(tmp)
tmp <- rle(tmp)
l <- tmp$lengths
v <- tmp$values
idx <- which(v == 0L)
l[idx - 1] <- l[idx - 1] + l[idx]
l <- l[-idx]
v <- v[-idx]
d$TR <- inverse.rle(list(lengths = l, values = v))
# Date Time Price Size TR
# 1 02/18/2014 05:06:13 49.60 200 NA
# 2 02/18/2014 05:06:13 49.60 200 NA
# 3 02/18/2014 05:06:13 49.60 200 NA
# 4 02/18/2014 05:06:14 49.60 200 NA
# 5 02/18/2014 05:06:14 49.60 193 NA
# 6 02/18/2014 05:44:41 49.62 100 1
# 7 02/18/2014 06:26:36 49.52 100 -1
# 8 02/18/2014 06:26:36 49.52 500 -1
# 9 02/18/2014 07:09:29 49.60 100 1
# 10 02/18/2014 07:56:40 49.56 300 -1
# 11 02/18/2014 07:56:40 49.55 400 -1
# 12 02/18/2014 07:56:41 49.54 200 -1
# 13 02/18/2014 07:56:43 49.55 100 1
# 14 02/18/2014 07:56:43 49.55 100 1
# 15 02/18/2014 07:56:50 49.55 100 1
# 16 02/18/2014 07:57:12 49.53 100 -1
# 17 02/18/2014 07:57:12 49.51 2200 -1
# 18 02/18/2014 07:57:12 49.51 100 -1
# 19 02/18/2014 07:57:12 49.50 200 -1

Related

How to know if a number is in a determinated interval in R

I have a dataset with 3 columns: Default, Height and Weight.
I made a binning of the variables and almacenated it (I have to do it this way) in a list. Every binning has a woe associated, but now I want to put those woes in the original Dataframe depending in which buckets are my observations:
For example, the data frame
df1 <- data.frame(default=sample(c(0,1), replace=TRUE, size=100, prob=c(0.9,0.1)),
height=sample(150:180, 100, replace=T),
weight=sample(50:80,100,replace=T))
> head(df1)
# default height weight
# 1 0 172 54
# 2 0 169 71
# 3 0 164 61
# 4 0 156 55
# 5 0 180 66
# 6 0 162 63
The bins (I will just show the first one)
bins <- lapply(c("height","weight"), function(x) woe.binning(df1, "default", x,
min.perc.total=0.05,
min.perc.class=0.05,event.class=1,
stop.limit = 0.05)[2])
# [[1]]
# [[1]][[1]]
# woe cutpoints.final cutpoints.final[-1] iv.total.final 0 1 col.perc.a col.perc.b iv.bins
# (-Inf,156] -46.58742 -Inf 156 0.1050725 21 5 0.24137931 0.38461538 0.0667299967
# (156,168] 23.91074 156 168 0.1050725 34 4 0.39080460 0.30769231 0.0198727638
# (168,169] -10.91993 168 169 0.1050725 6 1 0.06896552 0.07692308 0.0008689599
# (169, Inf] 25.85255 169 Inf 0.1050725 26 3 0.29885057 0.23076923 0.0176007627
# Missing NA Inf Missing 0.1050725 0 0 0.00000000 0.00000000
Now I want to see in with bins is my data.
My desired output is something similar to this
# default height weight woe_height woe_weight
# 1 0 160 54 23.91074 -8.180032
# 2 0 140 71 -46.58742 -7.640947
Is there any way to do it? The main problem I see here is that the intervals (a,b) are strings. I was thinking about use substr() or something similar to separate the strings in logical options, but I dont think that would work, and its not very elegant.
Any help will be welcome, thanks in advance.
Does this work fine for you?
apply_woe_binning <- function(df, x){
# woe binning
w <- woe.binning(df, "default", x,
min.perc.total=0.05,
min.perc.class=0.05,
event.class=1,
stop.limit = 0.05)[[2]]
# create new column name
new_col <- paste("woe", x, sep = "_")
# define cuts
cuts <- cut(df[[x]], w$cutpoints.final)
# add new column
df[[new_col]] <- w[cuts, "woe", drop = TRUE]
df
}
# one by one
df2 <- apply_woe_binning(df1, "height")
df2 <- apply_woe_binning(df2, "weight")
# in a functional
df2 <- Reduce(function(y, x) apply_woe_binning(df = y, x = x),
c("height","weight"),
init = df1)

Import in R with column headers across 3 rows. Replace missing with latest non-missing column

I need help importing data where my column header is split across 3 rows, with some header names implied. Here is what my xlsx file looks like
1 USA China
2 Dollars Volume Dollars Volume
3 Category Brand CY2016 CY2017 CY2016 CY2017 CY2016 CY_2017 CY2016 CY2017
4 Chocolate Snickers 100 120 15 18 100 80 20 22
5 Chocolate Twix 70 80 8 10 75 50 55 20
I would like to import the data into R, except I would like to retain the headers in rows 1 & 2. An added challenge is that some headers are implied. If a header is blank, I would like it to use the cell in the column to the left. An example of what I'd like it to import as.
1 Category Brand USA_Dollars_CY2016 USA_Dollars_CY2017 USA_Volume_CY2016 USA_Volume_CY2017 China_Dollars_CY2016 China_Dollars_CY_2017 China_Volume_CY2016 China_Volume_CY2017
2 Chocolate Snickers 100 120 15 18 100 80 20 22
3 Chocolate Twix 70 80 8 10 75 50 55 20
My current method is to import, skipping rows 1 & 2 and then just rename the columns based on known position. However, I was hoping code existed to that would prevent me from this step. Thank you!!
I will assume that you have saved the xlsx data in .csv format, so it can be read in like this:
header <- read.csv("data.csv", header=F, colClasses="character", nrow=3)
dat <- read.csv("data.csv", header=F, skip=3)
The tricky part is the header. This function should do it:
construct_colnames <- function(header) {
f <- function(x) {
x <- as.character(x)
c("", x[!is.na(x) & x != ""])[cumsum(!is.na(x) & x != "") + 1]
}
res <- apply(header, 1, f)
res <- apply(res, 1, paste0, collapse="_")
sub("^_*", "", res)
}
colnames(dat) <- construct_colnames(header)
dat
Result:
Category Brand USA_Dollars_CY2016 USA_Dollars_CY2017 USA_Volume_CY2016 USA_Volume_CY2017 China_Dollars_CY2016
1 Chocolate Snickers 100 120 15 18 100
2 Chocolate Twix 70 80 8 10 75
China_Dollars_CY_2017 China_Volume_CY2016 China_Volume_CY2017
1 80 20 22
2 50 55 20

Aggregate column intervals into new columns in data.table

I would like to aggregate a data.table based on intervals of a column (time). The idea here is that each interval should be a separate column with a different name in the output.
I've seen a similar question in SO but I couldn't get my head around the problem. help?
reproducible example
library(data.table)
# sample data
set.seed(1L)
dt <- data.table( id= sample(LETTERS,50,replace=TRUE),
time= sample(60,50,replace=TRUE),
points= sample(1000,50,replace=TRUE))
# simple summary by `id`
dt[, .(total = sum(points)), by=id]
> id total
> 1: J 2058
> 2: T 1427
> 3: C 1020
In the desired output, each column would be named after the interval size they originate from. For example with three intervals, say time < 10, time < 20, time < 30, the head of the output should be:
id | total | subtotal_under10 | subtotal_under20 | subtotal_under30
Exclusive Subtotal Categories
set.seed(1L);
N <- 50L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));
breaks <- seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L);
cuts <- cut(dt$time,breaks,labels=paste0('subtotal_under',breaks[-1L]),right=F);
res <- dcast(dt[,.(subtotal=sum(points)),.(id,cut=cuts)],id~cut,value.var='subtotal');
res <- res[dt[,.(total=sum(points)),id]][order(id)];
res;
## id subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60 total
## 1: A NA NA 176 NA NA 512 688
## 2: B NA NA 599 NA NA NA 599
## 3: C 527 NA NA NA NA NA 527
## 4: D NA NA 174 NA NA NA 174
## 5: E NA 732 643 NA NA NA 1375
## 6: F 634 NA NA NA NA 1473 2107
## 7: G NA NA 1410 NA NA NA 1410
## 8: I NA NA NA NA NA 596 596
## 9: J 447 NA 640 NA NA 354 1441
## 10: K 508 NA NA NA NA 454 962
## 11: M NA 14 1358 NA NA NA 1372
## 12: N NA NA NA NA 730 NA 730
## 13: O NA NA 271 NA NA 259 530
## 14: P NA NA NA NA 78 NA 78
## 15: Q 602 NA 485 NA 925 NA 2012
## 16: R NA 599 357 479 NA NA 1435
## 17: S NA 986 716 865 NA NA 2567
## 18: T NA NA NA NA 105 NA 105
## 19: U NA NA NA 239 1163 641 2043
## 20: V NA 683 NA NA 929 NA 1612
## 21: W NA NA NA NA 229 NA 229
## 22: X 214 993 NA NA NA NA 1207
## 23: Y NA 130 992 NA NA NA 1122
## 24: Z NA NA NA NA 104 NA 104
## id subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60 total
Cumulative Subtotal Categories
I've come up with a new solution based on the requirement of cumulative subtotals.
My objective was to avoid looping operations such as lapply(), since I realized that it should be possible to compute the desired result using only vectorized operations such as findInterval(), vectorized/cumulative operations such as cumsum(), and vector indexing.
I succeeded, but I should warn you that the algorithm is fairly intricate, in terms of its logic. I'll try to explain it below.
breaks <- seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L);
ints <- findInterval(dt$time,breaks);
res <- dt[,{ y <- ints[.I]; o <- order(y); y <- y[o]; w <- which(c(y[-length(y)]!=y[-1L],T)); v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks)))); c(sum(points),as.list(cumsum(points[o])[v])); },id][order(id)];
setnames(res,2:ncol(res),c('total',paste0('subtotal_under',breaks[-1L])));
res;
## id total subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60
## 1: A 688 NA NA 176 176 176 688
## 2: B 599 NA NA 599 599 599 599
## 3: C 527 527 527 527 527 527 527
## 4: D 174 NA NA 174 174 174 174
## 5: E 1375 NA 732 1375 1375 1375 1375
## 6: F 2107 634 634 634 634 634 2107
## 7: G 1410 NA NA 1410 1410 1410 1410
## 8: I 596 NA NA NA NA NA 596
## 9: J 1441 447 447 1087 1087 1087 1441
## 10: K 962 508 508 508 508 508 962
## 11: M 1372 NA 14 1372 1372 1372 1372
## 12: N 730 NA NA NA NA 730 730
## 13: O 530 NA NA 271 271 271 530
## 14: P 78 NA NA NA NA 78 78
## 15: Q 2012 602 602 1087 1087 2012 2012
## 16: R 1435 NA 599 956 1435 1435 1435
## 17: S 2567 NA 986 1702 2567 2567 2567
## 18: T 105 NA NA NA NA 105 105
## 19: U 2043 NA NA NA 239 1402 2043
## 20: V 1612 NA 683 683 683 1612 1612
## 21: W 229 NA NA NA NA 229 229
## 22: X 1207 214 1207 1207 1207 1207 1207
## 23: Y 1122 NA 130 1122 1122 1122 1122
## 24: Z 104 NA NA NA NA 104 104
## id total subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60
Explanation
breaks <- seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L);
breaks <- seq(0,ceiling(max(dt$time)/10)*10,10); ## old derivation, for reference
First, we derive breaks as before. I should mention that I realized there was a subtle bug in my original derivation algorithm. Namely, if the maximum time value is a multiple of 10, then the derived breaks vector would've been short by 1. Consider if we had a maximum time value of 60. The original calculation of the upper limit of the sequence would've been ceiling(60/10)*10, which is just 60 again. But it should be 70, since the value 60 technically belongs in the 60 <= time < 70 interval. I fixed this in the new code (and retroactively amended the old code) by adding 1 to the maximum time value when computing the upper limit of the sequence. I also changed two of the literals to integers and added an as.integer() coercion to preserve integerness.
ints <- findInterval(dt$time,breaks);
Second, we precompute the interval indexes into which each time value falls. We can precompute this once for the entire table, because we'll be able to index out each id group's subset within the j argument of the subsequent data.table indexing operation. Note that findInterval() behaves perfectly for our purposes using the default arguments; we don't need to mess with rightmost.closed, all.inside, or left.open. This is because findInterval() by default uses lower <= value < upper logic, and it's impossible for values to fall below the lowest break (which is zero) or on or above the highest break (which must be greater than the maximum time value because of the way we derived it).
res <- dt[,{ y <- ints[.I]; o <- order(y); y <- y[o]; w <- which(c(y[-length(y)]!=y[-1L],T)); v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks)))); c(sum(points),as.list(cumsum(points[o])[v])); },id][order(id)];
Third, we compute the aggregation using a data.table indexing operation, grouping by id. (Afterward we sort by id using a chained indexing operation, but that's not significant.) The j argument consists of 6 statements executed in a braced block which I will now explain one at a time.
y <- ints[.I];
This pulls out the interval indexes for the current id group in input order.
o <- order(y);
This captures the order of the group's records by interval. We will need this order for the cumulative summation of points, as well as the derivation of which indexes in that cumulative sum represent the desired interval subtotals. Note that the within-interval orders (i.e. ties) are irrelevant, since we're only going to extract the final subtotals of each interval, which will be the same regardless if and how order() breaks ties.
y <- y[o];
This actually reorders y to interval order.
w <- which(c(y[-length(y)]!=y[-1L],T));
This computes the endpoints of each interval sequence, IOW the indexes of only those elements that comprise the final element of an interval. This vector will always contain at least one index, it will never contain more indexes than there are intervals, and it will be unique.
v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks))));
This repeats each element of w according to its distance (as measured in intervals) from its following element. We use diff() on y[w] to compute these distances, requiring an appended length(breaks) element to properly treat the final element of w. We also need to cover if the first interval (and zero or more subsequent intervals) is not represented in the group, in which case we must pad it with NAs. This requires prepending an NA to w and prepending a 1 to the argument vector to diff().
c(sum(points),as.list(cumsum(points[o])[v]));
Finally, we can compute the group aggregation result. Since you want a total column and then separate subtotal columns, we need a list starting with the total aggregation, followed by one list component per subtotal value. points[o] gives us the target summation operand in interval order, which we then cumulatively sum, and then index with v to produce the correct sequence of cumulative subtotals. We must coerce the vector to a list using as.list(), and then prepend the list with the total aggregation, which is simply the sum of the entire points vector. The resulting list is then returned from the j expression.
setnames(res,2:ncol(res),c('total',paste0('subtotal_under',breaks[-1L])));
Last, we set the column names. It is more performant to set them once after-the-fact, as opposed to having them set repeatedly in the j expression.
Benchmarking
For benchmarking, I wrapped my code in a function, and did the same for Mike's code. I decided to make my breaks variable a parameter with its derivation as the default argument, and I did the same for Mike's my_nums variable, but without a default argument.
Also note that for the identical() proofs-of-equivalence, I coerce the two results to matrix, because Mike's code always computes the total and subtotal columns as doubles, whereas my code preserves the type of the input points column (i.e. integer if it was integer, double if it was double). Coercing to matrix was the easiest way I could think of to verify that the actual data is equivalent.
library(data.table);
library(microbenchmark);
bgoldst <- function(dt,breaks=seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L)) { ints <- findInterval(dt$time,breaks); res <- dt[,{ y <- ints[.I]; o <- order(y); y <- y[o]; w <- which(c(y[-length(y)]!=y[-1L],T)); v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks)))); c(sum(points),as.list(cumsum(points[o])[v])); },id][order(id)]; setnames(res,2:ncol(res),c('total',paste0('subtotal_under',breaks[-1L]))); res; };
mike <- function(dt,my_nums) { cols <- sapply(1:length(my_nums),function(x){return(paste0("subtotal_under",my_nums[x]))}); dt[,(cols) := lapply(my_nums,function(x) ifelse(time<x,points,NA))]; dt[,total := points]; dt[,lapply(.SD,function(x){ if (all(is.na(x))){ as.numeric(NA) } else{ as.numeric(sum(x,na.rm=TRUE)) } }),by=id, .SDcols=c("total",cols) ][order(id)]; };
## OP's sample input
set.seed(1L);
N <- 50L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));
identical(as.matrix(bgoldst(copy(dt))),as.matrix(mike(copy(dt),c(10,20,30,40,50,60))));
## [1] TRUE
microbenchmark(bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60)));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(copy(dt)) 3.281380 3.484301 3.793532 3.588221 3.780023 6.322846 100
## mike(copy(dt), c(10, 20, 30, 40, 50, 60)) 3.243746 3.442819 3.731326 3.526425 3.702832 5.618502 100
Mike's code is actually faster (usually) by a small amount for the OP's sample input.
## large input 1
set.seed(1L);
N <- 1e5L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));
identical(as.matrix(bgoldst(copy(dt))),as.matrix(mike(copy(dt),c(10,20,30,40,50,60,70))));
## [1] TRUE
microbenchmark(bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60,70)));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(copy(dt)) 19.44409 19.96711 22.26597 20.36012 21.26289 62.37914 100
## mike(copy(dt), c(10, 20, 30, 40, 50, 60, 70)) 94.35002 96.50347 101.06882 97.71544 100.07052 146.65323 100
For this much larger input, my code significantly outperforms Mike's.
In case you're wondering why I had to add the 70 to Mike's my_nums argument, it's because with so many more records, the probability of getting a 60 in the random generation of dt$time is extremely high, which requires the additional interval. You can see that the identical() call gives TRUE, so this is correct.
## large input 2
set.seed(1L);
N <- 1e6L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));
identical(as.matrix(bgoldst(copy(dt))),as.matrix(mike(copy(dt),c(10,20,30,40,50,60,70))));
## [1] TRUE
microbenchmark(bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60,70)));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(copy(dt)) 204.8841 207.2305 225.0254 210.6545 249.5497 312.0077 100
## mike(copy(dt), c(10, 20, 30, 40, 50, 60, 70)) 1039.4480 1086.3435 1125.8285 1116.2700 1158.4772 1412.6840 100
For this even larger input, the performance difference is slightly more pronounced.
I'm pretty sure something like this might work as well:
# sample data
set.seed(1)
dt <- data.table( id= sample(LETTERS,50,replace=TRUE),
time= sample(60,50,replace=TRUE),
points= sample(1000,50,replace=TRUE))
#Input numbers
my_nums <- c(10,20,30)
#Defining columns
cols <- sapply(1:length(my_nums),function(x){return(paste0("subtotal_under",my_nums[x]))})
dt[,(cols) := lapply(my_nums,function(x) ifelse(time<x,points,NA))]
dt[,total := sum((points)),by=id]
dt[,(cols):= lapply(.SD,sum,na.rm=TRUE),by=id, .SDcols=cols ]
head(dt)
id time points subtotal_under10 subtotal_under20 subtotal_under30 total
1: G 29 655 0 0 1410 1410
2: J 52 354 447 447 1087 1441
3: O 27 271 0 0 271 530
4: X 15 993 214 1207 1207 1207
5: F 5 634 634 634 634 2107
6: X 6 214 214 1207 1207 1207
Edit: To aggregate columns, you can simply change to:
#Defining columns
cols <- sapply(1:length(my_nums),function(x){return(paste0("subtotal_under",my_nums[x]))})
dt[,(cols) := lapply(my_nums,function(x) ifelse(time<x,points,NA))]
dt[,total := points]
dt[,lapply(.SD,function(x){
if (all(is.na(x))){
as.numeric(NA)
} else{
as.numeric(sum(x,na.rm=TRUE))
}
}),by=id, .SDcols=c("total",cols) ]
This should give the expected output of 1 row per ID.
Edit: Per OPs comment below, changed so that 0s are NA. Changed so don't need an as.numeric() call in the building of columns.
After a while thinking about this, I think I've arrived at a very simple and fast solution based on conditional sum ! The small problem is that I haven't figured out how to automate this code to create a larger number of columns without having to write each of them. Any help here would be really welcomed !
library(data.table)
dt[, .( total = sum(points)
, subtotal_under10 = sum(points[which( time < 10)])
, subtotal_under20 = sum(points[which( time < 20)])
, subtotal_under30 = sum(points[which( time < 30)])
, subtotal_under40 = sum(points[which( time < 40)])
, subtotal_under50 = sum(points[which( time < 50)])
, subtotal_under60 = sum(points[which( time < 60)])), by=id][order(id)]
microbenchmark
Using the same benchmark proposed by #bgoldst in another answer, this simple solution is much faster than the alternatives:
set.seed(1L)
N <- 1e6L
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T))
library(microbenchmark)
microbenchmark(rafa(copy(dt)),bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60)))
# expr min lq mean median uq max neval cld
# rafa(copy(dt)) 95.79 102.45 117.25 110.09 116.95 278.50 100 a
# bgoldst(copy(dt)) 192.53 201.85 211.04 207.50 213.26 354.17 100 b
# mike(copy(dt), c(10, 20, 30, 40, 50, 60)) 844.80 890.53 955.29 921.27 1041.96 1112.18 100 c

Filter rows based on values of multiple columns in R

Here is the data set, say name is DS.
Abc Def Ghi
1 41 190 67
2 36 118 72
3 12 149 74
4 18 313 62
5 NA NA 56
6 28 NA 66
7 23 299 65
8 19 99 59
9 8 19 61
10 NA 194 69
How to get a new dataset DSS where value of column Abc is greater than 25, and value of column Def is greater than 100.It should also ignore any row if value of atleast one column in NA.
I have tried few options but wasn't successful. Your help is appreciated.
There are multiple ways of doing it. I have given 5 methods, and the first 4 methods are faster than the subset function.
R Code:
# Method 1:
DS_Filtered <- na.omit(DS[(DS$Abc > 20 & DS$Def > 100), ])
# Method 2: which function also ignores NA
DS_Filtered <- DS[ which( DS$Abc > 20 & DS$Def > 100) , ]
# Method 3:
DS_Filtered <- na.omit(DS[(DS$Abc > 20) & (DS$Def >100), ])
# Method 4: using dplyr package
DS_Filtered <- filter(DS, DS$Abc > 20, DS$Def >100)
DS_Filtered <- DS %>% filter(DS$Abc > 20 & DS$Def >100)
# Method 5: Subset function by default ignores NA
DS_Filtered <- subset(DS, DS$Abc >20 & DS$Def > 100)

I am trying to change the sign of a number to negative if it is below a specific target and it is to remain positive if it is above.

I put the section of code in bold that seems to be the problem. Here is the code:
## price impact analysis
rm(list=ls())
### import data from excel spreadsheets
chtr_trades <- read.csv("F:/FRE 6951 Mkt Micro Struc/CHTRTRADES.csv")
chtr_quotes <- read.csv("F:/FRE 6951 Mkt Micro Struc/CHTRQUOTES.csv")
## initialize bid ask
max_bid <- NULL
min_ask <- NULL
### cleans data
maxrm <- function(x) {
max(x, na.rm=TRUE)
}
minrm <- function(x) {
min(x, na.rm=TRUE)
}
## retrieve max bid and ask for each iteration
max_bid<- tapply(chtr_quotes[,4],chtr_quotes[,3], maxrm)
min_ask<- tapply(chtr_quotes[,5],chtr_quotes[,3], minrm)
time <- levels(chtr_quotes[,3])
## calculate previous second midpoint
midpoint <- (min_ask + max_bid)/2
askbidtime <- data.frame(midpoint,time,max_bid,min_ask)
row.names(askbidtime) <- seq(nrow(askbidtime))
askbidtime[,2] <- as.POSIXct(askbidtime[,2], format="%H:%M:%S")
ordered.askbidtime <- askbidtime[order(askbidtime$time),]
row.names(ordered.askbidtime) <- seq(nrow(ordered.askbidtime))
chtr_trades_revised <-chtr_trades[which(as.POSIXct(chtr_trades[,3],format="%H:%M:%S") %in% ordered.askbidtime[,2]),]
midpoint<-NULL
midpoint[1:5] <- NA
for(i in 6:3917) {
midpoint[i] <- as.numeric(ordered.askbidtime[which(ordered.askbidtime[,2]==as.POSIXct(chtr_trades_revised[i,3],format="%H:%M:%S"))-1,1])
}
***## sign trades
chtr_trades_revised$midpoint
chtr_trades_revised$midpoint <- midpoint
for(i in 6:3917) {
if((!is.na(chtr_trades_revised$midpoint[i])) & (chtr_trades_revised$midpoint[i] > chtr_trades_revised$PRICE[i])) {
chtr_trades_revised$signed_volume <- -chtr_trades_revised$SIZE
}
if((!is.na(chtr_trades_revised$midpoint[i])) & (chtr_trades_revised$midpoint[i] < chtr_trades_revised$PRICE[i])) {
chtr_trades_revised$signed_volume <- chtr_trades_revised$SIZE
}
}***
Here are the results. In the last column rows 4062 and 4054 should be positive but it makes the entire column negative:
SYMBOL DATE TIME PRICE SIZE midpoint signed_volume
4060 CHTR 20130718 2014-08-26 15:59:44 124.46 100 124.485 -100
4061 CHTR 20130718 2014-08-26 15:59:52 124.46 100 124.495 -100
4062 CHTR 20130718 2014-08-26 15:59:55 124.52 100 124.490 -100
4063 CHTR 20130718 2014-08-26 15:59:58 124.53 100 124.410 -100
4064 CHTR 20130718 2014-08-26 16:00:00 124.57 7951 124.550 -7951
4065 CHTR 20130718 2014-08-26 16:00:00 124.53 100 124.550 -100
Here's a cute way:
foo<- 1:10
threshold <- 5
foo<- foo*(-1)^(foo < threshold)
foo
[1] -1 -2 -3 -4 5 6 7 8 9 10
Another method:
foo = 1:10 ; threshold = 5
foo
[1] 1 2 3 4 5 6 7 8 9 10
foo = ifelse(foo>=threshold, foo, -foo)
foo
[1] -1 -2 -3 -4 5 6 7 8 9 10

Resources