This question already has answers here:
Create group number for contiguous runs of equal values
(4 answers)
Closed 6 days ago.
Working with data.table package in R, I'm trying to get the 'group number' of some data points.
Specifically, my data is trajectories: I have many rows describing a specific observation of the particle I'm tracking, and I want to generate a specific index for the trajectory based on other identifying information I have.
If I do a [, , by] command, I can group my data by this identifying information and isolate each trajectory.
Is there a way, similar to .I or .N, which gives what I would call the index of the subset?
Here's an example with toy data:
dt <- data.table(x1 = c(rep(1,4), rep(2,4)),
x2 = c(1,1,2,2,1,1,2,2),
z = runif(8))
I need a fast way to get the trajectories (here should be c(1,1,2,2,3,3,4,4) for each observation -- my real data set is moderately large.
If we need the trajectories (donno what that means) based on the 'x2', we can use rleid
dt[, Grp := rleid(x2)]
Or if we need the group numbers based on 'x1' and 'x2', .GRP can be used.
dt[, Grp := .GRP,.(x1, x2)]
Or this can be done using rleid alone without the by (as #Frank mentioned)
dt[, Grp := rleid(x1,x2)]
Related
I have about 30,000 rows of data with a Date column in date format. I would like to be able to count the number of rows by month/year and year, but when I aggregate with the below code, I get a vector within the data table for my results instead of a number.
Using the hyperlinked csv file, I have tried the aggregate function.
https://www.dropbox.com/s/a26t1gvbqaznjy0/myfiles.csv?dl=0
short.date <- strftime(myfiles$Date, "%Y/%m")
aggr.stat <- aggregate(myfiles$Date ~ short.date, FUN = count)
Below is a view of the aggr.stat data frame. There are two columns and the second one beginning with "c(" is the one where I'd like to see a count value.
1 1969/01 c(-365, -358, -351, -347, -346)
2 1969/02 c(-323, -320)
3 1969/03 c(-306, -292, -290)
4 1969/04 c(-275, -272, -271, -269, -261, -255)
5 1969/05 c(-245, -240, -231)
6 1969/06 c(-214, -211, -210, -205, -204, -201, -200, -194, -190, -186)
I'm not much into downloading any unknown file from the internet, so you'll have to adapt my proposed solution to your needs.
You can solve the problem with the help of data.table and lubridate.
Imagine your data has at least one column called dates of actual dates (it is, calling class(df$dates) will return at least Date or something similar (POSIXct, etc).
# load libraries
library(data.table)
library(lubridate)
# convert df to a data.table
setDT(df)
# count rows per month
df[, .N, by = .(monthDate = floor_date(dates, "month")]
.N counts the number of rows, by = groups the data. See ?data.table for further details.
Consider running everything from data frames. Specifically, add needed month/year column to data frame and then run aggregate using data argument (instead of running by separate vectors). Finally, there is no count() function in base R, use length instead:
# NEW COLUMN
myfiles$short.date <- strftime(myfiles$Date, "%Y/%m")
# AGGREGATE WITH SPECIFIED DATA
aggr.stat <- aggregate(Date ~ short.date, data = myfiles, FUN = length)
I have a data.table with a column of customer IDs, a column of days on which they made a purchase, and a column with the value of that purchase. What I want to do is to compute the average of the purchase values on each day across customers, filling in missing values with the next available value.
For simplicity's sake, I'll have no duplicate days in my minimal example.
library(data.table)
dat <- data.table(custid=rep(seq(10),5), day=sample(50), val=rnorm(50,0,1))[order(custid,day)]
Now, I know how to solve this, but I don't know how to do it efficiently. One solution is to expand the data.table so that missing values become NA, and then to carry the next observation backward using na.locf() from zoo:
library(zoo)
res <- dat[as.data.table(expand.grid(custid=seq(10), day=seq(50))), on=c('custid','day'), allow.cartesian=TRUE, nomatch=NA][order(custid,day)]
res[, val:=na.locf(val, fromLast=TRUE, na.rm=FALSE), by='custid']
res <- res[,list(meanVal=mean(val, na.rm=TRUE)), by='day']
However, this creates a very large table when there are many days and many customers, but most customers only purchased on a handful of days. So I don't want that.
Another solution is to loop over the days, filter and aggregate per day, and then bind the rows into a data.table again:
res2 <- list()
for (dy in seq(max(dat$day))) {
res2 <- c(res2,
list(dat[day>=dy, .SD[1], by='custid'][,list(day=dy, meanVal=mean(val, na.rm=T))]))
}
res2 <- rbindlist(res2)
However, this is slow.
Could anyone come up with a data.table solution that neither requires a slow loop, nor the creation of a large intermediate table?
In my limited testing this is faster than either of your options (btw use CJ instead of data.table(expand.grid), and doesn't use much memory:
dat[dat, on = .(day >= day), mean(val[!duplicated(custid)]), by = .EACHI]
This assumes data is sorted by day as in OP.
This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed 6 years ago.
Sorry if the title is confusing, wasn't sure how to describe this problem. Ok so I have a dataframe with one column that is sampling site, of which I have many, and one column that is sampling method, of which there are only two. Here's a simplified version:
site <- c("X", "Y", "X","Z")
method <- c("A", "B", "B", "A")
data <- data.frame(site, method)
data
site method
1 X A
2 Y B
3 X B
4 Z A
Now some sites got sampled using both sampling method A and method B, and some got sampled by only method A or method B.
I am trying to select only those sites that got sampled using both methods. For example, the output for this data would look like this:
site method
1 X A
2 X B
I don't have a sample code because I honestly do not know how to do this. Please help!
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'site', if the length of the unique 'method' is greater than 1, then get the Subset of Data.table.
library(data.table)
setDT(data)[, if(uniqueN(method)>1) .SD , by = site]
Or with dplyr, we can do it.
library(dplyr)
data %>%
group_by(site) %>%
filter(n_distinct(method)>1)
A possible base R option would be
data[ with(data, ave(method, site, FUN = function(x) length(unique(x))>1)),]
After reading about benchmarks and speed comparisons of R methods, I am in the process of converting to the speedy data.table package for data manipulation on my large data sets.
I am having trouble with a particular task:
For a certain observed variable, I want to check, for each station, if the absolute lagged difference (with lag 1) is greater than a certain threshold. If it is, I want to replace it with NA, else do nothing.
I can do this for the entire data.table using the set command, but I need to do this operation by station.
Example:
# Example data. Assume the columns are ordered by date.
set.seed(1)
DT <- data.table(station=sample.int(n=3, size=1e6, replace=TRUE),
wind=rgamma(n=1e6, shape=1.5, rate=1/10),
other=rnorm(n=1.6),
key="station")
# My attempt
max_rate <- 35
set(DT, i=which(c(NA, abs(diff(DT[['wind']]))) > max_rate),
j=which(names(DT)=='wind'), value=NA)
# The results
summary(DT)
The trouble with my implementation is that I need to do this by station, and I do not want to get the lagged difference between the last reading in station 1 and the first reading of station 2.
I tried to use the by=station operator within the [ ], but I am not sure how to do this.
One way is to get the row numbers you've to replace using the special variable .I and then assign NA to those rows by reference using the := operator (or set).
# get the row numbers
idx = DT[, .I[which(c(NA, diff(wind)) > 35)], by=station][, V1]
# then assign by reference
DT[idx, wind := NA_real_]
This FR #2793 filed by #eddi when/if implemented will have a much more natural way to accomplish this task by providing the expression resulting in the corresponding indices on LHS and the value to replace with on RHS. That is, in the future, we should be able to do:
# in the future - a more natural way of doing the same operation shown above.
DT[, wind[which(c(NA, diff(wind)) > 35)] := NA_real_, by=station]
This question already has an answer here:
Summarizing multiple columns with data.table
(1 answer)
Closed 9 years ago.
This is really two questions I guess. I'm trying to use the data.table package to summarize a large dataset. Say my original large dataset is df1 and unfortunately df1 has 50 columns (y0... y49) that I want the sum of by 3 fields (segmentfield1, segmentfield2, segmentfield3). Is there a simpler way to do this than typing every y0...y49 column out? Related to this, is there a generic na.rm=T for the data.table instead of typing that with each sum too?
dt1 <- data.table(df1)
setkey(dt1, segmentfield1, segmentfield2, segmentfield3)
dt2 <- dt1[,list( y0=sum(y0,na.rm=T), y1=sum(y1,na.rm=T), y2=sum(y2,na.rm=T), ...
y49=sum(y49,na.rm=T) ),
by=list(segmentfield1, segmentfield2, segmentfield3)]
First, create the object variables for the names in use:
colsToSum <- names(dt1) # or whatever you need
summedNms <- paste0( "y", seq_along(colsToSum) )
If you'd like to copy it to a new data.table
dt2 <- dt1[, lapply(.SD, sum, na.rm=TRUE), .SDcols=colsToSum]
setnames(dt2, summedNms)
If alternatively, youd like to append the columns to the original
dt1[, c(summedNms) := lapply(.SD, sum, na.rm=TRUE), .SDcols=colsToSum]
As far as a general na.rm process, there is not one specific to data.table, but have a look at ?na.omit and ?na.exclude