data.table: How to indicate first occurrence of unique column value by group - r

I have a large data.table ~ 18*10^6 rows filled with columns ID and CLASS and I want to create a new binary column that indicates the occurrence of a new CLASS value by ID.
DT <- data.table::data.table(ID=c("1","1","1","2","2"),
CLASS=c("a","a","b","c","b"))
### Starting
ID CLASS
1 a
1 a
1 b
2 c
2 b
### Desired
ID CLASS NEWCLS
1 a 1
1 a 0
1 b 1
2 c 1
2 b 1
I originally initialized the NEWCLS variable and used the data.table::shift() function to lag a 1 by ID and CLASS
DT[,NEWCLS:=0]
DT[,NEWCLS:=data.table::shift(NEWCLS, n = 1L, fill = 1, type = "lag"),by=.(ID,CLASS)]
This creates the desired output but with ~18*10^6 rows it takes quite some time, even for data.table.
Would someone know how to create the NEWCLS variable in quicker and more efficient way using solely data.table arguments?

One possibility could be:
DT[, NEWCLS := as.integer(!duplicated(CLASS)), by = ID]
ID CLASS NEWCLS
1: 1 a 1
2: 1 a 0
3: 1 b 1
4: 2 c 1
5: 2 b 1

Related

Counting group size including zero with R's data.table

I have a small (< 10M row) data.table with a variable that takes on integer values. I would like to generate a count of the number of the number of times that the variable takes on each integer value, including zeroes when the variable never takes on that value.
For example, I might have:
dt <- data.table(a = c(1,1,3,3,5,5,5))
My desired output is a data.table with values:
a N
1 2
2 0
3 2
4 0
5 3
This is an extremely basic question, but it is difficult to find data.table specific answers for it. In my example, we can assume that the minimum is always 0, but the maximum variable value is unknown.
dt[, .N, by = .(a)
][data.table(a = seq(min(dt$a), max(dt$a))), on = .(a)
][is.na(N), N := 0][]
# a N
# <int> <int>
# 1: 1 2
# 2: 2 0
# 3: 3 2
# 4: 4 0
# 5: 5 3

R - How to reformat wide dataset to wider dataset by changing the primary unique row variable

Sorry if this has been asked already and if the title is very confusing. I did look and only found questions on reformating where the values of one of the columns were used as column headings in the output dataset.
My dataset is organized so the Filter is the unique value for each row. I want to change it so the id of the individual within each sampling season is unique for each row since individuals had multiple filters. Basically, I want to reformat Table 1 so it looks like Table 2.
Table 1
id season FilterI
1: 1 1 A
2: 1 1 B
3: 2 1 C
4: 2 1 D
5: 1 2 E
6: 1 2 F
Table 2
id season FilterI1 FilterI2
1: 1 1 A B
2: 1 2 E F
3: 2 1 C D
Reshape does not seem to work because none of the columns in the first dataset contain the column headings for the second dataset.
Using dcast with rowid, change from 'long' to 'wide' (assuming the example data is data.table)
library(data.table)
dcast(Table1, id + season ~ paste0("FilterI", rowid(id)), value.var = "FilterI")
# id season FilterI1 FilterI2
#1: 1 1 A B
#2: 1 2 E F
#3: 2 1 C D

Wrapping cumulative sum from a set starting row in R

I have a data frame that looks a bit like this:
wt <- data.frame(region = c(rep("A", 5), rep("B", 5)), time = c(1:5, 1:5),
start = c(rep(2,5), rep(4, 5)), value = rep(1, 10))
The values in the value column could be any numbers (I am working in a very large data set), but each region will be over an equal-length time series and have a single starting point.
I want to perform a cumulative sum within each region that begins accumulating at the starting point, continues forward in the time series, and wraps to the rows before the starting point in the time series.
The full data table, WITH the intended result, would look like this:
region time start value result
A 1 2 1 5
A 2 2 1 1
A 3 2 1 2
A 4 2 1 3
A 5 2 1 4
B 1 4 1 3
B 2 4 1 4
B 3 4 1 5
B 4 4 1 1
B 5 4 1 2
A simple transformation of the time column followed by cumsum does not work, since the function cares about row order and not any particular factor.
With that in mind, I am operating on a huge data table, and runtime is absolutely a concern, so any solution must avoid re-ordering rows.
Ideas of how to do this? Thanks in advance.
EDIT: Consider time to be a cycle such as hours in a day - and for example, if the start time is 2, that means observations start at one instance of time 2 and end at the next time 1.
We can do this in an efficient way with data.table
library(data.table)
setDT(wt)[time>=start, result := seq_len(.N), region]
wt[, Max := max(result, na.rm = TRUE), region]
wt[is.na(result), result := Max +seq_len(.N) , region][, Max := NULL][]
# region time start value result
#1: A 1 2 1 5
#2: A 2 2 1 1
#3: A 3 2 1 2
#4: A 4 2 1 3
#5: A 5 2 1 4
#6: B 1 4 1 3
#7: B 2 4 1 4
#8: B 3 4 1 5
#9: B 4 4 1 1
#10: B 5 4 1 2
akrun's solution works for the example I gave (hence I accepted it as the answer), but here's a version that works for any values in the value column:
library(data.table)
setDT(wt)[time>=start, result := cumsum(value), region]
wt[, Max := max(result, na.rm = TRUE), region]
wt[is.na(result), result := Max +cumsum(value) , region][, Max := NULL][]
Just adding the... unfortunately named cumsum function in place of a calculated sequence.

Adding a counter column for a set of similar rows in R [duplicate]

This question already has answers here:
How can I rank observations in-group faster?
(4 answers)
Closed 9 years ago.
I have a data-frame in R with two columns. The first column contains the subjectID and the second column contains the trial ID that subject has done.
The a specific subjectID might have done the trial for more than 1 time. I want to add a column with a counter that starts counting for each subject-trial unique value and increment by 1 till it reaches the last row with that occurance.
More precisely, I have this table:
ID T
A 1
A 1
A 2
A 2
B 1
B 1
B 1
B 1
and I want the following output
ID T Index
A 1 1
A 1 2
A 2 1
A 2 2
B 1 1
B 1 2
B 1 3
B 1 4
I really like the simple syntax of data.table for this (not to mention speed)...
# Load package
require( data.table )
# Turn data.frame into a data.table
dt <- data.table( df )
# Get running count by ID and T
dt[ , Index := 1:.N , by = c("ID" , "T") ]
# ID T Index
#1: A 1 1
#2: A 1 2
#3: A 2 1
#4: A 2 2
#5: B 1 1
#6: B 1 2
#7: B 1 3
#8: B 1 4
.N is an integer equal to the number of rows in each group. The groups are defined by the column names in the by argument, so 1:.N gives a vector as long as the group.
As data.table inherits from data.frame any function that takes a data.frame as input will also take a data.table as input and you can easily convert back if you wished ( df <- data.frame( dt ) )

Using R: Make a new column that counts the number of times 'n' conditions from 'n' other columns occur

I have columns 1 and 2 (ID and value). Next I would like a count column that lists the # of times that the same value occurs per id. If it occurs more than once, it will obviously repeat the value. There are other variables in this data set, but the new count variable needs to be conditional only on 2 of them. I have scoured this blog, but I can't find a way to make the new variable conditional on more than one variable.
ID Value Count
1 a 2
1 a 2
1 b 1
2 a 2
2 a 2
3 a 1
3 b 3
3 b 3
3 b 3
Thank you in advance!
You can use ave:
df <- within(df, Count <- ave(ID, list(ID, Value), FUN=length))
You can use ddply from plyr package:
library(plyr)
df1<-ddply(df,.(ID,Value), transform, count1=length(ID))
>df1
ID Value Count count1
1 1 a 2 2
2 1 a 2 2
3 1 b 1 1
4 2 a 2 2
5 2 a 2 2
6 3 a 1 1
7 3 b 3 3
8 3 b 3 3
9 3 b 3 3
> identical(df1$Count,df1$count1)
[1] TRUE
Update: As suggested by #Arun, you can replace transform with mutate if you are working with large data.frame
Of course, data.table also has a solution!
data[, Count := .N, by = list(ID, Value)
The built-in constant, ".N", is a length 1 vector reporting the number of observations in each group.
The downside to this approach would be joining this result with your initial data.frame (assuming you wish to retain the original dimensions).

Resources