New to R and working my way through DataCamp to better understand Data Tables. Trying to apply := and set on a large data set to aggregate.
Wondering if someone can give me with a pointer on including conditionals into a := or set() for data tables. I have a large 10 million row 20+ column datatable where I am trying to group by an ID and period (using setkey), testing row i-1 against i for column name "period" to provide a categorical output 0 or 1 in the "flag" column.
I've tried:
for(i in 1:200)
set(DT, i, .(period[i]-period[i-1]<=1, period[i]-period[i-1]>1), flag = .(0,1))
# error is unused argument flag=.(0,1)
I'm probably mixing up := and set and base R. I haven't seen where examples compare one row to another and include the index in the call to provide two different outputs.
Related
Is it possible to assign values to multiple columns using "set?"
Here is an example. For context, I want to create two new sets of columns-- one that imputes missing/NA values to 0, and another that indicates whether missing values were imputed. The first set of columns will duplicate an existing set but have 0 instead of NA and carry the suffix "_M0." The second set will be 0/100 and carry the suffix "_MISS."
I will use the iris data frame as a starting point.
## create a copy of the iris data frame that i can modify
local_iris <- copy(iris)
## make the local iris copy a data.table
iris.dt <- setDT(local_iris)
There isn't missing data, so I will add some for testing.
## make some parts of these columns missing, i.e., set to NA
iris.dt[1:5, Sepal.Width := NA][6:10, Sepal.Length := NA]
I'm using only the "Sepal" columns here, so I want to save those names and create new column names based on it.
## 'grep' returns a list of the positions that meet the criteria; 'grepl' returns a logic vector of the same length as the argument
## using the result of grep as the index/columns of a list seems to do the trick, even if it seems a tiny bit repetitive/clunky
bert <- names(iris.dt)[grep("^Sepal", names(iris.dt))]
## create lists like the original list with new suffixes
bert_M0 <- paste0(bert, "_M0")
bert_MISS <- paste0(bert, "_MISS")
This part seemed clear to me, and went pretty well, but I'm open to suggestions if there are obvious (or not-so-obvious!) ways to streamline it.
Regarding my data.table and other object names-- i try to pick unusual names when testing to ensure I'm not duplicating another name.
## the best way to go about this is unclear
## i will settle for 'a' way and worry about 'best' way later
## one approach is to extend the data.table to have the new columns added, and then modify their respective values in place later
## create a copy of the relevant columns
M0<-iris.dt[, .SD, .SDcols = bert]
## rename the columns
setnames(M0, old = bert, new = bert_M0)
## create a new data.table with the copied columns
opus<-cbind(iris.dt, M0)
## this creates a set of indicators and sets all the _MISS columns equal to 0
opus[, (bert_MISS) := 0L]
Then I'm going to use set and loop through my columns to recode missings and set the flags/dummy vars.
BUT, and here is my main question-- is is possible to do this with only ONE set? Or do I need to have one set per column?
## try using "set"
for (j in seq_len(length(bert))) { # seq_len(arg) is an alternative way of writing 1:arg
set(opus, ## the data.table we are operating on
which(is.na(opus[[bert[j]]])), ## the values of i
bert_M0[j], ## the column
0 ## the value
)
set(opus, ## the data.table we are operating on
which(is.na(opus[[bert[j]]])), ## the values of i
bert_MISS[j], ## the column
100 ## the value
)
}
Thanks!
I think this addresses your question
for (j in seq_len(length(bert))) set(
opus,
which(is.na(opus[[bert[j]]])),
c(bert_M0[j], bert_MISS[j])
list(0, 100)
)
you basically provide column names as character vector, and values as a list
I have a relatively large dataset that I wouldn't qualify as 'big data'. It's around 3 to 5 million rows; because of the size I'm using the data.table library to do analysis.
The dataset (named df, which is a data.table structure) composition can essentially be broken into:
n identify fields, hereafter ID_1, ID_2, ..., ID_n, some of which are numeric and some of which are character vector.
m categorical variables, hereafter C_1, ..., C_m, all of which are character vector and have very few values apiece (2 in one, 3 in another, etc...)
2 measurement variables, M_1, and M_2, both numeric.
A subset of data is identified by ID_1 through ID_n, has a full set of all values of C_1 through C_m, and has a range of values of M_1 and M_2. A subset of data consists of 126 records.
I need to accurately count the unique sets of data and, because of the size of the data, I would like to know if there already exists a much more efficient way to do this. (Why roll my own if other, much smarter, people have done it already?)
I've already done a fair amount of Google work to arrive at the method below.
What I've done is to use the ht package (https://github.com/nfultz/ht) so that I can use a data frame as a hash value (using digest in the background).
I paste together the ID fields to create a new, single column, hereafter referred to as ID, which resembles...
ID = "ID1:ID2:...:IDn"
Then I loop through each unique set of identifiers and then, using just the subset data frame of C_1 through C_m, M_1, and M_2 (126 rows of data), hash the value / increment the hash.
Afterwards I'm taking that information and putting it back into the data frame.
# Create the hash structure
datasets <- ht()
# Declare the fields which will denote a subset of data
uniqueFields <- c("C_1",..."C_m","M_1","M_2")
# Create the REPETITIONS field in the original data.table structure
df[,REPETITIONS := 0]
# Create the KEY field in the original data.table structure
df[,KEY := ""]
# Use the updateHash function to fill datasets
updateHash <- function(val){
key <- df[ID==val, uniqueFields, with=FALSE]
if (isnull(datasets[key])) {
# If this unique set of data doesn't already exist in datasets...
datasets[key] <- list(val)
} else {
# If this unique set of data does already exist in datasets...
datasets[key] <- append(datasets[key],val)
}
}
# Loop through the ID fields. I've explored using apply;
# this vector is around 10-15K long. This version works.
for (id in unique(df$ID)) {
updateHash(id)
}
# Now update the original data.table structure so the analysis can
# be done. Again, I could use the R apply family, this version works.
for(dataset in ls(datasets)){
IDS <- unlist(datasets[[dataset]]$val)
# For this set of information, how many times was it repeated?
df[ID%in%IDS, REPETITIONS := length(datasets[[dataset]]$val)]
# For this set, what is a unique identifier?
df[ID%in%IDS, KEY := dataset]
}
This does what I want to, though not blindingly fast. I now have the capability to present some neat analysis revolving around variability in datasets to people who care about it. I don't like that it's hackey and, one way or another, I'm going to clean this up and make it better. Before I do that I want to do my final due diligence and see if it's simply my Google Fu failing me.
I am trying to learn the recommended ways to create a new column for a data table when the column of interest is a list (or vector), and the selection is done relative to another column, and there may be a preliminary selection done as part of a chain.
Consider these data named (tmp). We want to find the minimum value of sacStartT greater than stimTime (in the real data one or the other of these could be empty and no minimum exist).
tmp = data.table("pid" = c(14,14,9,9),"trialNumber" = c(25,26,25,26),"stimTime" = c(100,200,1,2),"sacStartT" = list(c(98,99,101,102), c(201,202), c(5), c(-2,-3,3)))
This works:
tmp[,"mintime" := as.integer(min(unlist(sacStartT)[unlist(sacStartT)>stimTime])),by=seq_len(nrow(tmp))]
But if I wanted to first subselect the data I don't know how to get that row number for the row-by-row analysis, e.g.
tmp[pid == 9][,"mintime" := as.integer(min(unlist(sacStartT)[unlist(sacStartT)>stimTime])),by=seq_len(nrow(.N))]
fails because .N refers to the number of rows in tmp, and not the subset in the chain.
In summary, the question is the composition of:
Recommendations for doing this row by row analysis?
How to find the right number for the by argument in a chain?
Recommendations for dealing with data.table elements that contain lists? Do you just have to manually unlist them all?
A dummy column for a column c and a given value x equals 1 if c==x and 0 else. Usually, by creating dummies for a column c, one excludes one value x at choice, as the last dummy column doesn't add any information w.r.t. the already existing dummy columns.
Here's how I'm trying to create a long list of dummies for a column firm, in a data.table:
values <- unique(myDataTable$firm)
cols <- paste('d',as.character(inds[-1]), sep='_') # gives us nice d_value names for columns
# the [-1]: I arbitrarily do not create a dummy for the first unique value
myDataTable[, (cols):=lapply(values[-1],function(x)firm==x)]
This code reliably worked for previous columns, which had smaller unique values. firm however is larger:
tr(values)
num [1:3082] 51560090 51570615 51603870 51604677 51606085 ...
I get a warning when trying to add the columns:
Warning message:
truelength (6198) is greater than 1000 items over-allocated (length = 36). See ?truelength. If you didn't set the datatable.alloccol option very large, please report this to datatable-help including the result of sessionInfo().
As far as I can tell, there is still all columns that I need. Can I just ignore this issue? Will it slow down future computations? I'm not sure what to make of this and the relevant of truelength.
Taking Arun's comment as an answer.
You should use alloc.col function to pre-allocate required amount of columns in your data.table to the number which will be bigger than expected ncol.
alloc.col(myDataTable, 3200)
Additionally depending on the way how you consume the data I would recommend to consider reshaping your wide table to long table, see EAV. Then you need to have only one column per data type.
Q1:
Is it possible for me to search on two different columns in a data table. I have a 2 million odd row data and I want to have the option to search on either of the two columns. One has names and other has integers.
Example:
x <- data.table(foo=letters,bar=1:length(letters))
x
want to do
x['c'] : searching on foo column
as well as
x[2] : searching on bar column
Q2:
Is it possible to change the default data types in a data table. I am reading in a matrix with both character and integer columns however everything is being read in as a character.
Thanks!
-Abhi
To answer your Q2 first, a data.table is a data.frame, both of which are internally a list. Each column of the data.table (or data.frame) can therefore be of a different class. But you can't do that with a matrix. You can use := to change the class (by reference - no unnecessary copy being made), for example, of "bar" here:
x[, bar := as.integer(as.character(bar))]
For Q1, if you want to use fast subset (using binary search) feature of data.table, then you've to set key, using the function setkey.
setkey(x, foo)
allows you to fast-subset on 'x' alone like: x['a'] (or x[J('a')]). Similarly setting a key on 'bar' allows you to fast-subset on that column.
If you set the key on both 'foo' and 'bar' then you can provide values for both like so:
setkey(x) # or alternatively setkey(x, foo, bar)
x[J('c', 3)]
However, this'll subset those where x == 'c' and y == 3. Currently, I don't think there is a way to do a | operation with fast-subset directly. You'll have to resort to a vector-scan approach in that case.
Hope this is what your question was about. Not sure.
Your matrix is already a character. Matrices hold only one data type. You can try X['c'] and X[J(2)]. You can change data types as X[,col := as.character(col)]