lag row value in R by groups - r

I want the previous row value for different groups. I have gone through the solution given here and also tried the code shown below.
new_data[,avg_week := shift(.(avg_travel_time),type = "lag"), by = identifier]
This is the error that I am getting.
Error in `[.data.frame`(new_data, , `:=`(avg_week, c(NA, avg_travel_time[-.N])), :
unused argument (by = identifier)

There are two problems in the OP's code, 1) the dataset is a data.frame and not a data.table, 2) use of .( inside shift which is not required. We need to first convert to data.table (setDT(new_data) before applying the data.table syntax
setDT(new_data)[,avg_week := shift(avg_travel_time, type = "lag"), by = identifier]

Related

Error in creating function with data table syntax using multiple columns in as arguments for "by"

I am having trouble with using two columns as arguments for the data table argument "by" in a function.
The code below creates value-weighted portfolios according to different categories (e.g., state); where DT is a stacked dataframe containing stock prices for firms in the United States.
Here is my current code...
vw_testfunction <- function(bycol1, bycol2){
vwret_df <- DT[,list(vw_ret = lapply(.SD,
function(x) sum(x*mcap/sum(mcap,na.rm = TRUE),na.rm = TRUE))),
by = list(bycol1, bycol2), # adjust this line to manage grouping
.SDcols = "ret"]
return(vwret_df)
}
test <- vw_testfunction("state", "datadate")
...where I get this error:
Error in `[.data.table`(DT, , list(vw_ret = lapply(.SD, function(x) sum(x * :
The items in the 'by' or 'keyby' list are length(s) (1,1). Each must be length 527561; the same length as there are rows in x (after subsetting if i is provided).
I have tried the solution here, but using the arguments inside list() is causing the error.
Any advice is greatly appreciated.

Is there a function to invert the number of occurrences of values in a data.table?

Is there a function that can invert the number of occurrences of a value in a data.table, as opposed to sorting by frequency? E.g. say I have this:
install.packages('data.table')
require(data.table)
initially = data.table(initially = c('a,a','b,b','b,b','c,c','c,c','c,c'))
View(initially)
And wish to produce this:
required.inversion = data.table(required.inversion = c('a,a','a,a','a,a','b,b','b,b', 'c,c'))
View(required.inversion)
The way I was thinking of doing this was to produce a frequency table:
initial.frequencies = initially[, .N ,by = initially]
View(initial.frequencies)
Sort it to ensure it's in ascending frequency order:
initial.frequencies = initial.frequencies[,.SD[order(N)]]
View(initial.frequencies)
Store the order of those initial values:
inversion.key = initial.frequencies$initially
View(inversion.key)
Re-sort the data.table so it's in descending frequency order:
initial.frequencies = initial.frequencies[,.SD[order(N, decreasing = TRUE)]]
View(initial.frequencies)
Then insert the original order back into the table:
initial.frequencies$inversion.key = inversion.key
View(initial.frequencies)
I now have a 'key' showing me how many times an initial value would need to be multiplied to invert the number of times it occurs. I.e. that I'd need to multiply the number of times 'a,a' occurs by three, 'b,b' by two and 'c,c' by one.
I'm not sure how to actually replicate the values in the original table and this seems like a bad approach to take as it'll also double the length of the table.
this.approach.would.yield.this.in.the.ram = data.table(this.approach.would.yield.this.in.the.ram = c('a,a','b,b','b,b','c,c','c,c','c,c', 'a,a','a,a','a,a','b,b','b,b', 'c,c'))
View(this.approach.would.yield.this.in.the.ram)
If we use the approach by the OP, then just replicate the rows by the reverse of 'N' and assign 'N' to NULL
initially[, .N, by = initially][rep(seq_len(.N), rev(N))][, N := NULL][]

data.table: transforming subset of columns with a function, row by row

How can one, having a data.table with mostly numeric values, transform just a subset of columns and put them back to the original data table? Generally, I don't want to add any summary statistic as a separate column, just exchange the transformed ones.
Assume we have a DT. It has 1 column with names and 10 columns with numeric values. I am interested in using "scale" function of base R for each row of that data table, but only applied to those 10 numeric columns.
And to expand on this. What if I have a data table with more columns and I need to use column names to tell the scale function on which datapoints to apply the function?
With regular data.frame I would just do:
df[,grep("keyword",colnames(df))] <- t(apply(df[,grep("keyword",colnames(df))],1,scale))
I know this looks cumbersome but always worked for me. However, I can't figure out a simple way to do it in data.tables.
I would image something like this to work for data.tables:
dt[,grep("keyword",colnames(dt)) := scale(grep("keyword",colnames(dt)),center=F)]
But it doesn't.
EDIT:
Another example of doing that updating columns with their per-row-scaled version:
dt = data.table object
dt[,grep("keyword",colnames(dt),value=T) := as.data.table(t(apply(dt[,grep("keyword",colnames(dt)),with=F],1,scale)))]
Too bad it needs the "as.data.table" part inside, as the transposed value from apply function is a matrix. Maybe data.table should automatically coerce matrices into data.tables upon updating of columns?
If what you need is really to scale by row, you can try doing it in 2 steps:
# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]
# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]
PART 1: The one line solution you requested:
# First lets take a look at the data in the columns:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]`
One-line Solution Version 1: Use magrittR and the pipe operator:
DT[, (grep("keyword", colnames(DT))) := (lapply(.SD, . %>% scale(., center = F))),
.SDcols = grep("corrupt", colnames(DT))]
One-line Solution Version 2: Explicitly defines the function for the lapply:
DT[, (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)})),
.SDcols = grep("corrupt", colnames(DT))]
Modification - If you want to do it by group, just use the by =
DT[ , (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)}))
, .SDcols = grep("corrupt", colnames(DT))
, by = Grouping.Variable]
You can verify:
# Verify that the columns have updated values:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]
PART 2: A Step-by-Step Solution: (more general and easier to follow)
The above solution works clearly for the narrow example given.
As a public service, I am posting this for anyone that is still searching for a way that
feels a bit less condensed;
easier to understand;
more general, in the sense that you can apply any function you wish without having to compute the values into a separate data table first (which, n.b. does work perfectly here)
Here's the step-by-step way of doing the same:
Get the data into Data.Table format:
# You get a data.table called DT
DT <- as.data.table(df)
Then, Handle the Column Names:
# Get the list of names
Reference.Cols <- grep("keyword",colnames(df))
# FOR PEOPLE who want to store both transformed and untransformed values.
# Create new column names
Reference.Cols.normalized <- Reference.Cols %>% paste(., ".normalized", sep = "")
Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X,
X.mean = mean(X, na.rm = TRUE),
X.sd = sd(X, na.rm = TRUE))
{
X <- (X - X.mean) / X.sd
return(X)
}
After that, it is trivial in Data.Table syntax:
# Voila, the newly created set of columns the contain the transformed value,
DT[, (Reference.Cols.normalized) := lapply(.SD, normalize), .SDcols = Reference.Cols]
Verify:
new values stored in columns with names stored in:
DT[, .SD, .SDcols = Reference.Cols.normalized]
Untransformed values left unharmed
DT[, .SD, .SDcols = Reference.Cols]
Hopefully, for those of you who return to look at code after some interval, this more step-by-step / general approach can be helpful.

Error deleting factor column in empty data.table

If I have an empty data.table with a factor column, the factor column can't be removed with the := NULL operator. Integer and character columns have no problems.
library(data.table)
DT <- data.table(numbers = integer(0),
char.letters = character(0),
factor.letters = factor(character(0)))
DT[, factor.letters := NULL]
I get the following error:
Error in `[.data.table`(DT, , `:=`(factor.letters, NULL)) :
Can't assign to column 'factor.letters' (type 'factor') a value of type 'NULL' (not character, factor, integer or numeric)
Note that DT[, char.letters := NULL] and DT[, numbers := NULL] do not produce errors.
Since factor columns behave differently from character and integer columns, I suspect this is a problem with data.table, but am I doing anything incorrectly?
Edit: Previous example used join to create the empty data.table (which was then called join), but it can be reproduced just as easily by creating it directly.
Thanks for reporting. Now fixed in v1.8.9
Deleting a (0-length) factor column using :=NULL on an empty
data.table
now works, #4809. Thanks to Frank Pinter for reporting. Test added.

Remove multiple columns from data.table

What's the correct way to remove multiple columns from a data.table? I'm currently using the code below, but was getting unexpected behavior when I accidentally repeated one of the column names. I wasn't sure if this was a bug, or if I shouldn't be removing columns this way.
library(data.table)
DT <- data.table(x = letters, y = letters, z = letters)
DT[ ,c("x","y") := NULL]
names(DT)
[1] "z"
The above works fine, but
DT <- data.table(x = letters, y = letters, z = letters)
DT[ ,c("x","x") := NULL]
names(DT)
[1] "z"
This looks like a solid, reproducible bug. It's been filed as Bug #2791.
It appears that repeating the column attempts to delete the subsequent columns.
If no columns remain, then R crashes.
UPDATE : Now fixed in v1.8.11. From NEWS :
Assigning to the same column twice in the same query is now an error rather than a crash in some circumstances; e.g., DT[,c("B","B"):=NULL] (delete by reference the same column twice). Thanks to Ricardo (#2751) and matt_k (#2791) for reporting. Tests added.
This Q has been answered but regard this as a side note.
I prefer the following syntax to drop multiple columns
DT[ ,`:=`(x = NULL, y = NULL)]
because it matches the one to add multiple columns (variables)
DT[ ,`:=`(x = letters, y = "Male")]
This also check for duplicated column names. So trying to drop x twice will throw an error message.

Resources