A particular syntactic construct in R - r

In
Orders1=Orders[Datecreated<floor_date(send_Date,unit='week',week_start = 7)-weeks(PrevWeek),
.(Previous_Sales=sum(Sales)),
by=.(Category,send_Date=floor_date(send_Date,unit='week',week_start = 7))]
What does the . in .(Previous_Sales=sum(Sales)) mean? This is some syntactic nuance with which I am not familiar.
Also, what does by=.(Category,s....
Can someone help?

Here the . is similar to calling a list in data.table. It is creating a summarised output column
.(Previous_Sales=sum(Sales))
Or with list
list(Previous_Sales=sum(Sales))
In dplyr, similar syntax would be
summarise(Previous_Sales = sum(Sales))
and for creating a column/modifying an existing column usee
mutate(Previous_Sales = sum(Sales))
With data.table, updating/creating a column is done with :=
Previous_Sales := sum(Sales)
Similarly, the by also would be a list of column names
by = list(Category, send_Date=floor_date(send_Date,unit='week',week_start = 7)
which we can also use
by = .(Category, send_Date=floor_date(send_Date,unit='week',week_start = 7)
In the context of data.table, the syntax is consistent in the order
dt[i, j, by]
where i, is the place where we specify the row condition for subsetting the rows, j, we apply functions on column/columns and by the grouping columns. Using a simple example with iris
as.data.table(iris)[Sepal.Length < 5, .(Sum = sum(Sepal.Width)), by = Species]
the i is Sepal.Length < 5 it selects only those rows meeting that condition to sum the 'Sepal.Width' (in that rows), and as the by option is provided, it will do the sum of 'Sepal.Width' for each 'Species' resulting in a 3 row (here there are 3 unique 'Species'). We can also do this without the i option by doing the subsetting in j itself
as.data.table(iris)[, .(Sum = sum(Sepal.Width[Sepal.Length < 5])), by = Species]
With summariseation, both of these are okay, but if we do an assignment (:=), it would different
as.data.table(iris)[Sepal.Length < 5, Sum := sum(Sepal.Width), by = Species]
This would create a column 'Sum' and fills the sum values only where the 'Sepal.Length < 5and all other row elements will beNA`. If we do the second option
as.data.table(iris)[, Sum := sum(Sepal.Width[Sepal.Length < 5]), by = Species]
there won't be any NA element because it is subsetting within the j to create a single sum value for each 'Species'

Related

How do I multiply grouped values inside a column of data.table and return the data.table only with the result rows

How do I multiply the values inside a column by grouping from another column.
Let's say I have :
dt = data.table(group = c(1,1,2,2), value = c(2,3,4,5))
I want to multiply the elements of the value with each other but only the ones that belong to the same group , hence that would return.
dt=data.table(group=c(1,2), value=c(6,20))
I tried it with cumprod
dt[, new_value := cumprod(value), by = group]
but then that returns
dt=data.table(group=c(1,1,2,2), value=c(2,6,4,20)) and I don't know how to remove the rows that i dont neeed: those with value(2,4)
...
Taking the maximum is not a solution because the values could also be negative.
Updating for visibility using #chinsoon12 solution in the comments.
dt[, .(new_value = prod(value)), by = group]
Here's one option where you first perform the calculation and then take the last row by group.
dt[, .(new_value = cumprod(value)), by = group][,.SD[.N], by = group]
group new_value
1: 1 6
2: 2 20

In R: How to subset a large dataframe by top 5 longest runs of frequent values in 1 column?

I have a dataframe with 1 column. The values in this column can ONLY be "good" or "bad". I would like to find the top 5 largest runs of "bad".
I am able to use the rle(df) function to get the running length of all the "good" and "bad".
How do i find the 5 largest runs that attribute to ONLY "bad"?
How do i get the starting and ending indices of the top 5 largest runs for ONLY "bad"?
Your assistance is much appreciated!
One option would be rleid. Convert the 'data.frame' to 'data.table' (setDT(df1)), creating grouping column with rleid (generates a unique id based on adjacent non-matching elements, create the number of elements per group (n) as a column, and row number also as another column ('rn'), subset the rows where 'goodbad' is "bad", order 'n' in decreasing order, grouped by 'grp', summarise the 'first' and 'last' row numbe, as well as the entry for goodbad
library(data.table)
setDT(df1)[, grp := rleid(goodbad)][, n := .N, grp][ ,
rn := .I][goodbad == 'bad'][order(-n), .(goodbad = first(goodbad),
n = n, start = rn[1], last = rn[.N]), .(grp)
][n %in% head(unique(n), 5)][, grp := NULL][]
Or we can use rle and other base R methods
rl <- rle(df1$goodbad)
grp <- with(rl, rep(seq_along(values), lengths))
df2 <- transform(df1, grp = grp, n = rep(rl$lengths, rl$lengths),
rn = seq_len(nrow(df1)))
df3 <- subset(df2, goodbad == 'bad')
do.call(data.frame, aggregate(rn ~ grp, subset(df3[order(-df3$n),],
n %in% head(unique(n), 5)), range))
data
set.seed(24)
df1 <- data.frame(goodbad = sample(c("good", "bad"), 100,
replace = TRUE), stringsAsFactors = FALSE)
The sort(...) function arranges things by increasing or decreasing order. The default is increasing, but you can set "decreasing = TRUE". Use ?sort for more info.
The which(...) function returns the INDEX of values that meet a logical criteria. The code below sorts the times columns of rows where the goodbad value == GOOD.
sort(your.df$times[which(your.df$goodbad == GOOD)])
If you wanted to get the top 5 you could do this:
top5_good <- sort(your.df$times[which(your.df$goodbad == GOOD)])[1:5]
top5_bad <- sort(your.df$times[which(your.df$goodbad == BAD)])[1:5]

How to explicitly name the count column generated by the .N function?

I want to group-by a data table by an id column and then count how many times each id occurs. This can be done as follows:
dt <- data.table(id = c(1, 1, 2))
dt_by_id <- dt[, .N, by = id]
dt_by_id
id N
1: 1 2
2: 2 1
That's pretty fine, but I want the N-column to have a different name (e. g. count). In the help it says:
.N is an integer, length 1, containing the number of rows in the group. This may be useful when the column names are not known in
advance and for convenience generally. When grouping by i, .N is the
number of rows in x matched to, for each row of i, regardless of
whether nomatch is NA or 0. It is renamed to N (no dot) in the result
(otherwise a column called ".N" could conflict with the .N variable,
see FAQ 4.6 for more details and example), unless it is explicitly
named; ... .
How to "explicitly name" the N-column when creating the dt_by_id data table? (I know how to rename it afterwards.) I tried
dt_by_id <- dt[, count = .N, by = id]
but this led to
Error in `[.data.table`(dt, , count = .N, by = id) :
unused argument (count = .N)
You have to list the output of your calculation if you want to give your own name:
dt[, .(count=.N), by = id]
This is identical to dt[, list(count=.N), by = id], if you prefer; . is an alias for list here.
If we have already named it, then use setnames
setnames(dt_by_id, "N", 'count')
or using rename
library(dplyr)
dt_by_id %>%
rename(count = N)
# id count
#1: 1 2
#2: 2 1
Using dplyr::count (x, name= "new column" ) will replace the default column name n with a new name.
dt <- data.frame(id = c(1, 1, 2))
dt %>%
dplyr:: count(id, name = 'ID')

Loop through data.table and create new columns basis some condition

I have a data.table with quite a few columns. I need to loop through them and create new columns using some condition. Currently I am writing separate line of condition for each column. Let me explain with an example. Let us consider a sample data as -
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=10),
tc = rep(c('C','D'), 10),
one = rnorm(20,1,1),
two = rnorm(20,2,1),
three = rnorm(20,3,1),
four = rnorm(20,4,1),
five = rnorm(20,5,2),
six = rnorm(20,6,2),
seven = rnorm(20,7,2),
total = rnorm(20,28,3))
For each of the columns from one to total, I need to create 4 new columns, i.e. mean, sd, uplimit, lowlimit for 2 sigma outlier calculation. I am doing this by -
DTnew <- DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x), uplimit = mean(x)+1.96*sd(x), lowlimit = mean(x)-1.96*sd(x))))), by = .(town,tc)]
This DTnew data.table I am then merging with my DT
DTmerge <- merge(DT, DTnew, by= c('town','tc'))
Now to come up with the outliers, I am writing separate set of codes for each variable -
DTAoutlier <- DTmerge[ ,one.Aoutlier := ifelse (one >= one.lowlimit & one <= one.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,two.Aoutlier := ifelse (two >= two.lowlimit & two <= two.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,three.Aoutlier := ifelse (three >= three.lowlimit & three <= three.uplimit,0,1)]
can some one help to simplify this code so that
I don't have to write separate lines of code for outlier. In this example we have only 8 variables but what if we had 100 variables, would we end up writing 100 lines of code? Can this be done using a for loop? How?
In general for data.table how can we add new columns retaining the original columns. So for example below I am taking log of columns 3 to 10. If I don't create a new DTlog it overwrites the original columns in DT. How can I retain the original columns in DT and have the new columns as well in DT.
DTlog <- DT[,(lapply(.SD,log)),by = .(town,tc),.SDcols=3:10]
Look forward to some expert suggestions.
We can do this using :=. We subset the column names that are not the grouping variables ('nm'). Create a vector of names to assign for the new columns using outer ('nm1'). Then, we use the OP's code, unlist the output and assign (:=) it to 'nm1' to create the new columns.
nm <- names(DT)[-(1:2)]
nm1 <- c(t(outer(c("Mean", "SD", "uplimit", "lowlimit"), nm, paste, sep="_")))
DT[, (nm1):= unlist(lapply(.SD, function(x) { Mean = mean(x)
SD = sd(x)
uplimit = Mean + 1.96*SD
lowlimit = Mean - 1.96*SD
list(Mean, SD, uplimit, lowlimit) }), recursive=FALSE) ,
.(town, tc)]
The second part of the question involves doing a logical comparison between columns. One option would be to subset the initial columns, the 'lowlimit' and 'uplimit' columns separately and do the comparison (as these have the same dimensions) to get a logical output which can be coerced to binary with +. Then assign it to the original dataset to create the outlier columns.
m1 <- +(DT[, nm, with = FALSE] >= DT[, paste("lowlimit", nm, sep="_"),
with = FALSE] & DT[, nm, with = FALSE] <= DT[,
paste("uplimit", nm, sep="_"), with = FALSE])
DT[,paste(nm, "Aoutlier", sep=".") := as.data.frame(m1)]
Or instead of comparing data.tables, we can also use a for loop with set (which would be more efficient)
nm2 <- paste(nm, "Aoutlier", sep=".")
DT[, (nm2) := NA_integer_]
for(j in nm){
set(DT, i = NULL, j = paste(j, "Aoutlier", sep="."),
value = as.integer(DT[[j]] >= DT[[paste("lowlimit", j, sep="_")]] &
DT[[j]] <= DT[[paste("uplimit", j, sep="_")]]))
}
The 'log' columns can also be created with :=
DT[,paste(nm, "log", sep=".") := lapply(.SD,log),by = .(town,tc),.SDcols=nm]
Your data should probably be in long format:
m = melt(DT, id=c("town","tc"))
Then just write your test once
m[,
is_outlier := +(abs(value-mean(value)) > 1.96*sd(value))
, by=.(town, tc, variable)]
I see no outliers in this data (according to the given definition of outlier):
m[, .N, by=is_outlier] # this is a handy alternative to table()
# is_outlier N
# 1: 0 160
How it works
melt keeps the id columns and stacks all the rest into
variable (column names)
value (column contents)
+x does the same thing as as.integer(x), coercing TRUE/FALSE to 1/0
If you really like your data in wide format, though:
vjs = setdiff(names(DT), c("town","tc"))
DT[,
paste0(vjs,".out") := lapply(.SD, function(x) +(abs(x-mean(x)) > 1.96*sd(x)))
, by=.(town, tc), .SDcols=vjs]
For completeness, it should be noted that dplyr's mutate_each provides a handy way of tackling such problems:
library(dplyr)
result <- DT %>%
group_by(town,tc) %>%
mutate_each(funs(mean,sd,
uplimit = (mean(.) + 1.96*sd(.)),
lowlimit = (mean(.) - 1.96*sd(.)),
Aoutlier = as.integer(. >= mean(.) - 1.96*sd(.) &
. <= mean(.) - 1.96*sd(.))),
-town,-tc)

How to remove duplicated (by name) column in data.tables in R?

While reading a data set using fread, I've noticed that sometimes I'm getting duplicated column names, for example (fread doesn't have check.names argument)
> data.table( x = 1, x = 2)
x x
1: 1 2
The question is: is there any way to remove 1 of 2 columns if they have the same name?
How about
dt[, .SD, .SDcols = unique(names(dt))]
This selects the first occurrence of each name (I'm not sure how you want to handle this).
As #DavidArenburg suggests in comments above, you could use check.names=TRUE in data.table() or fread()
.SDcols approaches would return a copy of the columns you're selecting. Instead just remove those duplicated columns using :=, by reference.
dt[, which(duplicated(names(dt))) := NULL]
# x
# 1: 1
Different approaches:
Indexing
my.data.table <- my.data.table[ ,-2]
Subsetting
my.data.table <- subset(my.data.table, select = -2)
Making unique names if 1. and 2. are not ideal (when having hundreds of columns, for instance)
setnames(my.data.table, make.names(names = names(my.data.table), unique=TRUE))
Optionnaly systematize deletion of variables which names meet some criterion (here, we'll get rid of all variables having a name ending with ".X" (X being a number, starting at 2 when using make.names)
my.data.table <- subset(my.data.table,
select = !grepl(pattern = "\\.\\d$", x = names(my.data.table)))

Resources