R data.table - aggregate partially within group and perform operation - r

Is there a nice way to make a sub-group within a grouping column in data.table operations?
The result I would like is the output from this:
dt <- data.table(
group = c("a","a","a","b","b","b","c","c"),
value = c(1,2,3,4,5,6,7,8)
)
dt[group!="a", group:="Other"][, sum(value), by=.(group)][]
which gives
group V1
a 6
Other 30
However, this alters the original data.table. I don't know if there is a different way to do this that wouldn't involve merging two data.table. I can imagine a more complicated use case where I want group %in% c("a","b") as one sub-group and group %in% c("c","d") another, etc.

I think this is like a SQL right excluding join (using the terminology here)
You can go through by group and within each group perform an anti-join
#group no longer found in .SD, hence make a copy of the column
dt[, g:=group]
#go through each group, anti-join with other groups, aggregate value
dt[, .(
sumGrpVal=sum(value),
sumNonGrpVal=dt[!.SD, sum(value), on=c("group"="g")]
), by=.(group)]
or an even faster way:
dt[, .(
sumGrpVal=sum(value),
sumNonGrpVal=dt[group!=.BY$group, sum(value)]
), by=.(group)]
output:
group sumGrpVal sumNonGrpVal
1: a 6 30
2: b 15 21
3: c 15 21

Related

How do I write back results of a count query to a column in R?

I would like to count the instances of a Employee ID in a column and write back the results to a new column in my dataframe. So far I am able to count the instances and display the results in the R Studio console, but I'm not sure how to write the results back. Here is what I have tested successfully:
ids<-BAR$`Employee ID`
counts<-data.frame(table(ids))
counts
And here are the returned results:
1 00000018 1
2 00000179 1
3 00001045 1
4 00002729 1
5 00003095 2
6 00003100 1
Thanks!
If we need to create a column, use add_count
library(dplyr)
BAR1 <- BAR %>%
add_count(`Employee ID`)
table returns the summarised output. If we want to create a column in the original data
BAR1$n <- table(ids)[as.character(BAR$`Employee ID`)]
If you use a data.table you will be able to do this quickly, especially with larger datasets, using .N to count number of occurrences per grouping variable given in by.
# Load data.table
library(data.table)
# Convert data to a data.table
setDT(BAR)
# Count and assign counts per level of ID
BAR[, count := .N, by = ID]

Select groups with more than one distinct value per group [duplicate]

This question already has answers here:
Select groups with more than one distinct value
(3 answers)
Closed 7 years ago.
I have data like below:
ID category class
1 a m
1 a s
1 b s
2 a m
3 b s
4 c s
5 d s
I want to subset the data by only including those "ID" which have several (> 1) different categories.
My expected output:
ID category class
1 a m
1 a s
1 b s
Is there a way to doing so?
I tried
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(category, class) > 1)
But it gave me an error:
# Error: expecting a single value
Using data.table
library(data.table) #see: https://github.com/Rdatatable/data.table/wiki for more
setDT(data) #convert to native 'data.table' type by reference
data[ , if(uniqueN(category) > 1) .SD, by = ID]
uniqueN is data.table's (fast) native mask for length(unique()), and .SD is just the whole data.table (in more general cases, it can represent a subset of columns, e.g. when the .SDcols argument is activated). So basically the middle statement (j, the column selection argument) says to return all columns and rows associated with an ID for which there are at least two distinct values of category.
Use the by argument to extend to a case involving counts ok multiple columns.

How to summarize a data frame into a new one that tells means of separate levels? [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 7 years ago.
I have a data.frame that looks somewhat like this.
k <- data.frame(id = c(1,2,2,1,2,1,2,2,1,2), act = c('a','b','d','c','d','c','a','b','a','b'), var1 = 25:34, var2= 74:83)
I have to group the data into separate levels by first 2 columns and write the mean of the the next 2 columns(var1 and var2). It should look like this
id act varmean1 varmean2
1 1 a
2 1 c
3 2 a
4 2 b
5 2 b
6 2 d
The values of respective means are filled in varmean1 and varmean2.
My actual dataframe has 88 columns where I have to group the data into separate levels by the first 2 columns and find the respective means of the remaining. Please help me figure this out as soon as possible. Please try to use 'dplyr' package for the solution if possible. Thanks.
You have several options:
base R:
aggregate(. ~ id + act, k, mean)
or
aggregate(cbind(var1, var2) ~ id + act, k, mean)
The first option aggregates all the column by id and act, the second option only the column you specify. In this case both give the same result, but it is good to know for when you have more columns and only want to aggregate some of them.
dplyr:
library(dplyr)
k %>%
group_by(id, act) %>%
summarise_each(funs(mean))
If you want to specify the columns for which to calculate the mean, you can use summarise instead of summarise_each:
k %>%
group_by(id, act) %>%
summarise(var1mean = mean(var1), var2mean = mean(var2))
data.table:
library(data.table)
setDT(k)[, lapply(.SD, mean), by = .(id, act)]
If you want to specify the columns for which to calculate the mean, you can add .SDcols like:
setDT(k)[, lapply(.SD, mean), by = .(id, act), .SDcols=c("var1", "var2")]

How to drop columns from data frame with less than 2 unique levels in R

I have a dataset with numeric and categorical variables with ~200,000 rows, but many variables are constants(both numeric and cat). I am trying to create a new dataset where the length(unique(data.frame$factor))<=1 variables are dropped.
Example data set and attempts so far:
Temp=c(26:30)
Feels=c("cold","cold","cold","hot","hot")
Time=c("night","night","night","night","night")
Year=c(2015,2015,2015,2015,2015)
DF=data.frame(Temp,Feels,Time,Year)
I would think a loop would work, but something isn't working in my 2 below attempts. I've tried:
for (i in unique(colnames(DF))){
Reduced_DF <- DF[,(length(unique(DF$i)))>1]
}
But I really need a vector of the colnames where length(unique(DF$columns))>1, so I tried the below instead, to no avail.
for (i in unique(DF)){
if (length(unique(DF$i)) >1)
{keepvars <- c(DF$i)}
Reduced_DF <- DF[keepvars]
}
Does anyone out there have experience with this type of subsetting/dropping of columns with less than a certain level count?
You can find out how many unique values are in each column with:
sapply(DF, function(col) length(unique(col)))
# Temp Feels Time Year
# 5 2 1 1
You can use this to subset the columns:
DF[, sapply(DF, function(col) length(unique(col))) > 1]
# Temp Feels
# 1 26 cold
# 2 27 cold
# 3 28 cold
# 4 29 hot
# 5 30 hot
Another way with data.table
#Convert object to data.table object
library(data.table)
setDT(DF)
#Drop columns
todrop <- names(DF)[which(sapply(DF,uniqueN)<2)]
DF[, (todrop) := NULL]
One advantage to this method is that it does not make a copy (which might be useful when you have as many columns as you have).
If you are using data.table 1.9.4, you would change to the following:
#Drop columns
todrop <- names(DF)[which(sapply(DF,function(x) length(unique(x)<2))]
DF[, (todrop) := NULL]
I've also another possible solution for dropping the columns with categorical value with 2 lines of code, defining a list with columns of categorical values (1st line) and dropping them with the second line. df is our dataframe
df with categorical column:
list=pd.DataFrame(df.categorical).columns
df= df.drop(list,axis=1)
df after running the code:

Get previous observations in a data frame in R

I want to add a new column to a data frame that contains the ith-1 value. I can do this in a for loop but I would like to know it there is a more straightforward way to do it. I would also like to do it for other lags.
Example:
Price PrevPrice
23 NA
24 23
25 24
35 25
You can either do
library(dplyr)
mutate(df, PrevPrice=lag(Price))
Or
df$PrevPrice <- c(NA, df$Price[-nrow(df)])
If you have multiple columns to get the lag, another option is data.table where you can use ?shift By default, the type is lag. For multiple columns, specify the column index (for example, 1st 2 columns here) in .SDcols.
library(data.table) #data.table_1.9.5
setDT(df)[, paste0(names(df)[1:2], 'lag') := shift(.SD), .SDcols=1:2]

Resources