R: Dplyr Lagging Variables after Grouping by Multiple Columns

R: Dplyr Lagging Variables after Grouping by Multiple Columns - r

I want to calculate the score difference after grouping by Year, State, Tier, Group. A stylised representation of my data would look like:
dat2 <- data.frame(
Year = sample(1990:1996, 10, replace = TRUE),
State = sample(c("AL", "CA", "NY"), 10, replace = TRUE),
Tier = sample(1:2),
Group = sample(c("A", "B"), 10, replace = TRUE),
Score = rnorm(10))
I tried mutate with group_by_ and .dots however it obtains values from the next absolute value (i.e. grouping does not seem to work). I am mostly interested in plotting the yearly differences (ala time-series even though some years would be NA) so this can be solved by either lagging or calculating the next year's score.
Edit: So, if the dataset looks like:
Year State Tier Group Score
1990 AL 1 A 75
1990 AL 2 A 100
1990 AL 1 B 5
1990 AL 2 B 10
1991 AL 1 A 95
1991 AL 2 A 80
1991 AL 1 B 5
1991 AL 2 B 15
The desired end result would be:
Year State Tier Group Score Diff
1991 AL 1 A 95 20
1991 AL 1 B 5 0
1991 AL 2 A 80 -20
1991 AL 2 B 15 5

If I understand correctly, you are trying to calculate the difference in Score within each combination of Year, State, Tier, Group? Presumably, your data will be sorted chronologically for the difference to make any sense. Your example is small for these combinations to be repeated but I believe the solution you are looking for would be:
library(dplyr)
dat2 %>%
arrange(Year) %>%
group_by(State, Tier, Group) %>%
mutate(ScoreDiff = Score - lag(Score))
With your current code, the ScoreDiff column has a lot of NAs because there usually won't be multiple cases of the same combination of your four variables in just 10 cases. But you can try it with a more general code (I've also changed the starting year to 1890 from 1990):
n <- 100
dat2 <- data.frame(
Year = sample(1890:1996, n, replace = TRUE),
State = sample(c("AL", "CA", "NY"), n, replace = TRUE),
Tier = sample(1:2),
Group = sample(c("A", "B"), n, replace = TRUE),
Score = rnorm(n))
dat2 %>%
arrange(Year) %>%
group_by(State, Tier, Group) %>%
mutate(ScoreDiff = Score - lag(Score))

Related

Deleting duplicated rows based on condition (position)

I have a dataset that looks something like this
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1970,1970,1971,1980,1980,1981,1982),
"Val" = c(2,3,-2,5,2,5,3,5))
I have mulple observations for each id and time identifier - e.g. I have 3 different alpha 1970 values. I would like to retain only one observation per id/year most notably the last one that appears in for each id/year.
the final dataset should look something like this:
final <- data.frame("id" = c("Alpha","Alpha","Beta","Beta","Beta"),
"Year" = c(1970,1971,1980,1981,1982),
"Val" = c(-2,5,5,3,5))
Does anyone know how I can approach the problem?
Thanks a lot in advance for your help

If you are open to a data.table solution, this can be done quite concisely:
library(data.table)
setDT(df)[, .SD[.N], by = c("id", "Year")]
#> id Year Val
#> 1: Alpha 1970 -2
#> 2: Alpha 1971 5
#> 3: Beta 1980 5
#> 4: Beta 1981 3
#> 5: Beta 1982 5
by = c("id", "Year") groups the data.table by id and Year, and .SD[.N] then returns the last row within each such group.

How about this?
library(tidyverse)
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1970,1970,1971,1980,1980,1981,1982),
"Val" = c(2,3,-2,5,2,5,3,5))
final <-
df %>%
group_by(id, Year) %>%
slice(n()) %>%
ungroup()
final
#> # A tibble: 5 x 3
#> id Year Val
#> <fct> <dbl> <dbl>
#> 1 Alpha 1970 -2
#> 2 Alpha 1971 5
#> 3 Beta 1980 5
#> 4 Beta 1981 3
#> 5 Beta 1982 5
Created on 2019-09-29 by the reprex package (v0.3.0)
Translates to "within each id-Year group, take only the row where the row number is equal to the size of the group, i.e. it's the last row under the current ordering."
You could also use either filter(), e.g. filter(row_number() == n()), or distinct() (and then you wouldn't even have to group), e.g. distinct(id, Year, .keep_all = TRUE) - but distinct functions take the first distinct row, so you'd need to reverse the row ordering here first.

An option with base R
aggregate(Val ~ ., df, tail, 1)
# id Year Val
#1 Alpha 1970 -2
#2 Alpha 1971 5
#3 Beta 1980 5
#4 Beta 1981 3
#5 Beta 1982 5
If we need to select the first row
aggregate(Val ~ ., df, head, 1)

Create a new variable that is the average of one variable conditional on two other variables (and maintain all other variables in the data set)

Here is a (shortened) sample from a data set I am working on. The sample represents data from an experiment with 2 sessions (session_number), in each session participants completed 5 trials (trial_number) of a hand grip exercise (so, 10 in total; 2 * 5 = 10). Each of the 5 trials has 3 observations of hand grip strength (percent_of_maximum). I want to get the average (below, I call it mean_by_trial) of these 3 observations for each of the 10 trials.
Finally, and this is what I am stuck on, I want to output a data set that is 20 rows long (one row for each unique trial, there are 2 participants and 10 trials for each participant; 2 * 10 = 20), AND retains all other variables. All the other variables (in the example there are: placebo, support, personality, and perceived_difficulty) will be the same for each unique Participant, trial_number, or session_number (see sample data set below).
I have tried this using ddply, which is pretty much what I want, but the new data set does not contain the other variables in the data set (new_dat only contains trial_number, session_number, Participant and the new mean_by_trial variable). How can I maintain the other variables?
#create sample data frame
dat <- data.frame(
Participant = rep(1:2, each = 30),
placebo = c(replicate(15, "placebo"), replicate(15, "control"), replicate(15, "control"), replicate(15, "placebo")),
support = rep(sort(rep(c("support", "control"), 3)), 10),
personality = c(replicate(30, "nice"), replicate(30, "naughty")),
session_number = c(rep(1:2, each = 15), rep(1:2, each = 15)),
trial_number = c(rep(1:5, each = 3), rep(1:5, each = 3), rep(1:5, each = 3), rep(1:5, each = 3)),
percent_of_maximum = runif(60, min = 0, max = 100),
perceived_difficulty = runif(60, min = 50, max = 100)
)
#this is what I have tried so far
library(plyr)
new_dat <- ddply(dat, .(trial_number, session_number, Participant), summarise, mean_by_trial = mean(percent_of_maximum), .drop = FALSE)
I want new_dat to contain all the variables in dat, plus the mean_by_trial variable. Thank you!

We can use mutate instead of summarise to create a column in the dataset and then do slice
library(dplyr)
out <- ddply(dat, .(trial_number, session_number, Participant),
plyr::mutate, mean_by_trial = mean(percent_of_maximum), .drop = FALSE)
out %>%
group_by(trial_number, session_number, Participant) %>%
slice(1)
If we use dplyr, then this can all be inside a chain
newdat <- dat %>%
group_by(trial_number, session_number, Participant) %>%
mutate(mean_by_trial = mean(percent_of_maximum)) %>%
slice(1)
head(newdat)
# A tibble: 6 x 9
# Groups: trial_number, session_number, Participant [6]
Participant placebo support personality session_number trial_number percent_of_maximum perceived_difficulty mean_by_trial
# <int> <fct> <fct> <fct> <int> <int> <dbl> <dbl> <dbl>
#1 1 placebo control nice 1 1 71.5 95.5 73.9
#2 2 control control naughty 1 1 38.9 63.8 67.7
#3 1 control support nice 2 1 97.1 54.2 68.4
#4 2 placebo support naughty 2 1 62.9 86.2 40.4
#5 1 placebo support nice 1 2 49.0 95.8 65.7
#6 2 control support naughty 1 2 80.9 74.6 68.3

Here’s a tidyverse answer. First you want to group_by the variables of interest. Then calculate the desired mean in a new column using mutate.
As the value in the new mean column will be repeated across the variables, use the distinct function to retain uniqe rows. In other words, select a single row for each combination of Participant, session_number, and trial_number.
This is the answer (https://stackoverflow.com/a/39092166/9941764)
provided in: R - dplyr Summarize and Retain Other Columns
new_dat <- dat %>%
group_by(Participant, session_number, trial_number) %>%
mutate(mean = mean(percent_of_maximum)) %>%
distinct(mean, .keep_all = TRUE)

Subset a dataframe, calculate the mean and populate a dataframe in a loop in R

I have a set of 85 possible combinations from two variables, one with five values (years) and one with 17 values (locations). I make a dataframe that has the years in the first column and the locations in the second column. For each combination of year and location I want to calculate the weighted mean value and then add it to the third column, according to the year and location values.
My code is as follows:
for (i in unique(data1$year)) {
for (j in unique(data1$location)) {
data2 <- crossing(data1$year, data1$location)
dataname <- subset(data1, year %in% i & location %in% j)
result <- weighted.mean(dataname$length, dataname$raising_factor, na.rm = T)
}
}
The result I gets puts the last calculated mean in the third column for each row.
How can I get it to add according to matching year and location combination?
thanks.

A base R option would be by
by(df[c('x', 'y')], df[c('group', 'year')],
function(x) weighted.mean(x[,1], x[,2]))
Based on #LAP's example

As #A.Suleiman suggested, we can use dplyr::group_by.
Example data:
df <- data.frame(group = rep(letters[1:5], each = 4),
year = rep(2001:2002, 10),
x = 1:20,
y = rep(c(0.3, 1, 1/0.3, 0.4), each = 5))
library(dplyr)
df %>%
group_by(group, year) %>%
summarise(test = weighted.mean(x, y))
# A tibble: 10 x 3
# Groups: group [?]
group year test
<fctr> <int> <dbl>
1 a 2001 2.000000
2 a 2002 3.000000
3 b 2001 6.538462
4 b 2002 7.000000
5 c 2001 10.538462
6 c 2002 11.538462
7 d 2001 14.000000
8 d 2002 14.214286
9 e 2001 18.000000
10 e 2002 19.000000

R: Grouping in a hierarchy

I'm working on a dataset with a with grouping-system with six digits. The first two digits denote grouping on the top-level, the next two denote different sub-groups, and the last two digits denote specific type within the sub-group. I want to group the data to the top level in the hierarchy (two first digits only), and count unique names in each group.
An example for the GroupID 010203:
01 denotes BMW
02 denotes 3-series
03 denotes 320i (the exact model)
All I care about in this example is how many of each brand there is.
Toy dataset and wanted output:
df <- data.table(Quarter = c('Q4', 'Q4', 'Q4', 'Q4', 'Q3'),
GroupID = c(010203, 150503, 010101, 150609, 010000),
Name = c('AAAA', 'AAAA', 'BBBB', 'BBBB', 'CCCC'))
Output:
Quarter Group Counts
Q3 01 1
Q4 01 2
Q4 15 2

Using data.table we could do:
library(data.table)
dt[, Group := substr(GroupID, 1, 2)][
, Counts := .N, by = list(Group, Quarter)][
, head(.SD, 1), by = .(Quarter, Group, Counts)][
, .(Quarter, Group, Counts)]
Returns:
Quarter Group Counts
1: Q4 01 2
2: Q4 15 2
3: Q3 01 1
With dplyr and stringr we could do something like:
library(dplyr)
library(stringr)
df %>%
mutate(Group = str_sub(GroupID, 1, 2)) %>%
group_by(Group, Quarter) %>%
summarise(Counts = n()) %>%
ungroup()
Returns:
# A tibble: 3 x 3
Group Quarter Counts
<chr> <fct> <int>
1 01 Q3 1
2 01 Q4 2
3 15 Q4 2

Since you are already using data.table, you can do:
df[, Group := substr(GroupID,1,2)]
df <- df[,Counts := .N, .(Group,Quarter)][,.(Group, Quarter, Counts)]
df <- unique(df)
print(df)
Group Quarter Counts
1: 10 Q4 2
2: 15 Q4 2
3: 10 Q3 1

Here's my simple solution with plyr and base R, it is lightening fast.
library(plyr)
df$breakid <- as.character((substr(df$GroupID, start =0 , stop = 2)))
d <- plyr::count(df, c("Quarter", "breakid"))
Result
Quarter breakid freq
Q3 01 1
Q4 01 2
Q4 15 2

Alternatively, using tapply (and data.table indexing):
df$Brand <- substr(df$GroupID, 1, 2)
tapply(df$Brand, df[, .(Quarter, Brand)], length)
(If you don't care about the output being a matrix).

Summarise unique combinations in data frame

In the example dataset below I need to find the number of unique customers per product summarised per year. The output has to be a data.frame with the headers: year - product - number of customers
Thanks for your help.
year <- c("2009", "2010")
product <- c("a", "b", "c")
df <- data.frame(customer = sample(letters, 50, replace = T),
product = sample(product, 50, replace = T),
year = sample(year, 50, replace = T))

With aggregate() (in the included-with-R stats package):
agdf<-aggregate(customer~product+year,df,function(x)length(unique(x)))
agdf
# product year customer
#1 a 2009 7
#2 b 2009 8
#3 c 2009 10
#4 a 2010 7
#5 b 2010 7
#6 c 2010 6

Using plyr's summarise:
require(plyr)
ddply(df, .(product, year), summarise, customers=length(unique(customer)))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Dplyr Lagging Variables after Grouping by Multiple Columns - r

Related

Deleting duplicated rows based on condition (position)

Create a new variable that is the average of one variable conditional on two other variables (and maintain all other variables in the data set)

Subset a dataframe, calculate the mean and populate a dataframe in a loop in R

R: Grouping in a hierarchy

Summarise unique combinations in data frame

Categories

Resources