R - Compare values in a dataframe to aggregated dataframe

R - Compare values in a dataframe to aggregated dataframe - r

I'm trying to figure out how I can compare the values in a dataframe row-by-row that correspond to those given by the aggregate() function.
For example:
#create data frame
df <- data.frame(team=c('a', 'a', 'b', 'b', 'b', 'c', 'c'),
pts=c(5, 8, 14, 18, 5, 7, 7),
rebs=c(8, 8, 9, 3, 8, 7, 4))
#view data frame
df
team pts rebs
1 a 5 8
2 a 8 8
3 b 14 9
4 b 18 3
5 b 5 8
6 c 7 7
7 c 7 4
#find mean points scored by team
agg_df = aggregate(df$pts, list(df$team), FUN=mean)
Group.1 x
1 a 6.50000
2 b 12.33333
3 c 7.00000
What I want to do is create a new column in df using a logic similar to the following pseudo-code:
df$pts[i] > agg_df$x[i] then df$performance = 'overperformed' else df$performance = 'underperformed'.
But this is not exactly what I want. I want to compare row 1 and 2's points to the mean points for team a in agg_df. Similarly, rows 3-5 in df should be compared to the mean points for group b in agg_df.
The final result would look like:
> df
team pts rebs performance
1 a 5 8 under
2 a 8 8 over
3 b 14 9 over
4 b 18 3 over
5 b 5 8 under
6 c 7 7 average
7 c 7 4 average
I am a little puzzled as to how to achieve this, or if it is even achievable, so any help is very much appreciated.

You can do:
library(tidyverse)
df %>%
group_by(team) %>%
mutate(performance = case_when(pts > mean(pts) ~ "over",
pts == mean(pts) ~ "average",
pts < mean(pts) ~ "under")) %>%
ungroup()
which gives:
# A tibble: 7 x 4
team pts rebs performance
<chr> <dbl> <dbl> <chr>
1 a 5 8 under
2 a 8 8 over
3 b 14 9 over
4 b 18 3 over
5 b 5 8 under
6 c 7 7 average
7 c 7 4 average

Or in base way with merge().
# Merge data
db <- merge(df, agg_df, by.x = "team", by.y = 'Group.1')
db$performance <- ifelse(db$pts == db$x, 'average',
ifelse(db$pts > db$x, 'over', 'under'))
db$x <- NULL
db

Related

Create lag onto next group in R

Hi I would like to create a lag by one whole group on a dataframe in R.
So lets say the value for the group A is 11 I would like to make a lag where all values of group B are 11 and so on. Below is an example of what I would like to do.
Letter = c('A', 'A', 'A', 'B', 'C', 'C', 'D')
Value = c(11, 11, 11, 4, 8, 8, 10)
data.frame(Letter, Value)
Letter Value
1 A 11
2 A 11
3 A 11
4 B 4
5 C 8
6 C 8
7 D 10
And then have it become after the lag:
Lag = c(NA, NA, NA, 11, 4, 4, 8)
data.frame(Letter, Value, Lag)
Letter Value Lag
1 A 11 NA
2 A 11 NA
3 A 11 NA
4 B 4 11
5 C 8 4
6 C 8 4
7 D 10 8
(One thing to note is all values of the group will be the same)

Get the unique rows of the data, lag Value and then left join the original data frame with that.
library(dplyr)
DF %>%
left_join(mutate(distinct(.), Lag = lag(Value), Value = NULL), by = "Letter")
giving:
Letter Value Lag
1 A 11 NA
2 A 11 NA
3 A 11 NA
4 B 4 11
5 C 8 4
6 C 8 4
7 D 10 8

You can do the following (see below)
We first group by LETTER and assign an id to each group member.
Next we ungroup and assign the lag value if something is the first group member.
And the final step is to fill the missings.
NOTE: all of this assumes your data set is sorted to your needs so that it would be correct to take the lag value from the last group.
library(tidyverse)
data.frame(Letter, Value) |>
group_by(Letter) |>
mutate(id = 1:n()) |>
ungroup() |>
mutate(Lag = ifelse(id == 1, lag(Value), NA)) |>
fill(Lag) |>
select(-id)
# A tibble: 7 × 3
Letter Value Lag
<chr> <dbl> <dbl>
1 A 11 NA
2 A 11 NA
3 A 11 NA
4 B 4 11
5 C 8 4
6 C 8 4
7 D 10 8

Subsetting dataframe in grouped data

I have a dataframe including a column of factors that I would like to subset to select every nth row, after grouping by factor level. For example,
my_df <- data.frame(col1 = c(1:12), col2 = rep(c("A","B", "C"), 4))
my_df
col1 col2
1 1 A
2 2 B
3 3 C
4 4 A
5 5 B
6 6 C
7 7 A
8 8 B
9 9 C
10 10 A
11 11 B
12 12 C
Subsetting to select every 2nd row should yield my_new_df as,
col1 col2
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
I tried in dplyr:
my_df %>% group_by(col2) %>%
my_df[seq(2, nrow(my_df), 2), ] -> my_new_df
I get an error:
Error: Can't subset columns that don't exist.
x Locations 4, 6, 8, 10, and 12 don't exist.
ℹ There are only 2 columns.
To see if the nrow function was a problem, I tried using the number directly. So,
my_df %>% group_by(col2) %>%
my_df[seq(2, 4, 2), ] -> my_new_df
Also gave an error,
Error: Can't subset columns that don't exist.
x Location 4 doesn't exist.
ℹ There are only 2 columns.
Run `rlang::last_error()` to see where the error occurred.
My expectation was that it would run the subsetting on each group of data and then combine them into 'my_new_df'. My understanding of how group_by works is clearly wrong but I am stuck on how to move past this error. Any help would much appreciated.

Try:
my_df %>%
group_by(col2)%>%
slice(seq(from = 2, to = n(), by = 2))
# A tibble: 6 x 2
# Groups: col2 [3]
col1 col2
<int> <chr>
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
You might want to ungroup after slicing if you want to do other operations not based on col2.

Here is a data.table option:
library(data.table)
data <- as.data.table(my_df)
data[(rowid(col2) %% 2) == 0]
col1 col2
1: 4 A
2: 5 B
3: 6 C
4: 10 A
5: 11 B
6: 12 C
Or base R:
my_df[as.logical(with(my_df, ave(col1, col2, FUN = function(x)
seq_along(x) %% 2 == 0))), ]
col1 col2
4 4 A
5 5 B
6 6 C
10 10 A
11 11 B
12 12 C

Creating a column based on existing column where new column has values plus or minus certain value of old one

I am trying to create a column, where the new column has values plus or minus some fixed number or existing number of old column. For example, my old column is a and new column is b.
data = data.frame(a = 2:11)
new_data = data.frame(a = 2:11, b = c(1, 4, 5, 5, 6, 8, 9, 8, 11, 12))
new_data
#> a b
#> 1 2 1
#> 2 3 4
#> 3 4 5
#> 4 5 5
#> 5 6 6
#> 6 7 8
#> 7 8 9
#> 8 9 8
#> 9 10 11
#> 10 11 12

data$b <- data$a + sample(c(0, -1, +1), nrow(data), replace = T)
so If fixed number is say x do this
x <- 1
data$b <- data$a + sample(c(0, -1*x, x), nrow(data), replace = T)
Edit based on requirements stated in comments below. Use pmin and pmax. seed fixed in order to demonstrate
set.seed(19)
data %>% mutate(b = pmin(11, pmax(2, a + sample(-1:1, nrow(.), T)))) %>% pull(b) %>% cat
2 3 4 6 5 7 7 10 9 11
#otherwise
set.seed(19)
data %>% mutate(b = a + sample(-1:1, nrow(.), T))
a b
1 2 1
2 3 3
3 4 4
4 5 6
5 6 5
6 7 7
7 8 7
8 9 10
9 10 9
10 11 12

Expanding a data.frame based on (group) values from the data.frame

Lets say I have the following data frame:
tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9))
# A tibble: 2 x 3
user first last
<chr> <dbl> <dbl>
1 A 1 6
2 B 4 9
And want to create a tibble that looks like:
bind_rows(tibble(user = 'A', weeks = 1:6),
tibble(user = 'B', weeks = 4:9))
# A tibble: 12 x 2
user weeks
<chr> <int>
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 4
8 B 5
9 B 6
10 B 7
11 B 8
12 B 9
How could I go about doing this? I have tried:
tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9)) %>%
group_by(user) %>%
mutate(weeks = first:last)
I wonder if I should try a combination of complete map or nest?

One option is unnest after creating a sequence
library(dplyr)
library(purrr)
df1 %>%
transmute(user, weeks = map2(first, last, `:`)) %>%
unnest(weeks)
# A tibble: 12 x 2
# user weeks
# <chr> <int>
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 A 5
# 6 A 6
# 7 B 4
# 8 B 5
# 9 B 6
#10 B 7
#11 B 8
#12 B 9
Or another option is rowwise
df1 %>%
rowwise %>%
transmute(user, weeks = list(first:last)) %>%
unnest(weeks)
Or without any packages
stack(setNames(Map(`:`, df1$first, df1$last), df1$user))
Or otherwise written as
stack(setNames(do.call(Map, c(f = `:`, df1[-1])), df1$user))
data
df1 <- tibble(user = c('A', 'B'), first = c(1,4), last = c(6, 9))

One option involving dplyr and tidyr could be:
df %>%
uncount(last - first + 1) %>%
group_by(user) %>%
transmute(weeks = first + 1:n() - 1)
user weeks
<chr> <dbl>
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 4
8 B 5
9 B 6
10 B 7
11 B 8
12 B 9

lump factor based on another column

The example shows measurements of production output of different factories,
where the first columns denotes the factory
and the last column the amount produced.
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production)
df
factory production
1 A 15
2 A 2
3 B 1
4 B 1
5 B 2
6 B 1
7 B 2
8 C 20
9 D 5
Now I want to lump together the factories into fewer levels, based on their total output in this data set.
With the normal forcats::fct_lump, I can lump them by the number of rows in which thy appear, e.g. for making 3 levels:
library(tidyverse)
df %>% mutate(factory=fct_lump(factory,2))
factory production
1 A 15
2 A 2
3 B 1
4 B 1
5 B 2
6 B 1
7 B 2
8 Other 20
9 Other 5
but I want to lump them based on the sum(production), retaining the top n=2 factories (by total output) and lump the remaining factories. Desired result:
1 A 15
2 A 2
3 Other 1
4 Other 1
5 Other 2
6 Other 1
7 Other 2
8 C 20
9 Other 5
Any suggestions?
Thanks!

The key here is to apply a specific philosophy in order to group factories together based on their sum of production. Note that this philosophy has to do with the actual values you have in your (real) dataset.
Option 1
Here's an example that groups together factories that have a sum production equal to 15 or less. If you want another grouping you can modify the threshold (e.g. use 18 instead of 15)
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)
library(dplyr)
df %>%
group_by(factory) %>%
mutate(factory_new = ifelse(sum(production) > 15, factory, "Other")) %>%
ungroup()
# # A tibble: 9 x 3
# factory production factory_new
# <chr> <dbl> <chr>
# 1 A 15 A
# 2 A 2 A
# 3 B 1 Other
# 4 B 1 Other
# 5 B 2 Other
# 6 B 1 Other
# 7 B 2 Other
# 8 C 20 C
# 9 D 5 Other
I'm creating factory_new without removing the (original) factory column.
Option 2
Here's an example where you can rank / order the factories based on their production and then you can pick a number of top factories to keep as they are and group the rest
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)
library(dplyr)
# get ranked factories based on sum production
df %>%
group_by(factory) %>%
summarise(SumProd = sum(production)) %>%
arrange(desc(SumProd)) %>%
pull(factory) -> vec_top_factories
# input how many top factories you want to keep
# rest will be grouped together
n = 2
# apply the grouping based on n provided
df %>%
group_by(factory) %>%
mutate(factory_new = ifelse(factory %in% vec_top_factories[1:n], factory, "Other")) %>%
ungroup()
# # A tibble: 9 x 3
# factory production factory_new
# <chr> <dbl> <chr>
# 1 A 15 A
# 2 A 2 A
# 3 B 1 Other
# 4 B 1 Other
# 5 B 2 Other
# 6 B 1 Other
# 7 B 2 Other
# 8 C 20 C
# 9 D 5 Other

Just specify the weight argument w:
> df %>%
+ mutate(factory = fct_lump_n(factory, 2, w = production))
factory production
1 A 15
2 A 2
3 Other 1
4 Other 1
5 Other 2
6 Other 1
7 Other 2
8 C 20
9 Other 5
Note: use forcats::fct_lump_n because the generic fct_lump is no longer recommended.

We could use base R as well by creating a logical condition with ave
df$factory_new <- "Other"
i1 <- with(df, ave(production, factory, FUN = sum) > 15)
df$factory_new[i1] <- df$factory[i1]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - Compare values in a dataframe to aggregated dataframe - r

Or in base way with merge(). # Merge data db <- merge(df, agg_df, by.x = "team", by.y = 'Group.1') db$performance <- ifelse(db$pts == db$x, 'average', ifelse(db$pts > db$x, 'over', 'under')) db$x <- NULL db

Related

Create lag onto next group in R

Subsetting dataframe in grouped data

Creating a column based on existing column where new column has values plus or minus certain value of old one

Expanding a data.frame based on (group) values from the data.frame

lump factor based on another column

Categories

Resources