Extract grouped Subset with condition - r

I have following data structure:
Group Count Value
1 1 1000
1 10 2000
2 6 1000
2 7 2000
Some groups that have a count value and a data value. Now I only want those rows where count > 0.25 * sum(count of group).
For example group 1 has sum(count) = 11 so the first row should not be included in the result.
Result should look like:
Group Count Value
1 10 2000
2 6 1000
2 7 2000
How can I do this in R?
Additionally my dataset has around 5 million rows. So please consider performance.

With the sample data
dd<-read.table(text="Group Count Value
1 1 1000
1 10 2000
2 6 1000
2 7 2000", header=T)
you can do this with base R
subset(dd, Count>.25*ave(Count, Group, FUN=sum))
or the dplyr library
library(dplyr)
dd %>% group_by(Group) %>% filter(Count > .25 * sum(Count))
perhaps you'll find one more readable. Both retrun
Group Count Value
2 1 10 2000
3 2 6 1000
4 2 7 2000

Related

How to make multiple plots using data in one column in R?

Lets say I have the following data frame:
ID amount_ID timespan change
3 1 20 2
3 2 40 3
3 3 60 6
3 4 80 4
3 5 100 5
9 1 25 1
9 2 50 -2
9 3 75 0
9 4 100 -1
3 1 33.33 4
3 2 66.67 8
3 3 100 7
9 1 33.33 1
9 2 66.67 3
9 3 100 4
I want to make 2 plots with this data, one for ID 3 and one for ID 9. The timespan should be on the x-axis and the change should be on the y-axis. As you can see the maximum length of the x-axis is 100 per ID. But I want to make a graph where the change is the average of all the previous changes in the past from that same ID. So essentially I need to add up all the changes per timespan per individual ID and divide that by the times a particular ID number is present. The problem is that the timespan can be different within a particular ID (here ID 3 first has 5 amounts and then has 3 amounts, ID 9 first has 4 amounts and then has 3 amounts).
Here is a visual example
I hope you can help me!!! Thanks!
We can use cummean to calculate the running average at each timespan. Also facetsare useful to show every ID in separate plot.
library(ggplot2)
library(dplyr)
df %>%
arrange(ID, timespan) %>%
group_by(ID) %>%
mutate(change = cummean(change)) %>%
ggplot() + aes(timespan, change) +
geom_line() +
facet_wrap(.~ID, scales = "free_y")

gather() per grouped variables in R for specific columns

I have a long data frame with players' decisions who worked in groups.
I need to convert the data in such a way that each row (individual observation) would contain all group members decisions (so we basically can see whether they are interdependent).
Let's say the generating code is:
group_id <- c(rep(1, 3), rep(2, 3))
player_id <- c(rep(seq(1, 3), 2))
player_decision <- seq(10,60,10)
player_contribution <- seq(6,1,-1)
df <-
data.frame(group_id, player_id, player_decision, player_contribution)
So the initial data looks like:
group_id player_id player_decision player_contribution
1 1 1 10 6
2 1 2 20 5
3 1 3 30 4
4 2 1 40 3
5 2 2 50 2
6 2 3 60 1
But I need to convert it to wide per each group, but only for some of these variables, (in this example specifically for player_contribution, but in such a way that the rest of the data remains. So the head of the converted data would be:
data.frame(group_id=c(1,1),
player_id=c(1,2),
player_decision=c(10,20),
player_1_contribution=c(6,6),
player_2_contribution=c(5,5),
player_3_contribution=c(4,6)
)
group_id player_id player_decision player_1_contribution player_2_contribution player_3_contribution
1 1 1 10 6 5 4
2 1 2 20 6 5 6
I suspect I need to group_by in dplyr and then somehow gather per group but only for player_contribution (or a vector of variables). But I really have no clue how to approach it. Any hints would be welcome!
Here is solution using tidyr and dplyr.
Make a dataframe with the columns for the players contributions. Then join this dataframe back onto the columns of interest from the original Dataframe.
library(tidyr)
library(dplyr)
wide<-pivot_wider(df, id_cols= - player_decision,
names_from = player_id,
values_from = player_contribution,
names_prefix = "player_contribution_")
answer<-left_join(df[, c("group_id", "player_id", "player_decision") ], wide)
answer
group_id player_id player_decision player_contribution_1 player_contribution_2 player_contribution_3
1 1 1 10 6 5 4
2 1 2 20 6 5 4
3 1 3 30 6 5 4
4 2 1 40 3 2 1
5 2 2 50 3 2 1
6 2 3 60 3 2 1

Reuse value of previous row during dplyr::mutate

I am trying to group events based on their time of occurrence. To achieve this, I simply calculate a diff over the timestamps and want to essentially start a new group if the diff is larger than a certain value. I would have tried like the code below. However, this is not working since the dialog variable is not available during the mutate it is created by.
library(tidyverse)
df <- data.frame(time = c(1,2,3,4,5,510,511,512,513), id = c(1,2,3,4,5,6,7,8,9))
> df
time id
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 510 6
7 511 7
8 512 8
9 513 9
df <- df %>%
mutate(t_diff = c(NA, diff(time))) %>%
# This generates an error as dialog is not available as a variable at this point
mutate(dialog = ifelse(is.na(t_diff), id, ifelse(t_diff >= 500, id, lag(dialog, 1))))
# This is the desired result
> df
time id t_diff dialog
1 1 1 NA 1
2 2 2 1 1
3 3 3 1 1
4 4 4 1 1
5 5 5 1 1
6 510 6 505 6
7 511 7 1 6
8 512 8 1 6
9 513 9 1 6
In words, I want to add a column that points to the first element of each group. Thereby, the groups are distinguished at points at which the diff to the previous element is larger than 500.
Unfortunately, I have not found a clever workaround to achieve this in an efficient way using dplyr. Obviously, iterating over the data.frame with a loop would work, but would be very inefficient.
Is there a way to achieve this in dplyr?

RStudio: Summation of two datasets of different lengths

I have two datasets A and B:
Dataset A (called Sales) has the following data:
ID Person Sales
1 1 100
2 2 300
3 3 400
4 4 200
5 5 50
Dataset B (called Account_Scenarios) has the following data (Note- there are a lot more rows in dataset B I have just included the first 6):
ID Scenario Person Upkeep
1 1 1 -10
2 1 2 -200
3 2 1 -150
4 3 4 -50
5 3 3 -100
6 4 5 -500
I want to add a column called 'Profit' in dataset B such that I am able to see the profit per person per scenario (Profit = Sales + Upkeep). For example as below:
ID Scenario Person Upkeep Profit
1 1 1 -10 90
2 1 2 -200 100
3 2 1 -150 -50
4 3 4 -50 150
5 3 3 -100 300
6 4 5 -500 -450
What is the best way to do this? I am new to R and trying use an aggregate function but it requires the arguments to be the same length.
Account_Scenarios$Profit <- aggregate(Sales[,c('Sales')], Account_Scenarios[,c('Upkeep')], by=list(Sales$Person), 'sum')
Assuming that Sales$Person have only unique values, you can:
Account_Scenarios$Profit=Account_Scenarios$Upkeep-Sales$Sales[sapply(Account_Scenarios$Person,function(x)which(Sales$Person==x))]
I would left_join the two datasets base Person variable, then calculate the profit:
library(tidyverse)
A <- A %>% select(Person, Sales) # Only need the two variables for the join
df <- left_join(B, A, by = "Person") %>%
mutate(Profit = Sales + Upkeep)
A solution can be using sqldf library (a sql style join):
library(sqldf)
A <- data.frame(Person=1:5, Sales=c(100,300,400,200,50))
B <- data.frame(Scenario=c(1,1,2,3,3,4), Person=c(1,2,1,4,3,5), Upkeep=c(-10,-200,-150,-50,-100,-500))
B <- sqldf("SELECT B.*, A.Sales + B.Upkeep as Profit FROM B JOIN A on B.Person = A.Person")

eliminate observations with the same id but actually do not correspond in r

I am using a national survey to run an econometric analysis in R.
The df is based on a survey which is conducted every two years: some families have been interviwed for more than one times, and others appear just one time.
The variable family represents the code number of the family, the variable nord the code number of the componenet of the family in a certain year; the variable nordp represents the code number that the individual had in the previous survey. So when individuals are interwied more than one time nord and nordp shuold be the same, but actually it is not always true.
I need to filter the df in order to have only the individual that appears at least one time:
df <- df %>%
group_by(nquest, nordp) %>%
filter(n()>1)
Then I assign a unique id value to each individual with this command (in different years I have the same id for the same couple of nquest and nord):
df <- transform(df, id=as.numerica(interaction(nquest, nord))
the problem is that sometime the data were introduce in a wrong way so that in one year the same individual (identified with the same nquest and nordp) actually is not really the same person; for example look at the two lines with **; they have the same nquest and nordp, and so the same id, but they are not the same person (nord is not the same, and also sex is different).
year id nquest nord nordp sex
**2000 1 10 1 1 F**
2000 2 20 1 1 M
2000 3 30 1 1 M
2002 1 10 1 1 F
2002 2 20 1 1 M
2002 4 40 1 1 F
**2004 1 10 2 1 M**
2004 2 20 1 1 M
2004 3 30 1 1 M
so my problem is eliminate the observations that are not really the same using sex as check variable; consider that the df is composed by more than 50k observations and so I can't check for each id.
Thank you in advance
You could do
unique_df <- unique(df[,c("id","nquest","nordp","sex")])
unique_df$id[duplicated(df_unique$nquest)]
This returns the ids with multiple different sex annotations.
With summarise_each and n_distinct from dplyr you could do:
library("dplyr")
DF=read.table(text="year id nquest nord nordp sex
**2000 1 10 1 1 F**
2000 2 20 1 1 M
2000 3 30 1 1 M
2002 1 10 1 1 F
2002 2 20 1 1 M
2002 4 40 1 1 F
**2004 1 10 2 1 M**
2004 2 20 1 1 M
2004 3 30 1 1 M",header=TRUE,stringsAsFactors=FALSE)
summaryDF= DF %>%
group_by(id) %>%
summarise_each(funs(n_distinct),everything(),-year,-id) %>%
filter(sex>1 & nord >1 & nquest==1 & nordp==1 ) %>% #filter conditions on resultant data.frame
as.data.frame()
summaryDF
# id nquest nord nordp sex
# 1 1 2 1 3

Resources