split-apply-combine R - r

I have a data table with several columns.
Lets say
Location which may include Los Angles, etc.
age_Group, lets say (young, child, teenager), etc.
year = (2000, 2001, ..., 2015)
month = c(jan, ..., dec)
I would like to group_by them and see how many people has spent money
in some intervals, lets say I have intervals of interval_1 = (1, 100), (100, 1000), ..., interval_20=(1000, infinity)
How shall I proceed? What should I do after the following?
data %>% group_by(location, age_Group, year, month)
sample:
location age_gp year month spending
LA child 2000 1 102
LA teen 2000 1 15
LA teen 2000 10 9
NY old 2000 11 1000
NY old 2010 2 1000000
NY teen 2020 3 10
desired output
LA, child, 2000, jan interval_1
LA, child, 2000, feb interval_20
...
NY OLD 2015 Dec interval_1
the last column has to be determined by adding the spending of all people belonging to the same city, age_croup, year, month.

You can first create a new column (spending_cat) using, for example, the cut function. After you can add the new variable as a grouping variable and then you just need to count:
df <- data.frame(group = sample(letters[1:4], size = 1000, replace = T),
spending = rnorm(1000))
df %>%
mutate(spending_cat = cut(spending, breaks = c(-5:5))) %>%
group_by(group, spending_cat) %>%
summarise(n_people = n())
# A tibble: 26 x 3
# Groups: group [?]
group spending_cat n_people
<fct> <fct> <int>
1 a (-3,-2] 6
2 a (-2,-1] 36
3 a (-1,0] 83
4 a (0,1] 78
5 a (1,2] 23
6 a (2,3] 10
7 b (-4,-3] 1
8 b (-3,-2] 4
9 b (-2,-1] 40
10 b (-1,0] 78
# … with 16 more rows

Related

Joining two data frames using range of values

I have two data sets I would like to join. The income_range data is the master dataset and I would like to join data_occ to the income_range data based on what band the income falls inside. Where there are more than two observations(incomes) that are within the range I would like to take the lower income.
I was attempting to use data.table but was having trouble. I was would also like to keep all columns from both data.frames if possible.
The output dataset should only have 7 observations.
library(data.table)
library(dplyr)
income_range <- data.frame(id = "France"
,inc_lower = c(10, 21, 31, 41,51,61,71)
,inc_high = c(20, 30, 40, 50,60,70,80)
,perct = c(1,2,3,4,5,6,7))
data_occ <- data.frame(id = rep(c("France","Belgium"), each=50)
,income = sample(10:80, 50)
,occ = rep(c("manager","clerk","manual","skilled","office"), each=20))
setDT(income_range)
setDT(data_occ)
First attempt.
df2 <- income_range [data_occ ,
on = .(id, inc_lower <= income, inc_high >= income),
.(id, income, inc_lower,inc_high,perct,occ)]
Thank you in advance.
Since you tagged dplyr, here's one possible solution using that library:
library('fuzzyjoin')
# join dataframes on id == id, inc_lower <= income, inc_high >= income
joined <- income_range %>%
fuzzy_left_join(data_occ,
by = c('id' = 'id', 'inc_lower' = 'income', 'inc_high' = 'income'),
match_fun = list(`==`, `<=`, `>=`)) %>%
rename(id = id.x) %>%
select(-id.y)
# sort by income, and keep only the first row of every unique perct
result <- joined %>%
arrange(income) %>%
group_by(perct) %>%
slice(1)
And the (intermediate) results:
> head(joined)
id inc_lower inc_high perct income occ
1 France 10 20 1 10 manager
2 France 10 20 1 19 manager
3 France 10 20 1 14 manager
4 France 10 20 1 11 manager
5 France 10 20 1 17 manager
6 France 10 20 1 12 manager
> result
# A tibble: 7 x 6
# Groups: perct [7]
id inc_lower inc_high perct income occ
<chr> <dbl> <dbl> <dbl> <int> <chr>
1 France 10 20 1 10 manager
2 France 21 30 2 21 manual
3 France 31 40 3 31 manual
4 France 41 50 4 43 manager
5 France 51 60 5 51 clerk
6 France 61 70 6 61 manager
7 France 71 80 7 71 manager
I've added the intermediate dataframe joined for easy of understanding. You can omit it and just chain the two command chains together with %>%.
Here is one data.table approach:
cols = c("inc_lower", "inc_high")
data_occ[, (cols) := income]
result = data_occ[order(income)
][income_range,
on = .(id, inc_lower>=inc_lower, inc_high<=inc_high),
mult="first"]
data_occ[, (cols) := NULL]
# id income occ inc_lower inc_high perct
# 1: France 10 clerk 10 20 1
# 2: France 21 manager 21 30 2
# 3: France 31 clerk 31 40 3
# 4: France 41 clerk 41 50 4
# 5: France 51 clerk 51 60 5
# 6: France 62 manager 61 70 6
# 7: France 71 manager 71 80 7

How do I go about filtering my data by the upper 50th percentile for a separate dependent variable?

I need to split my data so that when I use the facet_wrap I have the top 50 percentile for each year.
Here is a sample of my data:
# A tibble: 10,519 x 3
Species Abundance Year
<chr> <dbl> <chr>
1 Astropecten irregularis 2 2009
2 Asterias rubens 14 2009
3 Echinus esculentus 1 2009
4 Pagurus prideaux 1 2009
5 Raja clavata 1 2009
6 Astropecten irregularis 4 2009
7 Asterias rubens 47 2009
8 Henricia sp. 2 2009
9 Ophiura ophiura 8 2009
10 Solaster endeca 1 2009
# ... with 10,509 more rows
My current strategy is this:
Data <- All_years %>%
group_by(Species, Year) %>%
summarise(Abundance = sum(Abundance, na.rm = TRUE)) %>%
filter(quantile(Abundance, 0.50)<Abundance) %>%
filter(Abundance > 50)
The issue is that this gives me the top 50 percentile for the whole set while I would like it to give me the top 50 for each year so I can then display it with a facet_wrap in ggplot.

Calculating the sum of different columns for every observation based on a time variable

Assume the following time series Dataset:
DF <- data.frame(T0=c(2012, 2016, 2014),
T1=c(2017, NA, 2019),
Duration= c(5,3,5),
val12 =c(15,43,7),
val13 =c(16,44,8),
val14 =c(17,45,9),
val15 =c(18,46,10),
val16 =c(19,47,11),
val17 =c(20,48,12),
val18 =c(21,49,13),
val19 =c(22,50,14),
SumVal =c(105,194,69))
print(DF)
T0 T1 Duration val12 val13 val14 val15 val16 val17 val18 val19 SumVal
1 2012 2017 5 15 16 17 18 19 20 21 22 105
2 2016 NA 3 43 44 45 46 47 48 49 50 194
3 2014 2019 5 7 8 9 10 11 12 13 14 69
For building a duration model, I would like to aggregate the "valXX" variables into one SumVal variable according to their duration, like in the table above. The first SumVal (105) corresonds to val12+...+val17, as this is the given time interval (2012-2017) for the first observation.
NA's in T1 indicate that the event of interest did not occure yet and the observation is censored. In this case the Duration and SumVal will be based on the intervall T0:2019.
I struggle to implement a function in R which can performs this task on a very large dataframe.
Any help would be much appreciated!
Here's a tidyverse approach.
library(tidyverse)
DF %>%
# Track orig rows, and fill in NA T1's
mutate(row = row_number(),
T1 = if_else(is.na(T1), T0 + Duration, T1)) %>%
# Gather into long form
gather(col, value, val12:val19) %>%
# convert column names into years
mutate(year = col %>% str_remove("val") %>% as.numeric + 2000) %>%
# Only keep the rows within each duration
filter(year >= T0 & year <= T1) %>%
# Count total value by row, equiv to
# group_by(row) %>% summarize(SumVal2 = sum(value))
count(row, wt = value, name = "SumVal2")
# A tibble: 3 x 2
row SumVal2
<int> <dbl>
1 1 105
2 2 194
3 3 69

How can I filter out Duplicated Rows per Group

So this is the data I'm working with:
ID Year State Grade Loss Total
1 2016 AZ A 50 1000
1 2016 AZ A 50 1000
2 2016 AZ B 0 5000
3 2017 AZ A 0 2000
4 2017 AZ C 10 100
2 2017 AZ B 0 3000
What I'm trying to do is create a table that shows the amount of value lost that is grouped by Year, State and Grade. That part I have done but the issue is you can see that there is a duplicated row for ID=1. I need to add a component to my code that removes any duplicated rows like it in my data once I have grouped the data by Year, State and Grade.
The reason I want to remove the duplicates after I have grouped the data is that the ID number may duplicate for a different year but that is OK as that is a new observation. I just want to remove any duplicates if the Year, State and Grade match. Basically if the whole row is a duplicate, it should be removed.
I can't tell if I should use Unique() or Distinct() but here is what I have so far:
Answer <- data %>%
group_by(Year, State, Grade) %>%
filter(row_number(ID) == 1) %>% #This is the part to replace
summarise(x = sum(Loss) / sum(Total)) %>%
spread(State, x)
The output should look like this:
Year State Grade x
2016 AZ A 0.05
2016 AZ B 0
2016 AZ C 0
2017 AZ A 0
2017 AZ B 0
2017 AZ C 0.1
A few things. Below, I use distinct to remove duplicate rows. Also, in your expected results you have an entry for grade C for 2016, which isn't in your original data. So, I used complete to add this (and any other missing cases) as a zero. Finally, as #akrun notes above: where does 0.00833 come from? Typo or have I misunderstood the calculation?
df <- read.table(text = "ID Year State Grade Loss Total
1 2016 AZ A 50 1000
1 2016 AZ A 50 1000
2 2016 AZ B 0 5000
3 2017 AZ A 0 2000
4 2017 AZ C 10 100
2 2017 AZ B 0 3000", header = TRUE)
Answer <- df %>%
distinct %>%
group_by(Year, State, Grade) %>%
summarise(x = sum(Loss) / sum(Total)) %>%
complete(Year, State, Grade, fill = list(x = 0))
# # A tibble: 6 x 4
# # Groups: Year, State [2]
# Year State Grade x
# <int> <fct> <fct> <dbl>
# 1 2016 AZ A 0.05
# 2 2016 AZ B 0
# 3 2016 AZ C 0
# 4 2017 AZ A 0
# 5 2017 AZ B 0
# 6 2017 AZ C 0.1

Aggregation on 2 columns while keeping two unique R

So I have this:
Staff Result Date Days
1 50 2007 4
1 75 2006 5
1 60 2007 3
2 20 2009 3
2 11 2009 2
And I want to get to this:
Staff Result Date Days
1 55 2007 7
1 75 2006 5
2 15 2009 5
I want to have the Staff ID and Date be unique in each row, but I want to sum 'Days' and mean 'Result'
I can't work out how to do this in R, I'm sure I need to do lots of aggregations but I keep getting different results to what I am aiming for.
Many thanks
the simplest way to do this is to group_by Staff and Date and summarise the results with dplyr package:
require(dplyr)
df <- data.frame(Staff = c(1,1,1,2,2),
Result = c(50, 75, 60, 20, 11),
Date = c(2007, 2006, 2007, 2009, 2009),
Days = c(4, 5, 3, 3, 2))
df %>%
group_by(Staff, Date) %>%
summarise(Result = floor(mean(Result)),
Days = sum(Days)) %>%
data.frame
Staff Date Result Days
1 1 2006 75 5
2 1 2007 55 7
3 2 2009 15 5
You can aggregate on two variables by using a formula and then merge the two aggregates
merge(aggregate(Result ~ Staff + Date, data=df, mean),
aggregate(Days ~ Staff + Date, data=df, sum))
Staff Date Result Days
1 1 2006 75.0 5
2 1 2007 55.0 7
3 2 2009 15.5 5
Here is another option with data.table
library(data.table)
setDT(df1)[, .(Result = floor(mean(Result)), Days = sum(Days)), .(Staff, Date)]
# Staff Date Result Days
#1: 1 2007 55 7
#2: 1 2006 75 5
#3: 2 2009 15 5

Resources