combine/merge rows with dplyr - r

The data that looks like
Month Location Money
1 Miami 12
1 Cal 15
2 Miami 5
2 Cal 3
...
12 Miami 6
12 Cal 8
I want to transform it so it looks like
Month Location Money
Spring Miami sum(from month=1,2,3)
spring Cal sum (from month= 1,2,3)
summer...
summer...
fall...
fall...
winter...
winter...
I dont' know how to ask the question directly (merging rows, aggregating rows?) but googling it only returns dplyr::group_by and summarize which collapses the rows based on a single value of the row.
I want to collpase/summarise the data based on multiple row values.
Is there an easy way? Any help would be appreciated Thanks!

It sounds like you want to
assign season to each record,
group_by season,
summarize.
If this is where you are going, you can either create a new column, Or you can do it directly. You could also create a separate table with month and season and left_join to your data.
library(dplyr)
## simulate data
df = tibble(
month = rep(1:12, each = 4),
location = rep(c("Cal", "Miami"), times = 24),
money = as.integer(runif(48, 10, 100 ))
)
head(df)
# # A tibble: 6 x 3
# month location money
# <int> <chr> <int>
# 1 1 Cal 69
# 2 1 Miami 84
# 3 1 Cal 38
# 4 1 Miami 44
# 5 2 Cal 33
# 6 2 Miami 64
## Create season based on month in groups of 3
df %>%
mutate(season = (month-1) %/% 3 +1) %>%
group_by(season, location) %>%
summarize(Monthly_Total = sum(money))
# # A tibble: 8 x 3
# # Groups: season [4]
# season location Monthly_Total
# <dbl> <chr> <int>
# 1 1 Cal 360
# 2 1 Miami 265
# 3 2 Cal 392
# 4 2 Miami 380
# 5 3 Cal 348
# 6 3 Miami 278
# 7 4 Cal 358
# 8 4 Miami 411
Using the same data you can skip the column creation and include it in group_by:
df %>%
group_by(season = (month-1) %/% 3 +1, location) %>%
summarize(Monthly_Total = sum(money))
## results identical to above.
It may make more sense to just create a season table:
seasons = tibble(
month = 1:12,
season = rep(c("Spring", "Summer", "Winter", "Fall"), each = 3)
)
df %>%
left_join(seasons) %>%
group_by(season, location) %>%
summarize(Monthly_Total = sum(money))
## again identical to above
The latter has the advantage of being more transparent.

You could aggregate after transforming the Month variable:
aggregate(Money ~ Month + Location, transform(data, Month = (Month - 1) %/% 3), sum)

Related

Using dplyr - how can I create a new category for one column when another column has duplicates?

I have a dataframe of coordinates for different studies that have been conducted. The studies are either experiment or observation however at some locations both experiment AND observation occur. For these sites, I would like to create a new study category called both. How can I do this using dplyr?
Example Data
df1 <- data.frame(matrix(ncol = 4, nrow = 6))
colnames(df1)[1:4] <- c("value", "study", "lat","long")
df1$value <- c(1,1,2,3,4,4)
df1$study <- rep(c('experiment','observation'),3)
df1$lat <- c(37.541290,37.541290,38.936604,29.9511,51.509865,51.509865)
df1$long <- c(-77.434769,-77.434769,-119.986649,-90.0715,-0.118092,-0.118092)
df1
value study lat long
1 1 experiment 37.54129 -77.434769
2 1 observation 37.54129 -77.434769
3 2 experiment 38.93660 -119.986649
4 3 observation 29.95110 -90.071500
5 4 experiment 51.50986 -0.118092
6 4 observation 51.50986 -0.118092
Note that the value above is duplicated when study has experiment AND observation.
The ideal output would look like this
value study lat long
1 1 both 37.54129 -77.434769
2 2 experiment 38.93660 -119.986649
3 3 observation 29.95110 -90.071500
4 4 both 51.50986 -0.118092
We can replace those 'value' cases where both experiment and observation is available to 'both' and get the distinct
library(dplyr)
df1 %>%
group_by(value) %>%
mutate(study = if(all(c("experiment", "observation") %in% study))
"both" else study) %>%
ungroup %>%
distinct
-output
# A tibble: 4 × 4
value study lat long
<dbl> <chr> <dbl> <dbl>
1 1 both 37.5 -77.4
2 2 experiment 38.9 -120.
3 3 observation 30.0 -90.1
4 4 both 51.5 -0.118

Count number of outliers by group in r and store count in new dataframe

I have a dataset that has 2 columns; column A is State_Name and has 5 different options of state, and column B is Total_Spend which has the average total spend of that state per day. There are 365 observations for each state.
What I want to do is count the number of outliers PER STATE using the 1.5 IQR rule and save the count of outliers per state to a new df or table.
So I would expect an output something like:
State
Outlier Count
ATL
5
GA
20
MI
11
NY
50
TX
23
I have managed to get it to work by doing it one state at a time but I can't figure out what to do to achieve this in a single go.
Here is my code at the moment (to return the result for a single state):
daily_agg %>%
select(State_Name, Total_Spend) %>%
filter(State_Name == "NY")
outlier_NY <- length(boxplot.stats(outlier_df$Total_Spend)$out)
Any help would be appreciated.
Thanks!
EDIT WITH TEST DATASET
outlier_mtcars <-
df %>%
select(cyl, disp) %>%
filter(cyl == "6")
outliers <- length(boxplot.stats(outlier_mtcars$disp)$out)
The above shows me 1 outlier for 6 cyl cars but I want a table that shows how many outliers for 4, 6, 8 cyl cars
Since I'm not very familiar with the function boxplot.stats, I didn't use this in my solution and instead manually calculates 1.5 * IQR + upper quantile.
Here mtcars was used as an example. For the records that are outliers, they are "flagged" as TRUE, where we can sum them up in summarize.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(flag = disp >= (IQR(disp) * 1.5 + quantile(disp, probs = 0.75)), .keep = "used") %>%
summarize(Outlier = sum(flag))
# A tibble: 3 × 2
cyl Outlier
<dbl> <int>
1 4 0
2 6 1
3 8 0
Since I don't have your data, I'll make some up with the two columns you mention:
df<-data.frame(state=sample(c("ny","fl"),100, replace=TRUE),
spend=sample(1:100, 100, replace=TRUE))
> head(df)
state spend
1 ny 3
2 fl 87
3 ny 91
4 fl 97
5 ny 47
6 fl 8
Then set your upper and lower bounds (could be quartiles, absolutes, whatever..)
df%>%
group_by(state)%>%
mutate(lower_bound=quantile(spend,0.25),
upper_bound=quantile(spend,0.75))%>%
mutate(is_outlier=if_else(spend<lower_bound|spend>upper_bound,TRUE,FALSE))
# A tibble: 10 × 5
# Groups: state [2]
state spend lower_bound upper_bound is_outlier
<chr> <int> <dbl> <dbl> <lgl>
1 ny 3 38 84 TRUE
2 fl 87 26 87 FALSE
3 ny 91 38 84 TRUE
4 fl 97 26 87 TRUE
Then if you only want to see the output, summarise by is_outlier:
df%>%
group_by(state)%>%
mutate(lower_bound=quantile(spend,0.25),upper_bound=quantile(spend,0.75))%>%
mutate(is_outlier=if_else(spend<lower_bound|spend>upper_bound,TRUE,FALSE))%>%
summarise(outliers=sum(is_outlier))
state outliers
<chr> <int>
1 fl 19
2 ny 30

Sum of elements in a forward looking rolling window by month

I have the following data.frame with columns: Id, Month, have
library(dplyr)
dt <- read.table(header = TRUE, text = '
Id Month have want
1 01-Jan-2018 1.000000000000000 1.234567901220000
1 01-Feb-2018 0.200000000000000 0.234567901233000
1 01-Mar-2018 0.030000000000000 0.034567901234400
1 01-Apr-2018 0.004000000000000 0.004567901234550
1 01-May-2018 0.000500000000000 0.000567901234566
1 01-Jun-2018 0.000060000000000 0.000067901234566
1 01-Jul-2018 0.000007000000000 0.000007901234566
1 01-Aug-2018 0.000000800000000 0.000000901234566
1 01-Sep-2018 0.000000090000000 0.000000101234566
1 01-Oct-2018 0.000000010000000 0.000000011234566
1 01-Nov-2018 0.000000001100000 0.000000001234566
1 01-Dec-2018 0.000000000120000 0.000000000134566
1 01-Jan-2019 0.000000000013000 0.000000000014566
1 01-Feb-2019 0.000000000001400 0.000000000001566
1 01-Mar-2019 0.000000000000150 0.000000000000166
1 01-Apr-2019 0.000000000000016 0.000000000000016
2 01-Jan-2018 1337.00 1338.00
2 01-Feb-2018 1.00 1.00
3 01-Jan-2018 5.000000000000000000 5.000000000000000
') %>% mutate(Month=as.Date(Month, format='%d-%b-%Y')
I would like to programmatically calculate sum of elements in a 12 month forward looking rolling window by Month and grouped by Id as demonstrated in column want. If the rolling observation window is less than 12 months, the missing elements should be ignored.
For bonus points would the solution would also allow for missing months, such as in:
dt <- read.table(header = TRUE, text = '
Id Month have want
1 01-Jan-18 1.000000000000000 1.200000000000000
1 01-Dec-18 0.200000000000000 0.230000000000000
1 01-Jan-19 0.030000000000000 0.030000000000000
') %>% mutate(Month=as.Date(Month, format='%d-%b-%Y')
I have tried different solutions, e.g. rollapplyr() of the zoo package and some functions in the runner package, but it doesn't seem to give me what I need.
You can use zoo's rollaply with partial = TRUE
library(dplyr)
dt %>%
group_by(Id) %>%
tidyr::complete(Month = seq(min(Month), max(Month), "month")) %>%
mutate(result = zoo::rollapply(have, 12, sum, na.rm = TRUE,
align = 'left', partial = TRUE)) -> result
result
If you have data for every month for each Id like in the example shared you can remove the complete step.
I suggest to use runner package in this case. runner function let you to calculate rolling window having a full control in time. k is a window length, lag is a lag of the window and in idx you specify index column which window depends on.
library(runner)
dt %>%
group_by(Id) %>%
mutate(want2 = runner(
.,
f = function(x) sum(x$have),
k = 12, # or "12 months"
lag = -11, # or "-11 months"
idx = Month)
)
# # A tibble: 19 x 5
# # Groups: Id [3]
# Id Month have want want2
# <int> <date> <dbl> <dbl> <dbl>
# 1 1 2018-01-01 1.00e+ 0 1.23e+ 0 1.00e+ 0
# 2 1 2018-02-01 2.00e- 1 2.35e- 1 2.00e- 1
# 3 1 2018-03-01 3.00e- 2 3.46e- 2 3.00e- 2
# 4 1 2018-04-01 4.00e- 3 4.57e- 3 4.00e- 3
# 5 1 2018-05-01 5.00e- 4 5.68e- 4 5.00e- 4
# 6 1 2018-06-01 6.00e- 5 6.79e- 5 6.00e- 5

Add sequence of week count aligned to a date column with infrequent dates

I'm building a dataset and am looking to be able to add a week count to a dataset starting from the first date, ending on the last. I'm using it to summarize a much larger dataset, which I'd like summarized by week eventually.
Using this sample:
library(dplyr)
df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
Week = nrow/7)
# A tibble: 93 x 2
Date Week
<date> <dbl>
1 1944-06-01 0.143
2 1944-06-02 0.286
3 1944-06-03 0.429
4 1944-06-04 0.571
5 1944-06-05 0.714
6 1944-06-06 0.857
7 1944-06-07 1
8 1944-06-08 1.14
9 1944-06-09 1.29
10 1944-06-10 1.43
# … with 83 more rows
Which definitely isn't right. Also, my real dataset isn't structured sequentially, there are many days missing between weeks so a straight sequential count won't work.
An ideal end result is an additional "week" column, based upon the actual dates (rather than hard-coded with a seq_along() type of result)
Similar solution to Ronak's but with lubridate:
library(lubridate)
(df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
week = interval(min(Date), Date) %>%
as.duration() %>%
as.numeric("weeks") %>%
floor() + 1))
You could subtract all the Date values with the first Date and calculate the difference using difftime in "weeks", floor all the values and add 1 to start the counter from 1.
df$week <- floor(as.numeric(difftime(df$Date, df$Date[1], units = "weeks"))) + 1
df
# A tibble: 93 x 2
# Date week
# <date> <dbl>
# 1 1944-06-01 1
# 2 1944-06-02 1
# 3 1944-06-03 1
# 4 1944-06-04 1
# 5 1944-06-05 1
# 6 1944-06-06 1
# 7 1944-06-07 1
# 8 1944-06-08 2
# 9 1944-06-09 2
#10 1944-06-10 2
# … with 83 more rows
To use this in your dplyr pipe you could do
library(dplyr)
df %>%
mutate(week = floor(as.numeric(difftime(Date, first(Date), units = "weeks"))) + 1)
data
df <- tibble::tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"))

Replace missing data by using another data table for multiple columns

I have many columns in a table where there is missing data. I want to be able to pull in the information from another table if the data is missing for a particular record based on ID. I thought about possibly joining the two tables and writing a for loop where if column X is NA then pull in information from column Y, however, I have many columns and would require writing many of these conditions.
I want to create a function or a loop where I can pass in the data column names with the missing data and be able to pass in the column name from another table to get the information from.
Reproducible Example:
ID <- c(1,2,3,4,5,6)
Year <- c(1990,1987,NA,NA,1968,1992)
Month <- c(1,NA,8,12,NA,5)
Day <- c(3,NA,NA,NA,NA,30)
New_Data = data.frame(ID=ID,Year=Year,Month=Month,Day=Day)
ID <- c(2,3,4,5)
Year <- c(NA,1994,1967,NA)
Month <- c(4,NA,NA,10)
Day <- c(23,12,16,9)
Old_Data = data.frame(ID=ID,Year=Year,Month=Month,Day=Day)
Expected Output:
ID <- c(1,2,3,4,5,6)
Year <- c(1990,1987,1994,1967,1968,1992)
Month <- c(1,4,8,12,10,5)
Day <- c(3,23,12,16,9,30)
New_Data = data.frame(ID=ID,Year=Year,Month=Month,Day=Day)
Using rbind combine two dataframe , then we using group_by with summarise_all
library(dplyr)
rbind(New_Data,Old_Data)%>%group_by(ID)%>%dplyr::summarise_all(function(x) x[!is.na(x)][1])
# A tibble: 6 x 4
ID Year Month Day
<dbl> <dbl> <dbl> <dbl>
1 1 1990 1 3
2 2 1987 4 23
3 3 1994 8 12
4 4 1967 12 16
5 5 1968 10 9
6 6 1992 5 30
An option using dplyr::left_join and dplyr::coalesce can be as:
library(dplyr)
New_Data %>% left_join(Old_Data, by="ID") %>%
mutate(Year = coalesce(Year.x, Year.y),
Month = coalesce(Month.x, Month.y),
Day = coalesce(Day.x, Day.y)) %>%
select(ID, Year, Month, Day)
# ID Year Month Day
# 1 1 1990 1 3
# 2 2 1987 4 23
# 3 3 1994 8 12
# 4 4 1967 12 16
# 5 5 1968 10 9
# 6 6 1992 5 30
Here's a solution using only base functions from another SO question
I modified it to your needs (created a function, and made an argument for the key column name):
fill_missing_data = function(df1, df2, keyColumn) {
commonNames <- names(df1)[which(colnames(df1) %in% colnames(df2))]
commonNames <- commonNames[commonNames != keyColumn]
dfmerge<- merge(df1,df2,by="ID",all=T)
for(i in commonNames){
left <- paste(i, ".x", sep="")
right <- paste(i, ".y", sep="")
dfmerge[is.na(dfmerge[left]),left] <- dfmerge[is.na(dfmerge[left]),right]
dfmerge[right]<- NULL
colnames(dfmerge)[colnames(dfmerge) == left] <- i
}
return(dfmerge)
}
result = fill_missing_data(New_Data, Old_Data, "ID")

Resources