df is battle events within years & conflicts. I am trying to calculate the average distance (in time) between battles within conflict years.
Header looks something like this:
conflictId | year | event_date | event_type
107 1997 1997-01-01 1
107 1997 1997-01-01 1
20 1997 1997-01-01 1
20 1997 1997-01-01 2
20 1997 1997-01-03 1
what I first tried was
time_prev_total <- aggregate (event_date ~ conflictId + year, data, diff)
but I end up with event_date being a list in the new df. Attempts to extract the first index position of the list within the df have been unsuccessful.
Alternatively it was suggested to me that I could create a time index within each conflict year, then lag that index, create a new data frame with conflictId, year, event_date, and the lagged index, and then merge that with the original df, but match the lagged index in the new df with the old index in the original df. I have tried to implement this but am a little unsure how to index the obs. within conflict years since it is unbalanced.
You can use ddply to split a data.frame into pieces
(one per year and conflict) and apply a function to each.
# Sample data
n <- 100
d <- data.frame(
conflictId = sample(1:3, n, replace=TRUE),
year = sample(1990:2000, n, replace=TRUE),
event_date = sample(0:364, n, replace=TRUE),
event_type = sample(1:10, n, replace=TRUE)
)
d$event_date <- as.Date(ISOdate(d$year,1,1)) + d$event_date
library(plyr)
# Average distance between battles, within each year and conflict
ddply(
d,
c("year","conflictId"),
summarize,
average = mean(dist(event_date))
)
# Average distance between consecutive battles, within each year and conflict
d <- d[order(d$event_date),]
ddply(
d,
c("year","conflictId"),
summarize,
average = mean(diff(event_date))
)
Related
I am trying to create a monthly average of precipitation values of two different time sets, but I can't get the data to be split into two before making the aggregation.
I have a dataset of daily precipitation data from 01-01-2006 to 31-12-2099 and I want to aggregate per month over the time period of (01-01-2015 to 31-12-2054) and (01-01-2055 to 31-12-2099).
I have used the aggregate function to create an average per month like this. But now I have the average per month over the entire data set (2006-2100) and I want to have two lists (one for 01-01-2015 to 31-12-2054 and one for 01-01-2055 to 31-12-2099). I think I need to make a subset or split the data, but I cannot find how to combine this with the aggregate function. Thank you so much!
months = Alentejo_RCP4.5_Average$Month
Alentejo_RCP4.5_Average.myma = aggregate(x = Alentejo_RCP4.5_Average,
by = list(months), FUN = mean)
I also tried this but it just takes the dates and not the attached values to the date.
df <- data.frame(date=as.Date("2015-01-01")+1:365, x=1:365)
list <- split(df,df$date<as.Date("2055-01-01"))
zz <- " Year Month Day Date Average_P
2006 1 1 2006-01-01 6.5
2007 1 2 2007-01-02 2.8
2055 3 3 2055-03-03 3.5
2058 3 4 2058-03-04 5.1
2060 5 5 2060-05-05 3.2"
Data <- read.table(text=zz, header = TRUE)
Instead of splitting the datasets you can create a new column to distinguish between the two groups and take mean of each group and each month.
Data %>%
mutate(Date = as.Date(Date),
group = ifelse(Date < as.Date("2055-01-01"),
'below_2055', 'above_2055'),
month = format(Date, '%m-%Y')) %>%
group_by(group, Date) %>%
summarise(Average_P = mean(Average_P)) -> result
Or in base R :
Data$Date <- as.Date(Data$Date)
aggregate(Average_P~group + month,
transform(Data,
group = ifelse(Date < as.Date("2055-01-01"),
'below_2055', 'above_2055'),
month = format(Date, '%m-%Y')), mean) -> result
If you need final output as list you can then use split.
split(result,result$group)
I have a table of about 50 000 rows, with four columns.
ID Arrival Departure Gender
1 10/04/2015 23:14 11/04/2015 00:21 F
1 11/04/2015 07:59 11/04/2015 08:08 F
3 10/04/2017 21:53 30/03/2017 23:37 M
3 31/03/2017 07:09 31/03/2017 07:57 M
3 01/04/2017 01:32 01/04/2017 01:35 M
3 01/04/2017 13:09 01/04/2017 14:23 M
6 10/04/2015 21:31 10/04/2015 23:17 F
6 10/04/2015 23:48 11/04/2015 00:05 F
6 01/04/2016 21:45 01/04/2016 22:48 F
6 02/04/2016 04:54 02/04/2016 07:38 F
6 04/04/2016 18:41 04/04/2016 22:48 F
10 10/04/2015 22:39 11/04/2015 00:42 M
10 13/04/2015 02:57 13/04/2015 03:07 M
10 31/03/2016 22:29 01/04/2016 08:39 M
10 01/04/2016 18:49 01/04/2016 19:44 M
10 01/04/2016 22:28 02/04/2016 00:31 M
10 05/04/2017 09:27 05/04/2017 09:28 M
10 06/04/2017 15:12 06/04/2017 15:43 M
This is a very small representation of the table. What I want to find out is, at the same time as each entry, how many others were present and then separate them by gender. So, say for example that at the time as the first presence of person with ID 1, person with ID 6 was present and person with ID 10 was present twice in the same interval. That would mean that at the same time, 2 other overlaps occurred. This also means that person with ID 1 has overlapped with 1 Male and 1 Female.
So its result should look like:
ID Arrival Departure Males encountered Females encountered
1 10/04/2015 23:14 11/04/2015 00:21 1 1
How would I be able to calculate this? I have tried to work with foverlaps and have managed to solve this with Excel, but I would want to do it in R.
Here is a data.table solution using foverlaps.
First, notice that there's an error in your data:
ID Arrival Departure Gender
3 10/04/2017 21:53 30/03/2017 23:37 M
The user arrived almost one month after he actually left. I needed to get rid of that data in order for foverlaps to run.
library(data.table)
dt <- data.table(df)
dt <- dt[Departure > Arrival, ] # filter wrong cases
setkey(dt, "Arrival", "Departure") # prepare for foverlaps
dt2 <- copy(dt) # use a different dt, inherits the key
run foverlaps and then
filter (leave only) the cases where arrival of second person is before than ID and same user-cases.
Add a variable where we count the male simultaneous guests and
a variable where we count the female simultaneous guests, all grouped by ID and arrival
.
simultaneous <- foverlaps(dt, dt2)[i.Arrival <= Arrival & ID != i.ID,
.(malesEncountered = sum(i.Gender == "M"),
femalesEncountered = sum(i.Gender == "F")),
by = .(ID, Arrival)]
Join the findings of the previous command with our original table on ID and arrival
result <- simultaneous[dt, on = .(ID, Arrival)]
<EDIT>: Convert to zero the NAs in malesEncountered and femalesEncountered: </EDIT>
result[is.na(malesEncountered), malesEncountered := 0][
is.na(femalesEncountered), femalesEncountered := o]
set the column order to something nicer
setcolorder(result, c(1, 2, 5, 6, 3, 4))[]
Here's one possibility. This uses lubridate's interval and the int_overlaps function that finds date overlaps. That has a drawback though: Interval doesn't work with dplyr. So this version is just doing all the work manually in a for loop.
It starts by making a 1000 row random dataset that matches yours: each person arrives in a two year period and departs one or two days later.
It's taking about 24 seconds for 1000 to run so you can expect it to take a while for 50K! The for loop outputs the row number so you can see where it is though.
Any questions about the code, lemme know.
There must be a faster vectorised way but interval didn't seem to play nice with apply either. Someone else might have something quicker...
Final output looks like this
library(tidyverse)
library(lubridate)
#Sample data:
#(Date sampling code: https://stackoverflow.com/questions/21502332/generating-random-dates)
#Random dates between 2017 and 2019
x <- data.frame(
ID = c(1:1000),
Arrival = sample(seq(as.Date('2017/01/01'), as.Date('2019/01/01'), by="day"), 1000, replace = T),
Gender = ifelse(rbinom(1000,1,0.5),'Male','Female')#Random Male female 50% probabiliity
)
#Make departure one or two days after arrival
x$Departure = x$Arrival + sample(1:2,1000, replace=T)
#Lubridate has a function for checking whether date intervals overlap
#https://lubridate.tidyverse.org/reference/interval.html
#So first, let's make the arrival and departure dates into intervals
x$interval <- interval(x$Arrival,x$Departure)
#Then for every person / row
#We want to know if their interval overlaps with the rest
#At the moment, dplyr doesn't play nice with interval
#https://github.com/tidyverse/dplyr/issues/3206
#So let's go through each row and do this manually
#Keep each person's result in list initially
gendercounts <- list()
#Check timing
t <- proc.time()
#Go through every row manually (sigh!
for(i in 1:nrow(x)){
print(paste0("Row ",i))
#exclude self (don't want to check date overlap with myself)
overlapcheck <- x[x$ID != x$ID[i],]
#Find out what dates this person overlaps with - can do all other intervals in one command
overlapcheck$overlaps <- int_overlaps(x$interval[i],overlapcheck$interval)
#Eyeball check that is finding the overlaps we want
#Is this ID date overlapping? Tick
#View(overlapcheck[overlapcheck$overlaps,])
#Use dplyr to find out the number of overlaps for male and female
#Keep only columns where the overlap is TRUE
#Also drop the interval column first tho as dplyr doesn't like it... (not tidy!)
gendercount <- overlapcheck %>%
select(-interval) %>%
filter(overlaps) %>%
group_by(Gender) %>%
summarise(count = n()) %>% #Get count of observations for each overlap for each sex
complete(Gender, fill = list(count = 0))#Need this to keep zero counts: summarise drops them otherwise
#We want count for each gender in their own column, so make wide
gendercount <- gendercount %>%
spread(key = Gender, value = count)
#Store for turning into dataframe shortly
gendercounts[[length(gendercounts)+1]] <- gendercount
}
#Dlyr command: turn list into dataframe
gendercounts <- bind_rows(gendercounts)
#End result. Drop interval column, order columns
final <- cbind(x,gendercounts) %>%
select(ID,Arrival,Departure,Gender,Male,Female)
#~24 seconds per thousand
proc.time()-t
I have an R time series data frame, consisting of multiple variables for each day for about 19 years of data. I would like to compute the mean of only the months which have more than 10 days of values. So, if a month (e.g. Jan for the entire period 1996-2015) has less than 10 days values, I would like to exclude it for the calculation of the mean-monthly for the whole time period.
The data frame is as follows:
date val1,val2,val3,val4,val5
1 1996-01-01 5.25,4.20,3.58,6.44,2.66
2 1996-01-02 10.11,9.22,14.25,12.11,13.22
3 1996-01-03 25.11,30.44,45.22,31.24,27.35
..
..
..
7305 2015-12-31 30.54,55.14,63.12,51.22,45.21
Any ideas?
You can first get the number of observations per month with aggregate and then restrict your dataset to those which have at least minDays observations using merge.
x <- read.table(sep=c(","), head=T, as.is = TRUE, text=
"date,val1,val2,val3,val4,val5
1996-01-01,5.25,4.20,3.58,6.44,2.66
1996-01-02,10.11,9.22,14.25,12.11,13.22
1996-01-03,25.11,30.44,45.22,31.24,27.35")
minDays <- 10
x$ym <- substr(x$date,1,nchar(x$date)-3) #get year month out of date
tt <- aggregate(val1 ~ ym, data=x, FUN=length) #Get number of observations per month
aggregate(val1 ~ ym, data=merge(x, tt[tt$val1>=minDays, "ym", drop=FALSE]), FUN=mean) #Calculate mean when n observations are >= minDays
Or using ave:
x <- read.table(sep=c(","), head=T, as.is = TRUE, text=
"date,val1,val2,val3,val4,val5
1996-01-01,5.25,4.20,3.58,6.44,2.66
1996-01-02,10.11,9.22,14.25,12.11,13.22
1996-01-03,25.11,30.44,45.22,31.24,27.35")
minDays <- 10
x$ym <- substr(x$date,1,nchar(x$date)-3) #get year month out of date
x$n <- with(x, ave(val1, ym, FUN=length))
aggregate(val1 ~ ym, data=x[x$n>=minDays,], FUN=mean)
Please see the sample data below.
I want to convert the quarterly sale data (with a start date and end date) into monthly sale data.
For example:
Data set A-Row 1 will be split into Data set B- Row 1, 2 and 3 for June, July and August separately and the sale will be pro rata based on number of days in that month, all other columns will be the same;
Data set A-Row 2 will pick up what was left in Row 1 (which ends in 5/9/2017) and formed a complete September.
Is there an efficient way to execute this, the actual data is a csv file with 100K x 15 data size, which will be split to approximately 300K x 15 new data set for monthly analysis.
Some key characteristic from sample question data includes:
The start day for the first quarterly sales data is the day that customer joins, so it could be any day;
All sales will be quarterly but in various days between 90, 91, or 92 days, but it is also possible to have imcomplete quarterly sale data as customer leave in the quarter.
Sample Question:
Customer.ID Country Type Sale Start..Date End.Date Days
1 1 US Commercial 91 7/06/2017 5/09/2017 91
2 1 US Commerical 92 6/09/2017 6/12/2017 92
3 2 US Casual 25 10/07/2017 3/08/2017 25
4 3 UK Commercial 64 7/06/2017 9/08/2017 64
Sample Answer:
Customer.ID Country Type Sale Start.Date End.Date Days
1 1 US Commercial 24 7/06/2017 30/06/2017 24
2 1 US Commercial 31 1/07/2017 31/07/2017 31
3 1 US Commercial 31 1/08/2017 31/08/2017 31
4 1 US Commercial 30 1/09/2017 30/09/2017 30
5 1 US Commercial 31 1/10/2017 31/10/2017 31
6 1 US Commercial 30 1/11/2017 30/11/2017 30
7 1 US Commercial 6 1/12/2017 6/12/2017 6
8 2 US Casual 22 10/07/2017 31/07/2017 22
9 2 US Casual 3 1/08/2017 3/08/2017 3
10 3 UK Commercial 24 7/06/2017 30/06/2017 24
11 3 UK Commercial 31 1/07/2017 31/07/2017 31
12 3 UK Commercial 9 1/08/2017 9/08/2017 9
I just ran CIAndrews' code. It seems to work for the most part, but it is very slow when run on a dataset with 10,000 rows. I eventually cancelled the execution after a few minutes of waiting. There's also an issue with the number of days: For example, July has 31 days, but the days variable only shows thirty. It's true that 31-1 = 30, but the first day should be counted as well.
The code below only takes about 21 seconds on my 2015 MacBook Pro (not including data generation), and takes care of the other problem, too.
library(tidyverse)
library(lubridate)
# generate data -------------------------------------------------------------
set.seed(666)
# assign variables
customer <- sample.int(n = 2000, size = 10000, replace = T)
country <- sample(c("US", "UK", "DE", "FR", "IS"), 10000, replace = T)
type <- sample(c("commercial", "casual", "other"), 10000, replace = T)
start <- sample(seq(dmy("7/06/2011"), today(), by = "day"), 10000, replace = T)
days <- sample(85:105, 10000, replace = T)
end <- start + days
sale <- sample(500:3000, 10000, replace = T)
# generate dataframe of artificial data
df_quarterly <- tibble(customer, country, type, sale, start, end, days)
# split quarters into months ----------------------------------------------
# initialize empty list with length == nrow(dataframe)
list_date_dfs <- vector(mode = "list", length = nrow(df_quarterly))
# for-loop generates new dates and adds as dataframe to list
for (i in 1:length(list_date_dfs)) {
# transfer dataframe row to variable `row`
row <- df_quarterly[i,]
# correct end date so split successful when interval doesn't cover full month
end_corr <- row$end + day(row$start) - day(row$end)
# use lubridate to compute first and last days of relevant months
m_start <- seq(row$start, end_corr, by = "month") %>%
floor_date(unit = "month")
m_end <- m_start + days_in_month(m_start) - 1
# replace first and last elements with original dates
m_start[1] <- row$start
m_end[length(m_end)] <- row$end
# compute the number of days per month as well as sales per month
# correct difference by adding 1
m_days <- as.integer(m_end - m_start) + 1
m_sale <- (row$sale / sum(m_days)) * m_days
# add tibble to list
list_date_dfs[[i]] <- tibble(customer = row$customer,
country = row$country,
type = row$type,
sale = m_sale,
start = m_start,
end = m_end,
days = m_days
)
}
# bind dataframe list elements into single dataframe
df_monthly <- bind_rows(list_date_dfs)
It's not pretty as it uses multiple functions and loops, since it consists out of multiple operations:
# Creating the dataset
library(tidyr)
customer <- c(1,1,2,3)
country <- c("US","US","US","UK")
type <- c("Commercial","Commercial","Casual","Commercial")
sale <- c(91,92,25,64)
Start <- as.Date(c("7/06/2017","6/09/2017","10/07/2017","7/06/2017"),"%d/%m/%Y")
Finish <- as.Date(c("5/09/2017","6/12/2017","3/08/2017","9/08/2017"),"%d/%m/%Y")
days <- c(91,92,25,64)
df <- data.frame(customer,country, type,sale, Start,Finish,days)
# Function to split per month
library(zoo)
addrowFun <- function(y){
temp <- do.call("rbind", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop
CiAndrews,
Thanks for the help and patience. I have managed to get the answer with small change. I have replace the "rbind" with "rbind.fill" from "plyr" package and everything runs smoothly after that.
Please see the head of sample2.csv below
customer country type sale Start Finish days
1 43108181108 US Commercial 3330 17/11/2016 24/02/2017 99
2 43108181108 US Commercial 2753 24/02/2017 23/05/2017 88
3 43108181108 US Commercial 3043 13/02/2018 18/05/2018 94
4 43108181108 US Commercial 4261 23/05/2017 18/08/2017 87
5 43103703637 UK Casual 881 4/11/2016 15/02/2017 103
6 43103703637 UK Casual 1172 26/07/2018 1/11/2018 98
Please see the codes below:
library(tidyr)
#read data and change the start and finish to data type
data <- read.csv("Sample2.csv")
data$Start <- as.Date(data$Start, "%d/%m/%Y")
data$Finish <- as.Date(data$Finish, "%d/%m/%Y")
customer <- data$customer
country <- data$country
days <- data$days
Finish <- data$Finish
Start <- data$Start
sale <- data$sale
type <- data$type
df <- data.frame(customer, country, type, sale, Start, Finish, days)
# Function to split per month
library(zoo)
library(plyr)
addrowFun <- function(y){
temp <- do.call("rbind.fill", by(y, 1:nrow(y), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))
# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
totalSale <- sum(tempSum$sale)
totalDays <- sum(tempSum$days)
return(x$days / totalDays * totalSale)
}
for(i in 1:length(loop$customer)){
loop$sale[i] <- sumFun(loop[i,])
}
loop
I am creating three types of variables over multiple time periods for individual customer IDs in my data. These new variables include the sum of a price vector, the mean of a price vector, and the mean difference between successive dates in a date vector.
Using data.table, I am looping through multiple time periods, subsetting the data in each period, and calculating those variables for individual customer IDs. Each of these variables are named dynamically as I loop over the time periods. As it stands, these variables are being computed correctly.
Here is where I am getting stuck: after all of these variables are computed, I would like to subset the data to include the new aggregated variables along with the most recent purchase.price and date elements for each customer.
I thought that data.table might replicate the computed totals over all rows corresponding to each customer. However, it only replicates those totals in rows corresponding to the period intervals specified in the table's i index. Since it does not replicate those totals over all rows for each customer, my final dplyr block does not do the trick.
In the second and third code blocks I will give the output of the final dplyr code and then what the output I would like to achieve.
This question stems from a similar problem noted where we are subsetting over fewer variables that are not being created dynamically.
library(lubridate)
library(data.table)
library(dplyr)
data <- data.frame(custid = c(rep(1, 25), rep(2, 25), rep(1, 25), rep(2, 25)),
purchase.price = seq(1, 200, by=2),
date = seq.Date(from=as.Date("2015-01-01"), to=as.Date("2015-04-10"), by="days"))
period_intervals <- list(period_one = interval(as.Date("2015-01-01"), as.Date("2015-01-30")),
period_two = interval(as.Date("2015-02-01"), as.Date("2015-02-28")),
period_three = interval(as.Date("2015-03-01"), as.Date("2015-03-31")),
period_four = interval(as.Date("2015-04-01"), as.Date("2015-04-28")))
data <- as.data.table(data)
data <- data[order(date)]
setkey(data, custid)
time_periods <- c(1:4)
for(i in time_periods[1]:max(time_periods)){
data <- data[date %within% period_intervals[[i]],
paste("period", i, "price.sum", sep="."):= sum(purchase.price),
by = custid]
data <- data[date %within% period_intervals[[i]],
paste("period", i, "price.mean", sep="."):= mean(purchase.price),
by = custid]
data <- data[date %within% period_intervals[[i]],
paste("period", i, "mean.diff.date", sep="."):= mean(as.numeric(diff(purchase.price))),
by = custid]
}
data_sub <- data %>%
group_by(custid) %>%
arrange(desc(date)) %>%
filter(row_number() == 1)
Current result from dplyr subsetting (showing the first 7 columns):
custid purchase.price date period.1.price.sum period.1.price.mean period.1.mean.diff.date period.2.price.sum ...
<dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> ...
1 2 199 2015-04-10 NA NA NA NA ...
2 1 149 2015-03-16 NA NA NA NA ...
Here is what I was hoping for (showing first 7 columns):
custid purchase.price date period.1.price.sum period.1.price.mean period.1.mean.diff.date period.2.price.sum ...
<dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> ...
1 2 199 2015-04-10 625 25 2 981 ...
2 1 149 2015-03-16 275 55 2 1539 ...
Note:
In my complete dataset, I am looping over anywhere between 10-20 time periods. The number of periods to be computed over is subject to change, thus my approach to dynamically create the new variables.
We can use Map as in the previous post
nm1 <- sprintf("%s.%d.%s", "period", seq_along(period_intervals), "price.sum")
nm2 <- sprintf("%s.%d.%s", "period", seq_along(period_intervals), "price.mean")
nm3 <- sprintf("%s.%d.%s", "period", seq_along(period_intervals), "mean.diff.date")
data[, c(rbind(nm1, nm2, nm3)) := unlist(Map(function(x,y) {
x1 <- purchase.price[x %within% y]
list(sum(x1), mean(x1), mean(as.numeric(diff(x1))))},
list(date), period_intervals), recursive = FALSE), by = custid]
data[order(custid, -date)][,.SD[1] , custid]