R - Difference between two rows conditional on variables - r

Lately for everything I manage on my own, I feel like I have a point I need to clarify on this board. SO thanks in advance for your help.
The issue is as follows:
library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,4,3,4,5),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","96V","25A","25A","25A","25A"),
Contract_Size = c(100,20,120,300,120,30,25,60,80,90,30,70,90,220,35,70,150,250,10,50,25)
)
Data<-Data[, totalcont := sum(Contract_Size), by=c("Code_ID_Buy","Month")]
View(Data)
I would like to calculate the difference in total contract size from one period to the next, does anybody know whether it is possible ? I have been working on that for weeks not finding the solution...
Kind regards,

This should solve it
library(tidyverse)
library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,4,3,4,5),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","96V","25A","25A","25A","25A"),
Contract_Size = c(100,20,120,300,120,30,25,60,80,90)
)
Data %>%
group_by(Month) %>%
summarise(total_contract = sum(Contract_Size)) %>%
mutate(dif_contract = total_contract -lag(total_contract))
If you need to respect another variable do
Data %>%
group_by(Code_ID_Buy,Month) %>%
summarise(total_contract = sum(Contract_Size)) %>%
mutate(lag = total_contract -lag(total_contract))
If you need to maintain the current structure left_join may be right
result <- Data %>%
group_by(Code_ID_Buy,Month) %>%
summarise(total_contract = sum(Contract_Size)) %>%
mutate(lag = total_contract -lag(total_contract))
Data %>%
left_join(result)

Related

Aggregation and mean calculation with dplyr

I have a chunk of code that aggregates timestamps of a large dataset (see below). Each timestamp represents a tweet. The code aggregates the tweets per week, it works fine. Now, I also have a column with the sentiment value of each tweet. I would like to know if it is possible to calculate the mean sentiment of the tweets per week. It would be nice to have at the end one dataset with the amount of tweets per week and the mean sentiment of these aggregated tweets. Please let me know if you've got some hints :)
Kind regards,
Daniel
weekly_counts_2 <- df_bw %>%
drop_na(Timestamp) %>%
mutate(weekly_cases = floor_date(
Timestamp,
unit = "week")) %>%
count(weekly_cases) %>%
tidyr::complete(
weekly_cases = seq.Date(
from = min(weekly_cases),
to = max(weekly_cases),
by = "week"),
fill = list(n = 0))
It is difficult to verify the answer since no data has been shared but based on the description provided here is a solution that you can try.
library(dplyr)
library(tidyr)
library(lubridate)
weekly_counts_2 <- df_bw %>%
drop_na(Timestamp) %>%
mutate(weekly_cases = floor_date(Timestamp,unit = "week")) %>%
group_by(weekly_cases) %>%
summarise(mean_sentiment = mean(sentiment_value, na.rm = TRUE),
count = n()) %>%
complete(weekly_cases = seq.Date(min(weekly_cases),
max(weekly_cases),by = "week"), fill = list(n = 0))
I have assumed the column with the sentiment value is called sentiment_value, change it accordingly to your data.

Aggregate, but function uses two columns

I'm sure this has been asked 1000 times, but I can't find the question, and can't figure it out.
I have a data.frame, with a location (a factor), a date, and a variable.
I want to find the date on which the variable is maximized, for each location.
df = data.frame(FAC = factor(rep(c("A","B","C"),each=5)), VAR = runif(15), DATE = rep(as.Date(c("2000-01-01","2000-01-02","2000-01-03","2000-01-04","2000-01-05"))))
I can easily (but messily) do this with a for loop:
df_summary = data.frame(FAC = levels(df$FAC),date=as.Date(character(1)))
for(i in seq_along(levels(df$FAC))){
df_subset = subset(df,FAC == levels(df$FAC)[i])
max_date = df_subset$DATE[which.max(df_subset$VAR)]
df_summary$date[df_summary$FAC == levels(df$FAC)[i]] = max_date
}
But I imagine there's a 'nice' way either with aggregate or dplyr, but I can't figure it out.
My (failed) attempts:
aggregate(x=df$DATE,by=list(df$FAC),FUN=function(x) x[which.max(df$VAR)])
This doesn't work, because df$VAR isn't subset in the function.
And I don't really know how to use dplyr because I generally use base R.
Any suggestions?
In dplyr, you can do -
library(dplyr)
df %>% group_by(FAC) %>% summarise(max_date = DATE[which.max(VAR)])
In data.table -
library(data.table)
setDT(df)[, .(max_date = DATE[which.max(VAR)]), FAC]
We can use
library(dplyr)
df %>%
arrange(FAC, desc(VAR)) %>%
group_by(FAC) %>%
slice(1)

Count of crashes and injuries?

I have a dataset from dot.gov website that I have to analyze as part of our school project. It contains a lot of information, but I am just focusing on crashes and injuries. How do I count the number of crashes or injuries from the year 2007-2014 for example?
Do I have to subset my data per year or is there a more efficient way to do it? Thank you!
Below is a sample of my dataset:
Without a reproducible example of your dataset on which we can test our code, it is difficult to be sure that it will be working, but using dplyr and lubridate package, you can try (assuming that your dataset is called df):
library(dplyr)
library(lubridate)
df %>% mutate(YEARTXT = ymd(YEARTXT)) %>%
mutate(Year = year(YEARTXT)) %>%
filter(Year %in% 2007:2014) %>%
summarise(INJURED = sum(INJURED, na.rm = FALSE),
CRASH = sum(CRASH == "Y"))
To get the count of Crash and injured by per year, you can add group_by to the following sequence such as:
df %>% mutate(YEARTXT = ymd(YEARTXT)) %>%
mutate(Year = year(YEARTXT)) %>%
group_by(Year) %>%
filter(Year %in% 2007:2014) %>%
summarise(INJURED = sum(INJURED, na.rm = FALSE),
CRASH = sum(CRASH == "Y"))
If this is not working, please provide a reproducible example of your dataset: How to make a great R reproducible example

efficient dplyr summarise in one data frame based on intervals in another one

I frequently need to calculate means of many parameters in time series datasets based on intervals defined as "events" in a second dataset.
The example code below illustrates my current approach, which does work nicely.
As my datasets will be increasing, though, I am wondering if there is a more efficient way (example runs in ~30 s on my PC).
It is important to stay within dplyr/tidyverse (data.table ways are appreciated, but won't really help).
library(tidyverse)
#generate time series data
data <- bind_cols(
data_frame(td=seq(from = as.POSIXct("2010-01-01 00:00"),
to = as.POSIXct("2010-12-31 23:59"),
by = 60)),
as_data_frame(replicate(20,runif(525600))))
#generate events
events <- data_frame(
event = as.character(1:669),
start_cet = seq(from = as.POSIXct("2010-01-01 00:00"),
to = as.POSIXct("2010-12-01 00:00"),
by = 43200),
stop_cet = seq(from = as.POSIXct("2010-01-01 02:00"),
to = as.POSIXct("2010-12-01 02:00"),
by = 43200)
)
#calculate means of data columns within event intervals
system.time(
means <- events %>%
rowwise() %>%
mutate(s = list(data %>% select(td) %>% filter(td >= start_cet & td < stop_cet))) %>%
unnest() %>%
select(event,td) %>%
left_join(.,data) %>%
group_by(event) %>%
summarise_at(vars(V1:V20),funs(mean=mean)) %>%
ungroup()
)
Here's an efficient way of doing it using the latest devel (1.9.7+) version of data.table that takes about 10 milliseconds to run for OP sample:
library(data.table)
setDT(data); setDT(events)
data[events, on = .(td >= start_cet, td <= stop_cet), lapply(.SD, mean), by = .EACHI]
Answer to myself after ~ 3 yrs...
The mutate step in the above dplyr solution was unnecessarily complicated, as also indicated in the comment by JDLong. I now use
means2 <- events %>%
rowwise() %>%
mutate(td = list(seq(start_cet, stop_cet - 60, "min"))) %>%
unnest() %>%
select(event,td) %>%
left_join(.,data) %>%
group_by(event) %>%
summarise_at(vars(V1:V20),funs(mean=mean)) %>%
ungroup()
which is ~ 25 times faster than the old dplyr solution above.
The dt solution is still ~ 5 times faster than this dplyr chain. However, the output is a bit messed up. Instead of a column with the events, I get two columns td, which are the start and stop times of the events. Some dt experts know how to fix this?

R Calculate change using time and not lag

I'm trying to calculate change from one quarter to the next. It's a little complex as I'm grouping variables and I'd like to not use lag if possible. I'm having some difficultly finding a solution. Does anyone have a suggestion?
This is where I am now.
library(dplyr)
library(lubridate)
x <- mutate(x,QTR.YR = quarter(x$Month.of.Day.Timestamp, with_year = TRUE))
New.DF <- x %>% group_by(location_country, Device.Type,QTR.YR) %>% #Select grouping columns to summarise
summarise(Sum.Basket = sum(Baskets.Created), Sum.Visits = sum(Visits),
Sum.Checkout.Starts = sum(Checkout.Starts), Sum.Converted.Visit = sum(Converted.Visits),
Sum.Store.Find = sum(store_find), Sum.Time.on.Page = sum(time_on_pages),
Sum.Time.on.Product.Page = sum(time_on_product_pages),
Visit.Change = sum(Visits)/sum((lag(Visits)),4)-1)
View(New.DF)

Resources