Rbind Difference of rows - r

I want to determine the difference of each row and have that total difference rbinded at the end. Below is a sample dataset:
DATE <- as.Date(c('2016-11-28','2016-11-29'))
TYPE <- c('A', 'B')
Revenue <- c(2000, 1000)
Sales <- c(1000, 4000)
Price <- c(5.123, 10.234)
Material <- c(10000, 7342)
df<-data.frame(DATE, TYPE, Revenue, Sales, Price, Material)
df
DATE TYPE Revenue Sales Price Material
1 2016-11-28 A 2000 1000 5.123 10000
2 2016-11-29 B 1000 4000 10.234 7342
How Do I calculate the Difference of Each of the Columns to produce this total:
DATE TYPE Revenue Sales Price Material
1 2016-11-28 A 2000 1000 5.123 10000
2 2016-11-29 B 1000 4000 10.234 7342
3 DIFFERENCE -1000 3000 5.111 -2658
I can easily do it by columns but having trouble doing it by row.
Any help would be great thanks!

As 'DATE' is Date class, we may need to change it to character before proceeding with rbinding with string "DIFFERENCE". Other than that, subset the numeric columns of 'df', loop it with lapply, get the difference, concatenate with the 'DATE' and 'TYPE', and rbind with original dataset.
df$DATE <- as.character(df$DATE)
rbind(df, c(DATE = "DIFFERENCE", TYPE= NA, lapply(df[-(1:2)], diff)))
# DATE TYPE Revenue Sales Price Material
#1 2016-11-28 A 2000 1000 5.123 10000
#2 2016-11-29 B 1000 4000 10.234 7342
#3 DIFFERENCE <NA> -1000 3000 5.111 -2658

Related

R: Only keeping the first observation of the month in dataset

I have the following kind of dataframe, with thousands of columns and rows. First column contains dates, and the following columns contain asset returns indexes corresponding to that date.
DATE
Asset_1
Asset_2
Asset_3
Asset_4
2000-01-01
1000
300
2900
NA
.....
2000-01-31
1100
350
2950
NA
2000-02-02
1200
330
2970
100
...
2000-02-28
1200
360
3000
200
2000-03-01
1200
370
3500
300
I want to make this into a monthly dataset by only keeping the first observation of the month.
I have come up with the following script:
library(dplyr)
library(lubridate)
monthly <- daily %>% filter(day(DATE) == 1)
However, the problem with this is that it doesnt work for months where the first day of the month is not a trading date (aka it is missing from the daily dataset).
So when I run the command, those months where the first day of the month doesn't exist are excluded from my dataset.
If the data is always ordered, you could group by year\month, then keep (slice) the first record from each group. Like:
df<-data.frame(mydate=as.Date("2023-01-01")+1:45)
library(tidyverse)
library(lubridate)
df %>%
group_by(ym=paste(year(mydate), month(mydate))) %>%
#group_by(year(mydate), month(mydate)) %>%
slice_head(n=1)
Use slice_min
library(dplyr) # version 1.1.0 or later
library(zoo)
daily %>%
mutate(ym = as.yearmon(DATE)) %>%
slice_min(DATE, by = ym)

How to calculate average of adjacent rows?

I got a DF with peoples salary data and their job. One row is one person. I need to calculate the average salary of 3 people on the same job and make a new DF out of it. The 3 people need to be on the same job and their wages need to be adjacent if the DF is sorted from highest to lowest salary. The average salary of the person themselves and the ones above and below them in the DF if they have the same job. The people with the highest and lowest salary in a job are excluded as they have nobody above or below them.
This is a sample of the data i have
Job salary
IT 5000
IT 4500
IT 4000
IT 4000
Sales 4500
Sales 4500
Sales 4000
Sales 3000
Sales 2500
HR 3000
HR 2500
HR 2300
This is what i would like to get (if the average salary went to decimal places i rounded it. But in the R DF there is no need to do it. Decimal places are ok.
Job salary
IT 4500
IT 4167
Sales 4333
Sales 3833
Sales 3167
HR 2600
I'm stuck as i can't figure out how to calculate the average of the 3 people on the same job and exclude the top and bottom. Hope you can help.
Thank you
You want a rolling average by group. This can be done with zoo::rollmean coupled with dplyr::group_by.
library(dplyr)
library(zoo)
dat %>%
group_by(Job) %>%
summarise(mean = rollmean(salary, 3, align = "right"))
Job mean
<fct> <dbl>
1 IT 4500
2 IT 4167.
3 Sales 4333.
4 Sales 3833.
5 Sales 3167.
6 HR 2600
Here are some base R options
> with(df,stack(tapply(salary, Job, function(x) rowMeans(embed(x, 3)))))
values ind
1 2600.000 HR
2 4500.000 IT
3 4166.667 IT
4 4333.333 Sales
5 3833.333 Sales
6 3166.667 Sales
> aggregate(salary ~ ., df, function(x) rowMeans(embed(x, 3)))
Job salary
1 HR 2600
2 IT 4500.000, 4166.667
3 Sales 4333.333, 3833.333, 3166.667

Remove a column with bad quality data if there is good quality data for the same period

I have the following problem: My data contains good and bad quality data. So e.g. For the time 2017-12-31, I have a column with good quality data (Quality = a) with the value 800 and bad quality data (Quality = b) with the value 750.
Quality Time Value
1 a 2017-12-31 800
2 a 2018-12-31 500
3 b 2017-12-31 750
4 b 2018-12-31 480
5 b 2019-12-31 200
Sample data frame:
df <- data.frame(Quality = c("a", "a", "b", "b", "b"), Time = c("2017-12-31", "2018-12-31", "2017-12-31", "2018-12-31", "2019-12-31"), Value = c(800, 500, 750, 480, 200))
I want to keep the "bad quality" data (Quality = b) only when there is no "good quality" data (Quality = a) for each period (Time).
Hence, the expected output is:
Quality Time Value
1 a 2017-12-31 800
2 a 2018-12-31 500
3 b 2019-12-31 200
I tried to solve this problem with an if statement, but failed. My real data has over 10000 rows and multiple dates. Any help is appreciated.
You can do this with the help of match :
library(dplyr)
df %>%
group_by(Time) %>%
slice(first(na.omit(match(c('a', 'b'), Quality)))) %>%
ungroup
# Quality Time Value
# <chr> <chr> <dbl>
#1 a 2017-12-31 800
#2 a 2018-12-31 500
#3 b 2019-12-31 200
You can do this by sorting by quality and then deduplicating by time.
library(dplyr)
df %>%
arrange(Quality) %>% #sort by quality so a is first
distinct(Time, .keep_all = TRUE) #keep only the first row for each time value and keep all columns
If you'd prefer base R you can do the same thing using order(Quality) and df[which(!duplicated(df$Time)),].

How to merge two datasets with conditions?

Say, I have two datasets:
First - Revenue Dataset
Year Month Sales Company
1988 5 100 A
1999 2 50 B
Second - Stock Price Data Set
Date Company Stock
19880530 A 200
19880531 A 201
19990225 B 500
19990229 B 506
I need to merge these two datasets into one in such a way that the stock price on the month end date (from second data set) should be combined to corresponding month in the revenue dataset (in second data set)
So the output would be:
Year Month Sales Company Stock
1988 5 100 A 201
1999 2 50 B 506
You can ignore the problem with leap year
You could extract the month and date from the Date column and for each Company and each Month select the row with max date. Then join this data to revenue data and select required columns.
library(dplyr)
stock %>%
mutate(date = as.integer(substring(Date, 7)),
Month = as.integer(substring(Date, 5, 6))) %>%
group_by(Company, Month) %>%
slice(which.max(date)) %>%
inner_join(revenue, by = c('Company', 'Month')) %>%
ungroup %>%
select(Year,Month ,Sales,Company,Stock)
# Year Month Sales Company Stock
# <int> <int> <int> <chr> <int>
#1 1988 5 100 A 201
#2 2000 2 50 B 506
First notice that here is no 1999-02-29!
To get the month ends, use ISOdate on first of following month and subtract one day. Then just merge them.
merge(transform(fi, Date=as.Date(ISOdate(fi$Year, fi$Month + 1, 1)) - 1),
transform(se, Date=as.Date(as.character(Date), format="%Y%m%d")))[-2]
# Company Year Month Sales Stock
# 1 A 1988 5 100 201
# 2 B 1999 2 50 506
Data:
fi <- read.table(header=T, text="Year Month Sales Company
1988 5 100 A
1999 2 50 B")
se <- read.table(header=T, text="Date Company Stock
19880530 A 200
19880531 A 201
19990225 B 500
19990228 B 506") ## note: date corrected!

Consecutive/Unbroken occurrences in time series

I'm working with a data table in R containing information quarterly information on the products being sold in grocery stores across the United States. In particular, there is a column for the date, a column for the store, and a column for the product. For example, here is a (very small) subset of the data:
Date StoreID ProductID
2000-03-31 10001 20001
2000-03-31 10001 20002
2000-03-31 10002 20001
2000-06-30 10001 20001
For each product in each store, I want to find out for how many consecutive quarters the product has been sold in that store up until that date. For example, if we restrict to just looking at staplers that sold in a specific store, we would have:
Date StoreID ProductID
2000-03-31 10001 20001
2000-06-30 10001 20001
2000-09-30 10001 20001
2000-12-31 10001 20001
2001-06-30 10001 20001
2001-09-30 10001 20001
2001-12-31 10001 20001
Assuming that's all the data for that combination of StoreID and ProductID, I want to assign a new variable as:
Date StoreID ProductID V
2000-03-31 10001 20001 1
2000-06-30 10001 20001 2
2000-09-30 10001 20001 3
2000-12-31 10001 20001 4
2001-06-30 10001 20001 1
2001-09-30 10001 20001 2
2001-12-31 10001 20001 3
2002-03-31 10001 20001 4
2002-06-30 10001 20001 5
2002-09-30 10001 20001 6
2002-12-31 10001 20001 7
2004-03-30 10001 20001 1
2004-06-31 10001 20001 2
Note that we roll over after Q4 2000 because the product was not sold during Q1 of 2001. Additionally, we roll over after Q4 2002 because the product was not sold during Q1 2003. The next time the product was sold was Q1 2004, which was assigned a 1.
The issue I am having, is that my actual dataset is quite large (on the order of 10 Million rows) so this needs to be done efficiently. The only techniques I've been able to come up with are dreadfully inefficient. Any advice would be greatly appreciated.
You can use custom function that calculate difference between quarters.
# Load data.table
library(data.table)
# Set data as a data.table object
setDT(data)
# Set key as it might be big data
setkey(data, StoreID, ProductID)
consecutiveQuarters <- function(date, timeGap = 14) {
# Calculate difference in dates
# And check if this difference is less than 14 weeks
shifts <- cumsum(c(FALSE, abs(difftime(date[-length(date)], date[-1], units = "weeks")) > timeGap))
# Generate vector from 1 to number of consecutive quarters
ave(shifts, shifts, FUN = seq_along)
}
# Calculate consecutive months my storeID and productID
data[, V := consecutiveQuarters(Date), .(StoreID, ProductID)]
Create a variable which is 1 if the product is sold in a quarter and 0 if not. Order the variable so it starts at the present and goes backward in time.
Compare the cumulative sum of such a variable to a sequence of identical length. When sales drop to zero the cumulative sum will no longer equal the sequence. Sum the number of times the cumulative sum equals the sequence and this will tell the number of consecutive quarters sales were positive.
data <- data.frame(
quarter = c(1, 2, 3, 4, 1, 2, 3, 4),
store = as.factor(c(1, 1, 1, 1, 1, 1, 1, 1)),
product = as.factor(c(1, 1, 1, 1, 2, 2, 2, 2)),
numsold = c(5, 6, 0, 1, 7, 3, 2, 14)
)
sortedData <- data[order(-data$quarter),]
storeValues <- c("1")
productValues <- c("1","2")
dataConsec <- data.frame(store = NULL, product = NULL, ConsecutiveSales = NULL)
for (storeValue in storeValues ){
for(productValue in productValues){
prodSoldinQuarter <-
as.numeric(sortedData[sortedData$store == storeValue &
sortedData$product == productValue,]$numsold > 0)
dataConsec <- rbind(dataConsec,
data.frame(
store = storeValue,
product = productValue,
ConsecutiveSales =
sum(as.numeric(cumsum(prodSoldinQuarter) ==
seq(1,length(prodSoldinQuarter))
))
))
}
}
As I understand from your question your really need your V column as quarter of year, not sum of product in each quarter. You can use something like that.
# to_quarters returns year's quarter of given date in character string
# base on reg exp
to_quarters <- function(date_string) {
month <- as.numeric(substr(date_string, 6, 7))
as.integer((month - 1) / 3) + 1
}
# with tidyverse library
library(tidyverse)
# your data as tibble format of data frame
data_set_tibble <- as.tibble(YOUR_DATA)
# here you create your table
data_set_tibble %>% mutate(V = to_quarters(Date) %>% as.integer())
# alterative with data.table library
library(data.table)
# your data as data.table format of data frame
data_set <- as.data.table(YOUR_DATA)
# here you create your table
data_set[,.(Date, StoreID, ProductID, V = to_quarters(Date))]
For tidyverse and data.table performance is a same, in my case for 5 000 000 rows works in 12 seconds

Resources