How to calculate average of adjacent rows? - r

I got a DF with peoples salary data and their job. One row is one person. I need to calculate the average salary of 3 people on the same job and make a new DF out of it. The 3 people need to be on the same job and their wages need to be adjacent if the DF is sorted from highest to lowest salary. The average salary of the person themselves and the ones above and below them in the DF if they have the same job. The people with the highest and lowest salary in a job are excluded as they have nobody above or below them.
This is a sample of the data i have
Job salary
IT 5000
IT 4500
IT 4000
IT 4000
Sales 4500
Sales 4500
Sales 4000
Sales 3000
Sales 2500
HR 3000
HR 2500
HR 2300
This is what i would like to get (if the average salary went to decimal places i rounded it. But in the R DF there is no need to do it. Decimal places are ok.
Job salary
IT 4500
IT 4167
Sales 4333
Sales 3833
Sales 3167
HR 2600
I'm stuck as i can't figure out how to calculate the average of the 3 people on the same job and exclude the top and bottom. Hope you can help.
Thank you

You want a rolling average by group. This can be done with zoo::rollmean coupled with dplyr::group_by.
library(dplyr)
library(zoo)
dat %>%
group_by(Job) %>%
summarise(mean = rollmean(salary, 3, align = "right"))
Job mean
<fct> <dbl>
1 IT 4500
2 IT 4167.
3 Sales 4333.
4 Sales 3833.
5 Sales 3167.
6 HR 2600

Here are some base R options
> with(df,stack(tapply(salary, Job, function(x) rowMeans(embed(x, 3)))))
values ind
1 2600.000 HR
2 4500.000 IT
3 4166.667 IT
4 4333.333 Sales
5 3833.333 Sales
6 3166.667 Sales
> aggregate(salary ~ ., df, function(x) rowMeans(embed(x, 3)))
Job salary
1 HR 2600
2 IT 4500.000, 4166.667
3 Sales 4333.333, 3833.333, 3166.667

Related

Income at different time periods (year, month, week)- I want to standardise the data

I have a dataset that looks a bit like this:
Income
Income period
1500
3
400
2
30000
1
Where 1 is yearly, 2 is weekly, and 3 is monthly.
I want to create a column that will show the income yearly for all rows so that I can compare them more easily.
Apologies if this is a very simple question, I guess I could recode 3 to be 12 and then have a formula that multiplies these columns together and then recode 2 to be 52 and do the same, just wanted to see if anyone has a better way of doing things as there are actually multiple columns like this with different codes for time periods that I need to fix.
library(dplyr)
df %>%
mutate(income_yr = case_when(period == 3 ~ income * 12,
period == 2 ~ income * 52,
TRUE ~ income))
#> income period income_yr
#> 1 1500 3 18000
#> 2 400 2 20800
#> 3 30000 1 30000
data
df <- data.frame(income = c(1500, 400, 30000),
period = c(3, 2, 1))
Created on 2021-04-13 by the reprex package (v2.0.0)

How to merge two datasets with conditions?

Say, I have two datasets:
First - Revenue Dataset
Year Month Sales Company
1988 5 100 A
1999 2 50 B
Second - Stock Price Data Set
Date Company Stock
19880530 A 200
19880531 A 201
19990225 B 500
19990229 B 506
I need to merge these two datasets into one in such a way that the stock price on the month end date (from second data set) should be combined to corresponding month in the revenue dataset (in second data set)
So the output would be:
Year Month Sales Company Stock
1988 5 100 A 201
1999 2 50 B 506
You can ignore the problem with leap year
You could extract the month and date from the Date column and for each Company and each Month select the row with max date. Then join this data to revenue data and select required columns.
library(dplyr)
stock %>%
mutate(date = as.integer(substring(Date, 7)),
Month = as.integer(substring(Date, 5, 6))) %>%
group_by(Company, Month) %>%
slice(which.max(date)) %>%
inner_join(revenue, by = c('Company', 'Month')) %>%
ungroup %>%
select(Year,Month ,Sales,Company,Stock)
# Year Month Sales Company Stock
# <int> <int> <int> <chr> <int>
#1 1988 5 100 A 201
#2 2000 2 50 B 506
First notice that here is no 1999-02-29!
To get the month ends, use ISOdate on first of following month and subtract one day. Then just merge them.
merge(transform(fi, Date=as.Date(ISOdate(fi$Year, fi$Month + 1, 1)) - 1),
transform(se, Date=as.Date(as.character(Date), format="%Y%m%d")))[-2]
# Company Year Month Sales Stock
# 1 A 1988 5 100 201
# 2 B 1999 2 50 506
Data:
fi <- read.table(header=T, text="Year Month Sales Company
1988 5 100 A
1999 2 50 B")
se <- read.table(header=T, text="Date Company Stock
19880530 A 200
19880531 A 201
19990225 B 500
19990228 B 506") ## note: date corrected!

Subtracting subset from larger dataset in R

Hi all: I have two variables. The first is entitled WITHOUT_VERANDAS. It is a list of cities, aggregated by average rental prices of homes WITHOUT verandas (there are about 200 rows):
City Price
1 Appleton 5000
2 Ames 9000
3 Lodi 1020
4 Milwaukee 2010
5 Barstow 2000
6 Chicago 2320
7 Champaign 2000
The second variable is entitled WITH_VERANDAS. It's a list of cities, aggregated by average rental prices of homes WITH verandas (there are about 10 rows, this is a subset of the previous dataset, since not every city has rental properties with verandas):
City Price
1 Milwaukee 3000
2 Chicago 2050
3 Lodi 5000
For each city on the WITH_VERANDAS list, I want to subtract that city's WITHOUT_VERANDAS city value from the first list. I want to see which cities have the highest or lowest differential. Essentially, the result should only include the WITH_VERANDAS data.
I've tried this:
difference <- WITH_VERANDAS$Price-WITHOUT_VERANDAS$Price
View(difference)
However, this returns as many rows as the WITHOUT_VERANDAS dataset. I also get an error:
longer object length is not a multiple of shorter object length
And the result is simply subtracting WITHOUT_VERANDAS's row 1 from WITH_VERANDA's row 1, as seen in the results: (for example, row 1 of the output would be the value of Milwaukee-Appleton, row 2 output would be Chicago - Ames, and so forth)
1. -2000
2. -6950
If I could only filter WITHOUT_VERANDAS to include only the cities included in WITH_VERANDAS, I think it would work. Thanks!
R2evans, thank you ! this worked great. Now, I have:
City Price.x Price.y
1 Appleton NA 5000
2 Ames NA 9000
3 Lodi 5000 1020
4 Milwaukee 3000 2010
How would I go about filtering this list to take out any row where Price.x is "NA"? i.e all rows that did not match. Thanks again!

Remove all rows of a category if zero in a % of cases

I have the following data set of weekly retail data ordered after Category(e.g. Chocolate), Brand (e.g. Cadbury's), and Week(1-208). CBX is a unique global identifier for each brand.
Category Brand Week Sales Price CBX
33 2 1 167650. 2.20 33 - 2
33 2 2 168044. 2.18 33 - 2
33 2 3 160770 2.24 33 - 2
I now want to remove the brands that have zero sales in more than 75% of the weeks (thus have positive sales in at least 156 weeks).
At first I deleted all brands with any zero sales using dplyr, but it deleted too much of the data. This was the code I used:
library(dplyr)
Final_df_ <- Final_df %>%
group_by(Final_df$CBX) %>%
filter(!any(Sales==0 & Price==0))
Now I'm trying to change the code so it only deletes all rows belonging to a brand (CBX) if the sales of that brand are zero in more than 25% of the cases.
This is how far I've come:
Final_df_ <- Final_df %>%
group_by(Final_df$CBX) %>%
filter(!((Final_df$Sales==0)>0.75))
Thank you!

Rbind Difference of rows

I want to determine the difference of each row and have that total difference rbinded at the end. Below is a sample dataset:
DATE <- as.Date(c('2016-11-28','2016-11-29'))
TYPE <- c('A', 'B')
Revenue <- c(2000, 1000)
Sales <- c(1000, 4000)
Price <- c(5.123, 10.234)
Material <- c(10000, 7342)
df<-data.frame(DATE, TYPE, Revenue, Sales, Price, Material)
df
DATE TYPE Revenue Sales Price Material
1 2016-11-28 A 2000 1000 5.123 10000
2 2016-11-29 B 1000 4000 10.234 7342
How Do I calculate the Difference of Each of the Columns to produce this total:
DATE TYPE Revenue Sales Price Material
1 2016-11-28 A 2000 1000 5.123 10000
2 2016-11-29 B 1000 4000 10.234 7342
3 DIFFERENCE -1000 3000 5.111 -2658
I can easily do it by columns but having trouble doing it by row.
Any help would be great thanks!
As 'DATE' is Date class, we may need to change it to character before proceeding with rbinding with string "DIFFERENCE". Other than that, subset the numeric columns of 'df', loop it with lapply, get the difference, concatenate with the 'DATE' and 'TYPE', and rbind with original dataset.
df$DATE <- as.character(df$DATE)
rbind(df, c(DATE = "DIFFERENCE", TYPE= NA, lapply(df[-(1:2)], diff)))
# DATE TYPE Revenue Sales Price Material
#1 2016-11-28 A 2000 1000 5.123 10000
#2 2016-11-29 B 1000 4000 10.234 7342
#3 DIFFERENCE <NA> -1000 3000 5.111 -2658

Resources