Aggregate with a start and end of date - r

I'm new to R so this is maybe simple, but I haven't find how to do it yet.
I'm trying to aggregate my temperature data by day so I have a mean temperature for every day of the year.
Here's an example of my data and the code I made :
Date Qobs Ptot Fsol Temp PE X
1 1956-11-01 0.001 14.0 -99 12.0 1.4 NA
2 1956-11-02 0.001 0.0 -99 13.5 1.5 NA
3 1956-11-03 0.001 0.0 -99 13.5 1.5 NA
4 1956-11-04 0.001 0.0 -99 13.0 1.4 NA
5 1956-11-05 0.001 0.0 -99 11.5 1.3 NA
6 1956-11-06 0.001 0.0 -99 11.0 1.2 NA
7 1956-11-07 0.001 2.0 -99 12.5 1.3 NA
8 1956-11-08 0.000 0.0 -99 5.0 0.7 NA
9 1956-11-09 0.000 0.5 -99 0.0 0.4 NA
10 1956-11-10 0.000 0.0 -99 -2.5 0.2 NA
11 1956-11-11 0.000 2.5 -99 5.5 0.8 NA
12 1956-11-12 0.000 0.0 -99 7.5 0.9 NA
reg_T=aggregate(x=tmp_data$Temp, by=list(j=format(tmp_data$Date, "%j")), mean)
But as you can see my data doesn't start the 1st Januray, so the 1st day of my data is the 01/11 which makes it complicated for later when it's aggregated.
How can I aggregate and define the start at the 01/01 and make it forget the beginning and end of my data because they are not complete years?
Thanks!
dput() of the data:
df <- structure(list(Date = structure(c(-4809, -4808, -4807, -4806, -4805, -4804,
-4803, -4802, -4801, -4800, -4799, -4798, -4797,
-4796, -4795, -4794, -4793, -4792, -4791, -4790,
-4789, -4788, -4787, -4786, -4785, -4784, -4783,
-4782, -4781, -4780), class = "Date"),
Temp = c(12, 13.5, 13.5, 13, 11.5, 11, 12.5, 5, 0, -2.5, 5.5, 7.5,
1.5, 6, 14, 6, 0.5, 0.5, 4, 2, 9, -4.5, -11.5, -10, -4.5,
-2.5, -3.5, -1, -1.5, -7.5)),
.Names = c("Date", "Temp"), row.names = c(NA, 30L), class = "data.frame")

What about something like this:
require(tidyverse)
df %>%
mutate(MonthDay = str_sub(as.character(Date), 6)) %>%
group_by(MonthDay) %>%
summarise(MeanDay = mean(Temp, na.rm = TRUE))
# A tibble: 30 x 2
MonthDay MeanDay
<chr> <dbl>
1 11-01 12.0
2 11-02 13.5
3 11-03 13.5
4 11-04 13.0
5 11-05 11.5
6 11-06 11.0
7 11-07 12.5
8 11-08 5.00
9 11-09 0.
10 11-10 -2.50
# ... with 20 more rows

Related

R apply weighting operation by multiple groups

Hi I have a dataset like this:
City = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3)
Area=c("A","B","A","B","A","A","B","B","B","A","A","B","A","A")
Weights=c(2.4,1.9,0.51,0.7,2.2,1.5,1.86,1.66,1.09,2.57,2.4,0.9,3.4,3.7)
Tax=c(16,93,96,44,67,73,12,65,81,22,39,94,41,30)
z = data.frame(City,Area,Weights,Tax)
Which looks like this:
What I want to do is to obtain the weighted tax for each City and each area.
For eg. for row 1 above the computed value is:
2.4*16/(2.40+0.51+2.20) and so on.
I can do that using this function:
cit_data=list()
weighted_tax=function(z){
for (cit in unique(z$City)){
city_data=z[z$City==cit,]
area_new=list()
for (ar in unique(z$Area)){
area_data=city_data[city_data$Area==ar,]
area_data$area_dat_n = (area_data$Weight*area_data$Tax)/sum(area_data$Weights)
area_new=rbind(area_new,area_data)
}
cit_data=rbind(cit_data,area_new)
}
return(cit_data)
}
tax=weighted_tax(z)
Is there a easier/cleaner way to implement this? Thanks in advance.
Using dplyr :
library(dplyr)
z %>%
group_by(City, Area) %>%
mutate(Weighted_tax = Tax*Weights/sum(Weights))
Output:
# A tibble: 14 x 5
# Groups: City, Area [6]
City Area Weights Tax Weighted_tax
<dbl> <fct> <dbl> <dbl> <dbl>
1 1 A 2.4 16 7.51
2 1 B 1.9 93 68.0
3 1 A 0.51 96 9.58
4 1 B 0.7 44 11.8
5 1 A 2.2 67 28.8
6 2 A 1.5 73 26.9
7 2 B 1.86 12 4.84
8 2 B 1.66 65 23.4
9 2 B 1.09 81 19.2
10 2 A 2.57 22 13.9
11 3 A 2.4 39 9.85
12 3 B 0.9 94 94.
13 3 A 3.4 41 14.7
14 3 A 3.7 30 11.7
We also could do this in base R with by,
do.call(rbind, by(z, z[c("City", "Area")], function(x)
cbind(x, area.dat.n=with(x, Weights * Tax / sum(Weights)))))
# City Area Weights Tax area.dat.n
# 1 1 A 2.40 16 7.514677
# 3 1 A 0.51 96 9.581213
# 5 1 A 2.20 67 28.845401
# 6 2 A 1.50 73 26.904177
# 10 2 A 2.57 22 13.891892
# 11 3 A 2.40 39 9.852632
# 13 3 A 3.40 41 14.673684
# 14 3 A 3.70 30 11.684211
# 2 1 B 1.90 93 67.961538
# 4 1 B 0.70 44 11.846154
# 7 2 B 1.86 12 4.841649
# 8 2 B 1.66 65 23.405640
# 9 2 B 1.09 81 19.151844
# 12 3 B 0.90 94 94.000000
or with ave.
cbind(z,
area.dat.n=
apply(cbind(z, w=with(z, ave(Weights, City, Area, FUN=sum))), 1, function(x)
x[3] * x[4] / x[5]))
# City Area Weights Tax area.dat.n
# 1 1 1 2.40 16 7.514677
# 2 1 2 1.90 93 67.961538
# 3 1 1 0.51 96 9.581213
# 4 1 2 0.70 44 11.846154
# 5 1 1 2.20 67 28.845401
# 6 2 1 1.50 73 26.904177
# 7 2 2 1.86 12 4.841649
# 8 2 2 1.66 65 23.405640
# 9 2 2 1.09 81 19.151844
# 10 2 1 2.57 22 13.891892
# 11 3 1 2.40 39 9.852632
# 12 3 2 0.90 94 94.000000
# 13 3 1 3.40 41 14.673684
# 14 3 1 3.70 30 11.684211
Data
z <- structure(list(City = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3), Area = structure(c(1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L,
1L, 2L, 1L, 1L), .Label = c("A", "B"), class = "factor"), Weights = c(2.4,
1.9, 0.51, 0.7, 2.2, 1.5, 1.86, 1.66, 1.09, 2.57, 2.4, 0.9, 3.4,
3.7), Tax = c(16, 93, 96, 44, 67, 73, 12, 65, 81, 22, 39, 94,
41, 30)), class = "data.frame", row.names = c(NA, -14L))

sales calculation as per the below method

I am have some sales calculation and define some basic predicted sales as per the formula given.
df1: cut_of_sales
cut-off_sales
1
2
1
3
df2: actual df for data:
Sales
NA
NA
NA
NA
1.2
2.1
1.4
1.1
2.1
1.4
1.1
1.2
2.1
1.4
1.1
1.2
2.1
1.4
1.1
2.3
First 4 quarters are NA. Keep them as they are.
Start with 5th row by adding the first value for cutoff_sales
Explanation:
1. cutoff_sales is given predefined by the company, 4 values for each quaters are given.
2. Add the q1 quarter of the cutoff sales with 2010q1 = ansq1
3. Add the q2 quater of the cutoff sales with 2010q2 = ansq2
4. Do the same for q3 and q4.
Now the answer of above addition will, will be input for next 2011 quaters.
so ansq1 + 2012q1 = ans...
ansq2 + 2012q2 = ans ....
and so on for below quarter answer for 2012 quaters will be input for 2013 and so on for rest of the 10 years.
Please help me in doing this addition.
I was only able to do the first year addition.
please help me writting a function or a loop that would be iterative as there would be many years coming up.
thanks.
For updated question
With the updated question, the following is one way to achieve the task. Since this is quarter data and the first four rows are NA, you can add the values of cut_off in mydf1 to Sales first. Then, you create a grouping variable. 1 indicates first quarter. You can sum up Sales with cumsum() as I suggested in my previous answer. It seems that you want to keep the NAs. So I converted 0 to NA in the end.
mydf2$Sales[5:8] <- mydf2$Sales[5:8] + mydf1$cut_off
group_by(mydf2, quarter = rep(1:4, times = n()/4)) %>%
mutate(Sales = cumsum(if_else(is.na(Sales), 0, Sales)),
Sales = na_if(Sales, 0))
Sales quarter
<dbl> <int>
1 NA 1
2 NA 2
3 NA 3
4 NA 4
5 2.20 1
6 4.10 2
7 2.40 3
8 4.10 4
9 4.30 1
10 5.50 2
11 3.50 3
12 5.30 4
13 6.40 1
14 6.90 2
15 4.60 3
16 6.50 4
17 8.50 1
18 8.30 2
19 5.70 3
20 8.80 4
DATA
mydf2 <- structure(list(Sales = c(NA, NA, NA, NA, 2.2, 4.1, 2.4, 4.1,
2.1, 1.4, 1.1, 1.2, 2.1, 1.4, 1.1, 1.2, 2.1, 1.4, 1.1, 2.3)), .Names = "Sales", row.names = c(NA,
-20L), class = "data.frame")
For original question
Here is one approach. I considered cases where you would have NA in your data. First, I added the values of cut_off in mydf1. Then, I create a new variable called quarter and defined groups. For each group, I applied cumsum() and summed up the values. If you do not have any NA, the final line would be mutate(sales = cumsum(sales)) in the code below.
library(dplyr)
mydf2 %>%
mutate(sales = if_else(substr(sales_quarter, 1,4) == "2010", sales + mydf1$cut_off, sales)) %>%
group_by(quarter = substr(sales_quarter, 5, 6)) %>%
mutate(sales = cumsum(if_else(is.na(sales), 0, sales)))
sales_quarter sales quarter
<chr> <dbl> <chr>
1 2010Q1 2.20 Q1
2 2010Q2 4.10 Q2
3 2010Q3 2.40 Q3
4 2010Q4 4.10 Q4
5 2011Q1 4.30 Q1
6 2011Q2 5.50 Q2
7 2011Q3 3.50 Q3
8 2011Q4 5.30 Q4
9 2012Q1 6.40 Q1
10 2012Q2 6.90 Q2
11 2012Q3 4.60 Q3
12 2012Q4 6.50 Q4
13 2013Q1 8.50 Q1
14 2013Q2 8.30 Q2
15 2013Q3 5.70 Q3
16 2013Q4 8.80 Q4
DATA
mydf1 <- structure(list(cut_off = c(1, 2, 1, 3)), .Names = "cut_off", row.names = c(NA,
4L), class = "data.frame")
mydf2 <- structure(list(sales_quarter = c("2010Q1", "2010Q2", "2010Q3",
"2010Q4", "2011Q1", "2011Q2", "2011Q3", "2011Q4", "2012Q1", "2012Q2",
"2012Q3", "2012Q4", "2013Q1", "2013Q2", "2013Q3", "2013Q4"),
sales = c(1.2, 2.1, 1.4, 1.1, 2.1, 1.4, 1.1, 1.2, 2.1, 1.4,
1.1, 1.2, 2.1, 1.4, 1.1, 2.3)), .Names = c("sales_quarter",
"sales"), class = "data.frame", row.names = c(NA, -16L))
New sequential answer:
> df
year_quater sales pred_sales
1 2010Q1 1.2 NA
2 2010Q2 2.1 NA
3 2010Q3 1.4 NA
4 2010Q4 1.1 NA
5 2011Q1 2.1 NA
6 2011Q2 1.4 NA
7 2011Q3 1.1 NA
8 2011Q4 1.2 NA
9 2012Q1 2.1 NA
10 2012Q2 1.4 NA
11 2012Q3 1.1 NA
12 2012Q4 1.2 NA
13 2013Q1 2.1 NA
14 2013Q2 1.4 NA
15 2013Q3 1.1 NA
16 2013Q4 2.3 NA
pred <- c(1,2,1,3)
for(i in seq(1, nrow(df), 4)){
df$pred_sales[i:(i+3)] <- df$sales[i:(i+3)] + pred
pred <- df$pred_sales[i:(i+3)]
}
> df
year_quater sales pred_sales
1 2010Q1 1.2 2.2
2 2010Q2 2.1 4.1
3 2010Q3 1.4 2.4
4 2010Q4 1.1 4.1
5 2011Q1 2.1 4.3
6 2011Q2 1.4 5.5
7 2011Q3 1.1 3.5
8 2011Q4 1.2 5.3
9 2012Q1 2.1 6.4
10 2012Q2 1.4 6.9
11 2012Q3 1.1 4.6
12 2012Q4 1.2 6.5
13 2013Q1 2.1 8.5
14 2013Q2 1.4 8.3
15 2013Q3 1.1 5.7
16 2013Q4 2.3 8.8
This answer creates a variable sequence by using the number of rows of your data and loops through every 4 rows, calculates the pred_sales, updates the pred values to use in the next loop iteration.

Calculate formula over all rows and specific columns of dataframe

I have the following sample dataframe with prices of toys in different shops:
dfData <- data.frame(article = c("Fix", "Foxi", "Stan", "Olli", "Barbie", "Ken", "Hulk"),
priceToys1 = c(10, NA, 10.5, NA, 10.7, 11.2, 12.0),
priceAllToys = c(NA, 11.4, NA, 11.9, 11.7, 11.1, NA),
price123Toys = c(12, 12.4, 12.7, NA, NA, 11.0, 12.1))
Additionally I generate a min price column by adding:
dfData$MinPrice <- apply(dfData[, grep("price", colnames(dfData))], 1, FUN=min, na.rm = TRUE)
So I have this dataframe now:
# article priceToys1 priceAllToys price123Toys MinPrice
#1 Fix 10.0 NA 12.0 10.0
#2 Foxi NA 11.4 12.4 11.4
#3 Stan 10.5 NA 12.7 10.5
#4 Olli NA 11.9 NA 11.9
#5 Barbie 10.7 11.7 NA 10.7
#6 Ken 11.2 11.1 11.0 11.0
#7 Hulk 12.0 NA 12.1 12.0
How do I get additional columns into the dataframe that tell me the factor of all prices relatively to the minimum price in percentage? The new column names should also include the shop name.
The result should look like this:
# article priceToys1 PercToys1 priceAllToys PercAllToys price123Toys Perc123Toys MinPrice
#1 Fix 10.0 100.0 NA NA 12.0 120.0 10.0
#2 Foxi NA NA 11.4 100.0 12.4 108.8 11.4
#3 Stan 10.5 100.0 NA NA 12.7 121.0 10.5
#4 Olli NA NA 11.9 100.0 NA NA 11.9
#5 Barbie 10.7 100.0 11.7 109.4 NA NA 10.7
#6 Ken 11.2 101.8 11.1 100.9 11.0 100.0 11.0
#7 Hulk 12.0 100.0 NA NA 12.1 100.8 12.0
Two possible solutions:
1) With the data.table-package:
# load the 'data.table'-package
library(data.table)
# get the columnnames on which to operate
cols <- names(dfData)[2:4] # or: grep("price", names(dfData), value = TRUE)
# convert dfData to a 'data.table'
setDT(dfData)
# compute the 'fraction'-columns
dfData[, paste0('Perc', gsub('price','',cols)) := lapply(.SD, function(x) round(100 * x / MinPrice, 1))
, .SDcols = cols][]
which gives:
article priceToys1 priceAllToys price123Toys MinPrice PercToys1 PercAllToys Perc123Toys
1: Fix 10.0 NA 12.0 10.0 100.0 NA 120.0
2: Foxi NA 11.4 12.4 11.4 NA 100.0 108.8
3: Stan 10.5 NA 12.7 10.5 100.0 NA 121.0
4: Olli NA 11.9 NA 11.9 NA 100.0 NA
5: Barbie 10.7 11.7 NA 10.7 100.0 109.3 NA
6: Ken 11.2 11.1 11.0 11.0 101.8 100.9 100.0
7: Hulk 12.0 NA 12.1 12.0 100.0 NA 100.8
2) With base R:
cols <- names(dfData)[2:4] # or: grep("price", names(dfData), value = TRUE)
dfData[, paste0('Perc', gsub('price','',cols))] <- round(100 * dfData[, cols] / dfData$MinPrice, 1)
which will get you the same result.
We can use mutate_at from dplyr
library(dplyr)
library(magrittr)
dfData %<>%
mutate_at(vars(matches("^price")), funs(Perc = round(100* ./MinPrice, 1)))
dfData

How to calculate group of sequent nonzero rows in R using 1 row above and 1 row after that group?

I want to create another data frame (df) that lists only events. For example, there should be 4 events in df(XX,YY). The column XX should be sum of event value greater than zero separated by zero rows. The column YY should be Max minus Min of event value greater than zero separated by zero rows.
XX YY
1 3.0 23.6
2 0.0 23.2
3 0.0 23.7
4 0.0 25.2
5 1.3 24.5
6 4.8 24.2
7 0.2 23.1
8 0.0 23.3
9 0.0 23.9
10 0.0 24.3
11 1.8 24.6
12 3.2 23.7
13 0.0 23.2
14 0.0 23.6
15 0.0 24.1
16 0.2 24.5
17 4.8 24.1
18 3.7 22.1
19 0.0 23.4
20 0.0 23.8
From my table, I would like to get the results as following.
Event 1. XX[1] = sum(row1,row2) ; YY[1] = [Max(row1,row2)- Min(row1,row2)]
XX[1]=3, YY[1]=0.4
Event 2. XX[2] = sum(row4,row5,row6,row7,row8) ; YY[2] = [Max(row4,row5,row6,row7,row8)- Min(row4,row5,row6,row7,row8)]
XX[2]=6.3, YY[2]=2.1
Event 3. XX[3] = sum(row10,row11,row12,row13) ; YY[3] = [Max(row10,row11,row12,row13)- Min(row10,row11,row12,row13)]
XX[3]=5, YY[3]=1.4
Event 4. XX[4] = sum(row15,row16,row17,row18,row19) ; YY[4] = [Max(row15,row16,row17,row18,row19)- Min(row15,row16,row17,row18,row19)]
XX[4]=5, YY[4]=2.4
XX YY
1 3 0.4
2 6.3 2.1
3 5 1.4
4 8.7 2.4
Method 1 in base R
Split the original data.frame into a list.
lst <- split(df, c(rep(1, 2), 2, rep(3, 5), 4, rep(5, 4), 6, rep(7, 5), 8));
lst <- lst[sapply(lst, function(x) nrow(x) > 1)];
names(lst) <- NULL;
Note that this is exactly the same as your original data, with the only difference that relevant rows are grouped into separate data.frames, and irrelevant rows (row3, row9, row14, row20) have been removed.
Next define a custom function
# Define a custom function that returns
# the sum(column XX) and max(column YY)-min(column YY)
calc_summary_stats <- function(df) {
c(sum(df$XX), max(df$YY) - min(df$YY));
}
Apply the function to your list elements using sapply to get your expected outcome.
# Apply the function to the list of dataframes
m <- t(sapply(lst, calc_summary_stats))
colnames(m) <- c("XX", "YY");
# XX YY
#[1,] 3.0 0.4
#[2,] 6.3 2.1
#[3,] 5.0 1.4
#[4,] 8.7 2.4
Method 2 using tidyverse
Using dplyr, we can first add an idx column by which we group the data; then filter the groups with >1 row, calculate the two summary statistics for every group, and output the ungrouped data with the idx column removed.
library(tidyverse);
df %>%
mutate(idx = c(rep(1, 2), 2, rep(3, 5), 4, rep(5, 4), 6, rep(7, 5), 8)) %>%
group_by(idx) %>%
filter(n() > 1) %>%
summarise(XX = sum(XX), YY = max(YY) - min(YY)) %>%
ungroup() %>%
select(-idx);
## A tibble: 4 x 2
# XX YY
# <dbl> <dbl>
#1 3.00 0.400
#2 6.30 2.10
#3 5.00 1.40
#4 8.70 2.40
Sample data
df <- read.table(text =
"XX YY
1 3.0 23.6
2 0.0 23.2
3 0.0 23.7
4 0.0 25.2
5 1.3 24.5
6 4.8 24.2
7 0.2 23.1
8 0.0 23.3
9 0.0 23.9
10 0.0 24.3
11 1.8 24.6
12 3.2 23.7
13 0.0 23.2
14 0.0 23.6
15 0.0 24.1
16 0.2 24.5
17 4.8 24.1
18 3.7 22.1
19 0.0 23.4
20 0.0 23.8", header = T)

Apply function and create new row

I have a data.frame as such:
X 1976 1977
1 6.4 6.9
2 6.3 7.0
3 6.1 7.1
4 6.0 7.2
I want to create the following:
Qtr Value
1976.00 6.27
1976.25 ...
And so on...
1977.00 7.0
1977.25 ...
And so on.
EDIT: The output is the average of the first 3 values. My apologies.
Can anybody help me out? Thanks in advance.
Robert
Here's an approach.
Your data frame:
dat <- read.table(text = "X 1976 1977
1 6.4 6.9
2 6.3 7.0
3 6.1 7.1
4 6.0 7.2", header = TRUE, check.names = FALSE)
The commands:
agg <- aggregate(dat[-1], by = list((dat$X - 1) %/% 3), mean)
dat2 <- setNames(stack(agg[-1])[2:1], c("Qtr", "Value"))
dat2$Qtr <- agg[[1]] * 0.25 + as.numeric(as.character(dat2$Qtr))
The result:
dat2
# Qtr Value
# 1 1976.00 6.266667
# 2 1976.25 6.000000
# 3 1977.00 7.000000
# 4 1977.25 7.200000
Try:
ddf = structure(list(X = 1:12, `1976` = c(6.4, 6.3, 6.1, 6, 6, 6.3,
6.1, 6, 6.4, 6.8, 6.6, 6), `1977` = c(6.9, 7, 7.1, 7.2, 7.2,
7.1, 7.2, 7.5, 7.2, 7.6, 7.8, 7.2)), .Names = c("X", "1976",
"1977"), class = "data.frame", row.names = c(NA, -12L))
ddf
X 1976 1977
1 1 6.4 6.9
2 2 6.3 7.0
3 3 6.1 7.1
4 4 6.0 7.2
5 5 6.0 7.2
6 6 6.3 7.1
7 7 6.1 7.2
8 8 6.0 7.5
9 9 6.4 7.2
10 10 6.8 7.6
11 11 6.6 7.8
12 12 6.0 7.2
df2 = data.frame(qtr =numeric(), value=numeric())
rr=1; x=0; new=TRUE
for(cc in 2:3)for(i in 1:4){
if(cc==3 & new){
rr = 1; x=0; new=FALSE;
}
df2[nrow(df2)+1,1] = as.numeric(names(ddf)[cc])+x
x = x+0.25
df2[nrow(df2),2] = mean(ddf[rr:(rr+3),cc])
rr = rr+4
if(rr>12) rr = 1
}
df2
qtr value
1 1976.00 6.20
2 1976.25 6.10
3 1976.50 6.45
4 1976.75 6.20
5 1977.00 7.05
6 1977.25 7.25
7 1977.50 7.45
8 1977.75 7.05

Resources