I have climatic data which have been collected during a whole year along an altitude gradient. Shaped like that:
clim <- read.table(text="alti year month week day meanTemp maxTemp minTemp
350 2011 aug. 31 213 10 14 6
350 2011 aug. 31 214 12 18 6
350 2011 aug. 31 215 10 11 9
550 2011 aug. 31 213 8 10 6
550 2011 aug. 31 214 10 12 8
550 2011 aug. 31 215 8 9 7
350 2011 sep. 31 244 9 10 8
350 2011 sep. 31 245 11 12 10
350 2011 sep. 31 246 10 11 9
550 2011 sep. 31 244 7.5 9 6
550 2011 sep. 31 245 8 10 6
550 2011 sep. 31 246 8.5 9 8", header=TRUE)
and I am trying to reshape this data in order to have only one row per altitude and to calculate the mean data for each month and for the whole year. I would be great if it could be shaped like that:
alti mean_year(meanTemp) mean_year(maxTemp) mean_aug.(meanTemp) mean_aug.(maxTemp) mean_sep.(meanTemp) [...]
350 10.333 12.667 10.667 14.3 10 ...
550 8.333 9.833 8.667 10.333 7.766 ...
Any idea to perform this reshaping & calculation?
You can use data.table and dcast:
library(data.table)
setDT(clim)
merge(
clim[, list("mean_temp_mean_year" = mean(meanTemp), "max_temp_mean_year" = mean(maxTemp)), by = alti]
,
dcast(clim[, list("mean_temp_mean" = mean(meanTemp), "max_temp_mean" = mean(maxTemp)), by = c("alti","month")], alti ~ month, value.var = c("mean_temp_mean","max_temp_mean"))
,
by = "alti")
I've switched the names of some of the variables, and you col order is not perfect, but the can be reordered/renamed afterwards
To get the means of the months or years, you can use aggregate followed by reshape.
The two aggregates can be computed separately, and then merge puts them together:
mon <- aggregate(cbind(meanTemp, maxTemp) ~ month + alti, data=clim, FUN=mean)
mon.wide <- reshape(mon, direction='wide', timevar='month', idvar='alti')
yr <- aggregate(cbind(meanTemp, maxTemp) ~ year + alti, data=clim, FUN=mean)
yr.wide <- reshape(yr, direction='wide', timevar='year', idvar='alti')
Each of these .wide sets have the data that you want. The only common column is alti so we take the merge defaults:
merge(mon.wide, yr.wide)
## alti meanTemp.aug. maxTemp.aug. meanTemp.sep. maxTemp.sep. meanTemp.2011 maxTemp.2011
## 1 350 10.666667 14.33333 10 11.000000 10.333333 12.666667
## 2 550 8.666667 10.33333 8 9.333333 8.333333 9.833333
Here's another variation of data.table solution, but this requires the current devel version, v1.9.5:
require(data.table) # v1.9.5+
setDT(clim)
form = paste("alti", c("year", "month"), sep=" ~ ")
val = c("meanTemp", "maxTemp")
ans = lapply(form, function(x) dcast(clim, x, mean, value.var = val))
Reduce(function(x, y) x[y, on="alti"], ans)
# alti meanTemp_mean_2011 maxTemp_mean_2011 meanTemp_mean_aug. meanTemp_mean_sep. maxTemp_mean_aug. maxTemp_mean_sep.
# 1: 350 10.333333 12.666667 10.666667 10 14.33333 11.000000
# 2: 550 8.333333 9.833333 8.666667 8 10.33333 9.333333
Related
I have a problem with the humans here; they're giving me Citizen Science data in spreadsheets formatted to be attractive and legible. I figured out the right sequence of pivots _longer and _wider to get it into an analyzable format but first I had to do a whole bunch of hand edits to make the column labels usable. I've just been given a corrected spreadsheet so now I have to do the same hand edits all over. Can I avoid this?
reprex <- read_csv("reprex.csv", col_names = FALSE)
gives:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 NA NA 2014 NA NA 2015 NA NA 2016 NA
2 NA Total F M Total F M Total F M
3 SiteA 180 92 88 134 40 94 34 20 14
4 SiteB NA NA NA 247 143 104 8 8 0
5 SiteC 237 194 43 220 95 125 62 45 17
I want column labels like "2014 Total", "2014 F", ... like so:
Location `2014 Total` `2014 F` `2014 M` `2015 Total` `2015 F` `2015 M` `2016 Total` `2016 F` `2016 M`
1 SiteA 180 92 88 134 40 94 34 20 14
2 SiteB NA NA NA 247 143 104 8 8 0
3 SiteC 237 194 43 220 95 125 62 45 17
...which would allow me to twist it up until I get to something like:
Location date Total F M
1 SiteA 2014 180 92 88
2 SiteB 2014 NA NA NA
3 SiteC 2014 237 194 43
4 SiteA 2015 134 40 94
5 SiteB 2015 247 143 104
6 SiteC 2015 220 95 125
7 SiteA 2016 34 20 14
8 SiteB 2016 8 8 0
9 SiteC 2016 62 45 17
The part from the second table to the third I've got; the problem is in how to get from the first table to the second. It would seem like you could pivot the first and then fill in the missing dates with fill(.direction="updown") except that the dates are the grouping value you need to be following.
For this example we could do like this:
library(tidyverse)
df_helper <- df %>%
slice(1:2) %>%
pivot_longer(cols= everything()) %>%
fill(value, .direction = "up") %>%
mutate(x = lead(value, 11)) %>%
drop_na() %>%
unite("name", c(value, x), sep = " ", remove = FALSE) %>%
pivot_wider(names_from = name)
df %>%
setNames(names(df_helper)) %>%
rename(Location = x) %>%
slice(-c(1:2))
Location 2014 Total 2014 F 2014 M 2015 Total 2015 F 2015 M 2016 Total 2016 F 2016 M
3 SiteA 180 92 88 134 40 94 34 20 14
4 SiteB <NA> <NA> <NA> 247 143 104 8 8 0
5 SiteC 237 194 43 220 95 125 62 45 17
I have sales data by year, condition and products
Year <- c(2010,2010,2010,2010,2010,2010,2011,2011,2011,2011,2011,2011,2012,2012,2012,2012,2012,2012)
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12",18,10,17)
Condition <- c("New","New","New","Used","Used","Used","New","New","New","Used","Used","Used","New","New","New","Used","Used","Used")
Product <- c("a","b","c","a","b","c","a","b","c","a","b","c","a","b","c","a","b","c")
df <- data.frame(Year,Condition, Product, Sale)
Now I want to calculate the share of each product by condition variable within each year. I tried the following code, but it calculates based on total no by year and "condition"
df$percentage <- df$Sale/sum(df$Sale)*100
First convert Sale from character to numeric with type.convert(as.is = TRUE),
then group by the desired columns and apply summarise:
Note that in your provided dataframe you will get 100 for percentage because of your provided data:
With this fake data
set.seed(123)
Year <- sample(c(2010, 2011, 2012), 18, replace = TRUE)
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12",18,10,17)
Condition <- sample(c("Used","New"), 18, replace = TRUE)
Product <- sample(c("a","b","c"), 18, replace = TRUE)
df <- data.frame(Year,Condition, Product, Sale)
using this code
library(dplyr)
df %>%
type.convert(as.is=TRUE) %>%
group_by(Year, Product, Condition) %>%
summarise(percentage = Sale/sum(Sale)*100)
you will get:
Year Product Condition percentage
<int> <chr> <chr> <dbl>
1 2010 a Used 83.2
2 2010 a Used 16.8
3 2010 c New 100
4 2011 a New 100
5 2011 a Used 42.9
6 2011 a Used 14.3
7 2011 a Used 42.9
8 2011 b New 100
9 2011 c New 49.2
10 2011 c New 50.8
11 2012 a Used 63.8
12 2012 a Used 36.2
13 2012 b New 100
14 2012 b Used 69.7
15 2012 b Used 30.3
16 2012 c New 100
17 2012 c Used 34.8
18 2012 c Used 65.2
Update: to keep Sale column: replace summarise with mutate
df %>%
type.convert(as.is=TRUE) %>%
group_by(Year, Product, Condition) %>%
mutate(percentage = paste(round(Sale/sum(Sale)*100, 1), "%"))
Year Condition Product Sale percentage
<int> <chr> <chr> <int> <chr>
1 2012 Used a 30 63.8 %
2 2012 New c 45 100 %
3 2012 Used b 23 69.7 %
4 2011 Used a 33 42.9 %
5 2012 Used c 24 34.8 %
6 2011 Used a 11 14.3 %
7 2011 New a 56 100 %
8 2011 New b 19 100 %
9 2012 Used c 45 65.2 %
10 2010 New c 56 100 %
11 2011 Used a 33 42.9 %
12 2011 New c 32 49.2 %
13 2010 Used a 89 83.2 %
14 2011 New c 33 50.8 %
15 2012 New b 12 100 %
16 2010 Used a 18 16.8 %
17 2012 Used b 10 30.3 %
18 2012 Used a 17 36.2 %
Here is a base solution using ave(). You can replace grouping variables in ave with any others you want.
within(df, {
perc1 = ave(as.numeric(Sale), Year, Product, FUN = proportions) * 100
perc2 = sprintf("%.1f %%", perc1)
})
Year Condition Product Sale perc2 perc1
1 2010 New a 30 47.6 % 47.61905
2 2010 New b 45 65.2 % 65.21739
3 2010 New c 23 67.6 % 67.64706
4 2010 Used a 33 52.4 % 52.38095
5 2010 Used b 24 34.8 % 34.78261
6 2010 Used c 11 32.4 % 32.35294
7 2011 New a 56 50.0 % 50.00000
8 2011 New b 19 36.5 % 36.53846
9 2011 New c 45 58.4 % 58.44156
10 2011 Used a 56 50.0 % 50.00000
11 2011 Used b 33 63.5 % 63.46154
12 2011 Used c 32 41.6 % 41.55844
I have a data frame called "data", that has "date, month, discharge, and station" columns. Another data frame called "perc" that has "month, W1_Percentile, and B1_Percentile" columns. W1_Percentile and B1_Percentile are the monthly percentile values for each of the gauging stations. I want my final output to have columns same as in df(data) with an additional column for "Percentile" that will have the percentile values for the respective month and gauging station (percentile values of each gauging station for the respective months is stored in df(perc)). What steps should I follow?
Here is the sample of input data:
date <- as.Date(c('1950-03-12','1954-03-23','1991-06-27','1997-09-04','1991-06-27','1987-05-06','1987-05-29','1856-07-08','1993-06-04', '2001-09-19','2001-05-06','2001-05-27'))
month <- c('Mar','Mar','Jun','Sep','Jun','May','May','Jul','Jun','Sep','May','May')
disch <- c(125,1535,1654,154,4654,453,1654,145,423,433,438,6426)
station <- c('W1','W1','W1','W1','W1','W1','B1','B1','B1','B1','B1','B1')
data <- data.frame("Date"= date, "Month" = month,"Discharge"=disch,"station"=station)
Date Month Discharge station
1 1950-03-12 Mar 125 W1
2 1954-03-23 Mar 1535 W1
3 1991-06-27 Jun 1654 W1
4 1997-09-04 Sep 154 W1
5 1991-06-27 Jun 4654 W1
6 1987-05-06 May 453 W1
7 1987-05-29 May 1654 B1
8 1856-07-08 Jul 145 B1
9 1993-06-04 Jun 423 B1
10 2001-09-19 Sep 433 B1
11 2001-05-06 May 438 B1
12 2001-05-27 May 6426 B1
Month <- c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec')
W1 <- c(106,313,531.40,164.10,40,23.39,18.30,24,16,16,12,34)
B1 <- c(1330,1550,1948,1880,1260,853.15,680.15,486.10,503,625,738,1070)
perc <- data.frame("Month"=Month,"W1_Percentile"=W1,"B1_Percentile"=B1)
Month W1_Percentile B1_Percentile
1 Jan 106.00 1330.00
2 Feb 313.00 1550.00
3 Mar 531.40 1948.00
4 Apr 164.10 1880.00
5 May 40.00 1260.00
6 Jun 23.39 853.15
7 Jul 18.30 680.15
8 Aug 24.00 486.10
9 Sep 16.00 503.00
10 Oct 16.00 625.00
11 Nov 12.00 738.00
12 Dec 34.00 1070.00
This is how I want the final output to look like:
Date Month Discharge station Percentile
1 1950-03-12 Mar 125 W1 531.40
2 1954-03-23 Mar 1535 W1 531.40
3 1991-06-27 Jun 1654 W1 23.39
4 1997-09-04 Sep 154 W1 16.00
5 1991-06-27 Jun 4654 W1 23.39
6 1987-05-06 May 453 W1 40.00
7 1987-05-29 May 1654 B1 1260.00
8 1856-07-08 Jul 145 B1 680.15
9 1993-06-04 Jun 423 B1 853.15
10 2001-09-19 Sep 433 B1 503.00
11 2001-05-06 May 438 B1 1260.00
12 2001-05-27 May 6426 B1 1260.00
We need to first convert your perc data into a long format so that we have the columns we want to add to data, then it's a simple join:
library(tidyr)
library(dplyr)
# make the column names the same as the values in data
names(perc)[2:3] = c("W1", "B1")
# convert to long format
perc_long = gather(perc, key = "station", value = "percentile", W1, B1)
# join
left_join(data, perc_long)
# Joining, by = c("Month", "station")
# Date Month Discharge station percentile
# 1 1950-03-12 Mar 125 W1 531.40
# 2 1954-03-23 Mar 1535 W1 531.40
# 3 1991-06-27 Jun 1654 W1 23.39
# 4 1997-09-04 Sep 154 W1 16.00
# 5 1991-06-27 Jun 4654 W1 23.39
# 6 1987-05-06 May 453 W1 40.00
# 7 1987-05-29 May 1654 B1 1260.00
# 8 1856-07-08 Jul 145 B1 680.15
# 9 1993-06-04 Jun 423 B1 853.15
# 10 2001-09-19 Sep 433 B1 503.00
# 11 2001-05-06 May 438 B1 1260.00
# 12 2001-05-27 May 6426 B1 1260.00
There are many ways to do these operations, it's essentially a combination of two R-FAQs. For additional reference see
Reshaping data.frame from wide to long format
How to join (merge) data frames (inner, outer, left, right)
While doing my data work I have this problem.
Data is as below,
row_number var1 var2
1 1921 16
2 1922 16
3 1921 17
4 1922 17
5 1703 29
6 1704 29
7 1705 29
8 1703 30
9 1704 30
10 1705 30
11 1703 31
12 1704 31
13 1705 31
I want to make pairs by only using unique var1 and unique var2.
In other words, 1~4 rows can be a group and I only need to remain 1st and 4th column. And, 5~13 rows can be an another group and I only need to remain this pair (1703 29, 1704 30, 1705 31). That is, I want to have this outcome
row_number var1 var2
1 1921 16
4 1922 17
5 1703 29
9 1704 30
13 1705 31
I have much more observations.
Suppose your data is in a dataframe named d. Then
out <- data.frame(row_number = NA, var1 = NA, var2 = NA)
for (i in 1:nrow(d)) {
if (!(d[i, "var1" ] %in% out[, "var1"]) & !(d[i, "var2"] %in% out[, "var2"])) {
out <- rbind(out, d[i,])
}
}
out <- out[-1, ]
out
# row_number var1 var2
# 2 1 1921 16
# 4 4 1922 17
# 5 5 1703 29
# 9 9 1704 30
# 13 13 1705 31
gives you your desired result by iterating through rows of d and extracting only rows where neither var1 nor var2 has previously appeared in the output dataframe.
I have a huge dataset similar to the following reproducible sample data.
Interval value
1 2012-06-10 552
2 2012-06-11 4850
3 2012-06-12 4642
4 2012-06-13 4132
5 2012-06-14 4190
6 2012-06-15 4186
7 2012-06-16 1139
8 2012-06-17 490
9 2012-06-18 5156
10 2012-06-19 4430
11 2012-06-20 4447
12 2012-06-21 4256
13 2012-06-22 3856
14 2012-06-23 1163
15 2012-06-24 564
16 2012-06-25 4866
17 2012-06-26 4421
18 2012-06-27 4206
19 2012-06-28 4272
20 2012-06-29 3993
21 2012-06-30 1211
22 2012-07-01 698
23 2012-07-02 5770
24 2012-07-03 5103
25 2012-07-04 775
26 2012-07-05 5140
27 2012-07-06 4868
28 2012-07-07 1225
29 2012-07-08 671
30 2012-07-09 5726
31 2012-07-10 5176
I want to aggregate this data to weekly level to get the output similar to the following:
Interval value
1 Week 2, June 2012 *aggregate value for day 10 to day 14 of June 2012*
2 Week 3, June 2012 *aggregate value for day 15 to day 21 of June 2012*
3 Week 4, June 2012 *aggregate value for day 22 to day 28 of June 2012*
4 Week 5, June 2012 *aggregate value for day 29 to day 30 of June 2012*
5 Week 1, July 2012 *aggregate value for day 1 to day 7 of July 2012*
6 Week 2, July 2012 *aggregate value for day 8 to day 10 of July 2012*
How do I achieve this easily without writing a long code?
If you mean the sum of of ‘value’ by week I think the easiest way to do it is to convert the data into a xts object as GSee suggested:
data <- as.xts(data$value,order.by=as.Date(data$interval))
weekly <- apply.weekly(data,sum)
[,1]
2012-06-10 552
2012-06-17 23629
2012-06-24 23872
2012-07-01 23667
2012-07-08 23552
2012-07-10 10902
I leave the formatting of the output as an exercise for you :-)
If you were to use week from lubridate, you would only get five weeks to pass to by. Assume dat is your data,
> library(lubridate)
> do.call(rbind, by(dat$value, week(dat$Interval), summary))
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 24 552 4146 4188 3759 4529 4850
# 25 490 2498 4256 3396 4438 5156
# 26 564 2578 4206 3355 4346 4866
# 27 698 993 4868 3366 5122 5770
# 28 671 1086 3200 3200 5314 5726
This shows a summary for the 24th through 28th week of the year. Similarly, we can get the means with aggregate with
> aggregate(value~week(Interval), data = dat, mean)
# week(Interval) value
# 1 24 3758.667
# 2 25 3396.286
# 3 26 3355.000
# 4 27 3366.429
# 5 28 3199.500
I just came across this old question because it was used as a dupe target.
Unfortunately, all the upvoted answers (except the one by konvas and a now deleted one) present solutions for aggregating the data by week of the year while the OP has requested to aggregate by week of the month.
The definition of week of the year and week of the month is ambiguous as discussed here, here, and here.
However, the OP has indicated that he wants to count the days 1 to 7 of each month as week 1 of the month, days 8 to 14 as week 2 of the month, etc. Note that week 5 is a stub for most of the months consisting of only 2 or 3 days (except for the month of February if no leap year).
Having prepared the ground, here is a data.table solution for this kind of aggregation:
library(data.table)
DT[, .(value = sum(value)),
by = .(Interval = sprintf("Week %i, %s",
(mday(Interval) - 1L) %/% 7L + 1L,
format(Interval, "%b %Y")))]
Interval value
1: Week 2, Jun 2012 18366
2: Week 3, Jun 2012 24104
3: Week 4, Jun 2012 23348
4: Week 5, Jun 2012 5204
5: Week 1, Jul 2012 23579
6: Week 2, Jul 2012 11573
We can verify that we have picked the correct intervals by
DT[, .(value = sum(value),
date_range = toString(range(Interval))),
by = .(Week = sprintf("Week %i, %s",
(mday(Interval) -1L) %/% 7L + 1L,
format(Interval, "%b %Y")))]
Week value date_range
1: Week 2, Jun 2012 18366 2012-06-10, 2012-06-14
2: Week 3, Jun 2012 24104 2012-06-15, 2012-06-21
3: Week 4, Jun 2012 23348 2012-06-22, 2012-06-28
4: Week 5, Jun 2012 5204 2012-06-29, 2012-06-30
5: Week 1, Jul 2012 23579 2012-07-01, 2012-07-07
6: Week 2, Jul 2012 11573 2012-07-08, 2012-07-10
which is in line with OP's specification.
Data
library(data.table)
DT <- fread(
"rn Interval value
1 2012-06-10 552
2 2012-06-11 4850
3 2012-06-12 4642
4 2012-06-13 4132
5 2012-06-14 4190
6 2012-06-15 4186
7 2012-06-16 1139
8 2012-06-17 490
9 2012-06-18 5156
10 2012-06-19 4430
11 2012-06-20 4447
12 2012-06-21 4256
13 2012-06-22 3856
14 2012-06-23 1163
15 2012-06-24 564
16 2012-06-25 4866
17 2012-06-26 4421
18 2012-06-27 4206
19 2012-06-28 4272
20 2012-06-29 3993
21 2012-06-30 1211
22 2012-07-01 698
23 2012-07-02 5770
24 2012-07-03 5103
25 2012-07-04 775
26 2012-07-05 5140
27 2012-07-06 4868
28 2012-07-07 1225
29 2012-07-08 671
30 2012-07-09 5726
31 2012-07-10 5176", drop = 1L)
DT[, Interval := as.Date(Interval)]
If you are using a data frame, you can easily do this with the tidyquant package. Use the tq_transmute function, which applies a mutation and returns a new data frame. Select the "value" column and apply the xts function apply.weekly. The additional argument FUN = sum will get the aggregate by week.
library(tidyquant)
df
#> # A tibble: 31 x 2
#> Interval value
#> <date> <int>
#> 1 2012-06-10 552
#> 2 2012-06-11 4850
#> 3 2012-06-12 4642
#> 4 2012-06-13 4132
#> 5 2012-06-14 4190
#> 6 2012-06-15 4186
#> 7 2012-06-16 1139
#> 8 2012-06-17 490
#> 9 2012-06-18 5156
#> 10 2012-06-19 4430
#> # ... with 21 more rows
df %>%
tq_transmute(select = value,
mutate_fun = apply.weekly,
FUN = sum)
#> # A tibble: 6 x 2
#> Interval value
#> <date> <int>
#> 1 2012-06-10 552
#> 2 2012-06-17 23629
#> 3 2012-06-24 23872
#> 4 2012-07-01 23667
#> 5 2012-07-08 23552
#> 6 2012-07-10 10902
When you say "aggregate" the values, you mean take their sum? Let's say your data frame is d and assuming d$Interval is of class Date, you can try
# if d$Interval is not of class Date d$Interval <- as.Date(d$Interval)
formatdate <- function(date)
paste0("Week ", (as.numeric(format(date, "%d")) - 1) + 1,
", ", format(date, "%b %Y"))
# change "sum" to your required function
aggregate(d$value, by = list(formatdate(d$Interval)), sum)
# Group.1 x
# 1 Week 1, Jul 2012 3725.667
# 2 Week 2, Jul 2012 3199.500
# 3 Week 2, Jun 2012 3544.000
# 4 Week 3, Jun 2012 3434.000
# 5 Week 4, Jun 2012 3333.143
# 6 Week 5, Jun 2012 3158.667