I have data table here:
row V1 velocity
1 2009-04-06 95.9230769230769
2 2009-04-11 95.0985074626866
3 2009-04-17 95.8064935064935
4 2009-04-22 94.6357142857143
5 2009-04-27 95.3626865671642
6 2009-05-03 95.9101265822785
7 2009-05-08 95.826582278481
8 2009-05-14 94.5126582278481
9 2009-05-20 95.8371428571429
10 2009-05-25 94.6981481481481
11 2009-05-30 96.397619047619
12 2009-06-05 94.8132530120482
13 2009-06-10 96.4558139534884
14 2009-06-16 94.9627906976744
15 2009-06-21 95.2666666666667
16 2009-06-26 95.2919540229885
17 2009-07-01 95.4333333333333
18 2009-07-07 95.3375
19 2009-07-12 95.0534246575343
20 2009-07-18 96.0277777777778
21 2009-07-24 95.6885057471264
22 2009-07-29 93.9375
23 2009-08-03 95.2776315789474
24 2009-08-08 94.9089285714286
25 2009-08-13 96.8906976744186
26 2009-08-19 95.4487804878049
27 2009-08-24 97.2444444444444
28 2009-08-30 95.1174418604651
I want to write a r code to find a mean value of velocity by month. (There are May, June, July, and August.
What could I do?
Or jusr:
tapply(df$velocity, months(as.Date(df$V1)), mean)
April August Juli Juni Mai
95.36530 95.81465 95.24634 95.35810 95.53038
Here's how I would do it
Use lubridate to create a month variable to group by in dplyr and then get means.
library(lubridate)
library(dplyr)
df %>% group_by(month = month(df$V1)) %>% summarize(mean = mean(velocity))
month mean
1 4 95.36530
2 5 95.53038
3 6 95.35810
4 7 95.24634
5 8 95.81465
If you add label=T you get this:
df %>% group_by(month = month(df$V1,label=T)) %>% summarize(mean = mean(velocity))
month mean
1 Apr 95.36530
2 May 95.53038
3 Jun 95.35810
4 Jul 95.24634
5 Aug 95.81465
Related
I would like to calculate mean every 5 rows in my df. Here is my df :
Time
value
03/06/2021 06:15:00
NA
03/06/2021 06:16:00
NA
03/06/2021 06:17:00
20
03/06/2021 06:18:00
22
03/06/2021 06:19:00
25
03/06/2021 06:20:00
NA
03/06/2021 06:21:00
31
03/06/2021 06:22:00
23
03/06/2021 06:23:00
19
03/06/2021 06:24:00
25
03/06/2021 06:25:00
34
03/06/2021 06:26:00
42
03/06/2021 06:27:00
NA
03/06/2021 06:28:00
19
03/06/2021 06:29:00
17
03/06/2021 06:30:00
25
I already have a loop which goes well to calculate means for each 5 rows package. My problem is in my "mean function".
The problem is :
-if I put na.rm = FALSE, mean = NA as soon as there is a NA in a package of 5 values.
- if I put na.rm = TRUE in mean function, the result gives me averages that are shifted to take 5 values. I would like the NA not to interfere with the average and that when there is a NA in a package of 5 values, the average is only done on 4 values.
How can I do this? Thanks for your help !
You can solve your problem by introducing a dummy variable that groups your observarions in sets of five and then calculating the mean within group. Here's MWE, based in the tidyverse, that assumes your data is in a data.frame named df.
library(tidyverse)
df %>%
mutate(Group= 1 + floor((row_number()-1) / 5)) %>%
group_by(Group) %>%
summarise(Mean=mean(value, na.rm=TRUE), .groups="drop")
# A tibble: 4 × 2
Group Mean
<dbl> <dbl>
1 1 22.3
2 2 24.5
3 3 28
4 4 25
A solution based on purrr::map_dfr:
library(purrr)
df <- data.frame(
stringsAsFactors = FALSE,
time = c("03/06/2021 06:15:00","03/06/2021 06:16:00",
"03/06/2021 06:17:00",
"03/06/2021 06:18:00","03/06/2021 06:19:00",
"03/06/2021 06:20:00","03/06/2021 06:21:00",
"03/06/2021 06:22:00","03/06/2021 06:23:00",
"03/06/2021 06:24:00","03/06/2021 06:25:00",
"03/06/2021 06:26:00",
"03/06/2021 06:27:00","03/06/2021 06:28:00",
"03/06/2021 06:29:00","03/06/2021 06:30:00"),
value = c(NA,NA,20L,22L,
25L,NA,31L,23L,19L,25L,34L,42L,NA,19L,17L,
25L)
)
map_dfr(1:(nrow(df)-5),
~ data.frame(Group =.x, Mean = mean(df$value[.x:(.x+5)],na.rm=T)))
#> Group Mean
#> 1 1 22.33333
#> 2 2 24.50000
#> 3 3 24.20000
#> 4 4 24.00000
#> 5 5 24.60000
#> 6 6 26.40000
#> 7 7 29.00000
#> 8 8 28.60000
#> 9 9 27.80000
#> 10 10 27.40000
#> 11 11 27.40000
If you want to take average of every 5 minutes you may use lubridate's function floor_date/ceiling_date to round the time.
library(dplyr)
library(lubridate)
df %>%
mutate(time = mdy_hms(time),
time = floor_date(time, '5 mins')) %>%
group_by(time) %>%
summarise(value = mean(value, na.rm = TRUE))
# time value
# <dttm> <dbl>
#1 2021-03-06 06:15:00 22.3
#2 2021-03-06 06:20:00 24.5
#3 2021-03-06 06:25:00 28
#4 2021-03-06 06:30:00 25
I am trying de-seasonalize my data by dividing my monthly totals by the average seasonality ratio per that month. I have two data frames. avgseasonality that has 12 rows of the average seasonality ratio per month. The problem is since the seasonality ratio is the ratio of each month averaged only has 12 rows and the ordertotal data frame has 147 rows.
deseasonlize <- transform(avgseasonalityratio, deseasonlizedtotal =
df1$OrderTotal / avgseasonality$seasonalityratio)
This runs but it does not pair the months appropriately. It uses the first ratio of april and runs it on the first ordertotal of december.
> avgseasonality
Month seasonalityratio
1 April 1.0132557
2 August 1.0054602
3 December 0.8316988
4 February 0.9813396
5 January 0.8357475
6 July 1.1181648
7 June 1.0439899
8 March 1.1772450
9 May 1.0430667
10 November 0.9841149
11 October 0.9595041
12 September 0.8312318
> df1
# A tibble: 157 x 3
DateEntLabel OrderTotal `d$Month`
<dttm> <dbl> <chr>
1 2005-12-01 00:00:00 512758. December
2 2006-01-01 00:00:00 227449. January
3 2006-02-01 00:00:00 155652. February
4 2006-03-01 00:00:00 172923. March
5 2006-04-01 00:00:00 183854. April
6 2006-05-01 00:00:00 239689. May
7 2006-06-01 00:00:00 237638. June
8 2006-07-01 00:00:00 538688. July
9 2006-08-01 00:00:00 197673. August
10 2006-09-01 00:00:00 144534. September
# ... with 147 more rows
I need the ordertotal and ratio of each month respectively. The calculations would for each month respectively be such as (december) 512758/0.8316988 = 616518.864762 The output for the calculations would be in their new column that corresponds with the month and ordertotal. Please any help is greatly appreciated!
Easiest way would be to merge() your data first, then do the operation. You can use R base merge() function, though I will show here using the tidyverse left_join() function. I see that one of your columns has a strange name d$Month, renameing this to Month will simplify the merge!
Reproducible example:
library(tidyverse)
df_1 <- data.frame(Month = c("Jan", "Feb"), seasonalityratio = c(1,2))
df_2 <- data.frame(Month = rep(c("Jan", "Feb"),each=2), OrderTotal = 1:4)
df_1 %>%
left_join(df_2, by = "Month") %>%
mutate(eseasonlizedtotal = OrderTotal / seasonalityratio)
#> Month seasonalityratio OrderTotal eseasonlizedtotal
#> 1 Jan 1 1 1.0
#> 2 Jan 1 2 2.0
#> 3 Feb 2 3 1.5
#> 4 Feb 2 4 2.0
Created on 2019-01-30 by the reprex package (v0.2.1)
looking to aggregate data (mean) in half-year periods by group.
Here is a snapshot of the data:
Date Score Group Score2
01/01/2015 15 A 11
02/01/2015 34 A 33
03/01/2015 16 A 1
04/01/2015 29 A 36
05/01/2015 4 A 28
06/01/2015 10 B 33
07/01/2015 21 B 19
08/01/2015 6 B 47
09/01/2015 40 B 15
10/01/2015 34 B 13
11/01/2015 16 B 7
12/01/2015 8 B 4
I have dfd$mon<-as.yearmon(dfd$Date) then
r<-as.data.frame(dfd %>%
mutate(month = format(Date, "%m"), year = format(Date, "%Y")) %>%
group_by(Group,mon) %>%
summarise(total = mean(Score), total1 = mean(Score2)))
for monthly aggregation, but how would you do this for every 6 months, grouped by Group?
I sense I am overcomplicating a simple issue here!
add another mutate after the current one:
mutate(yearhalf = as.integer(6/7)+1) %>%
output is 1 for the first 6 months and 2 for the months 7 to 12. Then you of course have to adapt the following functions for the new name, but that should do the trick.
diff(seq(as.Date("2016-12-21"), as.Date("2017-04-05"), by="month"))
Time differences in days
[1] 31 31 28
The above code generates no of days in the month Dec, Jan and Feb.
However, my requirement is as follows
#Results that I need
#monthly days from date 2016-12-21 to 2017-04-05
11, 31, 28, 31, 5
#i.e 11 days of Dec, 31 of Jan, 28 of Feb, 31 of Mar and 5 days of Apr.
I even tried days_in_month from lubridate but not able to achieve the result
library(lubridate)
days_in_month(c(as.Date("2016-12-21"), as.Date("2017-04-05")))
Dec Apr
31 30
Try this:
x = rle(format(seq(as.Date("2016-12-21"), as.Date("2017-04-05"), by=1), '%b'))
> setNames(x$lengths, x$values)
# Dec Jan Feb Mar Apr
# 11 31 28 31 5
Although we have seen a clever replacement of table by rle and a pure table solution, I want to add two approaches using grouping. All approaches have in common that they create a sequence of days between the two given dates and aggregate by month but in different ways.
aggregate()
This one uses base R:
# create sequence of days
days <- seq(as.Date("2016-12-21"), as.Date("2017-04-05"), by = 1)
# aggregate by month
aggregate(days, list(month = format(days, "%b")), length)
# month x
#1 Apr 5
#2 Dez 11
#3 Feb 28
#4 Jan 31
#5 Mrz 31
Unfortunately, the months are ordered alphabetically as it happened with the simple table() approach. In these situations, I do prefer the ISO8601 way of unambiguously naming the months:
aggregate(days, list(month = format(days, "%Y-%m")), length)
# month x
#1 2016-12 11
#2 2017-01 31
#3 2017-02 28
#4 2017-03 31
#5 2017-04 5
data.table
Now that I've got used to the data.table syntax, this is my preferred approach:
library(data.table)
data.table(days)[, .N, .(month = format(days, "%b"))]
# month N
#1: Dez 11
#2: Jan 31
#3: Feb 28
#4: Mrz 31
#5: Apr 5
The order of months is kept as they have appeared in the input vector.
I have this data frame:
Source: local data frame [446,604 x 2]
date pressure
1 2014_01_01_0:01 991
2 2014_01_01_0:02 991
3 2014_01_01_0:03 991
4 2014_01_01_0:04 991
5 2014_01_01_0:05 991
6 2014_01_01_0:06 991
7 2014_01_01_0:07 991
8 2014_01_01_0:08 991
9 2014_01_01_0:09 991
10 2014_01_01_0:10 991
.. ... ...
I want to separate the date column using separate() from tidyr
library(tidyr)
separate(df, date, into = c("year", "month", "day", "time"), sep="_")
But it does not work. I managed to do it using substr() and mutate():
library(dplyr)
df %>%
mutate(
year = substr(date, 1, 4),
month = substr(date, 6, 7),
day = substr(date, 9, 10),
time = substr(date, 12, 15))
Update:
It does not work because I have malformed rows. I was able to diagnose using my initial substr() method and I found out that I had weird entries in the dataframe:
df %>%
select(date) %>%
mutate(
year = substr(date, 1, 4),
month = substr(date, 6, 7),
day = substr(date, 9, 10),
time = substr(date, 12, 15)) %>%
group_by(year) %>%
summarise(n=n())
And this is what I get:
Source: local data frame [33 x 2]
year n
1 2014 446293
2 4164 9
3 4165 10
4 4166 10
5 4167 10
6 4168 10
7 4169 10
8 4170 10
9 4171 10
10 4172 10
11 4173 10
12 4174 10
13 4175 10
14 4176 10
15 4177 10
16 4178 10
17 4179 10
18 4180 10
19 4181 10
20 4182 10
21 4183 10
22 4184 10
23 4185 10
24 4186 10
25 4187 10
26 4188 10
27 4189 10
28 4190 10
29 4191 10
30 4192 10
31 4193 11
32 4194 10
33 4195 1
Would there be a more efficient way to diagnose the structure of the elements of a column and find the malformed lines before doing separate() ?
The steps would be:
Try to separate() first (no extra)
Notice there are malformed rows (errors in console)
Use separate() with extra = "drop"
Use group_by() and summarise() to explore the data and determine which rows to filter out