Taking average of dataframe elements sharing same date - r

I am a bit lost in how to take average of a data frame formatted in the following way:
id date quantity product
1 12-05-2015 10 apple
2 21-03-2015 12 orange
3 12-05-2015 15 orange
4 21-03-2015 16 apple
Expected result:
date quantity
21-03-2015 14
12-05-2015 12.5
I tried converting it to zoo object, but then I run into issues as dates are non-unique.

Try
aggregate(quantity~date, df1, mean)
# date quantity
#1 12-05-2015 12.5
#2 21-03-2015 14.0
Or
library(data.table)
setDT(df1)[, list(quantity=mean(quantity)), date]
As #Alex A. mentioned in the comments, list( can be replaced by .( in the recent data.table versions.

You could also use the dplyr package. Assuming your data frame is called df:
library(dplyr)
df %>%
group_by(date) %>%
summarize(quantity = mean(quantity))
# date quantity
# 1 12-05-2015 12.5
# 2 21-03-2015 14.0
This gets the mean quantity grouped by date.

Related

Aggregate week and date in R by some specific rules

I'm not used to using R. I already asked a question on stack overflow and got a great answer.
I'm sorry to post a similar question, but I tried many times and got the output that I didn't expect.
This time, I want to do slightly different from my previous question.
Merge two data with respect to date and week using R
I have two data. One has a year_month_week column and the other has a date column.
df1<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2<-data.frame(id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
For df1, 2022051 means 1st week of May,2022. Likewise, 2022052 means 2nd week of May,2022. For df2,20220503 means May 3rd, 2022. What I want to do now is merge df1 and df2 with respect to year_month_week. In this case, 20220503 and 20220506 are 1st week of May,2022.If more than one date are in year_month_week, I will just include the first of them. Now, here's the different part. Even if there is no date inside year_month_week,just leave it NA. So my expected output has a same number of rows as df1 which includes the column year_month_week.So my expected output is as follows:
df<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43),
temperature=c(36.1,36.6,NA,34.3,34.9,NA,NA))
First we can convert the dates in df2 into year-month-date format, then join the two tables:
library(dplyr);library(lubridate)
df2$dt = ymd(df2$date)
df2$wk = day(df2$dt) %/% 7 + 1
df2$year_month_week = as.numeric(paste0(format(df2$dt, "%Y%m"), df2$wk))
df1 %>%
left_join(df2 %>% group_by(year_month_week) %>% slice(1) %>%
select(year_month_week, temperature))
Result
Joining, by = "year_month_week"
id year_month_week points temperature
1 1 2022051 65 36.1
2 1 2022052 58 36.6
3 1 2022053 47 NA
4 2 2022041 21 34.3
5 2 2022042 25 34.9
6 2 2022043 27 NA
7 2 2022044 43 NA
You can build off of a previous answer here by taking the function to count the week of the month, then generate a join key in df2. See here
df1 <- data.frame(
id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2 <- data.frame(
id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
# Take the function from the previous StackOverflow question
monthweeks.Date <- function(x) {
ceiling(as.numeric(format(x, "%d")) / 7)
}
# Create a year_month_week variable to join on
df2 <-
df2 %>%
mutate(
date = lubridate::parse_date_time(
x = date,
orders = "%Y%m%d"),
year_month_week = paste0(
lubridate::year(date),
0,
lubridate::month(date),
monthweeks.Date(date)),
year_month_week = as.double(year_month_week))
# Remove duplicate year_month_weeks
df2 <-
df2 %>%
arrange(year_month_week) %>%
distinct(year_month_week, .keep_all = T)
# Join dataframes
df1 <-
left_join(
df1,
df2,
by = "year_month_week")
Produces this result
id.x year_month_week points id.y date temperature
1 1 2022051 65 1 2022-05-03 36.1
2 1 2022052 58 1 2022-05-12 36.6
3 1 2022053 47 NA <NA> NA
4 2 2022041 21 2 2022-04-01 34.3
5 2 2022042 25 2 2022-04-08 34.9
6 2 2022043 27 NA <NA> NA
7 2 2022044 43 NA <NA> NA
>
Edit: forgot to mention that you need tidyverse loaded
library(tidyverse)

How to calculate aggregate statistics on a dataframe in R by applying conditions on time values?

I am working on climate data analysis. After loading file in R, my interest is to subset data based upon hours in a day.
for time analysis we can use $hour with the variable in which time vector has been stored if our interest is to deal with hours.
I want to subset my data for each hour in a day for 365 days and then take an average of the data at a particular hour throughout the year. Say I am interested to take values of irradiation/wind speed etc at 12:OO PM for a year and then take mean of these values to get the desired result.
I know how to subset a data frame based upon conditions. If for example my data is in a matrix called data and contains 2 rows say time and wind speed and I'm interested to subset rows of data in which irradiationb isn't zero. We can do this using the following code
my_data <- subset(data, data[,1]>0)
but now in order to deal with hours values in time column which is a variable stored in data, how can I subset values?
My data look like this:
I hope I made sense in this question.
Thanks in advance!
Here is a possible solution. You can create a hourly grouping with format(df$time,'%H'), so we obtain only the hour for each period, we can then simply group by this new column and calculate the mean for each group.
df = data.frame(time=seq(Sys.time(),Sys.time()+2*60*60*24,by='hour'),val=sample(seq(5),49,replace=T))
library(dplyr)
df %>% mutate(hour=format(df$time,'%H')) %>%
group_by(hour) %>%
summarize(mean_val = mean(val))
To subset the non-zero values first, you can do either:
df = subset(df,val!=0)
or start the dplyr chain with:
df %>% filter(df$val!=0)
Hope this helps!
df looks as follows:
time val
1 2018-01-31 12:43:33 4
2 2018-01-31 13:43:33 2
3 2018-01-31 14:43:33 2
4 2018-01-31 15:43:33 3
5 2018-01-31 16:43:33 3
6 2018-01-31 17:43:33 1
7 2018-01-31 18:43:33 2
8 2018-01-31 19:43:33 4
... ... ... ...
And the output:
# A tibble: 24 x 2
hour mean_val
<chr> <dbl>
1 00 3.50
2 01 3.50
3 02 4.00
4 03 2.50
5 04 3.00
6 05 2.00
.... ....
This assumes your time column is already of class POSIXct, otherwise you'd first have to convert it using for example as.POSIXct(x,format='%Y-%m-%d %H:%M:%S')

Time intervals between resightings of several individuals

In R, I need to calculate several time interval variables between resightings of marked individuals. I have a dataset similar to this:
ID Time Day Month
a 11.15 13 6
a 12.35 13 6
a 10.02 14 6
a 19.30 15 6
a 20.46 15 6
.
.
.
b 11.12 8 7
etc
In which each ID represents a different animal marked for individual recognition, and each row contains the date and time in which it was relocated.
For each individual, I'd need to calculate the number of days each animal was observed, the mean and standard deviation of the number of relocations per day, and the mean and standard deviation of the days elapsed between relocations (including 0 days between observations on the same day.
Ideally, I need to obtain a data frame such this:
ID N.Obs N.days mean.Obs.per.Day m.O.D.sd mean.days.elapsed mde.sd
a 27 7 4.2 1.1 1.5 0.5
b 32 5 3.4 0.4 3.2 0.7
c 17 6 4.4 0.2 4.5 1.2
d etc
I've been doing it in using the tapply function and transferring the results to an Excel, but I am sure there must be a relatively simple code which could help me to ignite the process in R.
The OP has requested to aggregate 6 statistics per ID. Four of them can by directly aggregated by grouping by ID. Two (mean.Obs.per.Day and m.O.D.sd) need to be grouped by date and ID first.
Unfortunately, the time stamps are split up in three different fields, Time, Day, and Month with the year missing. As four of the statistics are based on dates, we need to construct a Date column which combines Day, Month, and a dummy year.
The code below utilises the data.table and lubridate packages for efficiency.
library(data.table)
# coerce to data.table and add Date column
setDT(DF)[, Date := lubridate::make_date(, Month, Day)]
# aggregate by ID,
# use temporary variable to hold the day differences between resightings
agg_per_id <- DF[, {
tmp <- as.numeric(diff(Date))
.(N.Obs = .N, N.days = uniqueN(Date),
mean.days.elapsed = mean(tmp),
mde.sd = sd(tmp))
} , by = ID]
# aggregate by Date and ID
agg_per_day_and_id <- DF[, .N, by = .(ID, Date)][
, .(mean.Obs.per.Day = mean(N), m.O.D.sd = sd(N)), by = ID]
# join partial results
result <- agg_per_day_and_id[agg_per_id, on = "ID"]
# reorder columns (for comparison with expected result)
setcolorder(result, c("ID", "N.Obs", "N.days", "mean.Obs.per.Day",
"m.O.D.sd", "mean.days.elapsed", "mde.sd"))
result
ID N.Obs N.days mean.Obs.per.Day m.O.D.sd mean.days.elapsed mde.sd
1: a 5 3 1.666667 0.5773503 0.5 0.5773503
2: b 1 1 1.000000 NA NaN NA
Note that the figures differ from the expected result of the OP due to different input data.
Data
As far as provided by the OP
DF <- readr::read_table(
"ID Time Day Month
a 11.15 13 6
a 12.35 13 6
a 10.02 14 6
a 19.30 15 6
a 20.46 15 6
b 11.12 8 7"
)

Aggregate function in R using two columns simultaneously

Data:-
df=data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),Year=c(2016,2015,2014,2016,2006,2006),Balance=c(100,150,65,75,150,10))
Name Year Balance
1 John 2016 100
2 John 2015 150
3 Stacy 2014 65
4 Stacy 2016 75
5 Kat 2006 150
6 Kat 2006 10
Code:-
aggregate(cbind(Year,Balance)~Name,data=df,FUN=max )
Output:-
Name Year Balance
1 John 2016 150
2 Kat 2006 150
3 Stacy 2016 75
I want to aggregate/summarize the above data frame using two columns which are Year and Balance. I used the base function aggregate to do this. I need the maximum balance of the latest year/ most recent year . The first row in the output , John has the latest year (2016) but the balance of (2015) , which is not what I need, it should output 100 and not 150. where am I going wrong in this?
Somewhat ironically, aggregate is a poor tool for aggregating. You could make it work, but I'd instead do:
library(data.table)
setDT(df)[order(-Year, -Balance), .SD[1], by = Name]
# Name Year Balance
#1: John 2016 100
#2: Stacy 2016 75
#3: Kat 2006 150
I will suggest to use the library dplyr:
data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),
Year=c(2016,2015,2014,2016,2006,2006),
Balance=c(100,150,65,75,150,10)) %>% #create the dataframe
tbl_df() %>% #convert it to dplyr format
group_by(Name, Year) %>% #group it by Name and Year
summarise(maxBalance=max(Balance)) %>% # calculate the maximum for each group
group_by(Name) %>% # group the resulted dataframe by Name
top_n(1,maxBalance) # return only the first record of each group
Here is another solution without the data.table package.
first sort the data frame,
df <- df[order(-df$Year, -df$Balance),]
then select the first one in each group with the same name
df[!duplicated[df$Name],]

R data frame: add values in common rows

I have a data frame like this .
> df1
portfolio date ticker quantity price
1 port 2010-01-01 AAPL 100 10
2 port 2010-01-01 AAPL 200 10
3 port 2010-01-01 AAPL 400 11
If the rows of df1 except quantity are same, then add the quantity of common rows.
I mean, i need the following output
portfolio date ticker quantity price
1 port 2010-01-01 AAPL 300 10
3 port 2010-01-01 AAPL 400 11
How can i do that? Thanks..
Here you go... :-)
For plyr :
ddply(df, .(portfolio, date, ticker, price),summarize, quantity=sum(quantity))
For data.table :
dt <- data.table(df)
dt[,list(quantity=sum(quantity)),by=list(portfolio,date,ticker,price)]
There may be a more concise way to express the list of grouping variables. Otherwise, the aggregate solution is much more elegant.
Use aggregate. Assuming your data.frame is called "mydf":
> aggregate(quantity ~ ., mydf, sum)
portfolio date ticker price quantity
1 port 2010-01-01 AAPL 10 300
2 port 2010-01-01 AAPL 11 400
Of course, we should all now wait for the data.table and ddply versions to populate the answers list....

Resources