aggregate the data to one column using unique value - r

I have a data frame yy. I want to do a aggregation of the data. There is a time stamp variable and there is repetition in the time variable.
I want to find the unique values of time stamp and aggregate all the other variables in this data frame with respect to this unique time stamp value. Finally I need to get the mean of the other variables.
Here is the data sample
temp yield density time
1 54 NA 30.23 2009-12-31 18
2 54 NA 30.22 2009-12-31 19
3 53 NA 30.20 2009-12-31 20
4 53 NA 30.19 2009-12-31 21
5 50 NA 30.18 2009-12-31 22
6 51 3 30.16 2009-12-31 23
.......
I run the following code:
aggdata=aggregate(yy~time, by= list(unique(time)), data =yy, FUN = mean,na.rm=TRUE)
I got this warning
argument is not numeric or logical: returning NA
If I run the aggregation one variable at a time, it works
aggdata=aggregate(temp~time, by= list(unique(time)),data=yy,FUN=mean)
But if use the whole data list yy, there are errors.
Could someone please explain this?

Using data.table, convert the 'data.frame' to 'data.table' (setDT(yy)), grouped by 'time', specify the columns to summarise in .SDcols, loop through them and get the mean.
library(data.table)
setDT(yy)[, lapply(.SD, mean, na.rm=TRUE), by = time, .SDcols = c("temp", "yield")]

This seems like something that could easily be done using the package dplyr
You could do something as follows:
yy <- yy %>% group_by(time) %>% summarize(meantemp = mean(temp), meanyield = mean(yield))

Related

How do I create a daily time series using data that isn't taken daily

I have a csv file that is written like this
Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50
I'd like R to produce something like this
Date Date
1/1/1980
1/2/1980
1/3/1980
1/4/1980
1/5/1980 25
1/6/1980
1/7/1980 30
Then I would like R to bring the last observation forward like this
Date Date
1/1/1980
1/2/1980
1/3/1980
1/4/1980
1/5/1980 25
1/6/1980 25
1/7/1980 30
I'd like two separate data.tables created one with just the actual data, then another with the last observation brought forward.
Thanks for all the help!
Edit: I also will need any NA's that are populated to changed to 0
You could also use tidyverse:
library(tidyverse)
df %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
complete(Date = seq(as.Date(format(min(Date), "%Y-%m-01")), max(Date), by = "day")) %>%
fill(Data) %>%
replace(., is.na(.), 0)
First 10 rows:
# A tibble: 104 x 2
Date Data
<date> <dbl>
1 1980-01-01 0
2 1980-01-02 0
3 1980-01-03 0
4 1980-01-04 0
5 1980-01-05 25
6 1980-01-06 25
7 1980-01-07 30
8 1980-01-08 30
9 1980-01-09 30
10 1980-01-10 30
I've used as a starting point the 1st day of the month and year of minimum date, and maximum the maximum date; this can be of course adjusted as needed.
EDIT: #Sotos has an even better suggestion for a more concise approach (by better usage of format argument):
df %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
complete(Date = seq(as.Date(format(min(Date), "%Y-%m-01")), max(Date), by = "day")) %>%
fill(Data)
The solution is:
create a data.frame with successive date
merge it with your original data.frame
use na.locf function from zoo to carry forward your data
Here is the code. I use lubridate to work with date.
library(lubridate)
df$Date <- mdy(df$Date)
successive <-data.frame(Date = seq( as.Date(as.yearmon(df$Date[1])), df$Date[length(df$Date)], by="days"))
successive is the vector of successive dates. Now the merging:
result <- merge(df,successive,all.y = T,on = "Date")
And the forward propagation:
library(zoo)
result$Data <- na.locf(result$Data,na.rm = F)
Date Data
1 1980-01-05 25
2 1980-01-06 25
3 1980-01-07 30
4 1980-01-08 30
5 1980-01-09 30
6 1980-01-10 30
7 1980-01-11 30
8 1980-01-12 30
9 1980-01-13 30
10 1980-01-14 30
11 1980-01-15 30
12 1980-01-16 30
13 1980-01-17 30
14 1980-01-18 30
15 1980-01-19 30
16 1980-01-20 30
17 1980-01-21 30
18 1980-01-22 30
19 1980-01-23 30
20 1980-01-24 30
21 1980-01-25 30
The data:
df <- read.table(text = "Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50", header = T)
Assuming that the result should start at the first of the month of the first date and end at the last date and that the input data frame is DF shown reproducibly in the Note at the end, convert DF to a zoo object z, create a grid of dates g merge them to give zoo objects z0 (with zero filling) and zz (with na.locf filling) and optionally convert back to data frames or else just leave it as is so you can use zoo for further processing.
library(zoo)
z <- read.zoo(DF, header = TRUE, format = "%m/%d/%Y")
g <- seq(as.Date(as.yearmon(start(z))), end(z), "day")
z0 <- merge(z, zoo(, g), fill = 0) # zero filled
zz <- na.locf0(merge(z, zoo(, g))) # na.locf filled
# optional
DF0 <- fortify.zoo(z0) # zero filled
DF2 <- fortify.zoo(zz) # na.locf filled
data.table
The question mentions data tables and if that refers to the data.table package then add:
library(data.table)
DT0 <- data.table(DF0) # zero filled
DT2 <- data.table(DF2) # na.locf filled
Variations
I wasn't clear on whether the question was asking for a zero filled answer and an na.locf filled answer or just an na.locf filled answer whose remaining NA values are 0 filled but assumed the former case. If you want to fill the NAs that are left in the na.locf filled answer then add:
zz[is.na(zz)] <- 0
If you want to end at the end of the last month rather than at the last date replace end(z) with as.Date(as.yearmon(end(z)), frac = 1) .
If you want to start at the first date rather than the first of the month of the first date replace as.Date(as.yearmon(start(z))) with start(z)
.
As an alternative to (3), to start at the first date and end at the last date we could simply convert to ts and back. Note that we need to restore Date class on the second line below since ts class cannot handle Date class directly.
z2.na <- as.zoo(as.ts(z))
time(z2.na) <- as.Date(time(z2.na))
zz20 <- replace(z2.na, is.na(z2.na), 0) # zero filled
zz2 <- na.locf0(z2.na) # na.locf filled
Note
Lines <- "
Date Data
1/5/1980 25
1/7/1980 30
2/13/1980 44
4/13/1980 50"
DF <- read.table(text = Lines, header = TRUE)

How to iterate and make new variables within a function in R? [duplicate]

This question already has answers here:
How to split a data frame?
(8 answers)
Closed 4 years ago.
Is there a way within R to make a function that would make subsets (for example by dates) into it's own data frame? For example I have 30 days worth of data, and I want to break each day down into individual days and output it into a new individual data frame. I can't figure out how to do it in a function. Any clues?
Example:
Dataframe: df_of_month
Output desired via a loop function of sorts:
df_of_month_day1
df_of_month_day2
df_of_month_day3
df_of_month_day4
df_of_month_day5
df_of_month_day6
etc?.... I've been looking for multiple way sand it's not working.
To give you an answer to your question, you would achieve this with lapply. For instance, consider the following:
Create some sample data:
df <- data.frame(Day = rep(seq.Date(from = as.Date('2010-01-01'), to = as.Date('2010-01-30'), by =1), 5))
df$somevar <- rnorm(nrow(df))
head(df)
Day somevar
1 2010-01-01 -0.946059466
2 2010-01-02 0.005897001
3 2010-01-03 -0.297566286
4 2010-01-04 -0.637562495
5 2010-01-05 -0.549800912
6 2010-01-06 0.287709994
Now, observe that unique can give you a vector with all unique dates:
unique(df$Day)
[1] "2010-01-01" "2010-01-02" "2010-01-03" "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" "2010-01-08" "2010-01-09" "2010-01-10"
[11] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14" "2010-01-15" "2010-01-16" "2010-01-17" "2010-01-18" "2010-01-19" "2010-01-20"
[21] "2010-01-21" "2010-01-22" "2010-01-23" "2010-01-24" "2010-01-25" "2010-01-26" "2010-01-27" "2010-01-28" "2010-01-29" "2010-01-30"
This you can pass to lapply to be used for subsetting:
lapply(unique(df$Day), function(x) df[df[,"Day"]==x,])
[[1]]
Day somevar
1 2010-01-01 -0.9460595
31 2010-01-01 -0.3434005
61 2010-01-01 -1.5463641
91 2010-01-01 -0.5192375
121 2010-01-01 -1.1780619
[[2]]
Day somevar
2 2010-01-02 0.005897001
32 2010-01-02 -1.346336688
62 2010-01-02 -0.321702391
92 2010-01-02 -0.384277955
122 2010-01-02 0.058906305
... (output omitted)
where the output of lapply is a list with the corresponding dataframes.
Needless to say, you would assign this to a name to capture all dataframes in a list as in mylist <- lapply(...). However, if you want to have them in your global environment, you can first give each dataframe a name, for instance using setNames as in setNames(mylist, paste0("df", format(unique(df$Day), format = "%Y%m%d"))) and then you could use list2env(mylist) to push each list element into the global environment.
However, as mentioned in the comments, this is probably not a good idea. If you want to do something to each date, consider the group-by solution with dplyr: For instance, imagine you want to get the mean by date:
library(dplyr)
df %>% group_by(Day) %>% summarize(mean_var = mean(somevar))
# A tibble: 30 x 2
Day mean_var
<date> <dbl>
1 2010-01-01 -0.907
2 2010-01-02 -0.398
3 2010-01-03 0.213
4 2010-01-04 -0.142
5 2010-01-05 -0.377
6 2010-01-06 0.404
7 2010-01-07 -0.634
8 2010-01-08 1.00
9 2010-01-09 0.378
10 2010-01-10 -0.0863
# ... with 20 more rows
where each row corresponds to the group-wise mean. This is called split-apply-combine and is worthwhile googling. It will come again and again.
Just for reference, in base R, you could achieve this using e.g. by, as in
by(df$somevar, df$Day, FUN = mean)
though either dplyr or data.table are probably more user-friendly.

Filter a data frame by two time series

Hi I am new to R and would like to know if there is a simple way to filter data over multiple dates.
I have a data which has dates from 07.03.2003 to 31.12.2016.
I need to split/ filter the data by multiple time series, as per below.
Dates require in new data frame:
07.03.2003 to 06/03/2005
and
01/01/2013 to 31/12/2016
i.e the new data frame should not include dates from 07/03/2005 to 31/12/2012
Let's take the following data.frame with dates:
df <- data.frame( date = c(ymd("2017-02-02"),ymd("2016-02-02"),ymd("2014-02-01"),ymd("2012-01-01")))
date
1 2017-02-02
2 2016-02-02
3 2014-02-01
4 2012-01-01
I can filter this for a range of dates using lubridate::ymd and dplyr::between and dplyr::between:
df1 <- filter(df, between(date, ymd("2017-01-01"), ymd("2017-03-01")))
date
1 2017-02-02
Or:
df2 <- filter(df, between(date, ymd("2013-01-01"), ymd("2014-04-01")))
date
1 2014-02-01
I would go with lubridate. In particular
library(data.table)
library(lubridate)
set.seed(555)#in order to be reproducible
N <- 1000#number of pseudonumbers to be generated
date1<-dmy("07-03-2003")
date2<-dmy("06-03-2005")
date3<-dmy("01-01-2013")
date4<-dmy("31-12-2016")
Creating data table with two columns (dates and numbers):
my_dt<-data.table(date_sample=c(sample(seq(date1, date4, by="day"), N),numeric_sample=sample(N,replace = F)))
> head(my_dt)
date_sample numeric_sample
1: 2007-04-11 2
2: 2006-04-20 71
3: 2007-12-20 46
4: 2016-05-23 78
5: 2011-10-07 5
6: 2003-09-10 47
Let's impose some cuts:
forbidden_dates<-interval(date2+1,date3-1)#create interval that dates should not fall in.
> forbidden_dates
[1] 2005-03-07 UTC--2012-12-31 UTC
test_date1<-dmy("08-03-2003")#should not fall in above range
test_date2<-dmy("08-03-2005")#should fall in above range
Therefore:
test_date1 %within% forbidden_dates
[1] FALSE
test_date2 %within% forbidden_dates
[1] TRUE
A good way of visualizing the cut:
before
>plot(my_dt)
my_dt<-my_dt[!(date_sample %within% forbidden_dates)]#applying the temporal cut
after
>plot(my_dt)

Divide column values by multiple values based on conditions

I have a csv file that contains indexes for various asset classes and most of them start on different dates. I would like to create new indexes for these asset classes that have the same base year. Below is a subset of the data I have.
indexes <- read.csv("AssetClassIndexes.csv")
indexes$Date <- as.Date(indexes$Date, '%m/%d/%Y')
indexes %>%
filter(Date > as.Date('2013-01-01')) %>%
select(Date, Large.Cap.Stocks, Mid.Cap.Stocks, Precious.Metals)
Date Large.Cap.Stocks Mid.Cap.Stocks Precious.Metals
1 2013-01-31 130.9160 58.13547 651.1803
2 2013-02-28 132.6932 58.70621 658.3433
3 2013-03-31 137.6696 61.51427 690.4047
4 2013-04-30 140.3220 61.90042 684.9505
5 2013-05-31 143.6044 63.29899 720.4309
6 2013-06-30 141.6760 62.13056 723.7449
7 2013-07-31 148.8850 65.97987 777.3744
8 2013-08-31 144.5731 63.50743 750.3217
9 2013-09-30 149.1068 66.81690 803.2194
10 2013-10-31 155.9609 69.29937 831.1715
11 2013-11-30 160.7137 70.21606 877.3015
12 2013-12-31 164.7823 72.38485 893.8825
13 2014-01-31 159.0851 70.84785 854.2835
14 2014-02-28 166.3623 74.30846 890.2488
15 2014-03-31 167.7607 74.58250 898.8842
16 2014-04-30 169.0008 73.41721 868.2323
17 2014-05-31 172.9679 74.72066 869.1005
18 2014-06-30 176.5410 77.81163 906.8195
19 2014-07-31 174.1063 74.48576 853.8612
20 2014-08-31 181.0715 78.27180 892.6265
21 2014-09-30 178.5322 74.71220 841.8361
What I would like to do is create multiple base indexes based on various dates.
BaseDates <-
c(
'1973-12-31',
'1981-06-30',
'1984-03-31',
'2001-03-31',
'2007-12-31'
)
I have the following line of code that allows me to create an index based on one date, but I can't figure out how to do all the base dates above. I'm guessing it involves some sort of apply function; any suggestions?
indexes %>%
mutate_each(funs(BaseIdx(.,Date,as.Date('1984-06-30'))),-Date)
BaseIdx <- function(x, column, dte) {x / x[column == dte]}
There are multiple approaches you can take. Your suggested approach moves across each column (mutate_each) dividing values whose date matches a single date. You can iterate this over all your dates with _apply or another command.
An alternate approach below uses lapply to iterate across dates, dividing rows by a vector. The tricky part is the division of a dataframe by rows. Here, the dataframe is transposed (t) and divided by a vector (as.numeric), then retransposed back to the original format (additional methods here).
#indexes = the subsetted [21 x 4] data in your example
#Sample some dates based on your example data
BaseDates <- indexes[seq(1, 21, by=5), "Date"]
IndexThemALL <- lapply(BaseDates, function(z) { #z = each BaseDate
data.frame(
IndexDate = z,
Date = indexes$Date,
t(t(indexes[, cols])/as.numeric(indexes[indexes$Date == z, cols]))
)
})
# Optional: collapse a list into a dataframe
IndexThemALL <- dplyr::rbind_all(IndexThemALL)
#Source: local data frame [105 x 5]
#IndexDate Date Large.Cap.Stocks Mid.Cap.Stocks Precious.Metals
#1 2013-01-31 2013-01-31 1.000000 1.000000 1.000000
#2 2013-01-31 2013-02-28 1.013575 1.009817 1.011000
#3 2013-01-31 2013-03-31 1.051587 1.058119 1.060236
#4 2013-01-31 2013-04-30 1.071848 1.064762 1.051860

R: Aggregating Large Data Frame under a Grouping Condition

I'm trying to figure out the fastest way to aggregate a large data frame (about 50M rows) that looks similar to:
>sample_frame = data.frame("id" = rep(sample(1:100,2,replace=F),3),
+ "date" = sample(seq(as.Date("2014-01-01"),as.Date("2014-02-13"),by=1),6),
+ "value" = runif(6))
> sample_frame
id date value
1 73 2014-02-11 0.84197491
2 7 2014-01-14 0.08057893
3 73 2014-01-16 0.78521616
4 7 2014-01-24 0.61889286
5 73 2014-02-06 0.54792356
6 7 2014-01-06 0.66484848
Here we have 2 unique IDs with 3 dates and a value assigned to each. I know that I can use ddply, or data.table, or just a lapply to aggregate and find the mean for each ID.
What I'm really looking for is a way to quickly find the mean for each ID for the most recent two dates. For example, with sapply:
> sapply(split(sample_frame,sample_frame$id),function(x){
+ mean(x$value[x$date%in%x$date[order(x$date,decreasing=T)][1:2]])
+ })
7 73
0.3497359 0.6949492
I can't figure out how to get data.table to do this. Thoughts? Hints?
Why not use tail in your "data.table" aggregation step?
set.seed(1)
sample_frame = data.frame("id" = rep(sample(1:100,2,replace=F),3),
"date" = sample(seq(as.Date("2014-01-01"),
as.Date("2014-02-13"),by=1),6),
"value" = runif(6))
DT <- data.table(sample_frame, key = "id,date")
DT
# id date value
# 1: 27 2014-01-09 0.20597457
# 2: 27 2014-01-26 0.62911404
# 3: 27 2014-02-07 0.68702285
# 4: 37 2014-02-06 0.17655675
# 5: 37 2014-02-09 0.06178627
# 6: 37 2014-02-13 0.38410372
DT[, mean(tail(value, 2)), by = id]
# id V1
# 1: 27 0.6580684
# 2: 37 0.2229450
Since you require the mean of just two values, you can do it directly (without using mean). And you can use the internal variable .N instead of tail to get more speed-up. You just have to take care of the case where there's just 1 date. Basically, this should be much faster.
DT[, (value[.N]+value[max(1L, .N-1)])/2, by=id]

Resources