When I run the following code on my dataset, I get an output (partial one shown) like this:
all_countries_ts[,grepl("Muslims", colnames(all_countries_ts))]
Senegal Muslims Serbia Muslims Seychelles Muslims
1970-01-01 3693807 200000 170
2000-01-01 8936283 529322 730
2010-01-01 11713126 527598 821
2015-01-01 13621382 471414 844
However, when I try to use the function apply.yearly on it to sum across the years, I just get an NA result:
apply.yearly(all_countries_ts[,grepl("Muslims", colnames(all_countries_ts))], FUN = sum)
1970-01-01 NA
2000-01-01 NA
2010-01-01 NA
2015-01-01 NA
The funny thing is that it works with some inputs but not others. For example, if I use input "Agnostics" instead of "Muslims", I get a good result. There isn't an error, so I can't seem to figure out what exactly is happening here.
all_countries_ts is stored as a xts object. One thing to note is that apply.yearly() always work on a subset of this dataset. I have written a function and you can see it below:
sum_by_category <- function(religious_group, dataset) {
apply.yearly(dataset[,grepl(paste(religious_group), colnames(dataset))], FUN =
sum)
}
country_search <- function(country_name, z){
z <- foreach(i = 1:length(country_name), .combine = merge.xts) %do%{
all_countries_ts[,grepl(country_name[i], colnames(all_countries_ts))]
}
return(z)}
When I type in the following, it works perfectly:
sum_by_category("Muslims", country_search("Senegal"))
Senegal Muslims
1970-01-01 3693807
2000-01-01 8936283
2010-01-01 11713126
2015-01-01 13621382
I really can't figure out what's going on since it works with some inputs and not others. Thanks in advance for any help / insights!
The xts::apply.yearly expects x argument coercible to xts object. Perhaps your data.frame is not a xts compatible data frame.
The help for apply.yearly explains:
Arguments
x an time-series object coercible to xts
FUN an R function
I have created a sample data based on data shared by OP and converted it to xts class. apply.yearly works correctly on the same.
library(xts)
# Convert data.frame to xts class
all_countries_ts <- xts(df[,-1], order.by = df$Date)
#Now one can use `apply.yearly`
apply.yearly(all_countries_ts[,grepl("Muslims", colnames(all_countries_ts))], FUN = sum)
# [,1]
# 1970-01-01 3893977
# 2000-01-01 9466335
# 2010-01-01 12241545
# 2015-01-01 14093640
Edited: Review of the OP's data suggest that it contains NA for many column which is causing total sum to be shown as NA. The fix is simple. OP needs to use as:
apply.yearly(all_countries_ts[,grepl("Muslims",colnames(all_countries_ts))],
FUN = sum, na.rm = TRUE)
# [,1]
# 1970-01-01 570772699
# 2000-01-01 1292170756
# 2010-01-01 1571250533
# 2015-01-01 1734531709
Data:
df <- read.table(text =
" Date 'Senegal Muslims' 'Serbia Muslims' 'Seychelles Muslims' Others
1970-01-01 3693807 200000 170 200
2000-01-01 8936283 529322 730 100
2010-01-01 11713126 527598 821 300
2015-01-01 13621382 471414 844 500",
header = TRUE, stringsAsFactors = FALSE)
#convert Date column to Date format
df$Date <- as.Date(df$Date)
I have two data frames. One containing time periods marked with character unique IDs and another containing events with another set of unique IDs associated with them
Period DF (code):
periodID <- c("P_UID_00", "P_UID_01", "P_UDI_02", "P_UID_03")
periodStart <- as.POSIXct(c("2016/02/10 19:00", "2016/02/11 19:00",
"2016/02/12 19:00", "2016/02/13 19:00"))
periodEnd <- as.POSIXct(c("2016/02/10 21:00", "2016/02/11 21:00",
"2016/02/12 21:00", "2016/02/13 21:00"))
periodDF <- data.frame(periodID, periodStart, periodEnd)
Period DF:
periodID periodStart periodEnd
1 P_UID_00 2016-02-10 19:00:00 2016-02-10 21:00:00
2 P_UID_01 2016-02-11 19:00:00 2016-02-11 21:00:00
3 P_UDI_02 2016-02-12 19:00:00 2016-02-12 21:00:00
4 P_UID_03 2016-02-13 19:00:00 2016-02-13 21:00:00
Event DF (code):
eventID <- c("E_UID_00", "E_UID_01", "E_UDI_02", "E_UID_03")
eventTime <- as.POSIXct(c("2016/02/09 19:55:01", "2016/02/11 19:12:01",
"2016/02/11 20:22:01", "2016/02/15 19:00:01"))
eventDF <- data.frame(eventID, eventTime)
Event DF:
eventID eventTime
1 E_UID_00 2016-02-09 19:55:01
2 E_UID_01 2016-02-11 19:12:01
3 E_UDI_02 2016-02-11 20:22:01
4 E_UID_03 2016-02-15 19:00:01
I want to to map the event times in second DF to the time periods in the first DF in order to match the ID of the event to the ID of the period. Essentially the result table I want to see should look like:
eventID periodID
1 E_UID_00 NA
2 NA P_UID_00
3 E_UID_01 P_UID_01
4 E_UDI_02 P_UID_01
5 NA P_UID_02
6 NA P_UID_03
7 E_UID_03 NA
I suppose this can be achieved by using lubricate to transform the start and end cloumns in the first DF to intervals and the use some form of apply and instant %within% interval combination, but I am not really familiar with lubridate and did not manage to produce a working code
Additional considerations:
- periods are completely arbitrary and can last from seconds to years
- periods never overlap, so this is not an issue
- more than one event could be associated with a time period
- it is possible for DFs to contain unassociatable events and time periods
- the solution must not include loops
- does not have to be solved with lubridate, in fact a solution with the base R will be even more welcome.
I actually managed to come up with the code that produces exactly what I wanted using lubridate. So if anyone knows how to do this in base OR simply a better way than the one suggested below, sharing this will be greatly appreciated!
First off, the start and end times in the period DF should be converted to lubridate intervals:
intervalsP <- as.interval(periodStart, periodEnd)
Step 2: A function should be created for checking if an instant is located within a list of intervals. The only reason I have created a separate function is to be able using it with apply:
PeriodAssign <- function(x, y){
# x - instants
# y - intervals
variable1 <- mapply(`%within%`, x, y)
if (length(y[variable1]) != 0) {
as.character(y[variable1])
} else {
NA
}
}
NOTE: I had to use the interval to character coercion, because otherwise intervals were coerced to their length in seconds by the apply function and as such being not really useful for matching purposes - i.e. all four intervals in this example are the same length
Step 3: The function can the be used on the event DF and both DFs can then be merged to produce the DF I was looking for:
eventDF$intervals <- lapply(eventTime, PeriodAssign, intervalsP)
periodDF$intervals <- as.character(intervalsP)
mergedDF <- merge(periodDF, eventDF, by = "intervals")
presentableDF <- mergedDF[, c(2, 5)]
# adding in the unmatched Periods and Evenets
tDF1 <- data.frame(periodDF[!(periodDF$periodID %in% presentableDF$periodID), 1], NA)
colnames(tDF1) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF1)
tDF2 <- data.frame(NA, eventDF[!(eventDF$eventID %in% presentableDF$eventID), 1])
colnames(tDF2) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF2)
presentableDF <- presentableDF[order(presentableDF[,1]),]
The eventual DF looks like:
> presentableDF
periodID eventID
3 P_UID_00 <NA>
1 P_UID_01 E_UID_01
2 P_UID_01 E_UDI_02
4 P_UID_02 <NA>
5 P_UID_03 <NA>
6 <NA> E_UID_00
7 <NA> E_UID_03
I am using xts to convert to hourly average data. I am starting with a year's worth of 10-minute data. Some hours have one 10-minute period (such as 'UTSP' in row 229) that is NA (missing).
For such hours, I would still like the average of the data that are available, however in the output I get NA for that variable for the hour.
Other hours may have no data (all data are missing). I want these completely missing hours to return NA, but where some data exist for an hour, I want that data to be used.
Here is a reproducible example of what I've been trying:
Lines <- "date,time,UTSP,UPM10,UPM25,UPM1,UWS,UWDT,PTSP,PPM10,PPM25,PPM1,PWS,PWDT
218,2014/10/15,22:00,9.7,4.9,4.66,1.54,6,152.56,102,53.6,33.71,10.34,NA,NA
219,2014/10/15,22:10,9.3,5.1,4.57,1.61,6.4,147.56,106.4,55.1,33.92,10.47,NA,NA
220,2014/10/15,22:20,8.9,5,4.7,1.55,6.4,147.56,108.3,54.8,33.19,10.53,NA,NA
221,2014/10/15,22:30,9.7,5.3,4.93,1.62,6.8,152.56,110.3,57.4,34.97,11.14,NA,NA
222,2014/10/15,22:40,9.1,5.2,4.76,1.54,6.8,152.56,118.9,62.3,37.58,11.63,NA,NA
223,2014/10/15,22:50,9.8,5.5,5.07,1.62,6.7,152.56,120.5,61.8,36.24,11.9,NA,NA
224,2014/10/15,23:00,11.1,5.6,5.2,1.59,6.4,152.56,108.6,57.1,34.93,11.66,NA,NA
225,2014/10/15,23:10,9.8,5.4,4.89,1.63,7.3,152.56,116,59.6,35.08,11.14,NA,NA
226,2014/10/15,23:20,9.1,5,4.95,1.63,7.1,152.56,122.6,63.8,38.28,12.17,NA,NA
227,2014/10/15,23:30,9.7,5.2,4.88,1.58,7.3,147.56,88.1,46.7,29.59,9.78,NA,NA
228,2014/10/15,23:40,9.2,5.2,4.79,1.66,7.1,152.56,92.4,48.8,30.11,9.69,NA,NA
229,2014/10/15,23:50,NA,NA,NA,NA,NA,NA,89.7,48.1,30.53,9.89,NA,NA
230,2014/10/16,00:00,9.8,5.5,5.03,1.6,7,147.56,91.2,47.5,30.09,9.38,NA,NA
231,2014/10/16,00:10,9.7,5.1,4.81,1.57,7.1,152.56,91.2,47.6,29.44,9.4,NA,NA
232,2014/10/16,00:20,9.9,5.4,5.09,1.61,7.4,147.56,91.1,48.3,29.78,9.23,NA,NA
233,2014/10/16,00:30,9.8,5.4,4.82,1.62,6.9,152.56,95.7,48.6,29.47,9.8,NA,NA
234,2014/10/16,00:40,10.6,5.7,4.99,1.58,6.8,147.56,91.3,47.9,29.57,9.94,NA,NA
235,2014/10/16,00:50,10.1,5.4,4.93,1.65,7,147.56,86.3,44.9,27.9,8.93,NA,NA"
conn <- textConnection(Lines)
dframe <- read.csv(conn)
close(conn)
library(xts)
USP_TSP.xts <- xts(dframe$UTSP,
as.POSIXct(paste(dframe$date,dframe$time), format="%Y/%m/%d %H:%M"))
na.exclude(USP_TSP.xts)
ep <- endpoints(USP_TSP.xts,'hours')
period.apply(USP_TSP.xts,ep,mean)
I have also tried several variations of na.contiguous, na.omit, na.action.
My resultant output always seems to be the same (excerpt):
[,1]
2014-10-15 22:50:00 9.4166667
2014-10-15 23:50:00 NA
2014-10-16 00:50:00 9.9833333
... with the value for 2014-10-15 hr 23 being NA, even though there were 5 out of 6 non-missing values
Also, I am calculating all the columns separately, then combining them later. Is there an easier way - like calculating all the columns at once?
Calling na.exclude doesn't change the USP_TSP.xts object. You would need to assign the output of na.exclude to USP_TSP.xts to achieve that.
USP_TSP.xts <- na.exclude(USP_TSP.xts)
But if you want to process all the columns in the object at once, using na.exclude is going to remove all rows that have at least one column with a missing value.
xData <- xts(dframe[,-(1:2)],
as.POSIXct(paste(dframe$date,dframe$time), format="%Y/%m/%d %H:%M"))
na.exclude(xData)
# UTSP UPM10 UPM25 UPM1 UWS UWDT PTSP PPM10 PPM25 PPM1 PWS PWDT
str(na.exclude(xData))
# An 'xts' object of zero-width
Instead, you should supply na.rm=TRUE to the call to mean inside the period.apply call. If you want to process all columns at the same time, you can use colMeans.
xDataMeans <- period.apply(xData, endpoints(xData, "hours"), colMeans, na.rm=TRUE)
xDataMeans
# UTSP UPM10 UPM25 UPM1 UWS UWDT
# 2014-10-15 22:50:00 9.416667 5.166667 4.781667 1.580 6.516667 150.8933
# 2014-10-15 23:50:00 9.780000 5.280000 4.942000 1.618 7.040000 151.5600
# 2014-10-16 00:50:00 9.983333 5.416667 4.945000 1.605 7.033333 149.2267
# PTSP PPM10 PPM25 PPM1 PWS PWDT
# 2014-10-15 22:50:00 111.06667 57.50000 34.93500 11.001667 NaN NaN
# 2014-10-15 23:50:00 102.90000 54.01667 33.08667 10.721667 NaN NaN
# 2014-10-16 00:50:00 91.13333 47.46667 29.37500 9.446667 NaN NaN
Your code works fine. You just need to assign USP_TSP.xts <- na.exclude(USP_TSP.xts). If you merely call na.exclude(USP_TSP.xts), then the output without NAs is printed, but it is not stored in any variable.
USP_TSP.xts <- na.exclude(USP_TSP.xts)
ep <- endpoints(USP_TSP.xts,'hours')
period.apply(USP_TSP.xts,ep,mean)
# [,1]
#2014-10-15 22:50:00 9.416667
#2014-10-15 23:40:00 9.780000
#2014-10-16 00:50:00 9.983333
Alternatively you can use period.apply(USP_TSP.xts,ep,mean, na.rm=T) if you don't want to modify the original xts object.
We have a csv file with Dates in Excel format and Nav for Manager A and Manager B as follows:
Date,Manager A,Date,Manager B
41346.6666666667,100,40932.6666666667,100
41347.6666666667,100,40942.6666666667,99.9999936329992
41348.6666666667,100,40945.6666666667,99.9999936397787
41351.6666666667,100,40946.6666666667,99.9999936714362
41352.6666666667,100,40947.6666666667,100.051441180137
41353.6666666667,100,40948.6666666667,100.04877283951
41354.6666666667,100.000077579585,40949.6666666667,100.068400298752
41355.6666666667,100.00007861475,40952.6666666667,100.070263374822
41358.6666666667,100.000047950872,40953.6666666667,99.9661095940006
41359.6666666667,99.9945012295984,40954.6666666667,99.8578245935173
41360.6666666667,99.9944609274138,40955.6666666667,99.7798031949116
41361.6666666667,99.9944817907402,40956.6666666667,100.029523604978
41366.6666666667,100,40960.6666666667,100.14859511024
41367.6666666667,99.4729804387476,40961.6666666667,99.7956029017769
41368.6666666667,99.4729804387476,40962.6666666667,99.7023420799123
41369.6666666667,99.185046151864,40963.6666666667,99.6124531927299
41372.6666666667,99.1766469096966,40966.6666666667,99.5689030038018
41373.6666666667,98.920738006398,40967.6666666667,99.5701493637685
,,40968.6666666667,99.4543885041996
,,40969.6666666667,99.3424528379521
We want to create a zoo object with the following structure [Dates, Manager A Nav, Manager B Nav].
After reading the csv file with:
data = read.csv("...", header=TRUE, sep=",")
we set an index for splitting the object and use lapply to split
INDEX <- seq(1, by = 2, length = ncol(data) / 2)
data.zoo <- lapply(INDEX, function(i, data) data[i:(i+1)], data = zoo(data))
I'm stuck with the fact that Dates are in Excel format and don't know how to fix that stuff. Is the problem set in a correct way?
If all you want to do is to convert the dates to proper dates you can do this easily enough. The thing you need to know is the origin date. Your numbers represent the integer and fractional number of days that have passed since the origin date. Usually this is Jan 0 1990!!! Go figure, but be careful as I don't think this is always the case. You can try this...
# Excel origin is day 0 on Jan 0 1900, but treats 1900 as leap year so...
data$Date <- as.Date( data$Date , origin = "1899/12/30")
data$Date.1 <- as.Date( data$Date.1 , origin = "1899/12/30")
# For more info see ?as.Date
If you are interested in keeping the times as well, you can use as.POSIXct, but you must also specify the timezone (UTC by default);
data$Date <- as.POSIXct(data$Date, origin = "1899/12/30" )
head(data)
# Date Manager.A Date.1 Manager.B
# 1 2013-03-13 16:00:00 100 2012-01-24 100.00000
# 2 2013-03-14 16:00:00 100 2012-02-03 99.99999
# 3 2013-03-15 16:00:00 100 2012-02-06 99.99999
# 4 2013-03-18 16:00:00 100 2012-02-07 99.99999
# 5 2013-03-19 16:00:00 100 2012-02-08 100.05144
# 6 2013-03-20 16:00:00 100 2012-02-09 100.04877