Related
I am having trouble with a function I wrote when trying to apply it to a dataframe to mutate in a new column
I want to add a column to a dataframe that calculates the sunrise/sunset time for all rows based on existing columns for Latitude, Longitude and Date. The sunrise/sunset calculation is derived from the "sunriseset" function from the maptools package.
Below is my function:
library(maptools)
library(tidyverse)
sunrise.set2 <- function (lat, long, date, timezone = "UTC", direction = c("sunrise", "sunset"), num.days = 1)
{
lat.long <- matrix(c(long, lat), nrow = 1)
day <- as.POSIXct(date, tz = timezone)
sequence <- seq(from = day, length.out = num.days, by = "days")
sunrise <- sunriset(lat.long, sequence, direction = "sunrise",
POSIXct = TRUE)
sunset <- sunriset(lat.long, sequence, direction = "sunset",
POSIXct = TRUE)
ss <- data.frame(sunrise, sunset)
ss <- ss[, -c(1, 3)]
colnames(ss) <- c("sunrise", "sunset")
if (direction == "sunrise") {
return(ss[1,1])
} else {
return(ss[1,2])
}
}
When I run the function for a single input I get the expected output:
sunrise.set2(41.2, -73.2, "2018-12-09 07:34:0", timezone="EST",
direction = "sunset", num.days = 1)
[1] "2018-12-09 16:23:46 EST"
However, when I try to do this on a dataframe object to mutate in a new column like so:
df <- df %>%
mutate(set = sunrise.set2(Latitude, Longitude, LocalDateTime, timezone="UTC", num.days = 1, direction = "sunset"))
I get the following error:
Error in mutate_impl(.data, dots) :
Evaluation error: 'from' must be of length 1.
The dput of my df is below. I suspect I'm not doing something right in order to properly vectorize my function but I'm not sure what.
Thanks
dput(df):
structure(list(Latitude = c(20.666, 20.676, 20.686, 20.696, 20.706,
20.716, 20.726, 20.736, 20.746, 20.756, 20.766, 20.776), Longitude = c(-156.449,
-156.459, -156.469, -156.479, -156.489, -156.499, -156.509, -156.519,
-156.529, -156.539, -156.549, -156.559), LocalDateTime = structure(c(1534318440,
1534404840, 1534491240, 1534577640, 1534664040, 1534750440, 1534836840,
1534923240, 1535009640, 1535096040, 1535182440, 1535268840), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), .Names = c("Latitude", "Longitude",
"LocalDateTime"), row.names = c(NA, -12L), class = c("tbl_df",
"tbl", "data.frame"), spec = structure(list(cols = structure(list(
Latitude = structure(list(), class = c("collector_double",
"collector")), Longitude = structure(list(), class = c("collector_double",
"collector")), LocalDateTime = structure(list(format = "%m/%d/%Y %H:%M"), .Names = "format", class = c("collector_datetime",
"collector"))), .Names = c("Latitude", "Longitude", "LocalDateTime"
)), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
The problem is indeed that your function as it is now is not vectorized, it breaks if you give it more than one value. A workaround (as Suliman suggested) is using rowwise() or a variant of apply, but that would give your function a lot of unnecessary work.
So better to make it vectorized, as maptools::sunriset is also vectorized. First suggestion: Debug or rewrite it with vectors as input, and then you easily see the lines where something unexpected happens. Let's go at it line by line, I've outcommented your lines where I replace it with something else:
library(maptools)
library(tidyverse)
# sunrise.set2 <- function (lat, long, date, timezone = "UTC", direction = c("sunrise", "sunset"), num.days = 1)
sunrise.set2 <- function (lat, long, date, timezone = "UTC", direction = c("sunrise", "sunset")
# Why an argument saying how many days? You have the length of your dates
{
#lat.long <- matrix(c(long, lat), nrow = 1)
lat.long <- cbind(lon, lat)
day <- as.POSIXct(date, tz = timezone)
# sequence <- seq(from = day, length.out = num.days, by = "days") # Your days object is fine
sunrise <- sunriset(lat.long, day, direction = "sunrise",
POSIXct = TRUE)
sunset <- sunriset(lat.long, day, direction = "sunset",
POSIXct = TRUE)
# I've replaced sequence with day here
ss <- data.frame(sunrise, sunset)
ss <- ss[, -c(1, 3)]
colnames(ss) <- c("sunrise", "sunset")
if (direction == "sunrise") {
#return(ss[1,1])
return(ss[,1])
} else {
#return(ss[1,2])
return(ss[,2])
}
}
But looking at your function, I think there is still a lot of extra work done that doesn't serve any purpose.
You're calculating both sunrise and sunset, only to use one of them. And you can just pass one your direction-argument, without even looking at it.
Is it useful to ask for a seperate date and timezone? When your users give you a POSIXt-object, the timezone is included. And it's nice if you can input a string as a date, but that only works if it's in the right format. To keep it simple, I'd just ask for a POSIXct as input (which is in your example-data.frame)
Why are you making a data.frame and assigning names before returning? As soon as you're subsetting, it all gets dropped again.
Which means your function can be a lot shorter:
sunrise.set2 <- function(lat, lon, date, direction = c("sunrise", "sunset")) {
lat.long <- cbind(lon, lat)
sunriset(lat.long, date, direction=direction, POSIXct.out=TRUE)[,2]
}
If you have no control over your input you might need to add some checks, but usually I find it most useful to keep focused on just the thing you want to accomplish.
I have a dataframe with the following structure:
df <- structure(list(Name = structure(1:9, .Label = c("task 1", "task 2",
"task 3", "task 4", "task 5", "task 6", "task 7", "task 8", "task 9"
), class = "factor"), Start = structure(c(1479799800, 1479800100,
1479800400, 1479800700, 1479801000, 1479801300, 1479801600, 1479801900,
1479802200), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1479801072,
1479800892, 1479801492, 1479802092, 1479802692, 1479803292, 1479803892,
1479804492, 1479805092), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("Name",
"Start", "End"), row.names = c(NA, -9L), class = "data.frame")
Now I want to count the items in column "Name" over time. They all have a start and end datetimes, which are formated as POSIXct.
With help of this solution here on SO I was able to do so (or at least I think I was) with following code:
library(data.table)
setDT(df)
dates = seq(min(df$Start), max(df$End), by = "min")
lookup = data.table(Start = dates, End = dates, key = c("Start", "End"))
ans = foverlaps(df, lookup, type = "any", which = TRUE)
library(ggplot2)
ggplot(ans[, .N, by = yid], aes(x = yid, y = N)) + geom_line()
Problem now:
How do I match my DateTime-scale to those integer values on the x-axis? Or is there a faster and better solution to solve my problem?
I tried to use x = as.POSIXct(yid, format = "%Y-%m-%dT%H:%M:%S", origin = min(df$Start)) within the aes of the ggplot(). But that didn't work.
EDIT:
When using the solution for this problem, I face another. Items, where there is no count, are displayed with the count of the latest countable item in the plot. This is why we have to merge (leftjoin) the table with the counts (ants) again with a complete sequence of all Datetimes and put a 0 for every NA. So we get explicit values for every necessary datapoint.
Like this:
# The part we use to count and match the right times
df1 <- ans[, .N, by = yid] %>%
mutate(time = min(df$Start) + minutes(yid))
# The part where we use the sequence from the beginning for a LEFT JOIN with the counting dataframe
df2 <- data.frame(time = dates)
dt <- merge(x = df2, y = df1, by = "time", all.x = TRUE)
dt[is.na(dt)] <- 0
In the tidyverse framework, this is a slightly different task -
Generate the sames dates variable you have.
Construct a data frame with all dates and all times (cartesian join)
Filter out the rows that are not in the interval for each task
Add up the tasks for each minute that remain
Plot.
That looks something like this --
library(tidyverse)
library(lubridate)
dates = seq(min(df$Start), max(df$End), by = "min")
df %>%
mutate(key = 1) %>%
left_join(data_frame(key = 1, times = dates)) %>%
mutate(include = times %within% interval(Start, End)) %>%
filter(include) %>%
group_by(times) %>%
summarise(count = n()) %>%
ggplot(aes(times, count)) +
geom_line()
#> Joining, by = "key"
If you need it to be faster, it will almost certainly be faster using your original data.table code.
Consider this.
library(data.table)
setDT(df)
dates = seq(min(df$Start), max(df$End), by = "min")
lookup = data.table(Start = dates, End = dates, key = c("Start", "End"))
ans = foverlaps(df, lookup, type = "any", which = TRUE)
ans[, .N, by = yid] %>%
mutate(time = min(df$Start) + minutes(yid)) %>%
ggplot(aes(time, N)) +
geom_line()
Now we use data.table to calculate the overlap, and then index time off the starting minute. Once we add a new column with the times, we can plot.
I have a seemingly small challenge, but I can't get to an answer. Here is my minimum working example.
fr_nuke <- structure(list(Date = structure(c(1420070400, 1420074000, 1420077600,
1420081200, 1420084800, 1420088400), class = c("POSIXct", "POSIXt"), tzone = ""),
`61` = c(57945, 57652, 57583, 57551, 57465, 57683),
`3244` = c(72666.64, 73508.78, 69749.17, 67080.13, 66357.65, 66524.13),
`778` = c(2.1133, 2.1133, 2.1133, 2.1133, 2.1133, 2.1133),
fcasted_nuke_temp = c(54064.6099092888, 54064.6099092888, 54064.6099092888,
54064.6099092888, 54064.6099092888, 54064.6099092888),
fcasted_nuke_cons = c(55921.043096775, 56319.5688170977, 54540.4094334057,
53277.340242333, 52935.4411965463, 53014.2244890147)),
.Names = c("Date", "61", "3244", "778", "fcasted_nuke_temp", "fcasted_nuke_cons"),
row.names = c(NA, 6L), class = "data.frame")
series1 <- as.xts(fr_nuke$'61', fr_nuke$Date)
series2 <- as.xts(fr_nuke$fcasted_nuke_temp, fr_nuke$Date)
series3 <- as.xts(fr_nuke$fcasted_nuke_cons, fr_nuke$Date)
grp_input <- cbind(series1,series2,series3)
dygraph(grp_input)
The resulting plot does not show the label of the individual series. Specifying the series with
dygraph(grp_input) %>% dySeries("V1", label = "Label1")
Results in:
Error in dySeries(., "V1", label = "Label1") : One or more of the
specified series were not found. Valid series names are: ..1, ..2, ..3
However, it works if I plot only one series (e.g. series1).
dygraph(series1) %>% dySeries("V1", label = "Label1")
Either set the colnames for the grp_input object, or use merge to construct the column names from the object names.
# setting colnames
require(dygraphs)
require(xts)
grp_input <- cbind(series1, series2, series3)
colnames(grp_input) <- c("V1", "V2", "V3")
dygraph(grp_input) %>% dySeries("V1", label = "Label1")
# using merge
require(dygraphs)
require(xts)
grp_input <- merge(series1, series2, series3)
dygraph(grp_input) %>% dySeries("series1", label = "Label1")
I'm trying to format a chart using GGPLOT2. What I'm trying to accomplish feels like it should be fairly straightforward, but try as I might, no dice. I have the following chart.
#packages
library(ggplot2)
library(plyr)
library(reshape2)
library(PerformanceAnalytics)
library(timeSeries)
library(quantmod)
library(ggthemes)
#stock data
getSymbols(c("^FCHI","^GSPC"), from = "2008-12-31")
stockmarketdata <- cbind(GSPC$GSPC.Close, FCHI$FCHI.Close)
#normalize data
stockmarketdata$CAC40 <- stockmarketdata[,2] / 3217.97
stockmarketdata$SNP <- stockmarketdata[ ,1] / 903.25
#Isolate normalized data
marketdata <- stockmarketdata[,3:4]
GSPC.DF<-data.frame(Date=index(GSPC),coredata(GSPC))
FCHI.DF<-data.frame(Date=index(FCHI),coredata(FCHI))
#format before making ggplot chart
market.df <- data.frame(Date=index(marketdata), marketdata)
market.df.eco <- market.df
colnames(market.df.eco) <- c("Date", "CAC40", "S&P500")
market.df.eco.mlt <- melt(market.df.eco, id = "Date")
ggplot chart
chart.EQ <-
ggplot(market.df.eco.mlt, aes(x=Date, y=value, colour = variable, group = variable)) +
geom_line() +
labs(title="Equity Market", x= "", y = "", color="December 31st 2008=1", title.vjust=1) +
theme_economist() +
scale_color_economist() +
theme(legend.position = c(0,1),
legend.justification=c(0,1),
legend.direction="horizontal",
plot.title = element_text(vjust=1),
legend.title=element_text(vjust=1),
legend.title.align=0)
which looks like this
Ideally, I would like the title and legend to be flush with the left side. I would like both the legend and the title to be above the chart itself with the legend title directly below the chart title and the labels in the legend below the legend title. More like this.
If someone could help me with this formatting that would be excellent. Whenever I remove the justification argument from the theme(), the legend flies half way off the image. Thanks for any help you can offer!
If you don't have the finance packages here is a small sample of the data frame 'marketdata' if you apply the transformations from the code above you should be able to get a similar chart
marketdata<-
structure(c(0.999999990988107, 1.04093261932212, 1.04411163621786,
1.05539205492904, 1.03981394730218, 1.03305191720246, 1.02533584837646,
1.00874778726961, 0.993760008017477, 0.948424006438842, 0.930984404142985,
0.937469895617423, 0.929060849231037, 0.909045152378674, 0.902920185085629,
0.891748561049357, 0.885384230741741, 0.918395795175219, 0.918134733698574,
0.955885235101632, 0.935294611198986, 0.92416023828687, 0.910527459547479,
0.926792323421287, 0.953703729369758, 0.952864706321066, 0.970422359127027,
0.9741763027623, 0.938712915285102, 0.94087886804414, 0.921183257768096,
0.931599768487587, 0.920524420985901, 0.893491853559853, 0.893131405202659,
0.892674604797434, 0.854746951960397, 0.847699051575994, 0.841539867991311,
0.838081126300121, 1, 1.03160806864102, 1.02679215278162, 1.03481872349848,
1.00376421145862, 1.0071740714088, 0.985718213119291, 0.963476346526432,
0.965170194298367, 0.932875721007473, 0.934115682258511, 0.94117907002491,
NA, 0.891469660669803, 0.930240786050374, 0.916136174923886,
0.92106284195959, 0.926177699418766, 0.936296730694714, 0.967716608912261,
0.935665668419596, 0.914342657071686, 0.913855523941323, 0.928325502352615,
0.921372798228619, 0.936451675615832, 0.961638500968724, 0.963066720177138,
0.915759726543039, 0.923044550235262, 0.924649877663991, 0.915405510102408,
NA, 0.873700507057847, 0.872870172156103, 0.862374760033213,
0.85253250816496, 0.82295047550512, 0.8559535178522, 0.846830915029062
), .indexCLASS = "Date", .indexTZ = "UTC", tclass = "Date", tzone = "UTC", src = "yahoo", updated = structure(1440080950.2661, class = c("POSIXct",
"POSIXt")), class = c("xts", "zoo"), index = structure(c(1230681600,
1230854400, 1231113600, 1231200000, 1231286400, 1231372800, 1231459200,
1231718400, 1231804800, 1231891200, 1231977600, 1232064000, 1232323200,
1232409600, 1232496000, 1232582400, 1232668800, 1232928000, 1233014400,
1233100800, 1233187200, 1233273600, 1233532800, 1233619200, 1233705600,
1233792000, 1233878400, 1234137600, 1234224000, 1234310400, 1234396800,
1234483200, 1234742400, 1234828800, 1234915200, 1235001600, 1235088000,
1235347200, 1235433600, 1235520000), tzone = "UTC", tclass = "Date"), .Dim = c(40L,
2L), .Dimnames = list(NULL, c("CAC40", "SNP")))
My data looks something like this:
There are 10,000 rows, each representing a city and all months since 1998-01 to 2013-9:
RegionName| State| Metro| CountyName| 1998-01| 1998-02| 1998-03
New York| NY| New York| Queens| 1.3414| 1.344| 1.3514
Los Angeles| CA| Los Angeles| Los Angeles| 12.8841| 12.5466| 12.2737
Philadelphia| PA| Philadelphia| Philadelphia| 1.626| 0.5639| 0.2414
Phoenix| AZ| Phoenix| Maricopa| 2.7046| 2.5525| 2.3472
I want to be able to do a plot for all months since 1998 for any city or more than one city.
I tried this but i get an error. I am not sure if i am even attempting this right. Any help will be appreciated. Thank you.
forecl <- ts(forecl, start=c(1998, 1), end=c(2013, 9), frequency=12)
plot(forecl)
Error in plots(x = x, y = y, plot.type = plot.type, xy.labels = xy.labels, :
cannot plot more than 10 series as "multiple"
You might try
require(reshape)
require(ggplot2)
forecl <- melt(forecl, id.vars = c("region","state","city"), variable_name = "month")
forecl$month <- as.Date(forecl$month)
ggplot(forecl, aes(x = month, y = value, color = city)) + geom_line()
To add to #JLLagrange's answer, you might want to pass city through facet_grid() if there are too many cities and the colors will be hard to distinguish.
ggplot(forecl, aes(x = month, y = value, color = city, group = city)) +
geom_line() +
facet_grid( ~ city)
Could you provide an example of your data, e.g. dput(head(forecl)), before converting to a time-series object? The problem might also be with the ts object.
In any case, I think there are two problems.
First, data are in wide format. I'm not sure about your column names, since they should start with a letter, but in any case, the general idea would be do to something like this:
test <- structure(list(
city = structure(1:2, .Label = c("New York", "Philly"),
class = "factor"), state = structure(1:2, .Label = c("NY",
"PA"), class = "factor"), a2005.1 = c(1, 1), a2005.2 = c(2, 5
)), .Names = c("city", "state", "a2005.1", "a2005.2"), row.names = c(NA,
-2L), class = "data.frame")
test.long <- reshape(test, varying=c(3:4), direction="long")
Second, I think you are trying to plot too many cities at the same time. Try:
plot(forecl[, 1])
or
plot(forecl[, 1:5])