Related
I am having trouble with a function I wrote when trying to apply it to a dataframe to mutate in a new column
I want to add a column to a dataframe that calculates the sunrise/sunset time for all rows based on existing columns for Latitude, Longitude and Date. The sunrise/sunset calculation is derived from the "sunriseset" function from the maptools package.
Below is my function:
library(maptools)
library(tidyverse)
sunrise.set2 <- function (lat, long, date, timezone = "UTC", direction = c("sunrise", "sunset"), num.days = 1)
{
lat.long <- matrix(c(long, lat), nrow = 1)
day <- as.POSIXct(date, tz = timezone)
sequence <- seq(from = day, length.out = num.days, by = "days")
sunrise <- sunriset(lat.long, sequence, direction = "sunrise",
POSIXct = TRUE)
sunset <- sunriset(lat.long, sequence, direction = "sunset",
POSIXct = TRUE)
ss <- data.frame(sunrise, sunset)
ss <- ss[, -c(1, 3)]
colnames(ss) <- c("sunrise", "sunset")
if (direction == "sunrise") {
return(ss[1,1])
} else {
return(ss[1,2])
}
}
When I run the function for a single input I get the expected output:
sunrise.set2(41.2, -73.2, "2018-12-09 07:34:0", timezone="EST",
direction = "sunset", num.days = 1)
[1] "2018-12-09 16:23:46 EST"
However, when I try to do this on a dataframe object to mutate in a new column like so:
df <- df %>%
mutate(set = sunrise.set2(Latitude, Longitude, LocalDateTime, timezone="UTC", num.days = 1, direction = "sunset"))
I get the following error:
Error in mutate_impl(.data, dots) :
Evaluation error: 'from' must be of length 1.
The dput of my df is below. I suspect I'm not doing something right in order to properly vectorize my function but I'm not sure what.
Thanks
dput(df):
structure(list(Latitude = c(20.666, 20.676, 20.686, 20.696, 20.706,
20.716, 20.726, 20.736, 20.746, 20.756, 20.766, 20.776), Longitude = c(-156.449,
-156.459, -156.469, -156.479, -156.489, -156.499, -156.509, -156.519,
-156.529, -156.539, -156.549, -156.559), LocalDateTime = structure(c(1534318440,
1534404840, 1534491240, 1534577640, 1534664040, 1534750440, 1534836840,
1534923240, 1535009640, 1535096040, 1535182440, 1535268840), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), .Names = c("Latitude", "Longitude",
"LocalDateTime"), row.names = c(NA, -12L), class = c("tbl_df",
"tbl", "data.frame"), spec = structure(list(cols = structure(list(
Latitude = structure(list(), class = c("collector_double",
"collector")), Longitude = structure(list(), class = c("collector_double",
"collector")), LocalDateTime = structure(list(format = "%m/%d/%Y %H:%M"), .Names = "format", class = c("collector_datetime",
"collector"))), .Names = c("Latitude", "Longitude", "LocalDateTime"
)), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
The problem is indeed that your function as it is now is not vectorized, it breaks if you give it more than one value. A workaround (as Suliman suggested) is using rowwise() or a variant of apply, but that would give your function a lot of unnecessary work.
So better to make it vectorized, as maptools::sunriset is also vectorized. First suggestion: Debug or rewrite it with vectors as input, and then you easily see the lines where something unexpected happens. Let's go at it line by line, I've outcommented your lines where I replace it with something else:
library(maptools)
library(tidyverse)
# sunrise.set2 <- function (lat, long, date, timezone = "UTC", direction = c("sunrise", "sunset"), num.days = 1)
sunrise.set2 <- function (lat, long, date, timezone = "UTC", direction = c("sunrise", "sunset")
# Why an argument saying how many days? You have the length of your dates
{
#lat.long <- matrix(c(long, lat), nrow = 1)
lat.long <- cbind(lon, lat)
day <- as.POSIXct(date, tz = timezone)
# sequence <- seq(from = day, length.out = num.days, by = "days") # Your days object is fine
sunrise <- sunriset(lat.long, day, direction = "sunrise",
POSIXct = TRUE)
sunset <- sunriset(lat.long, day, direction = "sunset",
POSIXct = TRUE)
# I've replaced sequence with day here
ss <- data.frame(sunrise, sunset)
ss <- ss[, -c(1, 3)]
colnames(ss) <- c("sunrise", "sunset")
if (direction == "sunrise") {
#return(ss[1,1])
return(ss[,1])
} else {
#return(ss[1,2])
return(ss[,2])
}
}
But looking at your function, I think there is still a lot of extra work done that doesn't serve any purpose.
You're calculating both sunrise and sunset, only to use one of them. And you can just pass one your direction-argument, without even looking at it.
Is it useful to ask for a seperate date and timezone? When your users give you a POSIXt-object, the timezone is included. And it's nice if you can input a string as a date, but that only works if it's in the right format. To keep it simple, I'd just ask for a POSIXct as input (which is in your example-data.frame)
Why are you making a data.frame and assigning names before returning? As soon as you're subsetting, it all gets dropped again.
Which means your function can be a lot shorter:
sunrise.set2 <- function(lat, lon, date, direction = c("sunrise", "sunset")) {
lat.long <- cbind(lon, lat)
sunriset(lat.long, date, direction=direction, POSIXct.out=TRUE)[,2]
}
If you have no control over your input you might need to add some checks, but usually I find it most useful to keep focused on just the thing you want to accomplish.
Intro:
I have a directory full of data from a sensor network. I would like to use each sensor's serial number, located within the filename, to create an id vector. .
Here's some example filenames:
2017-07-18-32058-aqdata.csv
2017-07-18-32033-aqdata.csv.
The serial number for each sensor comes after the timestamp, e.g. 32058 or 32033.
Here's how I am currently reading in the data:
## Load the necessary packages:
if (!require(plyr)){
install.packages('plyr')
library(plyr)
}
if (!require(dplyr)){
install.packages('dplyr')
library(dplyr)
}
## Define a function to read in a single file:
read.data <- function(file_path){
df <- read.csv(file_path, header=TRUE, stringsAsFactors=FALSE)
names(df)<-c("datetime", "co", "co2", "VOC", "RH","ugm3","temp")
df$datetime <- strptime(df$datetime, format="%Y-%m-%d %H:%M")
df$datetime <- as.POSIXct(df$datetime, format="%Y-%m-%d %H:%M:%S")
return(df)
}
## Assign object 'file_path' to my target directory:
file_path <-"~/my_directory/"
## Generate a list of files within this directory:
file_list <- list.files(path = file_path, pattern="\\.csv$", all.files=FALSE, full.names=TRUE, ignore.case=FALSE)
## Apply the data.read function to the list of files:
df_master <- dplyr::ldply(file_list, read.data)
df_master <- plyr::arrange(df_master, datetime)
How can I exploit the serial number in each filename to create corresponding ID vectors within my read.data() function?
Here's some example data:
df_example <- structure(list(datetime = structure(c(1497296520, 1497296580, 1497296640, 1497296700, 1497296760, 1497296820), class = c("POSIXct", "POSIXt"), tzone = ""), co = c(0, 0, 0, 0, 0, 0), co2 = c(1118L, 1508L, 836L, 620L, 529L, 498L), VOC = c(62.1353, 59.7594, 59.1831, 57.9592, 56.4335, 53.6528), RH = c(51.45, 52.18, 50.72, 49.71, 49.21, 48.51), ugm3 = c(2.601, 1.061, 1.901, 1.481, 2.501, 3.261), temp = c(72.27, 72.35, 72.45, 72.55, 72.67, 72.77)), .Names = c("datetime", "co", "co2", "VOC", "RH", "ugm3", "temp"), row.names = c(NA, 6L), class = "data.frame")
Thanks in advance!
This assumes your sensor numbers are all 5+ numbers, which helpfully avoids confusion with the dates. Using stringr:
library(stringr)
read.data <- function(file_path){
df <- read.csv(file_path, header=TRUE, stringsAsFactors=FALSE)
names(df)<-c("datetime", "co", "co2", "VOC", "RH","ugm3","temp")
df$datetime <- strptime(df$datetime, format="%Y-%m-%d %H:%M")
df$datetime <- as.POSIXct(df$datetime, format="%Y-%m-%d %H:%M:%S")
# New code to pull in sensor number
df$sensor <- str_extract(file_path, "[0-9]{5,}")
return(df)
}
I don't know why but I am facing errors when trying to download data from "Oanda" using getSymbols function:
library(quantmod)
require(quantmod)
tickers <- c("USD/EUR","EUR/BRL","USD/ARS","USD/BRL","USD/CLP","USD/COP")
tickers2 <- c("USDEUR","EURBRL","USDARS","USDBRL","USDBRL","USDCLP","USDCOP")
getSymbols(tickers, src = "oanda", env = parent.frame(n=1),
from = "2014-01-01", to = Sys.Date(), auto.assign = TRUE,
warning = FALSE)
Closeprices <- do.call(cbind, lapply(tickers2, function(x) get(x)))
head(Closeprices)
dat1 <- as.data.frame(Closeprices)
dat1$date <- time(Closeprices)
head(dat1)
and it returns the following error:
> getSymbols("USD/EUR", src = "oanda", env = parent.frame(n=1),
+ to = Sys.Date(), auto.assign = TRUE,
+ warning = FALSE)
Error in charToDate(x) :
character string is not in a standard unambiguous format
It used to work well, but now it is crashing.
Can you help me out?
I currently am connecting to a mongodb through the rmongodb package (which isnt necessary for this example), it provides me with a list in the form of "mongo_data" list supplied in the code below. I want to take a list in this format and end up with a single dataframe. I currently use a loop (which I have setup to run in parallel but even still it is quite slow when using large amounts of data. Any ideas on how to do this without using a loop?
Reproducible data
symbols <- c("SPXT", "BCOM", "MXWO")
mongo_data <- list(structure(list(`_id` = "561369b8e756f3e744d9d697", symbol = "BCOM",
year = 2014, field = "PX_LAST", monthly = list(structure(list(
`2014-01-31` = 126.1206, `2014-02-28` = 133.9783, `2014-03-31` = 134.5234,
`2014-04-30` = 137.7964, `2014-05-30` = 133.8324, `2014-06-30` = 134.6268,
`2014-07-31` = 127.9143, `2014-08-29` = 126.5745, `2014-09-30` = 118.6922,
`2014-10-31` = 117.7365, `2014-11-28` = 112.9451, `2014-12-31` = 104.3285), .Names = c("2014-01-31",
"2014-02-28", "2014-03-31", "2014-04-30", "2014-05-30", "2014-06-30",
"2014-07-31", "2014-08-29", "2014-09-30", "2014-10-31", "2014-11-28",
"2014-12-31")))), .Names = c("_id", "symbol", "year", "field",
"monthly")), structure(list(`_id` = "561369b8e756f3e744d9d698",
symbol = "BCOM", year = 2015, field = "PX_LAST", monthly = list(
structure(list(`2015-01-30` = 100.8413, `2015-02-27` = 103.4379,
`2015-03-31` = 98.123, `2015-04-30` = 103.7471, `2015-05-29` = 100.9465,
`2015-06-30` = 102.6892, `2015-07-31` = 91.7827,
`2015-08-31` = 90.9328, `2015-09-30` = 87.8214), .Names = c("2015-01-30",
"2015-02-27", "2015-03-31", "2015-04-30", "2015-05-29",
"2015-06-30", "2015-07-31", "2015-08-31", "2015-09-30"
)))), .Names = c("_id", "symbol", "year", "field", "monthly"
)), structure(list(`_id` = "561353f1e756f3e744d97955", symbol = "MXWO",
year = 2014, field = "PX_LAST", monthly = list(structure(list(
`2014-01-31` = 1598.46, `2014-02-28` = 1675.4, `2014-03-31` = 1673.87,
`2014-04-30` = 1687.74, `2014-05-30` = 1715.18, `2014-06-30` = 1743.42,
`2014-07-31` = 1714.35, `2014-08-29` = 1748.69, `2014-09-30` = 1698.41,
`2014-10-31` = 1708.09, `2014-11-28` = 1739.5, `2014-12-31` = 1709.67), .Names = c("2014-01-31",
"2014-02-28", "2014-03-31", "2014-04-30", "2014-05-30", "2014-06-30",
"2014-07-31", "2014-08-29", "2014-09-30", "2014-10-31", "2014-11-28",
"2014-12-31")))), .Names = c("_id", "symbol", "year", "field",
"monthly")), structure(list(`_id` = "561353f1e756f3e744d97956",
symbol = "MXWO", year = 2015, field = "PX_LAST", monthly = list(
structure(list(`2015-01-30` = 1677.54, `2015-02-27` = 1772.86,
`2015-03-31` = 1740.81, `2015-04-30` = 1778.4, `2015-05-29` = 1779.31,
`2015-06-30` = 1735.61, `2015-07-31` = 1765.6, `2015-08-31` = 1645.43,
`2015-09-30` = 1581.92), .Names = c("2015-01-30",
"2015-02-27", "2015-03-31", "2015-04-30", "2015-05-29",
"2015-06-30", "2015-07-31", "2015-08-31", "2015-09-30"
)))), .Names = c("_id", "symbol", "year", "field", "monthly"
)), structure(list(`_id` = "5613542fe756f3e744d97a69", symbol = "SPXT",
year = 2014, field = "PX_LAST", monthly = list(structure(list(
`2014-01-31` = 3200.95, `2014-02-28` = 3347.3799, `2014-03-31` = 3375.51,
`2014-04-30` = 3400.46, `2014-05-30` = 3480.29, `2014-06-30` = 3552.1799,
`2014-07-31` = 3503.1899, `2014-08-29` = 3643.3401, `2014-09-30` = 3592.25,
`2014-10-31` = 3679.99, `2014-11-28` = 3778.96, `2014-12-31` = 3769.4399), .Names = c("2014-01-31",
"2014-02-28", "2014-03-31", "2014-04-30", "2014-05-30", "2014-06-30",
"2014-07-31", "2014-08-29", "2014-09-30", "2014-10-31", "2014-11-28",
"2014-12-31")))), .Names = c("_id", "symbol", "year", "field",
"monthly")), structure(list(`_id` = "5613542fe756f3e744d97a6a",
symbol = "SPXT", year = 2015, field = "PX_LAST", monthly = list(
structure(list(`2015-01-30` = 3656.28, `2015-02-27` = 3866.4199,
`2015-03-31` = 3805.27, `2015-04-30` = 3841.78, `2015-05-29` = 3891.1799,
`2015-06-30` = 3815.8501, `2015-07-31` = 3895.8,
`2015-08-31` = 3660.75, `2015-09-30` = 3570.1699), .Names = c("2015-01-30",
"2015-02-27", "2015-03-31", "2015-04-30", "2015-05-29",
"2015-06-30", "2015-07-31", "2015-08-31", "2015-09-30"
)))), .Names = c("_id", "symbol", "year", "field", "monthly"
)))
Code Using Currently
library(foreach)
library(doParallel)
cl <- makeCluster((detectCores()))
registerDoParallel(cl)
total_df_list <- foreach(i=(1:length(symbols))) %dopar% {
# pull out the data for symbol i
symbol_data <- mongo_data[which(sapply(lapply(mongo_data, "[[", "symbol"), function(x) x == symbols[i]))]
# pull out the particular frequency of data & put into a single list
freq_data <- lapply(symbol_data, "[[", "monthly")
freq_data <- do.call(Map, c(c, freq_data))
# if the frequency doesnt exist then add symbol to filler vector
if (length(freq_data) > 0) {
# convert NULLs to NAs
freq_data[[1]][which(sapply(freq_data[[1]], is.null))] <- NA
# transform list into a proper dataframe
mongo_df <- data.frame("Date" = names(unlist(freq_data[[1]])),"Value"=as.numeric(unlist(freq_data[[1]])))
mongo_df[,"Date"] <- as.Date(mongo_df[,"Date"])
colnames(mongo_df)[2] <- paste0(symbols[i])
# put dataframe into the master list
results <- mongo_df
} else {
filler_vector <- c(filler_vector,symbols[i])
results <- NULL
}
results
}
I did some benchmarking of my previous solution for this question, and I noticed that nearly all the runtime was spent in the as.Date call, which converts string dates to Date objects. As a result, it seems like optimizing this operation would be needed to further push the efficiency. One observation would be that many of the dates are reported multiple times (for different symbols), so we are wasting computation by performing date conversions for the same string literal multiple times. To address this, we could use the following procedure:
Compute all unique date strings
Convert them to Date objects with as.Date
Look up the converted value for each date string, yielding the final converted vector
In code, we could do this with:
vals <- unlist(lapply(mongo_data, "[[", "monthly"))
s <- unlist(lapply(mongo_data, function(x) rep(x$symbol, length(x$monthly[[1]]))))
dates <- unique(names(vals))
date.map <- as.Date(dates)
names(date.map) <- dates
partial <- data.frame(Date=date.map[names(vals)], val=as.numeric(vals),
symbol=s)
tdl2 <- unname(lapply(split(partial, partial$s), function(x) {
names(x)[2] <- as.character(x$symbol[1])
rownames(x) <- NULL
x[,-3]
})[symbols])
identical(total_df_list, tdl2)
# [1] TRUE
Let's benchmark this on a large instance that is obtained by repeating each row in mongo_data 10,000 times. OP is the function from the original question, josilber1 is from my other answer, and josilber2 is from this answer:
big.mongo <- mongo_data[rep(seq_along(mongo_data), 10000)]
system.time(OP(big.mongo, symbols))
# user system elapsed
# 5.359 0.071 5.427
system.time(josilber1(big.mongo, symbols))
# user system elapsed
# 4.345 0.047 4.385
system.time(josilber2(big.mongo, symbols))
# user system elapsed
# 0.560 0.048 0.625
My other answer yielded a 20% improvement in the runtime, while this solution yielded roughly a 10x speedup.
I see a few opportunities to speed this up:
Don't loop through the list extracting the symbol name for each symbol you process. Instead you can extract all the symbol names at the beginning and use that to quickly identify the part of the list you want to process.
sapply with == to look up matching symbols is calling == separately on each element. It would be more efficient to just call == once on the whole vector.
sapply with is.null is calling is.null separately on each element. It would be more efficient to just call is.null once on the whole vector.
Here's my updated implementation:
mongo.sym <- unlist(lapply(mongo_data, "[[", "symbol"))
tdl <- lapply(symbols, function(sym) {
vals <- unlist(lapply(mongo_data[mongo.sym == sym], function(x) unlist(x$monthly)))
vals[is.null(vals)] <- NA
ret <- data.frame(as.Date(names(vals)), as.numeric(vals))
names(ret) <- c("Date", sym)
ret
})
identical(total_df_list, tdl)
# [1] TRUE
I'd like to display a time series as a highchart interactive graphic. However, in the following R scrip the dates are not displayed correctly all. The numerical value for the date is cut after five digits, making them appear all on the same day and time.
Anyone experienced and solved something similar?
library(plyr)
library(rCharts)
library(rHighcharts)
cs <-c("13-10-30 12:30:00", "13-10-30 12:35:00", "13-10-30 12:40:00",
"13-10-30 12:45:00", "13-10-30 12:50:00", "13-10-30 12:55:00")
x <-strptime(cs, "%y-%m-%d %H:%M:%S")
dfr <-data.frame(date=as.POSIXct(x,origin="1970-01-01"),
value=c(1.5,1.25,.75,2.1,1.3,1.4))
hpl <- hPlot(
value~date,
data = dfr,
type = "scatter"
)
hpl$xAxis(type = "datetime")
hpl$chart(zoomType = "x")
hpl$plotOptions(
line = list(
marker = list(enabled = F)
)
)
hpl