I am running an R script with the highlighted following filter:
query.list <- Init(start.date = as.character(startdate),
end.date = as.character(enddate),
dimensions =
"ga:date,ga:campaign,ga:adwordsCampaignID,ga:adGroup,ga:adDestinationUrl",
metrics = "ga:sessions,ga:bounces",
max.results = 10000,
**filters =
c("ga:adwordsCampaignID!=%28not%20set%29;ga:sessions>0"),**
table.id = example)
ga.query <- QueryBuilder(query.list)
x <- as.integer(difftime(max(as.Date(query.list$end.date, '%Y-%m-%d')) ,
min(as.Date(query.list$start.date, '%Y-%m-%d')) , units = "days"))
# Daywise split paramter defaults to False when date does not equal Monday
daywisesplit <- if(x == 0) {
F
} else {
T
}
# Extract the data and store it in a data-frame
example_camp <- GetReportData(ga.query, token, split_daywise = daywisesplit)
example_camp$date <- as.Date(example_camp$date, '%Y%m%d')
example_camp$brand <- 'Example'
example_camp <- subset(example_camp, sessions > 0)
example_camp <- subset(example_camp, adwordsCampaign > 0)
The Script runs with no errors, but when looking into the write.csv file I continue to see the (not set) within the AdwordsCampaignID column. Moreover, when I look at the sessions column all the columns with zero sessions are not included, so this filter is working properly.
How can I make the exclude (not set) filter work properly when pulling data into the the write.CSV?Perhaps I need to update the data-frame?
Related
I'm writing a wrapper for the YouTube Analytics API, and have created a function as follows:
yt_request <- function(dimensions = NULL, metrics = NULL, sort = NULL,
maxResults = NULL, filtr = NULL, startDate = Sys.Date() - 30,
endDate = Sys.Date(), token) {
url <- paste0("https://youtubeanalytics.googleapis.com/v2/reports?",
"&ids=channel%3D%3DMINE",
"&startDate=", startDate,
"&endDate=", endDate)
if(!is.null(dimensions)) url <- paste0(url, "&dimensions=", dimensions)
if(!is.null(metrics)) url <- paste0(url, "&metrics=", metrics)
if(!is.null(sort)) url <- paste0(url, "&sort=", sort)
if(!is.null(maxResults)) url <- paste0(url, "&maxResults=", maxResults)
if(!is.null(filtr)) url <- paste0(url, "&filters=", filtr)
r <- GET(url, token)
return(r)
}
This is meant to just be a flexible but not the most friendly of functions because I want to have wrapper functions that will contain yt_request() that will be much more user friendly. For example:
top_videos <- function(...) {
dim <- "video"
met <- "views,averageViewDuration"
maxRes <- 10
temp <- yt_request(dimensions = dim, metrics = met, maxResults = maxRes, token = myToken)
return(temp)
}
Which so far works fine and dandy, but I also want potential users to have a little flexibility with the results. For example, if they want to have maxResults <- 20 instead of 10 or they want different metrics than the ones I specify, I want them to be able to pass their own arguments in the ... of top_videos(...).
How can I do a check if someone passes an argument in the ellipsis? If they pass a metric, I want it to override the default I specify, otherwise, go with the default.
EDIT
To help clarify, I'm hoping that when the user decides to use the function, they could just write something like top_videos(maxResults = 20) and the function would ignore the line maxRes <- 10 and in the yt_request() function would assign maxResults = 20 instead of 10
We can capture the ... in a list and convert the whole elements to a key/value pair. Then, extract the elements based on the name. If we are not passing that particular named element, it will return NULL. We make use of this behavior of NULL to concatenate with the default value of 10 in maxRes and select the first element ([1]) so that if it is NULL, the default 10 is selected, or else the value passed will be selected. Likewise, do this on all those objects that the OP wanted to override
top_videos <- function(...) {
nm1 <- list(...)
lst1 <- as.list(nm1)
dim <- c(lst1[["dimensions"]], "video")[1]
met <- c(lst1[["metrics"]], "views,averageViewDuration")[1]
maxRes <- c(lst1[['maxResults']], 10)[1]
#temp <- yt_request(dimensions = dim,
metrics = met, maxResults = maxRes, token = myToken)
#temp
maxRes
}
-testing
top_videos(maxResults = 20)
#[1] 20
top_videos(hello = 5)
#[1] 10
I have a data set where I want to calculate the 6 month return of stocks with tq_get (see example below)
Dataset called top
ticker 6month
AKO.A
BIG
BGFV
Function
library(tidyverse)
library(dplyr)
library(tidyquant)
library(riingo)
calculate <- function (x) {
(tq_get(x, get = "tiingo", from = yesterday, to = yesterday)$adjusted/tq_get(x, get = "tiingo", from = before, to = before)$adjusted)-1
}
top[2] <- lapply(top[1], function(x) calculate(x))
Unfortunately for some of the tickers there is no value existing which results in error message when simply using lapply or mutate as the resulting vector is smaller (less rows) then the existing dataset. Resolving with try_catch did not worked.
I now wanted to apply a work around by checking with is_supported_ticker() provided by the package riingo if the ticker is available
calculate <- function (x) {
if (is_supported_ticker(x, type = "tiingo") == TRUE) {
(tq_get(x, get = "tiingo", from = yesterday, to = yesterday)$adjusted/tq_get(x, get = "tiingo", from = before, to = before)$adjusted)-1
}
else {
NA
}
}
top[2] <- lapply(top[1], function(x) calculate(x))
But now I receive the error message x ticker must be length 1, but is actually length 3.
I assume this is based on the fact that the whole first column of my dataset is used as input for is_supported_ticker() instead of row by row. How can I resolve this issue?
Glancing at the documentation, it looks like tq_get supports multiple symbols, only if_supported_ticker goes one at a time. So probably you should check all the tickers to see if they are supported, and then use tq_get once on all the supported ones. Something like this (untested, as I don't have any of these packages):
calculate <- function (x) {
supported = sapply(x, is_supported_ticker, type = "tiingo")
result = rep(NA, length(x))
result[supported] =
(
tq_get(x[supported], get = "tiingo", from = yesterday, to = yesterday)$adjusted /
tq_get(x[supported], get = "tiingo", from = before, to = before)$adjusted
) - 1
return(result)
}
It worries me that before and yesterday aren't function arguments - they're just assumed to be there in the global environment. I'd suggest passing them in as arguments to calculate(), like this:
calculate <- function (x, before, yesterday) {
supported = sapply(x, is_supported_ticker, type = "tiingo")
result = rep(NA, length(x))
result[supported] =
(
tq_get(x[supported], get = "tiingo", from = yesterday, to = yesterday)$adjusted /
tq_get(x[supported], get = "tiingo", from = before, to = before)$adjusted
) - 1
return(result)
}
# then calling it
calculate(top$ticker, before = <...>, yesterday = <...>)
This way you can pass values in for before and yesterday on the fly. If they are objects in your global environment, you can simply use calculate(top$ticker, before, yesterday), but it gives you freedom to vary those arguments without redefining those names in your global environment.
I'm creating a function but i need some help with best practices.
Active.Test <- function(date) {
date <- rep(date,length(df$Start.Date))
active <- rep(0,length(df$Start.Date))
active[date > df$Start.Date & date < df$End.Date] <- 1
active[df$Start.Date == df$End.Date ] <- df$Active.Time
return (active)
}
I basically want to check if a date (which is passed to the function) is between the start and end date in my data frame. If it is, assign a 1. If the start and end dates are equal, get the result from the same row in Active.Time column. Everything else has a default value of 0.
This returns an error as it's retrieving a vector which is of a different size for the second test.
I can re-write the above as:
Active.Test <- function(date) {
date <- rep(date,length(df$Start.Date))
active <- rep(0,length(df$Start.Date))
active[date > df$Start.Date & date < df$End.Date] <- 1
active[df$Start.Date == df$End.Date] <- df$Active.Time[df$Start.Date == df$End.Date]
return (active)
}
This will then get the correct element from the Active.Time column but this doesn't seem to be an elegant way to write this. I'm also guessing it's slower as i'm performing the same check twice as many times.
Could you please help me re-write this using best practices?
EDIT: Here's some code to get a few rows of data and then test use the function by checking to see if the start and end dates encompass 25/05/2016.
#Create a data frame
df <- data.frame(End.Date = as.Date(c("1/05/2016","28/05/2016", "25/05/2016"), format = "%d/%m/%Y"), Start.Date = as.Date(c("20/04/2016 11:00","20/05/2016 23:00", "25/05/2016 10:00"), format = "%d/%m/%Y" ), Active.Time = as.numeric(c(0.5,0.4,0.8)))
#Test the function
df$new <- Active.Test(as.Date("25/05/2016", format = "%d/%m/%Y"))
Thanks
# Using the data.table approach
library(data.table)
# Make data table instead of data.frame (you can also do as.data.table(df) to get a data.table)
my_dt <- data.table(Start.Date=as.Date(c("20/04/2016 11:00","20/05/2016 23:00", "25/05/2016 10:00"), format = "%d/%m/%Y" ),
End.Date=as.Date(c("1/05/2016","28/05/2016", "25/05/2016"), format = "%d/%m/%Y"),
Active.Time = as.numeric(c(0.5,0.4,0.8))
)
setkey(my_dt)
# Sample date to test
datte <- as.Date("25/05/2016", format = "%d/%m/%Y")
# Create function with conditions and result to return
Active.Test <- function(datte, Start.Date, End.Date, Active.Time) {
if(datte > Start.Date & datte < End.Date){
return(1)
}
else if(Start.Date==End.Date){
return(Active.Time)
}
else{return(0)}
}
# Test function
my_dt[, res:=Active.Test(datte, Start.Date, End.Date, Active.Time), by=1:nrow(my_dt)]
See data.table vignette for more on data.table. Also, in your function above, note the warning you get when you run df$new <- Active.Test(as.Date("25/05/2016", format = "%d/%m/%Y"))!
I'm currently writing a program (full disclosure, it's "homework"). The program is designed to run through a series of files based on a range given, collate them into one large table sans NAs and find the mean of the pollutant provided (which is a column in the table).
I wrote the program previously, but wanted to play around with compartmentalising the functions a bit more, so I rewrote it.
Strangely, some ranges return the exact numeric as in the original program, while others return (relatively) radically different results.
For instance:
pollutantmean("specdata", "sulfate", 1:10)
Old Program: 4.064128
New Program: 4.064128
pollutantmean("specdata", "nitrate", 23)
Old Program: 1.280833
New Program: 1.280833
pollutantmean("specdata", "nitrate", 70:72)
Old Program: 1.706047
New Program: 1.732979
In that final example, the old program is producing the expected result, while the new program is producing a result not within the acceptable margin of error at all.
I'm simply at a loss, I've been trying to rewrite my new code so as to minimise differences with the old cold without simply reproducing the old program, and the current code will be below (with the original program). But nothing is working, I continue to receive the exact same (bad) result despite quite a few changes being made.
New Program:
concatTables <- function(directory, id, hasHeader = TRUE, keepNAs = FALSE) {
totalTable <- NULL
currentTable <- NULL
for (file in id) {
filename <- paste( sep ="",
directory,"/",formatC(file,width=3,format="d",flag="0"),".csv"
);
currentTable <- read.csv(file = filename, header = hasHeader);
if (!is.null(totalTable)) {
totalTable <- rbind(totalTable, currentTable);
}
else {
totalTable <- currentTable;
}
}
if (!keepNAs) {
totalTable <- completeRows(totalTable);
}
totalTable
}
completeRows <- function(table) {
table <- table[complete.cases(table),]
table
}
pollutantmean <- function(directory = paste(getwd(),"/specdata",sep = ""), pollutant, id = 1:332, hasHeader = TRUE, keepNAs = FALSE) {
table <- NULL
table <- concatTables(directory,id,hasHeader,keepNAs);
tableMean <- mean(table[[pollutant]]);
tableMean
}
Old Program
(Which produces better results)
dataFileName <- NULL
pollutantmean <- function(directory = "specdata", pollutant, id = 1:332, idWidth = 3, fullLoop = TRUE) {
dataFrame <- NULL
dataFrameTotal <- NULL
for (i in id) {
dataFileName <- paste(directory, "/", formatC(i, width = idWidth, flag = 0), ".csv", sep = "")
if (!is.null(dataFileName)) {
dataFileConnection <- file(dataFileName)
dataFrame <- read.csv(dataFileConnection, header = TRUE)
dataFrameTotal <- rbind(dataFrame, dataFrameTotal)
##close(dataFileConnection)
if (fullLoop == FALSE) {
break
}
}
else print("DATAFILENAME IS NULL!")
}
print(mean(dataFrameTotal[[pollutant]], na.rm = TRUE))
}
The difference is that complete.cases() returns TRUE on each row where one of the columns is NA, while na.rm arg inside mean func will remove rows where the selected column (vector) is NA.
Example:
x <- airquality[1:10, -1]
x[3,3] <- NA
> mean(x[complete.cases(x), "Temp"]) == mean(x[["Temp"]], na.rm = T)
[1] FALSE
Note that complete.cases() returns TRUE on rows 5, 6 where Solar.R column is NA, so you lose 2 observations not NA in Temp column
I have the following functions. CreateChronVector does exactly what it implies. The resulting vector is in hourly intervals by default. The RoundHour function rounds up a chron vector to the hour.
CreateChronVector <- function(chronFrom, chronTo, frequency = "hourly") {
library(chron)
datesFrom <- dates(chronFrom)
timesFrom <- (chronFrom - dates(chronFrom))
datesTo <- dates(chronTo)
timesTo <- (chronTo - dates(chronTo))
if ((timesFrom != 0 || timesTo != 0) && frequency == "daily") {
print("Error: The indicated dates have hour components while the given frequency is daily.")
}
else {
if (timesTo == 0 && frequency == "hourly") {
timesTo <- 23/24
}
if (frequency == "hourly") {
chronFrom <- chron(dates = datesFrom, times = timesFrom,
format = c(dates = "m/d/y", times = "h:m:s"))
chronTo <- chron(dates = datesTo, times = timesTo,
format = c(dates = "m/d/y", times = "h:m:s"))
dateVector <- seq(chronFrom, chronTo, by = 1/24)
}
else if (frequency == "daily") {
dateVector <- seq(datesFrom, datesTo)
}
return(dateVector)
}
}
RoundHour <- function(x) {
res <- trunc(x,'hours', eps=1e-17)
res <- ifelse((x-res) > 0.5/24, res+1/24, res)
return(as.chron(res))
}
The problem I'm facing is that the intervals are not consistent. As an example, the code below returns two different interval sizes:
unique(diff(CreateChronVector(as.chron('2010-01-01'), as.chron('2010-01-01'))))
Similarly, using my rounding function does not correct the problem:
unique(diff(RoundHour(CreateChronVector(as.chron('2010-01-01'), as.chron('2010-01-01')))))
I'm sure this problem has to do with round-off errors. I have been trying to play with the trunc function and its eps parameter, but no luck.
You can use xts package. Once you have your data in xts object, you can use align.time function to "round up" time index. Almost all the timeseries analysis is very convenient in xts
PS: If you give reproducible example of your data I will update the answer with an example.
Taking the point from #G. Grothendieck, you can see what he is talking about if you try this:
hours <- 1:23
dateVector <- sapply(hours , function(x){ chron( dates = "01/01/10" , times = paste0(x,":00:00") ) } )
head( dateVector )
[1] 14610.04166666666606034 14610.08333333333393966 14610.12500000000000000
[4] 14610.16666666666606034 14610.20833333333393966 14610.25000000000000000
unique(diff(dateVector))
[1] 0.04166666666787932626903 0.04166666666606033686548
So you can't really do it because these numbers can't be represented exactly in floating point, but is there a reason this matters to you?