R Loop over multiple interdependent functions failing to loop - r

I am building a function to download Google analytics data from a long list of profiles and need a loop function that can tolerate a profile returning no data.
The problem is that there are several functions needed between the start of the loop and where the error can occur.
The Paste function is pulling an ID from idsvector and then the API query is constructed in 2 successive steps. This is then sent to the API using GetReportData(). The second ID in the list returns no data from the API. Currently it downloads the data from the first profile, merges it with the master dataset and then stops.
for (v in idsvector){
view.id <- paste("ga:",v,sep="") #the View ID parameter need to have "ga:" in front of the ID
sourcequery.list <- Init(
start.date = start.date,
end.date = end.date,
dimensions = "ga:channelGrouping,ga:campaign,ga:source,ga:medium,ga:date",
metrics = "ga:sessions,ga:bounces",
table.id = view.id,
max.results = 9999999
)
}
ga.sourcequery <- QueryBuilder(sourcequery.list)
data <- GetReportData(ga.sourcequery, token)
error=function(e){dev.off(); return(NULL)}
if (!is.null(data)) {
data$Property <- view.id
final.data<-rbind(sourcequery.data,data)
}
else {
next
}
}
How do I adapt this so that it loops back and tries the next ID?

Not sure if this will solve your problem, but a better approach for this is to use lapply.
It is not clear which library you are using to access GA, so I will make up some code for you:
library(data.table)
ga_load_property_data <- function(property) {
# here goes GA API wrapper magic
}
data <- lapply(properties, ga_load_property_data)
data <- rbindlist(data, idcol = "property")
This way you separate the load logic from your iterations.

Related

In R: Search all emails by subject line, pull comma-separate values from body, then save values in a dataframe

Each day, I get an email with the quantities of fruit sold on a particular day. The structure of the email is as below:
Date of report:,04-JAN-2022
Time report produced:,5-JAN-2022 02:04
Apples,6
Pears,1
Lemons,4
Oranges,2
Grapes,7
Grapefruit,2
I'm trying to build some code in R that will search through my emails, find all emails with a particular subject, iterate through each email to find the variables I'm looking for, take the values and place them in a dataframe with the "Date of report" put in a date column.
With the assistance of people in the community, I was able to achieve the desired result in Python. However as my project has developed, I need to now achieve the same result in R if at all possible.
Unfortunately, I'm quite new to R and therefore if anyone has any advice on how to take this forward I would greatly appreciate it.
For those interested, my Python code is below:
#PREP THE STUFF
Fruit_1 = "Apples"
Fruit_2 = "Pears"
searchf = [
Fruit_1,
Fruit_2
]
#DEF THE STUFF
def get_report_vals(report, searches):
dct = {}
for line in report:
term, *value = line
if term.casefold().startswith('date'):
dct['date'] = pd.to_datetime(value[0])
elif term in searches:
dct[term] = float(value[0])
if len(dct.keys()) != len(searches):
dct.update({x: None for x in searches if x not in dct})
return dct
#DO THE STUFF
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items
messages.Sort("[ReceivedTime]", True)
results = []
for message in messages:
if message.subject == 'FRUIT QUANTITIES':
if Fruit_1 in message.body and Fruit_2 in message.body:
data = [line.strip().split(",") for line in message.body.split('\n')]
results.append(get_report_vals(data, searchf))
else:
pass
fruit_vals = pd.DataFrame(results)
fruit_vals.columns = map(str.upper, fruit_vals.columns)
I'm probably going about this the wrong way, but I'm trying to use the steps I took in Python to achieve the same result in R. So for example I create some variables to hold the fruit sales I'm searching for, then I create a vector to store the searchables, and then when I create an equivalent 'get_vals' function, I create an empty vector.
library(RDCOMClient)
Fruit_1 <- "Apples"
Fruit_2 <- "Pears"
##Create vector to store searchables
searchf <- c(Fruit_1, Fruit_2)
## create object for outlook
OutApp <- COMCreate("Outlook.Application")
outlookNameSpace = OutApp$GetNameSpace("MAPI")
search <- OutApp$AdvancedSearch("Inbox", "urn:schemas:httpmail:subject = 'FRUIT QUANTITIES'")
inbox <- outlookNameSpace$Folders(6)$Folders("Inbox")
vec <- c()
for (x in emails)
{
subject <- emails(i)$Subject(1)
if (grepl(search, subject)[1])
{
text <- emails(i)$Body()
print(text)
break
}
}
read.table could be a good start for get_report_vals.
Code below outputs result as a list, exception handling still needs to be implemented :
report <- "
Date of report:,04-JAN-2022
Apples,6
Pears,1
Lemons,4
Oranges,2
Grapes,7
Grapefruit,2
"
get_report_vals <- function(report,searches) {
data <- read.table(text=report,sep=",")
colnames(data) <- c('key','value')
# find date
date <- data[grepl("date",data$key,ignore.case=T),"value"]
# transform dataframe to list
lst <- split(data$value,data$key)
# output result as list
c(list(date=date),lst[searches])
}
get_report_vals(report,c('Lemons','Oranges'))
$date
[1] "04-JAN-2022"
$Lemons
[1] "4"
$Oranges
[1] "2"
The results of various reports can then be concatenated in a data.frame using rbind:
rbind(get_report_vals(report,c('Lemons','Oranges')),get_report_vals(report,c('Lemons','Oranges')))
date Lemons Oranges
[1,] "04-JAN-2022" "4" "2"
[2,] "04-JAN-2022" "4" "2"
The code now functions as intended. Function was written quite a bit differently from those recommended:
get_vals <- function(email) {
body <- email$body()
date <- str_extract(body, "\\d{2}-[:alpha:]{3}-\\d{4}") %>%
as.character()
data <- read.table(text = body, sep = ",", skip = 9, strip.white = T) %>%
row_to_names(1) %>%
mutate("Date" = date)
return(data)
}
In addition I've written this to bind the rows together:
info <- sapply(results, get_vals, simplify = F) %>%
bind_rows()
May this is not what you are expecting to get as an answer, but I must state that here to help other readers to avoid such mistakes in future.
Unfortunately your Python code is not well-written. For example, I've noticed the following code where you iterate over all items in a folder and check the Subject and message bodies for keywords:
for message in messages:
if message.subject == 'FRUIT QUANTITIES':
if Fruit_1 in message.body and Fruit_2 in message.body:
You need to use the Find/FindNext or Restrict methods of the Items class instead. So, you don't need to iterate over all items in a folder. Instead, you get only items that correspond to your conditions. Read more about these methods in the following articles:
How To: Use Find and FindNext methods to retrieve Outlook mail items from a folder (C#, VB.NET)
How To: Use Restrict method to retrieve Outlook mail items from a folder
You may combine all your search criteria into a single query. So, you just need to iterate over found items and extract the data.
Also you may find the AdvancedSearch method helpful. The key benefits of using the AdvancedSearch method in Outlook are:
The search is performed in another thread. You don’t need to run another thread manually since the AdvancedSearch method runs it automatically in the background.
Possibility to search for any item types: mail, appointment, calendar, notes etc. in any location, i.e. beyond the scope of a certain folder. The Restrict and Find/FindNext methods can be applied to a particular Items collection (see the Items property of the Folder class in Outlook).
Full support for DASL queries (custom properties can be used for searching too). You can read more about this in the Filtering article in MSDN. To improve the search performance, Instant Search keywords can be used if Instant Search is enabled for the store (see the IsInstantSearchEnabled property of the Store class).
You can stop the search process at any moment using the Stop method of the Search class.
See Advanced search in Outlook programmatically: C#, VB.NET for more information.

Vectorized Operation in R causing problems with custom function

I'm writing out some functions for Inventory management. I've recently wanted to add a "photo url column" to my spreadsheet by using an API I've used successfully while initially building my inventory. My Spreadsheet header looks like the following:
SKU | NAME | OTHER STUFF
I have a getProductInfo function that returns a list of product info from an API I'm calling.
getProductInfo<- function(barcode) {
#Input UPC
#Output List of product info
CallAPI(barcode)
Process API return, remove garbage
return(info)
}
I made a new function that takes my inventory csv as input, and attempts to add a new column with product photo url.
get_photo_url_from_product_info_output <- function(in_list){
#Input GetProductInfo Output. Returns Photo URL, or nothing if
#it doesn't exist
if(in_list$DisplayStockPhotos == TRUE){
return(in_list$StockPhotoURL)
} else {
return("")
}
}
add_Photo_URL <- function(in_csv){
#Input CSV data frame, appends photourl column
#Requires SKU (UPC) assumes no photourl column
out_csv <- mutate(in_csv, photo =
get_photo_url_from_product_info_output(
getProductInfo(SKU)
)
)
}
return (out_csv)
}
#Call it
new <- add_Photo_URL(old)
My thinking was that R would simply input the SKU of the from the row, and put it through the double function call "as is", and the vectorized DPLYR function mutate would just vectorize it. Unfortunately I was running into all sorts of problems I couldn't understand. Eventually I figured out that API call was crashing because the SKU field was all messed up as it was being passed in. I put in a breakpoint and found out that it wasn't just passing in the SKU, but instead an entire list (I think?) of SKUs. Every Row all at once. Something like this:
#Variable 'barcode' inside getProductInfo function contains:
[1] 7.869368e+11 1.438175e+10 1.256983e+10 2.454357e+10 3.139814e+10 1.256983e+10 1.313260e+10 4.339643e+10 2.454328e+10
[10] 1.313243e+10 6.839046e+11 2.454367e+10 2.454363e+10 2.454367e+10 2.454348e+10 8.418870e+11 2.519211e+10 2.454375e+10
[19] 2.454381e+10 2.454381e+10 2.454383e+10 2.454384e+10 7.869368e+11 2.454370e+10 2.454390e+10 1.913290e+11 2.454397e+10
[28] 2.454399e+10 2.519202e+10 2.519205e+10 7.742121e+11 8.839291e+11 8.539116e+10 2.519211e+10 2.519211e+10 2.519211e+10
Obviously my initial getProductInfo function can't handle that, so it'll crash.
How should I modify my code, whether it be in the input or API call to avoid this vectorized operation issue?
Well, it's not totally elegant but it works.
I figured out I need to use lapply, which is usually not my strong suit. Initally I tried to nest them like so:
lapply(SKU, get_photo_url_from_product_info_output(getProductInfo())
But that didn't work. So I just came up with bright idea of making another function
get_photo_url_from_sku <- function(barcode){
return(get_photo_url_from_product_info_output(getProductInfo(barcode)))
}
Call that in the lapply:
out_csv<- mutate(in_csv, photocolumn = lapply(SKU, get_photo_url_from_sku))
And it works great. My speed is only limited by my API calls.

Unable to update data in dataframe

i tried updating data in dataframe but its unable to get updating
//Initialize data and dataframe here
user_data=read.csv("train_5.csv")
baskets.df=data.frame(Sequence=character(),
Challenge=character(),
countno=integer(),
stringsAsFactors=FALSE)
/Updating data in dataframe here
for(i in 1:length((user_data)))
{
for(j in i:length(user_data))
{
if(user_data$challenge_sequence[i]==user_data$challenge_sequence[j]&&user_data$challenge[i]==user_data$challenge[j])
{
writedata(user_data$challenge_sequence[i],user_data$challenge[i])
}
}
}
writedata=function( seqnn,challng)
{
#print(seqnn)
#print(challng)
newRow <- data.frame(Sequence=seqnn,Challenge=challng,countno=1)
baskets.df=rbind(baskets.df,newRow)
}
//view data here
View(baskets.df)
I've modified your code to what I believe will work. You haven't provided sample data, so I can't verify that it works the way you want. I'm basing my attempt here on a couple of common novice mistakes that I'll do my best to explain.
Your writedata function was written to be a little loose with it's scope. When you create a new function, what happens in the function technically happens in its own environment. That is, it tries to look for things defined within the function, and then any new objects it creates are created only within that environment. R also has this neat (and sometimes tricky) feature where, if it can't find an object in an environment, it will try to look up to the parent environment.
The impact this has on your writedata function is that when R looks for baskets.df in the function and can't find it, R then turns to the Global Environment, finds baskets.df there, and then uses it in rbind. However, the result of rbind gets saved to a baskets.df in the function environment, and does not update the object of the same name in the global environment.
To address this, I added an argument to writedata that is simply named data. We can then use this argument to pass a data frame to the function's environment and do everything locally. By not making any assignment at the end, we implicitly tell the function to return it's result.
Then, in your loop, instead of simply calling writedata, we assign it's result back to baskets.df to replace the previous result.
for(i in 1:length((user_data)))
{
for(j in i:length(user_data))
{
if(user_data$challenge_sequence[i] == user_data$challenge_sequence[j] &&
user_data$challenge[i] == user_data$challenge[j])
{
baskets.df <- writedata(baskets.df,
user_data$challenge_sequence[i],
user_data$challenge[i])
}
}
}
writedata=function(data, seqnn,challng)
{
#print(seqnn)
#print(challng)
newRow <- data.frame(Sequence = seqnn,
Challenge = challng,
countno = 1)
rbind(data, newRow)
}
I'm not sure what you're programming background is, but your loops will be very slow in R because it's an interpreted language. To get around this, many functions are vectorized (which simply means that you give them more than one data point, and they do the looping inside compiled code where the loops are fast).
With that in mind, here's what I believe will be a much faster implementation of your code
user_data=read.csv("train_5.csv")
# challenge_indices will be a matrix with TRUE at every place "challenge" and "challenge_sequence" is the same
challenge_indices <- outer(user_data$challenge_sequence, user_data$challenge_sequence, "==") &
outer(user_data$challenge, user_data$challenge, "==")
# since you don't want duplicates, get rid of them
challenge_indices[upper.tri(challenge_indices, diag = TRUE)] <- FALSE
# now let's get the indices of interest
index_list <- which(challenge_indices,arr.ind = TRUE)
# now we make the resulting data set all at once
# this is much faster, because it does not require copying the data frame many times - which would be required if you created a new row every time.
baskets.df <- with(user_data, data.frame(
Sequence = challenge_sequence[index_list[,"row"]],
challenge = challenge[index_list[,"row"]]
)

data.table and R's ellipsis (i.e. '...'): pass by reference does not seem to work

Im trying to manipulate a large data table (~37 MB) but in a special way: for other (unrelated) reasons I have implemented a 'hook' like structure meaning that the overall process is like
1) load the data.table from disk
2) fire a certain hook
3) the hook structure looks for this name ans checks whether the user (=me :)) has bound a function to this hook and if so, it is called
4) the data is processed further
The functions look like this:
data = readRDS(pathToFile)
data = data.table(data)
fireHook("After_data_read", data, [some other parameters])
some_more_processing(data)
and the region around fireHook looks like
hooksRegistered = list(
"After_data_read" = function(data, ...) {
# do some stuff
}
)
fireHook = function(hookName, ...) {
for (hookNameRegistered in names(hooksRegistered)) {
if (hookName == hookNameRegistered) {
func = .global.hooksRegistered[[hookName]]
func(hookName, ...)
}
}
}
Observe that one needs to cast an object that already is a data.table into it again (otherwise the pass-by-reference does not work), see Adding new columns to a data.table by-reference within a function not always working and Pass by reference bug?
Problem: this line: func(hookName, ...) takes like forever (> 5 minutes).
The debugger never really gets into the function (so its not the code in the function that takes a long time) and I've tested it with small data.tables and it worked. Also, I noted that the following seems to work:
fireHook = function(hookName, ...) {
args = list(...)
for (hookNameRegistered in names(.global.hooksRegistered)) {
if (hookName == hookNameRegistered) {
func = .global.hooksRegistered[[hookName]]
func(hookName, args)
}
}
}
(notice that I substituted ... by list(...)). To me, it seems as if R is trying to copy the whole table when using .... Is this right/desired? Or am I using it wrong?
regards,
FW

R Scripts - Use Output Message as if condition

I am using the RGoogleAnalytics library to get all the data from my Google Analytics Account into R. However, complex queries deliver 0 results.
My code looks like:
query.list <- Init(start.date = paste(c(lastmonth.startdate)),
end.date = paste(c(lastmonth.enddate)),
metrics = "ga:goalCompletionsAll",
dimensions = "ga:countryIsoCode,ga:yearMonth",
filters = "ga:goalCompletionsAll>0",
max.results = 10000,
table.id = sprintf("ga:%s", sites$profile.id[i]))
# Create the Query Builder object so that the query parameters are validated
ga.query <- QueryBuilder(query.list)
# Extract the data and store it in a data-frame
ga.countriesConversions1 <- GetReportData(ga.query, token)
Everything is inside a "for", and the script stops if one of the queries end in 0 results, because GetReportData(ga.query, token) cannot create a dataframe if there is no data.
I would like to know if there is a way use the warning message ("Your query matched 0 results. Please verify your query using the Query Feed Explorer and re-run it") fired by the library to the console, assign it to a variable and use this as an if condition. So I could create a dummy data.frame before the next function comes.
Assuming getReportData is throwing an error, then you can try:
ga.countriesConversions1 <- try(GetReportData(ga.query, token), silent=TRUE)
if(inherits(ga.countriesConversions1, "try-error")) {
warning(geterrmessage())
... error handling logic ...
}

Resources