How to speed up code with loop in R - r

Problem:
I have two data frames.
DF with payment log:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 53682 obs. of 7 variables:
str(moneyDB)
$ user_id : num 59017170 57859746 58507536 59017667 59017795 ...
$ reg_date: Date, format: "2016-08-06" "2016-07-01" "2016-07-19" ...
$ date : Date, format: "2016-08-06" "2016-07-01" "2016-07-19" ...
$ money : num 0.293 0.05 0.03 0.03 7 ...
$ type : chr "1" "2" "2" "1" ...
$ quality : chr "VG" "no_quality" "no_quality" "VG" ...
$ geo : chr "Canada" "NO GEO" "NO GEO" "Canada" ...
Here is its structure. Its just a log of all transactions.
Also i have second data frame:
str(grPaysDB)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 335591 obs. of 9 variables:
$ reg_date : Date, format: "2016-05-01" "2016-05-01" "2016-05-01" ...
$ date : Date, format: "2016-05-01" "2016-05-01" "2016-05-01" ...
$ type : chr "1" "1" "1" "1" ...
$ quality : chr "VG" "VG" "VG" "VG" ...
$ geo : chr "Australia" "Canada" "Finland" "Canada" ...
$ uniqPayers : num 0 1 0 1 1 0 0 1 0 3 ...
Its Grouped data from first data frame + zero transactions. For example, there is a lot of rows in second data frame with zero payers. Thats why second data frame is greater then first.
I need to add column weeklyPayers to the second data frames. Weekly payers is sum unique payers for the last 7 days. I tried do it via loop, but it wooks too long. Is there any another vectorized ideas, how to realise this?
weeklyPayers <- vector()
for (i in 1:nrow(grPaysDB)) {
temp <- moneyDB %>%
filter(
geo == grPaysDB$geo[i],
reg_date == grPaysDB$reg_date[i],
quality == grPaysDB$quality[i],
type == grPaysDB$type[i],
between(date, grPaysDB$date[i] - 6, grPaysDB$date[i])
)
weeklyPayers <- c(weeklyPayers, length(unique(temp$user_id)))
}
grPaysDB <- cbind(grPaysDB, weeklyPayers)
In this loop for each row in second data frame i find rows in first data frame with right geo,type, quality and reg_date and range of dates. And then I can calculate number of unique payers.

I may be misunderstanding, but I think this should be fairly simple, using filter and summarise in dplyr. However, as #Hack-R mentioned, it would be helpful to have your dataset. But it would look something like:
library(dplyr)
weeklyPayers <- grPaysDB %>%
filter(date > ADD DATE IN QUESTION) %>%
summarise(sumWeeklyPayers = sum(uniqPayers))
Then again, I may well have misunderstood. If your question involves summing for each week, then you may want to investigate daily2weekly in the timeSeries package and then using group_by for the weekly variable that transpires.

I would try making a join on your datasets using merge on multiple columns (c('geo', 'reg_date', 'quality', 'type') and filter the result based on the dates. After that, aggregate using summarise.
But I am not completely sure why you want to add the weeklypayers to every transaction. Isn't it more informative or easier to aggregate your data on week number (with dplyr). Like so:
moneyDB %>% mutate(week = date- as.POSIXlt(date)$wday) %>%
group_by(geo, reg_date, quality, type, week) %>%
summarise(weeklyPayers = n())

Related

Having issues with creating a gt table in R

I get the following error, "Error: Don't know how to select rows using an object of class quosures"
I have narrowed it down to one specific section of code in the R markdown file. If I highlight from Sum_Adjusted to "seriesttls") and run selected line, no errors. It is the tab row group item that throws it off.
Sum_Adjusted <- Adjusted %>%gt(rowname_col = "seriesttls") %>%
tab_row_group(group = "Super Sectors",
rows = vars("Mining and logging","Construction","Manufacturing","Trade,
transportation, and utilities","Information","Financial activities","Professional and
business services","Education and health services","Leisure and hospitality","Other
services","Government"))
I am hoping that from just looking at the code that someone can explain why I am getting this error now and not the last dozen times I have ran this exact same code. I have not been able to reproduce this in a smaller example.
str(Adjusted) produces this
tibble [12 x 7] (S3: tbl_df/tbl/data.frame)
$ seriesttls : chr [1:12] "Total nonfarm" "Mining and logging"
"Construction" "Manufacturing" ...
$ empces : num [1:12] 1335900 15000 91000 60200 282200 ...
$ MonthlyDifference: num [1:12] 4800 0 -1000 0 900 600 500 1500 0 1800 ...
$ AnnualDifference : num [1:12] 100900 200 -1900 5200 30600 ...
$ PercentGrowthRate: num [1:12] 0.0817 0.0135 -0.0205 0.0945 0.1216 ...
$ Max : num [1:12] 1442800 15800 146400 60300 282200 ...
$ oftotal : num [1:12] 1 0.0112 0.0681 0.0451 0.2112 ...
The issue is that the rows should be a vector and thus instead of vars (probably deprecated), use the standard select-helpers or just concatenate with c as described in documentation
rows -
The rows to be made components of the row group. Can either be a vector of row captions provided in c(), a vector of row indices, or a helper function focused on selections. The select helper functions are: starts_with(), ends_with(), contains(), matches(), one_of(), and everything().
We could do
library(gt)
library(dplyr)
Adjusted %>%
gt(rowname_col = "seriesttls") %>%
tab_row_group(label = "Super Sectors",
rows = c("Mining and logging","Construction","Manufacturing","Trade,
transportation, and utilities","Information","Financial activities","Professional and
business services","Education and health services","Leisure and hospitality","Other
services","Government"))
Using a reproducible example
gtcars %>%
dplyr::select(model, year, hp, trq) %>%
dplyr::slice(1:8) %>%
gt(rowname_col = "model") %>%
tab_row_group(
label = "numbered",
rows = c("GT", "California"))
-output

R Function behaves differently than the code entered line by line

I am at a loss. Googling has failed me because I'm not sure I know the right question to ask.
I have a data frame (df1) and my goal is to use a function to get a moving average using forecast::ma.
Here is str(df1)
'data.frame': 934334 obs. of 6 variables:
$ clname : chr ...
$ dos : Date, format: "2011-10-05" ...
$ subpCode: chr
$ ch1 : chr "
$ prov : chr
$ ledger : chr
I have a function that I am trying to write.
process <- function(df, y, sub, ...) {
prog <- df %>%
filter(subpCode == sub) %>%
group_by(dos, subpCode) %>%
summarise(services = n())
prog$count_ts <- ts(prog[ , c('services')])
}
The problem is that when I run the function, my final result is data object that is 1x1798 and it's just a time series. If I go a run the code line by line I get what I need but my function that hypothetically does the same thing wont work.
Here is my desired result
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1718 obs. of 4 variables:
$ dos : Date, format: "2010-09-21" "2010-11-18" "2010-11-19" "2010-11-30" ...
$ subpCode: chr "CII " "CII " "CII " "CII " ...
$ services: int 1 1 2 2 2 2 1 2 1 3 ...
$ count_ts: Time-Series [1:1718, 1] from 1 to 1718: 1 1 2 2 2 2 1 2 1 3 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "services"
- attr(*, "vars")= chr "dos"
- attr(*, "drop")= logi TRU
And here is the code that gets it.
CII <- df1 %>%
filter(subpCode == "CII ") %>%
group_by(dos, subpCode) %>%
summarise(services = n())
CII$count_ts <- ts(CII[ , c('services')])
Could someone point me in the right direction. I've exhausted my usual places.
Thanks!
Following the vignette pointed out by #CalumYou, you should use more something like this:
process <- function(df, sub) {
## Enquoting sub
sub <- enquo(sub)
## Piping stuff
prog <- df %>%
filter(!! subpCode == sub) %>%
group_by(dos, subpCode) %>%
summarise(services = n())
prog$count_ts <- ts(prog[ , c('services')])
## Returning the prog object
return(prog)
}

Classification dummy R

In a large dataset of US stocks I have a integer variable containing SIC codes. https://www.sec.gov/info/edgar/siccodes.htm
I would like to create a dummy variable indicating the major group of 50, i.e. a variable that takes on 1 for durable goods and 0 otherwise.
I tried the code:
data$durable <- as.integer(grepl(pattern = "50", x = data$sic))
But this, of course, does not take the hierarchical structure of SIC into account. I want to get the "50" only for the first two digits.
(New to R)
/Alex
Use either the division, or pad zero to left and check the first two letters.
code <- c(100, 102, 501, 5010)
# approach 1
as.integer(as.integer(code/100) == 50)
# approach 2
as.integer(substring(sprintf("%04d", code), 1, 2) == "50")
library(readxl)
library(dplyr)
library(stringi)
data_sic <- read_excel("./sic_example.xlsx")
data_sic$temp1 <- stri_sub(data_sic$SIC,1,2)
data_sic <- mutate(data_sic, durable_indicator =
ifelse(temp1 == "50", 1, 0))
str(data_sic)
Output:
str(data_sic)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 6 obs. of 4 variables:
$ SIC : num 4955 4961 4991 5000 5010 ...
$ Industry Title : chr "HAZARDOUS WASTE MANAGEMENT" "STEAM & AIR-CONDITIONING SUPPLY" "COGENERATION SERVICES & SMALL POWER PRODUCERS" "WHOLESALE-DURABLE GOODS" ...
$ temp1 : chr "49" "49" "49" "50" ...
$ durable_indicator: num 0 0 0 1 1 1
Addendum:
There are multiple ways to approach this problem.
I would suggest reviewing the stringi package Link to documentation for string editing.
As well as, the caret package - documentation for dummification of variables and other statistical transformations.

Dynamically Changing Data Type for a Data Frame

I have a set of data frames belonging to many countries consisting of 3 variables (year, AI, OAD). The example for Zimbabwe is shown as below,
>str(dframe_Zimbabwe_1955_1970)
'data.frame': 16 obs. of 3 variables:
$ year: chr "1955" "1956" "1957" "1958" ...
$ AI : chr "11.61568161" "11.34114927" "11.23639317" "11.18841409" ...
$ OAD : chr "5.740789488" "5.775882473" "5.800441036" "5.822536579" ...
I am trying to change the data type of the variables in data frame to below so that I can model the linear fit using lm(dframe_Zimbabwe_1955_1970$AI ~ dframe_Zimbabwe_1955_1970$year).
>str(dframe_Zimbabwe_1955_1970)
'data.frame': 16 obs. of 3 variables:
$ year: int 1955 1956 1957 1958 ...
$ AI : num 11.61568161 11.34114927 11.23639317 11.18841409 ...
$ OAD : num 5.740789488 5.775882473 5.800441036 5.822536579 ...
The below static code able to change AI from character (chr) to numeric (num).
dframe_Zimbabwe_1955_1970$AI <- as.numeric(dframe_Zimbabwe_1955_1970$AI)
However when I tried to automate the code as below, AI still remains as character (chr)
countries <- c('Zimbabwe', 'Afghanistan', ...)
for (country in countries) {
assign(paste('dframe_',country,'_1955_1970$AI', sep=''), eval(parse(text = paste('as.numeric(dframe_',country,'_1955_1970$AI)', sep=''))))
}
Can you advice what I could have done wrong?
Thanks.
42: Your code doesn't work as written but with some edits it will. in addition to the missing parentheses and wrong sep, you can't use $'column name' in assign, but you don't need it anyway
for (country in countries) {
new_val <- get(paste( 'dframe_',country,'_1955_1970', sep=''))
new_val[] <- lapply(new_val, as.numeric) # the '[]' on LHS keeps dataframe
assign(paste('dframe_',country,'_1955_1970', sep=''), new_val)
remove(new_val)
}
proof it works:
dframe_Zimbabwe_1955_1970 <- data.frame(year = c("1955", "1956", "1957"),
AI = c("11.61568161", "11.34114927", "11.23639317"),
OAD = c("5.740789488", "5.775882473", "5.800441036"),
stringsAsFactors = F)
str(dframe_Zimbabwe_1955_1970)
'data.frame': 3 obs. of 3 variables:
$ year: chr "1955" "1956" "1957"
$ AI : chr "11.61568161" "11.34114927" "11.23639317"
$ OAD : chr "5.740789488" "5.775882473" "5.800441036"
countries <- 'Zimbabwe'
for (country in countries) {
new_val <- get(paste( 'dframe_',country,'_1955_1970', sep=''))
new_val[] <- lapply(new_val, as.numeric) # the '[]' on LHS keeps dataframe
assign(paste('dframe_',country,'_1955_1970', sep=''), new_val)
remove(new_val)
}
str(dframe_Zimbabwe_1955_1970)
'data.frame': 3 obs. of 3 variables:
$ year: num 1955 1956 1957
$ AI : num 11.6 11.3 11.2
$ OAD : num 5.74 5.78 5.8
It's going to be considered fairly ugly code by teh purists but perhaps this:
for (country in countries) {
new_val <- get(paste('dframe_',country,'_1955_1970', sep=''))
new_val[] <- lapply(new_val, as.numeric) # the '[]' on LHS keeps dataframe
assign(paste('dframe_',country,'_1955_1970', sep=''), new_val)
}
Using the get('obj_name') function is considered cleaner than eval(parse(text=...)). It would get handled more R-naturally had you assembled these dataframes in a list.

R - error expecting a single value using dplyr mutate and if_else

I have a data frame
str is
'data.frame': 334 obs. of 6 variables:
$ Patient_ID : int 524451 517060 518025 515768 499994
$ Camp_Start_Date : Date, format: "2003-08-16" "2005-02-15" "2005-02-15" ...
$ Camp_End_Date : Date, format: "2003-08-20" "2005-02-18" "2005-02-18" ...
$ First_Interaction: Date, format: "2003-08-16" "2004-10-03" "2005-02-17" ...
I am using this to create a new column pRegDate
RegDatelogicLUT <- RegDatelogicLUT %>%
mutate(pRegDate = if_else(between(First_Interaction, Camp_Start_Date, Camp_End_Date), First_Interaction, Camp_Start_Date)
)
Getting the error.
Error: expecting a single value
Any help will be appreciated.
Thanks
There is a nice lubridatesolution for this problem:
library(lubridate)
RegDatelogicLUT <- RegDatelogicLUT %>%
mutate(pRegDate = if_else(First_Interaction %within% c(Camp_Start_Date %--% Camp_End_Date),
First_Interaction, Camp_Start_Date))
# Patient_ID Camp_Start_Date Camp_End_Date First_Interaction pRegDate
#1 524451 2003-08-16 2003-08-20 2003-08-16 2003-08-16
#2 517060 2005-02-15 2005-02-18 2004-10-03 2005-02-15
#3 518025 2005-02-15 2005-02-18 2005-02-17 2005-02-17

Resources