How do I find the clickthrough rate using dbplyr in R? - r

Here is the given code:
library(RSQLite)
library(DBI)
sqcon<-dbConnect(dbDriver("SQLite"), "data/sqlite.db")
events <- read_csv("events_log.csv")
sqevents <- copy_to(sqcon, events)
sqevents
The sqevents dataframe is like this:
## # Source: table<events> [?? x 9]
## # Database: sqlite 3.35.5 [C:\Users\James\Documents\Work\2021 Sem2\Stats
## # 369\lab4\Data\sqlite.db]
## uuid timestamp session_id group action checkin page_id n_results
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 00000736167~ 2.02e13 78245c2c3f~ b searchR~ NA cbeb66d1~ 5
## 2 00000c69fe3~ 2.02e13 c559c3be98~ a searchR~ NA eb658e87~ 10
## 3 00003bfdab7~ 2.02e13 760bf89817~ a checkin 30 f99a9fc1~ NA
## 4 0000465cd7c~ 2.02e13 fb905603d3~ a checkin 60 e5626962~ NA
## 5 000050cbb4e~ 2.02e13 c2bf5e5172~ a checkin 30 787dd6a4~ NA
## 6 0000a6af2ba~ 2.02e13 f6840a9614~ a checkin 180 6fb7b9ea~ NA
## 7 0000cd61e11~ 2.02e13 51f4d3b6a8~ a checkin 240 8ad97e7c~ NA
## 8 000104fe220~ 2.02e13 485eabe537~ b searchR~ NA 4da9a642~ 15
## 9 00012e37b74~ 2.02e13 91174a537d~ a checkin 180 dfdff179~ NA
## 10 000145fbe69~ 2.02e13 a795756dba~ b checkin 150 ec0bad00~ NA
## # ... with more rows, and 1 more variable: result_position <dbl>
I want to find the clickthrough rate which is the proportion of session_id that have action=="visitPage"
My current code is this:
sqevents %>% group_by(session_id) %>%
summarise(clickthrough = sum(action=="visitPage")) %>% filter(clickthrough=="0") %>% collect()
However this doesn't return anything:
## # A tibble: 0 x 2
## # ... with 2 variables: session_id <chr>, clickthrough <lgl>
What did I do wrong? And how do I fix this?

Perhaps, we may need to unquote the "0" as the previous step with sum returns a numeric summarised output. Also, if there are NA elements, specify the na.rm = TRUE in sum or else any missing value in the column returns the sum as NA as na.rm = FALSE by default.
library(dplyr)
sqevents %>%
group_by(session_id) %>%
summarise(clickthrough = sum(action=="visitPage", na.rm = TRUE)) %>%
filter(clickthrough == 0) %>%
collect()
Also, other case would be that there is at least one 'visitPage' for each 'session_id', thus the filter steps returns 0 rows

From you description "[...] which is the proportion of session_id that have action=="visitPage" [...]" you might commit an error further down the pipe using sum(). A nice way to calculate the proportion you described can be this:
library(dplyr)
sqevents %>%
dplyr::group_by(session_id) %>%
# check if a session has at least one "visitPage" (true or false = 1 or 0)
dplyr::summarise(yn = any(action == "visitPage")) %>%
# build a mean from that to get the proportion
dplyr::summarise(prop = mean(yn))
# and collect if you like

Related

I need help accessing a value in a tibble by name

I need to use the value from count() by name and not by position due to the dynamic nature of the source data.
I am trying to estimate the labor cost to re-ip devices based on existing ip assignment state.
Example:
a device with a state of Active will be $40.00
a device with a state of ActiveReservation will be $100.00
For the first example below:
6,323 * $10
For the second example below:
9 * $10
I can get them by
temp_dhcp_count$quantity[1] * 10
however I cant guarantee that [1] is the position and always "Active", I need to be able to call it by name "Active"
My assumption was, if I could extract them to values I could:
> Active = 6323
> Active * 10
[1] 63230
vs
temp_dhcp_count$quantity[1] * 10
For example:
> temp_dhcp_count
# A tibble: 5 x 2
# Groups: AddressState [5]
AddressState quantity
<chr> <int>
1 Active 6323
2 ActiveReservation 1222
3 Declined 10
4 Expired 12
5 InactiveReservation 287
> temp_dhcp_count$quantity[1]
[1] 6323
and
> temp_dhcp_count
# A tibble: 3 x 2
# Groups: AddressState [3]
AddressState quantity
<chr> <int>
1 Active 9
2 ActiveReservation 46
3 InactiveReservation 642
> temp_dhcp_count$quantity[1]
[1] 9
I tried asking how to extract rows from a tibble as key value pairs and now I am trying to ask this way based on feedback.
How do you change the output of count from a tibble to Name Value pairs?
The source data is a tsv that I import and select based on subnet and count by state.
library(tidyverse)
library(ipaddress)
dhcp <- read_delim("dhcpmerge.tsv.txt",
delim = "\t", escape_double = FALSE,
trim_ws = TRUE)
dhcp <- distinct(dhcp)
network_in_review = "10.75.0.0/16"
temp_dhcp <- dhcp %>%
select(IPAddress, AddressState, HostName) %>%
filter(is_within(ip_address(IPAddress), ip_network(network_in_review)))
temp_dhcp %>%
group_by(AddressState) %>%
count(name = "quantity") -> temp_dhcp_count
temp_dhcp_count
After more digging,
deframe() %>% as.list()
Works as well.
You can create a named list. With the sample data
temp_dhcp_count <- read.table(text="
AddressState quantity
Active 6323
ActiveReservation 1222
Declined 10
Expired 12
InactiveReservation 287", header=TRUE)
You can create a named list of values to extract them by name
vals <- with(temp_dhcp_count, setNames(as.list(quantity), AddressState))
vals$Active
# [1] 6323
vals$Declined
# [1] 10
And if the vals$ part bothers you, you can use with() again
with(vals, {
Active * 10 - Declined * 2
})
# [1] 63210
If I'm understanding the goal, you can make a table of prices, then merge it in to temp_dhcp_count as needed:
library(tidyverse)
prices <- tribble(
~ AddressState, ~ price,
"Active", 40,
"ActiveReservation", 100,
"Declined", 50,
"Expired", 50,
"InactiveReservation", 120
)
temp_dhcp_count %>%
left_join(prices) %>%
mutate(total = quantity * price)
# # A tibble: 5 x 4
# AddressState quantity price total
# <chr> <dbl> <dbl> <dbl>
# 1 Active 6323 40 252920
# 2 ActiveReservation 1222 100 122200
# 3 Declined 10 50 500
# 4 Expired 12 50 600
# 5 InactiveReservation 287 120 34440
This will work regardless of the order of AddressState in temp_dhcp_count.

Trying to group data by region and summarize by date in R Studio on COVID19 epidemic

I'm an old FOTRAN, C programmer trying to learn R. I started working with data on the COVID19 epidemic and have run aground.
The data I'm working with started out as wide data and I have converted it row data. It contains a daily case count of cases by ProvinceState, Region/Country, Lat, Long, Date, Cases.
I want to filter the dataframe for Mainland China and summarize cases by date as a first step. The code below generates a NULL data set when I try to group the data.
Thanks for any help!
library(dplyr)
library(dygraphs)
library(lubridate)
library(tidyverse)
library(timeSeries)
# Set current working directory.
#
setwd("/Users/markmcleod/MarksRepository/Data")
# Read a Case csv files
#
Covid19ConfirmedWideData <- read.csv("Covid19Deaths.csv",header=TRUE,check.names = FALSE)
# count the number of days of data
#
Covid19ConfirmedDays = NCOL(Covid19ConfirmedWideData)
# Gather Wide Data columns starting at column 5 until NCOL() into RowData DataFrame
#
Covid19ConfirmedRowData <- gather(Covid19ConfirmedWideData, Date, Cases, 5:Covid19ConfirmedDays, na.rm = FALSE, convert = TRUE)
tibble(Covid19ConfirmedRowData)
# # A tibble: 2,204 x 1
# Covid19ConfirmedRowData$ProvinceState $CountryRegion $Lat $Long $Date $Cases
# <fct> <fct> <dbl> <dbl> <chr> <int>
# 1 Anhui Mainland China 31.8 117. 1/22/20 0
# 2 Beijing Mainland China 40.2 116. 1/22/20 0
# 3 Chongqing Mainland China 30.1 108. 1/22/20 0
# Transmute date from chr to date
#
Covid19ConfirmedFormatedData <- transmute(Covid19ConfirmedRowData,CountryRegion,Date=as.Date(Date,format="%m/%d/%Y"),Cases)
tibble(Covid19ConfirmedFormatedData)
# # A tibble: 2,204 x 1
# Covid19ConfirmedFormatedData$CountryRegion $Date $Cases
# <fct> <date> <int>
# 1 Mainland China 0020-01-22 0
# 2 Mainland China 0020-01-22 0
Covid19ConfirmedGroupedData <- Covid19ConfirmedFormatedData %>%
filter(Covid19ConfirmedFormatedData$CountryRegion=='Mainland China')
tibble(Covid19ConfirmedGroupedData)
# A tibble: 2,204 x 1
Covid19ConfirmedGroupedData[,1] [,2] [,3]
<dbl> <dbl> <dbl>
1 NA NA NA
It appears that I have a conflict in the libraries I am using.
I fell backto a previous version of the code and used only the following libraries.
library(dygraphs)
library(lubridate)
library(tidyverse)
The code seems to work again.

Creating a new Data.Frame from variable values

I am currently working on a task that requires me to query a list of stocks from an sql db.
The problem is that it is a list where there are 1:n stocks traded per date. I want to calculate the the share of each stock int he portfolio on a given day (see example) and pass it to a new data frame. In other words date x occurs 2 times (once for stock A and once for stock B) and then pull it together that date x occurs only one time with the new values.
'data.frame': 1010 obs. of 5 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Date : Date, format: "2019-11-22" "2019-11-21" "2019-11-20" "2019-11-19" ...
$ Close: num 52 51 50.1 50.2 50.2 ...
$ Volume : num 5415 6196 3800 4784 6189 ...
$ Stock_ID : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
RawInput<-data.frame(Date=c("2017-22-11","2017-22-12","2017-22-13","2017-22-11","2017-22-12","2017-22-13","2017-22-11"), Close=c(50,55,56,10,11,12,200),Volume=c(100,110,150,60,70,80,30),Stock_ID=c(1,1,1,2,2,2,3))
RawInput$Stock_ID<-as.factor(RawInput$Stock_ID)
*cannot transfer the date to a date variable in this example
I would like to have a new dataframe that generates the Value traded per day, the weight of each stock, and the daily returns per day, while keeping the number of stocks variable.
I hope I translated the issue properly so that I can receive help.
Thank you!
I think the easiest way to do this would be to use the dplyr package. You may need to read some documentation but the mutate and group_by function may be able do what you want. This function will allow you to modify the current dataframe by either adding a new column or changing the existing data.
Lets start with a reproducible dataset
RawInput<-data.frame(Date=c("2017-22-11","2017-22-12","2017-22-13","2017-22-11","2017-22-12","2017-22-13","2017-22-11"),
Close=c(50,55,56,10,11,12,200),
Volume=c(100,110,150,60,70,80,30),
Stock_ID=c(1,1,1,2,2,2,3))
RawInput$Stock_ID<-as.factor(RawInput$Stock_ID)
library(magrittr)
library(dplyr)
dat2 <- RawInput %>%
group_by(Date, Stock_ID) %>% #this example only has one stock type but i imagine you want to group by stock
mutate(CloseMean=mean(Close),
CloseSum=sum(Close),
VolumeMean=mean(Volume),
VolumeSum=sum(Volume)) #what ever computation you need to do with
#multiple stock values for a given date goes here
dat2 %>% select(Stock_ID, Date, CloseMean, CloseSum, VolumeMean,VolumeSum) %>% distinct() #dat2 will still be the same size as dat, thus use the distinct() function to reduce it to unique values
# A tibble: 7 x 6
# Groups: Date, Stock_ID [7]
Stock_ID Date CloseMean CloseSum VolumeMean VolumeSum
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 2017-22-11 50 50 100 100
2 1 2017-22-12 55 55 110 110
3 1 2017-22-13 56 56 150 150
4 2 2017-22-11 10 10 60 60
5 2 2017-22-12 11 11 70 70
6 2 2017-22-13 12 12 80 80
7 3 2017-22-11 200 200 30 30
This data set that you provided actually only has one unique Stock_ID and Date combinations so there was nothing actually done with the data. However if you remove Stock_ID where necessary you can see how this function would work
dat2 <- RawInput %>%
group_by(Date) %>%
mutate(CloseMean=mean(Close),
CloseSum=sum(Close),
VolumeMean=mean(Volume),
VolumeSum=sum(Volume))
dat2 %>% select(Date, CloseMean, CloseSum, VolumeMean,VolumeSum) %>% distinct()
# A tibble: 3 x 5
# Groups: Date [3]
Date CloseMean CloseSum VolumeMean VolumeSum
<fct> <dbl> <dbl> <dbl> <dbl>
1 2017-22-11 86.7 260 63.3 190
2 2017-22-12 33 66 90 180
3 2017-22-13 34 68 115 230
After reading your first reply, You will have to be specific on how you are trying to calculate the weight. Also define your end result.
Im going to assume weight is just percentage by total cost. And the end result is for each date show the weight per stock. In other words a matrix of dates and stock Ids
library(tidyr)
RawInput %>%
group_by(Date) %>%
mutate(weight=Close/sum(Close)) %>%
select(Date, weight, Stock_ID) %>%
spread(key = "Stock_ID", value = "weight", fill = 0)
# A tibble: 3 x 4
# Groups: Date [3]
Date `1` `2` `3`
<fct> <dbl> <dbl> <dbl>
1 2017-22-11 0.192 0.0385 0.769
2 2017-22-12 0.833 0.167 0
3 2017-22-13 0.824 0.176 0

Skipping rows until row with a certain value

I need to to read a .txt file from an URL, but would like to skip the rows until a row with a certain value. The URL is https://fred.stlouisfed.org/data/HNOMFAQ027S.txt and the data takes the following form:
"
... (number of rows)
... (number of rows)
... (number of rows)
DATE VALUE
1945-01-01 144855
1946-01-01 138515
1947-01-01 136405
1948-01-01 135486
1949-01-01 142455
"
I would like to skip all rows until the row with "DATE // VALUE" and start importing the data from this line onwards (including "DATE // VALUE"). Is there a way to do this with data.table's fread() - or any other way, such as with dplyr?
Thank you very much in advance for your effort and your time!
Best,
c.
Here's a way to get to extract that info from those text files using readr::read_lines, dplyr, and string handling from stringr.
library(tidyverse)
library(stringr)
df <- data_frame(lines = read_lines("https://fred.stlouisfed.org/data/HNOMFAQ027S.txt")) %>%
filter(str_detect(lines, "^\\d{4}-\\d{2}-\\d{2}")) %>%
mutate(date = str_extract(lines, "^\\d{4}-\\d{2}-\\d{2}"),
value = as.numeric(str_extract(lines, "[\\d-]+$"))) %>%
select(-lines)
df
#> # A tibble: 286 x 2
#> date value
#> <chr> <dbl>
#> 1 1945-10-01 1245
#> 2 1946-01-01 NA
#> 3 1946-04-01 NA
#> 4 1946-07-01 NA
#> 5 1946-10-01 1298
#> 6 1947-01-01 NA
#> 7 1947-04-01 NA
#> 8 1947-07-01 NA
#> 9 1947-10-01 1413
#> 10 1948-01-01 NA
#> # ... with 276 more rows
I filtered for all the lines you want to keep using stringr::str_detect, then extracted out the info you want from the string using stringr::str_extract and regexes.
Combining fread with unix tools:
> fread("curl -s https://fred.stlouisfed.org/data/HNOMFAQ027S.txt | sed -n -e '/^DATE.*VALUE/,$p'")
DATE VALUE
1: 1945-10-01 1245
2: 1946-01-01 .
3: 1946-04-01 .
4: 1946-07-01 .
5: 1946-10-01 1298
---
282: 2016-01-01 6566888
283: 2016-04-01 6741075
284: 2016-07-01 7022321
285: 2016-10-01 6998898
286: 2017-01-01 7448792
>
Using:
file.names <- c('https://fred.stlouisfed.org/data/HNOMFAQ027S.txt',
'https://fred.stlouisfed.org/data/DGS10.txt',
'https://fred.stlouisfed.org/data/A191RL1Q225SBEA.txt')
text.list <- lapply(file.names, readLines)
skip.rows <- sapply(text.list, grep, pattern = '^DATE\\s+VALUE') - 1
# option 1
l <- Map(function(x,y) read.table(text = x, skip = y), x = text.list, y = skip.rows)
# option 2
l <- lapply(seq_along(text.list), function(i) fread(file.names[i], skip = skip.rows[i]))
will get you a list of data.frame's (option 1) or data.table's (option 2).

How to diagonally subtract different columns in R

I have a dataset of a hypothetical exam.
id <- c(1,1,3,4,5,6,7,7,8,9,9)
test_date <- c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15")
result_date <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20")
data1 <- as_data_frame(id)
data1$test_date <- test_date
data1$result_date <- result_date
colnames(data1)[1] <- "id"
"id" indicates the ID of the students who have taken a particular exam. "test_date" is the date the students took the test and "result_date" is the date when the students' results are posted. I'm interested in finding out which students retook the exam BEFORE the result of that exam session was released, e.g. students who knew that they have underperformed and retook the exam without bothering to find out their scores. For example, student with "id" 1 took the exam for the second time on "2012-07-10" which was before the result date for his first exam - "2012-07-29".
I tried to:
data1%>%
group_by(id) %>%
arrange(id, test_date) %>%
filter(n() >= 2) %>% #To only get info on students who have taken the exam more than once and then merge it back in with the original data set using a join function
So essentially, I want to create a new column called "re_test" where it would equal 1 if a student retook the exam BEFORE receiving the result of a previous exam and 0 otherwise (those who retook after seeing their marks or those who did not retake).
I have tried to mutate in order to find cases where dates are either positive or negative by subtracting the 2nd test_date from the 1st result_date:
mutate(data1, re_test = result_date - lead(test_date, default = first(test_date)))
However, this leads to mixing up students with different id's. I tried to split but mutate won't work on a list of dataframes so now I'm stuck:
split(data1, data1$id)
Just to add on, this is a part of the desired result:
data2 <- as_data_frame(id <- c(1,1,3,4))
data2$test_date_result <- c("2012-06-27","2012-07-10", "2013-07-04","2012-03-24")
data2$result_date_result <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25")
data2$re_test <- c(1, 0, 0, 0)
Apologies for the verbosity and hope I was clear enough.
Thanks a lot in advance!
library(reshape2)
library(dplyr)
# first melt so that we can sequence by date
data1m <- data1 %>%
melt(id.vars = "id", measure.vars = c("test_date", "result_date"), value.name = "event_date")
# any two tests in a row is a flag - use dplyr::lag to comapre the previous
data1mc <- data1m %>%
arrange(id, event_date) %>%
group_by(id) %>%
mutate (multi_test = (variable == "test_date" & lag(variable == "test_date"))) %>%
filter(multi_test)
# id variable event_date multi_test
# 1 1 test_date 2012-07-10 TRUE
# 2 9 test_date 2012-03-15 TRUE
## join back to the original
data1 %>%
left_join (data1mc %>% select(id, event_date, multi_test),
by=c("id" = "id", "test_date" = "event_date"))
I have a piecewise answer that may work for you. I first create a data.frame called student that contains the re-test information, and then join it with the data1 object. If students re-took the test multiple times, it will compare the last test to the first, which is a flaw, but I'm unsure if students have the ability to re-test multiple times?
student <- data1 %>%
group_by(id) %>%
summarise(retest=(test_date[length(test_date)] < result_date[1]) == TRUE)
Some re-test values were NA. These were individuals that only took the test once. I set these to FALSE here, but you can retain the NA, as they do contain information.
student$retest[is.na(student$retest)] <- FALSE
Join the two data.frames to a single object called data2.
data2 <- left_join(data1, student, by='id')
I am sure there are more elegant ways to approach this. I did this by taking advantage of the structure of your data (sorted by id) and the lag function that can refer to the previous records while dealing with a current record.
### Ensure Data are sorted by ID ###
data1 <- arrange(data1,id)
### Create Flag for those that repeated ###
data1$repeater <- ifelse(lag(data1$id) == data1$id,1,0)
### I chose to do this on all data, you could filter on repeater flag first ###
data1$timegap <- as.Date(data1$result_date) - as.Date(data1$test_date)
data1$lagdate <- as.Date(data1$test_date) - lag(as.Date(data1$result_date))
### Display results where your repeater flag is 1 and there is negative time lag ###
data1[data1$repeater==1 & !is.na(data1$repeater) & as.numeric(data1$lagdate) < 0,]
# A tibble: 2 × 6
id test_date result_date repeater timegap lagdate
<dbl> <chr> <chr> <dbl> <time> <time>
1 1 2012-07-10 2012-09-02 1 54 days -19 days
2 9 2012-03-15 2012-04-20 1 36 days -2 days
I went with a simple shift comparison. 1 line of code.
data1 <- data.frame(id = c(1,1,3,4,5,6,7,7,8,9,9), test_date = c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15"), result_date = c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20"))
data1$re_test <- unlist(lapply(split(data1,data1$id), function(x)
ifelse(as.Date(x$test_date) > c(NA, as.Date(x$result_date[-nrow(x)])), 0, 1)))
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 NA
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 NA
4 4 2012-03-24 2012-04-25 NA
5 5 2012-07-22 2012-09-01 NA
6 6 2013-09-16 2013-10-20 NA
7 7 2012-06-21 2012-07-01 NA
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 NA
10 9 2012-02-16 2012-03-17 NA
11 9 2012-03-15 2012-04-20 1
I think there is benefit in leaving NAs but if you really want all others as zero, simply:
data1$re_test <- ifelse(is.na(data1$re_test), 0, data1$re_test)
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 0
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 0
4 4 2012-03-24 2012-04-25 0
5 5 2012-07-22 2012-09-01 0
6 6 2013-09-16 2013-10-20 0
7 7 2012-06-21 2012-07-01 0
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 0
10 9 2012-02-16 2012-03-17 0
11 9 2012-03-15 2012-04-20 1
Let me know if you have any questions, cheers.

Resources