I'm new to both bootstrapping and the quanteda package for text analysis. I have a large corpus of texts organized by document group type that I'd like to obtain readability scores for. I can easily obtain readability scores for each group with the following function:
textstat_readability(texts(mwe, groups = "document"), "Flesch")
I then want to bootstrap the results to obtain a 95% confidence interval by wrapping a function:
b_readability <- function(x, i, groups = NULL, measure = "Flesch")
textstat_readability(texts(x[i], groups = groups), measure)
n <- 10
groups <- factor(mwe[["document"]]$document)
b <- boot(texts(mwe), b_readability, strata = groups, R = n, groups = groups)
colnames(b$t) <- names(b$t0)
apply(b$t, 2, quantile, c(.025, .5, .975))
But "b <-" fails with the error:
"Error in t.star[r, ] <- res[[r]] : incorrect number of subscripts on matrix"
I've wasted two days trying to debug with no luck. What am I doing wrong? Much appreciated for any advice...
MWE:
mwe<-structure(list(document = structure(c(1L, 1L),
.Label = c("a", "b", "c", "d", "e"), class = "factor"), text = c("Text 1. Text 1.1", "Text 2."), section = structure(2:1, .Label = c("aa", "bb", "cc", "dd", "ee", "ff", "hh", "ii", "jj", "kk"), class = "factor"), year = c(1919L, 1944L), preamble = structure(8:9, .Label = c("1", "2","3", " 4 ", "5", "6 ", "7 ", "8 ", "9 ", "10 "), class = "factor"), articles = c(43L, 47L), pages = c(5.218, 7.666), wordcount = c(3503L, 4929L), mean_articles = c(45, 45)), row.names = 1:2, class = "data.frame")
mwe <- corpus(mwe)
b_readability <- function(x, i, groups = NULL, measure = "Flesch")
textstat_readability(texts(x[i], groups = groups), measure)
n <- 10
groups <- factor(mwe[["document"]]$document)
b <- boot(texts(mwe), b_readability, strata = groups, R = n, groups = groups)
colnames(b$t) <- names(b$t0)
apply(b$t, 2, quantile, c(.025, .5, .975))
A good question that involves knowing a lot about the boot package as well as how to index and group corpus texts in quanteda. Here's the best (currently) and safest way to do it. "Safest" here means future-proof, since there are some things that currently work in the internal addressing of a quanteda corpus that will not work in upcoming v2. (We warn about this very clearly in ?corpus but no one seems to heed that warning...) Note also that while this should always work, we are also planning, in future versions, more direct methods for bootstrapping textual statistics that would not require the user to do this sort of deep dive into the boot package.
Let's try a reproducible example from built-in objects first. To "bootstrap" a text, we will construct a new, hypothetical text using sentence-level resampling (with replacement) from the original, and use texts(x, groups = "<groupvar>") to piece this together into a hypothetical kind of text. (This is how I have done in in the two references at the end of this post.) To make this happen, we can exploit the property of texts() that it works to get texts from a corpus object but also works on character objects (but with fast grouping).
To get the sentences, after subsetting the corpus to simplify our example here, we reshape it into sentences.
First, however, I recorded the original document's name in a new document variable, so that we can use this for grouping later. In this example, we could also have used Year, but doing it this way will work for any example. (There are some internal records about the original docname that we might have used, but doing it this way will be future-proof.)
library("quanteda")
## Package version: 1.4.1
library("boot")
docvars(data_corpus_inaugural, "docnameorig") <- docnames(data_corpus_inaugural)
sent_corpus <- data_corpus_inaugural %>%
corpus_subset(Year > 2000) %>%
corpus_reshape(to = "sentences")
Then we have to define the function to be bootstrapped. We will use the "index" method and call the index i (as you did above). Here, x will be a character and not a corpus, even though we will call texts() on it, again, using the grouping variable to reassemble it. This will also need to return a vector and not a data.frame, which is normal form of a textstat_*() return. So we will extract just the measure column and return it as a vector.
b_readability <- function(x, i, groups = NULL, measure = "Flesch") {
textstat_readability(texts(x[i], groups = groups[i]), measure)[[measure]]
}
We will call our grouping variable simgroups just to distinguish the value from the argument name, and use this for both the groups argument and for strata in the call to boot(). The strata is an argument to boot(), while groups is passed through to our function b_readability(). We need to factorize this grouping variable since the function seems to want that. Then we call boot() and get our answer.
simgroups <- factor(docvars(sent_corpus, "docnameorig"))
boot(texts(sent_corpus), b_readability, R = 10,
strata = simgroups, groups = simgroups)
##
## STRATIFIED BOOTSTRAP
##
##
## Call:
## boot(data = texts(sent_corpus), statistic = b_readability, R = 10,
## strata = simgroups, groups = simgroups)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 60.22723 -0.01454477 2.457416
## t2* 53.23332 1.24942328 2.564719
## t3* 60.56705 1.07426297 1.996705
## t4* 53.55532 -0.28971190 1.943986
## t5* 58.63471 0.52289051 2.502101
These correspond to the five (original) documents, here distinguished by year, although unfortunately those names have been replaced by t1, t2, ... in the return object from boot().
To return to your original example, let's say these form two documents from one strata (since these are too short two subdivide further). Then:
simgroups <- rep(1, ndoc(mwe))
boot(texts(mwe), b_readability, R = 10, strata = simgroups, groups = simgroups)
##
## STRATIFIED BOOTSTRAP
##
##
## Call:
## boot(data = texts(mwe), statistic = b_readability, R = 10, strata = simgroups,
## groups = simgroups)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 119.19 0.6428333 0.4902916
Related
I am trying to do a propensity matched analysis but am having a lot of trouble. I have a large data set with an exposure coded as 0 (no exposure) and 1 (exposure) and am trying to matched based on a couple of variables. Basically I was trying to follow a tutorial on propensity matching via Coursera but am getting a really weird output. My initial dataset has 2,202 distinct observations. However, once I do the matching, my dataset has 3,074 distinct observations, which is obviously not supposed to happen. It creates a matched sample, but I'm not sure where the additional observations come from...
Does anyone know what I'm doing wrong? I have been trying to troubleshoot for the past week but keep coming up empty handed.
Here is what I'm doing:
race <- as.numeric(cohort$race_eth)
insurance <- as.numeric(cohort$privateinsurance)
language <- as.numeric(cohort$primarylanguage)
bloodpressure <- as.numeric(cohort$bloodpressure
bmi <- cohort$bmiatdelivery
exp <- as.numeric(cohort$prechange)
out <- as.numeric(cohort$tdapvaccinedate_yn)
#merge new dataset
propensity <- cbind(race, insurance, language, bloodpressure, bmi, exp, out)
propensity <- data.frame(prop_score)
#covariates to use in matching
xvars <- c("race", "insurance", "language", "bloodpressure", "age", "bmi")
table1 <- CreateTableOne(vars=xvars, strata="exp",data=propensity, test=FALSE)
print(table1, smd=TRUE)
#do matching
greedymatch <- Match(Tr=propensity$exp, M=1, X=propensity[xvars])
matched <- propensity[unlist(greedymatch[c("index.treated", "index.control")]),] # THIS IS WHERE THE PROBLEM OCCURS SHOWING THAT I HAVE 3074 OBSERVATIONS ```
it is not as easy to tackle your question, as your provided code snippet seems to have some issues before the error you report to experience, hence I was not able to reproduce your error. Still, I have created in the following a dummy dataset based on random numbers, and proceeded your steps. I have put down comments where potential errors could arise with your current code. Maybe this already helps!
#Indicate which Packages are needed
library(tableone)
library(Matching)
##Create Reproducible Example Dataset
# I create dummy data with random variables
race = as.numeric(rep(c(1,2),5))
insurance = as.numeric(sample(1:100,10))
language = as.numeric(sample(1:100,10))
bloodpressure = as.numeric(sample(1:100,10))
bmi = sample(1:100,10)
# it would be safer if you renamed exp as it is is also base function
exp = rep(c(1,0),5)
out = sample(1:100,10)
#You did not inlcude here an age variable but will refer to it later
# did you maybe forgot to include it?
age= sample(1:50,10)
#merge new dataset
propensity <- cbind(race, insurance, language, bloodpressure, bmi, exp, out,age)
# In your example a new dataset "prop_score" appeared
# I can only guess you ment the just created dataset propensity
propensity <- data.frame(propensity)
#Input Dimension
dim(propensity)
#covariates to use in matching
xvars <- c("race", "insurance", "language", "bloodpressure", "age", "bmi")
table1 <- CreateTableOne(vars=xvars, strata="exp",data=propensity, test=FALSE)
print(table1, smd=TRUE)
#do matching
greedymatch <- Match(Tr=propensity$exp, M=1, X=propensity[,xvars])
matched <- propensity[unlist(greedymatch[c("index.treated", "index.control")]),]
#Output Dimensions
dim(matched) #The Dimensions are fine
Thanks for responding! I did ur code and it still didn't work. What do you mean by "it would be safer if you renamed exp as it is is also base function"? I have attached a reprex for my dataset. Does this help?
outcome = c(0, 0, 0, 1, 0),
exposure = c(0, 0, 0, 1, 1),
insurance = c(1, 1, 1, 1, 1),
language = c(3, 1, 1, 1, 1),
age = c(32, 36, 22, 26, 38),
bmi = c(23.8407, 25.354099, 29.709999, 26.9098, 36.290401),
race_eth = as.factor(c("5", "1", "2", "1", "2")),
nullip = as.factor(c("1", "0", "1", "1", "0"))
)
library(tableone)
library(Matching)
#recode variables to use in matching
race <- as.numeric(cohort$race_eth)
insurance <- as.numeric(cohort$insurance)
language <- as.numeric(cohort$language)
nullip <- as.numeric(cohort$nullip)
age <- cohort$age
bmi <- cohort$bmi
exp <- as.numeric(cohort$exposure)
out <- as.numeric(cohort$outcome)
#create new dataset
prop_score <- cbind(race, insurance, language, nullip, pnc, age, bmi, exp, out)
prop_score <- data.frame(prop_score)
xvars <- c("race", "insurance", "language", "nullip", "pnc", "age", "bmi")
#table 1
table1 <- CreateTableOne(vars=xvars, strata="exp",data=prop, test=FALSE)
print(table1, smd=TRUE)
#matching
greedymatch <- Match(Tr=prop$exp, M=1, X=prop[xvars])
matched <- prop[unlist(greedymatch[c("index.treated", "index.control")]),]`
just add the replace = FALSE to your code
greedymatch <- Match(Tr = propensity$exp, M = 1, X = propensity[xvars], replace = FALSE)
Using time series data for a single person, I can calculate a first order probability transition matrix i.e.library(markovchain)and calculate its density i.e.library(statnet)
This code works:
ds = matrix(c(1,1,2,1,2,4,1,3,6,1,4,8),ncol=3,byrow=TRUE) #create person period data for a single person
colnames(ds) = c("Id", "Time", "Evt")
ds = as.data.frame(ds)
mc = markovchainFit(ds$Evt, name = "mc")$estimate #calculate markovchain
am = mc#transitionMatrix #remove slot from S4 object
em = network(am, matrix.type="adjacency", directed=TRUE, Weighted = TRUE, loops = FALSE) #make network object
gden(em)#calculate density of network, etc
But I am having trouble making it work for a data with multiple ID's using tapply. This code doesn't work after line 4, but it is how a solution looks in my head:
ds2 = matrix(c(1,1,2,1,2,4,1,3,6,1,4,8,2,1,3,2,2,5,2,3,7,2,4,9),ncol=3,byrow=TRUE) #create person period data for two people
colnames(ds2) = c("Id", "Time", "Evt")
ds2 = as.data.frame(ds2)
mc2 = tapply(ds2$Evt, ds2$Id, markovchainFit) #it works to here and I am STUCK for days *see below
am2 = mc#transitionMatrix, #can't figure how to integrate these steps from above
em2 = network(am, matrix.type="adjacency", directed=TRUE, Weighted = TRUE, loops = FALSE)
gden(em2)
*For each person in the list I can't figure out:
how to name the markov chain S4 object
how to remove the transition matrix slot from the S4 object
how to pass additional functions after markovchainFit
Does anybody have any suggestions about how to loop my analysis for a single person through an ID vector? It would be very much appreciated.
How about something like below. In the code below, I make a function that does all of the interim work and returns the results of gden() on the appropriate object.
ds2 = matrix(c(1,1,2,1,2,4,1,3,6,1,4,8,2,1,3,2,2,5,2,3,7,2,4,9),ncol=3,byrow=TRUE) #create person period data for two people
colnames(ds2) = c("Id", "Time", "Evt")
ds2 = as.data.frame(ds2)
mcfun <- function(x){
mc <- markovchainFit(x, name="mc")$estimate
am <- mc#transitionMatrix
em <- network(am, matrix.type="adjacency", directed=TRUE, Weighted = TRUE, loops = FALSE) #make network object
gden(em)#calculate density of network, etc
}
tapply(ds2$Evt, ds2$Id, mcfun)
# 1 2
# 0.25 0.25
I am trying to compare two percentages/proportions for statistical significance in R, using a Chi-Square test. I am familiar with a SAS method for Chi Square in which I supply a dataset column for a numerator, another column for denominator, and a categorical variable to distinguish distributions (A/B).
However I am getting unexpected values in R using some examples sets. When I test two similar populations, with low sample sizes, I am getting p-values of (approximately) zero, where I would expect the p-values to be very high (~ 1).
My test set is below, where I went with sugar content in a batch of water: e.g. "does group A use the same ratio of sugar as group B?". My actual problem is similar, where this isn't a pass-fail type test and the numerator and denominator values can vary wildly between samples (different sugar and/or water weights per sample). My first objective is to verify that I can get a high p-value from two similar sets. The next question is, at what sample size does the p-value become low enough to indicate significance?
# CREATE 2 NEARLY-EQUAL DISTRIBUTIONS (EXPECTING HIGH P-VALUE FROM PROP.TEST)
set.seed(108)
group_A = tibble(group = "A", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
group_B = tibble(group = "B", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
batches <- rbind(group_A, group_B)
I then do a summarize to calculate the overall sugar percentage tendency between groups:
# SUMMARY TOTALS
totals <- batches %>%
group_by(group) %>%
summarize(batch_count = n(),
batch_lbs_sum = sum(batch_lbs),
sugar_lbs_sum = sum(sugar_lbs),
sugar_percent_overall = sugar_lbs_sum / batch_lbs_sum) %>%
glimpse()
I then supply the sugar percentage between groups to a prop.test, expecting a high p-value
# ADD P-VALUE & CONFIDENCE INTERVAL
stats <- totals %>%
rowwise() %>%
summarize(p_val = prop.test(x = sugar_percent_overall, n = batch_count, conf.level = 0.95, alternative = "two.sided")$p.value) %>%
mutate(p_val = round(p_val, digits = 3)) %>%
mutate(conf_level = 1 - p_val) %>%
select(p_val, conf_level) %>%
glimpse()
# FINAL SUMMARY TABLE
cbind(totals, stats) %>%
glimpse()
Unforunately the final table gives me a p-value of 0, suggesting the two nearly-identical sets are independent/different. Shouldn't I get a p-value of ~1?
Observations: 2
Variables: 7
$ group <chr> "A", "B"
$ batch_count <int> 50, 50
$ batch_lbs_sum <dbl> 1475.579, 1475.547
$ sugar_lbs_sum <dbl> 495.4983, 484.6928
$ sugar_percent_overall <dbl> 0.3357992, 0.3284833
$ p_val <dbl> 0, 0
$ conf_level <dbl> 1, 1
From another angle, I also tried to compare the recommended sample size from power.prop.test with an actual prop.test using this recommended sample size. This gave me the reverse problem -- I was a expecting low p-value, since I am using the recommended sample size, but instead get a p-value of ~1.
# COMPARE PROP.TEST NEEDED COUNTS WITH AN ACTUAL PROP.TEXT
power.prop.test(p1 = 0.33, p2 = 0.34, sig.level = 0.10, power = 0.80, alternative = "two.sided") ## n = 38154
prop.test(x = c(0.33, 0.34), n = c(38154, 38154), conf.level = 0.90, alternative = "two.sided") ## p = 1 -- shouldn't p be < 0.10?
Am I using prop.test wrong or am I misinterpreting something? Ideally, I would prefer to skip the summarize step and simply supply the dataframe, the numerator column 'sugar_lbs', and the denominator 'batch_lbs' as I do in SAS -- is this possible in R?
(Apologies for any formatting issues as I'm new to posting)
---------------------------------
EDIT - EXAMPLE WITH ONLY PROPORTIONS & SAMPLE SIZE
I think my choice of using normal distributions may have distracted from the original question. I found an example that gets to the heart of what I was trying to ask, which is how to use prop test given only a proportion/percentage and the sample size. Instead of city_percent and city_total below, I could simply rename these to sugar_percent and batch_lbs. I think this reference answers my question, where prop.test appears to be the correct test to use.
My actual problem has an extremely non-normal distribution, but is not easily replicated via code.
STANFORD EXAMPLE (pages 37-50)
- https://web.stanford.edu/class/psych10/schedule/P10_W7L1
df <- tibble(city = c("Atlanta", "Chicago", "NY", "SF"), washed = c(1175, 1329, 1169, 1521), not_washed = c(413, 180, 334, 215)) %>%
mutate(city_total = washed + not_washed,
city_percent = washed / city_total) %>%
select(-washed, -not_washed) %>%
glimpse()
# STANFORD CALCULATION (p = 7.712265e-35)
pchisq(161.74, df = 3, lower.tail = FALSE)
# PROP TEST VERSION (SAME RESULT, p = 7.712265e-35)
prop.test(x = df$city_percent * df$city_total, n = df$city_total, alternative = "two.sided", conf.level = 0.95)$p.value
The documentation for prop.test says:
Usage prop.test(x, n, p = NULL,
alternative = c("two.sided", "less", "greater"),
conf.level = 0.95, correct = TRUE)
Arguments
x a vector of counts of successes, a one-dimensional table with two entries, or a
two-dimensional table (or matrix) with 2 columns, giving the counts of
successes and failures, respectively.
n a vector of counts of trials; ignored if x is a matrix or a table.
So if you want a "correct" test, you would have to use sugar_lbs_sum as the x instead of sugar_percent_overall. You should still receive some kind of warning that the x is non-integral, but that's not my major concern.
But from a statistical perspective this is the complete wrong way of doing things. You are directly causing spurious correlation for a testing of difference between two quantities by dividing by their sum arbitrarily. If the samples (sugar_lbs_sum) are independent, but you divide by their sums, you have made the ratios dependent. This violates the assumptions of the statistical test in a critical way. Kronmal 1993 "Spurious correlation and the fallacy of the ratio" covers this.
The data you generated are independent normal, so don't sum them, rather test for a difference with the t-test.
The Stanford link I added to my original post answered my question. I modified the Stanford example to simply rename the variables from city to group, and washed counts to sugar_lbs. I also doubled one batch, (or comparing a small versus large city). I now get the expected high p-value (0.65) indicating that there is no statistical significance that the proportions are different.
When I add more groups (for more degrees of freedom) and continue to vary batch sizes proportionally, I continue to get high p-values as expected, confirming the recipe is the same. If I modify the sugar percent of any one group, the p-value immediately drops to zero indicating one of the groups is different, as expected.
Finally, when doing the prop.text within a 'dplyr' pipe, I found I should not have used the rowwise() step, which causes my p-values to fall to zero. Removing this step gives the correct p-value. The only downside is that I don't yet know which group is different until I compare only 2 groups at a time iteratively.
#---------------------------------------------------------
# STANFORD EXAMPLE - MODIFIED TO SUGAR & ONE DOUBLE BATCHED
#--------------------------------------------------------
df <- tibble(group = c("A", "B"), sugar_lbs = c(495.5, 484.7), water_lbs = c(1475.6 - 495.5, 1475.6 - 484.7)) %>%
mutate(sugar_lbs = ifelse(group == "B", sugar_lbs * 2, sugar_lbs),
water_lbs = ifelse(group == "B", water_lbs * 2, water_lbs)) %>%
mutate(batch_lbs = sugar_lbs + water_lbs,
sugar_percent = sugar_lbs / batch_lbs) %>%
glimpse()
sugar_ratio_all <- sum(df$sugar_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
water_ratio_all <- sum(df$water_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
dof <- (2 - 1) * (length(df$group) - 1)
df <- df %>%
mutate(sugar_expected = (sugar_lbs + water_lbs) * sugar_ratio_all,
water_expected = (sugar_lbs + water_lbs) * water_ratio_all) %>%
mutate(sugar_chi_sq = (sugar_lbs - sugar_expected)^2 / sugar_expected,
water_chi_sq = (water_lbs - water_expected)^2 / water_expected) %>%
glimpse()
q <- sum(df$sugar_chi_sq) + sum(df$water_chi_sq)
# STANFORD CALCULATION
pchisq(q, df = dof, lower.tail = F)
# PROP TEST VERSION (SAME RESULT)
prop.test(x = df$sugar_percent * df$batch_lbs, n = df$batch_lbs, alternative = "two.sided", conf.level = 0.95)$p.value
I'm currently working on a user-item collaborative filtering model.
I have a set of users and places they have shopped at, and have attempted to build a recommender model using R.
There are two aims of this project:
a) Recommend new shops to ALL customers
b) Give a stat to show how accurate the model is.
I have 2 years worth of data.
To answer b), I have subset my data to customers that have purchased in both the first 1.5 years AND in the following 6 months.
I have created a model on the data of transactions in the first 1.5 years, then have compared to model predictions to the ACTUAL 6 months of data.
By performing the above, I determined that I was to use UBCF and nn=500, and I was able to achieve accuracy of approx 80%.
However, I am now unsure of how to predict for the ENTIRE user base.
I was thinking of applying the ENTIRE dataset to the model I have just created, but there is bias/will not be accurate, as not all shops are represented in this small model I have created.
I have read articles and tutorials where people have done different things.
I have seen one where they input the entire dataset, and apply the [which] subsetting, so that it creates the model in 80% and tests using the remaining 20%.
My question is, if I was to use this process, how would I then get recommendations for ALL users, when the model only gives predictions for 20% of the users?
Is it best to create the model on the entire dataset?
SUBSET THE DATA
Create period flags
#If in 1.5 years, then 1. If in following 6 months, then 0.
FV$Flag1<-ifelse(FV$Date<="2018-10-01",1,0)
FV$Flag2<-ifelse(FV$Date>"2018-10-01",1,0)
IDENTIFICATION OF CUSTOMERS TO USE IN TRAINING MODEL
#Create SCV
#FV
FV_SCV<-select(FV, Customer, Flag1, Flag2) %>%
group_by(Customer) %>%
summarise_all(funs(sum)) #Sum all variables.
#Determine which customers to use based on if they have purchased both in the first and second years
FV_SCV$Use<-ifelse(FV_SCV$Flag1>0 & FV_SCV$Flag2>0, 1,0)
EXTRACT CUSTOMER LIST FOR TRAINING MODEL
#Training. Where customers have purchased both in the first & second year, but we only run the model on the first.
FV_Train<-FV_SCV %>%
filter(Use==1 )
SUBSET TO CUSTOMERS THAT HAVE PURCHASED IN 1 YEAR AND OF THE CUSTOMERS THAT HAVE PURCHASED IN BOTH YEARS, TO ONLY THOSE THAT HAVE SHOPPED IN THE FIRST YEAR
#FV_SCV$flag_sum<- FV_SCV$Flag1+FV_SCV$Flag2
SCV FOR CUSTOMERS USED IN THE TRAINING MODEL
#Join on the USE flag
FV_Train_Transactions<- FV %>% #Join on the page info
left_join(select(FV_Train, Customer, Use), by=c("Customer"="Customer"))
#Replace NA with 0
FV_Train_Transactions[is.na(FV_Train_Transactions)] <- 0
##Subset to only the users' transactions to be used in training
FV_Train_Transactions<-FV_Train_Transactions %>%
filter(Use==1)
##Create date flag for train and test to use to create the model on the train and comparing the results with the output of the test df
FV_Train_Transactions_Compare<-FV_Train_Transactions %>%
filter(Flag2>0)
##Create SCV for TRAIN
FV_TRAIN_SCV<-FV_Train_Transactions %>%
filter(Flag1>0) %>%
group_by(Customer, Brand)%>%
select(Customer, Brand)
FV_TRAIN_SCV$Flag<-1
#Not sure why this hasn't selected unique rows, so remove duplicates.
FV_TRAIN_SCV<-distinct(FV_TRAIN_SCV)
##Create scv for TEST
FV_TEST_SCV<-FV_Train_Transactions_Compare %>%
filter(Flag2>0) %>%
select(Customer, Brand) %>%
group_by(Customer, Brand)
FV_TEST_SCV$Flag<-1
#Not sure why this hasn't selected unique rows, so remove duplicates.
FV_TEST_SCV<-distinct(FV_TEST_SCV)
Transpose to columns
install.packages("reshape")
install.packages("reshape2")
install.packages("tidytext")
library(reshape)
library(reshape2)
library(tidytext)
#Melt data for transposition
#Train
fv_train_md<-melt(FV_TRAIN_SCV, id=(c("Customer", "Brand")))
FV_TRAIN_SCV2<-dcast(fv_train_md, Customer~Brand, value="Flag", fun.aggregate = mean)
#Test
fv_test_md<-melt(FV_TEST_SCV, id=(c("Customer" , "Brand")))
#Do the same for the overall transactions table
#Make FV_SCV a binary rating matrix
fv_overall<- FV[,c(1,3)] #The table name is case sensitive. Select only the customer and brand columns
fv_overall<- distinct(fv_overall) #Remove dups
fv_overall$Flag<-1
fv_overall_md<-melt(fv_overall, id=(c("Customer", "Brand")))
fv_overall_2<- dcast(fv_overall_md, Customer~Brand, value="Flag", fun.aggregate = mean)
#fv_test_123<-dcast(FV_TEST_SCV, Customer~Brand, value.var="Brand")
#colnames(fv_test_123)
#fv_test_12345<-which(fv_test_123==1, arr.ind=TRUE)
#fv_test_123<-colnames(fv_test_123)[fv_test_12345]
#fv_test_123
#fv_test_123_df<-as.data.frame((fv_test_123))
FV_TRAIN_SCV2<-dcast(fv_train_md, Customer~Brand, value="Value", fun.aggregate = mean)
FV_TEST_SCV2<-dcast(fv_test_md, Customer~Brand, value="Value", fun.aggregate = mean)
#Replace NaN with 0
is.nan.data.frame <- function(x)
do.call(cbind, lapply(x, is.nan))
FV_TRAIN_SCV2[is.nan(FV_TRAIN_SCV2)] <- 0
FV_TEST_SCV2[is.nan(FV_TEST_SCV2)] <- 0
fv_overall_2[is.nan(fv_overall_2)] <- 0
#
#
install.packages("recommenderlab")
library(recommenderlab)
row.names(FV_TRAIN_SCV2)<-FV_TRAIN_SCV2$Customer
FV_TRAIN_SCV2$Hawkers<-0
FV_TRAIN_SCV2$Pollini<-0
FV_TRAIN_SCV2$"Twin Set"<-0
FV_TRAIN_SCV2_matrix<-as.matrix(FV_TRAIN_SCV2[,2:ncol(FV_TRAIN_SCV2)])
FV_TRAIN_SCV2_binarymatrix<-as(FV_TRAIN_SCV2_matrix,"binaryRatingMatrix")
similarity_FV_train_trans_items<-similarity(FV_TRAIN_SCV2_binarymatrix, method="jaccard", which="items")
train_col<- data.frame(colnames(FV_TRAIN_SCV2))
#------------------------------------------------------------------------------------
row.names(fv_overall_2)<-fv_overall_2$Customer
#Convert NaN to 0
fv_overall_2[is.nan(fv_overall_2)]<-0
#fv_overall_matrix<- as.matrix(fv_overall_2[,2:(ncol(fv_overall_2)-3)])#Convert to matrix
fv_overall_matrix<- as.matrix(fv_overall_2[,2:ncol(fv_overall_2)])#Convert to matrix
#fv_overall_matrix<- as.matrix(fv_overall_matrix[,1:(ncol(fv_overall_2)-3)])
fv_matrix_binary<- as(fv_overall_matrix, "binaryRatingMatrix") #Make a binary ratings matrix
FV_overall_similarity<-similarity(fv_matrix_binary, method="jaccard", which="items")
overall_col<- data.frame(colnames(fv_overall_2))
#---------------------------------------------------------------------------------------------------------
#
#
#Now, define multiple recommender algorithms to compare them all.
algorithms <- list(`user-based CF 50` = list(name = "UBCF",param = list(method = "Jaccard", nn = 50)),
`user-based CF 100` = list(name = "UBCF",param = list(method = "Jaccard", nn = 100)),
`user-based CF 200` = list(name = "UBCF",param = list(method = "Jaccard", nn = 200)),
`user-based CF 500` = list(name = "UBCF",param = list(method = "Jaccard", nn = 500)),
#
`item-based CF 3` = list(name = "IBCF",param = list(method = "Jaccard", k = 3)),
`item-based CF 5` = list(name = "IBCF",param = list(method = "Jaccard", k = 5)),
`item-based CF 10` = list(name = "IBCF",param = list(method = "Jaccard", k = 10)),
`item-based CF 50` = list(name = "IBCF",param = list(method = "Jaccard", k = 50))
)
scheme <- evaluationScheme(FV_TRAIN_SCV2_binarymatrix, method = "cross", k = 4,given = 1)
scheme <- evaluationScheme(fv_matrix_binary, method = "cross", k = 4,given = 1)
results <- evaluate(scheme, algorithms, n = c(1,2,3,4,5,6,7,8))
results <- evaluate(scheme, algorithms, n = c(1,2,3,4,5,6,7,8,50,100,200,500))#Evaluating with n=c(1,2,3.....) being the number of recommendations
#names(results) #Check that all results have run.
#results
#Plot results to help determine which of the above models is best for further analysis
#plot(results, annotate = c(1, 3), legend = "right") #ROC Curve
#plot(results, "prec/rec", annotate = 3) #Precision/Recall Plot
The first of these plots (with FPR on x axis) is the ROC curve. The better performing model is the curve with the highest area therefore the better performing model, of these tested parameters, is UBCF with nn=500. Or, with nn=50.
Based on the precision/recall plot, nn should be set to 500.
MODEL USING UBCF nn = 500
recc_model <- Recommender(data = FV_TRAIN_SCV2_binarymatrix, method = "UBCF",
parameter = list(method = "Jaccard",
nn=500))
model_details <- getModel(recc_model)
model_details
#Running on ENTIRE DATA
recc_model <- Recommender(data = fv_matrix_binary, method = "UBCF",
parameter = list(method = "Jaccard",
nn=500))
model_details <- getModel(recc_model)
model_details
install.packages("plyr")
library(plyr)
#Get the scores
recc_predicted <- predict(object = recc_model, newdata = FV_TRAIN_SCV2_binarymatrix, n = 198, type="ratings")
ibcf_scores<-as(recc_predicted, "list")
ibcf_list_scores<-ldply(ibcf_scores, rbind)
ibcf_list_scores[is.na(ibcf_list_scores)]<-0
#Get the list of brands
recc_predicted_list2<-predict(object=recc_model,newdata= FV_TRAIN_SCV2_binarymatrix,type="topNList", n=198)
recc_predicted_list2_a<- as(recc_predicted_list2, "list")
recc_predicted_list2_b<-ldply(recc_predicted_list2_a, rbind)
#recc_predicted_list2_b[is.na(recc_predicted_list2_b)]<-0
#------------------------------------------------------------
#On the overall model:
#Get the scores
recc_predicted <- predict(object = recc_model, newdata = fv_matrix_binary, n = 80, type="ratings")
ibcf_scores<-as(recc_predicted, "list")
ibcf_list_scores<-ldply(ibcf_scores, rbind)
ibcf_list_scores[is.na(ibcf_list_scores)]<-0
#Get the list of brands
recc_predicted_list2<-predict(object=recc_model,newdata= fv_matrix_binary,type="topNList", n=80)
recc_predicted_list2_a<- as(recc_predicted_list2, "list")
recc_predicted_list2_b<-ldply(recc_predicted_list2_a, rbind)
#recc_predicted_list2_b[is.na(recc_predicted_list2_b)]<-0
Reshape df so all ratings are in one column. Use this to then create a unique table, to thendo a count if, as this always crashes excel.
install.packages("data.table")
library(data.table)
df.m1<- melt(ibcf_list_scores, id.vars=c(".id"),
value.name="Rating")
df.m1.unique<- data.frame(df.m1)
df.m1.unique$variable<-NULL
df.m1.unique$.id<-NULL
#df.m1.unique<-distinct(df.m1.unique)
#df.m1.unique<- df.m1.unique[order(df.m1.unique$Rating),] #This comma means it is only ordering based on this one var.
#Using ave
df.m1.unique$count<- ave(df.m1.unique$Rating, df.m1.unique[,c("Rating")], FUN=length)
rownames(df.m1.unique) <- c() #Remove rownames
df.m1.unique<-distinct(df.m1.unique)
df.m1.unique<- df.m1.unique[order(-df.m1.unique$Rating),]#Sort by ascending rating
#Plot this
df.m1.unique.plot<- data.frame(df.m1.unique[2:(nrow(df.m1.unique)-1),])
#plot(x=df.m1.unique.plot$Rating, y=df.m1.unique.plot$count)
#Get the cumulative distribution
df.m1.unique.plot2<- df.m1.unique.plot %>%
mutate(Percentage=cumsum(100*(count/sum(count))),
cumsum=cumsum(count))
Remove ratings
#a) Remove values that are less than specific rating
#Using logical indexing with replacement
ibcf_list_scores_removal<- ibcf_list_scores
#Replace low values with 0
ibcf_list_scores_removal[,2:ncol(ibcf_list_scores_removal)][ibcf_list_scores_removal[,2:ncol(ibcf_list_scores_removal)] < 0.0217] <- 0
#To flag whether customer is recommended the brand, replace all values >0 with 1. Keep 0 as is.
ibcf_list_scores_removal_b<- ibcf_list_scores_removal #Call a new df
ibcf_list_scores_removal_b[,2:ncol(ibcf_list_scores_removal_b)][ibcf_list_scores_removal_b[,2:ncol(ibcf_list_scores_removal_b)] > 0] <- 1#Create the flag
So basically I'd like to know how to create the model on my ENTIRE dataset?
And how to extract all ratings?
Thank you
I have a data frame with following structure, dput(scoreDF):
scoreDF <- structure(list(ID = c(1, 2), Status = structure(c(2L, 1L),
.Label = c("Fail", "Pass"), class = "factor"), Subject_1_Score = c(100, 25),
Subject_2_Score = c(50, 76)), .Names = c("ID", "Status", "Subject_1_Score",
"Subject_2_Score"), row.names = c(NA, -2L), class = "data.frame")
Now, I need to come up with the % of students who passed and failed, mean of the students who passed and failed, standard error for the same.
For standard error, I have defined a function as follows:
stdErr <- function(x) {sd(x)/ sqrt(length(x))}
where I expect x to be a vector whose standard error needs to be calculated.
I have seen the doc for ddply, but I am not able to figure out how to calculate the % i.e. (number of passes)/ (total count) and standard error for the data frame above.
You can use tapply to calculate group statistics. If your data frame is called students then to calculate mean by pass/fail you would specify:
tapply(students$Subject_1_Score, students$Status, FUN=mean)
For the standard error substitute your stdErr function for mean.
If you want to calculate something across multiple columns, you can index x:
tapply(students[,2:3], students$Status, FUN=mean)
To calculate percent of students that passed:
dim(students[students$Status == "Pass" ,])[1] / dim(students)[1]
Or by score:
dim(students[students$Subject_1_Score >= 65 ,])[1] / dim(students)[1]
The above is a dataframe example of this type of vector statement using indexing:
length(x[x == "Pass"]) / length(x)
To calculate a function across rows or columns you can use apply.