I am using Recommenderlab in R to build a recommendation system to provide craft-beer suggestions to new users.
However, upon running the model, I am receiving the same predictions per user for a majority of the training dataset, or receiving 'character(0)' as the output. How can I receive the predictions that are associated with each user and not duplicated?
The dataset I'm using can be found here: https://www.kaggle.com/rdoume/beerreviews/version/1
I have tried converting the data frame directly into a matrix, then into a realRatingMatrix.
In order to receive any recommendations, I need to use the 'dcast' function from the data.table library before converting the data frame into a matrix.
I have also tried removing the first column from the matrix to drop the user ids.
One thing to note is that when the data is sampled, there can be a few rows where the 'reviewer' is blank, but the rating and beer id is there.
library(dplyr)
library(tidyverse)
library(recommenderlab)
library(reshape2)
library(data.table)
beer <- read.csv('beer.csv', stringsAsFactors = FALSE)
#Take sample of data(1000)
beer_sample <- sample_n(beer, 1000)
#Select relevant columns & rename
beer_ratings <- select(beer_sample, reviewer = review_profilename, beerId = beer_beerid, rating = review_overall)
#Add unique id for reviewers
beer_ratings$userId <- group_indices_(beer_ratings, .dots = 'reviewer')
#Create ratings matrix
rating_matrix <- dcast(beer_ratings, userId ~ beerId, value.var = 'rating')
rating_matrix <- as.matrix(rating_matrix)
rating_matrix <- as(rating_matrix, 'realRatingMatrix')
#UBCF Model
recommender_model <- Recommender(rating_matrix, method = 'UBCF', param=list(method='Cosine',nn=10))
#Predict top 5 beers for first 10 users
recom <- predict(recommender_model, rating_matrix[1:10], n=5)
#Return top recommendations as a list
recom_list<- as(recom,'list')
recom_list
The above code will result in:
[[1]]
[1] "48542" "2042" "6" "10" "19"
[[2]]
[1] "10277" "2042" "6" "10" "19"
[[3]]
[1] "10277" "48542" "6" "10" "19"
[[4]]
[1] "10277" "48542" "2042" "6" "10"
[[5]]
[1] "10277" "48542" "2042" "6" "10"
[[6]]
[1] "10277" "48542" "2042" "6" "10"
Converting the data frame to a matrix then realRatingMatrix without casting first into a table results in the user's recommendation as:
`886093`
`character(0)`
Using the 'dcast' function first then converting the data frame into a matrix and removing the first column, then into a realRatingMatrix returns the same predictions for almost every user:
[[1]]
[1] "6" "7" "10" "12" "19"
[[2]]
[1] "6" "7" "10" "12" "19"
[[3]]
[1] "6" "7" "10" "12" "19"
Any help is greatly appreciated.
Related
I deleted rows from my R dataframe and now the index numbers are out of order. For example, the row-index was 1,2,3,4,5 before but now it is 2,3,4 because I deleted rows 1 and 5.
Do I want to change the index labels from 2,3,4 to 1,2,3 on my new dataframe?
If so, how do I do this?
If not, why not?
library(rvest)
url <- "https://en.wikipedia.org/wiki/Mid-American_Conference"
pg <- read_html(url) # Download webpage
pg
tb <- html_table(pg, fill = TRUE) # Extract HTML tables as data frames
tb
macdf <- tb[[2]]
macdf <- subset(macdf, select=c(1,2,5))
colnames(macdf) <- c("School","Location","NumStudent")
macdf <- macdf[-c(1,8),]
You can change the labels from "2" "3" "4" "5" "6" "7" "9" "10" "11" "12" "13" "14" to "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" using:
row.names(macdf) <- 1:nrow(macdf)
You can do something like this-
> library(data.table)
> subset(setDT(macdf,row.names),select=-rn)
OR
rownames(macdf) <- NULL
If I want to represent a set of values in R that are keyed on 3 different dimensions, is there a simple/succinct way of generating this?
Say for example I have the following keys - each dimension must support having a different number of keys. In total the example below will reference 360 values (3*30*4):
rating <- c('AA','AAB','C')
timeInYears <- 1:30
monthsUntilStart <- c(1,3,6,12)
So I want to be able to access, for example, the value with a rating of AA, 7 years from now, starting in 12 month, using something like:
value <- data[rating=='AA',timeInYears==7,monthsUntilStart==12]
To start with I'd like to be able to provide sample generated values for every combination of keys.
In reality they will be read in from a database, but to get started it would be good to provide a dummy structure from a set of dummy values, that can simply be sequentially repeated over the structure.
So say we have
values <- c(2.30,2.32,1.98,2.18,2.29,2.22)
So each (x,y,z) key maps to one of these values.
Any hints or tips on how to best to approach this much appreciated!
Thanks!
Phil.
You can use an array in R for this task.
First, we will create a data frame that includes all the possibilities. As desired, we will assign values that are cycled to the length of observations:
rating <- c('AA','AAB','C')
timeInYears <- 1:30
monthsUntilStart <- c(1,3,6,12)
data <- expand.grid(rating=rating, timeInYears=timeInYears, monthsUntilStart=monthsUntilStart)
data$value <- c(2.30,2.32,1.98,2.18,2.29,2.22) # cycles through
Next, we convert to an array:
dataarray <- unclass(by(data[["value"]], data[c("rating", "timeInYears", "monthsUntilStart")], identity))
Note that integers will be converted to character strings.
> dimnames(dataarray)
$rating
[1] "AA" "AAB" "C"
$timeInYears
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
$monthsUntilStart
[1] "1" "3" "6" "12"
You can access your desired element by index (it will return the random value that was assigned for this example).
> dataarray["AA", "7", "12"]
[1] 2.3
Edit
You can also just use the data frame itself, if you wish.
> subset(data, rating=='AA' & timeInYears==7 & monthsUntilStart==12)
rating timeInYears monthsUntilStart value
289 AA 7 12 2.3
> subset(data, rating=='AA' & timeInYears==7 & monthsUntilStart==12, value)
value
289 2.3
> subset(data, rating=='AA' & timeInYears==7 & monthsUntilStart==12)$value
[1] 2.3
I am reading HTML tables, and can do that fine, but I am collecting tables from multiple years. Unfortunately, the columns and rows are different in each year, so I wanted to add them all recursively to a list, so I can later apply lapply and do some analysis.
I can download the table and manipulate it into a dataframe when I do it once, but then when I add it to a list, the list only accepts the first column.
library(XML)
#reg
r=readHTMLTable('http://www.nhl.com/stats/team?season=20132014&gameType=2&viewName=summary#',stringsAsFactors=FALSE)
r=as.data.frame(r[3])
for(i in 3:ncol(r)){
r[,i]=as.numeric(r[,i])
}
This gives me r as something I can manipulate. I want to add it to a list:
> l=as.list(NULL)
> l[1]=r
Warning message:
In l[1] = r :
number of items to replace is not a multiple of replacement length
> l
[[1]]
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
Does anyone know how I can add it into my list so I keep the dimensions
> dim(r)
[1] 30 25
The issue is, I have many other tables that I would like to add, and was able to add them, but each one that was added only included the first column/element.
Any ideas is greatly appreciated
Thanks!
A little more research, and I found an answer. I feel guilty about it, but here it is:
l[[1]]=r
adds the table r to the list, and can also be done recursively
Here is a sample code to generate some y values for various test conditions and repeated tests.
library(data.table)
headings = seq(0,180,30)
spds=c(5,10,20,30)
tests=seq(5)
ts=seq(30*60)
dt <- data.table(hd = rep(headings,each=length(spds)*length(tests)*length(ts)),
spd = rep(spds,each=length(tests)*length(ts)),
test = rep(tests,each=length(ts)),
t = ts/60
)
set.seed(1)
dt$y <- with(dt, ((spd/22)^2 * (sin(pi/180*t) +
runif(nrow(dt),-1,1)) * runif(nrow(dt))))
The data for each test needs some processing, and returns a table of values. Here is a simple function to return the index, time, and y values in a range.
findGood<-function(x,ylim) {
gt <- which(ylim[1] <= x$y & x$y <= ylim[2])
dt <- data.table(indx=gt, t=x$t[gt], y=x$y[gt])
return(dt)
}
Use "by" to loop over the test conditions and repeated points, generates a list of results for each combination of test conditions.
ylim=c(.3,.45)
t_for_good_y <- by(dt[,c('t','y'),with=FALSE],
dt[,c('hd','spd','test'),with=FALSE],
findGood,ylim)
The following converts the results back into a data table. However, the table does not contain the condition values (hd, spd, test)
tg <- rbindlist(t_for_good_y)
How to convert the list t_for_good_y into a data table but add columns for the conditions that each list item represents (hd, spd, and test in this example)
One possible approach would make use of dimnames
dn <- dimnames(t_for_good_y)
dn contains the names and values for the indices of the list
R> str(dn)
List of 3
$ hd : chr [1:7] "0" "30" "60" "90" ...
$ spd : chr [1:4] "5" "10" "20" "30"
$ test: chr [1:5] "1" "2" "3" "4" ...
R> print(dn)
$hd
[1] "0" "30" "60" "90" "120" "150" "180"
$spd
[1] "5" "10" "20" "30"
$test
[1] "1" "2" "3" "4" "5"
R> names(dn)
[1] "hd" "spd" "test"
I want to edit out some information from row.names that are created automatically once split and cut2 were used. See following code:
#Mock data
date_time <- as.factor(c('8/24/07 17:30','8/24/07 18:00','8/24/07 18:30',
'8/24/07 19:00','8/24/07 19:30','8/24/07 20:00',
'8/24/07 20:30','8/24/07 21:00','8/24/07 21:30',
'8/24/07 22:00','8/24/07 22:30','8/24/07 23:00',
'8/24/07 23:30','8/25/07 00:00','8/25/07 00:30'))
U. <- as.numeric(c('0.2355','0.2602','0.2039','0.2571','0.1419','0.0778','0.3557',
'0.3065','0.1559','0.0943','0.1519','0.1498','0.1574','0.1929'
,'0.1407'))
#Mock data frame
test_data <- data.frame(date_time,U.)
#To use cut2
library(Hmisc)
#Splitting the data into categories
sub_data <- split(test_data,cut2(test_data$U.,c(0,0.1,0.2)))
new_data <- do.call("rbind",sub_data)
test_data <- new_data
You will see that "test_data" would have an extra column "row.names" with values such as "[0.000,0.100).6", "[0.000,0.100).10", etc.
How do I remove "[0.000,0.100)" and keep the number after the "." such as 6 and 10 so that I can reference these rows by their original row number later?
Any other better method to do this?
You could also set the names of sub_data to NULL.
names(sub_data) <- NULL
test_data <- do.call('rbind', sub_data)
row.names(test_data)
#[1] "6" "10" "5" "9" "11" "12" "13" "14" "15" "1" "2" "3" "4" "7" "8"
You could use a Regular Expression (Regex), as follows:
rownames(test_data) = gsub(".*[]\\)]\\.", "", rownames(test_data))
It's cryptic if you're not familiar with Regular Expressions, but it basically says match any sequence of characters (.*) that are followed by either a brace or parenthesis ([]\\)]) and then by a period (\\.) and remove all of it.
The double backslashes are "escapes" indicating that the character following the double-backslash should be interpreted literally, rather than in its special Regex meaning (e.g., . means "match any single character", but \\. means "this is really just a period").
Just for fun, you can also use regmatches
> Names <- rownames(test_data)
> ( rownames(test_data) <- regmatches(Names, regexpr("[0-9]+$", Names)) )
[1] "6" "10" "5" "9" "11" "12" "13" "14" "15" "1" "2" "3" "4" "7" "8"