How recommenderlab of R culculate the ratings of each item in ratingMatrix? - r

Recently, I started using R's recommenderlab package in my studies.
This is recommenderlab document:
http://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf
There are some examples in this document, but I have a big question.
First, load recommenderlab package and Jester5k data set.
library("recommenderlab")
data(Jester5k)
Use the frontest 1000 records (users) of Jester5k to learn. The recommendation algorithm is POPULAR.
r <- Recommender(Jester5k[1:1000], method="POPULAR")
Then predict the 1001th user's recommendation list. List the top 5 items.
recom <- predict(r, Jester5k[1001], n=5)<br/>
as(recom, "matrix")
output:
[1] "j89" "j72" "j47" "j93" "j76"<br/>
Then I check the rating of the 5 items above.
rating <- predict(r, Jester5k[1001], type="ratings")<br/>
as(rating, "matrix")[,c("j89", "j72", "j47", "j93", "j76")]
output:
j89 j72 j47 j93 j76<br/>
2.6476613 2.1273894 0.5867006 1.2997065 1.2956333<br/>
Why is the top 5 list "j89" "j72" "j47" "j93" "j76", but j47's rating is only 0.5867006.
I do not understand.
How does recommenderlab calculate the ratings of each item in ratingMatrix?
And how does it produce the TopN list?

To get a more clear picture of your issue I suggest that you read this:
"recommenderlab: A Framework for Developing and Testing Recommendation Algorithms"
Why is the top 5 list "j89" "j72" "j47" "j93" "j76"
You are using the popularity method, this means that you are choosing the top 5 list based on the most rated items(counting the number of saves), not the highest predicted rating.
How does recommenderlab calculate the ratings of each item in
ratingMatrix? And how does it produce the TopN list?
The predicted rating, recommanderlab calculates them using the usual distance methods(not yet clear if it is pearson or cosine, I didn't have the chance to check it out) then it determines the rating , as suggested by Breeseet al. (1998), mean rating plus a weighted factor calculated on the neighborhood, you can consider the entire training set as the neighborhood of any user, that is why the predicted ratings for any user on the same item will have the same value.
My best.
L

Related

Finding summary statistics. Struggling with having anything work after importing data into R from Excel

Very new to R here, also very new to the idea of coding and computer stuff.
Second week of class and I need to find some summary statistics from a set of data my professor provided. I downloaded the chart of data and tried to follow along with his verbal instructions during class, but I am one of the only non-computer science backgrounds in my degree program (I am an RN going for degree in Health Informatics), so he went way too fast for me.
I was hoping for some input on just where to start with his list of tasks for me to complete. I downloaded his data into an excel file, and then uploaded it into R and it is now a matrix. However, everything I try for getting the mean and standard deviation of the columns he wants comes up with an error. I am understanding that I need to convert these column names into some sort of vector, but online every website tells me to do these tasks differently. I don't even know where to start with this assignment.
Any help on how to get myself started would be greatly appreciated. Ive included a screenshot of his instructions and of my matrix. and please, excuse my ignorance/lack of familiarity compared to most of you here... this is my second week into my masters I am hoping I begin to pick this up soon I am just not there yet.
the instructions include:
# * Import the dataset
# * summarize the dataset,Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Tabulate smokers and age.level data with the variable and its frequency. How many smokers in each age category ?
# * Subset dataset by the mothers that smoke and weigh less than 100kg,how many mothers meet this requirements?
# * Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Plot a histogram
Stack Overflow is not a place for homeworks, but I feel your pain. Let's get piece by piece.
First let's use a package that helps us do those tasks:
library(data.table) # if not installed, install it with install.packages("data.table")
Then, let's load the data:
library(readxl) #again, install it if not installed
dt = setDT(read_excel("path/to/your/file/here.xlsx"))
Now to the calculations:
1 summarize the dataset. Here you'll see the ranges, means, medians and other interesting data of your table.
summary(dt)
1A mean and standard deviation of age, height and weight (replace age with the column name of height and weight to get those)
dt[, .(meanValue = mean(age, na.rm = TRUE), stdDev = sd(age, na.rm = TRUE))]
2 tabulate smokers and age.level. get the counts for each combination:
dt[, .N, by = .(smoke, age.level)]
3 subset smoker mothers with wt < 100 (I'm asuming non-pregnant mothers have NA in the gestation field. Adjust as necessary):
dt[smoke == 1 & weight < 100 & !is.na(gestation), .N]
4 Is the same as 1A.
5 Plot a histogram (but you don't specify of what variable, so let's say it's age):
hist(dt$age)
Keep on studying R, it's not that difficult. The book recommended in the comments is a very good start.

How to determine order of occurrence for Tweets (survival analysis?)?

Trying to figure something out for a pet project and is awfully stuck :(
The project centers around order of Tweet occurrences. I collected Tweets on 3 different topics for 2 actors. I want to determine which actor's tweet on each particular topic occurred earlier overall. A friend recommended me look into the package "survival", but I couldn't see how it could work. Any suggestion would be welcome! Thanks so much!
EDIT: Additional information
created_at name
1544469754 chicagotribune
1541550304 chicagotribune
party type topic
M 1 trade
M 1 trade
The variable represents in following information
-created at: the time the tweet was sent out
-name: Twitter account name
-party: classification variable of political leaning
-type: binary indicator (1 = media type A, 0 = media type B)
-topic: the topic the tweet belongs to (3 topics total)
I don't think this is a survival analysis problem, you just need to find the earliest timestamp within each topic. I think something like this should work:
# Read in example data
df = readr::read_table("created_at name party type topic
1544469754 chicagotribune M 1 trade
1541550304 chicagotribune M 1 trade")
df %>%
group_by(topic) %>%
summarise(first_tweeter = name[which.min(created_at)])

How to plot data from Excel using the R corrplot function?

I am trying to learn R, and use the corrplot library to draw Y:City and X: Population graph. I wrote the below code:
When you look at the picture above, there are 2 columns City and population. When I run the code I get this error message:
Error in cor(Illere_Gore_Nufus) : 'x' must be numeric.
My excel data:
In general, correlation plot (Scattered plot) can be plotted only when you have two continuous variable. Correlation is a value that tells you how two continuous variables are linearly related. The Correlation value will always fall between -1 and 1, where correlation value of -1 depicts weak linear relationship and correlation value of 1 depicts strong linear relationship between the two variables. Correlation value of 0 says that there is no linear relationship between the two variables, however, there could be curvi-linear relationship between the two variables
For example
Area of the land Vs Price of the land
Here is the Data
The correlation value for this data is 0.896, which means that there is a strong linear correlation between Area of the land and Price of the land (Obviously!).
Scatter plot in R would look like this
Scatter plot
The R code would be
area<-c(650,785,880,990,1100,1250,1350,1800,2200,2800)
price<-c(250,275,280,290,350,340,400,335,420,460)
cor(area,price)
plot(area,price)
In Excel, for the same example, you can select the two columns, go to Insert > Scatter plot (under charts section)
Scatter plot
In your case, the information can be plotted in bar graph with city in y axis and population in x axis or vice versa!
Hope I have answered you query!
Some assumptions
You are asking how to do this in Excel, but your question is tagged R and Power BI (also RStudio, but that has been edited away), so I'm going to show you how to do this with R and Power BI. I'm also going to show you why you got that error message, and also why you would get an error message either way because your dataset is just not sufficient to make a correlation plot.
My answer
I'm assuming you would like to make a correlation plot of the population between the cities in your table. In that table you'd need more information than only one year for each city. I would check your data sources and see if you could come up with population numbers for, let's say, the last 10 years. In lack of the exact numbers for the cities in your table, I'm going to use some semi-made up numbers for the population in the 10 most populous countries (following your datastrutcture):
Country 2017 2016 2015 2014 2013
China 1415045928 1412626453 1414944844 1411445597 1409517397
India 1354051854 1340371473 1339431384 1343418009 1339180127
United States 326766748 324472802 325279622 324521777 324459463
Indonesia 266794980 266244787 266591965 265394107 263991379
Brazil 210867954 210335253 209297939 209860881 209288278
Pakistan 200813818 199761249 200253292 197655630 197015955
Nigeria 195875237 192568158 195757661 191728478 190886311
Bangladesh 166368149 165630262 165936711 166124290 164669751
Russia 143964709 143658415 143146914 143341653 142989754
Mexcio 137590740 137486490 136768870 137177870 136590740
Writing and debugging R code in Power BI is a real pain, so I would recommend installing R studio, write your little R snippets there, and then paste it into Power B.
The reason for your error message is that the function cor() onlyt takes numerical data as arguments. In your code sample the city names are given as arguments. And there are more potential traps in your code sample. You have to make sure that your dataset is numeric. And you have to make sure that your dataset has a shape that the cor() will accept.
Below is an R script that will do just that. Copy the data above, and store it in a file called data.xlsx on your C drive.
The Code
library(corrplot)
library(readxl)
# Read data
setwd("C:/")
data <- read_excel("data.xlsx")
# Set Country names as row index
rownames(data) <- data$Country
# Remove Country from dataframe
data$Country <- NULL
# Transpose data into a readable format for cor()
data <- data.frame(t(data))
# Plot data
corrplot(cor(data))
The plot
Power BI
In Power BI, you need to import the data before you use it in an R visual:
Copy this:
Country,2017,2016,2015,2014,2013
China,1415045928,1412626453,1414944844,1411445597,1409517397
India,1354051854,1340371473,1339431384,1343418009,1339180127
United States,326766748,324472802,325279622,324521777,324459463
Indonesia,266794980,266244787,266591965,265394107,263991379
Brazil,210867954,210335253,209297939,209860881,209288278
Pakistan,200813818,199761249,200253292,197655630,197015955
Nigeria,195875237,192568158,195757661,191728478,190886311
Bangladesh,166368149,165630262,165936711,166124290,164669751
Russia,143964709,143658415,143146914,143341653,142989754
Mexcio,137590740,137486490,136768870,137177870,136590740
Save it as countries.csv in a folder of your choosing, and pick it up in Power BI using
Get Data | Text/CSV, click Edit in the dialog box, and in the Power Query Editor, click Use First Row as headers so that you have this table in your Power Query Editor:
Click Close & Apply and make sure that you've got the data available under VISUALIZATIONS | FIELDS:
Click R under VISUALIZATIONS:
Select all columns under FIELDS | countries so that you get this setup:
Take parts of your R snippet that we prepared above
library(corrplot)
# Set Country names as row index
data <- dataset
rownames(data) <- data$Country
# Remove Country from dataframe
data$Country <- NULL
# Transpose data into a readable format for cor()
data <- data.frame(t(data))
# Plot data
corrplot(cor(data))
And paste it into the Power BI R script Editor:
Click Run R Script:
And you're gonna get this:
That's it!
If you change the procedure to importing data from an Excel file instead of a textfile (using Get Data | Excel , you've successfully combined the powers of Excel, Power BI and R to produce a scatterplot!
I hope this is what you were looking for!

How do I optimize association analysis for the rules to make sense?

I have a dataset of customers, that I want to define a frequent criteria, to paint a picture of an ideal customer.
The dataset has the following fields:
email
fullname
Job (title)
company web domain
company description (string data)
company founded (year)
company employees (number)
company city
company state
company country
linkedin groups followed
created
updated
Except for Company Employees, Company Founded, Created and Updated there is no numerical data. The dataset has other useful data, like age (interval) and sex, but it has too many missing values, so I removed them for the analysis purposes.
I ran the code in R:
data1 <- read.csv("final_account_list.csv")
library(arules)
str(data1)
data1$Company.Founded <- factor(data1$Company.Founded)
rules1 <- apriori(data1)
rules1
inspect(rules1)
options(digits=2)
inspect(rules1[1:5])
I am getting a list of 59 rules, but they don't make much sense. For example,
{Company.Employees = 500} => {Company.Country USA} lift 1.176, confidence = 0.083, support = 0.109
The fact that majority of customers have 500 employees and are in USA does not bring much value. How do I make my analysis more meaningful?
For example, how do I find association for the title, geographies (city, state) and linkedin groups?
The most non-trivial part is to define, what "meaningful rule" means for you in terms of right-hand-side(rhs) and/or left-hand-side(rhs).
Then, as described in the documentation to the apriori package, you can investigate your rules.
For your example
how do I find association for the ... linkedin groups
you can use
# find rules with "linkedin groups followed" in right-hand-side
rulesLinkedIn = subset(rules1, subset = rhs %in% "linkedin groups followed"))
# inspect rules with highest confidence
inspect(head(sort(rulesLinkedIn, by="confidence")), n=3)

HMM text recognition in R depmixs4

I'm wondering how I would utilize the depmixs4 package for R to run HMM on a dataset. What functions would I use so I get a classification of a testing data set?
I have a file of training data, a file of label data, and a test data.
Training data consists of 4620 rows. Each row has 1079 values. These values are 83 windows with 13 values per window so in otherwords the 1079 is data that is made up of 83 states and each category has 13 observations. Each of these rows with 1079 values is a spoken word so it have 4620 utterances. But in total the data only has 7 distinct words. each of these distinct words have 660 different utterances hence the 4620 rows of words.
So we have words (0-6)
The label file is a list where each row is labeled 0-6 corresponding to what word they are. For example row 300 is labeled 2, row 450 is labeled 6 and 520 is labeled 0.
The test file contains about 5000 rows structured exactly like the training data except there are no labels assocaiated with it.
I want to use HMM to using the training data to classify the test data.
How would I use depmixs4 to output a classification of my test data?
I'm looking at :
depmix(response, data=NULL, nstates, transition=~1, family=gaussian(),
prior=~1, initdata=NULL, respstart=NULL, trstart=NULL, instart=NULL,
ntimes=NULL,...)
but I don't know what response refers to or any of the other parameters.
Here's a quick, albeit incomplete, test to get you started, if only to familiarize you with the basic outline. Please note that this is a toy example and it merely scratches the surface for HMM design/analysis. The vignette for the depmixs4 package, for instance, offers quite a lot of context and examples. Meanwhile, here's a brief intro.
Let's say that you wanted to investigate if industrial production offers clues about economic recessions. First, let's load the relevant packages and then download the data from the St. Louis Fed:
library(quantmod)
library(depmixS4)
library(TTR)
fred.tickers <-c("INDPRO")
getSymbols(fred.tickers,src="FRED")
Next, transform the data into rolling 1-year percentage changes to minimize noise in the data and convert data into data.frame format for analysis in depmixs4:
indpro.1yr <-na.omit(ROC(INDPRO,12))
indpro.1yr.df <-data.frame(indpro.1yr)
Now, let's run a simple HMM model and choose just 2 states--growth and contraction. Note that we're only using industrial production to search for signals:
model <- depmix(response=INDPRO ~ 1,
family = gaussian(),
nstates = 2,
data = indpro.1yr.df ,
transition=~1)
Now let's fit the resulting model, generate posterior states
for analysis, and estimate probabilities of recession. Also, we'll bind the data with dates in an xts format for easier viewing/analysis. (Note the use of set.seed(1), which is used to create a replicable starting value to launch the modeling.)
set.seed(1)
model.fit <- fit(model, verbose = FALSE)
model.prob <- posterior(model.fit)
prob.rec <-model.prob[,2]
prob.rec.dates <-xts(prob.rec,as.Date(index(indpro.1yr)),
order.by=as.Date(index(indpro.1yr)))
Finally, let's review and ideally plot the data:
head(prob.rec.dates)
[,1]
1920-01-01 1.0000000
1920-02-01 1.0000000
1920-03-01 1.0000000
1920-04-01 0.9991880
1920-05-01 0.9999549
1920-06-01 0.9739622
High values (>0.80 ??) indicate/suggest that the economy is in recession/contraction.
Again, a very, very basic introduction, perhaps too basic. Hope it helps.

Resources