I'm experimenting on some movie rating data. Currently doing some hybrid item and user based predictions. Mathimatically I'm unsure how to implement what I want and maybe the answer is just straight forward weighed mean but I feel like there might be some other option.
I have 4 values for now, that I want to get the mean of
item based prediction
user based prediction
Global movie average for given item
Global user average for given user
As this progesses there will be other values I'll need to add to the mix such as weighted similarity, genre weighting and I'm sure a few other things.
For now I want to focus on the data available to me as stated above as much for understanding as anything else.
Here is my theory. To start I want to weight the item and user based prediction equally which will have more weight than the global averages.
I feel though on my very rusty maths and some basic attempts to come up with a less linear solution is to use something like Harmonic mean. but instead of natuarlly tending towards the low mean value tend towards the global average.
e.g
predicted item base rating 4.5
predicted user based rating 2.5
global movie rating 3.8
global user rating 3.6
so the "centre"/global average here would be 3.7
I may be way off base with this as my maths is quite rusty but anyone any thoughts on how I could mathematically represent what I'm thinking?
OR
do you have any thoughts on a different approach
I recommend you to look into "Recommender systems handbook" by F. Ricci et al., 2011. It summarizes all the common approaches in recommender engines and provides all the necessary formulas.
Here is an excerpt from 4.2.3:
As the number of neighbors used in the prediction increases, the rating predicted by the regression approach will tend toward the mean rating of item i. Suppose item i has only ratings at either end of the rating range, i.e. it is either loved or hated, then the regression approach will make the safe decision that the item’s worth is average. [...] On the other hand, the classification approach will predict the rating as the most frequent one given to i. This is more risky as the item will be labeled as either “good” or “bad”.
Related
As a beginner in machine learning, I am faced with a project where I have to find a method that uses both categorical and numerical variables collected from surveys to predict a child's "discretized" GPA values.
For example, the x-variables include yes/no/don't know responses to questions such as "I worry about taking tests", and numerical answers such as household income. The surveys were given to teachers, caregivers, and the children themselves.
The y-variable, is GPA and ranges from 1 to 4 in discrete increments of 0.25.
What I have attempted, is to use boruta package to pick out the most relevant 65 features out of over 10000 features (and all of the features do make sense---they are often related to the child's behavior in school, and/or the child's scores/percentiles on standardized tests). Below is a sample of the features selected by boruta.
A3D. Your dad misses events or activities that are important to you
G2C. I worry about taking tests
G2D. It's hard for me to pay attention
G2H. It's hard for me to finish my schoolwork
G2I. I worry about doing well in school
G2M. I get in trouble for talking and disturbing others
G19A. Frequency you had 4 or more drinks in one day in past 12 months
E6A. Father could count on someone to co-sign for a bank loan for $5000
i13. how much you earn in that job, before taxes
I19A. Amount earned from all regular jobs in past 12 months
J1. Total household income before taxes/deductions in past 12 months
J4A. Name on bank account
J6B. Amount owed on your vehicle
Then I ran a naive Bayes classifier. I do not know if this is appropriate or if there are better methods for this task, but the results are simply terrible. The model often produces extreme values such as 1 and 4, when the actual value should be somewhere in between. I thought I had relevant features for the task, but somehow the accuracy is very low.
I have also tried gradient boosting machine from caret package using the default parameters, but the result isn't very satisfying either.
What can I do to improve the model, and is there better methods to try?
Is regression more suited for this if I want to achieve better accuracy/minimize error?
Thanks!
I have thousands of small documents from 100 different authors. Using quanteda package, I calculated cosine similarity between the authors with themselves. For example, author x has 100 texts, so I have come up with a 100 x 100 matrix of similarity. Author y has 50 texts, so I have come up with a 50 x 50 similarity matrix.
Now I want to compare these two authors. In other words, which author copies himself more? If I take the average the columns or rows and then average again the vector of means, I arrive at a number so I can compare these two means of means, but I am not sure if these proceeding is right. I hope I made myself clear.
I think the answer depends on what exactly is your quantity of interest. If this is a single summary of how similar are an author's documents to one another, then some distribution across the document similarities, within author, is probably your best means of comparing this quantity between authors.
You could save and plot the cosine similarities across an author's documents as a density, for instance, in addition to your strategy of summarising this distribution using a mean. To capture the variance I would also characterise the standard deviation of this similarity.
I'd be cautious about calling cosine similarity within author as "self-plagiarism". Cosine similarity computes a measure of distance across vector representations of bags of words, and is not viewed as a method for identifying "plagiarism". In addition, there are very pejorative connotations to the term "plagiarism", which means the dishonest representation of someone else's ideas as your own. (I don't even believe that the term "self-plagiarism" makes sense at all, but then I have academic colleagues who disagree.)
Added:
Consider the textreuse package for R, it is designed for the sort of text analysis of reuse that you are looking for.
I don't think Levenshtein distance is what you are looking for. As the Wikipedia page points out, the LD between kitten and sitting is 3, but this means absolutely nothing in substantive terms about their semantic relationship or one being an example of "re-use" of the other. An argument could be made that LD based on words might show re-use, but that's not how most algorithms e.g. http://turnitin.com implement detection for plagiarism.
I recently started to work with a huge dataset, provided by medical emergency
service. I have cca 25.000 spatial points of incidents.
I am searching books and internet for quite some time and am getting more and more confused about what to do and how to do it.
The points are, of course, very clustered. I calculated K, L and G function
for it and they confirm serious clustering.
I also have population point dataset - one point for every citizen, that is similarly clustered as incidents dataset (incidents happen to people, so there is a strong link between these two datasets).
I want to compare these two datasets to figure out, if they are similarly
distributed. I want to know, if there are places, where there are more
incidents, compared to population. In other words, I want to use population dataset to explain intensity and then figure out if the incident dataset corresponds to that intensity. The assumption is, that incidents should appear randomly regarding to population.
I want to get a plot of the region with information where there are more or less incidents than expected if the incidents were randomly happening to people.
How would you do it with R?
Should I use Kest or Kinhom to calculate K function?
I read the description, but still don't understand what is a basic difference
between them.
I tried using Kcross, but as I figured out, one of two datasets used
should be CSR - completely spatial random.
I also found Kcross.inhom, should I use that one for my data?
How can I get a plot (image) of incident deviations regarding population?
I hope I asked clearly.
Thank you for your time to read my question and
even more thanks if you can answer any of my questions.
Best regards!
Jernej
I do not have time to answer all your questions in full, but here are some pointers.
DISCLAIMER: I am a coauthor of the spatstat package and the book Spatial Point Patterns: Methodology and Applications with R so I have a preference for using these (and I genuinely believe these are the best tools for your problem).
Conceptual issue: How big is your study region and does it make sense to treat the points as distributed everywhere in the region or are they confined to be on the road network?
For now I will assume we can assume they are distributed anywhere.
A simple approach would be to estimate the population density using density.ppp and then fit a Poisson model to the incidents with the population density as the intensity using ppm. This would probably be a reasonable null model and if that fits the data well you can basically say that incidents happen "completely at random in space when controlling for the uneven population density". More info density.ppp and ppm are in chapters 6 and 9 of 1, respectively, and of course in the spatstat help files.
If you use summary statistics like the K/L/G/F/J-functions you should always use the inhom versions to take the population density into account. This is covered in chapter 7 of 1.
Also it could probably be interesting to see the relative risk (relrisk) if you combine all your points in to a marked point pattern with two types (background and incidents). See chapter 14 of 1.
Unfortunately, only chapters 3, 7 and 9 of 1 are availble as free to download sample chapters, but I hope you have access to it at your library or have the option of buying it.
I have historical purchase data for some 10k customers for 3 months, I want to use that data for making predictions about their purchase in next 3 months. I am using Customer ID as input variable, as I want xgboost to learn for individual spendings among different categories. Is there a way to tweak, so that emphasis is to learn more based on the each Individual purchase? Or better way of addressing this problem?
You can use weight vector which you can pass in weight argument in xgboost; a vector of size equal to nrow(trainingData). However This is generally used to penalize mistake in classification mistake (think of sparse data with items which just sale say once in month or so; you want to learn the sales then you need to give more weight to sales instance or else all prediction will be zero). Apparently you are trying to tweak weight of independent variable which I am not able to understand well.
Learning the behavior of dependent variable (sales in your case) is what machine learning model do, you should let it do its job. You should not tweak it to force learn from some feature only. For learning purchase behavior clustering type of unsupervised techniques will be more useful.
To include user specific behavior first take will be to do clustering and identify under-indexed and over-indexed categories for each user. Then you can create some categorical feature using these flags.
PS: Some data to explain your problem can help others to help you better.
It's arrived with XGBoost 1.3.0 as of the date of 10 December 2020, with the name of feature_weights : https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit , I'll edit here when I can work/see a tutorial with it.
Say I am tracking 2 objects moving in space & time,
I know their x,y co-ords and score (score being a probabilistic
measure of the tracked being the actual object),
and I get several
such {x,y,score} samples over time for each object
What metric would I use to measure "similarity" of say a ball moving across a room vs. man moving across the room vs. a child moving across the room.
Assume the score is pretty accurate.
Given your description, I'd recommend looking into a Hidden Markov Model or possibly an artificial neural network or some other machine learning approach. However, there are a number of other techniques that might be more appropriate for your situation.