I am trying to understand the given protocol of recommenderlab library in R.
From the original document https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf :
Testing is perfomed by withholding items (parameter given). Breese et
al. (1998) introduced the four experimental witholding protocols
called Given 2, Given 5, Given 10 and All-but-1. During testing, the
Given x protocol presents the algorithm with only x randomly chosen
items for the test user, and the algorithm is evaluated by how well it
is able to predict the withheld items
I do not understand the fact that this is only applied to the testing users. Why can't we simply apply this to all users, and we need to split between training and testing users AND use given protocol?
Since we give only few items for the algorithm to understand patterns and produce predictions, and then test/measure how good are those predictions based on the rest of the items, why do we need to do this for the test users only, and we can't do it for all? How do the training users serve? When the algorithm is given let's say 10 items to understand the target user's patterns, doesn't it use the whole dataset (both training and testing) to compute the user neighborhood for example (UBCF) in order to make the predictions that will be later evaluated on using the withheld items? And if not, meaning that during this process it only looks at the training users (say 80% for instance) why wouldn't it look to the testing users as well? What is the problem of having a another test user in the neighborhood, and there needs to be only training users? That part I don't get.. Or my assumptions are wrong?
To conclude: Why do we both need the given protocol, as well as the split between training and testing?
I hope my question makes sense, I am really curious to find the solution. Thank you in advance :)
Related
I am trying out a novel hierarchical linear model but the data structure makes me wonder if this is even possible in R. My previous attempts at the model were incorrectly specified (oops) and now I'm not sure how to deal with this piece of work. My coursework in HLM covered multilevel models and cross-classified models, but not a 3-level double cross-classified model.
Level 1:
Responses to dichotomously scored items. Categorical dependent
variable, so I think I'm going to be using glmer(). (~1.5 million responses to items)
Level 2:
Responses are nested within items - An item will have many responses
(from different people), but a single response will not be linked to
multiple items.
Responses are also nested within testing instances - A test instance will have many responses (50), but a response cannot link to multiple test instances.
Items are not nested with test instances and test instances are not nested within items. An item will appear at multiple test instances (every time someone takes Form A) and a test instance will be related to multiple items (Each item on the test form).
Level 3:
Items are nested within test forms - A form can have several items on it, but (in this case) items cannot appear on multiple forms.
Testing instances are nested within people - A person can participate in several test instances but a testing instance can't be executed by multiple people.
Testing instances are also nested within location - A location can have several test instances there, but a test instance can't occur at multiple locations
Test forms are not nested within people, people are not nested within test forms. A person can take multiple test forms and a test form can be taken by multiple people.
People are not nested within location and locations are not nested within people - A person can take a test at multiple locations and several people can take a test at a single location.
Test forms are not nested within location and locations are not nested within test forms - A test form can be used at multiple locations; a location can be used to administer many test forms.
I hypothesize that some location variables may have an impact on performance on particular items, but I think that will be moderated by things like the ability of the person taking the test. I have explanatory variables at the location, student, and item levels that I'm interested in exploring, like noise level, GPA, and subject matter.
Please let me know if you have any questions or suggestions.
I don't see why this is a problem. "Modern" mixed model frameworks, like most of those available in R (nlme, lme4, glmmTMB, etc.) don't incorporate (or require) any explicit statement of nestedness; they only require that you specify the factors that define each grouping variable. I don't offhand see why
(1|student) + (1|location) + (1|instance) + (1|item) + (1|form)
won't work. There are a few things to consider, I don't know whether any of them are coming up in your problem:
if your responses are binary, you need to make sure that you don't have a random effect where every observation falls into a separate random-effect group (this level of variation is unidentifiable). For example, if every student answers a question at most once in every location (i.e. no-one ever takes the test more than once at a location, and they never answer a question more than once per test), then any given student/item/location combination will occur either zero or one times in your data set, and the variance of the (1|location:student:item) would be unidentifiable.
there may be some cases where interactions among your variables are identifiable (e.g., does the difficulty of a test item vary across locations?); this could be specified as (1|location:item), but you do have to be a little bit careful that the levels of your interaction don't uniquely identify observations (see the previous point)
You may find the GLMM FAQ useful, especially this section ...
PS I would definitely recommend experimenting with a subset of your original data (1.5 million items is certainly possible with glmer, but will be slow ...); you may also want to specify control=glmerControl(calc.derivs=FALSE) to skip some of the (slow) diagnostic checks
I have a data set with a set of users and a history of documents they have read, all the documents have metadata attributes (think topic, country, author) associated with them.
I want to cluster the users based on their reading history per one of the metadata attributes associated with the documents they have clicked on. This attribute has 7 possible categorical values and I want to prove a hypothesis that there is a pattern to the users' reading habits and they can be divided into seven clusters. In other words, that users will often read documents based on one of the 7 possible values in the particular metadata category.
Anyone have any advice on how to do this especially in R, like specific packages? I realize that the standard k-means algorithm won't work well in this case since the data is categorical and not numeric.
Cluster analysis cannot be used to prove anything.
The results are highly sensitive to normalization, feature selection, and choice of distance metric. So no result is trustworthy. Most results you get out are outright useless. So it's as reliable as a proof by example.
They should only be used for explorative analysis, i.e. to find patterns that you then need to study with other methods.
I am building a local OLAP cube based on data gathered from several OLTP sources. Please note that I am doing this programmatically and do not have access to tools like SSAS or MDX-based tools.
My requirements are somewhat different than the operational requirements of the OLTP system users. I know that "in theory" it would be preferable to retain the most atomic grain available to me, but I don't see a reason to include the lowest level of data in the cube.
For example (I am simplifying), I have a measure field like "Price". Additionally, each sales fact has a Version attribute with values such as:
List (Original/Initial)
Initial Quote
Adjusted Quote
Sold
These describe the internal development of our pricing and are critical to the reports that I create.
However, for my reporting purposes, I will always want to know the value of all Versions whenever I am referencing a given transaction. Therefore, I am considering pivoting measures like Price by Version in the cube (Version will still be its own entity in the data model), resulting in measures like:
PriceList
PriceQuotedInitial
PriceQuotedAdjusted
PriceSold
Since only one Version is ever effective at a given point in time, we do not need to aggregate across multiple Versions.
Known Advantages
Since this will be a local cube file, it appears this approach would
simplify the creation of several required calculated measures that compare Price
across different Versions (would not be an issue to create calculated measures at various levels of aggregation if I was doing this with MDX)
It would also reduce the number of records by a factor of between 3
and 6, which would significantly boost performance for a local cube.
Known Disadvantages
While the data model will match the business process, the cube would not store the data at the most atomic level. An analyst would need to distinguish between Versions by Measure selection, and could not filter by Version - they would always get all available Versions.
This approach will greatly increase the number of Measures. For
example, there is not just one Price we are tracking, but several
price components and other Measures we track for each transaction.
So if we track a dozen true Measures for each transaction, that
might end up being 50-60 Measures if I take this approach.
I understand that for very large Fact tables, it would be preferable to factor all possible fields out of the Fact table into Dimensions for performance purposes, but I am not sure whether this is the case when using a local cube, as in all likelihood, I will put fewer than 50,000 records into any given cube file, given the limitations of local cubes.
Are there other drawbacks to this approach that I'm missing?
Problem:
I want to suggest the top 10 most compatible matches for a particular user, by comparing his/her 'interests' with interests of all others. I'm building an undirected weighted graph between users, where the weight = match score between the two users.
I already have a set of N users: S. For any user U in S, I have a set of interests I. After a long time (a week?) I create a new user U with a set of interests and add it to S. To generate a graph for this new user, I'm comparing interest set I of the new user with the interest sets of all the users in S, iteratively. The problem is with this "all the users" part.
Let's talk about the function for comparing interests. An interest in a set of interests I is a string. I'm comparing two strings/interests using WikipediaMiner (it uses Wikipedia links to infer how closely related two strings are. eg. Billy Jean & Thriller ==> high match, Brad Pitt & Jamaica ==> low match blah blah). I've asked a question about this too (to see if there's a better solution than the one I'm currently using.
So, the above function takes non-negligible time, and in total, it'll take a HUGE time when we compare thousands (maybe millions?) of users and their hundreds of interests. For 100,000 users, I can't afford to make 100,000 user comparisons in a small time (<30sec) in this way. But, I have to give the top 10 recommendations within 30 secs, possibly a preliminary recommendation, and then improve on it in the next 1 min or so, calculate improved recommendations. Simply comparing 1 user vs the N users sequentially is too slow.
Question:
Please suggest an algorithm, method or tool using which I can improve my situation or solve my problem.
I could think of only an approach to solve the problem, since the outcomes of below stuff
depend on the nature of inter-relation between interests.
=>step:1 As your title says.Build an undirected weighted graph with interests as vertices and the weighted match between them as edges.
=>step:2 - cluster the interests. (Most complex)
Kmeans is a commonly used clustering algo, but works on based on
K-Dimensional vector space.refer wiki to see how K-means works.
it minimizes the sum of (sum of distance^2 for each point and say the center of the cluster) for all clusters. In your case, there are no dimensions available. so try if you can apply the minimizing logic applied there by creating some kind of rule, for distance between two vertices, higher match => lesser distance and vice versa (what are the different matching levels provided by wiki-miner?). chose the Mean of cluster as say the most connected vertex in the chosen set, page ranking sounds to be a good option for "figuring the most connected vertex ".
"Pair-counting F-Measure" sounds like it suit's your need (weighted graph), check for other options available.
(Note: keep modifying this step untill a right clustering algo is found and
the right calibration for distance rule, no of clusters etc are found. )
=>Step:3 - Evaluate the clusters
from here on its like calibrating a couple things to fit your need.
Examine the clusters, reevaluate :
the number of clusters , inter-cluster distance, distance between vertices inside clusters, size of clusters,
time\precision trade-off (compare final - match results without any clustring)
goto: step-2 untill this evaluation is satisfactory.
=>step:4 - Examinie new inerest
iterate thru all clusters, calculate conectivity in each cluster, sort clusters based on high connectivity, for the top x% of sorted clusters
sort and filter out the highly connected interests.
=>step:5 - Match User
reverse look up set of all users using the interests obtained out of step-4, compare all interests for both users, generate a score.
=>step:6 - Apart form the above
you can distribute the load (multiple machines can be used for clusters machine-n clusters) to multiple systems\processors, based on the traffic and stuff.
what is the application for this problem, whats the expected traffic?
Another solution to find the connectivity between the new interest and "set of interests in Cluster" C.
Wiki-Miner runs on a set of wiki documents, let me call it the UNIVERSE.
1:for each cluster fetch and maintain(index, lucene might be handy) the "set of high relevent docs"(I am calling it HRDC) out of the UNIVERSE. so you have 'N' HRDC's if you got 'N' clusters.
2:when a new interest comes find "Conectivity with Cluster" = "Hit ratio of interest in HRDC/Hit ratio of interest in UNIVERSE" for each HRDC.
3:Sort "Conectivity with Cluster"'s and choose the Highly connected clusters.
4:Either compare all the vertices in the cluster with the new interest or the highly connected vertices (using Page Ranking), depending on the time\Precision trade off , that suits you.
One flaw is that your basing your algorithms complexity on the wrong thing. The real issue is that you have to compare each unique interest against every other unique interest (and that interest against itself).
If all of the interests are unique, then there is probably nothing you can do. However, if you have a lot of duplicate interests you can perhaps speed up the algorithm this way by the following.
Create a graph that associates each interest with the users that have that interest. In such a way that allows for fast look-ups.
Create a graph that shows how each interest relates to each other interest, also in such a way that allows for fast look-ups.
Therefore, when a new user is added, their interests are compared to all other interest and stored in a graph. You can then use that information to build to build a list of users with similar interests. That list of users will then need to be filtered somehow to bring it down to the top 10.
Finally, add that user and their interests to the graph of users and interests. This is done last so that the user with the most closely matched interests isn't the user themselves.
Note:
There might be some statistical short cuts that you could do something like this: A is related to B, B is related to C, C is related to D, therefore A is related to B, C, and D. However, to use those kinds of short cuts likely requires a much better understanding of how your comparison function works, which is a bit beyond my expertise.
Approximate solution:
I forgot to mention it earlier, but what your looking when comparing users or interests is a "Nearest neighbor search" in higher dimensions. Meaning, that for exact solutions, a linear search generally works better than data structures. So approximation is probably the best way to go if you need it faster.
To obtain a quick approximate solution (without guarantees as to how close it is), you'll need a data structure that allows for quickly being able to determine which users are likely to be similar to a new user.
One way to build that structure:
Pick 300 random users. These will be the seed users for 300 clusters. Ideally, you'd use the 300 users that are least closely related, but that's probably not practical, still might be wise to ensure that the no seed user is too closely related to the other users (as a sum or average of it's comparison's to other users).
The clusters are then filled by each user joining the cluster whose representative user most closely matches it.
The top ton can then be determined by picking the top 10 users most closely related users from that cluster.
If you ensure that the number of clusters and the users per cluster is always fairly close to sqrt(number of users), then you obtain a fair approximation in O(sqrt(N)) by only checking the points within the cluster. You can improve that approximation by including users in additional clusters and checking the representative users for each cluster. The more clusters you check, the closer you get towards O(N) and an exact solution. Although, there's probably no way to say how close the current solution is to the exact solution. Chances are you start to hit dimishing returns after checking more than a total of log(sqrt(N)) clusters total. Which would put you at O(sqrt(N) log(sqrt(N))).
few thoughts ...
Not exactly a graph theory solution.
assuming a finite set of interests. for each user maintain a bit sequence where each interest is a bit representing whether the user has that interest or not.
For a new user simply multiply the bit sequence with the existing users bit sequence and find the number of bits in the result which gives an idea of how closely their interests match.
Suppose you wanted to estimate the size of a userbase of a site which does not publicize this information.
People are more likely to have acquired different usernames with different probabilities. For instance, if the username 'nick' doesn't exist on the system, it's likely to have an extremely small userbase. If the username 'starbaby' is taken, it's likely to be a much larger site. It seems like a straightforward Bayesian problem.
There is the problem that different sites may have a different space of allowable usernames. The biggest problem would be the legality of common characters such as spaces, I imagine. Another issue that could taint the prior distribution is whether the site suggests names when the one you want is taken, or leaves you to think of a more creative name yourself.
How could you build a training set of the frequency of occurrence of usernames across different sized systems? Is there a way to use Bayes to do numeric estimation rather than classification into fixed-width buckets?
What you need to do is accurately estimate the probability that a certain user name is present given the number of users registered. Lets say N is the number of users and u = 1 if user u is present and 0 if they are absent.
First of all, make the assumption that the probability distributions for each user name are independent of each other. This is not going to be true - and you've already come up with one reason why - but it will probably be necessary since it makes the data collection and the maths a lot easier.
You are going to need a lot of data from sites with registered user names and the total number of users of that site. Now, take any specific user name and imagine your data points on a 2d plot (with N on x and u on y), there's going to be one horizontal line of points at y=0 and another at y=1. You can either bin the x axis as you suggest and take the mean y coordinate of all the data points in the bin to get a discrete function, or you could try to fit the points on the graph to some class of functions. I don't really know what that class of functions that would be - maybe some kind of power law? (I'm thinking of Zipf's law).
You now have the probability distributions to apply Bayes' rule. I don't know what kind of prior for N you would want to use. A uniform distribution (up to some large number) would make no assumptions, but I would guess most sites have a small user base.
I suspect that in order to make this work, when you sample users from a site you will need to do so for a specific set of users. I'm betting that the popularity of user names is going to have a very long tail and so a random sample of users is going to give you a lot of very infrequently used names and therefore a lot of uninformative evidence.
EDIT: I had another thought; in most forums (and on StackOverflow) users have consecutive user ids, so you can use a single site with a large number of users to give you estimates for all smaller N.
I think this is a cool idea!
You may be able to put together a data set by using UserNameCheck.com for some different usernames and cross-referencing the results with the stated userbase sizes of those sites that give them out.
Note: that website does not seem to check if the usernames are valid for the site, so e.g. it thinks Gmail would let you register "nick#gmail.com" even though that's too short.
The only way is to get a large set of taken usernames on systems for which you know the size of the userbase. Data may be skewed in userbases where certain names are more common. Even a tiny userbase from a Lord of the Rings forum will likely contain the username Strider, for example.