Why does the result of ItemSimilarityJob lack some similarities of itemId-pair? - similarity

Given that I have the following ratings.csv
userId,itemId,rating
1,1,1
1,2,2
1,3,3
2,2,4
2,3,2
2,5,4
2,6,5
3,1,5
3,3,1
3,6,2
4,4,4
Using org.apache.mahout.cf.taste.hadoop.item.RecommenderJob, we have
hadoop jar /mahout-examples-0.13.0-job.jar \
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob \
--input /item-cf/ratings --output /item-cf/recommend \
--similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE \
--tempDir /item-cf/temp \
--outputPathForSimilarityMatrix /item-cf/similarity-matrix
Then Hadoop gives the following results:
# similarity matrix
1 2 0.13367660240019172
1 3 0.16952084719853724
1 6 0.14459058185587106
2 3 0.28989794855663564
2 5 0.3333333333333333
2 6 0.25
3 5 0.21089672205953397
3 6 0.18660549686337075
5 6 0.3090169943749474
# recommendation lack of user-4
1 [5:2.3875139,6:2.0722904]
2 [1:3.565752]
3 [2:2.1649883,5:1.5943621]
On the other hand, I also use the following Python script to validate the results of Mahout. That is,
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
rating = np.array(
[
[1, 2, 3, 0, 0, 0],
[0, 4, 2, 0, 4, 5],
[5, 0, 1, 0, 0, 2],
[0, 0, 0, 4, 0, 0]
]
)
1 / (1 + euclidean_distances(ratings.T))
#
#array([[1, 0.13368, 0.16952, 0.13368, 0.13368, 0.14459],
# [0.13368, 1, 0.2899, 0.14286, 0.33333, 0.25 ],
# [0.16952, 0.2899, 1, 0.15439, 0.2109, 0.18661],
# [0.13368, 0.14286, 0.15439, 1, 0.15022, 0.12973],
# [0.13368, 0.33333, 0.2109, 0.15022, 1, 0.30902],
# [0.14459, 0.25, 0.18661, 0.12973, 0.30902, 1 ]])
However, I think Mahout gives a wrong similarity matrix, so I have the following questions/confusion:
Why does it lack the similarities of item-id pairs (1, 4), (1, 5), (2, 4), (3, 4), (4, 5), (4, 6)? How to explain the similarity matrix of Mahout?
In addition, the recommendations lack the result of user-4, why?
# similarity matrix --similarityClassname=[SIMILARITY_EUCLIDEAN_DISTANCE]
1 2 0.13367660240019172
1 3 0.16952084719853724
1 4 ?
1 5 ?
1 6 0.14459058185587106
2 3 0.28989794855663564
2 4 ?
2 5 0.3333333333333333
2 6 0.25
3 4 ?
3 5 0.21089672205953397
3 6 0.18660549686337075
4 5 ?
4 6 ?
5 6 0.3090169943749474

Related

index from one vector to another by closest values

Given two sorted vectors, how can you get the index of the closest values from one onto the other.
For example, given:
a = 1:20
b = seq(from=1, to=20, by=5)
how can I efficiently get the vector
c = (1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)
which, for each value in a, provides the index of the largest value in b that is less than or equal to it. But the solution needs to work for unpredictable (though sorted) contents of a and b, and needs to be fast when a and b are large.
You can use findInterval, which constructs a sequence of intervals given by breakpoints in b and returns the interval indices in which the elements of a are located (see also ?findInterval for additional arguments, such as behavior at interval boundaries).
a = 1:20
b = seq(from = 1, to = 20, by = 5)
findInterval(a, b)
#> [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
We can use cut
as.integer(cut(a, breaks = unique(c(b-1, Inf)), labels = seq_along(b)))
#[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

How do I create a simple table in R using for loops?

I was asked to create a table with three columns, A, B and C and eight rows. Column A must go 1, 1, 1, 1, 2, 2, 2, 2. Column B must alternate 1, 2, 1, 2, 1, 2, 1, 2. And column C must go 1, 1, 2, 2, 1, 1, 2, 2. I am able to produce the A column data fine, but don't know how to get B or C. This is the code I have so far:
dataSheet <- matrix(nrow = 0, ncol = 3)
colnames(dataSheet) <- c('A', 'B', 'C')
A <- 1
B <- 1
C <- 1
for (A in 1:4){
A=1
dataSheet <- rbind(dataSheet, c(A, B, C))
}
for (A in 5:8){
A=2
dataSheet <- rbind(dataSheet, c(A, B, C))
}
This seems like a good excuse to get familiar with the rep() function as it easily supports this question, but many more complicated questions if you're clever enough:
dt <- data.frame(A = rep(1:2, each = 4),
B = rep(1:2, times = 4),
C = rep(1:2, each = 2))
dt
#> A B C
#> 1 1 1 1
#> 2 1 2 1
#> 3 1 1 2
#> 4 1 2 2
#> 5 2 1 1
#> 6 2 2 1
#> 7 2 1 2
#> 8 2 2 2
Created on 2019-01-26 by the reprex package (v0.2.1)
Simply use R's vectorization for this task, i.e.
A <- c(1, 1, 1, 1, 2, 2, 2, 2)
B <- c(1, 2, 1, 2, 1, 2, 1, 2) # or rep(1:2, 4)
C <- c(1, 1, 2, 2, 1, 1, 2, 2)
cbind(A,B,C)
Maybe something along the lines of the following will be acceptable by your professor.
for (i in 1:8){
A <- if(i <= 4) 1 else 2
B <- if(i %% 2) 1 else 2
C <- if(any(i %% 4 == c(0, 1, 4, 5))) 1 else 2
dataSheet <- rbind(dataSheet, c(A, B, C))
}
dataSheet
# A B C
#[1,] 1 1 1
#[2,] 1 2 2
#[3,] 1 1 2
#[4,] 1 2 1
#[5,] 2 1 1
#[6,] 2 2 2
#[7,] 2 1 2
#[8,] 2 2 1

Removing certain elements form a list of lists

I am working on a list object containing hundreds of "lists" of random integers in the following format:
assignments <- list(
as.integer(c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3)),
as.integer(c(1, 1, 1, 0, 0, 0, 3, 3)),
as.integer(c(1, 3, 3, 3, 3, 3, 3, 2, 2)),
as.integer(c(1, 2, 0, 3, 2, 3, 2, 2, 2))
)
[[1]]
[1] 1 1 1 1 1 1 2 2 2 3 3
[[2]]
[1] 1 1 1 0 0 0 3 3
[[3]]
[1] 1 3 3 3 3 3 3 2 2
[[4]]
[1] 1 2 0 3 2 3 2 2 2
from which to extract the most frequent "non-zero" integer from a given list. However, in some lists of this list object, zero appears to be the most frequent integer, such as the second list [[2]]. The created some problems on my analysis.
Is there anyway to loop through a list of lists to remove certain elements, such as zero, from each list of this big list?
One method I've experimented earlier was to loop through this list of lists and use != to exclude values that equal zero
for(i in assignments){i[i != 0]}
but this didn't work.
lapply(assignments,function(x) x[x!=0])

Modification of a Vector based on elements sequence

How is it possible to transform the following vector:
x <- c(0, 0, 0, 1, 0, 3, 2, 0, 0, 0, 5, 0, 0, 0, 8)
into the desired form:
y <- c(1, 1, 1, 1, 3, 3, 2, 5, 5, 5, 5, 8, 8, 8, 8)
Any idea would be highly appreciated.
Here's another approach using only base R:
idx <- x != 0
split(x, cumsum(idx) - idx) <- x[idx]
The x-vector is now:
x
#[1] 1 1 1 1 3 3 2 5 5 5 5 8 8 8 8
you can use zoo to fill NAs via na.locf function as follows,
zoo::na.locf(replace(x, x==0, NA), fromLast = TRUE)
#[1] 1 1 1 1 3 3 2 5 5 5 5 8 8 8 8
Using rle, you can do the following in base R.
tmp <- rle(x)
tmp$values[which(tmp$values == 0)] <- tmp$values[which(tmp$values == 0) + 1L]
inverse.rle(tmp)
[1] 1 1 1 1 3 3 2 5 5 5 5 8 8 8 8
Note that this assumes the final value is not 0. If this is not the case, you could use head(which(tmp$values == 0), -1) in place of which(tmp$values == 0) to drop the final value.

calculating simple retention in R

For the dataset test, my objective is to find out how many unique users carried over from one period to the next on a period-by-period basis.
> test
user_id period
1 1 1
2 5 1
3 1 1
4 3 1
5 4 1
6 2 2
7 3 2
8 2 2
9 3 2
10 1 2
11 5 3
12 5 3
13 2 3
14 1 3
15 4 3
16 5 4
17 5 4
18 5 4
19 4 4
20 3 4
For example, in the first period there were four unique users (1, 3, 4, and 5), two of which were active in the second period. Therefore the retention rate would be 0.5. In the second period there were three unique users, two of which were active in the third period, and so the retention rate would be 0.666, and so on. How would one find the percentage of unique users that are active in the following period? Any suggestions would be appreciated.
The output would be the following:
> output
period retention
1 1 NA
2 2 0.500
3 3 0.666
4 4 0.500
The test data:
> dput(test)
structure(list(user_id = c(1, 5, 1, 3, 4, 2, 3, 2, 3, 1, 5, 5,
2, 1, 4, 5, 5, 5, 4, 3), period = c(1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4)), .Names = c("user_id", "period"
), row.names = c(NA, -20L), class = "data.frame")
How about this? First split the users by period, then write a function that calculates the proportion carryover between any two periods, then loop it through the split list with mapply.
splt <- split(test$user_id, test$period)
carryover <- function(x, y) {
length(unique(intersect(x, y))) / length(unique(x))
}
mapply(carryover, splt[1:(length(splt) - 1)], splt[2:length(splt)])
1 2 3
0.5000000 0.6666667 0.5000000
Here is an attempt using dplyr, though it also uses some standard syntax in the summarise:
test %>%
group_by(period) %>%
summarise(retention=length(intersect(user_id,test$user_id[test$period==(period+1)]))/n_distinct(user_id)) %>%
mutate(retention=lag(retention))
This returns:
period retention
<dbl> <dbl>
1 1 NA
2 2 0.5000000
3 3 0.6666667
4 4 0.5000000
This isn't so elegant but it seems to work. Assuming df is the data frame:
# make a list to hold unique IDS by
uniques = list()
for(i in 1:max(df$period)){
uniques[[i]] = unique(df$user_id[df$period == i])
}
# hold the retention rates
retentions = rep(NA, times = max(df$period))
for(j in 2:max(df$period)){
retentions[j] = mean(uniques[[j-1]] %in% uniques[[j]])
}
Basically the %in% creates a logical of whether or not each element of the first argument is in the second. Taking a mean gives us the proportion.

Resources