Select unique values from a list of 3 - r

I would like to list all unique combinations of vectors of length 3 where each element of the vector can range between 1 to 9.
First I list all such combinations:
df <- expand.grid(1:9, 1:9, 1:9)
Then I would like to remove the rows that contain repetitions.
For example:
1 1 9
9 1 1
1 9 1
should only be included once.
In other words if two lines have the same numbers and the same number of each number then it should only be included once.
Note that
8 8 8 or
9 9 9 is fine as long as it only appears once.

Based on your approach and the idea to remove repetitions:
df <- expand.grid(1:2, 1:2, 1:2)
# Var1 Var2 Var3
# 1 1 1 1
# 2 2 1 1
# 3 1 2 1
# 4 2 2 1
# 5 1 1 2
# 6 2 1 2
# 7 1 2 2
# 8 2 2 2
df2 <- unique(t(apply(df, 1, sort))) #class matrix
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 1 1 2
# [3,] 1 2 2
# [4,] 2 2 2
df2 <- as.data.frame(df2) #class data.frame
There are probably more efficient methods, but if I understand you correct, that is the result you want.

Maybe something like this (since your data frame is not large, so it does not pain!):
len <- apply(df,1,function(x) length(unique(x)))
res <- rbind(df[len!=2,], df[unique(apply(df[len==2,],1,prod)),])
Here is what is done:
Get the number of unique elements per row
Comprises two steps:
First argument of rbind: Those with length either 1 (e.g. 1 1 1, 7 7 7, etc) or 3 (e.g. 5 8 7, 2 4 9, etc) are included in the final results res.
Second argument of rbind: For those in which the number of unique elements are 2 (e.g. 1 1 9, 3 5 3, etc), we apply product per row and take whose unique products (cause, for example, the product of 3 3 5 and 3 5 3 and 5 3 3 are the same)

Related

Repeat vector to fill down column in data frame

Seems like this very simple maneuver used to work for me, and now it simply doesn't. A dummy version of the problem:
df <- data.frame(x = 1:5) # create simple dataframe
df
x
1 1
2 2
3 3
4 4
5 5
df$y <- c(1:5) # adding a new column with a vector of the exact same length. Works out like it should
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
df$z <- c(1:4) # trying to add a new colum, this time with a vector with less elements than there are rows in the dataframe.
Error in `$<-.data.frame`(`*tmp*`, "z", value = 1:4) :
replacement has 4 rows, data has 5
I was expecting this to work with the following result:
x y z
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 1
I.e. the shorter vector should just start repeating itself automatically. I'm pretty certain this used to work for me (it's in a script that I've been running a hundred times before without problems). Now I can't even get the above dummy example to work like I want to. What am I missing?
If the vector can be evenly recycled, into the data.frame, you do not get and error or a warning:
df <- data.frame(x = 1:10)
df$z <- 1:5
This may be what you were experiencing before.
You can get your vector to fit as you mention with rep_len:
df$y <- rep_len(1:3, length.out=10)
This results in
df
x z y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 1
5 5 5 2
6 6 1 3
7 7 2 1
8 8 3 2
9 9 4 3
10 10 5 1
Note that in place of rep_len, you could use the more common rep function:
df$y <- rep(1:3,len=10)
From the help file for rep:
rep.int and rep_len are faster simplified versions for two common cases. They are not generic.
If the total number of rows is a multiple of the length of your new vector, it works fine. When it is not, it does not work everywhere. In particular, probably you have used this type of recycling with matrices:
data.frame(1:6, 1:3, 1:4) # not a multiply
# Error in data.frame(1:6, 1:3, 1:4) :
# arguments imply differing number of rows: 6, 3, 4
data.frame(1:6, 1:3) # a multiple
# X1.6 X1.3
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 1
# 5 5 2
# 6 6 3
cbind(1:6, 1:3, 1:4) # works even with not a multiple
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 2 2 2
# [3,] 3 3 3
# [4,] 4 1 4
# [5,] 5 2 1
# [6,] 6 3 2
# Warning message:
# In cbind(1:6, 1:3, 1:4) :
# number of rows of result is not a multiple of vector length (arg 3)

Match group assignments between columns

I am trying to check the accuracy rate of a clustering algorithm, with a dataframe that looks like the one here. The orig.gp refers to the original grouping, which is the "correct" group assignment. The new.gp refers to the grouping assigned by the clustering algorithm.
df <- data.frame(id = 1:9,
orig.gp = c(rep(1:3, each = 3)),
new.gp = c(2, 2, 3, 3, 3, 1, 1, 1, 1) )
df
# id orig.gp new.gp
# 1 1 1 2
# 2 2 1 2
# 3 3 1 3
# 4 4 2 3
# 5 5 2 3
# 6 6 2 1
# 7 7 3 1
# 8 8 3 1
# 9 9 3 1
What I am trying to determine is whether the same ids are assigned the same grouping as the orig.gp. The group number itself is not that important, as the number is arbitrary. Ideally, I would like to achieve something like this:
# orig.gp new.gp correct
# 1 1 2 yes
# 2 1 2 yes
# 3 1 3 no
# 4 2 3 yes
# 5 2 3 yes
# 6 2 1 no
# 7 3 1 yes
# 8 3 1 yes
# 9 3 1 yes
To illustrate, in the original grouping, group 1 consists of ids 1, 2, 3; group 2 consists of ids 4, 5, 6; group 3 consists of 7, 8, 9. In the new grouping, ids 1, 2 are correctly assigned into the same group, thus the "yes" in the correct column. I would like to determine whether the same ids are assigned into the same groups as the original groupings.
Any suggestions would be appreciated!
The way I understand your problem, it is basically one of recoding. Namely, you want to identify observations that fall on the diagonal of a crosstabulation of new.gp and orig.gp, but the values of new.gp are mislabeled.
What I propose here is basically recoding the values of new.gp based on a simple crosstabulation (see tab below). The recoding is done by taking the modal value of orig.gp for each possible value of new.gp and assuming that this mode is the correct value label. I then use recode from car to perform the recoding.
library("car")
tab <- with(df, table(new.gp, orig.gp))
tab
## orig.gp
## new.gp 1 2 3
## 1 0 1 3
## 2 2 0 0
## 3 1 2 0
df$recoded <- recode(df$new.gp, paste(rownames(tab),colnames(tab)[max.col(tab)],sep='=',collapse=';'))
df$correct <- ifelse(df$orig.gp == df$recoded, "yes", "no")
The result:
> df
orig.gp new.gp recoded correct
1 1 2 1 yes
2 1 2 1 yes
3 1 3 2 no
4 2 3 2 yes
5 2 3 2 yes
6 2 1 3 no
7 3 1 3 yes
8 3 1 3 yes
9 3 1 3 yes

Replace values in a series exceeding a threshold

In a dataframe I'd like to replace values in a series where they exceed a given threshold.
For example, within a group ('ID') in a series designated by 'time', if 'value' ever exceeds 3, I'd like to make all following entries also equal 3.
ID <- as.factor(c(rep("A", 3), rep("B",3), rep("C",3)))
time <- rep(1:3, 3)
value <- c(c(1,1,2), c(2,3,2), c(3,3,2))
dat <- cbind.data.frame(ID, time, value)
dat
ID time value
A 1 1
A 2 1
A 3 2
B 1 2
B 2 3
B 3 2
C 1 3
C 2 3
C 3 2
I'd like it to be:
ID time value
A 1 1
A 2 1
A 3 2
B 1 2
B 2 3
B 3 3
C 1 3
C 2 3
C 3 3
This should be easy, but I can't figure it out. Thanks!
The ave function makes this very easy by allowing you to apply a function to each of the groupings. In this case, we will adapth the cummax (cumulative maximum) to see if we've seen a 3 yet.
dat$value2<-with(dat, ave(value, ID, FUN=
function(x) ifelse(cummax(x)>=3, 3, x)))
dat;
# ID time value value2
# 1 A 1 1 1
# 2 A 2 1 1
# 3 A 3 2 2
# 4 B 1 2 2
# 5 B 2 3 3
# 6 B 3 2 3
# 7 C 1 3 3
# 8 C 2 3 3
# 9 C 3 2 3
You could also just use FUN=cummax if you want never-decreasing values. I wasn't sure about the sequence c(1,2,1) if you wanted to keep that unchanged or not.
If you can assume your data are sorted by group, then this should be fast, essentially relying on findInterval() behind the scenes:
library(IRanges)
id <- Rle(ID)
three <- which(value>=3L)
ir <- reduce(IRanges(three, end(id)[findRun(three, id)])))
dat$value[as.integer(ir)] <- 3L
This avoids looping over the groups.

Maximum and mean lengths of streaks/runs of identical responses

We have a dataset with ID numbers in the first column and then responses to each of 240 questions in the following 240 columns. We'd like to assess the validity of the responses for each subject by finding the maximum and mean of the lengths of streaks or runs of identical responses. For example, if a subject responded (1, 1, 1, 2, 2, 5, 5, 5, 5, 1) to ten questions, the maximum would be 4 and the mean would be 2.5.
I have tried to solve this problem in R using rle(), but after I apply rle() to every row of the data frame I can't extract the lengths. Once I extract the lengths, I think it would be relatively easy to apply max() and mean(). Any help or advice on getting to that point would be appreciated.
There are two more issues that are minor and don't necessarily need to be answered here. The first is that it would be even more informative to find the maximum and mean per response (there are five possible responses, namely, 1 through 5). In the example above, the maxima and means for 1, 2, and 5 would be, respectively, 3 and 2, 2 and 2, and 4 and 4. The second is that I don't know how to apply rle() to the 240 responses exclusively, i.e. and not also to the ID number. I've been deleting the ID number column before manipulating the data frame in R, which is fine, but will lead to error if I unintentionally rearrange the rows.
Thank you!
The rle function returns a list, but this is not immediately obvious because it is possible to make R print whatever you want when you type the name of an object and the authors of rle have made it print something else. In order to find out the structure of an object, you can use str, for example
x <- c(1, 1, 1, 2, 2, 5, 5, 5, 5, 1)
codes <- rle(x)
str(codes)
You can get at the lengths by typing codes$lengths and similarly for the corresponding values.
Anyway, notwithstanding the statistical issues, here is how to do what you want. Suppose you have 30 subjects and they have responded to eight questions. Your data might look like this
set.seed(123)
repsonses <- data.frame(matrix(sample(0:5, 8*30, replace=T), nc=8))
> head(responses)
X1 X2 X3 X4 X5 X6 X7 X8
1 3 2 4 2 4 1 1 5
2 1 5 2 1 5 3 1 1
3 1 3 1 2 3 5 5 3
4 4 4 5 3 4 2 4 2
5 5 5 2 5 3 1 2 4
6 3 3 3 3 1 1 3 2
You can extract the maximum lengths of the runs for each subject like this:
> max.lengths <- apply(responses, 1, function(x) max(rle(x)$lengths))
> max.lengths
[1] 2 2 2 2 2 4 3 1 1 2 2 1 2 3 2 1 2 2 1 2 1 2 1 2 2 2 2 2 2 1
The max length was 2 for the first 5 subjects and 4 for the sixth subject, so it looks right.
Similarly for the mean lengths
> mean.lengths <- apply(responses, 1, function(x) mean(rle(x)$lengths))
> head(mean.lengths)
[1] 1.142857 1.142857 1.142857 1.142857 1.142857 2.000000
For example, the mean length for the first person was the mean of $1,1,1,1,1,2,1$ which is $8/7$, which agrees with what R says.
To break down the whole thing by response, you can use the same ideas and the tapply function like this:
bd <- function(x){
means <- tapply(x$lengths, factor(x$values,levels=0:5), mean)
means[is.na(means)] <- 0
maxes <- tapply(x$lengths, factor(x$values,levels=0:5), max)
maxes[is.na(maxes)] <- 0
M <- rbind(means, maxes)
rownames(M) <- c("mean", "max")
M
}
lapply(apply(responses, 1, rle), bd)
This outputs another list. For example, if you scroll up, you will see that for subject 25, it says
[[25]]
0 1 2 3 4 5
mean 0 1 2 1 0 2
max 0 1 2 1 0 2
compare with
> responses[25,]
X1 X2 X3 X4 X5 X6 X7 X8
25 3 5 5 3 2 2 1 3
so it is giving the correct answer. You can give this list a name, for example
break.downs <- lapply(apply(responses, 1, rle), bd)
and then you can access the entry for subject i by typing
break.downs[[i]]
For the problem with the ID number column, if it's included, say as column 1, you can just do the whole analysis to responses[ ,-1] and that should be OK. The $-1$ just deletes the first column.
PS. Sorry, I just noticed that I did it with repsonses $0$ to $5$ instead of $1$ to $5$, but you just need to change levels=0:5 to levels=1:5 in the bd function and it should work just as well.
I am partial to the data.table package. To use it, first reshape to long format. Then use rle (making sure to take the first list element of the result, using [[1]]), take the max/mean, and group by the respondent ID.
Here is an example with five respondents and 10 questions:
library(data.table)
set.seed(8028)
responses <- data.frame(cbind(id=1:5,matrix(sample(1:5, 10*5, replace=T), nc=10)))
responses
# id V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
# 1 1 3 4 2 5 1 2 4 4 1 3
# 2 2 2 2 4 5 5 2 3 3 3 1
# 3 3 5 1 3 3 4 4 1 4 2 2
# 4 4 3 2 4 5 2 2 1 4 1 3
# 5 5 5 2 4 5 3 1 4 1 2 4
responses.long<-data.table(reshape(responses, idvar="id", varying=list(2:11), direction="long"),key=c("id","time"))
responses.long[,list(run=max(rle(V2)[[1]]), mean=mean(rle(V2)[[1]])), by="id"]
# id run mean
# 1: 1 2 1.111111
# 2: 2 3 1.666667
# 3: 3 2 1.428571
# 4: 4 2 1.111111
# 5: 5 1 1.000000
Wouldn't this question by more appropriate for StackOverflow?

How to transform a list of user ratings into a matrix in R

I am working on a collaborative filtering problem, and I am having problems reshaping my raw data into a user-rating matrix. I am given a rating database with columns 'movie', 'user' and 'rating'. From this database, I would like to obtain a matrix of size #users x #movies, where each row indicates a user's ratings.
Here is a minimal working example:
# given this:
ratingDB <- data.frame(rbind(c(1,1,1),c(1,2,NA),c(1,3,0), c(2,1,1), c(2,2,1), c(2,3,0),
c(3,1,NA), c(3,2,NA), c(3,3,1)))
names(ratingDB) <- c('user', 'movie', 'liked')
#how do I get this?
userRating <- matrix(data = rbind(c(1,NA,0), c(1,1,0), c(NA,NA,1)), nrow=3)
I can solve the problem using two for loops, but this of course doesn't scale well. Can anybody help with me with a vectorized solution?
This can be done without any loop. It works with the function matrix:
# sort the 'liked' values (this is not neccessary for the example data)
vec <- with(ratingDB, liked[order(user, movie)])
# create a matrix
matrix(vec, nrow = length(unique(ratingDB$user)), byrow = TRUE)
[,1] [,2] [,3]
[1,] 1 NA 0
[2,] 1 1 0
[3,] NA NA 1
This will transform the vector stored in ratingDB$liked to a matrix. The argument byrow = TRUE allows arranging the data in rows (the default is by columns).
Update: What to do if the NA cases are not in the data frame?
(see comment by #steffen)
First, remove the rows containing NA:
subDB <- ratingDB[complete.cases(ratingDB), ]
user movie liked
1 1 1 1
3 1 3 0
4 2 1 1
5 2 2 1
6 2 3 0
9 3 3 1
The full data frame can be reconstructed. The function expand.grid is used to generate all combinations of user and movie:
full <- setNames(with(subDB, expand.grid(sort(unique(user)), sort(unique(movie)))),
c("user", "movie"))
movie user
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
Now, the information of the sub data frame subDB and the full combination data frame full can be combined with the merge function:
ratingDB_2 <- merge(full, subDB, all = TRUE)
user movie liked
1 1 1 1
2 1 2 NA
3 1 3 0
4 2 1 1
5 2 2 1
6 2 3 0
7 3 1 NA
8 3 2 NA
9 3 3 1
The result is identical with the original matrix. Hence, the same procedure can be applied to transform it to a matrix of liked values:
matrix(ratingDB_2$liked, nrow = length(unique(ratingDB_2$user)), byrow = TRUE)
[,1] [,2] [,3]
[1,] 1 NA 0
[2,] 1 1 0
[3,] NA NA 1

Resources