Maximum and mean lengths of streaks/runs of identical responses - r

We have a dataset with ID numbers in the first column and then responses to each of 240 questions in the following 240 columns. We'd like to assess the validity of the responses for each subject by finding the maximum and mean of the lengths of streaks or runs of identical responses. For example, if a subject responded (1, 1, 1, 2, 2, 5, 5, 5, 5, 1) to ten questions, the maximum would be 4 and the mean would be 2.5.
I have tried to solve this problem in R using rle(), but after I apply rle() to every row of the data frame I can't extract the lengths. Once I extract the lengths, I think it would be relatively easy to apply max() and mean(). Any help or advice on getting to that point would be appreciated.
There are two more issues that are minor and don't necessarily need to be answered here. The first is that it would be even more informative to find the maximum and mean per response (there are five possible responses, namely, 1 through 5). In the example above, the maxima and means for 1, 2, and 5 would be, respectively, 3 and 2, 2 and 2, and 4 and 4. The second is that I don't know how to apply rle() to the 240 responses exclusively, i.e. and not also to the ID number. I've been deleting the ID number column before manipulating the data frame in R, which is fine, but will lead to error if I unintentionally rearrange the rows.
Thank you!

The rle function returns a list, but this is not immediately obvious because it is possible to make R print whatever you want when you type the name of an object and the authors of rle have made it print something else. In order to find out the structure of an object, you can use str, for example
x <- c(1, 1, 1, 2, 2, 5, 5, 5, 5, 1)
codes <- rle(x)
str(codes)
You can get at the lengths by typing codes$lengths and similarly for the corresponding values.
Anyway, notwithstanding the statistical issues, here is how to do what you want. Suppose you have 30 subjects and they have responded to eight questions. Your data might look like this
set.seed(123)
repsonses <- data.frame(matrix(sample(0:5, 8*30, replace=T), nc=8))
> head(responses)
X1 X2 X3 X4 X5 X6 X7 X8
1 3 2 4 2 4 1 1 5
2 1 5 2 1 5 3 1 1
3 1 3 1 2 3 5 5 3
4 4 4 5 3 4 2 4 2
5 5 5 2 5 3 1 2 4
6 3 3 3 3 1 1 3 2
You can extract the maximum lengths of the runs for each subject like this:
> max.lengths <- apply(responses, 1, function(x) max(rle(x)$lengths))
> max.lengths
[1] 2 2 2 2 2 4 3 1 1 2 2 1 2 3 2 1 2 2 1 2 1 2 1 2 2 2 2 2 2 1
The max length was 2 for the first 5 subjects and 4 for the sixth subject, so it looks right.
Similarly for the mean lengths
> mean.lengths <- apply(responses, 1, function(x) mean(rle(x)$lengths))
> head(mean.lengths)
[1] 1.142857 1.142857 1.142857 1.142857 1.142857 2.000000
For example, the mean length for the first person was the mean of $1,1,1,1,1,2,1$ which is $8/7$, which agrees with what R says.
To break down the whole thing by response, you can use the same ideas and the tapply function like this:
bd <- function(x){
means <- tapply(x$lengths, factor(x$values,levels=0:5), mean)
means[is.na(means)] <- 0
maxes <- tapply(x$lengths, factor(x$values,levels=0:5), max)
maxes[is.na(maxes)] <- 0
M <- rbind(means, maxes)
rownames(M) <- c("mean", "max")
M
}
lapply(apply(responses, 1, rle), bd)
This outputs another list. For example, if you scroll up, you will see that for subject 25, it says
[[25]]
0 1 2 3 4 5
mean 0 1 2 1 0 2
max 0 1 2 1 0 2
compare with
> responses[25,]
X1 X2 X3 X4 X5 X6 X7 X8
25 3 5 5 3 2 2 1 3
so it is giving the correct answer. You can give this list a name, for example
break.downs <- lapply(apply(responses, 1, rle), bd)
and then you can access the entry for subject i by typing
break.downs[[i]]
For the problem with the ID number column, if it's included, say as column 1, you can just do the whole analysis to responses[ ,-1] and that should be OK. The $-1$ just deletes the first column.
PS. Sorry, I just noticed that I did it with repsonses $0$ to $5$ instead of $1$ to $5$, but you just need to change levels=0:5 to levels=1:5 in the bd function and it should work just as well.

I am partial to the data.table package. To use it, first reshape to long format. Then use rle (making sure to take the first list element of the result, using [[1]]), take the max/mean, and group by the respondent ID.
Here is an example with five respondents and 10 questions:
library(data.table)
set.seed(8028)
responses <- data.frame(cbind(id=1:5,matrix(sample(1:5, 10*5, replace=T), nc=10)))
responses
# id V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
# 1 1 3 4 2 5 1 2 4 4 1 3
# 2 2 2 2 4 5 5 2 3 3 3 1
# 3 3 5 1 3 3 4 4 1 4 2 2
# 4 4 3 2 4 5 2 2 1 4 1 3
# 5 5 5 2 4 5 3 1 4 1 2 4
responses.long<-data.table(reshape(responses, idvar="id", varying=list(2:11), direction="long"),key=c("id","time"))
responses.long[,list(run=max(rle(V2)[[1]]), mean=mean(rle(V2)[[1]])), by="id"]
# id run mean
# 1: 1 2 1.111111
# 2: 2 3 1.666667
# 3: 3 2 1.428571
# 4: 4 2 1.111111
# 5: 5 1 1.000000
Wouldn't this question by more appropriate for StackOverflow?

Related

Reorder (collate) vector elements automatically

It's an easy one, but I can find a simple solution for my problem. I have several vectors look like this one: rep(1:3, each = 3) and I want to convert them to like rep(1:3, times = 3).
So each element is repeated multiple times c(1,1,1,2,2,2,3,3,3) and I want to reorder them to c(1,2,3,1,2,3,1,2,3). How can I achieve that?
You can use a matrix transpose:
as.vector(t(matrix(x, nrow = 3)))
# [1] 1 2 3 1 2 3 1 2 3
v1 <- c(1,1,1,2,2,2,3,3,3)
o1 <- rle(v1)
rep(o1$values, min(o1$length))
[1] 1 2 3 1 2 3 1 2 3
This allows for unknown amount of numbers or strings but expects each value to be present in equal numbers. It only has some flexibility on what you want to do on some values occuring more than others.
Consider:
v2 <- c(1,1,1,2,2,2,3,3,3,3)
o2 <- rle(v2)
rep(o2$values, min(o2$length))
[1] 1 2 3 1 2 3 1 2 3
rep(o2$values, max(o2$length))
[1] 1 2 3 1 2 3 1 2 3 1 2 3

Select unique values from a list of 3

I would like to list all unique combinations of vectors of length 3 where each element of the vector can range between 1 to 9.
First I list all such combinations:
df <- expand.grid(1:9, 1:9, 1:9)
Then I would like to remove the rows that contain repetitions.
For example:
1 1 9
9 1 1
1 9 1
should only be included once.
In other words if two lines have the same numbers and the same number of each number then it should only be included once.
Note that
8 8 8 or
9 9 9 is fine as long as it only appears once.
Based on your approach and the idea to remove repetitions:
df <- expand.grid(1:2, 1:2, 1:2)
# Var1 Var2 Var3
# 1 1 1 1
# 2 2 1 1
# 3 1 2 1
# 4 2 2 1
# 5 1 1 2
# 6 2 1 2
# 7 1 2 2
# 8 2 2 2
df2 <- unique(t(apply(df, 1, sort))) #class matrix
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] 1 1 2
# [3,] 1 2 2
# [4,] 2 2 2
df2 <- as.data.frame(df2) #class data.frame
There are probably more efficient methods, but if I understand you correct, that is the result you want.
Maybe something like this (since your data frame is not large, so it does not pain!):
len <- apply(df,1,function(x) length(unique(x)))
res <- rbind(df[len!=2,], df[unique(apply(df[len==2,],1,prod)),])
Here is what is done:
Get the number of unique elements per row
Comprises two steps:
First argument of rbind: Those with length either 1 (e.g. 1 1 1, 7 7 7, etc) or 3 (e.g. 5 8 7, 2 4 9, etc) are included in the final results res.
Second argument of rbind: For those in which the number of unique elements are 2 (e.g. 1 1 9, 3 5 3, etc), we apply product per row and take whose unique products (cause, for example, the product of 3 3 5 and 3 5 3 and 5 3 3 are the same)

apply conditional numbering to grouped data in R

I have a table like the one below with 100's of rows of data.
ID RANK
1 2
1 3
1 3
2 4
2 8
3 3
3 3
3 3
4 6
4 7
4 7
4 7
4 7
4 7
4 6
I want to try to find a way to group the data by ID so that I can ReRank each group separately. The ReRank column is based on the Rank column and basically renumbering it starting at 1 from least to greatest, but it's important to note that the the number in the ReRank column can be put in more than once depending on the numbers in the Rank column .
In other words, the output needs to look like this
ID Rank ReRANK
1 3 2
1 2 1
1 3 2
2 4 1
2 8 2
3 3 1
3 3 1
3 3 1
For the life of me, I can't figure out how to be able to ReRank the the columns by the grouped columns and the value of the Rank columns.
This has been my best guess so far, but it definitely is not doing what I need it to do
ReRANK = mat.or.vec(length(RANK),1)
ReRANK[1] = counter = 1
for(i in 2:length(RANK)) {
if (RANK[i] != RANK[i-1]) { counter = counter + 1 }
ReRANK[i] = counter
}
Thank you in advance for the help!!
Here is a base R method using ave and rank:
df$ReRank <- ave(df$Rank, df$ID, FUN=function(i) rank(i, ties.method="min"))
The min argument in rank assures that the minimum ranking will occur when there are ties. the default is to take the mean of the ranks.
In the case that you have ties lower down in the groups, rank will count those lower values and then add continue with the next lowest value as the count of the lower values + 1. These values wil still be ordered and distinct. If you really want to have the count be 1, 2, 3, and so on rather than 1, 3, 6 or whatever depending on the number of duplicate values, here is a little hack using factor:
df$ReRank <- ave(df$Rank, df$ID, FUN=function(i) {
as.integer(factor(rank(i, ties.method="min"))))
Here, we use factor to build values counting from upward for each level. We then coerce it to be an integer.
For example,
temp <- c(rep(1, 3), 2,5,1,4,3,7)
[1] 2.5 2.5 2.5 5.0 8.0 2.5 7.0 6.0 9.0
rank(temp, ties.method="min")
[1] 1 1 1 5 8 1 7 6 9
as.integer(factor(rank(temp, ties.method="min")))
[1] 1 1 1 2 5 1 4 3 6
data
df <- read.table(header=T, text="ID Rank
1 2
1 3
1 3
2 4
2 8
3 3
3 3
3 3 ")

Using factors in R programming

If I have the code:
x <- c(rnorm(10),runif(10), rnorm(10,1))
f <- gl(3,10)
f
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
tapply(x,f,mean)
1 2 3
0.07368817 0.42992416 0.64212383
How are the 1,2,3's decided? I am assuming they are levels of something.
Furthermore, why is f used in the second argument, I dont see why it is an index and how does it know when to stop running through the index?.
I tried looking up the function definition but to no avail.
If you are asking about how tapply works (rather than gl) consider another simpler example:
> x1 <- c(1,1,2,2,3,3)
> tapply(x1, x1, mean)
1 2 3
1 2 3
> f2 <- c(2,2,2,2,3,3)
> tapply(x1, f2, mean)
2 3
1.5 3.0
In the first case, tapply has picked the first two items (indices), and found their mean
giving 1 for 1, then the next two items (2 and 2) having mean 2 etc.
In the second case, the first 4 items are treated as 2's, having mean (1+1+2+2)/4, and the last two and 3's having mean (3+3)/2
In effect, then "index" is labelling the data, and applying the requested function to each "group"

Create sequence of repeated values, in sequence?

I need a sequence of repeated numbers, i.e. 1 1 ... 1 2 2 ... 2 3 3 ... 3 etc. The way I implemented this was:
nyear <- 20
names <- c(rep(1,nyear),rep(2,nyear),rep(3,nyear),rep(4,nyear),
rep(5,nyear),rep(6,nyear),rep(7,nyear),rep(8,nyear))
which works, but is clumsy, and obviously doesn't scale well.
How do I repeat the N integers M times each in sequence?
I tried nesting seq() and rep() but that didn't quite do what I wanted.
I can obviously write a for-loop to do this, but there should be an intrinsic way to do this!
You missed the each= argument to rep():
R> n <- 3
R> rep(1:5, each=n)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
R>
so your example can be done with a simple
R> rep(1:8, each=20)
Another base R option could be gl():
gl(5, 3)
Where the output is a factor:
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
Levels: 1 2 3 4 5
If integers are needed, you can convert it:
as.numeric(gl(5, 3))
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
For your example, Dirk's answer is perfect. If you instead had a data frame and wanted to add that sort of sequence as a column, you could also use group from groupdata2 (disclaimer: my package) to greedily divide the datapoints into groups.
# Attach groupdata2
library(groupdata2)
# Create a random data frame
df <- data.frame("x" = rnorm(27))
# Create groups with 5 members each (except last group)
group(df, n = 5, method = "greedy")
x .groups
<dbl> <fct>
1 0.891 1
2 -1.13 1
3 -0.500 1
4 -1.12 1
5 -0.0187 1
6 0.420 2
7 -0.449 2
8 0.365 2
9 0.526 2
10 0.466 2
# … with 17 more rows
There's a whole range of methods for creating this kind of grouping factor. E.g. by number of groups, a list of group sizes, or by having groups start when the value in some column differs from the value in the previous row (e.g. if a column is c("x","x","y","z","z") the grouping factor would be c(1,1,2,3,3).

Resources