Using factors in R programming - r

If I have the code:
x <- c(rnorm(10),runif(10), rnorm(10,1))
f <- gl(3,10)
f
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
tapply(x,f,mean)
1 2 3
0.07368817 0.42992416 0.64212383
How are the 1,2,3's decided? I am assuming they are levels of something.
Furthermore, why is f used in the second argument, I dont see why it is an index and how does it know when to stop running through the index?.
I tried looking up the function definition but to no avail.

If you are asking about how tapply works (rather than gl) consider another simpler example:
> x1 <- c(1,1,2,2,3,3)
> tapply(x1, x1, mean)
1 2 3
1 2 3
> f2 <- c(2,2,2,2,3,3)
> tapply(x1, f2, mean)
2 3
1.5 3.0
In the first case, tapply has picked the first two items (indices), and found their mean
giving 1 for 1, then the next two items (2 and 2) having mean 2 etc.
In the second case, the first 4 items are treated as 2's, having mean (1+1+2+2)/4, and the last two and 3's having mean (3+3)/2
In effect, then "index" is labelling the data, and applying the requested function to each "group"

Related

Reorder (collate) vector elements automatically

It's an easy one, but I can find a simple solution for my problem. I have several vectors look like this one: rep(1:3, each = 3) and I want to convert them to like rep(1:3, times = 3).
So each element is repeated multiple times c(1,1,1,2,2,2,3,3,3) and I want to reorder them to c(1,2,3,1,2,3,1,2,3). How can I achieve that?
You can use a matrix transpose:
as.vector(t(matrix(x, nrow = 3)))
# [1] 1 2 3 1 2 3 1 2 3
v1 <- c(1,1,1,2,2,2,3,3,3)
o1 <- rle(v1)
rep(o1$values, min(o1$length))
[1] 1 2 3 1 2 3 1 2 3
This allows for unknown amount of numbers or strings but expects each value to be present in equal numbers. It only has some flexibility on what you want to do on some values occuring more than others.
Consider:
v2 <- c(1,1,1,2,2,2,3,3,3,3)
o2 <- rle(v2)
rep(o2$values, min(o2$length))
[1] 1 2 3 1 2 3 1 2 3
rep(o2$values, max(o2$length))
[1] 1 2 3 1 2 3 1 2 3 1 2 3

how to compare and select the minimum of two features in R?

Assume i have the following dataset:
dt<-data.frame(X=sample(5),Y=sample(5))
now, i need to compare these two features and select the one which is smaller.
X Y
1 4 3
2 5 2
3 2 4
4 3 5
5 1 1
Then the expected answer would be
3
2
2
3
1
I know
min(dt[1,])
could be helpful but it only gives me 1
Use pmin, which is the vectorized version of min:
pmin(dt$X,dt$Y)
Like thus:
> dt<-data.frame(X=sample(5),Y=sample(5))
> dt
X Y
1 3 2
2 4 3
3 1 5
4 2 4
5 5 1
> pmin(dt$X,dt$Y)
[1] 2 3 1 2 1
high <- apply(dt[,c("X","Y")], 1, max)
is another implementation
integer(0) or length 0 element happens when one of X or Y is of length(0)
For min or max, a length-one vector. For pmin or pmax, a vector of length the longest of the input vectors, or length zero if one of the inputs had zero length.
(from documentation)
max(which(1:3 == 5),10) works but pmax(which(1:3 == 5),10) gives integer(0)

Maximum and mean lengths of streaks/runs of identical responses

We have a dataset with ID numbers in the first column and then responses to each of 240 questions in the following 240 columns. We'd like to assess the validity of the responses for each subject by finding the maximum and mean of the lengths of streaks or runs of identical responses. For example, if a subject responded (1, 1, 1, 2, 2, 5, 5, 5, 5, 1) to ten questions, the maximum would be 4 and the mean would be 2.5.
I have tried to solve this problem in R using rle(), but after I apply rle() to every row of the data frame I can't extract the lengths. Once I extract the lengths, I think it would be relatively easy to apply max() and mean(). Any help or advice on getting to that point would be appreciated.
There are two more issues that are minor and don't necessarily need to be answered here. The first is that it would be even more informative to find the maximum and mean per response (there are five possible responses, namely, 1 through 5). In the example above, the maxima and means for 1, 2, and 5 would be, respectively, 3 and 2, 2 and 2, and 4 and 4. The second is that I don't know how to apply rle() to the 240 responses exclusively, i.e. and not also to the ID number. I've been deleting the ID number column before manipulating the data frame in R, which is fine, but will lead to error if I unintentionally rearrange the rows.
Thank you!
The rle function returns a list, but this is not immediately obvious because it is possible to make R print whatever you want when you type the name of an object and the authors of rle have made it print something else. In order to find out the structure of an object, you can use str, for example
x <- c(1, 1, 1, 2, 2, 5, 5, 5, 5, 1)
codes <- rle(x)
str(codes)
You can get at the lengths by typing codes$lengths and similarly for the corresponding values.
Anyway, notwithstanding the statistical issues, here is how to do what you want. Suppose you have 30 subjects and they have responded to eight questions. Your data might look like this
set.seed(123)
repsonses <- data.frame(matrix(sample(0:5, 8*30, replace=T), nc=8))
> head(responses)
X1 X2 X3 X4 X5 X6 X7 X8
1 3 2 4 2 4 1 1 5
2 1 5 2 1 5 3 1 1
3 1 3 1 2 3 5 5 3
4 4 4 5 3 4 2 4 2
5 5 5 2 5 3 1 2 4
6 3 3 3 3 1 1 3 2
You can extract the maximum lengths of the runs for each subject like this:
> max.lengths <- apply(responses, 1, function(x) max(rle(x)$lengths))
> max.lengths
[1] 2 2 2 2 2 4 3 1 1 2 2 1 2 3 2 1 2 2 1 2 1 2 1 2 2 2 2 2 2 1
The max length was 2 for the first 5 subjects and 4 for the sixth subject, so it looks right.
Similarly for the mean lengths
> mean.lengths <- apply(responses, 1, function(x) mean(rle(x)$lengths))
> head(mean.lengths)
[1] 1.142857 1.142857 1.142857 1.142857 1.142857 2.000000
For example, the mean length for the first person was the mean of $1,1,1,1,1,2,1$ which is $8/7$, which agrees with what R says.
To break down the whole thing by response, you can use the same ideas and the tapply function like this:
bd <- function(x){
means <- tapply(x$lengths, factor(x$values,levels=0:5), mean)
means[is.na(means)] <- 0
maxes <- tapply(x$lengths, factor(x$values,levels=0:5), max)
maxes[is.na(maxes)] <- 0
M <- rbind(means, maxes)
rownames(M) <- c("mean", "max")
M
}
lapply(apply(responses, 1, rle), bd)
This outputs another list. For example, if you scroll up, you will see that for subject 25, it says
[[25]]
0 1 2 3 4 5
mean 0 1 2 1 0 2
max 0 1 2 1 0 2
compare with
> responses[25,]
X1 X2 X3 X4 X5 X6 X7 X8
25 3 5 5 3 2 2 1 3
so it is giving the correct answer. You can give this list a name, for example
break.downs <- lapply(apply(responses, 1, rle), bd)
and then you can access the entry for subject i by typing
break.downs[[i]]
For the problem with the ID number column, if it's included, say as column 1, you can just do the whole analysis to responses[ ,-1] and that should be OK. The $-1$ just deletes the first column.
PS. Sorry, I just noticed that I did it with repsonses $0$ to $5$ instead of $1$ to $5$, but you just need to change levels=0:5 to levels=1:5 in the bd function and it should work just as well.
I am partial to the data.table package. To use it, first reshape to long format. Then use rle (making sure to take the first list element of the result, using [[1]]), take the max/mean, and group by the respondent ID.
Here is an example with five respondents and 10 questions:
library(data.table)
set.seed(8028)
responses <- data.frame(cbind(id=1:5,matrix(sample(1:5, 10*5, replace=T), nc=10)))
responses
# id V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
# 1 1 3 4 2 5 1 2 4 4 1 3
# 2 2 2 2 4 5 5 2 3 3 3 1
# 3 3 5 1 3 3 4 4 1 4 2 2
# 4 4 3 2 4 5 2 2 1 4 1 3
# 5 5 5 2 4 5 3 1 4 1 2 4
responses.long<-data.table(reshape(responses, idvar="id", varying=list(2:11), direction="long"),key=c("id","time"))
responses.long[,list(run=max(rle(V2)[[1]]), mean=mean(rle(V2)[[1]])), by="id"]
# id run mean
# 1: 1 2 1.111111
# 2: 2 3 1.666667
# 3: 3 2 1.428571
# 4: 4 2 1.111111
# 5: 5 1 1.000000
Wouldn't this question by more appropriate for StackOverflow?

Replace some component value in a vector with some other value

In R, in a vector, i.e. a 1-dim matrix, I would like to change components with value 3 to with value 1, and components with value 4 with value 2. How shall I do that? Thanks!
The idiomatic r way is to use [<-, in the form
x[index] <- result
If you are dealing with integers / factors or character variables, then == will work reliably for the indexing,
x <- rep(1:5,3)
x[x==3] <- 1
x[x==4] <- 2
x
## [1] 1 2 1 2 5 1 2 1 2 5 1 2 1 2 5
The car has a useful function recode (which is a wrapper for [<-), that will let you combine all the recoding in a single call
eg
library(car)
x <- rep(1:5,3)
xr <- recode(x, '3=1; 4=2')
x
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
xr
## [1] 1 2 1 2 5 1 2 1 2 5 1 2 1 2 5
Thanks to #joran for mentioning mapvalues from the plyr package, another wrapper for [<-
x <- rep(1:5,3)
mapvalues(x, from = c(3,1), to = c(1,2))
plyr::revalue is a wrapper for mapvalues specifically factor or character variables.

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Resources