I have a set of motorsport laptime data (mld) of the form:
car lap laptime
1 1 1 138.523
2 1 2 122.373
3 1 3 121.395
4 1 4 137.871
and I want to produce something of the form:
lap car.1 car.1.delta
1 1 138 NA
2 2 122 -16
3 3 121 -1
4 4 127 6
I can use the R command diff(mld$laptime, lag=1) to produce the difference column, but how do I elegantly create the padded difference column in R?
Here are a couple of approaches:
1) zoo
If we represented this as a time series using zoo then the calculation would be particularly simple:
# test data with two cars
Lines <- "car lap laptime
1 1 138.523
1 2 122.373
1 3 121.395
1 4 137.871
2 1 138.523
2 2 122.373
2 3 121.395
2 4 137.871"
cat(Lines, "\n", file = "data.txt")
# read it into a zoo series, splitting it
# on car to give wide form (rather than long form)
library(zoo)
z <- read.zoo("data.txt", header = TRUE, split = 1, index = 2, FUN = as.numeric)
# now that its in the right form its simple
zz <- cbind(z, diff(z))
The last statement gives:
> zz
1.z 2.z 1.diff(z) 2.diff(z)
1 138.523 138.523 NA NA
2 122.373 122.373 -16.150 -16.150
3 121.395 121.395 -0.978 -0.978
4 137.871 137.871 16.476 16.476
To plot zz, one column per panel, try this:
plot(zz, type = "o")
To only plot the differences we do not really need zz in the first place as this will do:
plot(diff(z), type = "o")
(Add the screen=1 argument to the plot command to plot everything on the same panel.)
2) ave. Here is a second solution that uses just plain R (except for the plotting) and keeps the output in long form; however, it is a bit more complex:
# assume same input as above
DF <- read.table("data.txt", header = TRUE)
DF$diff <- ave(DF$laptime, DF$car, FUN = function(x) c(NA, diff(x)))
The result is:
> DF
car lap laptime diff
1 1 1 138.523 NA
2 1 2 122.373 -16.150
3 1 3 121.395 -0.978
4 1 4 137.871 16.476
5 2 1 138.523 NA
6 2 2 122.373 -16.150
7 2 3 121.395 -0.978
8 2 4 137.871 16.476
To plot just the differences, one per panel, try this:
library(lattice)
xyplot(diff ~ lap | car, DF, type = "o")
Update
Added info above on plotting since the title of the question mentions this.
I think this is enough:
mld$car.1.delta = c(NA, diff(mld$laptime, lag = 1))
In your example you have truncated laptimes but rounded car.1.delta, so if you really depends on how you want that to work, but code below gives what you posted.
Wrap everything in with to simplify, and create a new data.frame based on modifications of the existing columns. Prepend an NA to the diff to pad it out.
with(mld,
data.frame(
lap = lap,
car.1 = trunc(laptime),
car.1.delta = c(NA, round(diff(laptime)))
)
)
lap car.1 car.1.delta
1 1 138 NA
2 2 122 -16
3 3 121 -1
4 4 137 16
I wonder if you want to do this by car, and if so it will need a bit more handling but since you've literally asked for column car.1 I think this works so far as that goes.
Related
I have the data.frame below. I want to add a column 'g' that classifies my data according to consecutive sequences in column h_no. That is, the first sequence of h_no 1, 2, 3, 4 is group 1, the second series of h_no (1 to 7) is group 2, and so on, as indicated in the last column 'g'.
h_no h_freq h_freqsq g
1 0.09091 0.008264628 1
2 0.00000 0.000000000 1
3 0.04545 0.002065702 1
4 0.00000 0.000000000 1
1 0.13636 0.018594050 2
2 0.00000 0.000000000 2
3 0.00000 0.000000000 2
4 0.04545 0.002065702 2
5 0.31818 0.101238512 2
6 0.00000 0.000000000 2
7 0.50000 0.250000000 2
1 0.13636 0.018594050 3
2 0.09091 0.008264628 3
3 0.40909 0.167354628 3
4 0.04545 0.002065702 3
You can add a column to your data using various techniques. The quotes below come from the "Details" section of the relevant help text, [[.data.frame.
Data frames can be indexed in several modes. When [ and [[ are used with a single vector index (x[i] or x[[i]]), they index the data frame as if it were a list.
my.dataframe["new.col"] <- a.vector
my.dataframe[["new.col"]] <- a.vector
The data.frame method for $, treats x as a list
my.dataframe$new.col <- a.vector
When [ and [[ are used with two indices (x[i, j] and x[[i, j]]) they act like indexing a matrix
my.dataframe[ , "new.col"] <- a.vector
Since the method for data.frame assumes that if you don't specify if you're working with columns or rows, it will assume you mean columns.
For your example, this should work:
# make some fake data
your.df <- data.frame(no = c(1:4, 1:7, 1:5), h_freq = runif(16), h_freqsq = runif(16))
# find where one appears and
from <- which(your.df$no == 1)
to <- c((from-1)[-1], nrow(your.df)) # up to which point the sequence runs
# generate a sequence (len) and based on its length, repeat a consecutive number len times
get.seq <- mapply(from, to, 1:length(from), FUN = function(x, y, z) {
len <- length(seq(from = x[1], to = y[1]))
return(rep(z, times = len))
})
# when we unlist, we get a vector
your.df$group <- unlist(get.seq)
# and append it to your original data.frame. since this is
# designating a group, it makes sense to make it a factor
your.df$group <- as.factor(your.df$group)
no h_freq h_freqsq group
1 1 0.40998238 0.06463876 1
2 2 0.98086928 0.33093795 1
3 3 0.28908651 0.74077119 1
4 4 0.10476768 0.56784786 1
5 1 0.75478995 0.60479945 2
6 2 0.26974011 0.95231761 2
7 3 0.53676266 0.74370154 2
8 4 0.99784066 0.37499294 2
9 5 0.89771767 0.83467805 2
10 6 0.05363139 0.32066178 2
11 7 0.71741529 0.84572717 2
12 1 0.10654430 0.32917711 3
13 2 0.41971959 0.87155514 3
14 3 0.32432646 0.65789294 3
15 4 0.77896780 0.27599187 3
16 5 0.06100008 0.55399326 3
Easily: Your data frame is A
b <- A[,1]
b <- b==1
b <- cumsum(b)
Then you get the column b.
If I understand the question correctly, you want to detect when the h_no doesn't increase and then increment the class. (I'm going to walk through how I solved this problem, there is a self-contained function at the end.)
Working
We only care about the h_no column for the moment, so we can extract that from the data frame:
> h_no <- data$h_no
We want to detect when h_no doesn't go up, which we can do by working out when the difference between successive elements is either negative or zero. R provides the diff function which gives us the vector of differences:
> d.h_no <- diff(h_no)
> d.h_no
[1] 1 1 1 -3 1 1 1 1 1 1 -6 1 1 1
Once we have that, it is a simple matter to find the ones that are non-positive:
> nonpos <- d.h_no <= 0
> nonpos
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE
In R, TRUE and FALSE are basically the same as 1 and 0, so if we get the cumulative sum of nonpos, it will increase by 1 in (almost) the appropriate spots. The cumsum function (which is basically the opposite of diff) can do this.
> cumsum(nonpos)
[1] 0 0 0 1 1 1 1 1 1 1 2 2 2 2
But, there are two problems: the numbers are one too small; and, we are missing the first element (there should be four in the first class).
The first problem is simply solved: 1+cumsum(nonpos). And the second just requires adding a 1 to the front of the vector, since the first element is always in class 1:
> classes <- c(1, 1 + cumsum(nonpos))
> classes
[1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3
Now, we can attach it back onto our data frame with cbind (by using the class= syntax, we can give the column the class heading):
> data_w_classes <- cbind(data, class=classes)
And data_w_classes now contains the result.
Final result
We can compress the lines together and wrap it all up into a function to make it easier to use:
classify <- function(data) {
cbind(data, class=c(1, 1 + cumsum(diff(data$h_no) <= 0)))
}
Or, since it makes sense for the class to be a factor:
classify <- function(data) {
cbind(data, class=factor(c(1, 1 + cumsum(diff(data$h_no) <= 0))))
}
You use either function like:
> classified <- classify(data) # doesn't overwrite data
> data <- classify(data) # data now has the "class" column
(This method of solving this problem is good because it avoids explicit iteration, which is generally recommend for R, and avoids generating lots of intermediate vectors and list etc. And also it's kinda neat how it can be written on one line :) )
In addition to Roman's answer, something like this might be even simpler. Note that I haven't tested it because I do not have access to R right now.
# Note that I use a global variable here
# normally not advisable, but I liked the
# use here to make the code shorter
index <<- 0
new_column = sapply(df$h_no, function(x) {
if(x == 1) index = index + 1
return(index)
})
The function iterates over the values in n_ho and always returns the categorie that the current value belongs to. If a value of 1 is detected, we increase the global variable index and continue.
Approach based on identifying number of groups (x in mapply) and its length (y in mapply)
mytb<-read.table(text="h_no h_freq h_freqsq group
1 0.09091 0.008264628 1
2 0.00000 0.000000000 1
3 0.04545 0.002065702 1
4 0.00000 0.000000000 1
1 0.13636 0.018594050 2
2 0.00000 0.000000000 2
3 0.00000 0.000000000 2
4 0.04545 0.002065702 2
5 0.31818 0.101238512 2
6 0.00000 0.000000000 2
7 0.50000 0.250000000 2
1 0.13636 0.018594050 3
2 0.09091 0.008264628 3
3 0.40909 0.167354628 3
4 0.04545 0.002065702 3", header=T, stringsAsFactors=F)
mytb$group<-NULL
positionsof1s<-grep(1,mytb$h_no)
mytb$newgroup<-unlist(mapply(function(x,y)
rep(x,y), # repeat x number y times
x= 1:length(positionsof1s), # x is 1 to number of nth group = g1:g3
y= c( diff(positionsof1s), # y is number of repeats of groups g1 to penultimate (g2) = 4, 7
nrow(mytb)- # this line and the following gives number of repeat for last group (g3)
(positionsof1s[length(positionsof1s )]-1 ) # number of rows - position of penultimate group (g2)
) ) )
mytb
I believe that using "cbind" is the simplest way to add a column to a data frame in R. Below an example:
myDf = data.frame(index=seq(1,10,1), Val=seq(1,10,1))
newCol= seq(2,20,2)
myDf = cbind(myDf,newCol)
The data.table function rleid is handy for things like this. We subtract the sequence 1:nrow(data) to transform consecutive sequences to constants, and then use rleid to create the group IDs:
data$g = data.table::rleid(data$h_no - 1:nrow(data))
Data.frame[,'h_new_column'] <- as.integer(Data.frame[,'h_no'], breaks=c(1, 4, 7))
We have a dataset with ID numbers in the first column and then responses to each of 240 questions in the following 240 columns. We'd like to assess the validity of the responses for each subject by finding the maximum and mean of the lengths of streaks or runs of identical responses. For example, if a subject responded (1, 1, 1, 2, 2, 5, 5, 5, 5, 1) to ten questions, the maximum would be 4 and the mean would be 2.5.
I have tried to solve this problem in R using rle(), but after I apply rle() to every row of the data frame I can't extract the lengths. Once I extract the lengths, I think it would be relatively easy to apply max() and mean(). Any help or advice on getting to that point would be appreciated.
There are two more issues that are minor and don't necessarily need to be answered here. The first is that it would be even more informative to find the maximum and mean per response (there are five possible responses, namely, 1 through 5). In the example above, the maxima and means for 1, 2, and 5 would be, respectively, 3 and 2, 2 and 2, and 4 and 4. The second is that I don't know how to apply rle() to the 240 responses exclusively, i.e. and not also to the ID number. I've been deleting the ID number column before manipulating the data frame in R, which is fine, but will lead to error if I unintentionally rearrange the rows.
Thank you!
The rle function returns a list, but this is not immediately obvious because it is possible to make R print whatever you want when you type the name of an object and the authors of rle have made it print something else. In order to find out the structure of an object, you can use str, for example
x <- c(1, 1, 1, 2, 2, 5, 5, 5, 5, 1)
codes <- rle(x)
str(codes)
You can get at the lengths by typing codes$lengths and similarly for the corresponding values.
Anyway, notwithstanding the statistical issues, here is how to do what you want. Suppose you have 30 subjects and they have responded to eight questions. Your data might look like this
set.seed(123)
repsonses <- data.frame(matrix(sample(0:5, 8*30, replace=T), nc=8))
> head(responses)
X1 X2 X3 X4 X5 X6 X7 X8
1 3 2 4 2 4 1 1 5
2 1 5 2 1 5 3 1 1
3 1 3 1 2 3 5 5 3
4 4 4 5 3 4 2 4 2
5 5 5 2 5 3 1 2 4
6 3 3 3 3 1 1 3 2
You can extract the maximum lengths of the runs for each subject like this:
> max.lengths <- apply(responses, 1, function(x) max(rle(x)$lengths))
> max.lengths
[1] 2 2 2 2 2 4 3 1 1 2 2 1 2 3 2 1 2 2 1 2 1 2 1 2 2 2 2 2 2 1
The max length was 2 for the first 5 subjects and 4 for the sixth subject, so it looks right.
Similarly for the mean lengths
> mean.lengths <- apply(responses, 1, function(x) mean(rle(x)$lengths))
> head(mean.lengths)
[1] 1.142857 1.142857 1.142857 1.142857 1.142857 2.000000
For example, the mean length for the first person was the mean of $1,1,1,1,1,2,1$ which is $8/7$, which agrees with what R says.
To break down the whole thing by response, you can use the same ideas and the tapply function like this:
bd <- function(x){
means <- tapply(x$lengths, factor(x$values,levels=0:5), mean)
means[is.na(means)] <- 0
maxes <- tapply(x$lengths, factor(x$values,levels=0:5), max)
maxes[is.na(maxes)] <- 0
M <- rbind(means, maxes)
rownames(M) <- c("mean", "max")
M
}
lapply(apply(responses, 1, rle), bd)
This outputs another list. For example, if you scroll up, you will see that for subject 25, it says
[[25]]
0 1 2 3 4 5
mean 0 1 2 1 0 2
max 0 1 2 1 0 2
compare with
> responses[25,]
X1 X2 X3 X4 X5 X6 X7 X8
25 3 5 5 3 2 2 1 3
so it is giving the correct answer. You can give this list a name, for example
break.downs <- lapply(apply(responses, 1, rle), bd)
and then you can access the entry for subject i by typing
break.downs[[i]]
For the problem with the ID number column, if it's included, say as column 1, you can just do the whole analysis to responses[ ,-1] and that should be OK. The $-1$ just deletes the first column.
PS. Sorry, I just noticed that I did it with repsonses $0$ to $5$ instead of $1$ to $5$, but you just need to change levels=0:5 to levels=1:5 in the bd function and it should work just as well.
I am partial to the data.table package. To use it, first reshape to long format. Then use rle (making sure to take the first list element of the result, using [[1]]), take the max/mean, and group by the respondent ID.
Here is an example with five respondents and 10 questions:
library(data.table)
set.seed(8028)
responses <- data.frame(cbind(id=1:5,matrix(sample(1:5, 10*5, replace=T), nc=10)))
responses
# id V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
# 1 1 3 4 2 5 1 2 4 4 1 3
# 2 2 2 2 4 5 5 2 3 3 3 1
# 3 3 5 1 3 3 4 4 1 4 2 2
# 4 4 3 2 4 5 2 2 1 4 1 3
# 5 5 5 2 4 5 3 1 4 1 2 4
responses.long<-data.table(reshape(responses, idvar="id", varying=list(2:11), direction="long"),key=c("id","time"))
responses.long[,list(run=max(rle(V2)[[1]]), mean=mean(rle(V2)[[1]])), by="id"]
# id run mean
# 1: 1 2 1.111111
# 2: 2 3 1.666667
# 3: 3 2 1.428571
# 4: 4 2 1.111111
# 5: 5 1 1.000000
Wouldn't this question by more appropriate for StackOverflow?
I have the data.frame below. I want to add a column 'g' that classifies my data according to consecutive sequences in column h_no. That is, the first sequence of h_no 1, 2, 3, 4 is group 1, the second series of h_no (1 to 7) is group 2, and so on, as indicated in the last column 'g'.
h_no h_freq h_freqsq g
1 0.09091 0.008264628 1
2 0.00000 0.000000000 1
3 0.04545 0.002065702 1
4 0.00000 0.000000000 1
1 0.13636 0.018594050 2
2 0.00000 0.000000000 2
3 0.00000 0.000000000 2
4 0.04545 0.002065702 2
5 0.31818 0.101238512 2
6 0.00000 0.000000000 2
7 0.50000 0.250000000 2
1 0.13636 0.018594050 3
2 0.09091 0.008264628 3
3 0.40909 0.167354628 3
4 0.04545 0.002065702 3
You can add a column to your data using various techniques. The quotes below come from the "Details" section of the relevant help text, [[.data.frame.
Data frames can be indexed in several modes. When [ and [[ are used with a single vector index (x[i] or x[[i]]), they index the data frame as if it were a list.
my.dataframe["new.col"] <- a.vector
my.dataframe[["new.col"]] <- a.vector
The data.frame method for $, treats x as a list
my.dataframe$new.col <- a.vector
When [ and [[ are used with two indices (x[i, j] and x[[i, j]]) they act like indexing a matrix
my.dataframe[ , "new.col"] <- a.vector
Since the method for data.frame assumes that if you don't specify if you're working with columns or rows, it will assume you mean columns.
For your example, this should work:
# make some fake data
your.df <- data.frame(no = c(1:4, 1:7, 1:5), h_freq = runif(16), h_freqsq = runif(16))
# find where one appears and
from <- which(your.df$no == 1)
to <- c((from-1)[-1], nrow(your.df)) # up to which point the sequence runs
# generate a sequence (len) and based on its length, repeat a consecutive number len times
get.seq <- mapply(from, to, 1:length(from), FUN = function(x, y, z) {
len <- length(seq(from = x[1], to = y[1]))
return(rep(z, times = len))
})
# when we unlist, we get a vector
your.df$group <- unlist(get.seq)
# and append it to your original data.frame. since this is
# designating a group, it makes sense to make it a factor
your.df$group <- as.factor(your.df$group)
no h_freq h_freqsq group
1 1 0.40998238 0.06463876 1
2 2 0.98086928 0.33093795 1
3 3 0.28908651 0.74077119 1
4 4 0.10476768 0.56784786 1
5 1 0.75478995 0.60479945 2
6 2 0.26974011 0.95231761 2
7 3 0.53676266 0.74370154 2
8 4 0.99784066 0.37499294 2
9 5 0.89771767 0.83467805 2
10 6 0.05363139 0.32066178 2
11 7 0.71741529 0.84572717 2
12 1 0.10654430 0.32917711 3
13 2 0.41971959 0.87155514 3
14 3 0.32432646 0.65789294 3
15 4 0.77896780 0.27599187 3
16 5 0.06100008 0.55399326 3
Easily: Your data frame is A
b <- A[,1]
b <- b==1
b <- cumsum(b)
Then you get the column b.
If I understand the question correctly, you want to detect when the h_no doesn't increase and then increment the class. (I'm going to walk through how I solved this problem, there is a self-contained function at the end.)
Working
We only care about the h_no column for the moment, so we can extract that from the data frame:
> h_no <- data$h_no
We want to detect when h_no doesn't go up, which we can do by working out when the difference between successive elements is either negative or zero. R provides the diff function which gives us the vector of differences:
> d.h_no <- diff(h_no)
> d.h_no
[1] 1 1 1 -3 1 1 1 1 1 1 -6 1 1 1
Once we have that, it is a simple matter to find the ones that are non-positive:
> nonpos <- d.h_no <= 0
> nonpos
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE
In R, TRUE and FALSE are basically the same as 1 and 0, so if we get the cumulative sum of nonpos, it will increase by 1 in (almost) the appropriate spots. The cumsum function (which is basically the opposite of diff) can do this.
> cumsum(nonpos)
[1] 0 0 0 1 1 1 1 1 1 1 2 2 2 2
But, there are two problems: the numbers are one too small; and, we are missing the first element (there should be four in the first class).
The first problem is simply solved: 1+cumsum(nonpos). And the second just requires adding a 1 to the front of the vector, since the first element is always in class 1:
> classes <- c(1, 1 + cumsum(nonpos))
> classes
[1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3
Now, we can attach it back onto our data frame with cbind (by using the class= syntax, we can give the column the class heading):
> data_w_classes <- cbind(data, class=classes)
And data_w_classes now contains the result.
Final result
We can compress the lines together and wrap it all up into a function to make it easier to use:
classify <- function(data) {
cbind(data, class=c(1, 1 + cumsum(diff(data$h_no) <= 0)))
}
Or, since it makes sense for the class to be a factor:
classify <- function(data) {
cbind(data, class=factor(c(1, 1 + cumsum(diff(data$h_no) <= 0))))
}
You use either function like:
> classified <- classify(data) # doesn't overwrite data
> data <- classify(data) # data now has the "class" column
(This method of solving this problem is good because it avoids explicit iteration, which is generally recommend for R, and avoids generating lots of intermediate vectors and list etc. And also it's kinda neat how it can be written on one line :) )
In addition to Roman's answer, something like this might be even simpler. Note that I haven't tested it because I do not have access to R right now.
# Note that I use a global variable here
# normally not advisable, but I liked the
# use here to make the code shorter
index <<- 0
new_column = sapply(df$h_no, function(x) {
if(x == 1) index = index + 1
return(index)
})
The function iterates over the values in n_ho and always returns the categorie that the current value belongs to. If a value of 1 is detected, we increase the global variable index and continue.
Approach based on identifying number of groups (x in mapply) and its length (y in mapply)
mytb<-read.table(text="h_no h_freq h_freqsq group
1 0.09091 0.008264628 1
2 0.00000 0.000000000 1
3 0.04545 0.002065702 1
4 0.00000 0.000000000 1
1 0.13636 0.018594050 2
2 0.00000 0.000000000 2
3 0.00000 0.000000000 2
4 0.04545 0.002065702 2
5 0.31818 0.101238512 2
6 0.00000 0.000000000 2
7 0.50000 0.250000000 2
1 0.13636 0.018594050 3
2 0.09091 0.008264628 3
3 0.40909 0.167354628 3
4 0.04545 0.002065702 3", header=T, stringsAsFactors=F)
mytb$group<-NULL
positionsof1s<-grep(1,mytb$h_no)
mytb$newgroup<-unlist(mapply(function(x,y)
rep(x,y), # repeat x number y times
x= 1:length(positionsof1s), # x is 1 to number of nth group = g1:g3
y= c( diff(positionsof1s), # y is number of repeats of groups g1 to penultimate (g2) = 4, 7
nrow(mytb)- # this line and the following gives number of repeat for last group (g3)
(positionsof1s[length(positionsof1s )]-1 ) # number of rows - position of penultimate group (g2)
) ) )
mytb
I believe that using "cbind" is the simplest way to add a column to a data frame in R. Below an example:
myDf = data.frame(index=seq(1,10,1), Val=seq(1,10,1))
newCol= seq(2,20,2)
myDf = cbind(myDf,newCol)
The data.table function rleid is handy for things like this. We subtract the sequence 1:nrow(data) to transform consecutive sequences to constants, and then use rleid to create the group IDs:
data$g = data.table::rleid(data$h_no - 1:nrow(data))
Data.frame[,'h_new_column'] <- as.integer(Data.frame[,'h_no'], breaks=c(1, 4, 7))
I have dataframe that contains 70-80 rows of ordered response time (rt) data for each of 228 people each with a unique id# (everyone doesn't have the same amount of rows). I want to bin each person's RTs into 5 bins. I want the 1st bin to be their fastest 20 percent of RTs, 2nd bin to be their next fastest 20 percent RTs, etc., etc. Each bin should have the same amount of trials in it (unless the total # of trial is odd).
My current dataframe looks like this:
id RT
7000 225
7000 250
7000 253
7001 189
7001 201
7001 225
I'd like my new dataframe to look like this:
id RT Bin
7000 225 1
7000 250 1
After getting my data to look like this, I will aggregate by id and bin
The only way I can think of to do this is to split the data into a list (using the split command), loop through each person, use the quantile command to get break points for the different bins, assign a bin value (1-5) to every response time. This feels very convoluted (and would be difficult for me). I'm in a bit of a jam and I would greatly appreciate any help in how to streamline this process. Thanks.
The answer #Chase gave split the range into 5 groups of equal length (difference of endpoints). What you seem to want is pentiles (5 groups with equal number in each group). For that, you need the cut2 function in Hmisc
library("plyr")
library("Hmisc")
dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))
tmp <- ddply(dat, "id", transform, hists = as.numeric(cut2(value, g = 5)))
tmp now has what you want
> tmp
id value hists
1 1 0.19016791 3
2 1 0.27795226 4
3 1 0.74350982 5
4 1 0.43459571 4
5 1 -2.72263322 1
....
95 10 -0.10111905 3
96 10 -0.28251991 2
97 10 -0.19308950 2
98 10 0.32827137 4
99 10 -0.01993215 4
100 10 -1.04100991 1
With the same number in each hists for each id
> table(tmp$id, tmp$hists)
1 2 3 4 5
1 2 2 2 2 2
2 2 2 2 2 2
3 2 2 2 2 2
4 2 2 2 2 2
5 2 2 2 2 2
6 2 2 2 2 2
7 2 2 2 2 2
8 2 2 2 2 2
9 2 2 2 2 2
10 2 2 2 2 2
Here's a reproducible example using package plyr and the cut function:
dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))
ddply(dat, "id", transform, hists = cut(value, breaks = 5))
id value hists
1 1 -1.82080027 (-1.94,-1.41]
2 1 0.11035796 (-0.36,0.166]
3 1 -0.57487134 (-0.886,-0.36]
4 1 -0.99455189 (-1.41,-0.886]
....
96 10 -0.03376074 (-0.233,0.386]
97 10 -0.71879488 (-0.853,-0.233]
98 10 -0.17533570 (-0.233,0.386]
99 10 -1.07668282 (-1.47,-0.853]
100 10 -1.45170078 (-1.47,-0.853]
Pass in labels = FALSE to cut if you want simple integer values returned instead of the bins.
Here's an answer in plain old R.
#make up some data
df <- data.frame(rt = rnorm(60), id = rep(letters[1:3], rep(20)) )
#and this is all there is to it
df <- df[order(df$id, df$rt),]
df$bin <- rep( unlist( tapply( df$rt, df$id, quantile )), each = 4)
You'll note that quantile command used can be set to use any quantiles. The defaults are for quintiles but if you want deciles then use
quantile(x, seq(0, 1, 0.1))
in the function above.
The answer above is a bit fragile. It requires equal numbers of RTs/id and I didn't tell you how to get to the magic number 4. But, it also will run very fast on a large dataset. If you want a more robust solution in base R.
library('Hmisc')
df <- df[order(df$id),]
df$bin <- unlist(lapply( unique(df$id), function(x) cut2(df$rt[df$id==x], g = 5) ))
This is much more robust than the first solution but it isn't as fast. For small datasets you won't notice.
I need a sequence of repeated numbers, i.e. 1 1 ... 1 2 2 ... 2 3 3 ... 3 etc. The way I implemented this was:
nyear <- 20
names <- c(rep(1,nyear),rep(2,nyear),rep(3,nyear),rep(4,nyear),
rep(5,nyear),rep(6,nyear),rep(7,nyear),rep(8,nyear))
which works, but is clumsy, and obviously doesn't scale well.
How do I repeat the N integers M times each in sequence?
I tried nesting seq() and rep() but that didn't quite do what I wanted.
I can obviously write a for-loop to do this, but there should be an intrinsic way to do this!
You missed the each= argument to rep():
R> n <- 3
R> rep(1:5, each=n)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
R>
so your example can be done with a simple
R> rep(1:8, each=20)
Another base R option could be gl():
gl(5, 3)
Where the output is a factor:
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
Levels: 1 2 3 4 5
If integers are needed, you can convert it:
as.numeric(gl(5, 3))
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
For your example, Dirk's answer is perfect. If you instead had a data frame and wanted to add that sort of sequence as a column, you could also use group from groupdata2 (disclaimer: my package) to greedily divide the datapoints into groups.
# Attach groupdata2
library(groupdata2)
# Create a random data frame
df <- data.frame("x" = rnorm(27))
# Create groups with 5 members each (except last group)
group(df, n = 5, method = "greedy")
x .groups
<dbl> <fct>
1 0.891 1
2 -1.13 1
3 -0.500 1
4 -1.12 1
5 -0.0187 1
6 0.420 2
7 -0.449 2
8 0.365 2
9 0.526 2
10 0.466 2
# … with 17 more rows
There's a whole range of methods for creating this kind of grouping factor. E.g. by number of groups, a list of group sizes, or by having groups start when the value in some column differs from the value in the previous row (e.g. if a column is c("x","x","y","z","z") the grouping factor would be c(1,1,2,3,3).