How to bin ordered data by percentile for each id in R dataframe [r] - r

I have dataframe that contains 70-80 rows of ordered response time (rt) data for each of 228 people each with a unique id# (everyone doesn't have the same amount of rows). I want to bin each person's RTs into 5 bins. I want the 1st bin to be their fastest 20 percent of RTs, 2nd bin to be their next fastest 20 percent RTs, etc., etc. Each bin should have the same amount of trials in it (unless the total # of trial is odd).
My current dataframe looks like this:
id RT
7000 225
7000 250
7000 253
7001 189
7001 201
7001 225
I'd like my new dataframe to look like this:
id RT Bin
7000 225 1
7000 250 1
After getting my data to look like this, I will aggregate by id and bin
The only way I can think of to do this is to split the data into a list (using the split command), loop through each person, use the quantile command to get break points for the different bins, assign a bin value (1-5) to every response time. This feels very convoluted (and would be difficult for me). I'm in a bit of a jam and I would greatly appreciate any help in how to streamline this process. Thanks.

The answer #Chase gave split the range into 5 groups of equal length (difference of endpoints). What you seem to want is pentiles (5 groups with equal number in each group). For that, you need the cut2 function in Hmisc
library("plyr")
library("Hmisc")
dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))
tmp <- ddply(dat, "id", transform, hists = as.numeric(cut2(value, g = 5)))
tmp now has what you want
> tmp
id value hists
1 1 0.19016791 3
2 1 0.27795226 4
3 1 0.74350982 5
4 1 0.43459571 4
5 1 -2.72263322 1
....
95 10 -0.10111905 3
96 10 -0.28251991 2
97 10 -0.19308950 2
98 10 0.32827137 4
99 10 -0.01993215 4
100 10 -1.04100991 1
With the same number in each hists for each id
> table(tmp$id, tmp$hists)
1 2 3 4 5
1 2 2 2 2 2
2 2 2 2 2 2
3 2 2 2 2 2
4 2 2 2 2 2
5 2 2 2 2 2
6 2 2 2 2 2
7 2 2 2 2 2
8 2 2 2 2 2
9 2 2 2 2 2
10 2 2 2 2 2

Here's a reproducible example using package plyr and the cut function:
dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))
ddply(dat, "id", transform, hists = cut(value, breaks = 5))
id value hists
1 1 -1.82080027 (-1.94,-1.41]
2 1 0.11035796 (-0.36,0.166]
3 1 -0.57487134 (-0.886,-0.36]
4 1 -0.99455189 (-1.41,-0.886]
....
96 10 -0.03376074 (-0.233,0.386]
97 10 -0.71879488 (-0.853,-0.233]
98 10 -0.17533570 (-0.233,0.386]
99 10 -1.07668282 (-1.47,-0.853]
100 10 -1.45170078 (-1.47,-0.853]
Pass in labels = FALSE to cut if you want simple integer values returned instead of the bins.

Here's an answer in plain old R.
#make up some data
df <- data.frame(rt = rnorm(60), id = rep(letters[1:3], rep(20)) )
#and this is all there is to it
df <- df[order(df$id, df$rt),]
df$bin <- rep( unlist( tapply( df$rt, df$id, quantile )), each = 4)
You'll note that quantile command used can be set to use any quantiles. The defaults are for quintiles but if you want deciles then use
quantile(x, seq(0, 1, 0.1))
in the function above.
The answer above is a bit fragile. It requires equal numbers of RTs/id and I didn't tell you how to get to the magic number 4. But, it also will run very fast on a large dataset. If you want a more robust solution in base R.
library('Hmisc')
df <- df[order(df$id),]
df$bin <- unlist(lapply( unique(df$id), function(x) cut2(df$rt[df$id==x], g = 5) ))
This is much more robust than the first solution but it isn't as fast. For small datasets you won't notice.

Related

R: Creating Random Samples From Entries in Neighboring Row

I am working with the R programming language.
I have the following data set:
my_data = data.frame(id = c(1,2,3,4,5), n = c(15,3,51,8,75))
I want to create a new variable that generates a single random integer for each row based on the corresponding value of "n". I tried to do this with the following code:
my_data$rand = sample.int(my_data$n,1)
But this is not working (the same random number is repeated 5 times).
I also tried to define a function to this:
my_function <- function(x){sample.int(x,1)}
transform(my_data, new_column= my_function(my_data$n) )
But this is also not working (the same random number is again repeated 5 times)..
In the end, I am trying to achieve something like this :
my_data$rand = c(sample.int(15,1), sample.int(3,1), sample.int(51,1), sample.int(8,1), sample.int(75,1))
Can someone please show me how to do this for larger datasets without having to manually specify each "sample.int" command?
Thanks!
When you say "based on value of n" what do you mean by that exactly? Based on n how?
Guess#1: at each row, you want to draw one random number with possible values being 1 to n.
Guess#2: at each row, you want to draw n random numbers for possible values between 0 and 1.
Second option is harder, but option #1 can be done with a loop:
my_data = data.frame(id = c(1,2,3,4,5), n = c(15,3,51,8,75))
my_data$rand = NA
set.seed(123)
for(i in 1:nrow(my_data)){
my_data$rand[i] = sample(1:(my_data$n[i]), size = 1)
}
my_data
id n rand
1 1 15 15
2 2 3 3
3 3 51 51
4 4 8 6
5 5 75 67
We can use sapply to go over all rows in my_data, and generate one sample.int per iteration.
my_data$rand <- sapply(1:nrow(my_data), function(x) sample.int(my_data[x, 2], 1))
id n rand
1 1 15 7
2 2 3 2
3 3 51 28
4 4 8 6
5 5 75 9
You can do this efficiently by a single call to runif(), multiplying by n, and rounding up:
transform(my_data, rand = ceiling(runif(n) * n))
id n rand
1 1 15 13
2 2 3 1
3 3 51 41
4 4 8 1
5 5 75 9

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

How to find the final value from repeated measures in R?

I have data arranged like this in R:
indv time mass
1 10 7
2 5 3
1 5 1
2 4 4
2 14 14
1 15 15
where indv is individual in a population. I want to add columns for initial mass (mass_i) and final mass (mass_f). I learned yesterday that I can add a column for initial mass using ddply in plyr:
sorted <- ddply(test, .(indv, time), sort)
sorted2 <- ddply(sorted, .(indv), transform, mass_i = mass[1])
which gives a table like:
indv mass time mass_i
1 1 1 5 1
2 1 7 10 1
3 1 10 15 1
4 2 4 4 4
5 2 3 5 4
6 2 8 14 4
7 2 9 20 4
However, this same method will not work for finding the final mass (mass_f), as I have a different number of observations for each individual. Can anyone suggest a method for finding the final mass, when the number of observations may vary?
You can simply use length(mass) as the index of the last element:
sorted2 <- ddply(sorted, .(indv), transform,
mass_i = mass[1], mass_f = mass[length(mass)])
As suggested by mb3041023 and discussed in the comments below, you can achieve similar results without sorting your data frame:
ddply(test, .(indv), transform,
mass_i = mass[which.min(time)], mass_f = mass[which.max(time)])
Except for the order of rows, this is the same as sorted2.
You can use tail(mass, 1) in place of mass[1].
sorted2 <- ddply(sorted, .(indv), transform, mass_i = head(mass, 1), mass_f=tail(mass, 1))
Once you have this table, it's pretty simple:
t <- tapply(test$mass, test$ind, max)
This will give you an array with ind. as the names and mass_f as the values.

Creating a delta column to plot time series differences in R

I have a set of motorsport laptime data (mld) of the form:
car lap laptime
1 1 1 138.523
2 1 2 122.373
3 1 3 121.395
4 1 4 137.871
and I want to produce something of the form:
lap car.1 car.1.delta
1 1 138 NA
2 2 122 -16
3 3 121 -1
4 4 127 6
I can use the R command diff(mld$laptime, lag=1) to produce the difference column, but how do I elegantly create the padded difference column in R?
Here are a couple of approaches:
1) zoo
If we represented this as a time series using zoo then the calculation would be particularly simple:
# test data with two cars
Lines <- "car lap laptime
1 1 138.523
1 2 122.373
1 3 121.395
1 4 137.871
2 1 138.523
2 2 122.373
2 3 121.395
2 4 137.871"
cat(Lines, "\n", file = "data.txt")
# read it into a zoo series, splitting it
# on car to give wide form (rather than long form)
library(zoo)
z <- read.zoo("data.txt", header = TRUE, split = 1, index = 2, FUN = as.numeric)
# now that its in the right form its simple
zz <- cbind(z, diff(z))
The last statement gives:
> zz
1.z 2.z 1.diff(z) 2.diff(z)
1 138.523 138.523 NA NA
2 122.373 122.373 -16.150 -16.150
3 121.395 121.395 -0.978 -0.978
4 137.871 137.871 16.476 16.476
To plot zz, one column per panel, try this:
plot(zz, type = "o")
To only plot the differences we do not really need zz in the first place as this will do:
plot(diff(z), type = "o")
(Add the screen=1 argument to the plot command to plot everything on the same panel.)
2) ave. Here is a second solution that uses just plain R (except for the plotting) and keeps the output in long form; however, it is a bit more complex:
# assume same input as above
DF <- read.table("data.txt", header = TRUE)
DF$diff <- ave(DF$laptime, DF$car, FUN = function(x) c(NA, diff(x)))
The result is:
> DF
car lap laptime diff
1 1 1 138.523 NA
2 1 2 122.373 -16.150
3 1 3 121.395 -0.978
4 1 4 137.871 16.476
5 2 1 138.523 NA
6 2 2 122.373 -16.150
7 2 3 121.395 -0.978
8 2 4 137.871 16.476
To plot just the differences, one per panel, try this:
library(lattice)
xyplot(diff ~ lap | car, DF, type = "o")
Update
Added info above on plotting since the title of the question mentions this.
I think this is enough:
mld$car.1.delta = c(NA, diff(mld$laptime, lag = 1))
In your example you have truncated laptimes but rounded car.1.delta, so if you really depends on how you want that to work, but code below gives what you posted.
Wrap everything in with to simplify, and create a new data.frame based on modifications of the existing columns. Prepend an NA to the diff to pad it out.
with(mld,
data.frame(
lap = lap,
car.1 = trunc(laptime),
car.1.delta = c(NA, round(diff(laptime)))
)
)
lap car.1 car.1.delta
1 1 138 NA
2 2 122 -16
3 3 121 -1
4 4 137 16
I wonder if you want to do this by car, and if so it will need a bit more handling but since you've literally asked for column car.1 I think this works so far as that goes.

How to calculate correlation In R

I wanted to calculate correlation coeficient between colunms of a subset of a data set x in R
I have rows of 40 models each 200 simulations in total 8000 rows
I wanted to calculate the corr coeficient between colums for each simulation (40 rows)
cor(x[c(3,5)]) calculates from all 8000 rows
I need cor(x[c(3,5)]) but only when X$nsimul=1 and so on
would you help me in this regards
San
I'm not sure what exactly you're doing with x[c(3,5)] but it looks like you want to do something like the following: You have a data-frame X like this:
set.seed(123)
X <- data.frame(nsimul = rep(1:2, each=5), a = sample(1:10), b = sample(1:10))
> X
nsimul a b
1 1 1 6
2 1 8 2
3 1 9 1
4 1 10 4
5 1 3 9
6 2 4 8
7 2 6 5
8 2 7 7
9 2 2 10
10 2 5 3
And you want to split this data-frame by the nsimul column, and calculate the correlation between a and b in each group. This is a classic split-apply-combine problem for which the plyr package is very well-suited:
require(plyr)
> ddply(X, .(nsimul), summarize, cor_a_b = cor(a,b))
nsimul cor_a_b
1 1 -0.7549232
2 2 -0.5964848
You can use by function e.g.:
correlations <- as.list(by(data=x,INDICES=x$nsimul,FUN=function(x) cor(x[3],x[5])))
# now you can access to correlation for each simulation
correlations["simulation 1"]
correlations["simulation 2"]
...
correlations["simulation 40"]

Resources