Group mean by another variable in R - r

For a sample dataframe:
set.seed (1000)
value <- rnorm(1000)
wave <- rep(1:5, times=20, each=10)
length <- rep(1:10, times=10, each=10)
df <- data.frame(value, length, wave)
I want to create a summary table for the mean for each length (1-10) by each 'wave'. If I just had data from one time point, I would use:
aggregate(df$value, by=list(Category=df$length), FUN=sum)
But how do I calculate this for all my different waves? Can I do this in one command?

Do you mean something like this...:
> aggregate(value~length+wave, data=df, FUN=sum)
length wave value
1 1 1 -14.055504
2 6 1 -11.303317
3 2 2 -24.260527
4 7 2 4.307751
5 3 3 -2.128476
6 8 3 11.522721
7 4 4 -1.202818
8 9 4 20.985253
9 5 5 12.848358
10 10 5 -9.189343

Related

How to add a date to each row for a column in a data frame?

df <- data.frame(DAY = character(), ID = character())
I'm running a (for i in DAYS[i]) and get IDs for each day and storing them in a data frame
df <- rbind(df, data.frame(ID = IDs))
I want to add the DAY[i] in a second column across each row in a loop.
How do I do that?
As #Pascal says, this isn't the best way to create a data frame in R. R is a vectorised language, so generally you don't need for loops.
I'm assuming each ID is unique, so you can create a vector of IDs from 1 to 10:
ID <- 1:10
Then, you need a vector for your DAYs which can be the same length as your IDs, or can be recycled (i.e. if you only have a certain number of days that are repeated in the same order you can have a smaller vector that's reused). Use c() to create a vector with more than one value:
DAY <- c(1, 2, 9, 4, 4)
df <- data.frame(ID, DAY)
df
# ID DAY
# 1 1 1
# 2 2 2
# 3 3 9
# 4 4 4
# 5 5 4
# 6 6 1
# 7 7 2
# 8 8 9
# 9 9 4
# 10 10 4
Or with a vector for DAY that includes unique values:
DAY <- sample(1:100, 10, replace = TRUE)
df <- data.frame(ID, DAY)
df
# ID DAY
# 1 1 61
# 2 2 30
# 3 3 32
# 4 4 97
# 5 5 32
# 6 6 74
# 7 7 97
# 8 8 73
# 9 9 16
# 10 10 98

How to reshape a data frame from wide to long format in R?

I am new to R. I am trying to read data from Excel in the mentioned format
x1 x2 x3 y1 y2 y3 Result
1 2 3 7 8 9
4 5 6 10 11 12
and data.frame in R should take data in mentioned format for 1st row
x y
1 7
2 8
3 9
then I want to use lm() and export the result to result column.
I want to automate this for n rows i.e once results of 1st column is exported to Excel then I want to import data for second row.
Please Help.
library(gdata)
# this spreadsheet is exactly as in your question
df.original <- read.xls("test.xlsx", sheet="Sheet1", perl="C:/strawberry/perl/bin/perl.exe")
#
#
> df.original
x1 x2 x3 y1 y2 y3
1 1 2 3 7 8 9
2 4 5 6 10 11 12
#
# for the above code you'll just need to change the argument 'perl' with the
# path of your installer
#
# now the example for the first row
#
library(reshape2)
df <- melt(df.original[1,])
df$variable <- substr(df$variable, 1, 1)
df <- as.data.frame(lapply(split(df, df$variable), `[[`, 2))
> df
x y
1 1 7
2 2 8
3 3 9
Now, at this stage we automated the process of inport/transformation (for one line).
First question: How you want the data to look like when every line will be treated?
Second question: In result, what do you want exactly to put? residual, fitted values? what you need from lm()?
EDIT:
ok, #kapil tell me if the final shape of df is what you thought:
library(reshape2)
library(plyr)
df <- adply(df.original, 1, melt, .expand=F)
names(df)[1] <- "rowID"
df$variable <- substr(df$variable, 1, 1)
rows <- df$rowID[ df$variable=="x"] # with y would be the same (they are expected to have the same legnth)
df <- as.data.frame(lapply(split(df, df$variable), `[[`, c("value")))
df$rowID <- rows
df <- df[c("rowID", "x", "y")]
> df
rowID x y
1 1 1 7
2 1 2 8
3 1 3 9
4 2 4 10
5 2 5 11
6 2 6 12
regarding the coefficient you can calculate for each rowID (which refers to the actual row in the xls file) in this way:
model <- dlply(df, .(rowID), function(z) {print(z); lm(y ~ x, df);})
> sapply(model, `[`, "coefficients")
$`1.coefficients`
(Intercept) x
6 1
$`2.coefficients`
(Intercept) x
6 1
so, for each group (or row in original spreadsheet) you have (as expected) two coefficients, intercept and slope, therefore I can't figure out how you want the coefficient to fit inside the data.frame (especially in the 'long' way it appears just above). But if you wanted the data.frame to stay in 'wide' mode then you can try this:
# obtained the object model, you can put the coeff in the df.original data.frame
#
> ldply(model, `[[`, "coefficients")
rowID (Intercept) x
1 1 6 1
2 2 6 1
df.modified <- cbind(df.original, ldply(model, `[[`, "coefficients"))
> df.modified
x1 x2 x3 y1 y2 y3 rowID (Intercept) x
1 1 2 3 7 8 9 1 6 1
2 4 5 6 10 11 12 2 6 1
# of course, if you don't like it, you can remove rowID with df.modified$rowID <- NULL
Hope this helps, and let me know if you wanted the 'long' version of df.

Randomly choose value between 1 and 10 with equal number of instances [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Select and insert value unique number of times in R
I would like to generate 2000 random numbers between 1 and 10 such that for each random number I have the same number of instances.
In this case 200 for each number.
What should be random is the order in which it is generated.
I have the following problem:
I have an array with 2000 entries but not each with unique values, for example it starts like this:
11112233333333344445667777777777
and consists of 2000 entries.
I would like to generate random numbers and assign each UNIQUE value a separate random number but have an entry for each value
So my intended result would look like this:
original array: 11112233333333344445667777777777
random numbers: 33334466666666699991778888888888
You could do this in a few steps:
my_numbers <- rep(1:10, each=200)
my_randomizer <- sample(seq_along(my_numbers), length(my_numbers))
my_random_numbers <- my_numbers[my_randomizer]
Based on the edits:
I would use rle. It sounds like you don't have an array, but instead a vector:
my_array_rled <- rle(my_array)
my_random_numbers <- sample(1:10, length(unique(my_array)))
my_array_rled$values <- factor(my_array_rled$values)
levels(my_array_rled$values) <- my_random_numbers
my_array_randomized <- inverse.rle(my_array_rled)
If I understand you correctly you can use "rep" to replicate your random numbers 200 times and "sample" to randomize the resulting vector.
x <- sample(rep(runif(2000,1,10),200))
A non vectorized code:
# using a seed for reproducible example
set.seed(2)
original_array <- c(1,1,1,1,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,5,6,6,7,7,7,7,7,7,7,7,7,7)
random_numbers <- numeric(length=length(original_array))
rdnum <- sample(unique(original_array), length(unique(original_array)))
for ( i in 1:length(unique(original_array)))
random_numbers[original_array == i] <- rdnum[i]
random_numbers
2 2 2 2 5 5 3 3 3 3 3 3 3 3 3 1 1 1 1 6 7 7 4 4 4 4 4 4 4 4 4 4
The table function with sample comes in quite handy for this scenerio:
set.seed(1)
## ASSUMING ORIGINAL IS A VECTOR
original <- c(1, 1, 1, 1, 2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,5,6,6,7,7,7,7,7,7,7,7,7,7)
## CREATE A TABLE OF ALL THE VALUES
tabl <- table(original)
## RNG is the sample range to select from. Assuming 1:10 in this example
RNG <- 1:10
## PICK VALUES RANDOMLY FROM RNG
tabl[] <- sample(RNG, length(tabl), replace=FALSE)
# note that the `names` of `tabl` will contain the values from `original`
# whereas the values of `tabl` will contain the new random value.
## ASSIGN NEW VALUES
randomNums <- original
for(i in seq(length(tabl)))
randomNums[ original==as.numeric(names(tabl))[[i]] ] <- tabl[[i]]
Results:
rbind(orig=original, rand=randomNums)
orig: 1 1 1 1 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 5 6 6 7 7 7 7 7 7 7 7 7 7
rand: 3 3 3 3 4 4 5 5 5 5 5 5 5 5 5 7 7 7 7 2 8 8 9 9 9 9 9 9 9 9 9 9

How to bin ordered data by percentile for each id in R dataframe [r]

I have dataframe that contains 70-80 rows of ordered response time (rt) data for each of 228 people each with a unique id# (everyone doesn't have the same amount of rows). I want to bin each person's RTs into 5 bins. I want the 1st bin to be their fastest 20 percent of RTs, 2nd bin to be their next fastest 20 percent RTs, etc., etc. Each bin should have the same amount of trials in it (unless the total # of trial is odd).
My current dataframe looks like this:
id RT
7000 225
7000 250
7000 253
7001 189
7001 201
7001 225
I'd like my new dataframe to look like this:
id RT Bin
7000 225 1
7000 250 1
After getting my data to look like this, I will aggregate by id and bin
The only way I can think of to do this is to split the data into a list (using the split command), loop through each person, use the quantile command to get break points for the different bins, assign a bin value (1-5) to every response time. This feels very convoluted (and would be difficult for me). I'm in a bit of a jam and I would greatly appreciate any help in how to streamline this process. Thanks.
The answer #Chase gave split the range into 5 groups of equal length (difference of endpoints). What you seem to want is pentiles (5 groups with equal number in each group). For that, you need the cut2 function in Hmisc
library("plyr")
library("Hmisc")
dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))
tmp <- ddply(dat, "id", transform, hists = as.numeric(cut2(value, g = 5)))
tmp now has what you want
> tmp
id value hists
1 1 0.19016791 3
2 1 0.27795226 4
3 1 0.74350982 5
4 1 0.43459571 4
5 1 -2.72263322 1
....
95 10 -0.10111905 3
96 10 -0.28251991 2
97 10 -0.19308950 2
98 10 0.32827137 4
99 10 -0.01993215 4
100 10 -1.04100991 1
With the same number in each hists for each id
> table(tmp$id, tmp$hists)
1 2 3 4 5
1 2 2 2 2 2
2 2 2 2 2 2
3 2 2 2 2 2
4 2 2 2 2 2
5 2 2 2 2 2
6 2 2 2 2 2
7 2 2 2 2 2
8 2 2 2 2 2
9 2 2 2 2 2
10 2 2 2 2 2
Here's a reproducible example using package plyr and the cut function:
dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))
ddply(dat, "id", transform, hists = cut(value, breaks = 5))
id value hists
1 1 -1.82080027 (-1.94,-1.41]
2 1 0.11035796 (-0.36,0.166]
3 1 -0.57487134 (-0.886,-0.36]
4 1 -0.99455189 (-1.41,-0.886]
....
96 10 -0.03376074 (-0.233,0.386]
97 10 -0.71879488 (-0.853,-0.233]
98 10 -0.17533570 (-0.233,0.386]
99 10 -1.07668282 (-1.47,-0.853]
100 10 -1.45170078 (-1.47,-0.853]
Pass in labels = FALSE to cut if you want simple integer values returned instead of the bins.
Here's an answer in plain old R.
#make up some data
df <- data.frame(rt = rnorm(60), id = rep(letters[1:3], rep(20)) )
#and this is all there is to it
df <- df[order(df$id, df$rt),]
df$bin <- rep( unlist( tapply( df$rt, df$id, quantile )), each = 4)
You'll note that quantile command used can be set to use any quantiles. The defaults are for quintiles but if you want deciles then use
quantile(x, seq(0, 1, 0.1))
in the function above.
The answer above is a bit fragile. It requires equal numbers of RTs/id and I didn't tell you how to get to the magic number 4. But, it also will run very fast on a large dataset. If you want a more robust solution in base R.
library('Hmisc')
df <- df[order(df$id),]
df$bin <- unlist(lapply( unique(df$id), function(x) cut2(df$rt[df$id==x], g = 5) ))
This is much more robust than the first solution but it isn't as fast. For small datasets you won't notice.

How to calculate correlation In R

I wanted to calculate correlation coeficient between colunms of a subset of a data set x in R
I have rows of 40 models each 200 simulations in total 8000 rows
I wanted to calculate the corr coeficient between colums for each simulation (40 rows)
cor(x[c(3,5)]) calculates from all 8000 rows
I need cor(x[c(3,5)]) but only when X$nsimul=1 and so on
would you help me in this regards
San
I'm not sure what exactly you're doing with x[c(3,5)] but it looks like you want to do something like the following: You have a data-frame X like this:
set.seed(123)
X <- data.frame(nsimul = rep(1:2, each=5), a = sample(1:10), b = sample(1:10))
> X
nsimul a b
1 1 1 6
2 1 8 2
3 1 9 1
4 1 10 4
5 1 3 9
6 2 4 8
7 2 6 5
8 2 7 7
9 2 2 10
10 2 5 3
And you want to split this data-frame by the nsimul column, and calculate the correlation between a and b in each group. This is a classic split-apply-combine problem for which the plyr package is very well-suited:
require(plyr)
> ddply(X, .(nsimul), summarize, cor_a_b = cor(a,b))
nsimul cor_a_b
1 1 -0.7549232
2 2 -0.5964848
You can use by function e.g.:
correlations <- as.list(by(data=x,INDICES=x$nsimul,FUN=function(x) cor(x[3],x[5])))
# now you can access to correlation for each simulation
correlations["simulation 1"]
correlations["simulation 2"]
...
correlations["simulation 40"]

Resources