I wanted to calculate correlation coeficient between colunms of a subset of a data set x in R
I have rows of 40 models each 200 simulations in total 8000 rows
I wanted to calculate the corr coeficient between colums for each simulation (40 rows)
cor(x[c(3,5)]) calculates from all 8000 rows
I need cor(x[c(3,5)]) but only when X$nsimul=1 and so on
would you help me in this regards
San
I'm not sure what exactly you're doing with x[c(3,5)] but it looks like you want to do something like the following: You have a data-frame X like this:
set.seed(123)
X <- data.frame(nsimul = rep(1:2, each=5), a = sample(1:10), b = sample(1:10))
> X
nsimul a b
1 1 1 6
2 1 8 2
3 1 9 1
4 1 10 4
5 1 3 9
6 2 4 8
7 2 6 5
8 2 7 7
9 2 2 10
10 2 5 3
And you want to split this data-frame by the nsimul column, and calculate the correlation between a and b in each group. This is a classic split-apply-combine problem for which the plyr package is very well-suited:
require(plyr)
> ddply(X, .(nsimul), summarize, cor_a_b = cor(a,b))
nsimul cor_a_b
1 1 -0.7549232
2 2 -0.5964848
You can use by function e.g.:
correlations <- as.list(by(data=x,INDICES=x$nsimul,FUN=function(x) cor(x[3],x[5])))
# now you can access to correlation for each simulation
correlations["simulation 1"]
correlations["simulation 2"]
...
correlations["simulation 40"]
Related
For a sample dataframe:
set.seed (1000)
value <- rnorm(1000)
wave <- rep(1:5, times=20, each=10)
length <- rep(1:10, times=10, each=10)
df <- data.frame(value, length, wave)
I want to create a summary table for the mean for each length (1-10) by each 'wave'. If I just had data from one time point, I would use:
aggregate(df$value, by=list(Category=df$length), FUN=sum)
But how do I calculate this for all my different waves? Can I do this in one command?
Do you mean something like this...:
> aggregate(value~length+wave, data=df, FUN=sum)
length wave value
1 1 1 -14.055504
2 6 1 -11.303317
3 2 2 -24.260527
4 7 2 4.307751
5 3 3 -2.128476
6 8 3 11.522721
7 4 4 -1.202818
8 9 4 20.985253
9 5 5 12.848358
10 10 5 -9.189343
I have a big matrix df with a length of over 3000 rows. I am programming in R. It looks like this:
df: person1 person2 calls
1 3 5
1 4 7
2 11 6
3 1 5
3 2 1
3 4 13
and so on.
What i want to do is to get the total number of calls that each person made and received in two matrices. This would look like this:
calls: person madecalls received: person receivedcalls
1 12 1 5
2 6 2 1
3 19 3 5
4 20
11 6
Can anyone help me with this problem?
Thanks!
Use the aggregate function:
made.calls <- aggregate(df$calls, by = list(person = df$person1), fun = sum)
.....plyr way:
library(plyr)
ddply(df, .(person1), function(x) data.frame( madecalls = sum(x$calls) )
I'm so new to R that I'm having trouble finding what I need in other peoples' questions. I think my question is so easy that nobody else has bothered to ask it.
What would be the simplest code to create a new data frame which excludes data which are univariate outliers(which I'm defining as points which are 3 SDs from their condition's mean), within their condition, on a certain variable?
I'm embarrassed to show what I've tried but here it is
greaterthan <- mean(dat$var2[dat$condition=="one"]) +
2.5*(sd(dat$var2[dat$condition=="one"]))
lessthan <- mean(dat$var2[dat$condition=="one"]) -
2.5*(sd(dat$var2[dat$condition=="one"]))
withoutliersremovedone1 <-dat$var2[dat$condition=="one"] < greaterthan
and I'm pretty much already stuck there.
Thanks
> dat <- data.frame(
var1=sample(letters[1:2],10,replace=TRUE),
var2=c(1,2,3,1,2,3,102,3,1,2)
)
> dat
var1 var2
1 b 1
2 a 2
3 a 3
4 a 1
5 b 2
6 b 3
7 a 102 #outlier
8 b 3
9 b 1
10 a 2
Now only return those rows which are not (!) greater than 2 absolute sd's from the mean of the variable in question. Obviously change 2 to however many sd's you want to be the cutoff.
> dat[!(abs(dat$var2 - mean(dat$var2))/sd(dat$var2)) > 2,]
var1 var2
1 b 1
2 a 2
3 a 3
4 a 1
5 b 2
6 b 3 # no outlier
8 b 3 # between here
9 b 1
10 a 2
Or more short-hand using the scale function:
dat[!abs(scale(dat$var2)) > 2,]
var1 var2
1 b 1
2 a 2
3 a 3
4 a 1
5 b 2
6 b 3
8 b 3
9 b 1
10 a 2
edit
This can be extended to looking within groups using by
do.call(rbind,by(dat,dat$var1,function(x) x[!abs(scale(x$var2)) > 2,] ))
This assumes dat$var1 is your variable defining the group each row belongs to.
I use the winsorize() function in the robustHD package for this task. Here is its example:
R> example(winsorize)
winsrzR> ## generate data
winsrzR> set.seed(1234) # for reproducibility
winsrzR> x <- rnorm(10) # standard normal
winsrzR> x[1] <- x[1] * 10 # introduce outlier
winsrzR> ## winsorize data
winsrzR> x
[1] -12.070657 0.277429 1.084441 -2.345698 0.429125 0.506056
[7] -0.574740 -0.546632 -0.564452 -0.890038
winsrzR> winsorize(x)
[1] -3.250372 0.277429 1.084441 -2.345698 0.429125 0.506056
[7] -0.574740 -0.546632 -0.564452 -0.890038
winsrzR>
This defaults to median +/- 2 mad, but you can set the parameters for mean +/- 3 sd.
I have data arranged like this in R:
indv time mass
1 10 7
2 5 3
1 5 1
2 4 4
2 14 14
1 15 15
where indv is individual in a population. I want to add columns for initial mass (mass_i) and final mass (mass_f). I learned yesterday that I can add a column for initial mass using ddply in plyr:
sorted <- ddply(test, .(indv, time), sort)
sorted2 <- ddply(sorted, .(indv), transform, mass_i = mass[1])
which gives a table like:
indv mass time mass_i
1 1 1 5 1
2 1 7 10 1
3 1 10 15 1
4 2 4 4 4
5 2 3 5 4
6 2 8 14 4
7 2 9 20 4
However, this same method will not work for finding the final mass (mass_f), as I have a different number of observations for each individual. Can anyone suggest a method for finding the final mass, when the number of observations may vary?
You can simply use length(mass) as the index of the last element:
sorted2 <- ddply(sorted, .(indv), transform,
mass_i = mass[1], mass_f = mass[length(mass)])
As suggested by mb3041023 and discussed in the comments below, you can achieve similar results without sorting your data frame:
ddply(test, .(indv), transform,
mass_i = mass[which.min(time)], mass_f = mass[which.max(time)])
Except for the order of rows, this is the same as sorted2.
You can use tail(mass, 1) in place of mass[1].
sorted2 <- ddply(sorted, .(indv), transform, mass_i = head(mass, 1), mass_f=tail(mass, 1))
Once you have this table, it's pretty simple:
t <- tapply(test$mass, test$ind, max)
This will give you an array with ind. as the names and mass_f as the values.
#For say, I got a situation like this
user_id = c(1:5,1:5)
time = c(1:10)
visit_log = data.frame(user_id, time)
#And I've wrote a method to calculate interval
interval <- function(data) {
interval = c(Inf)
for (i in seq(1, length(data$time))) {
intv = data$time[i]-data$time[i-1]
interval = append(interval, intv)
}
data$interval = interval
return (data)
}
#But when I want to get intervals by user_id and bind them to the data.frame,
#I can't find a proper way
#Is there any method to get something like
new_data = merge(by(visit_log, INDICE=visit_log$user_id, FUN=interval))
#And the result should be
user_id time interval
1 1 1 Inf
2 2 2 Inf
3 3 3 Inf
4 4 4 Inf
5 5 5 Inf
6 1 6 5
7 2 7 5
8 3 8 5
9 4 9 5
10 5 10 5
We can replace your loop with the diff() function which computes the differences between adjacent indices in a vector, for example:
> diff(c(1,3,6,10))
[1] 2 3 4
To that we can prepend Inf to the differences via c(Inf, diff(x)).
The next thing we need is to apply the above to each user_id individually. For that there are many options, but here I use aggregate(). Confusingly, this function returns a data frame with a time component that is itself a matrix. We need to convert that matrix to a vector, relying upon the fact that in R, columns of matrices are filled first. Finally, we add and interval column to the input data as per your original version of the function.
interval <- function(x) {
diffs <- aggregate(time ~ user_id, data = x, function(y) c(Inf, diff(y)))
diffs <- as.numeric(diffs$time)
x <- within(x, interval <- diffs)
x
}
Here is a slightly expanded example, with 3 time points per user, to illustrate the above function:
> visit_log = data.frame(user_id = rep(1:5, 3), time = 1:15)
> interval(visit_log)
user_id time interval
1 1 1 Inf
2 2 2 Inf
3 3 3 Inf
4 4 4 Inf
5 5 5 Inf
6 1 6 5
7 2 7 5
8 3 8 5
9 4 9 5
10 5 10 5
11 1 11 5
12 2 12 5
13 3 13 5
14 4 14 5
15 5 15 5