How to calculate variance in a data table - r

I am a nebie to R.I have a data table DT as
id time day type
1 1 9 10
2 2 3 10
1 3 6 12
3 8 9 10
6 9 9 10
8 2 6 18
9 3 5 10
9 1 4 12
From this I initially wanted the count group by day time type.SO i did
DT[,.N,by=list(day,time,type)]
which gives the count for each group.
Now I need to calculate the variance for each group. So I tried
DT[,var(.N),by=list(day,time,type)]
But this gave NA for all fields.Any help is appreciated.

In the example given, there is only a single unique value for many of the combinations, so there is no variance for those rows.
DT <- data.frame (id = c(1,2,1,3,6,8,9,9),
time = c(1,2,3,8,9,2,3,1),
day = c(9,3,6,9,9,6,5,4),
type = c(10,10, 12, 10,10,18,10,12))
aggregate(DT, list(DT$id), FUN = var)

Related

Create new column with shared ID to randomly link two rows in R

I am using R and working with this sample dataframe.
library(tibble)
library(stats)
set.seed(111)
conditions <- factor(c("1","2","3"))
df_sim <-
tibble::tibble(StudentID = 1:10,
Condition = sample(conditions,
size = 10,
replace = T),
XP = stats::rpois(n = 10,
lambda = 15))
This creates the following tibble.
StudentID
Condition
XP
1
2
8
2
3
11
3
3
16
4
3
12
5
1
22
6
3
16
7
1
18
8
3
8
9
2
14
10
1
17
I am trying create a new column in my dataframe called DyadID. The purpose of this column is to create a variable that is uniquely shared by two students in the dataframe — in other words, two students (e.g. Student 1 and Student 9) would share the same value (e.g. 4) in the DyadID column.
However, I only want observations linked together if they share the same Condition value. Condition contains three unique values (1, 2, 3). I want condition 1 observations linked with other condition 1 observations, 2 with 2, and 3 with 3.
Importantly, I'd like the students to be linked together randomly.
Ideally, I would like to stay within the tidyverse as that is what I am most familiar with. However, if that's not possible or ideal, any solution would be appreciated.
Here is a possible outcome I am hoping to achieve.
StudentID
Condition
XP
DyadID
1
2
8
4
2
3
11
1
3
3
16
2
4
3
12
1
5
1
22
3
6
3
16
NA
7
1
18
3
8
3
8
2
9
2
14
4
10
1
17
NA
Note that two students did not receive a pairing, because there was an odd number in condition 1 and condition 3. If there is an odd number, the DyadID can be NA.
Thank you for your help with this!
Using match to get a unique id according to Condition and sample for randomness.
library(dplyr)
df_sim <- df_sim %>% mutate(dyad_id = match(Condition,sample(unique(Condition))))

Find historical maximums in time series

I am trying to identify historical maximum records from time-series data. I need to only identify maximum records as they pertain to data up to that point, not for the whole vector.
An example:
set.seed(431)
df <- data.frame(time = c(1:10), value = runif(10, 1, 10))
df
time value
1 1 7.758703
2 2 6.262849
3 3 8.281712
4 4 8.243617
5 5 6.781752
6 6 2.078103
7 7 4.455353
8 8 1.339119
9 9 3.635554
10 10 9.084619
What I want to do is produce the vector that identifies the following record high numbers moving forward in time:
time value record
1 1 7.758703 yes
2 2 6.262849 no
3 3 8.281712 yes
4 4 8.243617 no
5 5 6.781752 no
6 6 2.078103 no
7 7 4.455353 no
8 8 1.339119 no
9 9 3.635554 no
10 10 9.084619 yes
The value at time 1 is a record because no values exist prior to that, therefor it is maximum. The item at time 3 is a record because its higher than that at time 1. The value at time 10 is a record because its higher than that at time 3.
All I have been able to do is test the max value for the whole vector (i.e identify the value at time 10), rather than the vector up to the time value being considered. I was trying to mutate through dplyr but it wouldn't work. Then I looked at writing a for loop, which would append values to the vector and look for the maximum within that new vector. That lead me to posts suggesting that was a more pythonic than R way of doing things.
Can anyone help? I imagine this is easy.
An option is to get the cummax of 'value', check whether it is equal to 'value'
library(dplyr)
df %>%
mutate(record = c('no', 'yes')[(value == cummax(value)) + 1])
# A tibble: 10 x 3
# time value record
# <int> <dbl> <chr>
# 1 1 7.76 yes
# 2 2 6.26 no
# 3 3 8.28 yes
# 4 4 8.24 no
# 5 5 6.78 no
# 6 6 2.08 no
# 7 7 4.46 no
# 8 8 1.34 no
# 9 9 3.64 no
#10 10 9.08 yes

Adding NA's where data is missing [duplicate]

This question already has an answer here:
Insert missing time rows into a dataframe
(1 answer)
Closed 5 years ago.
I have a dataset that look like the following
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
data.frame(id,cycle,value)
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 3 8
9 4 2 9
so basically there is a variable called id that identifies the sample, a variable called cycle which identifies the timepoint, and a variable called value that identifies the value at that timepoint.
As you see, sample 3 does not have cycle 2 data and sample 4 is missing cycle 1 and 3 data. What I want to know is there a way to run a command outside of a loop to get the data to place NA's where there is no data. So I would like for my dataset to look like the following:
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
I am able to solve this problem with a lot of loops and if statements but the code is extremely long and cumbersome (I have many more columns in my real dataset).
Also, the number of samples I have is very large so I need something that is generalizable.
Using merge and expand.grid, we can come up with a solution. expand.grid creates a data.frame with all combinations of the supplied vectors (so you'd supply it with the id and cycle variables). By merging to your original data (and using all.x = T, which is like a left join in SQL), we can fill in those rows with missing data in dat with NA.
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
dat <- data.frame(id,cycle,value)
grid_dat <- expand.grid(id = 1:4,
cycle = 1:3)
# or you could do (HT #jogo):
# grid_dat <- expand.grid(id = unique(dat$id),
# cycle = unique(dat$cycle))
merge(x = grid_dat, y = dat, by = c('id','cycle'), all.x = T)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
A solution based on the package tidyverse.
library(tidyverse)
# Create example data frame
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 4)
cycle <- c(1, 2, 3, 1, 2, 3, 1, 3, 2)
value <- 1:9
dt <- data.frame(id, cycle, value)
# Complete the combination between id and cycle
dt2 <- dt %>% complete(id, cycle)
Here is a solution with data.table doing a cross join:
library("data.table")
d <- data.table(id = c(1,1,1,2,2,2,3,3,4), cycle = c(1,2,3,1,2,3,1,3,2), value = 1:9)
d[CJ(id=id, cycle=cycle, unique=TRUE), on=.(id,cycle)]

Difference between aggregate and table functions

Age <- c(90,56,51,64,67,59,51,55,48,50,43,57,44,55,60,39,62,66,49,61,58,55,45,47,54,56,52,54,50,62,48,52,50,65,59,68,55,78,62,56)
Tenure <- c(2,2,3,4,3,3,2,2,2,3,3,2,4,3,2,4,1,3,4,2,2,4,3,4,1,2,2,3,3,1,3,4,3,2,2,2,2,3,1,1)
df <- data.frame(Age, Tenure)
I'm trying to count the unique values of Tenure, thus I've used the table() function to look at the frequencies
table(df$Tenure)
1 2 3 4
5 15 13 7
However I'm curious to know what the aggregate() function is showing?
aggregate(Age~Tenure , df, function(x) length(unique(x)))
Tenure Age
1 1 3
2 2 13
3 3 11
4 4 7
What's the difference between these two outputs?
The reason for the difference is your inclusion of unique in the aggregate. You are counting the number of distinct Ages by Tenure, not the count of Ages by Tenure. To get the analogous output with aggregate try
aggregate(Age~Tenure , df, length)
Tenure Age
1 1 5
2 2 15
3 3 13
4 4 7

Is there any way to bind data to data.frame by some index?

#For say, I got a situation like this
user_id = c(1:5,1:5)
time = c(1:10)
visit_log = data.frame(user_id, time)
#And I've wrote a method to calculate interval
interval <- function(data) {
interval = c(Inf)
for (i in seq(1, length(data$time))) {
intv = data$time[i]-data$time[i-1]
interval = append(interval, intv)
}
data$interval = interval
return (data)
}
#But when I want to get intervals by user_id and bind them to the data.frame,
#I can't find a proper way
#Is there any method to get something like
new_data = merge(by(visit_log, INDICE=visit_log$user_id, FUN=interval))
#And the result should be
user_id time interval
1 1 1 Inf
2 2 2 Inf
3 3 3 Inf
4 4 4 Inf
5 5 5 Inf
6 1 6 5
7 2 7 5
8 3 8 5
9 4 9 5
10 5 10 5
We can replace your loop with the diff() function which computes the differences between adjacent indices in a vector, for example:
> diff(c(1,3,6,10))
[1] 2 3 4
To that we can prepend Inf to the differences via c(Inf, diff(x)).
The next thing we need is to apply the above to each user_id individually. For that there are many options, but here I use aggregate(). Confusingly, this function returns a data frame with a time component that is itself a matrix. We need to convert that matrix to a vector, relying upon the fact that in R, columns of matrices are filled first. Finally, we add and interval column to the input data as per your original version of the function.
interval <- function(x) {
diffs <- aggregate(time ~ user_id, data = x, function(y) c(Inf, diff(y)))
diffs <- as.numeric(diffs$time)
x <- within(x, interval <- diffs)
x
}
Here is a slightly expanded example, with 3 time points per user, to illustrate the above function:
> visit_log = data.frame(user_id = rep(1:5, 3), time = 1:15)
> interval(visit_log)
user_id time interval
1 1 1 Inf
2 2 2 Inf
3 3 3 Inf
4 4 4 Inf
5 5 5 Inf
6 1 6 5
7 2 7 5
8 3 8 5
9 4 9 5
10 5 10 5
11 1 11 5
12 2 12 5
13 3 13 5
14 4 14 5
15 5 15 5

Resources