Data Frame: mean over certain variables, ignore but keep others

Data Frame: mean over certain variables, ignore but keep others - r

I am analysing my data with R for the first time which is a bit challenging. I have a data frame with my data that looks like this:
head(data)
subject group age trial cond acc rt
1 S1 2 1 1 1 1 5045
2 S1 2 1 2 2 1 8034
3 S1 2 1 3 1 1 6236
4 S1 2 1 4 2 1 8087
5 S1 2 1 5 3 0 8756
6 S1 2 1 6 1 1 6619
I would like to compute a mean and standard deviation for each subject in each condition for rt and a sum for each subject in each condition for acc. All the other variables are should remain the same (group and age are subject-specific, and trial can be disregarded).
I have tried using aggregate but that seemed kind of complicated because I had to do it in several steps and re-add information...
I'd be thankful for any help =)
Edit: I realise that I wasn't being clear. I want trial to be disregarded and end up with one row per subject per condition:
head(data_new)
subject group age cond rt_mean rt_sd acc_sum
1 S1 2 1 1 7581 100 5
2 S2 2 1 2 8034 150 4
Sorry about the confusion!

If you don't mind using the data.table package:
library(data.table)
data <- data.table(data)
data[, ':=' (rt_mean = mean(rt), rt_sd = sd(rt), acc_sum = sum(acc)), by = .(subject, cond)]
data
subject group age trial cond acc rt rt_mean rt_sd acc_sum
1: S1 2 1 1 1 1 5045 5966.667 820.83758 3
2: S1 2 1 2 2 1 8034 8060.500 37.47666 2
3: S1 2 1 3 1 1 6236 5966.667 820.83758 3
4: S1 2 1 4 2 1 8087 8060.500 37.47666 2
5: S1 2 1 5 3 0 8756 8756.000 NA 0
6: S1 2 1 6 1 1 6619 5966.667 820.83758 3
Edit:
If you want to get rid of some of the variables and duplicated rows, you need only a small modification - remove the := assignment operator (instead of adding new colums, it will now create a new data.table), add the variables you want to keep and use the unique function:
unique(dt[, .(group, age, rt_mean = mean(rt), rt_sd = sd(rt), acc_sum = sum(acc)), by = .(subject, cond)])
subject cond group age rt_mean rt_sd acc_sum
1: S1 1 2 1 5966.667 820.83758 3
2: S1 2 2 1 8060.500 37.47666 2
3: S1 3 2 1 8756.000 NA 0
If you additionally want to get rid of rows with missing values, use the na.omit function.

The package dplyr is made for this:
library(dplyr)
d %>%
group_by(subject, cond) %>% # we group by the two values
summarise(
mean_rt = mean(rt, na.rm=T),
sd_rt = sd(rt, na.rm=T),
sum_acc = sum(acc, na.rm=T) # here we apply each function to summarise values
)
# A tibble: 3 x 5
# Groups: subject [?]
subject cond mean_rt sd_rt sum_acc
<fct> <int> <dbl> <dbl> <int>
1 S1 1 5967. 821. 3
2 S1 2 8060. 37.5 2
3 S1 3 8756 NA 0
# NA for the last sd_rt is because you can't have
# sd for a single obs.
Basically you need to group_by the columns (one or more) that you need to use as grouping, then inside summarise, you apply each function you need (mean, sd, sum, ecc) to each variable (rt, acc, ecc).
Change summarise with mutate if you want to keep all variables:
d %>%
select(-trial) %>% # use select with -var_name to eliminate columns
group_by(subject, cond) %>%
mutate(
mean_rt = mean(rt, na.rm=T),
sd_rt = sd(rt, na.rm=T),
sum_acc = sum(acc, na.rm=T)
) %>%
ungroup()
# A tibble: 6 x 9
subject group age cond acc rt mean_rt sd_rt sum_acc
<fct> <int> <int> <int> <int> <int> <dbl> <dbl> <int>
1 S1 2 1 1 1 5045 5967. 821. 3
2 S1 2 1 2 1 8034 8060. 37.5 2
3 S1 2 1 1 1 6236 5967. 821. 3
4 S1 2 1 2 1 8087 8060. 37.5 2
5 S1 2 1 3 0 8756 8756 NA 0
6 S1 2 1 1 1 6619 5967. 821. 3
Update based on op request, maybe this is what you need:
d %>%
group_by(subject, cond, group, age) %>%
summarise(
mean_rt = mean(rt, na.rm=T),
sd_rt = sd(rt, na.rm=T),
sum_acc = sum(acc, na.rm=T)
)
# A tibble: 3 x 7
# Groups: subject, cond, group [?]
subject cond group age mean_rt sd_rt sum_acc
<fct> <int> <int> <int> <dbl> <dbl> <int>
1 S1 1 2 1 5967. 821. 3
2 S1 2 2 1 8060. 37.5 2
3 S1 3 2 1 8756 NA 0
Data used:
tt <- "subject group age trial cond acc rt
S1 2 1 1 1 1 5045
S1 2 1 2 2 1 8034
S1 2 1 3 1 1 6236
S1 2 1 4 2 1 8087
S1 2 1 5 3 0 8756
S1 2 1 6 1 1 6619"
d <- read.table(text=tt, header=T)

If you want to compute for example the mean of rt for subject S1 under condition 1, you can use mean(data[data$subject == "S1" & data$cond == 1, 7]).
I hope this gives you an idea how you can filter your values.

Related

Remove if unit only has one observation

I have a long form of clinical data that looks something like this:
patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104)
outcome <- c(1,1,1,1,1,NA,1,NA,NA,NA,NA,NA)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
A patient should be kept in the database only if they 2 or 3 observations (so patients that have complete data for 0 or only 1 time points should be thrown out. So for this example my desired result is this:
patientid <- c(100,100,100,101,101,101)
outcome <- c(1,1,1,1,1,NA)
time <- c(1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
Hence patients 102 and 104 are thrown out of the database because of they were missing the outcome variable in 2 or 3 of the time points.

We can create a logical expression on the sum of non-NA elements as a logical vector, grouped by 'patientid' to filter patientid's having more than one non-NA 'outcome'
library(dplyr)
Data %>%
group_by(patientid) %>%
filter(sum(!is.na(outcome)) > 1) %>%
ungroup
-output
# A tibble: 6 x 3
# patientid outcome time
# <dbl> <dbl> <dbl>
#1 100 1 1
#2 100 1 2
#3 100 1 3
#4 101 1 1
#5 101 1 2
#6 101 NA 3

A base R option using subset + ave
subset(
Data,
ave(!is.na(outcome), patientid, FUN = sum) > 1
)
giving
patientid outcome time
1 100 1 1
2 100 1 2
3 100 1 3
4 101 1 1
5 101 1 2
6 101 NA 3
A data.table option
setDT(Data)[, Y := sum(!is.na(outcome)), patientid][Y > 1, ][, Y := NULL][]
or a simpler one (thank #akrun)
setDT(Data)[Data[, .I[sum(!is.na(outcome)) > 1], .(patientid)]$V1]
which gives
patientid outcome time
1: 100 1 1
2: 100 1 2
3: 100 1 3
4: 101 1 1
5: 101 1 2
6: 101 NA 3

library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(observation = sum(outcome, na.rm = TRUE)) %>% # create new variable (observation) and count the observation per patient
filter(observation >=2) %>%
ungroup
output:
# A tibble: 6 x 4
patientid outcome time observation
<dbl> <dbl> <dbl> <dbl>
1 100 1 1 3
2 100 1 2 3
3 100 1 3 3
4 101 1 1 2
5 101 1 2 2
6 101 NA 3 2

Determine percentage of rows with missing values in a dataframe in R

I have a data frame with three variables and some missing values in one of the variables that looks like this:
subject <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
part <- c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3)
sad <- c(1,7,7,4,NA,NA,2,2,NA,2,3,NA,NA,2,2,1,NA,5,NA,6,6,NA,NA,3,3,NA,NA,5,3,NA,7,2)
df1 <- data.frame(subject,part,sad)
I have created a new data frame with the mean values of 'sad' per subject and part using a loop, like this:
columns<-c("sad.m",
"part",
"subject")
df2<-matrix(data=NA,nrow=1,ncol=length(columns))
df2<-data.frame(df2)
names(df2)<-columns
tn<-unique(df1$subject)
row=1
for (s in tn){
for (i in 0:3){
TN<-df1[df1$subject==s&df1$part==i,]
df2[row,"sad.m"]<-mean(as.numeric(TN$sad), na.rm = TRUE)
df2[row,"part"]<-i
df2[row,"subject"]<-s
row=row+1
}
}
Now I want to include an additional variable 'missing' in that indicates the percentage of rows per subject and part with missing values, so that I get df3:
subject <- c(1,1,1,1,2,2,2,2)
part<-c(0,1,2,3,0,1,2,3)
sad.m<-df2$sad.m
missing <- c(0,50,50,25,50,50,50,25)
df3 <- data.frame(subject,part,sad.m,missing)
I'd really appreciate any help on how to go about this!

It's best to try and avoid loops in R where possible, since they can get messy and tend to be quite slow. For this sort of thing, dplyr library is perfect and well worth learning. It can save you a lot of time.
You can create a data frame with both variables by first grouping by subject and part, and then performing a summary of the grouped data frame:
df2 = df1 %>%
dplyr::group_by(subject, part) %>%
dplyr::summarise(
sad_mean = mean(na.omit(sad)),
na_count = (sum(is.na(sad) / n()) * 100)
)
df2
# A tibble: 8 x 4
# Groups: subject [2]
subject part sad_mean na_count
<dbl> <dbl> <dbl> <dbl>
1 1 0 4.75 0
2 1 1 2 50
3 1 2 2.5 50
4 1 3 1.67 25
5 2 0 5.5 50
6 2 1 4.5 50
7 2 2 4 50
8 2 3 4 25

For each subject and part you can calculate mean of sad and calculate ratio of NA value using is.na and mean.
library(dplyr)
df1 %>%
group_by(subject, part) %>%
summarise(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100)
# subject part sad.m perc_missing
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 4.75 0
#2 1 1 2 50
#3 1 2 2.5 50
#4 1 3 1.67 25
#5 2 0 5.5 50
#6 2 1 4.5 50
#7 2 2 4 50
#8 2 3 4 25
Same logic with data.table :
library(data.table)
setDT(df1)[, .(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100), .(subject, part)]

Try this dplyr approach to compute df3:
library(dplyr)
#Code
df3 <- df1 %>% group_by(subject,part) %>% summarise(N=100*length(which(is.na(sad)))/length(sad))
Output:
# A tibble: 8 x 3
# Groups: subject [2]
subject part N
<dbl> <dbl> <dbl>
1 1 0 0
2 1 1 50
3 1 2 50
4 1 3 25
5 2 0 50
6 2 1 50
7 2 2 50
8 2 3 25
And for full interaction with df2 you can use left_join():
#Left join
df3 <- df1 %>% group_by(subject,part) %>%
summarise(N=100*length(which(is.na(sad)))/length(sad)) %>%
left_join(df2)
Output:
# A tibble: 8 x 4
# Groups: subject [2]
subject part N sad.m
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 4.75
2 1 1 50 2
3 1 2 50 2.5
4 1 3 25 1.67
5 2 0 50 5.5
6 2 1 50 4.5
7 2 2 50 4
8 2 3 25 4

Counting number of times a value occurs grouping by id in R

I have a dataset looks like below in R: Found similar posts like this Counting number of times a value occurs but not exactly the same.
id <- c(1,1,1, 2,2,2, 3,3,3,3)
cat.1 <- c("a","a","a","b","b","b","c","c","c","c")
cat.2 <- c("m","m","m","f","f","f","m","m","m","m")
score <- c(-1,0,-1, 1,0,1, -1,0,1,1)
data <- data.frame("id"=id, "cat.1"=cat.1, "cat.2"=cat.2, "score"=score)
data
id cat.1 cat.2 score
1 1 a m -1
2 1 a m 0
3 1 a m -1
4 2 b f 1
5 2 b f 0
6 2 b f 1
7 3 c m -1
8 3 c m 0
9 3 c m 1
10 3 c m 1
I would like to count number of -1 values in the score variable within each id. Also, I would like to keep the cat.1 and cat.2 variables. Desired output would be:
id cat.1 cat.2 count(-1)
1 1 a m 2
2 2 b f 0
3 3 c m 1
Do you have any suggestions?
Thanks!

This is something we can use dplyr for:
data %>%
group_by(id, cat.1, cat.2) %>% # or: group_by_at(vars(-score))
summarise(count_neg_1 = sum(score == -1))
# id cat.1 cat.2 count_neg_1
# 1 1 a m 2
# 2 2 b f 0
# 3 3 c m 1
You can change the name of the calculated column if you so desire. I generally avoid anything other than a letter, number, or underscore in my variable names.

One base R possibility could be:
aggregate(score ~ ., FUN = function(x) sum(x == -1), data = data)
id cat.1 cat.2 score
1 2 b f 0
2 1 a m 2
3 3 c m 1
If you have more variables in your data and you want to group with just these three, then you can explicitly specify it by aggregate(score ~ id + cat.1 + cat.2, ...)

library(data.table)
setDT(data)[ , sum(score == -1), by=c('id', 'cat.1', 'cat.2')]
# id cat.1 cat.2 V1
# 1: 1 a m 2
# 2: 2 b f 0
# 3: 3 c m 1

Another option is count
library(dplyr)
data %>%
mutate(score = score == -1) %>%
dplyr::count(id, cat.1, cat.2, wt = score)
# A tibble: 3 x 4
# id cat.1 cat.2 n
# <dbl> <fct> <fct> <int>
#1 1 a m 2
#2 2 b f 0
#3 3 c m 1

Finding individual values from cumulative mean in R Dataframe

I am having troubles finding how to find individual values from the running mean in an R dataframe.
I have an R dataframe:
x ID Mean
1 1 1
1 2 5
2 1 3
2 2 6
Where the mean is the mean for the x measurements for the specific ID in the dataframe.
To find the individual values at each x value rather than the mean, I was thinking that I needed to apply a recursive function on the dataframe and group by the ID. How could I do this in a dataframe while grouping by one of the values when any apply function wouldn't have access to the previous entry in the dataframe?
When completed and appended to the dataframe, I am hoping it to look like this:
x ID Mean IndivValues
1 1 1 1
1 2 5 5
2 1 3 5
2 2 6 7

It's much easier to calculate this from totals -> to individual observation, as below:
Example data.frame:
df <- read.table(text='
x ID Mean
1 1 1
1 2 5
2 1 3
2 2 6
', header=T)
Solution:
library(dplyr); library(magrittr)
df %>%
group_by(id) %>%
mutate(
total = mean * x,
ind_value = total - lag(total, default=0) )
## A tibble: 4 x 5
## Groups: ID [2]
# x ID Mean total ind_value
# <int> <int> <int> <int> <int>
#1 1 1 1 1 1
#2 1 2 5 5 5
#3 2 1 3 6 5
#4 2 2 6 12 7

Add missing subtotals to each group using dplyr

I need to add a new row to each id group where the key= "n" and value is the total - a + b
x <- data_frame( id = c(1,1,1,2,2,2,2),
key = c("a","b","total","a","x","b","total"),
value = c(1,2,10,4,1,3,12) )
# A tibble: 7 × 3
id key value
<dbl> <chr> <dbl>
1 1 a 1
2 1 b 2
3 1 total 10
4 2 a 4
5 2 x 1
6 2 b 3
7 2 total 12
In this example, the new rows should be
1 n 7
2 n 5
I tried getting the a+b subtotal and joining that to the total count to get the difference, but after using nine dplyr verbs I seem to be going in the wrong direction. Thanks.

This isn't a join, it's just binding new rows on:
x %>% group_by(id) %>%
summarize(
value = sum(value[key == 'total']) - sum(value[key %in% c('a', 'b')]),
key = 'n'
) %>%
bind_rows(x) %>%
select(id, key, value) %>% # back to original column order
arrange(id, key) # and a start a row order
# # A tibble: 9 × 3
# id key value
# <dbl> <chr> <dbl>
# 1 1 a 1
# 2 1 b 2
# 3 1 n 7
# 4 1 total 10
# 5 2 a 4
# 6 2 b 3
# 7 2 n 5
# 8 2 total 12
# 9 2 x 1

Here's a way using data.table, binding rows as in Gregor's answer:
library(data.table)
setDT(x)
dcast(x, id ~ key)[, .(id, key = "n", value = total - a - b)][, rbind(.SD, x)][order(id)]
id key value
1: 1 n 7
2: 1 a 1
3: 1 b 2
4: 1 total 10
5: 2 n 5
6: 2 a 4
7: 2 x 1
8: 2 b 3
9: 2 total 12

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Data Frame: mean over certain variables, ignore but keep others - r

If you want to compute for example the mean of rt for subject S1 under condition 1, you can use mean(data[data$subject == "S1" & data$cond == 1, 7]). I hope this gives you an idea how you can filter your values.

Related

Remove if unit only has one observation

Determine percentage of rows with missing values in a dataframe in R

Counting number of times a value occurs grouping by id in R

Finding individual values from cumulative mean in R Dataframe

Add missing subtotals to each group using dplyr

Categories

Resources