Calculate various means from data based on values - r

I have a large spreadsheet of team member ratings from which I want to calculate how people rated themselves, how they were rated by everyone else on their team, and how they rated everyone else on their team (all averages). I've been trying to do this with dplyr because I have used it before and I think that the group_by will simplify things when doing these calculations. I haven't been able to figure it out so I'm asking for help. I'll try to explain my thinking.
Here's an example dataset:
data <- read.table(text="
Team Rater A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4 A5 B5 C5 A6 B6 C6
1 1 2 4 4 2 1 5 2 2 3 4 4 4 3 2 1 NA NA NA
1 2 4 5 4 4 5 1 1 1 5 5 3 1 4 5 2 NA NA NA
1 3 2 1 4 3 5 5 2 1 5 1 1 4 1 1 4 NA NA NA
1 4 4 3 4 3 5 1 3 1 3 5 5 5 5 2 2 NA NA NA
1 5 3 4 5 4 3 3 5 5 4 1 4 5 5 5 1 NA NA NA
2 1 3 5 3 4 1 1 3 4 3 4 3 2 2 2 3 3 5 3
2 2 3 2 3 1 1 3 5 5 1 5 2 3 2 2 1 3 3 2
2 3 3 2 3 3 5 2 4 1 1 1 4 5 3 5 2 1 1 3
2 4 3 3 5 4 3 5 3 1 4 3 1 1 4 2 4 3 5 2
2 5 5 2 1 2 5 5 3 3 1 4 1 5 5 3 3 4 2 5
2 6 3 2 3 5 4 3 2 1 5 4 3 1 1 1 4 2 2 1",header = TRUE)
Each rater provides input on multiple questions for each other team member. The way it is organized, rater 1 answers A1, B1, and C1 about themselves. Rater 2 answers A2, B2, and C2 about themselves, and so on.
Self Ratings
To get someone's rating of themselves I figured it would be something like:
data %>%
group_by(Team) %>%
mutate(self = rowMeans(select(.,ends_with(Rater)), na.rm = TRUE))
It'd be convenient if the column selection was dynamically based on their rater number.
From Others
I was thinking of calculating this based on the average overall rating of that person except the self rating:
data %>%
group_by(Team) %>%
mutate(from = ( (mean(ends_with(Rater)) * n() - self ) / ( n() - 1 ) ) )
Of Others
For this column calculation I was thinking something along the lines of:
data %>%
mutate(of = select(A1:C6, -(ends_with(Rater))) %>% rowMeans(na.rm = TRUE))
(similar to this answer)
Results
Here is an example of what I'm looking for as new columns:
Team Rater self from of
1 1 3.33 3.58 2.75
1 2 3.33 3.33 3.33
1 3 2.67 2.92 2.67
1 4 5.00 3.08 3.00
1 5 3.67 2.67 3.83
If you can help with any of these parts I'd appreciate it!

I would recommend first transforming your data into a "tidy" format with tidyr like such
library(tidyr)
tidy <- data %>% gather(QV,Rating,-Team,-Rater) %>%
separate(QV, into=c("Quest","Rated"), sep=1) %>%
mutate(Rated=as.numeric(Rated)) %>%
filter(!is.na(Rating))
This transforms your data to have the following shape
Team Rater Quest Rated Rating
1 1 1 A 1 2
2 1 2 A 1 4
3 1 3 A 1 2
4 1 4 A 1 4
5 1 5 A 1 3
6 2 1 A 1 3
...
So we turn your data into a long format. Then you can perform each of the queries a bit more directly and merge them together
Reduce(left_join, list(
tidy %>% group_by(Team, Rater) %>% filter(Rated==Rater) %>% summarize(self=mean(Rating)),
tidy %>% group_by(Team, Rated) %>% filter(Rated!=Rater) %>% summarize(others=mean(Rating)) %>% rename(Rater=Rated),
tidy %>% group_by(Team, Rater) %>% filter(Rated!=Rater) %>% summarize(of=mean(Rating))
))
This returns
Team Rater self others of
(int) (dbl) (dbl) (dbl) (dbl)
1 1 1 3.333333 3.583333 2.750000
2 1 2 3.333333 3.333333 3.333333
3 1 3 2.666667 2.916667 2.666667
4 1 4 5.000000 3.083333 3.000000
5 1 5 3.666667 2.666667 3.833333
6 2 1 3.666667 2.866667 2.866667
7 2 2 1.666667 3.466667 2.800000
8 2 3 2.000000 2.933333 2.866667
9 2 4 1.666667 3.133333 3.400000
10 2 5 3.666667 2.533333 3.200000
11 2 6 1.666667 3.000000 2.800000

Related

Creating an indexed column in R, grouped by user_id, and not increase when NA

I want to create a column (in R) that indexes the presence of a number in another column grouped by a user_id column. And when the other column is NA, the new desired column should not increase.
The example should bring clarity.
I have this df:
data <- data.frame(user_id = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
one=c(1,NA,3,2,NA,0,NA,4,3,4,NA))
user_id tobeindexed
1 1 1
2 1 NA
3 1 3
4 2 2
5 2 NA
6 2 0
7 2 NA
8 3 4
9 3 3
10 3 4
11 3 NA
I want to make a new column looking like "desired" in the following df:
> cbind(data,data.frame(desired = c(1,1,2,1,1,2,2,1,2,3,3)))
user_id tobeindexed desired
1 1 1 1
2 1 NA 1
3 1 3 2
4 2 2 1
5 2 NA 1
6 2 0 2
7 2 NA 2
8 3 4 1
9 3 3 2
10 3 4 3
11 3 NA 3
How can I solve this?
Using colsum and group_by gets me close, but the count does not start over from 1 when the user_id changes...
> data %>% group_by(user_id) %>% mutate(desired = cumsum(!is.na(tobeindexed)))
user_id tobeindexed desired
<dbl> <dbl> <int>
1 1 1 1
2 1 NA 1
3 1 3 2
4 2 2 3
5 2 NA 3
6 2 0 4
7 2 NA 4
8 3 4 5
9 3 3 6
10 3 4 7
11 3 NA 7
Given the sample data you provided (with the one) column, this works unchanged. The code is retained below for demonstration.
base R
data$out <- ave(data$one, data$user_id, FUN = function(z) cumsum(!is.na(z)))
data
# user_id one out
# 1 1 1 1
# 2 1 NA 1
# 3 1 3 2
# 4 2 2 1
# 5 2 NA 1
# 6 2 0 2
# 7 2 NA 2
# 8 3 4 1
# 9 3 3 2
# 10 3 4 3
# 11 3 NA 3
dplyr
library(dplyr)
data %>%
group_by(user_id) %>%
mutate(out = cumsum(!is.na(one))) %>%
ungroup()
# # A tibble: 11 × 3
# user_id one out
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 NA 1
# 3 1 3 2
# 4 2 2 1
# 5 2 NA 1
# 6 2 0 2
# 7 2 NA 2
# 8 3 4 1
# 9 3 3 2
# 10 3 4 3
# 11 3 NA 3

merge/join two long df in R

I have two dataframes a and b which I would like to combine
a <- data.frame(g=c("1","2","2","3","3","3","4","4","4","4"),h=c("1","1","2","1","2","3","1","2","3","4"))
b <- data.frame(g=c("1","2","3","3","3","4","4","4","4","4"),i=c("1","2","3","2","1","2","3","4","5","6"))
g represents a grouping variable and h and i the columns I want to merge/join
> a
g h
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
7 4 1
8 4 2
9 4 3
10 4 4
> b
g i
1 1 1
2 2 2
3 3 3
4 3 2
5 3 1
6 4 2
7 4 3
8 4 4
9 4 5
10 4 6
a and b should be merged on the level of the grouping variable g whereas identical values of h and i should be put together (independant of the order they appear in h/i) and not identical values should be combined once (not all possible combinations).
a final df would look like:
g h i
1 1 1 1
2 2 1 <NA>
3 2 2 2
4 3 1 1
5 3 2 2
6 3 3 3
7 4 1 <NA>
8 4 2 2
9 4 3 3
10 4 4 4
11 4 <NA> 5
12 4 <NA> 6
I need that df to perform a correlation analysis.
Sounds like a merge on h==i, while retaining i, so create a new variable x to join on, and keep join results from both sides (all=TRUE). With a large hat-tip to #Moody_Mudskipper:
merge(transform(a,x=h), transform(b,x=i), all=TRUE)
# g x h i
#1 1 1 1 1
#2 2 1 1 <NA>
#3 2 2 2 2
#4 3 1 1 1
#5 3 2 2 2
#6 3 3 3 3
#7 4 1 1 <NA>
#8 4 2 2 2
#9 4 3 3 3
#10 4 4 4 4
#11 4 5 <NA> 5
#12 4 6 <NA> 6
We can also do this with dplyr
library(dplyr)
a %>%
mutate(x = h) %>%
full_join(mutate(b, x = i)) %>%
select(-x)

Reshaping different variables for selecting values from one column in R

Below, a sample of my data, I have more Rs and Os.
A R1 O1 R2 O2 R3 O3
1 3 3 5 3 6 4
2 3 3 5 4 7 4
3 4 4 5 5 6 5
I want to get the following data
A R O Value
1 3 1 3
1 5 2 3
1 6 3 4
2 3 1 3
2 5 2 4
2 7 3 4
3 4 1 4
3 5 2 5
3 6 3 5
I try the melt function, but I was unsuccessful. Any help would be very much appreciated.
A solution using dplyr and tidyr. The key is to use gather to collect all the columns other than A, and the use extract to split the column, and then use spread to convert the data frame back to wide format.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
gather(Column, Number, -A) %>%
extract(Column, into = c("Column", "ID"), regex = "([A-Z]+)([0-9]+)") %>%
spread(Column, Number) %>%
select(A, R, O = ID, Value = O)
dt2
# A R O Value
# 1 1 3 1 3
# 2 1 5 2 3
# 3 1 6 3 4
# 4 2 3 1 3
# 5 2 5 2 4
# 6 2 7 3 4
# 7 3 4 1 4
# 8 3 5 2 5
# 9 3 6 3 5
DATA
dt <- read.table(text = "A R1 O1 R2 O2 R3 O3
1 3 3 5 3 6 4
2 3 3 5 4 7 4
3 4 4 5 5 6 5",
header = TRUE)

R Partial Reshape Data from Long to Wide

I like to reshape a dataset from long to wide. Specifically, the new wide dataset should consist of rows corresponding to the unique number of IDs in the long dataset, and the number of columns is a multiple of unique values of another variable.
Let's say this is the original dataset:
ID a b C d e f g
1 1 1 1 1 2 3 4
1 1 1 2 5 6 7 8
2 2 2 1 1 2 3 4
2 2 2 3 9 0 1 2
2 2 2 2 5 6 7 8
3 3 3 3 9 0 1 2
3 3 3 2 5 6 7 8
3 3 3 1 1 2 3 4
In the new dataset, the number of rows is the number of IDs, the number of columns is 3 plus the multiple of unique elements found in variable C and the values from variables d to g are populated after sorting variable C in ascending order. It should look something like this:
ID a b d1 e1 f1 g1 d2 e2 f2 g2 d3 e3 f3 g3
1 1 1 1 2 3 4 5 6 7 8 NA NA NA NA
2 2 2 1 2 3 4 5 6 7 8 9 0 1 2
3 3 3 1 2 3 4 5 6 7 8 9 0 1 2
You can use dcast from data.table:
data.table::setDT(df)
data.table::dcast(df, ID + a + b ~ C, sep = "", value.var = c("d", "e", "f", "g"), fill=NA)
ID a b d1 d2 d3 e1 e2 e3 f1 f2 f3 g1 g2 g3
1: 1 1 1 1 5 NA 2 6 NA 3 7 NA 4 8 NA
2: 2 2 2 1 5 9 2 6 0 3 7 1 4 8 2
3: 3 3 3 1 5 9 2 6 0 3 7 1 4 8 2
Base reshape version - just have to use C as your time variable and away you go.
reshape(dat, idvar=c("ID","a","b"), direction="wide", timevar="C", sep="")
# ID a b d1 e1 f1 g1 d2 e2 f2 g2 d3 e3 f3 g3
#1 1 1 1 1 2 3 4 5 6 7 8 NA NA NA NA
#3 2 2 2 1 2 3 4 5 6 7 8 9 0 1 2
#6 3 3 3 1 2 3 4 5 6 7 8 9 0 1 2

Count with table() and exclude 0's

I try to count triplets; for this I use three vectors that are packed in a dataframe:
X=c(4,4,4,4,4,4,4,4,1,1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3,3)
Y=c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,3,4,2,2,2,2,3,4,1,1,2,2,3,3,4,4)
Z=c(4,4,5,4,4,4,4,4,6,1,1,1,1,1,1,1,2,2,2,2,7,2,3,3,3,3,3,3,3,3)
Count_Frame=data.frame(matrix(NA, nrow=(length(X)), ncol=3))
Count_Frame[1]=X
Count_Frame[2]=Y
Count_Frame[3]=Z
Counts=data.frame(table(Count_Frame))
There is the following problem: if I increase the value range in the vectors or use even more vectors the "Counts" dataframe quickly approaches its size limit due to the many 0-counts. Is there a way to exclude the 0-counts while generating "Counts"?
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(Count_Frame)), grouped by all the columns (.(X, Y, Z)), we get the number or rows (.N).
library(data.table)
setDT(Count_Frame)[,.N ,.(X, Y, Z)]
# X Y Z N
# 1: 4 1 4 7
# 2: 4 1 5 1
# 3: 1 1 6 1
# 4: 1 1 1 3
# 5: 1 2 1 2
# 6: 1 3 1 1
# 7: 1 4 1 1
# 8: 2 2 2 4
# 9: 2 3 7 1
#10: 2 4 2 1
#11: 3 1 3 2
#12: 3 2 3 2
#13: 3 3 3 2
#14: 3 4 3 2
Instead of naming all the columns, we can use names(Count_Frame) as well (if there are many columns)
setDT(Count_Frame)[,.N , names(Count_Frame)]
You can accomplish this with aggregate:
Count_Frame$one <- 1
aggregate(one ~ X1 + X2 + X3, data=Count_Frame, FUN=sum)
This will calculate the positive instances of table, but will not list the zero counts.
One solution is to create a combination of the column values and count those instead:
library(tidyr)
as.data.frame(table(unite(Count_Frame, tmp, X1, X2, X3))) %>%
separate(Var1, c('X1', 'X2', 'X3'))
Resulting output is:
X1 X2 X3 Freq
1 1 1 1 3
2 1 1 6 1
3 1 2 1 2
4 1 3 1 1
5 1 4 1 1
6 2 2 2 4
7 2 3 7 1
8 2 4 2 1
9 3 1 3 2
10 3 2 3 2
11 3 3 3 2
12 3 4 3 2
13 4 1 4 7
14 4 1 5 1
Or using plyr:
library(plyr)
count(Count_Frame, colnames(Count_Frame))
output
# > count(Count_Frame, colnames(Count_Frame))
# X1 X2 X3 freq
# 1 1 1 1 3
# 2 1 1 6 1
# 3 1 2 1 2
# 4 1 3 1 1
# 5 1 4 1 1
# 6 2 2 2 4
# 7 2 3 7 1
# 8 2 4 2 1
# 9 3 1 3 2
# 10 3 2 3 2
# 11 3 3 3 2
# 12 3 4 3 2
# 13 4 1 4 7
# 14 4 1 5 1

Resources