R: how to create median and agreement function for multiple groups - r

I have the following data structure:
Player Team Round Question Answer
1: 2 1 1 1 1
2: 5 1 1 1 1
3: 8 1 1 1 1
4: 9 1 1 1 1
5: 10 1 1 1 1
6: 2 1 1 2 4
7: 5 1 1 2 5
8: 8 1 1 2 5
9: 9 1 1 2 5
10: 10 1 1 2 5
11: 2 1 1 4 4
12: 5 1 1 4 3
13: 8 1 1 4 4
14: 9 1 1 4 2
15: 10 1 1 4 4
16: ...
So there are several players from several team, answering several questions. There are always 2 rounds of games.
What I try to calculate is the medium and the agreement coefficient (see agrmt package) from the data by grouping the Team and the Question.
The result should look like this:
Team Question Median_R1 Agrmt_R1 Median_R2 Agrmt_R2
1: 1 1 1 1 1 1
2: 1 2 2 0.83 1 1
3: ...
4: 5 10 4 1 4 1
Does someone know if this is possible? I could not find a solution for this. I can solve the median and the agreement coefficient standalone, but not combined?
Every hint is welcome. Thank you very much.
UPDATE:
The agreement function returns a coefficient between -1 and 1. The values represent.
1 represents a full agreement (e.g. if every player answers 5).
0 would be, if every player has a different answer.
-1 would be, if a disagreement exists (some players say answer 1 and others say 5)
Compared to the median, the agreement functions takes a vector of the frequency vector.
For example, we have the following answers
Player Team Round Question Answer
6: 2 1 1 2 4
7: 5 1 1 2 5
8: 8 1 1 2 5
9: 9 1 1 2 5
10: 10 1 1 2 5
The function inputs would look like this:
Median input: 4,5,5,5,5 --> Result: 5
Agreement input: 0,0,0,1,4 --> Result: 0.9
UPDATE 2: SOLVED
The calculation of the agreement could be done with the following code:
agreement(table(factor(x, levels=1:5)))
The final is based on #sandipan implementation. I had to add another sorting step in order to combine the right data.frames:
library(agrmt)
df1 <- unique(df[c('Party', 'Question')])
for (df.R in split(df, df$Round)) {
round <- unique(df.R$Round)
# get the data.frame of the current Round.
df2 <- as.data.frame(as.list(aggregate(Answer ~ Party + Question + Round,
df.R, FUN = function(x) c(Median = median(x), Agrmt = agreement(table(factor(x, levels=1:5)))))))
# sort it and take only the columns of median and agreement
df3 <- df2[with(df2, order(Party, Question)),][4:5]
names(df3) <- c(paste('Median_R', round, sep=''), paste('Agrmt_R', round, sep=''))
df1 <- cbind.data.frame(df1, df3)
}
df1
Thank you all for the help.

Here are three approaches: base R aggregate, dplyr, and data.table.
With base R aggregate:
library(agrmt)
aggregate(Answer ~ Team + Round + Question, data=dat,
FUN = function(x) {
c(Median=median(x),
Agreement=agreement(table(factor(x, levels=1:5))))
})
Team Round Question Answer.Median Answer.Agreement
1 1 1 1 1.0 1.0
2 1 1 2 5.0 0.9
3 1 1 4 4.0 0.7
With dplyr:
library(dplyr)
dat.summary = dat %>% group_by(Team, Round, Question) %>%
summarise(Median=median(Answer),
Agreement=agreement(table(factor(Answer, levels=1:5))))
Team Round Question Median Agreement
1 1 1 1 1 1.0
2 1 1 2 5 0.9
3 1 1 4 4 0.7
With data.table:
library(data.table)
dat.summary = setDT(dat)[, list(Median=median(Answer),
Agreement=agreement(table(factor(Answer, levels=1:5)))),
by=list(Team, Round, Question)]
Team Round Question Median Agreement
1: 1 1 1 1 1.0
2: 1 1 2 5 0.9
3: 1 1 4 4 0.7
To get a "wide" data frame as the final output:
In the examples above, I've left the output in "long" format. If you want to reshape to "wide" format, so that each Round get its own set of columns, you can do as follows:
First, let's add a second Round to the sample data by stacking another copy of the sample data:
library(dplyr)
library(reshape2)
library(agrmt)
dat = bind_rows(dat, dat %>% mutate(Round=2))
Now calculate the median and agreement with the same code we used before in the dplyr example:
dat.summary = dat %>%
group_by(Team, Round, Question) %>%
summarise(Median=median(Answer),
Agreement=agreement(table(factor(Answer, levels=1:5))))
Finally, reshape to wide format. This requires first "melting" the data to stack the Median and Agreement columns into a single column, and then casting to wide format. We also include the second line of code to add "Round" to each Round so that we get the column names we want in the wide data frame:
dat.summary = dat.summary %>%
mutate(Round = paste0("Round", Round)) %>%
melt(id.var=c("Team","Question","Round")) %>%
dcast(Team + Question ~ variable + Round, value.var="value")
Team Question Median_Round1 Median_Round2 Agreement_Round1 Agreement_Round2
1 1 1 1 1 1.0 1.0
2 1 2 5 5 0.9 0.9
3 1 4 4 4 0.7 0.7

I guess you want something as follows, right?
df
Player Team Round Question Answer
1: 2 1 1 1 1
2: 5 1 1 1 1
3: 8 1 1 1 1
4: 9 1 1 1 1
5: 10 1 1 1 1
6: 2 1 1 2 4
7: 5 1 1 2 5
8: 8 1 1 2 5
9: 9 1 1 2 5
10: 10 1 1 2 5
11: 2 1 1 4 4
12: 5 1 1 4 3
13: 8 1 1 4 4
14: 9 1 1 4 2
15: 10 1 1 4 4
16: 2 1 2 1 2
17: 5 1 2 1 3
18: 8 1 2 1 4
19: 2 1 2 2 5
20: 5 1 2 2 3
21: 8 1 2 2 1
22: 2 1 2 4 6
23: 5 1 2 4 1
24: 8 1 2 4 5
library(agrmt)
df1 <- unique(df[c('Team', 'Question')])
for (df.R in split(df, df$Round)) {
round <- unique(df.R$Round)
df2 <- as.data.frame(as.list(aggregate(Answer ~ Team + Question + Round,
df.R, FUN = function(x) c(Median = median(x), Agrmt = agreement(x)))))[4:5]
names(df2) <- c(paste('Median_R', round, sep=''), paste('Agrmt_R', round, sep=''))
df1 <- cbind.data.frame(df1, df2)
}
df1
Team Question Median_R1 Agrmt_R1 Median_R2 Agrmt_R2
1: 1 1 1 0.00000000 3 0.2222222
6: 1 2 5 0.04166667 3 0.4444444
11: 1 4 4 -0.05882353 5 -0.5833333

Related

How to keep only first value in every sequence of duplicated values in R [duplicate]

This question already has answers here:
Select first row in each contiguous run by group
(4 answers)
Closed 5 months ago.
I am trying to create a subset where I keep the first value in each sequence of numbers in a column. I tried to use:
df %>% group_by(x) %>% slice_head(n = 1)
But it only works for the first instance of each sequence.
An example data where x column contains the repeated sequence can be seen below:
x = c(2,2,2,3,3,3,1,1,1,5,5,5,2,2,2,1,1,1,3,3,3)
y = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
df= data.frame(x,y)
> df
x y
1 2 1
2 2 1
3 2 1
4 3 1
5 3 1
6 3 1
7 1 1
8 1 1
9 1 1
10 5 1
11 5 1
12 5 1
13 2 1
14 2 1
15 2 1
16 1 1
17 1 1
18 1 1
19 3 1
20 3 1
21 3 1
So the end result that I would like to achive is:
x = c(2,3,1,5,2,1,3)
y = c(1,1,1,1,1,1,1)
df= data.frame(x,y)
> df
x y
1 2 1
2 3 1
3 1 1
4 5 1
5 2 1
6 1 1
7 3 1
Could you please help or point me to any useful existing topics as I haven't managed to find it?
Thanks
You can try rleid from package data.table
> library(data.table)
> setDT(df)[!duplicated(rleid(x))]
x y
1: 2 1
2: 3 1
3: 1 1
4: 5 1
5: 2 1
6: 1 1
7: 3 1
Base R.
df[c(1, diff(df$x)) != 0, ]
Or also with helper functions from data.table.
library(data.table)
df[rowid(rleid(df$x)) == 1L, ]
# x y
# 1 2 1
# 4 3 1
# 7 1 1
# 10 5 1
# 13 2 1
# 16 1 1
# 19 3 1
Using rle and match.
df[match(with(rle(df$x), values), df$x), ]
# x y
# 1 2 1
# 4 3 1
# 7 1 1
# 10 5 1
# 1.1 2 1
# 7.1 1 1
# 4.1 3 1

Shifting rows up in columns and flush remaining ones

I have a problem with moving the rows to one upper row. When the rows become completely NA I would like to flush those rows (see the pic below). My current approach for this solution however still keeping the second rows.
Here is my approach
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
> data
gr A B C
1 1 1 NA 1
2 1 NA 1 NA
3 2 2 NA 4
4 2 NA 3 NA
5 3 4 NA 5
6 3 NA 7 NA
so using this approach
data.frame(apply(data,2,function(x){x[complete.cases(x)]}))
gr A B C
1 1 1 1 1
2 1 2 3 4
3 2 4 7 5
4 2 1 1 1
5 3 2 3 4
6 3 4 7 5
As we can see still I am having the second rows in each group!
The expected output
> data
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
thanks!
If there's at most one valid value per gr, you can use na.omit then take the first value from it:
data %>% group_by(gr) %>% summarise_all(~ na.omit(.)[1])
# [1] is optional depending on your actual data
# A tibble: 3 x 4
# gr A B C
# <int> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 3 4
#3 3 4 7 5
You can do it with dplyr like this:
data$ind <- rep(c(1,2), replace=TRUE)
data %>% fill(A,B,C) %>% filter(ind == 2) %>% mutate(ind=NULL)
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
Depending on how consistent your full data is, this may need to be adjusted.
One more solution using data.table:-
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
library(data.table)
library(zoo)
setDT(data)
data[, A := na.locf(A), by = gr]
data[, B := na.locf(B), by = gr]
data[, C := na.locf(C), by = gr]
data <- unique(data)
data
gr A B C
1: 1 1 1 1
2: 2 2 3 4
3: 3 4 7 5

R reshaping data - aggregating data on one part of a table to append to another

I have some survey data that I'd like to reshape to be able to interactively slice and dice using filters. However, I'm stuck in how to reshape the data in traditional ways, and I couldn't figure out the appropriate use of the reshape package. Please help!
The data is as follows: each respondent is in a row, along with the responses to each question. In additional columns are multiple demographic columns on the respondent.
ID Q1 Q2 Q3 … Q30 Demo1 Demo2 Demo3 Average Score
1 1 2 2 … 2 1 1 1 2.5
2 2 3 1 … 5 1 2 1 2.7
3 4 1 5 … 4 2 3 2 1.6
4 1 5 4 … 3 2 1 2 2.5
5 3 4 4 … 1 1 2 2 1.4
The goal is to reshape the data to have each unique question/demographic combination be unique, and the average/sample of the scores for that combination as values.
Question Demo1 Demo2 Demo3 Average NumResp
1 1 1 1 3.4 2
1 1 1 2 2.3 5
1 1 1 3 3.1 1
… … … … … ...
30 4 5 3 1.3 9
As a part 2 to the question, there are also calculations that change the responses from the 1-5 scale into "positive", "neutral" or "negative". It would be great to add this as a column that shows % of all respondents in that specific demographic that was either one of the three, with all 3 values adding up to 100%.
Q Sentiment Demo1 Demo2 Demo3 Average
1 Positive 1 1 1 3.4
1 Neutral 1 1 1 2.3
1 Negative 1 1 1 3.1
… … … … …
30 Negative 4 5 3 1.3
Any help is greatly appreciated! Would prefer to do this in R, though Python will work too.
With melt we can specify the id variables (grouping) or the measure variables ( to collapse to "long"). The argument variable.name allows us to name the new variable created by collapsing the wide columns. And value.name allows us to name the value column. This is all available and more with the documentation for ?melt.data.frame.
To create the Sentiment variable we use cut to break the value range of scores into thirds. There is an argument called labels that allows us to choose the names of the new values.
library(reshape2)
m <- melt(df, variable.name="Question", value.name="Average", id=c("Demo1", "Demo2", "Demo3"))
m$Question <- gsub("Q", "", m$Question)
a <- aggregate(Average~., m, mean)
a$Sentiment <- cut(a$Average, seq(1,5,length.out=4), labels=c("Negative", "Neutral", "Postive"), include.lowest=T)
# Demo1 Demo2 Demo3 Question Average Sentiment
# 1 1 1 1 1 1 Negative
# 2 1 2 1 1 2 Negative
# 3 2 1 2 1 1 Negative
# 4 1 2 2 1 3 Neutral
# 5 2 3 2 1 4 Postive
# 6 1 1 1 2 2 Negative
# 7 1 2 1 2 3 Neutral
# 8 2 1 2 2 5 Postive
# 9 1 2 2 2 4 Postive
# 10 2 3 2 2 1 Negative
Note below that I deleted the "ID" and "Average.Score" columns as they will be recalculated in the process.
Data
df <- read.table(text="
ID Q1 Q2 Q3 Q30 Demo1 Demo2 Demo3 Average.Score
1 1 2 2 2 1 1 1 2.5
2 2 3 1 5 1 2 1 2.7
3 4 1 5 4 2 3 2 1.6
4 1 5 4 3 2 1 2 2.5
5 3 4 4 1 1 2 2 1.4", header=T)
df <- df[,!names(df) %in% c("ID", "Average.Score")]
Under the assumption, that you have data set like this (make it data.table):
ID Q1 Q2 ... Demo1 Demo2 Demo3
1: 1 7 8 2 7 3
2: 2 3 7 6 10 1
3: 3 6 1 5 5 8
4: 4 5 9 10 1 7
5: 5 10 4 8 4 6
and dictionary of answers scores:
value Question Score
1: 7 1 17
2: 3 1 6
3: 6 1 19
Lets transform data to have Question, Answer, ID, Demo:
d2 <- melt(dt, id.vars=c('ID', 'Demo1', 'Demo2', 'Demo3'), measure.vars=grep('^Q[0-9]+$', colnames(dt), val=T))
d2[, c('Question', 'variable'):=list(substring(variable,2), NULL)]
R> d2
ID Demo1 Demo2 Demo3 value Question
1: 1 2 7 3 7 1
2: 2 6 10 1 3 1
3: 3 5 5 8 6 1
Now let's add scores:
d3 <- merge(d2, vals_enc, by=c('Question', 'value'))
And finally get average score and respondents for Question and Demographics:
d3[, list(Avg=mean(Score), Number=.N), .(Question,Demo1,Demo2,Demo3)]
Question Demo1 Demo2 Demo3 Avg Number
1: 1 6 10 1 6 1
2: 1 10 1 7 18 1
3: 1 5 5 8 19 1
Note:
for each Id there is the same demographic status, so number of respondents for each combination of Demographic and Question should be the same.
As it comes to part 2 of the Question:
do you have such calculations or are you looking for them?

calculate each chunk by group using dplyr?

How can I get the expected calculation using dplyr package?
row value group expected
1 2 1 =NA
2 4 1 =4-2
3 5 1 =5-4
4 6 2 =NA
5 11 2 =11-6
6 12 1 =NA
7 15 1 =15-12
I tried
df=read.table(header=1, text=' row value group
1 2 1
2 4 1
3 5 1
4 6 2
5 11 2
6 12 1
7 15 1')
df %>% group_by(group) %>% mutate(expected=value-lag(value))
How can I calculate for each chunk (row 1-3, 4-5, 6-7) although row 1-3 and 6-7 are labelled as the same group number?
Here is a similar approach. I created a new group variable using cumsum. Whenever the difference between two numbers in group is not 0, R assigns a new group number. If you have more data, this approach may be helpful.
library(dplyr)
mutate(df, foo = cumsum(c(T, diff(group) != 0))) %>%
group_by(foo) %>%
mutate(out = value - lag(value))
# row value group foo out
#1 1 2 1 1 NA
#2 2 4 1 1 2
#3 3 5 1 1 1
#4 4 6 2 2 NA
#5 5 11 2 2 5
#6 6 12 1 3 NA
#7 7 15 1 3 3
As your group variable is not useful for this, create a new variable aux and use it as the grouping variable:
library(dplyr)
df$aux <- rep(seq_along(rle(df$group)$values), times = rle(df$group)$lengths)
df %>% group_by(aux) %>% mutate(expected = value - lag(value))
Source: local data frame [7 x 5]
Groups: aux
row value group aux expected
1 1 2 1 1 NA
2 2 4 1 1 2
3 3 5 1 1 1
4 4 6 2 2 NA
5 5 11 2 2 5
6 6 12 1 3 NA
7 7 15 1 3 3
Here is an option using data.table_1.9.5. The devel version introduced new functions rleid and shift (default type is "lag" and fill is "NA") that can be useful for this.
library(data.table)
setDT(df)[, expected:=value-shift(value) ,by = rleid(group)][]
# row value group expected
#1: 1 2 1 NA
#2: 2 4 1 2
#3: 3 5 1 1
#4: 4 6 2 NA
#5: 5 11 2 5
#6: 6 12 1 NA
#7: 7 15 1 3

paste values within categories defined by multiple columns

I want to pivot the result column in df horizontally creating a data set with a separate row for each
region, state, county combination where the columns are ordered by year then city.
I also want to identify each row in the new data set by region, state and county and remove the white space between the four results columns. The code below does all of that, but I suspect it is not very efficient.
Is there a way to do this with reshape2 without creating a unique identifier for each group and numbering observations within each group? Is there a way to use apply in place of the for-loop to remove white space from a matrix? (Matrix here being used in a different manner than a mathematical or programming construct.) I realize those are two separate questions and maybe I should post each question separately.
Given that I can achieve the desired result and am only looking to improve the code I do not know whether I should even post this, but I am hoping to learn. Thanks for any advice.
df <- read.table(text= "
region state county city year result
1 1 1 1 1 1
1 1 1 2 1 2
1 1 1 1 2 3
1 1 1 2 2 4
1 1 2 3 1 4
1 1 2 4 1 3
1 1 2 3 2 2
1 1 2 4 2 1
1 2 1 1 1 0
1 2 1 2 1 NA
1 2 1 1 2 0
1 2 1 2 2 0
1 2 2 3 1 2
1 2 2 4 1 2
1 2 2 3 2 2
1 2 2 4 2 2
2 1 1 1 1 9
2 1 1 2 1 9
2 1 1 1 2 8
2 1 1 2 2 8
2 1 2 3 1 1
2 1 2 4 1 0
2 1 2 3 2 1
2 1 2 4 2 0
2 2 1 1 1 2
2 2 1 2 1 4
2 2 1 1 2 6
2 2 1 2 2 8
2 2 2 3 1 3
2 2 2 4 1 3
2 2 2 3 2 2
2 2 2 4 2 2
", header=TRUE, na.strings=NA)
desired.result <- read.table(text= "
region state county results
1 1 1 1234
1 1 2 4321
1 2 1 0.00
1 2 2 2222
2 1 1 9988
2 1 2 1010
2 2 1 2468
2 2 2 3322
", header=TRUE, colClasses=c('numeric','numeric','numeric','character'))
# redefine variables for package reshape2 creating a unique id for each
# region, state, county combination and then number observations in
# each of those combinations
library(reshape2)
id.var <- df$region*100000 + df$state*1000 + df$county
obsnum <- sequence(rle(id.var)$lengths)
df2 <- dcast(df, region + state + county ~ obsnum, value.var = "result")
# remove spaces between columns of results matrix
# with a for-loop. How can I use apply to do this?
x <- df2[,4:(4+max(obsnum)-1)]
# use a dot to represent a missing observation
x[is.na(x)] = '.'
x.cat = numeric(nrow(x))
for(i in 1:nrow(x)) {
x.cat[i] = paste(x[i,], collapse="")
}
df3 <- cbind(df2[,1:3],x.cat)
colnames(df3) <- c("region", "state", "county", "results")
df3
df3 == desired.result
EDIT:
Matthew Lundberg's answer below is excellent. Afterwards I realized I also needed to create an output data set in which the four result columns above contain numeric, rational numbers and are separated by a space. So, I have posted an apparent way to do that below that modifies Matthew's answer. I do not know whether this is accepted protocol, but the new scenario seems so immediately related to the original post that I did not think I should post a new question.
I think this does what you want:
df$result <- as.character(df$result)
df$result[is.na(df$result)] <- '.'
aggregate(result ~ county+state+region, data=df, paste0, collapse='')
county state region result
1 1 1 1 1234
2 2 1 1 4321
3 1 2 1 0.00
4 2 2 1 2222
5 1 1 2 9988
6 2 1 2 1010
7 1 2 2 2468
8 2 2 2 3322
This relies on your data frame being sorted in the proper order (as yours is).
Matthew Lundberg's answer is excellent. Afterwards I realized I also needed to create an output data set in which the four result columns above contain numeric, rational numbers and are separated by a space. So, here I provide an apparent way to do that using a modification of Matthew's answer. I do not know whether this is accepted protocol, but the new scenario seems so immediately related to the original post that I did not think I should post a new question.
The first two lines are modifications of Matthew's answer.
df$result[is.na(df$result)] <- 'NA'
df2 <- aggregate(result ~ county+state+region, data=df, paste)
Then I specify that NA represents missing observations and use apply to obtain the numeric output.
df2$result[df2$result=='NA'] = NA
new.df <- data.frame(df2[,1:3], apply(df2$result,2,as.numeric))
The output is below except note that I added 0.5 to each value in df shown in the original post.
county state region X1 X2 X3 X4
1 1 1 1.5 2.5 3.5 4.5
2 1 1 4.5 3.5 2.5 1.5
1 2 1 0.5 NA 0.5 0.5
2 2 1 2.5 2.5 2.5 2.5
1 1 2 9.5 9.5 8.5 8.5
2 1 2 1.5 0.5 1.5 0.5
1 2 2 2.5 4.5 6.5 8.5
2 2 2 3.5 3.5 2.5 2.5
In my original post I asked how to remove spaces between columns in a data set using apply. That did not prove necessary thanks to Matthew Lundberg's answer to my larger question. Nevertheless, removing spaces between columns of a data set is something I frequently have to do. For completeness, here I post a way to do that using paste0 and apply that arose, in part, from Matthew's answer.
To remove all spaces from the data set x:
x <- read.table(text= "
A B C D
1 1 1 1
1 1 2 2
1 NA 1 3
1 1 2 4
1 2 1 5
1 2 NA 6
1 2 1 7
1 2 2 8
", header=TRUE, na.strings=NA)
# use a dot to represent a missing observation
x[is.na(x)] = '.'
y <- as.data.frame(apply(x, 1, function(i) paste0(i, collapse='')))
colnames(y) <- 'result'
y
Gives:
result
1 1111
2 1122
3 1.13
4 1124
5 1215
6 12.6
7 1217
8 1228
The following code removes the spaces between just the second and third columns:
z <- as.data.frame(apply(x[,2:3], 1, function(i) paste0(i, collapse='')))
y <- data.frame(x[,1], z, x[,4])
colnames(y) <- c('A','BC','D')
y
Giving:
A BC D
1 1 11 1
2 1 12 2
3 1 .1 3
4 1 12 4
5 1 21 5
6 1 2. 6
7 1 21 7
8 1 22 8

Resources