paste values within categories defined by multiple columns - r

I want to pivot the result column in df horizontally creating a data set with a separate row for each
region, state, county combination where the columns are ordered by year then city.
I also want to identify each row in the new data set by region, state and county and remove the white space between the four results columns. The code below does all of that, but I suspect it is not very efficient.
Is there a way to do this with reshape2 without creating a unique identifier for each group and numbering observations within each group? Is there a way to use apply in place of the for-loop to remove white space from a matrix? (Matrix here being used in a different manner than a mathematical or programming construct.) I realize those are two separate questions and maybe I should post each question separately.
Given that I can achieve the desired result and am only looking to improve the code I do not know whether I should even post this, but I am hoping to learn. Thanks for any advice.
df <- read.table(text= "
region state county city year result
1 1 1 1 1 1
1 1 1 2 1 2
1 1 1 1 2 3
1 1 1 2 2 4
1 1 2 3 1 4
1 1 2 4 1 3
1 1 2 3 2 2
1 1 2 4 2 1
1 2 1 1 1 0
1 2 1 2 1 NA
1 2 1 1 2 0
1 2 1 2 2 0
1 2 2 3 1 2
1 2 2 4 1 2
1 2 2 3 2 2
1 2 2 4 2 2
2 1 1 1 1 9
2 1 1 2 1 9
2 1 1 1 2 8
2 1 1 2 2 8
2 1 2 3 1 1
2 1 2 4 1 0
2 1 2 3 2 1
2 1 2 4 2 0
2 2 1 1 1 2
2 2 1 2 1 4
2 2 1 1 2 6
2 2 1 2 2 8
2 2 2 3 1 3
2 2 2 4 1 3
2 2 2 3 2 2
2 2 2 4 2 2
", header=TRUE, na.strings=NA)
desired.result <- read.table(text= "
region state county results
1 1 1 1234
1 1 2 4321
1 2 1 0.00
1 2 2 2222
2 1 1 9988
2 1 2 1010
2 2 1 2468
2 2 2 3322
", header=TRUE, colClasses=c('numeric','numeric','numeric','character'))
# redefine variables for package reshape2 creating a unique id for each
# region, state, county combination and then number observations in
# each of those combinations
library(reshape2)
id.var <- df$region*100000 + df$state*1000 + df$county
obsnum <- sequence(rle(id.var)$lengths)
df2 <- dcast(df, region + state + county ~ obsnum, value.var = "result")
# remove spaces between columns of results matrix
# with a for-loop. How can I use apply to do this?
x <- df2[,4:(4+max(obsnum)-1)]
# use a dot to represent a missing observation
x[is.na(x)] = '.'
x.cat = numeric(nrow(x))
for(i in 1:nrow(x)) {
x.cat[i] = paste(x[i,], collapse="")
}
df3 <- cbind(df2[,1:3],x.cat)
colnames(df3) <- c("region", "state", "county", "results")
df3
df3 == desired.result
EDIT:
Matthew Lundberg's answer below is excellent. Afterwards I realized I also needed to create an output data set in which the four result columns above contain numeric, rational numbers and are separated by a space. So, I have posted an apparent way to do that below that modifies Matthew's answer. I do not know whether this is accepted protocol, but the new scenario seems so immediately related to the original post that I did not think I should post a new question.

I think this does what you want:
df$result <- as.character(df$result)
df$result[is.na(df$result)] <- '.'
aggregate(result ~ county+state+region, data=df, paste0, collapse='')
county state region result
1 1 1 1 1234
2 2 1 1 4321
3 1 2 1 0.00
4 2 2 1 2222
5 1 1 2 9988
6 2 1 2 1010
7 1 2 2 2468
8 2 2 2 3322
This relies on your data frame being sorted in the proper order (as yours is).

Matthew Lundberg's answer is excellent. Afterwards I realized I also needed to create an output data set in which the four result columns above contain numeric, rational numbers and are separated by a space. So, here I provide an apparent way to do that using a modification of Matthew's answer. I do not know whether this is accepted protocol, but the new scenario seems so immediately related to the original post that I did not think I should post a new question.
The first two lines are modifications of Matthew's answer.
df$result[is.na(df$result)] <- 'NA'
df2 <- aggregate(result ~ county+state+region, data=df, paste)
Then I specify that NA represents missing observations and use apply to obtain the numeric output.
df2$result[df2$result=='NA'] = NA
new.df <- data.frame(df2[,1:3], apply(df2$result,2,as.numeric))
The output is below except note that I added 0.5 to each value in df shown in the original post.
county state region X1 X2 X3 X4
1 1 1 1.5 2.5 3.5 4.5
2 1 1 4.5 3.5 2.5 1.5
1 2 1 0.5 NA 0.5 0.5
2 2 1 2.5 2.5 2.5 2.5
1 1 2 9.5 9.5 8.5 8.5
2 1 2 1.5 0.5 1.5 0.5
1 2 2 2.5 4.5 6.5 8.5
2 2 2 3.5 3.5 2.5 2.5

In my original post I asked how to remove spaces between columns in a data set using apply. That did not prove necessary thanks to Matthew Lundberg's answer to my larger question. Nevertheless, removing spaces between columns of a data set is something I frequently have to do. For completeness, here I post a way to do that using paste0 and apply that arose, in part, from Matthew's answer.
To remove all spaces from the data set x:
x <- read.table(text= "
A B C D
1 1 1 1
1 1 2 2
1 NA 1 3
1 1 2 4
1 2 1 5
1 2 NA 6
1 2 1 7
1 2 2 8
", header=TRUE, na.strings=NA)
# use a dot to represent a missing observation
x[is.na(x)] = '.'
y <- as.data.frame(apply(x, 1, function(i) paste0(i, collapse='')))
colnames(y) <- 'result'
y
Gives:
result
1 1111
2 1122
3 1.13
4 1124
5 1215
6 12.6
7 1217
8 1228
The following code removes the spaces between just the second and third columns:
z <- as.data.frame(apply(x[,2:3], 1, function(i) paste0(i, collapse='')))
y <- data.frame(x[,1], z, x[,4])
colnames(y) <- c('A','BC','D')
y
Giving:
A BC D
1 1 11 1
2 1 12 2
3 1 .1 3
4 1 12 4
5 1 21 5
6 1 2. 6
7 1 21 7
8 1 22 8

Related

R - Count duplicates values for each row

I'm working on a data frame that requires to calculate Fleiss's Kappa for inter-rater agreements. I'm using the 'irr' package for that.
Besides that, I need to count, for each observation, how many of raters are in agreement.
My data looks like these:
a b c
1 1 1 1
2 1 2 2
3 2 3 2
4 3 3 1
5 4 2 1
I'm expecting something like this, , where count stands for number of raters on agreement
a b c count
1 1 1 1 3
2 1 2 2 2
3 2 3 2 2
4 3 3 1 2
5 4 2 1 0
Thanks a lot.
Alternative solution if your data is in a data frame called abc:
as.numeric(apply(abc,1,function(x) {
ux<-unique(x);
tab <- tabulate(match(x, ux));
mode <- ux[tab == max(tab)];
ifelse(length(mode)==1,length(which(x==mode)),NA_character_);
} ))
When you run it gives:
[1] 3 2 2 2 NA

R: how to create median and agreement function for multiple groups

I have the following data structure:
Player Team Round Question Answer
1: 2 1 1 1 1
2: 5 1 1 1 1
3: 8 1 1 1 1
4: 9 1 1 1 1
5: 10 1 1 1 1
6: 2 1 1 2 4
7: 5 1 1 2 5
8: 8 1 1 2 5
9: 9 1 1 2 5
10: 10 1 1 2 5
11: 2 1 1 4 4
12: 5 1 1 4 3
13: 8 1 1 4 4
14: 9 1 1 4 2
15: 10 1 1 4 4
16: ...
So there are several players from several team, answering several questions. There are always 2 rounds of games.
What I try to calculate is the medium and the agreement coefficient (see agrmt package) from the data by grouping the Team and the Question.
The result should look like this:
Team Question Median_R1 Agrmt_R1 Median_R2 Agrmt_R2
1: 1 1 1 1 1 1
2: 1 2 2 0.83 1 1
3: ...
4: 5 10 4 1 4 1
Does someone know if this is possible? I could not find a solution for this. I can solve the median and the agreement coefficient standalone, but not combined?
Every hint is welcome. Thank you very much.
UPDATE:
The agreement function returns a coefficient between -1 and 1. The values represent.
1 represents a full agreement (e.g. if every player answers 5).
0 would be, if every player has a different answer.
-1 would be, if a disagreement exists (some players say answer 1 and others say 5)
Compared to the median, the agreement functions takes a vector of the frequency vector.
For example, we have the following answers
Player Team Round Question Answer
6: 2 1 1 2 4
7: 5 1 1 2 5
8: 8 1 1 2 5
9: 9 1 1 2 5
10: 10 1 1 2 5
The function inputs would look like this:
Median input: 4,5,5,5,5 --> Result: 5
Agreement input: 0,0,0,1,4 --> Result: 0.9
UPDATE 2: SOLVED
The calculation of the agreement could be done with the following code:
agreement(table(factor(x, levels=1:5)))
The final is based on #sandipan implementation. I had to add another sorting step in order to combine the right data.frames:
library(agrmt)
df1 <- unique(df[c('Party', 'Question')])
for (df.R in split(df, df$Round)) {
round <- unique(df.R$Round)
# get the data.frame of the current Round.
df2 <- as.data.frame(as.list(aggregate(Answer ~ Party + Question + Round,
df.R, FUN = function(x) c(Median = median(x), Agrmt = agreement(table(factor(x, levels=1:5)))))))
# sort it and take only the columns of median and agreement
df3 <- df2[with(df2, order(Party, Question)),][4:5]
names(df3) <- c(paste('Median_R', round, sep=''), paste('Agrmt_R', round, sep=''))
df1 <- cbind.data.frame(df1, df3)
}
df1
Thank you all for the help.
Here are three approaches: base R aggregate, dplyr, and data.table.
With base R aggregate:
library(agrmt)
aggregate(Answer ~ Team + Round + Question, data=dat,
FUN = function(x) {
c(Median=median(x),
Agreement=agreement(table(factor(x, levels=1:5))))
})
Team Round Question Answer.Median Answer.Agreement
1 1 1 1 1.0 1.0
2 1 1 2 5.0 0.9
3 1 1 4 4.0 0.7
With dplyr:
library(dplyr)
dat.summary = dat %>% group_by(Team, Round, Question) %>%
summarise(Median=median(Answer),
Agreement=agreement(table(factor(Answer, levels=1:5))))
Team Round Question Median Agreement
1 1 1 1 1 1.0
2 1 1 2 5 0.9
3 1 1 4 4 0.7
With data.table:
library(data.table)
dat.summary = setDT(dat)[, list(Median=median(Answer),
Agreement=agreement(table(factor(Answer, levels=1:5)))),
by=list(Team, Round, Question)]
Team Round Question Median Agreement
1: 1 1 1 1 1.0
2: 1 1 2 5 0.9
3: 1 1 4 4 0.7
To get a "wide" data frame as the final output:
In the examples above, I've left the output in "long" format. If you want to reshape to "wide" format, so that each Round get its own set of columns, you can do as follows:
First, let's add a second Round to the sample data by stacking another copy of the sample data:
library(dplyr)
library(reshape2)
library(agrmt)
dat = bind_rows(dat, dat %>% mutate(Round=2))
Now calculate the median and agreement with the same code we used before in the dplyr example:
dat.summary = dat %>%
group_by(Team, Round, Question) %>%
summarise(Median=median(Answer),
Agreement=agreement(table(factor(Answer, levels=1:5))))
Finally, reshape to wide format. This requires first "melting" the data to stack the Median and Agreement columns into a single column, and then casting to wide format. We also include the second line of code to add "Round" to each Round so that we get the column names we want in the wide data frame:
dat.summary = dat.summary %>%
mutate(Round = paste0("Round", Round)) %>%
melt(id.var=c("Team","Question","Round")) %>%
dcast(Team + Question ~ variable + Round, value.var="value")
Team Question Median_Round1 Median_Round2 Agreement_Round1 Agreement_Round2
1 1 1 1 1 1.0 1.0
2 1 2 5 5 0.9 0.9
3 1 4 4 4 0.7 0.7
I guess you want something as follows, right?
df
Player Team Round Question Answer
1: 2 1 1 1 1
2: 5 1 1 1 1
3: 8 1 1 1 1
4: 9 1 1 1 1
5: 10 1 1 1 1
6: 2 1 1 2 4
7: 5 1 1 2 5
8: 8 1 1 2 5
9: 9 1 1 2 5
10: 10 1 1 2 5
11: 2 1 1 4 4
12: 5 1 1 4 3
13: 8 1 1 4 4
14: 9 1 1 4 2
15: 10 1 1 4 4
16: 2 1 2 1 2
17: 5 1 2 1 3
18: 8 1 2 1 4
19: 2 1 2 2 5
20: 5 1 2 2 3
21: 8 1 2 2 1
22: 2 1 2 4 6
23: 5 1 2 4 1
24: 8 1 2 4 5
library(agrmt)
df1 <- unique(df[c('Team', 'Question')])
for (df.R in split(df, df$Round)) {
round <- unique(df.R$Round)
df2 <- as.data.frame(as.list(aggregate(Answer ~ Team + Question + Round,
df.R, FUN = function(x) c(Median = median(x), Agrmt = agreement(x)))))[4:5]
names(df2) <- c(paste('Median_R', round, sep=''), paste('Agrmt_R', round, sep=''))
df1 <- cbind.data.frame(df1, df2)
}
df1
Team Question Median_R1 Agrmt_R1 Median_R2 Agrmt_R2
1: 1 1 1 0.00000000 3 0.2222222
6: 1 2 5 0.04166667 3 0.4444444
11: 1 4 4 -0.05882353 5 -0.5833333

Select rows of data frame based on a vector with duplicated values

What I want can be described as: give a data frame, contains all the case-control pairs. In the following example, y is the id for the case-control pair. There are 3 pairs in my data set. I'm doing a resampling with respect to the different values of y (the pair will be both selected or neither).
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
> sample_df
x y
1 1 1
2 2 1
3 3 2
4 4 2
5 5 3
6 6 3
select_y = c(1,3,3)
select_y
> select_y
[1] 1 3 3
Now, I have computed a vector contains the pairs I want to resample, which is select_y above. It means the case-control pair number 1 will be in my new sample, and number 3 will also be in my new sample, but it will occur 2 times since there are two 3. The desired output will be:
x y
1 1
2 1
5 3
6 3
5 3
6 3
I can't find out an efficient way other than writing a for loop...
Solution:
Based on #HubertL , with some modifications, a 'vectorized' approach looks like:
sel_y <- as.data.frame(table(select_y))
> sel_y
select_y Freq
1 1 1
2 3 2
sub_sample_df = sample_df[sample_df$y%in%select_y,]
> sub_sample_df
x y
1 1 1
2 2 1
5 5 3
6 6 3
match_freq = sel_y[match(sub_sample_df$y, sel_y$select_y),]
> match_freq
select_y Freq
1 1 1
1.1 1 1
2 3 2
2.1 3 2
sub_sample_df$Freq = match_freq$Freq
rownames(sub_sample_df) = NULL
sub_sample_df
> sub_sample_df
x y Freq
1 1 1 1
2 2 1 1
3 5 3 2
4 6 3 2
selected_rows = rep(1:nrow(sub_sample_df), sub_sample_df$Freq)
> selected_rows
[1] 1 2 3 3 4 4
sub_sample_df[selected_rows,]
x y Freq
1 1 1 1
2 2 1 1
3 5 3 2
3.1 5 3 2
4 6 3 2
4.1 6 3 2
Another method of doing the same without a loop:
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
row_names <- split(1:nrow(sample_df),sample_df$y)
select_y = c(1,3,3)
row_num <- unlist(row_names[as.character(select_y)])
ans <- sample_df[row_num,]
I can't find a way without a loop, but at least it's not a for loop, and there is only one iteration per frequency:
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
select_y = c(1,3,3)
sel_y <- as.data.frame(table(select_y))
do.call(rbind,
lapply(1:max(sel_y$Freq),
function(freq) sample_df[sample_df$y %in%
sel_y[sel_y$Freq>=freq, "select_y"],]))
x y
1 1 1
2 2 1
5 5 3
6 6 3
51 5 3
61 6 3

R reshaping data - aggregating data on one part of a table to append to another

I have some survey data that I'd like to reshape to be able to interactively slice and dice using filters. However, I'm stuck in how to reshape the data in traditional ways, and I couldn't figure out the appropriate use of the reshape package. Please help!
The data is as follows: each respondent is in a row, along with the responses to each question. In additional columns are multiple demographic columns on the respondent.
ID Q1 Q2 Q3 … Q30 Demo1 Demo2 Demo3 Average Score
1 1 2 2 … 2 1 1 1 2.5
2 2 3 1 … 5 1 2 1 2.7
3 4 1 5 … 4 2 3 2 1.6
4 1 5 4 … 3 2 1 2 2.5
5 3 4 4 … 1 1 2 2 1.4
The goal is to reshape the data to have each unique question/demographic combination be unique, and the average/sample of the scores for that combination as values.
Question Demo1 Demo2 Demo3 Average NumResp
1 1 1 1 3.4 2
1 1 1 2 2.3 5
1 1 1 3 3.1 1
… … … … … ...
30 4 5 3 1.3 9
As a part 2 to the question, there are also calculations that change the responses from the 1-5 scale into "positive", "neutral" or "negative". It would be great to add this as a column that shows % of all respondents in that specific demographic that was either one of the three, with all 3 values adding up to 100%.
Q Sentiment Demo1 Demo2 Demo3 Average
1 Positive 1 1 1 3.4
1 Neutral 1 1 1 2.3
1 Negative 1 1 1 3.1
… … … … …
30 Negative 4 5 3 1.3
Any help is greatly appreciated! Would prefer to do this in R, though Python will work too.
With melt we can specify the id variables (grouping) or the measure variables ( to collapse to "long"). The argument variable.name allows us to name the new variable created by collapsing the wide columns. And value.name allows us to name the value column. This is all available and more with the documentation for ?melt.data.frame.
To create the Sentiment variable we use cut to break the value range of scores into thirds. There is an argument called labels that allows us to choose the names of the new values.
library(reshape2)
m <- melt(df, variable.name="Question", value.name="Average", id=c("Demo1", "Demo2", "Demo3"))
m$Question <- gsub("Q", "", m$Question)
a <- aggregate(Average~., m, mean)
a$Sentiment <- cut(a$Average, seq(1,5,length.out=4), labels=c("Negative", "Neutral", "Postive"), include.lowest=T)
# Demo1 Demo2 Demo3 Question Average Sentiment
# 1 1 1 1 1 1 Negative
# 2 1 2 1 1 2 Negative
# 3 2 1 2 1 1 Negative
# 4 1 2 2 1 3 Neutral
# 5 2 3 2 1 4 Postive
# 6 1 1 1 2 2 Negative
# 7 1 2 1 2 3 Neutral
# 8 2 1 2 2 5 Postive
# 9 1 2 2 2 4 Postive
# 10 2 3 2 2 1 Negative
Note below that I deleted the "ID" and "Average.Score" columns as they will be recalculated in the process.
Data
df <- read.table(text="
ID Q1 Q2 Q3 Q30 Demo1 Demo2 Demo3 Average.Score
1 1 2 2 2 1 1 1 2.5
2 2 3 1 5 1 2 1 2.7
3 4 1 5 4 2 3 2 1.6
4 1 5 4 3 2 1 2 2.5
5 3 4 4 1 1 2 2 1.4", header=T)
df <- df[,!names(df) %in% c("ID", "Average.Score")]
Under the assumption, that you have data set like this (make it data.table):
ID Q1 Q2 ... Demo1 Demo2 Demo3
1: 1 7 8 2 7 3
2: 2 3 7 6 10 1
3: 3 6 1 5 5 8
4: 4 5 9 10 1 7
5: 5 10 4 8 4 6
and dictionary of answers scores:
value Question Score
1: 7 1 17
2: 3 1 6
3: 6 1 19
Lets transform data to have Question, Answer, ID, Demo:
d2 <- melt(dt, id.vars=c('ID', 'Demo1', 'Demo2', 'Demo3'), measure.vars=grep('^Q[0-9]+$', colnames(dt), val=T))
d2[, c('Question', 'variable'):=list(substring(variable,2), NULL)]
R> d2
ID Demo1 Demo2 Demo3 value Question
1: 1 2 7 3 7 1
2: 2 6 10 1 3 1
3: 3 5 5 8 6 1
Now let's add scores:
d3 <- merge(d2, vals_enc, by=c('Question', 'value'))
And finally get average score and respondents for Question and Demographics:
d3[, list(Avg=mean(Score), Number=.N), .(Question,Demo1,Demo2,Demo3)]
Question Demo1 Demo2 Demo3 Avg Number
1: 1 6 10 1 6 1
2: 1 10 1 7 18 1
3: 1 5 5 8 19 1
Note:
for each Id there is the same demographic status, so number of respondents for each combination of Demographic and Question should be the same.
As it comes to part 2 of the Question:
do you have such calculations or are you looking for them?

Subsequent row summing in dataframe object

I would like to do subsequent row summing of a columnvalue and put the result into a new columnvariable without deleting any row by another columnvalue .
Below is some R-code and an example that does the trick and hopefully illustrates my question. I was wondering if there is a more elegant way to do since the for loop will be time consuming in my actual object.
Thanks for any feedback.
As an example dataframe:
MyDf <- data.frame(ID = c(1,1,1,2,2,2), Y = 1:6)
MyDf$FIRST <- c(1,0,0,1,0,0)
MyDf.2 <- MyDf
MyDf.2$Y2 <- c(1,3,6,4,9,15)
The purpose of this is so that I can write code that calculates Y2 in MyDf.2 above for each ID, separately.
This is what I came up with and, it does the trick. (Calculating a TEST column in MyDf that has to be equal to Y2 cin MyDf.2)
MyDf$TEST <- NA
for(i in 1:length(MyDf$Y)){
MyDf[i,]$TEST <- ifelse(MyDf[i,]$FIRST == 1, MyDf[i,]$Y,MyDf[i,]$Y + MyDf[i-1,]$TEST)
}
MyDf
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
MyDf.2
ID Y FIRST Y2
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
You need ave and cumsum to get the column you want. transform is just to modify your existing data.frame.
> MyDf <- transform(MyDf, TEST=ave(Y, ID, FUN=cumsum))
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15

Resources