How to create a variable that indicates agreement from two dichotomous variables - r

I d like to create a new variable that contains 1 and 0. A 1 represents agreement between the rater (both raters 1 or both raters 0) and a zero represents disagreement.
rater_A <- c(1,0,1,1,1,0,0,1,0,0)
rater_B <- c(1,1,0,0,1,1,0,1,0,0)
df <- cbind(rater_A, rater_B)
The new variable would be like the following vector I created manually:
df$agreement <- c(1,0,0,0,1,0,1,1,1,1)
Maybe there's a package or a function I don't know. Any help would be great.

You could create df as a data.frame (instead of using cbind) and use within and ifelse:
rater_A <- c(1,0,1,1,1,0,0,1,0,0)
rater_B <- c(1,1,0,0,1,1,0,1,0,0)
df <- data.frame(rater_A, rater_B)
##
df <- within(df,
agreement <- ifelse(
rater_A==rater_B,1,0))
##
> df
rater_A rater_B agreement
1 1 1 1
2 0 1 0
3 1 0 0
4 1 0 0
5 1 1 1
6 0 1 0
7 0 0 1
8 1 1 1
9 0 0 1
10 0 0 1

Related

Sub-setting or arrange the data in R

As I am new to R, this question may seem to you piece of a cake.
I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms.
For example:
0 org4|gene759
1 org1|gene992
2 org1|gene1101
3 org4|gene757
4 org1|gene1702
5 org1|gene989
6 org1|gene990
7 org1|gene1699
9 org1|gene1102
10 org4|gene2439
10 org1|gene1374
I need to re-arrange/reshape the data in following format.
Cluster No. Org 1 Org 2 org3 org4
0 0 0 1
1 0 0 0
I could not figure out how to do it in R.
Thanks
We could use table
out <- cbind(ClusterNo = seq_len(nrow(df1)), as.data.frame.matrix(table(seq_len(nrow(df1)),
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4)))))
head(out, 2)
# ClusterNo org1 org2 org3 org4
#1 1 0 0 0 1
#2 2 1 0 0 0
It is also possible that we need to use the first column to get the frequency
out1 <- as.data.frame.matrix(table(df1[[1]],
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4))))
Reading the table into R can be done with
input <- read.table('filename.txt')
Then we can extract the relevant number from the org4|gene759 string using a regular expression, and set this to a third column of our input:
input[, 3] <- gsub('^org(.+)\\|.*', '\\1', input[, 2])
Our input data now looks like this:
> input
V1 V2 V3
1 0 org4|gene759 4
2 1 org1|gene992 1
3 2 org1|gene1101 1
4 3 org4|gene757 4
5 4 org1|gene1702 1
6 5 org1|gene989 1
7 6 org1|gene990 1
8 7 org1|gene1699 1
9 9 org1|gene1102 1
10 10 org4|gene2439 4
11 10 org1|gene1374 1
Then we need to list the possible values of org:
possibleOrgs <- seq_len(max(input[, 3])) # = c(1, 2, 3, 4)
Now for the tricky part. The following function takes each unique cluster number in turn (I notice that 10 appears twice in your example data), takes all the rows relating to that cluster, and looks at the org value for those rows.
result <- vapply(unique(input[, 1]), function (x)
possibleOrgs %in% input[input[, 1] == x, 3], logical(4)))
We can then format this result as we like, perhaps using t to transform its orientation, * 1 to convert from TRUEs and FALSEs to 1s and 0s, and colnames to title its columns:
result <- t(result) * 1
colnames (result) <- paste0('org', possibleOrgs)
rownames(result) <- unique(input[, 1])
I hope that this is what you were looking for -- it wasn't quite clear from your question!
Output:
> result
org1 org2 org3 org4
0 0 0 0 1
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
7 1 0 0 0
9 1 0 0 0
10 1 0 0 1

extracting maximum value of cumulative sum into a new column

A sample of data set:
testdf <- data.frame(risk_11111 = c(0,0,1,2,3,0,1,2,3,4,0), risk_11112 = c(0,0,1,2,3,0,1,2,0,1,0))
And I need output data set which would contain new column where only maximum values of cumulative sum will be maintained:
testdf <- data.frame(risk_11111 = c(0,0,1,2,3,0,1,2,3,4,0),
risk_11111_max = c(0,0,0,0,3,0,0,0,0,4,0),
risk_11112 = c(0,0,1,2,3,0,1,2,0,1,0),
risk_11112_max = c(0,0,0,0,3,0,0,2,0,1,0))
I am guessing some logical subseting of vectors colwise with apply and extracting max value with position index, and mutate into new variables.
I dont know how to extract values for new variable.
Thanks
Something like this with base R:
lapply(testdf, function(x) {
x[diff(x) > 0] <- 0
x
})
And to have all in one data.frame:
dfout <- cbind(testdf, lapply(testdf, function(x) {
x[diff(x) > 0] <- 0
x
}))
names(dfout) <- c(names(testdf), 'risk_1111_max', 'risk_1112_max')
Output:
risk_11111 risk_11112 risk_1111_max risk_1112_max
1 0 0 0 0
2 0 0 0 0
3 1 1 0 0
4 2 2 0 0
5 3 3 3 3
6 0 0 0 0
7 1 1 0 0
8 2 2 0 2
9 3 0 0 0
10 4 1 4 1
11 0 0 0 0

Assign value to new column if another column has exact string

I have a df that looks like this.
Date Winner
4/12 Tom
4/13 Abe
4/14 George
4/15 Tom
I would like to add new columns that assign a 1 if if the name appears in the winner column and 0 if the name did not appear and vice versa. Ideally the df would look like this as a result
Date Winner Tom_Win Tom_Lose Abe_Win Abe_Lose George_Win George Lose
4/12 Tom 1 0 0 1 0 1
4/13 Abe 0 1 1 0 0 1
4/14 George 0 1 0 1 1 0
4/15 Tom 1 0 0 1 0 1
Is there an easy way to accomplish this?
This is extremely simple to do if you use the model.matrix functions, it will create N dummy columns with 0 when the name does not appear and one when it does (exactly as you requested), the code below:
(assuming your data is called db)
> winners <- model.matrix(~Winner - 1, data=db)
> winners
WinnerAbe WinnerGeorge WinnerTom
1 0 0 1
2 1 0 0
3 0 1 0
4 0 0 1
This bit is to compute the columns with the losing values
winners <- as.data.frame(winners)
winners$loserAbe <- as.numeric(!winners$WinnerAbe) #naturally you have to
#do this for every column you need
WinnerAbe WinnerGeorge WinnerTom loserAbe
1 0 0 1 1
2 1 0 0 0
3 0 1 0 1
4 0 0 1 1
winners$Date <- db$Date #this last bit so you don't lose the date.
Using mtabulate from qdapTools package we can do the following three steps,
library(qdapTools)
d1 <- mtabulate(d3$Winner)
d2 <- setNames(data.frame(sapply(d1, function(i) ifelse(i == 1, 0, 1))),
paste0(names(d1), '_Lose'))
cbind(d3$Date, d1, d2)
# d3$Date Abe George Tom Abe_Lose George_Lose Tom_Lose
#1 4/12 0 0 1 1 1 0
#2 4/13 1 0 0 0 1 1
#3 4/14 0 1 0 1 0 1
#4 4/15 0 0 1 1 1 0
DATA
str(d3)
'data.frame': 4 obs. of 2 variables:
$ Date : Factor w/ 4 levels "4/12","4/13",..: 1 2 3 4
$ Winner: Factor w/ 3 levels "Abe","George",..: 3 1 2 3
I'm sure there is a better way than this but this works in base R and it's fairly simple:
If your data looks like this:
df <- data.frame(Date = c("4/12","4/13","4/14","4/15"),Winner = c("Tom","Abe","George","Tom"))
Append the extra columns like so:
xcols <- c(paste0(unique(df$Winner), '_Win'), paste0(unique(df$Winner), '_Lose'))
df[ , xcols] <- 0
Now make a character vector with instructions to give the points for every player.
evl <- unlist(lapply(unique(df$Winner), function(x){paste0('df[', which(df$Winner == x), ',', which(names(df) == paste0(x, '_Win')), '] <- 1')}))
And execute the code:
eval(parse(text = evl))
df <- data.frame(
Date = c("4/12", "4/13","4/14", "4/15"),
Winner = c("Tom", "Abe", "George", "Tom")
)
df2 <- do.call(cbind,
lapply(seq_along(levels(df$Winner)), function(x) {
win <- ifelse(df$Winner == levels(df$Winner)[x], 1, 0)
lose <- ifelse(df$Winner == levels(df$Winner)[x], 0, 1)
dat <- cbind(win, lose)
colnames(dat) <- c(paste(levels(df$Winner)[x], "win", sep = "_"), paste(levels(df$Winner)[x], "lose", sep = "_"))
dat
})
)
cbind(df, df2)
> cbind(df, df2)
Date Winner Abe_win Abe_lose George_win George_lose Tom_win Tom_lose
1 4/12 Tom 0 1 0 1 1 0
2 4/13 Abe 1 0 0 1 0 1
3 4/14 George 0 1 1 0 0 1
4 4/15 Tom 0 1 0 1 1 0

Create mutually exclusive dummy variables from categorical variable in R [duplicate]

This question already has answers here:
Generate a dummy-variable
(17 answers)
Closed 7 years ago.
A while ago, I asked a question about creating a categorical variable from mutually exclusive dummy variables. Now, it turns out I want to do the opposite.
How would one go about creating dummy variables in a long-form dataset from a single categorical variable (time)? e.g. the dataframe below...
id time
1 1
1 2
1 3
1 4
would become...
id time time_dummy_1 time_dummy_2 time_dummy_3 time_dummy_4
1 1 1 0 0 0
1 2 0 1 0 0
1 3 0 0 1 0
1 4 0 0 0 1
I'm sure this is trivial (and please let me know if this question is a duplicate -- I'm not sure it is, but will happily remove if so). Thanks!
You can try the dummies library.
R Code:
# Creating the data frame
# id <- c(1,1,1,1)
# time <- c(1,2,3,4)
# data <- data.frame(id, time)
install.packages("dummies")
library(dummies)
data <- cbind(data, dummy(data$time))
Output:
id time data1 data2 data3 data4
1 1 1 0 0 0
1 2 0 1 0 0
1 3 0 0 1 0
1 4 0 0 0 1
Further you can rename the newly added dummy variable headers to suit your needs
R Code:
# Rename column headers
colnames(data)[colnames(data)=="data1"] <- "time_dummy_1"
colnames(data)[colnames(data)=="data2"] <- "time_dummy_2"
colnames(data)[colnames(data)=="data3"] <- "time_dummy_3"
colnames(data)[colnames(data)=="data4"] <- "time_dummy_4"
Output:
id time time_dummy_1 time_dummy_2 time_dummy_3 time_dummy_4
1 1 1 0 0 0
1 2 0 1 0 0
1 3 0 0 1 0
1 4 0 0 0 1
Hope this helps.
If your data is
id <- c(1,1,1,1)
time <- c(1,2,3,4)
df <- data.frame(id,time)
you can try
time <- as.character(time)
unique.time <- as.character(unique(df$time))
# Create a dichotomous dummy-variable for each time
x <- sapply(unique.time, function(x)as.numeric(df$time == x))
or
time.f = factor(time)
dummies = model.matrix(~time.f)

How can I reshape my dataframe using reshape package?

I have a dataframe that looks like this:
step var1 score1 score2
1 a 0 0
2 b 1 1
3 d 1 1
4 e 0 0
5 g 0 0
1 b 1 1
2 a 1 0
3 d 1 0
4 e 0 1
5 f 1 1
1 g 0 1
2 d 1 1
etc.
Because I need to correlate variabeles a-g (their scores are in score1) with score2 in only step 5 I think i need to schange my dataset into this first:
a b c d e f g score2_step5
0 1 1 0 0 0
1 1 1 0 1 1
1 0
etc.
I am pretty sure that the Reshape package should be able to help me to do the job, but I haven't been able to make it work yet.
Can anyone help me? Many thanks in advance!
Here's another version. In case there is no step = 5, the value for score2_step = 0. Assuming your data.frame is df:
require(reshape2)
out <- do.call(rbind, lapply(seq(1, nrow(df), by=5), function(ix) {
iy <- min(ix+4, nrow(df))
df.b <- df[ix:iy, ]
tt <- dcast(df.b, 1 ~ var1, fill = 0, value.var = "score1", drop=F)
tt$score2_step5 <- 0
if (any(df.b$step == 5)) {
tt$score2_step5 <- df.b$score2[df.b$step == 5]
}
tt[,-1]
}))
> out
a b d e f g score2_step5
2 0 1 1 0 0 0 0
21 1 1 1 0 1 0 1
22 0 0 1 0 0 0 0
It looks like you want 7 correlations between the variables a-g and score2_step5--is that correct? First, you're going to need another variable. I'm assuming that step repeats continuously from 1 to 5; if not, this is going to be more complicated. I'm assuming your data is called df. I also prefer the newer reshape2 package, so I'm using that.
df$block <- rep(1:(nrow(df)/5),each=5)
df.molten <- melt(df,id.vars=c("var1", "step", "block"),measure.vars=c("score1"))
df2 <- dcast(df.molten, block ~ var1)
score2_step5 <- df$score2[df$step==5]
and then finally
cor(df2, score2_step5, use='pairwise')
There's an extra column (block) in df2 that you can get rid of or just ignore.
I added another row to your example data because my code doesn't work unless there is a step-5 observation in every block.
dat <- read.table(textConnection("
step var1 score1 score2
1 a 0 0
2 b 1 1
3 d 1 1
4 e 0 0
5 g 0 0
1 b 1 1
2 a 1 0
3 d 1 0
4 e 0 1
5 f 1 1
1 g 0 1
2 d 1 1
5 a 1 0"),header=TRUE)
Like #JonathanChristensen, I made another variable (I called it rep instead of block), and I made var1 into a factor (since there are no c values in the example data set given and I wanted a placeholder).
dat <- transform(dat,var1=factor(var1,levels=letters[1:7]),
rep=cumsum(step==1))
tapply makes the table of score1 values:
tab <- with(dat,tapply(score1,list(rep,var1),identity))
add the score2, step-5 values:
data.frame(tab,subset(dat,step==5,select=score2))

Resources