Merge data frames for Cohen's kappa - r

I'm trying to analyze some date using R but I'm not very familiar with R (yet) and therefore I'm totally stuck.
What I try to do is manipulate my input data so I can use it to calculate Cohen's Kappa.
Now the problem is, that for rater_1, I have several ratings for some of the items and I need to select one. If rater_1 has given the same rate on an item as rater_2, then this rating should be chosen, if not any rating of the list can be used.
I tried
unique(merge(rater_1, rater_2, all.x=TRUE))
which brings me close, but if the ratings between the two raters diverge, only one is kept.
So, my question is, how do I get from
item rating_1
1 3
2 5
3 4
item rating_2
1 2
1 3
2 4
2 1
2 2
3 4
3 2
to
item rating_1 rating_2
1 3 3
2 5 4
3 4 4
?

There are some fancy ways to do this, but I thought it might be helpful to combine a few basic techniques to accomplish this task. Usually, in your question, you should include some easy way to generate your data, like this:
# Create some sample data
set.seed(1)
id<-rep(1:50)
rater_1<-sample(1:5,50,replace=TRUE)
df1<-data.frame(id,rater_1)
id<-rep(1:50,each=2)
rater_2<-sample(1:5,100,replace=TRUE)
df2<-data.frame(id,rater_2)
Now, here is one simple technique for doing this.
# Merge together the data frames.
all.merged<-merge(df1,df2)
# id rater_1 rater_2
# 1 1 2 3
# 2 1 2 5
# 3 2 2 3
# 4 2 2 2
# 5 3 3 1
# 6 3 3 1
# Find the ones that are equal.
same.rating<-all.merged[all.merged$rater_2==all.merged$rater_1,]
# Consider id 44, sometimes they match twice.
# So remove duplicates.
same.rating<-same.rating[!duplicated(same.rating),]
# Find the ones that never matched.
not.same.rating<-all.merged[!(all.merged$id %in% same.rating$id),]
# Pick one. I chose to pick the maximum.
picked.rating<-aggregate(rater_2~id+rater_1,not.same.rating,max)
# Stick the two together.
result<-rbind(same.rating,picked.rating)
result<-result[order(result$id),] # Sort
# id rater_1 rater_2
# 27 1 2 5
# 4 2 2 2
# 33 3 3 1
# 44 4 5 3
# 281 5 2 4
# 11 6 5 5
A fancy way to do this would be like this:
same.or.random<-function(x) {
matched<-which.min(x$rater_1==x$rater_2)
if(length(matched)>0) x[matched,]
else x[sample(1:nrow(x),1),]
}
do.call(rbind,by(merge(df1,df2),id,same.or.random))

Related

How to BiCluster with constant values in columns - in R

My Problem in general:
I have a data frame where i would like to find all bi-clusters with constant values in columns.
For Example the initial dataframe:
> df
v1 v2 v3
1 0 2 1
2 1 3 2
3 2 4 3
4 3 3 4
5 4 2 3
6 5 2 4
7 2 2 3
8 3 1 2
And for example i would like to find the a cluster like this:
> cluster1
v1 v3
1 2 3
2 2 3
I tried to use the biclust package and tested several functions but the result was always not what i want to archive.
I figured out that I may can use the BCPlaid function with fit.model = y ~ m. But it looks like this produce also different results.
Is there a way to archive this task efficient?

In R, generating every possible solution to a model, based on constraints

In R, I’m trying to generate a matrix that shows results from a model and the values used to solve them- all of which are constrained. Every possible solution. An example model:
Model= a^2+b^2+c^2+d^2
Where:
20≤Model≤30
a=1
2 ≤b ≤3
2 ≤c ≤3
3 ≤d ≤4
I’d like the output to look like this:
[a] [b] [c] [d] [Model]
[1] 1 3 2 3 23
[2] 1 2 2 4 25
[3] 1 3 3 3 28
[4] 1 2 3 3 23
Order doesn't matter. I just want the full permutation of feasible [integer] values. Any packages or help you could point my way?
In my example case, I want to generate all possible inputs(a,b,c,d) that hold valid, based on the parameters I set. I only want values from my output equation (Model) between 20 and 30. In this case, only 4 solutions are possible based on the criteria I'm setting.
Assuming you're only looking for integer solutions, you can use expand.grid()
dd <- expand.grid(a=1, b=2:3, c=2:3, d=3:4)
m <- with(dd, a^2+b^2+c^2+d^2)
inside <- function(x, a,b) a<=x & x<=b
cbind(dd, m)[inside(m, 20, 30),]
# a b c d m
# 2 1 3 2 3 23
# 3 1 2 3 3 23
# 4 1 3 3 3 28
# 5 1 2 2 4 25
# 6 1 3 2 4 30
# 7 1 2 3 4 30
(you said you want values <=30 but you seem to have left out the 30's in your example, you can change the inside() function of you want an open interval)

Data simulation according to specific rules in R

I need help simulating a dataset.
It is supposed to simulate all possible outcomes on a signal detection theory task (participants are presented with trials and have to decide whether or not they detected given signal). Now, I need a dataset of all possible values for varying number of trials.
Say, there are 6 trials, 5 with the signal present, 5 with the signal absent. I am only interested in correct detections (hits) and false alarms (Type I errors). A participant can correctly detect between 1 (I don't need 0's) and 5 and make the same number of false alarms. With all possible combinations, that would be dataset containing two variables with 5^2 cases each. To make things more complicated, even the number of trials is variable. The number of both signal and non-signal trials can vary between 1 and 20 but the total number of trials cannot be less than 3 (either 1 S trial and 2 Non-S trials, or the other way around). And for each possible combination of trials, there is a group of possible combinations of hits and false alarms.
What I need is a dataset with 5 variables (total N, N of S trials, N of Non-S trials, N of Hits, and N of False Alarms) with all the possible values.
EXAMPLE
Here are all possible data for total N of 4. Note that Signal + Noise = N_total and that N_Hit seq(1:Signal) and N_FA seq(1:Noise)
N_total Signal Noise N_Hit N_FA
4 1 3 1 1
4 1 3 1 2
4 1 3 1 3
4 2 2 1 1
4 2 2 1 2
4 2 2 2 1
4 2 2 2 2
4 3 1 1 1
4 3 1 2 1
4 3 1 3 1
I'm an R novice so any help at all would be much appreciated!
Hope the description is clear.
I created a function, which uses the number of trials as parameter.
myfunc <- function(n) {
# create a data frame of all combinations
grid <- expand.grid(rep(list(seq_len(n - 1)), 4))
# remove invalid combinations (keep valid ones)
grid <- grid[grid[3] <= grid[1] & # number of hits <= number of signals
grid[4] <= grid[2] & # false alarms <= noise
(grid[1] + grid[2]) == n , ] # signal and noise sum to total n
# remove signal and noise > 20
grid <- grid[!rowSums(grid[1:2] > 20), ]
# sort rows
grid <- grid[order(grid[1], grid[3], grid[4]), ]
# add total number of trials
res <- cbind(n, grid)
# remove row names, add column names and return the object
return(setNames("rownames<-"(res, NULL),
c("N_total", "Signal", "Noise", "N_Hit", "N_FA")))
}
Use the function:
> myfunc(4)
N_total Signal Noise N_Hit N_FA
1 4 1 3 1 1
2 4 1 3 1 2
3 4 1 3 1 3
4 4 2 2 1 1
5 4 2 2 1 2
6 4 2 2 2 1
7 4 2 2 2 2
8 4 3 1 1 1
9 4 3 1 2 1
10 4 3 1 3 1
How to apply this function to the values 3-40:
lapply(3:40, myfunc)
This will return a list of data frames.

Create sequence of repeated values, in sequence?

I need a sequence of repeated numbers, i.e. 1 1 ... 1 2 2 ... 2 3 3 ... 3 etc. The way I implemented this was:
nyear <- 20
names <- c(rep(1,nyear),rep(2,nyear),rep(3,nyear),rep(4,nyear),
rep(5,nyear),rep(6,nyear),rep(7,nyear),rep(8,nyear))
which works, but is clumsy, and obviously doesn't scale well.
How do I repeat the N integers M times each in sequence?
I tried nesting seq() and rep() but that didn't quite do what I wanted.
I can obviously write a for-loop to do this, but there should be an intrinsic way to do this!
You missed the each= argument to rep():
R> n <- 3
R> rep(1:5, each=n)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
R>
so your example can be done with a simple
R> rep(1:8, each=20)
Another base R option could be gl():
gl(5, 3)
Where the output is a factor:
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
Levels: 1 2 3 4 5
If integers are needed, you can convert it:
as.numeric(gl(5, 3))
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
For your example, Dirk's answer is perfect. If you instead had a data frame and wanted to add that sort of sequence as a column, you could also use group from groupdata2 (disclaimer: my package) to greedily divide the datapoints into groups.
# Attach groupdata2
library(groupdata2)
# Create a random data frame
df <- data.frame("x" = rnorm(27))
# Create groups with 5 members each (except last group)
group(df, n = 5, method = "greedy")
x .groups
<dbl> <fct>
1 0.891 1
2 -1.13 1
3 -0.500 1
4 -1.12 1
5 -0.0187 1
6 0.420 2
7 -0.449 2
8 0.365 2
9 0.526 2
10 0.466 2
# … with 17 more rows
There's a whole range of methods for creating this kind of grouping factor. E.g. by number of groups, a list of group sizes, or by having groups start when the value in some column differs from the value in the previous row (e.g. if a column is c("x","x","y","z","z") the grouping factor would be c(1,1,2,3,3).

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Resources