Data simulation according to specific rules in R - r

I need help simulating a dataset.
It is supposed to simulate all possible outcomes on a signal detection theory task (participants are presented with trials and have to decide whether or not they detected given signal). Now, I need a dataset of all possible values for varying number of trials.
Say, there are 6 trials, 5 with the signal present, 5 with the signal absent. I am only interested in correct detections (hits) and false alarms (Type I errors). A participant can correctly detect between 1 (I don't need 0's) and 5 and make the same number of false alarms. With all possible combinations, that would be dataset containing two variables with 5^2 cases each. To make things more complicated, even the number of trials is variable. The number of both signal and non-signal trials can vary between 1 and 20 but the total number of trials cannot be less than 3 (either 1 S trial and 2 Non-S trials, or the other way around). And for each possible combination of trials, there is a group of possible combinations of hits and false alarms.
What I need is a dataset with 5 variables (total N, N of S trials, N of Non-S trials, N of Hits, and N of False Alarms) with all the possible values.
EXAMPLE
Here are all possible data for total N of 4. Note that Signal + Noise = N_total and that N_Hit seq(1:Signal) and N_FA seq(1:Noise)
N_total Signal Noise N_Hit N_FA
4 1 3 1 1
4 1 3 1 2
4 1 3 1 3
4 2 2 1 1
4 2 2 1 2
4 2 2 2 1
4 2 2 2 2
4 3 1 1 1
4 3 1 2 1
4 3 1 3 1
I'm an R novice so any help at all would be much appreciated!
Hope the description is clear.

I created a function, which uses the number of trials as parameter.
myfunc <- function(n) {
# create a data frame of all combinations
grid <- expand.grid(rep(list(seq_len(n - 1)), 4))
# remove invalid combinations (keep valid ones)
grid <- grid[grid[3] <= grid[1] & # number of hits <= number of signals
grid[4] <= grid[2] & # false alarms <= noise
(grid[1] + grid[2]) == n , ] # signal and noise sum to total n
# remove signal and noise > 20
grid <- grid[!rowSums(grid[1:2] > 20), ]
# sort rows
grid <- grid[order(grid[1], grid[3], grid[4]), ]
# add total number of trials
res <- cbind(n, grid)
# remove row names, add column names and return the object
return(setNames("rownames<-"(res, NULL),
c("N_total", "Signal", "Noise", "N_Hit", "N_FA")))
}
Use the function:
> myfunc(4)
N_total Signal Noise N_Hit N_FA
1 4 1 3 1 1
2 4 1 3 1 2
3 4 1 3 1 3
4 4 2 2 1 1
5 4 2 2 1 2
6 4 2 2 2 1
7 4 2 2 2 2
8 4 3 1 1 1
9 4 3 1 2 1
10 4 3 1 3 1
How to apply this function to the values 3-40:
lapply(3:40, myfunc)
This will return a list of data frames.

Related

sequential counting with input from more than one variable in r

I want to create a column with sequential values but it gets its value from input from two other columns in the df. I want the value to sequentially count if either Team changes (between 1 and 2) or Event = x. Any help would be appreciated! See example below:
Team Event Value
1 1 a 1
2 1 a 1
3 2 a 2
4 2 x 3
5 2 a 3
6 1 a 4
7 1 x 5
8 1 a 5
9 2 x 6
10 2 a 6
This will do it...
df$Value <- cumsum(df$Event=="x" | c(1, diff(df$Team))!=0)
It takes the cumulative sum (i.e. of TRUE values) of those elements where either Event=="x" or the difference in successive values of Team is non-zero. An extra element is added at the start of the diff term to keep it the same length as the original.

apply conditional numbering to grouped data in R

I have a table like the one below with 100's of rows of data.
ID RANK
1 2
1 3
1 3
2 4
2 8
3 3
3 3
3 3
4 6
4 7
4 7
4 7
4 7
4 7
4 6
I want to try to find a way to group the data by ID so that I can ReRank each group separately. The ReRank column is based on the Rank column and basically renumbering it starting at 1 from least to greatest, but it's important to note that the the number in the ReRank column can be put in more than once depending on the numbers in the Rank column .
In other words, the output needs to look like this
ID Rank ReRANK
1 3 2
1 2 1
1 3 2
2 4 1
2 8 2
3 3 1
3 3 1
3 3 1
For the life of me, I can't figure out how to be able to ReRank the the columns by the grouped columns and the value of the Rank columns.
This has been my best guess so far, but it definitely is not doing what I need it to do
ReRANK = mat.or.vec(length(RANK),1)
ReRANK[1] = counter = 1
for(i in 2:length(RANK)) {
if (RANK[i] != RANK[i-1]) { counter = counter + 1 }
ReRANK[i] = counter
}
Thank you in advance for the help!!
Here is a base R method using ave and rank:
df$ReRank <- ave(df$Rank, df$ID, FUN=function(i) rank(i, ties.method="min"))
The min argument in rank assures that the minimum ranking will occur when there are ties. the default is to take the mean of the ranks.
In the case that you have ties lower down in the groups, rank will count those lower values and then add continue with the next lowest value as the count of the lower values + 1. These values wil still be ordered and distinct. If you really want to have the count be 1, 2, 3, and so on rather than 1, 3, 6 or whatever depending on the number of duplicate values, here is a little hack using factor:
df$ReRank <- ave(df$Rank, df$ID, FUN=function(i) {
as.integer(factor(rank(i, ties.method="min"))))
Here, we use factor to build values counting from upward for each level. We then coerce it to be an integer.
For example,
temp <- c(rep(1, 3), 2,5,1,4,3,7)
[1] 2.5 2.5 2.5 5.0 8.0 2.5 7.0 6.0 9.0
rank(temp, ties.method="min")
[1] 1 1 1 5 8 1 7 6 9
as.integer(factor(rank(temp, ties.method="min")))
[1] 1 1 1 2 5 1 4 3 6
data
df <- read.table(header=T, text="ID Rank
1 2
1 3
1 3
2 4
2 8
3 3
3 3
3 3 ")

R: Aggregate dataframe if column has less than 3 zeros, else return zero

I have ratings of images by several raters:
data <- as.data.frame(matrix(c(rep(1,6),rep(2,6),rep(1:6,2),
0,2,1,0,1,0,0,0,3,0,0,0),12,3))
colnames(data) <- c("image", "rater", "rating")
print(data)
# image rater rating
# 1 1 1 0
# 2 1 2 2
# 3 1 3 1
# 4 1 4 0
# 5 1 5 1
# 6 1 6 0
# 7 2 1 0
# 8 2 2 0
# 9 2 3 3
# 10 2 4 0
# 11 2 5 0
# 12 2 6 0
I want to aggregate (mean) ratings by images, but only if there less than 3 zero ratings for a given image. Otherwise (=if there are 3 zeros or more), the aggregated rating should be zero. And the counting of zeros should only be for raters 1-5.
So for the above data:
# image rating
# 1 1 0.8
# 2 2 0.0
For image 1 ratings are aggregated because the third zero belongs to rater 6. For image 2, the aggregated rating is zero because there are more than 2 zeros.
On top of that, I want the aggregation to take into account a) only the first 5 ratings for each image, and b) only positive ratings.
I can manage the last 2 conditions using aggregate:
aggregate(rating ~ image, data = data[data$rater <= 5 & data$rating != 0,], mean)
# Result:
# image rating
# 1 1 1.333333
# 2 2 3.000000
But I can't figure out the first condition.
Correct results should be:
# image rating
# 1 1 1.333333
# 2 2 0.000000
Can anyone please help? Thanks.
Here is a nice method using base R:
data$this <- ave(data$rating, data$image,
FUN=function(i) if(sum(i[1:5] > 0) > 2) mean(i[1:5]) else 0)
I use i[1:5] to subset each image, so if you have fewer than 5 raters for an image, you will get an error. This returns the mean for each group, if that is of interest. Of course, you can use the same function to produce the aggregation table you mentioned:
aggregate(data$rating, data["image"],
FUN=function(i) if(sum(i[1:5] > 0) > 2) mean(i[1:5]) else 0)

Merge data frames for Cohen's kappa

I'm trying to analyze some date using R but I'm not very familiar with R (yet) and therefore I'm totally stuck.
What I try to do is manipulate my input data so I can use it to calculate Cohen's Kappa.
Now the problem is, that for rater_1, I have several ratings for some of the items and I need to select one. If rater_1 has given the same rate on an item as rater_2, then this rating should be chosen, if not any rating of the list can be used.
I tried
unique(merge(rater_1, rater_2, all.x=TRUE))
which brings me close, but if the ratings between the two raters diverge, only one is kept.
So, my question is, how do I get from
item rating_1
1 3
2 5
3 4
item rating_2
1 2
1 3
2 4
2 1
2 2
3 4
3 2
to
item rating_1 rating_2
1 3 3
2 5 4
3 4 4
?
There are some fancy ways to do this, but I thought it might be helpful to combine a few basic techniques to accomplish this task. Usually, in your question, you should include some easy way to generate your data, like this:
# Create some sample data
set.seed(1)
id<-rep(1:50)
rater_1<-sample(1:5,50,replace=TRUE)
df1<-data.frame(id,rater_1)
id<-rep(1:50,each=2)
rater_2<-sample(1:5,100,replace=TRUE)
df2<-data.frame(id,rater_2)
Now, here is one simple technique for doing this.
# Merge together the data frames.
all.merged<-merge(df1,df2)
# id rater_1 rater_2
# 1 1 2 3
# 2 1 2 5
# 3 2 2 3
# 4 2 2 2
# 5 3 3 1
# 6 3 3 1
# Find the ones that are equal.
same.rating<-all.merged[all.merged$rater_2==all.merged$rater_1,]
# Consider id 44, sometimes they match twice.
# So remove duplicates.
same.rating<-same.rating[!duplicated(same.rating),]
# Find the ones that never matched.
not.same.rating<-all.merged[!(all.merged$id %in% same.rating$id),]
# Pick one. I chose to pick the maximum.
picked.rating<-aggregate(rater_2~id+rater_1,not.same.rating,max)
# Stick the two together.
result<-rbind(same.rating,picked.rating)
result<-result[order(result$id),] # Sort
# id rater_1 rater_2
# 27 1 2 5
# 4 2 2 2
# 33 3 3 1
# 44 4 5 3
# 281 5 2 4
# 11 6 5 5
A fancy way to do this would be like this:
same.or.random<-function(x) {
matched<-which.min(x$rater_1==x$rater_2)
if(length(matched)>0) x[matched,]
else x[sample(1:nrow(x),1),]
}
do.call(rbind,by(merge(df1,df2),id,same.or.random))

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Resources