R: Aggregate dataframe if column has less than 3 zeros, else return zero - r

I have ratings of images by several raters:
data <- as.data.frame(matrix(c(rep(1,6),rep(2,6),rep(1:6,2),
0,2,1,0,1,0,0,0,3,0,0,0),12,3))
colnames(data) <- c("image", "rater", "rating")
print(data)
# image rater rating
# 1 1 1 0
# 2 1 2 2
# 3 1 3 1
# 4 1 4 0
# 5 1 5 1
# 6 1 6 0
# 7 2 1 0
# 8 2 2 0
# 9 2 3 3
# 10 2 4 0
# 11 2 5 0
# 12 2 6 0
I want to aggregate (mean) ratings by images, but only if there less than 3 zero ratings for a given image. Otherwise (=if there are 3 zeros or more), the aggregated rating should be zero. And the counting of zeros should only be for raters 1-5.
So for the above data:
# image rating
# 1 1 0.8
# 2 2 0.0
For image 1 ratings are aggregated because the third zero belongs to rater 6. For image 2, the aggregated rating is zero because there are more than 2 zeros.
On top of that, I want the aggregation to take into account a) only the first 5 ratings for each image, and b) only positive ratings.
I can manage the last 2 conditions using aggregate:
aggregate(rating ~ image, data = data[data$rater <= 5 & data$rating != 0,], mean)
# Result:
# image rating
# 1 1 1.333333
# 2 2 3.000000
But I can't figure out the first condition.
Correct results should be:
# image rating
# 1 1 1.333333
# 2 2 0.000000
Can anyone please help? Thanks.

Here is a nice method using base R:
data$this <- ave(data$rating, data$image,
FUN=function(i) if(sum(i[1:5] > 0) > 2) mean(i[1:5]) else 0)
I use i[1:5] to subset each image, so if you have fewer than 5 raters for an image, you will get an error. This returns the mean for each group, if that is of interest. Of course, you can use the same function to produce the aggregation table you mentioned:
aggregate(data$rating, data["image"],
FUN=function(i) if(sum(i[1:5] > 0) > 2) mean(i[1:5]) else 0)

Related

how to subset a data frame up until a point R

i want to subset a data frame and take all observations for each id until the first observation that didn't meet my condition. Something like this:
goodDaysAfterTreatMent <- subset(Patientdays, treatmentDate < date & goodThings > badThings)
Except that this returns all observations that meet the condition. I want something that stops with the first observation that didn't meet the condition, moves on to the next id, and returns all observations for this id that meets the condition, and so on.
the only way i can see is to use a lot of loops but loops and that's usually not a god thing.
Hope you guys have an idea
Assume that your condition is to return rows where v < 5 :
# example dataset
df = data.frame(id = c(1,1,1,1,2,2,2,2,3,3,3),
v = c(2,4,3,5,4,5,6,7,5,4,1))
df
# id v
# 1 1 2
# 2 1 4
# 3 1 3
# 4 1 5
# 5 2 4
# 6 2 5
# 7 2 6
# 8 2 7
# 9 3 5
# 10 3 4
# 11 3 1
library(tidyverse)
df %>%
group_by(id) %>% # for each id
mutate(flag = cumsum(ifelse(v < 5, 1, NA))) %>% # check if v < 5 and fill with NA all rows when condition is FALSE and after that
filter(!is.na(flag)) %>% # keep only rows with no NA flags
ungroup() %>% # forget the grouping
select(-flag) # remove flag column
# # A tibble: 4 x 2
# id v
# <dbl> <dbl>
# 1 1 2
# 2 1 4
# 3 1 3
# 4 2 4
Easy way:
Find First FALSE by (min(which(condition == F)):
Patientdays<-cbind.data.frame(treatmentDate=c(1:5,4,6:10),date=c(2:5,3,6:10,10),goodThings=c(1:11),badThings=c(0:10))
attach(Patientdays)# Just due to ease of use (optional)
condition<-treatmentDate < date & goodThings > badThings
Patientdays[1:(min(which(condition == F))-1),]
Edit: Adding result.
treatmentDate date goodThings badThings
1 1 2 1 0
2 2 3 2 1
3 3 4 3 2
4 4 5 4 3

In a dataframe, find the index of the next smaller value for each element of a column

Question:
In a dataframe, I want to create a new column as the indices of the next smaller value of an existing column.
For example, the data looks like this. It is already arranged in item, day.
item day val
1 1 2 3
2 1 4 2
3 1 5 1
4 2 1 1
5 2 3 2
6 2 5 3
First I would like to use group_by(item) in dplyr to select the sub-dataframe of each item.
Then for row 1, I look down the rows and find that row 2 has a smaller val. This is what I want, so I record the day corresponding to that row. Similar for row 2.
Note that for row 3 and 6, they are the last rows of corresponding sub-dataframes, so there is no next smaller value. For row 4 and 5, there is no smaller val when I look down the rows.
The dataframe with the new column should look like this.
item day val next.smaller.day
1 1 2 3 4
2 1 4 2 5
3 1 5 1 -1
4 2 1 1 -1
5 2 3 2 -1
6 2 5 3 -1
I wonder if there is any way of using dplyr to implement this, or any codes in r other than a for loop.
I found a thread asking the algorithm of this question. Given an array, find out the next smaller element for each element .
It is relevant, and the proposed algorithm beats mine in terms of time complexity, but I still find it hard to implement in my scenario.
Thank you!
Update:
Here is another example to re-illustrate what I'm looking for.
item day val next.smaller.day
1 1 2 2 5
2 1 4 3 5
3 1 5 1 -1
4 2 1 3 3
5 2 3 1 -1
6 2 5 2 -1
You can group your data by the item, calculate the different between rows using the diff function and check if it is smaller than zero which will then generate a logic vector and you can use the logic vector to pick up the next day. And since you are picking up the next day, you will need the lead function to shift the day column forward so that it can match the rows where you want to place them.
Side note: Since diff function create a vector one element shorter than the original one and you will always leave the last row out per group, we can pad the diff result by a FALSE condition.
library(dplyr);
df %>% group_by(item) %>% mutate(smaller = c(diff(val) < 0, F),
next.smaller.day = ifelse(smaller, lead(day), -1)) %>%
select(-smaller)
# Source: local data frame [6 x 4]
# Groups: item [2]
# item day val next.smaller.day
# <int> <int> <int> <dbl>
# 1 1 2 3 4
# 2 1 4 2 5
# 3 1 5 1 -1
# 4 2 1 1 -1
# 5 2 3 2 -1
# 6 2 5 3 -1
Update:
find.next.smaller <- function(ini = 1, vec) {
if(length(vec) == 1) NA
else c(ini + min(which(vec[1] > vec[-1])),
find.next.smaller(ini + 1, vec[-1]))
} # the recursive function will go element by element through the vector and find out
# the index of the next smaller value.
df %>% group_by(item) %>% mutate(next.smaller.day = day[find.next.smaller(1, val)],
next.smaller.day = replace(next.smaller.day, is.na(next.smaller.day), -1))
# Source: local data frame [6 x 4]
# Groups: item [2]
#
# item day val next.smaller.day
# <int> <int> <dbl> <dbl>
# 1 1 2 2 5
# 2 1 4 3 5
# 3 1 5 1 -1
# 4 2 1 1 -1
# 5 2 3 2 -1
# 6 2 5 3 -1

Removing rows after a certain value in R

I have a data frame in R,
df <- data.frame(a=c(1,1,1,2,2,5,5,5,5,5,6,6), b=c(0,1,0,0,0,0,0,1,0,0,0,1))
I want to remove the rows which has values for the variable b equal to 0 which occurs after the value equals to 1 for the duplicated variable a values.
So the output I am looking for is,
df.out <- data.frame(a=c(1,1,2,2,5,5,5,6,6), b=c(0,1,0,0,0,0,1,0,1))
Is there a way to do this in R?
This should do the trick?
ind = intersect(which(df$b==0), which(df$b==1)+1)
df.out = df[-ind,]
The which(df$b==1) returns the index of the df where b==1. add one to this and intersect with the indexes where b==0.
How about
df[ ave(df$b, df$a, FUN=function(x) x>=cummax(x))==1, ]
# a b
# 1 1 0
# 2 1 1
# 4 2 0
# 5 2 0
# 6 5 0
# 7 5 0
# 8 5 1
# 11 6 0
# 12 6 1
Here we use ave to look within each level of a and we test to see if we've seen a 1 yet with cummax.

Data simulation according to specific rules in R

I need help simulating a dataset.
It is supposed to simulate all possible outcomes on a signal detection theory task (participants are presented with trials and have to decide whether or not they detected given signal). Now, I need a dataset of all possible values for varying number of trials.
Say, there are 6 trials, 5 with the signal present, 5 with the signal absent. I am only interested in correct detections (hits) and false alarms (Type I errors). A participant can correctly detect between 1 (I don't need 0's) and 5 and make the same number of false alarms. With all possible combinations, that would be dataset containing two variables with 5^2 cases each. To make things more complicated, even the number of trials is variable. The number of both signal and non-signal trials can vary between 1 and 20 but the total number of trials cannot be less than 3 (either 1 S trial and 2 Non-S trials, or the other way around). And for each possible combination of trials, there is a group of possible combinations of hits and false alarms.
What I need is a dataset with 5 variables (total N, N of S trials, N of Non-S trials, N of Hits, and N of False Alarms) with all the possible values.
EXAMPLE
Here are all possible data for total N of 4. Note that Signal + Noise = N_total and that N_Hit seq(1:Signal) and N_FA seq(1:Noise)
N_total Signal Noise N_Hit N_FA
4 1 3 1 1
4 1 3 1 2
4 1 3 1 3
4 2 2 1 1
4 2 2 1 2
4 2 2 2 1
4 2 2 2 2
4 3 1 1 1
4 3 1 2 1
4 3 1 3 1
I'm an R novice so any help at all would be much appreciated!
Hope the description is clear.
I created a function, which uses the number of trials as parameter.
myfunc <- function(n) {
# create a data frame of all combinations
grid <- expand.grid(rep(list(seq_len(n - 1)), 4))
# remove invalid combinations (keep valid ones)
grid <- grid[grid[3] <= grid[1] & # number of hits <= number of signals
grid[4] <= grid[2] & # false alarms <= noise
(grid[1] + grid[2]) == n , ] # signal and noise sum to total n
# remove signal and noise > 20
grid <- grid[!rowSums(grid[1:2] > 20), ]
# sort rows
grid <- grid[order(grid[1], grid[3], grid[4]), ]
# add total number of trials
res <- cbind(n, grid)
# remove row names, add column names and return the object
return(setNames("rownames<-"(res, NULL),
c("N_total", "Signal", "Noise", "N_Hit", "N_FA")))
}
Use the function:
> myfunc(4)
N_total Signal Noise N_Hit N_FA
1 4 1 3 1 1
2 4 1 3 1 2
3 4 1 3 1 3
4 4 2 2 1 1
5 4 2 2 1 2
6 4 2 2 2 1
7 4 2 2 2 2
8 4 3 1 1 1
9 4 3 1 2 1
10 4 3 1 3 1
How to apply this function to the values 3-40:
lapply(3:40, myfunc)
This will return a list of data frames.

Count and label observations per participant using loop

I have repeated-measures data.
I need to create a loop that will incrementally count each observation, within a participant, and label it.
I am new to writing loops. My logic was to say, for each item in the list of unique ids, count each row in that, and apply some function to that row.
Could someone point our what I am doing wrong?
data$Ob <- 0
for (i in unique(data$id)) {
count <- 1
for (u in data[data$id == i,]) {
data[data$id ==u,]$Ob <- count
count <- count + 1
print(count)
}
}
Thanks!
Justin
You can also use ave:
set.seed(1)
data <- data.frame(id = sample(4, 10, TRUE))
data$Ob = ave(data$id, data$id, FUN=seq_along)
data
id Ob
1 2 1
2 2 2
3 3 1
4 4 1
5 1 1
6 4 2
7 4 3
8 3 2
9 3 3
10 1 2
# Generate some dummy data
data <- data.frame(Ob=0, id=sample(4,20,TRUE))
# Go through every id value
for(i in unique(data$id)){
# Label observations
data$Ob[data$id == i] = 1:sum(data$id == i)
}
Be aware though that for loops are notoriously slow in R. In this simple case they work fine, but should you have millions and millions of rows in your data frame you'd better do something purely vectorized.
But you don't need a loop...
data <- data.frame (id = sample (4, 10, TRUE))
## id
## 1 3
## 2 4
## 3 1
## 4 3
## 5 3
## 6 4
## 7 2
## 8 1
## 9 1
## 10 4
data$Ob [order (data$id)] <- sequence (table (data$id))
## id Ob
## 1 3 1
## 2 4 1
## 3 1 1
## 4 3 2
## 5 3 3
## 6 4 2
## 7 2 1
## 8 1 2
## 9 1 3
## 10 4 3
(works also with character or factor IDs)
(isn't R just cool!?)

Resources