Sampling by group without repetition using data.table - r

I'll use a hypothetical scenario to illustrate the question. Here's a table with musicians and the instrument they play and a table with the composition for a band:
musicians <- data.table(
instrument = rep(c('bass','drums','guitar'), each = 4),
musician = c('Chas','John','Paul','Stuart','Andy','Paul','Peter','Ringo','George','John','Paul','Ringo')
)
band.comp <- data.table(
instrument = c('bass','drums','guitar'),
n = c(2,1,2)
)
To avoid arguments about who is best with which instrument, the band will be assembled by sortition. Here's how I'm doing:
musicians[band.comp, on = 'instrument'][, sample(musician, n), by = instrument]
instrument V1
1: bass Paul
2: bass Chas
3: drums Andy
4: guitar Paul
5: guitar George
The problem is: since there are musicians who play more than one instrument, it can happen that one person is drawn more than once.
One can build a for loop that, for each subsequent subset of instruments, draws musicians and then eliminates those from the rest of the table. But I would like suggestions on how to do this using data.table. Mainly because the kind of problem I need to solve in real life with this logic involves data bases with hundreds of thousands of rows. And also because I'm trying to better understand the data.table syntax.
As a reference, I tried some tips from Andrew Brooks blog, but couldn't come up with a solution.

This can be a solution, first you select an instrument by musician and then you select the musicians of your sample. But it may be that when selecting an instrument per musician your sample size is larger than the population then you will get an error (but in your real data this may not be a problem).
musicians[, .(instrument = sample(instrument, 1)), by = musician][band.comp, on = 'instrument'][, sample(musician, n), by = instrument]

You could expand the band comp into sum(band.comp$n) distinct positions and keep sampling until you find a feasible composition:
roles = musicians[,
CJ(posn = 1:band.comp[.BY, on=.(instrument), x.n], musician = musician)
, by=instrument]
set.seed(1)
while (TRUE){
roles[sample(1:.N), keep := !duplicated(.SD, by="musician") & !duplicated(.SD, by=c("instrument", "posn"))][]
if (sum(roles$keep) == sum(band.comp$n)) break
}
setorder(roles[keep == TRUE, !"keep"])[]
instrument posn musician
1: bass 1 Stuart
2: bass 2 John
3: drums 1 Andy
4: guitar 1 George
5: guitar 2 Paul
There's probably something you could do with linear programming or a bipartite graph to answer the question of whether a feasible comp exists, but it's unclear what "sampling" even means in terms of the distribution over feasible comps.

Came across a relevant post: Randomly draw rows from dataframe based on unique values and column values and eddi's answer is perfect for this OP:
#keep number of musicians per instrument in 1 data.table
musicians[band.comp, n:=n, on=.(instrument)]
#for storing the musician that has been sampled so far
m <- c()
musicians[, {
#exclude sampled musician before sampling
res <- .SD[!musician %chin% m][sample(.N, n[1L])]
m <- c(m, res$musician)
res
}, by=.(instrument)]
sample output:
instrument musician n
1: bass Stuart 2
2: bass Chas 2
3: drums Paul 1
4: guitar John 2
5: guitar Ringo 2
Or more succinctly with error handling as well:
m <- c()
musicians[
band.comp,
on=.(instrument),
j={
s <- setdiff(musician, m)
if (length(s) < n) stop(paste("Not enough musicians playing", .BY))
res <- sample(s, n)
m <- c(m, res)
res
},
by=.EACHI]

Related

Calculating the mean words produced by a participant in R

I am working with R. I am working with a dataset and I need to calculate the average of words produced by participants within a group. My data looks like this.
Group Participants WORDS
A John table
A John chair
A John house
A Steph pool
A Steph backyard
A Carlos street
B Pedro stop
B Pedro basket
B Jeff dog
B Alan cat
B Alan river
C Steve ocean
C Steve boat
C Steve hammer
C Steve temperature
C Steve sun
C Bryan outlet
C Mark printer
C Kobe basket
C Kobe internet
C Kobe legend
C Kobe ball
So, for example... within the group A we have only three (3) participants and a total of six (6) words produced. Within that group we have an average of two (2) words produced by each participants.
My problem is that since I am not working with numbers, I don't know how to calculate this in R.
Try this:
library(dplyr)
dft %>%
group_by(Group) %>%
summarise(participants = length(unique(Participants)),
words = length(unique(WORDS)),
mean_words = words/participants)
or (using only dplyr functions)
library(dplyr)
dft %>%
group_by(Group) %>%
summarise(participants = n_distinct(Participants),
words = n_distinct(WORDS),
mean_words = words/participants)
Next time, please provide sample data which is copy-pastable (e.g. using dput).
I generated some sample data as
set.seed(5555)
df <- data.frame(Group=sample(c("A","B","C"),100,replace=TRUE),
Participant= sample(letters[1:6],100,replace=TRUE),
Words = sample(paste0(LETTERS,LETTERS),100,replace=TRUE), stringsAsFactors = FALSE)
Then you can do
groupStats <- lapply(split(df,df$Group), function(wordsAndParticipantsInGroup) {
participantCount <- length(unique(wordsAndParticipantsInGroup$Participant))
wordCount <- length(unique(wordsAndParticipantsInGroup$Words))
meanWords <- wordCount/participantCount
data.frame(participants=participantCount,meanWords = meanWords)
})
groupStats <- data.frame(Group=names(groupStats),do.call("rbind",groupStats))

Taking variable names out of column and creating new columns in R

I am trying to take a dataframe like this
name response
1 Phil Exam
2 Terry Test
3 Simmon Exam
4 Brad Quiz
And turn it into this
name response Exam Test Quiz
1 Phil Exam Exam
2 Terry Test Test
3 Simmon Exam Exam
4 Brad Quiz Quiz
I tried to use a for loop, extracting each row. Then I would check to see if the column already existed and if it did not it would create a new column. I couldnt get this close to working and am unsure how to do this.
This can be accomplished a few ways. Might be a good opportunity to get to know the tidyverse:
library(tidyverse)
new.df <- spread(old.df, response, response)
This is an unusual use of tidyr::spread(). In this case, it constructs new column names from the values in "response", and also fills those columns with the values in "response". The fill argument can be used to change what goes in the resulting blank cells.
A base R solution. We can create a function to replace words that do not match to the target word, and then create the new column to the data frame.
# Create example data frame
dt <- read.table(text = " name response
1 Phil Exam
2 Terry Test
3 Simmon Exam
4 Brad Quiz",
header = TRUE, stringsAsFactors = FALSE)
# A function to create a new column based on the word in response
create_Col <- function(word, df, fill = NA){
new <- df$response
new[!new == word] <- fill
return(new)
}
# Apply this function
for (i in unique(dt$response)){
dt[[i]] <- create_Col(word = i, df = dt)
}
dt
name response Exam Test Quiz
1 Phil Exam Exam <NA> <NA>
2 Terry Test <NA> Test <NA>
3 Simmon Exam Exam <NA> <NA>
4 Brad Quiz <NA> <NA> Quiz
We can use dcast
library(data.table)
dcast(setDT(df1), name + response ~ response, value.var = 'response', fill = "")
# name response Exam Quiz Test
#1: Brad Quiz Quiz
#2: Phil Exam Exam
#3: Simmon Exam Exam
#4: Terry Test Test

How to use R to check data consistency (make sure no contradiction between case and value)?

Let's say I have:
Person Movie Rating
Sally Titanic 4
Bill Titanic 4
Rob Titanic 4
Sue Cars 8
Alex Cars **9**
Bob Cars 8
As you can see, there is a contradiction for Alex. All the same movies should have the same ranking, but there was a data error entry for Alex. How can I use R to solve this? I've been thinking about it for a while, but I can't figure it out. Do I have to just do it manually in excel or something? Is there a command on R that will return all the cases where there are data contradictions between two columns?
Perhaps I could have R do a boolean check if all the Movie cases match the first rating of its first iteration? For all that returns "no," I can go look at it manually? How would I write this function?
Thanks
Here's a data.table solution
Define the function
Myfunc <- function(x) {
temp <- table(x)
names(temp)[which.max(temp)]
}
library(data.table)
Create a column with the correct rating (by reference)
setDT(df)[, CorrectRating := Myfunc(Rating), Movie][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Alex Cars 9 8
# 6: Bob Cars 8 8
Or If you want to remove the "bad" ratings
df[Rating == CorrectRating][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Bob Cars 8 8
It looks like, within each group defined by "Movie", you're looking for any instances of Rating that are not the same as the most common value.
You can solve this using dplyr (which is good at "group by one column, then perform an operation within each group), along with the "Mode" function defined in this answer that finds the most common item in a vector:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr)
dat %>% group_by(Movie) %>% filter(Rating != Mode(Rating))
This finds all the cases where a row does not agree with the rest of the group. If you instead want to remove them, you can do:
newdat <- dat %>% group_by(Movie) %>% filter(Rating == Mode(Rating))
If you want to fix them, do
newdat <- dat %>% group_by(Movie) %>% mutate(Rating = Mode(Rating))
You can test the above with a reproducible version of your data:
dat <- data.frame(Person = c("Sally", "Bill", "Rob", "Sue", "Alex", "Bob"),
Movie = rep(c("Titanic", "Cars"), each = 3),
Rating = c(4, 4, 4, 8, 9, 8))
If the goal is to see if all the values within a group are the same (or if there are some differences) then this can be a simple application of tapply (or aggregate, etc.) used with a function like var (or compute the range). If all the values are the same then the variance and range will be 0. If it is any other value (outside of rounding error) then there must be a value that is different. The which function can help identify the group/individual.
tapply(dat$Rating, dat$Movie, FUN=var)
which(.Last.value > 0.00001)
tapply(dat$Rating, dat$Movie, FUN=function(x)diff(range(x)))
which(.Last.value != 0)
which( abs(dat$Rating - ave(dat$Rating, dat$Movie)) > 0)
which.max( abs(dat$Rating - ave(dat$Rating, dat$Movie)) )
dat[.Last.value,]
I would add a variable for mode so I can see if there is anything weird going on with the data, like missing data, text, many different answers instead of the rare anomaly,etc. I used "x" as your dataset
# one of many functions to find mode, could use any other
modefunc <- function(x){
names(table(x))[table(x)==max(table(x))]
}
# add variable for mode split by Movie
x$mode <- ave(x = x$Rating,x$Movie,FUN = modefunc)
# do whatever you want with the records that are different
x[x$Rating != x$mode, ]
If you want another function for mode, try other functions for mode

Perform multiple summary functions and return a dataframe

I have a data set that includes a whole bunch of data about students, including their current school, zipcode of former residence, and a score:
students <- read.table(text = "zip school score
43050 'Hunter' 202.72974236
48227 'NYU' 338.49571519
48227 'NYU' 223.48658339
32566 'CCNY' 310.40666224
78596 'Columbia' 821.59318662
78045 'Columbia' 853.09842034
60651 'Lang' 277.48624384
32566 'Lang' 315.49753763
32566 'Lang' 80.296556533
94941 'LIU' 373.53839238
",header = TRUE,sep = "")
I want a heap of summary data about it, per school. How many students from each school are in the data set, how many unique zipcodes per school, average and cumulative score. I know I can get this by using tapply to create a bunch of tmp frames:
tmp.mean <- data.frame(tapply(students$score, students$school, mean))
tmp.sum <- data.frame(tapply(students$score, students$school, sum))
tmp.unique.zip <- data.frame(tapply(students$zip, students$school, function(x) length(unique(x))))
tmp.count <- data.frame(tapply(students$zip, students$school, function(x) length(x)))
Giving them better column names:
colnames(tmp.unique.zip) <- c("Unique zips")
colnames(tmp.count) <- c("Count")
colnames(tmp.mean) <- c("Mean Score")
colnames(tmp.sum) <- c("Total Score")
And using cbind to tie them all back together again:
school.stats <- cbind(tmp.mean, tmp.sum, tmp.unique.zip, tmp.count)
I think the cleaner way to do this is:
library(plyr)
school.stats <- ddply(students, .(school), summarise,
record.count=length(score),
unique.r.zips=length(unique(zip)),
mean.dist=mean(score),
total.dist=sum(score)
)
The resulting data looks about the same (actually, the ddply approach is cleaner and includes the schools as a column instead of as row names). Two questions: is there better way to find out how many records there are associated with each school? And, am I using ddply efficiently here? I'm new to it.
If performance is an issue, you can also use data.table
require(data.table)
tab_s<-data.table(students)
setkey(tab_s,school)
tab_s[,list(total=sum(score),
avg=mean(score),
unique.zips=length(unique(zip)),
records=length(score)),
by="school"]
school total avg unique.zips records
1: Hunter 202.7297 202.7297 1 1
2: NYU 561.9823 280.9911 1 2
3: CCNY 310.4067 310.4067 1 1
4: Columbia 1674.6916 837.3458 2 2
5: Lang 673.2803 224.4268 2 3
6: LIU 373.5384 373.5384 1 1
Comments seem to be in general agreement: this looks good.

Calculation using two subsets of variable column in long data frame with R Reshape

I have a dataframe that has two sets of data that I need to multiply for a calculation. A simple version would be
sample = data.frame(apples=c(10,20,25,30,40,NA,NA,15))
sample$oranges = c(25,60,90,86,10,67,45,10)
sample$oats = c(65,75,85,95,105,115,125,135)
sample$eggs = c(23,22,21,20,19,18,17,16)
sample$consumer =c('john','mark','luke','paul','peter','thomas','matthew','brian')
sample$mealtime = c('breakfast','lunch','lunch','snack','lunch','breakfast','snack','dinner')
s1 = melt(sample,id.vars=c(5,6),measure.vars=c(1:4))
and what I'm trying to do is something along the lines of
s2 = dcast(s1, mealtime ~ ., function(x) (x[variable == 'oranges'] * x[variable =='apples'])/sum(x[variable == 'apples'])
In practice its a much longer data.frame and a more elaborate calculation but the principle should be the same. Thanks -- first post to SO so apologies for any errors.
The output would be a data frame that has mealtimes as the Id var and the apple weighted average of the orange data as the values for each mealtime.
Something along the lines of
Group.1 x
1 breakfast 1.785714
2 dinner 1.071429
3 lunch 27.500000
4 snack 18.428571
This was calculated using
sample$wa = sample$oranges*sample$apples/sum(sample$apples)
aggregate(sample$wa,by=list(sample$mealtime),sum,na.rm=T)
which feels off mathematically but was meant to be a quick kludgy approximation.
This is a much better task for plyr than it is for reshape.
library(plyr)
s1<-ddply(sample,.(mealtime), function(x) {return(sum(x$apples,x$oranges))})
And now you have clarified the output:
ddply(sample,.(mealtime), summarize,
wavg.oranges = sum(apples * oranges, na.rm=TRUE) / sum(apples, na.rm=TRUE))
# mealtime wavg.oranges
# 1 breakfast 25.00000
# 2 dinner 10.00000
# 3 lunch 45.29412
# 4 snack 86.00000

Resources