I have data on 23 'players'. Some of them played against each other (but not every possible pair) one or multiple times. The dataset I have (see dput below) includes the number of times one player won and lost against another player. I use it to fit a BT model using BradleyTerry2 package. The issue I have is that the model gives me the coefficients for 22 players not 23. Can anyone help me figure out what the problem is, please?
Below is the dput of my data (head)
structure(list(player1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("a12TTT.pdf",
"a15.pdf", "a17.pdf", "a18.pdf", "a21.pdf", "a2TTT.pdf", "a5.pdf",
"B11.pdf", "B12.pdf", "B13.pdf", "B22.pdf", "B24.pdf", "B4.pdf",
"B7.pdf", "B8.pdf", "cw10-1.pdf", "cw15-1TTT.pdf", "cw17-1.pdf",
"cw18.pdf", "cw3.pdf", "cw4.pdf", "cw7_1TTT.pdf", "cw13-1.pdf"
), class = "factor"), player2 = structure(c(4L, 5L, 8L, 9L, 10L,
12L), .Label = c("a12TTT.pdf", "a15.pdf", "a17.pdf", "a18.pdf",
"a21.pdf", "a2TTT.pdf", "a5.pdf", "B11.pdf", "B12.pdf", "B13.pdf",
"B22.pdf", "B24.pdf", "B4.pdf", "B7.pdf", "B8.pdf", "cw10-1.pdf",
"cw15-1TTT.pdf", "cw17-1.pdf", "cw18.pdf", "cw3.pdf", "cw4.pdf",
"cw7_1TTT.pdf", "cw13-1.pdf"), class = "factor"), win1 = c(0,
1, 1, 1, 2, 0), win2 = c(1, 1, 0, 1, 0, 2)), row.names = c(NA,
6L), class = "data.frame")
The code I am using:
BTm(cbind(win1,win2), player1, player2, data= prep)
I also tried
BTm(cbind(win1,win2), player1, player2, ~player, id="player", data= prep)
And it gives me the same result (i.e. the same player is missing, and the 22 coefficients for the rest are the same).
If that is relevant, I created 'prep' using the below code.
prep<-countsToBinomial(table(ju$winner, ju$loser))
ju$winner and ju$loser are two columns in which rows are individual games and the winner is in the first column.
I also tried the following code to fit the model:
BTm(1, p1, p2, data=ju)
In this case p1 and p2 are the same as columns winner and losser, but transformed so as to have the same level factors (so that the function would work). I am not sure I used this alternative correctly, and I mention it because in this case I also have one player missing (although a different one).
After reading more carefully the documentation for the package, I found that when estimating the model the function removes one script/player/contestant as a reference. Its value is always 0. So my understanding is that if you want to do any further analysis, you have to find what player was removed and reintroduce it in the data frame with the value for its ability 0.
Related
I am currently working on listening data of a music platform in R.
I have a subset (listening.subset) of the total data set. It contains 6 columns (USER, artist, Week, COUNT, user_type, binary).
Each user can either be a focal user, a friend, or a neighbour. There are separate data sets that link focal users to their friends (friend.data) and neighbours (neighbour.data), but I added a column to indicate the type of user.
Now, I have the following for-loop to indicate whether a friend has listened to an artist in the 10 weeks before the focal user has listened to that same artist. If that is the case, the binary column must show a 0, else a 1.
listening.subset$binary <- NA
for (i in 1:count(listening.subset)$n) {
test_user <- listening.subset[i,]
test_week <- test_user$Week
test_artist <- test_user$artist
if (test_user$user_type == "friend") {
foc <- vlookup(test_user$USER, friend.data, result_column = 1, lookup_column = 2)
prior_listen <- listening.subset %>% filter(USER == foc) %>% group_by(artist) %>% filter(test_week >= (Week -10) & test_week <= Week) %>% filter(artist == test_artist)
if (nrow(prior_listen) > 0) {
listening.subset[i,]$binary <- 0
}
else(
listening.subset[i,]$binary <- 1)
}
}
The problem with this for-loop is that it takes too long to apply to the full data set. Therefore, I want to apply vectorization. However, This concept is vague to me and after reading up on it online, I still do not have a clue as to how I should adjust my code.
I hope someone knows how to use vectorization and could help me.
EDIT1: the total data set contains around 50 million entries. However, I could split it up in 10 data sets of 5 million each.
EDIT2: listening.subset:
"clubanddeform", "HyprMusic", "Peter-182", "komosionmel", "SHHitsKaty",
"Sonik_Villa", "Haalf"), artist = c("Justin Timberlake", "Ediya",
"Lady Gaga", "El Guincho", "Lighthouse Family", "Pidżama Porno",
"The Men", "Modest Mouse", "Com Truise", "April Smith and The Great Picture Show"
), Week = c(197L, 213L, 411L, 427L, 443L, 232L, 431L, 312L, 487L,
416L), COUNT = c(1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 6L, 11L), user_type = c("friend",
"friend", "friend", "friend", "neighbour", "friend", "neighbour",
"friend", "focal", "friend"), binary = c(1, 1, 1, 1, NA, 1, NA,
1, NA, 1)), row.names = c(NA, 10L), class = "data.frame")
Where Week is an indicator for which week the user listened to the particular band (ranging between 1 and 527), and COUNT equals the amount of times the user has listened to that artist in that particular week.
Recap: The binary variable should indicate whether the "friend user" has listened to the same band as the "focal user", in the 10 weeks before the focal user played the band. The social connections can be found in the friend.data, which is depicted below.
structure(list(USER = c("TheMariner", "TheMariner", "TheMariner",
"TheMariner", "TheMariner", "TheMariner", "TheMariner", "TheMariner",
"TheMariner", "TheMariner"), FRIEND = c("npetrovay", "marno25",
"lennonstarr116", "sachmonkey", "andrewripp", "daledrops", "Skittlebite",
"Ego_Trippin", "pistolgal5", "jjollett")), row.names = c(NA,
10L), class = "data.frame")
For each 190 focal users (first column), the friends are listed next to it, in the second column.
I'm trying to run kmeans clustering analysis on a relatively simple data frame. However,
kmeans(sample_data, centers = 4)
doesn't work, as R states there are "NA/NaN/Inf in foreign function call (arg 1)" (not true). Anyway, I tried
kmeans(na.omit(sample_data), centers = 4)
based on the answers here (and other posts), and that didn't work. The only workaround I found was to exclude the non-numeric column (i.e., the observation names) using
kmeans(sample_data[, 2:5], centers = 4)
Unfortunately, this makes the clusters much less informative, since the points now have numbers instead of names. What's going on? Or how could I get the clustering with the right labels?
Edit: I'm trying to reproduce this procedure / result, but with a different data set. Notice that when the author visualizes the clusters, the points are labelled according to the observations (the states, in that case; or "obs1, obs2, etc." in mine.)
Because of the workaround above (which drops the column with observation names), I get a sequence of numeric labels instead.
Code and dput below:
library(factoextra)
cluster <- kmeans(sample_data, centers = 4) #this doesn't work
cluster <- kmeans(sample_data[, 2:5], centers = 4) #this works
fviz_cluster(cluster, sample_data)
sample_data:
structure(list(name = structure(c(1L, 12L, 19L, 20L, 21L, 22L,
23L, 24L, 25L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 13L,
14L, 15L, 16L, 17L, 18L), .Label = c("obs1", "obs10", "obs11",
"obs12", "obs13", "obs14", "obs15", "obs16", "obs17", "obs18",
"obs19", "obs2", "obs20", "obs21", "obs22", "obs23", "obs24",
"obs25", "obs3", "obs4", "obs5", "obs6", "obs7", "obs8", "obs9"
), class = "factor"), variable1 = c(0, 0.383966783938484, 0.541654398529028,
0.469060314591266, 0.397636449124337, 0.3944696359856, 0.368740430902284,
0.998695171590958, 0.60013559365688, 0.543416096609665, 1, 0.287523586757021,
0.57818096701751, 0.504722587360754, 0.284825226469556, 0.295250085072615,
0.509782836343032, 0.392942062325636, 0.602608457169149, 0.474668174468815,
0.219951650206242, 0.263837738487209, 0.530976492805559, 0.312401708505963,
0.828799458392802), variable2 = c(0, 0.21094954480341, 0.374890541082605,
0.502470003202637, 0.385212751959443, 0.499052863381439, 0.172887314327707,
0.319869014605517, 0.484308813708282, 0.348608342250238, 0.474464311565186,
0.380406312920036, 1, 0.618253544624658, 0.560290273167607, 0.676315913606924,
0.339157532529115, 0.479005841710258, 0.576094917240369, 0.819742646967549,
0.472559283375261, 0.45594685111211, 0.160720270709769, 0.494360626922513,
0.658705091697224), variable3 = c(0, 0.0391726961740698, 0.157000498692027,
0.194883594782107, 0.133290754949737, 0.199085094994071, 0.000551185924636259,
0.418045152251051, 0.434858475480003, 0.443442199844268, 0.257231662911141,
0.195570389942169, 0.46503468971732, 0.358104620337886, 0.391852363829371,
0.39834809992812, 0.258870156344325, 0.38555892877453, 0.480559759927908,
1, 0.15662554228071, 0.279363773961277, 0.11211821625736, 0.180885222092932,
0.339650099009323), variable4 = c(0, 0.0464395032429444, 0.323768557597659,
0.201813172242373, 0.302710768912681, 0.446027132614423, 0.542018940773003,
1, 0.738123811706962, 0.550819613183929, 0.679555989322392, 0.563126171437818,
0.470328070009844, 0.316069092919459, 0.344421820993065, 0.222931758003036,
0.250406547916021, 0.381098780580988, 0.9526031202384, 0.174161621337361,
0.260548409706516, 0.288399563112687, 0.617089845066814, 0.265314653254406,
0.330637996311329)), class = "data.frame", row.names = c(NA,
-25L))
K-means only works on continuous variables.
It probably tried to convert your labels into numbers, and that did not work.
Never include identifier columns in analysis!
Proper data preprocessing is crucial and 90% of the work; you need to understand the requirements precisely. It is not sufficient to just make it run somehow - it is easy to make it run, but return useless results...
The key is to convert the column with the desired labels to row names with
df <- tibble::column_to_rownames(df, var = "labels")
That way the clustering algorithm won't even consider the labels, but will apply them to the points on the cluster.
I have a data frame where for each Filename value, there is a set of values for Compound. Some compounds have a value for IS.Name, which is a value that is one of the Compound values for a Filename.
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
For each set of rows with the same Filename value in my data frame, I want to match the IS.Name value with the corresponding Compound value, and put the Chrom.1.RT value from the matched row into the IS.RT cell. For example, in the table above I want to take the Chrom.1.RT value from row 2 for Compound=IS-1 and put it into IS.RT on row 1 like this:
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0.61
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
If possible I need to do this in R. Thanks in advance for any help!
EDIT: Here is a larger, more detailed example:
Filename Compound Chrom.1.RT IS.Name IS.RT
1 Sample-001 IS-1 1.32495 NONE NA
2 Sample-001 Compound-1 1.344033333 IS-1 NA
3 Sample-001 IS-2 0.127416667 NONE NA
4 Sample-001 Compound-2 0 IS-2 NA
5 Sample-002 IS-1 1.32495 NONE NA
6 Sample-002 Compound-1 1.344033333 IS-1 NA
7 Sample-002 IS-2 0.127416667 NONE NA
8 Sample-002 Compound-2 0 IS-2 NA
This is chromatography data. For each sample, four compounds are being analyzed, and each compound has a retention time value (Chrom.1.RT). Two of these compounds are references that are used by the other two compounds. For example, compound-1 is using IS-1, while IS-1 does not have a reference (IS). Within each sample I am trying to match up the IS Name to the compound row for it to grab the CHrom.1.RT and put it in the IS.RT field. So for Compound-1, I want to find the Chrom.1.RT value for the Compound with the same name as the IS.Name field (IS-1) and put it in the IS.RT field for Compound-1. The tables I'm working with list all of the compounds together and don't match up the values for the references, which I need to do for the next step of calculating the difference between Chrom.1.RT and IS.RT for each compound. Does that help?
EDIT - Here's the code I found that seems to work:
sampleList<- unique(df1$Filename)
for (i in sampleList){
SampleRows<-which(df1$Filename == sampleList[i])
RefRows <- subset(df1, Filename== sampleList[i])
df1$IS.RT[SampleRows]<- RefRows$Chrom.1.RT[ match(df1$IS.Name[SampleRows], RefRows$Compound)]
}
I'm definitely open to any suggestions to make this more efficient though.
First of all, I suggest in the future you provide your example as the output of dput(df1) as it makes it a lot easier to read it into R instead of the space delimited table you provided
That being said, I've managed to wrangle it into R with the "help" of MS Excel.
df1=structure(list(Filename = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("Sample-001", "Sample-002"), class = "factor"),
Compound = structure(c(3L, 1L, 4L, 2L, 3L, 1L, 4L, 2L), .Label = c("Compound-1",
"Compound-2", "IS-1", "IS-2"), class = "factor"), Chrom.1.RT = c(1.32495,
1.344033333, 0.127416667, 0, 1.32495, 1.344033333, 0.127416667,
0), IS.Name = structure(c(3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L), .Label = c("IS-1",
"IS-2", "NONE"), class = "factor"), IS.RT = c(NA, NA, NA,
NA, NA, NA, NA, NA)), .Names = c("Filename", "Compound",
"Chrom.1.RT", "IS.Name", "IS.RT"), class = "data.frame", row.names = c(NA,
-8L))
The code below is severely clunky but it does the job.
library("dplyr")
df1=tbl_df(df1)
left_join(df1,left_join(df1%>%select(-Compound),df1%>%group_by(Compound)%>%summarise(unique(Chrom.1.RT)),c("IS.Name"="Compound")))%>%select(-IS.RT)%>%rename(IS.RT=`unique(Chrom.1.RT)`)
Unless I got i wrong, this is what you need?
I imagine that there's some way to do this with sqldf, though I'm not familiar with the syntax of that package enough to get this to work. Here's the issue:
I have two data frames, each of which describe genomic regions and contain some other data. I have to combine the two if the region described in the one df falls within the region of the other df.
One df, g, looks like this (though my real data has other columns)
start_position end_position
1 22926178 22928035
2 22887317 22889471
3 22876403 22884442
4 22862447 22866319
5 22822490 22827551
And another, l, looks like this (this sample has a named column)
name start end
101 GRMZM2G001024 11149187 11511198
589 GRMZM2G575546 24382534 24860958
7859 GRMZM2G441511 22762447 23762447
658 AC184765.4_FG005 26282236 26682919
14 GRMZM2G396835 10009264 10402790
I need to merge the two dataframes if the values from the start_position OR end_position columns in g fall within the start-end range in l, returning only the columns in l that have a match. I've been trying to get findInterval() to do the job, but haven't been able to return a merged DF. Any ideas?
My data:
g <- structure(list(start_position = c(22926178L, 22887317L, 22876403L,
22862447L, 22822490L), end_position = c(22928035L, 22889471L,
22884442L, 22866319L, 22827551L)), .Names = c("start_position",
"end_position"), row.names = c(NA, 5L), class = "data.frame")
l <- structure(list(name = structure(c(2L, 12L, 9L, 1L, 8L), .Label = c("AC184765.4_FG005",
"GRMZM2G001024", "GRMZM2G058655", "GRMZM2G072028", "GRMZM2G157132",
"GRMZM2G160834", "GRMZM2G166507", "GRMZM2G396835", "GRMZM2G441511",
"GRMZM2G442645", "GRMZM2G572807", "GRMZM2G575546", "GRMZM2G702094"
), class = "factor"), start = c(11149187L, 24382534L, 22762447L,
26282236L, 10009264L), end = c(11511198L, 24860958L, 23762447L,
26682919L, 10402790L)), .Names = c("name", "start", "end"), row.names = c(101L,
589L, 7859L, 658L, 14L), class = "data.frame")
I am trying to produce a series of box plots in R that is grouped by 2 factors. I've managed to make the plot, but I cannot get the boxes to order in the correct direction.
My data farm I am using looks like this:
Nitrogen Species Treatment
2 G L
3 R M
4 G H
4 B L
2 B M
1 G H
I tried:
boxplot(mydata$Nitrogen~mydata$Species*mydata$Treatment)
this ordered the boxes alphabetically (first three were the "High" treatments, then within those three they were ordered by species name alphabetically).
I want the box plot ordered Low>Medium>High then within each of those groups G>R>B for the species.
So i tried using a factor in the formula:
f = ordered(interaction(mydata$Treatment, mydata$Species),
levels = c("L.G","L.R","L.B","M.G","M.R","M.B","H.G","H.R","H.B")
then:
boxplot(mydata$Nitrogen~f)
however the boxes are still shoeing up in the same order. The labels are now different, but the boxes have not moved.
I have pulled out each set of data and plotted them all together individually:
lg = mydata[mydata$Treatment="L" & mydata$Species="G", "Nitrogen"]
mg = mydata[mydata$Treatment="M" & mydata$Species="G", "Nitrogen"]
hg = mydata[mydata$Treatment="H" & mydata$Species="G", "Nitrogen"]
etc ..
boxplot(lg, lr, lb, mg, mr, mb, hg, hr, hb)
This gives what i want, but I would prefer to do this in a more elegant way, so I don't have to pull each one out individually for larger data sets.
Loadable data:
mydata <-
structure(list(Nitrogen = c(2L, 3L, 4L, 4L, 2L, 1L), Species = structure(c(2L,
3L, 2L, 1L, 1L, 2L), .Label = c("B", "G", "R"), class = "factor"),
Treatment = structure(c(2L, 3L, 1L, 2L, 3L, 1L), .Label = c("H",
"L", "M"), class = "factor")), .Names = c("Nitrogen", "Species",
"Treatment"), class = "data.frame", row.names = c(NA, -6L))
The following commands will create the ordering you need by rebuilding the Treatment and Species factors, with explicit manual ordering of the levels:
mydata$Treatment = factor(mydata$Treatment,c("L","M","H"))
mydata$Species = factor(mydata$Species,c("G","R","B"))
edit 1 : oops I had set it to HML instead of LMH. fixing.
edit 2 : what factor(X,Y) does:
If you run factor(X,Y) on an existing factor, it uses the ordering of the values in Y to enumerate the values present in the factor X. Here's some examples with your data.
> mydata$Treatment
[1] L M H L M H
Levels: H L M
> as.integer(mydata$Treatment)
[1] 2 3 1 2 3 1
> factor(mydata$Treatment,c("L","M","H"))
[1] L M H L M H <-- not changed
Levels: L M H <-- changed
> as.integer(factor(mydata$Treatment,c("L","M","H")))
[1] 1 2 3 1 2 3 <-- changed
It does NOT change what the factor looks like at first glance, but it does change how the data is stored.
What's important here is that many plot functions will plot the lowest enumeration leftmost, followed by the next, etc.
If you create factors simply using factor(X) then usually the enumeration is based upon the alphabetical order of the factor levels, (e.g. "H","L","M"). If your labels have a conventional ordering different from alphabetical (i.e. "H","M","L"), this can make your graphs seems strange.
At first glance, it may seem like the problem is due to the ordering of data in the data frame - i.e. if only we could place all "H" at the top and "L" at the bottom, then it would work. It doesn't. But if you want your labels to appear in the same order as the first occurrence in the data, you can use this form:
mydata$Treatment = factor(mydata$Treatment, unique(mydata$Treatment))
This earlier StackOverflow question shows how to reorder a boxplot based on a numerical value; what you need here is probably just a switch from factor to the related type ordered. But it is hard say as we do not have your data and you didn't provide a reproducible example.
Edit Using the dataset you posted in variable md and relying on the solution I pointed to earlier, we get
R> md$Species <- ordered(md$Species, levels=c("G", "R", "B"))
R> md$Treatment <- ordered(md$Treatment, levels=c("L", "M", "H"))
R> with(md, boxplot(Nitrogen ~ Species * Treatment))
which creates the chart you were looking to create.
This is also equivalent to the other solution presented here.