Fast computation of average proximity in proximity matrix - r

I've got a similarity matrix between all cases and, in a separate data frame, classes of these cases. I want to compute average similarity between cases from the same class, here is the equation for an example n from class j:
We have to compute a sum of all squared proximities between n and all cases k that come from the same class as n. Link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#outliers
I implemented that with 2 for loops, but it is really slow. Is there a faster way to do such thing in R?
Thanks.
//DATA (dput)
Data frame with classes:
structure(list(class = structure(c(1L, 2L, 2L, 1L, 3L, 3L, 1L,
1L, 2L, 3L), .Label = c("1", "2", "3", "5", "6", "7"), class = "factor")), .Names = "class", row.names = c(NA,
-10L), class = "data.frame")
Proximity matrix (row m and column m correspond to class in row m of data frame above):
structure(c(1, 0.60996875, 0.51775, 0.70571875, 0.581375, 0.42578125,
0.6595, 0.7134375, 0.645375, 0.468875, 0.60996875, 1, 0.77021875,
0.55171875, 0.540375, 0.53084375, 0.4943125, 0.462625, 0.7910625,
0.56321875, 0.51775, 0.77021875, 1, 0.451375, 0.60353125, 0.62353125,
0.5203125, 0.43934375, 0.6909375, 0.57159375, 0.70571875, 0.55171875,
0.451375, 1, 0.69196875, 0.59390625, 0.660375, 0.76834375, 0.606875,
0.65834375, 0.581375, 0.540375, 0.60353125, 0.69196875, 1, 0.7194375,
0.684, 0.68090625, 0.50553125, 0.60234375, 0.42578125, 0.53084375,
0.62353125, 0.59390625, 0.7194375, 1, 0.53665625, 0.553125, 0.513,
0.801625, 0.6595, 0.4943125, 0.5203125, 0.660375, 0.684, 0.53665625,
1, 0.8456875, 0.52878125, 0.65303125, 0.7134375, 0.462625, 0.43934375,
0.76834375, 0.68090625, 0.553125, 0.8456875, 1, 0.503, 0.6215,
0.645375, 0.7910625, 0.6909375, 0.606875, 0.50553125, 0.513,
0.52878125, 0.503, 1, 0.60653125, 0.468875, 0.56321875, 0.57159375,
0.65834375, 0.60234375, 0.801625, 0.65303125, 0.6215, 0.60653125,
1), .Dim = c(10L, 10L))
Correct result:
c(2.44197227050781, 2.21901680175781, 2.07063155175781, 2.52448621289062,
1.88040830957031, 2.16019295703125, 2.58622273828125, 2.81453253222656,
2.1031745078125, 2.00542063378906)

Should be possible. Your notation does not make clear whether we will find members of like classes in the rows or columns, so this answer presumes in the columns but the obvious modifications would work as well if they were in rows.
colSums(mat^2)) # in R this is element-wise application of ^2 rather than matrix multiplication.
Since both operations are vectorized it would be expected to be much faster than for-loops.
With the modification and assuming the matrix is named 'mat' and the class-dataframe named 'cldf':
sapply( 1:nrow(mat) ,
function(r) sum(mat[r, cldf[['class']][r] == cldf[['class']] ]^2) )
[1] 2.441972 2.219017 2.070632 2.524486 1.880408 2.160193 2.586223 2.814533 2.103175 2.005421

Related

BradleyTerry2 package missing one player in model results

I have data on 23 'players'. Some of them played against each other (but not every possible pair) one or multiple times. The dataset I have (see dput below) includes the number of times one player won and lost against another player. I use it to fit a BT model using BradleyTerry2 package. The issue I have is that the model gives me the coefficients for 22 players not 23. Can anyone help me figure out what the problem is, please?
Below is the dput of my data (head)
structure(list(player1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("a12TTT.pdf",
"a15.pdf", "a17.pdf", "a18.pdf", "a21.pdf", "a2TTT.pdf", "a5.pdf",
"B11.pdf", "B12.pdf", "B13.pdf", "B22.pdf", "B24.pdf", "B4.pdf",
"B7.pdf", "B8.pdf", "cw10-1.pdf", "cw15-1TTT.pdf", "cw17-1.pdf",
"cw18.pdf", "cw3.pdf", "cw4.pdf", "cw7_1TTT.pdf", "cw13-1.pdf"
), class = "factor"), player2 = structure(c(4L, 5L, 8L, 9L, 10L,
12L), .Label = c("a12TTT.pdf", "a15.pdf", "a17.pdf", "a18.pdf",
"a21.pdf", "a2TTT.pdf", "a5.pdf", "B11.pdf", "B12.pdf", "B13.pdf",
"B22.pdf", "B24.pdf", "B4.pdf", "B7.pdf", "B8.pdf", "cw10-1.pdf",
"cw15-1TTT.pdf", "cw17-1.pdf", "cw18.pdf", "cw3.pdf", "cw4.pdf",
"cw7_1TTT.pdf", "cw13-1.pdf"), class = "factor"), win1 = c(0,
1, 1, 1, 2, 0), win2 = c(1, 1, 0, 1, 0, 2)), row.names = c(NA,
6L), class = "data.frame")
The code I am using:
BTm(cbind(win1,win2), player1, player2, data= prep)
I also tried
BTm(cbind(win1,win2), player1, player2, ~player, id="player", data= prep)
And it gives me the same result (i.e. the same player is missing, and the 22 coefficients for the rest are the same).
If that is relevant, I created 'prep' using the below code.
prep<-countsToBinomial(table(ju$winner, ju$loser))
ju$winner and ju$loser are two columns in which rows are individual games and the winner is in the first column.
I also tried the following code to fit the model:
BTm(1, p1, p2, data=ju)
In this case p1 and p2 are the same as columns winner and losser, but transformed so as to have the same level factors (so that the function would work). I am not sure I used this alternative correctly, and I mention it because in this case I also have one player missing (although a different one).
After reading more carefully the documentation for the package, I found that when estimating the model the function removes one script/player/contestant as a reference. Its value is always 0. So my understanding is that if you want to do any further analysis, you have to find what player was removed and reintroduce it in the data frame with the value for its ability 0.

Is there a way in R to convert the following character variable?

I have the following dataframe with a character variable that represents the number of lanes on a highway, can I replace this vector with a similar vector that has numbers instead of letter?
df<- structure(list(Blocked.Lanes = c("|RS|RS|ML|", "|RS|", "|RS|ML|ML|ML|ML|",
"|RS|", "|RS|RE|", "|ML|ML|ML|", "|RS|ML|", "|RS|", "|ML|ML|ML|ML|ML|ML|",
"|RS|ML|ML|"), Event.Id = c(240314L, 240381L, 240396L, 240796L,
240948L, 241089L, 241190L, 241225L, 241226L, 241241L)), row.names = c(NA,
10L), class = "data.frame")
The output should be something like df2 below:
df2<- structure(list(Blocked.Lanes = c(3L, 1L, 5L, 1L, 2L, 3L, 2L,
1L, 6L, 3L), Event.Id = c(240314L, 240381L, 240396L, 240796L,
240948L, 241089L, 241190L, 241225L, 241226L, 241241L)), class = "data.frame", row.names = c(NA,
-10L))
One way would be to count number of "|" in each string. We subtract it with - 1 since there is an additional "|".
stringr::str_count(df$Blocked.Lanes, '\\|') - 1
#[1] 3 1 5 1 2 3 2 1 6 3
In base R :
lengths(gregexpr("\\|", df$Blocked.Lanes)) - 1
Another way would to be count exact words in the string.
stringr::str_count(df$Blocked.Lanes, '\\w+')
lengths(gregexpr("\\w+", df$Blocked.Lanes))
Similar to Ronak's solution you could also do:
stringr:str_count(df$Blocked.Lanes, "\\b[A-Z]{2}\\b")
if the lanes are always 2 letters long, or
stringr:str_count(df$Blocked.Lanes, "\\b[A-Z]+\\b")
if the lanes are always at least one letter long.
stringr:str_count(df$Blocked.Lanes, "(?<=\\|)[A-Z]+(?=\\|)")
also works.
Not as succinct as #Ronak Shah's, but another method in Base R.
String split on string literal "|" and then count elements:
df2 <- transform(df, Blocked.Lanes = lengths(Map(function(x) x[x != ""],
strsplit(df$Blocked.Lanes, "|", fixed = TRUE))))

Counting the rows in a data frame based on integer ranges

I have a data frame that lists a bunch of objects and their values.
Name NumCpu MemoryMB
1 BEAVERTN-SVR-C5 1 3072
2 BEAVERTN-SVR-UK 4 4096
3 BEAVERTN-SVR-JV 1 1024
I want to take my data frame and create a new column that groups these numbers by ranges.
Ranges: 0-1024, 1025-2048, 2049-4096
And then output the counts of those ranges into a new data frame:
Range Count
0-1024 1
1025-2048 0
2049-4096 2
I learn by doing, so this is a real work problem I'm trying to use R to solve. Any help greatly appreciated. Thank you!
Data
DF <- structure(list(Name = c("BEAVERTN-SVR-C5", "BEAVERTN-SVR-UK",
"BEAVERTN-SVR-JV"), NumCpu = c(1L, 4L, 1L), MemoryMB = c(3072L,
4096L, 1024L), Range = structure(c(3L, 3L, 1L), .Label = c("(0,1.02e+03]",
"(1.02e+03,2.05e+03]", "(2.05e+03,4.1e+03]"), class = "factor")), .Names = c("Name",
"NumCpu", "MemoryMB", "Range"), row.names = c("1", "2", "3"), class = "data.frame")

R: Need to perform multiple matches for each row in data frame

I have a data frame where for each Filename value, there is a set of values for Compound. Some compounds have a value for IS.Name, which is a value that is one of the Compound values for a Filename.
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
For each set of rows with the same Filename value in my data frame, I want to match the IS.Name value with the corresponding Compound value, and put the Chrom.1.RT value from the matched row into the IS.RT cell. For example, in the table above I want to take the Chrom.1.RT value from row 2 for Compound=IS-1 and put it into IS.RT on row 1 like this:
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0.61
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
If possible I need to do this in R. Thanks in advance for any help!
EDIT: Here is a larger, more detailed example:
Filename Compound Chrom.1.RT IS.Name IS.RT
1 Sample-001 IS-1 1.32495 NONE NA
2 Sample-001 Compound-1 1.344033333 IS-1 NA
3 Sample-001 IS-2 0.127416667 NONE NA
4 Sample-001 Compound-2 0 IS-2 NA
5 Sample-002 IS-1 1.32495 NONE NA
6 Sample-002 Compound-1 1.344033333 IS-1 NA
7 Sample-002 IS-2 0.127416667 NONE NA
8 Sample-002 Compound-2 0 IS-2 NA
This is chromatography data. For each sample, four compounds are being analyzed, and each compound has a retention time value (Chrom.1.RT). Two of these compounds are references that are used by the other two compounds. For example, compound-1 is using IS-1, while IS-1 does not have a reference (IS). Within each sample I am trying to match up the IS Name to the compound row for it to grab the CHrom.1.RT and put it in the IS.RT field. So for Compound-1, I want to find the Chrom.1.RT value for the Compound with the same name as the IS.Name field (IS-1) and put it in the IS.RT field for Compound-1. The tables I'm working with list all of the compounds together and don't match up the values for the references, which I need to do for the next step of calculating the difference between Chrom.1.RT and IS.RT for each compound. Does that help?
EDIT - Here's the code I found that seems to work:
sampleList<- unique(df1$Filename)
for (i in sampleList){
SampleRows<-which(df1$Filename == sampleList[i])
RefRows <- subset(df1, Filename== sampleList[i])
df1$IS.RT[SampleRows]<- RefRows$Chrom.1.RT[ match(df1$IS.Name[SampleRows], RefRows$Compound)]
}
I'm definitely open to any suggestions to make this more efficient though.
First of all, I suggest in the future you provide your example as the output of dput(df1) as it makes it a lot easier to read it into R instead of the space delimited table you provided
That being said, I've managed to wrangle it into R with the "help" of MS Excel.
df1=structure(list(Filename = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("Sample-001", "Sample-002"), class = "factor"),
Compound = structure(c(3L, 1L, 4L, 2L, 3L, 1L, 4L, 2L), .Label = c("Compound-1",
"Compound-2", "IS-1", "IS-2"), class = "factor"), Chrom.1.RT = c(1.32495,
1.344033333, 0.127416667, 0, 1.32495, 1.344033333, 0.127416667,
0), IS.Name = structure(c(3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L), .Label = c("IS-1",
"IS-2", "NONE"), class = "factor"), IS.RT = c(NA, NA, NA,
NA, NA, NA, NA, NA)), .Names = c("Filename", "Compound",
"Chrom.1.RT", "IS.Name", "IS.RT"), class = "data.frame", row.names = c(NA,
-8L))
The code below is severely clunky but it does the job.
library("dplyr")
df1=tbl_df(df1)
left_join(df1,left_join(df1%>%select(-Compound),df1%>%group_by(Compound)%>%summarise(unique(Chrom.1.RT)),c("IS.Name"="Compound")))%>%select(-IS.RT)%>%rename(IS.RT=`unique(Chrom.1.RT)`)
Unless I got i wrong, this is what you need?

ddply run in a function looks in the environment outside the function?

I'm trying to write a function to do some often repeated analysis, and one part of this is to count the number of groups and number of members within each group, so ddply to the rescue !, however, my code has a problem....
Here is some example data
> dput(BGBottles)
structure(list(Machine = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
weight = c(14.23, 14.96, 14.85, 16.46, 16.74, 15.94, 14.98,
14.88, 14.87, 15.94, 16.07, 14.91)), .Names = c("Machine",
"weight"), row.names = c(NA, -12L), class = "data.frame")
and here is my code
foo<-function(exp1, exp2, data) {
datadesc<-ddply(data, .(with(data, get(exp2))), nrow)
return(datadesc)
}
If I run this function, I get an error
> foo(exp="Machine",exp1="weight",data=BGBottles)
Error in eval(substitute(expr), data, enclos = parent.frame()) :
invalid 'envir' argument
However, if I define my exp1, exp2 and data variables int he global environemtn first, it works
> exp1<-"weight"
> exp2<-"Machine"
> data<-BGBottles
> foo(exp="Machine",exp1="weight",data=BGBottles)
with.data..get.exp2.. V1
1 1 3
2 2 3
3 3 3
4 4 3
So, I assume ddply is running outside of the environemtn of the function ? Is there a way to stop this, or am I doing something wrong ?
Thanks
Paul.
You don't need get:
foo<-function(exp1, exp2, data) {
datadesc<-ddply(data, exp2, nrow)
return(datadesc)
}
This is an example of this bug: http://github.com/hadley/plyr/issues#issue/3. But as Marek points out, you don't need get here anyway.

Resources