I want to fit a Bradley-Terry model to many observers' rankings of three objects from within a larger set of objects.
My raw data looks like this:
obs1=c("A","C","D")
obs2=c("B","D","E")
obs3=c("C","B","E")
obs4=c("C","D","E")
obs5=c("C","E","D")
data=cbind(obs1,obs2,obs3,obs4,obs5)
obs1 obs2 obs3 obs4 obs5
1 A B C C C
2 C D B D E
3 D E E E D
but what I need as an input to countsToBinomial{BradleyTerry2} and then to BTm{BradleyTerry2} is a table like below, which contains the number of times the column-name objects were ranked before the row-name object by the five observers.
All information I found begins with the counts as data, but since my number of observations is very high, I wonder if there is a way of automatising this transformation.
A B C D E
A 0 0 0 0 0
B 0 0 1 0 0
C 1 0 0 0 0
D 1 1 3 0 1
E 0 2 3 2 0
Any ideas highly appreciated!
Ok here's the answer using nested for loops and match...
x <- unique(as.vector(data))
x <- sort(x)
cmatrix <- matrix(0,nrow = length(x), ncol = length(x))
colnames(cmatrix) <- x
row.names(cmatrix) <- x
This creates your output matrix as cmatrix with initial values all 0. Then we run the loops as follows...
count <- 0
for (i in 1:ncol(cmatrix) ){
for (j in 1:nrow(cmatrix) ){
for (k in 1:ncol(data)){
if( is.na(match(colnames(cmatrix)[i],data[,k])) == FALSE){
if( is.na(match(row.names(cmatrix)[j],data[,k])) == FALSE){
if( match(colnames(cmatrix)[i],data[,k]) < match(row.names(cmatrix)[j],data[,k]) ){count <- count+1}
}
cmatrix[j,i] <- cmatrix[j,i]+count
}
count <- 0
}
}
}
This will give you required output table. This solution will work for any number of values and not just for A to E.
Related
Ok, I have a list of words with their frequencies. There are many, many thousands of these. Here's a mini example:
w = c("abandon", "break", "fuzz", "when")
f = c(2, 10, 8, 200)
df = data.frame(cbind(w, f))
df
w f
1 abandon 2
2 break 10
3 fuzz 8
4 when 200
What I want to do is count the characters in each word and then aggregate the results. The count_chars function from the dw4psy package can do this for a given vector of strings. I've done this successfully by just creating a giant vector of strings from the word list (which has 10s of 1000s of words), as follows:
library(ds4psy) # for count_chars function
library(dplyr)
w = c("abandon", "break", "fuzz", "when")
f = c(2, 10, 8, 200)
df = data.frame(cbind(w, f))
df$w = as.character(df$w)
df$f = as.integer(df$f)
# repword will repeat wrd frq times with no spaces between
repword <- function(frq, wrd) paste(rep(times=frq, x=wrd), collapse="")
# now we create one giant vector of strings to do the counts on
# CAUTION -- uses lots of memory when you have 10s of 1000s of words
mytext = paste(mapply(repword, df$f, df$w))
# get a table of letter counts
mycounts = count_chars(mytext)
# convert to data frame sorted by character
mycounts.df <- mycounts[order(names(mycounts))] %>%
as.data.frame()
# sort by Freq in descending order
mycounts.df %>%
arrange(desc(Freq))
However, a colleague does not have enough memory for this brute force solution. So I tried to figure out how to do this word-by-word using foreach or mapply, but I am really stuck.
One issue is that you need a vector that has every letter in it to combine them (so far as I can tell). So I create a dummy word with all letters in it, and then do some tweaks to keep it from counting the repeated letters each time.
# create a dummy string that is a-z
dummy = paste0(letters, collapse="")
# now we create a count - it will be all 1s; we will subtract it every time
dummycount = count_chars(dummy)
countword <- function(frq, wrd) {
myword = paste0(dummy, wrd, collapse="")
# subtract 1 from each letter to correct for dummy
mycount = count_chars(myword) - dummycount
mycount = mycount * frq # multiply by frequency
return(mycount)
}
totalcount = dummycount - 1 # set a table to zeroes
foreach(frq = df$f, wrd = df$w) %do% {
totalcount = totalcount + countword(frq, wrd)
}
But this just doesn't work ... I get a weird result:
> totalcount
chars
a b c d e f g h i j k l m n o p q r s t u v w x y z
16 12 10 6 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I would be very grateful for any advice!
If we want the same output with foreach (assuming OP wants to work with foreach), simply loop over the sequence of rows
library(foreach)
library(parallel)
library(doSNOW)
no_of_cores = detectCores()
cl <- makeSOCKcluster(no_of_cores)
registerDoSNOW(cl)
out <- foreach(i = 1:nrow(df), .export = "count_chars",
.combine = `+`) %dopar% {
tmp <- countword(df$f[i], df$w[i])
totalcount[names(tmp)] <- totalcount[names(tmp)] + tmp
totalcount}
stopCluster(cl)
-output
> out
a b c d e f g h i j k l m n o p q r s t u v w x y z
14 12 0 2 210 8 0 200 0 0 10 0 0 204 2 0 0 10 0 0 8 0 200 0 0 16
Can you simply multiply the output of count_chars() by f, and do this by row?
library(data.table)
setDT(df)[, data.table(count_chars(w)*f), by=1:nrow(df)][, .(ct = sum(N)), chars][order(-ct)]
Output:
chars ct
1: e 210
2: n 204
3: h 200
4: w 200
5: z 16
6: a 14
7: b 12
8: k 10
9: r 10
10: f 8
11: u 8
12: d 2
13: o 2
I have a number of ordered lists (or sequences, or vectors, or data table columns) 1, 2, 3, with several items, for example
1 2 3
A A B
G G A
F F G
C E
D C
D
How can I efficiently derive the "master" list which contains all elements in the correct order B, A, G, F, E, C, D? I don't even know what keywords to search for. Any hints are much appreciated.
How about a graph-based approach.
Idea
The idea is to translate the sequences into paths in a directed graph (so A G F C D becomes a path A->G->F->C->D). By simplifying the graph we can then identify the longest connected sequence in that graph, which should then correspond to your "master" sequence.
Implementation
Note that I assume your sample data lst to be a list of vectors (see sample data at the end of this answer).
Let's construct an igraph from the different paths; each path is given by the entries in the lst vectors.
library(igraph)
ig <- make_empty_graph(
n = length(unique(unlist(lst))),
directed = TRUE) %>%
set_vertex_attr("name", value = sort(unique(unlist(lst))))
for (i in 1:length(lst)) ig <- ig + path(lst[[i]])
Next we simplify the graph
ig <- simplify(ig)
It's instructive to plot the graph
plot(ig)
We now extract all simple paths; the longest simple path corresponds to the "master" list.
pths <- sapply(V(ig), function(x) {
p <- all_simple_paths(ig, x)
names(unlist(p[which.max(lengths(p))]))
})
pths[which.max(lengths(pths))]
$B
#[1] "B" "A" "G" "F" "E" "C" "D"
The sequence matches your expected output for the master list.
Sample data
v1 <- c("A","G","F","C","D","D")
v2 <- c("A","G","F","E","C")
v3 <- c("B", "A","G")
lst <- list(v1, v2, v3)
Interesting problem. I think I have a working solution.
My thinking was that we can encode the vectors into a matrix to track which letters must come before and after each other letter by logic. Then we should be able to sort that matrix to find a working order.
Here, I take the three vectors and encode their implied ordering using nested loops.
v1 <- c("A","G","F","C","D","D")
v2 <- c("A","G","F","E","C")
v3 <- c("B", "A","G")
vecs <- list(v1, v2, v3)
unique_ltrs <- unique(unlist(vecs))
ltr_len <- length(unique_ltrs)
m <- matrix(0, nrow = ltr_len, ncol = ltr_len,
dimnames = list(unique_ltrs, unique_ltrs))
# Loops to populate m with what we know
for (v in 1:length(vecs)) {
vec <- unique(unlist(vecs[v]))
for (l in 1:length(vec)) {
for (l2 in 1:length(vec)) {
m_pos <- c(match(vec[l], unique_ltrs),
match(vec[l2], unique_ltrs))
compare <- ifelse(l < l2, -1, ifelse(l2 < l, 1, 0))
m[m_pos[1], m_pos[2]] <- compare
}
}
}
Here, 1 indicates the column letter comes before the row letter, while -1 means the row comes first.
> m
A G F C D E B
A 0 -1 -1 -1 -1 -1 1
G 1 0 -1 -1 -1 -1 1
F 1 1 0 -1 -1 -1 0
C 1 1 1 0 -1 1 0
D 1 1 1 1 0 0 0
E 1 1 1 -1 0 0 0
B -1 -1 0 0 0 0 0
Then we sort the matrix (relying on the code here), and a working order appears in the rownames:
m_ord <- m[do.call(order, as.data.frame(m)),]
#> m_ord
# A G F C D E B
#B -1 -1 0 0 0 0 0
#A 0 -1 -1 -1 -1 -1 1
#G 1 0 -1 -1 -1 -1 1
#F 1 1 0 -1 -1 -1 0
#E 1 1 1 -1 0 0 0
#C 1 1 1 0 -1 1 0
#D 1 1 1 1 0 0 0
rownames(m_ord)
#[1] "B" "A" "G" "F" "E" "C" "D"
Can you please help me to set labels in a list as new dataframe in R. For example, x is my list.
head(x) is giving me the following output:
head(x)
[[1]]
A B
0 0
[[2]]
C D E F
0 1 0 0
I want to create a data frame like this:
Any help would be appreciated! TIA!
Indeed, there are no rows to name unless your list is a data.frame. Try x <- as.data.frame(x) just before your cbind.
Answer to updated question :
l1 <- c(A=0, B=1)
l2 <- c(C=0,D=1,E=0,F=0)
x <- list(l1,l2)
x <- as.data.frame(c((x[[1]]), (x[[2]])))
colnames(x) <- c("X")
Outputs:
X
A 0
B 1
C 0
D 1
E 0
F 0
Here you can use your original code:
x <- cbind(EMAIL=rownames(x), x)
Which outputs:
EMAIL X
A A 0
B B 1
C C 0
D D 1
E E 0
F F 0
Here's my problem I couldn't solve it all.
Suppose that we have the following code as follows:
## A data frame named a
a <- data.frame(A = c(0,0,1,1,1), B = c(1,0,1,0,0), C = c(0,0,1,1,0), D = c(0,0,1,1,0), E = c(0,1,1,0,1))
## 1st function calculates all the combinaisons of colnames of a and the output is a character vector named item2
items2 <- c()
countI <- 1
while(countI <= ncol(a)){
for(i in countI){
countJ <- countI + 1
while(countJ <= ncol(a)){
for(j in countJ){
items2 <- c(items2, paste(colnames(a[i]), colnames(a[j]), collapse = '', sep = ""))
}
countJ <- countJ + 1
}
countI <- countI + 1
}
}
And here's my code I'm trying to solve (the output is a numeric vector called count_1):
## 2nd function
colnames(a) <- NULL ## just for facilitating the calculation
count_1 <- numeric(ncol(a)*2)
countI <- 1
while(countI <= ncol(a)){
for(i in countI){
countJ <- countI + 1
while(countJ <= ncol(a)){
for(j in countJ){
s <- a[, i]
p <- a[, j]
count_1[i*2] <- as.integer(s[i] == p[j] & s[i] == 1)
}
countJ <- countJ + 1
}
countI <- countI + 1
}
}
But when I execute this code in RStudio Console, a non-expectation result returned!:
count_1
[1] 0 0 0 0 0 1 0 1 0 0
However, I am expecting the following result:
count_1
[1] 1 2 2 2 1 1 1 1 2 1
You can see visit the following URL where you can find an image on Dropbox for detailed explanation.
https://www.dropbox.com/s/5ylt8h8wx3zrvy7/IMAG1074.jpg?dl=0
I'll try to explain a little more,
I posted the 1st function (code) just to show you what I'm looking for exactly that is an example that's all.
What I'm trying to get from the second function (code) is calculating the number of occurrences of number 1 (firstly we put counter = 0) in each row (while each row of two columns (AB, for example) must equal to one in both columns to say that counter = counter + 1) we continue by combing each column by all other columns (with AC, AD, AE, BC, BD, BE, CD, CE, and then DE), combination is n!/2!(n-2)!, that means for example if I have the following data frame:
a =
A B C D E
0 1 0 0 0
0 0 0 0 1
1 1 1 1 1
1 0 0 1 0
1 0 1 0 1
Then, the number of occurrences of the number 1 for each row by combining the two first columns is as follows: (Note that I put colnames(a) <- NULL just to facilitate the work and be more clear)
0 1 0 0 0
0 0 0 0 1
1 1 1 1 1
1 0 0 1 0
1 0 1 0 1
### Example 1: #####################################################
so from here I put (for columns A and B (AB))
s <- a[, i]
## s is equal to
## [1] 0 0 1 1 1
p <- a[, j]
## p is equal to
## [1] 1 0 1 0 0
Then I'll look for the occurrence of the number 1 in both vectors in condition it must be the same, i.e. a[, i] == 1 && a[, j] == 1 && a[, i] == a[, j], and for this example a numeric vector will be [1] 1
### Example 2: #####################################################
From here I put (for columns A and D (AD))
s <- a[, i]
## s is equal to
## [1] 0 0 1 1 1
p <- a[, j]
## p is equal to
## [1] 0 0 1 1 0
Then I'll look for the occurrence of the number 1 in both vectors in condition it must be the same, i.e. a[, i] == 1 && a[, j] == 1 && a[, i] == a[, j], and for this example a numeric vector will be [1] 2
And so on,
I'll have a numeric vector named count_1 equal to:
[1] 1 2 2 2 1 1 1 1 2 1
while each index of count_1 is a combination of each column by others (without the names of the data frame)
AB AC AD AE BC BD BE CD CE DE
1 2 2 2 1 1 1 1 2 1
Not clear what you're after at all.
As to the first code chunk, that is some ugly R coding involving a whole bunch of unnecessary while/for loops.
You can get the same result items2 in one single line.
items2 <- sort(toupper(unlist(sapply(1:4, function(i)
sapply(5:(i+1), function(j)
paste(letters[i], letters[j], sep = ""))))));
items2;
# [1] "AB" "AC" "AD" "AE" "BC" "BD" "BE" "CD" "CE" "DE"
As to the second code chunk, please explain what you're trying to calculate. It's likely that these while/for loops are as unnecessary as in the first case.
Update
Note that this is based on a as defined at the beginning of your post. Your expected output is based on a different a, that you changed further down the post.
There is no need for a for/while loop, both "functions" can be written in two one-liners.
# Your sample dataframe a
a <- data.frame(A = c(0,0,1,1,1), B = c(1,0,1,0,0), C = c(0,0,1,1,0), D = c(0,0,1,1,0), E = c(0,1,1,0,1))
# Function 1
items2 <- toupper(unlist(sapply(1:(ncol(a) - 1), function(i) sapply(ncol(a):(i+1), function(j)
paste(letters[i], letters[j], sep = "")))));
# Function 2
count_1 <- unlist(sapply(1:(ncol(a) - 1), function(i) sapply(ncol(a):(i+1), function(j)
sum(a[, i] + a[, j] == 2))));
# Add names and sort
names(count_1) <- items2;
count_1 <- count_1[order(names(count_1))];
# Output
count_1;
#AB AC AD AE BC BD BE CD CE DE
# 1 2 2 2 1 1 1 2 1 1
I have a matrix in R which is in the following format:
A B C D E
1 0 0 1 0
0 0 1 0 1
1 1 1 0 1
.
.
.
I need to compare, for each column the value in the row, with the corresponding value in the column D & E. For example:
if(mat[1,1] == 1 && mat[1,4]==1)
vec[1]+=1
else if(mat[1,1] == 1 && mat[1,5]==1)
vec[1]-=1
Basically, vec will be positive if number of 1s in column 4 is greater than number of 1s in column 5.
For this I pass use a apply function which calls the elements row wise, and then I run a loop from 1 to the number of columns, and perform the above conditions as shown below:
outputv = vector(,ncol(mat))
A <- function(vec){
for(i in 1:length(vec)-2)
{
if(vec[i]==1 &&vec[length(vec)-1]==1)
outputv[i] = outputv[i] + 1
else if(vec[i] == 1&& vec[length(vec)-2]==1)
outputv[i] = outputv[i] - 1
}
}
apply(mat,1,A)
I do realize the loop isnt an efficient method, but even with this approach, the values in outputv are all 0.
The expected outputv for the given input matrix would be
0 1 2
For the first column, 1 appears in the 1st and 3rd row. In the first row 4th column, there is 1, subtract 1. 3rd 5th Column has 1, so add 1
Total = -1 + 1 = 0
Logical operation & is equivalent to binary multiplication. So you can simplify your condition to columnwise multiplication and then calculating sums.
> mat <- as.matrix(read.table(
+ text = "
+ A B C D E
+ 1 0 0 1 0
+ 0 0 1 0 1
+ 1 1 1 0 1", header = TRUE ) )
>
> outputv <- rep(0, ncol(mat)-2 ) # initialize vector with zeros
>
> for (n in 1:(ncol(mat)-2) ) # calculate outputv in loop
+ outputv[n] <- sum( mat[,n]*(-mat[,ncol(mat)-1] +mat[,ncol(mat)]) )
> outputv
[1] 0 1 2
> vec <- sum(outputv)
> vec
[1] 3
But the best (and fast) solution may be this approach based on matrix multiplication:
- (mat[,ncol(mat)-1] %*% mat[,1:(ncol(mat)-2)] ) +
mat[,ncol(mat)] %*% mat[,1:(ncol(mat)-2)]
It gives desired output:
A B C
[1,] 0 1 2