I have 8 attendace lists from 8 different conferences. I need to know what persons assisted to at least 7 of the 8 conferences. I don't want to do it checking name by name in each list, so I'm planning to do it using R, but I have no clue about it. Any suggestions?
Might be a more simple way (my R is getting a bit rusty), but this works:
library(dplyr)
unique_attendees <- c('a', 'b', 'c', 'd', 'e')
conf1_attendees <- c('a','b')
conf2_attendees <- c('a','b','c')
conf3_attendees <- c('a','b','c','e')
conf4_attendees <- c('b', 'e')
conf5_attendees <- c('a','d', 'e')
conf6_attendees <- c('a','d', 'e')
conf7_attendees <- c('a','b', 'e')
conf8_attendees <- c('a','b', 'c')
conferences <- list(conf1_attendees, conf2_attendees, conf3_attendees, conf4_attendees, conf5_attendees, conf6_attendees, conf7_attendees,conf8_attendees)
attendance_record <- dplyr::bind_rows(lapply(unique_attendees, function(x){
cat(c('Working with: ', x, '\n'))
attendance <- lapply(conferences, function(y){
attended <- grepl(x, y)
return(attended)
})
number_attended = length(which(unlist(attendance) == TRUE))
result <- data.frame(person=x, number_attended=number_attended)
}))
result <- attendance_record %>%
mutate(attended_at_least_7 = data.table::fifelse(number_attended >= 7, TRUE, FALSE))
print(result)
Output:
person number_attended attended_at_least_7
1 a 7 TRUE
2 b 6 FALSE
3 c 3 FALSE
4 d 2 FALSE
5 e 5 FALSE
Obviously you'll need to adapt it to your problem since we don't know how your records are stored.
Related
I have two time series vectors: complete_data and incomplete_data. the data in the vector consists of 6 possible events which occur randomly throughout the vector. In principle the two should be the same because with every event in complete_data, that same event was then added on to incomplete_data. however in reality there were some anomalies in the system and not all of the events in complete_data were sent to incomplete_data. Thus complete_data is longer than incomplete_data. I need to find the differences in the pattern between the two and mark them. I made an attempt but it assumes that the discrepancy between the two vectors occurs in a single chunk, whereas in reality, there are various "missing events" scattered in incomplete_data.
Here is my attempt:
complete_data <- c('a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c')
dfcomplete <- as.data.frame(complete_data)
incomplete_data <- c('a', 'b', 'c', 'a','c', 'a', 'b', 'a', 'b', 'c')
dfincomplete <- as.data.frame(incomplete_data)
findMatch <- function(complete_data, incomplete_data){
matching_inorder <- NULL
matching_reverseorder <- NULL
for (i in 1:length(complete_data)){
matching_inorder[i] <- complete_data[i] == incomplete_data[i]
matching_reverseorder[i] <- rev(complete_data)[i] == rev(incomplete_data)[i]
}
is_match <- ifelse(matching_inorder == FALSE &
rev(matching_reverseorder) == FALSE, 'non_match', 'match')
is_match
}
dfcomplete$is_match_incorrect <- findMatch(dfcomplete$complete_data,
dfincomplete$incomplete_data)
And here is what I would like to get:
dfcomplete$expected_output <- c('match', 'match', 'match', 'match', 'non-match', 'match',
'match', 'match', 'non_match', 'match', 'match', 'match')
In reality my data is much larger than these examples with many different discrepancies scattered throughout the vector. Though there aren't necessarily too many discrepancies to make the task meaningless, for example, in one case the complete vector has 320 datapoints whilst the incomplete vector has 309.
Any help that can be offered would be much appreciated.
There are various ways to do this, but here's a recursive one, where x is assumed to be a complete sequence and y incomplete.
compare <- function(x, y) {
if (length(x) > 0) {
if (x[1] == y[1]) {
x[1] <- "match"
c(x[1], compare(x[-1], y[-1]))
} else {
x[1] <- "no match"
c(x[1], compare(x[-1], y))
}
}
}
compare(complete_data, incomplete_data)
# [1] "match" "match" "match" "match" "no match" "match"
# [7] "match" "match" "no match" "match" "match" "match"
Another one that perhaps is more readable and uses a simple loop would be
out <- rep(NA, length(incomplete_data))
gap <- 0
for(i in seq_along(complete_data)) {
if (complete_data[i] == incomplete_data[i - gap]) {
out[i] <- "match"
} else {
out[i] <- "no match"
gap <- gap + 1
}
}
out
# [1] "match" "match" "match" "match" "no match" "match"
# [7] "match" "match" "no match" "match" "match" "match"
If you can afford having event names only one letter long, here is a solution using string matching. The trick is to transform the incomplete data to a pattern including places to insert new characters.
complete_data <- c('a', 'b', 'c', 'a', 'B', 'c', 'a', 'b', 'C', 'a', 'b', 'c')
dfcomplete <- as.data.frame(complete_data,stringsAsFactors=FALSE)
incomplete_data <- c('a', 'b', 'c', 'a','c', 'a', 'b', 'a', 'b', 'c')
y <- paste0('^(.*)',paste(incomplete_data,collapse='(.*)'),'(.*)$')
x <- paste(complete_data,collapse="")
z <- str_length(str_match(x,y)[-1])
data.frame(incomplete_data=c("",incomplete_data),stringsAsFactors=FALSE) %>%
mutate(n=ifelse(incomplete_data=="",z,z+1)) %>%
filter(n>0) %>%
uncount(n) %>%
mutate(incomplete_data=ifelse(str_detect(rownames(.),"\\."),"",incomplete_data)) %>%
bind_cols(dfcomplete) %>%
mutate(match=complete_data==incomplete_data)
# incomplete_data complete_data match
#1 a a TRUE
#2 b b TRUE
#3 c c TRUE
#4 a a TRUE
#5 B FALSE
#6 c c TRUE
#7 a a TRUE
#8 b b TRUE
#9 C FALSE
#10 a a TRUE
#11 b b TRUE
#12 c c TRUE
I have the following df and use-case, I'd like to find and set something in all rows for which exist another row satisfying a condition e.g.
df <- data.frame(X=c('a','b','c'), Y=c('a','c','d'))
> df
X Y
1 a a
2 b c
3 c d
I'd like to find those rows whos Y value is the same as X value in another row. In the example above would be row #2 is true because Y = c and row #3 has X = c. Note that row #1 does not satisfy the condition.
Something like:
df$Flag <- find(df, Y == X_in_another_row(df))
1
For each Y, we check if any value in X (other than in the same row) matches.
sapply(1:NROW(df), function(i) df$Y[i] %in% df$X[-i])
#[1] FALSE TRUE FALSE
If indices are necessary, wrap the whole thing in which
which(sapply(1:NROW(df), function(i) df$Y[i] %in% df$X[-i]))
#[1] 2
2 (not tested well)
df <- data.frame(X=c('a','b','c'), Y=c('a','c','d'), stringsAsFactors = FALSE)
temp = outer(df$X, df$Y, "==") #Check equality among values of X and Y
diag(temp) = FALSE #Set diagonal values as FALSE (for same row)
colSums(temp) > 0
#[1] FALSE TRUE FALSE
which(match(df$Y,df$X)!=1:nrow(df))
I think this should work.
df <- data.frame(X= c(1,2,3,4,5,3,2,1), Y = c(1,2,3,4,5,6,7,8))
which(with(df, (X %in% Y) & (X != Y)))
Works on the original data.frame, if we set stringsasfactors=FALSE
df <- data.frame(X=c('a','b','c'), Y=c('a','c','d'), stringsAsFactors = F)
which(with(df, (X %in% Y) & (X != Y)))
Quite convoluted but I'll put it here anyway. This should work even if there are repeated values in X.
For example with the following dataframe df2:
df2 = data.frame(X=c('a','b','c','a','d'), Y=c('a','c','d','e','b'))
X Y
1 a a
2 b c
3 c d
4 a e
5 d b
## Specifying the same factor levels allows us to get a square matrix
df2$X = factor(df2$X,levels=union(df2$X,df2$Y))
df2$Y = factor(df2$Y,levels=union(df2$X,df2$Y))
m = as.matrix(table(df2))
valY = rowSums(m)*colSums(m)-diag(m)
which(df2$Y %in% names(valY)[as.logical(valY)])
[1] 1 2 3 5
Essentially you want to know whether Y is in X but you want the condition to be FALSE when X == Y:
df$Z <- with(df, (Y != X) & (Y %in% X))
# Assume you want to use position 4, value 'c', to find all the rows that Y is 'c'
df <- data.frame(X = c('a', 'b', 'd', 'c'),
Y = c('a', 'c', 'c', 'd'))
row <- 4 # assume the desire row is position 4
val <- as.character( df[(row),'X'] ) # get the character and turn it into character type
df[df$Y == val,]
# Result
# X Y
# 2 b c
# 3 d c
This is an extension of the question asked in Count number of times combination of events occurs in dataframe columns, I will reword the question again so it is all here:
I have a data frame and I want to calculate the number of times each combination of events in two columns occur (in any order), with a zero if a combination doesn't appear.
For example say I have
df <- data.frame('x' = c('a', 'b', 'c', 'c', 'c'),
'y' = c('c', 'c', 'a', 'a', 'b'))
So
x y
a c
b c
c a
c a
c a
c b
a and b do not occur together, a and c 4 times (rows 2, 4, 5, 6) and b and c twice (3rd and 7th rows) so I would want to return
x-y num
a-b 0
a-c 4
b-c 2
I hope this makes sense? Thanks in advance
This should do it:
res = table(df)
To convert to data frame:
resdf = as.data.frame(res)
The resdf data.frame looks like:
x y Freq
1 a a 0
2 b a 0
3 c a 2
4 a b 0
5 b b 0
6 c b 1
7 a c 1
8 b c 1
9 c c 0
Note that this answer takes order into account. If ordering of the columns is unimportant, then modifying the original data.frame prior to the process will remove the effect of ordering (a-c treated the same as c-a).
df1 = as.data.frame(t(apply(df,1,sort)))
As said, you can do this with factor() and expand.grid() (or another way to get all possible combinations)
all.possible <- expand.grid(c('a','b','c'), c('a','b','c'))
all.possible <- all.possible[all.possible[, 1] != all.possible[, 2], ]
all.possible <- unique(apply(all.possible, 1, function(x) paste(sort(x), collapse='-')))
df <- data.frame('x' = c('a', 'b', 'c', 'c', 'c'),
'y' = c('c', 'c', 'a', 'a', 'b'))
table(factor(apply(df , 1, function(x) paste(sort(x), collapse='-')), levels=all.possible))
An alternative, because I was a bit bored. Perhaps a bit more generalised? But probably still uglier than it could be...
df2 <- as.data.frame(table(df))
df2$com <- apply(df2[,1:2],1,function(x) if(x[1] != x[2]) paste(sort(x),collapse='-'))
df2 <- df2[df2$com != "NULL",]
ddply(df2, .(unlist(com)), summarise,
num = sum(Freq))
I would like to create a co-authorship network using igraph .
My data are organized in a data.frame which looks like that:
DF1 <- cbind(Papers = paste('Paper', 1:5, sep = ''),
Author1 = c('A', 'D', 'C', 'C', 'C'),
Author2 = c('B', 'C', 'F', NA, 'F'),
Author3 = c('C', 'E', NA, NA, 'D'))
I would like to create an Edge list which looks like this:
Vertex1 Vertex2
A B
D C
C F
C F
A C
D E
C D
B C
C E
F D
Is there anyway to do this in R (igraph for example)
The following function does the trick but for large dataset (over 5,000 papers) it takes too long to run
Fun_DFtoEdgeList <- function (Inputdataframe)
{
## This function create an edge list to create a network
## Input : Dataframe with UNIQUE VALUES !!!!
ResEdgeList <- data.frame(Vertex1 = c('--'), Vertex2 = c('--'))
for (i in 1 : (ncol(Inputdataframe)-1))
{
for (j in 2: (ncol(Inputdataframe)))
{
if (i !=j)
{
#print(paste(i, j, sep ='--'))
ToAppend <- data.frame(cbind(Inputdataframe[,i], Inputdataframe[,j]))
names(ToAppend) <- names(ResEdgeList)
#print(ToAppend)
ResEdgeList <- rbind(ResEdgeList, ToAppend)
}
}
}
ResEdgeList <- data.frame(ResEdgeList[-1,], stringsAsFactors = FALSE)
ResEdgeList<- subset(ResEdgeList, (is.na(Vertex1) == FALSE ) & (is.na(Vertex2) == FALSE ))
ResEdgeList
}
Fun_DFtoEdgeList (DF1[,-1])
``
Any help appreciated. (I had previously posted this question under different heading but am told that I wasn't clear enough)
Your code does not produce the data you give because it is iterating over the "Paper" column. It will also prove slow because everytime you append to an existing object, R has to take another copy of the entire object...when you do this iteratively, things slow to a crawl. Looking at your output, I think this is does what you want:
#First, creat all combos of the columns you want. I don't think you want to include the "Paper" column?
x <- combn(2:4,2)
#-----
[,1] [,2] [,3]
[1,] 2 2 3
[2,] 3 4 4
#next use apply to go through each pair:
apply(x, 2, function(z) data.frame(Vertex1 = DF1[, z[1]], Vertex2 = DF1[, z[2]]))
#-----
[[1]]
Vertex1 Vertex2
1 A B
2 D C
3 C F
4 C <NA>
5 C F
....
#So use do.call to rbind them together:
out <- do.call("rbind",
apply(x, 2, function(z) data.frame(Vertex1 = DF1[, z[1]], Vertex2 = DF1[, z[2]])))
#Finally, filter out the rows with NA:
out[complete.cases(out),]
#-----
Vertex1 Vertex2
1 A B
2 D C
3 C F
5 C F
6 A C
7 D E
10 C D
11 B C
12 C E
15 F D
Finally, see how this scales to a larger problem:
#Just over a million papers
zz <- matrix(sample(letters, 1000002, TRUE), ncol = 3)
x <- combn(1:3, 2)
system.time(do.call("rbind",
apply(x, 2, function(z) data.frame(Vertex1 = zz[, z[1]], Vertex2 = zz[, z[2]]))))
#-----
user system elapsed
1.332 0.144 1.482
1.5 seconds seems pretty reasonable to me?
There might be a better way to do this, but try combn, it produces all unique combinations:
DF1 <- cbind(Papers = paste('Paper', 1:5, sep = ''),
Author1 = c('A', 'D', 'C', 'C', 'C'),
Author2 = c('B', 'C', 'F', NA, 'F'),
Author3 = c('C', 'E', NA, NA, 'D'))
require(igraph)
l=apply(DF1[,-1],MARGIN=1,function(x) na.omit(data.frame(t(combn(x,m=2)))))
df=do.call(rbind,l)
g=graph.data.frame(df,directed=F)
plot(g)
I am trying to compare sets of variables(X) that are stored in two dataframes (foo, bar). Each X is a unique independent variable that has up to 10 values of Y associated with it. I would like to compare every foo.X with every bar.X by comparing the number of Y values they have in common - so the output could be a matrix with axes of foo.x by bar.x in length.
this simple example of foo and bar would want to return a 2x2 matrix comparing a,b with c,d:
foo <- data.frame(x= c('a', 'a', 'a', 'b', 'b', 'b'), y=c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- data.frame(x= c('c', 'c', 'c', 'd', 'd', 'd'), y=c('ab', 'xy', 'xz', 'xy', 'fx', 'xz'))
EDIT:
I've left the following code for other newbies to learn from (for loops are effectvie but probably very suboptimal), but the two solutions below are effective. In particular Ramnath's use of data.table is very effective when dealing with very large dataframes.
store the dataframes as lists where the values of y are stored using the stack function
foo.list <- dlply(foo, .(x), function(x) stack(x, select = y))
bar.list <- dlply(bar, .(x),function(x) stack(x, select = y))
write a function for comparing membership in the two stacked lists
comparelists <- function(list1, list2) {
for (i in list1){
for (j in list2){
count <- 0
if (i[[1]] %in% j[[1]]) count <- count + 1
}
}
return count
}
write an output matrix
output.matrix <- matrix(1:length(foo.list), 1:length(bar.list))
for (i in foo.list){
for (j in bar.list){
output.matrix[i,j] <- comparelists(i,j)
}
}
There must be a hundred ways to do this; here is one that feels relatively straightforward to me:
library(reshape2)
foo <- data.frame(x = c('a', 'a', 'a', 'b', 'b', 'b'),
y = c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- data.frame(x = c('c', 'c', 'c', 'd', 'd', 'd'),
y = c('ab', 'xy', 'xz', 'xy', 'fx', 'xz'))
# Create a function that counts the number of common elements in two groups
nShared <- function(A, B) {
length(intersect(with(foo, y[x==A]), with(bar, y[x==B])))
}
# Enumerate all combinations of groups in foo and bar
(combos <- expand.grid(foo.x=unique(foo$x), bar.x=unique(bar$x)))
# foo.x bar.x
# 1 a c
# 2 b c
# 3 a d
# 4 b d
# Find number of elements in common among all pairs of groups
combos$n <- mapply(nShared, A=combos$foo.x, B=combos$bar.x)
# Reshape results into matrix form
dcast(combos, foo.x ~ bar.x)
# foo.x c d
# 1 a 1 0
# 2 b 0 1
Here is a simpler approach using merge
library(reshape2)
df1 <- merge(foo, bar, by = 'y')
dcast(df1, x.x ~ x.y, length)
x.x c d
1 a 1 0
2 b 0 1
EDIT. The merge can be faster using data.table. Here is the code
foo_dt <- data.table(foo, key = 'y')
bar_dt <- data.table(bar, key = 'y')
df1 <- bar_dt[foo_dt, nomatch = 0]