This is an extension of the question asked in Count number of times combination of events occurs in dataframe columns, I will reword the question again so it is all here:
I have a data frame and I want to calculate the number of times each combination of events in two columns occur (in any order), with a zero if a combination doesn't appear.
For example say I have
df <- data.frame('x' = c('a', 'b', 'c', 'c', 'c'),
'y' = c('c', 'c', 'a', 'a', 'b'))
So
x y
a c
b c
c a
c a
c a
c b
a and b do not occur together, a and c 4 times (rows 2, 4, 5, 6) and b and c twice (3rd and 7th rows) so I would want to return
x-y num
a-b 0
a-c 4
b-c 2
I hope this makes sense? Thanks in advance
This should do it:
res = table(df)
To convert to data frame:
resdf = as.data.frame(res)
The resdf data.frame looks like:
x y Freq
1 a a 0
2 b a 0
3 c a 2
4 a b 0
5 b b 0
6 c b 1
7 a c 1
8 b c 1
9 c c 0
Note that this answer takes order into account. If ordering of the columns is unimportant, then modifying the original data.frame prior to the process will remove the effect of ordering (a-c treated the same as c-a).
df1 = as.data.frame(t(apply(df,1,sort)))
As said, you can do this with factor() and expand.grid() (or another way to get all possible combinations)
all.possible <- expand.grid(c('a','b','c'), c('a','b','c'))
all.possible <- all.possible[all.possible[, 1] != all.possible[, 2], ]
all.possible <- unique(apply(all.possible, 1, function(x) paste(sort(x), collapse='-')))
df <- data.frame('x' = c('a', 'b', 'c', 'c', 'c'),
'y' = c('c', 'c', 'a', 'a', 'b'))
table(factor(apply(df , 1, function(x) paste(sort(x), collapse='-')), levels=all.possible))
An alternative, because I was a bit bored. Perhaps a bit more generalised? But probably still uglier than it could be...
df2 <- as.data.frame(table(df))
df2$com <- apply(df2[,1:2],1,function(x) if(x[1] != x[2]) paste(sort(x),collapse='-'))
df2 <- df2[df2$com != "NULL",]
ddply(df2, .(unlist(com)), summarise,
num = sum(Freq))
Related
I have two lists:
source <- list(c(5,10,20,30))
source.val <- list(c('A', 'B', 'C', 'D'))
Each corresponding element in source has a corresponding value in source.val. I want to create dataframe from the above two files that look like below
source.val_5 source.val_10 source.val_20 source.val_30
A B C D
I did this
tempList <- list()
for(i in 1:lengths(source)){
tempList[[i]] <- data.frame(variable = paste0('source.val_',source[[1]][[i]]),
value = source.val[[1]][[i]])
}
temp.dat <- do.call('rbind', tempList)
temp.dat_wider <- tidyr::pivot_wider(finalList, id_cols = value, names_from = variable)
Now I want to do this across a bigger list
source <- list(c(5,10,20,30),
c(5,10,20,30),
c(5,10,20,30),
c(5,10,20,30))
source.val <- list(c('A', 'B', 'C', 'D'),
c('B', 'B', 'D', 'D'),
c('C', 'B', 'A', 'D'),
c('D', 'B', 'B', 'D'))
The resulting table will have 4 rows looking like this:
A tibble: 1 x 4
source.val_5 source.val_10 source.val_20 source.val_30
A B C D
B B D D
C B A D
D B B D
What is the best way to use function like mapply to achieve my desired result?
For the example shared, where all the elements of source have the same order you can do :
cols <- paste0('source.val_', sort(unique(unlist(source))))
setNames(do.call(rbind.data.frame, source.val), cols)
# source.val_5 source.val_10 source.val_20 source.val_30
#1 A B C D
#2 B B D D
#3 C B A D
#4 D B B D
However, for a general case where every value in source do not follow the same order you can reorder source.val based on source :
source.val <- Map(function(x, y) y[order(x)], source, source.val)
and then use the above code.
I want to obtain the minimum distance between 2 columns, however the same name may appear in both Column A and Column B. See example below;
Patient1 Patient2 Distance
A B 8
A C 11
A D 19
A E 23
B F 6
C G 25
So the output I need is:
Patient Patient_closest_distance Distance
A B 8
B F 6
c A 11
I have tried using the list function
library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]
However, I just get the minimum distance for each column, i.e. C will have 2 results as it is in both columns rather than showing the closest patient considering both columns. Also, I only get a list of distances, so I can't see which patient is linked to which;
Patient1 SNP
1: A 8
I have tried using the list function in R Studio
library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]
This code below works.
# Create sample data frame
df <- data.frame(
Patient1 = c('A','B', 'A', 'A', 'C', 'B'),
Patient2 = c('B', 'A','C', 'D', 'D', 'F'),
Distance = c(10, 1, 20, 3, 60, 20)
)
# Format as character variable (instead of factor)
df$Patient1 <- as.character(df$Patient1); df$Patient2 <- as.character(df$Patient2);
# If you want mirror paths included, you'll need to add them.
# Ex.) A to C at a distance of 20 is equivalent to C to A at a distance of 20
# If you don't need these mirror paths, you can ignore these two lines.
df_mirror <- data.frame(Patient1 = df$Patient2, Patient2 = df$Patient1, Distance = df$Distance)
df <- rbind(df, df_mirror); rm(df_mirror)
# group pairs by min distance
library(dplyr)
df <- summarise(group_by(df, Patient1, Patient2), min(Distance))
# Resort, min to top.
nearest <- df[order(df$`min(Distance)`), ]
# Keep only the first of each group
nearest <- nearest[!duplicated(nearest$Patient1),]
Following the example here: input format for binomial glm in R, I have a dataset with y = cbind(success, failure)) with each row representing one treatment.
My question is: How do I convert this to a "binary" format for each observation (e.g., y = 0 or 1 for each observation)?
Working example here:
df1 <- data.frame(time = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
symb = c('a', 'a', 'a', 'b', 'b', 'b','a', 'a', 'a', 'b', 'b', 'b'),
success= c(324,234,123,234,424,323,124,537,435,645,231,234),
failure= c(84,23,20,74,44,73,12,59,41,68,23,34))
Where success = 1 and failure = 0, and the final dataframe will have 4423 rows (sum(df1$success)+sum(df1$failure)). This answer gets to where I'm trying to go.
Here's a way, using gather to reshape the data, and then hints from this answer to do the other heavy lifting.
library(tidyverse)
# convert to long format
df1_long <- df1 %>%
gather(code, count, success, failure)
# function to repeat a data.frame
rep_df <- function(df, n){
do.call('rbind', replicate(n, df, simplify = FALSE))
}
# loop through each row and then rbind together
df1_full <- do.call('rbind',
lapply(1:nrow(df1_long),
FUN = function(i)
rep_df(df1_long[i,], df1_long[i,]$count)))
# create binary_code
df1_full$binary_code <- as.numeric(df1_full$code == 'success')
Here's what the first few rows look like:
# time symb code count binary_code
# 1 1 a success 324 1
# 2 1 a success 324 1
# 3 1 a success 324 1
# 4 1 a success 324 1
# 5 1 a success 324 1
# 6 1 a success 324 1
I would like to create a co-authorship network using igraph .
My data are organized in a data.frame which looks like that:
DF1 <- cbind(Papers = paste('Paper', 1:5, sep = ''),
Author1 = c('A', 'D', 'C', 'C', 'C'),
Author2 = c('B', 'C', 'F', NA, 'F'),
Author3 = c('C', 'E', NA, NA, 'D'))
I would like to create an Edge list which looks like this:
Vertex1 Vertex2
A B
D C
C F
C F
A C
D E
C D
B C
C E
F D
Is there anyway to do this in R (igraph for example)
The following function does the trick but for large dataset (over 5,000 papers) it takes too long to run
Fun_DFtoEdgeList <- function (Inputdataframe)
{
## This function create an edge list to create a network
## Input : Dataframe with UNIQUE VALUES !!!!
ResEdgeList <- data.frame(Vertex1 = c('--'), Vertex2 = c('--'))
for (i in 1 : (ncol(Inputdataframe)-1))
{
for (j in 2: (ncol(Inputdataframe)))
{
if (i !=j)
{
#print(paste(i, j, sep ='--'))
ToAppend <- data.frame(cbind(Inputdataframe[,i], Inputdataframe[,j]))
names(ToAppend) <- names(ResEdgeList)
#print(ToAppend)
ResEdgeList <- rbind(ResEdgeList, ToAppend)
}
}
}
ResEdgeList <- data.frame(ResEdgeList[-1,], stringsAsFactors = FALSE)
ResEdgeList<- subset(ResEdgeList, (is.na(Vertex1) == FALSE ) & (is.na(Vertex2) == FALSE ))
ResEdgeList
}
Fun_DFtoEdgeList (DF1[,-1])
``
Any help appreciated. (I had previously posted this question under different heading but am told that I wasn't clear enough)
Your code does not produce the data you give because it is iterating over the "Paper" column. It will also prove slow because everytime you append to an existing object, R has to take another copy of the entire object...when you do this iteratively, things slow to a crawl. Looking at your output, I think this is does what you want:
#First, creat all combos of the columns you want. I don't think you want to include the "Paper" column?
x <- combn(2:4,2)
#-----
[,1] [,2] [,3]
[1,] 2 2 3
[2,] 3 4 4
#next use apply to go through each pair:
apply(x, 2, function(z) data.frame(Vertex1 = DF1[, z[1]], Vertex2 = DF1[, z[2]]))
#-----
[[1]]
Vertex1 Vertex2
1 A B
2 D C
3 C F
4 C <NA>
5 C F
....
#So use do.call to rbind them together:
out <- do.call("rbind",
apply(x, 2, function(z) data.frame(Vertex1 = DF1[, z[1]], Vertex2 = DF1[, z[2]])))
#Finally, filter out the rows with NA:
out[complete.cases(out),]
#-----
Vertex1 Vertex2
1 A B
2 D C
3 C F
5 C F
6 A C
7 D E
10 C D
11 B C
12 C E
15 F D
Finally, see how this scales to a larger problem:
#Just over a million papers
zz <- matrix(sample(letters, 1000002, TRUE), ncol = 3)
x <- combn(1:3, 2)
system.time(do.call("rbind",
apply(x, 2, function(z) data.frame(Vertex1 = zz[, z[1]], Vertex2 = zz[, z[2]]))))
#-----
user system elapsed
1.332 0.144 1.482
1.5 seconds seems pretty reasonable to me?
There might be a better way to do this, but try combn, it produces all unique combinations:
DF1 <- cbind(Papers = paste('Paper', 1:5, sep = ''),
Author1 = c('A', 'D', 'C', 'C', 'C'),
Author2 = c('B', 'C', 'F', NA, 'F'),
Author3 = c('C', 'E', NA, NA, 'D'))
require(igraph)
l=apply(DF1[,-1],MARGIN=1,function(x) na.omit(data.frame(t(combn(x,m=2)))))
df=do.call(rbind,l)
g=graph.data.frame(df,directed=F)
plot(g)
I am trying to compare sets of variables(X) that are stored in two dataframes (foo, bar). Each X is a unique independent variable that has up to 10 values of Y associated with it. I would like to compare every foo.X with every bar.X by comparing the number of Y values they have in common - so the output could be a matrix with axes of foo.x by bar.x in length.
this simple example of foo and bar would want to return a 2x2 matrix comparing a,b with c,d:
foo <- data.frame(x= c('a', 'a', 'a', 'b', 'b', 'b'), y=c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- data.frame(x= c('c', 'c', 'c', 'd', 'd', 'd'), y=c('ab', 'xy', 'xz', 'xy', 'fx', 'xz'))
EDIT:
I've left the following code for other newbies to learn from (for loops are effectvie but probably very suboptimal), but the two solutions below are effective. In particular Ramnath's use of data.table is very effective when dealing with very large dataframes.
store the dataframes as lists where the values of y are stored using the stack function
foo.list <- dlply(foo, .(x), function(x) stack(x, select = y))
bar.list <- dlply(bar, .(x),function(x) stack(x, select = y))
write a function for comparing membership in the two stacked lists
comparelists <- function(list1, list2) {
for (i in list1){
for (j in list2){
count <- 0
if (i[[1]] %in% j[[1]]) count <- count + 1
}
}
return count
}
write an output matrix
output.matrix <- matrix(1:length(foo.list), 1:length(bar.list))
for (i in foo.list){
for (j in bar.list){
output.matrix[i,j] <- comparelists(i,j)
}
}
There must be a hundred ways to do this; here is one that feels relatively straightforward to me:
library(reshape2)
foo <- data.frame(x = c('a', 'a', 'a', 'b', 'b', 'b'),
y = c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- data.frame(x = c('c', 'c', 'c', 'd', 'd', 'd'),
y = c('ab', 'xy', 'xz', 'xy', 'fx', 'xz'))
# Create a function that counts the number of common elements in two groups
nShared <- function(A, B) {
length(intersect(with(foo, y[x==A]), with(bar, y[x==B])))
}
# Enumerate all combinations of groups in foo and bar
(combos <- expand.grid(foo.x=unique(foo$x), bar.x=unique(bar$x)))
# foo.x bar.x
# 1 a c
# 2 b c
# 3 a d
# 4 b d
# Find number of elements in common among all pairs of groups
combos$n <- mapply(nShared, A=combos$foo.x, B=combos$bar.x)
# Reshape results into matrix form
dcast(combos, foo.x ~ bar.x)
# foo.x c d
# 1 a 1 0
# 2 b 0 1
Here is a simpler approach using merge
library(reshape2)
df1 <- merge(foo, bar, by = 'y')
dcast(df1, x.x ~ x.y, length)
x.x c d
1 a 1 0
2 b 0 1
EDIT. The merge can be faster using data.table. Here is the code
foo_dt <- data.table(foo, key = 'y')
bar_dt <- data.table(bar, key = 'y')
df1 <- bar_dt[foo_dt, nomatch = 0]