I have a vector that looks something like this.
v <- as.data.frame(list(v=(c("a","b","c",'d','e'))))
v
v
1 a
2 b
3 c
4 d
5 e
My vector has 5 different values. This means I can make 120 permutations of my vector.
Here are some examples of permutations
v v2 v3
1 a a a
2 b b c
3 c c b
4 d e d
5 e d e
I would like to create only create 10 different vectors out of the 120 possible ones, but I would like to select the combination that should maximise their covariance. Any idea how I could do this?
thanks a lot in advance for your help
I have two databases with different numbers of columns. All columns of the second database are included in the second database. The patients in the two databases are also different. I need to merge the two databases. The function merge (or _join of dplyr) will not work in principle since I have to overlay the databases. The binding (rowbind) should not also works cause I have different columns. What is the simple way to do it?
mydata<-data.frame(
ID=c(1,1,1,2,2),B=rep("b",5),C=rep("c",5),D=rep("d",5)
)
mydata2<-data.frame(ID=c(3,4),B=c("b2","b2"),C=c("c2","c2"))
The expected dataset is this below:
ID B C D
1 1 b c d
2 1 b c d
3 1 b c d
4 2 b c d
5 2 b c d
6 3 b2 c2 <NA>
7 4 b2 c2 <NA>
A mere merge should suffice
merge( mydata, mydata2, all=T )
ID B C D
1 1 b c d
2 1 b c d
3 1 b c d
4 2 b c d
5 2 b c d
6 3 b2 c2 <NA>
7 4 b2 c2 <NA>
dplyr::full_join(mydata,mydata2)
seems to work .
You can use bind_rows() to combine two data frames having different number of columns. More here
library(dplyr)
bind_rows(mydata, mydata2)
I have an edgelist (2 columns) and I want to create a 3rd column with weights to each node based on the number of mentions of each word of my data.
See attached my data.
For example: 'oil' 'bad' 'gas' appear multiple times and I would like to add value '1' for every time the same one appear (and delete the multiple rows).
dat
An easy solution for this case would be just to use table
#create some sample data
set.seed(1)
node1<-rep("oil drilling", 20)
node2<-sample(c("gas", "frack", "pollute", "good"),20,replace=T)
edglst<-data.frame(node1, node2)
head(edglist,10)
node1 node2
1 oil drilling frack
2 oil drilling frack
3 oil drilling pollute
4 oil drilling good
5 oil drilling gas
6 oil drilling good
7 oil drilling good
8 oil drilling pollute
9 oil drilling pollute
10 oil drilling gas
#use table to get a dataframe with one row per combination and its frequency
as.data.frame(table(edglst))
node1 node2 Freq
1 oil drilling frack 5
2 oil drilling gas 4
3 oil drilling good 6
4 oil drilling pollute 5
EDIT : You may also need to remove some 0's if you have some possible combinations of nodes that don't occur in your data, in which case
x<-as.data.frame(table(edglst))
x<-x[!x$Freq==0,]
I do not want to type your data in so I will illustrate with some generated data.
set.seed(1234)
x = sample(LETTERS[1:6], 20, replace=TRUE)
y = sample(letters[1:6], 20, replace=TRUE)
dat = data.frame(x,y)
You can get the count that you want from the count function in the plyr package.
library(plyr)
count(dat)
x y freq
1 A b 1
2 A d 1
3 B b 4
4 B e 1
5 B f 2
6 D a 3
7 D b 2
8 D e 2
9 E c 1
10 F b 1
11 F d 1
12 F e 1
I am trying to understand the return type of the membership function used in ceb<-cluster-edge_betweeness
Saying
ceb<-cluster-edge_betweeness(g)
data<-membership(ceb)
print data
a b c d e f g h i j k l m n o p q r s t
1 2 3 4 5 4 6 6 7 8 9 10 11 3 6 12 5 3 13 6
I want to able to say for node a which clusters are you a member of
print data[2]
gives
b
2
Saying
print data[[2]]
gives
[1] 2
I want to be able to write something that returns the value of 'b' part of strange data type.
class(data)
gives
membership
typeof(data)
gives
double
data[2:10]
gives
b c d e f g h i j
2 3 4 5 4 6 6 7 8
what I was hoping to say was some code that said
vertex f is a member of cluster 4
The data[[6]] will give me 4, how do I get access to the f part ?
`
Just discovered that its possible to say
data [['a']]
and the expected answer, so this data type is some type of array addressed by the node name. I guess my questions needs to be how can I get a list of keys which such a construct
What I want
I have a daframe whose first two columns are combinations of m string values. Those values are users ids and the dataframe represents relationships between them. Every row includes a user par (user1, user2) and the number of times user1 has visited user2's profile. There are m different users but there are no rows for every possible relationship, only for those who have visited the other's profile. Mutual pairs can have different values or even don't exist (if user1 has visited user2's profile but not the other way round).
I want to create reduced versions of my dataframe, easier to work with. As a result I want to extract groups of n users (n < m) with at least r relationships between. I need to develop a function that checks if there are at least r rows (relationships), combination of any group of n users (n < m). If so, the function returns a dataframe formed by them.
That is to say, the function should return a daframe (let's name it S) that meets the following requirements:
S is a subset of the original dataframe
NROW(S) >= r
NROW(unique(append(S[1],S[2]))) = n
Small example
This is a minimal example, where users ids are just one letter. There are 5 different users (a,b,c,d,e) and 10 relationships between them:
>df
user1 user2 views
1 a b 1
2 a c 1
3 a d 3
4 b a 1
5 c b 1
6 c d 2
7 c e 4
8 d a 5
9 d e 1
10 d c 2
For r = 8 n = 4, I try to extract a group of 4 users with at least 8 relationships between them. The output should be a dataframe of 8 rows that contain only 4 different user values in the first two columns (if any). There are 5 different 4-users combinations with the values above [(a,b,c,d) ,(a,b,c,e), (a,b,d,e), (a,c,d,e), (b,c,d,e)]. However only of one of them (a,b,c,d) includes 8 relationships between members. Output should be:
[1] "a" "b" "c" "d"
user1 user2 views
1 a b 1
2 a c 1
3 a d 3
4 b a 1
5 c b 1
6 c d 2
7 d a 5
8 d c 2
For r = 6 n = 4, the aim is to find a group of 4 users with 6 relationships between them. In this case there are 2 4-users combinations that meet the requirement of forming at least 6 rows [(a,b,c,d), (a,c,d,e)]. Output could be any of those:
[1] "a" "b" "c" "d"
user1 user2 views
1 a b 1
2 a c 1
3 a d 3
4 b a 1
5 c b 1
6 c d 2
7 d a 5
8 d c 2
[1] "a" "c" "d" "e"
user1 user2 views
1 a c 1
2 a d 3
3 c d 2
4 c e 4
5 d a 5
6 d c 2
7 d e 1
For r > 8, no 4-users combination is valid. There are no 4-users group with more than 8 relationships between them. Then, output should be NULL.
Solution (if dataframe was small)
A solutions for a small dataframe (that works for this minimal example) could be to get all possible combinations of n-users and iterate over them, checking if one of them meets the requirement of forming r relationships in df. Code could be something like this:
library(dplyr)
library(trotter)
obtainUserGroup <- function(df,n,r){
## Get unique letter _users_ in dataframe
us <- unique(append(df$user1,df$user2))
## All possible combinations of users in _us_ in groups of _n_ users
users <- cpv(n,us)
## Iterate over any possible group of _n_ users
for(i in 1:NROW(users)){
l <- as.character(users[i])
l <- as.data.frame(permutations(n,2,l),stringAsFactors=FALSE)
names(l) <- c("user1","user2")
# Check which combinations in l are part of df
dc <- semi_join(df,l,by = c("user1", "user2"))
if(NROW(dc) >= r) return(dc)
}
return(NULL)
}
df <- data.frame(c("a","a","a","b","c","c","c","d","d","d"), c("b","c","d","a","b","d","e","a","e","c"), c(1,1,3,1,1,2,4,5,1,2), stringsAsFactors=FALSE)
names(df) <- c("user1","user2","views")
n <- 4
r <- 8
obtainUserGroup(df,n,r)
Problem
My user dataframe is quite large. Trying to reduce a dataframe including thousand of users (m > 7000) users to one much smaller of just a hundred or a few hundreds (n << m) is troublesome. Iterating over the number of possible combinations is not feasible.
Is it possible to check all combinations in a different (much faster) way than the for-loop?
Is there any other way to approach this problem, different than checking all possible combinations?