What I want
I have a daframe whose first two columns are combinations of m string values. Those values are users ids and the dataframe represents relationships between them. Every row includes a user par (user1, user2) and the number of times user1 has visited user2's profile. There are m different users but there are no rows for every possible relationship, only for those who have visited the other's profile. Mutual pairs can have different values or even don't exist (if user1 has visited user2's profile but not the other way round).
I want to create reduced versions of my dataframe, easier to work with. As a result I want to extract groups of n users (n < m) with at least r relationships between. I need to develop a function that checks if there are at least r rows (relationships), combination of any group of n users (n < m). If so, the function returns a dataframe formed by them.
That is to say, the function should return a daframe (let's name it S) that meets the following requirements:
S is a subset of the original dataframe
NROW(S) >= r
NROW(unique(append(S[1],S[2]))) = n
Small example
This is a minimal example, where users ids are just one letter. There are 5 different users (a,b,c,d,e) and 10 relationships between them:
>df
user1 user2 views
1 a b 1
2 a c 1
3 a d 3
4 b a 1
5 c b 1
6 c d 2
7 c e 4
8 d a 5
9 d e 1
10 d c 2
For r = 8 n = 4, I try to extract a group of 4 users with at least 8 relationships between them. The output should be a dataframe of 8 rows that contain only 4 different user values in the first two columns (if any). There are 5 different 4-users combinations with the values above [(a,b,c,d) ,(a,b,c,e), (a,b,d,e), (a,c,d,e), (b,c,d,e)]. However only of one of them (a,b,c,d) includes 8 relationships between members. Output should be:
[1] "a" "b" "c" "d"
user1 user2 views
1 a b 1
2 a c 1
3 a d 3
4 b a 1
5 c b 1
6 c d 2
7 d a 5
8 d c 2
For r = 6 n = 4, the aim is to find a group of 4 users with 6 relationships between them. In this case there are 2 4-users combinations that meet the requirement of forming at least 6 rows [(a,b,c,d), (a,c,d,e)]. Output could be any of those:
[1] "a" "b" "c" "d"
user1 user2 views
1 a b 1
2 a c 1
3 a d 3
4 b a 1
5 c b 1
6 c d 2
7 d a 5
8 d c 2
[1] "a" "c" "d" "e"
user1 user2 views
1 a c 1
2 a d 3
3 c d 2
4 c e 4
5 d a 5
6 d c 2
7 d e 1
For r > 8, no 4-users combination is valid. There are no 4-users group with more than 8 relationships between them. Then, output should be NULL.
Solution (if dataframe was small)
A solutions for a small dataframe (that works for this minimal example) could be to get all possible combinations of n-users and iterate over them, checking if one of them meets the requirement of forming r relationships in df. Code could be something like this:
library(dplyr)
library(trotter)
obtainUserGroup <- function(df,n,r){
## Get unique letter _users_ in dataframe
us <- unique(append(df$user1,df$user2))
## All possible combinations of users in _us_ in groups of _n_ users
users <- cpv(n,us)
## Iterate over any possible group of _n_ users
for(i in 1:NROW(users)){
l <- as.character(users[i])
l <- as.data.frame(permutations(n,2,l),stringAsFactors=FALSE)
names(l) <- c("user1","user2")
# Check which combinations in l are part of df
dc <- semi_join(df,l,by = c("user1", "user2"))
if(NROW(dc) >= r) return(dc)
}
return(NULL)
}
df <- data.frame(c("a","a","a","b","c","c","c","d","d","d"), c("b","c","d","a","b","d","e","a","e","c"), c(1,1,3,1,1,2,4,5,1,2), stringsAsFactors=FALSE)
names(df) <- c("user1","user2","views")
n <- 4
r <- 8
obtainUserGroup(df,n,r)
Problem
My user dataframe is quite large. Trying to reduce a dataframe including thousand of users (m > 7000) users to one much smaller of just a hundred or a few hundreds (n << m) is troublesome. Iterating over the number of possible combinations is not feasible.
Is it possible to check all combinations in a different (much faster) way than the for-loop?
Is there any other way to approach this problem, different than checking all possible combinations?
Related
So, I'm trying to read a excel files. What happens is that some of the rows are empty for some of the columns but not for all of them. I want to skip all the rows that are not complete, i.e., that don't have information in all of the columns. For example:
In this case I would like to skip the lines 1,5,6,7,8 and so on.
There is probably more elegant way of doing it, but a possible solution is to count the number of elements per rows that are not NA and keep only rows with the number of elements equal to the number of columns.
Using this dummy example:
df <- data.frame(A = LETTERS[1:6],
B = c(sample(1:10,5),NA),
C = letters[1:6])
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
6 F NA f
Using apply, you can for each rows count the number of elements without NA:
v <- apply(df,1, function(x) length(na.omit(x)))
[1] 3 3 3 3 3 2
And then, keep only rows with the number of elements equal to the number of columns (which correspond to complete rows):
df1 <- df[v == ncol(df),]
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
Does it answer your question ?
I have a dataframe of the form shown below. The cases have been pre-clustered into subgroups of varying populations, including singletons. I am trying to write some code that will sample (without replacement) any specified number of rows from the dataframe, but spread as evenly as possible across clusters.
> testdata
Cluster Name
1 1 A
2 1 B
3 1 C
4 2 D
5 3 E
6 3 F
7 3 G
8 3 H
9 4 I
10 5 J
11 5 K
12 5 L
13 5 M
14 5 N
15 6 O
16 7 P
17 7 Q
For example, if I ask for a sample of 3 rows, I would like to pull a random row from a random 3 clusters (i.e. not first rows of clusters 1-3 every time, though this is one valid outcome).
Acceptable examples:
> testdata_subset
Cluster Name
1 1 A
5 3 E
12 5 L
> testdata_subset
Cluster Name
6 3 F
14 5 N
15 6 O
Incorrect example:
> testdata_subset
Cluster Name
6 3 F
8 3 H
13 5 M
The same idea applies up to a sample size of 7 in the example data shown (1 per cluster). For higher sample sizes, I would like to draw from each cluster evenly as far as possible, then evenly across the remaining clusters with unsampled rows, and so on, until the specified number of rows has been sampled.
I know how to sample N rows indiscriminately:
testdata[sample(nrow(testdata), N),]
But this pays no regard to the clusters. I also used plyr to randomly sample N rows per cluster:
ddply(testdata,"Cluster", function(z) z[sample(nrow(z), N),])
But this fails as soon as you ask for more rows than there are in a cluster (i.e. if N > 1). I then added an if/else statement to begin to handle that:
numsamp_per_cluster <- 2
ddply(testdata,"Cluster", function(z) if (numsamp_per_cluster > nrow(z)){z[sample(nrow(z), nrow(z)),]} else {z[sample(nrow(z), numsamp_per_cluster),]})
This effectively caps the sample size asked for to the size of each cluster. But in doing so, it loses control of the overall sample size. I am hoping (but starting to doubt) there is an elegant method using dplyr or similar package that can do this kind of semi-randomised sampling. Either way, I am struggling to tie these elements together and solve the problem.
The strategy: First, you randomly assign the order inside each cluster. This value is stored in the inside variable below. Next, you randomly select the order of the first choices of each cluster and so on (outside variable). Finally, you order your dataframe selecting the first choices, then the second and so on of each cluster, breaking the ties with the outside variable. Something like that:
set.seed(1)
inside<-ave(seq_along(testdata$Cluster),testdata$Cluster,FUN=function(x) sample(length(x)))
outside<-ave(inside,inside,FUN=function(x) sample(seq_along(x)))
testdata[order(inside,outside),]
# Cluster Name
#10 5 J
#15 6 O
#4 2 D
#5 3 E
#9 4 I
#16 7 P
#1 1 A
#13 5 M
#3 1 C
#17 7 Q
#7 3 G
#6 3 F
#14 5 N
#2 1 B
#12 5 L
#8 3 H
#11 5 K
Now, selecting the first n rows of the resulting data.frame you get the sample you are looking for.
Base R option: You can randomly sample from unique values of a cluster, and then use those to randomly sample names? Not very elegant but can be defined in a function. N is the number of samples you want to draw from "cluster".
sampler <- function(df,n){
s <- sample(unique(df[,1]),n)
n <- sapply(s, function(x) sample(df[which(df[,1]==x),2],1,replace=F))
data.frame(cluster = s, name = n)
}
> sampler(testdata,6)
cluster name
1 4 I
2 2 D
3 6 O
4 1 A
5 7 Q
6 5 K
Here is a function that will do the sampling for you. First, i create an index of unique elements of the list and then shuffle them. Then i order the list by the number of element in it so that i can be evenly spaced out for all the classes. I have to make a long vector out of it and choose as many elements i want.
sample_df=function(df,iter){
l=unique(df$Cluster)
cluster_pos=lapply(l, function(x) which(df$Cluster==x))
random_cluster_pos=lapply(cluster_pos, function(x) if(length(x) > 1) { sample(x) } else x)
## index=random_cluster_pos[rev(order(sapply(random_cluster_pos, length)))]
index=sample(random_cluster_pos)
inde_pos=c(t(sapply(index, "[", 1:length(index))))
inde_pos=inde_pos[!is.na(inde_pos)]
return(df[inde_pos[1:iter],])
}
sample_df(testdata, 3)
This is what my dataframe looks like:
a <- c(1,1,4,4,5)
b <- c(1,2,3,3,5)
c <- c(1,4,4,4,5)
d <- c(2,2,4,4,5)
e <- c(1,5,3,3,5)
df <- data.frame(a,b,c,d,e)
I'd like to write something that returns all unique instances of vectors a,b,c,d that have a repeated value in vector e.
For example:
a b c d e
1 1 1 1 2 1
2 1 2 4 2 5
3 4 3 4 4 3
4 4 3 4 4 3
5 5 5 5 5 5
Rows 3 and 4 are exactly the same till vector d (having a combination of 4344) so only one instance of those should be returned, but they have 2 repeated values in vector e. I would want to get a count on those - so the combination of 4344 has 2 repeated values in vector e.
The expected output would me how many times a certain combination such as 4344 had repeated values in vector e. So in this case it would be something like:
a b c d e
4 3 4 4 2
Both R and SQL work, whatever does the job.
Again, see my comments above, but I believe the following gives you a start on your first question. First, create a "key" variable (in this case named key_abcd which uses tidyr::unite to unite columns a, b, c, and d). Then, count up e by this key_abcd variable. The group_by is implicit.
library(tidyr)
library(dplyr)
df <- data.frame(a,b,c,d,e,f,g)
df %>%
unite(key_abcd, a, b, c, d) %>%
count(key_abcd, e)
# key_abcd e n
# (chr) (dbl) (int)
# 1 1_1_1_2 1 1
# 2 1_2_4_2 5 1
# 3 4_3_4_4 3 2
# 4 5_5_5_5 5 1
It appears from how you've worded the question, you are only interested in "more than one" combinations, therefore, you could add %>% filter(n > 1) to the above code.
How can one merge two data frames, one column-wise and other one row-wise? For example, I have two data frames like this:
A: add1 add2 add3 add4
1 k NA NA NA
2 l k NA NA
3 j NA NA NA
4 j l NA NA
B: age size name
1 5 6 x
2 8 2 y
3 1 3 x
4 5 4 z
I want to merge the two data.frames by row.name. However, I want to merge the data.frame A column-wise, instead of row-wise. So, I'm looking for a data.frame like this for result:
C:id age size name add
1 5 6 x k
2 8 2 y l
2 8 2 y k
3 1 3 x j
4 5 4 z j
4 5 4 z l
For example, suppose you have information of people in table B including name, size, etc. These information are unique values, so you have one row per person in B. Then, suppose that in table A, you have up to 5 past addresses of people. First column is the most recent address; second, is the second most recent address; etc. Now, if someone has less than 5 addresses (e.g. 3), you have NA in the 4 and 5 columns for that person.
What I want to achieve is one data frame (C) that includes all of this information together. So, for a person with two addresses, I'll need two rows in table C, repeating the unique values and only different in the column address.
I was thinking of repeat the rows of A data frame by the number of non-NA values while keeping the row.names the same as they were (like data frame D) and then merge the the new data frame with B. But I'm not sure how to do this.
D: address
1 k
2 l
2 k
3 j
4 j
4 l
Thank you!
Change the first data.frame to long format, then it's easy. df1 is A and df2 is B. I also name the numbers id.
require(tidyr)
# wide to long (your example D)
df1tidy <- gather(df1,addname,addval,-id)
# don't need the original add* vars or NA's
df1tidy$addname <- NULL
df1tidy <- df1tidy[!is.na(df1tidy$addval), ]
# merge them into the second data.frame
merge(df2,df1tidy,by = 'id',all.x = T)
I have drawn heat maps from microarray expression data set and in the heatmaps I see duplicates and triplicates for many of the genes I am interested
I am very new to R and is there a way to remove these duplicates or triplicates of genes
For example I see name of one gene say (BMP1) 2 or 3 times in the heatmap
Kindly suggest me with some solutions
Regards
Ram
I try to guess your answer, but it will be better if you give an example of your problem:
> tmp <- data.frame("numbers" = 1:3, "letters" = letters[1:3])
> tmp
numbers letters
1 1 a
2 2 b
3 3 c
> tmp <- rbind(tmp,tmp)
> tmp
numbers letters
1 1 a
2 2 b
3 3 c
4 1 a
5 2 b
6 3 c
> unique(tmp)
numbers letters
1 1 a
2 2 b
3 3 c
From the base help
unique returns a vector, data frame or array like x but with duplicate elements/rows removed.