Related
So, if I have two lists, one being a "master list" without repeats, and the other being a subset with possible repeats, I would like to be able to check how many of each element are in the secondary subset list.
So if I have these lists:
a <- (a, b, c, d, e, f, g)
b <- (a, d, c, d, a, f, f, g, c, c)
I'd like to determine how many times each element from list a appear in list b and the frequency of each. My ideal output would be an r table that looks like:
c <- a b c d e f g
2 0 3 1 0 2 1
I've been trying to think through it with %in% and table()
You can use table and match - but first make the vectors factors so levels not present are included in the output:
a <- factor(c("a", "b", "c", "d", "e", "f", "g"))
b <- factor(c("a", "d", "c", "d", "a", "f", "f", "g", "c", "c"))
table(a[match(b, a)])
a b c d e f g
2 0 3 2 0 2 1
If for some reason you want a tidyverse solution. This method preserves the original data type in the lists.
library(tidyverse)
a <- c("a", "b", "c", "d", "e", "f", "g")
b <- c("a", "d", "c", "d", "a", "f", "f", "g", "c", "c")
tibble(letters = a, count = unlist(map(a, function(x) sum(b %in% x))))
# A tibble: 7 x 2
letters count
<chr> <int>
1 a 2
2 b 0
3 c 3
4 d 2
5 e 0
6 f 2
7 g 1
The following code:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1,6)
)
Results in the following dataframe:
letter score
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
I want to get the scores for a sequence of letters, for example the scores of c("f", "a", "d", "e"). It should result in c(6, 1, 4, 5).
What's more, I want to get the scores for c("c", "o", "f", "f", "e", "e"). Now the o is not in the letter column so it should return NA, resulting in c(3, NA, 6, 6, 5, 5).
What is the best way to achieve this? Can I use dplyr for this?
We can use match to create an index and extract the corresponding 'score' If there is no match, then by default it gives NA
df$score[match(v1, df$letter)]
#[1] 3 NA 6 6 5 5
df$score[match(v2, df$letter)]
#[1] 6 1 4 5
data
v1 <- c("c", "o", "f", "f", "e", "e")
v2 <- c("f", "a", "d", "e")
If you want to use dplyr I would use a join:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1:6)
)
library(dplyr)
df2 <- data.frame(letter = c("c", "o", "f", "f", "e", "e"))
left_join(df2, df, by = "letter")
letter score
1 c 3
2 o NA
3 f 6
4 f 6
5 e 5
6 e 5
This question already has an answer here:
Order a data frame according to a given order [duplicate]
(1 answer)
Closed 4 years ago.
I would like to sort a data frame on one of its columns, based on a vector which contains all possible elements of the column, but without duplicates. For example a table like this:
A a
B b
C b
D b
E a
F a
G c
H b
And a vector like this: c("b", "c", "a")
So that sorting the table on column 2 based on this vector would produce this table:
B b
C b
D b
H b
G c
A a
E a
F a
We can use match with order
df1[order(match(df1$v2, vec1)),]
# v1 v2
#2 B b
#3 C b
#4 D b
#8 H b
#7 G c
#1 A a
#5 E a
#6 F a
data
vec1 <- c("b", "c", "a")
df1 < structure(list(v1 = c("A", "B", "C", "D", "E", "F", "G", "H"),
v2 = c("a", "b", "b", "b", "a", "a", "c", "b")), .Names = c("v1",
"v2"), class = "data.frame", row.names = c(NA, -8L))
I have a simple problem (seemingly) but have not yet able to find an appropriately quick/time & resource efficient solution. This is a problem in R-Software.
My data is of format:
INPUT
col1 col2
A q
C w
B e
A r
A t
A y
C q
B w
C e
C r
B t
C y
DESIRED OUTPUT
unit1 unit2 same_col2_freq
A B 1
A C 3
B A 1
B C 2
C A 3
C B 2
That is in input A has occurred in col1 with q, r, t, y occurring in col2. Now, q, r, t, y occurs for B with t so the A-B combination has count 1.
B has occurred in col1 with e, w, t occurring in col2. Now, e, w, t occurs for C with w, t so the B-C combination has count 2.
.... and so on for all combinations in col1.
I have done it using a for loop but it is slow. I am picking unique elements from col1 and then, all the data is iterated for each element of col1. Then I am combining the results using rbind. This is slow and resource costly.
I am looking for an efficient method. Maybe a library, function etc. exists that I am unaware of. I tried using co-occurrence matrix but the number of elements in col1 is of order of ~10,000 and it does not solve my purpose.
Any help is greatly appreciated.
Thanks!
Use merge to join the dataframe with itself and then use aggregate to count within groups. demo:
d = data.frame(col1=c("A", "C", "B", "A", "A", "A", "C", "B", "C", "C", "B", "C"), col2=c("q", "w", "e", "r", "t", "y", "q", "w", "e", "r", "t", "y"))
dm = merge(d, d, by="col2")
dm = dm[dm[,'col1.x']!=dm[,'col1.y'],]
aggregate(col2 ~ col1.x + col1.y, data=dm, length)
# col1.x col1.y col2
# 1 B A 1
# 2 C A 3
# 3 A B 1
# 4 C B 2
# 5 A C 3
# 6 B C 2
Here is a similar approach (as showed by #cogitovita), but using data.table. Convert the "data.frame" to "data.table" using setDT, then Cross Join (CJ) the unique elements of "col1", grouped by "col2". Subset the rows of the output columns that are not equal (V1!=V2), get the count (.N), grouped by the new columns (.(V1, V2)) and finally order the columns (order(V1,V2))
library(data.table)
setDT(df)[,CJ(unique(col1), unique(col1)), col2][V1!=V2,
.N, .(V1,V2)][order(V1,V2)]
# V1 V2 N
#1: A B 1
#2: A C 3
#3: B A 1
#4: B C 2
#5: C A 3
#6: C B 2
data
df <- structure(list(col1 = c("A", "C", "B", "A", "A", "A", "C", "B",
"C", "C", "B", "C"), col2 = c("q", "w", "e", "r", "t", "y", "q",
"w", "e", "r", "t", "y")), .Names = c("col1", "col2"), class =
"data.frame", row.names = c(NA, -12L))
Here is my small dataset.
Indvidual <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
Parent1 <- c(NA, NA, "A", "A", "C", "C", "C", "E", "A", NA)
Parent2 <- c(NA, NA, "B", "C", "D", "D", "D", NA, "D", NA)
mydf <- data.frame (Indvidual, Parent1, Parent2)
Indvidual Parent1 Parent2
1 A <NA> <NA>
2 B <NA> <NA>
3 C A B
4 D A C
5 E C D
6 F C D
7 G C D
8 H E <NA>
9 I A D
10 J <NA> <NA>
Just consider people who has two or one known parents. I need to compare and derieve score by calculating scores that their parents have.
The rules is that either one of parent (names in parent1 or parent2 column) is known (not NA), will get 1 one additional score plus score their parents have. If there are two parents known, the highest scorer will be taken into consideration.
Here is an example:
Individual "A", has both parents unknown so will get score 0
Indiviudal "C", has both parents known (i.e. A, B)
will get 0 score (maximum of their parents)
plus 1 (as it has either one of parents known)
Thus expected output from above dataframe (with explanation) is:
Indvidual Parent1 Parent2 Scores Explanation
1 A <NA> <NA> 0 0 (Max of parent Scores NA) + 0 (neither parent knwon)
2 B <NA> <NA> 0 0 (Max of parent Scores NA) + 0 (neither parent knwon)
3 C A B 1 0 (Max of parent Scores) + 1 (either parent knwon)
4 D A C 2 1 (Max of parent scores) + 1 (either parent knwon)
5 E C D 3 2 (Max of parent scores) + 1 (either parent knwon)
6 F C D 3 2 (Max of parent scores) + 1 (either parent knwon)
7 G C D 3 2 (Max of parent scores) + 1 (either parent knwon)
8 H E <NA> 4 3 (Max of parent scores) + 1 (either parent knwon)
9 I A D 3 2 (Max of parent scores) + 1 (either parent knwon)
10 J <NA> <NA> 0 0 (Max of parent scores NA) + 0 (neither parent knwon)
Explanation: As loop goes on, it takes into account on the Scores already calculated.
Max of parent scores
Edits: based on chase's question
For example:
Individual C has two parents A and B, each of which has Scores calculated as 0 and 0
(in row 1 and 2 and column Scores), means that max (c(0,0)) will be 0
Individual E has parents C and D, whose scores in Scores column is (in row 3 and 4),
1 and 2, respectively. So maximum of max(c(1,2)) will be 2.
Example using plyr and a recursive argument
library(plyr)
Indvidual <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
Parent1 <- c(NA, NA, "A", "A", "C", "C", "C", "E", "A", NA)
Parent2 <- c(NA, NA, "B", "C", "D", "D", "D", NA, "D", NA)
mydf <- data.frame (Indvidual, Parent1, Parent2)
scor.fun<-function(x,mydf){
Explanation<-0
P1<-as.character(x$Parent1)
P2<-as.character(x$Parent2)
score<-as.numeric(!(is.na(P1)||is.na(P1)))
if(!(is.na(P1)||is.na(P2))){
Explanation<-max(scor.fun(subset(mydf,Indvidual==P1),mydf)[1],scor.fun(subset(mydf,Indvidual==P2),mydf)[1])
score<-score+Explanation
}else{
Explanation<-ifelse(is.na(P1),0,scor.fun(subset(mydf,Indvidual==P1),mydf)[1])
Explanation<-max(Explanation,ifelse(is.na(P2),0,scor.fun(subset(mydf,Indvidual==P2),mydf)[1]))
score<-score+Explanation
}
c(score,Explanation)
}
adply(mydf,1,scor.fun,mydf)
Probably not the best idea with the recursion on a big dataframe.
Individual <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
Parent1 <- c(NA, NA, "A", "A", "C", "C", "C", "E", "A", NA)
Parent2 <- c(NA, NA, "B", "C", "D", "D", "D", NA, "D", NA)
mydf <- data.frame (Individual, Parent1, Parent2, stringsAsFactors = FALSE)
mydf$Scores <- NA
mydf$Scores[rowSums(is.na(mydf[, c("Parent1", "Parent2")])) == 2] <- 0
while(any(is.na(mydf$Scores))){
KnownScores <- mydf[!is.na(mydf$Scores), c(1, 4)]
ToCalculate <- mydf[
mydf$Parent1 %in% c(KnownScores$Individual, NA) &
mydf$Parent2 %in% c(KnownScores$Individual, NA) &
is.na(mydf$Scores),
-4]
ToCalculate$Score <- apply(
merge(
merge(
ToCalculate,
KnownScores,
by.x = "Parent1",
by.y = "Individual",
all.x = TRUE
),
KnownScores,
by.x = "Parent2",
by.y = "Individual",
all.x = TRUE
)[, 4:5],
1,
max,
na.rm = TRUE) + 1
mydf <- merge(mydf, ToCalculate[, c(1, 4)], all.x = TRUE)
mydf$Scores[!is.na(mydf$Score)] <- mydf$Score[!is.na(mydf$Score)]
mydf$Score <- NULL
}