I have a simple problem (seemingly) but have not yet able to find an appropriately quick/time & resource efficient solution. This is a problem in R-Software.
My data is of format:
INPUT
col1 col2
A q
C w
B e
A r
A t
A y
C q
B w
C e
C r
B t
C y
DESIRED OUTPUT
unit1 unit2 same_col2_freq
A B 1
A C 3
B A 1
B C 2
C A 3
C B 2
That is in input A has occurred in col1 with q, r, t, y occurring in col2. Now, q, r, t, y occurs for B with t so the A-B combination has count 1.
B has occurred in col1 with e, w, t occurring in col2. Now, e, w, t occurs for C with w, t so the B-C combination has count 2.
.... and so on for all combinations in col1.
I have done it using a for loop but it is slow. I am picking unique elements from col1 and then, all the data is iterated for each element of col1. Then I am combining the results using rbind. This is slow and resource costly.
I am looking for an efficient method. Maybe a library, function etc. exists that I am unaware of. I tried using co-occurrence matrix but the number of elements in col1 is of order of ~10,000 and it does not solve my purpose.
Any help is greatly appreciated.
Thanks!
Use merge to join the dataframe with itself and then use aggregate to count within groups. demo:
d = data.frame(col1=c("A", "C", "B", "A", "A", "A", "C", "B", "C", "C", "B", "C"), col2=c("q", "w", "e", "r", "t", "y", "q", "w", "e", "r", "t", "y"))
dm = merge(d, d, by="col2")
dm = dm[dm[,'col1.x']!=dm[,'col1.y'],]
aggregate(col2 ~ col1.x + col1.y, data=dm, length)
# col1.x col1.y col2
# 1 B A 1
# 2 C A 3
# 3 A B 1
# 4 C B 2
# 5 A C 3
# 6 B C 2
Here is a similar approach (as showed by #cogitovita), but using data.table. Convert the "data.frame" to "data.table" using setDT, then Cross Join (CJ) the unique elements of "col1", grouped by "col2". Subset the rows of the output columns that are not equal (V1!=V2), get the count (.N), grouped by the new columns (.(V1, V2)) and finally order the columns (order(V1,V2))
library(data.table)
setDT(df)[,CJ(unique(col1), unique(col1)), col2][V1!=V2,
.N, .(V1,V2)][order(V1,V2)]
# V1 V2 N
#1: A B 1
#2: A C 3
#3: B A 1
#4: B C 2
#5: C A 3
#6: C B 2
data
df <- structure(list(col1 = c("A", "C", "B", "A", "A", "A", "C", "B",
"C", "C", "B", "C"), col2 = c("q", "w", "e", "r", "t", "y", "q",
"w", "e", "r", "t", "y")), .Names = c("col1", "col2"), class =
"data.frame", row.names = c(NA, -12L))
Related
I have a dataset that looks like this:
data <- data.frame(Name1 = c("A", "B", "D", "E", "H"),
Name2 = c("B", "C", "E", "G", "I"))
I would like to add an ID column to help me trace groups of names, i.e. who references who? So with the example data, the groups would be:
Name1 Name2 GroupID
A B 1
B C 1
D E 2
E G 2
H I 3
Please note that my original data is not ordered as this example is. Thanks in advance for any help!
You can use the igraph package to make a network from your data set and determine clusters:
data <- data.frame(Name1 = c("A", "B", "D", "E", "H"),
Name2 = c("B", "C", "E", "G", "I"))
library(igraph)
graph <- graph_from_data_frame(data, directed = FALSE)
clusters <- components(graph)
#data$GroupId <- sapply(data$Name1, function(x) clusters$membership[which(names(clusters$membership) == x)])
# Simpler version
data$GroupId <- clusters$membership[data$Name1]
That gives:
> data
Name1 Name2 GroupId
1 A B 1
2 B C 1
3 D E 2
4 E G 2
5 H I 3
I have a data frame like this.
df
Languages Order Machine Company
[1] W,X,Y,Z,H,I D D B
[2] W,X B A G
[3] W,I E B A
[4] H,I B C B
[5] W G G C
I want to get the number of rows where languages has 2 out of 3 values among W,H,I.
The result should be: 3 because row 1, row 3 and row 4 contains at least 2 values out of the3 values among W,H,I
You can use strsplit on df$Languages and take the intersect with W,H,I. Then get the lengths of this result and use which to get those which have more than 1 >1.
sum(lengths(sapply(strsplit(df$Languages, ",", TRUE), intersect, c("W","H","I"))) > 1)
#[1] 3
You can use :
sum(sapply(strsplit(df$Languages, ','), function(x)
sum(c("W","H","I") %in% x) >= 2))
#[1] 3
data
df<- structure(list(Languages = c("W,X,Y,Z,H,I", "W,X", "W,I", "H,I",
"W"), Order = c("D", "B", "E", "B", "G"), Machine = c("D", "A",
"B", "C", "G"), Company = c("B", "G", "A", "B", "C")),
class = "data.frame", row.names = c(NA, -5L))
a tidyverse approach
df %>% filter(map_int(str_split(Languages, ','), ~ sum(.x %in% c('W', 'H', 'I'))) >= 2)
Languages Order Machine Company
1 W,X,Y,Z,H,I D D B
2 W,I E B A
3 H,I B C B
df <- data.frame(X = c("a", "b", "c", "a", "b", "c", "a", "b", "c", "d" , "a", "b", "c", "d", "e"),
Y = c("w", "w", "w", "K", "K", "K", "L", "L", "L", "L", "Z", "Z", "Z", "Z", "Z"))
Note that the first vector has 5 levels and the second has 4 levels. My goal is to select df lines that have all levels of vector 1 in common as vector 2. That is, I want to select lines that have levels "a", "b" and "c" since " d "appears only twice" and "appears only in vector 1.
I tried to make a list with the common levels and leave only the lines with the common levels by subset. However, it doesn't work because this level list doesn't generate the address of the lines I want to remove. Ex:
common <- c ("a", "b", "c")
df2 <- df [c(common),]
In my real df, there are 64 levels in common, so it doesn't happen "to do by hand". Can someone help me?
I think this is what you want. Essentially splitting X by Y, then looking for all intersecting values that are in every set.
df[df$X %in% Reduce(intersect, split(df$X, df$Y)),]
# X Y
#1 a w
#2 b w
#3 c w
#4 a K
#5 b K
#6 c K
#7 a L
#8 b L
#9 c L
#11 a Z
#12 b Z
#13 c Z
Another way could be to group_by X and select groups which has all distinct values in Y.
library(dplyr)
df %>%
group_by(X) %>%
filter(n_distinct(Y) == n_distinct(.$Y))
# X Y
# <fct> <fct>
# 1 a w
# 2 b w
# 3 c w
# 4 a K
# 5 b K
# 6 c K
# 7 a L
# 8 b L
# 9 c L
#10 a Z
#11 b Z
#12 c Z
In base R, that would be using ave
subset(df, as.logical(ave(as.character(Y), X,
FUN = function(x) length(unique(x)) == length(unique(Y)))))
Using data.table
library(data.table)
setDT(df)[, .SD[uniqueN(Y) == uniqueN(df$Y)], by = X]
So, if I have two lists, one being a "master list" without repeats, and the other being a subset with possible repeats, I would like to be able to check how many of each element are in the secondary subset list.
So if I have these lists:
a <- (a, b, c, d, e, f, g)
b <- (a, d, c, d, a, f, f, g, c, c)
I'd like to determine how many times each element from list a appear in list b and the frequency of each. My ideal output would be an r table that looks like:
c <- a b c d e f g
2 0 3 1 0 2 1
I've been trying to think through it with %in% and table()
You can use table and match - but first make the vectors factors so levels not present are included in the output:
a <- factor(c("a", "b", "c", "d", "e", "f", "g"))
b <- factor(c("a", "d", "c", "d", "a", "f", "f", "g", "c", "c"))
table(a[match(b, a)])
a b c d e f g
2 0 3 2 0 2 1
If for some reason you want a tidyverse solution. This method preserves the original data type in the lists.
library(tidyverse)
a <- c("a", "b", "c", "d", "e", "f", "g")
b <- c("a", "d", "c", "d", "a", "f", "f", "g", "c", "c")
tibble(letters = a, count = unlist(map(a, function(x) sum(b %in% x))))
# A tibble: 7 x 2
letters count
<chr> <int>
1 a 2
2 b 0
3 c 3
4 d 2
5 e 0
6 f 2
7 g 1
This question already has an answer here:
Order a data frame according to a given order [duplicate]
(1 answer)
Closed 4 years ago.
I would like to sort a data frame on one of its columns, based on a vector which contains all possible elements of the column, but without duplicates. For example a table like this:
A a
B b
C b
D b
E a
F a
G c
H b
And a vector like this: c("b", "c", "a")
So that sorting the table on column 2 based on this vector would produce this table:
B b
C b
D b
H b
G c
A a
E a
F a
We can use match with order
df1[order(match(df1$v2, vec1)),]
# v1 v2
#2 B b
#3 C b
#4 D b
#8 H b
#7 G c
#1 A a
#5 E a
#6 F a
data
vec1 <- c("b", "c", "a")
df1 < structure(list(v1 = c("A", "B", "C", "D", "E", "F", "G", "H"),
v2 = c("a", "b", "b", "b", "a", "a", "c", "b")), .Names = c("v1",
"v2"), class = "data.frame", row.names = c(NA, -8L))