I have a very large dataset which has 3 columns of interest, id, house, & people. Each id can have multiple houses and each house can have multiple people. I want to create a edge-list using what #David Arenburg, has shared here Creating edge list with additional variables in R
However, the issue I have is the edges given are 'a;b' and 'b;a'. I would like to have them only once. As large set of a and b could produce thousands of a;b, b;a combinations.
I would like to have them only once as I would like to count how many times the people share a house.
Given the dataset
id=c(rep("ID1",3), rep("ID2",6), "ID3", rep("ID4",5))
house=c(rep("house1",2), "house2", rep("house3",2), rep("house4",4), "house5", rep("house6",3), "house7", "house8")
people=c("a","b","c","d","e","d","e","d","e","f","g","h","h","h","h")
df1 <- data.frame(id,house, people)
The following code by #David Arenburg gives us the edge-list
df1 = setDT(df1)[, if(.N > 1) tstrsplit(combn(as.character(people),
2, paste, collapse = ";"), ";"),
.(id, house)]
The results
id house V1 V2
1: ID1 house1 a b
2: ID2 house3 d e
3: ID2 house4 d e
4: ID2 house4 d d
5: ID2 house4 d e
6: ID2 house4 e d
7: ID2 house4 e e
8: ID2 house4 d e
9: ID4 house6 g h
10: ID4 house6 g h
11: ID4 house6 h h
As you can see there is between V1 & V2, house has both 'd;e', 'e;d' which I would like to avoid. So for large amount of data those combinations could be in 1000s
Thanks for your help
I'm sure there's a more concise base R way, but here's one dplyr approach, where we sort the two values to make it easier to eliminate repeats.
library(dplyr)
df %>%
mutate(V1s = if_else(V1 < V2, V1, V2),
V2s = if_else(V1 < V2, V2, V1)) %>%
distinct(id, house, V1s, V2s)
There's a possibility following from the excelent answer that #David Aremburg provided.
The overall strategy:
Create a new variable with the ordered edge (it is, convert "e -> d" to "d -> e")
Get the unique values of each combination of id, house and the new variable.
Drop the variable
.
library(data.table)
# keep Aremburg's solution and chain a couple of additional commands:
setDT(df1)[,
if(.N > 1) tstrsplit(combn(as.character(people),
2, paste, collapse = ";"), ";"),
.(id, house)][,
edge := apply(.SD,
1,
function(x) paste(sort(c(x[1],
x[2])),
collapse = ",")),
.SDcols = c("V1", "V2")][,
.SD[1, ],
by = .(id, house, edge)][
, edge := NULL][]
id house V1 V2
1: ID1 house1 a b
2: ID2 house3 d e
3: ID2 house4 d e
4: ID2 house4 d d
5: ID2 house4 e e
6: ID4 house6 g h
7: ID4 house6 h h
Notice that you could drop the rows in which V1 == V2 too, as those are irrelevant edges. That could be accomplished with [V1 != V2, ] at the end of the previous chain.
Related
I have a dataset with origin ("from"), destination ("to") and price as below:
from to price
A B 28109
A D 2356
A E 4216
B A 445789
B D 123
D A 45674
D B 1979
I want to sum the price considering the return route as well. for example, A - B consists of the following data:
from to price
A B 28109
B A 445789
Then, take the sum of the price (28109+445789). The output will be like this:
route total_price
A - B 473898
A - D 48030
A - E 4216
B - D 2102
I was thinking to run a for loop but my data size is very large (800k rows). Any help will be highly appreciated. Thanks a lot in advance.
You can do this by sorting the from-to pairs, then grouping on that sorted pair and summing.
Edit: See #JasonAizkalns' answer for tidyverse equivalent
library(data.table)
setDT(df)
df[, .(total_price = sum(price))
, by = .(route = paste(pmin(from, to), '-', pmax(from, to)))]
# route total_price
# 1: A - B 473898
# 2: A - D 48030
# 3: A - E 4216
# 4: B - D 2102
#Frank notes that this result hides the fact that route "A - E" is not complete, in the sense that there is no row of the original data with from == 'E' and to == 'A'. He's offered a good way of capturing that info (and more), and I've added some others below.
df[, .(total_price = sum(price), complete = .N > 1)
, by = .(route = paste(pmin(from, to), '-', pmax(from, to)))]
# route total_price complete
# 1: A - B 473898 TRUE
# 2: A - D 48030 TRUE
# 3: A - E 4216 FALSE
# 4: B - D 2102 TRUE
df[, .(total_price = sum(price), paths_counted = .(paste(from, '-', to)))
, by = .(route = paste(pmin(from, to), '-', pmax(from, to)))]
# route total_price paths_counted
# 1: A - B 473898 A - B,B - A
# 2: A - D 48030 A - D,D - A
# 3: A - E 4216 A - E
# 4: B - D 2102 B - D,D - B
Data used
df <- fread('
from to price
A B 28109
A D 2356
A E 4216
B A 445789
B D 123
D A 45674
D B 1979')
You could do a self-join and then things are pretty straightforward:
library(tidyverse)
df <- readr::read_table("
from to price
A B 28109
A D 2356
A E 4216
B A 445789
B D 123
D A 45674
D B 1979
")
df %>%
inner_join(df, by = c("from" = "to")) %>%
filter(to == from.y) %>%
mutate(
route = paste(from, "-", to),
total_price = price.x + price.y
)
#> # A tibble: 6 x 7
#> from to price.x from.y price.y route total_price
#> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
#> 1 A B 28109 B 445789 A - B 473898
#> 2 A D 2356 D 45674 A - D 48030
#> 3 B A 445789 A 28109 B - A 473898
#> 4 B D 123 D 1979 B - D 2102
#> 5 D A 45674 A 2356 D - A 48030
#> 6 D B 1979 B 123 D - B 2102
Created on 2019-03-20 by the reprex package (v0.2.1)
Because I like #IceCreamToucan's answer better, here's the tidyverse equivalent:
df %>%
group_by(route = paste(pmin(from, to), "-", pmax(from, to))) %>%
summarise(total_price = sum(price))
Also one tidyverse possibility:
df %>%
nest(from, to) %>%
mutate(route = unlist(map(data, function(x) paste(sort(x), collapse = "_")))) %>%
group_by(route) %>%
summarise(total_price = sum(price))
route total_price
<chr> <int>
1 A_B 473898
2 A_D 48030
3 A_E 4216
4 B_D 2102
In this case, it, first, creates a list composed of values "from" and "to" variables. Second, it sorts the elements in the list and combines them together, separated by _. Finally, it groups by the combined elements and gets the sum.
Or involving a wide-to-long transformation:
df %>%
rowid_to_column() %>%
gather(var, val, -c(rowid, price)) %>%
arrange(rowid, val) %>%
group_by(rowid) %>%
summarise(route = paste(val, collapse = "_"),
price = first(price)) %>%
group_by(route) %>%
summarise(total_price = sum(price))
For this, it, first, performs a wide-to-long data transformation, excluding the row ID and "price". Second, it arranges the data according row ID and values contained in "from" and "to". Third, it groups by row ID, combines the elements together, separated by _. Finally, it groups by this variable and gets the sum.
I'd do...
library(data.table)
setDT(df)
pts = df[, unique(c(from, to))]
rDT = CJ(P1 = pts, P2 = pts)[P1 < P2]
rDT[df, on=.(P1 = from, P2 = to), r12 := i.price]
rDT[df, on=.(P2 = from, P1 = to), r21 := i.price]
rDT[, r := r12 + r21]
P1 P2 r12 r21 r
1: A B 28109 445789 473898
2: A D 2356 45674 48030
3: A E 4216 NA NA
4: B D 123 1979 2102
5: B E NA NA NA
6: D E NA NA NA
This will make it clear where data is incomplete.** You could filter to rDT[!is.na(r)] for only the complete records.
** This is also addressed in #JasonAizkalns's
and #IceCreamToucan's answers, but contrasts with OP's requested output.
Trying to do a simple pivot in R, much like you would in SQL.
I understand this question has been asked however I am having trouble with duplicate rows.
Pivoting data in R
Currently the data is in this format (characters are just placeholders for ease of viewing. The actual data is numerical):
V1 V2 V3 V4
A B C Sales
D E F Sales
G H I Technical
J K L Technical
And it needs to be transformed into this format:
Variable Sales Technical
V1 A G
V1 D J
V2 B H
V2 E K
V3 C I
V3 F L
I've tried both reshape and tidyr packages and they either aggregate the data in the case of reshape or throw errors for duplicate row identifiers in the case of tidyr.
I don't care about duplicate row identifiers, infact it's necessary to identify them as factors for analysis.
Am I going about this the wrong way? Are these the correct packages to be using or can anyone suggest another method?
I hope this will work:
df %>% gather(Variable, Value, V1:V3) %>%
group_by(V4, Variable) %>%
mutate(g = row_number()) %>%
spread(V4, Value) %>% ungroup() %>%
select(-g)
# # A tibble: 6 x 3
# Variable Sales Technical
# * <chr> <chr> <chr>
# 1 V1 A G
# 2 V1 D J
# 3 V2 B H
# 4 V2 E K
# 5 V3 C I
# 6 V3 F L
Another option is melt/dcast from data.table
library(data.table)
dcast(melt(setDT(df1), id.var = 'V4'), variable + rowid(V4) ~
V4, value.var = 'value')[, V4 := NULL][]
# variable Sales Technical
#1: V1 A G
#2: V1 D J
#3: V2 B H
#4: V2 E K
#5: V3 C I
#6: V3 F L
I have two dataframes:
df_1 <- data.frame(c("a_b", "a_c", "a_d"))
df_2 <- data.frame(matrix(ncol = 2))
And I would like to loop over df_1 in order to fill df_2:
for (i in (1:(length(df_1[,1])))){
for (j in (1:2)) {
df_2[i*j,] <-str_split_fixed(df_1[i,1], "_", 2)
}
}
I would like df_2 to look like:
col1 col2
a b
a b
a c
a c
a d
a d
But instead I get:
col1 col2
a b
a c
a d
a c
NA NA
a d
I must be doing something wrong, but cannot figure it out.
I also would like to use apply (or something like it, but am pretty new to R and not firm with the apply-family.
Thanks for your help!
Another way would be
df_1 <- data.frame(col1 = c("a_b", "a_c", "a_d"))
df_2 <- as.data.frame(do.call(rbind, strsplit(as.character(df_1$col1), split = "_", fixed = TRUE)))
df_2[rep(1:nrow(df_2), each = 2), ]
V1 V2
1 a b
1.1 a b
2 a c
2.1 a c
3 a d
3.1 a d
We can use cSplit with data.table approach
library(splitstackshape)
cSplit(df_1, 'col1', '_')[rep(seq_len(.N), each =2)]
# col1_1 col1_2
#1: a b
#2: a b
#3: a c
#4: a c
#5: a d
#6: a d
Or another option is tidyverse
library(tidyverse)
separate(df_1, col1, into=c("col_1", "col_2")) %>%
map_df(~rep(., each = 2))
# A tibble: 6 × 2
# col_1 col_2
# <chr> <chr>
#1 a b
#2 a b
#3 a c
#4 a c
#5 a d
#6 a d
NOTE: Both the answers are one-liners.
data
df_1 <- data.frame(col1 = c("a_b", "a_c", "a_d"))
This would be a combination of two answers. With cSplit we split the column by _ and then repeat each row twice. Assuming your column name as V1.
library(splitstackshape)
df_2 <- cSplit(df_1, "V1", "_")
df_2[rep(seq_len(nrow(df_2)),each = 2), ]
# V1_1 V1_2
#1: a b
#2: a b
#3: a c
#4: a c
#5: a d
#6: a d
Or as #Sotos mentioned in the comments we can use expandRows to accomodate everything into one line.
expandRows(cSplit(df_1, "V1", "_"), 2, count.is.col = FALSE)
# V1_1 V1_2
#1: a b
#2: a b
#3: a c
#4: a c
#5: a d
#6: a d
data
df_1 <- data.frame(V1 = c("a_b", "a_c", "a_d"))
OK, I started learning R this week, but if you want presented result you can use your code with this fix:
for (i in (1:(length(df_1[,1])))){
for (j in (1:2)) {
df_2[(i-1)*2+j,] <- str_split_fixed(df_1[i,1], "_", 2)
}
}
I changed index of df_2.
I guess that there is better way than two for loops, but that all I can do for the moment.
I was trying to post a solution I found right after posting but it was misunderstood and was deleted:
"sometimes posting a question helps:
I am was asking for the right position in df_1, but I was saving the result in the wrong cell.
the answer to my original question should be something like this:
n <- 1
for (i in (1:(length(df_1[,1])))){
for (j in (1:2)) {
df_2[n,] <-str_split_fixed(df_1[i,1], "_", 2)
n <- n+1
}
}"
I have a data set with individuals (ID) that can be part of more than one group.
Example:
library(data.table)
DT <- data.table(
ID = rep(1:5, c(3:1, 2:3)),
Group = c("A", "B", "C", "B",
"C", "A", "A", "C",
"A", "B", "C")
)
DT
# ID Group
# 1: 1 A
# 2: 1 B
# 3: 1 C
# 4: 2 B
# 5: 2 C
# 6: 3 A
# 7: 4 A
# 8: 4 C
# 9: 5 A
# 10: 5 B
# 11: 5 C
I want to know the sum of identical individuals for 2 groups.
The result should look like this:
Group.1 Group.2 Sum
A B 2
A C 3
B C 3
Where Sum indicates the number of individuals the two groups have in common.
Here's my version:
# size-1 IDs can't contribute; skip
DT[ , if (.N > 1)
# simplify = FALSE returns a list;
# transpose turns the 3-length list of 2-length vectors
# into a length-2 list of 3-length vectors (efficiently)
transpose(combn(Group, 2L, simplify = FALSE)), by = ID
][ , .(Sum = .N), keyby = .(Group.1 = V1, Group.2 = V2)]
With output:
# Group.1 Group.2 Sum
# 1: A B 2
# 2: A C 3
# 3: B C 3
As of version 1.9.8 (on CRAN 25 Nov 2016), data.table has gained the ability to do non-equi joins. So, a self non-equi join can be used:
library(data.table) # v1.9.8+
setDT(DT)[, Group:= factor(Group)]
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)][
, .N, by = .(x.Group, i.Group)]
x.Group i.Group N
1: A B 2
2: A C 3
3: B C 3
Explanantion
The non-equi join on ID, Group < Group is a data.table version of combn() (but applied group-wise):
DT[DT, on = .(ID, Group < Group), nomatch = 0L, .(ID, x.Group, i.Group)]
ID x.Group i.Group
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 B C
5: 4 A C
6: 5 A B
7: 5 A C
8: 5 B C
We self-join with the same dataset on 'ID', subset the rows where the 'Group' columns are different, get the nrows (.N), grouped by the 'Group' columns, sort the 'Group.1' and 'Group.2' columns by row using pmin/pmax and get the unique value of 'N'.
library(data.table)#v1.9.6+
DT[DT, on='ID', allow.cartesian=TRUE][Group!=i.Group, .N ,.(Group, i.Group)][,
list(Sum=unique(N)) ,.(Group.1=pmin(Group, i.Group), Group.2=pmax(Group, i.Group))]
# Group.1 Group.2 Sum
#1: A B 2
#2: A C 3
#3: B C 3
Or as mentioned in the comments by #MichaelChirico and #Frank, we can convert 'Group' to factor class, subset the rows based on as.integer(Group) < as.integer(i.Group), group by 'Group', 'i.Group' and get the nrow (.N)
DT[, Group:= factor(Group)]
DT[DT, on='ID', allow.cartesian=TRUE][as.integer(Group) < as.integer(i.Group), .N,
by = .(Group.1= Group, Group.2= i.Group)]
Great answers above.
Just an alternative using dplyr in case you, or someone else, is interested.
library(dplyr)
cmb = combn(unique(dt$Group),2)
data.frame(g1 = cmb[1,],
g2 = cmb[2,]) %>%
group_by(g1,g2) %>%
summarise(l=length(intersect(DT[DT$Group==g1,]$ID,
DT[DT$Group==g2,]$ID)))
# g1 g2 l
# (fctr) (fctr) (int)
# 1 A B 2
# 2 A C 3
# 3 B C 3
yet another solution (base R):
tmp <- split(DT, DT[, 'Group'])
ans <- apply(combn(LETTERS[1 : 3], 2), 2, FUN = function(ind){
out <- length(intersect(tmp[[ind[1]]][, 1], tmp[[ind[2]]][, 1]))
c(group1 = ind[1], group2 = ind[2], sum_ = out)
}
)
data.frame(t(ans))
# group1 group2 sum_
#1 A B 2
#2 A C 3
#3 B C 3
first split data into list of groups, then for each unique pairwise combinations of two groups see how many subjects in common they have, using length(intersect(....
I have data like this:
ID=c(rep("ID1",3), rep("ID2",2), "ID3", rep("ID4",2))
item=c("a","b","c","a","c","a","b","a")
data.frame(ID,item)
ID1 a
ID1 b
ID1 c
ID2 a
ID2 c
ID3 a
ID4 b
ID4 a
and I would need it as a list of edges like this:
a;b
b;c
a;c
a;c
b;a
the first three edges coming from ID1, fourth from ID2, ID3 has no edges so nothing from that and fifth from ID4. Any ideas on how to accomplish this? melt/cast?
I'd guess there should be a simple igrpah solution for this, but here's a simple solution using data.table package
library(data.table)
setDT(df)[, if(.N > 1) combn(as.character(item), 2, paste, collapse = ";"), ID]
# ID V1
# 1: ID1 a;b
# 2: ID1 a;c
# 3: ID1 b;c
# 4: ID2 a;c
# 5: ID4 b;a
Try
res <- do.call(rbind,with(df, tapply(item, ID,
FUN=function(x) if(length(x)>=2) t(combn(x,2)))))
paste(res[,1], res[,2], sep=";")
#[1] "a;b" "a;c" "b;c" "a;c" "b;a"
Here is a more scalable solution that uses the same core logic as the other solutions:
library(plyr)
library(dplyr)
ID=c(rep("ID1",3), rep("ID2",2), "ID3", rep("ID4",2))
item=c("a","b","c","a","c","a","b","a")
dfPaths = data.frame(ID, item)
dfPaths2 = dfPaths %>%
group_by(ID) %>%
mutate(numitems = n(), item = as.character(item)) %>%
filter(numitems > 1)
ddply(dfPaths2, .(ID), function(x) t(combn(x$item, 2)))