Transforming directed dyads into undirected [duplicate] - r

This question already has answers here:
Pasting elements of two vectors alphabetically
(5 answers)
Closed 2 years ago.
this seems like such a basic question to me, that I'm almost sure it must be covered somewhere around here, but I've been searching for quite some time now and just can't seem to find the right answer.
My data looks like this:
data <- data.frame(col1 = c("A","A","B","B"), col2 = c("B","C","A","C"), value = c(1,2,3,4))
col1 col2 value
1 A B 1
2 A C 2
3 B A 3
4 B C 4
I want to merge col1 and col2 into a variable that indicates the unique dyads in a single vector. It should not matter, whether "A" and "B" are a value of col1 or col2. Each row that contains "A" and "B" combined in col1 and col2 should get the same value of the new variable. I tried to use tidyr for this.
unite(data, col1, col2, col="dyad", sep="_")
returns
dyad value
1 A_B 1
2 A_C 2
3 B_A 3
4 B_C 4
Basically, I need dyad to contain the same value for A_B and B_A, because these pairs are equivalent for me. This is what it should look like, for example:
dyad value
1 A_B 1
2 A_C 2
3 A_B 3
4 B_C 4
Is there an easy way to do this? Thanks a lot!

There may be more elegant solutions, but perhaps this helps:
data <- data.frame(col1 = c("A","A","B","B"), col2 = c("B","C","A","C"), value = c(1,2,3,4),
stringsAsFactors = FALSE)
data$dyad <- apply(data[,c("col1","col2")], 1, FUN= function(x) paste(sort(x), collapse="_"))
So the apply function ensures that the function is applied to each row of the data frame. The function first sorts the input and then pastes them together.
EDIT: I copied stringsAsFactors = FALSE from the other answer, as I used it as well but forgot to include it in my post :)

A solution using dplyr. Notice that I added stringsAsFactors = FALSE when creating the data frame because it is better to work on character columns in this case.
data <- data.frame(col1 = c("A","A","B","B"), col2 = c("B","C","A","C"), value = c(1,2,3,4),
stringsAsFactors = FALSE)
library(dplyr)
data2 <- data %>%
rowwise() %>%
mutate(dyad = paste(sort(c(col1, col2)), collapse = "_")) %>%
select(dyad, value) %>%
ungroup()
data2
# # A tibble: 4 x 2
# dyad value
# <chr> <dbl>
# 1 A_B 1
# 2 A_C 2
# 3 A_B 3
# 4 B_C 4

Related

How to match multiple columns based on lookup table

I have the following two data frames:
lookup <- data.frame(id = c("A", "B", "C"),
price = c(1, 2, 3))
results <- data.frame(price_1 = c(2,2,1),
price_2 = c(3,1,1))
I now want to go through all columns of results and add the respective matching id from lookup as new columns. So I first want to take the price_1 column and find the ids (here: "B", "B", "A") and add it as a new column to results and then I want to do the same for the price_2 column.
My real-life case would need to match 20+ columns, so I want to avoid a hard-coded manual solution and are looking for a dynamic approach, ideally in the tidyverse.
results <- results %>%
left_join(., lookup, by = c("price_1" = "id")
would give me the manual solution for the first column and I could repeat this with the second column, but I'm wondering if I can do this automatically for all my results columns.
Expected output:
price_1 price_2 id_1 id_2
2 3 "B" "C"
2 1 "B" "A"
1 1 "A" "A"
We could unlist the dataframe and match directly.
new_df <- results
names(new_df) <- paste0("id", seq_along(new_df))
new_df[] <- lookup$id[match(unlist(new_df), lookup$price)]
cbind(results, new_df)
# price_1 price_2 id1 id2
#1 2 3 B C
#2 2 1 B A
#3 1 1 A A
In dplyr, we can do
library(dplyr)
bind_cols(results, results %>% mutate_all(~lookup$id[match(., lookup$price)]))
You can use apply and match to match multiple columns based on lookup table.
cbind(results, t(apply(results, 1, function(i) lookup[match(i, lookup[,2]),1])))
# price_1 price_2 1 2
#1 2 3 B C
#2 2 1 B A
#3 1 1 A A

Collapse redundant rows in data table

I have a data table in the format:
myTable <- data.table(Col1 = c("A", "A", "A", "B", "B", "B"), Col2 = 1:6)
print(myTable)
Col1 Col2
1: A 1
2: A 2
3: A 3
4: B 4
5: B 5
6: B 6
I want show only the highest result for each category in Col1, then collapse all others and present their sum in Col2. It should look like this:
print(myTable)
Col1 Col2
1: A 3
2: Others 3
3: B 6
4: Others 9
I managed to do it with the following code:
unique <- unique(myTable$Col1) # unique values in Col1
myTable2 <- data.table() # empty data table to populate
for(each in unique){
temp <- myTable[Col1 == each, ] # filter myTable for unique Col1 values
temp <- temp[order(-Col2)] # order filtered table increasingly
sumCol2 <- sum(temp$Col2) # sum of values in filtered Col2
temp <- temp[1, ] # retain only first element
remSum <- sumCol2 - sum(temp$Col2) # remaining sum in Col2 (without first element)
temp <- rbindlist(list(temp, data.table("Others", remSum))) # rbind first element and remaining elements
myTable2 <- rbindlist(list(myTable2, temp)) # populate data table from beginning
}
This works, but I am trying to shorten a very large data table, so it takes forever.
Is there any better way to approach this?
Thanks.
UPDATE: Actually my procedure is a little bit more complicated. I figured I would be able to develop it myself after the basics were mastered but it seems I will need further help instead. I want to display the 5 highest values in Col1, and collapse the others, but some entries in Col1 do not have 5 values; in these case, all entries should be displayed, and no "Others" row should be added.
Here the data is split into groups according to the value of Col1 (by = Col1). .N is the index of the last row in the given group, so c(Col2[.N], sum(Col2) - Col2[.N])) gives the last value of Col2, and the sum of Col2 minus the last value. The newly created variables are surrounded by .() because .() is an alias for the list() function when using data.table, and the created columns need to go in a list.
library(data.table)
setDT(df)
df[, .(Col1 = c(Col1, 'Others'),
Col2 = c(Col2[.N], sum(Col2) - Col2[.N]))
, by = Col1][, -1]
# Col1 Col2
# 1: A 3
# 2: Others 3
# 3: B 6
# 4: Others 9
If it just a matter of displaying things you could the 'tables' packages :
others <- function(x) sum(x)-last(x)
df %>% tabular(Col1*(last+others) ~ Col2, .)
# Col1 Col2
# A last 3
# others 3
# B last 6
# others 9
do.call(
rbind, lapply(split(myTable, factor(myTable$Col1)), function(x) rbind(x[which.max(x$Col2),], list("Other", sum(x$Col2[-which.max(x$Col2)]))))
)
# Col1 Col2
#1: A 3
#2: Other 3
#3: B 6
#4: Other 9
I did it! I made a new myTable to illustrate. I want to retain only the 4 highest values by category, and collapse the others.
set.seeed(123)
myTable <- data.table(Col1 = c(rep("A", 3), rep("B", 5), rep("C", 4)), Col2 = sample(1:12, 12))
print(myTable)
Col1 Col2
1: A 8
2: A 5
3: A 2
4: B 7
5: B 10
6: B 9
7: B 12
8: B 11
9: C 4
10: C 6
11: C 3
12: C 1
# set key to Col2, it will sort it increasingly
setkey(myTable, Col2)
# if there are more than 4 entries by Col1 category, will return all information, otherwise will return 4 entries completing with NA
myTable <- myTable[,.(Col2 = Col2[1:max(c(4, .N))]) , by = Col1]
# will print in Col1: 4 entries of Col1 category, then "Other"
# will print in Col2: 4 last entries of Col2 in that category, then the remaining sum
myTable <- myTable[, .(Col1 = c(rep(Col1, 4), "Other"), Col2 = c(Col2[.N-3:0], sum(Col2) - sum(Col2[.N-3:0]))), by = Col1]
# removes rows with NA inserted in first step
myTable <- na.omit(myTable)
# removes rows where Col2 = 0, inserted because that Col1 category had exactly 4 entries
myTable <- myTable[Col2 != 0]
Owooooo!
Here's a base R solution and the dplyr equivalent:
res <- aggregate(Col2 ~.,transform(
myTable, Col0 = replace(Col1,duplicated(Col1,fromLast = TRUE), "Other")), sum)
res[order(res$Col1),-1]
# Col0 Col2
# 1 A 3
# 3 Other 3
# 2 B 6
# 4 Other 9
myTable %>%
group_by(Col0= Col1, Col1= replace(Col1,duplicated(Col1,fromLast = TRUE),"Other")) %>%
summarize_at("Col2",sum) %>%
ungroup %>%
select(-1)
# # A tibble: 4 x 2
# Col1 Col2
# <chr> <int>
# 1 A 3
# 2 Other 3
# 3 B 6
# 4 Other 9

Union dataframes in some way that updates rows with same row.name

I want to do a union of two dataframes, that share some rows with same rowName. For those rows with common rowNames, I would like to take into account the second dataframe values, and not the first one's. For example :
df1 <- data.frame(col1 = c(1,2), col2 = c(2,4), row.names = c("row_1", "row_2"))
df1
# col1 col2
# row_1 1 2
# row_2 2 4
df2 <- data.frame(col1 = c(3,6), col2 = c(10,99), row.names = c("row_3", "row_2"))
df2
# col1 col2
# row_3 3 6
# row_2 10 99
The result I would like to obtain would then be :
someSpecificRBind(df1,df2, takeIntoAccount=df2)
# col1 col2
# row_1 1 2
# row_2 10 99
# row_3 3 6
The function rbind doesn't do the job, actually it updates rowNames for common ones.
I would conceptualize this as only adding to df2 the rows in df1 that aren't already there:
rbind(df2, df1[setdiff(rownames(df1), rownames(df2)), ])
We get the index of duplicated elements and use that to filter
rbind(df2, df1)[!duplicated(c(row.names(df2), row.names(df1))),]

combining values in rows based on matching conditions in R

I have a simple question about aggregating values in R.
Suppose I have a dataframe:
DF <- data.frame(col1=c("Type 1", "Type 1B", "Type 2"), col2=c(1, 2, 3))
which looks like this:
col1 col2
1 Type 1 1
2 Type 1B 2
3 Type 2 3
I notice that I have Type 1 and Type 1B in the data, so I would like to combine Type 1B into Type 1.
So I decide to use dplyr:
filter(DF, col1=='Type 1' | col1=='Type 1B') %>%
summarise(n = sum(col2))
But now I need to keep going with it:
DF2 <- data.frame('Type 1', filter(DF, col1=='Type 1' | col1=='Type 1B') %>%
summarise(n = sum(col2)))
I guess I want to cbind this new DF2 back to the original DF, but that means I have to set the column names to be consistent:
names(DF2) <- c('col1', 'col2')
OK, now I can rbind:
rbind(DF2, DF[3,])
The result? It worked....
col1 col2
1 Type 1 3
3 Type 2 3
...but ugh! That was awful! There has to be a better way to simply combine values.
Here's a possible dplyr approach:
library(dplyr)
DF %>%
group_by(col1 = sub("(.*\\d+).*$", "\\1", col1)) %>%
summarise(col2 = sum(col2))
#Source: local data frame [2 x 2]
#
# col1 col2
#1 Type 1 3
#2 Type 2 3
Using sub() with aggregate(), removing anything other than a digit from the end of col1,
do.call("data.frame",
aggregate(col2 ~ cbind(col1 = sub("\\D+$", "", col1)), DF, sum)
)
# col1 col2
# 1 Type 1 3
# 2 Type 2 3
The do.call() wrapper is there so that the first column after aggregate() is properly changed from a matrix to a vector. This way there aren't any surprises later on down the road.
In my opinion, aggregate() is the perfect function for this purpose, but you shouldn't have to do any text processing (e.g. gsub()). I would do this in a two-step process:
Overwrite col1 with the new desired grouping.
Compute the aggregation using the new col1 to specify the grouping.
DF$col1 <- ifelse(DF$col1 %in% c('Type 1','Type 1B'),'Type 1',levels(DF$col1));
DF;
## col1 col2
## 1 Type 1 1
## 2 Type 1 2
## 3 Type 2 3
DF <- aggregate(col2~col1, DF, FUN=sum );
DF;
## col1 col2
## 1 Type 1 3
## 2 Type 2 3
You can try:
library(data.table)
setDT(transform(DF, col1=gsub("(.*)[A-Z]+$","\\1",DF$col1)))[,list(col2=sum(col2)),col1]
# col1 col2
# 1: Type 1 3
# 2: Type 2 3
Or even more directly:
setDT(DF)[, .(col2 = sum(col2)), by = .(col1 = sub("[[:alpha:]]+$", "", col1))]

Removing rows when flipped in two columns

Considering the following data frame:
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
> df
var1 var2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 1
I'd like to remove all rows whose values are flipped across the two columns. In this case, it would be row 1 and row 5 as the values 1 and 5 in row 1 are flipped to 5 and 1 in row 5. These two rows should be removed.
I hope it came clear what I am asking for :-)
Kind regards!
Perhaps something like this could work too:
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
df[!do.call(paste, df) %in% do.call(paste, rev(df)), ]
var1 var2
2 2 6
3 3 7
4 4 8
I'd have to test it on a few more test cases though, but the general idea is to use rev to reverse the order of the columns in "df" and paste them together and compare that with the pasted columns from "df".
Here's a simple but not especially elegant way: make a reversed data frame with a flag, and then merge it on to df:
# Make a reversed dataset
fd <- data.frame(var1 = df$var2, var2 = df$var1, flag = TRUE)
# Merge it onto your original df, then drop the matched rows and the flag var
df.sub <- subset(merge(x = df, y = fd, by = c("var1", "var2"), all.x = TRUE),
subset = is.na(flag),
select = c("var1", "var2"))
Using a bit of maths - the two rows are the same up to a permutation if the sum and absolute value of difference are the same:
df[with(df, !duplicated(data.frame(var1 + var2, abs(var1 - var2)), fromLast = TRUE)),]
# var1 var2
#1 1 5
#2 2 6
#3 3 7
#4 4 8
edit: should've read the question more carefully, to remove both duplicates, follow Ananda's suggestion:
df.ind = with(df, data.frame(var1 + var2, abs(var1 - var2)))
df[!duplicated(df.ind) & !duplicated(df.ind, fromLast = TRUE),]
# var1 var2
#2 2 6
#3 3 7
#4 4 8
If creating a copy doesn't cause memory issues then this works as well -
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
df2 <- data.frame(var12 = 1:5, var22 = c(5,6,7,8,1))
df3 <- merge(df,df2, by.x = 'var2', by.y = 'var12', all.x = TRUE)
df3 <- subset(
df3,
is.na(var22),
select = c('var1','var2')
)
Output:
> df3
var1 var2
3 2 6
4 3 7
5 4 8
I tried merging df with df but that gives a warning about the column var2 being duplicated. Anybody know what to do?
If you can assume there are no duplicates in the data frame. Here's a one line answer, but still not too concise:
df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df) + 1:nrow(df)],]
## var1 var2
## 2 2 6
## 3 3 7
## 4 4 8
rbindlist is necessary here because rbind(df,df[,2:1]) will match by column name rather than index, so the other option is something like rbind(df,setnames(df[,2:1],names(df))). If you want to keep duplicates from the original, this gets even more unpleasant:
> df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
> df<-rbind(df,c(2,6))
> df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df)+1:nrow(df)],]
var1 var2
2 2 6
3 3 7
4 4 8
> df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df)+1:nrow(df)] | duplicated(df),]
var1 var2
2 2 6
3 3 7
4 4 8
6 2 6

Resources