Create a Group Key across two columns - r

I'm trying to solve the below problem but I find it difficult to explain. I want to assign an incremental value based on the linkage between two columns (Colours & Letters).
Colours <- c("Green","Red","Green","Green","Blue","Red","Brown")
Letters <- c("X","C","Y","A","C","T","P")
df <- data.frame(Colours,Letters)
df
Colours Letters
1 Green X
2 Red C
3 Green Y
4 Green A
5 Blue C
6 Red T
7 Brown P
I'll assign a value to Group so that all identical Colours are in the same Group, along with any other Colour that shares the same Letter(s). For example, Group 2 includes Red & Blue given the shared linkage to Letter C.
Group <- c(1,2,1,1,2,2,3)
df <- data.frame(df,Group)
df
Colours Letters Group
1 Green X 1
2 Red C 2
3 Green Y 1
4 Green A 1
5 Blue C 2
6 Red T 2
7 Brown P 3
If an additional row was added with Colour = Green and Letter = C then the Group column would change to the below. All Greens would be grouped together with any other Colour (e.g. Red) that shared the same Letter (C in the case of Red). Furthermore, any Colour that shared a Letter with Red would likewise be added to the same Group as Green (such is the case for Blue, which shares the Letter C with Red).
Colours Letters Group
1 Green X 1
2 Red C 1
3 Green Y 1
4 Green A 1
5 Blue C 1
6 Red T 1
7 Brown P 2
8 Green C 1
Can anyone help?

As the #Frank above noted, you are describing a graph problem in that you want your group label to reflect connected components -- colours that share a letter. By converting your columns into a graph object you can figure out what the separate components are and return these as groups:
Colours <- c("Green","Red","Green","Green","Blue","Red","Brown")
Letters <- c("X","C","Y","A","C","T","P")
df <- data.frame(Colours,Letters)
Group <- c(1,2,1,1,2,2,3)
df <- data.frame(df,Group)
# load the igraph package for working with graphs
library(igraph)
adj.mat <- table(df$Colours, df$Letters) %*% t(table(df$Colours, df$Letters))
# visual inspection makes it clear what the components are
g <- graph_from_adjacency_matrix(adj.mat, mode = 'undirected', diag = F)
plot(g)
# we create a dataframe that matches each color to a component
mdf <- data.frame(Group_test = components(g)$membership,
Colours = names(components(g)$membership))
mdf
#> Group_test Colours
#> Blue 1 Blue
#> Brown 2 Brown
#> Green 3 Green
#> Red 1 Red
# Then we just match them together
dplyr::left_join(df, mdf)
#> Joining, by = "Colours"
#> Colours Letters Group Group_test
#> 1 Green X 1 3
#> 2 Red C 2 1
#> 3 Green Y 1 3
#> 4 Green A 1 3
#> 5 Blue C 2 1
#> 6 Red T 2 1
#> 7 Brown P 3 2
Clearly the groups have a different numbering but split the colours similarly.
We can look at the extended case as a sanity check, where we add a linking color that reduces the set of components to 2:
# examining the extended case as a check
df2 <- data.frame(Colours = c(Colours, "Green"), Letters = c(Letters, "C"))
df2
#> Colours Letters
#> 1 Green X
#> 2 Red C
#> 3 Green Y
#> 4 Green A
#> 5 Blue C
#> 6 Red T
#> 7 Brown P
#> 8 Green C
# lets wrap the procedure in a function for convenience
getGroup <- function(col, let, plot = FALSE){
adj.mat <- table(col, let) %*% table(let, col)
g <- graph_from_adjacency_matrix(adj.mat, mode = 'undirected',
diag = F)
if (plot) {plot(g)}
comps <- components(g)$membership
mdf <- data.frame(Group = comps, Colours = names(comps))
mdf
}
# we get our desired group key (which we can merge back to the dataframe)
getGroup(df2$Colours, df2$Letters)
#> Group Colours
#> Blue 1 Blue
#> Brown 2 Brown
#> Green 1 Green
#> Red 1 Red
Created on 2018-11-07 by the reprex package (v0.2.1)

Related

Ordering rows in a dataframe based on order of rows in another, with different dimensions [duplicate]

I have a categorical data set that looks similar to:
A < -data.frame(animal = c("cat","cat","cat","dog","dog","dog","elephant","elephant","elephant"),
color = c(rep(c("blue","red","green"), 3)))
animal color
1 cat blue
2 cat red
3 cat green
4 dog blue
5 dog red
6 dog green
7 elephant blue
8 elephant red
9 elephant green
I want to order it so that 'animal' is sorted as dog < elephant < cat, and then the color is sorted green < blue < red. So in the end it would look like
# animal color
# 6 dog green
# 4 dog blue
# 5 dog red
# 9 elephant green
# 7 elephant blue
# 8 elephant red
# 3 cat green
# 1 cat blue
# 2 cat red
The levels should be specified explicitly:
A$animal <- factor(A$animal, levels = c("dog", "elephant","cat"))
A$color <- factor(A$color, levels = c("green", "blue", "red"))
Then you order by the 2 columns simultaneously:
A[order(A$animal,A$color),]
# animal color
# 6 dog green
# 4 dog blue
# 5 dog red
# 9 elephant green
# 7 elephant blue
# 8 elephant red
# 3 cat green
# 1 cat blue
# 2 cat red
You can also use match - you do not alter column class neither do a factor transformation.
animalOrder = c("dog", "elephant","cat")
colorOrder = c("green", "blue", "red")
A[ order(match(A$animal, animalOrder), match(A$color, colorOrder)), ]
animal color
6 dog green
4 dog blue
5 dog red
9 elephant green
7 elephant blue
8 elephant red
3 cat green
1 cat blue
2 cat red
One other thing worth noting - you don't have to convert the class to do this. You can simply order by the factor of the variable. Thus preserving as eg character class within the existing data structure, if that is desired.
so eg, using the example above:
A[order(factor(A$animal, levels = c("dog", "elephant","cat")) ,factor(A$color, levels = c("green", "blue", "red"))),]
Depends on whether conservation of class is important. This would be a much more typical use case for me personally. HTH
In a vein similar to how agstudy did it, I'd present the 'tidyverse' way of presenting the ordering:
A$animal <- factor(A$animal, levels = c("dog", "elephant","cat"))
A$color <- factor(A$color, levels = c("green", "blue", "red"))
Then we load dplyr or the whole tidyverse and can do
arrange(A, animal, color)
or simply
A %>% arrange(animal, color)
where %>% is the 'pipe' operator in r, and can be accessed by using Ctrl + Shift + m

Is there a possibility to merge 2 dataframes in R, keeping only unique results with one of the columns as the dependency for which results are kept

I know the question is confusing, but I don't know how to formulate it better. I have 11 xls lists that contain GOTerm IDS in the format
ID Category Colour
These lists all in all contain about 130k IDS however, only 26k of those are unique.
I built a ranking of the 11 lists by category. What I need now is a way to merge List 11 with List 10 and for any ID that is in both lists, I want to keep the ID with the lower category number (in this example category 10) and the matching colour.
Preferably I'd like to do this in one script/ combined lines of code for all lists.
Example
List11
ID
Category
colour
1
Category11
red
2
Category11
red
3
Category11
red
List10
ID
Category
colour
1
Category10
blue
2
Category10
blue
4
Category10
blue
And my ideal merged result would look like this:
ID
Category
colour
1
Category10
blue
2
Category10
blue
3
Category11
red
4
Category10
blue
And then the same thing for the new list vs. List 9, that result vs list 8 and so on
Here is a data.table approach of your problem. funcxtional explanation of the code is in the comments of the code.
In this solution, there is no merging of the different tables. In stead, they are rowbound together and per ID only the rows that appear first are kept.
I do not know how you have read in your excels, but if you read in you excels using something like L <- lapply(myexcelfiles, readxl::read_excel), then your excel files are already inside a list and you can skip some of the lines below and start with the line DT <- rbindlist(L, id = "ListId").
library(data.table)
# Sample data
List11 <- fread("ID Category colour
1 Category11 red
2 Category11 red
3 Category11 red")
List10 <- fread("ID Category colour
1 Category10 blue
2 Category10 blue
4 Category10 blue")
# Add the items (in order!) to a list
L <- list(List10, List11)
# Put the list in one large data.table
# Get a ListId from the position in the list
DT <- rbindlist(L, id = "ListId")
# ListId ID Category colour
# 1: 1 1 Category10 blue
# 2: 1 2 Category10 blue
# 3: 1 4 Category10 blue
# 4: 2 1 Category11 red
# 5: 2 2 Category11 red
# 6: 2 3 Category11 red
# Only keep rows with the minimum ListId by ID
# and drop the ListId column afterward and order on ID
final <- setorder(DT[ , .SD[which.min(ListId)], by = ID][, ListId := NULL], ID)
# ID Category colour
# 1: 1 Category10 blue
# 2: 2 Category10 blue
# 3: 3 Category11 red
# 4: 4 Category10 blue
Let me share this example, which should give you an idea of the steps using basic R.
First, let's create these two dataframes. (A list is something different in R.)
list11 <- data.frame(id = c(1, 2, 3),
category = "Category11",
colour = "red")
list10 <- data.frame(id = c(1, 2, 4),
category = "Category10",
colour = "blue")
This results in:
> list11
id category colour
1 1 Category11 red
2 2 Category11 red
3 3 Category11 red
> list10
id category colour
1 1 Category10 blue
2 2 Category10 blue
3 4 Category10 blue
Next, you could join both data frames by ID:
joined <- merge(x = list10, y = list11, by = "id", all = TRUE)
which will give you:
> joined
id category.x colour.x category.y colour.y
1 1 Category10 blue Category11 red
2 2 Category10 blue Category11 red
3 3 <NA> <NA> Category11 red
4 4 Category10 blue <NA> <NA>
The idea is that we take everyting which exists in the x-columns. Only if there's nothing (= <NA>) then we check the y-columns. This is what we do using the is.na()function:
new10 <- joined[!is.na(joined$category.x), c("id", "category.x", "colour.x")]
new11 <- joined[is.na(joined$category.x), c("id", "category.y", "colour.y")]
> new10
id category.x colour.x
1 1 Category10 blue
2 2 Category10 blue
4 4 Category10 blue
> new11
id category.y colour.y
3 3 Category11 red
The merge function above gave us new column names, so we have to set them back:
colnames(new10) <- c("id", "category", "colour")
colnames(new11) <- c("id", "category", "colour")
Now both data frames have the same column names and we can stick them together again using:
> final <- rbind(new10, new11)
> final
id category colour
1 1 Category10 blue
2 2 Category10 blue
4 4 Category10 blue
3 3 Category11 red
Finally we can sort, if we want to do that:
> final <- final[order(final$id), ]
> final
id category colour
1 1 Category10 blue
2 2 Category10 blue
3 3 Category11 red
4 4 Category10 blue
To process all your XLSs you could either create a loop around it or use a list of data frames and apply over it.

Create a new level column based on unique row sets

I want to create a new column with new variables (preferably letters) to count the frequency of each set later on.
Lets say I have a data frame called datatemp which is like:
datatemp = data.frame(colors=rep( c("red","blue"), 6), val = 1:6)
colors val
1 red 1
2 blue 2
3 red 3
4 blue 4
5 red 5
6 blue 6
7 red 1
8 blue 2
9 red 3
10 blue 4
11 red 5
12 blue 6
And I can see my unique row sets where colors and val columns have identical inputs together, such as:
unique(datatemp[c("colors","val")])
colors val
1 red 1
2 blue 2
3 red 3
4 blue 4
5 red 5
6 blue 6
What I really want to do is to create a new column in the same data frame where each unique set of row above has a level, such as:
colors val freq
1 red 1 A
2 blue 2 B
3 red 3 C
4 blue 4 D
5 red 5 E
6 blue 6 F
7 red 1 A
8 blue 2 B
9 red 3 C
10 blue 4 D
11 red 5 E
12 blue 6 F
I know that's very basic, however, I couldn't come up with an useful idea for a huge dataset.
So make the question more clear, I am giving another representation of desired output below:
colA colB newcol
10 11 A
12 15 B
10 11 A
13 15 C
Values in the new column should be based on uniqueness of first two columns before it.
www's solution maps the unique values in your value column to letters in freq column. If you want to do create a factor variable for each unique combination of colors and val, you could do something along these lines:
library(plyr)
datatemp = data.frame(colors=rep( c("red","blue"), 6), val = 1:6)
datatemp$freq <- factor(paste(datatemp$colors, datatemp$val), levels=unique(paste(datatemp$colors, datatemp$val)))
datatemp$freq <- mapvalues(datatemp$freq, from = levels(datatemp$freq), to = LETTERS[1:length(levels(datatemp$freq))])
I first create a new factor variable for each unique combination of val and colors, and then use plyr::mapvalues to rename the factor levels to letters.
We can concatenate the val and color column and create it as factor, then we can change the factor level by letters.
datatemp$Freq <- as.factor(paste(datatemp$val, datatemp$colors, sep = "_"))
levels(datatemp$Freq) <- LETTERS[1:length(levels(datatemp$Freq))]
datatemp
# colors val Freq
# 1 red 1 A
# 2 blue 2 B
# 3 red 3 C
# 4 blue 4 D
# 5 red 5 E
# 6 blue 6 F
# 7 red 1 A
# 8 blue 2 B
# 9 red 3 C
# 10 blue 4 D
# 11 red 5 E
# 12 blue 6 F

identify and count duplicate values across multiple columns

I have a dataset with multiple columns that look similar to this:
ID1 ID2 ID3 ID4
Blue Grey Fuchsia Green
Black Blue Orange Blue
Green Green Yellow Pink
Pink Yellow NA Orange
What I want to do is count how many times each value is duplicated across the four columns. For example, this is what I'd like to get back from the above:
ID Replicates
Blue 3
Black 1
Green 3
Pink 2
Grey 1
Yellow 2
Fuchsia 1
Orange 2
I'd also like to be able to ask which ID value is present in the dataset at frequency >2. So the expected result would be: Green and Blue.
Any thoughts on how to do this in R?
Thanks!
Just a regular table is all you need for a data set full of factors.
> ( tab <- table(unlist(data)) )
Black Blue Green Pink Grey Yellow Fuchsia Orange
1 3 3 2 1 2 1 2
Add deparse.level = 2 if you want the table to be named.
It's easily subsetted with [ indexing. Just subset tab such that tab is greater than 2. And you can get the colors with names.
> tab[tab > 2]
Blue Green
3 3
> names(tab[tab > 2])
[1] "Blue" "Green"
There's also an as.data.frame method.
> as.data.frame(tab)
Var1 Freq
1 Black 1
2 Blue 3
3 Green 3
4 Pink 2
5 Grey 1
6 Yellow 2
7 Fuchsia 1
8 Orange 2
a dplyr / tidyr solution
ID1<-c("Blue", "Black", "Green", "Pink")
ID2<-c("Grey", "Blue", "Green", "Yellow")
ID3<-c("Fuchsia", "Orange", "Yellow", NA)
ID4<-c("Green", "Blue", "Pink", "Orange")
mydf<-data.frame(ID1,ID2,ID3,ID4)
library(dplyr)
library(tidyr)
mydf %>%
gather(key,value,1:4) %>%
group_by (value) %>%
tally
value n
1 Black 1
2 Blue 3
3 Fuchsia 1
4 Green 3
5 Grey 1
6 Orange 2
7 Pink 2
8 Yellow 2
9 NA 1
to return those at higher freq than 2...
mydf %>%
gather(key,value,1:4) %>%
group_by (value) %>%
tally %>%
filter (n>2)
value n
1 Blue 3
2 Green 3

How can I segment data in R to include only points for which one of my variables is set to a specific value?

I'm working with a data set that has 16 variables and more than 4,000 cases. I'd like to segment the data into a separate data frame that only includes cases for which one of the variables is set to 0.
If I'm unclear, here's a simple example that, hopefully, will help illustrate my question:
Ana = red, 1
Beth = blue, 0
Cate = green, 3
David = yellow, 0
How would I, through R, segment the data set to create a new data frame that omits the cases for which the second variable = 0? In this example, I would have a new data frame that only includes Ana and Cate. Likewise, how would I do the opposite, i.e. create a data frame with only Beth and David?
Thank you for your help!
Assume this is your data.frame
m <- data.frame(names = c("Ana", "Beth", "Cate", "David"),
colors = c("blue", "blue", "green", "yellow"), numbers = c(1, 0, 3, 0))
m
# names colors numbers
#1 Ana blue 1
#2 Beth blue 0
#3 Cate green 3
#4 David yellow 0
If I understood correctly here are two ways to obtain your result
id <- which(m[,"numbers"] > 0)
m[id,]
#1 Ana blue 1
#3 Cate green 3
or
subset(m, numbers > 0)
# names colors numbers
#1 Ana blue 1
#3 Cate green 3
subset(m, numbers == 0)
# names colors numbers
#2 Beth blue 0
#4 David yellow 0
An alternative is to use split, which would return a list of two data.frames, one for the rows where numbers == 0 and those where "numbers > 0". A trick to use is in how R treats numbers when you convert them to logical values: anything that is not zero becomes TRUE.
So, using #javlacalle's sample data, try:
out <- split(m, as.logical(m$numbers))
out
# $`FALSE`
# names colors numbers
# 2 Beth blue 0
# 4 David yellow 0
#
# $`TRUE`
# names colors numbers
# 1 Ana blue 1
# 3 Cate green 3
You can access the relevant data.frames by their index position or their names:
out[[1]]
# names colors numbers
# 2 Beth blue 0
# 4 David yellow 0
out[["FALSE"]]
# names colors numbers
# 2 Beth blue 0
# 4 David yellow 0

Resources