Custom sorting (non-alphabetical) - r

I have a categorical data set that looks similar to:
A < -data.frame(animal = c("cat","cat","cat","dog","dog","dog","elephant","elephant","elephant"),
color = c(rep(c("blue","red","green"), 3)))
animal color
1 cat blue
2 cat red
3 cat green
4 dog blue
5 dog red
6 dog green
7 elephant blue
8 elephant red
9 elephant green
I want to order it so that 'animal' is sorted as dog < elephant < cat, and then the color is sorted green < blue < red. So in the end it would look like
# animal color
# 6 dog green
# 4 dog blue
# 5 dog red
# 9 elephant green
# 7 elephant blue
# 8 elephant red
# 3 cat green
# 1 cat blue
# 2 cat red

The levels should be specified explicitly:
A$animal <- factor(A$animal, levels = c("dog", "elephant","cat"))
A$color <- factor(A$color, levels = c("green", "blue", "red"))
Then you order by the 2 columns simultaneously:
A[order(A$animal,A$color),]
# animal color
# 6 dog green
# 4 dog blue
# 5 dog red
# 9 elephant green
# 7 elephant blue
# 8 elephant red
# 3 cat green
# 1 cat blue
# 2 cat red

You can also use match - you do not alter column class neither do a factor transformation.
animalOrder = c("dog", "elephant","cat")
colorOrder = c("green", "blue", "red")
A[ order(match(A$animal, animalOrder), match(A$color, colorOrder)), ]
animal color
6 dog green
4 dog blue
5 dog red
9 elephant green
7 elephant blue
8 elephant red
3 cat green
1 cat blue
2 cat red

One other thing worth noting - you don't have to convert the class to do this. You can simply order by the factor of the variable. Thus preserving as eg character class within the existing data structure, if that is desired.
so eg, using the example above:
A[order(factor(A$animal, levels = c("dog", "elephant","cat")) ,factor(A$color, levels = c("green", "blue", "red"))),]
Depends on whether conservation of class is important. This would be a much more typical use case for me personally. HTH

In a vein similar to how agstudy did it, I'd present the 'tidyverse' way of presenting the ordering:
A$animal <- factor(A$animal, levels = c("dog", "elephant","cat"))
A$color <- factor(A$color, levels = c("green", "blue", "red"))
Then we load dplyr or the whole tidyverse and can do
arrange(A, animal, color)
or simply
A %>% arrange(animal, color)
where %>% is the 'pipe' operator in r, and can be accessed by using Ctrl + Shift + m

Related

Ordering rows in a dataframe based on order of rows in another, with different dimensions [duplicate]

I have a categorical data set that looks similar to:
A < -data.frame(animal = c("cat","cat","cat","dog","dog","dog","elephant","elephant","elephant"),
color = c(rep(c("blue","red","green"), 3)))
animal color
1 cat blue
2 cat red
3 cat green
4 dog blue
5 dog red
6 dog green
7 elephant blue
8 elephant red
9 elephant green
I want to order it so that 'animal' is sorted as dog < elephant < cat, and then the color is sorted green < blue < red. So in the end it would look like
# animal color
# 6 dog green
# 4 dog blue
# 5 dog red
# 9 elephant green
# 7 elephant blue
# 8 elephant red
# 3 cat green
# 1 cat blue
# 2 cat red
The levels should be specified explicitly:
A$animal <- factor(A$animal, levels = c("dog", "elephant","cat"))
A$color <- factor(A$color, levels = c("green", "blue", "red"))
Then you order by the 2 columns simultaneously:
A[order(A$animal,A$color),]
# animal color
# 6 dog green
# 4 dog blue
# 5 dog red
# 9 elephant green
# 7 elephant blue
# 8 elephant red
# 3 cat green
# 1 cat blue
# 2 cat red
You can also use match - you do not alter column class neither do a factor transformation.
animalOrder = c("dog", "elephant","cat")
colorOrder = c("green", "blue", "red")
A[ order(match(A$animal, animalOrder), match(A$color, colorOrder)), ]
animal color
6 dog green
4 dog blue
5 dog red
9 elephant green
7 elephant blue
8 elephant red
3 cat green
1 cat blue
2 cat red
One other thing worth noting - you don't have to convert the class to do this. You can simply order by the factor of the variable. Thus preserving as eg character class within the existing data structure, if that is desired.
so eg, using the example above:
A[order(factor(A$animal, levels = c("dog", "elephant","cat")) ,factor(A$color, levels = c("green", "blue", "red"))),]
Depends on whether conservation of class is important. This would be a much more typical use case for me personally. HTH
In a vein similar to how agstudy did it, I'd present the 'tidyverse' way of presenting the ordering:
A$animal <- factor(A$animal, levels = c("dog", "elephant","cat"))
A$color <- factor(A$color, levels = c("green", "blue", "red"))
Then we load dplyr or the whole tidyverse and can do
arrange(A, animal, color)
or simply
A %>% arrange(animal, color)
where %>% is the 'pipe' operator in r, and can be accessed by using Ctrl + Shift + m

How to (efficiently) perform Cartesian product on a key subset [R]

Suppose I have these data
data1 <- read.delim(textConnection(
"id val1
1 blue
1 green
1 red
2 black
2 brown
2 white"
), sep=' ')
data2 <- read.delim(textConnection(
"id val2
1 cat
1 dog
1 fish
2 hat
2 coat
2 car"
), sep=' ')
I would like to calculate all permutations of blue, green, and red cat, dog, and fish for id=1 and brown, black, and white hats, coats, and cars for id=2. I could do it in a for loop with expand.grid, and then "build" the output using rbind. But my actual data have several IDs and several vals so it runs poorly.
It turns out that merge does this by default
> merge(data1, data2, by='id')
id val1 val2
1 1 blue cat
2 1 blue dog
3 1 blue fish
4 1 green cat
5 1 green dog
6 1 green fish
7 1 red cat
8 1 red dog
9 1 red fish
10 2 black hat
11 2 black coat
12 2 black car
13 2 brown hat
14 2 brown coat
15 2 brown car
16 2 white hat
17 2 white coat
18 2 white car
In base R, we could use split on both the datasets to create a list of values by 'id' and then apply the expand.grid on the corresponding elements of the list and rbind (if needed)
Map(expand.grid, split(data1$val1, data1$id), split(data2$val2, data2$id))
Or in data.table
library(data.table)
setDT(data1)[data2, on = .(id), allow.cartesian = TRUE]

Create a Group Key across two columns

I'm trying to solve the below problem but I find it difficult to explain. I want to assign an incremental value based on the linkage between two columns (Colours & Letters).
Colours <- c("Green","Red","Green","Green","Blue","Red","Brown")
Letters <- c("X","C","Y","A","C","T","P")
df <- data.frame(Colours,Letters)
df
Colours Letters
1 Green X
2 Red C
3 Green Y
4 Green A
5 Blue C
6 Red T
7 Brown P
I'll assign a value to Group so that all identical Colours are in the same Group, along with any other Colour that shares the same Letter(s). For example, Group 2 includes Red & Blue given the shared linkage to Letter C.
Group <- c(1,2,1,1,2,2,3)
df <- data.frame(df,Group)
df
Colours Letters Group
1 Green X 1
2 Red C 2
3 Green Y 1
4 Green A 1
5 Blue C 2
6 Red T 2
7 Brown P 3
If an additional row was added with Colour = Green and Letter = C then the Group column would change to the below. All Greens would be grouped together with any other Colour (e.g. Red) that shared the same Letter (C in the case of Red). Furthermore, any Colour that shared a Letter with Red would likewise be added to the same Group as Green (such is the case for Blue, which shares the Letter C with Red).
Colours Letters Group
1 Green X 1
2 Red C 1
3 Green Y 1
4 Green A 1
5 Blue C 1
6 Red T 1
7 Brown P 2
8 Green C 1
Can anyone help?
As the #Frank above noted, you are describing a graph problem in that you want your group label to reflect connected components -- colours that share a letter. By converting your columns into a graph object you can figure out what the separate components are and return these as groups:
Colours <- c("Green","Red","Green","Green","Blue","Red","Brown")
Letters <- c("X","C","Y","A","C","T","P")
df <- data.frame(Colours,Letters)
Group <- c(1,2,1,1,2,2,3)
df <- data.frame(df,Group)
# load the igraph package for working with graphs
library(igraph)
adj.mat <- table(df$Colours, df$Letters) %*% t(table(df$Colours, df$Letters))
# visual inspection makes it clear what the components are
g <- graph_from_adjacency_matrix(adj.mat, mode = 'undirected', diag = F)
plot(g)
# we create a dataframe that matches each color to a component
mdf <- data.frame(Group_test = components(g)$membership,
Colours = names(components(g)$membership))
mdf
#> Group_test Colours
#> Blue 1 Blue
#> Brown 2 Brown
#> Green 3 Green
#> Red 1 Red
# Then we just match them together
dplyr::left_join(df, mdf)
#> Joining, by = "Colours"
#> Colours Letters Group Group_test
#> 1 Green X 1 3
#> 2 Red C 2 1
#> 3 Green Y 1 3
#> 4 Green A 1 3
#> 5 Blue C 2 1
#> 6 Red T 2 1
#> 7 Brown P 3 2
Clearly the groups have a different numbering but split the colours similarly.
We can look at the extended case as a sanity check, where we add a linking color that reduces the set of components to 2:
# examining the extended case as a check
df2 <- data.frame(Colours = c(Colours, "Green"), Letters = c(Letters, "C"))
df2
#> Colours Letters
#> 1 Green X
#> 2 Red C
#> 3 Green Y
#> 4 Green A
#> 5 Blue C
#> 6 Red T
#> 7 Brown P
#> 8 Green C
# lets wrap the procedure in a function for convenience
getGroup <- function(col, let, plot = FALSE){
adj.mat <- table(col, let) %*% table(let, col)
g <- graph_from_adjacency_matrix(adj.mat, mode = 'undirected',
diag = F)
if (plot) {plot(g)}
comps <- components(g)$membership
mdf <- data.frame(Group = comps, Colours = names(comps))
mdf
}
# we get our desired group key (which we can merge back to the dataframe)
getGroup(df2$Colours, df2$Letters)
#> Group Colours
#> Blue 1 Blue
#> Brown 2 Brown
#> Green 1 Green
#> Red 1 Red
Created on 2018-11-07 by the reprex package (v0.2.1)

R - Flatten a data frame within a list

I have received a JSON file which could be read into R as a list using
library(jsonlite)
data <- jsonlite::fromJSON(URL)
The data is a list which contains both data columns and data frame. For example
temp = list(id = c(1, 2, 3), name = c("banana", "organge", "apple"), type = data.frame(colour=c("red", "blue", "green", "purple"), shape = c("round", "round", "square", "square")))
> temp
$id
[1] 1 2 3
$name
[1] "banana" "organge" "apple"
$type
colour shape
1 red round
2 blue round
3 green square
4 purple square
How can we convert this list to data frame without losing information? In that case, I suppose each row in the nested data frame will be aligned with a row in the list. The result should be
id name type.colour type.shape
1 1 banana red round
2 1 banana blue round
3 1 banana green square
4 1 banana purple square
5 2 orange red round
6 2 orange blue round
7 2 orange green square
8 2 orange purple square
9 3 apple red round
10 3 apple blue round
11 3 apple green square
12 3 apple purple square
For this specific case you can use the following code :
DFidxs <- rep(seq_len(nrow(temp$type)),times=length(temp$id))
colidxs <- rep(seq_len(length(temp$id)),each=nrow(temp$type))
DF <- cbind(id = temp$id[colidxs],
name = temp$name[colidxs],
temp$type[DFidxs,])
> DF
id name colour shape
1 1 banana red round
2 1 banana blue round
3 1 banana green square
4 1 banana purple square
1.1 2 organge red round
2.1 2 organge blue round
3.1 2 organge green square
4.1 2 organge purple square
1.2 3 apple red round
2.2 3 apple blue round
3.2 3 apple green square
4.2 3 apple purple square
assuming that id,name (and possibly other vectors/columns) have the same length you can reuse this code to replicate the lines of type data.frame for each element of the columns and bind them.

identify and count duplicate values across multiple columns

I have a dataset with multiple columns that look similar to this:
ID1 ID2 ID3 ID4
Blue Grey Fuchsia Green
Black Blue Orange Blue
Green Green Yellow Pink
Pink Yellow NA Orange
What I want to do is count how many times each value is duplicated across the four columns. For example, this is what I'd like to get back from the above:
ID Replicates
Blue 3
Black 1
Green 3
Pink 2
Grey 1
Yellow 2
Fuchsia 1
Orange 2
I'd also like to be able to ask which ID value is present in the dataset at frequency >2. So the expected result would be: Green and Blue.
Any thoughts on how to do this in R?
Thanks!
Just a regular table is all you need for a data set full of factors.
> ( tab <- table(unlist(data)) )
Black Blue Green Pink Grey Yellow Fuchsia Orange
1 3 3 2 1 2 1 2
Add deparse.level = 2 if you want the table to be named.
It's easily subsetted with [ indexing. Just subset tab such that tab is greater than 2. And you can get the colors with names.
> tab[tab > 2]
Blue Green
3 3
> names(tab[tab > 2])
[1] "Blue" "Green"
There's also an as.data.frame method.
> as.data.frame(tab)
Var1 Freq
1 Black 1
2 Blue 3
3 Green 3
4 Pink 2
5 Grey 1
6 Yellow 2
7 Fuchsia 1
8 Orange 2
a dplyr / tidyr solution
ID1<-c("Blue", "Black", "Green", "Pink")
ID2<-c("Grey", "Blue", "Green", "Yellow")
ID3<-c("Fuchsia", "Orange", "Yellow", NA)
ID4<-c("Green", "Blue", "Pink", "Orange")
mydf<-data.frame(ID1,ID2,ID3,ID4)
library(dplyr)
library(tidyr)
mydf %>%
gather(key,value,1:4) %>%
group_by (value) %>%
tally
value n
1 Black 1
2 Blue 3
3 Fuchsia 1
4 Green 3
5 Grey 1
6 Orange 2
7 Pink 2
8 Yellow 2
9 NA 1
to return those at higher freq than 2...
mydf %>%
gather(key,value,1:4) %>%
group_by (value) %>%
tally %>%
filter (n>2)
value n
1 Blue 3
2 Green 3

Resources