I would like to append dfToAdd to df, where the first has missing columns. Important detail is that df has 2 types of columns. 1st set of columns are correlating with each other.
e.g. group="A" means name="Group A" and color="Blue". There can't be a combination of A-Group A-Red.
2nd type of columns are correlating among themselves.
animal="Dog" action="Bark"
And I would like to add this second data frame with missing columns of the first type of columns. Those columns should be filled with combinations of the first type of columns like the following dfResult (order of rows don't matter):
df = data.frame(group = c("A", "A", "A", "B", "B", "B"),
name = c("Group A", "Group A", "Group A", "Group B", "Group B", "Group B"),
color = c("Blue", "Blue", "Blue", "Red", "Red", "Red"),
animal = c("Dog", "Cat", "Mouse", "Dog", "Cat", "Mouse"),
action = c("Bark", "Meow", "Squeak", "Bark", "Meow", "Squeak")
)
dfToAdd = data.frame(animal = c("Lion", "Bird"),
action = c("Roar", "Chirp"))
dfResult = data.frame(group = c("A", "A", "A", "B", "B", "B", "A", "A", "B", "B"),
name = c("Group A", "Group A", "Group A", "Group B", "Group B", "Group B", "Group A", "Group A", "Group B", "Group B"),
color = c("Blue", "Blue", "Blue", "Red", "Red", "Red", "Blue", "Blue", "Red", "Red"),
animal = c("Dog", "Cat", "Mouse", "Dog", "Cat", "Mouse", "Lion", "Bird", "Lion", "Bird"),
action = c("Bark", "Meow", "Squeak", "Bark", "Meow", "Squeak", "Roar", "Chirp", "Roar", "Chirp"))
> df
group name color animal action
1 A Group A Blue Dog Bark
2 A Group A Blue Cat Meow
3 A Group A Blue Mouse Squeak
4 B Group B Red Dog Bark
5 B Group B Red Cat Meow
6 B Group B Red Mouse Squeak
> dfToAdd
animal action
1 Lion Roar
2 Bird Chirp
> dfResult
group name color animal action
1 A Group A Blue Dog Bark
2 A Group A Blue Cat Meow
3 A Group A Blue Mouse Squeak
4 B Group B Red Dog Bark
5 B Group B Red Cat Meow
6 B Group B Red Mouse Squeak
7 A Group A Blue Lion Roar
8 A Group A Blue Bird Chirp
9 B Group B Red Lion Roar
10 B Group B Red Bird Chirp
But the 1st type of columns (group, name, color) is not completely known. I am working with multiple grouping variables of an arbitrary number. You can imagine that there may or may not be be a description column="Group A is a good group" or date="2020.04.13". We only know for sure the columns of the second type: animal and action.
We could do this in a single %>% by sliceing the first row from 'df', select the columns that are not the ones in 'dfToAdd', bind that with the 'dfToAdd', then do the row bind with 'df' and use complete
library(dplyr)
library(tidyr)
library(rlang)
library(purrr)
df %>%
slice(1) %>%
select(-names(dfToAdd)) %>%
uncount(nrow(dfToAdd)) %>%
bind_cols(dfToAdd) %>%
bind_rows(df, .) %>%
complete(nesting(!!! syms(names(dfToAdd))),
nesting(!!! syms(setdiff(names(.), names(dfToAdd)))))
# A tibble: 10 x 5
# animal action group name color
# * <fct> <fct> <fct> <fct> <fct>
# 1 Cat Meow A Group A Blue
# 2 Cat Meow B Group B Red
# 3 Dog Bark A Group A Blue
# 4 Dog Bark B Group B Red
# 5 Mouse Squeak A Group A Blue
# 6 Mouse Squeak B Group B Red
# 7 Bird Chirp A Group A Blue
# 8 Bird Chirp B Group B Red
# 9 Lion Roar A Group A Blue
#10 Lion Roar B Group B Red
While writing this I had the idea to use [nesting][1] on both sides of [complete][2] function of tidyr and detect missing columns manually (maybe there is a more elegant solution):
# First find all grouping columns
groupCols = colnames(df)[!(colnames(df) %in% colnames(dfToAdd))]
otherCols = colnames(df)[colnames(df) %in% colnames(dfToAdd)]
# Populate missing columns with first grouping appearing in the df
dfToAdd[groupCols] = df[1, groupCols]
# rbind it to append
dfResult = rbind(df, dfToAdd)
# Now we have obvious missing combinations, tidyr::complete accepts nesting information to generate combinations only for those, which needs to be different.
dfResult %>% tidyr::complete(tidyr::nesting(!!! syms(otherCols)), tidyr::nesting(!!! syms(groupCols)))
edit: actually realized that I am using unknown column names at the end. This doesn't work really. I need to feed groupCols (character vector) to second nesting call.
edit2: now thanks to akrun's answer, I can correct this one too.
Related
I'm aware that the question is awkward. If I could phrase it better I'd probably find the solution in an other thread.
I have this data structure...
df <- data.frame(group = c("X", "F", "F", "F", "F", "C", "C"),
subgroup = c(NA, "camel", "horse", "dog", "cat", "orange", "banana"))
... and would like to turn it into this...
data.frame(group = c("X", "F", "camel", "horse", "dog", "cat", "C", "orange", "banana"))
... which is surprisingly confusing. Also, I would prefer not using a loop.
EDIT: I updated the example to clarify that solutions that depend on sorting unfortunately do not do the trick.
Here an (edited) answer with new data.
Using data.table is going to help a lot. The idea is to split the df into groups and lapply() to each group what we need. Whe have to take care of some things meanwhile.
library(data.table)
# set as data.table
setDT(df)
# to mantain the ordering, you need to put as factor the group.
# the levels are going to give the ordering infos to split
df[,':='(group = factor(group, levels =unique(df$group)))]
# here the split function, splitting df int a list
df_list <-split(df, df$group, sorted =F)
# now you lapply to each element what you need
df_list <-lapply(df_list, function(x) data.frame(group = unique(c(as.character(x$group),x$subgroup))))
# put into a data.table and remove NAs
rbindlist(df_list)[!is.na(df_onecol$group)]
group
1: X
2: F
3: camel
4: horse
5: dog
6: cat
7: C
8: orange
9: banana
With the edited data we need to add another column (here row_number) to sort by:
df %>%
pivot_longer(col = everything()) %>%
mutate(r_n = row_number()) %>%
group_by(value) %>% slice(1) %>%
arrange(r_n) %>%
filter(!is.na(value))
#output
# A tibble: 9 × 3
# Groups: value [9]
name value r_n
<chr> <chr> <int>
1 group X 1
2 group F 3
3 subgroup camel 4
4 subgroup horse 6
5 subgroup dog 8
6 subgroup cat 10
7 group C 11
8 subgroup orange 12
9 subgroup banana 14
My data look like this:
Col1 Col2 Col3
A Dog 3
A Cat 5
A Hat 6
B Dog 8
B Cat 3
B Hat 4
Col1 and Col2 are factors, and A is the first level of Col1.
I want to plot Col2 as a bar graph in descending order by Col3 but where the order of Col2 within the level of the factor defined as A in Col1 takes precedence. That is, I want the data to be graphed as follows (I have flipped the axes so that the values in Col2 are on the y axis, so the bars of the graph would be read from top to bottom):
Col1 Col2 Col3
A Hat 6
B Hat 4
A Cat 5
B Cat 3
A Dog 3
B Dog 8
Right now, I can only get ggplot to display the bars as defined by the largest overall value (8) rather than the largest value within factor level A only (6). So it looks like:
Col1 Col2 Col3
A Dog 3
B Dog 8
A Hat 6
B Hat 4
A Cat 5
B Cat 3
I know I can do this manually by re-specifying the levels of the factor in Col2, but my real data have 40 values for Col2, so it would take a lot of typing. I have ordered and cut down the data frame using arrange(Col1, desc(Col3)) %>% select(Col2) to get a vector that contains the correct ordering of Col2 (right_order = "Hat", "Hat", "Cat", "Cat", "Dog", "Dog"), but I cannot figure out how to use this vector to tell ggplot how to arrange the data. I tried using it in reorder but received the error arguments must have the same length. I have read innumerable questions and tutorials on reordering factor levels for graphing in ggplot, but I cannot find guidance on how to just use the order within one level of the factor (A in Col1) to arrange the graph.
We could arrange after converting to factor with the custom order
library(dplyr)
df1 %>%
arrange(Col1, desc(Col3)) %>%
mutate(Col2 = factor(Col2, levels = unique(Col2))) %>%
arrange(Col2, Col1, desc(Col3))
# Col1 Col2 Col3
#1 A Hat 6
#2 B Hat 4
#3 A Cat 5
#4 B Cat 3
#5 A Dog 3
#6 B Dog 8
data
df1 <- structure(list(Col1 = c("A", "A", "A", "B", "B", "B"), Col2 = c("Dog",
"Cat", "Hat", "Dog", "Cat", "Hat"), Col3 = c(3L, 5L, 6L, 8L,
3L, 4L)), class = "data.frame", row.names = c(NA, -6L))
You nearly have the answer (as does #akrun), but I think taking it stepwise is key here. Generally, the approach is the same. First, plot your data (df1):
ggplot(df1, aes(Col2, Col3)) + geom_col()
Then do the arrangement as you specify, noting that the output is a data.frame object, called d. We then map the unique() values of that column (d$Col2) to refactor d1$Col2:
d <- df1 %>% arrange(Col1, desc(Col3)) %>% select(Col2) # returns a dataframe!
df1$Col2 <- factor(df1$Col2, levels=unique(d$Col2)) # unique values of d$Col2 set to levels of df1$Col2 factor
Then you can plot again and see the columns are reordered:
I think the issue with #akrun's approach was that it did not work to do the factoring in the pipe commands. Take it stepwise: (1) arrange, (2) get your unique ordering from that, (3) refactor.
Here is a sample of my data
code group type outcome
11 A red M*P
11 N orange N*P
11 Z red R
12 AB A blue Z*P
12 AN B green Q*P
12 AA A gray AB
which can be created by:
df <- data.frame(
code = c(rep(11,3), rep(12,3)),
group = c("A", "N", "Z", "AB A", "AN B", "AA A"),
type = c("red", "orange", "red", "blue", "green", "gray"),
outcome = c("M*P", "N*P", "R", "Z*P", "Q*P", "AB"),
stringsAsFactors = FALSE
)
I want to get the following table
code group1 group2 group3 type1 type2 type3 outcome
11 A N Z red orange red MNR
12 AB A AN B AA A blue green gray ZQAB
I have used the following code, but it does not work. I want to remove Ps in outcome. Thanks for your help.
dcast(df, formula= code +group ~ type, value.var = 'outcome')
Using data.table to hit your expected output:
library(data.table)
setDT(df)
# Clean out the Ps before hand
df[, outcome := gsub("*P", "", outcome, fixed = TRUE)]
# dcast but lets leave the outcome for later... (easier)
wdf <- dcast(df, code ~ rowid(code), value.var = c('group', 'type'))
# Now outcome maneuvering separately by code and merge
merge(wdf, df[, .(outcome = paste(outcome, collapse = "")), code])
code group_1 group_2 group_3 type_1 type_2 type_3 outcome
1: 11 A N Z red orange red MNR
2: 12 AB A AN B AA A blue green gray ZQAB
Say I have two dataframes, each with four columns. One column is a numeric value. The other three are identifying variables. For example:
set1 <- data.frame(label1 = c("a","b", "c"), label2 = c("red", "white", "blue"), name = c("sam", "bob", "drew"), val = c(1, 10, 100))
set2 <- data.frame(label1 = c("b","c", "d"), label2 = c("white", "green", "orange"), name = c("bob", "drew", "collin"), val = c(7, 100, 15))
Which are:
> set1
label1 label2 name val
1 a red sam 1
2 b white bob 10
3 c blue drew 50
> set2
label1 label2 name val
1 b white bob 7
2 c green drew 100
3 d orange collin 15
The first three columns can be combined to form a primary key. What is the most efficient way to combine these two data frames such that all unique values (from columns label1, label2, name) are displayed along with the two val columns:
set3 <- data.frame(label = c("a", "b", "c", "c", "d"), label2 = c("red", "white", "blue", "green", "orange"), name = c("sam", "bob", "drew", "drew", "collin"), val.set1 = c(1, 10, 50, NA, NA), val.set2 = c(NA, 7, NA, 100, 15))
> set3
label label2 name val.set1 val.set2
1 a red sam 1 NA
2 b white bob 10 7
3 c blue drew 50 NA
4 c green drew NA 100
5 d orange collin NA 15
>
When thinking of efficiency, you should evaluate the data.table package:
library(data.table)
(merge(
setDT(set1, key=names(set1)[1:3]),
setDT(set2, key=names(set2)[1:3]),
all=T,
suffixes=paste0(".set",1:2)
) -> set3)
# label1 label2 name val.set1 val.set2
# 1: a red sam 1 NA
# 2: b white bob 10 7
# 3: c blue drew 100 NA
# 4: c green drew NA 100
# 5: d orange collin NA 15
Since they're in the same format, you could just rowbind them together and then take only the unique values. Using dplyr:
bind_rows(set1, set2) %>% distinct(label1, label2, name)
You just want to make sure that you don't have factors in there, that everything is a character or numeric.
Using the following example dataframe:
a <- c(1:5)
b <- c("Cat", "Dog", "Rabbit", "Cat", "Dog")
c <- c("Dog", "Rabbit", "Cat", "Dog", "Dog")
d <- c("Rabbit", "Cat", "Dog", "Dog", "Rabbit")
e <- c("Cat", "Dog", "Dog", "Rabbit", "Cat")
f <- c("Cat", "Dog", "Dog", "Rabbit", "Cat")
df <- data.frame(a,b,c,d,e,f)
I want to investigate how to reorder the columns WITHOUT having to type in all the column names, i.e., df[,c("a","d","e","f","b","c")]
How would I just say I want columns b and c AFTER column f? (only referencing the columns or range of columns that I want to move?).
Many thanks in advance for your help.
To move specific columns to the beginning or end of a data.frame, use select from the dplyr package and its everything() function. In this example we are sending to the end:
library(dplyr)
df %>%
select(-b, -c, everything())
a d e f b c
1 1 Rabbit Cat Cat Cat Dog
2 2 Cat Dog Dog Dog Rabbit
3 3 Dog Dog Dog Rabbit Cat
4 4 Dog Rabbit Rabbit Cat Dog
5 5 Rabbit Cat Cat Dog Dog
Without the negation, the columns would be sent to the front.
If you're just moving certain columns to the end, you can create a little helper-function like the following:
movetolast <- function(data, move) {
data[c(setdiff(names(data), move), move)]
}
movetolast(df, c("b", "c"))
# a d e f b c
# 1 1 Rabbit Cat Cat Cat Dog
# 2 2 Cat Dog Dog Dog Rabbit
# 3 3 Dog Dog Dog Rabbit Cat
# 4 4 Dog Rabbit Rabbit Cat Dog
# 5 5 Rabbit Cat Cat Dog Dog
I would not recommend getting too into the habit of using column positions, especially not from a programmatic standpoint, since those positions might change.
"For fun" update
Here's an extended interpretation of the above function. It allows you to move columns to either the first or last position, or to be before or after another column.
moveMe <- function(data, tomove, where = "last", ba = NULL) {
temp <- setdiff(names(data), tomove)
x <- switch(
where,
first = data[c(tomove, temp)],
last = data[c(temp, tomove)],
before = {
if (is.null(ba)) stop("must specify ba column")
if (length(ba) > 1) stop("ba must be a single character string")
data[append(temp, values = tomove, after = (match(ba, temp)-1))]
},
after = {
if (is.null(ba)) stop("must specify ba column")
if (length(ba) > 1) stop("ba must be a single character string")
data[append(temp, values = tomove, after = (match(ba, temp)))]
})
x
}
Try it with the following.
moveMe(df, c("b", "c"))
moveMe(df, c("b", "c"), "first")
moveMe(df, c("b", "c"), "before", "e")
moveMe(df, c("b", "c"), "after", "e")
You'll need to adapt it to have some error checking--for instance, if you try to move columns "b" and "c" to "before c", you'll (obviously) get an error.
You can refer to columns by position. e.g.
df <- df[ ,c(1,4:6,2:3)]
> df
a d e f b c
1 1 Rabbit Cat Cat Cat Dog
2 2 Cat Dog Dog Dog Rabbit
3 3 Dog Dog Dog Rabbit Cat
4 4 Dog Rabbit Rabbit Cat Dog
5 5 Rabbit Cat Cat Dog Dog
The package dplyr and the function dplyr::relocate, a new verb introduced in dplyr 1.0.0, does exactly what you are looking for with highly readable syntax.
df %>% dplyr::relocate(b, c, .after = f)
To generalize the reshuffling of columns in any order using dplyr, for example, to reshuffle:
df <- data.frame(a,b,c,d,e,f)
to
df[,c("a","d","e","f","b","c")]
df %>% select(a, d:f, b:c)
Use the subset function:
> df <- data.frame(a,b,c,d,e,f)
> df <- subset(df, select = c(a, d:f, b:c))
> df
a d e f b c
1 1 Rabbit Cat Cat Cat Dog
2 2 Cat Dog Dog Dog Rabbit
3 3 Dog Dog Dog Rabbit Cat
4 4 Dog Rabbit Rabbit Cat Dog
5 5 Rabbit Cat Cat Dog Dog
I changed the previous function to use it for data.table usinf the function setcolorder of the package data.table.
moveMeDataTable <-function(data, tomove, where = "last", ba = NULL) {
temp <- setdiff(names(data), tomove)
x <- switch(
where,
first = setcolorder(data,c(tomove, temp)),
last = setcolorder(data,c(temp, tomove)),
before = {
if (is.null(ba)) stop("must specify ba column")
if (length(ba) > 1) stop("ba must be a single character string")
order = append(temp, values = tomove, after = (match(ba, temp)-1))
setcolorder(data,order)
},
after = {
if (is.null(ba)) stop("must specify ba column")
if (length(ba) > 1) stop("ba must be a single character string")
order = append(temp, values = tomove, after = (match(ba, temp)))
setcolorder(data,order)
})
x
}
DT <- data.table(A=sample(3, 10, TRUE),
B=sample(letters[1:3], 10, TRUE), C=sample(10))
DT <- moveMeDataTable(DT, "C", "after", "A")
Here is another option:
df <- cbind( df[, -(2:3)], df[, 2:3] )