I would like to sort/arrange data by group. That's easy enough. However, I only want to sort values within specific groups, not all groups.
I found one possible instance of a similar question at the link. But I found it to be confusing due to the framing of the question by the OP.
Arrange values within a specific group
Sample data:
df <- data.frame(var = c("apple", "banana", "eggplant", "carrot", "dill", "fava", "garlic"),
grp = c("A", "A", "B", "B", "B", "C", "C"),
val = c(4, 2, 1, 3, 7, 6, 2))
df
# var grp val
# 1 apple A 4
# 2 banana A 2
# 3 carrot B 3
# 4 dill B 7
# 5 eggplant B 1
# 6 fava C 6
# 7 garlic C 2
Desired output:
# var grp val
# 1 apple A 4
# 2 banana A 2
# 3 eggplant B 1
# 4 carrot B 3
# 5 dill B 7
# 6 garlic C 2
# 7 fava C 6
Partial solution:
library(dplyr)
df %>%
group_by(grp) %>%
arrange(val, .by_group = T)
This of course sorts for all groups. How do I get it to sort for only the groups I would like sorted, which are "B" and "C"? I would like a tidyverse solution but feel free to post a base solution as well.
We can change the sign to the elements in 'val' that correspond to "A" group so that it is ordered in the opposite direction compared to the 'val' elements in other group
library(dplyr)
df %>%
arrange(grp, val * c(1, -1)[(grp == 'A') + 1])
-output
var grp val
1 apple A 4
2 banana A 2
3 eggplant B 1
4 carrot B 3
5 dill B 7
6 garlic C 2
7 fava C 6
Or if the values for 'A' should be kept as such, then mltiply by 0 so that each value is same for 'A'
df %>%
arrange(grp, val * c(1, 0)[(grp == 'A') + 1])
var grp val
1 apple A 4
2 banana A 2
3 eggplant B 1
4 carrot B 3
5 dill B 7
6 garlic C 2
7 fava C 6
NOTE: This is done without any group_by attribute
If we want to use the OP's way, i.e. using group_by
df %>%
group_by(grp) %>%
arrange(case_when(grp == 'A' ~ -1 * val, TRUE ~ val),
.by_group = TRUE) %>%
ungroup
-ouptutu
# A tibble: 7 x 3
var grp val
<chr> <chr> <dbl>
1 apple A 4
2 banana A 2
3 eggplant B 1
4 carrot B 3
5 dill B 7
6 garlic C 2
7 fava C 6
If the values in 'val' for grp 'A' are showed in descending order because of coincidence, then create a sequence column before doing the grouping and then use that for modifying
df %>%
mutate(rn = row_number()) %>%
group_by(grp) %>%
arrange(case_when(grp == 'A' ~ as.numeric(rn), TRUE ~ val),
.by_group = TRUE) %>%
ungroup %>%
dplyr::select(-rn)
-output
# A tibble: 7 x 3
var grp val
<chr> <chr> <dbl>
1 apple A 4
2 banana A 2
3 eggplant B 1
4 carrot B 3
5 dill B 7
6 garlic C 2
7 fava C 6
Or using base R
df[with(df, order(grp, c(1, 0)[(grp == 'A') + 1] * val)),]
var grp val
1 apple A 4
2 banana A 2
3 eggplant B 1
4 carrot B 3
5 dill B 7
7 garlic C 2
6 fava C 6
You can filter the groups you want to arrange, sort them and bind to the remaining data.
library(dplyr)
order_groups <- c('B', 'C')
df %>%
filter(grp %in% order_groups) %>%
arrange(grp, val) %>%
bind_rows(df %>%
filter(!grp %in% order_groups)) %>%
arrange(grp)
#. var grp val
#1 apple A 4
#2 banana A 2
#3 eggplant B 1
#4 carrot B 3
#5 dill B 7
#6 garlic C 2
#7 fava C 6
Related
I have a specific filtering question. Here is how my sample dataset looks like:
df <- data.frame(id = c(1,2,3,3,4,5),
cat= c("A","A","A","B","B","B"))
> df
id cat
1 1 A
2 2 A
3 3 A
4 3 B
5 4 B
6 5 B
Grouping by id, when the cat has multiple categories, I would only filter cat A. So the desired output would be:
> df.1
id cat
1 1 A
2 2 A
3 3 A
4 4 B
5 5 B
Any ideas?
Thanks!
If there are only two groups in cat, we can use the following logic:
df %>%
group_by(id) %>%
filter(! (n() == 2 & cat == "B"))
# A tibble: 5 x 2
# Groups: id [5]
id cat
<dbl> <chr>
1 1 A
2 2 A
3 3 A
4 4 B
5 5 B
When there are multiple other letters possible
df <- data.frame(id = c(1,2,3,3,4,5,6,6,6,7),
cat= c("A","A","A","B","B","B", "A", "B", "C","D"))
df %>%
group_by(id) %>%
filter(! (n() >= 2 & cat %in% LETTERS[2:26]))
# A tibble: 7 x 2
# Groups: id [7]
id cat
<dbl> <chr>
1 1 A
2 2 A
3 3 A
4 4 B
5 5 B
6 6 A
7 7 D
Explanation: n() gives the current group size. When that condition is met, we filter for everything that is not "B".
In this example you can take the first item from the group. In other situations you may need to reorder arrange before.
(using dplyr)
df %>% group_by(id) %>% summarise(cat = first(cat))
Base R:
aggregate(
df$cat,
by = list(id = df$id),
FUN = \(x) {
unx <- unique(x)
if (length(unx) > 1) 'A' else unx
}
)
# id x
# 1 1 A
# 2 2 A
# 3 3 A
# 4 4 B
# 5 5 B
One approach with dplyr. After grouping by id, filter where there is only one row per id or cat is "A".
library(dplyr)
df %>%
group_by(id) %>%
filter(n() == 1 | cat == "A")
Output
id cat
<dbl> <chr>
1 1 A
2 2 A
3 3 A
4 4 B
5 5 B
Also, if it is possible to have the same cat repeated within a single id, you can filter where the number of distinct cat is 1 (or keep if cat is "A"):
df %>%
group_by(id) %>%
filter(n_distinct(cat) == 1 | cat == "A")
Using base R
subset(df, cat == 'A'|id %in% names(which(table(id) == 1)))
id cat
1 1 A
2 2 A
3 3 A
5 4 B
6 5 B
Consider this df (the one I'm working with is much, much bigger)
set.seed(13)
test <- tibble(A = as.factor(seq(1:10)),
B = as.factor(sample(c("Apple", "Banana"), 10, replace = T)),
C = as.factor(sample(c("Cut", "Mashed"), 10, replace = T)),
D = as.factor(sample(seq(1:3), 10, replace = T)))
I need to create another numeric variable but the data of the new variable needs to be the same where the levels of the other variables are equal. Let me illustrate.
When I do this, or any other method I tried to find
test %>%
group_by(B,C,D) %>%
mutate(E = sample(seq(0.01:100, 0.01), 10, replace = T))
I get an error message,
The result I'm after is the following, I need to use sample or a random creator function
A B C D E
> <fct> <fct> <fct> <fct> <fct>
> 1 1 Banana Mashed 3 0.2
> 2 2 Apple Cut 1 4
> 3 3 Banana Mashed 1 5
> 4 4 Apple Mashed 2 3
> 5 5 Banana Cut 1 1.3
> 6 6 Apple Cut 3 4.7
> 7 7 Banana Mashed 1 5
> 8 8 Banana Mashed 1 5
> 9 9 Banana Cut 3 3.2
> 10 10 Banana Cut 3 3.2
So rows 9 and 10, 3, 7 and 8 need to be the exact same because the levels are the same across certain variables (B,C,D)
Any idea how to do this?
If I am understanding correctly, you want something like this. Basically you want to create your new column on the distinct values of your factor groups, and then join it back in so that they all have the same values.
library(dplyr)
new_values <- test %>%
distinct(B, C, D) %>%
mutate(E = sample(seq(0.01, 100, 0.01), n(), replace = T))
test %>%
left_join(new_values, by = c("B", "C", "D"))
# # A tibble: 10 x 5
# A B C D E
# <fct> <fct> <fct> <fct> <dbl>
# 1 1 Banana Mashed 3 68.0
# 2 2 Apple Cut 1 16.4
# 3 3 Banana Mashed 1 80.2
# 4 4 Apple Mashed 2 74.4
# 5 5 Banana Cut 1 1.53
# 6 6 Apple Cut 3 27.8
# 7 7 Banana Mashed 1 80.2
# 8 8 Banana Mashed 1 80.2
# 9 9 Banana Cut 3 83.4
# 10 10 Banana Cut 3 83.4
You can also do something like this with group_modify(), but it will sort your rows and reorder your columns based on the groups. This code will iterate through each group, add a column E based on a sample of size 1, and then restack all of the resulting groups back into a data frame.
test %>%
group_by(B, C, D) %>%
group_modify(~ mutate(.x, E = sample(seq(0.01, 100, 0.01), 1, replace = T)))
I have a list here, and I wish to mutate a new column with unique values for each list relative to the mutation. For example, I want to mutate a column named ID as n >= 1.
Naturally, on a dataframe I would do this:
dat %>% mutate(id = row_number())
For a list, I would do this:
dat%>% map(~ mutate(., ID = row_number()))
And I would get an output likeso:
dat <- list(data.frame(x=c("a", "b" ,"c", "d", "e" ,"f" ,"g") ), data.frame(y=c("p", "lk", "n", "m", "g", "f", "t")))
[[1]]
x id
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
7 g 7
[[2]]
y id
1 p 1
2 lk 2
3 n 3
4 m 4
5 g 5
6 f 6
7 t 7
Though, how would I mutate a new column ID such that the row number continues from the first list.
Expected output:
[[1]]
x id
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
7 g 7
[[2]]
y id
1 p 8
2 lk 9
3 n 10
4 m 11
5 g 12
6 f 13
7 t 14
An option is to bind them into a single dataset, create the 'id' with row_number(), split by 'grp', loop over the list and remove any columns that have all NA values
library(dplyr)
library(purrr)
dat %>%
bind_rows(.id = 'grp') %>%
mutate(id = row_number()) %>%
group_split(grp) %>%
map(~ .x %>%
select(where(~ any(!is.na(.))), -grp))
-output
#[[1]]
# A tibble: 7 x 2
# x id
# <chr> <int>
#1 a 1
#2 b 2
#3 c 3
#4 d 4
#5 e 5
#6 f 6
#7 g 7
#[[2]]
# A tibble: 7 x 2
# y id
# <chr> <int>
#1 p 8
#2 lk 9
#3 n 10
#4 m 11
#5 g 12
#6 f 13
#7 t 14
Or an easier approach is to unlist (assuming single column), get the sequence, add a new column with map2
map2(dat, relist(seq_along(unlist(dat)), skeleton = dat),
~ .x %>% mutate(id = .y))
Or using a for loop
dat[[1]]$id <- seq_len(nrow(dat[[1]]))
for(i in seq_along(dat)[-1]) dat[[i]]$id <-
seq(tail(dat[[i-1]]$id, 1) + 1, length.out = nrow(dat[[i]]), by = 1)
I have a df like this
name <- c("Fred","Mark","Jen","Simon","Ed")
a_or_b <- c("a","a","b","a","b")
abc_ah_one <- c(3,5,2,4,7)
abc_bh_one <- c(5,4,1,9,8)
abc_ah_two <- c(2,1,3,7,6)
abc_bh_two <- c(3,6,8,8,5)
abc_ah_three <- c(5,4,7,6,2)
abc_bh_three <- c(9,7,2,1,4)
def_ah_one <- c(1,3,9,2,7)
def_bh_one <- c(2,8,4,6,1)
def_ah_two <- c(4,7,3,2,5)
def_bh_two <- c(5,2,9,8,3)
def_ah_three <- c(8,5,3,5,2)
def_bh_three <- c(2,7,4,3,0)
df <- data.frame(name,a_or_b,abc_ah_one,abc_bh_one,abc_ah_two,abc_bh_two,
abc_ah_three,abc_bh_three,def_ah_one,def_bh_one,
def_ah_two,def_bh_two,def_ah_three,def_bh_three)
I want to use the value in column "a_or_b" to choose the values in each of the corresponding "ah/bh" columns for each "abc" (one, two, and three), and put it into a new data frame. For example, Fred would have the values 3, 2 and 5 in his row in the new df. Those values represent the values of each of his "ah" categories for the abc columns. Jen, who has "b" in her a_or_b column, would have all of her "bh" values from her abc columns for her row in the new df. Here is what my desired output would look like:
combo_one <- c(3,5,1,4,8)
combo_two <- c(2,1,8,7,5)
combo_three <- c(5,4,2,6,4)
df2 <- data.frame(name,a_or_b,combo_one,combo_two,combo_three)
I've attempted this using sapply. The following gives me a matrix of the correct column correct indexes of df[grep("abc",colnames(df),fixed=TRUE)] for each row:
sapply(paste0(df$a_or_b,"h"),grep,colnames(df[grep("abc",colnames(df),fixed=TRUE)]))
First we gather your data into a tidy long format, then break out the columns into something useful. After that the filtering is simple, and if necessary we can convert back to an difficult wide format:
library(dplyr)
library(tidyr)
gather(df, key = "var", value = "val", -name, -a_or_b) %>%
separate(var, into = c("combo", "h", "ind"), sep = "_") %>%
mutate(h = substr(h, 1, 1)) %>%
filter(a_or_b == h, combo == "abc") %>%
arrange(name) -> result_long
result_long
# name a_or_b combo h ind val
# 1 Ed b abc b one 8
# 2 Ed b abc b two 5
# 3 Ed b abc b three 4
# 4 Fred a abc a one 3
# 5 Fred a abc a two 2
# 6 Fred a abc a three 5
# 7 Jen b abc b one 1
# 8 Jen b abc b two 8
# 9 Jen b abc b three 2
# 10 Mark a abc a one 5
# 11 Mark a abc a two 1
# 12 Mark a abc a three 4
# 13 Simon a abc a one 4
# 14 Simon a abc a two 7
# 15 Simon a abc a three 6
spread(result_long, key = ind, value = val) %>%
select(name, a_or_b, one, two, three)
# name a_or_b one two three
# 1 Ed b 8 5 4
# 2 Fred a 3 2 5
# 3 Jen b 1 8 2
# 4 Mark a 5 1 4
# 5 Simon a 4 7 6
Base R approach would be using lapply, we loop through each row of the dataframe, create a string to find similar columns using paste0 based on a_or_b column and then rbind all the values together for each row.
new_df <- do.call("rbind", lapply(seq(nrow(df)), function(x)
setNames(df[x, grepl(paste0("abc_",df[x,"a_or_b"], "h"), colnames(df))],
c("combo_one", "combo_two", "combo_three"))))
new_df
# combo_one combo_two combo_three
#1 3 2 5
#2 5 1 4
#3 1 8 2
#4 4 7 6
#5 8 5 4
We can cbind the required columns then :
cbind(df[c(1, 2)], new_df)
# name a_or_b combo_one combo_two combo_three
#1 Fred a 3 2 5
#2 Mark a 5 1 4
#3 Jen b 1 8 2
#4 Simon a 4 7 6
#5 Ed b 8 5 4
It's possible to do this with a combination of map and mutate:
require(tidyverse)
df %>%
select(name, a_or_b, starts_with("abc")) %>%
rename_if(is.numeric, funs(sub("abc_", "", .))) %>%
mutate(combo_one = map_chr(a_or_b, ~ paste0(.x,"h_one")),
combo_one = !!combo_one,
combo_two = map_chr(a_or_b, ~ paste0(.x,"h_two")),
combo_two = !!combo_two,
combo_three = map_chr(a_or_b, ~ paste0(.x,"h_three")),
combo_three = !!combo_three) %>%
select(name, a_or_b, starts_with("combo"))
Output:
name a_or_b combo_one combo_two combo_three
1 Fred a 3 2 5
2 Mark a 5 1 4
3 Jen b 1 8 2
4 Simon a 4 7 6
5 Ed b 8 5 4
I have a messy table which has a single column that contains multiple category labels, separated by several delimiters. I want to us R to split that column at each delimiter, and create a new column for each category label. The methods I have seen can only split at one delimiter at a time.
My current table looks like this:
my_table = read.csv("./my_table.csv")
# > my_table
# ID TYPE TEXT
# 1 1 a blue water
# 2 2 a,b,c fresh water
# 3 3 a;b,f cold stream
# 4 4 f, b and c lovely sunset
# 5 5 b;c up there
I want a table that looks like this:
# ID A B C D TEXT
# 1 1 a blue water
# 2 2 a b c fresh water
# 3 3 a b d cold stream
# 4 4 b c d lovely sunset
# 5 5 b c up there
Here is what I have tried:
my_table1 <- my_table %>%
separate(TYPE, c('A', 'B'), ",")
my_table1
# > docs1
# ID A B TEXT
# 1 1 a <NA> blue water
# 2 2 a b fresh water
# 3 3 a;b f cold stream
# 4 4 f b and c lovely sunset
# 5 5 b;c <NA> up there
my_table2 <- my_table1 %>%
separate(A, c('A', 'C' ), ";")
# > docs2
# ID A C B TEXT
# 1 1 a <NA> <NA> blue water
# 2 2 a <NA> b fresh water
# 3 3 a b f cold stream
# 4 4 f <NA> b and c lovely sunset
# 5 5 b c <NA> up there
my_table3 <- my_table2 %>%
separate(A, c('A', 'D'), "and")
# > docs3
# ID A D C B TEXT
# 1 1 a <NA> <NA> <NA> blue water
# 2 2 a <NA> <NA> b fresh water
# 3 3 a <NA> b f cold stream
# 4 4 f <NA> <NA> b and c lovely sunset
# 5 5 b <NA> c <NA> up there
This gets me close, but the column names are off. Plus, I don't want to have to guess about where the string "b and c" ends up after a couple iterations. I have thousands of rows and maybe five or six categories. My guess is that there is a simpler way to do this.
As an alternative and to extend your tidyverse attempt, here is a solution using strsplit and unnest:
df %>%
mutate(
val = strsplit(as.character(TYPE), "(;|,\\s*|\\s*and\\s*)")) %>%
unnest() %>%
select(-TYPE) %>%
group_by(ID, TEXT) %>%
mutate(n = 1:n()) %>%
spread(n, val)
## A tibble: 5 x 5
## Groups: ID, TEXT [5]
# ID TEXT `1` `2` `3`
# <int> <fct> <chr> <chr> <chr>
#1 1 blue water a NA NA
#2 2 fresh water a b c
#3 3 cold stream a b f
#4 4 lovely sunset f b c
#5 5 up there b c NA
Note that this is not exactly the same as your expected output. It does however match #MKR's output.
Sample data
df <- read.table(text =
"ID TYPE TEXT
1 1 'a' 'blue water'
2 2 'a,b,c' 'fresh water'
3 3 'a;b,f' 'cold stream'
4 4 'f, b and c' 'lovely sunset'
5 5 'b;c' 'up there'")
The cSplit function from splitstackshape package can make problem easier to solve. An approach could be as:
library(splitstackshape)
# First use `gsub` to replace other delimiter and have only ',' delimiter.
my_table$TYPE <- gsub("and|;",",",my_table$TYPE)
Mod_df <- cSplit(my_table, "TYPE", sep = ",")
Mod_df
# ID TEXT TYPE_1 TYPE_2 TYPE_3
# 1: 1 blue water a NA NA
# 2: 2 fresh water a b c
# 3: 3 cold stream a b f
# 4: 4 lovely sunset f b c
# 5: 5 up there b c NA
The tidyr::gather and spread can be used to get the format mentioned by OP as:
library(tidyr)
gather(Mod_df, key, value, -ID,-TEXT) %>% mutate_if(is.factor, as.character) %>%
mutate(K = toupper(value)) %>%
select(-key) %>%
filter(!is.na(K)) %>%
spread(K, value)
# ID TEXT A B C F
# 1 1 blue water a <NA> <NA> <NA>
# 2 2 fresh water a b c <NA>
# 3 3 cold stream a b <NA> f
# 4 4 lovely sunset <NA> b c f
# 5 5 up there <NA> b c <NA>
Data
my_table <- read.table(text =
" ID TYPE TEXT
1 1 a 'blue water'
2 2 'a,b,c' 'fresh water'
3 3 'a;b,f' 'cold stream'
4 4 'f, b and c' 'lovely sunset'
5 5 'b;c' 'up there'",
header = TRUE, stringsAsFactors = FALSE)