dplyr reordering rows by string - r

I have the following data:
library(tidyverse)
d1 <- data_frame(Nat = c("UK", "UK", "UK", "NONUK", "NONUK", "NONUK"),
Type = c("a", "b", "c", "a", "b", "c"))
I would like to rearrange the rows so the dataframe looks like this:
d2 <- data_frame(
Nat = c("UK", "UK", "UK", "NONUK", "NONUK", "NONUK"),
Type = c("b", "c", "a", "b", "c", "a"))
So the UK and Non UK grouping remains, but the 'Type' rows have shifted. This questions is quite like this one: Reorder rows conditional on a string variable
However the answer above is dependent on the rows you are reordering being in alphabetical order (excluding London). Is there a way to reorder a string value more specifically where you select order of the rows yourself, rather than relying on it being alphabetical? Is there a way to do this using dplyr?
Thanks!

You could use match
string_order <- c("b", "c", "a")
d1 %>%
group_by(Nat) %>%
mutate(Type = Type[match(string_order, Type)]) %>%
ungroup()
# A tibble: 6 x 2
# Nat Type
# <chr> <chr>
#1 UK b
#2 UK c
#3 UK a
#4 NONUK b
#5 NONUK c
#6 NONUK a

What about explicit the levels in a dplyr chain, to choose your order:
library(dplyr)
d1 %>%
arrange(factor(.$Nat, levels = c("UK", "NONUK")), factor(.$Type, levels = c("c", "b","a")))
# A tibble: 6 x 2
Nat Type
<chr> <chr>
1 UK c
2 UK b
3 UK a
4 NONUK c
5 NONUK b
6 NONUK a
Another example:
d1 %>%
arrange(factor(.$Nat, levels = c("UK", "NONUK")), factor(.$Type, levels = c("b", "c","a")))
# A tibble: 6 x 2
Nat Type
<chr> <chr>
1 UK b
2 UK c
3 UK a
4 NONUK b
5 NONUK c
6 NONUK a

Related

R How to compute a unique index for two character variables with varing columns? [duplicate]

This question already has answers here:
Pasting elements of two vectors alphabetically
(5 answers)
How do you sort and paste two columns in a mutate statement?
(1 answer)
Row-wise sort then concatenate across specific columns of data frame
(2 answers)
Closed 1 year ago.
I'm not sure if I phrased my question properly, so let me give an simplified example:
Given a dataset as follows:
dat <- data_frame(X = c("A", "B", "B", "C", "A"),
Y = c("B", "A", "C", "A", "C"))
how can I compute a pair variable, so that it represents whatever was within X and Y at a given row BUT not generating duplicates, as here:
dat$pair <- c("A-B", "A-B", "B-C", "C-A", "C-A")
dat
# A tibble: 5 × 3
X Y pair
<chr> <chr> <chr>
1 A B A-B
2 B A A-B
3 B C B-C
4 C A C-A
5 A C C-A
I can compute a pairing with paste0 but it will indroduce duplicates (C-A is the same as A-C for me) that I want to avoid:
> dat <- mutate(dat, pair = paste0(X, "-", Y))
> dat
# A tibble: 5 × 3
X Y pair
<chr> <chr> <chr>
1 A B A-B
2 B A B-A
3 B C B-C
4 C A C-A
5 A C A-C
We can use pmin and pmax to sort the values parallely and paste them.
transform(dat, pair = paste(pmin(X, Y), pmax(X, Y), sep = '-'))
# X Y pair
#1 A B A-B
#2 B A A-B
#3 B C B-C
#4 C A A-C
#5 A C A-C
If you prefer dplyr this can be written as -
library(dplyr)
dat %>% mutate(pair = paste(pmin(X, Y), pmax(X, Y), sep = '-'))
I reordered each column once
dat <- data.frame(X = c("A", "B", "B", "C", "A"),
Y = c("B", "A", "C", "A", "C"))
library(dplyr)
dat %>%
rowwise %>%
mutate(pair = paste0(sort(c(as.character(X),as.character(Y)),decreasing = F),collapse = '-')) %>%
ungroup
output;
X Y pair
<fct> <fct> <chr>
1 A B A-B
2 B A A-B
3 B C B-C
4 C A A-C
5 A C A-C
With dplyr and tidyr you could try:
library(dplyr)
library(tidyr)
dat %>%
rowwise() %>%
mutate(pair = list(c(X, Y)),
pair = list(sort(pair)),
pair = list(paste(pair, collapse = "-"))) %>%
select(pair) %>%
distinct() %>%
unnest(pair)
#> # A tibble: 3 x 1
#> pair
#> <chr>
#> 1 A-B
#> 2 B-C
#> 3 A-C
Created on 2021-08-27 by the reprex package (v2.0.0)
data
dat <- data.frame(X = c("A", "B", "B", "C", "A"),
Y = c("B", "A", "C", "A", "C"))

Look-up table with different values for each column

I've got one table that is a set of all of my columns, their possible corresponding values, and the description for each one of those values. For example, the table looks like this:
ID Column Value Description
1 Age A Age_20-30
2 Age B Age_30-50
3 Age C Age_50-75
4 Geo A Big_City
5 Geo B Suburbs
6 Geo C Rural_Town
And so on.. Next, I have my main data frame that is populated with the column values. What I'd like to do is switch all values in each column with their corresponding description.
Old:
ID Age Geo
1 A B
2 A A
3 C A
4 B C
5 C C
New:
ID Age Geo
1 Age_20-30 Suburbs
2 Age_20-30 Big_City
3 Age_50-75 Big_City
4 Age_30-50 Rural_Town
5 Age_50-75 Rural_Town
Now I know how I can do this for one column using the following (where lookup_df is a table for only one of my columns):
old <- lookup_df$Value
new <- lookup_df$Description
df$Age <- new[match(df$Age, old, nomatch = 0)]
But I am struggling to do this for all columns. My full set of data has >100 columns so doing this manually for each column isn't really an option (at least in terms of efficiency). Any help or pointers in the right direction would be a huge help.
We can split the first dataset into to a list of named vectors. Use that to match and replace the second dataset
lst1 <- lapply(split(df1[c('Value', 'Description')], df1$Column),
function(x) setNames(x$Description, x$Value))
df2[-1] <- Map(function(x, y) y[x], df2[-1], lst1)
-output
df2
# ID Age Geo
#1 1 Age_20-30 Suburbs
#2 2 Age_20-30 Big_City
#3 3 Age_50-75 Big_City
#4 4 Age_30-50 Rural_Town
#5 5 Age_50-75 Rural_Town
data
df1 <- structure(list(ID = 1:6, Column = c("Age", "Age", "Age", "Geo",
"Geo", "Geo"), Value = c("A", "B", "C", "A", "B", "C"),
Description = c("Age_20-30",
"Age_30-50", "Age_50-75", "Big_City", "Suburbs", "Rural_Town"
)), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = 1:5, Age = c("A", "A", "C", "B", "C"), Geo = c("B",
"A", "A", "C", "C")), class = "data.frame", row.names = c(NA,
-5L))
To do this on data with lot of columns you can get the data in long format, join it with the first dataframe and (if needed) get it back in wide format.
library(dplyr)
library(tidyr)
df2 %>%
pivot_longer(cols = -ID) %>%
left_join(df1 %>% select(-ID),
by = c('name' = 'Column', 'value' = 'Value')) %>%
select(-value) %>%
pivot_wider(names_from = name, values_from = Description)
# ID Age Geo
# <int> <chr> <chr>
#1 1 Age_20-30 Suburbs
#2 2 Age_20-30 Big_City
#3 3 Age_50-75 Big_City
#4 4 Age_30-50 Rural_Town
#5 5 Age_50-75 Rural_Town

R: group_id by changing row values

1) Firstly, I have this data frame:
df <- data.frame(value=c("a","a","a", "b", "b", "b", "a", "a", "a"), ,
desired_id=c(1,1,1,2,2,2,3,3,3))
How do I generate the desired_id column?
My groups are assigned by row order.
That is, everytime the value column changes, I want the group indices to assign the next higher group indices.
I tried df$desired_id_replicate <- df %>% group_by(value) %>% group_indices
but that doesn't work as all value=="a" will be assigned the same group indices.
2)Secondly, I have this data frame:
df <- data.frame(value=c("a","a","a", "b", "b", "b", "a", "a", "a"),
value2=c("a","a","c", "b", "b", "c", "a", "a", "d"),
desired_id=c(1,1,2,3,3,4,5,5,6))
How do I generate the desired_id from the value and value2 column.
My groups are assigned row-wise again. That is, everytime a unique combination of value and value2 changes, the next higher desired_id should be assigned.
Similar to the above, I tried df$desired_id_replicate <- df %>% group_by(value, value2) %>% group_indices
but that doesn't work as all value=="a"&value2=="a" will be assigned the same group indices.
Thank you!
We can use rleid (run-length-encoding id) from data.table which would basically increment 1 for each element that is not equal to the previous element
library(data.table)
library(dplyr)
df%>%
mutate(newcol = rleid(value))
and for the second dataset, it would be
df %>%
mutate(new = rleid(value, value2))
# value value2 desired_id new
#1 a a 1 1
#2 a a 1 1
#3 a c 2 2
#4 b b 3 3
#5 b b 3 3
#6 b c 4 4
#7 a a 5 5
#8 a a 5 5
#9 a d 6 6
Or with rle from base R
df$newcol <- with(rle(df$value), rep(seq_along(values), lengths))

Organize subgroup strings (text)

I am trying to convert something like this df format:
df <- data.frame(first = c("a", "a", "b", "b", "b", "c"),
words =c("about", "among", "blue", "but", "both", "cat"))
df
first words
1 a about
2 a among
3 b blue
4 b but
5 b both
6 c cat
into the following format:
df1
first words
1 a about, among
2 b blue, but, both
3 c cat
>
I have tried
aggregate(words ~ first, data = df, FUN = list)
first words
1 a 1, 2
2 b 3, 5, 4
3 c 6
and tidyverse:
df %>%
group_by(first) %>%
group_rows()
Any suggestions would be appreciated!
A data.table solution:
library(data.table)
df <- data.frame(first = c("a", "a", "b", "b", "b", "c"),
words =c("about", "among", "blue", "but", "both", "cat"))
df <- setDT(df)[, lapply(.SD, toString), by = first]
df
# first words
# 1: a about, among
# 2: b blue, but, both
# 3: c cat
# convert back to a data.frame if you want
setDF(df)
Using tidyverse, after the group_by use summarise to either paste
library(dplyr)
df %>%
group_by(first) %>%
summarise(words = toString(words))
# A tibble: 3 x 2
# first words
# <fct> <chr>
#1 a about, among
#2 b blue, but, both
#3 c cat
or keep it as a list column
df %>%
group_by(first) %>%
summarise(words = list(words))

dplyr lag across groups

I am trying to do something like a lag, but across and not within groups. Sample data:
df <- data.frame(flag = c("A", "B", "A", "B", "B", "B", "A", "B", "B", "A", "B"),
var = c("AB123","AC124", "AD125", "AE126",
"AF127", "AG128", "AF129",
"AG130","AH131",
"AHI132", "AJ133"))
)
The goal for every flag="B" is to create lagvar with the previous var value where flag="A".
This will show the desired output:
df1 <- data.frame(flag = c("A", "B", "A", "B", "B", "B", "A", "B", "B", "A", "B"),
var = c("AB123","AC124", "AD125", "AE126",
"AF127", "AG128", "AF129",
"AG130","AH131",
"AHI132", "AJ133"),
lagvar = c("","AB123","","AD125","AD125","AD125","","AF129","AF129","","AHI132")
)
A dplyr solution is preferred, but I'm not picky!
EDIT: I found a solution using the zoo package but am interested if others have better ideas. df$lagvar <- ifelse(df$flag == "A", df$var, NA)
df <- df %>%
mutate(lagvar = na.locf(lagvar)
Here you go. I used NA instead of blanks, but you can adjust as needed:
df %>% mutate(lagvar = ifelse(flag == "A", as.character(var), NA),
lagvar = zoo::na.locf(lagvar),
lagvar = ifelse(flag == "A", NA, lagvar))
# flag var lagvar
# 1 A AB123 <NA>
# 2 B AC124 AB123
# 3 A AD125 <NA>
# 4 B AE126 AD125
# 5 B AF127 AD125
# 6 B AG128 AD125
# 7 A AF129 <NA>
# 8 B AG130 AF129
# 9 B AH131 AF129
# 10 A AHI132 <NA>
# 11 B AJ133 AHI132
My solution is a bit complicated. The idea is to find out the position of A each B should assign to and then join with a table, which only contains rows with flag A.
df %>%
mutate(pos=cumsum(flag == "A")) %>%
left_join(
df %>%
filter(flag == "A") %>%
mutate(pos=1:n()) %>%
select(pos, lagvar=var),
by="pos") %>%
mutate(lagvar=ifelse(flag == "A", "", as.character(lagvar)))
# flag var pos lagvar
# 1 A AB123 1
# 2 B AC124 1 AB123
# 3 A AD125 2
# 4 B AE126 2 AD125
# 5 B AF127 2 AD125
# 6 B AG128 2 AD125
# 7 A AF129 3
# 8 B AG130 3 AF129
# 9 B AH131 3 AF129
# 10 A AHI132 4
# 11 B AJ133 4 AHI132

Resources