So I have a column with values in this structure:
tribble(
~col,
"AA_BB;AA_AA;AA_BB",
"BB_BB;AA_AA",
"AA_BB",
"BB_AA;BB_AA;AA_AA;BB_AA")
)
So each row has items separated by a ";". The first for has items AA_BB, AA_AA and AA_BB. I want the first row to be transformed to "AA_BB;AA_AA" and the last row to be transformed to "BB_AA;AA_AA".
I thought about using separate but I the result didn't really help me (especially since I don't know how many columns there can be at most).
df %>%
separate(col, into = c("A", "B", "C", "D"), sep = ";")
Any tips on how to do this?
We can split the column, get the unique elements and paste
library(dplyr)
library(stringr)
library(purrr)
df %>%
mutate(col = map_chr(strsplit(col, ";"), ~ str_c(unique(.x), collapse=";")))
-output
# A tibble: 4 x 1
# col
# <chr>
#1 AA_BB;AA_AA
#2 BB_BB;AA_AA
#3 AA_BB
#4 BB_AA;AA_AA
Or split with separate_rows, then do a group by paste after getting the distinct rows
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
separate_rows(col, sep=";") %>%
distinct %>%
group_by(rn) %>%
summarise(col = str_c(col, collapse=";"), .groups = 'drop') %>%
select(col)
In base R, you can split the string on semi-colon, keep only unique strings and paste them together.
df$col1 <- sapply(strsplit(df$col, ';'), function(x)
paste0(unique(x), collapse = ';'))
df
# A tibble: 4 x 2
# col col1
# <chr> <chr>
#1 AA_BB;AA_AA;AA_BB AA_BB;AA_AA
#2 BB_BB;AA_AA BB_BB;AA_AA
#3 AA_BB AA_BB
#4 BB_AA;BB_AA;AA_AA;BB_AA BB_AA;AA_AA
Related
I have two dataframes:
set.seed(1)
df1 <- data.frame(k1 = "AFD(1);Acf(2);Vgr7(2);"
,k2 = "ABC(7);BHG(46);TFG(675);")
df2 <- data.frame(site =c("AFD(1);AFD(2);", "Acf(2);", "TFG(677);",
"XX(275);", "ABC(7);", "ABC(9);")
,p1 = rnorm(6, mean = 5, sd = 2)
,p2 = rnorm(6, mean = 6.5, sd = 2))
The first dataframe is in fact a list of often very long strings, made of 'elements". Each "element" is made of a few letters/numbers, followed by a number in brackets, followed by a semicolon. In this example I only put 3 "elements" into each string, but in my real dataframe there are tens to hundreds of them.
> df1
k1 k2
1 AFD(1);Acf(2);Vgr7(2); ABC(7);BHG(46);TFG(675);
The second dataframe shares some of the "elements" with df1. Its first column, called site, contains some (not all) "elements" from the first dataframe, sometimes the "element" forms the whole string, and sometimes is a part of a longer string:
> df2
site p1 p2
1 AFD(1);AFD(2); 4.043700 3.745881
2 Acf(2); 5.835883 5.670011
3 TFG(677); 7.717359 5.711420
4 XX(275); 4.794425 6.381373
5 ABC(7); 5.775343 8.700051
6 ABC(9); 4.892390 8.026351
I would like to filter the whole df2 using df2$site and each k column from df1 (there are many K columns, not all of them contain k in the names).
The easiest way to explain this is to show how the desired output would look like.
> outcome
k site p1 p2
1 k1 AFD(1);AFD(2): 4.043700 3.745881
2 k1 Acf(2); 5.835883 5.670011
3 k2 ABC(7); 5.775343 8.700051
The first column of the outcome dataframe corresponds to the column names in df1. The second column corresponds to the site column of df2 and contains only sites from df1 columns that were found in df2$sites. Other columns are from df2.
I appreciate that this question is made of two separate "problems", one grepping-related and one related to looping through df1 columns. I decided to show the task in its entirety in case there exists a solution that addresses both in one go.
FAILED SOLUTION 1
I can create a string to grep, but for each column separately:
# this replaces the semicolons with "|", but does not escape the brackets.
k1_pattern <- df1 %>%
select(k1) %>%
deframe() %>%
str_replace_all(";","|")
And then I am not sure how to use it. This (below) didn't work, maybe because I didn't escape brackets, but I am struggling with doing it:
k1_result <- df2 %>%
filter(grepl(pattern = k1_pattern, site))
But even if it did work, it would only deal with a single column from df1, and I have many, and would like to perform this operation on all df1 columns at the same time.
FAILED SOLUTION 2
I can create a list of sites to search in df2 from columns in df1:
k1_sites<- df1 %>%
select(k1) %>%
deframe() %>%
strsplit(., "[;]") %>%
unlist()
but the delimiter is lost here, and %in% cannot be used, as the match will sometimes be partial.
library(dplyr)
df2 %>%
mutate(site_list = strsplit(site, ";")) %>%
rowwise() %>%
filter(length(intersect(site_list,
unlist(strsplit(x = paste0(c(t(df1)), collapse=""),
split = ";")))) != 0) %>%
select(-site_list)
#> # A tibble: 3 x 3
#> # Rowwise:
#> site p1 p2
#> <chr> <dbl> <dbl>
#> 1 AFD(1);AFD(2); 3.75 7.47
#> 2 Acf(2); 5.37 7.98
#> 3 ABC(7); 5.66 9.52
Updated answer:
library(dplyr)
library(tidyr)
df1 %>%
rownames_to_column("id") %>%
pivot_longer(-id, names_to = "k", values_to = "site") %>%
separate_rows(site, sep = ";") %>%
filter(site != "") %>%
select(-id) -> df1_k
df2 %>%
tibble::rownames_to_column("id") %>%
separate_rows(site, sep = ";") %>%
full_join(., df1_k, by = c("site")) %>%
group_by(id) %>%
fill(k, .direction = "downup") %>%
filter(!is.na(id) & !is.na(k)) %>%
summarise(k = first(k),
site = paste0(site, collapse = ";"),
p1 = first(p1),
p2 = first(p2), .groups = "drop") %>%
select(-id)
#> # A tibble: 3 x 4
#> k site p1 p2
#> <chr> <chr> <dbl> <dbl>
#> 1 k1 AFD(1);AFD(2); 3.75 7.47
#> 2 k1 Acf(2); 5.37 7.98
#> 3 k2 ABC(7); 5.66 9.52
Here's a way going to a long format for exact matching (so no regex):
library(dplyr)
library(tidyr)
df1_long = df1 |> stack() |>
separate_rows(values, sep = ";") |>
filter(values != "")
df2 |>
mutate(id = row_number()) |>
separate_rows(site, sep = ";") |>
filter(site != "") |>
left_join(df1_long, by = c("site" = "values")) %>%
group_by(id) |>
filter(any(!is.na(ind))) %>%
summarize(
site = paste(site, collapse = ";"),
across(-site, \(x) first(na.omit(x)))
)
# # A tibble: 3 × 5
# id site p1 p2 ind
# <int> <chr> <dbl> <dbl> <fct>
# 1 1 AFD(1);AFD(2) 3.75 7.47 k1
# 2 2 Acf(2) 5.37 7.98 k1
# 3 5 ABC(7) 5.66 9.52 k2
I have a data.frame where values are repeated in col1.
col1 <- c("A", "A", "B", "B", "C")
col2 <- c(1995, 1997, 1999, 2000, 2005)
df <- data.frame(col1, col2)
I want to combine values in col2 that correspond to the same letter in col1 into one cell, so that col2 shows a range of values for a particular letter in col1. I do this by splitting the data.frame by col1, applying fun, and binding the split data.frames back together.
library(tidyverse)
split_df <- split(df, df$col1)
fun <- function(df) {
if (length(unique(df$col2)) > 1) {
df$col2 <- paste(min(df$col2),
max(df$col2),
sep = "-")
df <- distinct(df)
}
return(df)
}
split_df <- lapply(split_df, fun)
df <- do.call(rbind, split_df)
This works, but I am wondering if there is a more intuitive or more efficient solution?
Base R way using aggregate -
aggregate(col2~col1, df, function(x) paste0(unique(range(x)), collapse = '-'))
# col1 col2
#1 A 1995-1997
#2 B 1999-2000
#3 C 2005
Same can also be written with dplyr -
library(dplyr)
df %>%
group_by(col1) %>%
summarise(col2 = paste0(unique(range(col2)), collapse = '-'))
One option would be the tidyverse, where you can accomplish this a little more succinctly. The basic idea is the same:
library(tidyverse)
new.result <- df %>%
group_by(col1) %>%
summarize(
col2 = ifelse(n() == 1, as.character(col2), paste(min(col2), max(col2), sep = '-'))
)
col1 col2
<chr> <chr>
1 A 1995-1997
2 B 1999-2000
3 C 2005
A different (but possibly overcomplicated) approach assumes that you have at most two years per grouping. We can pivot the start and end years into their own columns, and then paste them together directly. This requires a little more data transformation but avoids having to check explicitly for groups with 1 year:
df %>%
group_by(col1) %>%
mutate(n = row_number()) %>%
pivot_wider(names_from = n, values_from = col2) %>%
rowwise() %>%
mutate(
vec = list(c(`1`, `2`)),
col2 = paste(vec[!is.na(vec)], collapse = '-')
) %>%
select(col1, col2)
I have a dataframe with the following factor variable:
> head(example.df)
path
1 C:/Users/My PC/pinkhipppos/tinyhorsefeet/location1/categoryA/eyoshdzjow_random_image.txt
(made up dirs).
I want to split into separate columns based on a delimiter: /.
I can do this using
library(tidyverse)
example.df <- example.df %>%
separate(path,
into=c("dir",
"ok",
"hello",
"etc...",
"finally...",
"location",
"category",
"filename"),
sep="/")
Although, I am only interested in the last two dirs and the file name or the last 3 results from the separate function. As parent directories (higher than location) may change. My desired output would be:
> head(example.df)
location category filename
1 location1 categoryA eyoshdzjow_random_image.txt
Reproducible:
example.df <- as.data.frame(
c("C:/Users/My PC/pinkhipppos/tinyhorsefeet/location1/categoryA/eyoshdzjow_random_image.txt",
"C:/Users/My PC/pinkhipppos/tinyhorsefeet/location2/categoryB/jdugnbtudg_random_image.txt")
)
colnames(example.df)<-"path"
One way in base R is to split string at "/" and select last 3 elements from each list.
as.data.frame(t(sapply(strsplit(as.character(example.df$path), "/"), tail, 3)))
# V1 V2 V3
#1 location1 categoryA eyoshdzjow_random_image.txt
#2 location2 categoryB jdugnbtudg_random_image.txt
Using tidyverse, we can get the data in long format, select last 3 entries in each row and get the data in wide format.
library(tidyverse)
example.df %>%
mutate(row = row_number()) %>%
separate_rows(path, sep = "/") %>%
group_by(row) %>%
slice((n() - 2) : n()) %>%
mutate(cols = c('location', 'category', 'filename')) %>%
pivot_wider(names_from = cols, values_from = path) %>%
ungroup() %>%
select(-row)
# A tibble: 2 x 3
# location category filename
# <chr> <chr> <chr>
#1 location1 categoryA eyoshdzjow_random_image.txt
#2 location2 categoryB jdugnbtudg_random_image.txt
Or similar concept as base R but using tidyverse
example.df %>%
mutate(temp = map(str_split(path, "/"), tail, 3)) %>%
unnest_wider(temp, names_repair = ~paste0("dir", seq_along(.) - 1)) %>%
select(-dir0)
I am trying to split a column in a data set that has codes separated by "-". This creates two issues. First i have to split the columns, but I also want to impute the values implied by the "-". I was able to split the data using:
separate_rows(df, code, sep = "-")
but I still haven't found a way to impute the implied values.
name <- c('group1', 'group1','group1','group2', 'group1', 'group1',
'group1')
code <- c('93790', '98960 - 98962', '98966 - 98969', '99078', 'S5950',
'99241 - 99245', '99247')
df <- data.frame( name, code)
what I am trying to output would look something like:
group1 93790, 98960, 98961, 98962, 98966, 98967, 98968, 98969, S5950, 99241,
99242, 99243, 99244, 99245, 99247
group2 99078
in this example, 98961, 98967 and 98968 are imputed and implied from the "-".
Any thoughts on how to accomplish this?
After we split the 'code', one option it to loop through the split elements with map, get a sequence (:), unnest and do a group_by paste
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
df %>%
mutate(code = map(strsplit(as.character(code), " - "), ~ {
x <- as.numeric(.x)
if(length(x) > 1) x[1]:x[2] else x})) %>%
unnest(code) %>%
group_by(name) %>%
summarise(code = str_c(code, collapse=", "))
# A tibble: 2 x 2
# name code
# <fct> <chr>
# 1 group1 93790, 98960, 98961, 98962, 98966, 98967, 98968, 98969
# 2 group2 99078
Or another option is before the separate_rows, create a row index and use that for grouping by when we do a complete
df %>%
mutate(rn = row_number()) %>%
separate_rows(code, convert = TRUE) %>%
group_by(rn, name) %>%
complete(code = min(code):max(code)) %>%
group_by(name) %>%
summarise(code = str_c(code, collapse =", "))
Update
If there are non-numeric elements
df %>%
mutate(rn = row_number()) %>%
separate_rows(code, convert = TRUE) %>%
group_by(name, rn) %>%
complete(code = if(any(str_detect(code, '\\D'))) code else
as.character(min(as.numeric(code)):max(as.numeric(code)))) %>%
group_by(name) %>%
summarise(code = str_c(code, collapse =", "))
# A tibble: 2 x 2
# name code
# <fct> <chr>
#1 group1 93790, 98960, 98961, 98962, 98966, 98967, 98968, 98969, S5950, 99241, 99242, 99243, 99244, 99245, 99247
#2 group2 99078
lapply(split(as.character(df$code), df$name), function(y) {
unlist(sapply(y, function(x){
if(grepl("-", x)) {
n = as.numeric(unlist(strsplit(x, "-")))
n[1]:n[2]
} else {
as.numeric(x)
}
}, USE.NAMES = FALSE))
})
#$group1
#[1] 93790 98960 98961 98962 98966 98967 98968 98969
#$group2
#[1] 99078
I've a dataset with 18 columns from which I need to return the column names with the highest value(s) for each observation, simple example below. I came across this answer, and it almost does what I need, but in some cases I need to combine the names (like abin maxcolbelow). How should I do this?
Any suggestions would be greatly appreciated! If it's possible it would be easier for me to understand a tidyverse based solution as I'm more familiar with that than base.
Edit: I forgot to mention that some of the columns in my data have NAs.
library(dplyr, warn.conflicts = FALSE)
#turn this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5)
#into this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5, maxol = c("ab", "b", "b"))
Created on 2018-10-30 by the reprex package (v0.2.1)
Continuing from the answer in the linked post, we can do
Df$maxcol <- apply(Df, 1, function(x) paste0(names(Df)[x == max(x)], collapse = ""))
Df
# a b c maxcol
# <int> <int> <int> <chr>
#1 4 4 3 ab
#2 3 5 4 b
#3 2 6 5 b
For every row, we check which position has max values and paste the names at that position together.
If you prefer the tidyverse approach
library(tidyverse)
Df %>%
mutate(row = row_number()) %>%
gather(values, key, -row) %>%
group_by(row) %>%
mutate(maxcol = paste0(values[key == max(key)], collapse = "")) %>%
spread(values, key) %>%
ungroup() %>%
select(-row)
# maxcol a b c
# <chr> <int> <int> <int>
#1 ab 4 4 3
#2 b 3 5 4
#3 b 2 6 5
We first convert dataframe from wide to long using gather, then group_by each row we paste column names for max key and then spread the long dataframe to wide again.
Here's a solution I found that loops through column names in case you find it hard to wrap your head around spread/gather (pivot_wider/longer)
out_df <- Df %>%
# calculate rowwise maximum
rowwise() %>%
mutate(rowmax = max(across())) %>%
# create empty maxcol column
mutate(maxcol = "")
# loop through column names
for (colname in colnames(Df)) {
out_df <- out_df %>%
# if the value at the specified column name is the maximum, paste it to the maxcol
mutate(maxcol = ifelse(.data[[colname]] == rowmax, paste0(maxcol, colname), maxcol))
}
# remove rowmax column if no longer needed
out_df <- out_df %>%
select(-rowmax)