Order column with integers separated by "-" in R [duplicate] - r

This question already has an answer here:
Order a "mixed" vector (numbers with letters)
(1 answer)
Closed 4 years ago.
I would like to order a column containing characters like this:
K3SG1-105-1051-1
However, using the arrange function will result in this:
K3SG1-105-1051-1
K3SG1-105-1051-10
K3SG1-105-1051-100
K3SG1-105-1051-1000
Instead of what I want:
K3SG1-105-1051-1
K3SG1-105-1051-2
K3SG1-105-1051-3
K3SG1-105-1051-4
Thanks in advance.

Here is a possibility using tidyr::separate and dplyr:
# Sample data
df <- data.frame(id = paste0("K3SG1-105-1051-", seq(1:10)));
# Using separate
df %>%
separate(id, into = paste0("id", 1:4), sep = "-", remove = FALSE) %>%
arrange(as.numeric(id4)) %>%
select(id);
# id
#1 K3SG1-105-1051-1
#2 K3SG1-105-1051-2
#3 K3SG1-105-1051-3
#4 K3SG1-105-1051-4
#5 K3SG1-105-1051-5
#6 K3SG1-105-1051-6
#7 K3SG1-105-1051-7
#8 K3SG1-105-1051-8
#9 K3SG1-105-1051-9
#10 K3SG1-105-1051-10
Explanation: Split column id into four separate columns based on "-" as separator; arrange rows based on the fourth column entries, which are converted to numeric to ensure proper ordering.

Data
I created the following example data for this answer:
(char_vec <- paste0("K3SG1-105-1051-", c(1:4, 10, 100, 1000)))
[1] "K3SG1-105-1051-1" "K3SG1-105-1051-2" "K3SG1-105-1051-3"
[4] "K3SG1-105-1051-4" "K3SG1-105-1051-10" "K3SG1-105-1051-100"
[7] "K3SG1-105-1051-1000"
Solution
char_vec[order(as.numeric(sub('.*-', '', char_vec)))]
[1] "K3SG1-105-1051-1" "K3SG1-105-1051-2" "K3SG1-105-1051-3"
[4] "K3SG1-105-1051-4" "K3SG1-105-1051-10" "K3SG1-105-1051-100"
[7] "K3SG1-105-1051-1000"
Explanation
sub('.*-', '', char_vec) gets just the last number characters in the vector, which we then convert to numeric and order to order char_vec.
If you order the characters 1, 2, and 10, the order is 1, 10, 2 because you're alphabetically ordering strings, not ordering numbers.

Related

Separating a column by the first 3 characters

I have a set of data below and I would like to separate the first three characters from the bm_id column into a separate column with the rest of the characters in another column.
bm_id
1
popCL20TE
2
agrST20
3
agrST20-09SE
I have tried using solutions to a similar question asked on stack, however I end up making extra empty columns with my data remaining together.
bm_id[c('species', 'id')] <- tstrsplit(bm_id$bm_id, '(?<=.{3})', perl = TRUE)
same happens with this code
bm_id2 <- tidyr::separate(bm_id, bm_id, into = c("species", "id"), sep = 3)
How about substr
df <- data.frame(vec= c("popCL20TE", "agrST20"))
df$first3 <- substr(df$vec, 1, 3)
df$last <- substr(df$vec, 4, nchar(df$vec))
df
vec first3 last
1 popCL20TE pop CL20TE
2 agrST20 agr ST20

How can I obtain the distinct values for a "|" delimited column? [duplicate]

This question already has answers here:
How do keep only unique words within each string in a vector
(3 answers)
Closed 1 year ago.
I have a dataframe that looks like this:
+--+---------------------------+
|id|grids |
+--+---------------------------+
|c1|21257a|75589y|21257a|77589y|
|c2|21257a|21257a|21257a|21257a|
|c3|21257a|75589y|75589y|33421v|
However, since there are duplicate characters under the grids column, I'd like to extract only the distinct characters such that the dataframe becomes like this:
+--+---------------------------+
|id|grids |
+--+---------------------------+
|c1|21257a|75589y |
|c2|21257a |
|c3|21257a|75589y|33421v |
Any help would be appreciated!
Using sapply split the string on |, keep only unique value in each row and paste.
df$grids <- sapply(strsplit(df$grids, '|', fixed = TRUE), function(x)
paste0(unique(x), collapse = '|'))
Here is a base R regex based approach:
df$grids <- gsub("\\b(.+?)(?=\\|.*\\1)", "", df$grids, perl=TRUE)
df$grids <- gsub("^\\|+|\\|+$", "", df$grids)
df$grids <- gsub("\\|{2,}", "|", df$grids)
df
id grids
1 c1 21257a|75589y
2 c2 21257a
3 c3 21257a|75589y|33421v
Data:
df <- data.frame(id=c("c1", "c2", "c3"),
grids=c("21257a|75589y|21257a|75589y",
"21257a|21257a|21257a|21257a",
"21257a|75589y|75589y|33421v"))
For an explanation of the regex \b(.+?)(?=\|.*\1), it matches any pipe-separated term for which we can find the same term later in the grid string. If so, then we strip it by replacing with empty string. There are also some cleanup steps to remove dangling multiple pipes which might be left behind (or at the beginning/end of the grid string).
using data by #Tim
library(tidyverse)
df <- data.frame(id=c("c1", "c2", "c3"),
grids=c("21257a|75589y|21257a|75589y",
"21257a|21257a|21257a|21257a",
"21257a|75589y|75589y|33421v"))
df %>% mutate(grids = map_chr(str_split(grids, '\\|'),
~paste(unique(.x), collapse = '|')))
#> id grids
#> 1 c1 21257a|75589y
#> 2 c2 21257a
#> 3 c3 21257a|75589y|33421v
Created on 2021-05-27 by the reprex package (v2.0.0)

R- Column match, create new column with another column of corresponding value [duplicate]

This question already has answers here:
dplyr: inner_join with a partial string match
(4 answers)
Closed 2 years ago.
I have two data frame:
df1<- data.frame(place=c("KARACA ADANA","ASIL BOLU","GAZIANTEP","YUKARI/MERSIN"))
df2<- data.frame(city=c("ADANA","BOLU","ANTEP","MERSIN"), neighbor=c("KARACA","ASIL","GAZI","YUKARI"))
I need to match columns df1$place and df2$neighbor. If df1$place contains the word in df2$neighbor, it should create a new column to df1$newcol by copying the corresponding value of df2$city of matches.
df1$newcol <- data.frame(place=c("KARACA ADANA","ASIL BOLU","GAZIANTEP","YUKARI/MERSIN") ,city=c("ADANA","BOLU","ANTEP","MERSIN"))
Here's an approach with sapply from base R:
If you want only whole words to match, you could use a regular expression. \\b looks for a word boundary.
ind <- unlist(sapply(df2$neighbor, function(x) grep(paste0("\\b",x,"\\b"),df1$place)))
ind2 <- rep(1:length(df2$neighbor),
times = sapply(df2$neighbor, function(x) length(grep(paste0("\\b",x,"\\b"),df1$place))))
df1$newcol <- NA
df1$newcol[ind] <- as.character(df2$city[ind2])
df1
# place newcol
#1 KARACA ADANA ADANA
#2 ASIL BOLU BOLU
#3 GAZIANTEP <NA>
#4 YUKARI/MERSIN MERSIN
#5 YUKARI/MERSIN MERSIN
#6 GAZIANTEP <NA>
#7 ASIL BOLU BOLU
#8 KARACA ADANA ADANA
Sample Data
df1<- data.frame(place=c(c("KARACA ADANA","ASIL BOLU","GAZIANTEP","YUKARI/MERSIN"),
rev(c("KARACA ADANA","ASIL BOLU","GAZIANTEP","YUKARI/MERSIN"))))
try to do it this way
library(tidyverse)
df1 %>%
rowwise() %>%
mutate(out = df2$city[str_which(place, df2$city)])

Mark repeated strings in a R object with their correspondent number [duplicate]

This question already has an answer here:
How to make a unique set of names from a vector of strings?
(1 answer)
Closed 5 years ago.
I would like to enumerate the strings in a object in a way that strings that appears more than one time are tagged as "stringX1", "string2" and so on.
This would be an input example:
strings <- c("stringQ", "stringW", "stringE", "stringQ")
The expected output would be:
stringOut <- c("stringQ1", "stringW1", "stringE1", "stringQ2")
Note that the "stringQ" is there two times, that's why I expect "stringQ1" and "stringQ2".
We can use ave
paste0(strings, ave(strings, strings, FUN = seq_along))
Or if we start the numbering from duplicate elements
make.unique(strings, sep="")
You can do this with dplyr as follows:
require(dplyr)
strings <- data.frame(string = c("stringQ", "stringW", "stringE", "stringQ"))
strings %>% group_by(string) %>%
mutate(stringnumber = paste0(string,row_number())) %>%
ungroup() %>%
select(stringnumber)
results in :
# A tibble: 4 x 1
stringnumber
<chr>
1 stringQ1
2 stringW1
3 stringE1
4 stringQ2

Separate column into three columns with grouping [duplicate]

This question already has answers here:
How to strsplit different number of strings in certain column by do function
(1 answer)
tidyr separate only first n instances [duplicate]
(2 answers)
Closed 5 years ago.
I have a column with Full names that should be separated into three columns just by spaces. The problem is that some full names contains more than three words, and 4-th and other words shouldn't be omitted, but added to third part.
For instance, "Abdullaeva Mehseti Nuraddin Kyzy" should be separated as:
| Abdullaeva | Mehseti | Nuraddin Kyzy |
I tried to split column with (tidyr) package as follow, but in this way 3d part contains only 1 word after second space.
df<-df %>%
separate('FULL_NAME', c("1st_part","2d_part","3d_part"), sep=" ")
Any help will be appreciated.
Use extra argument:
# dummy data
df1 <- data.frame(x = c(
"some name1",
"justOneName",
"some three name",
"Abdullaeva Mehseti Nuraddin Kyzy"))
library(tidyr)
library(dplyr)
df1 %>%
separate(x, c("a1", "a2", "a3"), extra = "merge")
# a1 a2 a3
# 1 some name1 <NA>
# 2 justOneName <NA> <NA>
# 3 some three name
# 4 Abdullaeva Mehseti Nuraddin Kyzy
# Warning message:
# Too few values at 2 locations: 1, 2
From manual:
extra
If sep is a character vector, this controls what happens when
there are too many pieces. There are three valid options:
- "warn" (the default): emit a warning and drop extra values.
- "drop": drop any extra values without a warning.
- "merge": only splits at most length(into) times
Since for this dataset you said that you only have name1, name2, last name, then you can also use str_split_fixed from stringr, i.e.
setNames(data.frame(stringr::str_split_fixed(df1$x, ' ', 3)), paste0('a', 1:3))
Which gives,
a1 a2 a3
1 some name1
2 justOneName
3 some three name
4 Abdullaeva Mehseti Nuraddin Kyzy
Note that you can fill the empty slots with NA as per usual

Resources