Separate column into three columns with grouping [duplicate] - r

This question already has answers here:
How to strsplit different number of strings in certain column by do function
(1 answer)
tidyr separate only first n instances [duplicate]
(2 answers)
Closed 5 years ago.
I have a column with Full names that should be separated into three columns just by spaces. The problem is that some full names contains more than three words, and 4-th and other words shouldn't be omitted, but added to third part.
For instance, "Abdullaeva Mehseti Nuraddin Kyzy" should be separated as:
| Abdullaeva | Mehseti | Nuraddin Kyzy |
I tried to split column with (tidyr) package as follow, but in this way 3d part contains only 1 word after second space.
df<-df %>%
separate('FULL_NAME', c("1st_part","2d_part","3d_part"), sep=" ")
Any help will be appreciated.

Use extra argument:
# dummy data
df1 <- data.frame(x = c(
"some name1",
"justOneName",
"some three name",
"Abdullaeva Mehseti Nuraddin Kyzy"))
library(tidyr)
library(dplyr)
df1 %>%
separate(x, c("a1", "a2", "a3"), extra = "merge")
# a1 a2 a3
# 1 some name1 <NA>
# 2 justOneName <NA> <NA>
# 3 some three name
# 4 Abdullaeva Mehseti Nuraddin Kyzy
# Warning message:
# Too few values at 2 locations: 1, 2
From manual:
extra
If sep is a character vector, this controls what happens when
there are too many pieces. There are three valid options:
- "warn" (the default): emit a warning and drop extra values.
- "drop": drop any extra values without a warning.
- "merge": only splits at most length(into) times

Since for this dataset you said that you only have name1, name2, last name, then you can also use str_split_fixed from stringr, i.e.
setNames(data.frame(stringr::str_split_fixed(df1$x, ' ', 3)), paste0('a', 1:3))
Which gives,
a1 a2 a3
1 some name1
2 justOneName
3 some three name
4 Abdullaeva Mehseti Nuraddin Kyzy
Note that you can fill the empty slots with NA as per usual

Related

Looping over patterns list to remove them for a string column in R [duplicate]

This question already has answers here:
remove multiple patterns from text vector r
(4 answers)
Closed 1 year ago.
I have a df with 2 columns where the second one represents strings that contains special characters and other characters I want to remove.
The problem
I have written a for loop that works but only after being executed Three (03) times!
Libraries & Data
library(tidyverse)
client_id <- 1:10
client_name <- c("name5", "-name", "name--", "name-µ", "name²", "name31", "7name8", "name514", "²name8")
df <- data.frame(cbind(client_id, client_name))
Patterns to be removed
patterns <- list("-", "--", "[:digit:]", "[:cntrl:]" , "µ" , "²" , "[:punct:]")
What I have done
To remove the unwanted patterns in col 2 client_names I have written the following for loop:
for(ptrn in patterns) {
df <- df %>%
mutate(client_name = str_remove(df$client_name, ptrn))
print(ptrn) # progress
}
The above for loop removes all unwanted patterns, but only after being executed Three (03) times.
How can we fix that in order to remove all unwanted patterns since the first execution?
Should I nest the above for loop with another one in order to iterate over client_names[i]?
Thanks
This is a more straightforward method:
Instead of making a list of all unwanted characters you can str_extract all and only the wanted ones, which, in your case, are the (Roman) alphabetic characters:
library(stringr)
df %>%
mutate(client_name = str_extract(client_name,"[A-Za-z]+"))
client_id client_name
1 1 name
2 2 name
3 3 name
4 4 name
5 5 name
6 6 name
7 7 name
8 8 name
9 9 name
10 10 name
You can collapse the patterns in one regex pattern and use str_remove_all to remove all the occurrences of it.
library(dplyr)
library(stringr)
ptrn <- paste0(patterns, collapse = '|')
df <- df %>% mutate(client_name = str_remove_all(client_name, ptrn))
df
# client_id client_name
#1 1 name
#2 2 name
#3 3 name
#4 4 name
#5 5 name
#6 6 name
#7 7 name
#8 8 name
#9 9 name
data
client_id <- 1:9
client_name <- c("name5", "-name", "name--", "name-µ", "name²", "name31", "7name8", "name514", "²name8")
df <- data.frame(client_id, client_name)

How can I obtain the distinct values for a "|" delimited column? [duplicate]

This question already has answers here:
How do keep only unique words within each string in a vector
(3 answers)
Closed 1 year ago.
I have a dataframe that looks like this:
+--+---------------------------+
|id|grids |
+--+---------------------------+
|c1|21257a|75589y|21257a|77589y|
|c2|21257a|21257a|21257a|21257a|
|c3|21257a|75589y|75589y|33421v|
However, since there are duplicate characters under the grids column, I'd like to extract only the distinct characters such that the dataframe becomes like this:
+--+---------------------------+
|id|grids |
+--+---------------------------+
|c1|21257a|75589y |
|c2|21257a |
|c3|21257a|75589y|33421v |
Any help would be appreciated!
Using sapply split the string on |, keep only unique value in each row and paste.
df$grids <- sapply(strsplit(df$grids, '|', fixed = TRUE), function(x)
paste0(unique(x), collapse = '|'))
Here is a base R regex based approach:
df$grids <- gsub("\\b(.+?)(?=\\|.*\\1)", "", df$grids, perl=TRUE)
df$grids <- gsub("^\\|+|\\|+$", "", df$grids)
df$grids <- gsub("\\|{2,}", "|", df$grids)
df
id grids
1 c1 21257a|75589y
2 c2 21257a
3 c3 21257a|75589y|33421v
Data:
df <- data.frame(id=c("c1", "c2", "c3"),
grids=c("21257a|75589y|21257a|75589y",
"21257a|21257a|21257a|21257a",
"21257a|75589y|75589y|33421v"))
For an explanation of the regex \b(.+?)(?=\|.*\1), it matches any pipe-separated term for which we can find the same term later in the grid string. If so, then we strip it by replacing with empty string. There are also some cleanup steps to remove dangling multiple pipes which might be left behind (or at the beginning/end of the grid string).
using data by #Tim
library(tidyverse)
df <- data.frame(id=c("c1", "c2", "c3"),
grids=c("21257a|75589y|21257a|75589y",
"21257a|21257a|21257a|21257a",
"21257a|75589y|75589y|33421v"))
df %>% mutate(grids = map_chr(str_split(grids, '\\|'),
~paste(unique(.x), collapse = '|')))
#> id grids
#> 1 c1 21257a|75589y
#> 2 c2 21257a
#> 3 c3 21257a|75589y|33421v
Created on 2021-05-27 by the reprex package (v2.0.0)

how to add title of gene to the output in R?

I have 9 length strings and list of longer strings with titles
Example data:
String <- "ABCDEFGHI", "ACBDGHIEF"
Data in text file contains 'longer strings with titles' like
>name
ABCDEFGHIJKLMNOPQRSTUVWXYX
>name1
TUVWXYACBDGHIEFXGHIJKLMIJK
>name2
ABFNOCDEPQRXYXGSTUVWHIMJKL
I use library(stringr) to locate the positions of each string.
Code in R
loc <- str_locate(textfile,pattern = strings)
write.csv(loc, "locate.csv")
EXPECTED Output:
string | locate | source of longer string
1 | 1-9| name1
2 | 7-15|name2
3 |NA| NA
QUESTION:
I would like to add the name of the longer string on which the "string" located? How to do this in R? I want to have the last column (that has bold-faced in the EXPECTED OUTCOME).
Thank you for help
Venkata
Here is an option with tidyverse. After reading the data with readLines, based on the occurence of 'title' with 'value', it is alternating, so an option would be to separate into columns or vectors with a recycling logical vector ('i1'), apply the str_locate only the 'value' ('col2'), create a row_number column and the 'source_longer_string' by checking if there is a NA element in 'locate' or not
library(dplyr)
library(stringr)
i1 <- c(TRUE, FALSE)
df1 <- tibble(col1 = textfile[i1], col2 = textfile[!i1])
str_locate(df1$col2, str_c(String, collapse="|")) %>%
as.data.frame %>%
transmute(string = row_number(),
locate = str_c(start, end, sep="-"),
source_longer_string = case_when(is.na(locate) ~ NA_character_,
TRUE ~ df1$col1))
# string locate source_longer_string
#1 1 1-9 >name
#2 2 7-15 >name1
#3 3 <NA> <NA>
data
textfile <- readLines(textConnection(">name
ABCDEFGHIJKLMNOPQRSTUVWXYX
>name1
TUVWXYACBDGHIEFXGHIJKLMIJK
>name2
ABFNOCDEPQRXYXGSTUVWHIMJKL"))
String <- c("ABCDEFGHI", "ACBDGHIEF")

Order column with integers separated by "-" in R [duplicate]

This question already has an answer here:
Order a "mixed" vector (numbers with letters)
(1 answer)
Closed 4 years ago.
I would like to order a column containing characters like this:
K3SG1-105-1051-1
However, using the arrange function will result in this:
K3SG1-105-1051-1
K3SG1-105-1051-10
K3SG1-105-1051-100
K3SG1-105-1051-1000
Instead of what I want:
K3SG1-105-1051-1
K3SG1-105-1051-2
K3SG1-105-1051-3
K3SG1-105-1051-4
Thanks in advance.
Here is a possibility using tidyr::separate and dplyr:
# Sample data
df <- data.frame(id = paste0("K3SG1-105-1051-", seq(1:10)));
# Using separate
df %>%
separate(id, into = paste0("id", 1:4), sep = "-", remove = FALSE) %>%
arrange(as.numeric(id4)) %>%
select(id);
# id
#1 K3SG1-105-1051-1
#2 K3SG1-105-1051-2
#3 K3SG1-105-1051-3
#4 K3SG1-105-1051-4
#5 K3SG1-105-1051-5
#6 K3SG1-105-1051-6
#7 K3SG1-105-1051-7
#8 K3SG1-105-1051-8
#9 K3SG1-105-1051-9
#10 K3SG1-105-1051-10
Explanation: Split column id into four separate columns based on "-" as separator; arrange rows based on the fourth column entries, which are converted to numeric to ensure proper ordering.
Data
I created the following example data for this answer:
(char_vec <- paste0("K3SG1-105-1051-", c(1:4, 10, 100, 1000)))
[1] "K3SG1-105-1051-1" "K3SG1-105-1051-2" "K3SG1-105-1051-3"
[4] "K3SG1-105-1051-4" "K3SG1-105-1051-10" "K3SG1-105-1051-100"
[7] "K3SG1-105-1051-1000"
Solution
char_vec[order(as.numeric(sub('.*-', '', char_vec)))]
[1] "K3SG1-105-1051-1" "K3SG1-105-1051-2" "K3SG1-105-1051-3"
[4] "K3SG1-105-1051-4" "K3SG1-105-1051-10" "K3SG1-105-1051-100"
[7] "K3SG1-105-1051-1000"
Explanation
sub('.*-', '', char_vec) gets just the last number characters in the vector, which we then convert to numeric and order to order char_vec.
If you order the characters 1, 2, and 10, the order is 1, 10, 2 because you're alphabetically ordering strings, not ordering numbers.

Remove an entire column from a data.frame in R

Does anyone know how to remove an entire column from a data.frame in R? For example if I am given this data.frame:
> head(data)
chr genome region
1 chr1 hg19_refGene CDS
2 chr1 hg19_refGene exon
3 chr1 hg19_refGene CDS
4 chr1 hg19_refGene exon
5 chr1 hg19_refGene CDS
6 chr1 hg19_refGene exon
and I want to remove the 2nd column.
You can set it to NULL.
> Data$genome <- NULL
> head(Data)
chr region
1 chr1 CDS
2 chr1 exon
3 chr1 CDS
4 chr1 exon
5 chr1 CDS
6 chr1 exon
As pointed out in the comments, here are some other possibilities:
Data[2] <- NULL # Wojciech Sobala
Data[[2]] <- NULL # same as above
Data <- Data[,-2] # Ian Fellows
Data <- Data[-2] # same as above
You can remove multiple columns via:
Data[1:2] <- list(NULL) # Marek
Data[1:2] <- NULL # does not work!
Be careful with matrix-subsetting though, as you can end up with a vector:
Data <- Data[,-(2:3)] # vector
Data <- Data[,-(2:3),drop=FALSE] # still a data.frame
To remove one or more columns by name, when the column names are known (as opposed to being determined at run-time), I like the subset() syntax. E.g. for the data-frame
df <- data.frame(a=1:3, d=2:4, c=3:5, b=4:6)
to remove just the a column you could do
Data <- subset( Data, select = -a )
and to remove the b and d columns you could do
Data <- subset( Data, select = -c(d, b ) )
You can remove all columns between d and b with:
Data <- subset( Data, select = -c( d : b )
As I said above, this syntax works only when the column names are known. It won't work when say the column names are determined programmatically (i.e. assigned to a variable). I'll reproduce this Warning from the ?subset documentation:
Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like '[', and in particular the non-standard evaluation
of argument 'subset' can have unanticipated consequences.
(For completeness) If you want to remove columns by name, you can do this:
cols.dont.want <- "genome"
cols.dont.want <- c("genome", "region") # if you want to remove multiple columns
data <- data[, ! names(data) %in% cols.dont.want, drop = F]
Including drop = F ensures that the result will still be a data.frame even if only one column remains.
The posted answers are very good when working with data.frames. However, these tasks can be pretty inefficient from a memory perspective. With large data, removing a column can take an unusually long amount of time and/or fail due to out of memory errors. Package data.table helps address this problem with the := operator:
library(data.table)
> dt <- data.table(a = 1, b = 1, c = 1)
> dt[,a:=NULL]
b c
[1,] 1 1
I should put together a bigger example to show the differences. I'll update this answer at some point with that.
There are several options for removing one or more columns with dplyr::select() and some helper functions. The helper functions can be useful because some do not require naming all the specific columns to be dropped. Note that to drop columns using select() you need to use a leading - to negate the column names.
Using the dplyr::starwars sample data for some variety in column names:
library(dplyr)
starwars %>%
select(-height) %>% # a specific column name
select(-one_of('mass', 'films')) %>% # any columns named in one_of()
select(-(name:hair_color)) %>% # the range of columns from 'name' to 'hair_color'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^v.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
You can also drop by column number:
starwars %>%
select(-2, -(4:10)) # column 2 and columns 4 through 10
With this you can remove the column and store variable into another variable.
df = subset(data, select = -c(genome) )
Using dplyR, the following works:
data <- select(data, -genome)
as per documentation found here https://www.marsja.se/how-to-remove-a-column-in-r-using-dplyr-by-name-and-index/#:~:text=select(starwars%2C%20%2Dheight)
I just thought I'd add one in that wasn't mentioned yet. It's simple but also interesting because in all my perusing of the internet I did not see it, even though the highly related %in% appears in many places.
df <- df[ , -which(names(df) == 'removeCol')]
Also, I didn't see anyone post grep alternatives. These can be very handy for removing multiple columns that match a pattern.

Resources