Use extract and/or separate to isolate variable string from dataframe - r

I've looked through the following pages on using regex to isolate a string:
Regular expression to extract text between square brackets
What is a non-capturing group? What does (?:) do?
Split data frame string column into multiple columns
I have a dataframe which contains protein/gene identifiers, and in some cases there are two or more of these strings (seperated by a comma) because of multiple matches from a list. In this case the first string is the strongest match and I'm not necessarily interested in keeping the rest.They represent multiple matches from inferred evidence and when they cannot be easily discriminated all of the hits get put into a column. In this case I'm only interested in keeping the first because the group will likely have the same type of annotation (i.e. type of protein, gene ontology, similar function etc) If I split the multiple entries into more rows then it would appear that I have evidence that they exist in my dataset, but at the empirical level I don't.
My dataframe:
protein
1 sp|P50213|IDH3A_HUMAN
2 sp|Q9BZ95|NSD3_HUMAN
3 sp|Q92616|GCN1_HUMAN
4 sp|Q9NSY1|BMP2K_HUMAN
5 sp|O75643|U520_HUMAN
6 sp|O15357|SHIP2_HUMAN
523 sp|P10599|THIO_HUMAN,sp|THIO_HUMAN|
524 sp|Q96KB5|TOPK_HUMAN
525 sp|P12277|KCRB_HUMAN,sp|P17540|KCRS_HUMAN,sp|P12532|KCRU_HUMAN
526 sp|O00299|CLIC1_HUMAN
527 sp|P25940|CO5A3_HUMAN
The output I am trying to create:
uniprot gene
P50213 IDH3A
Q9BZ95 NSD3
Q92616 GCN1
P12277 KCRB
I'm trying to use extract and separate functions to do this:
extract(df, protein, into = c("uniprot", "gene"), regex = c("sp|(.*?)|","
(.*?)_"), remove = FALSE)
results in:
Error: is_string(regex) is not TRUE
trying separate to at least break apart the two in multiple steps:
separate(df, protein, into = c("uniprot", "gene"), sep = "|", remove =
FALSE)
results in:
Warning message:
Expected 2 pieces. Additional pieces discarded in 528 rows [1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
protein uniprot gene
1 sp|P50213|IDH3A_HUMAN s
2 sp|Q9BZ95|NSD3_HUMAN s
3 sp|Q92616|GCN1_HUMAN s
4 sp|Q9NSY1|BMP2K_HUMAN s
5 sp|O75643|U520_HUMAN s
6 sp|O15357|SHIP2_HUMAN s
What is the best way to use regex in this scenario and are extract or separate the best way to go about this? Any suggestion would be greatly appreciated. Thanks!
Update based on feedback:
df <- structure(list(protein = c("sp|P50213|IDH3A_HUMAN", "sp|Q9BZ95|NSD3_HUMAN",
"sp|Q92616|GCN1_HUMAN", "sp|Q9NSY1|BMP2K_HUMAN", "sp|O75643|U520_HUMAN",
"sp|O15357|SHIP2_HUMAN", "sp|P10599|THIO_HUMAN,sp|THIO_HUMAN|",
"sp|Q96KB5|TOPK_HUMAN", "sp|P12277|KCRB_HUMAN,sp|P17540|KCRS_HUMAN,sp|P12532|KCRU_HUMAN",
"sp|O00299|CLIC1_HUMAN")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "523", "524", "525", "526"))
df1 <- separate(df, protein, into = "protein", sep = ",")
#i'm only interested in the first match, because science
df2 <- extract(df1, protein, into = c("uniprot", "gene"), regex = "sp\\|
([^|]+)\\|([^_]+)", remove = FALSE)
#create new columns with uniprot code and gene id, no _HUMAN
#df2
# protein uniprot gene
#1 sp|P50213|IDH3A_HUMAN P50213 IDH3A
#2 sp|Q9BZ95|NSD3_HUMAN Q9BZ95 NSD3
#3 sp|Q92616|GCN1_HUMAN Q92616 GCN1
#4 sp|Q9NSY1|BMP2K_HUMAN Q9NSY1 BMP2K
#5 sp|O75643|U520_HUMAN O75643 U520
#6 sp|O15357|SHIP2_HUMAN O15357 SHIP2
#523 sp|P10599|THIO_HUMAN P10599 THIO
#524 sp|Q96KB5|TOPK_HUMAN Q96KB5 TOPK
#525 sp|P12277|KCRB_HUMAN P12277 KCRB
#526 sp|O00299|CLIC1_HUMAN O00299 CLIC1
#and the answer using %>% pipes (this is what I aspire to)
df_filtered <- df %>%
separate(protein, into = "protein", sep = ",") %>%
extract(protein, into = c("uniprot", "gene"), regex = "sp\\|([^|]+)\\|([^_]+)") %>%
select(uniprot, gene)
#df_filtered
# uniprot gene
#1 P50213 IDH3A
#2 Q9BZ95 NSD3
#3 Q92616 GCN1
#4 Q9NSY1 BMP2K
#5 O75643 U520
#6 O15357 SHIP2
#523 P10599 THIO
#524 Q96KB5 TOPK
#525 P12277 KCRB
#526 O00299 CLIC1

We can capture the pattern as a group ((...)) in extract. Here, we match sp at the beginning (^) of the string followed by a | (metacharacter - escaped \\), followed by one or more characters not a | captured as a group, followed by a | and the second set of characters captured
library(tidyverse)
extract(df, protein, into = c("uniprot", "gene"),
regex = "^sp\\|([^|]+)\\|([^|]+).*")
If there are multiple instances of 'sp', then separate the rows into long format with separate_rows and then use extract
df %>%
separate_rows(protein, sep=",") %>%
extract(protein, into = c("uniprot", "gene"),
"^sp\\|([^|]+)\\|([^|]*).*")
There is one instance where there is only two sets of words. To make it working
df %>%
separate_rows(protein, sep=",") %>%
extract(protein, into = "gene", "([^|]*HUMAN)", remove = FALSE) %>%
mutate(uniprot = str_extract(protein, "(?<=sp\\|)[^_]+(?=\\|)")) %>%
select(uniprot, gene)
# uniprot gene
#1 P50213 IDH3A_HUMAN
#2 Q9BZ95 NSD3_HUMAN
#3 Q92616 GCN1_HUMAN
#4 Q9NSY1 BMP2K_HUMAN
#5 O75643 U520_HUMAN
#6 O15357 SHIP2_HUMAN
#7 P10599 THIO_HUMAN
#8 <NA> THIO_HUMAN
#9 Q96KB5 TOPK_HUMAN
#10 P12277 KCRB_HUMAN
#11 P17540 KCRS_HUMAN
#12 P12532 KCRU_HUMAN
#13 O00299 CLIC1_HUMAN
data
df <- structure(list(protein = c("sp|P50213|IDH3A_HUMAN", "sp|Q9BZ95|NSD3_HUMAN",
"sp|Q92616|GCN1_HUMAN", "sp|Q9NSY1|BMP2K_HUMAN", "sp|O75643|U520_HUMAN",
"sp|O15357|SHIP2_HUMAN", "sp|P10599|THIO_HUMAN,sp|THIO_HUMAN|",
"sp|Q96KB5|TOPK_HUMAN", "sp|P12277|KCRB_HUMAN,sp|P17540|KCRS_HUMAN,sp|P12532|KCRU_HUMAN",
"sp|O00299|CLIC1_HUMAN")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "523", "524", "525", "526"))

Related

Remove Last Character in R inplace

I came from a Python background and I am working in R with this data df.
name age
1 Anon1 52a
2 Anon2 62
3 Anon3 44a
4 Anon4 30
5 Anon5 110a
Using R language, how can I remove the a in the last part of the age column and do data modification in place??
(just like Python using inplace=True)
Can I attain it using
df$Age[which(df$Age == `a pattern`)] <- ""
This is a perfect use case for parse_number from readr package (it is in tidyverse:
library(dplyr)
library(readr)
df %>%
mutate(age = parse_number(age))
name age
1 Anon1 52
2 Anon2 62
3 Anon3 44
4 Anon4 30
5 Anon5 110
data:
df <- structure(list(name = c("Anon1", "Anon2", "Anon3", "Anon4", "Anon5"
), age = c("52a", "62", "44a", "30", "110a")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))
You could use sub here:
df$age <- sub("a$", "", df$age, fixed=TRUE)
#A tidy solution
library(dplyr)
library(stringr)
df <- data.frame(name=c("anon1","anon2"),age=c("52","37a"))
df <- df %>%
mutate(age = str_extract(age,"^\\d+"))
df
name age
1 anon1 52
2 anon2 37
Here are two approaches. No packages are used.
1) We remove all non-digit characters where in a regular expression \D means non-digit. If we knew that only a could appear as a non-digit we could , instead, use "a" as the first argument to gsub and if we knew it only appears once we could use sub instead of gsub.
Also it is easier to debug code if you don't overwrite variables since then you always know that a particular variable is in its original state. Instead assign the result to a new variable.
transform(DF, age = as.numeric(gsub("\\D", "", age)))
This could also be written using pipes:
transform(DF, age = age |> gsub(pattern = "\\D", replacement = "") |> as.numeric())
2) We can use scan specifying that a is a comment character.
transform(DF, age = scan(text = age, comment.char = "a", quiet = TRUE))
Note
Lines <- "
name age
1 Anon1 52a
2 Anon2 62
3 Anon3 44a
4 Anon4 30
5 Anon5 110a"
DF <- read.table(text = Lines)
The inplace modifier in python refers to making a change without creating a copy. The data.table package in R allows for this (called replace by reference).
df <- read.table(text="
name age
1 Anon1 52a
2 Anon2 62
3 Anon3 44a
4 Anon4 30
5 Anon5 110a")
library(data.table)
library(stringi)
setDT(df)[, age:=stri_extract(age, regex='^\\d+')]
df
The first clause (setDT(df)) converts df to a data.table by reference (e.g., without making a copy), and the second clause ([, age:=...]) replaces the values in column age with ... also by reference.

How to separate a column in R with unequal character lengths and without separators

I want to separate a column which contains dates and items into two columns.
V1
23/2/2000shampoo
24/2/2000flour
21/10/2000poultry
17/4/2001laundry detergent
To this
V1 V2
23/2/2000 shampoo
24/2/2000 flour
21/10/2000 poultry
17/4/2001 laundry detergent
My problem is that there's no separation between the two. The date length isn't uniform (it's in the format of 1/1/2000 instead of 01/01/2000) so I can't separate by character length. The dataset also covers multiple years.
One option would be separate from tidyr. We specify the sep with a regex lookaround to split between digit and a lower case letter
library(dplyr)
library(tidyr)
df1 %>%
separate(V1, into = c("V1", "V2"), sep="(?<=[0-9])(?=[a-z])")
# V1 V2
#1 23/2/2000 shampoo
#2 24/2/2000 flour
#3 21/10/2000 poultry
#4 17/4/2001 laundry detergent
Or with read.csv after creating a delimiter with sub
read.csv(text = sub("(\\d)([a-z])", "\\1,\\2", df1$V1),
header = FALSE, stringsAsFactors = FALSE)
data
df1 <- structure(list(V1 = c("23/2/2000shampoo", "24/2/2000flour",
"21/10/2000poultry",
"17/4/2001laundry detergent")), class = "data.frame", row.names = c(NA,
-4L))
You could also use capture groups with tidyr::extract(). The first group \\d{1,2}/\\d{1,2}/\\d{4} get the date in the format you posted, and the second group [[:print:]]+ grabs at least one printable character.
library(tidyverse)
df1 %>%
extract(V1, c("V1", "V2"), "(\\d{1,2}/\\d{1,2}/\\d{4})([[:print:]]+)")
V1 V2
1 23/2/2000 shampoo
2 24/2/2000 flour
3 21/10/2000 poultry
4 17/4/2001 laundry detergent
Data:
df1 <- readr::read_csv("V1
23/2/2000shampoo
24/2/2000flour
21/10/2000poultry
17/4/2001laundry detergent")
You can also use :
data <- data.frame(V1 = c("23-02-2000shampoo", "24-02-2001flour"))
library(stringr)
str_split_fixed(data$V1, "(?<=[0-9])(?=[a-z])", 2)
[,1] [,2]
[1,] "23-02-2000" "shampoo"
[2,] "24-02-2001" "flour"

Splitting coloumn with differing syntax in R

I am having some trouble cleaning up my data. It consists of a list of sold houses. It is made up of the sell price, no. of rooms, m2 and the address.
As seen below the address is in one string.
Head(DF, 3)
Address Price m2 Rooms
Petersvej 1772900 Hoersholm 10.000 210 5
Annasvej 2B2900 Hoersholm 15.000 230 4
Krænsvej 125800 Lyngby C 10.000 210 5
A Mivs Alle 119800 Hjoerring 1.300 70 3
The syntax for the address coloumn is: road name, road no., followed by a 4 digit postalcode and the city name(sometimes two words).
Also need to extract the postalcode.. been looking at 'stringi' package haven't been able to find any examples..
any pointers are very much appreciated
1) Using separate in tidyr separate the subfields of Address into 3 fields merging anything left over into the last and then use separate again to split off the last 4 digits in the Number column that was generated in the first separate.
library(dplyr)
library(tidyr)
DF %>%
separate(Address, into = c("Road", "Number", "City"), extra = "merge") %>%
separate(Number, into = c("StreetNo", "Postal"), sep = -4)
giving:
Road StreetNo Postal City Price m2 Rooms CITY
1 Petersvej 77 2900 Hoersholm 10 210 5 Hoersholm
2 Annasvej 121B 2900 Hoersholm 15 230 4 Hoersholm
3 Krænsvej 12 5800 Lyngby C 10 210 5 C
2) Alternately, insert commas between the subfields of Address and then use separate to split the subfields out. It gives the same result as (1) on the input shown in the Note below.
DF %>%
mutate(Address = sub("(\\S.*) +(\\S+)(\\d{4}) +(.*)", "\\1,\\2,\\3,\\4", Address)) %>%
separate(Address, into = c("Road", "Number", "Postal", "City"), sep = ",")
Note
The input DF in reproducible form is:
DF <-
structure(list(Address = structure(c(3L, 1L, 2L), .Label = c("Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C", "Petersvej 772900 Hoersholm"), class = "factor"),
Price = c(10, 15, 10), m2 = c(210L, 230L, 210L), Rooms = c(5L,
4L, 5L), CITY = structure(c(2L, 2L, 1L), .Label = c("C",
"Hoersholm"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
Update
Added and fixed (2).
Check out the cSplit function from the splitstackshape package
library(splitstackshape)
df_new <- cSplit(df, splitCols = "Address", sep = " ")
#This will split your address column into 4 different columns split at the space
#you can then add an ifelse block to combine the last 2 columns to make up the city like
df_new$City <- ifelse(is.na(df_new$Address_4), as.character(df_new$Address_3), paste(df_new$Address_3, df_new$Address_4, sep = " "))
One way to do this is with regex.
In this instance you may use a simple regular expression which will match all alphabetical characters and space characters which lead to the end of the string, then trim the whitespace off.
library(stringr)
DF <- data.frame(Address=c("Petersvej 772900 Hoersholm",
"Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C"))
DF$CITY <- str_trim(str_extract(DF$Address, "[a-zA-Z ]+$"))
This will give you the following output:
Address CITY
1 Petersvej 772900 Hoersholm Hoersholm
2 Annasvej 121B2900 Hoersholm Hoersholm
3 Krænsvej 125800 Lyngby C Lyngby C
In R the stringr package is preferred for regex because it allows for multiple-group capture, which in this example could allow you to separate each component of the address with one expression.

How to add value base on specific character ,also fix with certain digits in R

There is the basic width : xxxx.xxxxxx (4digits before "." 6 digits after".")
Have to add "0" when each side before and after "." is not enough digits.
Use regexr find "[.]" location with combination of str_pad can
fix the first 4 digits but
don't know how to add value after the specific character with fixed digits.
(cannot find a library can count the location from somewhere specified)
Data like this
> df
Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402
Desired data
> df
Category
1 0300.030340
2 3400.040290
3 0700.070110
4 1700.090100
5 0700.070114
6 0700.079100
7 3600.050590
8 4400.040200
I am beginner of coding that sometime can't understand some regex like "["
e.t.c .With some explain of them would be super helpful.
Also i have a combination like this :
df$Category<-ifelse(regexpr("[.]",df$Category)==4,
paste("0",df1$Category,sep = ""),df$Category)
df$Category<-str_pad(df$Category,11,side = c("right"),pad="0")
Desire to know are there is any better way do this , especially count and
return the location from the END until specific character appear.
Using formatC:
df$Category <- formatC(as.numeric(df$Category), format = 'f', width = 11, flag = '0', digits = 6)
# > df
# Category
# 1 0300.030340
# 2 3400.040290
# 3 0700.070110
# 4 1700.090100
# 5 0700.070114
# 6 0700.079100
# 7 3600.050590
# 8 4400.040200
format = 'f': formating doubles;
width = 11: 4 digits before . + 1 . + 6 digits after .;
flag = '0': pads leading zeros;
digits = 6: the desired number of digits after the decimal point (format = "f");
Input df seems to be character data.frame:
structure(list(Category = c("300.030340", "3400.040290", "700.07011",
"1700.0901", "700.070114", "700.0791", "3600.05059", "4400.0402"
)), .Names = "Category", row.names = c(NA, -8L), class = "data.frame")
We can use sprintf
df$Category <- sprintf("%011.6f", df$Category)
df
# Category
#1 0300.030340
#2 3400.040290
#3 0700.070110
#4 1700.090100
#5 0700.070114
#6 0700.079100
#7 3600.050590
#8 4400.040200
data
df <- structure(list(Category = c(300.03034, 3400.04029, 700.07011,
1700.0901, 700.070114, 700.0791, 3600.05059, 4400.0402)),
.Names = "Category", class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
There are plenty of great tricks, functions, and shortcuts to be learned, and I would encourage you to explore them all! For example, if you're trying to win code golf, you will want to use #akrun's sprintf() approach. Since you stated you're a beginner, it might be more helpful to breakdown the problem into its component parts. One transparent and easy-to-follow, in my opinion, approach would be to utilize the stringr package:
library(stringr)
location_of_dot <- str_locate(df$Category, "\\.")[, 1]
substring_left_of_dot <- str_sub(df$Category, end = location_of_dot - 1)
substring_right_of_dot <- str_sub(df$Category, start = location_of_dot + 1)
pad_left <- str_pad(substring_left_of_dot, 4, side = "left", pad = "0")
pad_right <- str_pad(substring_right_of_dot, 6, side = "right", pad = "0")
result <- paste0(pad_left, ".", pad_right)
result
Use separate in tidyr to separate Category on decimal. Use str_pad from stringr to add zeros in the front or back and paste them together.
library(tidyr) # to separate columns on decimal
library(dplyr) # to mutate and pipes
library(stringr) # to strpad
input_data <- read.table(text =" Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402", header = TRUE, stringsAsFactors = FALSE) %>%
separate(Category, into = c("col1", "col2")) %>%
mutate(col1 = str_pad(col1, width = 4, side= "left", pad ="0"),
col2 = str_pad(col2, width = 6, side= "right", pad ="0"),
Category = paste(col1, col2, sep = ".")) %>%
select(-col1, -col2)

Converting csv values to table in R

I have some data from a poll which looks like this:
Freetime_activities
1 Travelling, On the PC, Clubbing
2 Sports, On the PC, Clubbing
3 Clubbing
4 On the PC
5 Travelling, On the PC, Clubbing
6 On the PC
7 Watching TV, Travelling
I want to get the count of each value (how many times Travelling/On the PC/etc.), but I'm having trouble splitting the values. Is there a function in R that can do for example:
split("A,B,C") ->
1 A
2 B
3 C
Or is there a straight forward solution to counting the values directly from the column?
We can use strsplit to split the column by the delimiter ", "), unlist the list output and then use table to get the frequency
tbl <- table(unlist(strsplit(as.character(df1$Freetime_activities),
", ")))
as.data.frame(tbl)
# Var1 Freq
#1 Clubbing 4
#2 On the PC 5
#3 Sports 1
#4 Travelling 3
#5 Watching TV 1
NOTE: Here is used as.character in case the column is a factor as strsplit can take only character vectors.
Or another option would be to use scan to extract the elements, and then with table get the frequency.
table(trimws(scan(text = as.character(df1$Freetime_activities),
what = "", sep = ",")))
Or using read.table with unlist and table
table(unlist(read.table(text = as.character(df1$Freetime_activities),
sep = ",", fill = TRUE, strip.white = TRUE)))
EDIT: Based on #David Arenburg's comments.
data
df1 <- structure(list(Freetime_activities = c("Travelling, On the PC,
Clubbing",
"Sports, On the PC, Clubbing", "Clubbing", "On the PC", "Travelling,
On the PC, Clubbing",
"On the PC", "Watching TV, Travelling")),
.Names = "Freetime_activities",
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7"))

Resources