Remove unnecessary symbols in the data in R - r

That's my dataset
1.abc
2.def
3.2354
4.. $.?,
How can I delete those obs in which only digits, in which only symbols like point, comma ..., well, in which any symbols and digits(1#5??%).And words in the text where less than two letters

We can use str_count to count the number of characters and subset the dataset
library(stringr)
library(dplyr)
df1 %>%
filter(str_count(v1, "[[:alpha:]]") > 2)
Or with gsub to remove any character that is not a letter and count the number of characters with nchar to create a logical index for subsetting
subset(df1, nchar(gsub("[^[:alpha:]]+", "", v1))>2)
# v1
#1 1.abc
#2 2.def
data
df1 <- structure(list(v1 = c("1.abc", "2.def", "3.2354", "4.. $.?,")),
.Names = "v1", class = "data.frame", row.names = c(NA, -4L))

Related

Rename columns of R dataframe with tidyselect and regular expression

I have a dataframe whose columns names are combinations of numbering and some complicated texts:
A1. Good day
A1a. Have a nice day
......
Z7d. Some other titles
Now I want to keep only the "A1.", "A1a.", "Z7d.", removing both the preceding number and the ending texts. Is there any idea how to do this with tidyselect and regex?
You can use this regex -
names(df) <- sub('\\d+\\.\\s+([A-Za-z0-9]+).*', '\\1', names(df))
names(df)
#[1] "A1" "A1a" "Z7d"
The same regex can also be used in rename_with if you want a tidyverse answer.
library(dplyr)
df %>% rename_with(~sub('\\d+\\.\\s+([A-Za-z0-9]+).*', '\\1', .))
# A1 A1a Z7d
#1 0.5755992 0.4147519 -0.1474461
#2 0.1347792 -0.6277678 0.3263348
#3 1.6884930 1.3931306 0.8809109
#4 -0.4269351 -1.2922231 -0.3362182
#5 -2.0032113 0.2619571 0.4496466
data
df <- structure(list(`1. A1. Good day` = c(0.575599213383783, 0.134779160673435,
1.68849296209512, -0.426935114884432, -2.00321125417319), `2. A1a. Have a nice day` = c(0.414751904860513,
-0.627767775889949, 1.39313055331098, -1.29222310608057, 0.261957078465535
), `99. Z7d. Some other titles` = c(-0.147446140558093, 0.326334824433201,
0.880910933597998, -0.336218174873965, 0.449646567320979)),
class = "data.frame", row.names = c(NA, -5L))
We can use str_extract
library(stringr)
names(df) <- str_extract(names(df), "(?<=\\.\\s)[^.]+")
names(df)
[1] "A1" "A1a" "Z7d"
data
df <- structure(list(`1. A1. Good day` = c(0.575599213383783, 0.134779160673435,
1.68849296209512, -0.426935114884432, -2.00321125417319), `2. A1a. Have a nice day` = c(0.414751904860513,
-0.627767775889949, 1.39313055331098, -1.29222310608057, 0.261957078465535
), `99. Z7d. Some other titles` = c(-0.147446140558093, 0.326334824433201,
0.880910933597998, -0.336218174873965, 0.449646567320979)),
class = "data.frame", row.names = c(NA, -5L))

how to remove quotation marks string before and after a string in R

I've tried several ways i found here but i haven't gotten the result i need, I need to be able to remove the " "" that appears on the first column and on the last column remove the " that appears at the end because the data base runs for several thousand the number of digits increases.
what is constant is the " "" on the first column and the " on the last column
db <- structure(list(`"1""Name` = c("\"2\"\"AAFC", "\"3\"\"Adfd",
"\"4\"\"Abbb"), `References"` = c("3\"", "4\"", "4\"")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
If we need to remove the leading/lagging ", use trimws with whitespace specifying the regex pattern
library(dplyr)
db1 <- db %>%
mutate(across(everything(), ~ trimws(., whitespace = '"')))
Or use str_remove_all to remove all the double quotes
library(stringr)
db1 <- db %>%
mutate(across(everything(), ~ str_remove_all(., '"')))
To remove all the occurrence of '"' from all the columns you can use lapply with gsub :
db[] <- lapply(db, function(x) gsub('"', '', x))
db
# A tibble: 3 x 2
# `"1""Name` `References"`
# <chr> <chr>
#1 2AAFC 3
#2 3Adfd 4
#3 4Abbb 4
If there are lot of columns and you want to do this only for selected columns we can subset those columns and pass to lapply. For example, for first and last column we can do :
cols <- c(1, ncol(db))
db[cols] <- lapply(db[cols], function(x) gsub('"', '', x))

Create two column with multiple separators

I have a dataframe such as
COl1
scaffold_97606_2-BACs_-__SP1_1
UELV01165908.1_2-BACs_+__SP2_2
UXGC01046554.1_9-702_+__SP3_3
scaffold_12002_1087-1579_-__SP4_4
and I would like to separate both into two columns and get :
COL1 COL2
scaffold_97606 2-BACs_-__SP1_1
UELV01165908.1 2-BACs_+__SP2_2
UXGC01046554.1 9-702_+__SP3_3
scaffold_12002 1087-1579_-__SP4_4
so as you can see the separator changes it can be .Number_ or Number_Number
So far I wrote ;
df2 <- df1 %>%
separate(COL1, paste0('col', 1:2), sep = " the separator patterns ", extra = "merge")
but I do not know what separator I should use here in the " the separator patterns "part
You may use
> df1 %>%
separate(COl1, paste0('col', 1:2), sep = "(?<=\\d)_(?=\\d+-)", extra = "merge")
col1 col2
1 scaffold_97606 2-BACs_-__SP1_1
2 UELV01165908.1 2-BACs_+__SP2_2
3 UXGC01046554.1 9-702_+__SP3_3
4 scaffold_12002 1087-1579_-__SP4_4
See the regex demo
Pattern details
(?<=\d) - a positive lookbehind that requires a digit immediately to the left of the current location
_ - an underscore
(?=\d+-) - a positive lookahead that requires one or more digits and then a - immediately to the right of the current location.
You can use extract :
tidyr::extract(df, COl1, c('Col1', 'Col2'), regex = '(.*?\\d+)_(.*)')
# Col1 Col2
#1 scaffold_97606 2-BACs_-__SP1_1
#2 UELV01165908.1 2-BACs_+__SP2_2
#3 UXGC01046554.1 9-702_+__SP3_3
#4 scaffold_12002 1087-1579_-__SP4_4
data
df <- structure(list(COl1 = c("scaffold_97606_2-BACs_-__SP1_1",
"UELV01165908.1_2-BACs_+__SP2_2",
"UXGC01046554.1_9-702_+__SP3_3", "scaffold_12002_1087-1579_-__SP4_4"
)), class = "data.frame", row.names = c(NA, -4L))

Merge columns that are separated by character in order

I have a table that has two columns with information separated by ":". Te problem is that not all of them has the same size.
I'll write an example:
Col1 ol2
AA:BB:CC 1:2:3
AA:DD:BB:CC 4:5:6:7
And I would like a third column that is
Col3
AA=1:BB=2:CC=3
AA=4:DD=5:BB=6:CC=7
I've not idea where to start, I've try to split them, but it took me nowere
We can use strsplit to split the 'Col1', 'Col2' by :, then concatenate the corresponding list elements with str_c to create the 'Col3'
library(dplyr)
library(purrr)
library(stringr)
df1 %>%
mutate(col3 = map2_chr(strsplit(Col1, ":"), strsplit(Col2, ":"),
~ str_c(.x, .y, sep="=", collapse=':')))
# Col1 Col2 col3
#1 AA:BB:CC 1:2:3 AA=1:BB=2:CC=3
#2 AA:DD:BB:CC 4:5:6:7 AA=4:DD=5:BB=6:CC=7
data
df1 <- structure(list(Col1 = c("AA:BB:CC", "AA:DD:BB:CC"), Col2 = c("1:2:3",
"4:5:6:7")), class = "data.frame", row.names = c(NA, -2L))

Removing the special symbols in data.frame column values

I have two data frame each with a column Name
df1:
name
#one2
!iftwo
there_2_go
come&go
df1 = structure(list(name = c("#one2", "!iftwo", "there_2_go", "come&go")),.Names = c("name"), row.names = c(NA, -4L), class = "data.frame")
df2:
name
One2
IfTwo#
there-2-go
come.go
df2 = structure(list(name = c("One2", "IfTwo#", "there-2-go", "come.go")),.Names = c("name"), row.names = c(NA, -4L), class = "data.frame")
Now to compare the two data frames for inequality is cumbersome because of special symbols using %in%. To remove the special symbols using stringR can be useful. But how exactly we can use stringR functions with %in% and display the mismatch between them
have already done the mutate() to convert all in lowercasestoLower()as follows
df1<-mutate(df1,name=tolower(df1$name))
df2<-mutate(df2,name=tolower(df2$name))
Current output of comparison:
df2[!(df2 %in% df1),]
[1] "one2" "iftwo#" "there-2-go" "come.go"
Expected output as essentially the contents are same but with special symbols:
df2[!(df2 %in% df1),]
character(0)
Question : How do we ignore the symbols in the contents of the Frame
Here it is in a function,
f1 <- function(df1, df2){
i1 <- tolower(gsub('[[:punct:]]', '', df1$name))
i2 <- tolower(gsub('[[:punct:]]', '', df2$name))
d1 <- sapply(i1, function(i) grepl(paste(i2, collapse = '|'), i))
return(!d1)
}
f1(df, df2)
# one2 iftwo there2go comego
# FALSE FALSE FALSE FALSE
#or use it for indexing,
df2[f1(df, df2),]
#character(0)

Resources