Identify a set of strings and remove them from a column

Identify a set of strings and remove them from a column - r

I'm trying to loop through a column and remove any characters from the start of the row, that falls under my predefined set of strings.
Reproducible Example
df <- data.frame(serial = 1:3, name = c("Javier", "Kenneth", "Kasey"))
serial name
1 1 Javier
2 2 Kenneth
3 3 Kasey
Condition Vector
Removes these strings from the front of name only!
vec <- c("Ja", "Ka")
Intended Output
serial name
1 1 vier
2 2 Kenneth
3 3 sey

We could create a pattern by pasting vec into one vector and remove their occurrence using sub.
df$name <- sub(paste0("^", vec, collapse = "|"), "", df$name)
df
# serial name
#1 1 vier
#2 2 Kenneth
#3 3 sey
In stringr we can also use str_remove
stringr::str_remove(df$name, paste0("^", vec, collapse = "|"))
#[1] "vier" "Kenneth" "sey"

Since we're using fixed length vec strings in this example, it might even be more efficient to use substr replacements. This will only really pay off in the case when df and/or vec is large though, and comes at the price of some flexibility.
df$name <- as.character(df$name)
sel <- substr(df$name, 1, 2) %in% vec
df$name[sel] <- substr(df$name, 3, nchar(df$name))[sel]
# serial name
#1 1 vier
#2 2 Kenneth
#3 3 sey

We can also do this with substring
library(stringr)
library(dplyr)
df$name <- substring(df$name, replace_na(str_locate(df$name,
paste(vec, collapse="|"))[,2] + 1, 1))
df$name
#[1] "vier" "Kenneth" "sey"
Or with str_replace
str_replace(df$name, paste0("^", vec, collapse="|"), "")
#[1] "vier" "Kenneth" "sey"
Or using gsubfn
library(gsubfn)
gsubfn("^.{2}", setNames(rep(list(""), length(vec)), vec), as.character(df$name))
#[1] "vier" "Kenneth" "sey"

Related

How do I change all the character values of a column that starts with specific characters?

I have a dataset with millions of observations.
One of the columns of this dataset uses 4 or 5 characters to classify these observations.
My goal is to merge this classification into smaller groups, for example, I want to replace all the values of the column that STARTS with "AA" (e.g., "AABC" or "AAUCC") for just "A". How can I do this?
To illustrate:
Considering that my data is labeled "f2016" and the column that I'm interested in is "SECT16", I've been using the following code to replace values:
f2016$SECT16[f2016$SECT16 == "AABB"] <- "A"
But I cannot do this to all combinations of letters that I have in the dataset. Is there a way that I can do the same replacement holding the first two letters constant?

Here is another base R solution:
f2016[startsWith(f2016$SECT16, "AA"),] <- "A"
# SECT16
# 1 A
# 2 A
# 3 ABBBBC
# 4 DDDDE
# 5 BABA
This replaces chars with the prefix specified in this case AA. An an excerpt from from the help(startsWith).
startsWith() is equivalent to but much faster than
substring(x, 1, nchar(prefix)) == prefix
or also
grepl("^", x)
where prefix is not to contain special regular expression characters.
Data
f2016 <- data.frame(SECT16 = c("AAABBB", "AAAAAABBBB", "ABBBBC", "DDDDE", "BABA"), stringsAsFactors = F)

We can use grep/grepl
f2016$SECT16[grep("^AA", f2016$SECT16)] <- "A"
#f2016$SECT16[grepl("^AA", f2016$SECT16)] <- "A"
Consider this dataset
df <- data.frame(A = c("ABCD", "AACD", "DASDD", "AABB"), stringsAsFactors = FALSE)
df
# A
#1 ABCD
#2 AACD
#3 DASDD
#4 AABB
df$A[grep("^AA", df$A)] <- "A"
df
# A
#1 ABCD
#2 A
#3 DASDD
#4 A

You can use stringr and dplyr.
Modify all columns:
df <- df %>% mutate_all(function(x) stringr::str_replace(x, "^AA.+", "A"))
Modify specific columns:
df <- df %>% mutate_at(1, function(x) stringr::str_replace(x, "^AA.+", "A"))
Data
df <- data.frame(SECT16 = c("AABC", "AABB"),
SECT17 = c("AADD", "AAEE"))

Match character vector in a dataframe with another character vector and trim character

Here is a dataframe and a vector.
df1 <- tibble(var1 = c("abcd", "efgh", "ijkl", "qrst"))
vec <- c("abcd", "mnop", "ijkl")
Now, for all the values in var1 that matches with the values in vec, keep only first 3 characters in var1 such that the desired solution is:
df2 <- tibble(var1 = c("abc", "efgh", "ijk", "qrst"))
Since, "abcd" matches, we keep only 3 characters i.e. "abc" in df2, but "efgh" doesn't exist in vec, so we keep it as is i.e "efgh" in df2.
How can I use dplyr and/or stringr to accomplish this?

You can just use %in% to check whether the strings are in the vector, and substr to trim the vector:
df1 %>%
mutate(var1 = ifelse(var1 %in% vec, substr(var1, 1, 3), var1))
# A tibble: 4 x 1
# var1
# <chr>
#1 abc
#2 efgh
#3 ijk
#4 qrst

Spacing vector by regular pattern

I have a vector
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
which generally follows the pattern of alternating between two different sequences of strings (the first sequence being all alphabetical, the second being numerical with the symbol #).
However there are cases where no # term appears: so in the above between mp and jq, and then again after ez. I would like to define a function which "fills the gaps" with the character string #, so that I would have the output:
[1] "ab" "#4" "gw" "#29" "mp" "#" "jq" "#35" "ez" "#"
which I would then convert to a data frame
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #
My attempt so far is rather clunky and relies on looping through the vector and filling the gaps. I'd be interested to see more elegant solutions.
My Solution
greplSpace <- function(pattern, replacement, x){
j <- 1
while( j < length(x) ){
if(grepl(pattern, x[j+1]) ){
j <- j+2
} else {
x <- c( x[1:j], replacement, x[(j+1):length(x)] )
j <- j+2
}
}
if( ! grepl(pattern, tail(x,1) ) ){ x <- c(x, replacement) }
return(x)
}
library(magrittr)
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
vec %>% greplSpace("#", "#", . ) %>%
matrix(ncol = 2, byrow = TRUE) %>%
as.data.frame

Start with your vec, we can create your expected data frame directly with some functions from the dplyr, tidyr, and stringr.
library(dplyr)
library(tidyr)
library(stringr)
vec <- c("ab", "#4", "gw", "#29", "mp", "jq", "#35", "ez")
dat <- data_frame(Value = vec)
dat2 <- dat %>%
mutate(String = !str_detect(vec, "#"),
Key = ifelse(String, "V1", "V2"),
Row = cumsum(String)) %>%
select(-String) %>%
spread(Key, Value, fill = "#") %>%
select(-Row)
dat2
# # A tibble: 5 x 2
# V1 V2
# <chr> <chr>
# 1 ab #4
# 2 gw #29
# 3 mp #
# 4 jq #35
# 5 ez #

Here is a base R option with split. Create a logical index by checking the "#" in each of the strings, get the cumulative sum and split the original vector by this grouping variable into a list ('lst'). For those list elements that don't have two (maximum length) elements are appended with NA at the end by assignment with length<-. Then, rbind, the list elements into a two column matrix. If needed, convert those NA to #
lst <- split(vec, cumsum(!grepl("#", vec)))
out <- do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
out[,2][is.na(out[,2])] <- "#" #not recommended though
out
# [,1] [,2]
#1 "ab" "#4"
#2 "gw" "#29"
#3 "mp" "#"
#4 "jq" "#35"
#5 "ez" "#"
Wrap it with as.data.frame if we need a data.frame output

You can use Base R:
First Collapse the vector into a string while replaceing # where needed.
Then just read using read.csv
vec1=gsub("([a-z]),\\s*([a-z])|$","\\1,#,\\2",toString(vec))
read.csv(text=gsub("(#.*?),","\\1\n",vec1),h=F)
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #
Explanation:
First collapse the vector into a string by toString
Then if there are alphabets on either side of the , ie [a-z],\s*[a-z] or at the end ie |$ you insert an #.
Then create line breaks after numbers or # and read in the data as a table
You can also do:
a=read.csv(h=F,text=toString(sub("([a-z]+)","\n\\1",vec)),na=c(" ",""))[1:2]
a
V1 V2
1 ab #4
2 gw #29
3 mp <NA>
4 jq #35
5 ez <NA>
data.frame(replace(as.matrix(a),is.na(a),"#"))
V1 V2
1 ab #4
2 gw #29
3 mp #
4 jq #35
5 ez #

Another base possibility:
do.call(rbind, tapply(vec, cumsum(!grepl("^#", vec)), FUN = function(x){
if(length(x) == 1) c(x, "#") else x}))
# [,1] [,2]
# 1 "ab" "#4"
# 2 "gw" "#29"
# 3 "mp" "#"
# 4 "jq" "#35"
# 5 "ez" "#"
Explanation:
Check if elements in vec starts with #, and negate it: !grepl("^#", vec); creates a logical vector.
Create a grouping variable by applying cumsum to the logical vector (note: 1 & 2 similar to #akrun).
Use tapply to apply a function to each subset of vec, defined by the grouping variable. Check if the length is 1. If so, pad by a trailing #, else just return the subset: if(length(x) == 1) c(x, "#") else x
Bind the resulting list together by row: do.call(rbind,
Another one:
# create a row index
ri <- cumsum(!grepl("^#", vec))
# create a column index
ci <- ave(ri, ri, FUN = seq_along)
# create an empty matrix of desired dimensions
m <- matrix(nrow = max(ri), ncol = 2)
# assign 'vec' to matrix at relevant indices
m[cbind(ri, ci)] <- vec
# replace NA with '#'
m[is.na(m)] <- "#"
Using data.table. Create a grouping variable as above, and reshape from long to wide.
library(data.table)
d <- data.table(vec)
d[ , g := cumsum(!grepl("^#", vec))]
dcast(d, g ~ rowid(g), value.var = "vec", fill = "#")
# g 1 2
# 1: 1 ab #4
# 2: 2 gw #29
# 3: 3 mp #
# 4: 4 jq #35
# 5: 5 ez #

R splitting a column of character separated by different number of spaces

I have a data frame with a column consisting of words separated by a varying number of spaces for example:
head(lst)
'fff fffd ddd'
'sss dd'
'de dd'
'dds sssd eew rrr'
'dsds eed'
What I would like to have is 2 columns:
The first column is the part before the first space
And the second column is the part after the last space
meaning it should like this
V1 v2
'fff' 'ddd'
'sss' 'dd'
'de' 'dd'
'dds' 'rrr
'dsds' 'eed'
The first column I am able to get but the second one is a problem
This is the code I use.
lst <- strsplit(athletes.df$V1, "\\s+")
v1 <- sapply(lst ,`[`, 1)
v2 <- sapply(lst, `[`, 2)
What I get I get for column v2 is the second word. I know it's because I put 2 inside the sapply How do I tell it to only take what comes after the last space?

You can use tail to grab the last entry of each vector:
lst <- strsplit(athletes.df$V1, "\\s+")
v1 <- sapply(lst, head, 1) # example with head to grab first vector element
v2 <- sapply(lst, tail, 1) # example with tail to grab last vector element
Or perhaps the vapply version since you know your return type should be a character vector:
v2 <- vapply(lst, tail, 1, FUN.VALUE = character(1))
Another approach would be to modify your strsplit split criteria to something like this where you split on a space that can optionally be followed by any character one or more times until a final space is found.
strsplit(df$V1, "\\s(?:.+\\s)?")
#[[1]]
#[1] "fff" "ddd"
#
#[[2]]
#[1] "sss" "dd"
#
#[[3]]
#[1] "de" "dd"
#
#[[4]]
#[1] "dds" "rrr"
#
#[[5]]
#[1] "dsds" "eed"
As Sumedh points out this regex works nicely with tidyr's separate:
tidyr::separate(df, V1, c("V1", "V2"), "\\s(?:.+\\s)?")
# V1 V2
#1 fff ddd
#2 sss dd
#3 de dd
#4 dds rrr
#5 dsds eed
Two stringi based approaches:
library(stringi)
v1 <- stri_extract_last_regex(df$V1, "\\S+")
v2 <- stri_extract_first_regex(df$V1, "\\S+")
Or
stri_extract_all_regex(df$V1, "^\\S+|\\S+$", simplify = TRUE)
# this variant explicitly checks for the spaces with lookarounds:
stri_extract_all_regex(df$V1, "^\\S+(?=\\s)|(?<=\\s)\\S+$", simplify = TRUE)

Maybe this?
lst <- strsplit(athletes.df$V1, "\\s+")
v1 <- sapply(lst ,`[`, 1)
v2 <- sapply(lst, function(x) x[length(x)])
Or
data.frame(t(sapply(strsplit(athletes.df$V1, "\\s+"),
function(x) c(x[1], x[length(x)]))))

Without using any packages, this can be done with read.table after creating a delimiter using sub.
read.table(text=sub("^(\\S+)\\s+.*\\s+(\\S+)$", "\\1 \\2", df1$V1),
header=FALSE, stringsAsFactors= FALSE)
# V1 V2
#1 fff ddd
#2 sss dd
#3 de dd
#4 dds rrr
#5 dsds eed
Another convenient option is word from stringr
library(stringr)
transform(df1, V1 = word(V1, 1), V2 = word(V1, -1))
# V1 V2
#1 fff ddd
#2 sss dd
#3 de dd
#4 dds rrr
#5 dsds eed
data
df1 <- structure(list(V1 = c("fff fffd ddd", "sss dd", "de dd",
"dds sssd eew rrr",
"dsds eed")), .Names = "V1", class = "data.frame", row.names = c(NA,
-5L))

How to remove '.' from column names in a dataframe?

My dataframe which I read from a csv file has column names like this
abc.def, ewf.asd.fkl, qqit.vsf.addw.coil
I want to remove the '.' from all the names and convert them to
abcdef, eqfasdfkl, qqitvsfaddwcoil.
I tried using the sub command sub(".","",colnames(dataframe)) but this command took out the first letter of each column name and the column names changed to
bc.def, wf.asd.fkl, qit.vsf.addw.coil
Anyone know another command to do this. I can change the column name one by one, but I have a lot of files with 30 or more columns in each file.
Again, I want to remove the "." from all the colnames. I am trying to do this so I can use "sqldf" commands, which don't deal well with "."
Thank you for your help

1) sqldf can deal with names having dots in them if you quote the names:
library(sqldf)
d0 <- read.csv(text = "A.B,C.D\n1,2")
sqldf('select "A.B", "C.D" from d0')
giving:
A.B C.D
1 1 2
2) When reading the data using read.table or read.csv use the check.names=FALSE argument.
Compare:
Lines <- "A B,C D
1,2
3,4"
read.csv(text = Lines)
## A.B C.D
## 1 1 2
## 2 3 4
read.csv(text = Lines, check.names = FALSE)
## A B C D
## 1 1 2
## 2 3 4
however, in this example it still leaves a name that would have to be quoted in sqldf since the names have embedded spaces.
3) To simply remove the periods, if DF is a data frame:
names(DF) <- gsub(".", "", names(DF), fixed = TRUE)
or it might be nicer to convert the periods to underscores so that it is reversible:
names(DF) <- gsub(".", "_", names(DF), fixed = TRUE)
This last line could be alternatively done like this:
names(DF) <- chartr(".", "_", names(DF))

UPDATE dplyr 0.8.0
As of dplyr 0.8 funs() is soft deprecated, use formula notation.
a dplyr way to do this using stringr.
library(dplyr)
library(stringr)
data <- data.frame(abc.def = 1, ewf.asd.fkl = 2, qqit.vsf.addw.coil = 3)
renamed_data <- data %>%
rename_all(~str_replace_all(.,"\\.","_")) # note we have to escape the '.' character with \\
Make sure you install the packages with install.packages().
Remember you have to escape the . character with \\. in regex, which functions like str_replace_all use, . is a wildcard.

To replace all the dots in the names you'll need to use gsub, rather than sub, which will only replace the first occurrence.
This should work.
test <- data.frame(abc.def = NA, ewf.asd.fkl = NA, qqit.vsf.addw.coil = NA)
names(test) <- gsub( ".", "", names(test), fixed = TRUE)
test
abcdef ewfasdfkl qqitvsfaddwcoil
1 NA NA NA

You can also try:
names(df) = gsub(pattern = ".", replacement = "", x = names(df))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Identify a set of strings and remove them from a column - r

Related

How do I change all the character values of a column that starts with specific characters?

Match character vector in a dataframe with another character vector and trim character

Spacing vector by regular pattern

R splitting a column of character separated by different number of spaces

How to remove '.' from column names in a dataframe?

Categories

Resources