Remove a number of character from string in a column - r

I have a data frame with a column of strings and I would like to remove the first three characters in each of the strings. As in the following example:
From this:
df <- data_frame(col1 = c('01_A','02_B', '03_C'))
To this:
df <- data_frame(col1 = c('A','B', 'C'))
I have been trying to use the dplyr transmute function but I can't really get it to work.
Any help would be super appreciated!

I think this will work:
library(dplyr)
library(stringr)
df %>%
mutate(col1 = str_remove(col1, "\\d+(_)"))
col1
1 A
2 B
3 C

We could also use substring from base R as the OP mentioned above position based substring extraction
df$col1 <- substring(df$col1, 4)
df$col1
#[1] "A" "B" "C"

You can use sub like below
> df %>%
+ mutate(col1 = sub("^.{3}", "", col1))
# A tibble: 3 x 1
col1
<chr>
1 A
2 B
3 C

Related

How do I get the column number from a dataframe which contains specific strings?

I have a data frame df with 7 columns and I have a list z containing multiple strings.
I want a dataframe containing only the columns in df which contain the sting from z.
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
How do I get the column number of the z strings in df? Or how do I get a dataframe with only the columns which contains the z strings.
What I want is:
print(df)
"a_means" "c_m" "f_m"
What I tried:
match(a, names(df)
and
df[,which(colnames(df) %in% colnames(df[ ,grepl(z,names(df)])]
You can use:
df[,match(z, substring(colnames(df), 1, 3))]
With base R:
z <- paste(z, collapse = "|")
df[, grepl(z, names(df))] # you could use grep as well
Combine the search patterns and use that as a pattern for stringr::str_detect() function.
library(dplyr)
library(stringr)
df <- data.frame(a_means = "a_means",
b_means = "b_means",
c_means = "c_means",
d_means = "d_means",
e_means = "e_means",
f_means = "f_means",
g_means = "g_means"
)
z <- c("a_m","c_m","f_m")
z <- paste(z, collapse = "|")
df %>% select_if(str_detect(names(df), z))
#> a_means c_means f_means
#> 1 a_means c_means f_means
You can simply do this:
library(dplyr)
df %>%
select(contains(z))
Check out help("starts_with"). You can also match to a starting prefix with starts_with() among other things.
You can use select and matches to subest the columns based on z
library(dplyr)
df <- data.frame("a_means","b_means","c_means","d_means","e_mean","f_means","g_means")
z <- c("a_m","c_m","f_m")
df %>%
select(matches(z))
#> X.a_means. X.c_means. X.f_means.
#> 1 a_means c_means f_means

Combining two columns with character strings into a new column

Below I have two columns of data (column 6 and 7) of genus and species names. I would like to combine those two columns with character string data into a new column with the names combined.
I am quite new to R and the code below does not work! Thank you for the help wonderful people of stack overflow!
#TRYING TO MIX GENUS & SPECIES COLUMN
accepted_genus <- merged_subsets_2[6]
accepted_species <- merged_subsets_2[7]
accepted_genus
accepted_species
merged_subsets_2%>%
bind_cols(accepted_genus, accepted_species)
merged_subsets_2
We can use str_c from stringr
library(dplyr)
library(stringr)
df %>%
mutate(Col3 = str_c(Col1, Col2))
Or with unite
library(tidyr)
df %>%
unite(Col3, Col1, Col2, sep="", remove = FALSE)
Please take a look at this if this doesn't answer your question.
df <- data.frame(Col1 = letters[1:2], Col2=LETTERS[1:2]) # Sample data
> df
Col1 Col2
1 a A
2 b B
df$Col3 <- paste0(df$Col1, df$Col2) # Without spacing
> df
Col1 Col2 Col3
1 a A aA
2 b B bB
df$Col3 <- paste(df$Col1, df$Col2)
> df
Col1 Col2 Col3
1 a A a A
2 b B b B

Creating another column in R

This is my current data set
I want to take the numbers after "narrow" (e.g. 20) and make another vector. Any idea how I can do that?
We can use sub to remove the substring "Narrow", followed by a , and zero or more spaces (\\s+), replace with blank ("") and convert to numeric
df1$New <- as.numeric(sub("Narrow,\\s*", "", df1$Stimulus))
You could use separate to separate the stimulus column into two vectors.
library(tidyr)
df %>%
separate(col = stimulus,
sep = ", ",
into = c("Text","Number"))
Maybe you can try the code below, using regmatches
df$new <- with(df, as.numeric(unlist(regmatches(stimulus,gregexpr("\\d+",stimulus)))))
You want separate from the tidyr package.
library(dplyr)
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df %>% separate(x, c("A", "B"))
#> A B
#> 1 <NA> <NA>
#> 2 a b
#> 3 a d
#> 4 b c

How do I change all the character values of a column that starts with specific characters?

I have a dataset with millions of observations.
One of the columns of this dataset uses 4 or 5 characters to classify these observations.
My goal is to merge this classification into smaller groups, for example, I want to replace all the values of the column that STARTS with "AA" (e.g., "AABC" or "AAUCC") for just "A". How can I do this?
To illustrate:
Considering that my data is labeled "f2016" and the column that I'm interested in is "SECT16", I've been using the following code to replace values:
f2016$SECT16[f2016$SECT16 == "AABB"] <- "A"
But I cannot do this to all combinations of letters that I have in the dataset. Is there a way that I can do the same replacement holding the first two letters constant?
Here is another base R solution:
f2016[startsWith(f2016$SECT16, "AA"),] <- "A"
# SECT16
# 1 A
# 2 A
# 3 ABBBBC
# 4 DDDDE
# 5 BABA
This replaces chars with the prefix specified in this case AA. An an excerpt from from the help(startsWith).
startsWith() is equivalent to but much faster than
substring(x, 1, nchar(prefix)) == prefix
or also
grepl("^", x)
where prefix is not to contain special regular expression characters.
Data
f2016 <- data.frame(SECT16 = c("AAABBB", "AAAAAABBBB", "ABBBBC", "DDDDE", "BABA"), stringsAsFactors = F)
We can use grep/grepl
f2016$SECT16[grep("^AA", f2016$SECT16)] <- "A"
#f2016$SECT16[grepl("^AA", f2016$SECT16)] <- "A"
Consider this dataset
df <- data.frame(A = c("ABCD", "AACD", "DASDD", "AABB"), stringsAsFactors = FALSE)
df
# A
#1 ABCD
#2 AACD
#3 DASDD
#4 AABB
df$A[grep("^AA", df$A)] <- "A"
df
# A
#1 ABCD
#2 A
#3 DASDD
#4 A
You can use stringr and dplyr.
Modify all columns:
df <- df %>% mutate_all(function(x) stringr::str_replace(x, "^AA.+", "A"))
Modify specific columns:
df <- df %>% mutate_at(1, function(x) stringr::str_replace(x, "^AA.+", "A"))
Data
df <- data.frame(SECT16 = c("AABC", "AABB"),
SECT17 = c("AADD", "AAEE"))

Extract string from a cell and put it in a new data frame R

In a R project, I want to extract strings from a data frame which a column is like
"A|B|C"
"B|Z"
"I|P"
...
I want to have a new data frame with column A B C Z I P
I think to make it with a for and a gsub, but it is not easy because the pattern extract the | and I am not sure if it is the best and elegant way to do this kind of task
With a combination of strsplit,unlist and unique you can do:
#Steps:
#1) split each element of column with separator as "|"
#2) combine output for all items with unlist
#3) retain unique elements of those
vec = c("A|B|C","B|Z","I|P")
newDF = data.frame(newCol = unique(unlist(lapply(vec,function(x) unlist(strsplit(x,"[|]")) ))),
stringsAsFactors = FALSE)
newDF$newCol
#[1] "A" "B" "C" "Z" "I" "P"
starting with the dataframe df, with base R we can try the following:
data.frame(col=unique(unlist(strsplit(as.character(df$col), split='\\|'))))
# col
#1 A
#2 B
#3 C
#4 Z
#5 I
#6 P
or with dplyr
df %>%
mutate(col = strsplit(col, "\\|")) %>%
unnest(col) %>% unique
# col
# (chr)
#1 A
#2 B
#3 C
#4 Z
#5 I
#6 P
data
df <- data.frame(col=c("A|B|C",
"B|Z",
"I|P"), stringsAsFactors = FALSE)
If you want them to be the names of the columns, try this:
symbols <- unique(unlist(strsplit(as.character(df$col), split='\\|')))
df <- data.frame(matrix(vector(), 0, length(symbols),
dimnames=list(c(), symbols)), stringsAsFactors=F)
df
#[1] A B C Z I P
#<0 rows> (or 0-length row.names)
We can use cSplit
library(splitstackshape)
unique(cSplit(df1, "V1", "|", "long"), by = "V1")
data
df1 <- data.frame(V1 = c("A|B|C","B|Z","I|P"))
The scan function with the text parameter input appears suited for this task:
st <- c("A|B|C","B|Z","I|P")
scan(text=st, what="", sep="|")
Read 7 items
[1] "A" "B" "C" "B" "Z" "I" "P"
It wasn't clear to me from your problem description or example how you wanted this to be aligned with the original 3 row dataframe.

Resources