extract the first n characters from concatenated strings - r

I've a data frame with two columns: id and value.
id <- seq(1,5)
value <- c("123;21232;78678",
"4575;23423",
"1245;45757;45375;39873",
"787;536453;1121;3232;58684;2434",
"444;44444;4444;444")
df <- data.frame(id, value)
df$value <- as.character(df$value)
I need to extract the first two digits of each number (separated by ;). The output would look like:

We can use str_extract_all from stringr
library(stringr)
library(dplyr)
library(purrr)
df %>%
mutate(value_2 = map_chr(str_extract_all(value, "\\b(\\d{2})"),
str_c, collapse=";"))
# id value value_2
#1 1 123;21232;78678 12;21;78
#2 2 4575;23423 45;23
#3 3 1245;45757;45375;39873 12;45;45;39
#4 4 787;536453;1121;3232;58684;2434 78;53;11;32;58;24
#5 5 444;44444;4444;444 44;44;44;44
Or with str_replace_all
df %>%
mutate(value_2 = str_replace_all(value, "(^|;)(..)[^;]+", "\\2;") %>%
trimws(whitespace = ";"))
Or using base R
df$value_2 <- sapply(regmatches(df$value,
gregexpr("\\b(\\d{2})", df$value)), paste, collapse=";")
Or with gsub from base R
trimws(gsub("(^|;)(..)[^;]+", "\\2;", df$value, perl = TRUE), whitespace = ";")
#[1] "12;21;78" "45;23" "12;45;45;39"
#[4] "78;53;11;32;58;24" "44;44;44;44"

Using just base R:
df$value_2 = sapply(strsplit(df$value, ';'), function(i){
paste(substr(i, 1, 2), collapse = ';')
})
This goes row by row and separates each string by the semicolons, then pulls out the first two characters and returns them into a new string separated by semicolons.
This will give you:
id value value_2
1 1 123;21232;78678 12;21;78
2 2 4575;23423 45;23
3 3 1245;45757;45375;39873 12;45;45;39
4 4 787;536453;1121;3232;58684;2434 78;53;11;32;58;24
5 5 444;44444;4444;444 44;44;44;44

Related

add numbers to specific observations

I have several IDs I am working with. I want to add a leading zero for values that have 1 integer after the dash in id. Here is sample data. I
id
2034-5
1023-12
1042-22
1231-9
I want this:
id
2034-05
1023-12
1042-22
1231-09
I tried this, but it's not working. Any advice?
x <-sprintf("%02d", df$id)
You could actually use sub here for a base R option:
df$id <- sub("-(\\d)$", "-0\\1", df$id)
df
id
1 2034-05
2 1023-12
3 1042-22
4 1231-09
Data:
df <- data.frame(id=c("2034-5", "1023-12", "1042-22", "1231-9"), stringsAsFactors=FALSE)
To use sprintf you have to separate out both the numbers, use sprintf on second number and then combine them again.
library(dplyr)
library(tidyr)
df %>%
separate(id, c('id1', 'id2')) %>%
mutate(id2 = sprintf('%02s', id2)) %>%
unite(id, id1, id2, sep = '-')
# id
#1 2034-05
#2 1023-12
#3 1042-22
#4 1231-09
An option with strsplit and sprintf from base R
df$id <- sapply(strsplit(df$id, "-"), function(x)
do.call(sprintf, c(as.list(x), fmt = "%s-%02s")))
df$id
#[1] "2034-05" "1023-12" "1042-22" "1231-09"

How to remove everything from a row except pattern

I have a dataframe that contains one column separated by ; like this
AB00001;09843;AB00002;GD00001
AB84375;34
AB84375;AB84375
74859375;AB001;4455;FG3455
What I want is remove everything except the codes that starts with AB....
AB00001;AB00002
AB84375
AB84375;AB84375
AB001
I've tried to separate them with separate(), but I donĀ“t know how to continue. Any suggestions?
If your data frame is called df and your column is called V1, you could try:
sapply(strsplit(df$V1, ";"), function(x) paste(grep("^AB", x, value = TRUE), collapse = ";"))
#> [1] "AB00001;AB00002" "AB84375" "AB84375;AB84375" "AB001"
This splits at all the semicolons then matches all strings starting with "AB", then joins them back together with semicolons.
I thought of using stringr and Daniel O's data:
df %>%
mutate(data = str_extract_all(data, "AB\\w+"))
which gives us
data
1 AB00001, AB00002
2 AB84375
3 AB84375, AB84375
4 AB001
1) Base R Assuming DF shown reproducibly in the Note at the end we prefix each line with a semicolon and then use the gsub with the pattern shown and finally remove the semicolon we added. No packages are used.
transform(DF, V1 = sub("^;", "", gsub("(;AB\\d+)|;[^;]*", "\\1", paste0(";", V1))))
giving:
V1
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
2) dplyr/tidyr This one is longer than the others in this answer but it is straight forward and has no complex regular expressions.
library(dplyr)
library(tidyr)
DF %>%
mutate(id = 1:n()) %>%
separate_rows(V1, sep = ";") %>%
filter(substr(V1, 1, 2) == "AB") %>%
group_by(id) %>%
summarize(V1 = paste(V1, collapse = ";")) %>%
ungroup %>%
select(-id)
giving:
# A tibble: 4 x 1
V1
<chr>
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
3) gsubfn Replace codes that do not start with AB with an empty string and then remove redundant semicolons from what is left.
library(gsubfn)
transform(DF, V1 = gsub("^;|;$", "", gsub(";+", ";",
gsubfn("[^;]*", ~ if (substr(x, 1, 2) == "AB") x else "", V1))))
giving:
V1
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
Note
Lines <- "AB00001;09843;AB00002;GD00001
AB84375;34
AB84375;AB84375
74859375;AB001;4455;FG3455"
DF <- read.table(text = Lines, as.is = TRUE, strip.white = TRUE)

R: How to mutate new ID by modifying previous ID?

I asked the question(How to mutate a new column by modifying another column?)
Now I have another problem. I have to use more 'untidy'IDs like,
df1 <- data.frame(id=c("A-1","A-10","A-100","b-1","b-10","b-100"),n=c(1,2,3,4,5,6))
from this IDs, I want to assign new 'tidy' IDs like,
df2 <- data.frame(id=c("A0001","A0010","A0100","B0001","B0010","B0100"),n=c(1,2,3,4,5,6))
(now I need capital 'B' instead of 'b')
I tried to use str_pad functiuon, but I couldn't manage.
We can separate the data into different columns based on "-", convert the letters to uppercase, using sprintf pad with 0's and combine the two columns with unite.
library(dplyr)
library(tidyr)
df1 %>%
separate(id, c("id1", "id2"), sep = "-") %>%
mutate(id1 = toupper(id1),
id2 = sprintf('%04s', id2)) %>%
unite(id, id1, id2, sep = "")
# id n
#1 A0001 1
#2 A0010 2
#3 A0100 3
#4 B0001 4
#5 B0010 5
#6 B0100 6
Based on the comment if there are cases where we don't have separator and we want to change certain id1 values we can use the following.
df1 %>%
extract(id, c("id1", "id2"), regex = "([:alpha:])-?(\\d+)") %>%
mutate(id1 = case_when(id1 == 'c' ~ 'B',
TRUE ~ id1),
id1 = toupper(id1),id2 = sprintf('%04s', id2)) %>%
unite(id, id1, id2, sep = "")
The str_pad function is handy for this purpose, as you said. But you have to extract out the digits first and then paste it all back together.
library(stringr)
paste0(toupper(str_extract(df1$id, "[aA-zZ]-")),
str_pad(str_extract(df1$id, "\\d+"), width=4, pad="0"))
[1] "A-0001" "A-0010" "A-0100" "B-0001" "B-0010" "B-0100"
Base R solution
df1$id <- sub("^(.)0+?(.{4})$","\\1\\2", sub("-", "0000", toupper(df1$id)))
tidyverse solution
library(tidyverse)
df1$id <- str_to_upper(df1$id) %>%
str_replace("-","0000") %>%
str_replace("^(.)0+?(.{4})$","\\1\\2")
Output
df1
# id n
# 1 A0001 1
# 2 A0010 2
# 3 A0100 3
# 4 B0001 4
# 5 B0010 5
# 6 B0100 6
Data
df1 <- data.frame(id=c("A-1","A-10","A-100","b-1","b-10","b-100"),n=c(1,2,3,4,5,6))

R replacing words in column by annotation dataframe

I have a dataframe with in 1 column gene IDs (data1). In another dataframe I have the corresponding gene names (data2). Data1 also contains cells with multiple genenames, separated with ':', and also a lot of NAs. Preferably I want to add a column to data1 with the corresponding gene names, also separated by ':' if there are multiple. An alternative would be to replace all the genenames in data1 with the corresponding gene names. Any idea how to go about this? Thanks!
a <- c("ENSG00000150401:ENSG00000150403", "ENSG00000185294", "NA")
data1 <- data.frame(a)
b <- c("ENSG00000150401", "ENSG00000150403", "ENSG00000185294")
c <- c("GeneA", "GeneB", "GeneC")
data2 <- data.frame(b,c)
One option involving stringr could be:
data1$res <- str_replace_all(data1$a, setNames(data2$c, data2$b))
a res
1 ENSG00000150401:ENSG00000150403 GeneA:GeneB
2 ENSG00000185294 GeneC
3 NA NA
We can get data1 in long format, left_join data2 and paste values together.
library(dplyr)
data1 %>%
mutate(row = row_number()) %>%
tidyr::separate_rows(a, sep = ":") %>%
left_join(data2, by = c('a' = 'b')) %>%
group_by(row) %>%
summarise(a = paste0(a, collapse = ":"),
c = paste0(c, collapse = ":")) %>%
select(-row)
# a c
# <chr> <chr>
#1 ENSG00000150401:ENSG00000150403 GeneA:GeneB
#2 ENSG00000185294 GeneC
#3 NA NA
Here is another option with gsubfn
library(gsubfn)
data1$res <- gsubfn("\\w+", setNames(as.list(as.character(data2$c)),
data2$b), as.character(data1$a))
data1
# a res
#1 ENSG00000150401:ENSG00000150403 GeneA:GeneB
#2 ENSG00000185294 GeneC
#3 NA NA
In base R, this can be also done by splitting the 'a' column with strsplit and then do match with a named vector created from 'b', 'c' columns of second dataset
is.na(data1$a) <- data1$a == "NA" # converting to real NA instead of character
i1 <- !is.na(data1$a)
# create named vector
v1 <- setNames(as.character(data2$c), data2$b)
data1$res[i1] <- sapply(strsplit(as.character(data1$a[i1]), ":"),
function(x) paste(v1[x], collapse=":"))

How to extract a number from a string in a dataframe and place it in a new column?

I have a simple dataframe:
df <- data.frame(test = c("test_A_1_1.txt", "test_A_2_1.txt", "test_A_3_1.txt"), value = c(0.51, 0.52, 0.56))
test value
1 test_A_1_1.txt 0.51
2 test_A_2_1.txt 0.52
3 test_A_3_1.txt 0.56
Expected output
I would like to copy the numbers on the end of the string in column 1 and placed it in column three or four respectively, like this:
test value new new
1 test_A_1.txt 0.51 1 1
2 test_A_2.txt 0.52 2 1
3 test_A_3.txt 0.56 3 1
Attempt
Using the following code, I am able to extract the numbers from the string:
library(stringr)
as.numeric(str_extract_all("test_A_3.txt", "[0-9]+")[[1]])[1] # Extracts the first number
as.numeric(str_extract_all("test_A_3.txt", "[0-9]+")[[1]])[2] # Extracts the second number
I would like to apply this code on all the values of the first column:
library(tidyverse)
df %>% mutate(new = as.numeric(str_extract_all(df$test, "[0-9]+")[[1]])[1])
However, this lead to a column new, with only the number 1.
What am I doing wrong?
We can use parse_number from readr
library(dplyr)
library(purrr)
library(stringr)
df %>%
mutate(new = readr::parse_number(as.character(test)))
Regarding the OP's issue, it is selecting only the first list element ([[1]]) from the str_extract_all (which returns a list). Instead, it is better to use str_extract as we need to extract only the first instance of one or more digits (\\d+)
df %>%
mutate(new = as.numeric(str_extract(test, "[0-9]+")))
If we need to get the output from str_extract_all (in case), unlist the list to a vector and then apply the as.numeric on that vector
df %>%
mutate(new = as.numeric(unlist(str_extract_all(test, "[0-9]+"))))
If there are multiple instances, then keep it as a list after converting to numeric by looping through the list elements with map
df %>%
mutate(new = map(str_extract_all(test, "[0-9]+"), as.numeric))
NOTE: The str_extract based solution was first posted here.
In base R, we can use regexpr
df$new <- as.numeric(regmatches(df$test, regexpr("\\d+", df$test)))
Update
With the updated example, if we need to get two instances of numbers, the first one can be extracted with str_extract and the last (stri_extract_last - from stringi can be used as well), by providing a regex lookaround to check for digits followed by a . and 'txt'
df %>%
mutate(new1 = as.numeric(str_extract(test, "\\d+")),
new2 = as.numeric(str_extract(test, "\\d+(?=\\.txt)")))
# test value new1 new2
#1 test_A_1_1.txt 0.51 1 1
#2 test_A_2_1.txt 0.52 2 1
#3 test_A_3_1.txt 0.56 3 1
Slightly modifying your existing code:
df %>%
mutate(new = as.integer(str_extract(test, "[0-9]+")))
Or simply
df$new <- as.integer(str_extract(df$test, "[0-9]+"))
Why not a base R solution?
df$new <- as.numeric(gsub("[^[:digit:]]+", "", df$test))
df
# test value new
#1 test_A_1.txt 0.51 1
#2 test_A_2.txt 0.52 2
#3 test_A_3.txt 0.56 3
Edit.
Following the example in user #camille's answer where the strings may have different numbers of numbers, here is a solution using package stringr.
df1 <- data.frame(test = c("test_A_1.txt", "test_A_2.txt", "test_A_3.txt"), value = c(0.51, 0.52, 0.56))
df2 <- data.frame(test = c("test_A_1_1.txt", "test_A_2_1.txt", "test_A_3_1.txt"), value = c(0.51, 0.52, 0.56))
df3 <- data.frame(test = c("test_A_1_1.txt", "test_A_2_1.txt", "test_A_3_1.txt", "test_A_4_2_1.txt"), value = c(0.51, 0.52, 0.56, 2))
num2cols <- function(DF, col = "test"){
s <- stringr::str_extract_all(DF[[col]], "[[:digit:]]+")
Max <- max(sapply(s, length))
new <- do.call(rbind, lapply(s, function(x){
as.numeric(c(x, rep(NA, Max - length(x))))
}))
names_new <- paste0("new", seq.int(ncol(new)))
setNames(cbind(DF, new), c(names(DF), names_new))
}
num2cols(df1)
num2cols(df2)
num2cols(df3)
Given that they are fixed width, you can:
df$new <- substr(df$test, 8, 8) %>% as.integer
I recommend using as.integer rather than as.numeric because you are working with integers, not floats.
Seeing as you said you might have multiple numbers in a file name, I'd suggest using a method that is more verbose but would scale up to work for more than 1 or 2 numbers. That way, you aren't hard-coding columns like new1 and new2. To illustrate, I added a third number to one of the file names.
The original problem you ran into was that str_extract_all returns a list, and you then need to extract items from that list. You can unnest that list to get individual rows for each number, add a key that sequences over each file name's numbers, then spread to a wide shape to get one column per number, with NA where no number exists in the file name.
library(dplyr)
library(stringr)
library(tidyr)
df <- data.frame(test = c("test_A_1_1.txt", "test_A_2_1.txt", "test_A_3_1.txt", "test_A_4_2_1.txt"), value = c(0.51, 0.52, 0.56, 2))
df %>%
mutate(nums = str_extract_all(test, "\\d+")) %>%
unnest(nums) %>%
group_by(test) %>%
mutate(key = row_number()) %>%
spread(key, value = nums, sep = "")
#> # A tibble: 4 x 5
#> # Groups: test [4]
#> test value key1 key2 key3
#> <fct> <dbl> <chr> <chr> <chr>
#> 1 test_A_1_1.txt 0.51 1 1 <NA>
#> 2 test_A_2_1.txt 0.52 2 1 <NA>
#> 3 test_A_3_1.txt 0.56 3 1 <NA>
#> 4 test_A_4_2_1.txt 2 4 2 1
We can also use sub or stringi::stri_extract_last_regex:
sapply(df1, function(x) sub('.*(\\d{1}).*', '\\1', x))
or
sapply(df1, function(x) stringi::stri_extract_last_regex(x, "\\d{1}"))
Extract a number from a string in a data frame and place in a new column.
t$new<-substr(t[,1] , 8,8)
test value new
1 test_A_1.txt 0.51 1
2 test_A_2.txt 0.52 2
3 test_A_3.txt 0.56 3

Resources