I have a data below in the dataframe column-
X_ABC_123_DF</n>
A_NJU_678_PP</n>
J_HH_99_LL</n>
II_00_777_PPP</n>
I want to extract the value between second and third underscore for each row in the dataframe, which i am planning to create a new column and store those values.. I found one way on SO mentioned below, but they haven't mentioned how to write this in R. I am not sure how to write its regex function.
^(?:[^_]+_){2}([^_ ]+)<br>
extract word between 2nd underscore and 3rd underscore or space
A few solutions:
df$values = sapply(strsplit(df$V1, "_"), function(x) x[3])
df$values = gsub("(.*_){2}(\\d+)_.+", "\\2", df$V1)
library(dplyr)
library(stringr)
df %>%
mutate(values = str_extract(V1, "\\d+(?=_[a-zA-Z]+.+$)"))
Result:
V1 values
1 X_ABC_123_DF</n> 123
2 A_NJU_678_PP</n> 678
3 J_HH_99_LL</n> 99
4 II_00_777_PPP</n> 777
Data:
df = read.table(text = "X_ABC_123_DF</n>
A_NJU_678_PP</n>
J_HH_99_LL</n>
II_00_777_PPP</n>", stringsAsFactors = FALSE)
1) Assume the input is a data frame df with a single column V1. Read it in using read.table with sep="_" and then pick out the third column. No packages or regular expressions are used. If df$V1 is already character (as opposed to factor) then the as.character could be omitted.
read.table(text = as.character(df$V1), sep = "_")$V3
## [1] 123 678 99 777
2) If the third column is the only one that contains digits (which is the case for the sample data in the question) then it would be sufficient to replace each non-digit with the empty string:
as.numeric(gsub("\\D", "", df$V1))
## [1] 123 678 99 777
Related
I have a large data frame in which I'm trying to separate the values from one column into two. The values are character then text such as AU2847 or AU1824. I want the first column to be AU and the second to be the corresponding 4 digit number.
I am also restricted to the base r packages so I believe strsplit will be our best bet- but can't figure out how to make it split after 2nd character and create 2 columns from it.
I regularly use these two functions:
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
and
substrLeft <- function(x, n){
substr(x, 1,n)
}
Which cutoff n characters left or right of the string
There are several options to do this. You can subset by position using substr(), or you can use gsub() and call be reference too. Subsetting by position will be faster but inflexible (you would have to have a huge dataframe to notice a difference in time), and using regex (gsub() will be a little slower but is much more flexible). E.g.:
df[c("col2", "col3", "col2b", "col3b")] <- list(substr(df$col1, 1, 2),
substr(df$col1, 3, 6),
gsub("([[:alpha:]]+)(\\d+)", "\\1", df$col1),
gsub("([[:alpha:]]+)(\\d+)", "\\2", df$col1))
df
col1 col2 col3 col2b col3b
1 AU2847 AU 2847 AU 2847
2 AU1824 AU 1824 AU 1824
Data:
df <- data.frame(col1 = c("AU2847", "AU1824"), stringsAsFactors = F)
You could try:
as.data.frame(
do.call(rbind,
strsplit(sub("^(.+?)(\\d+)", "\\1_\\2", df$col),
split="_")
)
)
Whereby df is the name of your data frame and col the name of your column.
This then inserts artificially an underscore between the text and first number - this way you can use underscore as an argument to strsplit.
We can use strsplit() together with a regular expression which uses a lookbehind assertion:
x <- c("AU2847", "AU1824")
strsplit(x, "(?<=[A-Z]{2})", perl = TRUE)
[[1]]
[1] "AU" "2847"
[[2]]
[1] "AU" "1824"
The lookbehind regular expression tells strsplit() to split each string after two capital letters. There is no need to artificially introduce a character to split on as in arg0naut91's answer.
Now, the OP has mentioned that the character vector to be splitted is a column of a larger data.frame. This requires some additional code to append the list output of strsplit() as new columns to the data.frame:
Let's assume we have this data.frame
DF <- data.frame(x, stringsAsFactors = FALSE)
Now, the new columns can be appended by:
DF[, c("col1", "col2")] <- do.call(rbind, strsplit(DF$x, "(?<=[A-Z]{2})", perl = TRUE))
DF
x col1 col2
1 AU2847 AU 2847
2 AU1824 AU 1824
I have a dataset with millions of observations.
One of the columns of this dataset uses 4 or 5 characters to classify these observations.
My goal is to merge this classification into smaller groups, for example, I want to replace all the values of the column that STARTS with "AA" (e.g., "AABC" or "AAUCC") for just "A". How can I do this?
To illustrate:
Considering that my data is labeled "f2016" and the column that I'm interested in is "SECT16", I've been using the following code to replace values:
f2016$SECT16[f2016$SECT16 == "AABB"] <- "A"
But I cannot do this to all combinations of letters that I have in the dataset. Is there a way that I can do the same replacement holding the first two letters constant?
Here is another base R solution:
f2016[startsWith(f2016$SECT16, "AA"),] <- "A"
# SECT16
# 1 A
# 2 A
# 3 ABBBBC
# 4 DDDDE
# 5 BABA
This replaces chars with the prefix specified in this case AA. An an excerpt from from the help(startsWith).
startsWith() is equivalent to but much faster than
substring(x, 1, nchar(prefix)) == prefix
or also
grepl("^", x)
where prefix is not to contain special regular expression characters.
Data
f2016 <- data.frame(SECT16 = c("AAABBB", "AAAAAABBBB", "ABBBBC", "DDDDE", "BABA"), stringsAsFactors = F)
We can use grep/grepl
f2016$SECT16[grep("^AA", f2016$SECT16)] <- "A"
#f2016$SECT16[grepl("^AA", f2016$SECT16)] <- "A"
Consider this dataset
df <- data.frame(A = c("ABCD", "AACD", "DASDD", "AABB"), stringsAsFactors = FALSE)
df
# A
#1 ABCD
#2 AACD
#3 DASDD
#4 AABB
df$A[grep("^AA", df$A)] <- "A"
df
# A
#1 ABCD
#2 A
#3 DASDD
#4 A
You can use stringr and dplyr.
Modify all columns:
df <- df %>% mutate_all(function(x) stringr::str_replace(x, "^AA.+", "A"))
Modify specific columns:
df <- df %>% mutate_at(1, function(x) stringr::str_replace(x, "^AA.+", "A"))
Data
df <- data.frame(SECT16 = c("AABC", "AABB"),
SECT17 = c("AADD", "AAEE"))
I know there are some answers here about splitting a string every nth character, such as this one and this one, However these are pretty question specific and mostly related to a single string and not to a data frame of multiple strings.
Example data
df <- data.frame(id = 1:2, seq = c('ABCDEFGHI', 'ZABCDJHIA'))
Looks like this:
id seq
1 1 ABCDEFGHI
2 2 ZABCDJHIA
Splitting on every third character
I want to split the string in each row every thrid character, such that the resulting data frame looks like this:
id 1 2 3
1 ABC DEF GHI
2 ZAB CDJ HIA
What I tried
I used the splitstackshape before to split a string on a single character, like so: df %>% cSplit('seq', sep = '', stripWhite = FALSE, type.convert = FALSE) I would love to have a similar function (or perhaps it is possbile with cSplit) to split on every third character.
An option would be separate
library(tidyverse)
df %>%
separate(seq, into = paste0("x", 1:3), sep = c(3, 6))
# id x1 x2 x3
#1 1 ABC DEF GHI
#2 2 ZAB CDJ HIA
If we want to create it more generic
n1 <- nchar(as.character(df$seq[1])) - 3
s1 <- seq(3, n1, by = 3)
nm1 <- paste0("x", seq_len(length(s1) +1))
df %>%
separate(seq, into = nm1, sep = s1)
Or using base R, using strsplit, split the 'seq' column for each instance of 3 characters by passing a regex lookaround into a list and then rbind the list elements
df[paste0("x", 1:3)] <- do.call(rbind,
strsplit(as.character(df$seq), "(?<=.{3})", perl = TRUE))
NOTE: It is better to avoid column names that start with non-standard labels such as numbers. For that reason, appended 'x' at the beginning of the names
You can split a string each x characters in base also with read.fwf (Read Fixed Width Format Files), which needs either a file or a connection.
read.fwf(file=textConnection(as.character(df$seq)), widths=c(3,3,3))
V1 V2 V3
1 ABC DEF GHI
2 ZAB CDJ HIA
I have a data frame like the following:
sampleid <- c("patient_sdlkfjd_2354_CSF_CD19+", "control_sdlkfjd_2632_CSF_CD8+", "control_sdlkfjd_2632_CSF")
values = rnorm(3, 8, 3)
df <- data.frame(sampleid, values)
I also have a vector like the following:
matches <- c("632_CSF_CD8+", "632_CSF").
I want to extract rows in this data frame which contain the matches at the end of the value in the sampleid column. From this example, you can see why the end of string is important,as I have two samples which contain "632_CSF," but they are distinct samples. If I chose to change matches to only:
matches <- c("632_CSF").
Then I want only the third row of the data frame to be outputted, because this is the only one where this matches at the end of the sampleid.
How can this be achieved?
Thanks!
Just use $ in your pattern to indicate that it occurs at the end of the string.
grep("632_CSF$", sampleid, value=TRUE)
[1] "control_sdlkfjd_2632_CSF"
You can make this with stringr and some manipulations.
You need to encode regex, it's done with quotemeta function.
Next step would be to append $ to ensure the match is in the end of the string and then concatenate all matches into one with regex OR - |.
And then it should be used with str_detect to get boolean indices.
library(stringr)
# taken from here
# https://stackoverflow.com/a/14838753/1030110
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
matches_with_end <- sapply(matches, function(x) { paste0(quotemeta(x), '$') })
joined_matches <- paste(matches_with_end, collapse = '|')
ind <- str_detect(df$sampleid, joined_matches)
# [1] FALSE TRUE TRUE
df[ind, ]
# sampleid values
# 2 control_sdlkfjd_2632_CSF_CD8+ 10.712634
# 3 control_sdlkfjd_2632_CSF 7.001628
Suggest making your dataset more regular.
library(tidyverse)
df_regular <- df %>%
separate(
sampleid,
into = c("patient_type",
"test_number",
"patient_group",
"patient_id"),
extra = "merge") %>%
mutate(patient_id = str_pad(patient_id, 9, side = c("left"), pad = "0"))
df_regular
df_regular %>%
filter(patient_group %in% "2632" & patient_id %in% "000000CSF")
My dataframe which I read from a csv file has column names like this
abc.def, ewf.asd.fkl, qqit.vsf.addw.coil
I want to remove the '.' from all the names and convert them to
abcdef, eqfasdfkl, qqitvsfaddwcoil.
I tried using the sub command sub(".","",colnames(dataframe)) but this command took out the first letter of each column name and the column names changed to
bc.def, wf.asd.fkl, qit.vsf.addw.coil
Anyone know another command to do this. I can change the column name one by one, but I have a lot of files with 30 or more columns in each file.
Again, I want to remove the "." from all the colnames. I am trying to do this so I can use "sqldf" commands, which don't deal well with "."
Thank you for your help
1) sqldf can deal with names having dots in them if you quote the names:
library(sqldf)
d0 <- read.csv(text = "A.B,C.D\n1,2")
sqldf('select "A.B", "C.D" from d0')
giving:
A.B C.D
1 1 2
2) When reading the data using read.table or read.csv use the check.names=FALSE argument.
Compare:
Lines <- "A B,C D
1,2
3,4"
read.csv(text = Lines)
## A.B C.D
## 1 1 2
## 2 3 4
read.csv(text = Lines, check.names = FALSE)
## A B C D
## 1 1 2
## 2 3 4
however, in this example it still leaves a name that would have to be quoted in sqldf since the names have embedded spaces.
3) To simply remove the periods, if DF is a data frame:
names(DF) <- gsub(".", "", names(DF), fixed = TRUE)
or it might be nicer to convert the periods to underscores so that it is reversible:
names(DF) <- gsub(".", "_", names(DF), fixed = TRUE)
This last line could be alternatively done like this:
names(DF) <- chartr(".", "_", names(DF))
UPDATE dplyr 0.8.0
As of dplyr 0.8 funs() is soft deprecated, use formula notation.
a dplyr way to do this using stringr.
library(dplyr)
library(stringr)
data <- data.frame(abc.def = 1, ewf.asd.fkl = 2, qqit.vsf.addw.coil = 3)
renamed_data <- data %>%
rename_all(~str_replace_all(.,"\\.","_")) # note we have to escape the '.' character with \\
Make sure you install the packages with install.packages().
Remember you have to escape the . character with \\. in regex, which functions like str_replace_all use, . is a wildcard.
To replace all the dots in the names you'll need to use gsub, rather than sub, which will only replace the first occurrence.
This should work.
test <- data.frame(abc.def = NA, ewf.asd.fkl = NA, qqit.vsf.addw.coil = NA)
names(test) <- gsub( ".", "", names(test), fixed = TRUE)
test
abcdef ewfasdfkl qqitvsfaddwcoil
1 NA NA NA
You can also try:
names(df) = gsub(pattern = ".", replacement = "", x = names(df))