Text to columns by fixed width in R - r

I have a large data frame in which I'm trying to separate the values from one column into two. The values are character then text such as AU2847 or AU1824. I want the first column to be AU and the second to be the corresponding 4 digit number.
I am also restricted to the base r packages so I believe strsplit will be our best bet- but can't figure out how to make it split after 2nd character and create 2 columns from it.

I regularly use these two functions:
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
and
substrLeft <- function(x, n){
substr(x, 1,n)
}
Which cutoff n characters left or right of the string

There are several options to do this. You can subset by position using substr(), or you can use gsub() and call be reference too. Subsetting by position will be faster but inflexible (you would have to have a huge dataframe to notice a difference in time), and using regex (gsub() will be a little slower but is much more flexible). E.g.:
df[c("col2", "col3", "col2b", "col3b")] <- list(substr(df$col1, 1, 2),
substr(df$col1, 3, 6),
gsub("([[:alpha:]]+)(\\d+)", "\\1", df$col1),
gsub("([[:alpha:]]+)(\\d+)", "\\2", df$col1))
df
col1 col2 col3 col2b col3b
1 AU2847 AU 2847 AU 2847
2 AU1824 AU 1824 AU 1824
Data:
df <- data.frame(col1 = c("AU2847", "AU1824"), stringsAsFactors = F)

You could try:
as.data.frame(
do.call(rbind,
strsplit(sub("^(.+?)(\\d+)", "\\1_\\2", df$col),
split="_")
)
)
Whereby df is the name of your data frame and col the name of your column.
This then inserts artificially an underscore between the text and first number - this way you can use underscore as an argument to strsplit.

We can use strsplit() together with a regular expression which uses a lookbehind assertion:
x <- c("AU2847", "AU1824")
strsplit(x, "(?<=[A-Z]{2})", perl = TRUE)
[[1]]
[1] "AU" "2847"
[[2]]
[1] "AU" "1824"
The lookbehind regular expression tells strsplit() to split each string after two capital letters. There is no need to artificially introduce a character to split on as in arg0naut91's answer.
Now, the OP has mentioned that the character vector to be splitted is a column of a larger data.frame. This requires some additional code to append the list output of strsplit() as new columns to the data.frame:
Let's assume we have this data.frame
DF <- data.frame(x, stringsAsFactors = FALSE)
Now, the new columns can be appended by:
DF[, c("col1", "col2")] <- do.call(rbind, strsplit(DF$x, "(?<=[A-Z]{2})", perl = TRUE))
DF
x col1 col2
1 AU2847 AU 2847
2 AU1824 AU 1824

Related

R regexp for odd sorting of a char vector

I have several hundred files that need their columns sorted in a convoluted way. Imagine a character vector x which is the result of names(foo) where foo is a data.frame:
x <- c("x1","i2","Component.1","Component.10","Component.143","Component.13",
"r4","A","C16:1n-7")
I'd like to have it ordered according to the following rule: First, alphabetical for anything starting with "Component". Second, alphabetical for anything remaining starting with "C" and a number. Third anything remaining in alphabetical order.
For x that would be:
x[c(3,4,6,5,9,8,2,7,1)]
Is this a regexp kind of task? And does one use match? Each file will have a different number of columns (so x will be of varying lengths). Any tips appreciated.
You can achieve that with the function order from base-r:
x <- c("x1","i2","Component.1","Component.10","Component.143","Component.13",
"r4","A","C16:1n-7")
order(
!startsWith(x, "Component"), # 0 - starts with component, 1 - o.w.
!grepl("^C\\d", x), # 0 - starts with C<NUMBER>, 1 - o.w.
x # alphabetical
)
# output: 3 4 6 5 9 8 2 7 1
A brute-force solution using only base R:
first = sort(x[grepl('^Component', x)])
second = sort(x[grepl('^C\\d', x)])
third = sort(setdiff(x, c(first, second)))
c(first, second, third)
We can split int to different elements and then use mixedsort from gtools
v1 <- c(gtools::mixedsort(grep("Component", x, value = TRUE)),
gtools::mixedsort(grep("^C\\d+", x, value = TRUE)))
c(v1, gtools::mixedsort(x[!x %in% v1]))
#[1] "Component.1" "Component.10" "Component.13" "Component.143" "C16:1n-7" "A" "i2" "r4"
#[9] "x1"
Or another option in select assuming that these are the columns of the data.frame
library(dplyr)
df1 %>%
select(mixedsort(starts_with('Component')),
mixedsort(names(.)[matches("^C\\d+")]),
gtools::mixedsort(names(.)[everything()]))
If it is just the order of occurrence
df1 %>%
select(starts_with('Component'), matches('^C\\d+'), sort(names(.)[everything()]))
data
set.seed(24)
df1 <- as.data.frame(matrix(rnorm(5 * 9), ncol = 9,
dimnames = list(NULL, x)))

What is the best way to find an element in a regex expression in R?

I have a list of strings in R like so:
"A(123:456)"
"B(23456:345)"
"C(3451:45600)"
I want to parse out the first number and the second number in the parenthesis for all these items:
first second
123 456
23456 345
3451 45600
What is the best way to do this in a vectorized manner? I've thought of using substrings and index of, but then heard of regexes, but am wondering of the most "R" way to do this.
You could use regexpr to match the pattern,
and regmatches to extract the matched patterns.
You could define the pattern to match (to be extracted) as \\d+, which means 1 or more digits.
This will match the first 3 digits that occur in each pattern.
And extract the matches with regmatches, like this:
v <- c("A(123:456)", "B(234:345)", "C(345:456)")
regmatches(v, regexpr('\\d+', v))
The above will give a vector of values:
[1] "123" "234" "345"
To get a data.frame with two columns of the numeric values,
you can use gregmatches instead of regmatches.
That returns a list of lists,
from which you can extract the values into vectors:
m <- regmatches(v, gregexpr('\\d+', v))
first <- sapply(m, function(x) x[[1]])
second <- sapply(m, function(x) x[[2]])
Or as #RuiBarradas pointed out in a comment, you can simplify the sapply calls like this:
first <- sapply(m, '[[', 1)
second <- sapply(m, '[[', 2)
Here's one way with regex:
# Your data
df <- data.frame(obs=c("A(123:456)","B(234:345)","C(345:456)"))
# extraction:
df$first <- gsub(df$obs,pattern="^.*\\((.*)\\:.*$",replacement="\\1")
Here are two ways.
The first is the simplest and if your strings always have exactly two characters followed by the three digit number of interest, it will work.
The second uses regular expressions.
substr(x, 3, 5)
[1] "123" "234" "345"
sub("^.*\\(([[:digit:]]*).*", "\\1", x)
[1] "123" "234" "345"
Then, if you want numeric results, use as.integer or as.numeric.
DATA.
x <- scan(what = character(), text = '
"A(123:456)"
"B(234:345)"
"C(345:456)"')
EDIT.
After the question's edit by the OP, the solutions above are no longer valid. The following one is. Note that the regex has changed and that I now also use strsplit.
res <- do.call(rbind, strsplit(sub("^.*\\((.*)\\).*$", "\\1", x), ":"))
res <- as.data.frame(res, stringsAsFactors = FALSE)
names(res) <- c("first", "second")
res
# first second
#1 123 456
#2 234 345
#3 345 456
The columns of this dataframe are both of class character. In order to have numbers, coerce them with
res[] <- lapply(res, as.integer)

Extract value between second and third underscore in R

I have a data below in the dataframe column-
X_ABC_123_DF</n>
A_NJU_678_PP</n>
J_HH_99_LL</n>
II_00_777_PPP</n>
I want to extract the value between second and third underscore for each row in the dataframe, which i am planning to create a new column and store those values.. I found one way on SO mentioned below, but they haven't mentioned how to write this in R. I am not sure how to write its regex function.
^(?:[^_]+_){2}([^_ ]+)<br>
extract word between 2nd underscore and 3rd underscore or space
A few solutions:
df$values = sapply(strsplit(df$V1, "_"), function(x) x[3])
df$values = gsub("(.*_){2}(\\d+)_.+", "\\2", df$V1)
library(dplyr)
library(stringr)
df %>%
mutate(values = str_extract(V1, "\\d+(?=_[a-zA-Z]+.+$)"))
Result:
V1 values
1 X_ABC_123_DF</n> 123
2 A_NJU_678_PP</n> 678
3 J_HH_99_LL</n> 99
4 II_00_777_PPP</n> 777
Data:
df = read.table(text = "X_ABC_123_DF</n>
A_NJU_678_PP</n>
J_HH_99_LL</n>
II_00_777_PPP</n>", stringsAsFactors = FALSE)
1) Assume the input is a data frame df with a single column V1. Read it in using read.table with sep="_" and then pick out the third column. No packages or regular expressions are used. If df$V1 is already character (as opposed to factor) then the as.character could be omitted.
read.table(text = as.character(df$V1), sep = "_")$V3
## [1] 123 678 99 777
2) If the third column is the only one that contains digits (which is the case for the sample data in the question) then it would be sufficient to replace each non-digit with the empty string:
as.numeric(gsub("\\D", "", df$V1))
## [1] 123 678 99 777

Exclude redundant rows containing different strings

I would like to exclude rows from a data-frame which contain mirrored info. This is my input:
dfin <- 'info
c1-10-20-c2-40-50
c2-1-2-c4-20-25
c4-20-25-c2-1-2
c2-40-50-c1-10-20'
dfin <- read.table(text=dfin, header=T)
In the above example you can see that rows 1 and 3; 2 and 4 represent the same logic in a 'mirror'. In my context does not matter if I have c1-10-20-c2-40-50 or c2-40-50-c1-10-20, thus I would like to filter one of this rows out (any of them). I don't have more than two redundant rows. Moreover, In my actual data-set these 'mirrored' rows are scattered and do not follow a pattern. My expected output:
dfout <- 'info
c1-10-20-c2-40-50
c2-1-2-c4-20-25'
dfout <- read.table(text=dfout, header=T)
We can split the 'info' column by -, sort it, convert to a logical vector with duplicated which will be used for subsetting the rows.
dfN <- dfin[!duplicated(lapply(strsplit(as.character(dfin$info), "-"), sort)),, drop=FALSE]
all.equal(dfN, dfout, check.attributes=FALSE)
#[1] TRUE
Here is an approach which does not keep the original order:
dfin <- 'info-info-info-info-info-info
c1-10-20-c2-40-50
c2-1-2-c4-20-25
c4-20-25-c2-1-2
c2-40-50-c1-10-20'
df <- read.table(text=dfin, header=T, sep = "-", strip.white = T)
dfout<-as.data.frame(unique(t(apply(df, 1, sort))))
I extended your column name to make it work.

How to remove '.' from column names in a dataframe?

My dataframe which I read from a csv file has column names like this
abc.def, ewf.asd.fkl, qqit.vsf.addw.coil
I want to remove the '.' from all the names and convert them to
abcdef, eqfasdfkl, qqitvsfaddwcoil.
I tried using the sub command sub(".","",colnames(dataframe)) but this command took out the first letter of each column name and the column names changed to
bc.def, wf.asd.fkl, qit.vsf.addw.coil
Anyone know another command to do this. I can change the column name one by one, but I have a lot of files with 30 or more columns in each file.
Again, I want to remove the "." from all the colnames. I am trying to do this so I can use "sqldf" commands, which don't deal well with "."
Thank you for your help
1) sqldf can deal with names having dots in them if you quote the names:
library(sqldf)
d0 <- read.csv(text = "A.B,C.D\n1,2")
sqldf('select "A.B", "C.D" from d0')
giving:
A.B C.D
1 1 2
2) When reading the data using read.table or read.csv use the check.names=FALSE argument.
Compare:
Lines <- "A B,C D
1,2
3,4"
read.csv(text = Lines)
## A.B C.D
## 1 1 2
## 2 3 4
read.csv(text = Lines, check.names = FALSE)
## A B C D
## 1 1 2
## 2 3 4
however, in this example it still leaves a name that would have to be quoted in sqldf since the names have embedded spaces.
3) To simply remove the periods, if DF is a data frame:
names(DF) <- gsub(".", "", names(DF), fixed = TRUE)
or it might be nicer to convert the periods to underscores so that it is reversible:
names(DF) <- gsub(".", "_", names(DF), fixed = TRUE)
This last line could be alternatively done like this:
names(DF) <- chartr(".", "_", names(DF))
UPDATE dplyr 0.8.0
As of dplyr 0.8 funs() is soft deprecated, use formula notation.
a dplyr way to do this using stringr.
library(dplyr)
library(stringr)
data <- data.frame(abc.def = 1, ewf.asd.fkl = 2, qqit.vsf.addw.coil = 3)
renamed_data <- data %>%
rename_all(~str_replace_all(.,"\\.","_")) # note we have to escape the '.' character with \\
Make sure you install the packages with install.packages().
Remember you have to escape the . character with \\. in regex, which functions like str_replace_all use, . is a wildcard.
To replace all the dots in the names you'll need to use gsub, rather than sub, which will only replace the first occurrence.
This should work.
test <- data.frame(abc.def = NA, ewf.asd.fkl = NA, qqit.vsf.addw.coil = NA)
names(test) <- gsub( ".", "", names(test), fixed = TRUE)
test
abcdef ewfasdfkl qqitvsfaddwcoil
1 NA NA NA
You can also try:
names(df) = gsub(pattern = ".", replacement = "", x = names(df))

Resources