How to remove brackets and keep content inside from Data Frame - r

I have a dataframe pulled from a pdf that looks something like this
library(tidyverse)
Name <- c("A","B","A","A","B","B","C","C","C")
Result <- c("ND","[0.5]","1.2","ND","ND","[0.8]","ND","[1.1]","22")
results2 <- data.frame(Name, Result)
results2
I am trying to get rid of the brackets and have tried using gsub and a subsection and have failed. it looked like
results2$RESULT <-gsub("\\[","",results2$RESULT)
and resulted in an 'unexpected symbol' error message. I would like to get rid of the brackets and turn the Results into a numeric column.

We may need to change it to add an OR (|) with ] so that both opening and closing square brackets are matched. Also, the column name is lower case. As R is case-sensitive, it wouldn't match the 'RESULT' which doesn't exist in the data
gsub("\\[|\\]", "", results2$Result)
[1] "ND" "0.5" "1.2" "ND" "ND" "0.8" "ND" "1.1" "22"

As you are using library(tidyverse), you could do: regex is from akrun:
library(dplyr)
library(stringr)
results2 %>%
mutate(Result = str_replace_all(Result, "\\[|\\]", ""))
Name Result
1 A ND
2 B 0.5
3 A 1.2
4 A ND
5 B ND
6 B 0.8
7 C ND
8 C 1.1
9 C 22

Related

How to order from highest to lowest in R

My goal is to order multiple variables in a list in R:
"+5x^{5}" "-2x^{3}" "5x^{7}" "0" "1"
I want to get this order:
"5x^{7}" "+5x^{5}" "-2x^{3}" "1" "0"
So exponents from highest to lowest, then numerical order for the numbers.
How can I achieve this? Decreasing for numerical alone is clear. For the exponents it would be necessary to detect if there is a x in the string and then extract the exponent and order it based on this. But I dont know how to do it.
A more verbose option is to extract the exponents and the multipliers and then use arrange. This has the advantage of having these numbers ready if you need to use them.
library(stringr)
library(dplyr)
dat <- data.frame(x = c("+5x^{5}", "-2x^{3}", "5x^{7}", "0", "1"))
dat |> mutate(m = as.numeric(str_match(x, "([+-]*\\d+)x\\^\\{(\\d)\\}")[, 2]),
exp = as.numeric(str_match(x, "([+-]*\\d+)x\\^\\{(\\d)\\}")[, 3]),
n = as.numeric(x)) |>
arrange(desc(exp), desc(n))
Output
#> x m exp n
#> 1 5x^{7} 5 7 NA
#> 2 +5x^{5} 5 5 NA
#> 3 -2x^{3} -2 3 NA
#> 4 1 NA NA 1
#> 5 0 NA NA 0
Created on 2022-06-13 by the reprex package (v2.0.1)
Base R solution:
x[
order(
gsub(
".*\\{(\\d+)\\}.*",
"\\1",
x
),
decreasing = TRUE
)
]
Input data:
x <- c(
"+5x^{5}",
"-2x^{3}",
"5x^{7}",
"0",
"1"
)
Well... it works
> x=c("+5x^{5}","-2x^{3}","5x^{7}","0","1")
> x[order(gsub("(.*\\^\\{)(.+)(\\}.*)","\\2",x),decreasing=T)]
[1] "5x^{7}" "+5x^{5}" "-2x^{3}" "1" "0"
the regex string (.*\\^\\{)(.+)(\\}.*) looks for three things:
(.*\\^\\{) searches for anything before ^{, this is the first split,
(.+) searches for anything inside curly brackets, second split,
(\\}.*) searches for anything after }, third split,
in the end it returns only \\2, the contents of the second split,
which is what we use to order the elements of the string vector.

Coercing String to Vector

I'm trying to create a calculator that multiplies permutation groups written in cyclic form (the process of which is described in this post, for anyone unfamiliar: https://math.stackexchange.com/questions/31763/multiplication-in-permutation-groups-written-in-cyclic-notation). Although I know this would be easier to do with Python or something else, I wanted to practice writing code in R since it is relatively new to me.
My gameplan for this is take an input, such as "(1 2 3)(2 4 1)" and split it into two separate lists or vectors. However, I am having trouble starting this because from my understanding of character functions (which I researched here: https://www.statmethods.net/management/functions.html) I will ultimately have to use the function grep() to find the points where ")(" occur in my string to split from there. However, grep only takes vectors for its argument, so I am trying to coerce my string into a vector. In researching this problem, I have mostly seen people suggest to use as.integer(unlist(str_split())), however, this doesn't work for me as when I split, not everything is an integer and the values become NA, as seen in this example.
library(tidyverse)
x <- "(1 2 3)(2 4 1)"
x <- as.integer(unlist(str_split(x," ")))'
x
Is there an alternative way to turn a string into a vector when there are not just integers involved? I also realize that the means by which I am trying to split up the two permutations is very roundabout, but that is because of the character functions that I researched this seems like the only way. If there are other functions that would make this easier, please let me know.
Thank you!
Comments in the code.
x <- "(1 2 3)(2 4 1)"
out1 <- strsplit(x, split = ")(", fixed = TRUE)[[1]] # split on close and open bracket
out2 <- gsub("[\\(|\\)]", replacement = "", out1) # remove brackets
out3 <- strsplit(out2, " ") # tease out numbers between spaces
lapply(out3, as.integer)
[[1]]
[1] 1 2 3
[[2]]
[1] 2 4 1
There aren't really any scalars on R. Single values like 1, TRUE, and "a" are all 1-element vectors. grep(pattern, x) will work fine on your original string. As a starting point for getting towards your desired goal, I would suggest splitting the groups using:
> str_extract_all(x, "\\([0-9 ]+\\)")
[[1]]
[1] "(1 2 3)" "(2 4 1)"
If we need to split the strings with the brackets
strsplit(x, "(?<=\\))(?=\\()", perl = TRUE)[[1]]
#[1] "(1 2 3)" "(2 4 1)"
Or we can use convenient wrapper from qdapRegex
library(qdapRegex)
ex_round(x, include.marker = TRUE)[[1]]
#[1] "(1 2 3)" "(2 4 1)"
alternative: using library(magrittr)
x <- "(1 2 3)(2 4 1)"
x %>%
gsub("^\\(","c(",.) %>% gsub("\\)\\(","),c(",.) %>% gsub("(?=\\s\\d)",", ",.,perl=T) %>%
paste0("list(",.,")") %>% {eval(parse(text=.))}
result:
# [[1]]
# [1] 1 2 3
#
# [[2]]
# [1] 2 4 1
You could use chartr with read.table :
read.table(text= chartr("()"," \n",x))
# V1 V2 V3
# 1 1 2 3
# 2 2 4 1

Regular expression that returns numbers following a specific letter until the next letter

I need a regular expression that returns a specific letter and the following (one or two) digits until the next letter.
For example, I would like to extract how many carbons (C) are in a formula using regular expressions in R
strings <- c("C16H4ClNO2", "CH8O", "F2Ni")
I need an expression that returns the number of C which can be one or 2 digits and that does not return the number after chlorine (Cl).
substr(strings,regexpr("C[0-9]+",strings) + 1, regexpr("[ABDEFGHIJKLMNOPQRSTUVWXYZ]+",strings) -1)
[1] "16" "C" ""
but the answer I want to be returned is
"16","1","0"
Moreover, I would like the regular expression to automatically locate the next letter and stop before it, instead of having a final position which I specify as a letter not being a C.
makeup in the CHNOSZ package will parse a chemical formula. Here are some alternatives that use it:
1) Create a list L of such fully parsed formulas and then for each one check if it has a "C" component and return its value or 0 if none:
library(CHNOSZ)
L <- Map(makeup, strings)
sapply(L, function(x) if ("C" %in% names(x)) x[["C"]] else 0)
## C16H4ClNO2 CH8O F2Ni
## 16 1 0
Note that L is a list of the fully parsed formulas in case you have other requirements:
> L
$C16H4ClNO2
C H Cl N O
16 4 1 1 2
$CH8O
C H O
1 8 1
$F2Ni
F Ni
2 1
1a) By adding c(C = 0) to each list component we can avoid having to test for the existence of carbon yielding the following shorter version of the sapply line in (1):
sapply(lapply(L, c, c(C = 0)), "[[", "C")
2) This one-line variation of (1) gives the same answer as in (1) except for names. It appends "C0" to each formula to avoid having to test for the existence of carbon:
sapply(lapply(paste0(strings, "C0"), makeup), "[[", "C")
## [1] 16 1 0
2a) Here is a variation of (2) that eliminates the lapply by using the fact that makeup will accept a matrix:
sapply(makeup(as.matrix(paste0(strings, "C0"))), "[[", "C")
## [1] 16 1 0
If I understood your question correctly, you're looking for two things:
C + a number immediately afterwards => match this number
C followed by another UPPERCASE letter (another chemical element, that is) => count C
If you're able to install another library, you might get along with:
library("stringr")
strings <- c("C16H4ClNO2", "CH8O", "F2Ni")
str1 <- str_extract(strings, '(?<=C)\\d+')
str2 <- str_count(strings, 'C[A-Z]')
str2[!is.na(str1)] = str1[!is.na(str1)]
str2
# [1] "16" "1" "0"
This does a lot of fancy things, str1 looks for the first condition (C followed by another digits), while str2 looks for the second condition. The last line combines the two vectors
We can do this with base R
sub("C(\\d+).*", "\\1", sub("C([^0-9]+)",
"C1\\1", ifelse(!grepl("C", strings), paste0("C0", strings), strings)))
#[1] "16" "1" "0"
ifelse(str_extract(strings,'(?<=C)(\\d+|)')=='',1,str_extract(strings,'(?<=C)(\\d+|)'))
[1] "16" "1" NA

Splitting a Large Data File in R using Strsplit and R Connection

Hi I am trying to read in a large data file into R. It is a tab delimited file, however the first two columns are filled with multiple pieces of data separated by a "|". The file looks like:
A|1 B|2 0.5 0.4
C|3 D|4 0.9 1
I only care about the first values in both the first and second columns as well as the third and fourth column. In the end I want to end up with a vectors for each line that look like:
A B 0.5 0.4
I am using a connection to read in the file:
con <- file("inputfile.txt", open = "r")
lines <- readLines(con)
which gives me:
lines[1]
[1] "A|1\tB|2/t0.5\t0.4"
then I am using strsplit to split the tab delimited file:
linessplit <- strsplit(lines, split="\t")
which gives me:
linessplit[1]
[1] "A|1" "B|2"
[3] "0.5" "0.4"
When I try the following to split "A|1" into "A" "1":
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "|")
I get:
"Error in strsplit(line1[1], split = "|") : non-character argument"
Does anyone have a way in which I can fix this?
Thanks!
Since you provided an approach I explain the errors in the code even though for your problem maybe you have to consider another approach.
Anyway putting aside personal tastes about code, the problems are:
you have to extract the first element of the list with double brackets
line1[[1]]
the split argument accepts regular
expressions. If you supply | which is a metacharacter, it won't be
read as is. You must escape it with \\| or (as suggested by #nongkrong) you have to use the fixed = T argument that allows you to match strings exactly as is (say, without their meaning as a meta characters).
The final code is l1 <- strsplit(line1[[1]], split = "\\|")
as a final personal consideration, you might take into considerations an lapply solution:
lapply(linessplit, strsplit, split = "|", fixed = T)
Here is my solution to your original problem, says
split lines
"A|1\tB|2\t0.5\t0.4"
"C|3\tD|4\t0.9\t1"
into
A B 0.5 0.4
C D 0.9 1
Below is my code:
lines <- c("A|1\tB|2\t0.5\t0.4", "C|3\tD|4\t0.9\t1", "E|5\tF|6\t0.7\t0.2")
lines
library(reshape2)
linessplit <- colsplit(lines, pattern="\t", names=c(1:4))
linessplit
split_n_select <- function(x, sel=c(1), pat="\\|", nam=c(1:2)){
tmp <- t(colsplit(x, pattern=pat, names=nam))
tmp[sel,]
}
linessplit2 <- sapply(linessplit, split_n_select)
linessplit2
That's break it down:
Read original data into lines
lines <- c("A|1\tB|2\t0.5\t0.4", "C|3\tD|4\t0.9\t1", "E|5\tF|6\t0.7\t0.2")
lines
Results:
[1] "A|1\tB|2\t0.5\t0.4" "C|3\tD|4\t0.9\t1" "E|5\tF|6\t0.7\t0.2"
Load reshape2 library to import function colsplit, then use it with pattern "\t" to split lines into 4 columns named 1,2,3,4.
library(reshape2)
linessplit <- colsplit(lines, pattern="\t", names=c(1,2,3,4))
linessplit
Results:
1 2 3 4
1 A|1 B|2 0.5 0.4
2 C|3 D|4 0.9 1.0
3 E|5 F|6 0.7 0.2
That's make a function to take a row, split into rows and select the row we want.
Take the first row of linessplit into colsplit
tmp <- colsplit(linessplit[1,], pattern="\\|", names=c(1:2))
tmp
Results:
1 2
1 A 1
2 B 2
3 0.5 NA
4 0.4 NA
Take transpose
tmp <- t(colsplit(linessplit[1,], pattern="\\|", names=c(1:2)))
tmp
Results:
[,1] [,2] [,3] [,4]
1 "A" "B" "0.5" "0.4"
2 " 1" " 2" NA NA
Select first row:
tmp[1,]
Results:
[1] "A" "B" "0.5" "0.4"
Make above steps a function split_n_select:
split_n_select <- function(x, sel=c(1), pat="\\|", nam=c(1:2)){
tmp <- t(colsplit(x, pattern=pat, names=nam))
tmp[sel,]
}
Use sapply to apply function split_n_select to each row in linessplit
linessplit2 <- sapply(linessplit, split_n_select)
linessplit2
Results:
1 2 3 4
[1,] "A" "B" "0.5" "0.4"
[2,] "C" "D" "0.9" "1"
[3,] "E" "F" "0.7" "0.2"
You can also select the second row by adding sel=c(2)
linessplit2 <- sapply(linessplit, split_n_select, sel=c(2))
linessplit2
Results:
1 2 3 4
[1,] "1" "2" NA NA
[2,] "3" "4" NA NA
[3,] "5" "6" NA NA
Change
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "|")
to
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "[|]") #i added square brackets

Sets in R DataFrame

I have a csv that looks like
Deamon,Host,1:2:4,aaa.03
Pixe,Paradigm,1:3:5,11.us
I need to read this into a dataframe for analysis but the 3rd column in my data is separated by : and need to be read like a set or list 1.e splitted by : so that it returns (1,2,4) . Is it possible to have a columns that has a class list in R . Or How best do you think i can approach this problem.
You can use strsplit to split a character vector into a list of components:
x <- c("1:2:4", "1:3:5")
strsplit(x, split=":")
[[1]]
[1] "1" "2" "4"
[[2]]
[1] "1" "3" "5"
As noted above, the answer will vary depending on if the number of separators in the columns are consistent or not. The answer is more straight forward if that number is consistent. Here's one answer to do that building off of Andrie's strsplit answer:
dat <- read.csv("yourData.csv", header=FALSE, stringsAsFactors = FALSE)
#If always going to be a consistent number of separators
dat <- cbind(dat, do.call("rbind", strsplit(dat[, 3], ":")))
V1 V2 V3 V4 1 2 3
1 Deamon Host 1:02:04 aaa.03 1 02 04
2 Pixe Paradigm 1:03:05 11.us 1 03 05
Note that the above is essentially how colsplit.character from package reshape is implemented and may be a better option for you as it forces you to give proper names.
If the number of separators is different, then using rbind.fill is an option from package plyr. rbind.fill expects data.frames which was a bit annoying, and I couldn't figure out how to get a one row data.frame without first converting to a matrix, so I imagine this can be made more efficient, but here's the basic idea:
library(plyr)
x <- c("1:2:4", "1:3:5:6:7")
rbind.fill(
lapply(
lapply(strsplit(x, ":"), matrix, nrow = 1)
, as.data.frame)
)
V1 V2 V3 V4 V5
1 1 2 4 <NA> <NA>
2 1 3 5 6 7
Which can then be cbinded as shown above.
Try using gsub to replace that character:
R> str <- "1:2:4"
R> str
[1] "1:2:4"
R> gsub(":", ",", str)
[1] "1,2,4"
Make sure the column is a string not a factor beforehand.

Resources