Regex for variable length - r

I am looking for a regex or another command/workaround to extract all pkA values from a very large list for hundred of chemicals. So far, I have managed to extract the desired pkA values from a subset of my list.
I wonder however if it is also possible to extract the whole lines that contain the pkAs? I figured since they all have a rather comparable length, you could extract these with a regex but I don't know how to implement the length inside the regex in combinations with the specific lines containing the pkA values?
The reason why I wonder this is because my regex does not include pkAs that start with a 0. Chemicals like this are uncommon but they do exist. By extracting the whole line, I would also catch the few entries that give a temperature value which my regex is not including.
Down below is a (hopefully) minimal working example with an extract of my list.
library(stringr)
list_pkas <- structure(list(Chemical = c("MCPA", "Aspirin"), pka = c("3.2.13Dissociation Constants\r\npKa= 3.13\r\nCessna AJ, Grover R; J Agric Food Chem 26: 289-92(1978)\r\nHazardous Substances Data Bank (HSDB)",
"3.2.14Dissociation Constants\r\nAcidic pKa\r\n3.47\r\nTested as SID 103164874 in AID 781325: https://pubchem.ncbi.nlm.nih.gov/bioassay/781325#sid=103164874\r\nComparison of the accuracy of experimental and predicted pKa values of basic and acidic compounds. Pharm Res. 2014; 31(4):1082-95. DOI:10.1007/s11095-013-1232-z. PMID:24249037\r\nChEMBL\r\nAcidic pKa\r\n3.5\r\nTested as SID 103164874 in AID 781326: https://pubchem.ncbi.nlm.nih.gov/bioassay/781326#sid=103164874\r\nComparison of the accuracy of experimental and predicted pKa values of basic and acidic compounds. Pharm Res. 2014; 31(4):1082-95. DOI:10.1007/s11095-013-1232-z. PMID:24249037\r\nChEMBL; DrugBank\r\npKa = 3.49 at 25 °C\r\nO'Neil, M.J. (ed.). The Merck Index - An Encyclopedia of Chemicals, Drugs, and Biologicals. Whitehouse Station, NJ: Merck and Co., Inc., 2006., p. 140\r\nHazardous Substances Data Bank (HSDB)"
)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
string <- list_pkas$pka[2]
string_sub <- str_sub(string, 7)
pkas <- str_extract_all(string_sub, "([1-9]\\.[0-9]{1,2})")
The expected output should be for MCPA:
3.13
or
pKa=3.13
For Aspirin:
3.47
3.5
pKa = 3.49 at 25 °C
Any help is much appreciated!

You can use the lookbehind assertion (?<=foo):
str_extract_all(list_pkas$pka, "(?<=pKa\\D{0,5})\\d.*")
# [[1]]
# [1] "3.13"
#
# [[2]]
# [1] "3.47" "3.5" "3.49 at 25 °C"

I think that this expression might do what you need:
"pKa\\D{0,5}((?:\\s*\\d+\\.*\\d*)(?:\\s*at\\s*\\d+\\s*.*?\\w)*)"

Related

Run A For Loop on a Data Frame

I have a dataframe that looks like this:
draftclasses
Name Yards TDs Class
Joe Smith 333.3 34 2017
Carson Mathers 386.2 22 2021
Bo Someome 345.2 22 2022
Im Notgood 170.99 7 2017
What I would like to do is get all of the Yards subset for each value in the Class column. I know to filter out a subset of the data frame:
year2021 = draftclass[!is.na(draftclasses$Yards) & draftclasses$Class == 2022,]
I am also aware I would use a for loop but don't know how to design it. I've read around online a bit but I am still unsure how to run a loop to get this input for each year in the Class column.
Ideally I would like each year in Class to label a object with a string of all of the yards pertaining to a class like so:
> year2017
[1] "333.3" "170.99"
Any help would be appreciated. Thanks.
With the help of split we can split the Yards for each Class.
result <- split(draftclasses$Yards, draftclasses$Class)
This returns a list in result, if you want a separate vector for each Class, you can name the list and use list2env.
names(result) <- paste0('year', names(result))
list2env(result, .GlobalEnv)
year2017
#[1] 333.30 170.99
year2021
#[1] 386.2
year2022
#[1] 345.2

Splitting an object in R

I would like to split an object in R according to the suffixes of the barcodes it contains. These end in '-n' where n is a number from 1 to 6. e.g. AAACCGTGCCCTCA-1, GAACCGTGCCCTCA-2, CATGCGTGCCCTCA-5, etc. I would like all the corresponding information about each barcode to be split accordingly as well. Here is some example code of an object, cds.
class(cds)
[1] "CellDataSet"
attr(,"package")
[1] "monocle"
split(cds, cds$barcode)
#not by individual barcodes, but by groups of those ending '-1', '-2',...,'-6'. So 6 new objects in total
Many thanks!
Abigail
Split does not work because you need to subset based on the columns. I am not sure if there is a split method defined for this class. You can try the following:
First to get something like your example:
library(monocle)
library(HSMMSingleCell)
library(Biostrings)
cds = load_HSMM()
class(cds)
[1] "CellDataSet"
attr(,"package")
[1] "monocle"
dim(cds)
Features Samples
47192 271
And to create a barcode for every sample:
bar = paste(names(oligonucleotideFrequency(DNAString("NNNNN"),5))[1:ncol(cds)],
sample(1:6,ncol(cds),replace=TRUE),sep="-")
head(bar)
[1] "AAAAA-3" "AAAAC-6" "AAAAG-5" "AAAAT-1" "AAACA-5" "AAACC-5"
Now we get the group, which is the suffix 1-6 :
cds$barcodes= bar
grp = sub("[A-Z]*[-]","",cds$barcodes)
To get one subset, for example, those will "-1", you can just do:
group1 = cds[,grp==1]
dim(group1)
Features Samples
47192 46
head(group1$barcodes)
[1] "AAAAT-1" "AACGA-1" "AAGCG-1" "AAGGG-1" "AAGTA-1" "AATAG-1"
To get your 6 groups, you can do the below, but check whether your machine has the memory to accommodate this!
subset_obj = lapply(unique(grp),function(i){
cds[,grp==i]
})
names(subset_obj) = unique(grp)
We can use sub to remove the -\\d+ and split the 'cds' based on that
split(cds, sub("-\\d+$", "", cds$barcode))

How can I get a single value from a read.csv-produced dataframe instead of a list?

I am using read.csv on a datapath. It returns a dataframe. I want to be able to get a single value in this dataframe, but instead I get a list of values displaying the levels.
I have tried several ways to access the value I want. In the next part, I will show you what I tried and the results I got.
Here is my simple dataframe:
"OGM","Nutrient","data3"
"tomato","iron",0.03
"domestic cat","iron",0.02
"zebrafish","zing",0.02
"giraffe","nitrate", 0.09
"common cougar","manganese",0.05
"fawn","nitrogen",0.04
"daim","bromure",0.08
"wild cat","iron",0.05
"domestic cat","calcium",0.02
"muren","calcium",0.07
"jaguar","iron",0.02
"green turtle","sodium",0.01
"dave grohl","metal",0.09
"zebra","nitrates",0.12
"tortoise","sodium",0.16
"dinosaur","calcium",0.08
"apex mellifera","sodium",0.15
Here is how I load the data:
#use read.csv on the datapath contained in file
fileData <- read.csv(file[4][[1]])
print(fileData[1][1])
What I want is to access a single value: from example, "tomato" or "nitrate". The result I want is exactly this:
>[1] tomato
Here is what I tried and the result I got:
print(fileData[1][1])
returns
> OGM
>1 tomato
>2 domestic cat
>3 zebrafish
>4 giraffe...
print(fileData$OGM[1])
returns
> [1] tomato
Levels: apex mellifera common cougar daim...
print(fileData[1][[1]])
returns
> [1] tomato domestic cat zebrafish giraffe common cougar [...]
[15] tortoise dinosaur apex mellifera
Levels: apex mellifera common cougar daim...
print(fileData$OGM[[1]])
returns
Levels: apex mellifera common cougar daim...
All apologies for the stupid question, but I'm a bit lost. All help is appreciated. If you want me to edit my post to be more clear, tell me. Thank you.
Some suggestions
Try readr::read_csv rather than read.csv to read in your data. This will get around the stringsAsFactors problem. Or use the approach suggested by Stewart Macdonald.
Once you have the data in, you can manipulate it as follows
# Make a sample dataframe
library(tidyverse)
df <- tribble(~OGM, ~Nutrient, ~data3,
"tomato","iron",0.03,
"domestic cat","iron",0.02,
"zebrafish","zing",0.02,
"giraffe","nitrate", 0.09,
"common cougar","manganese",0.05,
"fawn","nitrogen",0.04,
"daim","bromure",0.08,
"wild cat","iron",0.05,
"domestic cat","calcium",0.02,
"muren","calcium",0.07,
"jaguar","iron",0.02,
"green turtle","sodium",0.01,
"dave grohl","metal",0.09,
"zebra","nitrates",0.12,
"tortoise","sodium",0.16,
"dinosaur","calcium",0.08,
"apex mellifera","sodium",0.15)
df %>%
select(OGM) %>% # select the OGM column
filter(OGM == 'tomato') %>%
pull # convert to a vector
[1] "tomato"

"Zero frequent items" when using the eclat to mine frequent itemsets

So I want to find patterns and "clusters" based on what items that are bought together, and according to the wiki for eclat:
The Eclat algorithm is used to perform itemset mining. Itemset mining let us find frequent patterns in data like if a consumer buys milk, he also buys bread. This type of pattern is called association rules and is used in many application domains.
Though, when I use the eclat in R, i get "zero frequent items" and "NULL" when when retrieving the results through tidLists. Anyone can see what I am doing wrong?
The full dataset: https://pastebin.com/8GbjnHK2
Each row is a transactions, containing different items in the columns. Quick snap of the data:
3060615;;;;;;;;;;;;;;;
3060612;3060616;;;;;;;;;;;;;;
3020703;;;;;;;;;;;;;;;
3002469;;;;;;;;;;;;;;;
3062800;;;;;;;;;;;;;;;
3061943;3061965;;;;;;;;;;;;;;
The code
trans = read.transactions("Transactions.csv", format = "basket", sep = ";")
f <- eclat(trans, parameter = list(supp = 0.1, maxlen = 17, tidLists = TRUE))
dim(tidLists(f))
as(tidLists(f), "list")
Could it be due to the data structure? In that case, how should I change it? Furthermore, what do I do to get the suggested itemsets? I couldn't figure that out from the wiki.
EDIT: I used 0.004 for supp, as suggested by #hpesoj626. But it seems like the function is grouping the orders/users and not the items. I don't know how to export the data, so here is a picture of the tidLists:
The problem is that you have set your support too high. Try adjusting supp say, supp = .001, for which we get
dim(tidLists(f))
# [1] 928 15840
For your data set, the highest support is 0.08239 which is below 0.1. That is why you are getting no results with supp = 0.1.
inspect(head(sort(f, by = "support"), 10))
# items support count
# [1] {3060620} 0.08239 1305
# [2] {3060619} 0.07260 1150
# [3] {3061124} 0.05688 901
# [4] {3060618} 0.05663 897
# [5] {4027039} 0.04975 788
# [6] {3060617} 0.04564 723
# [7] {3061697} 0.04306 682
# [8] {3060619,3060620} 0.03087 489
# [9] {3039715} 0.02727 432
# [10] {3045117} 0.02708 429

Difference between data[ , "col"] and data$col

From other answers on this site on similar questions, and e.g. from pages like http://www.r-tutor.com/r-introduction/data-frame/data-frame-column-vector , it seems that I extract a variable from a data.frame, data[ , "col"] and data$col should yield the same result. But now I have some data in Excel:
LU Urban_LU LU_Index Urban_LU_index
Residential Residential 2 0
Rural residential Residential 3 0
Commercial Commercial 4 1
Public institutions including education Industrial 5 1
Industry Industrial 7 2
)
and I read it with read_excel from the readxl package:
library(readxl)
data <- read_excel("data.xlsx", "Sheet 1")
Now I extract a single variable from the data frame, using [ or $:
data[ , "LU"]
# Source: local data frame [5 x 1]
#
# LU
# (chr)
# 1 Residential
# 2 Rural residential
# 3 Commercial
# 4 Public institutions including education
# 5 Industry
data$LU
# [1] "Residential" "Rural residential"
# [3] "Commercial" "Public institutions including education"
# [5] "Industry"
length(data[ , "LU"])
# [1] 1
length(data$LU)
# [1] 5
Also, what I find suspicious are the classes of the data obtained from read_excel and the data which results from the two different modes of extraction:
class(data)
# [1] "tbl_df" "tbl" "data.frame"
class(data[ , "LU"])
# [1] "tbl_df" "data.frame"
class(data$LU)
# [1] "character"
>
So what's the difference between [ , "col"] and $col? Am I missing something from the manuals or is this a special case? Also, what's with the tbl_df and tbl class identifiers? I suspect that they are the cause of my confusion, what do they mean?
More of an extended comment:
The fact that readxl::read_xl returns output of class tbl_df seems poorly documented in ?read_xl. This behaviour was mentioned in the announcement of readxl on the RStudio blog though:
"[read_xl r]eturns output with class c("tbl_df", "tbl", "data.frame")"
To learn more about tbl_df, we need to consult the dplyr help pages. In the Methods section of ?dplyr::tbl_df, we find that
"tbl_df implements two important base methods: [ Never simplifies (drops), so always returns data.frame".
For more background, read about the drop argument in ?[.data.frame.
Related Q&A: Extract a dplyr tbl column as a vector and Best practice to get a dropped column in dplyr tbl_df.
See also the 'original' issue on github and the discussion therein.

Resources