pull subject ids after detecting characters matching a string - r

Please help me pull subject id's after determining a list of participants who do not contain specified characters. e.g:
data:
df <- structure (list(subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-2365"), edta_codes = c("4EDTA-3M783316", "4EDTA-3M2897865", "4EDTA-M280934", "4EDTA-3M286549","MCF -3M289684", "NA")), class = "data.frame", row.names = c (NA, -6L))
Code to test if character is in string:
df$edta_codes[!grepl("4EDTA-3", df$edta_codes)]
Different method:
str_detect(df$edta_codes,"4EDTA-3")
Both give me the result I want but from here I want to show the subject ids that do not have the specified string, including those with NA (i.e. in this case - 191-3457, 191-1245, 191-2365 are all different from the specified characters). I have tried using pull after each of the above codes and they both did not work.
Please help.

You can simply do,
df[!grepl("4EDTA-3", df$edta_codes),'subject_id']
#[1] "191-3457" "191-1245" "191-2365"
If you want to return also the codes, then,
df[!grepl("4EDTA-3", df$edta_codes),]
# subject_id edta_codes
#3 191-3457 4EDTA-M280934
#5 191-1245 MCF -3M289684
#6 191-2365 NA

Related

Decoding GS1 string using R

In a dataframe, one column includes a GS1 code scanned from barcodes. A GS1 code is a string including different types of information. Application Identifiers (AI) indicate what type of information the next part of the string is.
Here is an example of a GS1 string: (01)8714729797579(17)210601(10)23919374
the AI is indicated between brackets. In this case (01) means 'GTIN', (17) means 'Expiration Date' and (10) means 'LOT'.
What I like to do in R is create three different columns from the single column, using the AI as the new column names.
I tried using 'separate', but the brackets aren't removed. Why aren't the brackets removed?
df <- data.frame(id =c(1, 2, 3), CODECONTENT = c("(01)871(17)21(10)2391", "(01)579(17)26(10)9374", "(01)979(17)20(10)9193"))
df <- df %>% separate(CODECONTENT, c("GTIN", "Expiration_Date"), "(17)", extra = "merge") %>%
separate(Expiration_Date, c("Expiration Date", "LOT"), "(10)", extra = "merge")
The above returns the following:
id
GTIN
Expiration Date
LOT
1
1
(01)871(
)21(
)2391
2
2
(01)579(
)26(
)9374
3
3
(01)979(
)20(
)9193
I am not sure why the brackets are still there. Besides removing the bracket would there be a smarter way to also remove the first AI (01) in the same code?
Because the parenthesis symbols are special characters, you need to tell the regex to treat them literally. One option is to surround them in square brackets.
df %>%
separate(col = CODECONTENT,
sep = "[(]17[)]",
into = c("gtin", "expiration_date")) %>%
separate(expiration_date,
sep = "[(]10[)]",
into = c("expiration_date", "lot"),
extra = "merge")
id gtin expiration_date lot
1 1 (01)871 21 2391
2 2 (01)579 26 9374
3 3 (01)979 20 9193

pattern matching with multiple options in R

I have a list of filenames all of which come from different providers. The provider names are given inconsistently, for instance "Apple" and "APPL". I am trying to find a way to append the correct name to the file dataframe, if the data has a matching string. For example, if a filename contained "Apple" then the correct name "APPL" would be appended next to it, in the dataframe. Sorry if I haven't included my attempt at it, I just think that would confuse the question as I am a complete beginner. Thanks heaps!
(In reality I have about 1000 filenames, and 30 or so provider names each of which has around 3 possible patterns it could be. Yikes!)
stock_tickers <- data.frame("APPL", "MSFT")
financial_reports <- data.frame("financial_report_names" = "AppleFinance.csv", "APPLStock.csv", "financesMICROSOFT.csv", "MSFTstocks.csv", "UberStocks.csv",
"report_month" = "202101", "202101", "202102", "202102", "202102")
APPL_matches <- c("APPL", "Apple")
MSFT_matches <- c("MSFT", "Microsoft")
#expected output
# financial_report_names report month matching ticker
# "AppleFinance.csv" 202101 APPL
# "APPLStock.csv" 202102 APPL
# "financesMICROSOFT.csv" 202102 MSFT
# "MSFTstocks.csv" 202102 MSFT
# "UberStocks.csv" 202102 N/A
You can create a regex pattern which can take all possible combination of patterns of a ticker and use stringr::str_replace_all -
ptrn <- c(".*(APPL|Apple|APPLE).*" = 'APPL',
".*(MSFT|Microsoft|MICROSOFT).*" = 'MSFT',
".*(Uber).*" = 'UBER')
stringr::str_replace_all(financial_reports$financial_report_names, ptrn)
#[1] "APPL" "APPL" "MSFT" "MSFT" "UBER"
You can also generate this pattern dynamically if saved somewhere in dataframe or csv.
data
financial_reports <- structure(list(financial_report_names = c("AppleFinance.csv",
"APPLStock.csv", "financesMICROSOFT.csv", "MSFTstocks.csv", "UberStocks.csv"
), report_month = c("202101", "202101", "202102", "202102", "202102"
)), class = "data.frame", row.names = c(NA, -5L))

Extracting string before a fixed character position

Its fairly simple question, I tried multiple combinations however I am not getting to what I want to achieve.
I have a columns which has statement separate by "-". I want to extract the words before the fourth instance of "-" from
the month of April.
I am using this code which trims the part before the 4th "-" and it returns anything left after that.
data$newCol1 <- NA
data$newCol1 <- ifelse(data$date >= as.Date("2019-04-01"), sub(".?-.?-.?-.?-", "", data$Email), ifelse(data$date <= as.Date("2019-03-31"),data$Email,data$newCol1))
However I want to extract the portion before the 4th "-" for eg if this my string "19Q1-XYZ-JA-All-OutR-random-key-March" I want only 19Q1-XYZ-JA-All instead of having OutR-random-key-March which is what i get currently
This is my dataset
Email date
18Q4-ABC-SEA-CO-TM 1/8/2019
19Q1-DEF-ABJPODTSST 1/16/2019
19Q1-ABC-CMJ 2/8/2019
19Q1-APC-CORP 4/9/2019
19Q1-XYZ-ALP-SEA-MOO ABc_1 5/13/2019
19Q1-WXY-All-SF- Coral 01_24 1/27/2019
19Q1-XYZ-All-SF-Tokyo SF Event 03_14 FINAL Send 3/14/2019
19Q1-XYZ-CN-All-cra-foo world-2901 1/30/2019
19Q1-XYZ-CN-All-get-foo world-2901 1/31/2019
19Q1-XYZ-CN-All-opc-foo world-2901 7/31/2019
19Q1-XYX-FI-AC-DEC-kites 1/21/2019
19Q1-XYZ-JA-All-OutR-random-key-March 7/19/2019
19Q1-XYZ-JA-All-OutR-random-key-March 6/19/2019
19Q1-XYZ-JA-SF-OutR-RFC_ABS-key-March 3/29/2019
19Q1-XYZ-unavailable-random-key-balaji 4/20/2019
An option is to to match 3 sets of characters that are not a - followed by - and the next set of characters that are not a - ([^-]+), capture as a group and replace with the backreference (\\1) of that captured group
data$date <- as.Date(data$date, "%m/%d/%Y")
data$newCol1 <- NA
data$newCol1 <- ifelse(data$date >= as.Date("2019-04-01"),
sub("^(([^-]+-){3}[^-]+)-.*", "\\1", data$Email),
ifelse(data$date <= as.Date("2019-03-31"),data$Email,data$newCol1))
data
data <- structure(list(Email = c("18Q4-ABC-SEA-CO-TM", "19Q1-DEF-ABJPODTSST",
"19Q1-ABC-CMJ", "19Q1-APC-CORP", "19Q1-XYZ-ALP-SEA-MOO ABc_1",
"19Q1-WXY-All-SF- Coral 01_24", "19Q1-XYZ-All-SF-Tokyo SF Event 03_14 FINAL Send",
"19Q1-XYZ-CN-All-cra-foo world-2901", "19Q1-XYZ-CN-All-get-foo world-2901",
"19Q1-XYZ-CN-All-opc-foo world-2901", "19Q1-XYX-FI-AC-DEC-kites",
"19Q1-XYZ-JA-All-OutR-random-key-March", "19Q1-XYZ-JA-All-OutR-random-key-March",
"19Q1-XYZ-JA-SF-OutR-RFC_ABS-key-March", "19Q1-XYZ-unavailable-random-key-balaji"
), date = c("1/8/2019", "1/16/2019", "2/8/2019", "4/9/2019",
"5/13/2019", "1/27/2019", "3/14/2019", "1/30/2019", "1/31/2019",
"7/31/2019", "1/21/2019", "7/19/2019", "6/19/2019", "3/29/2019",
"4/20/2019")), class = "data.frame", row.names = c(NA, -15L))
An easy solution is to use ?gregexpr function to get the position of all - and then extract the string based on its position:
I use the data created by #akrun
result <- sapply(data$Email, function(x)substr(x, 1, gregexpr("-",x)[[1]][4]-1))
result
This will simply generate NA value since some string only has 3 "-", you can simply modify the code using if condition to filter them.

odd behavior when substituting parts of a string within a for loop

I'm trying to replace a series of numbers in a character string with information that comes from a dataframe.
My string comes from a text file that I imported using the readr package as follows: read_file("Human.txt")
I've checked the class, it is character. The string contains the following information (I've named it treeString):
"(1,2,((((3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
My dataframe (labels.csv) was originally in factor format, but I changed the format of the second column to character using the following command: labels[,2] = as.character(labels[,2]). It looks like this
v1 v2
1 1 name1
2 2 name2
3 3 name3
My goal is to substitute every number in the string with the corresponding name (i.e. V2) in the dataframe. This should result in the following:
"(name1,name2,((((name3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
Here is the code I am using to accomplish this:
for(i in 1:nrow(labels)){
gsub(as.character(i), labels[i,2], treeString)
}
The weird thing is that if I run the gsub() command on its own (with specified numbers - eg. 2) it does the substitution, however, when I run it in a loop it does not substitute the numbers.
As pointed out by Kumar Manglam in the comments, you forgot to assign the result of gsub() back to treeString.
There is something else you should be aware of: The way you specified the regular expression in your question it will also replace patterns like "(241)" with "(name24name1)". To avoid this behaviour, you should check whether the numbers you want to replace are preceded by a comma or opening parenthesis and succeeded by a comma or closing parenthesis:
# Option1
for(i in 1:nrow(labelnames)){
reg_pattern <- paste0("(?<=[(,])(", i, ")(?=[),])")
treeString <- gsub(reg_pattern, labelnames$v2[i], treeString, perl=T)
}
Another, nicer, option is drop the for-loop and do it all at once:
# Option2
reg_pattern <- paste0("(?<=[(,])([1-", nrow(labelnames), "])(?=[),])")
treeString <- gsub(reg_pattern, "name\\1", treeString, perl=T)
# Result
treeString
# "(name1,name2,((((name3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
Data
treeString <- "(1,2,((((3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
labelnames <- structure(list(v1 = 1:3, v2 = c("name1", "name2", "name3")), .Names = c("v1", "v2"), class = "data.frame", row.names = c(NA, -3L))

How can I do a regular expression loop?

So my situation is that I have a list of files in a physical chemistry dataset which I created from multiple calculations, and I want run a foreach or while loop through a column named Files in my dataframe titled, CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES.
I have filenames which look like this: "1AH7A_TRP-16-A_GLU-9-A.log:", "1AH7A_TRP-198-A_ASP-197-A.log:", "1BGFA_TRP-43-A_GLU-44-A.log:", "1CXQA_TRP-61-A_ASP-82-A.log:", etc ...
I want to run a while or foreach loop through my column "Files", and if exists the word "GLU" or "ASP", and then if I find "GLU" or "ASP", in the file I want to print it to a list.
So in the files above, the printing order would be "GLU", "ASP", "GLU", "ASP". Again, my files aren't order in any particular way, and all the way down through my 1273 entries of files. Then I can save this list and put it into a column title "Residues" in my dataframe, and do some useful exploratory data analysis.
Note: ASP is for the amino acid Aspartate, and GLU is for the amino acid Glutamate.
I know that I can regular expression search grep for the terms in a the column "Files" like so.
Searching for "ASP":
> grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files, value = TRUE)
[1] "1AH7A_TRP-198-A_ASP-197-A.log:"
[2] "1CXQA_TRP-61-A_ASP-82-A.log:"
[3] "1EJDA_TRP-279-A_ASP-278-A.log:"
[4] "1EU1A_TRP-32-A_ASP-33-A.log:"
As you can see I get a few matches. In fact I get 683 matches. But that's not good enough. I need the matches where they occur, not that they occur.
And of course I can grep for "GLU":
> grep("GLU", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files, value = TRUE)
[1] "1AH7A_TRP-16-A_GLU-9-A.log:"
[2] "1BGFA_TRP-43-A_GLU-44-A.log:"
[3] "1D8WA_TRP-17-A_GLU-14-A.log:"
And I get a whole bunch of matches!
I tried a for loop. Of course it failed!!!
> for(i in 1:length(CD1_and_CH2_Distances$Distance_Files))
{if(grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files))
{print("ASP")}
else if(grep("GLU", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files))
{print("GLU")}}
All it did was print:
[1] "ASP"
[1] "ASP"
[1] "ASP"
...
Even though there is "GLU"!
I mean I can do basic algebraic loops that don't matter to anyone:
> for(i in 1:10){print(i^2)}
[1] 1
[1] 4
[1] 9
[1] 16
Anyway, I checked the warnings to see what was going wrong:
> warnings()
Warning messages:
1: In if (grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)) { ... :
the condition has length > 1 and only the first element will be used
2: In if (grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)) { ... :
the condition has length > 1 and only the first element will be used
As you can see I'm getting the same error over and over again. I guess that makes sense since this is a a loop. But why is this happening, and why can't I grep inside of a loop?
My dataframe that I am trying to parse looks like this:
"","Files","Interaction_Energy_kcal_per_Mole","atom","Distance_Angstroms"
"1","1AH7A_TRP-16-A_GLU-9-A.log:",-8.49787784468197,"CD1",4.03269909613896
"2","1AH7A_TRP-198-A_ASP-197-A.log:",-7.92648167142146,"CD1",3.54307493570204
"3","1BGFA_TRP-43-A_GLU-44-A.log:",-6.73507800775909,"CD1",4.17179517713897
"4","1CXQA_TRP-61-A_ASP-82-A.log:",-9.39887176290279,"CD1",5.29897291934956
"5","1D8WA_TRP-17-A_GLU-14-A.log:",-9.74720319145055,"CD1",3.69398565238145
"6","1D8WA_TRP-17-A_GLU-18-A.log:",-11.3235196065977,"CD1",3.52345441293058
"7","1DJ0A_TRP-223-A_GLU-226-A.log:",-7.46891330209553,"CD1",5.41108436452436
"8","1E58A_TRP-15-A_GLU-18-A.log:",-6.59830781067777,"CD1",4.79790235415437
where commas separate columns.
This is what I want the result to look like:
"","Files","Interaction_Energy_kcal_per_Mole","atom","Distance_Angstroms", "Residue",
"1","1AH7A_TRP-16-A_GLU-9-A.log:",-8.49787784468197,"CD1",4.03269909613896, "GLU",
"2","1AH7A_TRP-198-A_ASP-197-A.log:",-7.92648167142146,"CD1",3.54307493570204, "ASP",
"3","1BGFA_TRP-43-A_GLU-44-A.log:",-6.73507800775909,"CD1",4.17179517713897, "GLU",
"4","1CXQA_TRP-61-A_ASP-82-A.log:",-9.39887176290279,"CD1",5.29897291934956, "ASP",
"5","1D8WA_TRP-17-A_GLU-14-A.log:",-9.74720319145055,"CD1",3.69398565238145, "GLU",
"6","1D8WA_TRP-17-A_GLU-18-A.log:",-11.3235196065977,"CD1",3.52345441293058, "GLU",
"7","1DJ0A_TRP-223-A_GLU-226-A.log:",-7.46891330209553,"CD1",5.41108436452436, "GLU",
"8","1E58A_TRP-15-A_GLU-18-A.log:",-6.59830781067777,"CD1",4.79790235415437, "GLU",
...
Any help is appreciated! Thank you!
We can use split the dataset into a list of data.frame using substring derived with sub
lst <- split(df1, sub(".*_([A-Z]{3})-.*", "\\1", df1$Files))
data
df1 <- structure(list(X = 1:8, Files = c("1AH7A_TRP-16-A_GLU-9-A.log:",
"1AH7A_TRP-198-A_ASP-197-A.log:", "1BGFA_TRP-43-A_GLU-44-A.log:",
"1CXQA_TRP-61-A_ASP-82-A.log:", "1D8WA_TRP-17-A_GLU-14-A.log:",
"1D8WA_TRP-17-A_GLU-18-A.log:", "1DJ0A_TRP-223-A_GLU-226-A.log:",
"1E58A_TRP-15-A_GLU-18-A.log:"), Interaction_Energy_kcal_per_Mole = c(-8.49787784468197,
-7.92648167142146, -6.73507800775909, -9.39887176290279, -9.74720319145055,
-11.3235196065977, -7.46891330209553, -6.59830781067777), atom = c("CD1",
"CD1", "CD1", "CD1", "CD1", "CD1", "CD1", "CD1"), Distance_Angstroms = c(4.03269909613896,
3.54307493570204, 4.17179517713897, 5.29897291934956, 3.69398565238145,
3.52345441293058, 5.41108436452436, 4.79790235415437)), .Names = c("X",
"Files", "Interaction_Energy_kcal_per_Mole", "atom", "Distance_Angstroms"
), class = "data.frame", row.names = c(NA, -8L))
I am not sure I get your question completely but consider your data resides in "dat" data(which contains rows for GLU and ASP). Use below to tabulate a field which can contain the data of "ASP" and "GLU".
library(stringr)
newvar <- NULL
newvar$GLU <- str_extract(dat$Files,"(GLU)")
newvar$ASP <- str_extract(dat$Files,"(ASP)")
newvar1 <- data.frame(newvar)
newvar1
library(tidyr)
newvar1[is.na(newvar1)] = ""
new <- unite(newvar1, new, GLU:ASP, sep='')
dat$new <- new
Here the field called new would contain your value of GLU and ASP
Answer:
dat
X Files Interaction_Energy_kcal_per_Mole atom Distance_Angstroms new
1 1 1AH7A_TRP-16-A_GLU-9-A.log: -8.497878 CD1 4.032699 GLU
2 2 1AH7A_TRP-198-A_ASP-197-A.log: -7.926482 CD1 3.543075 ASP
3 3 1BGFA_TRP-43-A_GLU-44-A.log: -6.735078 CD1 4.171795 GLU
4 4 1CXQA_TRP-61-A_ASP-82-A.log: -9.398872 CD1 5.298973 ASP
5 5 1D8WA_TRP-17-A_GLU-14-A.log: -9.747203 CD1 3.693986 GLU
6 6 1D8WA_TRP-17-A_GLU-18-A.log: -11.323520 CD1 3.523454 GLU
7 7 1DJ0A_TRP-223-A_GLU-226-A.log: -7.468913 CD1 5.411084 GLU
8 8 1E58A_TRP-15-A_GLU-18-A.log: -6.598308 CD1 4.797902 GLU
After a long time I figured out a solution to my problem:
# Save my column as a vector because factors are making the world burn:
Files <- as.vector(CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)
# Split the Files into three parts along the two underscores, and save it back to my vector, preserving the third cut around the underscore.
Files <- str_split_fixed(Files, "_", 3)[,3]
Result:
[1] "GLU-9-A.log:"
"ASP-197-A.log:"
etc ...
# Split those results along the hyphens, and take what's next to the first hyphen or the first cut:
Residues <- str_split_fixed(Files, "-", 3)[,1]
> Residues
[1] "GLU" "ASP" "GLU", ...
Add the Residue columns to my data.frame.
CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Residue <- Residue
I guess the grep function is overrated. I had to look hard for this function.
Assuming you saved the data you are trying to parse in the file glu_vs_asp.csv.
Below is an example of how you can create two data frames, one for GLU and one for ASP:
# Read .csv file.
dt <- read.table(file = "glu_vs_asp.csv", sep = ",", header = TRUE)
# Create two data frames, one for GLU and one for ASP.
dt_glu <- dt[grep("GLU", dt$Files),]
dt_asp <- dt[grep("ASP", dt$Files),]
To create a data frame containing both GLU and ASP you can try the following:
dt_glu_asp <- dt[grep("(ASP|GLU)", dt$Files),]
The commands
grep("ASP", dt$Files)
grep("GLU", dt$Files)
give you the indices of the rows that contain respectively 'ASP' and 'GLU' in the Files column.

Resources