pattern matching with multiple options in R

pattern matching with multiple options in R - r

I have a list of filenames all of which come from different providers. The provider names are given inconsistently, for instance "Apple" and "APPL". I am trying to find a way to append the correct name to the file dataframe, if the data has a matching string. For example, if a filename contained "Apple" then the correct name "APPL" would be appended next to it, in the dataframe. Sorry if I haven't included my attempt at it, I just think that would confuse the question as I am a complete beginner. Thanks heaps!
(In reality I have about 1000 filenames, and 30 or so provider names each of which has around 3 possible patterns it could be. Yikes!)
stock_tickers <- data.frame("APPL", "MSFT")
financial_reports <- data.frame("financial_report_names" = "AppleFinance.csv", "APPLStock.csv", "financesMICROSOFT.csv", "MSFTstocks.csv", "UberStocks.csv",
"report_month" = "202101", "202101", "202102", "202102", "202102")
APPL_matches <- c("APPL", "Apple")
MSFT_matches <- c("MSFT", "Microsoft")
#expected output
# financial_report_names report month matching ticker
# "AppleFinance.csv" 202101 APPL
# "APPLStock.csv" 202102 APPL
# "financesMICROSOFT.csv" 202102 MSFT
# "MSFTstocks.csv" 202102 MSFT
# "UberStocks.csv" 202102 N/A

You can create a regex pattern which can take all possible combination of patterns of a ticker and use stringr::str_replace_all -
ptrn <- c(".*(APPL|Apple|APPLE).*" = 'APPL',
".*(MSFT|Microsoft|MICROSOFT).*" = 'MSFT',
".*(Uber).*" = 'UBER')
stringr::str_replace_all(financial_reports$financial_report_names, ptrn)
#[1] "APPL" "APPL" "MSFT" "MSFT" "UBER"
You can also generate this pattern dynamically if saved somewhere in dataframe or csv.
data
financial_reports <- structure(list(financial_report_names = c("AppleFinance.csv",
"APPLStock.csv", "financesMICROSOFT.csv", "MSFTstocks.csv", "UberStocks.csv"
), report_month = c("202101", "202101", "202102", "202102", "202102"
)), class = "data.frame", row.names = c(NA, -5L))

Related

Search and replace file names according to strings in dataframe. In R

I have several files in a folder that look like "blabla_A1_bla.txt", "blabla_A1_bla.phd","blabla_B1_bla.txt", "blablabla_B1_bla.phd"...and all the way to H12.
Then I have a df that indicates which sample is each one.
well
sample
A1
F32-1
B1
F13-3
C1
B11-4
...
...
I want to rename the files in the folder according to the table. So that A1 gets replaces by F32-1, B1 by F13-3 and so on.
I have created a list of all the files in the directory with files<-list.files(directory). I know how to use the str_replace function of the stringr package to change them one by one, but I don't know how to make it automatic. I guess I need a loop that reads cell 1,1 of the dataframe, searches that string in "files" and replaces it with the value in cell 1,2. And then moves to cell 2,1 and so on. But I don't know how to code this. (Or if there is a better way to do it).
I'll appreciate your help with this.

You can create a named vector of replacement and pattern and use it in str_replace_all
files <- list.files(directory)
files <- stringr::str_replace_all(files, setNames(df$sample, df$well))
Using a reproducible example -
df <- structure(list(well = c("A1", "B1", "C1"), sample = c("F32-1",
"F13-3", "B11-4")), class = "data.frame", row.names = c(NA, -3L))
files <- c("blabla_A1_bla.txt", "blabla_A1_bla.phd","blabla_B1_bla.txt", "blablabla_B1_bla.phd")
stringr::str_replace_all(files, setNames(df$sample, df$well))
#[1] "blabla_F32-1_bla.txt" "blabla_F32-1_bla.phd" "blabla_F13-3_bla.txt"
#[4] "blablabla_F13-3_bla.phd"

I would first create a vector of new names and then use the function file.rename:
files = c("blabla_A1_bla.phd","blabla_B1_bla.txt", "blablabla_B1_bla.phd")
patterns = c('A1', 'B1')
replace = c('F22', 'G22')
new.name = c()
for (f in files){
# first identify which pattern corresponds to file f (sis it A1, B1, ...)
which.pattern = which(sapply(patterns, grepl, x = f))
# and then replace it by the correct string
new.name = c(new.name, gsub(patterns[which.pattern], replace[which.pattern], f))
}
file.rename(files, new.name)
replacing patterns and replace by df$well and df$sample should work for your case.

How to change values in unnamed first column

How do I change the entries of the first column in the matrix returned by read_csv if it doesn't have a header?
My variables currently looks like this:
PostFC C1Mean
WBGene00001816 2.475268e-01 415.694457
WBGene00001817 4.808575e+00 2451.018711
and I'd like to rename WBGene0000XXXX to XXXX.

If the first column is actually the rownames do the following
rownames(data) <- gsub(pattern = "WBGene0000", replacement = "", x = rownames(data))
If it isn't consistent, you may want to consider the stringr package and use the substr function
But if it is actually a vector with no header column, I do not know how to reference it without knowing the structure of the data.
run the str function of the data set and see what it returns. Or do the following as a test
colnames(data)[1] <- "test"
Can't exactly help until we know how you have a "zero-length" variable name

If I understand your question correctly the first "unnamed" column you describe are rownames and are not actually in you data.frame
# Example data
df = data.frame(PostFC = c(2.475268e-01, 4.808575e+00), C1Mean = c(415.694457, 2451.018711) )
rownames(df) = c("WBGene00001816", "WBGene00001817")
df
# PostFC C1Mean
# WBGene00001816 0.2475268 415.6945
# WBGene00001817 4.8085750 2451.0187
# change rownames
rownames(df) = c("rowname1", "rowname2")
df
# PostFC C1Mean
# rowname1 0.2475268 415.6945
# rowname2 4.8085750 2451.0187

The entries addressed are actually row names. We can access them with rownames(.).
rownames(df1)
# [1] "WBGene00001816" "WBGene00001817" "WBGene00001818" "WBGene00001819"
# [5] "WBGene00001820" "WBGene00001821" "WBGene00001822"
In R also implemented is rownames<-, i.e. we can assign new rownames by doing rownames(.) <- c(.).
Now in your case it looks like if you want to keep just the last four digits. We may use substring here, which we tell from which digit it should extract. In our case it is the 11th digit to the last, so we do:
rownames(df1) <- substring(rownames(df1), 11)
df1
# PostFC C1Mean
# 1816 0.36250598 2.1073145
# 1817 0.51068402 0.4186838
# 1818 -0.96837330 -0.7239156
# 1819 0.02331745 -0.5902216
# 1820 -0.56927945 1.7540356
# 1821 -0.51252943 0.1343385
# 1822 0.47263180 1.4366233
Note, that duplicated row names are not allowed, i.e. if you obtain duplicates applying this method it will yield an error.
Data used
df1 <- structure(list(PostFC = c(0.362505982864934, 0.510684020059692,
-0.968373302351162, 0.0233174467410604, -0.56927945273647, -0.512529427359891,
0.472631804850333), C1Mean = c(2.10731450148575, 0.418683823183885,
-0.723915648073638, -0.590221641040516, 1.75403562218217, 0.134338480077884,
1.43662329542089)), class = "data.frame", row.names = c("1816",
"1817", "1818", "1819", "1820", "1821", "1822"))

how to combine the result as a data frame in case of vector

gender_value<-function(author){
au<-sub("([^,]+),\\s*(.*)", "\\2 \\1", author)
r<-GET(paste0("https://genderapi.io/api?name=",au))
g<-content(r)$gender
n<-c(au,g)}
author("nandan,vivek","paswan, jyoti")
If I want to pass more than 10000 name in a author chosen from csv file, how could I do this.
I want the write the result in .csv file after bind the result "n".
like a table form with two columns(name,gender) and more than 10000 row(eg. jyoti paswan,female)

Pass only first name to the API as it returns NULL for names with spaces or last name
gender_value<-function(author) {
au <-sub("([^,]+),\\s*(.*)", "\\2 \\1", author)
first_name <- sub("(.*)\\s+.*", "\\1", au)
r <- GET(paste0("https://genderapi.io/api?name=",first_name))
g <- content(r)$gender
g <- if (is.null(g)) NA else g
data.frame(author_name = au, gender = g)
}
Have a vector of names which you want to pass to the function
author_names <- c("nandan,vivek","paswan, jyoti")
and then use lapply to apply the function to each name. If you have names to be passed in a dataframe use dataframe_name$column_name instead of author_names.
df <- do.call(rbind, lapply(author_names, gender_value))
df
# author_name gender
#1 vivek nandan male
#2 jyoti paswan female
You can now write df to csv using write.csv
write.csv(df, "/path/of/file.csv", row.names = FALSE)

Subset strings in R

One of the strings in my vector (df$location1) is the following:
Potomac, MD 20854\n(39.038266, -77.203413)
Rest of the data in the vector follow same pattern. I want to separate each component of the string into a separate data element and put it in new columns like: df$city, df$state, etc.
So far I have been able to isolate the lat. long. data into a separate column by doing the following:
df$lat.long <- gsub('.*\\\n\\\((.*)\\\)','\\\1',df$location1)
I was able to make it work by looking at other codes online but I don't fully understand it. I understand the regex pattern but don't understand the "\\1" part. Since I don't understand it in full I have been unable to use it to subset other parts of this same string.
What's the best way to subset data like this?
Is using regex a good way to do this? What other ways should I be looking into?
I have looked into splitting the string after a comma, subset using regex, using scan() function and to many other variations. Now I am all confused. Thx

We can also use the separate function from the tidyr package (part of the tidyverse package).
library(tidyverse)
# Create example data frame
dat <- data.frame(Data = "Potomac, MD 20854\n(39.038266, -77.203413)",
stringsAsFactors = FALSE)
dat
# Data
# 1 Potomac, MD 20854\n(39.038266, -77.203413)
# Separate the Data column
dat2 <- dat %>%
separate(Data, into = c("City", "State", "Zip", "Latitude", "Longitude"),
sep = ", |\\\n\\(|\\)|[[:space:]]")
dat2
# City State Zip Latitude Longitude
# 1 Potomac MD 20854 39.038266 -77.203413

You can try strsplit or data.table::tstrsplit(strsplit + transpose):
> x <- 'Potomac, MD 20854\n(39.038266, -77.203413)'
> data.table::tstrsplit(x, ', |\\n\\(|\\)')
[[1]]
[1] "Potomac"
[[2]]
[1] "MD 20854"
[[3]]
[1] "39.038266"
[[4]]
[1] "-77.203413"
More generally, you can do this:
library(data.table)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
The pattern ', |\\n\\(|\\)' tells tstrsplit to split by ", ", "\n(" or ")".
In case you want to sperate state and zip and cite names may contain spaces, You can try a two-step way:
# original split (keep city names with space intact)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
# split state and zip
df[c('state', 'zip')] <- tstrsplit(df$state, ' ')

Here is an option using base R
read.table(text= trimws(gsub(",+", " ", gsub("[, \n()]", ",", dat$Data))),
header = FALSE, col.names = c("City", "State", "Zip", "Latitude", "Longitude"),
stringsAsFactors = FALSE)
# City State Zip Latitude Longitude
#1 Potomac MD 20854 39.03827 -77.20341

So this process might be a little longer, but for me it makes things clear. As opposed to using breaks, below I identify values by using a specific regex for each value I want. I make a vector of regex to extract each value, a vector for the variable names, then use a loop to extract and create the dataframe from those vectors.
library(stringi)
library(dplyr)
library(purrr)
rgexVec <- c("[\\w\\s-]+(?=,)",
"[A-Z]{2}",
"\\d+(?=\\n)",
"[\\d-\\.]+(?=,)",
"[\\d-\\.]+(?=\\))")
varNames <- c("city",
"state",
"zip",
"lat",
"long")
map2_dfc(varNames, rgexVec, function(vn, rg) {
extractedVal <- stri_extract_first_regex(value, rg) %>% as.list()
names(extractedVal) <- vn
extractedVal %>% as_tibble()
})

\\1 is a back reference in regex. It is similar to a wildcard (*) that will grab all instances of your search term, not just the first one it finds.

How can I do a regular expression loop?

So my situation is that I have a list of files in a physical chemistry dataset which I created from multiple calculations, and I want run a foreach or while loop through a column named Files in my dataframe titled, CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES.
I have filenames which look like this: "1AH7A_TRP-16-A_GLU-9-A.log:", "1AH7A_TRP-198-A_ASP-197-A.log:", "1BGFA_TRP-43-A_GLU-44-A.log:", "1CXQA_TRP-61-A_ASP-82-A.log:", etc ...
I want to run a while or foreach loop through my column "Files", and if exists the word "GLU" or "ASP", and then if I find "GLU" or "ASP", in the file I want to print it to a list.
So in the files above, the printing order would be "GLU", "ASP", "GLU", "ASP". Again, my files aren't order in any particular way, and all the way down through my 1273 entries of files. Then I can save this list and put it into a column title "Residues" in my dataframe, and do some useful exploratory data analysis.
Note: ASP is for the amino acid Aspartate, and GLU is for the amino acid Glutamate.
I know that I can regular expression search grep for the terms in a the column "Files" like so.
Searching for "ASP":
> grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files, value = TRUE)
[1] "1AH7A_TRP-198-A_ASP-197-A.log:"
[2] "1CXQA_TRP-61-A_ASP-82-A.log:"
[3] "1EJDA_TRP-279-A_ASP-278-A.log:"
[4] "1EU1A_TRP-32-A_ASP-33-A.log:"
As you can see I get a few matches. In fact I get 683 matches. But that's not good enough. I need the matches where they occur, not that they occur.
And of course I can grep for "GLU":
> grep("GLU", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files, value = TRUE)
[1] "1AH7A_TRP-16-A_GLU-9-A.log:"
[2] "1BGFA_TRP-43-A_GLU-44-A.log:"
[3] "1D8WA_TRP-17-A_GLU-14-A.log:"
And I get a whole bunch of matches!
I tried a for loop. Of course it failed!!!
> for(i in 1:length(CD1_and_CH2_Distances$Distance_Files))
{if(grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files))
{print("ASP")}
else if(grep("GLU", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files))
{print("GLU")}}
All it did was print:
[1] "ASP"
[1] "ASP"
[1] "ASP"
...
Even though there is "GLU"!
I mean I can do basic algebraic loops that don't matter to anyone:
> for(i in 1:10){print(i^2)}
[1] 1
[1] 4
[1] 9
[1] 16
Anyway, I checked the warnings to see what was going wrong:
> warnings()
Warning messages:
1: In if (grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)) { ... :
the condition has length > 1 and only the first element will be used
2: In if (grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)) { ... :
the condition has length > 1 and only the first element will be used
As you can see I'm getting the same error over and over again. I guess that makes sense since this is a a loop. But why is this happening, and why can't I grep inside of a loop?
My dataframe that I am trying to parse looks like this:
"","Files","Interaction_Energy_kcal_per_Mole","atom","Distance_Angstroms"
"1","1AH7A_TRP-16-A_GLU-9-A.log:",-8.49787784468197,"CD1",4.03269909613896
"2","1AH7A_TRP-198-A_ASP-197-A.log:",-7.92648167142146,"CD1",3.54307493570204
"3","1BGFA_TRP-43-A_GLU-44-A.log:",-6.73507800775909,"CD1",4.17179517713897
"4","1CXQA_TRP-61-A_ASP-82-A.log:",-9.39887176290279,"CD1",5.29897291934956
"5","1D8WA_TRP-17-A_GLU-14-A.log:",-9.74720319145055,"CD1",3.69398565238145
"6","1D8WA_TRP-17-A_GLU-18-A.log:",-11.3235196065977,"CD1",3.52345441293058
"7","1DJ0A_TRP-223-A_GLU-226-A.log:",-7.46891330209553,"CD1",5.41108436452436
"8","1E58A_TRP-15-A_GLU-18-A.log:",-6.59830781067777,"CD1",4.79790235415437
where commas separate columns.
This is what I want the result to look like:
"","Files","Interaction_Energy_kcal_per_Mole","atom","Distance_Angstroms", "Residue",
"1","1AH7A_TRP-16-A_GLU-9-A.log:",-8.49787784468197,"CD1",4.03269909613896, "GLU",
"2","1AH7A_TRP-198-A_ASP-197-A.log:",-7.92648167142146,"CD1",3.54307493570204, "ASP",
"3","1BGFA_TRP-43-A_GLU-44-A.log:",-6.73507800775909,"CD1",4.17179517713897, "GLU",
"4","1CXQA_TRP-61-A_ASP-82-A.log:",-9.39887176290279,"CD1",5.29897291934956, "ASP",
"5","1D8WA_TRP-17-A_GLU-14-A.log:",-9.74720319145055,"CD1",3.69398565238145, "GLU",
"6","1D8WA_TRP-17-A_GLU-18-A.log:",-11.3235196065977,"CD1",3.52345441293058, "GLU",
"7","1DJ0A_TRP-223-A_GLU-226-A.log:",-7.46891330209553,"CD1",5.41108436452436, "GLU",
"8","1E58A_TRP-15-A_GLU-18-A.log:",-6.59830781067777,"CD1",4.79790235415437, "GLU",
...
Any help is appreciated! Thank you!

We can use split the dataset into a list of data.frame using substring derived with sub
lst <- split(df1, sub(".*_([A-Z]{3})-.*", "\\1", df1$Files))
data
df1 <- structure(list(X = 1:8, Files = c("1AH7A_TRP-16-A_GLU-9-A.log:",
"1AH7A_TRP-198-A_ASP-197-A.log:", "1BGFA_TRP-43-A_GLU-44-A.log:",
"1CXQA_TRP-61-A_ASP-82-A.log:", "1D8WA_TRP-17-A_GLU-14-A.log:",
"1D8WA_TRP-17-A_GLU-18-A.log:", "1DJ0A_TRP-223-A_GLU-226-A.log:",
"1E58A_TRP-15-A_GLU-18-A.log:"), Interaction_Energy_kcal_per_Mole = c(-8.49787784468197,
-7.92648167142146, -6.73507800775909, -9.39887176290279, -9.74720319145055,
-11.3235196065977, -7.46891330209553, -6.59830781067777), atom = c("CD1",
"CD1", "CD1", "CD1", "CD1", "CD1", "CD1", "CD1"), Distance_Angstroms = c(4.03269909613896,
3.54307493570204, 4.17179517713897, 5.29897291934956, 3.69398565238145,
3.52345441293058, 5.41108436452436, 4.79790235415437)), .Names = c("X",
"Files", "Interaction_Energy_kcal_per_Mole", "atom", "Distance_Angstroms"
), class = "data.frame", row.names = c(NA, -8L))

I am not sure I get your question completely but consider your data resides in "dat" data(which contains rows for GLU and ASP). Use below to tabulate a field which can contain the data of "ASP" and "GLU".
library(stringr)
newvar <- NULL
newvar$GLU <- str_extract(dat$Files,"(GLU)")
newvar$ASP <- str_extract(dat$Files,"(ASP)")
newvar1 <- data.frame(newvar)
newvar1
library(tidyr)
newvar1[is.na(newvar1)] = ""
new <- unite(newvar1, new, GLU:ASP, sep='')
dat$new <- new
Here the field called new would contain your value of GLU and ASP
Answer:
dat
X Files Interaction_Energy_kcal_per_Mole atom Distance_Angstroms new
1 1 1AH7A_TRP-16-A_GLU-9-A.log: -8.497878 CD1 4.032699 GLU
2 2 1AH7A_TRP-198-A_ASP-197-A.log: -7.926482 CD1 3.543075 ASP
3 3 1BGFA_TRP-43-A_GLU-44-A.log: -6.735078 CD1 4.171795 GLU
4 4 1CXQA_TRP-61-A_ASP-82-A.log: -9.398872 CD1 5.298973 ASP
5 5 1D8WA_TRP-17-A_GLU-14-A.log: -9.747203 CD1 3.693986 GLU
6 6 1D8WA_TRP-17-A_GLU-18-A.log: -11.323520 CD1 3.523454 GLU
7 7 1DJ0A_TRP-223-A_GLU-226-A.log: -7.468913 CD1 5.411084 GLU
8 8 1E58A_TRP-15-A_GLU-18-A.log: -6.598308 CD1 4.797902 GLU

After a long time I figured out a solution to my problem:
# Save my column as a vector because factors are making the world burn:
Files <- as.vector(CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)
# Split the Files into three parts along the two underscores, and save it back to my vector, preserving the third cut around the underscore.
Files <- str_split_fixed(Files, "_", 3)[,3]
Result:
[1] "GLU-9-A.log:"
"ASP-197-A.log:"
etc ...
# Split those results along the hyphens, and take what's next to the first hyphen or the first cut:
Residues <- str_split_fixed(Files, "-", 3)[,1]
> Residues
[1] "GLU" "ASP" "GLU", ...
Add the Residue columns to my data.frame.
CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Residue <- Residue
I guess the grep function is overrated. I had to look hard for this function.

Assuming you saved the data you are trying to parse in the file glu_vs_asp.csv.
Below is an example of how you can create two data frames, one for GLU and one for ASP:
# Read .csv file.
dt <- read.table(file = "glu_vs_asp.csv", sep = ",", header = TRUE)
# Create two data frames, one for GLU and one for ASP.
dt_glu <- dt[grep("GLU", dt$Files),]
dt_asp <- dt[grep("ASP", dt$Files),]
To create a data frame containing both GLU and ASP you can try the following:
dt_glu_asp <- dt[grep("(ASP|GLU)", dt$Files),]
The commands
grep("ASP", dt$Files)
grep("GLU", dt$Files)
give you the indices of the rows that contain respectively 'ASP' and 'GLU' in the Files column.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

pattern matching with multiple options in R - r

Related

Search and replace file names according to strings in dataframe. In R

How to change values in unnamed first column

how to combine the result as a data frame in case of vector

Subset strings in R

How can I do a regular expression loop?

Categories

Resources