extracting all .com, .in, .co.in from all elements - r

I have data in csv which contains following column
ARTICLE_URL
http://twitter.com/aviryadsh/statuses/528219883872337920
http://www.ibtimes.co.in/2014
I want to create an another columns next to this column where I can have only the web address like twitter.com, team-bhp.com, ibtimes.co.in,broadbandforum.co.
I have tried
text$ne=str_extract(Brand$ARTICLE_URL, '\\w+(.com)')
but this is giving only url which are ending with .com how to fetch all other also.

I'd recommend using string replacement as opposed to string extraction in this instance. It's possible to do with string extraction, but the regular expression is kind of messy and not as readable as a two-step string replacement method. Here's how I'd do it:
urls <- c("http://twitter.com/aviryadsh/statuses/528219883872337920", "http://www.ibtimes.co.in/2014", "https://www.ibtimes.co.in/2014")
tmp <- stringr::str_replace_all(urls, "https?://|www.", "")
domains <- stringr::str_replace_all(tmp, "/.*", "")
And then looking at our output:
domains
# [1] "twitter.com" "ibtimes.co.in" "ibtimes.co.in"

Related

File renaming (string substitution) without a clear pattern using R

Currently, I am working with a long list of files.
They have a name pattern of SB_xxx_(parts). (different extensions), where xxx refers to an item code.
SB_19842.png
SB_19842_head.png
SB_19842_hand.png
SB_19842_head.pdf
...
It is found that many of these codes have incorrect entries.
I got two columns in hand: One is for old codes and one is new codes (let's say A & B). I hope to change all those old codes in the file names to the new code.
old new
12154 24124
92482 02425
.....
My first thought is to use file.rename()
However, it is a one-to-one changing approach. I cannot do this because every item has a different number of parts and different file extensions.
Is there any recursive method that can simply change all incorrect file names with strings in A and replace them with strings in B? Anyone get an idea, please?
A loop solution with purrr::map2 at the end:
library(purrr)
#create files to rename
file.create("SB_19842.png")
file.create("SB_19842_head.png")
file.create("SB_19842_hand.png")
file.create("SB_19842_head.pdf")
file.create("SB_12154.png")
file.create("SB_12154_head.png")
file.create("SB_12154_hand.png")
file.create("SB_12154_head.pdf")
# a dataframe with old a nd new patterns
file_names <- data.frame(
old = c("19842", "12154"),
new = c("new1", "new2")
)
# old filenames from the directory, specify path if needed
file_names_SB <- list.files(pattern = "SB_")
# function to substitute one type of code with another
sub_one_code <- function(old_code, new_code, file_names_original){
gsub(paste0("SB_", old_code), paste0("SB_", new_code), file_names_original)
}
# loop to substitute all codes
new_file_names <- file_names_SB
for (row in 1:nrow(file_names)){
new_file_names <- sub_one_code(file_names[row, "old"], file_names[row, "new"], new_file_names)
}
# rename all the files
map2(file_names_SB,
new_file_names,
file.rename)
#thelatemail provided a link with more elegant solutions for generating new file names.

extracting main URL address

I have a list of URLs and I want to extract the main URL to see how many times each URL has been used. as you can imagine, there are so many URLs with different notations. I tried and wrote the following code to extract the main URL:
library(stringr)
library(rebus)
# Step 2: creating a pattern for URL extraction
pat<- "//" %R% capture(one_or_more(char_class(WRD,DOT)))
#step 3: Creating a new variable from URL column of df
#(it should be atomic vector)
URL_var<-df[["URLs"]]
#step 4: using rebus to extract main URL
URL_extract<-str_match(URL_var,pattern = pat)
#step 5: changing large vector to dataframe and changing column name:
URL_data<-data.frame(URL_extract[,2])
names(URL_data)[names(URL_data) == "URL_extract...2."] <- "Main_URL"
The result of this code is acceptable for most cases. For example for //www.google.com, it returns www.google.com and for a website like http://image.google.com/steve it returns image.google.com; however, there are so many cases that this code can't recognize the pattern and will fail to find the URL. For example for URL such as http://my-listing.ca/CommercialDrive.html the code will return my which is definitely not acceptable. for another example, for a website like http://www.real-data.ca/clients/ur/ it only returns www.real. It seems that handling - for my code is difficult
Do you have any suggestions on how to improve this code? or do we have any packages to help me extract URLs faster and better?
Thanks
I think you can simply use
library(stringr)
URL_var<-df[["URLs"]]
URL_data<-data.frame(str_extract(URL_var, "(?<=//)[^\\s/:]+"))
names(URL_data)[names(URL_data) == "URL_extract...2."] <- "Main_URL"
Here, stringr::str_extract method searches for the first match in the input, and fetches the substring found. Unlike stringr::str_match, it cannot return submatches, so a lookbehind is used in the regex pattern, (?<=...):
(?<=//)[^\s/:]+
It means:
(?<=//) - match a location in the string that is immediately preceded with // string
[^\\s/:]+ - one or more (+) occurrences of any char but whitespace, / and :. The colon is to make sure port number is not included in the match. / makes sure the match stops before the first / and \s (whitespace) makes sure the match stops before the first whitespace.

R: locating files that their names contain a specific string from a directory and match to my list of wanted files

It's me the newbie again with another messy file and folder situation(thanks to us biologiests): I got this directory containing a huge amount of .txt files (~900,000+), all the files have been previously handed with inconsistent naming format :(
For example, messy files in directory look like these:
ctrl_S978765_uns_dummy_00_none.txt
ctrl_S978765_3S_Cookie_00_none.txt
S59607_3S_goody_3M_V10.txt
ctrlnuc30-100_S3245678_DMSO_00_none.txt
ctrlRAP_S0846567_3S_Dex_none.txt
S6498432_2S_Fulra_30mM_V100.txt
.....
As you see the naming has no reliable consistency. What's important for me is the ID code embedded in them, such as S978765. Now I have got a list (100 ID codes) of these ID codes that I want.
The CSV file containing the list as below, mind you the list does have repetitive ID codes in the row due to different CLnumber value in the second columns:
ID code CLnumber
S978765 1
S978765 2
S306223 1
S897458 1
S514486 2
....
So I want to achieve below task: find all the messy named files using the code IDs by matching to my list. And copy them into a new directory.
I have thought of use list.files() to get all the .txt files and their names, then I got stuck at the next step at matching the code ID names, I know how to do it with one string, say "S978765", but if I do it one by one, this is almost just like manual digging the folder.
How could I feed the ID code names in column1 as a list and compare/match them with the messy file title names in the directory and then copy them into a new folder?
Many thanks,
ML
This works:
library(stringr)
# get this via list.files in your actual code
files <- c("ctrl_S978765_uns_dummy_00_none.txt",
"ctrl_S978765_3S_Cookie_00_none.txt",
"S59607_3S_goody_3M_V10.txt",
"ctrlnuc30-100_S3245678_DMSO_00_none.txt",
"ctrlRAP_S0846567_3S_Dex_none.txt",
"S6498432_2S_Fulra_30mM_V100.txt")
ids <- data.frame(`ID Code` = c("S978765", "S978765", "S306223", "S897458", "S514486"),
CLnumber = c(1, 2, 1, 1, 2),
stringsAsFactors = FALSE)
str_subset(files, paste(ids$ID.Code, collapse = "|"))
#> [1] "ctrl_S978765_uns_dummy_00_none.txt" "ctrl_S978765_3S_Cookie_00_none.txt"
str_subset takes a character vector and returns elements matching some pattern. In this case, the pattern is "S978765|S978765|S306223|S897458|S514486" (created by using paste), which is a regular expression that matches any of the ID codes separated by |. So we take files and keep only the elements that have a match in ID Code.
There are many other ways to do this, which may or may not be more clear. For example, you could pass ids$ID.Code directly to str_subset instead of constructing a regular expression via paste, but that would throw a warning about object lengths every time, which could get confusing (or cause problems if you get used to ignoring it and then ignore it in a different context where it matters). Another method would be to use purrr and keep, but while that might be a little bit more clear to write, it would be a lot more inefficient since it would mean making multiple passes over the files vector -- not relevant in this context, but possibly very relevant if you suddenly need to do this for hundreds of thousands of files and IDs.
You could use regex to extract the ID codes from the file name.
Here, I have used the pattern "S" followed by 5 or more numbers. Once we extract the ID_codes, we can compare them with the ones which we have in csv.
Assuming the csv is called df and the column name is ID_Codes we can use %in% to filter them.
We can then use file.copy to move files from one folder to another folder.
all_files <- list.files(path = '/Path/To/Folder', full.names = TRUE)
selected_files <- all_files[sub('.*(S\\d{5,}).*', '\\1', basename(all_files))
%in% unique(df$ID_Codes)]
file.copy(selected_files, 'new_path/for/files')

How can I extract a pattern (start and end) in a big string, using R?

I have a big string and I want to match/extract a pattern with start and end search pattern. How can this be done in R?
An example of the string:
big_string <- "read.csv(\"http://company.com/students.csv\", header = TRUE)","solution":"# Preview students with str()\nstr(students)\n\n# Coerce Grades to character\nstudents$Grades <- read.csv(\"http://company.com/students_grades.csv\", header = TRUE)"
And I want to extract the url components in this instance. Therefore, the pattern starts with http and ends with .csv or any extension (if possible).
http://company.com/students.csv
http://company.com/students_grades.csv
I have no luck with many attempts using gregexpr to extract the pattern. Can someone help with coming out a way to do this in R?
The stringr package works very well for this type of application:
library(stringr)
big_string <- 'read.csv(\"http://company.com/students.csv\", header = TRUE)","solution":"# Preview students with str()\nstr(students)\n\n# Coerce Grades to character\nstudents$Grades <- read.csv(\"http://company.com/students_grades.csv\", header = TRUE)'
results<-unlist(str_extract_all(big_string, "http:.+csv"))
The search pattern is a string starting with "http:" with at least 1 character and ending with "csv"

Extracting data from corrupted string

Hi I have a dataframe with a column in which the variable is email. Unfortunately, something went wrong and several of the email id have number prefix seperated by underscore. These are the two patterns I have noticed.
Is there a way to extract the data after the underscore, if we processing from the left. Can some logic be built so that the script is smart enough to check if there is one underscore or two. I can do this in excel using find() and right() functions but was wondering how to accomplish this in R.
For example:
product$email
83837_83838_abcd#gmail.com
83837_abcd#gmail.com
output
abcd#gmail.com
abcd#gmail.com
We can use sub
sub('.*_', '', str1)
#[1] "abcd#gmail.com" "abcd#gmail.com"
Or
library(stringr)
str_extract(str1, '[^_]+$')
data
str1 <- c('83837_83838_abcd#gmail.com', '83837_abcd#gmail.com')

Resources