Remove part of a URL string in R [duplicate]

Remove part of a URL string in R [duplicate] - r

This question already has answers here:
Extracting numbers from vectors of strings
(12 answers)
Closed 5 years ago.
I have a bunch of URLs of the form:
http://www.imdb.com/title/tt0383574/?ref_=adv_li_tt
I want to be left with the movie code (i.e., 0383574). I tried this:
url = "http://www.imdb.com/search/title?genres=action&title_type=feature&sort=moviemeter,asc"
page = read_html(url)
movie.nodes <- html_nodes(page,'.lister-item-header a')
movie.nodes
movie.link <- sapply(html_attrs(movie.nodes),`[[`,'href')
movie.link <- paste0("http://www.imdb.com",movie.link)
movie.id1 <- gsub("http://www.imdb.com/title/tt", "", movie.link)
movie.id <- gsub("/?ref_=adv_li_tt", "", movie.id1)
But calling movie.id returns:
[1] "0451279/?" "2345759/?" "1790809/?" "1469304/?" "0974015/?" "3896198/?"
[7] "3371366/?" "3890160/?" "3315342/?" "4425200/?" "2250912/?" "2406566/?"
[13] "1972591/?" "1825683/?" "2091256/?" "3501632/?" "4630562/?" "1386697/?"
[19] "4154756/?" "4116284/?" "2975590/?" "5884234/?" "5013056/?" "1211837/?"
[25] "0120616/?" "2527336/?" "1082807/?" "0325980/?" "1293847/?" "2034800/?"
[31] "2015381/?" "2911666/?" "1648190/?" "4912910/?" "1298650/?" "1477834/?"
[37] "2334871/?" "3748528/?" "2239822/?" "3469046/?" "2461150/?" "3731562/?"
[43] "1431045/?" "0449088/?" "3385516/?" "2226597/?" "0468569/?" "1219827/?"
[49] "0383574/?" "3498820/?"
How do I get rid of the /? from the output? Thanks in advance.

Considering the movie id as the only part with digits, you can remove any other characters that are not digits, leaving you with the ids as follows:
> gsub("[^[:digit:]]", "", movie.link)
[1] "0451279" "2345759" "1790809" "1469304" "0974015" "3896198" "3371366" "3890160" "3315342" "4425200"
[11] "2250912" "2406566" "1972591" "1825683" "2091256" "3501632" "4630562" "1386697" "4154756" "4116284"
[21] "2975590" "5884234" "5013056" "1211837" "0120616" "2527336" "1082807" "0325980" "1293847" "2034800"
[31] "2015381" "2911666" "1648190" "4912910" "1298650" "1477834" "2334871" "3748528" "2239822" "3469046"
[41] "2461150" "3731562" "1431045" "0449088" "3385516" "2226597" "0468569" "1219827" "0383574" "3498820"

Just found how to do it:
movie.id <- gsub("\\D", "", movie.link)
Because \\D removes anything that isn't an number.

gsub accepts a regular expression pattern as its first parameter. In regular expressions, ? is a special character, which signifies that the preceding character may occur zero or one time.
So you are currently searching for a ref_=adv_li_tt that either is or isn't immediately preceded by a /.
You need to escape the ? to indicate that you are searching for a literal question mark:
gsub("/\?ref_=adv_li_tt", "", movie.id1)

Related

Substitution of strings results in incorrect names

I,d like to change several strings in vector. In my case, I have in all.images object:
# Original character's list
all.images <-c("S2B2A_20171003_124_IndianaIIPR00911120170922_BOA_10.tif",
"S2B2A_20181028_124_IndianaIIPR0065820181024_BOA_10.tif",
"S2B2A_20170715_124_SantaMariaCalcasPR0033420170731_BOA_10.tif",
"S2B2A_20180928_124_NSraAparecidaBortolettoPR0042720180912_BOA_10.tif",
"S2A2A_20170610_124_LagoaAmarelaPR0022020170619_BOA_10.tif",
"S2A2A_20160705_124_AguaSumidaPR001320160629_BOA_10.tif",
"S2A2A_20181023_124_SaoPedroGabrielGarciaPR001720181031_BOA_10.tif",
"S2B2A_20180908_124_NSraAparecidaBortolettoPR001920180911_BOA_10.tif",
"S2A2A_20180824_124_NSraAparecidaBortolettoPR0043320180911_BOA_10.tif",
"S2A2A_20170720_124_VoAnaPR001520170802_BOA_10.tif",
"S2B2A_20180322_124_SaoMateusPR0021920180314_BOA_10.tif",
"S2A2A_20181212_124_NSradeFatimaJoaoBatistaPR002320181128_BOA_10.tif",
"S2A2A_20180413_081_SantaFeSebastiaoFogacaPR0021920180427_BOA_10.tif",
"S2B2A_20170913_124_PerdizesPR0034920170905_BOA_10.tif",
"S2A2A_20170610_124_TresMeninasPR001820170601_BOA_10.tif",
"S2B2A_20180428_081_SantaFeSebastiaoFogacaPR0021020180501_BOA_10.tif",
"S2B2A_20180508_081_SantaFeSebastiaoFogacaPR0022320180427_BOA_10.tif",
"S2A2A_20170809_124_VoAnaPR001620170803_BOA_10.tif",
"S2B2A_20180819_124_PontalIIPR0012220180801_BOA_10.tif",
"S2B2A_20181214_081_NSradeFatimaJoaoBatistaPR002320181128_BOA_10.tif",
"S2A2A_20180423_081_SantaFeSebastiaoFogacaPR0033920180427_BOA_10.tif",
"S2A2A_20180814_124_PontalIIPR0012220180801_BOA_10.tif",
"S2B2A_20170715_124_VoAnaPR0015A20170803_BOA_10.tif",
"S2A2A_20160615_124_AguaSumidaPR0011220160627_BOA_10.tif",
"S2A2A_20170720_124_SantaMariaCalcasPR0022820170726_BOA_10.tif",
"S2A2A_20180913_124_SantaMariaCalcasPR001620180829_BOA_10.tif",
"S2B2A_20170804_124_NSraAparecidaBortolettoPR0035720170811_BOA_10.tif",
"S2A2A_20170809_124_SantaFeBaracatPR001920170801_BOA_10.tif",
"S2B2A_20180322_124_NSradeFatimaGlebaAPR001320180403_BOA_10.tif",
"S2B2A_20180508_081_SantaFeSebastiaoFogacaPR0021920180427_BOA_10.tif")
#
My idea is 1) remove S2B2A_ and _BOA_10.tif; 2) After S2B2A_ convert the 8 values into dates (e.g. 2017-09-05); 3) After the dates take the next three
values to the end (eg. 124 or 081); and 4) Separate the characters based in capital letters and dates (eg. AguaSumidaPR0011220160627 to AguaSumida-PR00112-2016-06-27).
But when I try to do:
sub("^\\w+_(\\d+)_(\\d+)_([A-Za-z]+)([A-Z]{2}\\d{3})(\\d)(\\d{4})(\\d{2})(\\d+)_.*",
"\\3_\\4_\\5_\\6-\\7-\\8_\\1_\\2", all.images)
[1] "IndianaII_PR009_1_1120-17-0922_20171003_124"
[2] "IndianaII_PR006_5_8201-81-024_20181028_124"
...
[28] "SantaFeBaracat_PR001_9_2017-08-01_20170809_124"
[29] "NSradeFatimaGlebaA_PR001_3_2018-04-03_20180322_124"
[30] "SantaFeSebastiaoFogaca_PR002_1_9201-80-427_20180508_081"
I have incorrected dates (eg. in [30] 9201-80-427_20180508_081) and my desirable output needs to be:
[1] "IndianaII_PR009111_2017-09-22_2017-10-03_124"
[2] "IndianaII_PR00658_2018-10-24_2018-10-28_124"
...
[28] "SantaFeBaracat_PR0019_2017-08-01_2017-08-09_124"
[29] "NSradeFatimaGlebaA_PR0013_2018-04-03_2018-03-22_124"
[30] "SantaFeSebastiaoFogaca_PR00219_2018-04-27_2018-05-08_081"
Please any help with it?

I think this handles those exceptions in the comments on your answer using look ahead:
sub("^\\w+_(\\d{4})(\\d{2})(\\d{2})_(\\d+)_([A-Za-z]+)([A-Z]{2}\\w+)(?=\\d{8})+(\\d{4})(\\d{2})(\\d+)_.*",
"\\5_\\6_\\7-\\8-\\9_\\1-\\2-\\3_\\4", all.images, perl = TRUE)

Element wise concatenation of nested list [duplicate]

This question already has answers here:
Paste multiple columns together
(11 answers)
Closed 4 years ago.
I have a nested list
l1 <- letters
l2 <- 1:26
l3 <- LETTERS
list <- list(l1,l2,l3)
Is there an elegant way to concatenate all the elements in inner vectors to form one character vector (possibly using paste), the assumption is that all the inner vectors are of the same length.
I would like my final result to be
[1] "a1A"
[2] "b2B"
[3] "c3C"
[4] "d4D"
....
[26] "z26Z"

Try:
apply(sapply(list,paste0),1,paste0,collapse="")
[1] "a1A" "b2B" "c3C" "d4D" "e5E" "f6F" "g7G" "h8H" "i9I" "j10J" "k11K" "l12L" "m13M" "n14N" "o15O" [16] "p16P" "q17Q" "r18R" "s19S" "t20T" "u21U" "v22V" "w23W" "x24X" "y25Y" "z26Z"

user20650's solution is probably as elegant as you are going to get. But for what it's worth, here's a quick hack in dplyr:
library(dplyr)
ll <- list(l1,l2,l3) # I try not to use "list" as a name. Gets confusing sometimes.
as.data.frame(ll) %>%
mutate(x = paste0(.[[1]], .[[2]], .[[3]])) %>%
.$x
# returns
[1] "a1A" "b2B" "c3C" "d4D" "e5E" "f6F" "g7G" "h8H" "i9I" "j10J" "k11K" "l12L"
[13] "m13M" "n14N" "o15O" "p16P" "q17Q" "r18R" "s19S" "t20T" "u21U" "v22V" "w23W" "x24X"
[25] "y25Y" "z26Z"

str_split on first and second occurence of delimter at different locations in character vector

I have a character list that has weather variables followed by "mean_#" where # is a number between 5 and 10. I want to subset the list to only have the weather variable names themselves. The mean weather variables look like this:
> mean_vars
[1] "dew_mean_10" "dew_mean_5" "dew_mean_6" "dew_mean_7"
[5] "dew_mean_8" "dew_mean_9" "humid_mean_10" "humid_mean_5"
[9] "humid_mean_6" "humid_mean_7" "humid_mean_8" "humid_mean_9"
[13] "rain_mean_10" "rain_mean_5" "rain_mean_6" "rain_mean_7"
[17] "rain_mean_8" "rain_mean_9" "soil_moist_mean_10" "soil_moist_mean_5"
[21] "soil_moist_mean_6" "soil_moist_mean_7" "soil_moist_mean_8" "soil_moist_mean_9"
[25] "soil_temp_mean_10" "soil_temp_mean_5" "soil_temp_mean_6" "soil_temp_mean_7"
[29] "soil_temp_mean_8" "soil_temp_mean_9" "solar_mean_10" "solar_mean_5"
[33] "solar_mean_6" "solar_mean_7" "solar_mean_8" "solar_mean_9"
[37] "temp_mean_10" "temp_mean_5" "temp_mean_6" "temp_mean_7"
[41] "temp_mean_8" "temp_mean_9" "wind_dir_mean_10" "wind_dir_mean_5"
[45] "wind_dir_mean_6" "wind_dir_mean_7" "wind_dir_mean_8" "wind_dir_mean_9"
[49] "wind_gust_mean_10" "wind_gust_mean_5" "wind_gust_mean_6" "wind_gust_mean_7"
[53] "wind_gust_mean_8" "wind_gust_mean_9" "wind_spd_mean_10" "wind_spd_mean_5"
[57] "wind_spd_mean_6" "wind_spd_mean_7" "wind_spd_mean_8" "wind_spd_mean_9"
And this is all I want at the end:
> var_names
"dew" "humid" "rain" "solar" "temp" "soil_moist" "soil_temp" "wind_dir" "wind_gust" "wind_spd"
Now I figured out how to do it but I fill my method is extraneous due to a lack of ability with regular expressions. I also will have to repeat my process 20 times substituting "mean" with other words.
var_names <- unique(str_split_fixed(mean_vars, "_", n = 3)[c(1:18,31:42),1])
var_names <- unlist(c(var_names, unique(unite(as_tibble(str_split_fixed(mean_vars, "_", n = 3)[c(19:30,43:60), 1:2])))))
I've been trying to stay within the realm of the tidyverse packages as much as possible so I was using stringr::str_split_fixed.
If you have a solution using this same function that would be ideal as I could continue the same programming style, but I'm open to all suggestions.
Thanks.

Use sub and unique. This is shorter and has no package dependencies (or use unique(str_replace(mean_vars, "_mean.*", "")) with stringr):
unique(sub("_mean.*", "", mean_vars))
giving:
[1] "dew" "humid" "rain" "soil_moist" "soil_temp"
[6] "solar" "temp" "wind_dir" "wind_gust" "wind_spd"
If for some reason you really want to use str_split then:
rmMean <- function(x) paste(head(x, -2), collapse = "_")
unique(sapply(str_split(mean_vars, "_"), rmMean))
Note
mean_vars <- c("dew_mean_10", "dew_mean_5", "dew_mean_6", "dew_mean_7", "dew_mean_8",
"dew_mean_9", "humid_mean_10", "humid_mean_5", "humid_mean_6",
"humid_mean_7", "humid_mean_8", "humid_mean_9", "rain_mean_10",
"rain_mean_5", "rain_mean_6", "rain_mean_7", "rain_mean_8", "rain_mean_9",
"soil_moist_mean_10", "soil_moist_mean_5", "soil_moist_mean_6",
"soil_moist_mean_7", "soil_moist_mean_8", "soil_moist_mean_9",
"soil_temp_mean_10", "soil_temp_mean_5", "soil_temp_mean_6",
"soil_temp_mean_7", "soil_temp_mean_8", "soil_temp_mean_9", "solar_mean_10",
"solar_mean_5", "solar_mean_6", "solar_mean_7", "solar_mean_8",
"solar_mean_9", "temp_mean_10", "temp_mean_5", "temp_mean_6",
"temp_mean_7", "temp_mean_8", "temp_mean_9", "wind_dir_mean_10",
"wind_dir_mean_5", "wind_dir_mean_6", "wind_dir_mean_7", "wind_dir_mean_8",
"wind_dir_mean_9", "wind_gust_mean_10", "wind_gust_mean_5", "wind_gust_mean_6",
"wind_gust_mean_7", "wind_gust_mean_8", "wind_gust_mean_9", "wind_spd_mean_10",
"wind_spd_mean_5", "wind_spd_mean_6", "wind_spd_mean_7", "wind_spd_mean_8",
"wind_spd_mean_9")

R: Extract words from a website

I am attempting to extract all words that start with a particular phrase from a website. The website I am using is:
http://docs.ggplot2.org/current/
I want to extract all the words that start with "stat_". I should get 21 names like "stat_identity" in return. I have the following code:
stats <- readLines("http://docs.ggplot2.org/current/")
head(stats)
grep("stat_{1[a-z]", stats, value=TRUE)
I am returned every line containing the phrase "stat_". I just want to extract the "stat_" words. So I tried something else:
gsub("\b^stat_[a-z]+ ", "", stats)
I think the output I got was an empty string, " ", where a "stat_" phrase would be? So now I'm trying to think of ways to extract all the text and set everything that is not a "stat_" phrase to empty strings. Does anyone have any ideas on how to get my desired output?

rvest & stringr to the rescue:
library(xml2)
library(rvest)
library(stringr)
pg <- read_html("http://docs.ggplot2.org/current/")
unique(str_match_all(html_text(html_nodes(pg, "body")),
"(stat_[[:alnum:]_]+)")[[1]][,2])
## [1] "stat_bin" "stat_bin2dCount"
## [3] "stat_bindot" "stat_binhexBin"
## [5] "stat_boxplot" "stat_contour"
## [7] "stat_density" "stat_density2d"
## [9] "stat_ecdf" "stat_functionSuperimpose"
## [11] "stat_identity" "stat_qqCalculation"
## [13] "stat_quantile" "stat_smooth"
## [15] "stat_spokeConvert" "stat_sum"
## [17] "stat_summarySummarise" "stat_summary_hexApply"
## [19] "stat_summary2dApply" "stat_uniqueRemove"
## [21] "stat_ydensity" "stat_defaults"
Unless you need the links (then you can use other rvest functions), this removes all the markup for you and just gives you the text of the website.

R: How to remove quotation marks in a vector of strings, but maintain vector format as to call each individual value?

I want to create a vector of names that act as variable names so I can then use themlater on in a loop.
years=1950:2012
for(i in 1:length(years))
{
varname[i]=paste("mydata",years[i],sep="")
}
this gives:
> [1] "mydata1950" "mydata1951" "mydata1952" "mydata1953" "mydata1954" "mydata1955" "mydata1956" "mydata1957" "mydata1958"
[10] "mydata1959" "mydata1960" "mydata1961" "mydata1962" "mydata1963" "mydata1964" "mydata1965" "mydata1966" "mydata1967"
[19] "mydata1968" "mydata1969" "mydata1970" "mydata1971" "mydata1972" "mydata1973" "mydata1974" "mydata1975" "mydata1976"
[28] "mydata1977" "mydata1978" "mydata1979" "mydata1980" "mydata1981" "mydata1982" "mydata1983" "mydata1984" "mydata1985"
[37] "mydata1986" "mydata1987" "mydata1988" "mydata1989" "mydata1990" "mydata1991" "mydata1992" "mydata1993" "mydata1994"
[46] "mydata1995" "mydata1996" "mydata1997" "mydata1998" "mydata1999" "mydata2000" "mydata2001" "mydata2002" "mydata2003"
[55] "mydata2004" "mydata2005" "mydata2006" "mydata2007" "mydata2008" "mydata2009" "mydata2010" "mydata2011" "mydata2012"
All I want to do is remove the quotes and be able to call each value individually.
I want:
>[1] mydata1950 mydata1951 mydata1952 mydata1953, #etc...
stored as a variable such that
varname[1]
> mydata1950
varname[2]
> mydata1951
and so on.
I have played around with
cat(varname[i],"\n")
but this just prints values as one line and I can't call each individual string. And
gsub("'",'',varname)
but this doesn't seem to do anything.
Suggestions? Is this possible in R? Thank you.

There are no quotes in that character vector's values. Use:
cat(varname)
.... if you want to see the unquoted values. The R print mechanism is set to use quotes as a signal to your brain that distinct values are present. You can also use:
print(varname, quote=FALSE)
If there are that many named objects in you workspace, then you need desperately to learn to use lists. There are mechanisms for "promoting" character values to names, but this would be seen as a failure on your part to learn to use the language effectively:
var <- 2
> eval(as.name('var'))
[1] 2
> eval(parse(text="var"))
[1] 2
> get('var')
[1] 2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove part of a URL string in R [duplicate] - r

Just found how to do it: movie.id <- gsub("\\D", "", movie.link) Because \\D removes anything that isn't an number.

Related

Substitution of strings results in incorrect names

Element wise concatenation of nested list [duplicate]

str_split on first and second occurence of delimter at different locations in character vector

R: Extract words from a website

R: How to remove quotation marks in a vector of strings, but maintain vector format as to call each individual value?

Categories

Resources