I am trying to retrieve the whole lyrics of a band from the web.
I have noticed that they build URLs using ".../firstletter/bandname/songname.html"
Here is an example.
http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html
I was thinkining about creating a function that would read.csv the URLs.
That part was kind of easy because I can get the titles by a simple copy paste and save as .csv. Then, use that vector to pass the function for each value in order to construct the URL name.
But I tried to read the first one just to see what it looks like and I found that there will be too much "cleaning the data" if my goal is to build a csv file with each lyric.
x <-read.csv(url("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html"))
I think my approach is not the best (or maybe I need a better data cleaning strategy)
The HTML page has a tell on where the lyrics begin:
Usage of azlyrics.com content by any third-party lyrics provider is prohibited by our licensing agreement. Sorry about that.
Taking advantage of that, you can detect this string, and then read everything up to the end of the div:
m <- readLines("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html")
giveaway <- "Sorry about that."
#You can add the full line in case you think one of the lyrics might have this sentence in it.
start <- grep(giveaway, m) + 1 # Where the lyric starts
end <- grep("</div>", m[start:length(m)])[1] + start
# Take the first </div> after the start of the lyric, and then fix the position by adding the start
lyrics <- paste(gsub("<br>|</div>", "", m[start:end]), collapse = "\n")
#This is just an example of how to clear the remaining tags and join the text.
And then:
> cat(lyrics) #using cat() prints the line breaks
Ridin' down the highway
Goin' to a show
Stop in all the byways
Playin' rock 'n' roll
.
.
.
Well it's a long way
It's a long way, you should've told me
It's a long way, such a long way
Assuming that "cleaning the data" means you would be parsing through html tags. I recommend using DOM scraping library that would extract only the text lyrics from the page and save those lyrics to CSV, database or wherever. That way you wouldn't have to do any data cleaning. I don't know what programming language your using, but a simple google search will show you a lot of DOM querying and parsing libraries for any language.
Here is an example with PHP
http://simplehtmldom.sourceforge.net/manual.htm
$html = file_get_html('http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html');
// Find all images
$lyrics = $html->find('div.ringtone',1)->next_sibling();
print($lyrics.innertext);
now you have lyrics. Save Them.(code not tested);
If your using the R-Language. Use this library here. You will be able to query the DOM and extract the lyrics easily.
https://github.com/hadley/rvest
Related
I have embedded data that I have imported into Qualtrics use a web service block. The data comes from a .json file and reads something like 0.male, 1.male, 2.male, etc.
I have been trying to read this into my survey using the Qualtrics.SurveyEngine.getEmbeddedData method but without luck.
I'm trying to do something that takes the form.
let n = 2
Qualtrics.SurveyEngine.getEmbeddedData(n + ".male")
but this has been returning a NULL result. Is it possible to read embedded data that starts with a number?
Also see:
https://community.qualtrics.com/XMcommunity/discussion/15991/read-in-embedded-variables-using-a-loop#latest
The issue isn't the number, it is the dot. getEmbeddedData() doesn't work when the name contains a dot. See https://stackoverflow.com/a/51802695/4434072 for possible alternatives.
I have the data as below manner.
<Status>Active Leave Terminated</Status>
<date>05/06/2014 09/10/2014 01/10/2015</date>
I want to get the data as in the below manner.
<status>Active</Status>
<date>05/06/2014</date>
<status>Leave</Status>
<date>09/10/2014</date>
<status>Terminated</Status>
<date>01/10/2015</date>
please help me on the query, to retrieve the data as specified above.
Well, you have a string and want to split it at the whitestapces. That's what tokenize() is for and \s is a whitespace. To get the corresponding date you can get the current position in the for loop using at. Together it looks something like this (note that I assume that the input data is the current context item):
let $dates := tokenize(date, "\s+")
for $status at $pos in tokenize(Status, "\s+")
return (
<status>{$status}</status>,
<date>{$dates[$pos]}</date>
)
You did not indicate whether your data is on the file system or already loaded into MarkLogic. It's also not clear if this is something you need to do once on a small set of data or on an on-going basis with a lot of data.
If it's on the file system, you can transform it as it is being loaded. For instance, MarkLogic Content Pump can apply a transformation during load.
If you have already loaded the content and you want to transform it in place, you can use Corb2.
If you have a small amount of data, then you can just loop across it using Query Console.
Regardless of how you apply the transformation code, dirkk's answer shows how you need to change it. If you are updating content already in your database, you'll xdmp:node-delete() the original Status and date elements and xdmp:node-insert-child() the new ones.
Am using the RODBC library to bring data into R. I have a long query that I want to pass a variable to, much like this SO user.
Problem is that R interprets the whitespace/carriage returns in my query as a newline '\n'.
The accepted solution for this question suggests to simply break up the text into chunks and then paste() together - which works, but ideally I'd like to keep the whitespace intact - makes it easier to test/verify the behavior of the query over in the database before pasting into R.
In other languages I'm familiar with there's a simple line continuation character - indeed, several of the comments on the accepted answer are looking for an approach similar to python's \.
I found an aside to a workaround using strwrap deep in the bowels of an R discussion lists, so in the interest of making the internet better I will post it here. However, if someone can point the direction toward a more elegant/straightforward solution, I will happily accept your answer.
I don't know if you will find this helpful or not, but I have eventually gravitated towards keeping my SQL separate from my R scripts. Keeping the query in my R script, except for very very short ones, I find gets unreadable very quickly.
These days, I tend to keep queries that are more than a single line in their own separate .sql file. Then I can keep them nice and formatted and readable in a nice text editor, and read them into R as needed via something like this:
read_sql <- function(path){
stopifnot(file.exists(path))
sql <- readChar(path,nchar = file.info(path)$size)
sql
}
For binding parameters into the queries, I just keep a %s where the parameter will go in the .sql file, and then add in the parameters in R using sprintf.
I've been much happier this way, as I was finding that cluttering up my R scripts with really long paste statements and multi-line character objects was making my code really hard to read.
R's strwrap will destroy whitespace, including newline characters, per the documentation.
Essentially, you can get the desired behavior by initially letting R introduce line breaks/newline \ns, and then immediately stripping them out.
#make query using PASTE
query_1 <- paste("SELECT map.ps_studentid
,students.first_name || ' ' || students.last_name AS full_name
,map.testritscore
,map.termname
,map.measurementscale
FROM map$comprehensive_with_growth map
JOIN students
ON map.ps_studentid = students.id
WHERE map.termname = '",map_term,"'", sep='')
#remove newline characters introduced above.
#width is an arbitrary big number-
#it just needs to be longer than your string.
query_1 <- strwrap(query_1, width=10000, simplify=TRUE)
#execute the query
map_njask <- sqlQuery(XE, query_1)
query <- gsub(pattern='\\s',replacement="",x=query)
Try using sprintf to get variable substitution, and then replacing all newlines and whitespace.
See my answer to a similar question for details.
I Have Just Gone Through to this Sphinx4 Speech Recognition,I had implemented it with the helloworld demo of sphinx4,
Now What is Want is To create A dynamic dictionary For the text file given as input,
Right now What i need to do is just create a text file and uplaod in IMTOOLS and then They provide me a .Dict File.
But My requirement is like as the user Type any text in textbox and click a convert button then it automatically convert that word in to WSJ Dcitionary like prounounciation Words,\ E.G. User Type in textbox with the word he want to recognize Let say its a " ANKIT" then it automaticaly convert that word in to wsj dicionary like format as "AE NG K AH T" . please Anyone Can help me Out?or Any suggestion?i want to do That way...hope i explained in good way that you can understand in proper way and give me reply.. "
If any one Providing this service with the payment.then even its ok...
Here is an easy website that allows you to do that to create a .dic file: http://www.speech.cs.cmu.edu/tools/lmtool-new.html . Just upload a .txt file with what you would like to add to the dictionary. You can do this multiple times if you have more than a few thousand sentences and then converge them into one file.
I´m all new to scraping and I´m trying to understand xpath using R. My objective is to create a vector of people from this website. I´m able to do it using :
r<-htmlTreeParse(e) ## e is after getURL
g.k<-(r[[3]][[1]][[2]][[3]][[2]][[2]][[2]][[1]][[4]])
l<-g.k[names(g.k)=="text"]
u<-ldply(l,function(x) {
w<-xmlValue(x)
return(w)
})
However this is cumbersome and I´d prefer to use xpath. How do I go about referencing the path detailed above? Is there a function for this or can I submit my path somehow referenced as above?
I´ve come to
xpathApply( htmlTreeParse(e, useInt=T), "//body//text//div//div//p//text()", function(k) xmlValue(k))->kk
But this leaves me a lot of cleaning up to do and I assume it can be done better.
Regards,
//M
EDIT: Sorry for the unclearliness, but I´m all new to this and rather confused. The XML document is too large to be pasted unfortunately. I guess my question is whether there is some easy way to find the name of these nodes/structure of the document, besides using view source ? I´ve come a little closer to what I´d like:
getNodeSet(htmlTreeParse(e, useInt=T), "//p")[[5]]->e2
gives me the list of what I want. However still in xml with br tags. I thought running
xpathApply(e2, "//text()", function(k) xmlValue(k))->kk
would provide a list that later could be unlisted. however it provides a list with more garbage than e2 displays.
Is there a way to do this directly:
xpathApply(htmlTreeParse(e, useInt=T), "//p[5]//text()", function(k) xmlValue(k))->kk
Link to the web page: I´m trying to get the names, and only, the names from the page.
getURL("http://legeforeningen.no/id/1712")
I ended up with
xml = htmlTreeParse("http://legeforeningen.no/id/1712", useInternalNodes=TRUE)
(no need for RCurl) and then
sub(",.*$", "", unlist(xpathApply(xml, "//p[4]/text()", xmlValue)))
(subset in xpath) which leaves a final line that is not a name. One could do the text processing in XML, too, but then one would iterate at the R level.
n <- xpathApply(xml, "count(//p[4]/text())") - 1L
sapply(seq_len(n), function(i) {
xpathApply(xml, sprintf('substring-before(//p[4]/text()[%d], ",")', i))
})
Unfortunately, this does not pick up names that do not contain a comma.
Use a mixture of xpath and string manipulation.
#Retrieve and parse the page.
library(XML)
library(RCurl)
page <- getURL("http://legeforeningen.no/id/1712")
parsed <- htmlTreeParse(page, useInternalNodes = TRUE)
Inspecting the parsed variable which contains the page's source tells us that instead of sensibly using a list tag (like <ul>), the author just put a paragraph (<p>) of text split with line breaks (<br />). We use xpath to retrieve the <p> elements.
#Inspection tells use we want the fifth paragraph.
name_nodes <- xpathApply(parsed, "//p")[[5]]
Now we convert to character, split on the <br> tags and remove empty lines.
all_names <- as(name_nodes, "character")
all_names <- gsub("</?p>", "", all_names)
all_names <- strsplit(all_names, "<br />")[[1]]
all_names <- all_names[nzchar(all_names)]
all_names
Optionally, separate the names of people and their locations.
strsplit(all_names, ", ")
Or more prettily with stringr.
str_split_fixed(all_names, ", ", 2)