unable to fetch the Video source link from this website using JSoup?

unable to fetch the Video source link from this website using JSoup? - web-scraping

I have a website from where I want to fetch video link from using Jsoup. But Im unable to do so my program throws an error. Can somebody please help me?
Here is the code:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class MovMaker {
public static void main(String[] args) {
try {
String url="http://www.tamilyogi.tv/7aum-arivu-2011-hd-720p-tamil-movie-watch-online/";
Document doc = Jsoup.connect(url).get();
Element vid = doc.getElementsByTag("video").get(0);
System.out.println("\nlink: " + vid.attr("src"));
System.out.println("text: " + vid.text());
catch (IOException e) {
e.printStackTrace();
}
}
}
My Error:
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(Unknown Source)
at java.util.ArrayList.get(Unknown Source)
at MovMaker.main(MovMaker.java:16)
The page source where I want to fetch the data from is: Here
Im new to java and jsoup completely I would be thankful if someone can give me the code.
Regards,
Bhuvanesh

There is no <video> tag in the directly loaded html of the link you have given. The tag is instead created by some JavaScript in the browser. Since JSoup does not run any JavaScript you are out of luck here.
What you can do is either use something like
ui4j
selenium
HtmlUnit
or you analyze the contents of the html and maybe the network traffic that happens in the browser when you load that site in order to find out if you can construct the link from that info by hand. In your case I had a quick view on the html and found that the video tag is generated within an IFrame. In the source of the IFrame you find this part:
<script type="text/javascript"> jwplayer("vplayer").setup({
sources: [{file:"http://cdn7.vidmad.tv/h7todtdxamlbu3tf6rutlihpzoz4di2fcsaje74hlrcqda7qibjmlb4vblxq/v.mp4",label:"720p"},{file:"http://cdn7.vidmad.tv/h7todtdxamlbu3tf6rutlihpzoz4di2fcsaje74hljcqda7qibjjd3opruyq/v.mp4",label:"360p","default": "true"},{file:"http://cdn7.vidmad.tv/h7todtdxamlbu3tf6rutlihpzoz4di2fcsaje74hlbcqda7qibjgcvfli2eq/v.mp4",label:"240p"}],
image: "http://cdn7.vidmad.tv/i/01/00000/cjwf05thn2vm.jpg",
duration:"9607",
width: "100%",
height: "350",
aspectratio: "16:9",
preload: "none",
androidhls: "true",
startparam: "start"
,tracks: []
,skin: "glow",abouttext:"VidMAD", aboutlink:"http://vidmad.tv"
});
...
</script>
So the URL is part of a <script> tag. You can use regular expressions to get it:
Document doc = Jsoup.connect("http://www.tamilyogi.tv/7aum-arivu-2011-hd-720p-tamil-movie-watch-online/")
.userAgent("Mozilla/5.0")
.get();
Element iframeEl = doc.select("iframe[src*=embed]").first();
if (iframeEl != null){
Document frameDoc = Jsoup.connect(iframeEl.attr("src"))
.userAgent("Mozilla/5.0")
.get();
Elements scriptEls = frameDoc.select("script");
for (Element scriptEl :scriptEls ){
String html = scriptEl.html();
Pattern p = Pattern.compile("sources:\\s*\\[\\{file:\"([^\"]+)");
Matcher m = p.matcher(html);
if (m.find()){
String link = m.group(1);
System.out.println(link);
break;
}
}
}
Of course my solution above only works for this site and link. You may need to adapt the approach to fit your needs, but the general idea should be clear now.

`
Document doc = Jsoup.connect("http://www.tamilyogi.tv/7aum-arivu-2011-hd-720p-tamil-movie-watch-online/")
.userAgent("Mozilla/5.0")
.get();
Element iframeEl = doc.select("iframe").first();
System.out.println(iframeEl.absUrl("src"));
`
Hope this works for you.

Related

Exporting file from gdrive as html and convert into wordpress post

I am working on WordPress plugin, which basically converts text files/MS Word Documents into WordPress Posts.
Flow is really simple, you just open dialog box and select files from PC and import.
Now I am trying to integrate Google Drive Picker so users can also create posts from there documents stored in their G-Drive.
I have done some pretty work by reading and understanding google drive picker documentation,
I found a really good working example of it too.
so I customized by callback function for picker which is:
function pickerCallback(data) {
if (data.action === google.picker.Action.PICKED) {
// document.getElementById('content').innerText = JSON.stringify(data, null, 2);
var docs = data[google.picker.Response.DOCUMENTS];
// var googleSelectedFiles = [];
var allFiles = [];
var singleFile = {};
docs.forEach(function (file) {
gapi.load('client', function () {
gapi.client.load('drive', 'v2', function () {
gapi.client.request({
'path': '/drive/v3/files/' + file.id + '/export?mimeType=text%2Fhtml&key=' + myAjax.google.apikey,
'method': 'GET',
callback: function (responsejs, responsetxt) {
singleFile.id = file.id;
singleFile.name = file.name + ".html";
singleFile.content = JSON.parse(responsetxt).gapiRequest.data.body;
allFiles.push(singleFile);
singleFile = {};
}
});
});
});
});
setTimeout(function () {
gDriveHandleFileProcess(allFiles);
}, 4000);
}
}
Now the problem is, I am getting all the document converted as HTML, from <html> to end </html> which includes head tag and style tags and of course there are images too.
I can set all the content into post_content of post while saving it into db, but it's really bad way, I know that.. so looked out for this problem, but nothing found helpful.
If it is possible in a good manners or there might be other solutions I can go through like exporting in other format then save it.. but I also tried simple text format which is not required as the formatting is must.
If anyone can guide me through or share any idea I can go through, that'll be really great.
Thanks in advance.

jsdom does not fetch scripts on local file system

This is how i construct it:
var fs = require("fs");
var jsdom = require("jsdom");
var htmlSource = fs.readFileSync("./test.html", "utf8");
var doc = jsdom.jsdom(htmlSource, {
features: {
FetchExternalResources : ['script'],
ProcessExternalResources : ['script'],
MutationEvents : '2.0'
},
parsingMode: "auto",
created: function (error, window) {
console.log(window.b); // always undefined
}
});
jsdom.jQueryify(doc.defaultView, 'https://code.jquery.com/jquery-2.1.3.min.js', function() {
console.log( doc.defaultView.b ); // undefined with local jquery in html
});
the html:
<!DOCTYPE HTML>
<html>
<head></head>
<body>
<script src="./js/lib/vendor/jquery.js"></script>
<!-- <script src="http://code.jquery.com/jquery.js"></script> -->
<script type="text/javascript">
var a = $("body"); // script crashes here
var b = "b";
</script>
</body>
</html>
As soon as i replace the jquery path in the html with a http source it works. The local path is perfectly relative to the working dir of the shell / actual node script. To be honest i don't even know why i need jQueryify, but without it the window never has jQuery and even with it, it still needs the http source inside the html document.

You're not telling jsdom where the base of your website lies. It has no idea how to resolve the (relative) path you give it (and tries to resolve from the default about:blank, which just doesn't work). This also the reason why it works with an absolute (http) URL, it doesn't need to know where to resolve from since it's absolute.
You'll need to provide the url option in your initialization to give it the base url (which should look like file:///path/to/your/file).
jQuerify just inserts a script tag with the path you give it - when you get the reference in the html working, you don't need it.

I found out. I'll mark Sebmasters answer as accepted because it solved one of two problems. The other cause was that I didn't properly wait for the load event, thus the code beyond the external scripts wasn't parsed yet.
What i needed to do was after the jsdom() call add a load listener to doc.defaultView.
The reason it worked when using jQuerify was simply because it created enough of a timeout for the embedded script to load.

I had the same issue when full relative path of the jquery library to the jQueryify function. and I solved this problem by providing the full path instead.
const jsdom = require('node-jsdom')
const jqueryPath = __dirname + '/node_modules/jquery/dist/jquery.js'
window = jsdom.jsdom().parentWindow
jsdom.jQueryify(window, jqueryPath, function() {
window.$('body').append('<div class="testing">Hello World, It works')
console.log(window.$('.testing').text())
})

Pop up window not appending text

I am trying to implement a 'Trace Window' pop up window when I enter a website, and then send messages to that window throughout the website in Order to diagnose some of the more awkward issues i have with the site.
The Problem is that the page changes, if The trace window already exists, all content is removed, before the new TraceText is added.
What I want is a Window that can be sent messages from any page inside the website.
I have a javascript Script debugger.js which I include as a script in every screen (shown below) I would then call the sendToTraceWindow() function to send a message to it thoughout the website. this is currently Mostly done in vbscript at present, due to the issues i am currenctly investigating.
I think it is because i am scripting in the debugger.js into every screen, which sets the traceWindow variable = null (see code below) but I do not know how to get around this!
Any help much appreciated.
Andrew
code examples:
debugger.js:
var traceWindow = null
function opentraceWindow()
{
traceWindow = window.open('traceWindow.asp','traceWindow','width=400,height=800')
}
function sendToTracewindow(sCaller, pMessage)
{
try
{
if (!traceWindow)
{
opentraceWindow()
}
if (!traceWindow.closed)
{
var currentTrace = traceWindow.document.getElementById('trace').value
var newTrace = sCaller + ":" + pMessage + "\n" + currentTrace
traceWindow.document.getElementById('trace').value = newTrace
}
}
catch(e)
{
var currentTrace = traceWindow.document.getElementById('trace').value
var newTrace = "error tracing:" + e.message + "\n" + currentTrace
traceWindow.document.getElementById('trace').value = newTrace
}
}
traceWindow.asp - just a textarea with id='trace':
<HTML>
<head>
<title>Debug Window</title>
</head>
<body>
<textarea id="trace" rows="50" cols="50"></textarea>
</body>
</HTML>

I don't think there is any way around the fact that your traceWindow variable will be reset on every page load, therefore rendering your handle to the existing window invalid. However, if you don't mind leveraging LocalStorage and some jQuery, I believe you can achieve the functionality you are looking for.
Change your trace window to this:
<html>
<head>
<title>Debug Window</title>
<script type="text/javascript" src="YOUR_PATH_TO/jQuery.js" />
<script type="text/javascript" src="YOUR_PATH_TO/jStorage.js" />
<script type="text/javascript" src="YOUR_PATH_TO/jquery.json-2.2.js" />
<script type="text/javascript">
var traceOutput;
var traceLines = [];
var localStorageKey = "traceStorage";
$(function() {
// document.ready.
// Assign the trace textarea to the global handle.
traceOutput = $("#trace");
// load any cached trace lines from local storage
if($.jStorage.get(localStorageKey, null) != null) {
// fill the lines array
traceLines = $.jStorage.get(localStorageKey);
// populate the textarea
traceOutput.val(traceLines.join("\n"));
}
});
function AddToTrace(data) {
// append the new trace data to the textarea with a line break.
traceOutput.val(traceOutput.val() + "\n" + data);
// add the data to the lines array
traceLines[tracelines.length] = data;
// save to local storage
$.jStorage.set(localStorageKey, traceLines);
}
function ClearTrace() {
// empty the textarea
traceOutput.val("");
// clear local storage
$.jStorage.deleteKey(localStorageKey);
}
</script>
</head>
<body>
<textarea id="trace" rows="50" cols="50"></textarea>
</body>
</html>
Then, in your pages where you want to trace data, you could modify your javascript like so:
var traceWindow = null;
function opentraceWindow() {
traceWindow = window.open('traceWindow.asp','traceWindow','width=400,height=800');
}
function sendToTracewindow(sCaller, pMessage) {
traceWindow.AddToTrace(sCaller + ":" + pMessage);
}
Every time a new page is loaded and the trace window is refreshed, the existing trace data is loaded from local storage and displayed in your textarea. This should achieve the functionality that you are looking for.
Please be kind on any errors - I'm winging this on a Monday morning!
Finding the jQuery library should be trivial. You can find the jStorage library here: http://www.jstorage.info/, and you can find jquery-json here: http://code.google.com/p/jquery-json/

parsing html and following a javascript link

I have been asked to extract info by an academic colleague from a website where I need to link the content of a webpage in a table - not too hard with the contents of a text file which is only reacheable (as far as I can tell) by clicking on a javascript link... e.g.
<a id="tk1" href="javascript:__doPostBack('tk1$ContentPlaceHolder1$grid$tk$OpenFileButton','')">
The table is conveniently inside a table with id='tk1' which is nice... but how do I follow the link which pulls the text file.
Ideally I'd like to do this in R... I can grab the relevant table in text format by saying
u <- the url of interest...
library(XML)
tables = readHTMLTable(u)
interestingTable <- tables[grep('tk1', names(tables))]
And this will give the text in the table, but how do I grab the html for that particular table? and how do I "click" on the button and get the text file behind it?
I note that there is a form with massive hidden values - the site appears to be asp.net driven and uses impenetrable URLs.
Many thanks!

This is somewhat tricky, and not fully integrated in R, but some system()-fiddling will get you started.
Download and install phantom javascript: http://code.google.com/p/phantomjs/
Check the short script on http://menne-biomed.de/uni/JavaButton.html, which emulates your case. When you click the javascript anchor, it redirects http://cran.at.r-project.org/ via doPostBack(inaccessibleJavascriptVar).
Save the following script locally as javabutton.js
var page = new WebPage();
page.open('http://www.menne-biomed.de/uni/JavaButton.html', function (status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var ua = page.evaluate(function () {
var t = document.getElementById('tk1').href;
var re = new RegExp('\((.*)\)');
return eval(re.exec(t)[1]);
});
console.log(ua);// Outputs http://cran.at.r-project.org/
}
phantom.exit();
});
With phantomjs on path, call
phantomjs javabutton.js
The link will be displayed on the console. Use any method to get it into Rcurl.
Not elegant, but maybe someones wraps phantomjs into R one day. In case the link to JaveButton.html should be lost, here it is as code.
<!DOCTYPE html >
<head>
<script>
inaccesibleJavascriptVar = 'http://' + 'cran.at.r-project.org/';
function doPostBack(myref)
{
window.location.href= myref;
return false;
}
</script>
</head>
<body>
<a id="tk1" href="javascript:doPostBack(inaccesibleJavascriptVar)" >Click here</a>
</body>
</html>

Have a look at the RCurl package:
http://www.omegahat.org/RCurl/

Get Image dimensions using Javascript during file upload

I have file upload UI element in which the user will upload images. Here I have to validate the height and width of the image in client side. Is it possible to find the size of the image having only the file path in JS?
Note: If No, is there any other way to find the dimensions in Client side?

You can do this on browsers that support the new File API from the W3C, using the readAsDataURL function on the FileReader interface and assigning the data URL to the src of an img (after which you can read the height and width of the image). Currently Firefox 3.6 supports the File API, and I think Chrome and Safari either already do or are about to.
So your logic during the transitional phase would be something like this:
Detect whether the browser supports the File API (which is easy: if (typeof window.FileReader === 'function')).
If it does, great, read the data locally and insert it in an image to find the dimensions.
If not, upload the file to the server (probably submitting the form from an iframe to avoid leaving the page), and then poll the server asking how big the image is (or just asking for the uploaded image, if you prefer).
Edit I've been meaning to work up an example of the File API for some time; here's one:
<!DOCTYPE HTML>
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
<title>Show Image Dimensions Locally</title>
<style type='text/css'>
body {
font-family: sans-serif;
}
</style>
<script type='text/javascript'>
function loadImage() {
var input, file, fr, img;
if (typeof window.FileReader !== 'function') {
write("The file API isn't supported on this browser yet.");
return;
}
input = document.getElementById('imgfile');
if (!input) {
write("Um, couldn't find the imgfile element.");
}
else if (!input.files) {
write("This browser doesn't seem to support the `files` property of file inputs.");
}
else if (!input.files[0]) {
write("Please select a file before clicking 'Load'");
}
else {
file = input.files[0];
fr = new FileReader();
fr.onload = createImage;
fr.readAsDataURL(file);
}
function createImage() {
img = document.createElement('img');
img.onload = imageLoaded;
img.style.display = 'none'; // If you don't want it showing
img.src = fr.result;
document.body.appendChild(img);
}
function imageLoaded() {
write(img.width + "x" + img.height);
// This next bit removes the image, which is obviously optional -- perhaps you want
// to do something with it!
img.parentNode.removeChild(img);
img = undefined;
}
function write(msg) {
var p = document.createElement('p');
p.innerHTML = msg;
document.body.appendChild(p);
}
}
</script>
</head>
<body>
<form action='#' onsubmit="return false;">
<input type='file' id='imgfile'>
<input type='button' id='btnLoad' value='Load' onclick='loadImage();'>
</form>
</body>
</html>
Works great on Firefox 3.6. I avoided using any library there, so apologies for the attribute (DOM0) style event handlers and such.

The previous example is Okay, but it is far from perfect.
var reader = new FileReader();
reader.onload = function(e)
{
var image = new Image();
image.onload = function()
{
console.log(this.width, this.height);
};
image.src = e.target.result;
};
reader.readAsDataURL(this.files[0]);

If you use a flash based uploaded such as SWFUpload you can have all the info you want as well as multiple queued uploads.
I recommend SWFUpload and am in no way associated with them other than as a user.
You could also write a silverlight control to pick your file and upload it.

No, You can't, filename and file content are send to the server in http headerbody, javascript cannot manipulate those fields.

HTML5 is definitely the correct solution here.
You should always code for the future, not the past.
The best way to deal with HTML4 browsers is to either fall back on degraded functionality or use Flash (but only if the browser does not support the HTML5 file API)
Using the img.onload event will enable you to recover the dimensions of the file.
Its working for an app I'm working on.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

unable to fetch the Video source link from this website using JSoup? - web-scraping

` Document doc = Jsoup.connect("http://www.tamilyogi.tv/7aum-arivu-2011-hd-720p-tamil-movie-watch-online/") .userAgent("Mozilla/5.0") .get(); Element iframeEl = doc.select("iframe").first(); System.out.println(iframeEl.absUrl("src")); ` Hope this works for you.

Related

Exporting file from gdrive as html and convert into wordpress post

jsdom does not fetch scripts on local file system

Pop up window not appending text

parsing html and following a javascript link

Get Image dimensions using Javascript during file upload

Categories

Resources