using JSoup to scrape emails and links

using JSoup to scrape emails and links - web-scraping

I want to use JSoup to extract all of the email addresses and URL's of a website and store it in a hashset(so there would be no repeats). I am attempting to do it but I am not exactly sure what exactly I need to put into the select or if I am doing it right. Here is the code:
Document doc = Jsoup.connect(link).get();
Elements URLS = doc.select("");
Elements emails = doc.select("");
emailSet.add(emails.toString());
linksToVisit.add(URLS.toString());

Do like this:
Fetch the html document:
Document doc = Jsoup.connect(link).get();
Extract emails into a HashSet, using a regex to extract all the email addresses on the page:
Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
Matcher matcher = p.matcher(doc.text());
Set<String> emails = new HashSet<String>();
while (matcher.find()) {
emails.add(matcher.group());
}
Extract links:
Set<String> links = new HashSet<String>();
Elements elements = doc.select("a[href]");
for (Element e : elements) {
links.add(e.attr("href"));
}
Complete and working code here: https://gist.github.com/JonasCz/a3b81def26ecc047ceb5
Now don't become a spammer !

This is my working solution, it will search emails not only in text, but also in code:
public Set<String> getEmailsByUrl(String url) {
Document doc;
Set<String> emailSet = new HashSet<>();
try {
doc = Jsoup.connect(url)
.userAgent("Mozilla")
.get();
Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
Matcher matcher = p.matcher(doc.body().html());
while (matcher.find()) {
emailSet.add(matcher.group());
}
} catch (IOException e) {
e.printStackTrace();
}
return emailSet;
}

Related

Xamarin.Forms failing to use EvoHtmlToPdfclient in order to convert html string to a pdf file

I'm using Xamarin.Forms and I am trying to convert an html string to a pdf file using EvoPdfConverter, but the problem is that when I try to do so, on the line htmlToPdfConverter.ConvertHtmlToFile(htmlData, "", myDir.ToString()); in the code snippet below, the app just freezes and does nothing, seems like it wants to connect to the given IP, but it can't, however I don't get any errors or exceptions! not even catch!! does anybody know what I should do to resolve this issue? and here is my code for this:
public void ConvertHtmlToPfd(string htmlData)
{
ServerSocket s = new ServerSocket(0);
HtmlToPdfConverter htmlToPdfConverter = new
HtmlToPdfConverter(GetLocalIPAddress(),(uint)s.LocalPort);
htmlToPdfConverter.TriggeringMode = TriggeringMode.Auto;
htmlToPdfConverter.PdfDocumentOptions.CompressCrossReference = true;
htmlToPdfConverter.PdfDocumentOptions.PdfCompressionLevel = PdfCompressionLevel.Best;
if (ContextCompat.CheckSelfPermission(Android.App.Application.Context, Manifest.Permission.WriteExternalStorage) != Permission.Granted)
{
ActivityCompat.RequestPermissions((Android.App.Activity)Android.App.Application.Context, new String[] { Manifest.Permission.WriteExternalStorage }, 1);
}
if (ContextCompat.CheckSelfPermission(Android.App.Application.Context, Manifest.Permission.ReadExternalStorage) != Permission.Granted)
{
ActivityCompat.RequestPermissions((Android.App.Activity)Android.App.Application.Context, new String[] { Manifest.Permission.ReadExternalStorage }, 1);
}
try
{
// create the HTML to PDF converter object
if (Android.OS.Environment.IsExternalStorageEmulated)
{
root = Android.OS.Environment.ExternalStorageDirectory.ToString();
}
htmlToPdfConverter.LicenseKey = "4W9+bn19bn5ue2B+bn1/YH98YHd3d3c=";
htmlToPdfConverter.PdfDocumentOptions.PdfPageSize = PdfPageSize.A4;
htmlToPdfConverter.PdfDocumentOptions.PdfPageOrientation = PdfPageOrientation.Portrait;
Java.IO.File myDir = new Java.IO.File(root + "/Reports");
try
{
myDir.Mkdir();
}
catch (Exception e)
{
string message = e.Message;
}
Java.IO.File file = new Java.IO.File(myDir, filename);
if (file.Exists()) file.Delete();
htmlToPdfConverter.ConvertHtmlToFile(htmlData, "", myDir.ToString());
}
catch (Exception ex)
{
string message = ex.Message;
}
}

Could you try to set a base URL to ConvertHtmlToFile call as the second parameter? You passed an empty string. That helps to resolve the relative URLs found in HTML to full URLs. The converter might have delays when trying to retrieve content from invalid resources URLs.

In Skype bot framework Attachments Content null

I'm trying to access the list of attachments sent by the user to the skype bot that I'm developing.
Here is how I access the attachment details ,
public async Task<HttpResponseMessage> Post([FromBody]Activity message)
{
if (message.Attachments != null)
{
if (message.Attachments.Count > 0)
{
List<Attachment> attachmentList = message.Attachments.ToList();
foreach (var item in attachmentList)
{
var name = item.Name;
var content = item.Content;
}
}
}
}
But I get null for the following even though the attachment count is greater than zero,
var name = item.Name;
var content = item.Content;
Am I doing this right?

Maybe do something like this...
List<Attachment> attachmentList = message?.Attachments?.Where(x => x != null)?.ToList() ?? new List<Attachment>();
This would hopefully always set attachmentList to an empty list or a list containing non null items?

Saving/Accesing Internal Storage, Sony Android TV

I'm trying to save a xml document programmatically inside the Internal Storage of my Sony Android TV. I will also later on will need to accecss this file. Is it even possible to do and how should I approach this? Any suggestions or solutions?
Code:
public class xmlCreateFile {
Boolean finished = false;
String TAG = "xmlCreateFile";
public Boolean xmlCreate(){
try {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
// root elements
Document doc = docBuilder.newDocument();
Element rootElement = doc.createElement("company");
doc.appendChild(rootElement);
// staff elements
Element staff = doc.createElement("Staff");
rootElement.appendChild(staff);
// set attribute to staff element
Attr attr = doc.createAttribute("id");
attr.setValue("1");
staff.setAttributeNode(attr);
// shorten way
// staff.setAttribute("id", "1");
// firstname elements
Element firstname = doc.createElement("firstname");
firstname.appendChild(doc.createTextNode("yong"));
staff.appendChild(firstname);
// lastname elements
Element lastname = doc.createElement("lastname");
lastname.appendChild(doc.createTextNode("mook kim"));
staff.appendChild(lastname);
// nickname elements
Element nickname = doc.createElement("nickname");
nickname.appendChild(doc.createTextNode("mkyong"));
staff.appendChild(nickname);
// salary elements
Element salary = doc.createElement("salary");
salary.appendChild(doc.createTextNode("100000"));
staff.appendChild(salary);
// write the content into xml file
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
File path = Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_DOCUMENTS);
StreamResult result = new StreamResult(path +"/file.xml");
Log.d(TAG,"Env: " + Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_DOCUMENTS));
//Output to console for testing
StreamResult result2 = new StreamResult(System.out);
// transformer.transform(source, result);
transformer.transform(source, result2);
finished = true;
} catch (ParserConfigurationException pce) {
pce.printStackTrace();
} catch (TransformerException tfe) {
tfe.printStackTrace();
}
return finished;
}
}

There are a number of ways to store data on a device. It seems like you only need this information to be visible to your app, so you can use the private Internal Storage APIs.
These APIs make it relatively easy to store and retrieve a file. Here's a short example.
// Save a file
String FILENAME = "textfile.txt";
String writeString = "hello world!";
FileOutputStream fos = getActivity().openFileOutput(FILENAME, Context.MODE_PRIVATE);
fos.write(writeString.getBytes());
fos.close();
// Read file
FileInputStream fis = getActivity().openFileInput(FILENAME);
StringBuilder builder = new StringBuilder();
int inputChar;
while((inputChar = fis.read()) != -1) {
builder.append((char) inputChar);
}
fis.close();
String readString = builder.toString();

How to get result of BoilerPipe extraction in HTML instead of plain text

I'm using the following code to extract the textual contents from the web pages, my app is hosted on Google App Engine and works exactly like BoilerPipe Web API. The problem is that I can only get the result in plain text format. I played around the library to find a work around, but I couldn't find a method to display the result in HTML. What I am trying to have is to include a option like HTML (extract mode) as in the original BoilerPipe Web API here.
This is the code I'm using for extracting the plain text.
PrintWriter out = response.getWriter();
try {
String urlString = request.getParameter("url");
String listOUtput = request.getParameter("OutputType");
String listExtractor = request.getParameter("ExtractorType");
URL url = new URL(urlString);
switch (listExtractor) {
case "1":
String mainArticle = ArticleExtractor.INSTANCE.getText(url);
out.println(mainArticle);
break;
case "2":
String fullArticle = KeepEverythingExtractor.INSTANCE.getText(url);
out.println(fullArticle);
break;
}
} catch (BoilerpipeProcessingException e) {
out.println("Sorry We Couldn't Scrape the URL you Entered " + e.getLocalizedMessage());
} catch (IOException e) {
out.println("Exception thrown");
}
How can I include the feature for displaying the result in HTML form?

i am using the source code of Boilerpipe, and solve your question with the following code:
String urlString = "your url";
URL url = new URL(urlString);
URI uri = new URI(urlString);
final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);
final BoilerpipeExtractor extractor = CommonExtractors.DEFAULT_EXTRACTOR;
final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();
hh.setOutputHighlightOnly(true);
TextDocument doc;
String text = "";
doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
extractor.process(doc);
final InputSource is = htmlDoc.toInputSource();
text = hh.process(doc, is);
System.out.println(text);
Source

get the number of youtube channel subscribers

i need to get the number of subscribers of a youtube channel in my asp.net website, i have searched the documentation (https://developers.google.com/youtube/2.0/developers_guide_protocol_activity_feeds) and i got that i have to put this:
http://gdata.youtube.com/feeds/api/users/PageName
but its not giving anything even when i open it in a browser it wont open any help would be much appreciated.
is there a special library or api i can use?

What you can do is the following:
System.Net.WebClient wb = new System.Net.WebClient();
string str = wb.DownloadString("http://gdata.youtube.com/feeds/api/users/TheNewYorkTimes");
Response.Write(getCount(str));
public static string getCount(string video)
{
if (video.Contains("="))
{
string str = "subscriberCount='";
string videoid = video.Substring(video.IndexOf("subscriberCount") + str.Length);
if (videoid.Contains("'"))
{
videoid = videoid.Remove(videoid.IndexOf("'"));
return videoid;
}
else
{
return "";
}
}
else
{
return "";
}
}

Categories

HOME

inputstream

grpc

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

using JSoup to scrape emails and links - web-scraping

Related

Xamarin.Forms failing to use EvoHtmlToPdfclient in order to convert html string to a pdf file

In Skype bot framework Attachments Content null

Saving/Accesing Internal Storage, Sony Android TV

How to get result of BoilerPipe extraction in HTML instead of plain text

get the number of youtube channel subscribers

Categories

Resources