Scraping product information for comic book website

Scraping product information for comic book website - wordpress

I'm making a comic book website built upon a WordPress platform for an old friend for his business. I would like to be able to have a script that goes to various publisher sites and pulls in the data. I'm new to programming and I've read of many different alternatives and just don't know where to begin. Firstly, would this be legal to pull this content from these websites? Secondly, here's an example for what I would like to do.
Page displays what's coming out for the month. Copy all links from
that page within the appropriate div that leads to the comic book
details. Save each hyperlink as $comiclink or whatever. The script will
execute each hyperlink at a time.
Go to the hyperlink for $comiclink and scrape content out of the page based
upon what's in certain DIV's on that page. Example:
Copy & save comic title within a defined div into $title
Copy & save previous and future title hyperlinks within a defined div into $othertitles
Note: $othertitles will loop off and start the same process itself from 1.
Save & download all images within a defined div to $images
Copy & save all content within a defined div to $content. $content is then broken down
and pulled apart based upon the content that is within it. Example:
In stores: $date
format: $format
UPC: $upc
Price: $price
The Story: $story
Copy & save defined div hyperlink and save into $seriesinfo
Copy & save defined div $relatedinfo and then break it down.
images within $relatedinfo to $relatedimages
content within $relatedinfo to $relatedcontent
links within $relatedinfo to $relatedlink. $relatedlink will loop off and restart this process itself from 1.
Now that everything is broken apart and saved into it's own little pieces. I want WordPress to automatically create a post and then start assigning all this info into the post. Working something like this.
Check for existing post with same $title if does not exist place $title in title for post and page-name. If it exists abort script and move on to the next.
Remove numbers and alpha characters from $title and check for existence of category if it does not exist; create it and assign to post. If it exists assigns category to the post.
Check for existing category with value $format if exists assign to post, if not create & assign category to post.
upload images that were downloaded from $image into this post.
Check for images that contain the word "cover" and assign as featured image.
Also how this whole thing executes also. I don't want this running 24/7 - just once a week I would like this to execute by itself and automatically go to the websites in question and scrape the content and create the pages.
I'm not asking you guys to write out the whole darn thing for me; though I definitely won't object to it! Just help point me in the right directions to get this going. Over the past day I've read probably 30+ articles on pulling content and there's so many options from what I can tell that I just don't know where to begin or how to get the ball moving in the right direction with this.
Updated Code
Notes: So I've managed to successfully copy the content and paths for each block and instead of downloading the images just echoing them from their present location. Next up is actually automating the process to create a post in wordpress to dump the data into.
function scraping_comic()
{
// create HTML DOM
$html = file_get_html('http://page-on-site-to-scrape.com');
// get block to scrape
foreach($html->find('li.browse_result') as $article)
{
// get title from block
$item['title'] = trim($article->find('h4', 0)->find('span',0)->plaintext);
// get title url from block
$item['title_url'] = trim($article->find('h4', 0)->find('a.grid-hidden',0)->href);
// get image from block
$item['image_url'] = trim($article->find('img.main_thumb',0)->src);
// get details from block
$item['details'] = trim($article->find('p.browse_result_description_release', 0)->plaintext);
// get sale info from block
$item['on_sale'] = trim($article->find('.browse_comics_release_dates', 0)->plaintext);
$ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret;
}
// ===== The Code ====
$ret = scraping_comic();
if ( ! empty($ret))
{
// place main url for instance when hyperlinks and image srcs don't use the full path.
$scrape = 'http://site-to-scrape.com';
foreach($ret as $v)
{
echo '<p>'.$v['title'].'</p>';
echo '<p><img src="'.$v['image_url'].'"></p>';
echo '<p>'.$v['details'].'</p>';
echo '<p> '.$v['on_sale'].'</p>';
}
}
else { echo 'Could not scrape page!'; }
?>

Typically, no this wouldn't be legal. Companies that share their data these days will implement an API you can call and use in your application (subject to their Terms of Use and Copyright Policy). They don't like you making automated requests that bog down their server and kill their bandwidth.
That being said, often times product information is available from other sources such as Amazon which does have an API.
This project you are describing has a lot of work to be done essentially customizing the WordPress CMS and would be less than trivial for someone without any programming experience. You might want to consider hiring a freelancer at oDesk or one of the many other freelance job boards.

Related

Wordpress Loop get_the_id()

I tried the following functions in header.php, footer.php, single.php etc.
var_dump(in_the_loop());
var_dump(get_the_id());
The first function gives false (meaning we are not in the loop) and the second function gives the post id every single time.
Description of get_the_id() from wordpress :
Retrieve the numeric ID of the current post. This tag must be within The Loop.
I just want a simple explanation what the hell is going on why do i get the post id if I call the function out of the loop !?

must is a little strong for get_the_id() ...delivers evil eye to Wordpress.
It works in the header and non-loop (confirmed).
Please note that post/page are essentially interchangeable in this conversation.
Think of WP this way -> You always have a post id in some way, all the time, every page, unless you do weird stuff or talk about non-page edge cases. When you are at the install root (such as site.com/) there are posts being called, something has to be displayed. There are other settings that will impact post/page such as static front page settings. On a category listing, if there are pages, I got the first ID returned before the loop.
On post/pages the page ID is (more or less0 set before the loop. This is a result of the URL (pretty or ?p=123 format) dictating the content. Using pretty names, the page at site.com/foo-bar/ will try to look up if there is content available via the permalink rules for "foo-bar". If there is content, the post ID is obtained. (simplified)
Later in the page build you get into the loop. However, before the loop you are also offered opportunities to change, sort, or augment the loop - such as changing the page IDs to be looped or sorting.
Regarding in_the_loop(), WP says
"True if caller is within loop, false if loop hasn't started or has ended." via http://codex.wordpress.org/Function_Reference/in_the_loop
in_the_loop() evaluates if the loop is in action (loop being key to the WP world). Also important - when you are in the loop WP can iterate over multiple page/post (IDs).
I don't have a 100% bulletproof explanation as to how the ID always shows, but when you dig into the API and various methods for hooking this might be a result.
I understand your confusion and agree with you. I think WP intended get_the_id() as a loop based tool, outside the loop you will get unpredictable results.
Hope that helps, I do enjoy working in WP, and I hope you do to.

How should I download data from api.mapmyfitness.com?

I want to display on my Wordpress website using MapMyFitness's API. It needs to be in a friendly format for my readers so I am thinking it can be converted into posts. Should I write a script that would directly pull this data from their servers (during each page refresh) or should I create and schedule a cron job for it? I do like the idea of creating a cron job for it (since my site will need to keep track of ratings and comments for each route), but I am not sure where to start for it.
Here's a URL I would be using to grab data from:
http://api.mapmyfitness.com/3.1/routes/get_route?&o=xml&r=&route_key=&route_id=15219484&created_date=&old_json=&loc=
There API does support JSON, so the &o= parameter can be set to "&o=Json".
Here's a link to the API I am using: api.mapmyfitness.com/3.1/
And here's a link to the Routes method which is being used in the link above. It shows how you can manipulate the data: api.mapmyfitness.com/3.1/routes/get_route?doc
Please let me know if I should provide anymore details about what I am trying to accomplish.

there's a few ways you could go about this..
but to keep it simple you could just use a XML parser to read the xml returned from the API,
this xml file, you can do what you want with it, adding pages/posts for each route?
are you using, google maps to display the route points?
heres a little code block that you could use in your wordpress sidebar.
open sidebar.php
pop this little code snippet in there.
$url = "http://api.mapmyfitness.com/3.1/routes/get_route?&o=xml&r=&route_key=&route_id=15219484&created_date=&old_json=&loc=";
$xml= simplexml_load_file($url);
foreach($xml->output as $output){
echo "User: ".$output->user->username."<br/>";
echo "RouteName: ".$output->route_name;
echo "<ul>";
//loop through each point
foreach($output->points as $point){
echo "<li>Lat: ".$point->point->lat." Long: ".$point->point->lng."</li>";
}
echo "</ul>";
}
this should produce a list
User: mm75 RouteName: All PANTHER TRACE - Riverview FL Mar 31, 2010 1:14 PM
Lat: 27.81178 Long: -82.330725
Lat: 27.81178 Long: -82.330725
etc..etc..etc..
hopefully this puts you on track, :)
Marty

In Drupal 7, how do I get a list of all blocks being used on the page?

I'm building a module that manages ad units in the form of blocks that all need to be aware of each other and pass information around.
Thus I need to find a simple hook or other function to get a listing of every block that will be used on the page, so I can be sure to know the entire list of ad units on the page.
I've tried hook_block_list_alter() but this seems to return the ENTIRE list of blocks that exist in my Drupal install, with no designation of which blocks are actually going to be rendered or not on this page.
So, now what do I do?

I think it's got something to do with the order hook_block_list_alter() is called in but I'm not entirely sure. Either way the following code will get you a list of all blocks in the current page context. Be sure not to put this in a hook_block_list_alter() function or you'll get an exception caused by infinite nesting.
global $theme;
$all_regions = system_region_list($theme);
$blocks = array();
foreach (array_keys($all_regions) as $region) {
$blocks += block_list($region);
}
If you only need the blocks in a particular region you can ditch most of the code above and just go with:
$blocks = block_list('region_name');

I ran into the same problem when calling block_list('content'). Looking at the code for this function I found it calls two other functions, _block_load_blocks() and _block_render_blocks(). The problem seems to occur in _block_render_blocks() as the display text is not added to the content object. This is different from the other block objects that pass through the function.
To get around this, instead of calling block_list(), I called _block_load_blocks() directly. This returns an array of blocks grouped by region, bypassing the _block_render_blocks() call.
Now we can check for blocks in the content region without the content text disappearing. Huzzar!

I had a similar requirement when implementing an adserver (DFP). My solution was to define an array as a global variable, and include php code in each ad unit block that added a new element to the array.
Then, once all blocks on the page have been executed, you can simply access the global variable to see which ad blocks were called. Because the code to build the list of blocks being called is part of each block, it doesn't matter whether the blocks are displayed in a region, in a panel, or anywhere else.
In my case, I wanted to use the information to add scripts to the <head> section that reference only the add units from the blocks being placed. My complete solution was as follows:
1) Implement hook init to create a global variable in which to store information about which blocks are being displayed (you need to create a custom module to contain this code):
YOURMODULE_custom_init() {
$GLOBALS['dfp-ads'] = array();
}
2) Enable the core php module
3) Add php code at the end of each ad block to add a row to the array created in step 1
<?php
$GLOBALS['dfp-ads']['AD_OR_BLOCK_NAME_GOES_HERE']="AD SPECIFIC SCRIPT GOES HERE";
?>
4) Implement THEME_preprocess_html in my template.php file to access the global variable, build the script, and add the script to <head> section with a call to drupal_add_html_head
function YOURTHEME_preprocess_html(&$vars) {
$inline_script = LOGIC TO ACCESS $GLOBALS['dfp-ads'] AND BUILD SCRIPT GOES HERE;
$element = array(
'#type' => 'markup',
'#markup' => $inline_script,
);
drupal_add_html_head($element, 'google-dfp');
}
It sounds from your description that you don't need the list of ad blocks to build javascript for the head section, but instead want to use the information to modify the contents of the blocks themselves.
In that case insteaad of THEME_preprocess_html, you could try hook_page_alter(&page)
The api page for that hook states that individual "Blocks may be referenced by their module/delta pair within a region:"
// The login block in the first sidebar region.
$page['sidebar_first']['user_login']['#block'];
Hope that helps someone!

Re-processing attached images in drupal 7

I'm trying to import nodes from my forum to drupal 7. Not in bulk, but one by one so that news posts can be created and referenced back to the forum. The kicker is that I'm wanting to bring image attachments across as well...
So far, using the code example here http://drupal.org/node/889058#comment-3709802 things mostly work: Nodes are created, but the images don't go through any validation or processing.
I'd like the attached images to be validated against the rules defined in the content type. in particular the style associated with my image field which resizes them to 600x600.
So, instead of simply creating the nodes programatically with my own form, i decided to modify a "new" node using hook_node_prepare and using the existing form to create new content (based on passed in url args). This works really well and a create form is presented pre-filled with all my data. including the image! very cute.
I expected that i could then hit preview or save and all the validation and resizing would happen to my image, but instead i get the error:
"The file used in the Image field may not be referenced."
The reason for this is that my file doesn't have an entry in the file_usage table.. *le sigh*
so, how do i get to all the nice validation and processing which happens when i manually choose a file to upload? like resizing, an entry in the file_usage table.
The ajax upload function does it, but i can't find the code which is called to do this anywhere in the api.
What file upload / validation functions does Drupal call which i'm not doing?
Anybody have any experience with the file/image api for Drupal 7 who can help me out?

For getting the usage entry (in essence, checking out a file to a specific module so that it doesn't get deleted while its in use) look up the Drupal function 'file_usage_add()'
For validating incoming images, I got this example from user.module (if you're comfortable with PHP, you can always look at the core to see how something is done the 'Drupal way'):
function user_validate_picture(&$form, &$form_state) {
// If required, validate the uploaded picture.
$validators = array(
'file_validate_is_image' => array(),
'file_validate_image_resolution' => array(variable_get('user_picture_dimensions', '85x85')),
'file_validate_size' => array(variable_get('user_picture_file_size', '30') * 1024),
);
// Save the file as a temporary file.
$file = file_save_upload('picture_upload', $validators);
if ($file === FALSE) {
form_set_error('picture_upload', t("Failed to upload the picture image; the %directory directory doesn't exist or is not writable.", array('%directory' => variable_get('user_picture_path', 'pictures'))));
}
elseif ($file !== NULL) {
$form_state['values']['picture_upload'] = $file;
}
}
That function is added to the $form['#validate'] array like so:
$form['#validate'][] = 'user_validate_picture'

Drupal - Getting node id from view to customise link in block

How can I build a block in Drupal which is able to show the node ID of the view page the block is currently sitting on?
I'm using views to build a large chunk of my site, but I need to be able to make "intelligent" blocks in PHP mode which will have dynamic content depending on what the view is displaying.
How can I find the $nid which a view is currently displaying?

Here is a more-robust way of getting the node ID:
<?php
// Check that the current URL is for a specific node:
if(arg(0) == 'node' && is_numeric(arg(1))) {
return arg(1); // Return the NID
}
else { // Whatever it is we're looking at, it's not a node
return NULL; // Return an invalid NID
}
?>
This method works even if you have a custom path for your node with the path and/or pathauto modules.
Just for reference, if you don't turn on the path module, the default URLs that Drupal generates are called "system paths" in the documentation. If you do turn on the path module, you are able to set custom paths which are called "aliases" in the documentation.
Since I always have the path module turned on, one thing that confused me at first was whether it was ever possible for the arg function to return part of an alias rather than part of system path.
As it turns out, the arg function will always return a system path because the arg function is based on $_GET['q']... After a bit of research it seems that $_GET['q'] will always return a system path.
If you want to get the path from the actual page request, you need to use $_REQUEST['q']. If the path module is enabled, $_REQUEST['q'] may return either an alias or a system path.

For a solution, especially one that involves a view argument in the midst of a path like department/%/list, see the blog post Node ID as View Argument from SEO-friendly URL Path.

In the end this snippet did the job - it just stripped the clean URL and reported back the very last argument.
<?php
$refer= $_SERVER ['REQUEST_URI'];
$nid = explode("/", $refer);
$nid = $nid[3];
?>
Given the comment reply, the above was probably reduced to this, using the Drupal arg() function to get a part of the request path:
<?php
$nid = arg(3);
?>

You should considder the panels module. It is a very big module and requires some work before you really can tap into it's potential. So take that into considderation.
You can use it to setup a page containing several views/blocks that can be placed in different regions. It uses a concept called context which can be anything related to what you are viewing. You can use that context to determine which node is being viewed and not only change blocks but also layout. It is also a bit more clean since you can move the PHP code away from admin interface.
On a side note, it's also written by the views author.

There are a couple of ways to go about this:
You can make your blocks with Views and pass the nid in through an argument.
You can manually pass in the nid by accessing the $view object using the code below. It's an array at $view->result. Each row in the view is an object in that array, and the nid is in that object for each one. So you could run a foreach on that and get all of the nid of all rows in the view pretty easily.
The first option is a lot easier, so if that suits your needs I would go with that.

New about Drupal 7: The correct way to get the node id is using the function menu_get_object();
Example:
$node = menu_get_object();
$contentType = node_type_get_name($node);
Drupal 8 has another method. Check this out:
arg() is deprecated

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex