How to extract a number using scrapy? - web-scraping

Hello and thanks for taking the time to read this. I am quite new to scrapy and am trying to scrape just one number on a website. I tried using:
yield {
'myitem': response.css('span.theidofmyitem').extract_first()
}
but this seems only to work for text as using with this number (which is contained within span) a the value of none is returned. Thanks in advance for any help!

Related

Assigning observation name to a value when retrieving a variable

I want to create a dataframe that contains > 100 observations on ~20 variables. Now, this will be based on a list of html files which are saved to my local folder. I would like to make sure that are matches the correct values per variable to each observation. Assuming that R would use the same order of going through the files for constructing each variable AND not skipping variables in case of errors or there like, this should happen automatically.
But, is there a "save way" to this, meaning assigning observation names to each variable value when retrieving the info?
Take my sample code for extracting a variable to make it more clear:
#Specifying the url for desired website to be scrapped
url <- 'http://www.imdb.com/search/title?
count=100&release_date=2016,2016&title_type=feature'
#Reading the HTML code from the website
webpage <- read_html(url)
title_data_html <- html_text(html_nodes(webpage,'.lister-item-header a'))
rank_data_html <- html_text(html_nodes(webpage,'.text-primary'))
description_data_html <- html_text(html_nodes(webpage,'.ratings-bar+ .text-
muted'))
df <- data.frame(title_data_html, rank_data_html,description_data_html)
This would come up with a list of rank and description data, but no reference to the observation name for rank or description (before binding it in the df). Now, in my actual code one variable suddenly comes up with 1 value too much, so 201 descriptions but there are only 200 movies. Without having a reference to which movie the description belongs, it is very though to see why that happens.
A colleague suggested to extract all variables for 1 observation at a time and extend the dataframe row-wise (1 observation at a time), instead of extending column-wise (1 variable at a time), but spotting errors and clean up needs per variable seems way more time consuming this way.
Does anyone have a suggestion of what is the "best practice" in such a case?
Thank you!
I know it's not a satisfying answer, but there is not a single strategy for solving this type of problem. This is the work of web scraping. There is no guarantee that the html is going to be structured in the way you'd expect it to be structured.
You haven't shown us a reproducible example (something we can run on our own machine that reproduces the problem you're having), so we can't help you troubleshoot why you ended up extracting 201 nodes during one call to html_nodes when you expected 200. Best practice here is the boring old advice to LOOK at the website you're scraping, LOOK at your data, and see where the extra or duplicate description is (or where the missing movie is). Perhaps there's an odd element that has an attribute that is also matching your xpath selector text. Look at both the website as it appears in a browser, as well as the source. Right click, CTL + U (PC), or OPT + CTL + U (Mac) are some ways to pull up the source code. Use the search function to see what matches the selector text.
If the html document you're working with is like the example you used, you won't be able to use the strategy you're looking for help with (extract the name of the movie together with the description). You're already extracting the names. The names are not in the same elements as the descriptions.

Association Analysis using Tweets/twittR package

I'm new to R and was wondering if is possible using R,
to get a list of users who tweet using the word cats for example and then
go through their timeline and see did they tweet using the word dogs for example.
I have managed using the twitteR package to get a list of user names and their tweets and put them into a dataframe. I just don't know how to go about doing the rest or if it is even possible.
Any help at all would be greatly appreciated!
John I am not sure if I understand correctly what you are trying to achieve. But I am assuming that the dataframe also contains a time stamp of the tweet. If that is the case, then you can group by the user and arrange in ascending order as per the timestamp. Thereafter you could use grepl() for 'dogs' or any other word you are searching for.

Determination of correct CSS selector for rvest to scrape realtor.com

I am attempting to scrape information from realtor.com at the following address for an example home in Des Moines, IA.
http://www.realtor.com/realestateandhomes-detail/2419-Hart-Ave_Des-Moines_IA_50320_M85646-67738
The information I am particularly interested in is under "Payment Options", specifically on in the "wheel chart" where Principal & Interest, Property Tax, and Home Insurance values are listed and graphically displayed. I have inspected the element for this page, and it seems to me that the CSS selector I need is:
span #principle_interest .float-right
I am not sure if it is appropriate to have spaces in the above, but have tried both ways. Following is my R code:
## Load rvest package
library(rvest)
## Parse realtor.com page html
siteHTML <- read_html("http://www.realtor.com/realestateandhomes-detail/2419-Hart-Ave_Des-Moines_IA_50320_M85646-67738")
## Attempt to extract principle interest value
PBI <- siteHTML %>% html_nodes("span#principle_interest.float-right")
After this attempt, PBI is equal to "{xml_nodeset (0)}"
My attempts extracting the address, including city and zip code, as well as total price, number of baths, beds, square feet and lot square feet were all successful, but I could not get this part to work. Does anyone have any insight here? I sincerely apologize if this is a double post, I couldn't find anything similar upon looking around. Am I strongly oversimplifying the CSS perhaps?
Thank you so much!

Google Spreadsheet IF and AND

im trying to find an easy formula to do the following:
=IF(AND(H6="OK";H7="OK";H8="OK";H9="OK";H10="OK";H11="OK";);"OK";"X")
This actually works. But I want to apply to a range of cells within a column (H6:H11) instead of having to create a rule for each and every one of them... But trying as a range:
=IF(AND(H6:H11="OK";);"OK";"X")
Does not work.
Any insights?
Thanks.
=ArrayFormula(IF(AND(H6:H11="OK");"OK";"X"))
also works
arrayformulas work the same way they do in excel... they just need an ArrayFormula() around to work (will be automatically set when pressing Ctrl+Alt+Return like in excel)
In google sheets the formula is:
=ArrayFormula(IF(SUM(IF(H6:H11="OK";1;0))=6;"OK";"X"))
in excel:
=IF(SUM(IF(H6:H11="OK";1;0))=6;"OK";"X")
And confirm with Ctrl-Shift-Enter
This basically counts the number of times the said range is = to the criteria and compares it to the number it should be. So if the range is increased then increase the number 6 to accommodate.

Infopath : convert number to words .

I want to convert Info Path number Field to Word
Ex. I have no. 1000 -> Ten Thousand like that please help me.
Thank you,
As far as i know there's no built-in way to achieve this - you'll have to write custom code.
This link might help you converting numbers to text.
http://www.daniweb.com/software-development/csharp/threads/53072

Resources