How to read text and get some specific lines values using Scrapy

How to read text and get some specific lines values using Scrapy - web-scraping

I have a requirement to hit url > http://something.com/requirements.txt
Content will be something like this, (response.text).
From the 8th to the 12th century, Old English gradually transformed through language contact into Middle English. Middle English is often arbitrarily defined as beginning with the conquest of England by William the Conqueror in 1066, but it developed further in the period from 1200–1450.
Year: 2020
First, the waves of Norse colonisation of northern parts of the British Isles in the 8th and 9th centuries put Old English into intense contact with Old Norse, a North Germanic language. Norse influence was strongest in the north-eastern varieties of Old English spoken in the Danelaw area around York, which was the centre of Norse colonisation; today these features are still particularly present in Scots and Northern English. However the centre of norsified English seems to have been in the Midlands around Lindsey, and after 920 CE when Lindsey was reincorporated into the Anglo-Saxon polity, Norse features spread from there into English varieties that had not been in direct contact with Norse speakers. An element of Norse influence that persists in all English varieties today is the group of pronouns beginning with th- (they, them, their) which replaced the Anglo-Saxon pronouns with h- (hie, him, hera).[43]
i want to scrape only "Year:" value from the text response using scrapy and map that to ItemLoader. Is there any way i can do with scrapy?

You can use regex re.
import re
re.findall(r'Year: (.*)\n', response.text)

Related

Adding extra lines between paragraphs in txt files

I have a over 5000 txt files of news articles that look just like the one below. I am trying to create a corpus that creates a document for each paragraph of every txt file in a folder. There is a command (corpus_reshape) in the Quanteda R package that helps me to create this corpus with paragraphs, instead of full articles, as the documents. However, the command isn't able to identify the single "enter" paragraphs in the body of the article, but instead is looking for larger gaps between text to determine where one paragraph begins and one ends. In other words, from the text file below, the command only create for documents. The first documents starting with "Paying fo the Papal Visit, the second with "Copyright 1979 The Washington Post", the third with "NO ONE KNOWS" and the last with "End of Document". But the body of the article (between "Body" and "End of Text" actually consists of four paragraphs that the corpus_reshape couldn't identify.
So, I need to somehow go back through all 5,000+ txt files and increase the number of empty lines between paragraphs in the body of the text, so that, when I create the corpus, it can accurately parse out all paragraphs. I would greatly appreciate any help. Thank you!
Update: Below I have added both a link to the a downloadable copy of the txt file in the example as well as the pasted text as copied from the txt file.
Downloadable link:
https://drive.google.com/file/d/1SO5S5XgNlc4H-Ms8IEvK_ogZ57qSoTkw/view?usp=sharing
Pasted Text:
Paying for the Papal Visit
The Washington Post
September 16, 1979, Sunday, Final Edition
Copyright 1979 The Washington Post
Section: Outlook; Editorial; B6
Length: 403 words
Body
NO ONE KNOWS how much Pope John Paul II's week-long U.S. visit will end up costing -- or even how to calculate the cost. But already who picks up the tab has become a subject of considerable unnecessary controversy in three cities. Some religious and civil-liberties groups in Philadelphia and Boston are challenging -- or nit-picking -- proposals by governments in these cities to spend public money on facilities connected with outdoor papal masses; and in New York, local and Roman Catholic officials have been locked in negotiations over who will pay for what.
But by and large, here in Washington and in Chicago and Des Moines, these details are being handled as they should be: without making a separation-of-church-and-state issue out of the logistics. Spending by a city for the smoothest and safest handling of a major event is a legitimate secular municipal function. Even a spokesman for Americans United for Separation of Church and State has agreed that there is nothing wrong with using public money for cleanup, police overtime, police protection and traffic control.
Playing host to world figures and huge turnouts is indeed expensive, as District officials regularly remind us when they are haggling with Congress for more federal help with the bills. Still, here in the capital, whether it is the pope, angry American farmers, anti-war demonstrators or civil-rights marchers, public spending for special services is considered normal and essential. Much of the hair-splitting in other cities over the papal-visit expenses has to do with whether public money should pay for platforms from which the pope will celebrate outdoor masses. That's a long reach for a constitutional controversy, and not worth it.
Far better is the kind of cooperation that separate church and state groups here are demonstrating in their planning. For example, there will be a chainlink fence surrounding the stage and altar from which the pope will say the mass on the Mall and extending to other nearby areas. The police recommended the fence, estimated to cost about $25,000, the church has agreed to pay for the portion around the stage and altar. To help clean up, the church plans to produce hundreds of Scouts on the Monday holiday for volunteer duty. This approach to the visit is a lot more sensible -- and helpful to all taxpayers -- than a drawn-out argument and threats of legal action.
End of Document

This will add 3 lines at teh end of those paragraphs. THe logic used was to add the extra lines when the line length was greater than 50. You may wnat to modify that. It was chosen because the longest line in the "paragraphs" you were happy with was 46 characters.
txt <- readLines("/home/david/Documents/R_code/misc files/WP_1979.9.16.txt")
spread <- ifelse( nchar(txt) < 50,
paste0(txt, "\n") , # these lines are left alone
paste0( txt, "\n\n\n\n") ) # longer lines are padded
cat(spread, file="/home/david/Documents/R_code/misc files/spread.txt" )
The cat function doesn't modify lines much, but it does omit the returns that were not included when readLines did the input. Some of the "lines" in the input text were just empty:
nchar(txt)
[1] 0 0 26 19 41 0 0 34 31 17 4 0 0 566 519 643 672 0 0 15
Now the same operation on spread.txt yields a different "picture". I'm thinking that the added padding with "\n" characters is what is changing the counts, but I think that the corpus processing machinery will not mind:
nchar( readLines("/home/david/Documents/R_code/misc files/spread.txt" ))
#------------
[1] 0 1 27 20 42 1 1 35 32 18 5 1 1 567 0 0 0 520 0 0 0 644 0 0 0 673 0
[28] 0 0 1 1 16
>

R (regex) - removing apartment, unit, and other words from end of address

I have a large dataset of addresses that I plan to geocode in ArcGIS (Google geolocating is too expensive). Examples of the addresses are below.
9999 ST PAUL ST BSMT
GARRISON BL & BOARMAN AVENUE REAR
1234 MAIN STREET 123
1234 MAIN ST UNIT1
ArcGIS doesn't recognize addresses that include units and other words at the end. So I want to remove these words so that it looks like the below.
9999 ST PAUL ST
GARRISON BL & BOARMAN AVENUE
1234 MAIN STREET
1234 MAIN ST
The key challenges include
ST is used both to abbreviate streets and indicate "SAINT" in street names.
Addresses end in many different indicators such as STREET and AVENUE
There are intersections (indicated with &) that might include indicators like ST and AVENUE twice.
Using R, I'm attempting to apply the sub() function to solve the problem but I have not had success. Below is my latest attempt.
sub("(.*)ST","\\1",df$Address,perl=T)
I know that many questions ask similar questions but none address this problem directly and I suspect it is relevant to other users.

Although I feel removing the last word should work for you, but just to be little safer, you can use this regex to retain what you want and discard what you don't want in safer way.
(.*(?:ST|AVENUE|STREET)\b).*
Here, .*(?:ST|AVENUE|STREET)\b captures your intended data by capturing everything from start in greedy manner and only stop when it encounters any of those words ST or AVENUE or STREET (i.e. last occurrence of those words), and whatever comes after that, will be discarded which is what you wanted. In your current case you only have one word but it can discard more than one word or indeed anything that occurs after those specific words. Intended data gets captured in group 1 so just replace that with \1
So instead of this,
sub("(.*)ST","\\1",df$Address,perl=T)
try this,
sub("(.*(?:ST|AVENUE|STREET)\b).*","\\1",df$Address,perl=T)
See this demo

Trying to use regex (or R) to turn press releases into a dataset

I'm working on a project to turn press releases from Operation Inherent Resolve which detail airstrikes against ISIS in Syria and Iraq into a usable dataset. So far we've been handcoding everything but it just takes insanely long.
Every press release is structured like this:
November 23, 2016
Military Strikes Continue Against ISIL Terrorists in Syria and Iraq
U.S. Central Command
SOUTHWEST ASIA, November 23, 2016 - On Nov. 22, Coalition military forces conducted 17 strikes against ISIL terrorists in Syria and Iraq. In Syria, Coalition military forces conducted 11 strikes using attack, bomber, fighter, and remotely piloted aircraft against ISIL targets. Additionally in Iraq, Coalition military forces conducted six strikes coordinated with and in support of the Government of Iraq using attack, bomber, fighter, and remotely piloted aircraft against ISIL targets.
The following is a summary of the strikes conducted since the last press release:
Syria
Near Abu Kamal, one strike destroyed an oil rig.
Near Ar Raqqah, four strikes engaged an ISIL tactical unit, destroyed two vehicles, an oil tanker truck, an oil pump, and a VBIED, and damaged a road.
Iraq
Near Rawah, one strike engaged an ISIL tactical unit and destroyed a vehicle, a mortar system, and a weapons cache.
Near Mosul, four strikes engaged three ISIL tactical units, destroyed >six ISIL-held buildings, a mortar system, a vehicle, a weapons cache, a supply cache, and an artillery system, and damaged five supply routes, and a bridge.
more text I don't need, about 5 exceptions where they amend previous reports I'll just fix by hand, and then the next report
What I'm trying to do is pull out just the date of the strike and how many strikes per city for both Iraq and Syria and reformat that information into a proper dataset organized as one row per date, like this:
Rawah Mosul And So On
1/1/2014 1 4
1/2/2014 2 5
The bad: There's a different number of cities listed for each country in each press release, and a different number of strikes listed each time.
The good: Everything one of these press releases is worded exactly the same.
The string "SOUTHWEST ASIA," is always in front of the date
A 4 space indent followed by the word "Near" are always in front of the city
The city and a comma are always in front of the number of strikes
The number of strikes are always in front of the word "airstrike" or "airstrikes"
The question is whether it's possible to make a regex to either copy/cut everything matching those criteria in order or just delete everything else. I think to grab the arbitrary number of cities (with unknown names) and unknown numbers of strikes it would have to be based on copying/saving everything next to the unchanging markers.
I've tried using notepad++'s find/replace function with something like *(foobar)* but I can only match one thing at a time and when I try to replace everything but the matched string it just deletes the whole file instead of protecting every instance of the matching string.

I suggest searching by using Near (.*?),. You can back-reference with \1.
I did a quick scan of the documents, and it seems the more recent ones change a bit of the format, adding "Strikes at [country]" rather than your example of just "[country]". But each one lists the city in a Near [city], format.
This would grab you the cities, of course, but you would have to do some pretty hacky things to get the number of strikes, since there doesn't seem to be a standard for that.
If you are only dealing with the records that have your formatting, try Near (.*?), (.*? ) and you should get the spelled out number of strikes per city by referencing \2, and the city by referencing \1.
So, if you were to find and replace in Notepad++, you would use something like .*Near (.*?), (.*? ).* as your find, and something like \1 -- \2 as your replace. From there you would need to draft up a small script to translate the spelled numbers to digits, and output those where they need to go. The pattern \w* \d{1,2}, \d{4} will match a date in the long format, something else you could pipe into a python script or something to construct your table of data. Sorry I couldn't help more there!

Splitting data from a data.frame of length 1 by using a delimiter in R

I just imported a text file into R, and now I have a data frame of length 1.
I had 152 reviews separated by a * character in that text file. Each review is like a paragraph long.
How can I get a data frame of length 152, each review being only 1 in the data frame ? I used this line to import the file into R :
myReviews <- read.table("C:/Users/Norbert/Desktop/research/Important files/Airline Reviews/Reviews/air_can_Review.txt", header=FALSE,sep="*")
(myReview has a length if 1 here.)... I need it to be 152, the number of reviews inside the text file.
How can I split the data from the data frame by the "*" delimiter, or just import the text file correctly by putting it into a data frame of length 152 instead of 1.
EDIT : Example of the data
I have 152 of this kind of data, all separated by "*":
I boarded Air Canada flight AC 7354 to Washington DCA the next morning, April 13th. After arriving at DCA, I
discovered to my dismay that the suitcase I had checked in at Tel Aviv was not on my flight. I spent 5 hours at
Pearson trying to sort out the mess resulting from the bumped connection.
*First time in 40 years to fly with Air Canada. We loved the slightly wider seats. Disappointed in no movies to
watch as not all travelers have i-phones etc. Also, on our trip down my husband asked twice for coffee, black. He
did not get any and was not offered anything else. On our trip home, I asked for a black coffee. She said she was
making more as several travelers had asked. I did not my coffee or anything else to drink. It was a long trip with
nothing to drink especially when they had asked. When trying to get their attention, they seemed to avoid my
tries. We found the two female stewardess's not very friendly. It may seem a small thing but very frustrating for
a traveler.
*Painless check-in. Two legs of 3 on AC: AC105, YYZ-YVR. Roomy and clean A321 with fantastic crew. AC33: YVR-SYD,
very light load and had 3 seats to myself. A very enthusiastic and friendly crew as usual on this transpacific
route that I take several times a year. Arrived 20 min ahead of schedule. The expected high level of service from
our flag carrier, Air Canada. Altitude Elite member.
They are airline reviews, each review separated by a *. I would like to take all of the reviews and put them in one data.frame in R, but each review should get its own "slot" or "position". The "*" is intended to be the separator.

is there an official document on how to convert Chinese characters into pinyin? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I need to convert Chinese characters into pinyin and need an official document for that conversion.
There are some libraries around as mentioned by previous posts such as Convert chinese characters to hanyu pinyin .
However, I need an "official standard" more than an "available library". Where could I find such a document? Is there any standard / document / book released by China government for how shall Chinese characters be pronounced/marked by pinyin?
Appreciate your kind help.

Taiwan Ministry of Education has a site listing all the variants of the Chinese character. http://dict.variants.moe.edu.tw/eng.htm
In it, they also specified the pronunciation of the characters. However, the pronunciation used is Zhuyin (popular in Taiwan) and not Hanyu Pinyin (popular in Mainland China).
You could use the list on Wikipedia to map Zhuyin to Hanyu Pinyin http://zh.wikipedia.org/wiki/%E4%B8%AD%E6%96%87%E6%8B%BC%E9%9F%B3%E5%B0%8D%E7%85%A7%E8%A1%A8
For example, the character 井 http://dict.variants.moe.edu.tw/yitia/fra/fra00052.htm has the Zhuyin of ㄐ｜ㄥˇ, which you then look up ㄐ｜ㄥ = jing. Then combine with the tone and you get jǐng.
I don't know of any official standard in Mainland China or in any other Chinese speaking countries.

There is no unique way to convert a Chinese character to pinyin, since there is not necessary a unique way to pronounce characters; and pinyin a system to transcribe Chinese characters into Latin script from which one can derive how to pronounce the character. It all depends on the context in which the character is used.
Some examples:
The verb 数 meaning "to count" has pinyin shǔ, while the noun 数 meaning "number" has pinyin shù.
长 with meaning "long" is written as cháng, with meaning "chief" however it is written as zhǎng
The pinyin for 好 with meaning "good" is hǎo while the 好 in 爱好 has pinyin hào.
行 with meaning "to walk" has pinyin xíng, while the measurement word meaning for a row of something has pinyin háng.
Chinese is full with such examples. Sometimes only the tones differ (see the 好 example) and something the pronunciation is completely different (the 行 example).
Next to having characters with multiple pronunciations (depending on the context), tones also change when characters are used together with other characters. For example the pinyin for 不 is normally bù, but becomes bú when the character following 不 has a forth tone.

answer my own questions just to add my 2 cents, in case others might bump into this topic.
In mainland China, there is a dictionary 新华字典 （http://en.wikipedia.org/wiki/Xinhua_Zidian） that is quite authoritative. Although it's not endorsed by China government, it's published more than 400 million copies, widely used as reference book for primary school and middle school students & teachers.
unfortunately there's no online websites for this dictionary, though some scanned version are available.

For mainland China, pinyin orthography follows the 《中文拼音正词法基本规则》 (Chinese Pinyin Orthography Basic Rules) published in 1996. This is the national standard, which has to be used in all official publications (although you will see wrong Pinyin use everywhere in China). You can find the full text (including English translation) here: http://www.pinyin.info/rules/pinyinrules_simp.html
For the correct transcription of characters, I agree that Xinhua Zidian is a quasi authority. You can find some online versions, in fact (like http://xh.5156edu.com/), but I don't know if they are reliable.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex