I want to scrape all the pages of Internshala and extract the Job ID, Job name, Company name and the Last date to apply and store everything in a csv to later convert to a dataframe.
import requests
import scrapy
from bs4 import BeautifulSoup
from scrapy import Selector
from scrapy.crawler import CrawlerProcess
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
import string
import pandas as pd
url='https://internshala.com/fresher-jobs'
sel=Selector(text=BeautifulSoup(requests.get(url).content).prettify())
pages=sel.xpath('//span[#id="total_pages"]').xpath('normalize-space(./text())').extract()
pages[0]=int(pages[0])
print(pages[0]) #which gives -> 4
class jobMan(scrapy.Spider):
name='job'
to_remove={0:["\n ","\n "],\
1:['\n ','\n ']}
def start_requests(self):
urls="https://internshala.com/fresher-jobs/page-1"
yield scrapy.Request(url=urls,callback=self.parse)
def parse(self,response):
ID=response.xpath('//div[#class="container-fluid individual_internship visibilityTrackerItem"]/#internshipid').extract()
Job_Post = response.xpath('//div[#class="heading_4_5 profile"]/a').xpath('normalize-space(./text())').extract()
Company = response.xpath('//a[#class="link_display_like_text"]').xpath('normalize-space(./text())').extract()
Apply_By = response.xpath('//div[#class="internship_other_details_container"]/div[#class="other_detail_item_row"][2]//div[#class="item_body"]').xpath('normalize-space(./text())').extract()
for page in range(2,pages[0]+1):
yield(scrapy.Request(url=f"https://internshala.com/fresher-jobs/page-{page}",callback=self.parse))
yield {
'ID': ID,
'Job':Job_Post,
'Company':Company,
'Apply_By':Apply_By
}
process=CrawlerProcess(settings={
'FEED_URI':'JOBSS.csv',
'FEED_FORMAT':'csv'
})
process.crawl(jobMan)
process.start()
And then finally-:
final=pd.read_csv('JOBSS.csv')
print(final)
Which gave me-:
ID Job \
0 NaN Product Developer - Science,Salesforce Develop...
1 NaN Business Development Manager,Mobile App Develo...
2 NaN Software Engineer,Social Media Strategist And ...
3 NaN Reactjs Developer,Full Stack Developer,Busines...
Company \
0 Open Door Education,Aekot Consulting And Techn...
1 ISB Studienkolleg,TutorBin,Alphacore Technolog...
2 CrewKarma,Internshala,Mithi Software Technolog...
3 Startxlabs Technologies Private Limited,RavGin...
Apply_By
0 7 Aug' 21,7 Aug' 21,7 Aug' 21,7 Aug' 21,7 Aug'...
1 31 Jul' 21,30 Jul' 21,30 Jul' 21,31 Jul' 21,30...
2 24 Jul' 21,24 Jul' 21,23 Jul' 21,23 Jul' 21,23...
3 11 Jul' 21,11 Jul' 21,11 Jul' 21,11 Jul' 21,11...
Doubt_1-: Why is it not printing the IDs ?? I tried scraping just the ID for the first page using the same xpath and I got the correct output but not while crawling.
/
Doubt_2-: I wanted a a dataframe such that, for example, the Job_Post column contains each job post's name nested under each other (means as a new row) from all the pages merged but I am getting rows per page.
How can I solve these issues ?? Please help
Doubt_1-: Why is it not printing the IDs ?? I tried scraping just the ID for the first page using the same xpath and I got the correct output but not while crawling.
Because the class name has a space in it, use:
ID=response.xpath('//div[contains(#class, "container-fluid individual_internship visibilityTrackerItem")]/#internshipid').extract()
I am not sure why this code doesnt run. But if it breaks it into 2 smaller chunks then it works. Is there anyway i can run this whole chunk at once?
When I run this code it appears the plus sign in the console and I couldnt click run in R markdown
dataT4<- dataT4 %>% mutate (coupleID=case_when(id==10011~1, id==10021~2,
id==10032~3, id==10041~4,id==10062~5, id==10071~6,id==10082~7, id==10092~8,
id==10112~9, id==10121~10,id== 10131~11, id==10142~12, id==10151~13,
id==10162~14,id==10171~15, id==10181~16, id==10202~17, id==10212~18, id==10221~19,
id==10232~20, id==10242~21, id==10251~22, id==10262~23, id==10271~24, id==10292~25,
id==10311~26, id==10332~27, id==10342~28, id==10351~29, id==10361~30, id==10372~31,
id==10382~32, id==10391~33, id==10401~34, id==10412~35, id==10421~36, id==10432~37,
id==10442~38, id==10452~39, id==10461~40, id==10471~41, id==10481~42, id==10492~43,
id==10501~44, id==10511~45, id==10521~46, id==10532~47, id==10542~48, id==10562~49,
id==10581~50, id==10592~51, id==10602~52, id==10611~53, id==10642~54, id==10651~55,
id==10662~56, id==10672~57, id==10681~58, id==10702~59, id==10761~60, id==10782~61,
id==10791~62, id==10802~63, id==10812~64, id==10822~65, id==10831~66, id==10852~67,
id==10862~68, id==10881~69, id==10912~70, id==10942~71, id==10951~72, id==10962~73,
id==10972~74, id==10982~75, id==10992~76, id==11001~77, id==11031~78, id==11052~79,
id==11061~80, id==11072~81, id==11092~82, id==11101~83, id==11112~84, id==11171~85,
id==11192~86, id==11202~87, id==11221~88, id==11231~89, id==11252~90, id==11261~91,
id==11281~92, id==11292~93, id==11322~94, id==11332~95, id==11372~96, id==11382~97,
id==11391~98, id==11411~99, id==11422~100, id==11441~101, id==11461~102,
id==11471~103, id==11492~104, id==11501~105, id==11512~106,
id==11521~107,id==11562~108,id==11591~109, id==11601~110, id==11611~111,
id==11621~112, id==11632~113, id==11641~114, id==11651~115, id==11662~116,
id==11682~117,id==11691~118,id==11712~119, id==11771~120, id==11782~121,
id==11811~122, id==11821~123, id==11831~124, id==11841~125, id==11852~126,
id==11861~127,id==11872~128,id==11882~129, id==11892~130, id==11902~131,
id==11911~132, id==11922~133, id==11961~134, id==11972~135,
id==11992~136,id==12011~137, id==12041~138, id==12052~139, id==12061~140,
id==12081~141, id==12101~142, id==12111~143, id==12122~144, id==12131~145,
id==12142~146, id==12151~147, id==12161~148, id==12182~149, id==12191~150,
id==12201~151, id==12232~152, id==12261~153, id==12272~154, id==12322~155,
id==12332~156, id==12342~157, id==12352~158, id==12382~159, id==12392~160,
id==12401~161, id==12411~162, id==12421~163, id==12432~164, id==12441~165,
id==12451~166, id==12461~167, id==12471~168, id==12492~169, id==12501~170,
id==12512~171, id==12521~172, id==12542~173, id==12552~174, id==12562~175,
id==12572~176, id==12581~177, id==12612~178, id==12622~179, id==12652~180,
id==12662~181, id==12682~182, id==12701~183, id==12712~184, id==12731~185,
id==12741~186, id==12762~187, id==12792~188, id==12802~189, id==12811~190,
id==12822~191, id==12832~192, id==12841~193, id==12862~194, id==12882~195,
id==12891~196, id==12911~197, id==12931~198, id==12942~199, id==12952~200,
id==12961~201, id==12972~202, id==13011~203, id==13021~204, id==13032~205,
id==13042~206, id==13061~207, id==13082~208, id==13102~209, id==13111~210,
id==13132~211, id==13142~212, id==13151~213, id==13162~214, id==13191~215,
id==13202~216, id==13212~217, id==13262~218, id==13271~219, id==13281~220,
id==13311~221, id==13322~222, id==13331~223, id==13351~224, id==13361~225,
id==13372~226, id==13422~227, id==13432~228, id==13452~229, id==13462~230,
id==13472~231, id==13481~232, id==13501~233, id==13511~234, id==13521~235,
id==13561~236, id==13571~237, id==13601~238, id==13612~239, id==13632~240,
id==13642~241, id==13652~242, id==13662~243, id==13671~244, id==13681~245,
id==13691~246, id==13701~247, id==13711~248, id==13732~249, id==13742~250,
id==13752~251, id==13782~252, id==13842~253, id==13802~254, id==13822~255,
id==13851~256, id==13872~257, id==13882~258, id==13892~259, id==13912~260,
id==13921~261, id==13932~262, id==13941~263, id==13952~264, id==13971~265,
id==13981~266, id==13992~267, id==14011~268, id==14021~269, id==14031~270,
id==14041~271, id==14052~272, id==14072~273, id==14111~274, id==14131~275,
id==14162~276, id==14172~277, id==14182~278, id==14191~279, id==14212~280,
id==14222~281, id==14241~282, id==14261~283, id==14291~284, id==14302~285,
id==14312~286, id==14321~287, id==14342~288, id==14352~289, id==14362~290,
id==14371~291, id==14392~292, id==14402~293, id==14432~294, id==14451~295,
id==14472~296, id==14482~297, id==14491~298, id==14511~299, id==14521~300,
id==14531~301, id==14541~302, id==14552~303, id==14562~304, id==14572~305,
id==14581~306, id==14592~307, id==14602~308, id==14621~309, id==14632~310,
id==14641~311, id==14651~312, id==14671~313, id==14681~314, id==14692~315,
id==14712~316, id==14722~317, id==14732~318, id==14741~319, id==14751~320,
id==14781~321, id==14792~322, id==14812~323, id==14842~324, id==14852~325,
id==14862~326, id==14882~327, id==14892~328, id==14901~329, id==11012~330))
As a single line it is just too long to be parsed. You may be better served putting all of these values into a separate data.frame and merging it into your data instead of using a giant case_when.
Usually when I want to do something like this I'll open Excel or something similar, put column names in the first row (here that would be id and couple_id) and enter all of the values, save it as a CSV, then read the CSV into R as a data.frame, and then merge it.
You can use rank:
dataT4 <- data.frame(id=c(10011, 10021, 10382, 11012))
dataT4 <- dataT4 %>% mutate (coupleID=rank(id))
dataT4
id coupleID
1 10011 1
2 10021 2
3 10382 3
4 11012 4
Data:
dataT4 <- data.frame(id=c(10011, 10021, 10382, 11012))
I am trying to read in a csv in this form:
2014,92,1931,6.234,10.14
2014,92,1932,5.823,9.49
2014,92,1933,5.33,7.65
2014,92,1934,4.751,6.19
2014,92,1935,4.156,5.285
2014,92,1936,3.962,4.652
2014,92,1937,3.74,4.314
2014,92,1938,3.325,3.98
2014,92,1939,2.909,3.847
2014,92,1940,2.878,3.164
To be clear, this is (Year, Day of year, 2400hr time, and 2 columns of values).
I have had some thought on the matter in a previous question, but to no avail, and it's proving to be a matter of a few problems... (Create an indexed datetime from date/time info in 3 columns using pandas)
As noted in the above question, the following "read_csv" attempt
df = pd.read_csv("home_prepped.dat", parse_dates={"dt" : [0,1,2]},
date_parser=parser, header=None)
triggers a TypeError:
TypeError: parser() takes exactly 1 argument (3 given)
This is due to the "parse_dates" arg having 0,1,2 in it.
I have also tried putting them in double brackets [[0,1,2]] and get:
ValueError: [0, 1, 2] is not in list
I have gotten past this by setting parse_dates=True and thought I could just set_index after but get this:
TypeError: must be string, not numpy.int64
My parser gets hung up on the format too, and I have read conflicting stories about zero-padding the "day of year" value. Mine are not zero-padded, but even still, above errors aside I have had the format get hung up on the first value, the year! Here is the parser:
def parser(x):
return pd.datetime.strptime(x, '%Y %j %H%M')
So yea, I have had errors saying '2014' not recognized, and '92' (day of year) not recognized, but have been encouraged cause at least strptime has been able to make its way "through" to try out the format.
I am wondering if this has something to do with my data.
I am looking for a way to get this datetime info indexed as a datetime and I have had nothing but problems. I have gone ahead and padded some julians in case someone wants to test out the format being a problem of the padding, see below:
2014,092,1931,6.234,10.14
2014,092,1932,5.823,9.49
2014,092,1933,5.33,7.65
2014,092,1934,4.751,6.19
2014,092,1935,4.156,5.285
2014,092,1936,3.962,4.652
2014,092,1937,3.74,4.314
2014,092,1938,3.325,3.98
2014,092,1939,2.909,3.847
2014,092,1940,2.878,3.164
Thanks for your help guys, I am starting to really get frustrated here :S
After correcting your %m (month) to %M (minute), your code works for me:
>>> import pandas as pd
>>> print pd.version.version
0.15.2-10-gf7af818
>>>
>>> def parser(x):
... return pd.datetime.strptime(x, '%Y %j %H%M')
...
>>> df = pd.read_csv("home_prepped.dat", parse_dates={"dt" : [0,1,2]},
... date_parser=parser, header=None)
>>> df
dt 3 4
0 2014-04-02 19:31:00 6.234 10.140
1 2014-04-02 19:32:00 5.823 9.490
2 2014-04-02 19:33:00 5.330 7.650
3 2014-04-02 19:34:00 4.751 6.190
4 2014-04-02 19:35:00 4.156 5.285
5 2014-04-02 19:36:00 3.962 4.652
6 2014-04-02 19:37:00 3.740 4.314
7 2014-04-02 19:38:00 3.325 3.980
8 2014-04-02 19:39:00 2.909 3.847
9 2014-04-02 19:40:00 2.878 3.164
But after playing around with this for a little while, there are some very strange behaviours when an error happens, leading to some odd error messages, so I can see why it's very hard to debug this.
If for some reason the above isn't working, you could try doing the parsing yourself:
df = pd.read_csv("home_prepped.dat", header=None)
timestr = df.iloc[:,:3].astype(str).apply(' '.join,axis=1)
df = df.iloc[:,3:]
times = pd.to_datetime(timestr, format='%Y %j %H%M')
df["dt"] = times
As mentioned above, when something goes wrong (e.g. a parse error) the error messages are very confusing from within read_csv.
The following seems to work, i think. Keep in mind this is the first time I have ever brought anything into pandas to work with so not sure how to properly test it, but it recognizes the format and says:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-04-02 19:31:00, ..., 2014-12-21 23:59:00]
Length: 337917, Freq: None, Timezone: None
Which is sweet, as I believe this means I have finally indexed a datetime!
Here is what I did...
In [41]:
import numpy as np
import pandas as pd
from datetime import datetime
In [60]:
def parse(yr, yearday, hrmn):
date_string = ''.join([yr, yearday, hrmn])
return datetime.strptime(date_string,"%Y%j%H%M")
In [61]:
df = pd.read_csv('home_prepped.csv', parse_dates={'datetime':[0,1,2]}, date_parser=parse, index_col='datetime', header=None)
Now I tried to put a space in between the '' before the .join and it separated the %Y %j but only managed to see a "1" as part of the %H. So I got rid of the space and changed the format to be spaceless as well.
Thanks for your work on this DSM.