Can't web scrape nested tables using BeautifulSoup - web-scraping

I've been trying to web scrape this page using requests_html, requests and BeautifulSoup. Whenever I try to do it online (by using requests.get or requests_html.HTMLSession()) my code fails to find tags inside the tables, as in the example below:
r = requests.get(url=url, verify=False)
soup = BeautifulSoup(r.text, 'html.parser')
div = soup.find('div', class_='collapse-3')
print(div)
#None
This div is inside a section.
I've already tried to render the page:
with HTMLSession() as session:
r = session.get(url=url, headers=headers)
r.html.render()
div = r.html.find('div.collapse-3')
print(div)
#None
But no success. It will only work if I download the HTML and load it.
I found this and this solutions using Selenium, but they do not solve the problem and I need a solution that does not rely on Selenium. As far as I can see, the problem is that the information that I need is inside an iterative table, which blocks me from accessing the information inside it.
Any advice is welcome! Thank you all in advance!

Get the data straight from the source (returned as JSON). Then read into pandas:
import requests
import pandas as pd
url = 'https://www.portaltransparencia.gov.br/licitacoes/item-licitacao/resultado'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36'}
payload = {
'paginacaoSimples': 'false',
'tamanhoPagina': '1000',
'offset': '0',
'direcaoOrdenacao': 'asc',
'colunaOrdenacao': 'descricao',
'colunasSelecionadas': 'codigoItemCompra,descricao,descComplementarItemCompra,quantidade,valor,cpfCnpjVencedor,nome',
'skCompra': '205900728',
'_': '1633966570617'}
jsonData = requests.get(url, headers=headers, params=payload, verify=False).json()
df = pd.DataFrame(jsonData['data'])
Output:
print(df)
codigoItemCompra ... descComplementarItemCompra
0 1551260500022202100059 ... ARTIGO PARA HIGIENE NO LEITO, TIPO LIMPADOR DE...
1 1551260500022202100004 ... BANDAGEM ELÁSTICA, MATERIAL ALGODÃO, TIPO AUTO...
2 1551260500022202100003 ... BANDAGEM ELÁSTICA, MATERIAL FITA MICROPOROSA, ...
3 1551260500022202100005 ... BANDAGEM ELÁSTICA, MATERIAL NÃO TECIDO POROSO,...
4 1551260500022202100012 ... BOLSA OSTOMIA, MATÉRIA PRIMA PLÁSTICO, APLICAÇ...
5 1551260500022202100010 ... BOLSA OSTOMIA, MATÉRIA PRIMA PLÁSTICO, APLICAÇ...
6 1551260500022202100008 ... BOLSA OSTOMIA, MATÉRIA PRIMA PLÁSTICO, APLICAÇ...
7 1551260500022202100017 ... BOLSA OSTOMIA, MATÉRIA PRIMA PLÁSTICO, APLICAÇ...
8 1551260500022202100009 ... BOLSA OSTOMIA, MATÉRIA PRIMA PLÁSTICO, APLICAÇ...
9 1551260500022202100016 ... BOLSA OSTOMIA, MATÉRIA PRIMA PLÁSTICO, APLICAÇ...
10 1551260500022202100006 ... BOLSA OSTOMIA, MATÉRIA PRIMA PLÁSTICO, APLICAÇ...
11 1551260500022202100014 ... BOLSA OSTOMIA, MATÉRIA PRIMA PLÁSTICO, APLICAÇ...
12 1551260500022202100015 ... BOLSA OSTOMIA, MATÉRIA PRIMA PLÁSTICO, APLICAÇ...
13 1551260500022202100013 ... BOLSA OSTOMIA, MATÉRIA PRIMA PLÁSTICO, APLICAÇ...
14 1551260500022202100007 ... BOLSA OSTOMIA, MATÉRIA PRIMA PLÁSTICO, APLICAÇ...
15 1551260500022202100018 ... BOTA DE UNNA, COMPOSIÇÃO IMPREGNADA COM PASTA ...
16 1551260500022202100019 ... CINTA ELÁSTICA, MATERIAL POLIÉSTER, TIPO ABDOM...
17 1551260500022202100020 ... CLAMP INSTRUMENTAL, MODELO VASCULAR- BULLDOG, ...
18 1551260500022202100056 ... COBERTURA, TIPO DE COBERTURA FILME TRANSPARENT...
19 1551260500022202100024 ... COBERTURA, TIPO DE COBERTURA FILME TRANSPARENT...
20 1551260500022202100045 ... CREME HIDRATANTE, COMPOSIÇÃO URÉIA E ÁCIDO LÁT...
21 1551260500022202100022 ... CURATIVO, MATERIAL POLIURETANO, DIMENSÃO CERCA...
22 1551260500022202100035 ... CURATIVO, TIPO HIDROCOLÓIDE, MATERIAL POLIURET...
23 1551260500022202100049 ... CURATIVO, TIPO HIDROGEL, REVESTIMENTO REVESTID...
24 1551260500022202100038 ... CURATIVO, TIPO HIDROPOLÍMERO, REVESTIMENTO REC...
25 1551260500022202100057 ... CURATIVO, MATERIAL POLIÉSTER, REVESTIMENTO REV...
26 1551260500022202100040 ... CURATIVO, MATERIAL ACETATO DE CELULOSE, REVEST...
27 1551260500022202100046 ... CURATIVO, TIPO HIDROCOLÓIDE, MATERIAL POLIURET...
28 1551260500022202100034 ... CURATIVO, TIPO MEMBRANA COM MICROPARTICULA DE ...
29 1551260500022202100047 ... CURATIVO, TIPO HIDROCOLÓIDE, MATERIAL POLIURET...
30 1551260500022202100037 ... CURATIVO, MATERIAL NÃO TECIDO, REVESTIMENTO PR...
31 1551260500022202100036 ... CURATIVO, TIPO HIDROGEL, REVESTIMENTO REVESTID...
32 1551260500022202100025 ... CURATIVO, TIPO HIDROGEL, MATERIAL POLIURETANO ...
33 1551260500022202100063 ... CURATIVO, TIPO HIDROCOLÓIDE, MATERIAL POLIURET...
34 1551260500022202100051 ... CURATIVO, TIPO HIDROGEL, REVESTIMENTO COM ALGI...
35 1551260500022202100041 ... CURATIVO, MATERIAL POMADA, REVESTIMENTO C/ CAD...
36 1551260500022202100023 ... CURATIVO, TIPO HIDROPOLÍMERO, MATERIAL POLIURE...
37 1551260500022202100039 ... CURATIVO, MATERIAL NÃO TECIDO, REVESTIMENTO PR...
38 1551260500022202100029 ... LENÇO DESCARTÁVEL, MATERIAL POLIPROPILENO E CE...
39 1551260500022202100031 ... PELÍCULA ADESIVA, MATERIAL ADESIVO ACRÍLICO HI...
40 1551260500022202100001 ... PELÍCULA ADESIVA, MATERIAL ADESIVO ACRÍLICO HI...
41 1551260500022202100002 ... PLACA PERIOSTOMAL, MATERIAL DA PLACA HIDROCOLÓ...
42 1551260500022202100061 ... POLIHEXANIDA, COMPOSIÇÃO ASSOCIADA À UNDECILAM...
43 1551260500022202100021 ... PROTETOR CUTÂNEO, ASPECTO FÍSICO EM CREME, COM...
44 1551260500022202100033 ... PROTETOR CUTÂNEO, ASPECTO FÍSICO EM CREME, COM...
45 1551260500022202100032 ... PROTETOR CUTÂNEO, ASPECTO FÍSICO EM CREME, COM...
46 1551260500022202100055 ... PROTETOR CUTÂNEO, ASPECTO FÍSICO EM PÓ, COMPOS...
47 1551260500022202100028 ... PROTETOR CUTÂNEO, ASPECTO FÍSICO PLACA, MATERI...
48 1551260500022202100060 ... PROTETOR CUTÂNEO, ASPECTO FÍSICO EM CREME, COM...
49 1551260500022202100054 ... PROTETOR CUTÂNEO, ASPECTO FÍSICO EM SPRAY, COM...
50 1551260500022202100050 ... SOLUÇÃO, TIPO À BASE DE BIGUANIDA (PHMB), CONC...
51 1551260500022202100027 ... TERAPIA DE PRESSÃO NEGATIVA P/ FERIDAS, TIPO C...
52 1551260500022202100030 ... ÁCIDOS GRAXOS ESSENCIAIS, COMPOSIÇÃO TCM, COMP...
[53 rows x 10 columns]
Without payload:
import requests
import pandas as pd
url = 'https://www.portaltransparencia.gov.br/licitacoes/item-licitacao/resultado?paginacaoSimples=false&tamanhoPagina=1000&offset=0&direcaoOrdenacao=asc&colunaOrdenacao=descricao&colunasSelecionadas=codigoItemCompra%2Cdescricao%2CdescComplementarItemCompra%2Cquantidade%2Cvalor%2CcpfCnpjVencedor%2Cnome&skCompra=205900728&_=1633967219449'
jsonData = requests.get(url, headers=headers, params=payload, verify=False).json()
df = pd.DataFrame(jsonData['data'])

Related

How do I compute a sample data set from my original data set?

See a snippet of the following data set:
$ HOMEMTOT : int 4278 2389 1264 3249 6048 1705 5304 11091 1289 13347 ...
$ MULHERTOT : int 4213 2298 1278 3170 5667 1541 4931 11388 1254 11777 ...
$ pesoRUR : int 4464 1649 1588 1369 9269 885 2886 12910 221 10895 ...
$ pesotot : int 8491 4687 2542 6419 11715 3246 10235 22479 2543 25124 ...
$ pesourb : int 4027 3038 954 5050 2446 2361 7349 9569 2322 14229 ...
$ Estados : Factor w/ 26 levels "AC","AL","AM",..: 2 10 22 25 10 25 12 6 17 12 ...
I am intending to conduct a PCA and LASSO regression, but the issue is that I am supposed to set aside a sample size of 10 municipalities, coined "Estados," as seen above. How would I say make a sample set of, say, 300 observations based on 10 random municipalities "Estados." ?
You could use:
library(tidyverse)
df %>%
filter(Estados %in% sample(Estados, 10, replace = FALSE)) %>%
group_by(Estados) %>%
slice_sample(n = 30) %>%
ungroup()

Download excel file from URL and read it with `read_xlsx`

I am trying to download a particularly messy .xlsx file from a URL to a local directory and then read this file using read_xlsx.
# Download file into directory
my_url <- 'https://docs.google.com/spreadsheets/d/0Bw4a10rhk2QqaTZkUmQwaXU4aEE/edit?resourcekey=0-RQa9gRpFX0x3z5bSJGn0Dg#gid=1944035140'
download.file(url=my_url, destfile='./dat/df.xlsx')
# Load file
df <- read_xlsx('./dat/df.xlsx')
This last line throws the following error:
Error: Evaluation error: zip file '/Users/... some path .../dat/df.xlsx' cannot be opened.
I believe this is happening because download.file() is messing up the format somehow. A few other similar issues have been solved, but the solution (mode='wb') did not help.
Could you help me download the file without messing up the format so I can later read this file using read_xlsx?
As an additional request, I would like to use as few external dependencias as possible (that's the reason I tried this with download.file()).
Indeed, the link takes you to Google Docs and is for non-downloadable editing. You cannot download this file this way. Just save it to your hard drive. However, I did a function that reads data from a file downloaded to disk. Maybe it will be useful to you.
library(tidyverse)
library(readxl)
urlFile = "https://docs.google.com/spreadsheets/d/1SF0PkBz9BR4yqiQ27Bt5OsD33Y8Rt5lh/edit?usp=sharing&ouid=107152468748636733235&rtpof=true&sd=true"
xlsFile = "refugios_nayarit.xlsx"
download.file(url=urlFile, destfile=xlsFile, mode="wb")
fReadXls = function(xlsFile, sheet) {
data = read_excel(
xlsFile, sheet = sheet, skip = 6,
col_names = c("No.", "REFUGIO", "MUNICIPIO", "DIRECCIÓN", "USO DEL INMUEBLE",
"SERVICIOS", "CAPACIDAD DE PERSONAS", "COORD. LATITUD L",
"COORD. LATITUD W", "COORD. ALTITUD MSNM", "RESPONSABLE",
"TELÉFONO"))
data %>% slice_head(n=nrow(.)-1)
}
df = tibble(sheet = excel_sheets(xlsFile)) %>%
mutate(data = map(sheet, ~fReadXls(xlsFile, .x)))
df$data[[1]]
output
# A tibble: 20 x 12
No. REFUGIO MUNICIPIO DIRECCIÓN `USO DEL INMUEB~ SERVICIOS `CAPACIDAD DE PE~ `COORD. LATITUD~ `COORD. LATITUD~
<dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
1 1 PRIMARIA LAB~ ACAPONETA LOPEZ RAYON EDUCACION AGUA, SANITA~ 200 "22°29'56.06\"" "105°21'37.27\""
2 2 JARDIN DE NI~ ACAPONETA ALDAMA ESQ C~ EDUCACION AGUA, SANITA~ 100 "22°29'53.14\"" "105°21'29.48\""
3 3 PRIMARIA CAR~ ACAPONETA E. CARRANZA EDUCACION AGUA, SANITA~ 200 "22o30'00.43\"" "105°21'37.46\""
4 4 PRIMARIA LAZ~ ACAPONETA AMADO NERVO EDUCACION AGUA, SANITA~ 100 "22°29'27.17\"" "105°21'39.68"
5 5 PRIMARIA H. ~ ACAPONETA VERACRUZ No.~ EDUCACION AGUA, SANITA~ 150 "22o29'40.21\"" "105a21'40.23\""
6 6 PRIMARIA MIG~ ACAPONETA MORELOS Y OA~ EDUCACION AGUA, SANITA~ 200 "22o29'23.26\"" "105a21'41.99\""
7 7 PRIMARIA CEN~ ACAPONETA MATAMOROS Y ~ EDUCACION AGUA, SANITA~ 250 "22o29'37.31\"" "105a21'33.33\""
8 8 SINDICATO CTM ACAPONETA QUERETARO Y ~ GREMIO SINDICAL AGUA, SANITA~ 100 "22°29'39.32\"" "105°21'46.60"
9 9 ESTADIO MUNI~ ACAPONETA JUAN ESCUTIA DEPORTE AGUA, SANITA~ 300 "22a29'55.10\"" "105a21'52.29\""
10 10 CASA DE LA C~ ACAPONETA MORELOS CULTURAL AGUA, SANITA~ 300 "22a29'20.78\"" "105a21'46.46\""
11 11 CENTRO RECRE~ ACAPONETA México ENTRE~ RECREATIVO AGUA, SANITA~ 400 "22a29'39.76\"" "105a21'37.87\""
12 12 CENTRO RECRE~ ACAPONETA VERACRUZ No15 RECREATIVO AGUA, SANITA~ 400 "22a29'30.03\"" "105a21'39.47\""
13 13 IGLESIA CRIS~ ACAPONETA VERACRUZ No ~ RELIGIOSO AGUA, SANITA~ 50 "22a29'41.60\"" "105a21'40.35\""
14 14 ESCUELA FRAY~ AHUACATLAN 20 DE NOVIEM~ EDUCACION AGUA, SANITA~ 80 "21a03'06.07\"" "104a29'03.50\""
15 15 SECUNDARIA F~ AHUACATLAN 20 DE NOVIEM~ EDUCACION AGUA, SANITA~ 250 "21a03'18.33\"" "104a28'56.26\""
16 16 ESCUELA JOSE~ AHUACATLAN MORELOS Y MA~ EDUCACION AGUA, SANITA~ 200 "21a03'04.55\"" "104a29'12.67\""
17 17 ESCUELA PREP~ AHUACATLAN CALLE EL SAL~ EDUCACION AGUA, SANITA~ 200 "21a02'57.01\"" "104a29'16.71\""
18 18 ESCUELA PLAN~ AHUACATLAN OAXACA E HID~ EDUCACION AGUA, SANITA~ 200 "21a03'02.43\"" "104a28'58.82\""
19 19 UNIDAD ACADE~ AHUACATLAN CARR A GUADA~ EDUCACION AGUA, SANITA~ 200 "21a03'28.20\"" "104a29'06.67\""
20 20 CLUB SOCIAL ~ AHUACATLAN 20 DE NOVIEM~ DEPORTE AGUA, SANITA~ 400 "21a03'07.37\"" "104a29'01\"57\~
# ... with 3 more variables: COORD. ALTITUD MSNM <dbl>, RESPONSABLE <chr>, TELÉFONO <chr>

bs4 Attribute Error while scraping table python

I am trying to scrape a table using bs4. But whenever I iterate over the <tbody> elements, i get the following error: Traceback (most recent call last): File "f:\Python Programs\COVID-19 Notifier\main.py", line 28, in <module> for tr in soup.find('tbody').findAll('tr'): AttributeError: 'NoneType' object has no attribute 'findAll'
I am new to bs4 and have faced this error many times before too. This is the code I am using. Any help would be greatly appreciated as this is an official project to be submitted in a competition and the deadline is near. Thanks in advance. And beautifulsoup4=4.8.2, bs4==0.0.4 and soupsieve==2.0.
My code:
from plyer import notification
import requests
from bs4 import BeautifulSoup
import time
def notifyMe(title, message):
notification.notify(
title = title,
message = message,
app_icon = ".\\icon.ico",
timeout = 6
)
def getData(url):
r = requests.get(url)
return r.text
if __name__ == "__main__":
while True:
# notifyMe("Harry", "Lets stop the spread of this virus together")
myHtmlData = getData('https://www.mohfw.gov.in/')
soup = BeautifulSoup(myHtmlData, 'html.parser')
#print(soup.prettify())
myDataStr = ""
for tr in soup.find('tbody').find_all('tr'):
myDataStr += tr.get_text()
myDataStr = myDataStr[1:]
itemList = myDataStr.split("\n\n")
print(itemList)
states = ['Chandigarh', 'Telengana', 'Uttar Pradesh']
for item in itemList[0:22]:
dataList = item.split('\n')
if dataList[1] in states:
nTitle = 'Cases of Covid-19'
nText = f"State {dataList[1]}\nIndian : {dataList[2]} & Foreign : {dataList[3]}\nCured : {dataList[4]}\nDeaths : {dataList[5]}"
notifyMe(nTitle, nText)
time.sleep(2)
time.sleep(3600)
This line raises the error:
for tr in soup.findAll('tbody').findAll('tr'):
You can only call find_all on a single tag, not a result set returned by another find_all. (findAll is the same as find_all - the latter one is preferably used because it meets the Python PEP 8 styling standard)
According to the documentation:
The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.
If you're looping through a single table, simply replace the first findAll with find. If multiple tables, store the result set in a variable and loop through it, and you can apply the findAll on a single tag.
This should fix it:
for tr in soup.find('tbody').find_all('tr'):
Multiple tables:
tables = soup.find_all('tbody')
for table in tables:
for tr in table.find_all('tr'):
...
There are a few issues here.
The <tbody> tag is within the comments of the html. BeautifulSoup skips comments, unless you specifically pull those.
Why bother with the getData() function? It's just one line, why not just put that into the code. The extra function doesn't really add efficiency or more readability in the code.
Even when you pull the <tbody> tag, your dataList doesn't have 6 items (you call dataList[5], which will throw and error). I adjusted it, but I don't know if those are the corect numbers. I don't know what each of those vlaues represent, so you may need to fix that. The headers for that data you are pulling are ['S. No.','Name of State / UT','Active Cases*','Cured/Discharged/Migrated*','Deaths**'], so I don't know what Indian : {dataList[2]} & Foreign : are suppose to be.
With that, I don't what those numbers represent, but is it the correct data? Looks like you can pull new data here, but it's not the same numbers in the <tbody>
So, here's to get that other data source...maybe it's more accurate?
import requests
import pandas as pd
jsonData = requests.get('https://www.mohfw.gov.in/data/datanew.json').json()
df = pd.DataFrame(jsonData)
Output:
print(df.to_string())
sno state_name active positive cured death new_active new_positive new_cured new_death state_code
0 2 Andaman and Nicobar Islands 153 5527 5309 65 146 5569 5358 65 35
1 1 Andhra Pradesh 66944 997462 922977 7541 74231 1009228 927418 7579 28
2 3 Arunachal Pradesh 380 17296 16860 56 453 17430 16921 56 12
3 4 Assam 11918 231069 217991 1160 13942 233453 218339 1172 18
4 5 Bihar 69869 365770 293945 1956 76420 378442 300012 2010 10
5 6 Chandigarh 4273 36404 31704 427 4622 37232 32180 430 04
6 7 Chhattisgarh 121555 605568 477339 6674 123479 622965 492593 6893 22
7 8 Dadra and Nagar Haveli and Daman and Diu 1668 5910 4238 4 1785 6142 4353 4 26
8 10 Delhi 91618 956348 851537 13193 92029 980679 875109 13541 07
9 11 Goa 10228 72224 61032 964 11040 73644 61628 976 30
10 12 Gujarat 92084 453836 355875 5877 100128 467640 361493 6019 24
11 13 Haryana 58597 390989 328809 3583 64057 402843 335143 3643 06
12 14 Himachal Pradesh 11859 82876 69763 1254 12246 84065 70539 1280 02
13 15 Jammu and Kashmir 16094 154407 136221 2092 16993 156344 137240 2111 01
14 16 Jharkhand 40942 184951 142294 1715 43415 190692 145499 1778 20
15 17 Karnataka 196255 1247997 1037857 13885 214330 1274959 1046554 14075 29
16 18 Kerala 156554 1322054 1160472 5028 179311 1350501 1166135 5055 32
17 19 Ladakh 2041 12937 10761 135 2034 13089 10920 135 37
18 20 Lakshadweep 803 1671 867 1 920 1805 884 1 31
19 21 Madhya Pradesh 84957 459195 369375 4863 87640 472785 380208 4937 23
20 22 Maharashtra 701614 4094840 3330747 62479 693632 4161676 3404792 63252 27
21 23 Manipur 513 30047 29153 381 590 30151 29180 381 14
22 24 Meghalaya 1133 15488 14198 157 1238 15631 14236 157 17
23 25 Mizoram 608 5220 4600 12 644 5283 4627 12 15
24 26 Nagaland 384 12800 12322 94 457 12889 12338 94 13
25 27 Odisha 32963 388479 353551 1965 36718 394694 356003 1973 21
26 28 Puducherry 5923 50580 43931 726 6330 51372 44314 728 34
27 29 Punjab 40584 319719 270946 8189 43943 326447 274240 8264 03
28 30 Rajasthan 107157 467875 357329 3389 117294 483273 362526 3453 08
29 31 Sikkim 640 6970 6193 137 693 7037 6207 137 11
30 32 Tamil Nadu 89428 1037711 934966 13317 95048 1051487 943044 13395 33
31 34 Telengana 52726 379494 324840 1928 58148 387106 326997 1961 36
32 33 Tripura 563 34302 33345 394 645 34429 33390 394 16
33 35 Uttarakhand 26980 138010 109058 1972 29949 142349 110379 2021 05
34 36 Uttar Pradesh 259810 976765 706414 10541 273653 1013370 728980 10737 09
35 37 West Bengal 68798 700904 621340 10766 74737 713780 628218 10825 19
36 11111 2428616 16263695 13648159 186920 2552940 16610481 13867997 189544
Here's your code with pulling the comments out
Code:
import requests
from bs4 import BeautifulSoup, Comment
import time
def notifyMe(title, message):
notification.notify(
title = title,
message = message,
app_icon = ".\\icon.ico",
timeout = 6
)
if __name__ == "__main__":
while True:
# notifyMe("Harry", "Lets stop the spread of this virus together")
myHtmlData = requests.get('https://www.mohfw.gov.in/').text
soup = BeautifulSoup(myHtmlData, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
myDataStr = ""
for each in comments:
if 'tbody' in str(each):
soup = BeautifulSoup(each, 'html.parser')
for tr in soup.find('tbody').findAll('tr'):
myDataStr += tr.get_text()
myDataStr = myDataStr[1:]
itemList = myDataStr.split("\n\n")
print(itemList)
states = ['Chandigarh', 'Telengana', 'Uttar Pradesh','Meghalaya']
for item in itemList[0:22]:
w=1
dataList = item.split('\n')
if dataList[1] in states:
nTitle = 'Cases of Covid-19'
nText = f"State {dataList[1]}\nIndian : {dataList[0]} & Foreign : {dataList[2]}\nCured : {dataList[3]}\nDeaths : {dataList[4]}" #<-- I changed this
notifyMe(nTitle, nText)
time.sleep(2)
time.sleep(3600)

A summary on a dataframe show me the majors values for a character field. When I apply a substr on it : I receive "useless" information instead

I'm a beginner in R language.
I have loaded an Excel sheet into a dataframe.
A summary show me these informations :
summary(dat)
APE LIBELLE EFFECTIF
8110Z :638 Activités combinées de soutien lié aux bâtiments :638 1,5 :664
2370Z : 8 Commerce de gros de bois et de matériaux de construction: 8 4 : 57
4511Z : 8 Commerce de voitures et de véhicules automobiles légers : 8 34,5 : 37
4673A : 8 Hôtels et hébergement similaire : 8 14,5 : 36
5510Z : 8 Taille, façonnage et finissage de pierres : 8 7,5 : 24
2363Z : 6 Fabrication de béton prêt à l'emploi : 6 74,5 : 17
(Other):181 (Other) :181 (Other): 22
The APE code (that I think is now the european NACE code but the field has an old name) is too detailed with five characters. I execute this statement to take only its two first characters into account :
dat$APE <- substr(dat$APE, 1, 2)
Then, the summary command doesn't show me the result I expected :
summary(dat)
APE LIBELLE EFFECTIF
Length:857 Activités combinées de soutien lié aux bâtiments :638 1,5 :664
Class :character Commerce de gros de bois et de matériaux de construction: 8 4 : 57
Mode :character Commerce de voitures et de véhicules automobiles légers : 8 34,5 : 37
Hôtels et hébergement similaire : 8 14,5 : 36
Taille, façonnage et finissage de pierres : 8 7,5 : 24
Fabrication de béton prêt à l'emploi : 6 74,5 : 17
(Other) :181 (Other): 22
I was expecting 23, 45, 46, 55, 81... in the APE column.
I don't figure where the problem may come from, as when I do an head command, everything seems fine.
head(dat)
APE LIBELLE EFFECTIF
1 02 Exploitation forestière 4
2 08 Extraction pierres ornement. construc. calcaire industriel, gypse 14,5
3 08 Exploit gravieres & sablieres, extraction argiles & kaolin 34,5
4 10 Préparation industrielle de produits à base de viande 4
5 10 Préparation industrielle de produits à base de viande 7,5
6 10 Transformation et conservation de fruits 34,5
Regards,
#Roland thanks for your good answer.
dat$APE <- as.factor(substr(dat$APE, 1, 2))
succeeded in converting a String to a factor with corrected my problem.

R - Apriori function error

I'm currently having problems with the apriori function. The thing is I have a csv with data like the following:
Desc,Cantidad,Valor,Fecha,Lugar,UUID
DESCUENTO,1,-3405,2014-10-04T14:02:57,53100,7F74AFC0-FC28-4105-89A5-CD99416B50C7
DESCUENTO,1,-3405,2014-10-04T14:02:57,53100,7F74AFC0-FC28-4105-89A5-CD99416B50C7
DESCUENTO,1,-170,2014-09-05T15:10:24,83000,7F0C7F0B-BCFC-4FCA-8740-B36AE9932869
Descuento de TYK Dia,1,-156,2014-06-19T16:52:27,86280,1E08E51E-213A-4EE0-8FE9-492E677FF0C9
Descuento de TYK Dia,1,-139,2014-04-25T10:52:44,86280,AB802E63-2D0D-4B47-AB70-DDE007929F9F
DESCUENTO,1,-63,2014-07-04T13:53:10,83000,5B1F12BB-71DE-4734-A774-8D377757A880
REDONDEO,1,-1,2014-03-29T10:50:59,0,5B241EFA-6654-46EA-B47A-3CB76C5EA923
DESCUENTO,1,-1,2014-10-04T14:02:57,53100,7F74AFC0-FC28-4105-89A5-CD99416B50C7
DESCUENTO,1,-1,2014-10-04T14:02:57,53100,7F74AFC0-FC28-4105-89A5-CD99416B50C7
LAVADO,1,0,2014-05-27T18:18:11,44500,e5d540d6-0f98-4993-ec09-56887cd4a27d
TUA,1,0,2014-09-29T10:20:31,6500,1d8ada06-a8a1-4bd8-9356-851b5da28108
Transportación Aerea,1,0,2014-10-03T10:41:09,6500,5fc3925a-d08a-4cdc-be7e-ca02bd488d5b
OBSEQUIO LAVADO DE CARROCERIA,1,0,2014-04-07T13:45:55,91800,8148ab07-5804-4b2b-b37c-5323b394907a
Arroz Al Azafran Combos A,1,0,2014-08-19T11:50:34,11520,f09c23e6-dc60-4aaf-a1b8-1506d38f3585
Frijoles Charros A,1,0,2014-08-19T11:50:34,11520,f09c23e6-dc60-4aaf-a1b8-1506d38f3585
Pepsi Ch A,1,0,2014-08-19T11:50:34,11520,f09c23e6-dc60-4aaf-a1b8-1506d38f3585
FECHA DE CONSUMO 18/07/2014,1,0,2014-07-19T18:01:45,6060,0f0465aa-a75b-4f95-8e3b-43c13452cafb
CAMBIO DE ACEITE DE MOTOR,1,0,2014-02-01T11:18:53,39890,5BDF0742-CDF5-4F6B-9937-DF1CB00274ED
CAMBIO DE FILTRO DE ACEITE,1,0,2014-02-01T11:18:53,39890,5BDF0742-CDF5-4F6B-9937-DF1CB00274ED
Whole CSV (https://github.com/antonio1695/BaseX/blob/master/facturas1.csv)
To download the file just click on find file and then you will see the file.
So what I did was:
> df1 <- read.csv("facturas1.csv")
> rules <- apriori(df1,parameter=list(support=0.01,confidence=0.5))
Error in asMethod(object) :
column(s) 3 not logical or a factor. Discretize the columns first.
Nevertheless, the problem is that the columns are discrete already and if I change the data in order for it to have column 3 in the place of column 2 and viceversa. It still says that that column 3 is not logical or a factor when it should say it about column 2 instead. Thanks!
library(arules)
df1 <- read.csv("https://raw.githubusercontent.com/antonio1695/BaseX/master/facturas1.csv")
trans <- as(df1, "transactions")
Error in asMethod(object) :
column(s) 3 not logical or a factor. Discretize the columns first.
Let's look at the data frame:
str(df1)
'data.frame': 10510 obs. of 6 variables:
$ Desc : Factor w/ 3927 levels "0","00000215R0 - LIQUIDO DE FRENOS",..: 1490 1490 1490 1491 1491 1490 3209 1490 1490 2238 ...
$ Cantidad: Factor w/ 85 levels "","1","-1","10",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Valor : int -3405 -3405 -170 -156 -139 -63 -1 -1 -1 0 ...
$ Fecha : Factor w/ 4054 levels "1294","2014-01-06T11:10:21",..: 4041 4041 3443 1794 596 2125 241 4041 4041 1215 ...
$ Lugar : Factor w/ 982 levels "","0","1000",..: 487 487 802 848 848 802 2 487 487 373 ...
$ UUID : Factor w/ 4056 levels "0019A60D-78F8-E341-8D3E-9786201FE017",..: 1988 1988 1979 456 2711 1423 1424 1988 1988 3658 ...
Valor is a number (int) and needs to be discretized! For example with discretize():
df1$Valor <- discretize(df1$Valor)
head(df1$Valor)
[1] [-3405, 2400) [-3405, 2400) [-3405, 2400) [-3405, 2400) [-3405, 2400)
[6] [-3405, 2400)
Levels: [-3405, 2400) [ 2400, 8204) [ 8204,14009]
Now you can create transactions and applt APRIORI:
trans <- as(df1, "transactions")
rules <- apriori(trans,parameter=list(support=0.01,confidence=0.5))
rules
set of 84 rules
After some research I found that the apriori function must take intervals in order for it to work properly, so when you use discretize you must add the parameter "categories" to select how many intervals you want. It isn't possible for it not to take intervals. I'll post the code here:
I decided to take 20 intervals which are all depending on how often the value in the interval is repeated.
df$Valor <- discretize(df$Valor, method="frequency",categories = 20)
Hope it helps somebody.

Resources