I am trying to read multiple csv files (like 300) with the function fread in R.
When i open one of the csv files in excel, the columns are delimited correctly, even when some observations contain commas.
When I try to read one of the files, the fuction does't read all the observations in the file and the next error appears
> file_prueba<-fread("Datos/Datos_precios/INP_PP_CAB18 (7)_A_vivienda_06_2020.csv", skip = 5, header = TRUE)
Warning message:
In fread("Datos/Datos_precios/INP_PP_CAB18 (7)_A_vivienda_06_2020.csv", :
Stopped early on line 1073. Expected 17 fields but found 22. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4", PZA 6 MTS","231.55","1","PZA","">>
Therefore i can't read the whole file. I suspect it is because one of the observations cointains commas like "PLOMERIA, TUBO DE COBRE, DE 60 MTS". But I'm not sure.
How can i fix this without fixing each csv file one by one?
Here's the file that i'm using int he example, but as I said, i need to read multiple files like this:
https://drive.google.com/file/d/1gSjyL14sZQC5KNtMXhN_iN79xCETTZAG/view?usp=sharing
The file is corrupt in two ways: lines 1073 and 3401 have embedded quotes. But there's another problem here ... read down to the second section fread and double-double-quotes for the problem with fread.
(Ultimately, this is a failure of the exporting process and a failure of fread to read embedded double quotes.)
Corrupted lines
Scroll right to see the problems.
Line 1073:
"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4", PZA 6 MTS","231.55","1","PZA",""
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ^-- this quote is incorrect
Line 3401:
"2020","06","20/07/2020 12:00:00 a. m.","43","Campeche, Camp.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","003","NACOBRE, PLOMERIA, TUBO DE COBRE, BARRA DE 1/2" X 6 MT","316.76","1","PZA",""
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ^-- this quote is incorrect
The best fix is to get whatever person/process exported this to export compliant CSV.
Here is a command-line (sed) fix that will allow fread to load it without warning or error (this is on a shell prompt, not in R).
sed -i \
-e 's/", PZA/"", PZA/g' \
-e s'/BARRA DE 1\/2"/BARRA DE 1\/2""/g' \
"INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV"
Simple explanation: the CSV standard (well-framed at https://en.wikipedia.org/wiki/Comma-separated_values) suggests that either double-quotes should never be in a quoted field, or if present they should be doubled (as in "" to produce a single " in the middle of a value).
In this case, it finds the two very specific failing text and adds the second quote.
-i means to make the change in-place; perhaps a more defensive use would be to do sed -e 's/../../g' -e 's/../../g' < oldfile.csv > newfile.csv, which would preserve the broken file. Over to you.
-e adds a sed script/command, multiple commands can be given.
s/from/to/g means to replace the pattern from with the string in to; the g means "global".
This changes the two lines (shown one after the other here for simplicity:
"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4"", PZA 6 MTS","231.55","1","PZA",""
"2020","06","20/07/2020 12:00:00 a. m.","43","Campeche, Camp.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","003","NACOBRE, PLOMERIA, TUBO DE COBRE, BARRA DE 1/2"" X 6 MT","316.76","1","PZA",""
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ^^^^^-- the changes, double-double quotes
FYI: if you don't have sed in the path ... if you're running windows, then look in the RTools40 path; for me, I have c:/rtools40/usr/bin/sed.exe. If you're on macos or linux and cannot find sed, well ... that's odd.
After that sed command executes correctly, it will load without problem. HOWEVER, don't let this mislead you ... it is not really fixed. Keep reading.
csv <- fread("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", skip = 5)
csv
# Año Mes Fecha_Pub_DOF Clave ciudad Nombre ciudad División
# <int> <int> <char> <int> <char> <char>
# 1: 2020 6 20/07/2020 12:00:00 a. m. 1 Área Met. de la Cd. de México 3. Vivienda
# 2: 2020 6 20/07/2020 12:00:00 a. m. 1 Área Met. de la Cd. de México 3. Vivienda
# 3: 2020 6 20/07/2020 12:00:00 a. m. 1 Área Met. de la Cd. de México 3. Vivienda
...snip...
# 11 variables not shown: [Grupo <char>, Clase <char>, Subclase <char>, Clave genérico <int>, Genérico <char>, Consecutivo <int>, Especificación <char>, Precio promedio <num>, Cantidad <int>, Unidad <char>, ...]
fread and double-double-quotes
The problem with the above is that while it seems to have worked correctly, it (still) does not do embedded quotes correctly. As long as you want your data to have all of the embedded quotes that you want, then you cannot use fread, unfortunately.
Why?
str(csv[1067,])
# Classes 'data.table' and 'data.frame': 1 obs. of 17 variables:
# $ Año : int 2020
# $ Mes : int 6
# $ Fecha_Pub_DOF : chr "20/07/2020 12:00:00 a. m."
# $ Clave ciudad : int 12
# $ Nombre ciudad : chr "San Luis Potosí, S.L.P."
# $ División : chr "3. Vivienda"
# $ Grupo : chr "3.1. Costo de uso de vivienda"
# $ Clase : chr "3.1.1. Costo de uso de vivienda"
# $ Subclase : chr "42 Vivienda propia"
# $ Clave genérico : int 140
# $ Genérico : chr "Productos para reparación menor de la vivienda"
# $ Consecutivo : int 1
# $ Especificación : chr "PLOMERIA, TUBO DE PVC, REFORZADO, 4\"\", PZA 6 MTS"
# $ Precio promedio: num 232
# $ Cantidad : int 1
# $ Unidad : chr "PZA"
# $ Estatus : chr ""
# - attr(*, ".internal.selfref")=<externalptr>
Namely, see
csv$Especificación[1067]
# [1] "PLOMERIA, TUBO DE PVC, REFORZADO, 4\"\", PZA 6 MTS"
^^^^ should only be a single "
Fortunately, read.csv works fine here:
csv <- read.csv("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", skip = 5)
csv$Especificación[1067]
# [1] "PLOMERIA, TUBO DE PVC, REFORZADO, 4\", PZA 6 MTS"
FYI, if you don't care about the embedded quotes, you can still use fread if you change the sed expressions to remove the double-quotes instead of doubling the double-quotes. That is, -e 's/", PZA/, PZA/g' and likewise for the second expression. I didn't recommend this first because it changes your data, which you should not have to do.
The file you linked is properly quoted.
It has 5 lines of non-CSV data though, so skip these:
csv = read.csv("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", header = T, skip = 5, fileEncoding = "Latin1")
This works fine for me.
I am not so familiar with fread, and it does seem to have a problem with this file. Is there a reason you need data.table::fread for this?
I am again here with an interesting problem.
I have a document like shown below:
"""UDAYA FILLING STATION ps\na MATTUPATTY ROAD oe\noe 4 MUNNAR Be:\nSeat 4 04865230318 Rat\nBree 4 ORIGINAL bepas e\n\noe: Han Die MC DE ER DC I se ek OO UO a Be ten\" % aot\n: ag 29-MAY-2019 14:02:23 [i\n— INVOICE NO: 292 hee fos\nae VEHICLE NO: NOT ENTERED Bea\nss NOZZLE NO : 1 ome\n- PRODUCT: PETROL ae\ne RATE : 75.01 INR/Ltr yee\n“| VOLUME: 1.33 Ltr ae\n~ 9 =6AMOUNT: 100.00 INR mae wae\nage, Ee pel Di EE I EE oe NE BE DO DC DE a De ee De ae Cate\notome S.1T. No : 27430268741C =. ver\nnes M.S.T. No: 27430268741V ae\n\nThank You! Visit Again\n""""
From the above document, I need to extract date highlighted in bold and Italics.
I tried with strpdate function but did not get the desired results.
Any help will be greatly appreciated.
Thanks in advance.
Assuming you only want to capture a single date, you may use sub here:
text <- "UDAYA FILLING STATION ps\na MATTUPATTY ROAD oe\noe 4 MUNNAR Be:\nSeat 4 04865230318 Rat\nBree 4 ORIGINAL bepas e\n\noe: Han Die MC DE ER DC I se ek OO UO a Be ten\" % aot\n: ag 29-MAY-2019 14:02:23 [i\n— INVOICE NO: 292 hee fos\nae VEHICLE NO: NOT ENTERED Bea\nss NOZZLE NO : 1 ome\n- PRODUCT: PETROL ae\ne RATE : 75.01 INR/Ltr yee\n“| VOLUME: 1.33 Ltr ae\n~ 9 =6AMOUNT: 100.00 INR mae wae\nage, Ee pel Di EE I EE oe NE BE DO DC DE a De ee De ae Cate\notome S.1T. No : 27430268741C =. ver\nnes M.S.T. No: 27430268741V ae\n\nThank You! Visit Again\n"
date <- sub("^.*\\b(\\d{2}-[A-Z]+-\\d{4})\\b.*", "\\1", text)
date
[1] "29-MAY-2019"
If you had the need to match multiple such dates in your text, then you may use regmatches along with regexec:
text <- "Hello World 29-MAY-2019 Goodbye World 01-JAN-2018"
regmatches(text,regexec("\\b(\\d{2}-[A-Z]+-\\d{4})\\b", text))[[1]]
[1] "29-MAY-2019" "29-MAY-2019"
While scraping text from a webpage using the rvest package, some paragraphs return empty but they should not.
The webpage is:
https://www.legifrance.gouv.fr/affichTexte.do?cidTexte=LEGITEXT000005620562
I want the paragraphs under the "articles", so I use ".article p" as CSS selector. It should return 9 paragraphs (5 should be empty as they are fillers). I do get 9 paragraphs, but 8 are empty!
page=read_html("https://www.legifrance.gouv.fr/affichTexte.do?cidTexte=LEGITEXT000005620562")
html_text(html_nodes(page,".article p"))
I would post a screenshot but I don't have enough reputation...
Running this lines return a vector with 9 character strings but they are empty, exept the 8th one.
Paragraphs 1, 3 and 5 should contain text but here they appear empty.
Thank you all for your time.
EDIT:
A bit of context: I need to scrape a lot of pages from this website to get the core text of the articles to perform linguistic analysis on it.
The ".article p" CSS selector does a good job on most pages but the content of some paragraphs appear empty.
Why not do something like this?
library(tidyverse)
library(rvest)
#> Loading required package: xml2
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#>
#> pluck
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
page <- read_html("https://www.legifrance.gouv.fr/affichTexte.do?cidTexte=LEGITEXT000005620562")
page %>%
html_nodes(".article") %>%
html_text() %>%
str_remove_all(pattern = "\nArticle\\s[0-9]")
#> [1] " En savoir plus sur cet article...\n\n Sont désignées comme gares ferroviaires ouvertes au trafic international au sens de l'article 35 quater de l'ordonnance du 2 novembre 1945 susvisée au titre desquelles peuvent être créées des zones d'attente les gares suivantes :\n Lille-Europe, Lille-Flandres, Aulnoye, Strasbourg, Thionville, Forbach, Metz, Sarreguemines, Pontarlier, Morteau, Modane, Cerbère, Nice, Hendaye, Calais-Fréthun, Paris-Gare du Nord, Paris-Gare de l'Est, Paris-Gare de Lyon.\n "
#> [2] " En savoir plus sur cet article...\n\n Le présent arrêté abroge les dispositions de l'arrêté du 4 mai 1995 désignant les gares ferroviaires ouvertes au trafic international.\n\n "
#> [3] "\nLes préfets et, à Paris, le préfet de police sont chargés, chacun en ce qui le concerne, de l'exécution du présent arrêté, qui sera publié au Journal officiel de la République française.\n\n "
Created on 2019-01-21 by the reprex package (v0.2.1)
I need to show the following table in Markdown:
resumen<-data.frame(dato=c('cant_cli','debit', 'costo_real',
'comis_max', 'pagos', 'comis_real',
'recupero%'),
valor=
c(nrow(ops), sum(ops$Debit),sum(ops$costo_real),
sum(ops$comision_max), sum(ops$Amount), sum(ops$comision_ganada),
sum(ops$Amount)/sum(ops$Debit)
))
The result is:
dato valor
1 cant_cli 6.217400e+04
2 debit 3.943952e+06
3 costo_real 2.641091e+04
4 comis_max 1.021484e+05
5 pagos 2.003838e+06
6 comis_real 5.189941e+04
7 recupero% 5.080788e-01
But I need to have the following format:
dato valor
cant_cli 62174
debit 3943952
costo_real 26411
comis_max 102148
pagos 2003838
comis_real 51899
recupero% 50.80%
How can I make the code use this format?
Try this
as.numeric(resumen[,2])