Reading csv flie with commas in R that stops fread function - r

I am trying to read multiple csv files (like 300) with the function fread in R.
When i open one of the csv files in excel, the columns are delimited correctly, even when some observations contain commas.
When I try to read one of the files, the fuction does't read all the observations in the file and the next error appears
> file_prueba<-fread("Datos/Datos_precios/INP_PP_CAB18 (7)_A_vivienda_06_2020.csv", skip = 5, header = TRUE)
Warning message:
In fread("Datos/Datos_precios/INP_PP_CAB18 (7)_A_vivienda_06_2020.csv", :
Stopped early on line 1073. Expected 17 fields but found 22. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4", PZA 6 MTS","231.55","1","PZA","">>
Therefore i can't read the whole file. I suspect it is because one of the observations cointains commas like "PLOMERIA, TUBO DE COBRE, DE 60 MTS". But I'm not sure.
How can i fix this without fixing each csv file one by one?
Here's the file that i'm using int he example, but as I said, i need to read multiple files like this:
https://drive.google.com/file/d/1gSjyL14sZQC5KNtMXhN_iN79xCETTZAG/view?usp=sharing

The file is corrupt in two ways: lines 1073 and 3401 have embedded quotes. But there's another problem here ... read down to the second section fread and double-double-quotes for the problem with fread.
(Ultimately, this is a failure of the exporting process and a failure of fread to read embedded double quotes.)
Corrupted lines
Scroll right to see the problems.
Line 1073:
"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4", PZA 6 MTS","231.55","1","PZA",""
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ^-- this quote is incorrect
Line 3401:
"2020","06","20/07/2020 12:00:00 a. m.","43","Campeche, Camp.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","003","NACOBRE, PLOMERIA, TUBO DE COBRE, BARRA DE 1/2" X 6 MT","316.76","1","PZA",""
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ^-- this quote is incorrect
The best fix is to get whatever person/process exported this to export compliant CSV.
Here is a command-line (sed) fix that will allow fread to load it without warning or error (this is on a shell prompt, not in R).
sed -i \
-e 's/", PZA/"", PZA/g' \
-e s'/BARRA DE 1\/2"/BARRA DE 1\/2""/g' \
"INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV"
Simple explanation: the CSV standard (well-framed at https://en.wikipedia.org/wiki/Comma-separated_values) suggests that either double-quotes should never be in a quoted field, or if present they should be doubled (as in "" to produce a single " in the middle of a value).
In this case, it finds the two very specific failing text and adds the second quote.
-i means to make the change in-place; perhaps a more defensive use would be to do sed -e 's/../../g' -e 's/../../g' < oldfile.csv > newfile.csv, which would preserve the broken file. Over to you.
-e adds a sed script/command, multiple commands can be given.
s/from/to/g means to replace the pattern from with the string in to; the g means "global".
This changes the two lines (shown one after the other here for simplicity:
"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4"", PZA 6 MTS","231.55","1","PZA",""
"2020","06","20/07/2020 12:00:00 a. m.","43","Campeche, Camp.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","003","NACOBRE, PLOMERIA, TUBO DE COBRE, BARRA DE 1/2"" X 6 MT","316.76","1","PZA",""
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ^^^^^-- the changes, double-double quotes
FYI: if you don't have sed in the path ... if you're running windows, then look in the RTools40 path; for me, I have c:/rtools40/usr/bin/sed.exe. If you're on macos or linux and cannot find sed, well ... that's odd.
After that sed command executes correctly, it will load without problem. HOWEVER, don't let this mislead you ... it is not really fixed. Keep reading.
csv <- fread("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", skip = 5)
csv
# Año Mes Fecha_Pub_DOF Clave ciudad Nombre ciudad División
# <int> <int> <char> <int> <char> <char>
# 1: 2020 6 20/07/2020 12:00:00 a. m. 1 Área Met. de la Cd. de México 3. Vivienda
# 2: 2020 6 20/07/2020 12:00:00 a. m. 1 Área Met. de la Cd. de México 3. Vivienda
# 3: 2020 6 20/07/2020 12:00:00 a. m. 1 Área Met. de la Cd. de México 3. Vivienda
...snip...
# 11 variables not shown: [Grupo <char>, Clase <char>, Subclase <char>, Clave genérico <int>, Genérico <char>, Consecutivo <int>, Especificación <char>, Precio promedio <num>, Cantidad <int>, Unidad <char>, ...]
fread and double-double-quotes
The problem with the above is that while it seems to have worked correctly, it (still) does not do embedded quotes correctly. As long as you want your data to have all of the embedded quotes that you want, then you cannot use fread, unfortunately.
Why?
str(csv[1067,])
# Classes 'data.table' and 'data.frame': 1 obs. of 17 variables:
# $ Año : int 2020
# $ Mes : int 6
# $ Fecha_Pub_DOF : chr "20/07/2020 12:00:00 a. m."
# $ Clave ciudad : int 12
# $ Nombre ciudad : chr "San Luis Potosí, S.L.P."
# $ División : chr "3. Vivienda"
# $ Grupo : chr "3.1. Costo de uso de vivienda"
# $ Clase : chr "3.1.1. Costo de uso de vivienda"
# $ Subclase : chr "42 Vivienda propia"
# $ Clave genérico : int 140
# $ Genérico : chr "Productos para reparación menor de la vivienda"
# $ Consecutivo : int 1
# $ Especificación : chr "PLOMERIA, TUBO DE PVC, REFORZADO, 4\"\", PZA 6 MTS"
# $ Precio promedio: num 232
# $ Cantidad : int 1
# $ Unidad : chr "PZA"
# $ Estatus : chr ""
# - attr(*, ".internal.selfref")=<externalptr>
Namely, see
csv$Especificación[1067]
# [1] "PLOMERIA, TUBO DE PVC, REFORZADO, 4\"\", PZA 6 MTS"
^^^^ should only be a single "
Fortunately, read.csv works fine here:
csv <- read.csv("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", skip = 5)
csv$Especificación[1067]
# [1] "PLOMERIA, TUBO DE PVC, REFORZADO, 4\", PZA 6 MTS"
FYI, if you don't care about the embedded quotes, you can still use fread if you change the sed expressions to remove the double-quotes instead of doubling the double-quotes. That is, -e 's/", PZA/, PZA/g' and likewise for the second expression. I didn't recommend this first because it changes your data, which you should not have to do.

The file you linked is properly quoted.
It has 5 lines of non-CSV data though, so skip these:
csv = read.csv("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", header = T, skip = 5, fileEncoding = "Latin1")
This works fine for me.
I am not so familiar with fread, and it does seem to have a problem with this file. Is there a reason you need data.table::fread for this?

Related

How to scrape data from PDF with R?

I need to extract data from a PDF file. This file is a booklet of public services, where each page is about a specific service, which contains fields with the following information: name of the service, service description, steps, documentation, fees and observations. All pages follow this same pattern, changing only the information contained in these fields.
I would like to know if it is possible to extract all the data contained in these fields using R, please.
[those that are marked in highlighter are the fields with the information]
I've used the command line Java application Tabula and the R version TabulizeR to extract tabular data from text-based PDF files.
https://github.com/tabulapdf/tabula
https://github.com/ropensci/tabulizer
However, if your PDF is actually an image, then this becomes an OCR problem and needs different a tool.
Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.
Here is an approach that can be considered to extract the text of your image :
library(RDCOMClient)
library(magick)
################################################
#### Step 1 : We convert the image to a PDF ####
################################################
path_TXT <- "C:\\temp.txt"
path_PDF <- "C:\\temp.pdf"
path_PNG <- "C:\\stackoverflow145.png"
path_Word <- "C:\\temp.docx"
pdf(path_PDF, width = 16, height = 6)
im <- image_read(path_PNG)
plot(im)
dev.off()
####################################################################
#### Step 2 : We use the OCR of Word to convert the PDF to word ####
####################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)
#############################################
#### Step 3 : We convert the word to txt ####
#############################################
doc$SaveAs(path_TXT, FileFormat = 4)
text <- readLines(path_TXT)
text
[1] "Etapas:"
[2] "Documenta‡Æo:"
[3] "\a Consulte a se‡Æo documenta‡Æo b sica (veja no ¡ndice)."
[4] "\a Original da primeira via da nota fiscal do fabricante - DANFE - Resolu‡Æo Sefaz 118/08 para ve¡culos adquiridos diretamente da f brica, ou original da primeira via da nota fiscal do revendedor ou c¢pia da Nota Fiscal Eletr“nica - DANFE, para ve¡culos adquiridos em revendedores, acompanhados, em ambos os casos, da etiqueta contendo o decalque do chassi em baixo relevo;"
[5] "\a Documento que autoriza a inclusÆo do ve¡culo na frota de permission rios/ concession rios, expedido pelo ¢rgÆo federal, estadual ou municipal concedente, quando se tratar de ve¡culo classificado na esp‚cie \"passageiros\" e na categoria \"aluguel\","
[6] "\a No caso de inclusÆo de GRAVAME comercial, as institui‡äes financeiras e demais empresas credoras serÆo obrigados informar, eletronicamente, ao Sistema Nacional de GRAVAMEs (SNG), sobre o financiamento do ve¡culo. Os contratos de GRAVAME comercial serÆo previamente registrados no sistema de Registro de Contratos pela institui‡Æo financeira respons vel para permitir a inclusÆo do GRAVAME;"
[7] "\a Certificado de registro expedido pelo ex‚rcito para ve¡culo blindado;"
[8] "\a C¢pia autenticada em cart¢rio do laudo m‚dico e o registro do n£mero do Certificado de Seguran‡a Veicular (CSV), quando se tratar de ve¡culo adaptado para deficientes f¡sicos."
[9] "Taxas:"
[10] "Duda 001-9 (Primeira Licen‡a). Duda de Emplacamento: 037-0 para Carros; 041-8 para motos."
[11] "Observa‡Æo:"
[12] "1. A 1' licen‡a de ve¡culo adquirido atrav‚s de leilÆo dever ser feita atrav‚s de Processo Administrativo, nas CIRETRANS, SATs ou uma unidade de Protocolo Geral. (veja no ¡ndice as se‡äes Lista de CIRETRANS, Lista de SATs e Unidades de Protocolo Geral para endere‡os )"
Once you have the text in R, you can use the R package stringr.

R crashes randomly

I have had R 3.0.2 crashing at random times. I have reinstalled R again after uninstalling the same version. I do not think is any of the packages as the crashes occur while working with different packages that prior worked fine. It happens when computing some functions/routines but it does not seem to be routine specific related.
Nombre del evento de problema: APPCRASH
Nombre de la aplicación: Rgui.exe
Versión de la aplicación: 3.2.63987.0
Marca de tiempo de la aplicación: 52430323
Nombre del módulo con errores: R.dll
Versión del módulo con errores: 3.2.63987.0
Marca de tiempo del módulo con errores: 52430319
Código de excepción: c0000005
Desplazamiento de excepción: 0000000000022a9c
Versión del sistema operativo: 6.1.7600.2.0.0.256.1
Id. de configuración regional: 3082
Información adicional 1: d710
Información adicional 2: d7107453bd4712fe75341007482db842
Información adicional 3: 3033
Información adicional 4: 303376387c0f552f51fb92acd44e3a83
When running by adding one computation (thus package) at a time the error does not occur. I have run a low memtest and found no problems. As well I run overnight while(TRUE){hist(runif(10000))} and it did crash

Sort By Negative Value in Ubuntu

I have a problem: I want to sort a file like this from big to little values
de la (-0.190192990384141)
de l (-0.158296326178354)
la commission (0.041432182043560)
c est (0.107475708632644)
à la (-0.112009236677336)
le président (0.051962088225587)
à l (-0.095689228439195)
monsieur le (0.041436304077711)
I try with this command
sort -t "(" -ngk2r file1 > file2
but I get this
de la (-0.190192990384141)
de l (-0.158296326178354)
à la (-0.112009236677336)
c est (0.107475708632644)
à l (-0.095689228439195)
le président (0.051962088225587)
monsieur le (0.041436304077711)
la commission (0.041432182043560)
As you see the file is not sorted.
It seems like a mysterious problem.
Any ideas please?
Thanks
i found the solution for this problem
env LC_ALL=C sort -t "(" -ngk2r result2.txt >result2tri
Thanks

sort negative value in Unix

I have a problem when i want to sort a file like this
ce point de l ordre du jour -0.000000004070935
au sein de la commission des libertés 0.000000004017626
du conseil de sécurité de l onu -0.000000003909216
I try with this command
sort -ngk8r file1 > file2
but i get this
ce point de l ordre du jour -0.000000004070935
au sein de la commission des libertés 0.000000004017626
du conseil de sécurité de l onu -0.000000003909216
As you see the file is not sorted
I found joy by dropping the -g
$ sort -gnrk8 file1
sort: options `-gn' are incompatible
Example
$ sort -nrk8 file1
au sein de la commission des libertés 0.000000004017626
du conseil de sécurité de l onu -0.000000003909216
ce point de l ordre du jour -0.000000004070935

Sort column2 Unix

I have a problem with the Sort command in unix
I have a text file that contains line like this:
de la (-0.167969404167593)
de l (-0.137148984295644)
la commission (0.0922090559997898)
à la (-0.115188946405936)
à l (-0.0936395578796088)
c est (0.130628584805583)
I want to sort these sentence according to a descending order of the values ​​in parenthesis
I do this commande :
sort 2fr -t"(" -k2r > 2frsort
But it does not the sort correctly
Any idea please?
Thanks
You can use numeric sort on the second column:
$ sort --numeric-sort --field-separator '(' --reverse --key 2 test.dat
c est (0.130628584805583)
la commission (0.0922090559997898)
à l (-0.0936395578796088)
à la (-0.115188946405936)
de l (-0.137148984295644)
de la (-0.167969404167593)
Are you using GNU sort? If so, -g option handles the exponential notation
sort -t '(' -k 2 -g -r

Resources