I am trying to read multiple csv files (like 300) with the function fread in R.
When i open one of the csv files in excel, the columns are delimited correctly, even when some observations contain commas.
When I try to read one of the files, the fuction does't read all the observations in the file and the next error appears
> file_prueba<-fread("Datos/Datos_precios/INP_PP_CAB18 (7)_A_vivienda_06_2020.csv", skip = 5, header = TRUE)
Warning message:
In fread("Datos/Datos_precios/INP_PP_CAB18 (7)_A_vivienda_06_2020.csv", :
Stopped early on line 1073. Expected 17 fields but found 22. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4", PZA 6 MTS","231.55","1","PZA","">>
Therefore i can't read the whole file. I suspect it is because one of the observations cointains commas like "PLOMERIA, TUBO DE COBRE, DE 60 MTS". But I'm not sure.
How can i fix this without fixing each csv file one by one?
Here's the file that i'm using int he example, but as I said, i need to read multiple files like this:
https://drive.google.com/file/d/1gSjyL14sZQC5KNtMXhN_iN79xCETTZAG/view?usp=sharing
The file is corrupt in two ways: lines 1073 and 3401 have embedded quotes. But there's another problem here ... read down to the second section fread and double-double-quotes for the problem with fread.
(Ultimately, this is a failure of the exporting process and a failure of fread to read embedded double quotes.)
Corrupted lines
Scroll right to see the problems.
Line 1073:
"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4", PZA 6 MTS","231.55","1","PZA",""
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ^-- this quote is incorrect
Line 3401:
"2020","06","20/07/2020 12:00:00 a. m.","43","Campeche, Camp.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","003","NACOBRE, PLOMERIA, TUBO DE COBRE, BARRA DE 1/2" X 6 MT","316.76","1","PZA",""
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ^-- this quote is incorrect
The best fix is to get whatever person/process exported this to export compliant CSV.
Here is a command-line (sed) fix that will allow fread to load it without warning or error (this is on a shell prompt, not in R).
sed -i \
-e 's/", PZA/"", PZA/g' \
-e s'/BARRA DE 1\/2"/BARRA DE 1\/2""/g' \
"INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV"
Simple explanation: the CSV standard (well-framed at https://en.wikipedia.org/wiki/Comma-separated_values) suggests that either double-quotes should never be in a quoted field, or if present they should be doubled (as in "" to produce a single " in the middle of a value).
In this case, it finds the two very specific failing text and adds the second quote.
-i means to make the change in-place; perhaps a more defensive use would be to do sed -e 's/../../g' -e 's/../../g' < oldfile.csv > newfile.csv, which would preserve the broken file. Over to you.
-e adds a sed script/command, multiple commands can be given.
s/from/to/g means to replace the pattern from with the string in to; the g means "global".
This changes the two lines (shown one after the other here for simplicity:
"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4"", PZA 6 MTS","231.55","1","PZA",""
"2020","06","20/07/2020 12:00:00 a. m.","43","Campeche, Camp.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","003","NACOBRE, PLOMERIA, TUBO DE COBRE, BARRA DE 1/2"" X 6 MT","316.76","1","PZA",""
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ^^^^^-- the changes, double-double quotes
FYI: if you don't have sed in the path ... if you're running windows, then look in the RTools40 path; for me, I have c:/rtools40/usr/bin/sed.exe. If you're on macos or linux and cannot find sed, well ... that's odd.
After that sed command executes correctly, it will load without problem. HOWEVER, don't let this mislead you ... it is not really fixed. Keep reading.
csv <- fread("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", skip = 5)
csv
# Año Mes Fecha_Pub_DOF Clave ciudad Nombre ciudad División
# <int> <int> <char> <int> <char> <char>
# 1: 2020 6 20/07/2020 12:00:00 a. m. 1 Área Met. de la Cd. de México 3. Vivienda
# 2: 2020 6 20/07/2020 12:00:00 a. m. 1 Área Met. de la Cd. de México 3. Vivienda
# 3: 2020 6 20/07/2020 12:00:00 a. m. 1 Área Met. de la Cd. de México 3. Vivienda
...snip...
# 11 variables not shown: [Grupo <char>, Clase <char>, Subclase <char>, Clave genérico <int>, Genérico <char>, Consecutivo <int>, Especificación <char>, Precio promedio <num>, Cantidad <int>, Unidad <char>, ...]
fread and double-double-quotes
The problem with the above is that while it seems to have worked correctly, it (still) does not do embedded quotes correctly. As long as you want your data to have all of the embedded quotes that you want, then you cannot use fread, unfortunately.
Why?
str(csv[1067,])
# Classes 'data.table' and 'data.frame': 1 obs. of 17 variables:
# $ Año : int 2020
# $ Mes : int 6
# $ Fecha_Pub_DOF : chr "20/07/2020 12:00:00 a. m."
# $ Clave ciudad : int 12
# $ Nombre ciudad : chr "San Luis Potosí, S.L.P."
# $ División : chr "3. Vivienda"
# $ Grupo : chr "3.1. Costo de uso de vivienda"
# $ Clase : chr "3.1.1. Costo de uso de vivienda"
# $ Subclase : chr "42 Vivienda propia"
# $ Clave genérico : int 140
# $ Genérico : chr "Productos para reparación menor de la vivienda"
# $ Consecutivo : int 1
# $ Especificación : chr "PLOMERIA, TUBO DE PVC, REFORZADO, 4\"\", PZA 6 MTS"
# $ Precio promedio: num 232
# $ Cantidad : int 1
# $ Unidad : chr "PZA"
# $ Estatus : chr ""
# - attr(*, ".internal.selfref")=<externalptr>
Namely, see
csv$Especificación[1067]
# [1] "PLOMERIA, TUBO DE PVC, REFORZADO, 4\"\", PZA 6 MTS"
^^^^ should only be a single "
Fortunately, read.csv works fine here:
csv <- read.csv("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", skip = 5)
csv$Especificación[1067]
# [1] "PLOMERIA, TUBO DE PVC, REFORZADO, 4\", PZA 6 MTS"
FYI, if you don't care about the embedded quotes, you can still use fread if you change the sed expressions to remove the double-quotes instead of doubling the double-quotes. That is, -e 's/", PZA/, PZA/g' and likewise for the second expression. I didn't recommend this first because it changes your data, which you should not have to do.
The file you linked is properly quoted.
It has 5 lines of non-CSV data though, so skip these:
csv = read.csv("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", header = T, skip = 5, fileEncoding = "Latin1")
This works fine for me.
I am not so familiar with fread, and it does seem to have a problem with this file. Is there a reason you need data.table::fread for this?
I need to extract data from a PDF file. This file is a booklet of public services, where each page is about a specific service, which contains fields with the following information: name of the service, service description, steps, documentation, fees and observations. All pages follow this same pattern, changing only the information contained in these fields.
I would like to know if it is possible to extract all the data contained in these fields using R, please.
[those that are marked in highlighter are the fields with the information]
I've used the command line Java application Tabula and the R version TabulizeR to extract tabular data from text-based PDF files.
https://github.com/tabulapdf/tabula
https://github.com/ropensci/tabulizer
However, if your PDF is actually an image, then this becomes an OCR problem and needs different a tool.
Caveat: Tabula only works on text-based PDFs, not scanned documents. If you can click-and-drag to select text in your table in a PDF viewer (even if the output is disorganized trash), then your PDF is text-based and Tabula should work.
Here is an approach that can be considered to extract the text of your image :
library(RDCOMClient)
library(magick)
################################################
#### Step 1 : We convert the image to a PDF ####
################################################
path_TXT <- "C:\\temp.txt"
path_PDF <- "C:\\temp.pdf"
path_PNG <- "C:\\stackoverflow145.png"
path_Word <- "C:\\temp.docx"
pdf(path_PDF, width = 16, height = 6)
im <- image_read(path_PNG)
plot(im)
dev.off()
####################################################################
#### Step 2 : We use the OCR of Word to convert the PDF to word ####
####################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)
#############################################
#### Step 3 : We convert the word to txt ####
#############################################
doc$SaveAs(path_TXT, FileFormat = 4)
text <- readLines(path_TXT)
text
[1] "Etapas:"
[2] "Documenta‡Æo:"
[3] "\a Consulte a se‡Æo documenta‡Æo b sica (veja no ¡ndice)."
[4] "\a Original da primeira via da nota fiscal do fabricante - DANFE - Resolu‡Æo Sefaz 118/08 para ve¡culos adquiridos diretamente da f brica, ou original da primeira via da nota fiscal do revendedor ou c¢pia da Nota Fiscal Eletr“nica - DANFE, para ve¡culos adquiridos em revendedores, acompanhados, em ambos os casos, da etiqueta contendo o decalque do chassi em baixo relevo;"
[5] "\a Documento que autoriza a inclusÆo do ve¡culo na frota de permission rios/ concession rios, expedido pelo ¢rgÆo federal, estadual ou municipal concedente, quando se tratar de ve¡culo classificado na esp‚cie \"passageiros\" e na categoria \"aluguel\","
[6] "\a No caso de inclusÆo de GRAVAME comercial, as institui‡äes financeiras e demais empresas credoras serÆo obrigados informar, eletronicamente, ao Sistema Nacional de GRAVAMEs (SNG), sobre o financiamento do ve¡culo. Os contratos de GRAVAME comercial serÆo previamente registrados no sistema de Registro de Contratos pela institui‡Æo financeira respons vel para permitir a inclusÆo do GRAVAME;"
[7] "\a Certificado de registro expedido pelo ex‚rcito para ve¡culo blindado;"
[8] "\a C¢pia autenticada em cart¢rio do laudo m‚dico e o registro do n£mero do Certificado de Seguran‡a Veicular (CSV), quando se tratar de ve¡culo adaptado para deficientes f¡sicos."
[9] "Taxas:"
[10] "Duda 001-9 (Primeira Licen‡a). Duda de Emplacamento: 037-0 para Carros; 041-8 para motos."
[11] "Observa‡Æo:"
[12] "1. A 1' licen‡a de ve¡culo adquirido atrav‚s de leilÆo dever ser feita atrav‚s de Processo Administrativo, nas CIRETRANS, SATs ou uma unidade de Protocolo Geral. (veja no ¡ndice as se‡äes Lista de CIRETRANS, Lista de SATs e Unidades de Protocolo Geral para endere‡os )"
Once you have the text in R, you can use the R package stringr.
I am analysing some amazon reviews. At the current step of my analysis, I'd like to take the sentences written in reviews that have less than two stars. I have done that, applied get sentences, and wrote this function in order to search for words inside the sentences and print out only those containing all of the words:
ricerca <- function(sentences,keyword){
found <- lapply(sentences, function(x) grep(keyword, x, value = TRUE))
found <-found[lengths(found) > 0]
return(found)
The sentences are made in the following way:
> class(frasi_negative)
[1] "get_sentences" "get_sentences_character" "list"
> frasi_negative[1:2]
[[1]]
[1] "Auricolari comodissimi."
[2] "Restano incollati alle orecchie e non cadono in nessuna situazione."
[3] "Suono limpido, pulito."
[4] "Bassi consistenti."
[5] "La durata della batteria è più che soddisfacente (io li ho usati anche per 4 ore di fila senza problemi)."
[6] "Decisamente soddisfatto."
[7] "Li ricomprerei."
[8] "Amazon perfetta come al solito."
[9] "AGGIORNAMENTO RECENSIONE!!!!!"
[10] "- Dopo un mese di utilizzo non è più possibile ricaricare le cuffie."
[11] "Non danno più segni di vita."
[12] "Delusissimo."
[13] "Non mi sarei mai aspettato una fine così."
[14] "Peccato perché il prodotto era praticamente perfetto."
[[2]]
[1] "Al mio cellulare (Xiaomi Redmi Note 5) si mostrano singolarmente, separate, quando cerco di connetterle."
[2] "O si connette alla destra, o alla sinistra, e in ogni caso il suono poi esce dalle casse del cellulare (nonostante aver dato alle cuffie tutti i permessi)."
[3] "Non capisco perché, data che la prima connessione era andata come si deve; spente e riaccese, hanno iniziato a comportarsi così."
[4] "Ho provato a riavviare sia loro che cellulare, a rimetterle nella scatoletta e ritoglierle, ma il problema persiste."
[5] "Non penso c'entri il mio cellulare (mai avuto problemi con prodotti simili), in ogni caso effettuo reso con rimborso."
When I try searching for a word, it seems to work (even if the output is really horrible):
> found<-ricerca(frasi_negative, "qualità")
> found[1:3]
[[1]]
[1] "Pessima qualità."
[2] "La qualità delle chiamate telefoniche è assolutamente pessima (il proprio interlocutore non riceve in modo chiaro la nostra voce, dunque, risultano inutilizzabili come aurocolari telefonici)."
[[2]]
[1] "imbarazzanti non so la gente qui come fanno a dargli 5 stelle, l'unica cosa che mi viene in mente e che non hanno mai provato un paio di cuffie decenti, qualità dell audio pessima si sente basso e male , l'unica cosa buona e che la batteria si comporta bene - a distanza di 20 cm ogni tanto si scollegano provato con piu dispositivi sicuramente richiedo il rimborso davvero pessime"
[[3]]
[1] "La qualità costruttiva è ottima, l'accoppiamento è avvenuto in maniera facile ed immediata, e la durata è ottima."
But when I try searching for a few words (as example c("quality","bad")), it only searches for the first word, and gets me lot of warnings.
I have no idea about how to adapt this function, so thanks to all of you in advance.
Library: sentimentr
UPDATE:
Thanks for the answers, but it seems that in both the function you guys published it outputs all sentences containing at least one of the two words. I just want to see those which contain both. Is there a way to do it?
The below function should iterate through an inputted vector of keywords:
ricerca <- function(sentences,keywords){
for(i in 1:length(keywords){
found <- lapply(sentences, function(x) grep(keywords[i], x, value = TRUE))
found <-found[lengths(found) > 0]
return(found)
}
}
I hope this helps!
Try combining the keyword using paste as one string.
ricerca <- function(sentences,keyword){
found <- lapply(sentences, function(x)
grep(paste0(keyword, collapse = "|"), x, value = TRUE))
found <-found[lengths(found) > 0]
return(found)
}
I have a get_sentences (sentimentr) list, and need to extract only those sentences containing specific words.
This is how the data looks like:
> class(frasi_negative)
[1] "get_sentences" "get_sentences_character" "list"
> frasi_negative[2:3]
[[1]]
[1] "Al mio cellulare (Xiaomi Redmi Note 5) si mostrano singolarmente, separate, quando cerco di connetterle."
[2] "O si connette alla destra, o alla sinistra, e in ogni caso il suono poi esce dalle casse del cellulare (nonostante aver dato alle cuffie tutti i permessi)."
[3] "Non capisco perché, data che la prima connessione era andata come si deve; spente e riaccese, hanno iniziato a comportarsi così."
[4] "Ho provato a riavviare sia loro che cellulare, a rimetterle nella scatoletta e ritoglierle, ma il problema persiste."
[5] "Non penso c'entri il mio cellulare (mai avuto problemi con prodotti simili), in ogni caso effettuo reso con rimborso."
[[2]]
[1] "Comprate due mesi fa."
[2] "All'inizio funzionavano perfettamente, ma dopo qualche settimana hanno iniziato a disconnettersi tra loro di tanto in tanto."
[3] "qualche giorno fa estraendo la sinistra dall'astuccio magnetico è saltata la saldatura che la teneva chiusa e questo è il risultato."
[4] "Usandole in chiamata rendono un suono non limpido."
[5] "Il suono è accettabile ma nulla di speciale."
[6] "Ormai sono inutilizzabili."
[7] "Per il prezzo mi aspettavo un prodotto migliore e più duraturo (avendo già provato auricolari wireless della stessa fascia di prezzo di altri brand)"
As an example, if I search "wireless", only the [[2]][7] element should show up.
How can I achieve this?
Thank you in advance.
You can try to iterate over the list using lapply and return the sentence which matches a particular keyword using grep .
keyword <- 'wireless'
lapply(frasi_negative, function(x) grep(keyword, x, value = TRUE))
We can use str_detect
library(purrr)
library(stringr)
keyword <- 'wireless'
map(frasi_negative, ~ .x[str_detect(.x, keyword)])
I am again here with an interesting problem.
I have a document like shown below:
"""UDAYA FILLING STATION ps\na MATTUPATTY ROAD oe\noe 4 MUNNAR Be:\nSeat 4 04865230318 Rat\nBree 4 ORIGINAL bepas e\n\noe: Han Die MC DE ER DC I se ek OO UO a Be ten\" % aot\n: ag 29-MAY-2019 14:02:23 [i\n— INVOICE NO: 292 hee fos\nae VEHICLE NO: NOT ENTERED Bea\nss NOZZLE NO : 1 ome\n- PRODUCT: PETROL ae\ne RATE : 75.01 INR/Ltr yee\n“| VOLUME: 1.33 Ltr ae\n~ 9 =6AMOUNT: 100.00 INR mae wae\nage, Ee pel Di EE I EE oe NE BE DO DC DE a De ee De ae Cate\notome S.1T. No : 27430268741C =. ver\nnes M.S.T. No: 27430268741V ae\n\nThank You! Visit Again\n""""
From the above document, I need to extract date highlighted in bold and Italics.
I tried with strpdate function but did not get the desired results.
Any help will be greatly appreciated.
Thanks in advance.
Assuming you only want to capture a single date, you may use sub here:
text <- "UDAYA FILLING STATION ps\na MATTUPATTY ROAD oe\noe 4 MUNNAR Be:\nSeat 4 04865230318 Rat\nBree 4 ORIGINAL bepas e\n\noe: Han Die MC DE ER DC I se ek OO UO a Be ten\" % aot\n: ag 29-MAY-2019 14:02:23 [i\n— INVOICE NO: 292 hee fos\nae VEHICLE NO: NOT ENTERED Bea\nss NOZZLE NO : 1 ome\n- PRODUCT: PETROL ae\ne RATE : 75.01 INR/Ltr yee\n“| VOLUME: 1.33 Ltr ae\n~ 9 =6AMOUNT: 100.00 INR mae wae\nage, Ee pel Di EE I EE oe NE BE DO DC DE a De ee De ae Cate\notome S.1T. No : 27430268741C =. ver\nnes M.S.T. No: 27430268741V ae\n\nThank You! Visit Again\n"
date <- sub("^.*\\b(\\d{2}-[A-Z]+-\\d{4})\\b.*", "\\1", text)
date
[1] "29-MAY-2019"
If you had the need to match multiple such dates in your text, then you may use regmatches along with regexec:
text <- "Hello World 29-MAY-2019 Goodbye World 01-JAN-2018"
regmatches(text,regexec("\\b(\\d{2}-[A-Z]+-\\d{4})\\b", text))[[1]]
[1] "29-MAY-2019" "29-MAY-2019"