I'm new to the AI playground and for this purpose I'm experimenting with the GPT-J (6b) model on an Amazon SageMaker notebook instance (g4dn.xlarge). So far, I've managed to register an endpoint and run the predictor but I'm sure I'm making the wrong questions or I haven't really understood how the model parameters work (which is probable).
This is my code:
# build the prompt
prompt = """
language: es
match: comida
topic: hoteles en la playa todo incluido
output: ¿Sabes cuáles son los mejores Hoteles Todo Incluido de España? Cada vez son
más los que se suman a la moda del Todo Incluido para disfrutar de unas perfectas y
completas vacaciones en familia, en pareja o con amigos. Y es que con nuestra oferta
hoteles Todo Incluido podrás vivir unos días de auténtico relax y una estancia mucho
más completa, ya que suelen incluir desde el desayuno, la comida y la cena, hasta
cualquier snack y bebidas en las diferentes instalaciones del hotel. ¿Qué se puede
pedir más para relajarse durante una perfecta escapada? A continuación, te
presentamos los mejores hoteles Todo Incluido de España al mejor precio.
language: es
match: comida
topic: hoteles en la playa todo incluido
output:
"""
# set the maximum token length
maximum_token_length = 25
# set the sampling temperature
sampling_temperature = 0.6
# build the predictor arguments
predictor_arguments = {
"inputs": prompt,
"parameters": {
"max_length": len(prompt) + maximum_token_length,
"temperature": sampling_temperature
}
}
# execute the predictor with the prompt as input
predictor_output = predictor.predict(predictor_arguments)
# retrieve the text output
text_output = predictor_output[0]["generated_text"]
# print the text output
print(f"text output: {text_output}")
My problem is I try to get a different response using the same parameters but I get nothing. It just repeats my inputs with an empty response so I'm definitely doing something wrong although the funny thing is I actually get a pretty understandable text output if I throw the same input with the same sampling temperature on the OpenAI playground (on text-davinci-003).
Can you give me a hint on what am I doing wrong? Oh, and another question is: How can I specify something like 'within the first 10 words' for a keyword match?
I am trying to read multiple csv files (like 300) with the function fread in R.
When i open one of the csv files in excel, the columns are delimited correctly, even when some observations contain commas.
When I try to read one of the files, the fuction does't read all the observations in the file and the next error appears
> file_prueba<-fread("Datos/Datos_precios/INP_PP_CAB18 (7)_A_vivienda_06_2020.csv", skip = 5, header = TRUE)
Warning message:
In fread("Datos/Datos_precios/INP_PP_CAB18 (7)_A_vivienda_06_2020.csv", :
Stopped early on line 1073. Expected 17 fields but found 22. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4", PZA 6 MTS","231.55","1","PZA","">>
Therefore i can't read the whole file. I suspect it is because one of the observations cointains commas like "PLOMERIA, TUBO DE COBRE, DE 60 MTS". But I'm not sure.
How can i fix this without fixing each csv file one by one?
Here's the file that i'm using int he example, but as I said, i need to read multiple files like this:
https://drive.google.com/file/d/1gSjyL14sZQC5KNtMXhN_iN79xCETTZAG/view?usp=sharing
The file is corrupt in two ways: lines 1073 and 3401 have embedded quotes. But there's another problem here ... read down to the second section fread and double-double-quotes for the problem with fread.
(Ultimately, this is a failure of the exporting process and a failure of fread to read embedded double quotes.)
Corrupted lines
Scroll right to see the problems.
Line 1073:
"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4", PZA 6 MTS","231.55","1","PZA",""
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ^-- this quote is incorrect
Line 3401:
"2020","06","20/07/2020 12:00:00 a. m.","43","Campeche, Camp.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","003","NACOBRE, PLOMERIA, TUBO DE COBRE, BARRA DE 1/2" X 6 MT","316.76","1","PZA",""
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ^-- this quote is incorrect
The best fix is to get whatever person/process exported this to export compliant CSV.
Here is a command-line (sed) fix that will allow fread to load it without warning or error (this is on a shell prompt, not in R).
sed -i \
-e 's/", PZA/"", PZA/g' \
-e s'/BARRA DE 1\/2"/BARRA DE 1\/2""/g' \
"INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV"
Simple explanation: the CSV standard (well-framed at https://en.wikipedia.org/wiki/Comma-separated_values) suggests that either double-quotes should never be in a quoted field, or if present they should be doubled (as in "" to produce a single " in the middle of a value).
In this case, it finds the two very specific failing text and adds the second quote.
-i means to make the change in-place; perhaps a more defensive use would be to do sed -e 's/../../g' -e 's/../../g' < oldfile.csv > newfile.csv, which would preserve the broken file. Over to you.
-e adds a sed script/command, multiple commands can be given.
s/from/to/g means to replace the pattern from with the string in to; the g means "global".
This changes the two lines (shown one after the other here for simplicity:
"2020","06","20/07/2020 12:00:00 a. m.","12","San Luis Potosí, S.L.P.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","001","PLOMERIA, TUBO DE PVC, REFORZADO, 4"", PZA 6 MTS","231.55","1","PZA",""
"2020","06","20/07/2020 12:00:00 a. m.","43","Campeche, Camp.","3. Vivienda","3.1. Costo de uso de vivienda","3.1.1. Costo de uso de vivienda","42 Vivienda propia","140","Productos para reparación menor de la vivienda","003","NACOBRE, PLOMERIA, TUBO DE COBRE, BARRA DE 1/2"" X 6 MT","316.76","1","PZA",""
---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ---> ^^^^^-- the changes, double-double quotes
FYI: if you don't have sed in the path ... if you're running windows, then look in the RTools40 path; for me, I have c:/rtools40/usr/bin/sed.exe. If you're on macos or linux and cannot find sed, well ... that's odd.
After that sed command executes correctly, it will load without problem. HOWEVER, don't let this mislead you ... it is not really fixed. Keep reading.
csv <- fread("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", skip = 5)
csv
# Año Mes Fecha_Pub_DOF Clave ciudad Nombre ciudad División
# <int> <int> <char> <int> <char> <char>
# 1: 2020 6 20/07/2020 12:00:00 a. m. 1 Área Met. de la Cd. de México 3. Vivienda
# 2: 2020 6 20/07/2020 12:00:00 a. m. 1 Área Met. de la Cd. de México 3. Vivienda
# 3: 2020 6 20/07/2020 12:00:00 a. m. 1 Área Met. de la Cd. de México 3. Vivienda
...snip...
# 11 variables not shown: [Grupo <char>, Clase <char>, Subclase <char>, Clave genérico <int>, Genérico <char>, Consecutivo <int>, Especificación <char>, Precio promedio <num>, Cantidad <int>, Unidad <char>, ...]
fread and double-double-quotes
The problem with the above is that while it seems to have worked correctly, it (still) does not do embedded quotes correctly. As long as you want your data to have all of the embedded quotes that you want, then you cannot use fread, unfortunately.
Why?
str(csv[1067,])
# Classes 'data.table' and 'data.frame': 1 obs. of 17 variables:
# $ Año : int 2020
# $ Mes : int 6
# $ Fecha_Pub_DOF : chr "20/07/2020 12:00:00 a. m."
# $ Clave ciudad : int 12
# $ Nombre ciudad : chr "San Luis Potosí, S.L.P."
# $ División : chr "3. Vivienda"
# $ Grupo : chr "3.1. Costo de uso de vivienda"
# $ Clase : chr "3.1.1. Costo de uso de vivienda"
# $ Subclase : chr "42 Vivienda propia"
# $ Clave genérico : int 140
# $ Genérico : chr "Productos para reparación menor de la vivienda"
# $ Consecutivo : int 1
# $ Especificación : chr "PLOMERIA, TUBO DE PVC, REFORZADO, 4\"\", PZA 6 MTS"
# $ Precio promedio: num 232
# $ Cantidad : int 1
# $ Unidad : chr "PZA"
# $ Estatus : chr ""
# - attr(*, ".internal.selfref")=<externalptr>
Namely, see
csv$Especificación[1067]
# [1] "PLOMERIA, TUBO DE PVC, REFORZADO, 4\"\", PZA 6 MTS"
^^^^ should only be a single "
Fortunately, read.csv works fine here:
csv <- read.csv("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", skip = 5)
csv$Especificación[1067]
# [1] "PLOMERIA, TUBO DE PVC, REFORZADO, 4\", PZA 6 MTS"
FYI, if you don't care about the embedded quotes, you can still use fread if you change the sed expressions to remove the double-quotes instead of doubling the double-quotes. That is, -e 's/", PZA/, PZA/g' and likewise for the second expression. I didn't recommend this first because it changes your data, which you should not have to do.
The file you linked is properly quoted.
It has 5 lines of non-CSV data though, so skip these:
csv = read.csv("INP_PP_CAB18 (7)_A_vivienda_06_2020.CSV", header = T, skip = 5, fileEncoding = "Latin1")
This works fine for me.
I am not so familiar with fread, and it does seem to have a problem with this file. Is there a reason you need data.table::fread for this?
I'm using R/RStudio on a Windows machine that I purchased in Japan, and I want to input Twitter data (in Spanish) from a social media analysis platform. For example, I have a file in XLSX format containing just two cells:
RT #PajeroHn #Emerson_182 ya sabía que eras olímpia pero no que eras extorsionador aunque era de esperarse 🌚
Jodas Kevo. A menos que vos seas extorsionador😂😂😂😂😂😂
There are accented vowels in there, as well as some non-standard emoticon characters that didn't make it through the export process intact. I tried this previously using the xlsx package, but it looks like XLConnect might be a better choice:
library(XLConnect)
test <- readWorksheetFromFile('test.xlsx',sheet=1,header=FALSE)
This is OK; I might even be able to do something useful with the emoticons. I'm bothered that it converts the accented characters (in "sabía" and "olímpia") to their unaccented equivalents:
test
RT #PajeroHn #Emerson_182 ya sabia que eras olimpia pero no que eras extorsionador aunque era de esperarse <ed><U+00A0><U+00BC><ed><U+00BC><U+009A>
Jodas Kevo. A menos que vos seas extorsionador<ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082>
My locale is Japanese:
Sys.getlocale()
"LC_COLLATE=Japanese_Japan.932;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=Japanese_Japan.932;LC_NUMERIC=C;LC_TIME=Japanese_Japan.932"
but changing it actually makes matters worse:
Sys.setlocale("LC_ALL","Spanish")
[1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252>
test <- readWorksheetFromFile('test.xlsx',sheet=1,header=FALSE)
test
RT #PajeroHn #Emerson_182 ya sab僘 que eras ol匇pia pero no que eras extorsionador aunque era de esperarse <ed><U+00A0><U+00BC><ed><U+00BC><U+009A>
Jodas Kevo. A menos que vos seas extorsionador<ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082>
Any ideas?
This should work:
testx2 <- read.xlsx2('test.xlsx',sheetIndex=1,header = FALSE, encoding = 'UTF-8')