pyparsing.ParseException: Expected "File path :", found 'ÿ' - pyparsing

I am trying to extract file path from a TXT file using pyparsing.
The file path is like this
File path : C:\user\1\book
Part of My code is:
file_path = (
Literal('File path :').suppress() +
Word(printables).setResultsName('file_path') +
LineEnd().suppress()
)
After I run the code, there is an exception. I'm not what it is.
pyparsing.ParseException: Expected "File path :", found 'ÿ' (at char 0), (line:1, col:1)
How can I fix that?

Related

R library "XML" doesn't recognize encoding

Problem
I have an XML file that I would like to parse in R. I know that this file is not corrupted because the following Python code seems to work:
>>> import xml.etree.ElementTree as ET
>>> xml_tree = ET.parse(PATH_TO_MY_XML_FILE)
>>> do_my_regular_xml_stuff_that_seems_to_work_no_problem(xml_tree)
Now, when I try to run the following code in R, I get an error message:
> library("XML")
> xml_tree <- XML::xmlParse(PATH_TO_MY_XML_FILE)
Error in nchar(text_repr): invalid multibyte string, element 1
Traceback:
Alright, maybe the parser doesn't recognize the encoding. Luckily this should be specified in a decent XML file. So, I go to my shell and check:
$ head -n1 PATH_TO_MY_XML_FILE
??<?xml version="1.0" encoding="utf-16"?>
Now, I can go back to R and explicitly pass on the encoding, only to face the next error message where I got stuck now:
> library("XML")
> xml_tree <- XML::xmlParse(PATH_TO_MY_XML_FILE, encoding='UTF-16')
Start tag expected, '<' not found
Error: 1: Start tag expected, '<' not found
Traceback:
1. XML::xmlParse(filePath, encoding = "UTF-16")
2. (function (msg, ...)
. {
. if (length(grep("\\\n$", msg)) == 0)
. paste(msg, "\n", sep = "")
. if (immediate)
. cat(msg)
. if (length(msg) == 0) {
. e = simpleError(paste(1:length(messages), messages, sep = ": ",
. collapse = ""))
. class(e) = c(class, class(e))
. stop(e)
. }
. messages <<- c(messages, msg)
. })(character(0))
A last attempt to check (in R) if the file is in fact "UTF-16" encoded yields:
> f <- file(filePath, 'r', encoding = "UTF-16")
> firstLine <- readLines(f, n=1)
> close(f)
> print(line)
[1] "<?xml version=\"1.0\" encoding=\"utf-16\"?>"
Which looks just about right to me.
Question(s)
Does anyone know what is happening here? Is this a bug from the XML library? Is the file maybe not 'UTF-16' encoded, even though it claims it is? What are the two question marks ?? that I see when I print the file into the shell? These question marks don't appear when reading in the file properly...
Is this a bug from the XML library?
I think there could be a bug here. If I generate a valid UTF-16 XML document, which will have an initial byte-order mark:
$ echo '<a>😊</a>' | iconv -t utf-16 >a-utf16.xml
$ xxd a-utf16.xml
00000000: fffe 3c00 6100 3e00 3dd8 0ade 3c00 2f00 ..<.a.>.=...<./.
00000010: 6100 3e00 0a00 a.>...
then I can parse it with:
> XML::xmlParse('a-utf16.xml')
<?xml version="1.0"?>
<a>😊</a>
but not if I specify the encoding:
> XML::xmlParse('a-utf16.xml', encoding='utf-16')
Start tag expected, '<' not found
Error: 1: Start tag expected, '<' not found
Your original problem was when you weren't specifying the encoding. However:
I know that this file is not corrupted because the following Python code seems to work
That's a good hint, but I think you'll find edge cases where that doesn't hold. Try iconv for a second opinion on whether the file is encoded correctly.
For a more specific response, you'll need to post a reproducible XML file,

My paste function is not providing an accurate path to the file with csv data

In the code you'll see that it's an if/then function that will write a file directory depending on what operating system you have. This will run on Mac but when it sees I'm on windows I get this error: Error in system(paste("/RTools/bin/wc -l \"", filename, "\"", sep = ""), : '/RTools/bin/wc' not found
I've tried debugging the function and as I run it, filename is being passed the correct csv file in the correct format. I believe I have an issue with the two back slashes going the wrong way and perhaps an extra quotation mark. Line 3-5 is where I believe my problem is.
function( filename ){
if(.Platform$OS.type=="windows"){
system.time({
cmd<-system(paste("/RTools/bin/wc -l \"",filename, "\"",sep=""), intern=TRUE)
cmd<-strsplit(cmd, " ")[[1]][1]
})
return(as.numeric(cmd) + 1)
} else {
system.time({
cmd<-system(paste("wc -l \"",filename,"\" | awk \'{print $1}\'", sep=""), intern=TRUE)
cmd<-strsplit(cmd, " ")[[1]][1]
})
return(as.numeric(cmd) + 1)
}
}
I expect it to build the correct file path, it results in the error I listed.

Download files from FTP folder using Loop

I am trying to download all the files inside FTP folder
temp <- tempfile()
destination <- "D:/test"
url <- "ftp://XX.XX.net/"
userpwd <- "USER:Password"
filenames <- getURL(url, userpwd = userpwd,ftp.use.epsv = FALSE,dirlistonly = TRUE)
filenames <- strsplit(filenames, "\r*\n")[[1]]
When I am printing "filenames" I am getting all the file names which are inside the FTP folder - correct output till here
[1] "2018-08-28-00.gz" "2018-08-28-01.gz"
[3] "2018-08-28-02.gz" "2018-08-28-03.gz"
[5] "2018-08-28-04.gz" "2018-08-28-05.gz"
[7] "2018-08-28-08.gz" "2018-08-28-09.gz"
[9] "2018-08-28-10.gz" "2018-08-28-11.gz"
[11] "2018-08-28-12.gz" "2018-08-28-13.gz"
[13] "2018-08-28-14.gz" "2018-08-28-15.gz"
[15] "2018-08-28-16.gz" "2018-08-28-17.gz"
[17] "2018-08-28-18.gz" "2018-08-28-23.gz"
for ( i in filenames ) {
download.file(paste0(url,i), paste0(destination,i), mode="w")
}
I got this error
trying URL 'ftp://XXX.net/2018-08-28-00.gz'
Error in download.file(paste0(url, i), paste0(destination, i), mode = "w") :
cannot open URL 'ftp://XXX.net/2018-08-28-00.gz'
In addition: Warning message:
In download.file(paste0(url, i), paste0(destination, i), mode = "w") :
InternetOpenUrl failed: 'The login request was denied'
I modified the code to
for ( i in filenames )
{
#download.file(paste0(url,i), paste0(destination,i), mode="w")
download.file(getURL(paste(url,filenames[i],sep=""), userpwd =
"USER:PASSWORD"), paste0(destination,i), mode="w")
}
After that, I got this error
Error in function (type, msg, asError = TRUE) : RETR response: 550
Without a minimal, complete, and verifiable example it is a challenge to directly replicate your problem. Assuming the file names don't include the URL, you'll need to combine them to access the files.
download.file() requires a file to be read, an output file, as well as additional flags regarding whether you want a binary download or not.
For example, I have data from Alberto Barradas' Pokémon Stats kaggle.com data set stored on my Github site. To download some of the files to the test subdirectory of my R Working Directory, I can use the following code:
filenames <- c("gen01.csv","gen02.csv","gen03.csv")
fileLocation <- "https://raw.githubusercontent.com/lgreski/pokemonData/master/"
# use ./ for subdirectory of current directory, end with / to work with paste0()
destination <- "./test/"
# note that these are character files, so use mode="w"
for (i in filenames){
download.file(paste0(fileLocation,i),
paste0(destination,i),
mode="w")
}
...and the output:
The paste0() function concatenates text without spaces, which allows the code to generate a fully qualified path name for the url of each source file, as well as the subdirectory where the destination file will be stored.
To illustrate what's happening with paste0() in the for() loop, we can use message() to print to the R console.
> # illustrate what paste0() does
> for (i in filenames){
+ message(paste("Source is: ",paste0(fileLocation,i)))
+ message(paste("Destination is:",paste0(destination,i)))
+ }
Source is: https://raw.githubusercontent.com/lgreski/pokemonData/master/gen01.csv
Destination is: ./test/gen01.csv
Source is: https://raw.githubusercontent.com/lgreski/pokemonData/master/gen02.csv
Destination is: ./test/gen02.csv
Source is: https://raw.githubusercontent.com/lgreski/pokemonData/master/gen03.csv
Destination is: ./test/gen03.csv
>

how to open Text File in google Collab

I am recently using google collab juypter notebook.After Uploading text file, unable to open the file using open function in python 3.
from google.colab import files
import io
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
data_path = io.StringIO(uploaded['fra.txt'].decode('utf-8'))
with open(data_path, 'rb') as f:
lines = f.read().split('\n')
but it gives this error : TypeError: expected str, bytes or os.PathLike object, not _io.StringIO
how to open text file in google collab juypter notebook ?
Change to just
data_path = 'fra.txt'
Should work.
The _io.StringIO refers to the StringIO object (in-memory file stream). "For strings StringIO can be used like a file opened in text mode."
The issue is that the file is already open and you have it available to you as a StringIO buffer. I think you want to do readlines() on the StringIO object (data_path).
You can also call getvalue() on the object and get the str of the entire buffer.
https://docs.python.org/3/library/io.html#io.StringIO
See my example here; which I started with your code...
https://colab.research.google.com/drive/1Vbh13FVm02HMXeHXx-Zko1pFpqyp7bwI
do like this
with open('anna.txt', 'r') as f:
text=f.read()
vocab = sorted(set(text))
vocab_to_int = {c: i for i, c in enumerate(vocab)}
int_to_vocab = dict(enumerate(vocab))
encoded = np.array([vocab_to_int[c] for c in text], dtype=np.int32)

Unzip password protected zip files in R

A password cannot be specified in unzip (utils) function. The other function I am aware of, getZip (Hmisc), only works for zip files containing one compressed file.
I would like to do something like this to unzip all the files in foo.zip in Windows 8:
unzip("foo.zip", password = "mypass")
I found this question very useful but saw that no formal answers were posted, so here goes:
First I installed 7z.
Then I added "C:\Program Files\7-Zip" to my environment path.
I tested that the 7z command was recognized from the command line.
I opened R and typed in system("7z x secure.7z -pPASSWORD") with the appropriate PASSWORD.
I have multiple zipped files and I'd rather not the password show in the source code or be stored in any text file, so I wrote the following script:
file_list <- list.files(path = ".", pattern = ".7z", all.files = T)
pw <- readline(prompt = "Enter the password: ")
for (file in file_list) {
sys_command <- paste0("7z ", "x ", file, " -p", pw)
system(sys_command)
}
which when sourced will prompt me to enter the password, and the zip files will be decompressed in a loop.
I found #Kim 's answer worked for me eventually but not first off. I thought I'd just add a few extra links/steps that helped me get there in the end.
Close and reopen R so that environment path is recognised
If you've already opened R when you do steps 1-3 you need to close and reload R for R to recognise the environment path for 7z. #wush978 's answer to this question r system doesn't work when trying 7zip was informative. I used Sys.getenv("PATH") to check that 7zip was included in the environment paths.
Step 4. I opened R and typed in system("7z x secure.7z -pPASSWORD") with the appropriate PASSWORD.
I actually found this didn't work so I modified it slightly following the instructions in this post which also explains how to specify an output directory https://stackoverflow.com/a/16098709/13678913.
If you have already extracted the files the system command prompts you to choose whether you want to replace the existing file with the file from the archive and provides options
(Y)es / (N)o / (A)lways / (S)kip all / A(u)to rename all / (Q)uit?
So the modified step 4 (Y allows replacement of files)
system("7z e -ooutput_dir secure.zip -pPASSWORD" Y)
Putting this altogether as a modified set of instructions
Install 7z.
Added "C:\Program Files\7-Zip\" to my environment path using menu options (instructions here https://www.opentechguides.com/how-to/article/windows-10/113/windows-10-set-path.html)
Closed and reopened R studio. Typed Sys.getenv("PATH") to check path to 7zip recognised in the environment (as per #wush978 's answer to question r system doesn't work when trying 7zip)
Typed in the console system("7z e -oC:/My Documents/output_dir secure.zip -pPASSWORD") with the appropriate PASSWORD (as per instructions here https://stackoverflow.com/a/16098709/13678913)
And here is a modified version of #Kim 's neat function (including specified output directory and check for existing files):
My main script
output_dir <- "C:/My Documents/output_dir " #space after directory name is important
zippedfiles_dir <- "C:/My Documents/zippedfiles_dir/"
file_list <- paste0(output_dir , zippedfiles_dir , list.files(path = zippedfiles_dir, pattern = ".zip", all.files = T))
source("unzip7z.R")
Code inside source file unzip7z.R
pw = readline(prompt = "Enter the password: ")
for (file in file_list) {
csvfile <- gsub("\\.zip", "\\.csv", gsub(".*? ", "", file)) #csvfile name (removes output_dir from 'file' and replaces .zip extension with .csv)
#check if csvfile already exists in output_dir, and if it does, replace it with archived version and if it doesn't exist, continue to extract.
if(file.exists(csvfile)) {
sys_command = paste0("7z ", "e -o", file, " -p", pw, " Y")
} else {
sys_command = paste0("7z ", "e -o", file, " -p", pw)
}
system(sys_command)
}
password <- "your password"
read.table(
text = system(paste0("unzip -p -P ", password, " yourfile.zip ", "yourfile.csv"),
intern = "TRUE"
), stringsAsFactors = FALSE, header = TRUE, sep = ","
)
password <- "your password"
system(
command = paste0("unzip -o -P ", password, " ", "yourfile.zip"),
wait = TRUE
)

Resources