How can unzip file in kettle where zip content cyrillic - unzip

I try to unzip file.zip with files (a, b, c) in pentaho kettle (file management -> unzip file). it working fine.
But if i try to unzip file.zip with files (a, b, ж), for example, i have errors:
2016/01/18 17:46:17 - cfgbuilder - Warning: The configuration parameter [org] is not supported by the default configuration builder for scheme: sftp
2016/01/18 17:46:17 - Unzip file - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : Could not unzip file [file:///D:/projects/loaders/loader_little_files/src.zip]. Exception : [MALFORMED]
2016/01/18 17:46:17 - Unzip file - ERROR (version 6.0.1.0-386, build 1 from 2015-12-03 11.37.25 by buildguy) : java.lang.IllegalArgumentException: MALFORMED
2016/01/18 17:46:17 - Unzip file - at java.util.zip.ZipCoder.toString(ZipCoder.java:58)
2016/01/18 17:46:17 - Unzip file - at java.util.zip.ZipFile.getZipEntry(ZipFile.java:566)
2016/01/18 17:46:17 - Unzip file - at java.util.zip.ZipFile.access$900(ZipFile.java:60)
2016/01/18 17:46:17 - Unzip file - at java.util.zip.ZipFile$ZipEntryIterator.next(ZipFile.java:524)
2016/01/18 17:46:17 - Unzip file - at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:499)
2016/01/18 17:46:17 - Unzip file - at java.util.zip.ZipFile$ZipEntryIterator.nextElement(ZipFile.java:480)
2016/01/18 17:46:17 - Unzip file - at org.apache.commons.vfs2.provider.zip.ZipFileSystem.init(ZipFileSystem.java:91)
2016/01/18 17:46:17 - Unzip file - at org.apache.commons.vfs2.provider.AbstractVfsContainer.addComponent(AbstractVfsContainer.java:53)
2016/01/18 17:46:17 - Unzip file - at org.apache.commons.vfs2.provider.AbstractFileProvider.addFileSystem(AbstractFileProvider.java:103)
2016/01/18 17:46:17 - Unzip file - at org.apache.commons.vfs2.provider.AbstractLayeredFileProvider.createFileSystem(AbstractLayeredFileProvider.java:88)
2016/01/18 17:46:17 - Unzip file - at org.apache.commons.vfs2.provider.AbstractLayeredFileProvider.findFile(AbstractLayeredFileProvider.java:61)
2016/01/18 17:46:17 - Unzip file - at org.apache.commons.vfs2.impl.DefaultFileSystemManager.resolveFile(DefaultFileSystemManager.java:790)
2016/01/18 17:46:17 - Unzip file - at org.apache.commons.vfs2.impl.DefaultFileSystemManager.resolveFile(DefaultFileSystemManager.java:712)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.core.vfs.KettleVFS.getFileObject(KettleVFS.java:151)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.core.vfs.KettleVFS.getFileObject(KettleVFS.java:106)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.entries.unzip.JobEntryUnZip.unzipFile(JobEntryUnZip.java:618)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.entries.unzip.JobEntryUnZip.processOneFile(JobEntryUnZip.java:516)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.entries.unzip.JobEntryUnZip.execute(JobEntryUnZip.java:461)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.Job.execute(Job.java:730)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.Job.execute(Job.java:873)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.Job.execute(Job.java:873)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.Job.execute(Job.java:873)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.Job.execute(Job.java:546)
2016/01/18 17:46:17 - Unzip file - at org.pentaho.di.job.Job.run(Job.java:435)
I'am using windows 7, when i create "ж" file.
I'am trying to rename file in linux to "ж" - the result has not changed.
How can i do this? Any hidden setting?
Thanks!

Non utf-8 encoding in zip files.
Taken from here. https://blogs.oracle.com/xuemingshen/entry/non_utf_8_encoding_in
Important parts
The Zip specification (historically) does not specify what character encoding to be used for the embedded file names
Jar specification meanwhile explicitly specifies to use UTF-8 as the encoding to encode and decode all file names and comments in Jar files. Our java.util.jar and java.util.zip implementation therefor strictly followed Jar specification to use UTF-8 as the sole encoding when dealing with the file names and comments stored in Jar/Zip files.
Windows NFTS filesystem encoding UTF-16. Cyrillic symbols in file names cause problems in java application. Troubles will arise in use some third party tools to create zip archive (unless u use java based tools - which rarely) and then unzip them using java tools like PDI.
Good staff for Linux users, ext4 use by default UTF-8 (actually it doesn't rely on encoding just byte sequence, but GUI like gnome (environment where u create files whatever shell, or gnome nautilus file manager) assume UTF-8 to decode symbols to write file name on disk. QT relies on locale. Of cause there are ways to override but by default as I know UTF-8 become wide used as default locale.
Conclusion:
zip file created in linux(tested in ubuntu) can be unzipped using PDI.
zip file created using JavaAPI can be unzipped anywhere using PDI
zip file created on Windows can cause trouble unzipped using PDI

How to decompress zip file created on Windows 8.1, using 7zip. Files have names contain cyrilic symbols. Zip archive contains 3 files inside named:
а.txt
ж.txt
ё.txt
Fortunately all needed libraries (Apache commons-compress and commons-io) are in directory PENTAHO_HOME/lib, so u don't have to add extra libraries to kettle.
Here is code underneath, for "User Defined Java Class" step
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Enumeration;
import org.apache.commons.compress.archivers.zip.ZipArchiveEntry;
import org.apache.commons.compress.archivers.zip.ZipFile;
import org.apache.commons.io.IOUtils;
public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException{
Object[] r = getRow();
r = createOutputRow(r, data.outputRowMeta.size());
String fname = getVariable("FNAME", null);
String outDir = getVariable("OUT", null);
System.out.println(fname + " " + outDir);
try {
java.io.File inputFile = new java.io.File(fname);
ZipFile zipFile = new ZipFile(inputFile, "cp866", false);
Enumeration enumEntry = zipFile.getEntries();
int i = 0;
while(enumEntry.hasMoreElements()){
ZipArchiveEntry entry = (ZipArchiveEntry) enumEntry.nextElement();
String entryName = entry.getName();
System.out.println(entryName);
OutputStream os = new FileOutputStream(new File(outDir, Integer.valueOf(++i) + entryName));
InputStream is = zipFile.getInputStream(entry);
IOUtils.copy(is, os);
is.close();
os.close();
}
} catch (Exception exc) {
System.out.println("Faild to unzip");
exc.printStackTrace();
}
putRow(data.outputRowMeta, r);
return true;
}
Important parts of code are:
String fname = getVariable("FNAME", null);
String outDir = getVariable("OUT", null);
Those mean that 2 variables should be available in transformation
FNAME - absolute path to ZipFile,
OUT - directory where need to extract files
In this line:
ZipFile zipFile = new ZipFile(inputFile, "cp866", false);
"cp866" means encoding used by 7zip for zipfile entries(cp866 on windows). If u use another zipper then u might need to change encoding. Here is some notice https://commons.apache.org/proper/commons-compress/zip.html. Part Recommendations for Interoperability. U can write own algorith to identify encoding, rely on for example on known part of name of files in zip archive.
Anyway I think most probably this kettle job/tranformation will use zip file from single certain source, and just need to identify and set proper encoding of zip file in code.
This line:
Integer.valueOf(++i) + entryName)
Why file name generated using integer? If wrong encoding is used then ZipFile will decode filename of zip entries to [].txt (ZipFile can't decode а.txt, ж.txt so it will replace symbols 'а', 'ж' with '[]'). Which lead to (if u have wrong encoding and filenames have same length and written in cyrilic) each enty in loop will overwrite same file and u will get in the end, single file named [].txt.
With counter in file name u will guaranty all files will have different name even if u not able to decode correct file name.
1[].txt
2[].txt
3[].txt
Anyway if u know exactly encoding then just remove this part of code to eliminate numbers in file name.

only one worked for me in Debian Jessie - install WinRAR into wine and choose there file names encoding

Related

R shell.exec requires full file path

I installed R and RStudio this week on a new Windows 10 machine. I want to use this R code to launch Excel and open a CSV file that is in a subdirectory of the current working directory:
file <- "example.csv"
sub_dir <- "subdirectory"
shell.exec(file.path(sub_dir, file))
But I get this error:
Error in shell.exec(file.path(sub_dir, file)) :
'subdirectory/example.csv' not found
However, if I provide shell.exec with the full file path, this code works as expected:
shell.exec(file.path(getwd(), sub_dir, file))
The documentation for shell.exec states:
The path in file is interpreted relative to the current working
directory.
R versions 2.13.0 and earlier interpreted file relative to the R home
directory, so a complete path was usually needed.
Why doesn't my original code (without getwd) not work? Thanks.
It looks to be related to the path separator in some wacky way. Below, I specify the the file path separator as \ and the command executes as expected. You could keep your call to file.path() and simply wrap in normalizePath() as another option.
file <- "example.csv"
sub_dir <- "subdirectory"
dir.create(sub_dir)
writeLines("myfile",file.path(sub_dir, file))
# Works
shell.exec(file.path(sub_dir, file, fsep = "\\"))
shell.exec(file.path(sub_dir, file))
#> Error in shell.exec(file.path(sub_dir, file)): 'subdirectory/example.csv' not found

open excel command not opening the excel file

In Ride editor when I am giving open excel the test is failed
Open excel D:/RobotProjects/Testproj/Demo1.xls default=True
or
Open excel D:/RobotProjects/Testproj/Demo1.xls default=False
TEST readexceldemo
Full Name: Testproj.ExcelDemo.readexceldemo
Start / End / Elapsed: 20190825 02:52:29.280 / 20190825 02:52:29.287 / 00:00:00.007
Status: FAIL (critical)
Message: IOError: [Errno 2] No such file or directory: u'D:\RobotProjects\Testproj\Demo1.xls'
00:00:00.004KEYWORD ExcelLibrary . Open Excel D:\RobotProjects\Testproj\Demo1.xls, default=True
Documentation:
Opens the Excel file from the path provided in the file name parameter. If the boolean useTempDir is set to true, depending on the
operating system of the computer running the test the file will be opened in the Temp directory if the operating system is Windows
or tmp directory if it is not.
Start / End / Elapsed: 20190825 02:52:29.281 / 20190825 02:52:29.285 / 00:00:00.004
02:52:29.285 FAIL IOError: [Errno 2] No such file or directory: u'D:\RobotProjects\Testproj\Demo1.xls'
TEST readexceldemo
Full Name: Testproj.ExcelDemo.readexceldemo
Start / End / Elapsed: 20190825 02:53:45.656 / 20190825 02:53:45.665 / 00:00:00.009
Status: FAIL (critical)
Message: IOError: [Errno 2] No such file or directory: u'D:/RobotProjects/Testproj/Demo1.xls'
00:00:00.006KEYWORD ExcelLibrary . Open Excel D:/RobotProjects/Testproj/Demo1.xls, default=True
Documentation:
Opens the Excel file from the path provided in the file name parameter. If the boolean useTempDir is set to true, depending on the
operating system of the computer running the test the file will be opened in the Temp directory if the operating system is Windows
or tmp directory if it is not.
Start / End / Elapsed: 20190825 02:53:45.657 / 20190825 02:53:45.663 / 00:00:00.006
02:53:45.663 FAIL IOError: [Errno 2] No such file or directory: u'D:/RobotProjects/Testproj/Demo1.xls'
The error message you show, is from Robot Framework, that did not find the file.
You probably would have the same error if you run the test from command window.
You should try to use backslash separator (which must be doubled). The command would be:
Open excel D:\\RobotProjects\\Testproj\\Demo1.xls default=True

Can't move file after download and unzip

I'm trying to download a zip file from a source, unzip it and after move to another directory.
First the download:
if (!file.exists("inst/extdata/sp_resultados_universo")) {
tmp <- tempfile(fileext = ".zip")
download.file("ftp://ftp.ibge.gov.br/Censos/Censo_Demografico_2010/Resultados_do_Universo/Agregados_por_Setores_Censitarios/SP_Capital_20180416.zip", tmp, quiet = TRUE)
unzip(tmp, exdir = "inst/extdata/sp_resultados_universo", junkpaths=T)
unlink(tmp)
}
The file i want is on this directory inst/extdata/sp_resultados_universo/SP Capital/Base informa�oes setores2010 universo SP_Capital (codificação inválida)/CSV/, so when i try copy to inst/extdata/sp_resultados_universo/ i get an error
file.rename("inst/extdata/sp_resultados_universo/SP%20Capital/Base%20informa%87oes%20setores2010%20universo%20SP_Capital(condificação inválida)/CSV/Domicilio02_SP1.csv",
"inst/extdata/sp_resultados_universo/Domicilio02_SP1.csv")
Warning message:
In file.rename("inst/extdata/sp_resultados_universo/SP%20Capital/Base%20informa%87oes%20setores2010%20universo%20SP_Capital(condificação inválida)/CSV/Domicilio02_SP1.csv", :
it was not possible to rename file 'inst/extdata/sp_resultados_universo/SP%20Capital/Base%20informa%87oes%20setores2010%20universo%20SP_Capital(condificação inválida)/CSV/Domicilio02_SP1.csv'
for 'inst/extdata/sp_resultados_universo/Domicilio02_SP1.csv',
reason 'File or directory not found'
I'm translating the error message, so it could be inconsistent with english message.
I can change the directory name or move the file manually, but breaks the flow and it's not nice for reproducibility. How can i handle it inside R?
My system info:
Sys.info()
sysname
"Linux"
release
"4.9.0-6-amd64"
version
"#1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07)"
machine
"x86_64"
Many thanks in advance for any help.
when using R you can interact with the linux shell (or the windows cmd line) through a call to system() where you put the quoted command just as you would use in the shell,
for instance:
system("pwd") # prints current working directory
system("date") # prints
system("ls | grep .R") # prints a list of r scripts in the current working directory
system("mv file.txt /home/new_directory/file.txt") # moves your file to another directory

PowerShell to zip WordPress plugin

I have a folder containing the code for a WordPress plugin, and the following script:
$folder = Join-Path $PSScriptRoot $pluginName
$destination = Join-Path $PSScriptRoot "bundles/$pluginName.zip"
Add-Type -AssemblyName "System.IO.Compression.FileSystem"
if (Test-Path $destination) {
Remove-Item $destination
}
[IO.Compression.ZipFile]::CreateFromDirectory(
$folder,
$destination,
[IO.Compression.CompressionLevel]::Optimal,
$true # include base directory
)
If I try to install the plugin by uploading the produced zip file to a local WordPress instance, I get the following errors:
Warning: copy(/var/www/html/wp-content/plugins/jwt-authentication-for-wp-rest-api/jwt-authentication-for-wp-rest-api\jwt-auth.php): failed to open stream: Invalid argument in /var/www/html/wp-admin/includes/class-wp-filesystem-direct.php on line 257
Warning: copy(/var/www/html/wp-content/plugins/jwt-authentication-for-wp-rest-api/jwt-authentication-for-wp-rest-api\jwt-auth.php): failed to open stream: Invalid argument in /var/www/html/wp-admin/includes/class-wp-filesystem-direct.php on line 257
Could not copy file. /var/www/html/wp-content/plugins/jwt-authentication-for-wp-rest-api/jwt-authentication-for-wp-rest-api\jwt-auth.php
Plugin install failed.
However, if I manually zip the plugin, by right-clicking the plugin folder in Windows Explorer and choosing "Send to..." -> "Compressed folder", the generated zip file installs fine.
I can't figure out why this happens, because if I unzip the folders and diff their contents, they are identical. (The zipped files are not, but I assume that's because of compression levels etc).
Do I have to set any specific flags when zipping to make this work? How do I script production of a zip file that works exactly like "Send to..." -> "Compressed folder"?

R error when using untar

I'm running a script with input parameters that are referenced in the code to automate the directory creation, download of file and untar of file. I would be fine with unzip, however this particular file I want to analyze is .tar.gz. I manually unpacked and it was tar.gz, unpacked to .tar file. Would that be the problem?
Full error: Error in untar2(tarfile, files, list, exdir) : unsupported entry type ‘’
Running Windows 10, 64 bit, R set to: [Default] [64-bit] C:\Program Files\R\R-3.2.2
Script notes one solution found (issues, lines 28-31), but I don't really understand it.
I did install 7-zip on my computer, restart and of course restart R:
`#DOWNLOADING AND UNZIPPING TAR FILE
#load required packages.
#If there is a load package error, use install.packages("[package]")
library(dplyr)
library(lubridate)
library(XML) # HTML processing
options(stringsAsFactors = FALSE)
#Set directory locations, data file and fetch data file from internet
#enter full url including file name between ' ' marks
mainDir<-"C:/R/BEES/"
subDir<-"C:/R/BEES/Killers"
Fetch<-'http://dds.cr.usgs.gov/pub/data/nationalatlas/afrbeep020_nt00218.tar.gz'
ArchFile<-basename(Fetch)
download.file<-(ArchFile)
#Check for file directories and create if directory if it doesn't exist
if(!file.exists(mainDir)){dir.create(mainDir)}
if(!file.exists(subDir)){dir.create(subDir)}
#set the working directory
setwd(file.path(subDir))
#check if file exists and download if it doesn't exist.
if(!file.exists(ArchFile))
{download.file (url=Fetch,destfile=ArchFile,method='auto')}
#unpack and view file list
untar(path.expand(ArchFile),list=TRUE,exdir=subDir,compressed="gzip")
list.files(subDir)
#Error: Error in untar2(tarfile, files, list, exdir) :
# unsupported entry type ‘’
#Need solution to use tar/untar app
#instructions here: https://stevemosher.wordpress.com/step-10-build/`
Appreciate feedback - I've been lurking around StackOverflow for some time to use other people's solutions.

Resources