I'm trying to run a ScanContent processor on Apache Nifi, and whilst I can get the processor to run when scanning a text file, and using a .txt dictionary file with the search terms contained in it (and delimited by a newline character), I cannot get it to run when searching a file using the binary type of the processor for the dictionary file.
I am unsure whether I am simply using the wrong format for the binary dictionary file, or whether it needs to be encoded differently. I couldnt find any example dictionaries anywhere online that would be of any use (most things were related to the ScanAttributes instead).
The format of my dictionary file is:
(inside a .txt file)
32 00001001001000010000100001000000\n
The requirements according to the documentation are that the dictionary terms need to be a 4 byte integer, followed by the binary search term.
Does anyone have any experience of using this processor with a binary dictionary that might be able to help specify the format?
A binary dictionary file would typically be generated as the output of another program. There is an example in the ScanContent unit tests for how to accomplish this in Java:
#Test
public void testBinaryScan() throws IOException {
// Create dictionary file.
final String[] terms = new String[]{"hello", "good-bye"};
final ByteArrayOutputStream baos = new ByteArrayOutputStream();
try (final DataOutputStream dictionaryOut = new DataOutputStream(baos);) {
for (final String term : terms) {
final byte[] termBytes = term.getBytes("UTF-8");
dictionaryOut.writeInt(termBytes.length);
dictionaryOut.write(termBytes);
}
final byte[] termBytes = baos.toByteArray();
final Path dictionaryPath = Paths.get("target/dictionary");
Files.write(dictionaryPath, termBytes, StandardOpenOption.CREATE, StandardOpenOption.WRITE);
...
Related
My intention is to index an arbitrary directory containing pdf files (among other file types) with keywords stored in a list. I have a traditional solution and I heard that graph based solutions using e.g. SimpleGraph could be more elegant/efficient and independent of directory structures.
What would a graph-based solution (e.g. SimpleGraph) look like?
Traditional solution
// https://stackoverflow.com/a/14051951/1497139
List<File> pdfFiles = this.explorePath(TestPDFFiles.RFC_DIRECTORY, "pdf");
List<PDFFile> pdfs = this.getPdfsFromFileList(pdfFiles);
…
for (PDFFile pdf:pdfs) {
// https://stackoverflow.com/a/9560307/1497139
if (org.apache.commons.lang3.StringUtils.containsIgnoreCase(pdf.getText(), keyWord)) {
foundList.add(pdf.file.getName()); // here we access by structure (early binding)
// - in the graph solution by name (late binding)
}
}
Basically with SimpleGraph you'd use a combination of the modules
FileSystem
PDFSystem
With the FileSystem module you collect your graph of files in the directory and filter it to include only files with the extension pdf - then you analyze the PDFs using the PDFSystem to get the page/text structure - there is already a test case for this in the simplegraph-bundle module showing how it works with some RFC pdfs as input.
TestPDFFiles.java
I have now added the indexing test see below.
The core functionality has been taken from the old test with searching for a single keyword and allowing this as a parameter:
List<Object> founds = pdfSystem.g().V().hasLabel("page")
.has("text", RegexPredicate.regex(".*" + keyWord + ".*")).in("pages")
.dedup().values("name").toList();
This is a gremlin query that will do most of the work by searching in a whole tree of PDF files with just one call. I consider this more elegant since you do not have to care about the structure of the input (tree/graph/filesystem/database, etc ...)
JUnit Testcase
#Test
/**
* test for https://github.com/BITPlan/com.bitplan.simplegraph/issues/12
*/
public void testPDFIndexing() throws Exception {
FileSystem fs = getFileSystem(RFC_DIRECTORY);
int limit = Integer.MAX_VALUE;
PdfSystem pdfSystem = getPdfSystemForFileSystem(fs, limit);
Map<String, List<String>> index = this.getIndex(pdfSystem, "ARPA",
"proposal", "plan");
// debug=true;
if (debug) {
for (Entry<String, List<String>> indexEntry : index.entrySet()) {
List<String> fileNameList = indexEntry.getValue();
System.out.println(String.format("%15s=%3d %s", indexEntry.getKey(),
fileNameList.size(), fileNameList));
}
}
assertEquals(14,index.get("ARPA").size());
assertEquals(9,index.get("plan").size());
assertEquals(8,index.get("proposal").size());
}
I have a .Net-Core project that worked fine on 1.1 but is now failing on 2.0. The problem is happens when I try to unzip a zip archive with ZipFile.ExtractToDirectory (from System.IO.Compression). I get one or two files out but then it throws an exception with the message:
The process cannot access the file '<path to file>' because it is being used by another process.
As far as I can tell there is no other process that could possibly be using that file as it has just been extracted. It is actually present on the disk when I get the error.
The stack trace is:
at System.IO.Win32FileSystem.OpenHandle(String fullPath, Boolean asDirectory)
at System.IO.Win32FileSystem.SetLastWriteTimeInternal(String fullPath, DateTimeOffset time, Boolean asDirectory)
at System.IO.Win32FileSystem.SetLastWriteTime(String fullPath, DateTimeOffset time, Boolean asDirectory)
at System.IO.File.SetLastWriteTime(String path, DateTime lastWriteTime)
at System.IO.Compression.ZipFileExtensions.ExtractToFile(ZipArchiveEntry source, String destinationFileName, Boolean overwrite)
at System.IO.Compression.ZipFileExtensions.ExtractToDirectory(ZipArchive source, String destinationDirectoryName, Boolean overwrite)
at System.IO.Compression.ZipFile.ExtractToDirectory(String sourceArchiveFileName, String destinationDirectoryName, Encoding entryNameEncoding, Boolean overwrite)
at System.IO.Compression.ZipFile.ExtractToDirectory(String sourceArchiveFileName, String destinationDirectoryName)
at FunctionDataStore.PackageManagement.InstallPackage(String pkgFile, String pkgDir) in D:\Projects\Momo\UserPlatform\FunctionDataStore\PackageManagement.cs:line 250
I can go back to 1.1 for the time being but I need to move to 2.0.
Does anyone know what might be causing this exception and what I can do about it?
Added: 27 Nov 17
I thought I had the answer after a reboot. But now the problem remains no matter how many times I reboot the system. There is some problem with .NET Core 2.0 and System.IO.Compression.ZipFile. I have verified that no other process is actually using the extracted files when the error happens.
Addendum: 207 Nov 17
Since the stack trace shows ExtractToDirectory is failing on the SetLastWriteTime call, and I don't really care about the timestamps on the files, I replaced the ExtractToDirectory call with the following:
using (ZipArchive archive = ZipFile.Open(pkgFile, ZipArchiveMode.Read))
{
foreach (var entry in archive.Entries)
{
var filename = pkgDir + "\\" + entry.FullName;
var fileDir = Path.GetDirectoryName(filename);
Directory.CreateDirectory(fileDir);
using (BinaryWriter writer = new BinaryWriter(File.Open(filename, FileMode.Create)))
{
byte[] bytes = new byte[1024];
int numbytes;
var stream = entry.Open();
while ((numbytes = stream.Read(bytes, 0, 1024)) > 0)
{
writer.Write(bytes, 0, numbytes);
}
}
}
}
where pkgFile is the zip file to be read and pkgDir is the directory to extract to. This seems to work without problem.
I still don't know why the SetLastWriteTime is failing in ExtractToDirectory, though. But this workaround seems to be sufficient for my needs.
You can find which process is using your file to find root of error. to find which process is using your file download process explorer
run process explorer
press ctrl+f
type name of your file
this way you can find and kill process using your file
Mode of publishing - static
I'm trying to publish images, but the issue is whenever I publish those images, their TCM URI is appended to their name (i.e if image name is example and its TCM URI is like tcm:1-115, image filename becomes example_tcm1-115).
I have written the following code:
public void Transform(Engine engine, Package package)
{
Filter MMCompFilter = new Filter();
MMCompFilter.Conditions["ItemType"] = Tridion.ContentManager.ItemType.Component;
Folder folder = engine.GetObject("tcm:1-1-2") as Folder;
foreach (Component MMcomp in folder.GetItems(MMCompFilter))
{
Binary binary = engine.PublishingContext.RenderedItem.AddBinary(MMcomp);
String binaryurl = binary.Url;
char[] array = binaryurl.ToCharArray();
Array.Reverse(array);
string obj = new string(array);
string final = newImagepath(obj);
char[] array2 = final.ToCharArray();
Array.Reverse(array2);
string obj2 = new string(array2);
package.PushItem("Image", package.CreateHtmlItem(obj2));
}
public string newImagepath(string filePath)
{
int formatIndex =filePath.IndexOf(".");
string format= filePath.Substring(0,formatIndex);
int finalPath=filePath.IndexOf("_");
string newPath=filePath.Substring((finalPath+1));
return (format+"."+newPath);
}
}
I want to publish images without the TCM URI appended to it. Plz suggest how can it be done.
Chris Summers wrote a very nice article on this very topic http://www.urbancherry.net/blogengine/post/2010/02/09/Unique-binary-filenames-for-SDL-Tridion-Multimedia-Components.aspx
It is basically a very simple thing to fix, but can have huge consequences which you should be aware of!
You can only publish a binary with a certain file-name in a single location once (and a binary can only be published to a single location on the presentation server, unless you publish it as a variant). However, in the CMS it is very easy to create Multimedia Components with the same binary file-name in different folders, which if they get published to the same location will be in conflict. That is why by default SDL Tridion appends the TCM URI to the filename to make it unique.
Simplest is always best.
In your TBB, just push the individual images to the package:
package.PushItem(package.CreateMultimediaItem(component.Id));
Then use the "PublishBinariesInPackage" TBB to publish these images to your presentation server.
You can use the RenderedItem.AddBinary method for this goal. Some of the overloaded versions of the method allows to publish an image as a stream, and pass any file name. For example:
public Binary AddBinary(
Stream content,
string filename,
string variantId,
string mimeType
)
I have a text file inside the assembly say MyAssembly. I am trying to access that text file from the code like this :
Stream stream = Assembly.GetAssembly(typeof(MyClass)).GetFile("data");
where data is data.txt file containing some data and I have added that .txt as Embedded Resources. I have dome reading of the images from the Assebly as embedded resources with code like this :
protected Stream GetLogoImageStream()
{
Assembly current = Assembly.GetExecutingAssembly();
string imageFileNameFormat = "{0}.{1}";
string imageName = "myLogo.GIF";
string assemblyName = current.ManifestModule.Name;
int extensionIndex = assemblyName.LastIndexOf(".dll", StringComparison.CurrentCultureIgnoreCase);
string file = string.Format(imageFileNameFormat, assemblyName.Remove(extensionIndex, 4), imageName);
Stream thisImageStream = current.GetManifestResourceStream(file);
return thisImageStream;
}
However, this approach did not work while reading the .txt file from an the executing assembly. I would really appreciate if anybody can point me to the approach to read .txt file from an assembly. Please dont ask me why I am not reading the file from the drive or the network share. Just say that the requirement is to read the .txt file from the Assembly.
Thank you so much
GetManifestResourceStream is indeed the correct way to read the data. However, when it returns null, that usually means you have specified the wrong name. Specifying the correct name is not as simple as it seems. The rules are:
The VB.NET compiler generates a resource name of <root namespace>.<physical filename>.
The C# compiler generates a resource name of <default namespace>.<folder location>.<physical filename>, where <folder location> is the relative folder path of the file within the project, using dots as path separators.
You can call the Assembly.GetManifestResourceNames method in the debugger to check the actual names generated by the compiler.
Your approach should work. GetManifestResourceStream returns null, if the resource is not found. Try checking the run-time value of your file variable with the actual name of the resource stored in the assembly (you could check it using Reflector).
I really appreciate for everybody's help on this question. I was able to read the file with the code like this :
Assembly a = Assembly.GetExecutingAssembly();
string[] nameList = a.GetManifestResourceNames();
string manifestanme = string.Empty;
if (nameList != null && nameList.Length > 0)
{
foreach (string name in nameList)
{
if (name.IndexOf("c.txt") != -1)
{
manifestanme = name;
break;
}
}
}
Stream stream = a.GetManifestResourceStream(manifestanme);
Thanks and +1 for Christian Hayter for this method : a.GetManifestResourceNames();
I am looking for a solution or recommendation to a problem I am having. I have a bunch of ASPX pages that will be localized and have a bunch of text that needs to be supported in 6 languages.
The people doing the translation will not have access to Visual Studio and the likely easiest tool is Excel. If we use Excel or even export to CSV, we need to be able to import to move to .resx files. So, what is the best method for this?
I am aware of this question, Convert a Visual Studio resource file to a text file? already and the use of Resx Editor but an easier solution would be preferred.
I'm not sure how comprehensive an answer you're looking for, but if you're really just using [string, string] pairs for your localization, and you're just looking for a quick way to load resource (.resx) files with the results of your translations, then the following will work as a fairly quick, low-tech solution.
The thing to remember is that .resx files are just XML documents, so it should be possible to manually load your data into the resource from an external piece of code. The following example worked for me in VS2005 and VS2008:
namespace SampleResourceImport
{
class Program
{
static void Main(string[] args)
{
XmlDocument doc = new XmlDocument();
string filePath = #"[file path to your resx file]";
doc.Load(filePath);
XmlElement root = doc.DocumentElement;
XmlElement datum = null;
XmlElement value = null;
XmlAttribute datumName = null;
XmlAttribute datumSpace = doc.CreateAttribute("xml:space");
datumSpace.Value = "preserve";
// The following mocks the actual retrieval of your localized text
// from a CSV or ?? document...
// CSV parsers are common enough that it shouldn't be too difficult
// to find one if that's the direction you go.
Dictionary<string, string> d = new Dictionary<string, string>();
d.Add("Label1", "First Name");
d.Add("Label2", "Last Name");
d.Add("Label3", "Date of Birth");
foreach (KeyValuePair<string, string> pair in d)
{
datum = doc.CreateElement("data");
datumName = doc.CreateAttribute("name");
datumName.Value = pair.Key;
value = doc.CreateElement("value");
value.InnerText = pair.Value;
datum.Attributes.Append(datumName);
datum.Attributes.Append(datumSpace);
datum.AppendChild(value);
root.AppendChild(datum);
}
doc.Save(filePath);
}
}
}
Obviously, the preceding method won't generate the code-behind for your resource, however opening the resource file in Visual Studio and toggling the accessibility modifier for the resource will (re)generate the static properties for you.
If you're looking for a completely XML-based solution (vs. CSV or Excel interop), you could also instruct your translators to store their translated content in Excel, saved as XML, then use XPath to retrieve your localization info. The only caveat being the file sizes tend to become pretty bloated.
Best of luck.
I ran into similar problem and realized that the simplest way to create a .resx file from excel file is using a concatenate function of excel to generate "<"data">".."<"/data">" node for the .resx file and then manually copying the generated rows to the .resx file in any text editor. So lets say that you have "Name" in column A of an excel document and "value" in Column B of the excel document. Using following formula in Column C
=CONCATENATE("<data name=","""",A14,""" xml:space=""preserve"">","<value>", B14, "</value>", "</data>")
you will get the data node for resource. You can then copy this formula to all the rows and then copy the contents of Column C in your .resx file.
If it's in csv here's a quick Ruby script to generate the data elements.
require 'csv'
require 'builder'
file = ARGV[0]
builder = Builder::XmlMarkup.new(:indent => 2)
CSV.foreach(file) do |row|
builder.data(:name => row[0], "xml:space" => :preserve) {|d| d.value(row[1]) }
end
File.open(file + ".xml", 'w') { |f| f.write(builder.target!) }