Importing doc file in wikimedia - docx

We have a large amount of documentation in word file format.
We are moving to a wikimedia platform for that documentation.
I'm looking to import all those doc files into our wiki. Is there a way to automate it?

To help anyone that might come later on:
I solved our issue with a combination of pandoc (convert docx file into zip files containing both images and html files) and html2wiki, to import those zip files.

Related

How can I import thousands documents that are 'files'

I need to import a file folder with thousands of files. They are not csv, xlsx, txt files, they just say file under type(data type). I have tried multiple ways to import them as well as running R as an administrator.
I have tried different permutations of this code using different read.csv,delim etc but I am unable to import the files.
baseball <- read.csv('C:/Users/nfisc/Desktop/Spring 2021/CIS 576/Homework/Homework 5/rec.sport.baseball', stringsAsFactors = F)
Any help would be appreciated.
Can you be more specific about what these files are? Does it open into Excel? What kind of data does it represent?
Without an extension, it may be difficult to import.
If you right-click the files and go to "properties", what does it say under "Type of File"?
I know that this is late, but I figured out how to input text data. The original issue also was that the file was a rar file and not a zip file.
Import text data # Make V Corpus from a specified directory
baseball_messages <- tm::VCorpus(tm::DirSource("C:/Users/nfisc/Desktop/Spring 2021/CIS 576/Homework/Homework 5/usebaseball"))
hockey_messages <- tm::VCorpus(tm::DirSource("C:/Users/nfisc/Desktop/Spring 2021/CIS 576/Homework/Homework 5/usehockey"))

Opening Alteryx .yxdb files in R

Similar to the question below, I was wondering whether there is a way to open .yxdb files in R?
Open Alteryx .yxdb file in Python?
YXDB is Alteryx's own native format. I haven't seen or heard of anything else that can open it.
You could change your workflow to write to a CSV (or other compatible file) as well as writing to the YXDB file.
AFAIK there is no way yet for R to read yxdb files. I also export my Alteryx workflows to CSVs or use the R tool, read.Alteryx, and saveRDS to save it as a fast-loading binary file.

Extract comments from pdf

I have a collection of .pdf files with comments that were added in Adobe Acrobat. I would like to be able to analyze these comments, but I'm kind of stuck on extracting them. I've looked at the pdftools package, but it seems to only be able to extract the text and not the comments. Is there a method available for extracting the comments within R?
PyMuPDF (https://pymupdf.readthedocs.io/en/latest/) is the only python library I have found working.
Installation in Debian/Ubuntu-based distributions:
apt-get install python3-fitz
Script:
import fitz
doc = fitz.open("example.pdf")
for i in range(doc.pageCount):
page = doc[i]
for annot in page.annots():
print(annot.info["content"])
Did you try PoDoFo or another OpenSource tool that can access the PDF elements?
You can also look at Extracting PDF annotations/comments here on stackoverflow if you will do little programming
Screenshot of how >> Export the comments as an Excel file, then import it into R?
Eg: in PDF-X-change Editor, go to comment > summarize comments > export into whatever format you want. Similar in Adobe.

How to extract and read a bzip2ed hdf5 in a zipped file in R?

I would like to read a hdf5 file in a zipped file. The issue here is that this hdf5 file is also double zipped as a bzip2ed (.bz2) file.
Please refer to the figure shown below.
The zip file is "g2_BIOPAR_SWI_201012250000_GLOBE_ASCAT_V2_0_0.ZIP".
The target bz2 file is "g2_BIOPAR_SWI_201012250000_GLOBE_ASCAT_V2_0_0.h5.bz2".
Could someone show me some tips or guidances in how to do it?
Give a look at this question:
Extract bz2 file in R
It explains how to read a.bz2 file and I think the last answer is the right one for you,; or just type ?bzfile.
After that for hdf5 files you can give a look at rhdf5 package from Bioncoductor; here is the link:
http://www.bioconductor.org/packages/release/bioc/html/rhdf5.html

HAR parser and reporting tool

Is there any parser tool for HAR(Http archive) which generates csv or excel output of page loading times? I know there are HAR viewer but I need the output as csv for plotting.
Note: It is easier to write a parser and generate the csv output(which I have done) but before reinventing the wheel, I just want to check for existing tools.
Yes, you got the har2csv (command line tool), here, and you can pull the zip file here.

Resources