Can openoffice count words from console? - console

i have a small problem i need to count words inside the console to read doc, docx, pptx, ppt, xls, xlsx, odt, pdf ... so don't suggest me | wc -w or grep because they work only with text or console output and they count only spaces and in japanese, chinese, arabic , hindu , hebrew they use diferent delimiter so the word count is wrong and i tried to count with this
pdftotext file.pdf -| wc -w
/usr/local/bin/docx2txt.pl < file.docx | wc -w
/usr/local/bin/pptx2txt.pl < file.pptx | wc -w
antiword file.doc -| wc -w
antiword file.word -| wc -w
in some cases microsoft word , openoffice sad 1000 words and the counters return 10 or 300 words if the language is ( japanese , chinese, hindu ect... ) , but if i use normal characters then i have no issue the biggest mistake is in some case 3 chars less witch is "OK"
i tried to convert with soffice , openoffice and then try WC -w but i can't even convert ,
soffice --headless --nofirststartwizard --accept=socket,host=127.0.0.1,port=8100; --convert-to pdf some.pdf /var/www/domains/vocabridge.com/devel/temp_files/23/0/东京_1000_words_Docx.docx
OR
openoffice.org --headless --convert-to ........
OR
openoffice.org3 --invisible
so if someone know any way to count correctly or display document statistic with openoffice or anything else or linux with the console please share it
thanks.

If you have Microsoft Word (and Windows, obviously) you can write a VBA macro or if you want to run straight from the command line you can write a VBScript script with something like the following:
wordApp = CreateObject("Word.Application")
doc = ... ' open up a Word document using wordApp
docWordCount = doc.Words.Count
' Rinse and repeat...
If you have OpenOffice.org/LibreOffice you have similar (but more) options. If you want to stay in the office app and run a macro you can probably do that. I don't know the StarBasic API well enough to tell you how but I can give you the alternative: creating a Python script to get the word count from the command line. Roughly speaking, you do the following:
Start up your copy of OOo/LibO from the command line with the appropriate parameters to accept incoming socket connections. http://www.openoffice.org/udk/python/python-bridge.html has instructions on how to do that. Go there and use the browser's in-page find feature to search for `accept=socket'
Write a Python script to use the OOo/LibO UNO bridge (basically equivalent to the VBScript example above) to open up your Word/ODT documents one at a time and get the word count from each. The above page should give you a good start to doing that.
You get the word count from a document model object's WordCount property: http://www.openoffice.org/api/docs/common/ref/com/sun/star/text/GenericTextDocument.html#WordCount

Just building on to what #Yawar wrote. Here is is more explicit steps for how to word count with MS word from the console.
I also use the more accurate Range.ComputeStatistics(wdStatisticWords) instead of the Words property. See here for more info: https://support.microsoft.com/en-za/help/291447/word-count-appears-inaccurate-when-you-use-the-vba-words-property
Make a script called wc.vbs and then put this in it:
Set word = CreateObject("Word.Application")
word.Visible = False
Set doc = word.Documents.Open("<replace with absolute path to your .docx/.pdf>")
docWordCount = doc.Range.ComputeStatistics(wdStatisticWords)
word.Quit
Dim StdOut : Set StdOut = CreateObject("Scripting.FileSystemObject").GetStandardStream(1)
WScript.Echo docWordCount & " words"
Open powershell in the directory you saved wc.vbs and run cscript .\wc.vbs and you'll get back the word count :)

I think this may do what you are aiming for
# Continuously updating word count
import unohelper, uno, os, time
from com.sun.star.i18n.WordType import WORD_COUNT
from com.sun.star.i18n import Boundary
from com.sun.star.lang import Locale
from com.sun.star.awt import XTopWindowListener
#socket = True
socket = False
localContext = uno.getComponentContext()
if socket:
resolver = localContext.ServiceManager.createInstanceWithContext('com.sun.star.bridge.UnoUrlResolver', localContext)
ctx = resolver.resolve('uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext')
else: ctx = localContext
smgr = ctx.ServiceManager
desktop = smgr.createInstanceWithContext('com.sun.star.frame.Desktop', ctx)
waittime = 1 # seconds
def getWordCountGoal():
doc = XSCRIPTCONTEXT.getDocument()
retval = 0
# Only if the field exists
if doc.getTextFieldMasters().hasByName('com.sun.star.text.FieldMaster.User.WordCountGoal'):
# Get the field
wordcountgoal = doc.getTextFieldMasters().getByName('com.sun.star.text.FieldMaster.User.WordCountGoal')
retval = wordcountgoal.Content
return retval
goal = getWordCountGoal()
def setWordCountGoal(goal):
doc = XSCRIPTCONTEXT.getDocument()
if doc.getTextFieldMasters().hasByName('com.sun.star.text.FieldMaster.User.WordCountGoal'):
wordcountgoal = doc.getTextFieldMasters().getByName('com.sun.star.text.FieldMaster.User.WordCountGoal')
wordcountgoal.Content = goal
# Refresh the field if inserted in the document from Insert > Fields >
# Other... > Variables > Userdefined fields
doc.TextFields.refresh()
def printOut(txt):
if socket: print txt
else:
model = desktop.getCurrentComponent()
text = model.Text
cursor = text.createTextCursorByRange(text.getEnd())
text.insertString(cursor, txt + '\r', 0)
def hotCount(st):
'''Counts the number of words in a string.
ARGUMENTS:
str st: count the number of words in this string
RETURNS:
int: the number of words in st'''
startpos = long()
nextwd = Boundary()
lc = Locale()
lc.Language = 'en'
numwords = 1
mystartpos = 1
brk = smgr.createInstanceWithContext('com.sun.star.i18n.BreakIterator', ctx)
nextwd = brk.nextWord(st, startpos, lc, WORD_COUNT)
while nextwd.startPos != nextwd.endPos:
numwords += 1
nw = nextwd.startPos
nextwd = brk.nextWord(st, nw, lc, WORD_COUNT)
return numwords
def updateCount(wordCountModel, percentModel):
'''Updates the GUI.
Updates the word count and the percentage completed in the GUI. If some
text of more than one word is selected (including in multiple selections by
holding down the Ctrl/Cmd key), it updates the GUI based on the selection;
if not, on the whole document.'''
model = desktop.getCurrentComponent()
try:
if not model.supportsService('com.sun.star.text.TextDocument'):
return
except AttributeError: return
sel = model.getCurrentSelection()
try: selcount = sel.getCount()
except AttributeError: return
if selcount == 1 and sel.getByIndex(0).getString == '':
selcount = 0
selwords = 0
for nsel in range(selcount):
thisrange = sel.getByIndex(nsel)
atext = thisrange.getString()
selwords += hotCount(atext)
if selwords > 1: wc = selwords
else:
try: wc = model.WordCount
except AttributeError: return
wordCountModel.Label = str(wc)
if goal != 0:
pc_text = 100 * (wc / float(goal))
#pc_text = '(%.2f percent)' % (100 * (wc / float(goal)))
percentModel.ProgressValue = pc_text
else:
percentModel.ProgressValue = 0
# This is the user interface bit. It looks more or less like this:
###############################
# Word Count _ o x #
###############################
# _____ #
# 451 / |500| #
# ----- #
# ___________________________ #
# |############## | #
# --------------------------- #
###############################
# The boxed `500' is the text entry box.
class WindowClosingListener(unohelper.Base, XTopWindowListener):
def __init__(self):
global keepGoing
keepGoing = True
def windowClosing(self, e):
global keepGoing
keepGoing = False
setWordCountGoal(goal)
e.Source.setVisible(False)
def addControl(controlType, dlgModel, x, y, width, height, label, name = None):
control = dlgModel.createInstance(controlType)
control.PositionX = x
control.PositionY = y
control.Width = width
control.Height = height
if controlType == 'com.sun.star.awt.UnoControlFixedTextModel':
control.Label = label
elif controlType == 'com.sun.star.awt.UnoControlEditModel':
control.Text = label
elif controlType == 'com.sun.star.awt.UnoControlProgressBarModel':
control.ProgressValue = label
if name:
control.Name = name
dlgModel.insertByName(name, control)
else:
control.Name = 'unnamed'
dlgModel.insertByName('unnamed', control)
return control
def loopTheLoop(goalModel, wordCountModel, percentModel):
global goal
while keepGoing:
try: goal = int(goalModel.Text)
except: goal = 0
updateCount(wordCountModel, percentModel)
time.sleep(waittime)
if not socket:
import threading
class UpdaterThread(threading.Thread):
def __init__(self, goalModel, wordCountModel, percentModel):
threading.Thread.__init__(self)
self.goalModel = goalModel
self.wordCountModel = wordCountModel
self.percentModel = percentModel
def run(self):
loopTheLoop(self.goalModel, self.wordCountModel, self.percentModel)
def wordCount(arg = None):
'''Displays a continuously updating word count.'''
dialogModel = smgr.createInstanceWithContext('com.sun.star.awt.UnoControlDialogModel', ctx)
dialogModel.PositionX = XSCRIPTCONTEXT.getDocument().CurrentController.Frame.ContainerWindow.PosSize.Width / 2.2 - 105
dialogModel.Width = 100
dialogModel.Height = 30
dialogModel.Title = 'Word Count'
lblWc = addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 6, 2, 25, 14, '', 'lblWc')
lblWc.Align = 2 # Align right
addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 33, 2, 10, 14, ' / ')
txtGoal = addControl('com.sun.star.awt.UnoControlEditModel', dialogModel, 45, 1, 25, 12, '', 'txtGoal')
txtGoal.Text = goal
#addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 6, 25, 50, 14, '(percent)', 'lblPercent')
ProgressBar = addControl('com.sun.star.awt.UnoControlProgressBarModel', dialogModel, 6, 15, 88, 10,'' , 'lblPercent')
ProgressBar.ProgressValueMin = 0
ProgressBar.ProgressValueMax =100
#ProgressBar.Border = 2
#ProgressBar.BorderColor = 255
#ProgressBar.FillColor = 255
#ProgressBar.BackgroundColor = 255
addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 124, 2, 12, 14, '', 'lblMinus')
controlContainer = smgr.createInstanceWithContext('com.sun.star.awt.UnoControlDialog', ctx)
controlContainer.setModel(dialogModel)
controlContainer.addTopWindowListener(WindowClosingListener())
controlContainer.setVisible(True)
goalModel = controlContainer.getControl('txtGoal').getModel()
wordCountModel = controlContainer.getControl('lblWc').getModel()
percentModel = controlContainer.getControl('lblPercent').getModel()
ProgressBar.ProgressValue = percentModel.ProgressValue
if socket:
loopTheLoop(goalModel, wordCountModel, percentModel)
else:
uthread = UpdaterThread(goalModel, wordCountModel, percentModel)
uthread.start()
keepGoing = True
if socket:
wordCount()
else:
g_exportedScripts = wordCount,
Link for more info
https://superuser.com/questions/529446/running-word-count-in-openoffice-writer
Hope this helps regards tom
EDIT : Then i found this
http://forum.openoffice.org/en/forum/viewtopic.php?f=7&t=22555

wc can understand Unicode and uses system's iswspace function to find whether the unicode character is whitespace. "The iswspace() function tests whether wc is a wide-character code representing a character of class space in the program's current locale." So, wc -w should be able to correctly count words if your locale (LC_CTYPE) is configured correctly.
The source code of the wc program
The manual page for the iswspace function

I found the answer create one service
#!/bin/sh
#
# chkconfig: 345 99 01
#
# description: your script is a test service
#
(while sleep 1; do
ls pathwithfiles/in | while read file; do
libreoffice --headless -convert-to pdf "pathwithfiles/in/$file" --outdir pathwithfiles/out
rm "pathwithfiles/in/$file"
done
done) &
then the php script that i needed counted everything
$ext = pathinfo($absolute_file_path, PATHINFO_EXTENSION);
if ($ext !== 'txt' && $ext !== 'pdf') {
// Convert to pdf
$tb = mktime() . mt_rand();
$tempfile = 'locationofpdfs/in/' . $tb . '.' . $ext;
copy($absolute_file_path, $tempfile);
$absolute_file_path = 'locationofpdfs/out/' . $tb . '.pdf';
$ext = 'pdf';
while (!is_file($absolute_file_path)) sleep(1);
}
if ($ext !== 'txt') {
// Convert to txt
$tempfile = tempnam(sys_get_temp_dir(), '');
shell_exec('pdftotext "' . $absolute_file_path . '" ' . $tempfile);
$absolute_file_path = $tempfile;
$ext = 'txt';
}
if ($ext === 'txt') {
$seq = '/[\s\.,;:!\? ]+/mu';
$plain = file_get_contents($absolute_file_path);
$plain = preg_replace('#\{{{.*?\}}}#su', "", $plain);
$str = preg_replace($seq, '', $plain);
$chars = count(preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY));
$words = count(preg_split($seq, $plain, -1, PREG_SPLIT_NO_EMPTY));
if ($words === 0) return $chars;
if ($chars / $words > 10) $words = $chars;
return $words;
}

Related

Sample python code to replace a substring value in an xlsx cell

Sample code snippet tried:
for row in range(1,sheet.max_row+1):
for col in range(1, sheet.max_column+1):
temp = None
cell_obj = sheet.cell(row=row,column=col)
temp = re.search(r"requestor", str(cell_obj.value))
if temp:
if 'requestor' in cell_obj.value:
cell_obj.value.replace('requestor',
'ABC')
Trying to replace from an xlsx cell containing value "Customer name: requestor " with value "Customer name: ABC" .How can this be achieved easily ?
I found my answer in this post:https://www.edureka.co/community/42935/python-string-replace-not-working
The replace function doesn't store the result in the same variable. Hence the solution for above:
mvar = None
for row in range(1,sheet.max_row+1):
for col in range(1, sheet.max_column+1):
temp = None
cell_obj = sheet.cell(row=row,column=col)
temp = re.search(r"requestor", str(cell_obj.value))
if temp:
if 'requestor' in cell_obj.value:
mvar = cell_obj.value.replace('requestor',
'ABC')
cell_obj.value = mvar
Just keep it simple. Instead of re and replace, search for the given value and override the cell.
The example below also gives you the ability to change 'customer name' if needed:
wb = openpyxl.load_workbook("example.xlsx")
sheet = wb["Sheet1"]
customer_name = "requestor"
replace_with = "ABC"
search_string = f"Customer name: {customer_name}"
replace_string = f"Customer name: {replace_with}"
for row in range(1, sheet.max_row + 1):
for col in range(1, sheet.max_column + 1):
cell_obj = sheet.cell(row=row, column=col)
if cell_obj.value == search_string:
cell_obj.value = replace_string
wb.save("example_copy.xlsx") # remember that you need to save the results to the file

Why more number of duplicated data is saving in my excel sheet for my code?

Actually this code is generally used to scrape data from websites but the problem is more number of duplicated data is producing and saving in my excel sheet.
def extractor():
time.sleep(10)
souptree = html.fromstring(driver.page_source)
tburl = souptree.xpath("//table[contains(#id, 'theDataTable')]//tbody//tr//td[4]//a//#href")
for tbu in tburl:
allurl = []
allurl.append(urllib.parse.urljoin(siteurl, tbu))
for tb in allurl:
get_url = requests.get(tb)
get_soup = html.fromstring(get_url.content)
pattern = re.compile("^\s+|\s*,\s*|\s+$")
name = get_soup.xpath('//td[#headers="contactName"]//text()')
phone = get_soup.xpath('//td[#headers="contactPhone"]//text()')
mail = get_soup.xpath('//td[#headers="contactEmail"]//a//text()')
artitle = get_soup.xpath('//td[#headers="contactEmail"]//a//#href')
artit = ([x for x in pattern.split(str(artitle)) if x][-1])
title = artit[:-2]
for (nam, pho, mai) in zip(name, phone, mail):
fname = nam[9:]
allmails.append(mai)
allnames.append(fname)
allphone.append(pho)
alltitles.append(title)
fullfile = pd.DataFrame({'Names': allnames, 'Mails': allmails, 'Title': alltitles, 'Phone Numbers': allphone})
writer = ExcelWriter('G:\\Sheet_Name.xlsx')
fullfile.to_excel(writer, 'Sheet1', index=False)
writer.save()
print(fname, pho, mai, title, sep='\t')
while True:
time.sleep(10)
extractor()
try:
nextbutton()
except (WebDriverException):
driver.refresh()
except(NoSuchElementException):
time.sleep(10)
driver.quit()
I want the output should not be duplicated but almost half and more number of data are duplicating each time i run the code.

How can I cut large csv files using any R packages like ff or data.table?

I want to cut large csv files (file size more than RAM size) and use them or save each in disk for later usage. Which R package is best for doing this for large files?
I haven't tried but using skip and nrows parameters in read.table or read.csv is worth a try. These are from ?read.table
skip integer: the number of lines of the data file to skip before
beginning to read data.
nrows integer: the maximum number of rows to read in. Negative and
other invalid values are ignored.
To avoid some troublesome issues at the end you need to do some error handling. In other words I don't know what happpens when skip value is greater than the number of rows in your big csv.
p.s. I also don't know whether header=TRUE is affecting skip or not, you also have to check that.
The answer given bu #berkorbay is OK and I can confirm that header can be used with skip. However, if your file is really large it gets painfully slow, as each subsequent reading after the first must skip over all previously read lines.
I had to do something similar and, after wasting quite a bit of time, I wrote a short script in PERL which fragments the original file in chuncks that you can read one after the other. It is much faster. I enclose the source here, translating some parts so that the intent is clear:
#!/usr/bin/perl
system("cls");
print("Fragment .csv file keeping header in each chunk\n") ;
print("\nEnter input file name = ") ;
$entrada = <STDIN> ;
print("\nEnter maximum number of lines in each fragment = ") ;
$nlineas = <STDIN> ;
print("\nEnter output file name stem = ") ;
$salida = <STDIN> ;
chop($salida) ;
open(IN,$entrada) || die "Cannot open input file: $!\n" ;
$cabecera = <IN> ;
$leidas = 0 ;
$fragmento = 1 ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
while(<IN>) {
if ($leidas > $nlineas) {
close(OUT) ;
$fragmento++ ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
$leidas = 0;
}
$leidas++ ;
print OUT $_ ;
}
close(OUT) ;
Just save with whatever name and execute. The first line might have to be changed if you have PERL in a diferent place (an, if you are on Windows, you migh have to invoke the script as "perl name-of-script").
One should have used read.csv.ffdf of ff package with specific parameters like this to read big file:
library(ff)
a <- read.csv.ffdf(file="big.csv", header=TRUE, VERBOSE=TRUE, first.rows=1000000, next.rows=1000000, colClasses=NA)
Once big file is read into a ff object, Subsetting ffobject into data frames can be done using:
a[1000:1000000,]
Rest of the code for subsetting and saving broken dataframes
totalrows = dim(a)[1]
row.size = as.integer(object.size(a[1:10000,])) / 10000 #in bytes
block.size = 200000000 #in bytes .IN Mbs 200 Mb
#rows.block is rows per block
rows.block = ceiling(block.size/row.size)
#nmaps is the number of chunks/maps of big dataframe(ff), nmaps = number of maps - 1
nmaps = floor(totalrows/rows.block)
for(i in (0:nmaps)){
if(i==nmaps){
df = a[(i*rows.block+1) : totalrows,]
}
else{
df = a[(i*rows.block+1) : ((i+1)*rows.block),]
}
#process df or save it
write.csv(df,paste0("M",i+1,".csv"))
#remove df
rm(df)
}
Alternatively you can first read the files into mysql using dbWriteTable and then use read.dbi.ffdf function from the ETLUtils package to read it back to R. Consider the function below;
read.csv.sql.ffdf <- function(file, name,overwrite = TRUE, header = TRUE, drv = MySQL(), dbname = "new", username = "root",host='localhost', password = "1234"){
conn = dbConnect(drv, user = username, password = password, host = host, dbname = dbname)
dbWriteTable(conn, name, file, header = header, overwrite = overwrite)
on.exit(dbRemoveTable(conn, name))
command = paste0("select * from ", name)
ret = read.dbi.ffdf(command, dbConnect.args = list(drv =drv, dbname = dbname, username = username, password = password))
return(ret)
}

Get a list of all open windows using AutoIt

I'm trying to get rid of my minimize, maximize and close buttons on all windows. Googling around I found this:
$h = WinGetHandle("[CLASS:Notepad]")
$iOldStyle = _WinAPI_GetWindowLong($h, $GWL_STYLE)
$iNewStyle = BitXOr($iOldStyle, $WS_SYSMENU)
_WinAPI_SetWindowLong($h, $GWL_STYLE, $iNewStyle)
_WinAPI_ShowWindow($h, #SW_SHOW)
This works fine, so now I only need to iterate over all windows with this code, and I'm done. How do I get a list of all HWNDs in the system?
You can get a list of all open windows using WinList:
$aWindows = WinList()
For $i=1 To $aWindows[0][0]
; skip windows without a title
If $aWindows[$i][0] = '' Then ContinueLoop
;use the HWND to get the state of the window
$iWndState = WinGetState($aWindows[$i][1])
; here you could filter out the windows you don't want to modify
ConsoleWrite($aWindows[$i][0] & ': ')
If BitAND($iWndState,1) = 1 Then ConsoleWrite(' exists')
If BitAND($iWndState,2) = 2 Then ConsoleWrite(' visible')
If BitAND($iWndState,4) = 4 Then ConsoleWrite(' enabled')
If BitAND($iWndState,8) = 8 Then ConsoleWrite(' active')
If BitAND($iWndState,16) = 16 Then ConsoleWrite(' minimised')
If BitAND($iWndState,32) = 32 Then ConsoleWrite(' maximised')
ConsoleWrite(#CRLF)
Next

Compiling individual webpage tables into a single Excel readable table

I would like to create a master list of contact information for all Chiropractors in Arizona. The board website lists all the Chiropractors here however, I have to click through to see each individual address and phone number.
How can I get all of the information about each Chiropractor in to a single spreadsheet row format?
This is easy. In your first sheet do Data > External data > From website. Paste the URL, select the main table and do Next and put it in A1.
In VBA Editor paste the following formula and execute it. It will retrieve all data from the website and paste it in Sheet2. The rest is just reorganizing the data which is not the topic of your question so I leave it up to you.
Sub ExtractAllData()
Dim dest As Range, license As Range
Dim license_no As String
Worksheets("Feuil2").Select
Set dest = Worksheets("Feuil2").Range("A1")
Set license = Worksheets("Feuil1").Range("C3")
Do Until license.Value = ""
license_no = Mid(license.Value, 1, InStr(1, license.Value, " "))
With Worksheets("Feuil2").QueryTables.Add(Connection:= _
"URL;http://www.azchiroboard.us/ProDetail.asp?LicenseNo=" & license_no, Destination:= _
dest)
.Name = "ProDetail.asp?LicenseNo=" & license_no
.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = True
.RefreshOnFileOpen = False
.BackgroundQuery = True
.RefreshStyle = xlInsertDeleteCells
.SavePassword = False
.SaveData = True
.AdjustColumnWidth = True
.RefreshPeriod = 0
.WebSelectionType = xlSpecifiedTables
.WebFormatting = xlWebFormattingNone
.WebTables = "1"
.WebPreFormattedTextToColumns = True
.WebConsecutiveDelimitersAsOne = True
.WebSingleBlockTextImport = False
.WebDisableDateRecognition = False
.WebDisableRedirections = False
.Refresh BackgroundQuery:=False
End With
Set dest = Range("A65535").End(xlUp)
Set dest = dest.Offset(1, 0)
Set license = license.Offset(1, 0)
Loop
End Sub
For the record, it took me 1 minute to figure out how to retrieve the data from the main table. 1 minute to figure out that the link just calls a PHP page with the license number. 1 minute to record the macro and then 5 minutes to adapt it and fix errors I had made.

Resources