Sum of file size - directory

I am writing a file storage service. I ran into a problem. I have two model classes: Folder and Data. The Data model determines the file size when uploading. How to make folders automatically count the size of this folder, where the folder size = the sum of the sizes of attached files
My models.py file:
class Folder(models.Model):
uid = models.UUIDField(primary_key=True, editable=False, default=uuid.uuid4)
# parent_id = models.AutoField(primary_key=True, null=False)
name = models.CharField(max_length=255)
# size = models.IntegerField(blank=True, null=True, default = 0)
size = Data.objects.all()
created_at = models.DateField(auto_now=True)
# def get_upload_path(instance, filename):
# return os.path.join(str(instance.folder.uid),filename)
class Data(models.Model):
id = models.AutoField(primary_key=True, null=False)
file = models.FileField(null=True, max_length=255)
created_at = models.DateField(auto_now =True)
# user = models.ForeignKey(User, on_delete=models.CASCADE)
parent_id = models.ForeignKey(Folder, on_delete=models.CASCADE)
def __str__(self):
return str(self.file.name)

Related

How to import data from a HTML table on a website to excel?

I would like to do some statistical analysis with Python on the live casino game called Crazy Time from Evolution Gaming. There is a website that has the data to do this: https://tracksino.com/crazytime. I want the data of the lowest table 'Spin History' to be imported into excel. However, I do not now how this can be done. Could anyone give me an idea where to start?
Thanks in advance!
Try the below code:
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
import csv
import datetime
def scrap_history():
csv_headers = []
file_path = '' #mention your system where you have to save the file
file_name = 'spin_history.csv' # filename
page_number = 1
while True:
#Dynamic URL fetching data in chunks of 100
url = 'https://api.tracksino.com/crazytime_history?filter=&sort_by=&sort_desc=false&page_num=' + str(page_number) + '&per_page=100&period=24hours'
print('-' * 100)
print('URL created : ',url)
response = requests.get(url,verify=False)
result = json.loads(response.text) # loading data to convert in JSON.
history_data = result['data']
print(history_data)
if history_data != []:
with open(file_path + file_name ,'a+') as history:
#Headers for file
csv_headers = ['Occured At','Slot Result','Spin Result','Total Winners','Total Payout',]
csvwriter = csv.DictWriter(history, delimiter=',', lineterminator='\n',fieldnames=csv_headers)
if page_number == 1:
print('Writing CSV header now...')
csvwriter.writeheader()
#write exracted data in to csv file one by one
for item in history_data:
value = datetime.datetime.fromtimestamp(item['when'])
occured_at = f'{value:%d-%B-%Y # %H:%M:%S}'
csvwriter.writerow({'Occured At':occured_at,
'Slot Result': item['slot_result'],
'Spin Result': item['result'],
'Total Winners': item['total_winners'],
'Total Payout': item['total_payout'],
})
print('-' * 100)
page_number +=1
print(page_number)
print('-' * 100)
else:
break
Explanation:
I have implemented the above script using python requests way. The API url https://api.tracksino.com/crazytime_history?filter=&sort_by=&sort_desc=false&page_num=1&per_page=50&period=24hours extarcted from the web site itself(refer screenshot). In the very first step script will take the dynamic URL where page number is dynamic and changed upon on every iteration. For ex:- first it will be page_num = 1 then page_num = 2 and so on till all the data will get extracted.

Kivy regarding binding multiple buttons to each individual function

Hi I am new to Kivy and just started programming. I have problem, I want to bind all the buttons i created in the for loops to the on_release for every single buttons. So that to make all buttons once click is able to go different screens. Below is my a small part of my code( I EDITED with more information)
#this are the pictures of the buttons
a = '_icons_/mcdonald2.png'
b = '_icons_/boostjuice.png'
c = '_icons_/duckrice.png'
d = '_icons_/subway_logo.png'
e = '_icons_/bakery.png'
f = '_icons_/mrbean.png'
#these are the names of the different screen
n1 = 'mcdonald_screen'
n2 = 'boost_screen'
n3 = 'duck_screen'
n4 = 'subway_screen'
n5 = 'bakery_screen'
n6 = 'mrbean_screen'
arraylist = [[a,n1],[b,n2],[c,n3],[d,n4],[e,n5],[f,n6]]
self.layout2 = GridLayout(rows=2, spacing = 50,size_hint = (0.95,0.5),
pos_hint = {"top":.65,"x":0},padding=(90,0,50,0))
for image in arraylist:
self.image_outlet = ImageButton(
size_hint=(1, 0.1),
source= image[0])
self.screen_name = image[1]
self.image_outlet[0].bind(on_release= ??) ## This part is the one
i want to change
according to the
different screen
self.layout2.add_widget(self.image_outlet)
self.add_widget(self.layout2)
GUI = Builder.load("_kivy_/trying.kv")
class TRYINGApp(App):
def build(self):
return GUI
def change_screen(self,screen_name):
screen_manager = self.root.ids['screen_manager']
screen_manager.current = screen_name
#kv file#
# all the varies kv file screen
#: include _kivy_/variestime_screen.kv
#: include _kivy_/homescreen.kv
#: include _kivy_/mcdonaldscreen.kv
#: include _kivy_/firstpage.kv
#: include _kivy_/mrbeanscreen.kv
#: include _kivy_/boostscreen.kv
#: include _kivy_/duckscreen.kv
#: include _kivy_/subwayscreen.kv
#: include _kivy_/bakeryscreen.kv
GridLayout:
cols:1
ScreenManager:
id : screen_manager
FirstPage:
name :"first_page"
id : first_page
VariesTimeScreen:
name: "variestime_screen"
id: variestime_screen
HomeScreen:
name : "home_screen"
id : home_screen
McDonaldScreen:
name : "mcdonald_screen"
id : mcdonald_screen
BoostScreen:
name : "boost_screen"
id : boost_screen
DuckScreen:
name: "duck_screen"
id: duck_screen
SubwayScreen:
name:"subway_screen"
id: subway_screen
BakeryScreen:
name: "bakery_screen"
id: bakery_screen
MrBeanScreen:
name: "mrbean_screen"
id : mrbean_screen
Your on_release can be something like:
self.image_outlet.bind(on_release=partial(self.change_screen, image[1]))
where change_screen is a method that you must define:
def change_screen(self, new_screen_name, button_instance):
# some code to change to the screen with name new_screen_name
Note that I have removed the [0] from self.image_outlet (I suspect that was a typo). I can't determine what code should go in the new method, because you haven't provided enough information.
If you have a change_screen method in your App class, you can use that directly by referencing it in your on_release as:
self.image_outlet.bind(on_release=partial(App.get_running_app().change_screen, image[1]))
You will need to make a minor change to your change_screen method to handle additional args:
def change_screen(self, screen_name, *args):
screen_manager = self.root.ids['screen_manager']
screen_manager.current = screen_name

Why more number of duplicated data is saving in my excel sheet for my code?

Actually this code is generally used to scrape data from websites but the problem is more number of duplicated data is producing and saving in my excel sheet.
def extractor():
time.sleep(10)
souptree = html.fromstring(driver.page_source)
tburl = souptree.xpath("//table[contains(#id, 'theDataTable')]//tbody//tr//td[4]//a//#href")
for tbu in tburl:
allurl = []
allurl.append(urllib.parse.urljoin(siteurl, tbu))
for tb in allurl:
get_url = requests.get(tb)
get_soup = html.fromstring(get_url.content)
pattern = re.compile("^\s+|\s*,\s*|\s+$")
name = get_soup.xpath('//td[#headers="contactName"]//text()')
phone = get_soup.xpath('//td[#headers="contactPhone"]//text()')
mail = get_soup.xpath('//td[#headers="contactEmail"]//a//text()')
artitle = get_soup.xpath('//td[#headers="contactEmail"]//a//#href')
artit = ([x for x in pattern.split(str(artitle)) if x][-1])
title = artit[:-2]
for (nam, pho, mai) in zip(name, phone, mail):
fname = nam[9:]
allmails.append(mai)
allnames.append(fname)
allphone.append(pho)
alltitles.append(title)
fullfile = pd.DataFrame({'Names': allnames, 'Mails': allmails, 'Title': alltitles, 'Phone Numbers': allphone})
writer = ExcelWriter('G:\\Sheet_Name.xlsx')
fullfile.to_excel(writer, 'Sheet1', index=False)
writer.save()
print(fname, pho, mai, title, sep='\t')
while True:
time.sleep(10)
extractor()
try:
nextbutton()
except (WebDriverException):
driver.refresh()
except(NoSuchElementException):
time.sleep(10)
driver.quit()
I want the output should not be duplicated but almost half and more number of data are duplicating each time i run the code.

Fetch datastore entity by id inside of a Dataflow transform

I have 2 datastore models:
class KindA(ndb.Model):
field_a1 = ndb.StringProperty()
field_a2 = ndb.StringProperty()
class KindB(ndb.Model):
field_b1 = ndb.StringProperty()
field_b2 = ndb.StringProperty()
key_to_kind_a = ndb.KeyProperty(KindA)
I want to query KindB and output it to a csv file, but if an entity of KindB points to an entity in KindA I want those fields to be present in the csv as well.
If I was able to use ndb inside of a transform I would setup my pipeline like this
def format(element): # element is an `entity_pb2` object of KindB
try:
obj_a_key_id = element.properties.get('key_to_kind_a', None).key_value.path[0]
except:
obj_a_key_id = None
# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< HOW DO I DO THIS
obj_a = ndb.Key(KindA, obj_a_key_id).get() if obj_a_key_id else None
return ",".join([
element.properties.get('field_b1', None).string_value,
element.properties.get('field_b2', None).string_value,
obj_a.properties.get('field_a1', None).string_value if obj_a else '',
obj_a.properties.get('field_a2', None).string_value if obj_a else '',
]
def build_pipeline(project, start_date, end_date, export_path):
query = query_pb2.Query()
query.kind.add().name = 'KindB'
filter_1 = datastore_helper.set_property_filter(query_pb2.Filter(), 'field_b1', PropertyFilter.GREATER_THAN, start_date)
filter_2 = datastore_helper.set_property_filter(query_pb2.Filter(), 'field_b1', PropertyFilter.LESS_THAN, end_date)
datastore_helper.set_composite_filter(query.filter, CompositeFilter.AND, filter_1, filter_2)
p = beam.Pipeline(options=pipeline_options)
_ = (p
| 'read from datastore' >> ReadFromDatastore(project, query, None)
| 'format' >> beam.Map(format)
| 'write' >> apache_beam.io.WriteToText(
file_path_prefix=export_path,
file_name_suffix='.csv',
header='field_b1,field_b2,field_a1,field_a2',
num_shards=1)
)
return p
I suppose I could use ReadFromDatastore to query all entities of KindA and then use CoGroupByKey to merge them, but KindA has millions of records and that would be very inefficient.
Per the reccommendations in this answer: https://stackoverflow.com/a/49130224/4458510
I created the following utils, which were inspired by the source code of
DatastoreWriteFn in apache_beam.io.gcp.datastore.v1.datastoreio
write_mutations and fetch_entities in apache_beam.io.gcp.datastore.v1.helper
import logging
import time
from socket import error as _socket_error
from apache_beam.metrics import Metrics
from apache_beam.transforms import DoFn, window
from apache_beam.utils import retry
from apache_beam.io.gcp.datastore.v1.adaptive_throttler import AdaptiveThrottler
from apache_beam.io.gcp.datastore.v1.helper import make_partition, retry_on_rpc_error, get_datastore
from apache_beam.io.gcp.datastore.v1.util import MovingSum
from apache_beam.utils.windowed_value import WindowedValue
from google.cloud.proto.datastore.v1 import datastore_pb2, query_pb2
from googledatastore.connection import Datastore, RPCError
_WRITE_BATCH_INITIAL_SIZE = 200
_WRITE_BATCH_MAX_SIZE = 500
_WRITE_BATCH_MIN_SIZE = 10
_WRITE_BATCH_TARGET_LATENCY_MS = 5000
def _fetch_keys(project_id, keys, datastore, throttler, rpc_stats_callback=None, throttle_delay=1):
req = datastore_pb2.LookupRequest()
req.project_id = project_id
for key in keys:
req.keys.add().CopyFrom(key)
#retry.with_exponential_backoff(num_retries=5, retry_filter=retry_on_rpc_error)
def run(request):
# Client-side throttling.
while throttler.throttle_request(time.time() * 1000):
logging.info("Delaying request for %ds due to previous failures", throttle_delay)
time.sleep(throttle_delay)
if rpc_stats_callback:
rpc_stats_callback(throttled_secs=throttle_delay)
try:
start_time = time.time()
response = datastore.lookup(request)
end_time = time.time()
if rpc_stats_callback:
rpc_stats_callback(successes=1)
throttler.successful_request(start_time * 1000)
commit_time_ms = int((end_time - start_time) * 1000)
return response, commit_time_ms
except (RPCError, _socket_error):
if rpc_stats_callback:
rpc_stats_callback(errors=1)
raise
return run(req)
# Copied from _DynamicBatchSizer in apache_beam.io.gcp.datastore.v1.datastoreio
class _DynamicBatchSizer(object):
"""Determines request sizes for future Datastore RPCS."""
def __init__(self):
self._commit_time_per_entity_ms = MovingSum(window_ms=120000, bucket_ms=10000)
def get_batch_size(self, now):
"""Returns the recommended size for datastore RPCs at this time."""
if not self._commit_time_per_entity_ms.has_data(now):
return _WRITE_BATCH_INITIAL_SIZE
recent_mean_latency_ms = (self._commit_time_per_entity_ms.sum(now) / self._commit_time_per_entity_ms.count(now))
return max(_WRITE_BATCH_MIN_SIZE,
min(_WRITE_BATCH_MAX_SIZE,
_WRITE_BATCH_TARGET_LATENCY_MS / max(recent_mean_latency_ms, 1)))
def report_latency(self, now, latency_ms, num_mutations):
"""Reports the latency of an RPC to Datastore.
Args:
now: double, completion time of the RPC as seconds since the epoch.
latency_ms: double, the observed latency in milliseconds for this RPC.
num_mutations: int, number of mutations contained in the RPC.
"""
self._commit_time_per_entity_ms.add(now, latency_ms / num_mutations)
class LookupKeysFn(DoFn):
"""A `DoFn` that looks up keys in the Datastore."""
def __init__(self, project_id, fixed_batch_size=None):
self._project_id = project_id
self._datastore = None
self._fixed_batch_size = fixed_batch_size
self._rpc_successes = Metrics.counter(self.__class__, "datastoreRpcSuccesses")
self._rpc_errors = Metrics.counter(self.__class__, "datastoreRpcErrors")
self._throttled_secs = Metrics.counter(self.__class__, "cumulativeThrottlingSeconds")
self._throttler = AdaptiveThrottler(window_ms=120000, bucket_ms=1000, overload_ratio=1.25)
self._elements = []
self._batch_sizer = None
self._target_batch_size = None
def _update_rpc_stats(self, successes=0, errors=0, throttled_secs=0):
"""Callback function, called by _fetch_keys()"""
self._rpc_successes.inc(successes)
self._rpc_errors.inc(errors)
self._throttled_secs.inc(throttled_secs)
def start_bundle(self):
"""(re)initialize: connection with datastore, _DynamicBatchSizer obj"""
self._elements = []
self._datastore = get_datastore(self._project_id)
if self._fixed_batch_size:
self._target_batch_size = self._fixed_batch_size
else:
self._batch_sizer = _DynamicBatchSizer()
self._target_batch_size = self._batch_sizer.get_batch_size(time.time()*1000)
def process(self, element):
"""Collect elements and process them as a batch"""
self._elements.append(element)
if len(self._elements) >= self._target_batch_size:
return self._flush_batch()
def finish_bundle(self):
"""Flush any remaining elements"""
if self._elements:
objs = self._flush_batch()
for obj in objs:
yield WindowedValue(obj, window.MAX_TIMESTAMP, [window.GlobalWindow()])
def _flush_batch(self):
"""Fetch all of the collected keys from datastore"""
response, latency_ms = _fetch_keys(
project_id=self._project_id,
keys=self._elements,
datastore=self._datastore,
throttler=self._throttler,
rpc_stats_callback=self._update_rpc_stats)
logging.info("Successfully read %d keys in %dms.", len(self._elements), latency_ms)
if not self._fixed_batch_size:
now = time.time()*1000
self._batch_sizer.report_latency(now, latency_ms, len(self._elements))
self._target_batch_size = self._batch_sizer.get_batch_size(now)
self._elements = []
return [entity_result.entity for entity_result in response.found]
class LookupEntityFieldFn(LookupKeysFn):
"""
Looks-up a field on an EntityPb2 object
Expects a EntityPb2 object as input
Outputs a tuple, where the first element is the input object and the second element is the object found during the
lookup
"""
def __init__(self, project_id, field_name, fixed_batch_size=None):
super(LookupEntityFieldFn, self).__init__(project_id=project_id, fixed_batch_size=fixed_batch_size)
self._field_name = field_name
#staticmethod
def _pb2_key_value_to_tuple(kv):
"""Converts a key_value object into a tuple, so that it can be a dictionary key"""
path = []
for p in kv.path:
path.append(p.name)
path.append(p.id)
return tuple(path)
def _flush_batch(self):
_elements = self._elements
keys_to_fetch = []
for element in self._elements:
kv = element.properties.get(self._field_name, None)
if kv and kv.key_value and kv.key_value.path:
keys_to_fetch.append(kv.key_value)
self._elements = keys_to_fetch
read_keys = super(LookupEntityFieldFn, self)._flush_batch()
_by_key = {self._pb2_key_value_to_tuple(entity.key): entity for entity in read_keys}
output_pairs = []
for input_obj in _elements:
kv = input_obj.properties.get(self._field_name, None)
output_obj = None
if kv and kv.key_value and kv.key_value.path:
output_obj = _by_key.get(self._pb2_key_value_to_tuple(kv.key_value), None)
output_pairs.append((input_obj, output_obj))
return output_pairs
The Key to this is the line response = datastore.lookup(request), where:
datastore = get_datastore(project_id) (from apache_beam.io.gcp.datastore.v1.helper.get_datastore)
request is a LookupRequest from google.cloud.proto.datastore.v1.datastore_pb2
response is LookupResponse from google.cloud.proto.datastore.v1.datastore_pb2
The rest of the above code does things like:
using a single connection to the datastore for a dofn bundle
batches keys together before performing a lookup request
throttles interactions with the datastore if requests start to fail
(honestly I don't know how critical these bits are, I just came across them when browsing the apache_beam source code)
The resulting util function LookupEntityFieldFn(project_id, field_name) is a DoFn that takes in an entity_pb2 object as input, extracts and fetches/gets the key_property that resides on the field field_name, and outputs the result as a tuple (the fetch-result is paired with the input object)
My Pipeline code then became
def format(element): # element is a tuple `entity_pb2` objects
kind_b_element, kind_a_element = element
return ",".join([
kind_b_element.properties.get('field_b1', None).string_value,
kind_b_element.properties.get('field_b2', None).string_value,
kind_a_element.properties.get('field_a1', None).string_value if kind_a_element else '',
kind_a_element.properties.get('field_a2', None).string_value if kind_a_element else '',
]
def build_pipeline(project, start_date, end_date, export_path):
query = query_pb2.Query()
query.kind.add().name = 'KindB'
filter_1 = datastore_helper.set_property_filter(query_pb2.Filter(), 'field_b1', PropertyFilter.GREATER_THAN, start_date)
filter_2 = datastore_helper.set_property_filter(query_pb2.Filter(), 'field_b1', PropertyFilter.LESS_THAN, end_date)
datastore_helper.set_composite_filter(query.filter, CompositeFilter.AND, filter_1, filter_2)
p = beam.Pipeline(options=pipeline_options)
_ = (p
| 'read from datastore' >> ReadFromDatastore(project, query, None)
| 'extract field' >> apache_beam.ParDo(LookupEntityFieldFn(project_id=project, field_name='key_to_kind_a'))
| 'format' >> beam.Map(format)
| 'write' >> apache_beam.io.WriteToText(
file_path_prefix=export_path,
file_name_suffix='.csv',
header='field_b1,field_b2,field_a1,field_a2',
num_shards=1)
)
return p

Can openoffice count words from console?

i have a small problem i need to count words inside the console to read doc, docx, pptx, ppt, xls, xlsx, odt, pdf ... so don't suggest me | wc -w or grep because they work only with text or console output and they count only spaces and in japanese, chinese, arabic , hindu , hebrew they use diferent delimiter so the word count is wrong and i tried to count with this
pdftotext file.pdf -| wc -w
/usr/local/bin/docx2txt.pl < file.docx | wc -w
/usr/local/bin/pptx2txt.pl < file.pptx | wc -w
antiword file.doc -| wc -w
antiword file.word -| wc -w
in some cases microsoft word , openoffice sad 1000 words and the counters return 10 or 300 words if the language is ( japanese , chinese, hindu ect... ) , but if i use normal characters then i have no issue the biggest mistake is in some case 3 chars less witch is "OK"
i tried to convert with soffice , openoffice and then try WC -w but i can't even convert ,
soffice --headless --nofirststartwizard --accept=socket,host=127.0.0.1,port=8100; --convert-to pdf some.pdf /var/www/domains/vocabridge.com/devel/temp_files/23/0/东京_1000_words_Docx.docx
OR
openoffice.org --headless --convert-to ........
OR
openoffice.org3 --invisible
so if someone know any way to count correctly or display document statistic with openoffice or anything else or linux with the console please share it
thanks.
If you have Microsoft Word (and Windows, obviously) you can write a VBA macro or if you want to run straight from the command line you can write a VBScript script with something like the following:
wordApp = CreateObject("Word.Application")
doc = ... ' open up a Word document using wordApp
docWordCount = doc.Words.Count
' Rinse and repeat...
If you have OpenOffice.org/LibreOffice you have similar (but more) options. If you want to stay in the office app and run a macro you can probably do that. I don't know the StarBasic API well enough to tell you how but I can give you the alternative: creating a Python script to get the word count from the command line. Roughly speaking, you do the following:
Start up your copy of OOo/LibO from the command line with the appropriate parameters to accept incoming socket connections. http://www.openoffice.org/udk/python/python-bridge.html has instructions on how to do that. Go there and use the browser's in-page find feature to search for `accept=socket'
Write a Python script to use the OOo/LibO UNO bridge (basically equivalent to the VBScript example above) to open up your Word/ODT documents one at a time and get the word count from each. The above page should give you a good start to doing that.
You get the word count from a document model object's WordCount property: http://www.openoffice.org/api/docs/common/ref/com/sun/star/text/GenericTextDocument.html#WordCount
Just building on to what #Yawar wrote. Here is is more explicit steps for how to word count with MS word from the console.
I also use the more accurate Range.ComputeStatistics(wdStatisticWords) instead of the Words property. See here for more info: https://support.microsoft.com/en-za/help/291447/word-count-appears-inaccurate-when-you-use-the-vba-words-property
Make a script called wc.vbs and then put this in it:
Set word = CreateObject("Word.Application")
word.Visible = False
Set doc = word.Documents.Open("<replace with absolute path to your .docx/.pdf>")
docWordCount = doc.Range.ComputeStatistics(wdStatisticWords)
word.Quit
Dim StdOut : Set StdOut = CreateObject("Scripting.FileSystemObject").GetStandardStream(1)
WScript.Echo docWordCount & " words"
Open powershell in the directory you saved wc.vbs and run cscript .\wc.vbs and you'll get back the word count :)
I think this may do what you are aiming for
# Continuously updating word count
import unohelper, uno, os, time
from com.sun.star.i18n.WordType import WORD_COUNT
from com.sun.star.i18n import Boundary
from com.sun.star.lang import Locale
from com.sun.star.awt import XTopWindowListener
#socket = True
socket = False
localContext = uno.getComponentContext()
if socket:
resolver = localContext.ServiceManager.createInstanceWithContext('com.sun.star.bridge.UnoUrlResolver', localContext)
ctx = resolver.resolve('uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext')
else: ctx = localContext
smgr = ctx.ServiceManager
desktop = smgr.createInstanceWithContext('com.sun.star.frame.Desktop', ctx)
waittime = 1 # seconds
def getWordCountGoal():
doc = XSCRIPTCONTEXT.getDocument()
retval = 0
# Only if the field exists
if doc.getTextFieldMasters().hasByName('com.sun.star.text.FieldMaster.User.WordCountGoal'):
# Get the field
wordcountgoal = doc.getTextFieldMasters().getByName('com.sun.star.text.FieldMaster.User.WordCountGoal')
retval = wordcountgoal.Content
return retval
goal = getWordCountGoal()
def setWordCountGoal(goal):
doc = XSCRIPTCONTEXT.getDocument()
if doc.getTextFieldMasters().hasByName('com.sun.star.text.FieldMaster.User.WordCountGoal'):
wordcountgoal = doc.getTextFieldMasters().getByName('com.sun.star.text.FieldMaster.User.WordCountGoal')
wordcountgoal.Content = goal
# Refresh the field if inserted in the document from Insert > Fields >
# Other... > Variables > Userdefined fields
doc.TextFields.refresh()
def printOut(txt):
if socket: print txt
else:
model = desktop.getCurrentComponent()
text = model.Text
cursor = text.createTextCursorByRange(text.getEnd())
text.insertString(cursor, txt + '\r', 0)
def hotCount(st):
'''Counts the number of words in a string.
ARGUMENTS:
str st: count the number of words in this string
RETURNS:
int: the number of words in st'''
startpos = long()
nextwd = Boundary()
lc = Locale()
lc.Language = 'en'
numwords = 1
mystartpos = 1
brk = smgr.createInstanceWithContext('com.sun.star.i18n.BreakIterator', ctx)
nextwd = brk.nextWord(st, startpos, lc, WORD_COUNT)
while nextwd.startPos != nextwd.endPos:
numwords += 1
nw = nextwd.startPos
nextwd = brk.nextWord(st, nw, lc, WORD_COUNT)
return numwords
def updateCount(wordCountModel, percentModel):
'''Updates the GUI.
Updates the word count and the percentage completed in the GUI. If some
text of more than one word is selected (including in multiple selections by
holding down the Ctrl/Cmd key), it updates the GUI based on the selection;
if not, on the whole document.'''
model = desktop.getCurrentComponent()
try:
if not model.supportsService('com.sun.star.text.TextDocument'):
return
except AttributeError: return
sel = model.getCurrentSelection()
try: selcount = sel.getCount()
except AttributeError: return
if selcount == 1 and sel.getByIndex(0).getString == '':
selcount = 0
selwords = 0
for nsel in range(selcount):
thisrange = sel.getByIndex(nsel)
atext = thisrange.getString()
selwords += hotCount(atext)
if selwords > 1: wc = selwords
else:
try: wc = model.WordCount
except AttributeError: return
wordCountModel.Label = str(wc)
if goal != 0:
pc_text = 100 * (wc / float(goal))
#pc_text = '(%.2f percent)' % (100 * (wc / float(goal)))
percentModel.ProgressValue = pc_text
else:
percentModel.ProgressValue = 0
# This is the user interface bit. It looks more or less like this:
###############################
# Word Count _ o x #
###############################
# _____ #
# 451 / |500| #
# ----- #
# ___________________________ #
# |############## | #
# --------------------------- #
###############################
# The boxed `500' is the text entry box.
class WindowClosingListener(unohelper.Base, XTopWindowListener):
def __init__(self):
global keepGoing
keepGoing = True
def windowClosing(self, e):
global keepGoing
keepGoing = False
setWordCountGoal(goal)
e.Source.setVisible(False)
def addControl(controlType, dlgModel, x, y, width, height, label, name = None):
control = dlgModel.createInstance(controlType)
control.PositionX = x
control.PositionY = y
control.Width = width
control.Height = height
if controlType == 'com.sun.star.awt.UnoControlFixedTextModel':
control.Label = label
elif controlType == 'com.sun.star.awt.UnoControlEditModel':
control.Text = label
elif controlType == 'com.sun.star.awt.UnoControlProgressBarModel':
control.ProgressValue = label
if name:
control.Name = name
dlgModel.insertByName(name, control)
else:
control.Name = 'unnamed'
dlgModel.insertByName('unnamed', control)
return control
def loopTheLoop(goalModel, wordCountModel, percentModel):
global goal
while keepGoing:
try: goal = int(goalModel.Text)
except: goal = 0
updateCount(wordCountModel, percentModel)
time.sleep(waittime)
if not socket:
import threading
class UpdaterThread(threading.Thread):
def __init__(self, goalModel, wordCountModel, percentModel):
threading.Thread.__init__(self)
self.goalModel = goalModel
self.wordCountModel = wordCountModel
self.percentModel = percentModel
def run(self):
loopTheLoop(self.goalModel, self.wordCountModel, self.percentModel)
def wordCount(arg = None):
'''Displays a continuously updating word count.'''
dialogModel = smgr.createInstanceWithContext('com.sun.star.awt.UnoControlDialogModel', ctx)
dialogModel.PositionX = XSCRIPTCONTEXT.getDocument().CurrentController.Frame.ContainerWindow.PosSize.Width / 2.2 - 105
dialogModel.Width = 100
dialogModel.Height = 30
dialogModel.Title = 'Word Count'
lblWc = addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 6, 2, 25, 14, '', 'lblWc')
lblWc.Align = 2 # Align right
addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 33, 2, 10, 14, ' / ')
txtGoal = addControl('com.sun.star.awt.UnoControlEditModel', dialogModel, 45, 1, 25, 12, '', 'txtGoal')
txtGoal.Text = goal
#addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 6, 25, 50, 14, '(percent)', 'lblPercent')
ProgressBar = addControl('com.sun.star.awt.UnoControlProgressBarModel', dialogModel, 6, 15, 88, 10,'' , 'lblPercent')
ProgressBar.ProgressValueMin = 0
ProgressBar.ProgressValueMax =100
#ProgressBar.Border = 2
#ProgressBar.BorderColor = 255
#ProgressBar.FillColor = 255
#ProgressBar.BackgroundColor = 255
addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 124, 2, 12, 14, '', 'lblMinus')
controlContainer = smgr.createInstanceWithContext('com.sun.star.awt.UnoControlDialog', ctx)
controlContainer.setModel(dialogModel)
controlContainer.addTopWindowListener(WindowClosingListener())
controlContainer.setVisible(True)
goalModel = controlContainer.getControl('txtGoal').getModel()
wordCountModel = controlContainer.getControl('lblWc').getModel()
percentModel = controlContainer.getControl('lblPercent').getModel()
ProgressBar.ProgressValue = percentModel.ProgressValue
if socket:
loopTheLoop(goalModel, wordCountModel, percentModel)
else:
uthread = UpdaterThread(goalModel, wordCountModel, percentModel)
uthread.start()
keepGoing = True
if socket:
wordCount()
else:
g_exportedScripts = wordCount,
Link for more info
https://superuser.com/questions/529446/running-word-count-in-openoffice-writer
Hope this helps regards tom
EDIT : Then i found this
http://forum.openoffice.org/en/forum/viewtopic.php?f=7&t=22555
wc can understand Unicode and uses system's iswspace function to find whether the unicode character is whitespace. "The iswspace() function tests whether wc is a wide-character code representing a character of class space in the program's current locale." So, wc -w should be able to correctly count words if your locale (LC_CTYPE) is configured correctly.
The source code of the wc program
The manual page for the iswspace function
I found the answer create one service
#!/bin/sh
#
# chkconfig: 345 99 01
#
# description: your script is a test service
#
(while sleep 1; do
ls pathwithfiles/in | while read file; do
libreoffice --headless -convert-to pdf "pathwithfiles/in/$file" --outdir pathwithfiles/out
rm "pathwithfiles/in/$file"
done
done) &
then the php script that i needed counted everything
$ext = pathinfo($absolute_file_path, PATHINFO_EXTENSION);
if ($ext !== 'txt' && $ext !== 'pdf') {
// Convert to pdf
$tb = mktime() . mt_rand();
$tempfile = 'locationofpdfs/in/' . $tb . '.' . $ext;
copy($absolute_file_path, $tempfile);
$absolute_file_path = 'locationofpdfs/out/' . $tb . '.pdf';
$ext = 'pdf';
while (!is_file($absolute_file_path)) sleep(1);
}
if ($ext !== 'txt') {
// Convert to txt
$tempfile = tempnam(sys_get_temp_dir(), '');
shell_exec('pdftotext "' . $absolute_file_path . '" ' . $tempfile);
$absolute_file_path = $tempfile;
$ext = 'txt';
}
if ($ext === 'txt') {
$seq = '/[\s\.,;:!\? ]+/mu';
$plain = file_get_contents($absolute_file_path);
$plain = preg_replace('#\{{{.*?\}}}#su', "", $plain);
$str = preg_replace($seq, '', $plain);
$chars = count(preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY));
$words = count(preg_split($seq, $plain, -1, PREG_SPLIT_NO_EMPTY));
if ($words === 0) return $chars;
if ($chars / $words > 10) $words = $chars;
return $words;
}

Resources