Compiling individual webpage tables into a single Excel readable table - web-scraping

I would like to create a master list of contact information for all Chiropractors in Arizona. The board website lists all the Chiropractors here however, I have to click through to see each individual address and phone number.
How can I get all of the information about each Chiropractor in to a single spreadsheet row format?

This is easy. In your first sheet do Data > External data > From website. Paste the URL, select the main table and do Next and put it in A1.
In VBA Editor paste the following formula and execute it. It will retrieve all data from the website and paste it in Sheet2. The rest is just reorganizing the data which is not the topic of your question so I leave it up to you.
Sub ExtractAllData()
Dim dest As Range, license As Range
Dim license_no As String
Worksheets("Feuil2").Select
Set dest = Worksheets("Feuil2").Range("A1")
Set license = Worksheets("Feuil1").Range("C3")
Do Until license.Value = ""
license_no = Mid(license.Value, 1, InStr(1, license.Value, " "))
With Worksheets("Feuil2").QueryTables.Add(Connection:= _
"URL;http://www.azchiroboard.us/ProDetail.asp?LicenseNo=" & license_no, Destination:= _
dest)
.Name = "ProDetail.asp?LicenseNo=" & license_no
.FieldNames = True
.RowNumbers = False
.FillAdjacentFormulas = False
.PreserveFormatting = True
.RefreshOnFileOpen = False
.BackgroundQuery = True
.RefreshStyle = xlInsertDeleteCells
.SavePassword = False
.SaveData = True
.AdjustColumnWidth = True
.RefreshPeriod = 0
.WebSelectionType = xlSpecifiedTables
.WebFormatting = xlWebFormattingNone
.WebTables = "1"
.WebPreFormattedTextToColumns = True
.WebConsecutiveDelimitersAsOne = True
.WebSingleBlockTextImport = False
.WebDisableDateRecognition = False
.WebDisableRedirections = False
.Refresh BackgroundQuery:=False
End With
Set dest = Range("A65535").End(xlUp)
Set dest = dest.Offset(1, 0)
Set license = license.Offset(1, 0)
Loop
End Sub
For the record, it took me 1 minute to figure out how to retrieve the data from the main table. 1 minute to figure out that the link just calls a PHP page with the license number. 1 minute to record the macro and then 5 minutes to adapt it and fix errors I had made.

Related

SAP - Error 800A026B - The control could not be found by id

I´m dealing with the problem regarding SAP GUI automation.
I have a script shown below which is working without any problem BUT it is working only in the first window of the SAP.
When I have something open in the first window of the SAP and for example the home page of the SAP in 2nd window - I will get this error :
800A026B - The control could not be found by id.
Is there any option how to tell: If window 1 is not available try window No. 2 (Or If window No.1 is not available open a new window and perform there the script?)
Script as below:
If Not IsObject(application) Then
Set SapGuiAuto = GetObject("SAPGUI")
Set application = SapGuiAuto.GetScriptingEngine
End If
If Not IsObject(connection) Then
Set connection = application.Children(0)
End If
If Not IsObject(session) Then
Set session = connection.Children(0)
End If
If IsObject(WScript) Then
WScript.ConnectObject session, "on"
WScript.ConnectObject application, "on"
End If
session.findById("wnd[0]").resizeWorkingPane 206,40,false
session.findById("wnd[0]/usr/cntlIMAGE_CONTAINER/shellcont/shell/shellcont[0]/shell").doubleClickNode "F00082"
session.findById("wnd[0]/usr/ctxtDATABROWSE-TABLENAME").text = "/CASWW/LBL_DATA"
session.findById("wnd[0]").sendVKey 0
session.findById("wnd[0]/mbar/menu[3]/menu[2]").select
session.findById("wnd[1]/usr/chk[2,13]").selected = true
session.findById("wnd[1]/usr/chk[2,13]").setFocus
session.findById("wnd[1]/usr").verticalScrollbar.position = 21
session.findById("wnd[1]/usr/chk[2,13]").selected = true
session.findById("wnd[1]/usr/chk[2,13]").setFocus
session.findById("wnd[1]/tbar[0]/btn[0]").press
session.findById("wnd[0]/usr/ctxtI1-LOW").text = "2000"
session.findById("wnd[0]/usr/ctxtI2-LOW").text = "A2C03664400"
session.findById("wnd[0]/usr/ctxtI3-LOW").text = "1.1.2021"
session.findById("wnd[0]/usr/ctxtI3-HIGH").setFocus
session.findById("wnd[0]/usr/ctxtI3-HIGH").caretPosition = 0
session.findById("wnd[0]").sendVKey 4
session.findById("wnd[1]/usr/cntlCONTAINER/shellcont/shell").focusDate = "20220104"
session.findById("wnd[1]/tbar[0]/btn[0]").press
session.findById("wnd[0]/usr/ctxtLIST_BRE").text = "2500"
session.findById("wnd[0]/usr/txtMAX_SEL").text = "5000 "
session.findById("wnd[0]/usr/txtMAX_SEL").setFocus
session.findById("wnd[0]/usr/txtMAX_SEL").caretPosition = 10
session.findById("wnd[0]/tbar[1]/btn[8]").press
session.findById("wnd[0]/mbar/menu[3]/menu[0]/menu[1]").select
session.findById("wnd[1]/tbar[0]/btn[14]").press
session.findById("wnd[1]/usr/chk[1,3]").selected = true
session.findById("wnd[1]/usr/chk[1,4]").selected = true
session.findById("wnd[1]/usr/chk[1,5]").selected = true
session.findById("wnd[1]/usr/chk[1,6]").selected = true
session.findById("wnd[1]/usr/chk[1,8]").selected = true
session.findById("wnd[1]/usr/chk[1,9]").selected = true
session.findById("wnd[1]/usr/chk[1,12]").selected = true
session.findById("wnd[1]/usr/chk[1,17]").selected = true
session.findById("wnd[1]/usr/chk[1,20]").selected = true
session.findById("wnd[1]/usr/chk[1,21]").selected = true
session.findById("wnd[1]/usr/chk[1,21]").setFocus
session.findById("wnd[1]/usr").verticalScrollbar.position = 19
session.findById("wnd[1]/usr/chk[1,14]").selected = true
session.findById("wnd[1]/usr/chk[1,14]").setFocus
session.findById("wnd[1]/tbar[0]/btn[6]").press
session.findById("wnd[0]/usr/lbl[173,3]").setFocus
session.findById("wnd[0]/usr/lbl[173,3]").caretPosition = 0
session.findById("wnd[0]").sendVKey 42
session.findById("wnd[0]/mbar/menu[1]/menu[5]").select
session.findById("wnd[1]/usr/subSUBSCREEN_STEPLOOP:SAPLSPO5:0150/sub:SAPLSPO5:0150/radSPOPLI-SELFLAG[1,0]").select
session.findById("wnd[1]/usr/subSUBSCREEN_STEPLOOP:SAPLSPO5:0150/sub:SAPLSPO5:0150/radSPOPLI-SELFLAG[1,0]").setFocus
session.findById("wnd[1]/tbar[0]/btn[0]").press
session.findById("wnd[1]/usr/ctxtDY_FILENAME").text = "D3.XLS"
session.findById("wnd[1]/usr/ctxtDY_PATH").text = "\\cw01.contiwan.com\bdyp\dida4065\Zakaznicka_kvalita\300003300023-0002-0001"
session.findById("wnd[1]/usr/ctxtDY_FILE_ENCODING").text = "0000"
session.findById("wnd[1]/usr/ctxtDY_FILE_ENCODING").setFocus
session.findById("wnd[1]/usr/ctxtDY_FILE_ENCODING").caretPosition = 4
session.findById("wnd[1]/tbar[0]/btn[0]").press
session.findById("wnd[0]/tbar[0]/btn[3]").press
session.findById("wnd[0]/tbar[0]/btn[3]").press
session.findById("wnd[0]/tbar[0]/btn[3]").press
session.findById("wnd[0]").resizeWorkingPane 206,40,false
session.findById("wnd[0]/tbar[0]/okcd").text = "mb52"
session.findById("wnd[0]").sendVKey 0
session.findById("wnd[0]/usr/ctxtMATNR-LOW").text = "A2C03664400"
session.findById("wnd[0]/usr/ctxtWERKS-LOW").text = "2000"
session.findById("wnd[0]/usr/ctxtWERKS-LOW").setFocus
session.findById("wnd[0]/usr/ctxtWERKS-LOW").caretPosition = 4
session.findById("wnd[0]/tbar[1]/btn[8]").press
Thank you for any advise.
Ok, I just solve the issue so the guidance for everyone who would deal with the same problem..
If you have something open in the Window No.1 you have 2 options:
Terminate the script in Window No.1 no matter what has been open there
For the termination add /n before the name of transaction..
Open a new session and perform the script there.
For new session add /o before name of the transaction
Use Followigs:
session.findById("wnd[0]").resizeWorkingPane 159,28,false
session.findById("wnd[0]/tbar[0]/okcd").text = "/nName_of_Transaction"
session.findById("wnd[0]").sendVKey 0
So in my code it would be:
session.findById("wnd[0]").resizeWorkingPane 159,28,false
session.findById("wnd[0]/tbar[0]/okcd").text = "/nZQQN_WLQN"
session.findById("wnd[0]").sendVKey 0
session.findById("wnd[0]/usr/chkDY_RST").selected = true
session.findById("wnd[0]/usr/chkDY_MAB").selected = true
session.findById("wnd[0]/usr/chkDY_DEL").selected = true
etc...
Thank also Sandra for her help.

Why more number of duplicated data is saving in my excel sheet for my code?

Actually this code is generally used to scrape data from websites but the problem is more number of duplicated data is producing and saving in my excel sheet.
def extractor():
time.sleep(10)
souptree = html.fromstring(driver.page_source)
tburl = souptree.xpath("//table[contains(#id, 'theDataTable')]//tbody//tr//td[4]//a//#href")
for tbu in tburl:
allurl = []
allurl.append(urllib.parse.urljoin(siteurl, tbu))
for tb in allurl:
get_url = requests.get(tb)
get_soup = html.fromstring(get_url.content)
pattern = re.compile("^\s+|\s*,\s*|\s+$")
name = get_soup.xpath('//td[#headers="contactName"]//text()')
phone = get_soup.xpath('//td[#headers="contactPhone"]//text()')
mail = get_soup.xpath('//td[#headers="contactEmail"]//a//text()')
artitle = get_soup.xpath('//td[#headers="contactEmail"]//a//#href')
artit = ([x for x in pattern.split(str(artitle)) if x][-1])
title = artit[:-2]
for (nam, pho, mai) in zip(name, phone, mail):
fname = nam[9:]
allmails.append(mai)
allnames.append(fname)
allphone.append(pho)
alltitles.append(title)
fullfile = pd.DataFrame({'Names': allnames, 'Mails': allmails, 'Title': alltitles, 'Phone Numbers': allphone})
writer = ExcelWriter('G:\\Sheet_Name.xlsx')
fullfile.to_excel(writer, 'Sheet1', index=False)
writer.save()
print(fname, pho, mai, title, sep='\t')
while True:
time.sleep(10)
extractor()
try:
nextbutton()
except (WebDriverException):
driver.refresh()
except(NoSuchElementException):
time.sleep(10)
driver.quit()
I want the output should not be duplicated but almost half and more number of data are duplicating each time i run the code.

Application Engine Peoplecode bind variables

I have the below PeopleCode step in an Application Engine program that reads a CSV file using a File Layout and then inserts the data into a table, and I am just trying to get a better understanding of how the the line of code (&SQL1 = CreateSQL("%Insert(:1)");) in the below script gets generated. It looks like the CreateSQL is using a bind variable (:1) inside the Insert statement, but I am struggling as where to find where this variable is defined in the program.
Function EditRecord(&REC As Record) Returns boolean;
Local integer &E;
&REC.ExecuteEdits(%Edit_Required + %Edit_DateRange + %Edit_YesNo + %Edit_OneZero);
If &REC.IsEditError Then
For &E = 1 To &REC.FieldCount
&MYFIELD = &REC.GetField(&E);
If &MYFIELD.EditError Then
&MSGNUM = &MYFIELD.MessageNumber;
&MSGSET = &MYFIELD.MessageSetNumber;
&LOGFILE.WriteLine("****Record:" | &REC.Name | ", Field:" | &MYFIELD.Name);
&LOGFILE.WriteLine("****" | MsgGet(&MSGSET, &MSGNUM, ""));
End-If;
End-For;
Return False;
Else
Return True;
End-If;
End-Function;
Function ImportSegment(&RS2 As Rowset, &RSParent As Rowset)
Local Rowset &RS1, &RSP;
Local string &RecordName;
Local Record &REC2, &RECP;
Local SQL &SQL1;
Local integer &I, &L;
&SQL1 = CreateSQL("%Insert(:1)");
rem &SQL1 = CreateSQL("%Insert(:1) Order by COUNT_ORDER");
&RecordName = "RECORD." | &RS2.DBRecordName;
&REC2 = CreateRecord(#(&RecordName));
&RECP = &RSParent(1).GetRecord(#(&RecordName));
For &I = 1 To &RS2.ActiveRowCount
&RS2(&I).GetRecord(1).CopyFieldsTo(&REC2);
If (EditRecord(&REC2)) Then
&SQL1.Execute(&REC2);
&RS2(&I).GetRecord(1).CopyFieldsTo(&RECP);
For &L = 1 To &RS2.GetRow(&I).ChildCount
&RS1 = &RS2.GetRow(&I).GetRowset(&L);
If (&RS1 <> Null) Then
&RSP = &RSParent.GetRow(1).GetRowset(&L);
ImportSegment(&RS1, &RSP);
End-If;
End-For;
If &RSParent.ActiveRowCount > 0 Then
&RSParent.DeleteRow(1);
End-If;
Else
&LOGFILE.WriteRowset(&RS);
&LOGFILE.WriteLine("****Correct error in this record and delete all error messages");
&LOGFILE.WriteRecord(&REC2);
For &L = 1 To &RS2.GetRow(&I).ChildCount
&RS1 = &RS2.GetRow(&I).GetRowset(&L);
If (&RS1 <> Null) Then
&LOGFILE.WriteRowset(&RS1);
End-If;
End-For;
End-If;
End-For;
End-Function;
rem *****************************************************************;
rem * PeopleCode to Import Data *;
rem *****************************************************************;
Local File &FILE1, &FILE3;
Local Record &REC1;
Local SQL &SQL1;
Local Rowset &RS1, &RS2;
Local integer &M;
&FILE1 = GetFile("\\nt115\apps\interface_prod\interface_in\Item_Loader\ItemPriceFile.csv", "r", "a", %FilePath_Absolute);
&LOGFILE = GetFile("\\nt115\apps\interface_prod\interface_in\Item_Loader\ItemPriceFile.txt", "r", "a", %FilePath_Absolute);
&FILE1.SetFileLayout(FileLayout.GH_ITM_PR_UPDT);
&LOGFILE.SetFileLayout(FileLayout.GH_ITM_PR_UPDT);
&RS1 = &FILE1.CreateRowset();
&RS = CreateRowset(Record.GH_ITM_PR_UPDT);
REM &SQL1 = CreateSQL("%Insert(:1)");
&SQL1 = CreateSQL("%Insert(:1)");
/*Skip Header Row: The following line of code reads the first line in the file layout (the header)
and does nothing. Then the pointer goes to the next line in the file and starts using the
file.readrowset*/
&some_boolean = &FILE1.ReadLine(&string);
&RS1 = &FILE1.ReadRowset();
While &RS1 <> Null
ImportSegment(&RS1, &RS);
&RS1 = &FILE1.ReadRowset();
End-While;
&FILE1.Close();
&LOGFILE.Close();
The :1 is coming from the line further down &SQL1.Execute(&REC2);
&REC2 gets assigned a record object, so the line &SQL1.Execute(&REC2); evaluates to %Insert(your_record_object)
Here is a simple example that's doing basically the same thing
Here is a description of %Insert
Answer because too long to comment:
The table name is most likely (PS_)GH_ITM_PR_UPDT. The general consensus is to name the FileLayout the same as the record it is based on.
If not, it is defined in FileLayout.GH_ITM_PR_UPDT. Open the FileLayout, right click the segment and under 'Selected Node Properties' you will find the 'File Record Name'.
In your code this record is carried over into &RS1.
&FILE1.SetFileLayout(FileLayout.GH_ITM_PR_UPDT);
&RS1 = &FILE1.CreateRowset();
The rowset is a collection of rows. A row consists of records and a record is a row of data from a database table. (Peoplesoft Object Data Types are fun...)
This rowset is filled with data in the following statement:
&RS1 = &FILE1.ReadRowset();
This uses your file as input and outputs a rowset collection, mapping the data to records based on how you defined your FileLayout.
The result is fed into the ImportSegment function:
ImportSegment(&RS1, &RS);
Function ImportSegment(&RS2 As Rowset, &RSParent As Rowset)
&RS2 in the function is a reference to &RS1 in the rest of your code.
The table name is also hidden here:
&RecordName = "RECORD." | &RS2.DBRecordName;
So if you can't/don't want to check the FileLayout, you could output &RS2.DBRecordName with a messagebox and your answer will be Message Log of your Process Monitor.
Finally a record object is created for this database table and it is filled with a row from the rowset. This record is inserted into the database table:
&REC2 = CreateRecord(#(&RecordName));
&RS2(&I).GetRecord(1).CopyFieldsTo(&REC2);
&SQL1 = CreateSQL("%Insert(:1)");
&SQL1.Execute(&REC2);
TLDR:
Table name can be found in the FileLayout or output in the ImportSegment Function as &RS2.DBRecordName

How to display WebDynpro ABAP in ABAP report?

I've just started coding ABAP for a few days and I have a task to call the report from transaction SE38 and have
the report's result shown on the screen of the WebDynPro application SE80.
The report take the user input ( e.g: Material Number, Material Type, Plant, Sale Org. ) as a condition for querying, so the WebDynPro application must allow user to key in this parameters.
In some related article they were talking about using SUBMIT rep EXPORTING LIST TO MEMORY and CALL FUNCTION 'LIST_FROM_MEMORY' but so far I really have no idea to implement it.
Any answers will be appreciated. Thanks!
You can export it to PDF. Therefore, when a user clicks on a link, you run the conversion and display the file in the browser window.
To do so, you start by creating a JOB using the following code below:
constants c_name type tbtcjob-jobname value 'YOUR_JOB_NAME'.
data v_number type tbtcjob-jobcount.
data v_print_parameters type pri_params.
call function 'JOB_OPEN'
exporting
jobname = c_name
importing
jobcount = v_number
exceptions
cant_create_job = 1
invalid_job_data = 2
jobname_missing = 3
others = 4.
if sy-subrc = 0.
commit work and wait.
else.
EXIT. "// todo: err handling here
endif.
Then, you need to get the printer parameters in order to submit the report:
call function 'GET_PRINT_PARAMETERS'
exporting
destination = 'LP01'
immediately = space
new_list_id = 'X'
no_dialog = 'X'
user = sy-uname
importing
out_parameters = v_print_parameters
exceptions
archive_info_not_found = 1
invalid_print_params = 2
invalid_archive_params = 3
others = 4.
v_print_parameters-linct = 55.
v_print_parameters-linsz = 1.
v_print_parameters-paart = 'LETTER'.
Now you submit your report using the filters that apply. Do not forget to add the job parameters to it, as the code below shows:
submit your_report_name
to sap-spool
spool parameters v_print_parameters
without spool dynpro
with ...(insert all your filters here)
via job c_name number v_number
and return.
if sy-subrc = 0.
commit work and wait.
else.
EXIT. "// todo: err handling here
endif.
After that, you close the job:
call function 'JOB_CLOSE'
exporting
jobcount = v_number
jobname = c_name
strtimmed = 'X'
exceptions
cant_start_immediate = 1
invalid_startdate = 2
jobname_missing = 3
job_close_failed = 4
job_nosteps = 5
job_notex = 6
lock_failed = 7
others = 8.
if sy-subrc = 0.
commit work and wait.
else.
EXIT. "// todo: err handling here
endif.
Now the job will proceed and you'll need to wait for it to complete. Do it with a loop. Once the job is completed, you can get it's spool output and convert to PDF.
data v_rqident type tsp01-rqident.
data v_job_head type tbtcjob.
data t_job_steplist type tbtcstep occurs 0 with header line.
data t_pdf like tline occurs 0 with header line.
do 200 times.
wait up to 1 seconds.
call function 'BP_JOB_READ'
exporting
job_read_jobcount = v_number
job_read_jobname = c_name
job_read_opcode = '20'
importing
job_read_jobhead = v_job_head
tables
job_read_steplist = t_job_steplist
exceptions
invalid_opcode = 1
job_doesnt_exist = 2
job_doesnt_have_steps = 3
others = 4.
read table t_job_steplist index 1.
if not t_job_steplist-listident is initial.
v_rqident = t_job_steplist-listident.
exit.
else.
clear v_job_head.
clear t_job_steplist.
clear t_job_steplist[].
endif.
enddo.
check not v_rqident is initial.
call function 'CONVERT_ABAPSPOOLJOB_2_PDF'
exporting
src_spoolid = v_rqident
dst_device = 'LP01'
tables
pdf = t_pdf
exceptions
err_no_abap_spooljob = 1
err_no_spooljob = 2
err_no_permission = 3
err_conv_not_possible = 4
err_bad_destdevice = 5
user_cancelled = 6
err_spoolerror = 7
err_temseerror = 8
err_btcjob_open_failed = 9
err_btcjob_submit_failed = 10
err_btcjob_close_failed = 11
others = 12.
If you're going to send it via HTTP, you may need to convert it to BASE64 as well.
field-symbols <xchar> type x.
data v_offset(10) type n.
data v_char type c.
data v_xchar(2) type x.
data v_xstringdata_aux type xstring.
data v_xstringdata type xstring.
data v_base64data type string.
data v_base64data_aux type string.
loop at t_pdf.
do 134 times.
v_offset = sy-index - 1.
v_char = t_pdf+v_offset(1).
assign v_char to <xchar> casting type x.
concatenate v_xstringdata_aux <xchar> into v_xstringdata_aux in byte mode.
enddo.
concatenate v_xstringdata v_xstringdata_aux into v_xstringdata in byte mode.
clear v_xstringdata_aux.
endloop.
call function 'SCMS_BASE64_ENCODE_STR'
exporting
input = v_xstringdata
importing
output = v_base64data.
v_base64data_aux = v_base64data.
while strlen( v_base64data_aux ) gt 255.
clear t_base64data.
t_base64data-data = v_base64data_aux.
v_base64data_aux = v_base64data_aux+255.
append t_base64data.
endwhile.
if not v_base64data_aux is initial.
t_base64data-data = v_base64data_aux.
append t_base64data.
endif.
And you're done!
Hope it helps.
As previous speakers said, you should do extensive training before implementing such stuff in productive environment.
However, calling WebdynPro ABAP within report can be done with the help of WDY_EXECUTE_IN_PLACE function module. You should pass there Webdyn Pro application and necessary parameters.
CALL FUNCTION 'WDY_EXECUTE_IN_PLACE'
EXPORTING
* PROTOCOL =
INTERNALMODE = ' '
* SMARTCLIENT =
APPLICATION = 'Z_MY_WEBDYNPRO'
* CONTAINER_NAME =
PARAMETERS = lt_parameters
SUPPRESS_OUTPUT =
TRY_TO_USE_SAPGUI_THEME = ' '
IMPORTING
OUT_URL = ex_url
.
IF sy-subrc <> 0.
* Implement suitable error handling here
ENDIF.

Can openoffice count words from console?

i have a small problem i need to count words inside the console to read doc, docx, pptx, ppt, xls, xlsx, odt, pdf ... so don't suggest me | wc -w or grep because they work only with text or console output and they count only spaces and in japanese, chinese, arabic , hindu , hebrew they use diferent delimiter so the word count is wrong and i tried to count with this
pdftotext file.pdf -| wc -w
/usr/local/bin/docx2txt.pl < file.docx | wc -w
/usr/local/bin/pptx2txt.pl < file.pptx | wc -w
antiword file.doc -| wc -w
antiword file.word -| wc -w
in some cases microsoft word , openoffice sad 1000 words and the counters return 10 or 300 words if the language is ( japanese , chinese, hindu ect... ) , but if i use normal characters then i have no issue the biggest mistake is in some case 3 chars less witch is "OK"
i tried to convert with soffice , openoffice and then try WC -w but i can't even convert ,
soffice --headless --nofirststartwizard --accept=socket,host=127.0.0.1,port=8100; --convert-to pdf some.pdf /var/www/domains/vocabridge.com/devel/temp_files/23/0/东京_1000_words_Docx.docx
OR
openoffice.org --headless --convert-to ........
OR
openoffice.org3 --invisible
so if someone know any way to count correctly or display document statistic with openoffice or anything else or linux with the console please share it
thanks.
If you have Microsoft Word (and Windows, obviously) you can write a VBA macro or if you want to run straight from the command line you can write a VBScript script with something like the following:
wordApp = CreateObject("Word.Application")
doc = ... ' open up a Word document using wordApp
docWordCount = doc.Words.Count
' Rinse and repeat...
If you have OpenOffice.org/LibreOffice you have similar (but more) options. If you want to stay in the office app and run a macro you can probably do that. I don't know the StarBasic API well enough to tell you how but I can give you the alternative: creating a Python script to get the word count from the command line. Roughly speaking, you do the following:
Start up your copy of OOo/LibO from the command line with the appropriate parameters to accept incoming socket connections. http://www.openoffice.org/udk/python/python-bridge.html has instructions on how to do that. Go there and use the browser's in-page find feature to search for `accept=socket'
Write a Python script to use the OOo/LibO UNO bridge (basically equivalent to the VBScript example above) to open up your Word/ODT documents one at a time and get the word count from each. The above page should give you a good start to doing that.
You get the word count from a document model object's WordCount property: http://www.openoffice.org/api/docs/common/ref/com/sun/star/text/GenericTextDocument.html#WordCount
Just building on to what #Yawar wrote. Here is is more explicit steps for how to word count with MS word from the console.
I also use the more accurate Range.ComputeStatistics(wdStatisticWords) instead of the Words property. See here for more info: https://support.microsoft.com/en-za/help/291447/word-count-appears-inaccurate-when-you-use-the-vba-words-property
Make a script called wc.vbs and then put this in it:
Set word = CreateObject("Word.Application")
word.Visible = False
Set doc = word.Documents.Open("<replace with absolute path to your .docx/.pdf>")
docWordCount = doc.Range.ComputeStatistics(wdStatisticWords)
word.Quit
Dim StdOut : Set StdOut = CreateObject("Scripting.FileSystemObject").GetStandardStream(1)
WScript.Echo docWordCount & " words"
Open powershell in the directory you saved wc.vbs and run cscript .\wc.vbs and you'll get back the word count :)
I think this may do what you are aiming for
# Continuously updating word count
import unohelper, uno, os, time
from com.sun.star.i18n.WordType import WORD_COUNT
from com.sun.star.i18n import Boundary
from com.sun.star.lang import Locale
from com.sun.star.awt import XTopWindowListener
#socket = True
socket = False
localContext = uno.getComponentContext()
if socket:
resolver = localContext.ServiceManager.createInstanceWithContext('com.sun.star.bridge.UnoUrlResolver', localContext)
ctx = resolver.resolve('uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext')
else: ctx = localContext
smgr = ctx.ServiceManager
desktop = smgr.createInstanceWithContext('com.sun.star.frame.Desktop', ctx)
waittime = 1 # seconds
def getWordCountGoal():
doc = XSCRIPTCONTEXT.getDocument()
retval = 0
# Only if the field exists
if doc.getTextFieldMasters().hasByName('com.sun.star.text.FieldMaster.User.WordCountGoal'):
# Get the field
wordcountgoal = doc.getTextFieldMasters().getByName('com.sun.star.text.FieldMaster.User.WordCountGoal')
retval = wordcountgoal.Content
return retval
goal = getWordCountGoal()
def setWordCountGoal(goal):
doc = XSCRIPTCONTEXT.getDocument()
if doc.getTextFieldMasters().hasByName('com.sun.star.text.FieldMaster.User.WordCountGoal'):
wordcountgoal = doc.getTextFieldMasters().getByName('com.sun.star.text.FieldMaster.User.WordCountGoal')
wordcountgoal.Content = goal
# Refresh the field if inserted in the document from Insert > Fields >
# Other... > Variables > Userdefined fields
doc.TextFields.refresh()
def printOut(txt):
if socket: print txt
else:
model = desktop.getCurrentComponent()
text = model.Text
cursor = text.createTextCursorByRange(text.getEnd())
text.insertString(cursor, txt + '\r', 0)
def hotCount(st):
'''Counts the number of words in a string.
ARGUMENTS:
str st: count the number of words in this string
RETURNS:
int: the number of words in st'''
startpos = long()
nextwd = Boundary()
lc = Locale()
lc.Language = 'en'
numwords = 1
mystartpos = 1
brk = smgr.createInstanceWithContext('com.sun.star.i18n.BreakIterator', ctx)
nextwd = brk.nextWord(st, startpos, lc, WORD_COUNT)
while nextwd.startPos != nextwd.endPos:
numwords += 1
nw = nextwd.startPos
nextwd = brk.nextWord(st, nw, lc, WORD_COUNT)
return numwords
def updateCount(wordCountModel, percentModel):
'''Updates the GUI.
Updates the word count and the percentage completed in the GUI. If some
text of more than one word is selected (including in multiple selections by
holding down the Ctrl/Cmd key), it updates the GUI based on the selection;
if not, on the whole document.'''
model = desktop.getCurrentComponent()
try:
if not model.supportsService('com.sun.star.text.TextDocument'):
return
except AttributeError: return
sel = model.getCurrentSelection()
try: selcount = sel.getCount()
except AttributeError: return
if selcount == 1 and sel.getByIndex(0).getString == '':
selcount = 0
selwords = 0
for nsel in range(selcount):
thisrange = sel.getByIndex(nsel)
atext = thisrange.getString()
selwords += hotCount(atext)
if selwords > 1: wc = selwords
else:
try: wc = model.WordCount
except AttributeError: return
wordCountModel.Label = str(wc)
if goal != 0:
pc_text = 100 * (wc / float(goal))
#pc_text = '(%.2f percent)' % (100 * (wc / float(goal)))
percentModel.ProgressValue = pc_text
else:
percentModel.ProgressValue = 0
# This is the user interface bit. It looks more or less like this:
###############################
# Word Count _ o x #
###############################
# _____ #
# 451 / |500| #
# ----- #
# ___________________________ #
# |############## | #
# --------------------------- #
###############################
# The boxed `500' is the text entry box.
class WindowClosingListener(unohelper.Base, XTopWindowListener):
def __init__(self):
global keepGoing
keepGoing = True
def windowClosing(self, e):
global keepGoing
keepGoing = False
setWordCountGoal(goal)
e.Source.setVisible(False)
def addControl(controlType, dlgModel, x, y, width, height, label, name = None):
control = dlgModel.createInstance(controlType)
control.PositionX = x
control.PositionY = y
control.Width = width
control.Height = height
if controlType == 'com.sun.star.awt.UnoControlFixedTextModel':
control.Label = label
elif controlType == 'com.sun.star.awt.UnoControlEditModel':
control.Text = label
elif controlType == 'com.sun.star.awt.UnoControlProgressBarModel':
control.ProgressValue = label
if name:
control.Name = name
dlgModel.insertByName(name, control)
else:
control.Name = 'unnamed'
dlgModel.insertByName('unnamed', control)
return control
def loopTheLoop(goalModel, wordCountModel, percentModel):
global goal
while keepGoing:
try: goal = int(goalModel.Text)
except: goal = 0
updateCount(wordCountModel, percentModel)
time.sleep(waittime)
if not socket:
import threading
class UpdaterThread(threading.Thread):
def __init__(self, goalModel, wordCountModel, percentModel):
threading.Thread.__init__(self)
self.goalModel = goalModel
self.wordCountModel = wordCountModel
self.percentModel = percentModel
def run(self):
loopTheLoop(self.goalModel, self.wordCountModel, self.percentModel)
def wordCount(arg = None):
'''Displays a continuously updating word count.'''
dialogModel = smgr.createInstanceWithContext('com.sun.star.awt.UnoControlDialogModel', ctx)
dialogModel.PositionX = XSCRIPTCONTEXT.getDocument().CurrentController.Frame.ContainerWindow.PosSize.Width / 2.2 - 105
dialogModel.Width = 100
dialogModel.Height = 30
dialogModel.Title = 'Word Count'
lblWc = addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 6, 2, 25, 14, '', 'lblWc')
lblWc.Align = 2 # Align right
addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 33, 2, 10, 14, ' / ')
txtGoal = addControl('com.sun.star.awt.UnoControlEditModel', dialogModel, 45, 1, 25, 12, '', 'txtGoal')
txtGoal.Text = goal
#addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 6, 25, 50, 14, '(percent)', 'lblPercent')
ProgressBar = addControl('com.sun.star.awt.UnoControlProgressBarModel', dialogModel, 6, 15, 88, 10,'' , 'lblPercent')
ProgressBar.ProgressValueMin = 0
ProgressBar.ProgressValueMax =100
#ProgressBar.Border = 2
#ProgressBar.BorderColor = 255
#ProgressBar.FillColor = 255
#ProgressBar.BackgroundColor = 255
addControl('com.sun.star.awt.UnoControlFixedTextModel', dialogModel, 124, 2, 12, 14, '', 'lblMinus')
controlContainer = smgr.createInstanceWithContext('com.sun.star.awt.UnoControlDialog', ctx)
controlContainer.setModel(dialogModel)
controlContainer.addTopWindowListener(WindowClosingListener())
controlContainer.setVisible(True)
goalModel = controlContainer.getControl('txtGoal').getModel()
wordCountModel = controlContainer.getControl('lblWc').getModel()
percentModel = controlContainer.getControl('lblPercent').getModel()
ProgressBar.ProgressValue = percentModel.ProgressValue
if socket:
loopTheLoop(goalModel, wordCountModel, percentModel)
else:
uthread = UpdaterThread(goalModel, wordCountModel, percentModel)
uthread.start()
keepGoing = True
if socket:
wordCount()
else:
g_exportedScripts = wordCount,
Link for more info
https://superuser.com/questions/529446/running-word-count-in-openoffice-writer
Hope this helps regards tom
EDIT : Then i found this
http://forum.openoffice.org/en/forum/viewtopic.php?f=7&t=22555
wc can understand Unicode and uses system's iswspace function to find whether the unicode character is whitespace. "The iswspace() function tests whether wc is a wide-character code representing a character of class space in the program's current locale." So, wc -w should be able to correctly count words if your locale (LC_CTYPE) is configured correctly.
The source code of the wc program
The manual page for the iswspace function
I found the answer create one service
#!/bin/sh
#
# chkconfig: 345 99 01
#
# description: your script is a test service
#
(while sleep 1; do
ls pathwithfiles/in | while read file; do
libreoffice --headless -convert-to pdf "pathwithfiles/in/$file" --outdir pathwithfiles/out
rm "pathwithfiles/in/$file"
done
done) &
then the php script that i needed counted everything
$ext = pathinfo($absolute_file_path, PATHINFO_EXTENSION);
if ($ext !== 'txt' && $ext !== 'pdf') {
// Convert to pdf
$tb = mktime() . mt_rand();
$tempfile = 'locationofpdfs/in/' . $tb . '.' . $ext;
copy($absolute_file_path, $tempfile);
$absolute_file_path = 'locationofpdfs/out/' . $tb . '.pdf';
$ext = 'pdf';
while (!is_file($absolute_file_path)) sleep(1);
}
if ($ext !== 'txt') {
// Convert to txt
$tempfile = tempnam(sys_get_temp_dir(), '');
shell_exec('pdftotext "' . $absolute_file_path . '" ' . $tempfile);
$absolute_file_path = $tempfile;
$ext = 'txt';
}
if ($ext === 'txt') {
$seq = '/[\s\.,;:!\? ]+/mu';
$plain = file_get_contents($absolute_file_path);
$plain = preg_replace('#\{{{.*?\}}}#su', "", $plain);
$str = preg_replace($seq, '', $plain);
$chars = count(preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY));
$words = count(preg_split($seq, $plain, -1, PREG_SPLIT_NO_EMPTY));
if ($words === 0) return $chars;
if ($chars / $words > 10) $words = $chars;
return $words;
}

Resources