I have a folder with files containing some text. I am trying to go through all the files one after the other, and count how many times we see every word in the text files.
I know how to open the file, but once I'm in the file I don't know how to read each word one after the other, and go to the next word.
If anyone has some ideas to guide me it'd be great.
Read the file a line at a time into a string using Get_Line, then break the line up into the individual words.
Here's one way of doing it, I needed some playtime with the containers.
Using streams is still the best solution for your problem, given the multiple files.
Text_Search.ads
Pragma Ada_2012;
With
Ada.Containers.Indefinite_Ordered_Maps;
Package Text_Search with Elaborate_Body is
Text : Constant String :=
ASCII.HT &
"We hold these truths to be self-evident, that all men are created " &
"equal, that they are endowed by their Creator with certain unalienable "&
"Rights, that among these are Life, Liberty and the pursuit of " &
"Happiness.--That to secure these rights, Governments are instituted " &
"among Men, deriving their just powers from the consent of the governed" &
", --That whenever any Form of Government becomes destructive of these " &
"ends, it is the Right of the People to alter or to abolish it, and to " &
"institute new Government, laying its foundation on such principles " &
"and organizing its powers in such form, as to them shall seem most " &
"likely to effect their Safety and Happiness. Prudence, indeed, will " &
"dictate that Governments long established should not be changed for " &
"light and transient causes; and accordingly all experience hath shewn, "&
"that mankind are more disposed to suffer, while evils are sufferable, " &
"than to right themselves by abolishing the forms to which they are " &
"accustomed. But when a long train of abuses and usurpations, pursuing " &
"invariably the same Object evinces a design to reduce them under " &
"absolute Despotism, it is their right, it is their duty, to throw off " &
"such Government, and to provide new Guards for their future security." &
"now the necessity which constrains them to alter their former Systems " &
"of Government. The history of the present King of Great Britain is a " &
"history of repeated injuries and usurpations, all having in direct " &
"object the establishment of an absolute Tyranny over these States. To " &
"prove this, let Facts be submitted to a candid world.";
Package Word_List is New Ada.Containers.Indefinite_Ordered_Maps(
Key_Type => String,
Element_Type => Positive
);
Function Create_Map( Words : String ) Return Word_List.Map;
Words : Word_List.map;
End Text_Search;
Text_Search.adb
Package Body Text_Search is
Function Create_Map( Words : String ) Return Word_List.Map is
Delimiters : Array (Character) of Boolean:=
('.' | ' ' | '-' | ',' | ';' | ASCII.HT => True, Others => False);
Index, Start, Stop : Positive := Words'First;
begin
Return Result : Word_List.Map do
Parse:
loop
Start:= Index;
-- Ignore initial delimeters.
while Delimiters(Words(Start)) loop
Start:= 1+Start;
end loop;
Stop:= Start;
while not Delimiters(Words(Stop)) loop
Stop:= 1+Stop;
end loop;
declare
-- Because we stop *on* a delimiter we mustn't include it.
Subtype R is Positive Range Start..Stop-1;
Substring : String renames Words(R);
begin
-- if it's there, increment; otherwise add it.
if Result.Contains( Substring ) then
Result(Substring):= 1 + Result(Substring);
else
Result.Include( Key => substring, New_Item => 1 );
end if;
end;
Index:= Stop + 1;
end loop parse;
exception
When Constraint_Error => null; -- we run until our index fails.
end return;
End Create_Map;
Begin
Words:= Create_Map( Words => Text );
End Text_Search;
Test.adb
Pragma Ada_2012;
Pragma Assertion_Policy( Check );
With
Text_Search,
Ada.Text_IO;
Procedure Test is
Procedure Print_Word( Item : Text_Search.Word_List.Cursor ) is
use Text_Search.Word_List;
Word : String renames Key(Item);
Word_Column : String(1..20) := (others => ' ');
begin
Word_Column(1..Word'Length+1):= Word & ':';
Ada.Text_IO.Put_Line( Word_Column & Positive'Image(Element(Item)) );
End Print_Word;
Begin
Text_Search.Words.Iterate( Print_Word'Access );
End Test;
Instead of going by individual words, you could read the file a line at a time into a string using Get_Line, and then use regular expressions: Regular Expressions in Ada?
If you're using Ada 2012 here's how I would recommend doing it:
With Ada.Containers.Indefinite_Ordered_Maps.
Instantiate a Map with String as key, and Positive as Key;
Grab the string; I'd use either a single string or stream-processing.
Break the input-text into words; this can be done on-the-fly if you use streams.
When you get a word (from #4) add it to your map if it doesn't exist, otherwise increment the element.
When you are finished, just run a For Element of WORD_MAP Loop that prints the string & count.
There's several ways you could handle strings in #3:
Perfectly sized via recursive function call [terminating on a non-word character or the end of input].
Unbounded_String
Vector (Positive, Character) -- append for valid characters, convert to array [string] and add to the map when an invalid-character [or the end of input] is encountered -- working variable.
Not Null Access String working variable.
Related
I need to search those elements who have space " " in their attributes.
For example:
<unit href="http:xxxx/unit/2 ">
Suppose above code have space in the last for href attribute.
I have done this using FLOWER query. But I need this to be done using CTS functions. Please suggest.
For FLOWER query I have tried this:
let $x := (
for $d in doc()
order by $d//id
return
for $attribute in data($d//#href)
return
if (fn:contains($attribute," ")) then
<td>{(concat( "id = " , $d//id) ,", data =", $attribute)}</td>
else ()
)
return <tr>{$x}</tr>
This is working fine.
For CTS I have tried
let $query :=
cts:element-attribute-value-query(xs:QName("methodology"),
xs:QName("href"),
xs:string(" "),
"wildcarded")
let $search := cts:search(doc(), $query)
return fn:count($search)
Your query is looking for " " to be the entirety of the value of the attribute. If you want to look for attributes that contain a space, then you need to use wildcards. However, since there is no indexing of whitespace except for exact value queries (which are by definition not wildcarded), you are not going to get a lot of index support for that query, so you'll need to run this as a filtered search (which you have in your code above) with a lot of false positives.
You may be better off creating a string range index on the attribute and doing value-match on that.
This might be a really easy one but I couldn't seem to find an answer anywhere
I'm trying to comment my code as follows
Session("test") = "JAMIE" _
'TEST INFO
& "TEST" _
'ADDRESS INFO
& "ADDRESS = TEST"
With the code above i'm getting the error
Syntax error
But when I remove the comments like so
Session("test") = "JAMIE" _
& "TEST" _
& "ADDRESS = TEST"
It works fine so my guess is that I cannot comment my code between the _ character.
Is there some way I can get around this as I'd like to comment my code ideally
The _ character is the line continuation. It means that the next line is interpreted as if it was on the same line.
So, putting a comment in the middle of the line is a syntax error.
Since you want a solution:
Either put a comment before the continued line or after it
As Tim Schmelter points out in his answer, you can construct the value that will go into the Session object before you put it into the Session object - you can do that is separate statements and comment those to your hearts content.
As Oded has mentioned, The _ character continues the line so you cannot comment between.
You could write:
Dim value = "JAMIE"
'TEST INFO
value &= "TEST"
'ADDRESS INFO
value &= "ADDRESS = TEST"
Session("test") = value
Because that may create separate strings internally just to comment them, you could use a StringBuilder here. You could show us what you're really tring to do, so that we can suggest a different approach(if you need to comment each "line" of a single variable, you should consider to redesign the way you assign the value to the variable).
System.Text.StringBuilder str = new System.Text.StringBuilder();
str.Append("JAMIE");
str.Append("TEST");//TEST INFO
str.Append("ADDRESS");//ADDRESS INFO
public string Test
{
get
{
return Convert.ToString(Session["TEST"]);
}
set
{
Session["Test"] = value;
}
}
Test = st.ToString();
i'm learning by practice. I was given an OCX file which according to who gave it to me was created using VB6 and I have the task of creating a user interface for it to test all the functionality that is described in a poorly written documentation file. On top of that I am not well-versed in VBScript but I've managed to dodge a few bullets while learning.
I have a method which returns a Collection and when I try to access it from VBScript I am only able to query the Count but when I try to do job.Item(i) or job(i) I get an error stating it doesn't have that property or method.
Can someone point me in the right direction to be able to traverse the contents of this collection?
I had to do it from JavaScript but since some things weren't that easy I decided that perhaps VBScript would help me bridge the gaps where JavaScript didn't cut it. I can access all properties from the ActiveXObject from JavaScript, but the methods which return other VB objects are a little more obscure to me. I've tried aJob.Item(iCount), aJob.Items(iCount) and aJob(iCount).
My code is:
For iCount = 1 To aJobs.Count
MsgBox("Num " & iCount)
MsgBox(aJobs.Item(iCount))
Next
Thanks.
People often create specialized and/or strongly typed collection classes in VB6. They don't always do it correctly though, and they sometimes create "partial" collection implementations that have no Item() method (or fail to mark it as the default member of the class). They might even have a similar method or property but name it something entirely different.
It is rarer to return a raw Collection object, but it can be done and if it is you shouldn't have the problems you have indicated from VBScript.
I just created a DLL project named "HallLib" with three classes: Hallway, DoorKnobs, and DoorKnob. The DoorKnobs class is a collection of DoorKnob objects. The Hallway class has a DoorKnobs object that it initializes with a random set of DoorKnob objects with randomly set properties. Hallway.DoorKnobs() returns the DoorKnobs collection object as its result.
It works fine in this script:
Option Explicit
Dim Hallway, DoorKnobs, DoorKnob
Set Hallway = CreateObject("HallLib.Hallway")
Set DoorKnobs = Hallway.DoorKnobs()
MsgBox "DoorKnobs.Count = " & CStr(DoorKnobs.Count)
For Each DoorKnob In DoorKnobs
MsgBox "DoorKnob.Material = " & CStr(DoorKnob.Material) & vbNewLine _
& "DoorKnob.Color = " & CStr(DoorKnob.Color)
Next
Update:
This script produces identical results:
Option Explicit
Dim Hallway, DoorKnobs, KnobIndex
Set Hallway = CreateObject("HallLib.Hallway")
Set DoorKnobs = Hallway.DoorKnobs()
MsgBox "DoorKnobs.Count = " & CStr(DoorKnobs.Count)
For KnobIndex = 1 To DoorKnobs.Count
With DoorKnobs.Item(KnobIndex)
MsgBox "DoorKnob.Material = " & CStr(.Material) & vbNewLine _
& "DoorKnob.Color = " & CStr(.Color)
End With
Next
As does:
Option Explicit
Dim Hallway, DoorKnobs, KnobIndex
Set Hallway = CreateObject("HallLib.Hallway")
Set DoorKnobs = Hallway.DoorKnobs()
MsgBox "DoorKnobs.Count = " & CStr(DoorKnobs.Count)
For KnobIndex = 1 To DoorKnobs.Count
With DoorKnobs(KnobIndex)
MsgBox "DoorKnob.Material = " & CStr(.Material) & vbNewLine _
& "DoorKnob.Color = " & CStr(.Color)
End With
Next
So I suspect you'll need to use some type library browser like OLEView to look at your OCX to see what classes and members it actually exposes.
So I just got my site kicked off the server today and I think this function is the culprit. Can anyone tell me what the problem is? I can't seem to figure it out:
Public Function CleanText(ByVal str As String) As String
'removes HTML tags and other characters that title tags and descriptions don't like
If Not String.IsNullOrEmpty(str) Then
'mini db of extended tags to get rid of
Dim indexChars() As String = {"<a", "<img", "<input type=""hidden"" name=""tax""", "<input type=""hidden"" name=""handling""", "<span", "<p", "<ul", "<div", "<embed", "<object", "<param"}
For i As Integer = 0 To indexChars.GetUpperBound(0) 'loop through indexchars array
Dim indexOfInput As Integer = 0
Do 'get rid of links
indexOfInput = str.IndexOf(indexChars(i)) 'find instance of indexChar
If indexOfInput <> -1 Then
Dim indexNextLeftBracket As Integer = str.IndexOf("<", indexOfInput) + 1
Dim indexRightBracket As Integer = str.IndexOf(">", indexOfInput) + 1
'check to make sure a right bracket hasn't been left off a tag
If indexNextLeftBracket > indexRightBracket Then 'normal case
str = str.Remove(indexOfInput, indexRightBracket - indexOfInput)
Else
'add the right bracket right before the next left bracket, just remove everything
'in the bad tag
str = str.Insert(indexNextLeftBracket - 1, ">")
indexRightBracket = str.IndexOf(">", indexOfInput) + 1
str = str.Remove(indexOfInput, indexRightBracket - indexOfInput)
End If
End If
Loop Until indexOfInput = -1
Next
End If
Return str
End Function
Wouldn't something like this be simpler? (OK, I know it's not identical to posted code):
public string StripHTMLTags(string text)
{
return Regex.Replace(text, #"<(.|\n)*?>", string.Empty);
}
(Conversion to VB.NET should be trivial!)
Note: if you are running this often, there are two performance improvements you can make to the Regex.
One is to use a pre-compiled expression which requires re-writing slightly.
The second is to use a non-capturing form of the regular expression; .NET regular expressions implement the (?:) syntax, which allows for grouping to be done without incurring the performance penalty of captured text being remembered as a backreference. Using this syntax, the above regular expression could be changed to:
#"<(?:.|\n)*?>"
This line is also wrong:
Dim indexNextLeftBracket As Integer = str.IndexOf("<", indexOfInput) + 1
It's guaranteed to always set indexNextLeftBracket equal to indexOfInput, because at this point the character at the position referred to by indexOfInput is already always a '<'. Do this instead:
Dim indexNextLeftBracket As Integer = str.IndexOf("<", indexOfInput+1) + 1
And also add a clause to the if statement to make sure your string is long enough for that expression.
Finally, as others have said this code will be a beast to maintain, if you can get it working at all. Best to look for another solution, like a regex or even just replacing all '<' with <.
In addition to other good answers, you might read up a little on loop invariants a little bit. The pulling out and putting back stuff to the string you check to terminate your loop should set off all manner of alarm bells. :)
Just a guess, but is this like the culprit?
indexOfInput = str.IndexOf(indexChars(i)) 'find instance of indexChar
Per the Microsoft docs, Return Value -
The index position of value if that string is found, or -1 if it is not. If value is Empty, the return value is 0.
So perhaps indexOfInput is being set to 0?
What happens if your code tries to clean the string <a?
As I read it, it finds the indexChar at position 0, but then indexNextLeftBracket and indexRightBracket both equal 0, you fall into the else condition, and then you insert a ">" at position -1, which will presumably insert at the beginning, giving you the string ><a. The new indexRightBracket then becomes 0, so you delete from position 0 for 0 characters, leaving you with ><a. Then the code finds the <a in the code again, and you're off to the races with an infinite memory-consuming loop.
Even if I'm wrong, you need to get yourself some unit tests to reassure yourself that these edge cases work properly. That should also help you find the actual looping code if I'm off-base.
Generally speaking though, even if you fix this particular bug, it's never going to be very robust. Parsing HTML is hard, and HTML blacklists are always going to have holes. For instance, if I really want to get a <input type="hidden" name="tax" tag in, I'll just write it as <input name="tax" type="hidden" and your code will ignore it. Your better bet is to get an actual HTML parser involved, and to only allow the (very small) subset of tags that you actually want. Or even better, use some other form of markup, and strip all HTML tags (again using a real HTML parser of some description).
I'd have to run it through a real compiler but the mindpiler tells me that the str = str.Remove(indexOfInput, indexRightBracket - indexOfInput) line is re-generating an invalid tag such that when you loop through again it finds the same mistake "fixes" it, tries again, finds the mistake "fixes" it, etc.
FWIW heres a snippet of code that removes unwanted HTML tags from a string (It's in C# but the concept translates)
public static string RemoveTags( string html, params string[] allowList )
{
if( html == null ) return null;
Regex regex = new Regex( #"(?<Tag><(?<TagName>[a-z/]+)\S*?[^<]*?>)",
RegexOptions.Compiled |
RegexOptions.IgnoreCase |
RegexOptions.Multiline );
return regex.Replace(
html,
new MatchEvaluator(
new TagMatchEvaluator( allowList ).Replace ) );
}
MatchEvaluator class
private class TagMatchEvaluator
{
private readonly ArrayList _allowed = null;
public TagMatchEvaluator( string[] allowList )
{
_allowed = new ArrayList( allowList );
}
public string Replace( Match match )
{
if( _allowed.Contains( match.Groups[ "TagName" ].Value ) )
return match.Value;
return "";
}
}
That doesn't seem to work for a simplistic <a<a<a case, or even <a>Test</a>. Did you test this at all?
Personally, I hate string parsing like this - so I'm not going to even try figuring out where your error is. It'd require a debugger, and more headache than I'm willing to put in.
I'm making a fair amount of calls to database tables via ADO.
In the spirit of keeping things DRY, I wrote the following functions to return an array of values from a recordset.
Is this hare brained?
I use it mainly for grabbing a set of combo-box values and the like, never for enormous values. Example Usage (error handling removed for brevity):
Function getEmployeeList()
getEmployeeList= Array()
strSQL = "SELECT emp_id, emp_name from employees"
getEmployeeList = getSQLArray( strSQL, "|" )
End Function
Then I just do whatever I want with the returned array.
Function getSQLArray( SQL, delimiter )
'*************************************************************************************
' Input a SQL statement and an optional delimiter, and this function
' will return an array of strings delimited by whatever (pipe defaults)
' You can perform a Split to extract the appropriate values.
' Additionally, this function will return error messages as well; check for
' a return of error & delimiter & errNum & delimiter & errDescription
'*************************************************************************************
getSQLArray = Array()
Err.Number = 0
Set objCon = Server.CreateObject("ADODB.Connection")
objCon.Open oracleDSN
Set objRS = objCon.Execute(SQL)
if objRS.BOF = false and objRS.EOF = false then
Do While Not objRS.EOF
for fieldIndex=0 to (objRS.Fields.Count - 1)
If ( fieldIndex <> 0 ) Then
fieldValue = testEmpty(objRS.Fields.Item(fieldIndex))
recordString = recordString & delimiter & fieldValue
Else
recordString = CStr(objRS.Fields.Item(fieldIndex))
End If
Next
Call myPush( recordString, getSQLArray )
objRS.MoveNext
Loop
End If
Set objRS = Nothing
objCon.Close
Set objCon = Nothing
End Function
Sub myPush(newElement, inputArray)
Dim i
i = UBound(inputArray) + 1
ReDim Preserve inputArray(i)
inputArray(i) = newElement
End Sub
Function testEmpty( inputValue )
If (trim( inputValue ) = "") OR (IsNull( inputValue )) Then
testEmpty = ""
Else
testEmpty = inputValue
End If
End Function
The questions I'd have are:
Does it make sense to abstract all the recordset object creation/opening/error handling into its own function call like this?
Am I building a Rube Goldberg machine, where anyone maintaining this code will curse my name?
Should I just suck it up and write some macros to spit out the ADO connection code, rather than try doing it in a function?
I'm very new to asp so I have holes in its capabilities/best practices, so any input would be appreciated.
There's nothing wrong with doing it your way. The ADO libraries were not really all that well designed, and using them directly takes too many lines of code, so I always have a few utility functions that make it easier to do common stuff. For example, it's very useful to make yourself an "ExecuteScalar" function that runs SQL that happens to return exactly one value, for all those SELECT COUNT(*)'s that you might do.
BUT - your myPush function is extremely inefficient. ReDim Preserve takes a LONG time because it has to reallocate memory and copy everything. This results in O(n2) performance, or what I call a Shlemiel the Painter algorithm. The recommended best practice would be to start by dimming, say, an array with room for 16 values, and double it in size whenever you fill it up. That way you won't have to call ReDim Preserve more than Lg2n times.
I wonder why you are not using GetRows? It returns an array, you will find more details here: http://www.w3schools.com/ado/met_rs_getrows.asp
A few notes on GetRows:
Set objRS = Server.CreateObject ("ADODB.Recordset")
objRS.Open cmd, , adOpenForwardOnly, adLockReadOnly
If Not objRS.EOF Then
astrEmployees = objRS.GetRows()
intRecFirst = LBound(astrEmployees, 2)
intRecLast = UBound(astrEmployees, 2)
FirstField = 0
SecondField = 1
End If
'2nd field of the fourth row (record) '
Response.Write (SecondField, 3)
Yes, it makes sense to factor out common tasks. I don't see anything wrong with the general idea. I'm wondering why you're returning an array of strings separated by a delimiter; you might as well return an array of arrays.