Minizinc: output for five days,there is a better flexible way?

Minizinc: output for five days,there is a better flexible way? - constraints

I have to extend the output and the solution of my project (make an exams scheduling):
-Extend the structure to five days (I have always worked on one day):
I thought about moltiply the number of days for slotstimes (5*10) and then I tune the output! Is there a better way?
Now the whole code:
include "globals.mzn";include "alldifferent.mzn";
%------------------------------Scalar_data----------------------
int: Students; % number of students
int: Exams; % number of exams
int: Rooms; % number of rooms
int: Slotstime; % number of slots
int: Days; % a period i.e. five days
int: Exam_max_duration; % the maximum length of any exam (in slots)
%------------------------------Vectors--------------------------
array[1..Rooms] of int : Rooms_capacity;
array[1..Exams] of int : Exams_duration; % the duration of written test
array[1..Slotstime, 1..Rooms] of 0..1: Unavailability;
array[1..Students,1..Exams] of 0..1: Enrollments;
Enrollments keeps track of the registrations for every student;
from this I obtain the number of students which will be at the exam,
in order to choose the right room according to the capacity
%---------------------------Decision_variables------------------
array[1..Slotstime,1..Rooms] of var 0..Exams: Timetable_exams;
array[1..Exams] of var 1..Rooms: ExamsRoom;
array[1..Exams] of var 1..Slotstime: ExamsStart;
%---------------------------Constraints--------------------------
% Calculate the number of subscribers and assign classroom
% according to time and capacity
constraint forall (e in 1..Exams,r in 1..Rooms,s in 1..Slotstime)
(if Rooms_capacity[r] <= sum([bool2int(Enrollments[st,e]>0)| st in 1..Students])
then Timetable_exams[s,r] != e
else true
endif
);
% Unavailability OK
constraint forall(c in 1..Slotstime, p in 1..Rooms)
(if Unavailability[c,p] == 1
then Timetable_exams[c,p] = 0
else true
endif
);
% Assignment exams according with rooms and slotstimes (Thanks Hakan)
constraint forall(e in 1..Exams) % for each exam
(exists(r in 1..Rooms) % find a room
( ExamsRoom[e] = r /\ % assign the room to the exam
forall(t in 0..Exams_duration[e]-1)
% assign the exam to the slotstimes and room in the timetable
(Timetable_exams[t+ExamsStart[e],r] = e)
)
)
/\ % ensure that we have the correct number of exam slots
sum(Exams_duration) = sum([bool2int(Timetable_exams[t,r]>0) | t in 1..Slotstime,
r in 1..Rooms]);
%---------------------------Solver--------------------------
solve satisfy;
% solve::int_search([Timetable_exams[s, a] | s in 1..Slotstime, a in
% 1..Rooms],first_fail,indomain_min,complete) satisfy;
And now the output, extremely heavy and full of strings.
%---------------------------Output--------------------------
output ["\n" ++ "MiniZinc paper: Exams schedule " ++ "\n" ]
++["\nDay I \n"]++
[
if r=1 then "\n" else " " endif ++
show(Timetable_exams[t,r])
| t in 1..Slotstime div Days, r in 1..Rooms
]
++["\n\nDay II \n"]++
[
if r=1 then "\n" else " " endif ++
show(Timetable_exams[t,r])
| t in 11..((Slotstime div Days)*2), r in 1..Rooms
]
++["\n\nDay III \n"]++
[
if r=1 then "\n" else " " endif ++
show(Timetable_exams[t,r])
| t in 21..((Slotstime div Days)*3), r in 1..Rooms
]
++["\n\nDay IV \n"]++
[
if r=1 then "\n" else " " endif ++
show(Timetable_exams[t,r])
| t in 31..((Slotstime div Days)*4), r in 1..Rooms
]
++["\n\nDay V \n"]++
[
if r=1 then "\n" else " " endif ++
show(Timetable_exams[t,r])
| t in 41..Slotstime, r in 1..Rooms
]
++[ "\n"]++
[
"\nExams_Room: ", show(ExamsRoom), "\n",
"Exams_Start: ", show(ExamsStart), "\n",
]
++["Participants: "]++
[
if e=Exams then " " else " " endif ++
show (sum([bool2int(Enrollments[st,e]>0)| st in 1..Students]))
|e in 1..Exams
];
I finish with data:
%Data
Slotstime=10*Days;
Students=50;
Days=5;
% Exams
Exams = 5;
Exam_max_duration=4;
Exams_duration = [4,1,2,3,2];
% Rooms
Rooms = 4;
Rooms_capacity = [20,30,40,50];
Unavailability = [|0,0,0,0 % Rooms rows % Slotstime columns
|0,0,0,0
|0,0,0,0
|0,0,0,0
|1,1,1,1
|1,1,1,1
|0,0,0,0
|0,0,0,0
|0,0,0,0
|0,0,0,0
% End first day
|0,0,0,0
|0,0,0,0
|0,0,0,0
|0,0,0,0
|1,1,1,1
|1,1,1,1
|0,0,0,0
|0,0,0,0
|0,0,0,0
|0,0,0,0
% End secon day
|0,0,0,0
|0,0,0,0
|0,0,0,0
|0,0,0,0
|1,1,1,1
|1,1,1,1
|0,0,0,0
|0,0,0,0
|0,0,0,0
|0,0,0,0
% End third day
|0,0,0,0
|0,0,0,0
|0,0,0,0
|0,0,0,0
|1,1,1,1
|1,1,1,1
|0,0,0,0
|0,0,0,0
|0,0,0,0
|0,0,0,0
% End fourth day
|0,0,0,0
|0,0,0,0
|0,0,0,0
|0,0,0,0
|1,1,1,1
|1,1,1,1
|0,0,0,0
|0,0,0,0
|0,0,0,0
|0,0,0,0
%End fifth day
|];
Enrollments= [|1,0,1,0,1 % Exams rows %Students columns
|1,0,1,0,1
|0,1,0,0,0
|1,0,0,1,0
|0,1,0,0,0
|0,0,1,1,0
|1,0,0,1,0
|0,0,0,0,1
|1,0,0,0,1
|0,0,0,0,1
|0,1,0,0,0
|0,0,0,0,0
|0,1,0,0,1
|0,0,1,0,1
|1,0,1,0,1
|1,0,1,0,1
|0,1,0,0,0
|1,0,0,1,0
|0,1,0,0,0
|0,0,1,1,0
|1,0,0,1,0
|0,0,0,0,1
|1,0,0,0,1
|0,0,0,0,1
|0,1,0,0,0
|0,0,0,0,0
|0,1,0,0,1
|0,0,1,0,1
|1,0,1,0,1
|1,0,1,0,1
|0,1,0,0,0
|1,0,0,1,0
|0,1,0,0,0
|0,0,1,1,0
|1,0,0,1,0
|0,0,0,0,1
|1,0,0,0,1
|0,0,0,0,1
|0,1,0,0,0
|0,0,0,0,0
|0,1,0,0,1
|0,0,1,0,1
|1,0,1,0,1
|1,0,1,0,1
|0,1,0,0,0
|1,0,0,1,0
|0,1,0,0,0
|0,0,1,1,0
|1,0,0,1,0
|0,0,0,0,1
|];
Thanks in advance

For the output section, the following code should work. I only changed the Day schedule, the rest is unchanged.
output ["\n" ++ "MiniZinc paper: Exams schedule " ++ "\n" ]
++
[
if t mod 10 = 1 /\ r = 1 then
"\n\nDay " ++ show(d) ++ " \n"
else "" endif ++
if r=1 then "\n" else " " endif ++
show(Timetable_exams[t,r])
| d in 1..Days, t in 1+(d-1)*10..(Slotstime div Days)*d, r in 1..Rooms,
]
++[ "\n"]++
[
"\nExams_Room: ", show(ExamsRoom), "\n",
"Exams_Start: ", show(ExamsStart), "\n",
]
++["Participants: "]++
[
if e=Exams then " " else " " endif ++
show (sum([bool2int(Enrollments[st,e]>0)| st in 1..Students]))
|e in 1..Exams
];
If it's a requirement that the days should be numbered with "I","II", etc then you can define a string array with the day names, e.g.
array[1..Days] of string: DaysStr = ["I","II","III","IV","V"];
and then use it in the output loop:
% ....
if t mod 10 = 1 /\ r = 1 then
"\n\nDay " ++ DaysStr[d] ++ " \n" % <---
else "" endif ++
% ....
Later update:
One other thing to make the model a little more general (and smaller) is to replace the huge Unavailability matrix (and the constraint using it) with this:
set of int: UnavailabilitySlots = {5,6};
% ....
constraint
forall(c in 1..Slotstime, p in 1..Rooms) (
if c mod 10 in UnavailabilitySlots then
Timetable_exams[c,p] = 0
else
true
endif
);
Yet another comment:
The original model has a flaw in that it allow exams that will be over two days, e.g. the 2 last hours of day I and the first 2 hours of day II. I think the following extra (and not so pretty) constraint will fix that. Again, the magic "10" is used.
constraint
% do not pass over a day limit
forall(e in 1..Exams) (
not(exists(t in 1..Exams_duration[e]-1) (
(ExamsStart[e]+t-1) mod 10 > (ExamsStart[e]+t) mod 10
))
)
;

Related

Generate new array from sparse array of objects in JQ

I have a JSON file that I want to process with JQ. It has an array of objects inside another object, with a key that I want to use to populate a new array.
In my real use-case this is nested in with a lot of other fluff and there lots more arrays but take this as a simpler but representative example of the kind of thing:
{
"numbers": [
{
"numeral": 1,
"ordinal": "1st",
"word": "One"
},
{
"numeral": 2,
"ordinal": "2nd",
"word": "Two"
},
{
"numeral": 5,
"ordinal": "5th",
"word": "Five"
},
{
"some-other-fluff-i-want-to-ignore": true
}
]
}
I'd like to use JQ to get a new array based on the elements, ignoring some elements and handling the missing ones. e.g.
[
"The 1st word is One",
"The 2nd word is Two",
"Wot no number 3?",
"Wot no number 4?",
"The 5th word is Five"
]
Doing this in a loop for the elements that are there is simple, terse and elegant enough:
.numbers | map( . | select( .numeral) | [ "The", .ordinal, "word is", .word ] | join (" "))
But I can't find a way to cope with the missing entries. I have some code that sort-of works:
.numbers | [
( .[] | select(.numeral == 1) | ( [ "The", .ordinal, "word is", .word ] | join (" ")) ) // "Wot no number 1?",
( .[] | select(.numeral == 2) | ( [ "The", .ordinal, "word is", .word ] | join (" ")) ) // "Wot no number 2?",
( .[] | select(.numeral == 3) | ( [ "The", .ordinal, "word is", .word ] | join (" ")) ) // "Wot no number 3?",
( .[] | select(.numeral == 4) | ( [ "The", .ordinal, "word is", .word ] | join (" ")) ) // "Wot no number 4?",
( .[] | select(.numeral == 5) | ( [ "The", .ordinal, "word is", .word ] | join (" ")) ) // "Wot no number 5?"
]
It produces usable output, after a fashion:
richard#sophia:~$ jq -f make-array.jq < numbers.json
[
"The 1st word is One",
"The 2nd word is Two",
"Wot no number 3?",
"Wot no number 4?",
"The 5th word is Five"
]
richard#sophia:~$
However, whilst it produces the output, handles the missing elements and ignores the bits I don't want, it's obviously extremely naff code that cries out for a for-loop or something similar but I can't see a way in JQ to do this. Any ideas?

jq solution:
jq 'def print(o): "The \(o.ordinal) word is \(o.word)";
.numbers | (reduce map(select(.numeral))[] as $o ({}; .["\($o.numeral)"] = $o)) as $o
| [range(0; ($o | [keys[] | tonumber] | max))
| "\(.+1)" as $i
| if ($o[$i]) then print($o[$i]) else "Wot no number \($i)?" end
]' input.json
The output:
[
"The 1st word is One",
"The 2nd word is Two",
"Wot no number 3?",
"Wot no number 4?",
"The 5th word is Five"
]

Another solution !
jq '[
range(1; ( .numbers | max_by(.numeral)|.numeral ) +1 ) as $range_do_diplay |
.numbers as $thedata | $range_do_diplay |
. as $i |
if ([$thedata[]|contains( { numeral: $i })]|any )
then
($thedata|map(select( .numeral == $i )))|.[0]| "The \(.ordinal) word is \(.word) "
else
"Wot no number \($i)?"
end
] ' numbers.json
This solution use
max_by to find the max value of numeral
range to generate a list o values
use variables to store intermediate value

problems defining time when pulling data from sqlite3

i have a problem getting the right setup for pulling data from a database.
i need to pull the last 10 minutes of data, and put it into a chart. i have 2 problems with that, i cant seem to get the arguments to work to get a specific time period (last 10 minutes), but i did manage to just grab the last 600 entries.
that works, but my chart X-axis is the wrong way, and i cant seem to find a way to invert the data in the table, after i pull it out of the database. (newest reading left and then getting older as you go further right on the chart)
the code i am using at the moment is this (yes it's ugly and borrowed from all over) but i hope you can look past it since this is my first time coding anything like this.
#!/usr/bin/env python
import sqlite3
import sys
import cgi
import cgitb
import time
global variables
speriod=(15)-1
dbname='/var/www/sensors.db'
def printHTTPheader():
print "Content-type: text/html\n\n"
def printHTMLHead(title, table):
print "<head>"
print " <title>"
print title
print " </title>"
print_graph_script(table)
print "</head>"
#current_time=time.time()
def get_data(interval):
conn=sqlite3.connect(dbname)
curs=conn.cursor()
if interval == None:
curs.execute("SELECT * FROM readings")
else:
# curs.execute("SELECT * FROM readings")
curs.execute("SELECT * FROM readings ORDER BY timestamp DESC LIMIT "+str(600))
# curs.execute("SELECT * FROM readings WHERE time>=datetime('now','%s minutes')" % interval)
rows=curs.fetchall()
conn.close()
return rows
def create_table(rows):
chart_table=""
for row in rows[:-4]:
rowstr="['{0}', {1}],\n".format(str(row[2]),str(row[5]))
chart_table+=rowstr
row=rows[-1]
rowstr="['{0}', {1}]\n".format(str(row[2]),str(row[5]))
chart_table+=rowstr
return chart_table
def print_graph_script(table):
chart_code="""
<script type="text/javascript" src="https://www.google.com/jsapi"></script>
<script type="text/javascript">
google.load("visualization", "1", {packages:["corechart"]});
google.setOnLoadCallback(drawChart);
function drawChart() {
var data = google.visualization.arrayToDataTable([
['windspeed', 'temperature'],
%s
]);
var options = {
title: 'windspeed'
};
var chart = new google.visualization.LineChart(document.getElementById('chart_div'));
chart.draw(data, options);
}
</script>"""
print chart_code % (table)
def show_graph():
print "<h2>Windspeed Chart</h2>"
print '<div id="chart_div" style="width: 900px; height: 500px;"></div>'
def show_stats(option):
conn=sqlite3.connect(dbname)
curs=conn.cursor()
if option is None:
option = str(24)
curs.execute("SELECT timestamp,max(windspeed) FROM readings WHERE time>date('now','-%s hour') AND time<=date('now')" % option)
# curs.execute("SELECT timestamp,max(windspeed) FROM readings WHERE timestamp>datetime('2013-09-19 21:30:02','-%s hour') AND timestamp<=datetime('2013-09-19 21:31:02')" % option)
rowmax=curs.fetchone()
rowstrmax="{0}&nbsp&nbsp&nbsp{1}C".format(str(rowmax[0]),str(rowmax[1]))
curs.execute("SELECT timestamp,min(windspeed) FROM readings WHERE timestamp>time('time','-%s hour') AND timestamp<=time('current_time')" % option)
# curs.execute("SELECT timestamp,min(temp) FROM temps WHERE timestamp>datetime('2013-09-19 21:30:02','-%s hour') AND timestamp<=datetime('2013-09-19 21:31:02')" % option)
rowmin=curs.fetchone()
rowstrmin="{0}&nbsp&nbsp&nbsp{1}C".format(str(rowmin[0]),str(rowmin[1]))
curs.execute("SELECT avg(windspeed) FROM readings WHERE timestamp>time('now','-%s hour') AND timestamp<=time('current_time')" % option)
# curs.execute("SELECT avg(temp) FROM temps WHERE timestamp>datetime('2013-09-19 21:30:02','-%s hour') AND timestamp<=datetime('2013-09-19 21:31:02')" % option)
rowavg=curs.fetchone()
print "<hr>"
print "<h2>Minumum temperature&nbsp</h2>"
print rowstrmin
print "<h2>Maximum temperature</h2>"
print rowstrmax
print "<h2>Average temperature</h2>"
print "%.3f" % rowavg+ "C"
print "<hr>"
print "<h2>In the last hour:</h2>"
print "<table>"
print "<tr><td><strong>Date/Time</strong></td><td>
<strong>Temperature</strong></td></tr>"
rows=curs.execute("SELECT * FROM readings WHERE timestamp>datetime('now','-1 hour') AND timestamp<=datetime('now')")
# rows=curs.execute("SELECT * FROM temps WHERE timestamp>datetime('2013-09-19 21:30:02','-1 hour') AND timestamp<=datetime('2013-09-19 21:31:02')")
for row in rows:
rowstr="<tr><td>{0}  </td><td>{1}C</td></tr>".format(str(row[0]),str(row[1]))
print rowstr
print "</table>"
print "<hr>"
conn.close()
def print_time_selector(option):
print """<form action="/cgi-bin/webgui.py" method="POST">
Show the temperature logs for
<select name="timeinterval">"""
if option is not None:
if option == "6":
print "<option value=\"6\" selected=\"selected\">the last 6 hours</option>"
else:
print "<option value=\"6\">the last 6 hours</option>"
if option == "12":
print "<option value=\"12\" selected=\"selected\">the last 12 hours</option>"
else:
print "<option value=\"12\">the last 12 hours</option>"
if option == "24":
print "<option value=\"24\" selected=\"selected\">the last 24 hours</option>"
else:
print "<option value=\"24\">the last 24 hours</option>"
else:
print """<option value="6">the last 6 hours</option>
<option value="12">the last 12 hours</option>
<option value="24" selected="selected">the last 24 hours</option>"""
print """ </select>
<input type="submit" value="Display">
</form>"""
def validate_input(option_str):
if option_str.isalnum():
if int(option_str) > 0 and int(option_str) <= 24:
return option_str
else:
return None
else:
return None
def get_option():
form=cgi.FieldStorage()
if "timeinterval" in form:
option = form["timeinterval"].value
return validate_input (option)
else:
return None
def main():
cgitb.enable()
option=get_option()
if option is None:
option = str(24)
records=get_data(option)
printHTTPheader()
if len(records) != 0:
table=create_table(records)
else:
print "No data found"
return
print "<html>"
printHTMLHead("Raspberry Pi Wind Logger", table)
print "<body>"
print "<h1>Raspberry Pi Wind Logger</h1>"
print "<hr>"
print_time_selector(option)
show_graph()
show_stats(option)
print "</body>"
print "</html>"
sys.stdout.flush()
if __name__=="__main__":
main()
the database contains 7 rows (unix time, date, time, windspeed, direction, temperature, checksum) it's data comming from an ultrasonic wind sensor, i would like to get the "options" in this working aswell like the max wind speed or the average wind speed, but i think that should become clear once i figure out how to actualy get the time sorted.
edit:
here is a reading pulled from the database
1540300313.94403|23/10/2018|21:11:53|0.1|273|24.1|5E*4B
and part of the code to store data in the database
cursor.execute('''insert into readings values (?, ?, ?, ?, ?, ?, ?)''',
(timestamp, current_date, current_time, windspeed, direction, temperature, checksum));
dbconnect.commit()

How to concatenate the values based on other field in unix

I have detail.txt file ,which contains
cat >detail.txt
Student ID,Student Name, Percentage
101,A,75
102,B,77
103,C,34
104,D,42
105,E,75
106,F,42
107,G,77
1.I want to print concatenated output based on Percentage (group by Percentage) and print student name in single line separated by comma(,).
Expected Output:
75-A,E
77-B,G
42-D,F
34-C
For above question i got that how can achieve this for 75 or 77 or 42. But i did not get how to write a code grouping third field (Percentage).
I tried below code
awk -F"," '{OFS=",";if($3=="75") print $2}' detail.txt
2. I want to get output based on grading system which is given below.
marks < 45=THIRD
marks>=45 and marks<60 =SECOND
marks>=60 and marks<=75 =FIRST
marks>75 =DIST
Expected Output:
DIST:B,G
FIRST:A,E
THIRD:C,D,F
Please help me to get the expected output. Thank You..

awk solution:
awk -F, 'NR>1{
if ($3<45) k="THIRD"; else if ($3>=45 && $3<60) k="SECOND";
else if ($3>=60 && $3<=75) k="FIRST"; else k="DIST";
a[k] = a[k]? a[k]","$2 : $2;
}END{ for(i in a) print i":"a[i] }' detail.txt
k - variable that will be assigned with "grading system" name according to one of the if (...) <exp>; else if(...) <exp> ... statements
a[k] - array a is indexed by determined "grading system" name k
a[k] = a[k]? a[k]","$2 : $2 - all "student names"(presented by the 2nd field $2) are accumulated/grouped into the needed "grading system"
The output:
DIST:B,G
THIRD:C,D,F
FIRST:A,E

With GNU awk for true multi-dimensional arrays:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR>1 {
stud = $2
pct = $3
if ( pct <= 45 ) { band = "THIRD" }
else if ( pct <= 60 ) { band = "SECOND" }
else if ( pct <= 75 ) { band = "FIRST" }
else { band = "DIST" }
pcts[pct][stud]
bands[band][stud]
}
END {
for (pct in pcts) {
out = ""
for (stud in pcts[pct]) {
out = (out == "" ? pct "-" : out OFS) stud
}
print out
}
print "----"
for (band in bands) {
out = ""
for (stud in bands[band]) {
out = (out == "" ? band ":" : out OFS) stud
}
print out
}
}
.
$ gawk -f tst.awk file
34-C
42-D,F
75-A,E
77-B,G
----
DIST:B,G
THIRD:C,D,F
FIRST:A,E

For your first question, the following awk one-liner should do:
awk -F, '{a[$3]=a[$3] (a[$3] ? "," : "") $2} END {for(i in a) printf "%s-%s\n", i, a[i]}' input.txt
The second question can work almost the same way, storing your mark divisions in an array, then stepping through that array to determine the subscript for a new array:
BEGIN { FS=","; m[0]="THIRD"; m[45]="SECOND"; m[60]="FIRST"; m[75]="DIST" } { for (i=0;i<=100;i++) if ((i in m) && $3 > i) mdiv=m[i]; marks[mdiv]=marks[mdiv] (marks[mdiv] ? "," : "") $2 } END { for(i in marks) printf "%s:%s\n", i, marks[i] }
But this is unreadable. When you need this level of complexity, you're past the point of a one-liner. :)
So .. combining the two and breaking them out for easier reading (and commenting) we get the following:
BEGIN {
FS=","
m[0]="THIRD"
m[45]="SECOND"
m[60]="FIRST"
m[75]="DIST"
}
{
a[$3]=a[$3] (a[$3] ? "," : "") $2 # Build an array with percentage as the index
for (i=0;i<=100;i++) # Walk through the possible marks
if ((i in m) && $3 > i) mdiv=m[i] # selecting the correct divider on the way
marks[mdiv]=marks[mdiv] (marks[mdiv] ? "," : "") $2
# then build another array with divider
# as the index
}
END { # Once we've processed all the input,
for(i in a) # step through the array,
printf "%s-%s\n", i, a[i] # printing the results.
print "----"
for(i in marks) # step through the array,
printf "%s:%s\n", i, marks[i] # printing the results.
}
You may be wondering why we for (i=0;i<=100;i++) instead of simply using for (i in m). This is because awk does not guarantee the order of array elements, and when stepping through the m array, it's important that we see the keys in increasing order.

Improve AWK script - Compare and add fields

I have a CSV file separated by semicolons, the file contains a sentiment analysis on customer reviews.
The views are grouped by field 1 and 6, what I have to do is add the last fields of each line of the group and then compare the sum with field 3. If they match, the comparison equals 1 if not 0.
At the same time, I have also compare fields 6 and 7 applying the same rule above.
Finally calculate the number of ones obtained for both comparisons separately.
I have a script made below, but I think it could be improved. Any suggestions? Also I am not sure that the script is fine..!
BEGIN {
OFS=FS=";";
flag="";
counter1=0;
counter2=0;
counter3=0;
}
{
number=$1;
topic=$6;
id= number";"topic;
if (id != flag)
{
for (i in topics)
{
if ((sum < 0) && (polarity[i] == "negative") || (sum > 0) && (polarity[i] == "positive"))
{
hit_2=1;
counter2++;
}
else
{
hit_2=0;
}
s=split(topics[i],words,";")
hit_1=0;
for (k=1;k<=s;k++)
{
if ((words[k] == words[k+1]) && (words[k] != "") || (words[k] == "NULL") && (hit_2 == 1))
{
hit_1=1;
}
}
if (hit_1 == 1)
{
counter1++;
}
print to_print[i]";"hit_1";"hit_2;
}
delete topics;
delete to_print;
delete polarity;
counter3++;
sum="";
flag=id;
}
sum += $(NF-1);
topics[$1";"$6]=topics[$1";"$6] ";"$6";"$7;
to_print[$1";"$6]=$1";"$2";"$3";"$4";"$5";"$6
polarity[$1";"$6]=$3;
}
END {
print ""
print "#### - sentiments: "counter3" - topic: "counter1 " - polarity: "counter2;
}
A portion of the input data:
100429301;"RESTAURANT#GENERAL";negative;1004293;10042930;place;place;place;good;good;2.000000;
100429301;"RESTAURANT#GENERAL";negative;1004293;10042930;place;place;place;not longer;not longer;-3.000000;
100429331;"FOOD#QUALITY";negative;1004293;10042933;food;food;food;lousy;lousy;-3.000000;
100429331;"FOOD#QUALITY";negative;1004293;10042933;food;food;food;too sweet;too sweet;3.600000;
100429331;"FOOD#QUALITY";negative;1004293;10042933;food;portions;portion;tiny;tiny;-1.000000;
103269521;"FOOD#QUALITY";positive;1032695;10326952;duck breast special;visit;visit;incredible;incredible;4.000000;
Output:
100429301;"RESTAURANT#GENERAL";negative;1004293;10042930;place;1;1
100429331;"FOOD#QUALITY";negative;1004293;10042933;food;1;1
103269521;"FOOD#QUALITY";positive;1032695;10326952;duck breast special;0;1
#### - sentiments: 57 - topic: 28 - polarity: 39

I rewrote part of it, perhaps you can extend this further.
$ awk -F';' -v OFS=';' '
{key=$1 FS $3 FS $6;
sum[key]+=$(NF-1);
line[key]=$1 FS $2 FS $3 FS $4 FS $5 FS $6;
sign[key]=($3=="negative"?-1:1)
}
END{for(k in sum)
print line[k],(sum[k]*sign[k]<0?0:1),sum[k],sign[k]}' data
100429301;"RESTAURANT#GENERAL";negative;1004293;10042930;place;1;-1;-1
103269521;"FOOD#QUALITY";positive;1032695;10326952;duck breast special;1;4;1
100429331;"FOOD#QUALITY";negative;1004293;10042933;food;1;-0.4;-1
This only does your first check (I guess hit2), while also added sum and sign information (last two fields).

How to gather characters usage statistics in text file using Unix commands?

I have got a text file created using OCR software - about one megabyte in size.
Some uncommon characters appears all over document and most of them are OCR errors.
I would like find all characters used in document to easily spot errors (like UNIQ command but for characters, not for lines).
I am on Ubuntu.
What Unix command I should use to display all characters used in text file?

This should do what you're looking for:
cat inputfile | sed 's/\(.\)/\1\n/g' | sort | uniq -c
The premise is that the sed puts each character in the file onto a line by itself, then the usual sort | uniq -c sequence strips out all but one of each unique character that occurs, and provides counts of how many times each occurred.
Also, you could append | sort -n to the end of the whole sequence to sort the output by how many times each character occurred. Example:
$ echo hello | sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -n
1
1 e
1 h
1 o
2 l

This will do it:
#!/usr/bin/perl -n
#
# charcounts - show how many times each code point is used
# Tom Christiansen <tchrist#perl.com>
use open ":utf8";
++$seen{ ord() } for split //;
END {
for my $cp (sort {$seen{$b} <=> $seen{$a}} keys %seen) {
printf "%04X %d\n", $cp, $seen{$cp};
}
}
Run on itself, that program produces:
$ charcounts /tmp/charcounts | head
0020 46
0065 20
0073 18
006E 15
000A 14
006F 12
0072 11
0074 10
0063 9
0070 9
If you want the literal character and/or name of the character, too, that’s easy to add.
If you want something more sophisticated, this program figures out characters by Unicode property. It may be enough for your purposes, and if not, you should be able to adapt it.
#!/usr/bin/perl
#
# unicats - show character distribution by Unicode character property
# Tom Christiansen <tchrist#perl.com>
use strict;
use warnings qw<FATAL all>;
use open ":utf8";
my %cats;
our %Prop_Table;
build_prop_table();
if (#ARGV == 0 && -t STDIN) {
warn <<"END_WARNING";
$0: reading UTF-8 character data directly from your tty
\tSo please type stuff...
\t and then hit your tty's EOF sequence when done.
END_WARNING
}
while (<>) {
for (split(//)) {
$cats{Total}++;
if (/\p{ASCII}/) { $cats{ASCII}++ }
else { $cats{Unicode}++ }
my $gcat = get_general_category($_);
$cats{$gcat}++;
my $subcat = get_general_subcategory($_);
$cats{$subcat}++;
}
}
my $width = length $cats{Total};
my $mask = "%*d %s\n";
for my $cat(qw< Total ASCII Unicode >) {
printf $mask, $width => $cats{$cat} || 0, $cat;
}
print "\n";
my #catnames = qw[
L Lu Ll Lt Lm Lo
N Nd Nl No
S Sm Sc Sk So
P Pc Pd Ps Pe Pi Pf Po
M Mn Mc Me
Z Zs Zl Zp
C Cc Cf Cs Co Cn
];
#for my $cat (sort keys %cats) {
for my $cat (#catnames) {
next if length($cat) > 2;
next unless $cats{$cat};
my $prop = length($cat) == 1
? ( " " . q<\p> . $cat )
: ( q<\p> . "{$cat}" . "\t" )
;
my $desc = sprintf("%-6s %s", $prop, $Prop_Table{$cat});
printf $mask, $width => $cats{$cat}, $desc;
}
exit;
sub get_general_category {
my $_ = shift();
return "L" if /\pL/;
return "S" if /\pS/;
return "P" if /\pP/;
return "N" if /\pN/;
return "C" if /\pC/;
return "M" if /\pM/;
return "Z" if /\pZ/;
die "not reached one: $_";
}
sub get_general_subcategory {
my $_ = shift();
return "Lu" if /\p{Lu}/;
return "Ll" if /\p{Ll}/;
return "Lt" if /\p{Lt}/;
return "Lm" if /\p{Lm}/;
return "Lo" if /\p{Lo}/;
return "Mn" if /\p{Mn}/;
return "Mc" if /\p{Mc}/;
return "Me" if /\p{Me}/;
return "Nd" if /\p{Nd}/;
return "Nl" if /\p{Nl}/;
return "No" if /\p{No}/;
return "Pc" if /\p{Pc}/;
return "Pd" if /\p{Pd}/;
return "Ps" if /\p{Ps}/;
return "Pe" if /\p{Pe}/;
return "Pi" if /\p{Pi}/;
return "Pf" if /\p{Pf}/;
return "Po" if /\p{Po}/;
return "Sm" if /\p{Sm}/;
return "Sc" if /\p{Sc}/;
return "Sk" if /\p{Sk}/;
return "So" if /\p{So}/;
return "Zs" if /\p{Zs}/;
return "Zl" if /\p{Zl}/;
return "Zp" if /\p{Zp}/;
return "Cc" if /\p{Cc}/;
return "Cf" if /\p{Cf}/;
return "Cs" if /\p{Cs}/;
return "Co" if /\p{Co}/;
return "Cn" if /\p{Cn}/;
die "not reached two: <$_> " . sprintf("U+%vX", $_);
}
sub build_prop_table {
for my $line (<<"End_of_Property_List" =~ m{ \S .* \S }gx) {
L Letter
Lu Uppercase_Letter
Ll Lowercase_Letter
Lt Titlecase_Letter
Lm Modifier_Letter
Lo Other_Letter
M Mark (combining characters, including diacritics)
Mn Nonspacing_Mark
Mc Spacing_Mark
Me Enclosing_Mark
N Number
Nd Decimal_Number (also Digit)
Nl Letter_Number
No Other_Number
P Punctuation
Pc Connector_Punctuation
Pd Dash_Punctuation
Ps Open_Punctuation
Pe Close_Punctuation
Pi Initial_Punctuation (may behave like Ps or Pe depending on usage)
Pf Final_Punctuation (may behave like Ps or Pe depending on usage)
Po Other_Punctuation
S Symbol
Sm Math_Symbol
Sc Currency_Symbol
Sk Modifier_Symbol
So Other_Symbol
Z Separator
Zs Space_Separator
Zl Line_Separator
Zp Paragraph_Separator
C Other (means not L/N/P/S/Z)
Cc Control (also Cntrl)
Cf Format
Cs Surrogate (not usable)
Co Private_Use
Cn Unassigned
End_of_Property_List
my($short_prop, $long_prop) = $line =~ m{
\b
( \p{Lu} \p{Ll} ? )
\s +
( \p{Lu} [\p{L&}_] + )
\b
}x;
$Prop_Table{$short_prop} = $long_prop;
}
}
For example:
$ unicats book.txt
2357232 Total
2357199 ASCII
33 Unicode
1604949 \pL Letter
74455 \p{Lu} Uppercase_Letter
1530485 \p{Ll} Lowercase_Letter
9 \p{Lo} Other_Letter
10676 \pN Number
10676 \p{Nd} Decimal_Number
19679 \pS Symbol
10705 \p{Sm} Math_Symbol
8365 \p{Sc} Currency_Symbol
603 \p{Sk} Modifier_Symbol
6 \p{So} Other_Symbol
111899 \pP Punctuation
2996 \p{Pc} Connector_Punctuation
6145 \p{Pd} Dash_Punctuation
11392 \p{Ps} Open_Punctuation
11371 \p{Pe} Close_Punctuation
79995 \p{Po} Other_Punctuation
548529 \pZ Separator
548529 \p{Zs} Space_Separator
61500 \pC Other
61500 \p{Cc} Control

As far as using *nix commands, the answer above is good, but it doesn't get usage stats.
However, if you actually want stats (like the rarest used, median, most used, etc) on the file, this Python should do it.
def get_char_counts(fname):
f = open(fname)
usage = {}
for c in f.read():
if c not in usage:
usage.update({c:1})
else:
usage[c] += 1
return usage