standardize text using phonetic - asp.net

i received data from a datacenter and i have to cleanse and make data useful and my biggest problem is one column lets call it "service_description" and for example the data center belong to a hair salon, this column is filled manually (text box) and contain huge amount of data (Billions), here is a small sample
service description
washed the haair
hair washed and dried
used shampoo on har
nails manicure
nail paint
nail pant
paint the nails
what i need to do is get each category together by ruining a script that will analyze each line and give it specif category e.g. hair could be the category for the first three lines because it is repeated in all of them while nail is category for the rest, taking in consideration the category word could be misspelled.
results
service description possible categories
washed the haair hair
hair washed and dried hair
used shampoo on har hair
nails manicure nail
nail paint nail
nail pant nail
paint the nails nail

I'm assuming your categories are fixed lookup.
I would split the string by white spaces; and for each part I would go through all items in your categories lookup, and pick the one with minimum levenshtein distance.
Some references:
http://en.wikipedia.org/wiki/Levenshtein_distance
http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm

Related

How to write “dependent” understanding-when rules?

I’m in a situation like this:
"Cream Corner" by Lynn
An ice cream cone is a kind of edible thing.
An ice cream cone has a number called scoop count.
Rule for printing the name of an ice cream cone:
say "[the scoop count in words]-scoop ice cream cone".
The Cream Corner is a room.
The player holds an ice cream cone with scoop count 3.
Now I want > eat three-scoop to work. I can do this:
Understand "one-scoop" as an ice cream cone
when the scoop count of the item described is 1.
Understand "two-scoop" as an ice cream cone
when the scoop count of the item described is 2.
Understand "three-scoop" as an ice cream cone
when the scoop count of the item described is 3.
[ ... ]
But of course, ideally, I’d like to write a rule like this:
Understand "[number]-scoop" as an ice cream cone
when the scoop count of the item described is the number understood.
However, the Inform documentation specifies that this is impossible:
So we cannot safely say "when the noun is the fir cone", for instance, or refer to things like "the number understood". (We aren't done understanding yet.) If we want more sophisticated handling of such cases, we need to write checking rules and so on in the usual way.
All the same, it isn’t clear to me how to replace such a rule with a checking rule “in the usual way”. How do I use checking rules to make [number]-scoop in the player’s command be interpreted as “an ice cream cone with that many scoops”?
I'm not quite sure what the documentation is implying we do there, but I can help solve the base problem. A partial solution is these two lines:
Understand the scoop count property as referring to an ice cream cone.
Understand "scoop" as an ice cream cone.
This allows the player to type things like three scoop cone or three ice cream cone, and have the three-scoop cone be understood.
This is only a partial solution, as we don't include the dash*, so something like take three-scoop cone wouldn't be understood correctly. The obvious way of solving this problem is by replacing the dash entirely before the command is read, something like this:
[Nonfunctional solution!]
After reading a command:
If the player's command includes "-scoop":
replace the matched text with " scoop";
This doesn't seem to work however, as the above rule matches only whole words--i.e., three -scoop, but not three-scoop. Attempting to do the same with only replacing - fails similarly.
*You can also argue that three ice cream cone shouldn't be a valid match.

Feature engineering of X,Y coordinates in neighborhoods of San Francisco

I am participating in a starter Kaggle competition(Crimes in San Francisco) in which I want to predict the category of a crime using a bunch of predictor variables including X and Y coordinates of a crime. As I doubt of the predictive power of the coordinates, I want to transform these variables to something more relevant to the crime category.
So I am thinking that if I had the neighbourhood of San Francisco in which the crime took place, it would be more informative than the actual coordinates of the crime. I can find the neighbourhoods online but of course I cant use the borders of each neighbour to classify the corresponding crime because their shapes are not rectangular or anything like that.
Does anyone have any idea about how I could solve this one?
Thanks guys
Well that's interesting AntoniosK and it's getting close to what I want to accomplish. The problem is that the information " south-east and 2km from city center" can lead to more than one neighborhoods.
I am still thinking that the partition of the city in neighborhoods is valuable because the socio-economic and structural differences between them ( there is a reason why the neighborhoods of each city are separated as such, right?) can lead to a higher probability for a certain category crime and a lower one for another.
That said, your idea made me thinking of using the south-east etc mapping and then use the angle of the segment(point to city center) with x axis to map the point to appropriate neighborhood. I am on it right now. Thanks
After some time on the problem I found that the procedure I want to perform is titled " reverse geocoding". It also turns out that there are some api's to solve this. The best according to my opinion is revgeocode() function contained in ggmap package(google's edition). This one though has a query limit per day(2500 queries) unless you pay for extra.
The one that I turned to though is geonames package and GNneighbourhood function that turns coordinates to neighbours. It is free, though I have experienced some errors(keep in mind that this one is only for US and Canada cities)
revgeocode function-ggmap package
Gnneighbourhood-geonames package

How to identify a roadway item on a map?

I am using the Traffic API to collect speed of streets. As a response, I got a list of roadways (RW), and each one has a list of flow items (FI). But reading the description (DE) of the roadways, I noticed that for the same road there are a bunch of roadway items. And each FI of a RW has a description with a name of another road (not necessarily connected to the current RW).
Example:
<RW LI="B14+05453" DE="Av Cristiano Machado" PBT="2016-04-20T22:08:55Z" mid="8a961c28-7d3c-481f-a6c4-bd1ca1b699f0|">
<FIS>
<FI>
<TMC PC="5454" DE="R Nossa Senhora De Fátima" QD="-" LE="0.04553"/>
<CF CN="0.78" FF="39.0" JF="0.0" SP="44.45" SU="44.45" TY="TR"/>
</FI>
<FI>
<TMC PC="5455" DE="Av Presidente Antônio Carlos" QD="-" LE="0.1193"/>
<CF CN="0.76" FF="39.0" JF="0.0" SP="42.03" SU="42.03" TY="TR"/>
</FI>
...
</FIS>
<\RW>
This is a small portion of a response. There is a RW called "Av Cristiano Machado" and two FI listed. The second FI, "Av Presidente Antônio Carlos" is anoter road, but it doens't intersect with "Av Cristiano Machado" in the reality.
My question is:
Based on the identifier of the RW (I think it's LI), how to indentify on a map which part of the road it represents? Is it possible to convert a RW item into a polyline or maybe get its coordinates?
Oh well, after a thorough reading I learned about responseattributes. Adding "&responseattributes=sh" to the request, it returns the shape of each flow item, which is a group of georeferenced points. You can plot and connect them through a line and I assume this is the section of the road it represents.
This figure is a result I obtained using Google MyMaps, LI="B14+05656" and LI="B14-05656". On the left hand (green) the coordinates seem really precise, not so much on the right hand of the image (red). The blue line is a valid driving route.

Arranging elements with a variable size on a fixed canvas

Input:
I have created and filled an array/table in velocity. This array currently contains 3 things:
(note that this example is pure fiction)
Top level community name (e.g. Stack Overflow USA, Stack Overflow BEL)
Subcommunity (e.g. stackOverflow.com/r/USA/CSS and stackOverflow.com/r/BEL/JSON)
Owner(s) of the subcommunity (e.g. Frank)
Depending on the situation and the point in time, the amount of top level communities, that have a variable number of subcommunities are each owned by a variable number of owners. And, each top level community can also have one or more owners.
(input) table example:
SO USA, , Phil
SO USA, CSS, Frank
SO USA, JSON, Marc
SO BEL, CSS, Marieke
SO BEL, CSS, Francis
SO BEL, JSON, Patrick
SO FRA, , Francois
output:
I now want to position these communities and sub communities graphically on a webpage like this
Depending on the amount of subcommunities, the top level community will have a different size, and should therefore be parsed where it fits best on the page (e.g on a 600pxx800px canvas).
Here are my questions:
does somebody know code that has already been written to solve this kind of problem?
if not, how would I best tackle this?

Expressing order or disorder mathematically

I work for game development company which makes casual games. One of the main casual genres is match-3: there is a field and chips of different colors. One should move chips so that they make lines of at least three chips of the same color. If the move leads to making a line the chips in the line disappear.
Chips on field can be located differently: there may be a lot of chips of the same color gouped in one place or there may be a situation when a player can't make a move - all the neighbour chips are of the different colors.
So, I want to express the situation on the field mathematically with a factor of order (disorder). If the factor is high a player can make a lot of matches and the lines made by the player are long. If the factor is low, the field is in complete disorder and one can't make a single match. This may be helpful for generating field of different difficulty.
The question is: what branch of math can help me to do this. Where should I start my research. Any suggestions for keywords to google?
Thanks in advace.
Entropy.
I would look into graph theory. You can for example make a graph, where nodes would be positions on the board, and two nodes would be connected with an edge if they are neighbours and have a chip of the same color. If you have large components with nodes of large degree, you have less disorder. If all your components are small, you have high disorder.
First thing that comes to mind is that you're looking at the distribution of n populations (one for each color), which I would approach with Poisson sampling,. You can use that to calculate the probability of finding two adjacent units of the same population (color), which will give you a measure of the difficulty of your puzzle.

Resources