Sunday, September 30, 2012

Parsing or Word-Rip Alice-in-Wonderland, A programming methodology

Data need to be parsed very often into a form that is understandable by a software. Data parsing is most familiar with the computer science students, usually, data need to change form among different system. The sample operation below describes the methodology of word ripping the famed novel, Alice in Wonder land to 1. count the number of words and 2. determine word frequency 3. sort the word frequency.

Critical Analyzing the sample problem, and possibly break down this word rip operation into several sub steps. This is the most classic divide an conquer approach. First would be to determine the weapon of choice from the arsenal. Be it C++, C#, Java, Python, Haskell and etc. In this example where there is no preference for the language needed, hence the tie breaker would be the familiarity with the API. Using API definitely reduce the time to market (TTM), then custom writing another whole new library. In my example, Java is preferred.

Next, the "coding" part. Never-Ever-Jump-Straight-In! Give some thought on the process of word rip. Data need to be acquired, data need to be parsed, data need to be stored (in memory , data need to be manipulated and lastly data need to be output to a file physically. AH.... all look so familiar isn't it! At this point of time, locate the possible use of the API and make a mental (or physically writing it down) note about it. This is also the time where UML case and also class diagram can be drawn.

 Assuming the novel could be put into the same directory of the *.class file, this would same some time in programming, i.e, no need to import* and thus skipping the myriad of networking related troubleshooting for the code to access to the Internet. Reading in the data file, this should be familiar. With the class. The logical next step will be thinking on how to count the number of words (without the punctuation marks), what variables need to be declared and what type of control structure need to be used? Hint* while loop till the end of file, increase the counter for each word found. Compile and Print out. Troubleshoot as necessary.

Congratulations for coming this far. It means that you have completed the first iteration of counting words using next() [which ignoring punctuation marks]. Now, come the interesting bit. How to 1. remove the punctuation mars, the words 3.count the frequency of the words. Few assumptions need to be made. i.e Word and w0rd; text-ing and texting is 2 unique an etc. item 1 can be dealt separately first, but the choice made for 2. will directly impact the way the code written for 3.

REGEX and greedy comparison. I am assuming you guys are familiar with Java String class API and the replaceAll() method call. Since the need is to rip out the words (alphabets and numbers), minus the punctuation marks. I started with the most classic one replaceAll("[^A-Za-z]", "");  on close examining of the processed outputs, there are a few missed bits and blobs of punctuation mark remained. From the output, it will be naked to the untrained eye (on what need to be spotted). Do look out for patterns, such as repeating type of punctuation marks, syntax (format) of the words that fall in-between the data parser and then back to the programming<-run->debug cycle. Eventually, there might be a conclusion. Different train of thoughts will definitely leads to a different code. Some might be tempted to bail out; Some might look for other possible API to use, and etc. </-run->

Choice of 2. will impact 3. For an example, if array was used, and since array can only store 1 item in each allocated memory block. Logically it would lead to several outcome, such as OOP style i.e creating a class with 2 variables. creating 2 arrays, one for storing words and one for storing frequency. using more advanced data structures. The algorithm is pretty much straight forward. check whether is the first occurance. if not, increase the word-count, else insert the word and with word-count =1.

The efficiency of the code will be hampered by the type of data structure chosen. For an example, let say one array is used to store words, the code need to loop through the array to find a similar word. That is O(n) and the worst case scenario will be looping through the entire array. Since the words come in a random, and the array content not sorted, just to make the matter worst. Let say there is a match for the same word, the count need to be increased. Then it will be another loop through the word-count array, that is another O(n). Most likely this piece of code is implemented in an O(n^2) fashion. Is there a better way that can achieve O(1)?? [don't peek at my code yet, research about data structures, particularly the map]. Why I know to use a HMAP, it boils down to 2 things. I read about it, and I used it before. The worst thing you guys might be facing is "Ï heard it before (from some random conversation)", then it will be just another round of wild goose chase.

As for the sorting by frequency of words, again the choice of data structure made earlier affects the options available later. Hint* use Java.Collections

The first 3 SP students that respond with your OWN thoughts that were translated into code shall get a post card signed by me and snail mail from London, UK. POC + FCFS only.

code, enjoy!
just incase the embed fail...

No comments: