Programmer Of The Month (POTM) contest ... THE SUMMER EDITION .... Deadline for this one is midnight on Tuesday August 31, 1993. How hard is it to look up a word in a dictionary when you don't know how to spell the word? How hard is it to find the "correct" number in a list when all you have is a number where you think that "most" of the digits are OK? This latter question is the subject of the Summer POTM: MR. NOISE! ============================================================================ Here is the model ... ------------- TRUE NUMBERS -> | Mr. Noise | -> CORRUPTED NUMBERS -> ---------------- /\ ------------- | | /||\ | Your Program | || | | LIST OF GOOD NUMBERS -------------------------------> ---------------- || \||/ I'll tell you all about what Mr. Noise does to the \/ numbers, and I'll score you on how well YOUR GUESSES YOUR GUESSES match the TRUE NUMBERS I start with. In fact, the actual Mr.Noise shell program is provided below! ============================================================================ If you deduce a TRUE NUMBER you get a reward (10). If you give up, you get a smaller reward (3). If you guess WRONG, you get nothing! (For example, the TRUE number is 5551212. Mr. Noise corrupted it to 55612192. You have to determine that the intended number in the good number list was really 5551212.) THE GORY DETAILS FOLLOW ... READ ON IF YOU WANT TO TACKLE THE CHALLENGE! IF NOT .... SEE YOU NEXT TIME!!! (Please join us ... the solution to this one may even have AT&T related application use!) =Fred (homxb!fah) (Send mail if you want to be taken OFF the mailing list!) ============================= THE DETAILS ============================= YOUR PROGRAM: 1) Assuming your program is a.out, then it should take two arguments: a.out /tmp/good_nums /tmp/test_list The first arg (/tmp/good_nums for example) is the name of a file containing valid numbers. /tmp/good_nums contains one seven digit number per line, with no white space. This file will not contain numbers beginning with "0". The file will be numerically sorted and will contain no repetitions. (A shell/awk program is provided to generate a sample list of valid random numbers.) The second arg (/tmp/test_list for example) is the name of a file containing digit strings to be tested. This input file contains one number per line, between 1 and 14 digits each, with no white space. There MAY be leading zeroes in these numbers. There MAY be repeats and it need not be sorted. (I will generate this file by corrupting a subset of the valid numbers with the MR.NOISE shell using some fixed set of probabilities.) 2) Your program should generate, to standard output, exactly one output number for each input number in /tmp/test_list (the second arg.). Line N of the output should contain the results of your test of line N of the input - and should be either a valid number from /tmp/good_nums or "0000000" to signify that you chose not to make a guess on this input line. (If your output is NOT in the /tmp/good_nums list, it will be treated as a bad guess.) 3) Name your program something clever ... the last set of program names were particularly boring. THE DIGITS: 1) All "digits" are zero through 9, provided via [0123456789] 2) The "valid" numbers in the good_nums file will be seven digits long. They may begin with any digit except 0 (zero). They thus range from 1000000 to 9999999 numerically. 3) I will corrupt the seven digit number and will return a digit string using the digits [0-9]. This digit string will be between 1 and 14 digits long (no white space). See the section on CORRUPTION below. CORRUPTION ... (I am Mr. Noise ... I will wreak havoc ...): 1) I will corrupt a seven digit number by examining each digit in turn, and applying one or more of three different corruptions to that digit based on probabilities generated from random numbers. 2) With probability P(change) I will change the single digit to some other digit. (e.g. 9 -> 4) 3) With probability P(remove) I will delete the digit altogether. 4) With probability P(add) I will add a random digit to the string following this digit. (e.g. 3 -> 37 ) 5) With probability P(surprise) I will generate a completely random seven digit number (on a separate line) in addition to corrupting the input number. 6) An "awk" program is provided ... this is what I will use to corrupt the seven digit number. (Note that if your version of awk does not support the "rand" function, try "nawk" or "xawk".) 7) The probabilities for each digit are fixed for each test run and follow: 0. <= P(change) <= .20 0. <= P(remove) <= .20 0. <= P(add) <= .20 (given a digit was not removed) 0. <= P(surprise) <= .10 (applied once per input number) The probabilities will be applied to each digit of the TRUE number as demonstrated in the awk program. 8) As shown by the awk program, I may either add or delete a digit - but not both. If I change a digit, I may then add another or delete that digit. Independent of what I changed, added, or deleted - I may add at most one "surprise" number for each input number. THE SCORING: The scoring is based on the numbers you return on standard output: 1) If the number is 0000000 it will be assumed that you decided not to correct the number - you will be awarded 3 points regardless of what the "true" input was. (Exception: see #4 below) 2) If you return the CORRECT uncorrupted 7 digit number, you will be awarded 10 points. CORRECT means that you returned the number that I started with before I applied the corruption - I will determine which number I started with based on the list I gave to the "Mr.Noise" awk script. (It is possible that you return a "closer" entry than the one I started with ... no matter, it is MY number that earns you the points - the one that would be printed if DEBUG=1 in the awk script.) 3) If you return an INCORRECT 7 digit number, you get no points. 4) When "Mr.Noise" adds a random number (with P(surprise)), I will assume that "0000000" is a CORRECT guess (worth 10 points). If "Mr.Noise" adds a random number that happens to be on the good_nums list, I will remove the number from the list before running the tests. 5) Thus, for 1,000 input numbers, the worst score would be zero. The best score would be 10,000. If your program does absolutely nothing except output 0000000 every time, your score will be at least 3,000. ============================= THE JUDGING ============================= 1) For system test, I will generate a valid good_nums list of length 5,000 and five lists of 1000 test numbers each using the awk scripts included below. The five tests will each have different probabilities used. Your score will be the number of points earned when I compare your five outputs with the five perfect inputs I use to drive the awk script. As always, you will be mailed your system test results (as well as those of other players!) (Note that with DEBUG=1, the awk script will generate the CORRECT numbers as well as the corrupted numbers.) 2) For the finals, I will generate 10 such lists - each of which will contain between 100 and 5,000 test entries. These will be played against valid number files containing between 500 and 10,000 entries each. 3) Final score will be the sum of the scores on the ten runs. 4) In the (unlikely??) event of a tie, the quickest (sys+user) program will win. Unlike the CAT->DOG POTM, I do NOT expect to have to resort to this tie-breaker for MR.NOISE. 5) Any arbitrary decisions made by me along the way will be valid. ============================= THE ENVIRONMENT ========================= I run on an Amdahl 5870 ... homxb by name. You supply me with source code. (I can handle K&R C code (no ANSI), C++, or anything supported by exptools. I'd be willing to try anything you think might work ... so try!) I will "make" the code when I receive it and get back to you immediately (?) with your system test results. In the past, it has proven wise to submit your entry EARLY so that we can work through any portability issues via email. I will distribute weekly or bi-weekly leader boards based on the system tests ONLY to those folks who have sent me an entry to that date. a) No include files other than "stdio.h" in C. For C++ add "stdlib.h" and "string.h". Others by special request only. b) No non-standard libraries are permitted. c) sys+user time on the five system tests must total less than 300 seconds in order to qualify for the finals. d) Any temp files must be created in /tmp and removed when done. e) No assembler code (too much advantage for local Amdahlers). f) Time is measured with timex, sys+user, for better or worse. g) Standard compile line is "cc -O" for C programs, others as provided. h) Program must end normally - no core dumps! i) Your submission may be distributed freely and used by anyone for anything j) this list will grow for future POTMs ... ============================= SUBMITTING AN ENTRY ==================== Use "uuto filename homxb!fah" to send me an entry ... please accompany it with some mail explaining how to compile the entry. If you prefer, you may send your entry via email to homxb!fah or attmail!hicinbothem. I will not examine your source since I will be competing, but I will ship you lint output, or compile errors, or stuff like that if things don't work out the first time! You may submit multiple entries, but ONLY YOUR LAST ENTRY WILL BE SAVED. I must have a single email address and name to associate with each entry, even if you choose to work in teams - all results will go to this individual. Questions submitted via email will be answered. All entries must be received by the closing date/time - as measured in NJ! ============================= TWO USEFUL SHELL PROGRAMS ============== The following are two programs which should prove helpful in testing your programs. They are the programs I will use for testing them! ############################# CUT HERE ################################## # # This program is Mr.Noise. It is an shell/awk program that uses the built # in function "rand". If the version of awk on your system does not # support "rand" ... try xawk (exptools) or nawk. If this still does # not work ... write a C program and accept my apologies! # # This program expects a standard input containing seven digit numbers, # one per line. It will corrupt these numbers and # produce a corresponding string of corrupted digits on stdout. # Every now and then it will add a completely random number. # # The loop at the front contains a SEED that will be changed by ME when # I run the tests ... so don't try and predict the errors precisely! # Note that SEED should be pretty big to avoid generating numbers that # might also be in the dictionary (the other shell program). # During testing I will also modify the four probabilities (see rules). # # Setting DEBUG=1 will cause the program to print out TWO numbers per # line - the corrupted number followed by the "Correct" number sed "s/./& /g" | # put spaces in the numbers for awk xawk ' BEGIN { SEED=15345 ; DEBUG=0 ; for(i=1;i999999 && x<=9999999) printf("%07d\n",x) ; } } ' - | sort -n -u #########################################################################