Changes

Jump to navigation Jump to search
no edit summary
Where <tt>sp=1</tt> forces the inclusion of spaces in the character set (which is otherwise a-z), as well as before and after the string, <tt>minfq</tt> sets to minimum global frequency of occurance of an n-gram for it to be included in the output, and <tt>diag=1</tt> produces an additional frequency of occurance diagnostic file.
The script has several other useful options, including <tt>-two</tt> which generates two files, one of the index, the class (if specified through '<tt>-classnocol'refcol</tt> and a reference file is specified with <tt>-r</tt>) and the gram variables, and another containing the index and all other variables. An example command line to build the two files and do the reference look-ups is:<tt> perl SurnameFeatures.pl -i=SourceFile.txt -r=Culture-EganClassification.txt -rcol=6 -rkey=0 -rno=2 -ncol=0 -dcol=5 -rsup=1 -sp=1 -gram=2 -minfq=1 -diag=0 -two=1 </tt> Where <tt>-rsup</tt> suppresses records that do not have reference lookups, and the <tt>-rkey</tt> and <tt>-rno</tt> specify the key and class number columns in the reference file (here Culture-EganClassification.txt)
Anonymous user

Navigation menu