Changes

Jump to navigation Jump to search
'''To general users:'''
your Your input file (usually a single ".txt" file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate ".txt" files. To run the script, you basically need to specify the following:
1. "File Name" : without the ".txt" extension,
'''To advanced users:'''
1. one One important step in data preprocessing is to convert encode words (strings) to integers. That is we need to build a dictionary mapping words to their corresponding indices. Our dictionary is ordered by the words' frequency. Higher the frequency smaller the index, i.e. you should expect to see "the, a, ..." these words in the smallest 10 indices : 2, 3, 4, .... Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage of doing this is that you can specify easily ignore those common and meaningless words by simply say I want to consider words with the index > 20 for example. And for any word that is not in our dictionary, code it with index 1, so again you can easily ignore it.
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do preprocessing every time you want to run your classifier.
78

edits

Navigation menu