Difference between revisions of "Patent Validity Ideas for ML"
(Created page with "Research Question: Can patent grants be predicted given Input: patent application text Output: patent was granted? (boolean) == SOA Natural Language Feature Extraction ==...") |
(→Primer) |
||
(One intermediate revision by the same user not shown) | |||
Line 5: | Line 5: | ||
Output: patent was granted? (boolean) | Output: patent was granted? (boolean) | ||
+ | |||
+ | == Primer == | ||
+ | |||
+ | See | ||
+ | * https://explosion.ai/blog/deep-learning-formula-nlp | ||
+ | * https://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-60.pdf | ||
== SOA Natural Language Feature Extraction == | == SOA Natural Language Feature Extraction == |
Latest revision as of 22:43, 13 July 2017
Research Question: Can patent grants be predicted given
Input: patent application text
Output: patent was granted? (boolean)
Primer
See
- https://explosion.ai/blog/deep-learning-formula-nlp
- https://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-60.pdf
SOA Natural Language Feature Extraction
In a machine learning setting, this problem is a supervised recommender system.
Recent machine learning + NLP research focuses on the idea of RNNs, Recurrent Neural Networks. For a primer, see Karpathy's blog post on The Unreasonable Effectiveness of RNNs. These techniques are generally sequence-to-sequence (seq2seq) which are the state-of-the-art in the machine translation of human languages. Crucially, RNNs have popularized LSTM (Long Short-Term Memory) sub-networks which handle relatively "far away" grammatical dependencies in sequences like English. Note that RNNs are extremely computationally expensive to train.
Data Preprocessing
Since the domain of patent data text is the English language, the input to the machine learning model must undergo dimensionality reduction, either explicitly as preprocessing or implicitly in layers of the model. Given a constrained input source like an XML document, it would be preferable to have no XML markup to boost contrast between different inputs.
Negative sample generation is how many image models achieve high accuracy. We can imagine some algebra over patent applications such that two rejected applications randomly combined will probably be rejected. And so on for the truth tables of AND and OR. This approach relies on the algebra operating on enough random bad and good combinations that a clear signal emerges that is predictive. I know this approach has been used in machine learning models.
Next Steps
So to begin, we could use a autoencoder to do dimensionality reduction and frame the problem as a classical binary classification problem to get a baseline? Then move on to a classical RNN model and then some linear combination of the two models like in Google's Wide & Deep Paper?