Difference between revisions of "Research Plans"
VeeralShah (talk | contribs) |
|||
Line 25: | Line 25: | ||
==Veeral Shah== | ==Veeral Shah== | ||
− | + | ||
+ | Short Term: | ||
+ | *Build Web Crawler Tool that can obtain company descriptions for a list of companies using HTML and Python. | ||
+ | *Collaborate with Ben to help with obtain and organize information on Houston startup companies. | ||
==James Chen== | ==James Chen== |
Revision as of 16:16, 15 July 2016
Contents
Ravali Kruthiventi
Project - USPTO Assignees, Patent and Citation Data
Assignees Data
- Data source: patent database (merged data from patent_2015 and patentdata databases)
- Issues: citations data contains non numeric patent numbers (likely application numbers, etc)
- Solution:
- Segregate into smaller tables so that Amir and Marcela can identify patterns
- link back to appropriate patent numbers from the patent table
- Time to implement: 1 day
- Priority:
- Teams waiting for it:
- Marcela and Amir
- Project : Patent data analysis
- Jake and James, potentially could need this down the line
- Project : LBO data
- Marcela and Amir
- Deadline:
- Data Source: USPTO Bulk Data repository
- Issues:
- The script inserts copies of data into the tables.
- Analysis required on the data to make sure the data was inserted correctly from the XML files.
- Analysis is also required to determine whether this data is better than the data we have in the patent database right now.
- Action owners : Amir and Marcela
- Solution:
- Amir and Marcela and/or I need to look at the data to determine quality
- If they find that any of this data is better than the data we currently have, I will have to figure out a way to integrate this data into our data model for patent data.
- Amir and Marcela and/or I will need to delete the copies
- Amir and Marcela and/or I need to look at the data to determine quality
- Time to implement:
- Priority:
- Teams waiting for it:
- Deadline:
- Issues:
Project - Lex Machina Data
- Data Source:
- Issues:
- Solution:
- Time to implement:
- Priority:
- Teams waiting for it:
- Deadline:
Project - Pattern Recognition on Patent Data through Machine Learning
- Data Source: The patent database.
- Plan:
- Technique
- Determine research question to be asked
- Scrub data
- Determine 3-4 mining\machine learning techniques to best extract patterns
- Train the algorithms
- Run the algos on sample dataset
- Determine the algo with best results
- Implement the
- Technique
- Known Issues:
- Dataset to be cleaned, quality analyzed as specified above.
- Deliverables
- Set of patterns to base further research on
- Research paper (?)
- Documentation - Wiki page
- Time to implement:
- Priority:
- Teams waiting for it: None
- Deadline:
- Plan:
Amir Kazempour
Introduction
The research plan at this point consists of roughly 5 separate smaller tasks. Namely, data project relevant to the little guy paper, Lex Machina data pull which is a vital step to get started with the little guy paper, application for the SBA funding for little guy paper and improving and updating wiki pages relevant to the little guy project and Marcela's work on patent litigation process.
Data Project
We have found few issues with the patent database which need to be addressed in order to have the datasets required for the little guy paper.
- USPTO historical assignment data
- The issue with table keys is potentially resolved.
- Create a table to enable us to track patent ownership through the life of a patent.
- Assignees data
- Identify U.S. assignees in the data for all assignees without a valid country or state entry.
- Maintenance fee data
- Create a table with all the active patents and their expected remaining life using the maintenance fee event codes.
- Citation data
- We need to recognize patterns in the cited patent numbers. A low hanging fruit would be to match all the publication numbers to the granted patent numbers in the histpatent table. Few other repeating patterns seem to point to foreign issued patents.
Lex Machina data pull
A data pull proposal/request has been prepared. We need to get in touch with Brian to discuss the first pull. Upon acquiring the first data pull, we need to assess feasibility of the little guy paper as discussed in the proposal mentioned above.
Little guy paper
We need to consider the possibility of new research questions based on the hypotheses discussed in Lerner, Josh, Andrew Speen, and Ann Leamon. "The Leahy-Smith America Invents Act: A Preliminary Examination of Its Impact on Small Businesses."
Further progress on the paper is conditional on the first data pull from Lex Machina.
Wiki pages
Making sure that:
- All the SQL codes are available on the wiki and are up to date.
- Full data description for the USPTO database and the USPTO historical assignment data are available on the wiki.
- Pages are linked correctly.
SBA grant application
The goal is to go over the Instruction to Offerors and Statement of Work documents, both available on Research on the Changing Value of Patents to Startups page, and gauge the compatibility of the little guy paper's research question with the one in these documents. The main focus for the literature review will be on Lerner, Josh, Andrew Speen, and Ann Leamon. "The Leahy-Smith America Invents Act: A Preliminary Examination of Its Impact on Small Businesses".
Dylan Dickens
Ben Baldazo
Ben Baldazo Research Plans (Plan Page)
Startups of Houston Interactive Maps - The Whole Process
Use Google Maps to find Longitude and Latitude
- Document how to work Geocode.py and what might go wrong
Put through R code to make an interactive map
- Finding and Documenting the processes required to run the R code may be necessary
- Works on Carto and looks really cool
- We do eventually need to have a plug in and a Carto account so that we can post this on the Wiki
Accelerator Quality Issue Brief
Houston Accelerators (issue brief)
Factors to look at
- Value Added
- How to look at this though?
- Market vs Non-Market
- Philanthropic funding?
- If Non-Profit: Propublica will document contributions
- If for profit: call?
- Founded Bottom up? or Top down?
- See if it was founded by a group or individual with actual industry connections (online or phone)
- Philanthropic funding?
- Location
- Proximity to resources
- We may have to update the startups in Houston map for this
- Proximity to resources
- Available Resources (we should generally be able to call or find this on website, these should be things they brag about)
- Flex Space
- Events
- Co-Working
- Connections/Mentorship
- This can also have a judged value based on Mentor/Connection perceived experiece
- Funding (This may be hard but if they offer their own VC we can check through SDC)
- Userbase
- Leadership/Experience qualifications (Linkedin, profiles on their own websites, other bios)
- Have they driven a startup before or been in backseat
- What other qualifications do they have
- Criteria from the Acc Rankings (as long as we have portfolios then we can use SDC for this)
- VC funding history
- IPO
- Acquired
- Any other reviews possible
- Other articles (Xconomy, Houston Chronicle, etc.)
- Info from actually calling the accelerators (Putting a list of questions up on the discussion for Houston Accelerators (issue brief)
- perhaps reviews from startups themselves
- Could look specifically into startups that have gone through multiple accelerators hopefully we have phone numbers on File:HSM10.xlsx
Jake Silberman
Jake Silberman Research Plans Plan Page
Leveraged Buyout Innovation (Academic Paper)
- Finalize Hazard Model
- Determine best regression model (Cox or something else that makes more assumptions)
- Determine finalized variable set
- Predict based on model
- Match LBO and non LBO companies based on hazard model predictions
- Generate buckets, i.e. break down by industry, decade, etc...
- Determine metric for matching
- Integrate new patent data
- Create stocks of patents
- Add in patent assignment data
- Analysis of control group and study group for first results
- Refine matching if necessary
- Test for endogeneity/other issues
- Lit review for variables
- Revise preexisting regression variable write-up and reformat it to appropriate academic paper form
- Do final correct pull of SDC Data (just include IPOS)
- Clean data, throwing out duplicate names and only take most recently invested one
- Rank cities by venture capital on different metrics, either in SQL or Excel
- Write up issue brief
Shoeb Mohammed
Shoeb Mohammed Research Plans Research Plan page
Short Term
- Create a listing on the wiki for all software developed at McNair center. - Completed
- Build a Linux box to run the crawler. - Completed
Long Term
- Optimize/re-design the 'Matcher' software. In particular, speed-up fuzzy matching and possibly re-structure the code to make usage easier.
- Develop the crawler. Try to begin with code that Dan has. - Completed
Side Tasks
- If possible, redo the patent parser (previous coded by Kranti) to also pull in Patent Citation data.
Veeral Shah
Short Term:
- Build Web Crawler Tool that can obtain company descriptions for a list of companies using HTML and Python.
- Collaborate with Ben to help with obtain and organize information on Houston startup companies.
James Chen
James Chen Research Plans (Plan Page)
- Short term:
- Refine variables to include in hazard rate model
- Industrygroup
- Log or ratio for tax, ebitda, etc.
- Refine variables to include in hazard rate model
- Long term:
- Finish hazard model
- Complete hazard rate matching
- Test for endogeneity, variables list
- Incorporate new patent data (stock, transfers, etc.)
- Complete literature Review using final variable list
Ariel Sun
Ariel Sun Research Plans (Plan Page)
- Hubs
- Get scorecard system completed Hubs: Mechanical Turk
- Mechanical Turk for potential hubs Mechanical Turk (Tool)
- Matched identified hubs to CMSAs
- VC table
- Waiting for patent data to be fixed to join to VC table
- Import VC data to STATA
- Hazard rate model
- Diff-in-diff
Todd Rachowin
Todd Rachowin Research Plans (Plan Page)
- Short-Term = Hubs List (Hubs: Mechanical Turk)
- Creating a comprehensive list of potential hubs
- Determining the best variables for the scorecard
- Building "filters" for automating the collection
- Running and auditing of the automation
- Collecting the remaining manual data
- Long-Term = Everything Else
- Hazard Rate Model (determine proper one, run it, etc.)
- Diff-Diff
Gunny Liu
Gunny Liu Research Plans (Plan Page)
Week VII
7/11 thru 7/15
- Finalize Twitter Webcrawler version Alpha, discuss roadmap ahead with research fellows
- Expand semantic mediawiki capabilities on our wiki and provide documentation for existing data structures
- Configuration of data transfer of startup data from local to wiki wrt Ben
Week VIII
7/8 thru 7/22
- Alpha Exploration & development of existing Google Maps API script
- Advanced development of Twitter Webcrawler to populate McNair databases
- Input: previously documented mothernodes and entrepreneurship buzzwords
- Advance development of Eventbrite Webcrawler to populate McNair databases
- To integrate with Google Maps API to provide updated mapping of active entrepreneurship events in Houston
Week IX
7/25 thru 7/29
- Alpha Exploration & development of Techcrunch API
- Alpha Exploration & development of Facebook API
Week X
8/1 thru 8/5
- Advanced development of all API scripts to populate McNair databases
Week XI
8/8 thru 8/12
- Last day of summer internship: 8/8