Difference between revisions of "Peter Jalbert (Work Log)"
Peterjalbert (talk | contribs) |
Peterjalbert (talk | contribs) |
||
Line 158: | Line 158: | ||
4/19/2017 10:00-12:00: Made updates to the [http://mcnair.bakerinstitute.org/wiki/LinkedIn_Crawler_(Python) LinkedIn Crawler] Wikipage. Ran LinkedIn Crawler on accelerator data. Working on an html parser for the results from the LinkedIn Crawler. | 4/19/2017 10:00-12:00: Made updates to the [http://mcnair.bakerinstitute.org/wiki/LinkedIn_Crawler_(Python) LinkedIn Crawler] Wikipage. Ran LinkedIn Crawler on accelerator data. Working on an html parser for the results from the LinkedIn Crawler. | ||
+ | |||
+ | 4/20/2017 14:30-17:45: Finished the HTML Parser for the [http://mcnair.bakerinstitute.org/wiki/LinkedIn_Crawler_(Python) LinkedIn Crawler]. Ran HTML parser on accelerator founders. Data is stored in projects/accelerators/LinkedIn Founder Data. | ||
[[Category:Work Log]] | [[Category:Work Log]] |
Revision as of 17:50, 20 April 2017
09/27/2016 15:00-18:00: Set up Staff wiki page, work log page; registered for Slack, Microsoft Remote Desktop; downloaded Selenium on personal computer, read Selenium docs. Created wiki page for Moroccan Web Driver Project.
09/29/2016 15:00-18:00: Re-enroll in Microsoft Remote Desktop with proper authentication, set up Selenium environment and Komodo IDE on Remote Desktop, wrote program using Selenium that goes to a link and opens up the print dialog box. Developed computational recipe for a different approach to the problem.
09/30/2016 12:00-14:00: Selenium program selects view pdf option from the website, and goes to the pdf webpage. Program then switches handle to the new page. CTRL S is sent to the page to launch save dialog window. Text cannot be sent to this window. Brainstorm ways around this issue. Explored Chrome Options for saving automatically without a dialog window. Looking into other libraries besides selenium that may help.
10/3/2016 13:00 - 16:00: Moroccan Web Driver projects completed for driving of the Monarchy proposed bills, the House of Representatives proposed bills, and the Ratified bills sites. Begun process of devising a naming system for the files that does not require scraping. Tinkered with naming through regular expression parsing of the URL. Structure for the oral questions and written questions drivers is set up, but need fixes due to the differences in the sites. Fixed bug on McNair wiki for women's biz team where email was plain text instead of an email link. Took a glimpse at Kuwait Parliament website, and it appears to be very different from the Moroccan setup.
10/6/2016 13:30 - 18:00: Discussed with Dr. Elbadawy about the desired file names for Moroccan data download. The consensus was that the bill programs are ready to launch once the files can be named properly, and the questions data must be retrieved using a web crawler which I need to learn how to implement. The naming of files is currently drawing errors in going from arabic, to url, to download, to filename. Debugging in process. Also built a demo selenium program for Dr. Egan that drives the McNair blog site on an infinite loop.
10/7/2016 12:00 - 14:00: Learned unicode and utf8 encoding and decoding in arabic. Still working on transforming an ascii url into printable unicode.
10/11/2016 15:00 - 18:00: Fixed arabic bug, files can now be saved with arabic titles. Monarchy bills downloaded and ready for shipment. House of Representatives Bill mostly downloaded, ratified bills prepared for download. Started learning scrapy library in python for web scraping. Discussed idea of screenshot-ing questions instead of scraping.
10/13/2016 13:00-18:00: Completed download of Moroccan Bills. Working on either a web driver screenshot approach or a webcrawler approach to download the Moroccan oral and written questions data. Began building Web Crawler for Oral and Written Questions site. Edited Moroccan Web Driver/Crawler wiki page. Moroccan Web Driver
10/14/2016 12:00-14:00: Finished Oral Questions crawler. Finished Written Questions crawler. Waiting for further details on whether that data needs to be tweaked in any way. Updated the Moroccan Web Driver/Web Crawler wiki page. Moroccan Web Driver
10/18/2016 15:00-18:30: Finished code for Oral Questions web driver and Written Questions web driver using selenium. Now, the data for the dates of questions can be found using the crawler, and the pdfs of the questions will be downloaded using selenium. Moroccan Web Driver
10/20/2016 13:00-18:00: Continued to download data for the Moroccan Parliament Written and Oral Questions. Updated Wiki page. Started working on Twitter project with Christy. Moroccan Web Driver
10/21/2016 12:00-14:00: Continued to download data for the Moroccan Parliament Written and Oral Questions. Looked over Christy's Twitter Crawler to see how I can be helpful. Dr. Egan asked me to think about how to potentially make multiple tools to get cohorts and other sorts of data from accelerator sites. See Accelerator List He also asked me to look at the GovTrack Web Crawler for potential ideas on how to bring this project to fruition.
11/1/2016: 15:00-18:00: Continued to download Moroccan data in the background. Went over code for GovTracker Web Crawler, continued learning Perl. GovTrack Web Crawler Began Kuwait Web Crawler/Driver.
11/3/2016: 13:00-18:00: Continued to download Moroccan data in the background. Dr. Egan fixed systems requirements to run the GovTrack Web Crawler. Made significant progress on the Kuwait Web Crawler/Driver for the Middle East Studies Department.
11/4/2016: 12:00-14:00: Continued to download Moroccan data in the background. Finished writing initial Kuwait Web Crawler/Driver for the Middle East Studies Department. Middle East Studies Department asked for additional embedded files in the Kuwait website. Moroccan Web Driver
11/8/2016: 15:00-18:00: Continued to download Moroccan data in the background. Finished writing code for the embedded files on the Kuwait Site. Spent time debugging the frame errors due to the dynamically generated content. Never found an answer to the bug, and instead found a workaround that sacrificed run time for the ability to work. Moroccan Web Driver
11/10/2016 13:00-18:00: Continued to download Moroccan data and Kuwait data in the background. Began work on Google Scholar Crawler. Wrote a crawler for the Accelerator Project to get the HTML files of hundreds of accelerators. The crawler ended up failing; it appears to have been due to HTTPS.
11/11/2016 12:00-2:00: Continued to download Moroccan data in the background. Attempted to find bug fixes for the Accelerator Project crawler.
11/15/2016 15:00-18:00: Finished download of Moroccan Written Question pdfs. Wrote a parser with Christy to be used for parsing bills from Congress and eventually executive orders. Found bug in the system Python that was worked out and rebooted.
11/17/2016 13:00-18:00: Wrote a crawler to retrieve information about executive orders, and their corresponding pdfs. They can be found here. Next step is to run code to convert the pdfs to text files, then use the parser fixed by Christy.
11/18/2016 12:00-2:00: Converted Executive Order PDFs to text files using adobe acrobat DC. See Wikipage for details.
11/22/2016 15:00-18:00: Transferred downloaded Morocco Written Bills to provided SeaGate Drive. Made a "gentle" F6S crawler to retrieve HTMLs of possible accelerator pages documented here.
11/29/2016 15:00-18:00: Began pulling data from the accelerators listed here. Made text files for about 18 accelerators.
12/1/2016 13:00-18:00: Continued making text files for the Accelerator Seed List project. Built tool for the E&I Governance Report Project with Christy. Adds a column of data that shows whether or not the bill has been passed.
12/2/2016 12:00-14:00: Built and ran web crawler for Center for Middle East Studies on Kuwait. Continued making text files for the Accelerator Seed List project.
12/6/2016 15:00-18:00: Learned how to use git. Committed software projects from the semester to the McNair git repository. Projects can be found at; Executive Order Crawler, Foreign Government Web Crawlers, F6S Crawler and Parser.
12/7/2016 15:00-18:00: Continued making text files for the Accelerator Seed List project.
12/8/2016 14:00-18:00: Continued making text files for the Accelerator Seed List project.
1/10/2017 14:30-17:15: Continued making text files for the Accelerator Seed List project. Downloaded pdfs in the background for the Moroccan Government Crawler Project.
1/11/2017 10:00-12:00: Continued making text files for the Accelerator Seed List project. Downloaded pdfs in the background for the Moroccan Government Crawler Project.
1/12/2017 14:30-17:45: Continued making text files for the Accelerator Seed List project. Downloaded pdfs in the background for the Moroccan Government Crawler Project.
1/17/2017 14:30-17:15: Continued making text files for the Accelerator Seed List project. Downloaded pdfs in the background for the Moroccan Government Crawler Project.
1/18/2017 10:00-12:00: Downloaded pdfs in the background for the Moroccan Government Crawler Project.
1/19/2017 14:30- 17:45: Downloaded pdfs in the background for the Moroccan Government Crawler Project. Created parser for the Accelerator Seed List project, completed creation of final data set(yay!). Began working on cohort parser.
1/23/2017 10:00-12:00: Worked on parser for cohort data of the Accelerator Seed List project. Preliminary code is written, working on debugging.
1/24/2017 14:30-17:15: Worked on parser for cohort data of the Accelerator Seed List project. Cohort data file created, debugging is almost complete. Will begin work on the google accelerator search soon.
1/25/2017 10:00-12:00: Finished parser for cohort data of the Accelerator Seed List project. Some data files still need proofreading as they are not in an acceptable format. Began working on Google sitesearch project.
1/26/2017 14:30-17:45: Continued working on Google sitesearch project. Discovered crunchbase, changed project priority. Priority 1, split accelerator data up by flag, priority 2, use crunchbase to get web urls for cohorts, priority 3, make internet archive wayback machine driver. Located Whois Parser.
1/30/2017 10:00-12:00: Optimized enclosing circle algorithm through memoization. Developed script to read addresses from accelerator data and return latitude and longitude coordinates.
1/31/2017 14:30-17:15: Built WayBack Machine Crawler. Updated documentation for coordinates script. Updated profile page to include locations of code.
2/1/2017 10:00-12:00:
Notes from Session with Ed: Project on US university patenting and entrepreneurship programs (writing code to identify universities in assignees), search Wikipedia (XML then bulk download), student pop, faculty pop, etc. Circle project for VC data will end up being a joint project to join accelerator data. Pull descriptions for VC. Founders of accelerators in linkedin. LinkedIn cannot be caught(pretend to not be a bot). Can eventually get academic backgrounds through linkedin. Pull business registration data, Stern/Guzman Algorithm. GIS ontop of geocoded data. Maps that works on wiki or blog (CartoDB), Maps API and R. NLP Projects, Description Classifier.
2/2/2017 14:30-15:45: Out sick, independent research and work from RDP. Brief research into the Stern-Guzman algorithm. Research into Interactive Maps. No helpful additions to map embedding problem.
2/7/2017 14:30-17:15: Fixed bugs in parse_cohort_data.py, the script for parsing the cohort data from the Accelerator Seed List project. Added descriptive statistics to cohort data excel file.
2/8/2017 10:00-12:00 Worked on Neural Net for the Industry Classifier Project.
2/13/2017 10:00-12:00 Worked on Neural Net for the Industry Classifier Project.
2/14/2017 14:30-17:15: Worked on the application of the Enclosing Circle algorithm to the VC study. Working on bug fixes in the Enclosing Circle algorithm. Created wiki page for the Enclosing Circle Algorithm.
2/15/2017: 10:00-12:00: Finished Enclosing Circle Algorithm applied to the VC study. Enclosing Circle algorithm still needs adjustment, but the program runs with the temporary fixes.
2/16/2017 14:30-17:45: Reworked Enclosing Circle Algorithm to create a file of geocoded data. Began work on wrapping the algorithm in C to improve speed.
2/20/2017 10:00-12:00: Continued to download geocoded data for VC Data as part of the Enclosing Circle Algorithm Project. Assisted work on the Industry Classifier.
2/21/2017 14:30- 17:15: Continued to download geocoded data for VC Data as part of the Enclosing Circle Algorithm Project. Researched into C++ Compilers for Python so that the Enclosing Circle Algorithm could be wrapped in C. Found a recommended one here.
2/22/2017 10:00-12:00: Continued to download geocoded data for VC Data as part of the Enclosing Circle Algorithm Project. Helped out with Industry Classifier Project.
2/23/2017 14:30-17:45: Continued to download geocoded data for VC Data as part of the Enclosing Circle Algorithm Project. Installed C++ Compiler for Python. Ran tests on difference between Python and C wrapped Python.
2/27/2017 10:00-12:00: Continued to download geocoded data for VC Data as part of the Enclosing Circle Algorithm Project. Assisted work on the Industry Classifier.
2/28/2017 14:30-17:15: Finished downloading geocoded data for VC Data as part of the Enclosing Circle Algorithm Project. Found bug in Enclosing Circle Algorithm.
3/1/2017 10:00-12:00: Created statistics for the VC Circles Project.
3/2/2017 14:30-17:45: Cleaned up data for the VC Circles Project. Created histogram of data in Excel. See Enclosing Circle Algorithm Project. Began work on the LinkedIn Crawler.
3/6/2017 10:00-12:00: Ran script to determine the top 50 cities which Enclosing Circle should be run on. Fixed the VC Circles script to take in a new data format.
3/7/2017 14:30-17:15: Redetermined the top 50 cities which Enclosing Circle should be run on. Data on the Top 50 Cities for VC Backed Companies can be found here. Ran Enclosing Circle Algorithm on the Top 50 Cities.
3/8/2017 10:00-12:00: Continued running Enclosing Circle Algorithm on the Top 50 Cities. Created script to draw outcome of the Enclosing Circle Algorithm on Google Maps.
3/9/2017 14:30-17:45: Continued running Enclosing Circle Algorithm on the Top 50 Cities. Finished script to draw Enclosing Circles on a Google Map.
3/20/2017 10:00-12:00: Worked on debugging the Enclosing Circle Algorithm.
3/21/2017 14:30-17:15: Coded a brute force algorithm for the Enclosing Circle Algorithm.
3/23/2017 14:30- 17:45: Finished debugging the brute force algorithm for Enclosing Circle Algorithm. Implemented a method to plot the points and circles on a graph. Analyzed runtime of the brute force algorithm.
3/27/2017 10:00-12:00: Worked on debugging the Enclosing Circle Algorithm. Implemented a way to remove interior circles, and determined that translation to latitude and longitude coordinates resulted in slightly off center circles.
3/28/2017 14:30- 17:15: Finished running the Enclosing Circle Algorithm. Worked on removing incorrect points from the data set(see above).
3/29/2017 10:00-12:00: Worked on debugging points for the Enclosing Circle Algorithm.
4/3/2017 10:00-12:00: Finished debugging points for the Enclosing Circle Algorithm. Added Command Line functionality to the Industry Classifier.
4/5/2017 9:45-11:45: Began work on the LinkedIn Crawler. Researched on launching Python Virtual Environment.
4/6/2017 14:00-17:15: Continued working on debugging and documenting the LinkedIn Crawler. Wrote a test program that logs in, searches for a query, navigates through search pages, and logs out. Recruiter program can now login and search.
4/10/2017 10:00-12:00: Began writing functioning crawler of LinkedIn.
4/11/2017 14:30-17:15: Completed functional crawler of LinkedIn Recruiter Pro. Basic search functions work and download profile information for a given person.
4/12/2017 10:00-12:00: Work on bugs with the LinkedIn Crawler.
4/13/2017 14:30-17:45: Worked on debugging the logout procedure for the LinkedIn Crawler. Began formulation of process to search for founders of startups using a combination of the LinkedIn Crawler with the data resources from the CrunchBase Snapshot.
4/17/2017 10:00-12:00: Worked on ways to get correct search results from the LinkedIn Crawler. Worked on an HTML Parser for the results from the LinkedIn Crawler.
4/18/2017 14:30-17:15: Ran LinkedIn Crawler on matches between Crunchbase Snapshot and the accelerator data.
4/19/2017 10:00-12:00: Made updates to the LinkedIn Crawler Wikipage. Ran LinkedIn Crawler on accelerator data. Working on an html parser for the results from the LinkedIn Crawler.
4/20/2017 14:30-17:45: Finished the HTML Parser for the LinkedIn Crawler. Ran HTML parser on accelerator founders. Data is stored in projects/accelerators/LinkedIn Founder Data.