Changes

Jump to navigation Jump to search
3,299 bytes added ,  19:53, 29 September 2020
no edit summary
{{<onlyinclude> [[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]] 2018-08-03: Fixed/debugged minor coding with priority ranking. Helped Connor find timing info for missing companies. Cleaned up wiki pages. 2018-08-02: Redid minor codes with priority ranking. 2018-08-01: Entered the rest of the minor codes and arbitrarily picked the first one for those that had multiple codes attached to them.  2018-07-31: Minor coded cohorts based on contents of category group list in the file to rule them all. See Ed's slack message for key/legend. The rows highlighted in red are the ones that I'm not sure about. I wrote a python script to code most of them - E:\McNair\Projects\Accelerators\Summer 2018\codecategory.py . The coded sheet is a google sheet that I will add onto the wiki page once I make one.  2018-07-30: Matched employers with VC firms, funds, and startups. There were 40 matches with firms and funds, and 4 matches with startups. Coded these into 2 columns in Founder Experience table. Updated all wiki pages. 2018-07-27: Reformatted the timing info data to separate out the companies to look like The File To Rule Them All. It is located in E:/McNair Staff/Projects/Accelerators/Summer 2018/Formatted Timing Info.txt. Looked at the WhoIs Parser but Maxine said she figured it out so she will finish it. I will start with Founders Experience on Monday and I do not understand what minorcode lookup means.|status=Active}}2018-07-26: Finished Demo Day Timing Info data. Talked with Ed and Hira about what to do for the last week. Cleaned up Timing Info data.
2018-07-25: Converted the 608 pdfs to txt files using [[PDF to Text Converter]]. All of them converted to txt files but some txt files are empty or do not contain the content of the paper. I do not know of a way to fix it or clean up the txt files to get only the txt files that are actually academic papers. Worked on Demo Day Timing Info data.
<onlyinclude>2018-07-24: Realized that some pdfs did not download properly because the link was not to an immediate pdf. Found all pdfs possible and came up with 608 total and 5 that I could not find pdfs for. Ran pdf_to_txt_bulk_PTLR.py on the 608 pdfs.  2018-07-23: Found my missing pdfs. There were 7 that were not able to download through the script for some reason so I manually downloaded 6 of them and I could not find the pdf for 1 of them. Now I have 612 pdfs on file and am ready to start converting to txt tomorrow. 2018-07-20: Reinstalled pdfminer and the test that GitHub provides works so it should be installed correctly. Ran all the pdf urls through pdfdownloader.py but came up short with 602/613 pdfs. I found 8 papers that did not successfully get downloaded but that still leaves 3 mystery papers and I'm not sure which ones they are. Did some sql to try to figure it out but hasn't worked yet.
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]2018-07-19: Started on converting pdfs to txt files. I found pdf_to_txt_bulk_PTLR.py in E:/McNair/Software/Google_Scholar_Crawler. I copied this and moved my data to E:/McNair/Software/Patent_Thicket. When I tried to run the program, it was giving me an error when trying to import pdfminer because I originally tried running it in Z, so I installed pdfminer.six and it did not complain with python3.6 It gave me a different error with a missing package which Wei said we shouldn't touch so I moved everything back to E. It now runs but cannot convert any pdfs. I have no idea how to fix this.
2018-07-18: Finished running the rest of the 100 pages. Took quite a long time because google scholar was catching me after 2-5 pages rather than 5-10. It helped to switch between the different wifi(rice visitor, rice owls, eduroam). Altogether resulted in 958 bibtex files and 613 pdfs from 1000 entries. There might be more entries but I'm not sure where to find them. I saved the data and code onto the rdp by connecting to it from the selenium box.

Navigation menu