Changes

2,704 bytes added , 19:53, 29 September 2020

no edit summary

~~{{McNair Staff~~<onlyinclude> [[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|~~status=Active~~}}(log page)]]

2018-08-03: Fixed/debugged minor coding with priority ranking. Helped Connor find timing info for missing companies. Cleaned up wiki pages.

~~<onlyinclude>~~2018-08-02: Redid minor codes with priority ranking. 2018-08-01: Entered the rest of the minor codes and arbitrarily picked the first one for those that had multiple codes attached to them. 2018-07-31: Minor coded cohorts based on contents of category group list in the file to rule them all. See Ed's slack message for key/legend. The rows highlighted in red are the ones that I'm not sure about. I wrote a python script to code most of them - E:\McNair\Projects\Accelerators\Summer 2018\codecategory.py . The coded sheet is a google sheet that I will add onto the wiki page once I make one. 2018-07-30: Matched employers with VC firms, funds, and startups. There were 40 matches with firms and funds, and 4 matches with startups. Coded these into 2 columns in Founder Experience table. Updated all wiki pages. 2018-07-27: Reformatted the timing info data to separate out the companies to look like The File To Rule Them All. It is located in E:/McNair/Projects/Accelerators/Summer 2018/Formatted Timing Info.txt. Looked at the WhoIs Parser but Maxine said she figured it out so she will finish it. I will start with Founders Experience on Monday and I do not understand what minorcode lookup means. 2018-07-26: Finished Demo Day Timing Info data. Talked with Ed and Hira about what to do for the last week. Cleaned up Timing Info data. 2018-07-25: Converted the 608 pdfs to txt files using [[PDF to Text Converter]]. All of them converted to txt files but some txt files are empty or do not contain the content of the paper. I do not know of a way to fix it or clean up the txt files to get only the txt files that are actually academic papers. Worked on Demo Day Timing Info data. 2018-07-24: Realized that some pdfs did not download properly because the link was not to an immediate pdf. Found all pdfs possible and came up with 608 total and 5 that I could not find pdfs for. Ran pdf_to_txt_bulk_PTLR.py on the 608 pdfs. 2018-07-23: Found my missing pdfs. There were 7 that were not able to download through the script for some reason so I manually downloaded 6 of them and I could not find the pdf for 1 of them. Now I have 612 pdfs on file and am ready to start converting to txt tomorrow.

~~[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]~~2018-07-20: Reinstalled pdfminer and the test that GitHub provides works so it should be installed correctly. Ran all the pdf urls through pdfdownloader.py but came up short with 602/613 pdfs. I found 8 papers that did not successfully get downloaded but that still leaves 3 mystery papers and I'm not sure which ones they are. Did some sql to try to figure it out but hasn't worked yet.

2018-07-19: Started on converting pdfs to txt files. I found pdf_to_txt_bulk_PTLR.py in E:/McNair/Software/Google_Scholar_Crawler. I copied this and moved my data to E:/McNair/Software/Patent_Thicket. When I tried to run the program, it was giving me an error when trying to import pdfminer because I originally tried running it in Z, so I installed pdfminer.six and it did not complain with python3.6 It gave me a different error with a missing package which Wei said we shouldn't touch so I moved everything back to E. It now runs but cannot convert any pdfs. I have no idea how to fix this.

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,613

edits

Changes

Grace Tan (Work Log) (view source)

Revision as of 19:53, 29 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools