Changes

4,557 bytes added , 19:53, 29 September 2020

no edit summary

~~{{McNair Staff|status=Active~~}}<onlyinclude>

[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]

~~<onlyinclude>~~2018-08-03: Fixed/debugged minor coding with priority ranking. Helped Connor find timing info for missing companies. Cleaned up wiki pages. 2018-08-02: Redid minor codes with priority ranking. 2018-08-01: Entered the rest of the minor codes and arbitrarily picked the first one for those that had multiple codes attached to them. 2018-07-31: Minor coded cohorts based on contents of category group list in the file to rule them all. See Ed's slack message for key/legend. The rows highlighted in red are the ones that I'm not sure about. I wrote a python script to code most of them - E:\McNair\Projects\Accelerators\Summer 2018\codecategory.py . The coded sheet is a google sheet that I will add onto the wiki page once I make one. 2018-07-30: Matched employers with VC firms, funds, and startups. There were 40 matches with firms and funds, and 4 matches with startups. Coded these into 2 columns in Founder Experience table. Updated all wiki pages. 2018-07-27: Reformatted the timing info data to separate out the companies to look like The File To Rule Them All. It is located in E:/McNair/Projects/Accelerators/Summer 2018/Formatted Timing Info.txt. Looked at the WhoIs Parser but Maxine said she figured it out so she will finish it. I will start with Founders Experience on Monday and I do not understand what minorcode lookup means. 2018-07-26: Finished Demo Day Timing Info data. Talked with Ed and Hira about what to do for the last week. Cleaned up Timing Info data. 2018-07-25: Converted the 608 pdfs to txt files using [[PDF to Text Converter]]. All of them converted to txt files but some txt files are empty or do not contain the content of the paper. I do not know of a way to fix it or clean up the txt files to get only the txt files that are actually academic papers. Worked on Demo Day Timing Info data. 2018-07-24: Realized that some pdfs did not download properly because the link was not to an immediate pdf. Found all pdfs possible and came up with 608 total and 5 that I could not find pdfs for. Ran pdf_to_txt_bulk_PTLR.py on the 608 pdfs. 2018-07-23: Found my missing pdfs. There were 7 that were not able to download through the script for some reason so I manually downloaded 6 of them and I could not find the pdf for 1 of them. Now I have 612 pdfs on file and am ready to start converting to txt tomorrow. 2018-07-20: Reinstalled pdfminer and the test that GitHub provides works so it should be installed correctly. Ran all the pdf urls through pdfdownloader.py but came up short with 602/613 pdfs. I found 8 papers that did not successfully get downloaded but that still leaves 3 mystery papers and I'm not sure which ones they are. Did some sql to try to figure it out but hasn't worked yet. 2018-07-19: Started on converting pdfs to txt files. I found pdf_to_txt_bulk_PTLR.py in E:/McNair/Software/Google_Scholar_Crawler. I copied this and moved my data to E:/McNair/Software/Patent_Thicket. When I tried to run the program, it was giving me an error when trying to import pdfminer because I originally tried running it in Z, so I installed pdfminer.six and it did not complain with python3.6 It gave me a different error with a missing package which Wei said we shouldn't touch so I moved everything back to E. It now runs but cannot convert any pdfs. I have no idea how to fix this. 2018-07-18: Finished running the rest of the 100 pages. Took quite a long time because google scholar was catching me after 2-5 pages rather than 5-10. It helped to switch between the different wifi(rice visitor, rice owls, eduroam). Altogether resulted in 958 bibtex files and 613 pdfs from 1000 entries. There might be more entries but I'm not sure where to find them. I saved the data and code onto the rdp by connecting to it from the selenium box. 2018-07-17: Ran google scholar crawler. When google scholar blocks me with a 403 error code, I exit the program and rerun it at the page that it last looked at by clicking on the correct page number before crawling. I finished running through 68/100 pages of google scholar.

~~[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]~~2018-07-16: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.

2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the "next" button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.

2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - "Cannot contact reCAPTCHA. Check your connection and try again." I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

Grace Tan (Work Log) (view source)

Revision as of 19:53, 29 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools