Augi Liebster (Work Log)

From edegan.com
Revision as of 19:53, 29 September 2020 by Ed (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Summer 2018

VentureXpert Data Geocode.py


Augi Liebster Work Logs

2018-08 6-17: For the last two weeks I have continued to work on the constructino of vcdb3. I built on Marcos Lee's code and ran my data through the clustering process.

2018-08-01: Found issues cleaning the geocoded data. Had difficulty extracting good data from Puerto Rico. Issue was that it was marked as an exclude row but actually needs to be included. Will continue to work on tomorrow. Think I have an idea to implement my solution but am having trouble with manipulation of some tables.

2018-07-31: Moved onto geocoding. First had to learn about the process. Figured out I would need to include primary keys in the output table, so I changed the script a bit the reflect that.

2013-07-30: Worked on graphs all day for Ed. Called him twice to make sure that everything was in line with his expectations.

2018-07-27: Talked to Ed about fixes on the ranking tables including fixing numalive by including portcos that didn't have deaddates listed. Then finished cleaning fundbase tables. Called Ed and talked about formatting for the tables. Will work on the tables over the weekend.

2018-07-26: Finished ranking data and formatted the tables for Ed to review. Started cleaning firmbase and fundbase tables using firmname and fundname as primary keys.

2018-07-25: Worked on ranking data. Finished ranking data for cities and moved on to states. Created the tables based on variables Ed told me to include in excel. Didn't style them yet but sent them to Ed for approval.

2018-07-24: Spent the majority of the day talking to Ed. Worked through the ExitKeysClean and PortCoExit with Ed, and then started to work on Roundbase Ranking.

2018-07-23: Came in Sunday and finished PortCoExit and ExitKeysClean tables. Was not getting desired results in ExitKeysClean as MaNoDups was not joining correctly. Figured out this was because the initial MA pull pulled state names instead of state codes, so the matching based on primary keys of portcos didn't work.

2018-07-20: Continued to construct the ExitKeysClean and PortCoExit tables. Spent a lot of time removing duplicates from data and making sure everything was clean. Had to deal with a few bugs in the data as I was getting strange numbers with selects when I performed joins.

2018-07-19: Loaded the data into the database and began to construct the ExitKeysClean table. Learning SQL on the fly so had some beginner difficulties.

2018-07-18: Cleaned up MA data. Made a mistake yesterday because I didn't clean the data before running matches and thus was getting massive documents. Cleaned MA data based on state codes and date of first investment vs date the MA was announced. Removed multiples after cleaning and was able to salvage around 70 percent of data as good matches. Began the process for IPOs.

2018-07-17: Waiting on Ed to respond, so that I can finish cleaning up my IPO and MA data and begin to build stacks in the database. Currently struggling with finding a way to ensure that a company is matched to itself when we match portcos agains MAs and IPOs. Could not a find a way to link these two using data given in the MA database. IPOs seem to be doing fine. Spent the majority of my day helping Minh classifying his Demo Day pages.

2018-07-16: Spent the day matching portcos with IPOs and MAs. Then cleaned the data using an excel file. Almost finished IPOs. Made a mistake in filtering MAs but will go back and finish cleaning both MAs and IPOs by tomorrow. Slightly confused about cleaning the MA table since I do not see a way other than equivalence of state to determine whether a company is matched to itself or a company with a similar name. Either will accept same state as a indicator or will wait for Ed's response.

2018-07-13: Worked to standardize the company names using the matcher. Also uploaded the rest of the data that I could into the db.

2018-07-12: Spent the day struggling the MA pull. Dylan figured out that data will pull when pulled in text not columnar form. Tomorrow will try to learn RegEx so that I can manage this file. Still stuck on USLongDescription as I have tried different ways of normalization and nothing has worked.

2018-07-11: Uploaded the rest of the tables that I was able to into the database. I am struggling with normalizing the USLongDescription and have tried the various ways given to solve the problem. I am stuck here and not sure how to proceed. I am similarly stuck with the MA table as I have still not been able to retrieve this data from SDC. I did update the Venture Xpert DataBase wiki page with information on loading the tables and the possible errors that could arise. For now, I am waiting on a response from Ed to see how I can continue to be productive.

2018-07-10: Struggled with pulling MA data from SDC for the majority of the morning. Tried creating a custom report from scratch and playing around with the variables but eventually gave up because SDC kept crashing. Moved on to loading data into the database. Created the database and loaded in roundbase and the ipo which seem to be consistent with former projects. Then read around to figure out how to normalize long description so that I could load it into the database but couldn't figure out what the documentation was trying to say.

2018-07-09: Continued to repull data from SDC in order to have the first two full quarters of 2018. Pulled everything except for the MAs. While I was waiting on the data to pull, I continued to go through Minh's data and toggle which had a list of starts ups and which didn't.

2018-06-29: Planned out the construction of the database and checked all rpt files to make sure that all variables I would need were present. Then updated the SDC Platinum and VentureXpert Wikis to ensure that both were readable and thorough. Finally, helped Minh with creating his training data so that he would be able to create an accurate crawler. Sorted through previously pulled websites of accelerators with the keyword Demo Day and marked whether or not they had lists of the companies that had taken part in their cohorts. Marked about 500 websties.

2018-06-28: Organized my folder for the building of the database. Talked to Ed who suggested that I pull data from SDC for July so that we would have two full quarters of data to work with. Helped with RoundOnOneLine and gave tips for better organizing data.

2018-06-27: Finished extracting data from SDC. Have normalized everything that was normalized in the previous process. Am now waiting on Ed to discuss how he wants to new database designed so that I can begin to actually build the database. IPO and MA pulls took a while because of various errors including Out of Memory error which kept on popping up. I slightly changed the rpt file and the pull ran quickly and effectively.

2018-06-26: Pulled data from SDC. Successfully pulled USPortCo1980-Present, CompanyLongDescription, USVCFirms1980-Present, USVCFunds1980-Present, and VCFunds1980-Present. Had some trouble figuring out which rpt files to use but messaged Ed and he clarified.

2018-06-25: Started to pull data from SDC. Did a few practice runs and then started to pull real data. Today I pulled USVCPortCo1980-Present and USCompanyLongDescription. I am having some trouble formatting both of them and need to sort out foreign countries from the data. Once I get down the formatting I will pull down the other datasets as well.

2018-06-22: Read all relevant pages to my project. Understand the process behind the building and have identified the master tables that will have to be built. Mapped out multiple trees to represent the stacks of tables created in the process of making the master tables. Need help understanding the SDC Platinum interface and how to pull data from there so that I can start to construct the database. Would like to meet with Ed to discuss his vision for redesigning the database.

2018-06-21: Continued to read through and understand the VCDatabase Rebuild wiki page. Found a number of logical and mathematical errors and have quickly realized that in the process of building the db I will have to rewrite the wiki.

2018-06-20: Began to read the VCDatabase Rebuild wiki page. Found page to be decently good at describing the process of building the db, but the process seems flawed. Confused about certain things where numbers seem to not add up or illogical statements are made. So far I have observed this in roundname duplicate check (not present), IPO distinct table (where there seem to be 10 entries missing) and maskey announcedates (where the use of min seems illogical). Potentially I am misunderstanding due to new exposure to SQL.

2018-06-19: Set up work stations on the balcony. Ordered extra cables needed to set up monitors with keyboard, a mouse, and my laptop. Get our projects; I will be redesigning the VC Database. Heard about other people's projects and other interns were assigned their projects as well.

2018-06-18: Met all of the interns, Ed, and Anne. Was introduced to the database, learned basic SQL commands, and set up a wiki page. Also logged onto the RDP for the first time. Starting to learn the infrastructure.