Changes

Jump to navigation Jump to search
no edit summary
|Has project status=Active
}}
 
=Current Project Write-Up=
 
==What Each File in the "Accelerator" Folder on the RDP Contains==
*"Accelerator List Sources" (Folder) - This folder contains most of the sources that we pulled accelerator names from at the very beginning of the project.
*"Code+Final_Data" (Folder) - This folder contains Peter's code for pulling the data from the text files in the "Data" folder.
*"Crunchbase Snapshot" (Folder) - This folder contains the data we obtained from Crunchbase. There is a massive amount of data which we will need to sort through to find useful information and hopefully match that data with our current cohort data.
*"Data" (Folder) - This folder contains all of our data on accelerators including cohort information and the html files of each cohort page. I would estimate that it is about 95% clean currently.
*"Data - Copy" (Folder) - This is just a copy of our current "Data" folder.
*"Data_Copy" (Folder) - This is a copy of our original "Data" folder before we did any manual cleaning.
*"Enclosing_Circle" (Folder) - This folder seems to contain some data on VC but I'm not sure how it pertains to the Accelerator project.
*"F6S Accelerator HTMLs" (Folder) - This folder contains the HTML pages of all the pages on the F6S website. We used it to add more potential accelerators to our list.
*"Google_SiteSearch" (Folder) - This folder contains Python code for Google searches.
*"Industry_Classifier" (Folder) - This folder seems to contain Python code but I'm not sure what for.
*"Matcher" (Folder) - This folder contains the Matcher.
*"Python WebCrawler" (Folder) - This folder contains code that is a work in progress for pulling descriptions from accelerator websites. It is Jeemin's project.
*"Cleaned Cohort Data Copy" (Excel File) - This file contains a copy of our cleaned cohort data.
*"Cleaned Cohort Data" (Excel File) - This file contains the most current, completely cleaned data on cohort company information.
*"NormalizeFixedWidth" (PL File) - This is the normalizer.
*"PortCoNames" (TXT File) - This file contains all of the names of the cohort companies as well as the accelerator they went through.
*"VC Data" (Excel File) - This file contains all of the names of the companies that have ever received VC funding.
*"VC_Data" (TXT File) - This file contains that non-normalized data of all of the VC information.
*"VC_Data_Names" (TXT File) - This file contains all of the names of companies that have received VC funding.
*"VC_Data_Names_Matched_PortCoNames" (Excel File) - This file contains all of the cohort companies that have also received VC funding. Still needs to be sorted through.
 
==Process==
After accumulating the massive amount of data on accelerators, their cohorts, and their html files, we began cleaning those text files, which are located in the "Data" folder within "Accelerators". After going through the first round of cleaning, we ran a code through the cohort data which put all of that information into an Excel document called "Cleaned Cohort Data". There were still some mistakes in the cohort information unfortunately, which we fixed within the Excel file itself. Therefore, there are some text files within the "Data" folder that do not match with the "Cleaned Cohort Data" file. If we were to run the cohort code through the "Data" folder, we would get something that does not match with the "Cleaned Cohort Data" file, which is problematic. The solution to this (other than manually cleaning the text files again) would be to write a code from the "Cleaned Cohort Data" file which would allow us to clean the data in the "Data" folder through the format of the Excel file. We have also matched all of the cohort companies with our list of all companies that have received VC funding.
=Current To Do=

Navigation menu