Difference between revisions of "Ecosystem Organization Classifier"

From edegan.com
Jump to navigation Jump to search
 
(26 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
{{Project
 
{{Project
 +
|Has project output=Data
 +
|Has sponsor=Kauffman Incubator Project
 
|Has title=Ecosystem Organization Classifier
 
|Has title=Ecosystem Organization Classifier
 +
|Has owner=Libby Bassini, Anne Freeman,
 
|Has project status=Active
 
|Has project status=Active
|Is dependent on=Crunchbase Database, VentureXpert Database,  
+
|Is dependent on=Crunchbase Database, VentureXpert Database,
 
|Does subsume=Defining Incubators, Incubator Seed Data, Incubators in Five Ecosystems,
 
|Does subsume=Defining Incubators, Incubator Seed Data, Incubators in Five Ecosystems,
 
}}
 
}}
 
 
==Introduction==
 
==Introduction==
  
 
The purpose of this project is to build a classifier, which takes the description of an ecosystem organization (i.e., a startup, a venture capitalist, an incubator, etc.) and either correctly classifies the organization's type or correctly classifies incubators vs. non-incubators.
 
The purpose of this project is to build a classifier, which takes the description of an ecosystem organization (i.e., a startup, a venture capitalist, an incubator, etc.) and either correctly classifies the organization's type or correctly classifies incubators vs. non-incubators.
 
===Text Processing===
 
 
There are two possible classification methods for the processing the text of target HTML pages. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.)
 
  
 
==Related Projects==
 
==Related Projects==
Line 21: Line 19:
 
This project is dependent on:
 
This project is dependent on:
 
{{#show: Ecosystem Organization Classifier|?Is dependent on}}
 
{{#show: Ecosystem Organization Classifier|?Is dependent on}}
 +
 +
===Text Processing===
 +
 +
There are two obvious classification methods for the processing the textual descriptions. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency (TF-IDF) to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses a shallow 2 layer neural network to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.) We are going to be trying both approaches.
 +
 +
====Code built already====
 +
 +
We have previously used bag-of-words in the [[Demo Day Page Google Classifier]] and in early versions of the [[Industry Classifier]]. Later versions of the [[Industry Classifier]] were based on our [[Deep Text Classifier]] project.
 +
 +
====First data====
 +
 +
For the first data, we are going to use organization descriptions from Crunchbase. Run this code on '''crunchbase3''' (see [[Crunchbase Database]]):
 +
<nowiki>\COPY (SELECT uuid, company_name, short_description FROM Organizations) TO 'CrunchbaseShortOrgDescs.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV</nowiki>
 +
--744332
 +
<nowiki>\COPY (SELECT A.uuid, A.company_name, B.description FROM Organizations AS A JOIN organization_descriptions AS B on A.uuid=B.uuid) TO 'CrunchbaseLongOrgDescs.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV</nowiki>
 +
--520698
 +
 +
The resulting files are in Z:\crunchbase3 and copied to E:\projects\crunchbase3.
 +
 +
We can use [[The Matcher (Tool)]] to match organization names to portfolio companies and VC funds and firms taken from '''vcdb3''' (see [[VentureXpert Database]]). We will also search this data by hand for incubators to get an initial set. Later on, we'll match our early list of incubators to crunchbase organization names to expand our list.
 +
 +
==Incubator Scores of Crunchbase Results==
 +
 +
{| class="wikitable sortable" style="width:100%"
 +
|-
 +
! style="width: 2%" | #
 +
! style="width: 9%" | Company
 +
! style="width: 2%" | Self Described [Y/N]
 +
! style="width: 9%" | State
 +
! style="width: 9%" | City
 +
! style="width: 7%" | Region
 +
! style="width: 7%" | Lists Client Companies [Y/N]
 +
! style="width: 9%" | Fixed Duration [Y - 0 /N - 1]
 +
! style="width: 5%" | Incubator Investment [Y - 0 /N - 1]
 +
! style="width: 9%" | Cohorts [Y - 0 /N - 1]
 +
! style="width: 9%" | Formal Application Process [Y - 0 /N - 1]
 +
! style="width: 9%" | Incubator Score out of 4
 +
! style="width: 9%" | Notes (Foreign, Virtual, Social Impact, or other observations)
 +
|-
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|-
 +
|}
 +
 +
==Process Notes for Calculating Incubator Scores==
 +
 +
Two new files were generated from the '''crunchbase3''' dbase as follows:
 +
 +
\COPY (SELECT uuid, company_name, short_description FROM Organizations WHERE country_code='USA' AND short_description LIKE '%incubat%') TO 'CrunchbaseShortOrgDescsUSAIncubat.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--466
 +
 +
\COPY (SELECT A.uuid, A.company_name, B.description FROM Organizations AS A JOIN organization_descriptions AS B on A.uuid=B.uuid WHERE country_code='USA' AND description LIKE '%incubat%') TO 'CrunchbaseLongOrgDescsUSAIncubat.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
 +
--933
 +
 +
These files were put in E:\projects\crunchbase3.
 +
 +
1. New file - Renamed E:\projects\crunchbase3\organizations as E:\projects\crunchbase3\organizations_OnlyIncubators_PlusIncubatorScores
 +
 +
2. Only US - CTRL+Fed for "US", created a column filter for only USA companies, and deleted non-US based organizations
 +
 +
3. Incubator - CTRL+Fed for "incubator" and deleted organizations that didn't identify as an incubator
 +
 +
4. New Columns - 1. #; 2. Company with URL to page linked; 3. Self-Identified Incubator [Y/N]; 4. State; 5. City; 6. Region; 7. Lists Client Companies [Y/N] with URL linked; 8. Fixed Duration [Y - 0 /N - 1]: (Startups at an incubator generally do not all stay for the same fixed duration; Incubator does not have a fixed graduation date for its startups or has a program that lasts longer than one year); 9. Incubator Investment [Y - 0 /N - 1]: (Incubator does not invest directly in the company or take equity in its startups); 10. Cohorts [Y - 0 /N - 1]: Incubator does not have limited-duration programs that ventures enter and exit in groups, known as cohorts or batches.; 11.  Formal Application Process [Y - 0 /N - 1]: (Selective, competitive admissions process; Fixed, not rolling application process); 12. Incubator Score out of 4 (A score of 4 is most likely to be an incubator and a score of 0 is less likely to be an incubator based on our baseline attributes for an incubator [[Defining Incubators]])
 +
 +
5. Deleted Columns - funding_rounds; roles; permalink; domain; funding rounds
 +
 +
6. Delete Closed Incubators - Filtered 'status' column to exclude showing results that are 'closed'
 +
 +
7. Made A Table - Converted entire worksheet into a table to filter more easily
 +
 +
8. Identified Self-Identified Incubators - Created a custom-auto filter that searched the 'short description' for 'contains: incubat'

Latest revision as of 13:41, 21 September 2020


Project
Ecosystem Organization Classifier
Project logo 02.png
Project Information
Has title Ecosystem Organization Classifier
Has owner Libby Bassini, Anne Freeman
Has start date
Has deadline date
Has project status Active
Is dependent on Crunchbase Database, VentureXpert Database
Does subsume Defining Incubators, Incubator Seed Data, Incubators in Five Ecosystems
Has sponsor Kauffman Incubator Project
Has project output Data
Copyright © 2019 edegan.com. All Rights Reserved.

Introduction

The purpose of this project is to build a classifier, which takes the description of an ecosystem organization (i.e., a startup, a venture capitalist, an incubator, etc.) and either correctly classifies the organization's type or correctly classifies incubators vs. non-incubators.

Related Projects

Subsumed Projects: Defining Incubators, Incubator Seed Data, Incubators in Five Ecosystems

This project is dependent on: Crunchbase Database, VentureXpert Database

Text Processing

There are two obvious classification methods for the processing the textual descriptions. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency (TF-IDF) to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses a shallow 2 layer neural network to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.) We are going to be trying both approaches.

Code built already

We have previously used bag-of-words in the Demo Day Page Google Classifier and in early versions of the Industry Classifier. Later versions of the Industry Classifier were based on our Deep Text Classifier project.

First data

For the first data, we are going to use organization descriptions from Crunchbase. Run this code on crunchbase3 (see Crunchbase Database):

\COPY (SELECT uuid, company_name, short_description FROM Organizations) TO 'CrunchbaseShortOrgDescs.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
--744332
\COPY (SELECT A.uuid, A.company_name, B.description FROM Organizations AS A JOIN organization_descriptions AS B on A.uuid=B.uuid) TO 'CrunchbaseLongOrgDescs.txt' WITH DELIMITER AS E'\t' HEADER NULL AS '' CSV
--520698

The resulting files are in Z:\crunchbase3 and copied to E:\projects\crunchbase3.

We can use The Matcher (Tool) to match organization names to portfolio companies and VC funds and firms taken from vcdb3 (see VentureXpert Database). We will also search this data by hand for incubators to get an initial set. Later on, we'll match our early list of incubators to crunchbase organization names to expand our list.

Incubator Scores of Crunchbase Results

# Company Self Described [Y/N] State City Region Lists Client Companies [Y/N] Fixed Duration [Y - 0 /N - 1] Incubator Investment [Y - 0 /N - 1] Cohorts [Y - 0 /N - 1] Formal Application Process [Y - 0 /N - 1] Incubator Score out of 4 Notes (Foreign, Virtual, Social Impact, or other observations)

Process Notes for Calculating Incubator Scores

Two new files were generated from the crunchbase3 dbase as follows:

\COPY (SELECT uuid, company_name, short_description FROM Organizations WHERE country_code='USA' AND short_description LIKE '%incubat%') TO 'CrunchbaseShortOrgDescsUSAIncubat.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--466

\COPY (SELECT A.uuid, A.company_name, B.description FROM Organizations AS A JOIN organization_descriptions AS B on A.uuid=B.uuid WHERE country_code='USA' AND description LIKE '%incubat%') TO 'CrunchbaseLongOrgDescsUSAIncubat.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--933

These files were put in E:\projects\crunchbase3.

1. New file - Renamed E:\projects\crunchbase3\organizations as E:\projects\crunchbase3\organizations_OnlyIncubators_PlusIncubatorScores

2. Only US - CTRL+Fed for "US", created a column filter for only USA companies, and deleted non-US based organizations

3. Incubator - CTRL+Fed for "incubator" and deleted organizations that didn't identify as an incubator

4. New Columns - 1. #; 2. Company with URL to page linked; 3. Self-Identified Incubator [Y/N]; 4. State; 5. City; 6. Region; 7. Lists Client Companies [Y/N] with URL linked; 8. Fixed Duration [Y - 0 /N - 1]: (Startups at an incubator generally do not all stay for the same fixed duration; Incubator does not have a fixed graduation date for its startups or has a program that lasts longer than one year); 9. Incubator Investment [Y - 0 /N - 1]: (Incubator does not invest directly in the company or take equity in its startups); 10. Cohorts [Y - 0 /N - 1]: Incubator does not have limited-duration programs that ventures enter and exit in groups, known as cohorts or batches.; 11. Formal Application Process [Y - 0 /N - 1]: (Selective, competitive admissions process; Fixed, not rolling application process); 12. Incubator Score out of 4 (A score of 4 is most likely to be an incubator and a score of 0 is less likely to be an incubator based on our baseline attributes for an incubator Defining Incubators)

5. Deleted Columns - funding_rounds; roles; permalink; domain; funding rounds

6. Delete Closed Incubators - Filtered 'status' column to exclude showing results that are 'closed'

7. Made A Table - Converted entire worksheet into a table to filter more easily

8. Identified Self-Identified Incubators - Created a custom-auto filter that searched the 'short description' for 'contains: incubat'