Difference between revisions of "Hubs: Hubs Data"

From edegan.com
Jump to navigation Jump to search
 
(30 intermediate revisions by 3 users not shown)
Line 1: Line 1:
=Background=
+
=Hubs Pages=
This page summarizes current work in progress on the Hubs data. To see the weeds of the WIP see [[Hubs: Mechanical Turk]].
+
*The main page for Hubs can be found: [[Hubs (Academic Paper)]]
 +
*For the current work in progress for building the Hubs datasheet for the scorecard go to: [[Hubs: Hubs Scorecard]]
 +
*For a tracker of work in progress for the dataset building for the scorecard go to [[Hubs: Hubs Data Building]]
 +
*For a high-level overview of the variables for the scorecard go to [[Hubs: Hubs Data]]
  
 
=List of Variables=
 
=List of Variables=
 +
For a more in-depth of the variables and procedure please see: [[Hubs: Hubs Scorecard]].  This page will reflect the variables being collected separated into three categories.  Each variable will include a breakdown of levels being collected if the definition is not trivial and an approximate approach.
 +
 +
 +
 +
'''07/29''' Ariel: code Hubs variable for Hubs
 +
:<code>E:/McNair/Projects/Hubs/Hubs Variable-Ariel</code>
 +
 +
 +
 +
 
'''As of Week of 7/25'''
 
'''As of Week of 7/25'''
 
===Group 1===
 
===Group 1===
'''Variables Easy to Obtain'''
+
'''Variables Difficult to Obtain'''
#Twitter activity
+
#'''Founding Date''' ''(date_founded)''
#*Variables Obtained: Twitter Handle, # Tweets in a Month, # Followers, # Retweets
+
#*''' ''Difficulty:'' ''' Finding date based on our strategies
#Site URL
+
#*''' ''New Approach:'' '''
#Address
+
#*#Whois.net Date
#*See multiple locations for
+
#*#Factavia/other press release searches
#*To
+
#'''Multiple locations within city + Franchise''' (as of now just addresses) ''(multi_address)''
#Nonprofit status (Binary)
+
#*''' ''Difficulty:'' ''' Company or establishment level will impact measurements
#*Uses: http://www.guidestar.org/
+
#*''' ''New Approach:'' ''' Will record all addresses at company level
#Mission statement
+
#'''Onsite Venture Capital v. Angel Investors''' (e.g. # and Assets Under Management) ''(onsite_Vc_bin)/(onsite_vc_list)'' ''(onsite_angel_bin)/etc.''
#*If not explicitly stated mission statement, will include "About" or statements on main page
+
#*''' ''Levels:'' ''' Binary, list of investors
#Specific Industry
+
#*''' ''Difficulty:'' ''' Hub website usually does not include investors
#*Based on Mission Statement, not aggregrated
+
#*''' ''New Approach:'' '''
#Price for a space/office
+
#*#Google key terms with address of Hub
#*Not just membership, must be desk space
+
#*#Start with partners and use google/crunchbase
  
 
===Group 2===
 
===Group 2===
'''Variables Comfortable, Not Complete'''
+
'''Variables Comfortable, Not Complete''' (rough order of most difficult to least difficult)
#Onsite Mentors
+
#'''Onsite accelerator''' ''(onsite_accel_bin)/(onsite_accel_cnt)/(onsite_accel_list)''
#Office hours investors
+
#*''' ''Levels:'' ''' Binary, count, list
#Office hours mentor/advisors
+
#*''' ''Difficulty:'' ''' Usually not a list, which requires more scrubbing as many other variables just require us to find one page on a website.
#Sponsors/Partners
+
#*''' ''Approach:'' '''
#*University
+
#*#Google searches and procedure to use on website yields decent results
#*Corporate
+
#*#Similar procedure to onsite investors
#Community membership?
+
#'''Size (# members)''' ''(num_members)''
#Onsite temporary workshops and Networking Meetups (Count)
+
#*''' ''Levels:'' ''' Count for companies (currently not planning to include list of companies given that some potential hubs have 200+ members)
#*
+
#*''' ''Difficulty:'' ''' Some companies don’t list all members - only selective ones-, others do not separate current members and alumni, and some just write "we have served more than 120 startups..."
#*Levels:  
+
#*''' ''Approach:'' ''' For companies that have a list, we will count.  For those with select members, we will count those they listed and try to see if there is a comment about how many they have.  For those that just have a statement "with over," we will write the number and + (e.g. "120+).
#Curriculum
+
#'''Office hours investors''' and '''Office hours mentor/advisors''' ''(OH_bin)/(OH_inv_bin)/(OH_inv_list)/etc.''
#Onsite code school
+
#*''' ''Levels:'' ''' Binary for OH, binary for two separate OH, list of names/descriptions of OH
#Alumni Network
+
#*''' ''Difficulty:'' ''' Some companies do not list who OH are with, not always obvious if investor, mentor, or advisor, sometimes not clear if mentor is investor/future investor
#Size (sqft)
+
#*''' ''Approach:'' ''' Google approach to get to OH pages and then lookup key words in description to separate out
#Size (# companies)
+
#'''Onsite temporary workshops and Networking Meetups''' (Count) ''(onsite_temp_events_bin)/(onsite_temp_workshop_bin)/(onsite_temp_workshop_cnt)/etc.''
#Onsite accelerator
+
#*''' ''Levels:'' '''  Binary for do they exist, count for each
 +
#*''' ''Difficulty:'' ''' Difficult for Turkers to differentiate between these two and also other potential events (e.g. symposiums)
 +
#*''' ''Approach:'' ''' Uses key search terms (e.g. Java/etc.) to separate out workshops and key terms (e.g. lunch/happy hour) for networking meetings
 +
#'''Onsite code school''' and '''Curriculum''' ''(onsite_long_term_courses)/(onsite_code_school_bin)''
 +
#*''' ''Levels:'' '''  Binary for do they exist, binary for each
 +
#*''' ''Difficulty:'' ''' Difficult for Turkers to differentiate between long-term coding programs for individuals and curriculum for startups
 +
#*''' ''Approach:'' ''' Uses key search terms (e.g. specific code schools) to separate out known code schools and also to look into key terms (e.g. leadership) for curriculum
 +
#'''Sponsors/Partners''' (University, Corporate) ''(sponsors_cnt)/(sponsors_list)/etc.''
 +
#*''' ''Levels:'' ''' Count, list of sponsors/partners (if exist), separate columns for university and corporate
 +
#*''' ''Difficulty:'' ''' Not all companies will list sponsors, partnesrs, or either.  Not always clear the difference among sponsors, partners, investors.
 +
#*''' ''Approach:'' ''' Use two different levels and use of google search, then if list exists, separate by "college"/"university" and rest
 +
#'''Alumni Network''' ''(alumni_bin)/(alumni_list)''
 +
#*''' ''Levels:'' ''' Binary, list
 +
#*''' ''Difficulty:'' ''' Not all companies list alumni, some only list "selected"
 +
#*''' ''Approach:'' ''' Include all that have lists
 +
#'''Size (sqft)''' ''(size_sqft)''
 +
#*''' ''Levels:'' ''' Number in sqft
 +
#*''' ''Difficulty:'' ''' Not all companies list square feet online
 +
#*''' ''Approach:'' '''
 +
#*#Google search with key words
 +
#*#If results do not appear, use of press releases is possible
 +
#'''Onsite Mentors''' ''(onsite_mentors_bin)/(onsite_mentors_cnt)/(onsite_mentors_list)''
 +
#*''' ''Levels:'' ''' Count and list of mentors (if exist)
 +
#*''' ''Difficulty:'' ''' Not all companies list mentors - bigger issue is onsite investors
 +
#*''' ''Approach:'' ''' Use two different levels and use of google search
  
 
===Group 3===
 
===Group 3===
'''Variables Difficult to Obtain'''
+
'''Variables Easy to Obtain'''
#Founding Date
+
#'''Twitter activity''' ''(twit_handle)/(twit_prev_mon_cnt_tweets)/(twit_cnt_followers)/(twit_cnt_retweets)''
#*Very difficult to obtain based on our strategies
+
#*''' ''Levels:'' ''' Twitter Handle, # Tweets in a Month, # Followers, # Retweets
#*New Approach: website URL
+
#*''' ''Approach:'' ''' Easy to get twitter handle from Turk or Veeral's code that allows us to run a series of searches on google and then use Gunny's Twitter crawler to get other levels from handle
#Onsite Venture Capital v. Angel Investors (e.g. # and Assets Under Management)
+
#'''Site URL''' ''(url)''
#*Difficult to obtain
+
#*''' ''Levels:'' ''' URL
#*Levels: Binary, list of
+
#*''' ''Approach:'' ''' Google using Veeral's code that allows us to search
#*New Approach:
+
#''' ''Whois Date'' ''' ''(date_whois)''
#Multiple locations within city + Franchise  (as of now just addresses)
+
#*''' ''Levels:'' ''' Date
#*Company or establishment level will impact measurements
+
#*''' ''Approach:'' ''' Date active website was registered
#*Will record all addresses at company level
+
#'''Address''' ''(address)''
 +
#*''' ''Levels:'' ''' Will include all addresses
 +
#*''' ''Approach:'' ''' Google key terms (e.g. Contact Us) and URL using Veeral's code
 +
#'''Nonprofit status''' ''(nonprofit_binary)''
 +
#*''' ''Levels:'' ''' Binary variable indicating if the potential Hub is a nonprofit organization
 +
#*''' ''Approach:'' ''' http://www.guidestar.org/ is a site that we can use to search if a company is nonprofit or not
 +
#'''Mission statement''' ''(missions_stmt)''
 +
#*''' ''Levels:'' ''' Official mission statement or description of company (if mission does not exist)
 +
#*''' ''Approach:'' ''' If not explicitly stated mission statement, will include "About" or statements on main page
 +
#'''Specific Industry''' ''(spec_industry)''
 +
#*''' ''Levels:'' ''' Industry included in statement (no aggregation)
 +
#*''' ''Approach:'' ''' *Based on Mission Statement, not aggregated
 +
#'''Price for a space/office''' ''(price_space)''
 +
#*''' ''Levels:'' ''' Two prices one for shared, other for private
 +
#*''' ''Approach:'' ''' Uses google methodology with key terms and URL
 +
[[Category: Internal]]
 +
[[Internal Classification: Legacy| ]]

Latest revision as of 17:35, 2 September 2016

Hubs Pages

List of Variables

For a more in-depth of the variables and procedure please see: Hubs: Hubs Scorecard. This page will reflect the variables being collected separated into three categories. Each variable will include a breakdown of levels being collected if the definition is not trivial and an approximate approach.


07/29 Ariel: code Hubs variable for Hubs

E:/McNair/Projects/Hubs/Hubs Variable-Ariel



As of Week of 7/25

Group 1

Variables Difficult to Obtain

  1. Founding Date (date_founded)
    • Difficulty: Finding date based on our strategies
    • New Approach:
      1. Whois.net Date
      2. Factavia/other press release searches
  2. Multiple locations within city + Franchise (as of now just addresses) (multi_address)
    • Difficulty: Company or establishment level will impact measurements
    • New Approach: Will record all addresses at company level
  3. Onsite Venture Capital v. Angel Investors (e.g. # and Assets Under Management) (onsite_Vc_bin)/(onsite_vc_list) (onsite_angel_bin)/etc.
    • Levels: Binary, list of investors
    • Difficulty: Hub website usually does not include investors
    • New Approach:
      1. Google key terms with address of Hub
      2. Start with partners and use google/crunchbase

Group 2

Variables Comfortable, Not Complete (rough order of most difficult to least difficult)

  1. Onsite accelerator (onsite_accel_bin)/(onsite_accel_cnt)/(onsite_accel_list)
    • Levels: Binary, count, list
    • Difficulty: Usually not a list, which requires more scrubbing as many other variables just require us to find one page on a website.
    • Approach:
      1. Google searches and procedure to use on website yields decent results
      2. Similar procedure to onsite investors
  2. Size (# members) (num_members)
    • Levels: Count for companies (currently not planning to include list of companies given that some potential hubs have 200+ members)
    • Difficulty: Some companies don’t list all members - only selective ones-, others do not separate current members and alumni, and some just write "we have served more than 120 startups..."
    • Approach: For companies that have a list, we will count. For those with select members, we will count those they listed and try to see if there is a comment about how many they have. For those that just have a statement "with over," we will write the number and + (e.g. "120+).
  3. Office hours investors and Office hours mentor/advisors (OH_bin)/(OH_inv_bin)/(OH_inv_list)/etc.
    • Levels: Binary for OH, binary for two separate OH, list of names/descriptions of OH
    • Difficulty: Some companies do not list who OH are with, not always obvious if investor, mentor, or advisor, sometimes not clear if mentor is investor/future investor
    • Approach: Google approach to get to OH pages and then lookup key words in description to separate out
  4. Onsite temporary workshops and Networking Meetups (Count) (onsite_temp_events_bin)/(onsite_temp_workshop_bin)/(onsite_temp_workshop_cnt)/etc.
    • Levels: Binary for do they exist, count for each
    • Difficulty: Difficult for Turkers to differentiate between these two and also other potential events (e.g. symposiums)
    • Approach: Uses key search terms (e.g. Java/etc.) to separate out workshops and key terms (e.g. lunch/happy hour) for networking meetings
  5. Onsite code school and Curriculum (onsite_long_term_courses)/(onsite_code_school_bin)
    • Levels: Binary for do they exist, binary for each
    • Difficulty: Difficult for Turkers to differentiate between long-term coding programs for individuals and curriculum for startups
    • Approach: Uses key search terms (e.g. specific code schools) to separate out known code schools and also to look into key terms (e.g. leadership) for curriculum
  6. Sponsors/Partners (University, Corporate) (sponsors_cnt)/(sponsors_list)/etc.
    • Levels: Count, list of sponsors/partners (if exist), separate columns for university and corporate
    • Difficulty: Not all companies will list sponsors, partnesrs, or either. Not always clear the difference among sponsors, partners, investors.
    • Approach: Use two different levels and use of google search, then if list exists, separate by "college"/"university" and rest
  7. Alumni Network (alumni_bin)/(alumni_list)
    • Levels: Binary, list
    • Difficulty: Not all companies list alumni, some only list "selected"
    • Approach: Include all that have lists
  8. Size (sqft) (size_sqft)
    • Levels: Number in sqft
    • Difficulty: Not all companies list square feet online
    • Approach:
      1. Google search with key words
      2. If results do not appear, use of press releases is possible
  9. Onsite Mentors (onsite_mentors_bin)/(onsite_mentors_cnt)/(onsite_mentors_list)
    • Levels: Count and list of mentors (if exist)
    • Difficulty: Not all companies list mentors - bigger issue is onsite investors
    • Approach: Use two different levels and use of google search

Group 3

Variables Easy to Obtain

  1. Twitter activity (twit_handle)/(twit_prev_mon_cnt_tweets)/(twit_cnt_followers)/(twit_cnt_retweets)
    • Levels: Twitter Handle, # Tweets in a Month, # Followers, # Retweets
    • Approach: Easy to get twitter handle from Turk or Veeral's code that allows us to run a series of searches on google and then use Gunny's Twitter crawler to get other levels from handle
  2. Site URL (url)
    • Levels: URL
    • Approach: Google using Veeral's code that allows us to search
  3. Whois Date (date_whois)
    • Levels: Date
    • Approach: Date active website was registered
  4. Address (address)
    • Levels: Will include all addresses
    • Approach: Google key terms (e.g. Contact Us) and URL using Veeral's code
  5. Nonprofit status (nonprofit_binary)
    • Levels: Binary variable indicating if the potential Hub is a nonprofit organization
    • Approach: http://www.guidestar.org/ is a site that we can use to search if a company is nonprofit or not
  6. Mission statement (missions_stmt)
    • Levels: Official mission statement or description of company (if mission does not exist)
    • Approach: If not explicitly stated mission statement, will include "About" or statements on main page
  7. Specific Industry (spec_industry)
    • Levels: Industry included in statement (no aggregation)
    • Approach: *Based on Mission Statement, not aggregated
  8. Price for a space/office (price_space)
    • Levels: Two prices one for shared, other for private
    • Approach: Uses google methodology with key terms and URL