Featured projects

From edegan.com
Jump to navigation Jump to search

This page describes five Featured projects. These projects are among the most popularly visited and largest of the 185 projects (excluding write-ups on regular pages) on this wiki.

U.S. Seed Accelerators

The U.S. Seed Accelerators project subsumes several related projects. These projects were intended to assemble near-population data on high-growth high-tech seed accelerators in the U.S. and understand how to automate the data collection process. As such, the project includes both a dataset and prototypes. Some of the prototypes were used in the Kauffman Incubator Project.


The VCDB20H1 project documents the build of vcbd20h1 -- a Venture Capital DataBase covering until the end of the first half (H1) of 2020. Vcbd20h1 includes investments, funds, startups, executives, and exits derived from data from VentureXpert. This project updates vcdb4, which covered (almost) to the of Q3 2019. See also: SDC Normalizer.

Federal Grant Data

The Federal Grant Data project collects and processes NIH Data, NSF Data, and other federal grant information from structured government sources and imports it into a relational database for use. See also: The Trial Data Project and the FDA Trials Data project.

Reproducible Patent Data

The Reproducible Patent Data project is a continuation of the Redesigning Patent Database project. It aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data. See also the Patent Data umbrella project.

The Matcher (Tool)

The Matcher (Tool) is a tool to match and merge datasets using company names as identifiers. It is written in perl and implements both normalization and fuzzy matching techniques. The normalization methods include 'Hall' and others used by the NBER Patent Data project, and the fuzzy matching supports a range of techniques (Ngram, LCS, etc.) that can be used to generate candidate lists for human processing or machine learning, as well as threshold-based cut-offs.