Difference between revisions of "Jeemin Sim (Work Log)"

From edegan.com
Jump to navigation Jump to search
 
(48 intermediate revisions by one other user not shown)
Line 1: Line 1:
====2/6/2017 MONDAY ''2PM-6PM''====  
+
[[Jeemin Sim]] [[Work Logs]] [[Jeemin Sim (WorkLog)|(log page)]]
* Set up wikiPage & remote desktop.  
+
 
* Started working on python version of web crawler. So far it successfully prints out a catchphrase/ description for one website. To be worked on. The python file can be found in: E:\McNair\Projects\Accelerators\Python WebCrawler\webcrawlerpython.py
+
===Spring 2018===
 +
2017-01-11:
 +
*Was briefed on [[Urban Start-up Agglomeration and Venture Capital Investment]]
 +
*Going forward - running regressions in R in PostgreSQL and with desired layers (instead of levels)
 +
 
 +
===Fall 2017===
 +
<onlyinclude>
 +
 
 +
2017-12-05:
 +
* Uploaded tables with KML files in TIF folder in bulk(E:) drive
 +
** Allegheny County
 +
** Atlanta
 +
** Chicago
 +
** Columbus, OH
 +
** Dublin, OH
 +
** Vermont
 +
** Washington, D.C.
 +
 
 +
*Updated documentation for:
 +
** http://www.edegan.com/wiki/TIF_Project#Uploading_TIF_Data_onto_database_.28tigertest.29
 +
 
 +
* shp2pgsql needs to be installed to upload Shapefiles to PostgreSQL database
 +
** applies for Houston, TX and Dallas, TX
 +
** DONE
 +
 
 +
2017-12-04:
 +
* To upload KML file into database with specified table name:
 +
researcher@McNairDBServ:/bulk/tigertest$ ogr2ogr -f PostgreSQL PG:"dbname=tigertest" chicagotif.kml -nln chicagotif
 +
* chicagotif table now resides in tigertest database
 +
 
 +
2017-11-30:
 +
* Displayed map with TIF districts and startups
 +
* Location:
 +
E:\McNair\Projects\Agglomeration\TIF\McNair_Color_Chicago_TIF_and_startups.png
 +
* Added information and steps to ArcMap/ArcGIS documentation page
 +
* Create a project page for 'Working with POSTGIS' and add instructions for uploading KML file onto POSTGIS
 +
** Command used :
 +
*** Logged in as : researcher@McNairDBServ
 +
/bulk/tigertest$ ogr2ogr -f PostgreSQL PG:"dbname=tigertest" chicagotif.kml
 +
* chicago TIF kml file currently downloaded in tigertest with a table name of Layer0
 +
** Figure out how to change layer name while loading kml file
 +
** Instructions pulled from : http://wiki.wildsong.biz/index.php/Loading_data_into_PostGIS#Loading_data_from_KMZ_files
 +
 
 +
2017-11-28:
 +
* Created and edited [[ArcMap / ArcGIS Documentation]]
 +
* Plotted points for TIFS and Startups in Chicago in one map.
 +
** Location:  
 +
E:\McNair\Projects\Agglomeration\TIF\Jeemin_Chicago_TIF_and_Startups_Attempt1
 +
* Used https://mygeodata.cloud/result to convert from KML file to CSV (which was then saved as txt file to be uploaded onto ArcMap)
 +
* Text files located in Local Disk (C:) Drive
 +
 
 +
2017-11-27:
 +
*Download Chicago TIF data
 +
 
 +
2017-11-13:
 +
*
 +
 
 +
2017-11-09:
 +
* Notes on data downloaded:
 +
** Year 2010-2012 data are based on total population, not 25 yrs or over (case for all other tables)
 +
*** Record appears five times total, with same exact column name
 +
***For exmaple: 'Total; Estimate; High school graduate (includes equivalency)' appears five times, with different values.
 +
* TODO:
 +
** Make Projects page for ACS Data
 +
***[[American Community Survey (ACS) Data]]
 +
 
 +
2017-11-07:
 +
*Yesterday, narrowed down columns of interest from ACS_S1501_educationattain_2016 table.
  
====2/8/2017 WEDNESDAY''9AM-11AM''====
+
Id
* Attempted to come up with possible cases for locating the description of accelerators - pick up from extracting bodies of text from the about page (given that it exists)
+
Id2
 +
Geography
 +
Total; Estimate; Population 25 years and over
 +
Total; Estimate; Population 25 years and over - High school graduate (includes equivalency)
 +
Total; Margin of Error; Population 25 years and over - High school graduate (includes equivalency)
 +
Total; Margin of Error; Population 25 years and over - High school graduate (includes equivalency)
 +
Percent; Margin of Error; Population 25 years and over - High school graduate (includes equivalency)
 +
Total; Estimate; Population 25 years and over - Associate's degree
 +
Total; Margin of Error; Population 25 years and over - Associate's degree
 +
Percent; Estimate; Population 25 years and over - Associate's degree
 +
Percent; Margin of Error; Population 25 years and over - Associate's degree
 +
Total; Estimate; Population 25 years and over - Bachelor's degree
 +
Total; Margin of Error; Population 25 years and over - Bachelor's degree
 +
Percent; Estimate; Population 25 years and over - Bachelor's degree
 +
Percent; Margin of Error; Population 25 years and over - Bachelor's degree
 +
Total; Estimate; Population 25 years and over - Graduate or professional degree
 +
Total; Margin of Error; Population 25 years and over - Graduate or professional degree
 +
Percent; Estimate; Population 25 years and over - Graduate or professional degree
 +
Percent; Margin of Error; Population 25 years and over - Graduate or professional degree
 +
Percent; Estimate; Percent high school graduate or higher
 +
Percent; Margin of Error; Percent high school graduate or higher
 +
Percent; Estimate; Percent bachelor's degree or higher
 +
Percent; Margin of Error; Percent bachelor's degree or higher
  
====2/13/2017 MONDAY ''2PM-6PM''====
+
*Complications:
* Goals (for trials): 1) Build ER Diagram 2) For each entity, get XML snippet 3) Build a parser/ripper for single file; the python parser can be found at: E:\McNair\Projects\FDA Trials\Jeemin_Project
+
**For csv files corresponding to years 2015 & 2016, all of the above columns exist.
* [[Trial Data Project]]
+
**For csv files corresponding to years 2005 - 2014, no 'Percent' columns exist
 +
*** Instead their 'Total' columns are percentage values
 +
**For csv file corresponding to year 2005, columns regarding Graduate or professional degree are labeled differently.
 +
**2012 data doesn't correspond to Population 25 years and over.
  
====2/15/2017 WEDNESDAY ''9AM-11AM''====
+
*Temporary Solution:
* Discussed with Catherine what to do with FDA Trial data and decided to have a dictionary with zip-codes as keys and number of trials occurred in that zipcode as values. Was still attempting to loop through the files without the code having to exist in the same directory as the XML files. Plan to write to excel via tsv, with zip-code as one column and # of occurrence as the other.
+
**Since the above problems may be specific to this set of tables, will go through csv files and adjust columns.
  
====2/17/2017 FRIDAY ''2PM-6PM''====
+
*Python script location:
* Completed code for counting the number of occurrences for each unique zipcode. (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_Running_File.py). It has been running for 20+min because of the comprehensive XML data files. Meanwhile started coding to create a dictionary with the keys corresponding to each unique trial ID, mapped to every other information (location, sponsors, phase, drugs ...etc.) (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py).
+
E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data\pullCertainColumns.py
  
====2/20/2017 MONDAY ''2PM-4:30PM''====
+
2017-10-31:
* Continued working on Jeemin_FDATrial_as_key_data_ripping.py to find tags and place all of those information in a list. The other zipcode file did not finish executing after 2+ hours of running it - considering the possibility of splitting the record file into smaller bits, or running the processing on a faster machine.
+
* Finished doanloading files from ACS.
 +
* Started loading tables into tigertest.
 +
* Commands run could be found in  
 +
E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data\DataLoading_SQL_Commands.txt
  
====2/22/2017 WEDNESDAY ''9AM-12:30PM''====
+
2017-10-30:
* Finished Jeemin_FDATrial_as_key_data_ripping.py (E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py), which outputs to E:\McNair\Projects\FDA Trials\Jeemin_Project\general_data_ripping_output.txt; TODO: output four different tables & replace the write in the same for-loop as going through each file
+
* Downloaded data from ACS, to be continued
 +
* File path:
 +
E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data
 +
* Fields of interest:  
 +
S1401 SCHOOL ENROLLMENT
 +
S1501 EDUCATIONAL ATTAINMENT
 +
S2301 EMPLOYMENT STATUS
 +
B01003 TOTAL POPULATION
 +
B02001 RACE
 +
B07201 GEOGRAPHICAL MOBILITY
 +
B08303 TRAVEL TIME TO WORK
 +
B19013 MEDIAN HOUSEHOLD INCOME
 +
B19053 SELF-EMPLOYMENT INCOME IN THE PAST 12 MONTHS FOR HOUSEHOLDS
 +
B19083 GINI INDEX OF INCOME INEQUALITY
 +
B25003 TENURE
 +
B25105 MEDIAN MONTHLY HOUSING COSTS
 +
B28011 INTERNET SUBSCRIPTIONS IN HOUSEHOLD
 +
G001 GEOGRAPHIC IDENTIFIERS
  
====2/24/2017 FRIDAY ''2:30PM-6:30PM''====
+
2017-10-23:
* Continued working on producing multiple tables - first two are done. Was working on location, as there are multiple location tags per location.
+
* Talked to Ed with Peter & Oliver about upcoming tasks & projects.
 +
* Loaded acs_place table 2017 (does not contain population) on tigertest.
 +
** SQL commands used:
  
====2/27/2017 MONDAY ''2PM-6PM''====
+
DROP TABLE acs_place;
* Finished producing tables from Jeemin_FDATrial_as_key_data_ripping.py
 
* Talked to Julia about LinkedIn data extracting - to be discussed further with Julia & Peter.
 
* Started web crawler for Wikipedia - currently pulls Endowment, Academic staff, students, undergraduates, and postgraduates info found on Rice Wikipedia page. Can be found in : E:\McNair\Projects\University Patents\Jeemin_University_wikipedia_crawler.py
 
  
====3/1/2017 WEDNESDAY ''9AM-12PM''====
+
CREATE TABLE acs_place (
* Started re-running Jeemin_FDATrial_as_key_data_ripping.py
+
        USPS varchar(5),
 +
        GEOID  varchar(30),
 +
        ANSICODE varchar(30),
 +
        NAME varchar(100),
 +
        LSAD varchar(30),
 +
        FUNCSTAT varchar(10),
 +
        ALAND varchar(30),
 +
        AWATER varchar(30),
 +
        ALAND_SQMI varchar(30),
 +
        AWATER_SQMI varchar(30),
 +
        INTPTLAT varchar(30),
 +
        INTPTLONG varchar(30)
 +
);
  
====3/3/2017 FRIDAY ''2PM-5PM''====
+
\COPY acs_place FROM '/bulk/2017_Gaz_place_national.txt';
* Attempted to output sql tables
+
--COPY 29578
  
====3/6/017 MONDAY ''2PM-6PM''====
+
* TODO:
* [[Installing python in a database]]
+
** Find acs place 2016 data for population
* Added building Python function section to [[Working with PostgreSQL]] at the bottom of the page.
+
** Find larger acs files, ideally at the place level
* Ran FDA Trial data ripping again, as the text output files were wiped.
+
** Provide more documentation on POSTGIS & geocoding
* Plan on discussing with Julia and Meghana again about pulling universities and other relevant institutions from the Assignee List USA.
 
* Talked to Sonia about pulling city, state, zipcode information, hence python was installed in a database. Will work with Sonia on Wednesday afternoon and see how best a regex function could be implemented
 
  
====3/8/2017 WEDNESDAY ''9AM-12PM''====
+
2017-10-16:
* Output sql tables from finished run of Jeemin_FDATrial_as_key_data_ripping.py
+
* Exported maps of points from the Bay Area each year.  
* Ran through assigneelist_USA.txt to see how many different ways UNIVERSITY could be spelled wrong. There were many.
+
** Map used location: E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS\Jeemin_Bay_Area Points_Every_Year\BayAreaEveryYearMap
* Tried to logic through creating a pattern that could catch all different versions of UNIVERSITY. Discuss further on whether UNIVERSITIES and those that include UNIVERSITIES but include INC in the end should be pulled as relevant information
+
** Zoom scale: 1:650.000
 +
* Location of Bay Area Points png files:
 +
  E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS\Jeemin_Bay_Area Points_Every_Year
  
====3/8/2017 WEDNESDAY ''2PM-5PM ''====
+
2017-10-10:
* Wrote regex pattern that identifies all "university" matchings - can be found in E:\McNair\Projects\University Patents\university_pulled_from_assignee_list_USA -- is an output file
+
*Discoveries/ Struggles regarding.gdb to .shp file conversion:
* Talked to Sonia, but didn't come to solid conclusion on identifying whether key words associate with city or country by running a python function
+
** Esri Production Mapping (costly)
 +
*** License needs to be purchasesd: http://www.esri.com/software/arcgis/extensions/production-mapping/pricing
 +
** Use ogr2ogr from gdal package
 +
*** https://gis.stackexchange.com/questions/14432/migrating-geodatabase-data-into-postgis-without-esri-apps
 +
*** Command: ogr2ogr -f "ESRI Shapefile" [Destination of shapefile] [path to gdb file]
 +
*** Problem installing gdal
  
====3/13/2017 MONDAY ''12PM-2PM''====
+
2017-10-09:
* For University Patent Data Matching - matched SCHOOL (output: E:\McNair\Projects\University Patents\school_pulled_from_assignee_list_USA) and matched INSTITUTE(output: E:\McNair\Projects\University Patents\institute_pulled_from_assignee_list_USA).  
+
* TODO'S:
* [[University Patent Matching]]
+
** Downloading data onto tigertest
* To be worked on later: Grant XML parsing & general name matcher
+
*** Road
 +
*** Railway
 +
*** Coastline
 +
*** Instructions: http://www.edegan.com/wiki/PostGIS_Installation#Bulk_Download_TIGER_Shapefiles
 +
** Configure census data from American Community Survey (ACS)
 +
*** 1) Work out what data is of our interest (confirm ACS)
 +
*** 2) Determine appropriate shape file unit:  
 +
**** census block vs. census block group vs. census track
 +
*** 3) Load into tigertest
  
====3/14/2017 TUESDAY ''12PM-2PM''====
+
* Done:
* Started pulling academy cases but there are too many cases to worry about, in terms of institution of interest. A document is located in E:\McNair\Projects\University Patents\academies_verify_cases.txt
+
** Downloaded data from https://www.census.gov/cgi-bin/geo/shapefiles/index.php
* Need Julia/Meghana to look through the hits and see which are relevant & extract pattern from there.  
+
  tl_2017_us_coastline -- 4209
* Having trouble outputting txt file without double quotes around every line.
+
  tl_2017_us_primaryroads -- 11574
* Thinking that one text file should be output for all keywords instead of having one each, to avoid overlap (ex) COLLEGE and UNIVERSITY are both keywords; ALBERT EINSTEIN COLLEGE OF YESHIVA UNIVERSITY will be hit twice if it were counted as two separate instances, one accounting for COLLEGE and the other for UNIVERSITY) - either in the form of if-elseif statements or one big regex check.
+
  tl_2017_us_rails -- 176237
 +
** Link found to potentially download ACS data: https://www.census.gov/geo/maps-data/data/tiger-data.html
 +
*** But most files on it come with .gdb extension and not .shp
  
====3/15/2017 WEDNESDAY ''9AM-1PM''====
+
2017-10-03:
* Todo: write a wikipage on possible input/output info on string matcher
+
* Installed PostGIS & is now visible on pgAdmin III
* Wrote part of XML parser, extracted yearly data into E:\McNair\Projects\Federal Grant Data\NSF\NSF Extracted Data (up to year 2010)
 
  
====3/16/2017 THURSDAY ''12PM-2PM''====
+
* ArcGIS (connect to postgis database):
* Further documented [[University Patent Matching]]
+
** 1) Open ArcMap
* Finished writing XML Parser
+
** 2) Either open blank or open existing file/project
 +
** 3) Click on 'Add Data' button with a cross and a yellow diamond (under Selection toolbar)
 +
** 4) Go to the top-most directory by pressing on the arrow that points left-then-up (on the left of home button)
 +
** 5) Click on 'Database Connections'
 +
** 6) Click on 'Add Database Connection' (if Connection to localhost.sde) does not exist already)
 +
** 7) Fill in the following fields:
 +
*** Database Platform: PostgreSQL
 +
*** Instance: localhost
 +
*** User name: postgres
 +
*** Password:
 +
*** Database: tigertest
 +
** 8) Press 'OK'
 +
** 9) Now you'll have 'Connection to localhost.sde' in your Database Connections
 +
** 10) Double click on 'Connection to localhost.sde'
 +
** 11) Double click on the table of interest
 +
** 12) Click 'Finish'
 +
** 13) You'll see information populated on map, as one of the 'Layers'
 +
*** Tested with: tigertest.public.copointplacescontains
  
====3/20/2017 MONDAY ''2PM-6PM''====
+
* On running & altering Oliver's script:
* Talked to Julia about universal matcher, want to combine all University of California's to University of California, The Regents of
+
** Location: E:\McNair\Projects\OliverLovesCircles\src\python\vc_circles.py
* Converted crunchbase2013 data from mySQL to PostgreSQL, but having trouble with the last table - cb_relationships, complains about syntax error at or near some places - but generally all tables exist in database called crunchbase
+
** Ed manipulated file names so that underscores would replace dots (St.Louis --> St_Louis)
* Federal Grant Data XML Parser was run - the three output textfiles can be found in E:\McNair\Projects\Federal Grant Data\NSF
+
** Takes in instances and sweep times as part of the argument, but not impactful as those variables are hardcoded in the script
 +
** Ran vc_circles.py with the following variables with changed values:
 +
*** SWEEP_CYCLE_SECONDS = 10 (used to be 30)
 +
*** NUMBER_INSTANCES = 16 (used to be 8)
 +
** New output to be found in: E:\McNair\Projects\OliverLovesCircles\out
  
====3/22/2017 WEDNESDAY ''9AM-12PM''====
+
2017-10-02:
* Read string matching & calculating distance, below are relevant links
+
* Talked to Harrison & Peter regarding ArcGIS
* [http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf]
+
** Currently have points plotted on Houston
* [http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html]
+
** Trouble interpreting geometry type, as currently reads in from text file
 +
** Documents located in : E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS
 +
* Attempted to install PostGIS spatial extention from PostgreSQL but getting 'spatial database creation failed' error message.
 +
** Referenced instructions:
 +
*** https://www.gpsfiledepot.com/tutorials/installing-and-setting-up-postgresql-with-postgis/
 +
*** http://www.bostongis.com/PrinterFriendly.aspx?content_name=postgis_tut01
  
====3/24/2017 FRIDAY ''2PM-5PM''====
+
2017-09-26:
* Discussed with Julia & Meghana about university keys to use to count # of occurrences, including aliases and misspellings
+
* Created a table that maps a state to the database name.
* Thoughts: to use a scoring metric with a key of UNIVERSITY OF CALIFORNIA SYSTEM, it should have a 'better' score when compared to MATHEMATICAL SCIENCES PUBLISHERS C/O UNIVERSITY OF CALIFORNIA BERKELEY or CALIFORNIA AT LOS ANGELES, UNVIERSITY OF than when compared to UNIVERSITY OF SOUTHERN CALIFORNIA, which may pose a challenge when attempting to implement this in a more general sense. In normalizing a string, strip "THE", "," and split words by spaces and compare each keyword from the two strings. Deciding on which strings to compare will be another issue - length (within some range maybe) could be an option.
+
** http://www.edegan.com/wiki/PostGIS_Installation#Translating_Table_names_to_corresponding_States
* Federal Grant Data XML Parser was rerun - same output textfiles
+
* Added more GIS-information (functions, realm & outliers to consider)
 +
** http://www.edegan.com/wiki/Urban_Start-up_Agglomeration#GIS_Resources
 +
* Visualization in PostGIS or connecting to ArcGIS for visualization (import/export data)
 +
* Spatial indexing:
 +
** http://revenant.ca/www/postgis/workshop/indexing.html
  
====3/27/2017 MONDAY ''2PM-6PM''====
+
2017-09-25:
* Writing code for university matches - decided to go through keys instead of each dataitem. Use keywords in each key to go through the dataitem - misspellings are currently unaccounted for.
+
* Talked to Ed about GIS, Census data, and going about determining the correctness of reported 'place.' Currently script makes a cross product of each reported place and an existing place, outputting a column of boolean value to indicate whether the reported place's coordinates fell within a place's geometric boundaries. One other way of going about this which we discussed is to first check if the reported place does fall within that place's boundaries. If it isn't, we'll go about the cross product method.
  
====3/29/2017 WEDNESDAY''2PM-5PM''====
+
* To add documentation :
* Troubled by the variety of cases - separating keys by keywords will not work favorably when it hits University of California vs. University of Southern California case - find a way to match University of Southern California first (more specific ones first) - but how to generalize
+
** http://www.edegan.com/wiki/PostGIS_Installation
.
+
** http://www.edegan.com/wiki/Urban_Start-up_Agglomeration
.
 
.
 
====4/12/2017 WEDNESDAY''9AM-12PM''====
 
* Finishing up cleaning the columns for Federal Grant Data - NIH. The output excel files can be accessed at:
 
E:\McNair\Projects\Federal Grant Data\NIH\Grants
 
Titled:
 
    Jeemin_combined_files 1986-2001.csv
 
    Jeemin_combined_files 2002-2012.csv
 
    Jeemin_combined_files 2013-2015.csv
 
  
* psql table formula:
+
* Discussed the need to maintain venture capital database.
  
CREATE TABLE all_grants (
+
*Relevant File paths:
  APPLICATION_ID integer,
+
**E:\McNair\Projects\Agglomeration\TestGIS.sql
  ACTIVITY varchar(3),
+
**Z:\VentureCapitalData\SDCVCData\vcdb2\ProecssingCoLevelSimple.sql
  ADMINISTERING_IC varchar(2),
+
**Z:\VentureCapitalData\SDCVCData\vcdb2\CitiesWithGT10Active.txt
  APPLICATION_TYPE varchar(1),
 
  ARRA_FUNDED varchar(1),
 
  AWARD_NOTICE_DATE date,
 
  BUDGET_START date,
 
  BUDGET_END date,
 
  CFDA_CODE varchar(3),
 
  CORE_PROJECT_NUM varchar(11),
 
  ED_INST_TYPE varchar(30),
 
  FOA_NUMBER varchar(13),
 
  FULL_PROJECT_NUM varchar(35),
 
  FUNDING_ICs varchar(40),
 
  FUNDING_MECHANISM varchar(23),
 
  FY smallint,
 
  IC_NAME varchar(77),
 
  NIH_SPENDING_CATS varchar(295),
 
  ORG_CITY varchar(20),
 
  ORG_COUNTRY varchar(16),
 
  ORG_DEPT varchar(30),
 
  ORG_DISTRICT smallint,
 
  ORG_DUNS integer,
 
  ORG_FIPS varchar(2),
 
  ORG_NAME varchar(60),
 
  ORG_STATE varchar(2),
 
  ORG_ZIPCODE integer,
 
  PHR varchar(200),
 
  PI_IDS varchar(30),
 
  PI_NAMEs varchar(200),
 
  PROGRAM_OFFICER_NAME varchar(36),
 
  PROJECT_START date,
 
  PROJECT_END date,
 
  PROJECT_TERMS varchar(200),
 
  PROJECT_TITLE varchar(244),
 
  SERIAL_NUMBER smallint,
 
  STUDY_SECTION varchar(4),
 
  STUDY_SECTION_NAME varchar(100),
 
  SUBPROJECT_ID smallint,
 
  SUFFIX varchar(2),
 
  SUPPORT_YEAR smallint,
 
  DIRECT_COST_AMT integer,
 
  INDIRECT_COST_AMT integer,
 
  TOTAL_COST integer,
 
  TOTAL_COST_SUB_PROJECT integer
 
);
 
\COPY all_grants FROM 'Jeemin_combined_files 1986-2001.csv' WITH DELIMITER AS E'\t' HEADER NULL AS ''CSV
 
  
====4/14/2017 FRIDAY''2PM-5PM''====
+
2017-09-21:
* Loaded Federal Grants Data into database
+
* Functions for Linear Referencing:
 +
'''ST_LineInterpolatePoint(geometry A, double measure)''': Returns a point interpolated along a line.
 +
'''ST_LineLocatePoint(geometry A, geometry B)''': Returns a float between 0 and 1 representing the location of the closest point on LineString to the given Point.
 +
'''ST_Line_Substring(geometry A, double from, double to)''': Return a linestring being a substring of the input one starting and ending at the given fractions of total 2d length.
 +
'''ST_Locate_Along_Measure(geometry A, double measure)''': Return a derived geometry collection value with elements that match the specified measure.
 +
'''ST_Locate_Between_Measures(geometry A, double from, double to)''': Return a derived geometry collection value with elements that match the specified range of measures inclusively.
 +
'''ST_AddMeasure(geometry A, double from, double to)''': Return a derived geometry with measure elements linearly interpolated between the start and end points. If the geometry has no measure dimension, one is added.
  
====4/17/2017 MONDAY''2PM-4PM''====
+
*3-D Functions:
 +
'''ST_3DClosestPoint''' — Returns the 3-dimensional point on g1 that is closest to g2. This is the first point of the 3D shortest line.
 +
'''ST_3DDistance''' — For geometry type Returns the 3-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units.
 +
'''ST_3DDWithin''' — For 3d (z) geometry type Returns true if two geometries 3d distance is within number of units.
 +
'''ST_3DDFullyWithin''' — Returns true if all of the 3D geometries are within the specified distance of one another.
 +
'''ST_3DIntersects''' — Returns TRUE if the Geometries “spatially intersect” in 3d - only for points and linestrings
 +
'''ST_3DLongestLine''' — Returns the 3-dimensional longest line between two geometries
 +
'''ST_3DMaxDistance''' — For geometry type Returns the 3-dimensional cartesian maximum distance (based on spatial ref) between two geometries in projected units.
 +
'''ST_3DShortestLine''' — Returns the 3-dimensional shortest line between two geometries
  
====4/17/2017 WEDNESDAY''9AM-12PM''====
+
*Relevant PostgreSQL Commands:
* To pull accelerators: Wrote simple python regex-based script that ran on organizations data.
+
  '''\dt *.*''' Show all tables
  Code: E:\McNair\Projects\Accelerators\Crunchbase Snapshot\accelerator keywords.py
+
  '''\q''' Exit table
  Matched output (885 mathces) : E:\McNair\Projects\Accelerators\Crunchbase Snapshot\Jeemin_885_accel_matches
 
  
====9/11/2017 MONDAY''4PM-6PM''====
+
*Specifities/ Outliers to consider:
* Ensured that documentation exists for the projects worked on last semester.
+
New York (decompose)
 +
Princeton area (keep Princeton  unique)
 +
Reston, Virginia (keep)
 +
San Diego (include La Jolla)
 +
Silicon Valley (all distinct)
  
====9/12/2017 TUESDAY''9AM-10:40AM & 1PM-2:20PM & 4PM-5:30PM''====
+
* Continue reading from: https://postgis.net/docs/postgis_installation.html
* Clarified University Matching output file.
 
* Helped Christy with pdf-reader, capturing keywords in readable format.
 
  
..
+
2017-09-20:
 +
* Attended first intro to GIS course yesterday
 +
* Updated above notes on GIS
  
====9/18/2017 MONDAY''4PM-6PM''====
 
* Read documentation on PostGIS and tiger geocoder
 
* Continue reading from: http://workshops.boundlessgeo.com/postgis-intro/joins.html
 
  
====9/19/2017 TUESDAY''9AM-10:40AM & 1PM-2:20PM''====
+
2017-09-19:
 
* Useful functions for spatial joins:
 
* Useful functions for spatial joins:
  
Line 223: Line 351:
 
* Continue reading from: http://workshops.boundlessgeo.com/postgis-intro/geography.html
 
* Continue reading from: http://workshops.boundlessgeo.com/postgis-intro/geography.html
  
====9/20/2017 WEDNESDAY''3:50PM-5:20PM''====
 
* Attended first intro to GIS course yesterday
 
* Updated above notes on GIS
 
* Continue reading from:
 
  
====9/21/2017 THURSDAY''9AM-12:30PM''====
+
2017-09-18:
* Functions for Linear Referencing:
+
* Read documentation on PostGIS and tiger geocoder
  '''ST_LineInterpolatePoint(geometry A, double measure)''': Returns a point interpolated along a line.
+
* Continue reading from: http://workshops.boundlessgeo.com/postgis-intro/joins.html
  '''ST_LineLocatePoint(geometry A, geometry B)''': Returns a float between 0 and 1 representing the location of the closest point on LineString to the given Point.
+
 
'''ST_Line_Substring(geometry A, double from, double to)''': Return a linestring being a substring of the input one starting and ending at the given fractions of total 2d length.
+
2017-09-12:
  '''ST_Locate_Along_Measure(geometry A, double measure)''': Return a derived geometry collection value with elements that match the specified measure.
+
* Clarified University Matching output file.
  '''ST_Locate_Between_Measures(geometry A, double from, double to)''': Return a derived geometry collection value with elements that match the specified range of measures inclusively.
+
* Helped Christy with pdf-reader, capturing keywords in readable format.
'''ST_AddMeasure(geometry A, double from, double to)''': Return a derived geometry with measure elements linearly interpolated between the start and end points. If the geometry has no measure dimension, one is added.
+
 
 +
2017-09-11:
 +
* Ensured that documentation exists for the projects worked on last semester.
 +
 
 +
</onlyinclude>
 +
 
 +
===Spring 2017===
 +
 
 +
2017-04-17:
 +
* To pull accelerators: Wrote simple python regex-based script that ran on organizations data.
 +
  Code: E:\McNair\Projects\Accelerators\Crunchbase Snapshot\accelerator keywords.py
 +
  Matched output (885 mathces) : E:\McNair\Projects\Accelerators\Crunchbase Snapshot\Jeemin_885_accel_matches
 +
 
 +
2017-04-14:
 +
* Loaded Federal Grants Data into database
 +
 
 +
2017-04-12:
 +
* Finishing up cleaning the columns for Federal Grant Data - NIH. The output excel files can be accessed at:
 +
  E:\McNair\Projects\Federal Grant Data\NIH\Grants
 +
  Titled:
 +
    Jeemin_combined_files 1986-2001.csv
 +
    Jeemin_combined_files 2002-2012.csv
 +
    Jeemin_combined_files 2013-2015.csv
 +
 
 +
* psql table formula:
  
*3-D Functions:
+
CREATE TABLE all_grants (
'''ST_3DClosestPoint''' — Returns the 3-dimensional point on g1 that is closest to g2. This is the first point of the 3D shortest line.
+
  APPLICATION_ID integer,
'''ST_3DDistance''' — For geometry type Returns the 3-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units.
+
  ACTIVITY varchar(3),
'''ST_3DDWithin''' — For 3d (z) geometry type Returns true if two geometries 3d distance is within number of units.
+
  ADMINISTERING_IC varchar(2),
'''ST_3DDFullyWithin''' — Returns true if all of the 3D geometries are within the specified distance of one another.
+
  APPLICATION_TYPE varchar(1),
'''ST_3DIntersects''' — Returns TRUE if the Geometries “spatially intersect” in 3d - only for points and linestrings
+
  ARRA_FUNDED varchar(1),
  '''ST_3DLongestLine''' — Returns the 3-dimensional longest line between two geometries
+
  AWARD_NOTICE_DATE date,
  '''ST_3DMaxDistance''' — For geometry type Returns the 3-dimensional cartesian maximum distance (based on spatial ref) between two geometries in projected units.
+
  BUDGET_START date,
'''ST_3DShortestLine''' — Returns the 3-dimensional shortest line between two geometries
+
  BUDGET_END date,
 +
  CFDA_CODE varchar(3),
 +
  CORE_PROJECT_NUM varchar(11),
 +
  ED_INST_TYPE varchar(30),
 +
  FOA_NUMBER varchar(13),
 +
  FULL_PROJECT_NUM varchar(35),
 +
  FUNDING_ICs varchar(40),
 +
  FUNDING_MECHANISM varchar(23),
 +
  FY smallint,
 +
  IC_NAME varchar(77),
 +
  NIH_SPENDING_CATS varchar(295),
 +
  ORG_CITY varchar(20),
 +
  ORG_COUNTRY varchar(16),
 +
  ORG_DEPT varchar(30),
 +
  ORG_DISTRICT smallint,
 +
  ORG_DUNS integer,
 +
  ORG_FIPS varchar(2),
 +
  ORG_NAME varchar(60),
 +
  ORG_STATE varchar(2),
 +
  ORG_ZIPCODE integer,
 +
  PHR varchar(200),
 +
  PI_IDS varchar(30),
 +
  PI_NAMEs varchar(200),
 +
  PROGRAM_OFFICER_NAME varchar(36),
 +
  PROJECT_START date,
 +
  PROJECT_END date,
 +
  PROJECT_TERMS varchar(200),
 +
  PROJECT_TITLE varchar(244),
 +
  SERIAL_NUMBER smallint,
 +
  STUDY_SECTION varchar(4),
 +
  STUDY_SECTION_NAME varchar(100),
 +
  SUBPROJECT_ID smallint,
 +
  SUFFIX varchar(2),
 +
  SUPPORT_YEAR smallint,
 +
  DIRECT_COST_AMT integer,
 +
  INDIRECT_COST_AMT integer,
 +
  TOTAL_COST integer,
 +
  TOTAL_COST_SUB_PROJECT integer
 +
  );
 +
  \COPY all_grants FROM 'Jeemin_combined_files 1986-2001.csv' WITH DELIMITER AS E'\t' HEADER NULL AS ''CSV
  
*Relevant PostgreSQL Commands:
+
2017-03-29:
'''\dt *.*''' Show all tables
+
* Troubled by the variety of cases - separating keys by keywords will not work favorably when it hits University of California vs. University of Southern California case - find a way to match University of Southern California first (more specific ones first) - but how to generalize
'''\q''' Exit table
 
  
*Specifities/ Outliers to consider:
+
2017-03-27:
New York (decompose)
+
* Writing code for university matches - decided to go through keys instead of each dataitem. Use keywords in each key to go through the dataitem - misspellings are currently unaccounted for.
Princeton area (keep Princeton  unique)
 
Reston, Virginia (keep)
 
San Diego (include La Jolla)
 
Silicon Valley (all distinct)
 
  
* Continue reading from: https://postgis.net/docs/postgis_installation.html
+
2017-03-24:
 +
* Discussed with Julia & Meghana about university keys to use to count # of occurrences, including aliases and misspellings
 +
* Thoughts: to use a scoring metric with a key of UNIVERSITY OF CALIFORNIA SYSTEM, it should have a 'better' score when compared to MATHEMATICAL SCIENCES PUBLISHERS C/O UNIVERSITY OF CALIFORNIA BERKELEY or CALIFORNIA AT LOS ANGELES, UNVIERSITY OF than when compared to UNIVERSITY OF SOUTHERN CALIFORNIA, which may pose a challenge when attempting to implement this in a more general sense. In normalizing a string, strip "THE", "," and split words by spaces and compare each keyword from the two strings. Deciding on which strings to compare will be another issue - length (within some range maybe) could be an option.  
 +
* Federal Grant Data XML Parser was rerun - same output textfiles
  
====9/25/2017 MONDAY''4PM-6PM''====
+
2017-03-22:
* Talked to Ed about GIS, Census data, and going about determining the correctness of reported 'place.' Currently script makes a cross product of each reported place and an existing place, outputting a column of boolean value to indicate whether the reported place's coordinates fell within a place's geometric boundaries. One other way of going about this which we discussed is to first check if the reported place does fall within that place's boundaries. If it isn't, we'll go about the cross product method.
+
* Read string matching & calculating distance, below are relevant links
 +
* [http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf]
 +
* [http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html]
  
* To add documentation :  
+
2017-03-22:
** http://mcnair.bakerinstitute.org/wiki/PostGIS_Installation
+
* Talked to Julia about universal matcher, want to combine all University of California's to University of California, The Regents of
** http://mcnair.bakerinstitute.org/wiki/Urban_Start-up_Agglomeration
+
* Converted crunchbase2013 data from mySQL to PostgreSQL, but having trouble with the last table - cb_relationships, complains about syntax error at or near some places - but generally all tables exist in database called crunchbase
 +
* Federal Grant Data XML Parser was run - the three output textfiles can be found in E:\McNair\Projects\Federal Grant Data\NSF
  
* Discussed the need to maintain venture capital database.
+
2017-03-16:
 +
* Further documented [[University Patent Matching]]
 +
* Finished writing XML Parser
  
*Relevant File paths:
+
2017-03-15:
**E:\McNair\Projects\Agglomeration\TestGIS.sql
+
* Todo: write a wikipage on possible input/output info on string matcher
**Z:\VentureCapitalData\SDCVCData\vcdb2\ProecssingCoLevelSimple.sql
+
* Wrote part of XML parser, extracted yearly data into E:\McNair\Projects\Federal Grant Data\NSF\NSF Extracted Data (up to year 2010)
**Z:\VentureCapitalData\SDCVCData\vcdb2\CitiesWithGT10Active.txt
 
  
====9/26/2017 TUESDAY ''9AM-12PM''====
+
2017-03-14:
* Created a table that maps a state to the database name.
+
* Started pulling academy cases but there are too many cases to worry about, in terms of institution of interest. A document is located in E:\McNair\Projects\University Patents\academies_verify_cases.txt
** http://mcnair.bakerinstitute.org/wiki/PostGIS_Installation#Translating_Table_names_to_corresponding_States
+
* Need Julia/Meghana to look through the hits and see which are relevant & extract pattern from there.  
* Added more GIS-information (functions, realm & outliers to consider)
+
* Having trouble outputting txt file without double quotes around every line.  
** http://mcnair.bakerinstitute.org/wiki/Urban_Start-up_Agglomeration#GIS_Resources
+
* Thinking that one text file should be output for all keywords instead of having one each, to avoid overlap (ex) COLLEGE and UNIVERSITY are both keywords; ALBERT EINSTEIN COLLEGE OF YESHIVA UNIVERSITY will be hit twice if it were counted as two separate instances, one accounting for COLLEGE and the other for UNIVERSITY) - either in the form of if-elseif statements or one big regex check.
* Visualization in PostGIS or connecting to ArcGIS for visualization (import/export data)
 
* Spatial indexing:
 
** http://revenant.ca/www/postgis/workshop/indexing.html
 
  
.
+
2017-03-13:
.
+
* For University Patent Data Matching - matched SCHOOL (output: E:\McNair\Projects\University Patents\school_pulled_from_assignee_list_USA) and matched INSTITUTE(output: E:\McNair\Projects\University Patents\institute_pulled_from_assignee_list_USA).  
.
+
* [[University Patent Matching]]
====10/2/2017 MONDAY ''4PM-6PM''====
+
* To be worked on later: Grant XML parsing & general name matcher
* Talked to Harrison & Peter regarding ArcGIS
 
** Currently have points plotted on Houston
 
** Trouble interpreting geometry type, as currently reads in from text file
 
** Documents located in : E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS
 
* Attempted to install PostGIS spatial extention from PostgreSQL but getting 'spatial database creation failed' error message.
 
** Referenced instructions:  
 
*** https://www.gpsfiledepot.com/tutorials/installing-and-setting-up-postgresql-with-postgis/
 
*** http://www.bostongis.com/PrinterFriendly.aspx?content_name=postgis_tut01
 
  
====10/3/2017 TUESDAY ''10AM-12PM & 1PM-2:30PM''====
+
2017-03-08:
* Installed PostGIS & is now visible on pgAdmin III
+
* Wrote regex pattern that identifies all "university" matchings - can be found in E:\McNair\Projects\University Patents\university_pulled_from_assignee_list_USA -- is an output file
 +
* Talked to Sonia, but didn't come to solid conclusion on identifying whether key words associate with city or country by running a python function
 +
* Output sql tables from finished run of Jeemin_FDATrial_as_key_data_ripping.py
 +
* Ran through assigneelist_USA.txt to see how many different ways UNIVERSITY could be spelled wrong. There were many.
 +
* Tried to logic through creating a pattern that could catch all different versions of UNIVERSITY. Discuss further on whether UNIVERSITIES and those that include UNIVERSITIES but include  INC in the end should be pulled as relevant information
  
* ArcGIS (connect to postgis database):
+
2017-03-06:
** 1) Open ArcMap
+
* [[Installing python in a database]]
** 2) Either open blank or open existing file/project
+
* Added building Python function section to [[Working with PostgreSQL]] at the bottom of the page.
** 3) Click on 'Add Data' button with a cross and a yellow diamond (under Selection toolbar)
+
* Ran FDA Trial data ripping again, as the text output files were wiped.  
** 4) Go to the top-most directory by pressing on the arrow that points left-then-up (on the left of home button)
+
* Plan on discussing with Julia and Meghana again about pulling universities and other relevant institutions from the Assignee List USA.  
** 5) Click on 'Database Connections'
+
* Talked to Sonia about pulling city, state, zipcode information, hence python was installed in a database. Will work with Sonia on Wednesday afternoon and see how best a regex function could be implemented
** 6) Click on 'Add Database Connection' (if Connection to localhost.sde) does not exist already)
 
** 7) Fill in the following fields:
 
*** Database Platform: PostgreSQL
 
*** Instance: localhost
 
*** User name: postgres
 
*** Password:
 
*** Database: tigertest
 
** 8) Press 'OK'
 
** 9) Now you'll have 'Connection to localhost.sde' in your Database Connections
 
** 10) Double click on 'Connection to localhost.sde'
 
** 11) Double click on the table of interest
 
** 12) Click 'Finish'
 
** 13) You'll see information populated on map, as one of the 'Layers'
 
*** Tested with: tigertest.public.copointplacescontains
 
  
* On running & altering Oliver's script:
+
2017-03-03:
** Location: E:\McNair\Projects\OliverLovesCircles\src\python\vc_circles.py
+
* Attempted to output sql tables
** Ed manipulated file names so that underscores would replace dots (St.Louis --> St_Louis)
 
** Takes in instances and sweep times as part of the argument, but not impactful as those variables are hardcoded in the script
 
** Ran vc_circles.py with the following variables with changed values:
 
*** SWEEP_CYCLE_SECONDS = 10 (used to be 30)
 
*** NUMBER_INSTANCES = 16 (used to be 8)
 
** New output to be found in: E:\McNair\Projects\OliverLovesCircles\out
 
  
====10/9/2017 MONDAY''3PM-6PM''====
+
2017-03-01:
* TODO'S:
+
* Started re-running Jeemin_FDATrial_as_key_data_ripping.py
** Downloading data onto tigertest
 
*** Road
 
*** Railway
 
*** Coastline
 
*** Instructions: http://mcnair.bakerinstitute.org/wiki/PostGIS_Installation#Bulk_Download_TIGER_Shapefiles
 
** Configure census data from American Community Survey (ACS)
 
*** 1) Work out what data is of our interest (confirm ACS)
 
*** 2) Determine appropriate shape file unit:
 
**** census block vs. census block group vs. census track
 
*** 3) Load into tigertest
 
  
* Done:
+
2017-02-27:
** Downloaded data from https://www.census.gov/cgi-bin/geo/shapefiles/index.php
+
* Finished producing tables from Jeemin_FDATrial_as_key_data_ripping.py
  tl_2017_us_coastline -- 4209
+
* Talked to Julia about LinkedIn data extracting - to be discussed further with Julia & Peter.
  tl_2017_us_primaryroads -- 11574
+
* Started web crawler for Wikipedia - currently pulls Endowment, Academic staff, students, undergraduates, and postgraduates info found on Rice Wikipedia page. Can be found in : E:\McNair\Projects\University Patents\Jeemin_University_wikipedia_crawler.py
  tl_2017_us_rails -- 176237
 
** Link found to potentially download ACS data: https://www.census.gov/geo/maps-data/data/tiger-data.html
 
*** But most files on it come with .gdb extension and not .shp
 
  
====10/10/2017 TUESDAY''9AM-12PM''====
+
2017-02-24:
*Discoveries/ Struggles regarding.gdb to .shp file conversion:
+
* Continued working on producing multiple tables - first two are done. Was working on location, as there are multiple location tags per location.
** Esri Production Mapping (costly)
 
*** License needs to be purchasesd: http://www.esri.com/software/arcgis/extensions/production-mapping/pricing
 
** Use ogr2ogr from gdal package
 
*** https://gis.stackexchange.com/questions/14432/migrating-geodatabase-data-into-postgis-without-esri-apps
 
*** Command: ogr2ogr -f "ESRI Shapefile" [Destination of shapefile] [path to gdb file]
 
*** Problem installing gdal
 
  
====10/16/2017 MONDAY''4PM-6PM''====
+
2017-02-22:
* Exported maps of points from the Bay Area each year.  
+
* Finished Jeemin_FDATrial_as_key_data_ripping.py (E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py), which outputs to E:\McNair\Projects\FDA Trials\Jeemin_Project\general_data_ripping_output.txt; TODO: output four different tables & replace the write in the same for-loop as going through each file
** Map used location: E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS\Jeemin_Bay_Area Points_Every_Year\BayAreaEveryYearMap
 
** Zoom scale: 1:650.000
 
* Location of Bay Area Points png files:
 
E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS\Jeemin_Bay_Area Points_Every_Year
 
  
...
+
2017-02-20:
 +
* Continued working on Jeemin_FDATrial_as_key_data_ripping.py to find tags and place all of those information in a list. The other zipcode file did not finish executing after 2+ hours of running it - considering the possibility of splitting the record file into smaller bits, or running the processing on a faster machine.
  
====10/23/2017 MONDAY''4PM-6PM''====
+
2017-02-17:
* Talked to Ed with Peter & Oliver about upcoming tasks & projects.
+
* Completed code for counting the number of occurrences for each unique zipcode. (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_Running_File.py). It has been running for 20+min because of the comprehensive XML data files. Meanwhile started coding to create a dictionary with the keys corresponding to each unique trial ID, mapped to every other information (location, sponsors, phase, drugs ...etc.) (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py).
* Loaded acs_place table 2017 (does not contain population) on tigertest.
 
** SQL commands used:
 
  
DROP TABLE acs_place;
+
2017-02-15:
 +
* Discussed with Catherine what to do with FDA Trial data and decided to have a dictionary with zip-codes as keys and number of trials occurred in that zipcode as values. Was still attempting to loop through the files without the code having to exist in the same directory as the XML files. Plan to write to excel via tsv, with zip-code as one column and # of occurrence as the other.
  
CREATE TABLE acs_place (
+
2017-02-13:
        USPS varchar(5),
+
* Goals (for trials): 1) Build ER Diagram 2) For each entity, get XML snippet 3) Build a parser/ripper for single file; the python parser can be found at: E:\McNair\Projects\FDA Trials\Jeemin_Project
        GEOID  varchar(30),
 
        ANSICODE varchar(30),
 
        NAME varchar(100),
 
        LSAD varchar(30),
 
        FUNCSTAT varchar(10),
 
        ALAND varchar(30),
 
        AWATER varchar(30),
 
        ALAND_SQMI varchar(30),
 
        AWATER_SQMI varchar(30),
 
        INTPTLAT varchar(30),
 
        INTPTLONG varchar(30)
 
);
 
  
\COPY acs_place FROM '/bulk/2017_Gaz_place_national.txt';
+
2017-02-08:
--COPY 29578
+
* Attempted to come up with possible cases for locating the description of accelerators - pick up from extracting bodies of text from the about page (given that it exists)
 +
* [[Trial Data Project]]
  
* TODO:
+
2017-02-06:
** Find acs place 2016 data for population
+
* Set up wikiPage & remote desktop.
** Find larger acs files, ideally at the place level
+
* Started working on python version of web crawler. So far it successfully prints out a catchphrase/ description for one website. To be worked on. The python file can be found in: E:\McNair\Projects\Accelerators\Python WebCrawler\webcrawlerpython.py
** Provide more documentation on POSTGIS & geocoding
 
  
 
[[Category:Work Log]]
 
[[Category:Work Log]]

Latest revision as of 13:38, 8 October 2019

Jeemin Sim Work Logs (log page)

Spring 2018

2017-01-11:

Fall 2017

2017-12-05:

  • Uploaded tables with KML files in TIF folder in bulk(E:) drive
    • Allegheny County
    • Atlanta
    • Chicago
    • Columbus, OH
    • Dublin, OH
    • Vermont
    • Washington, D.C.
  • shp2pgsql needs to be installed to upload Shapefiles to PostgreSQL database
    • applies for Houston, TX and Dallas, TX
    • DONE

2017-12-04:

  • To upload KML file into database with specified table name:
researcher@McNairDBServ:/bulk/tigertest$ ogr2ogr -f PostgreSQL PG:"dbname=tigertest" chicagotif.kml -nln chicagotif
  • chicagotif table now resides in tigertest database

2017-11-30:

  • Displayed map with TIF districts and startups
  • Location:
E:\McNair\Projects\Agglomeration\TIF\McNair_Color_Chicago_TIF_and_startups.png
  • Added information and steps to ArcMap/ArcGIS documentation page
  • Create a project page for 'Working with POSTGIS' and add instructions for uploading KML file onto POSTGIS
    • Command used :
      • Logged in as : researcher@McNairDBServ
/bulk/tigertest$ ogr2ogr -f PostgreSQL PG:"dbname=tigertest" chicagotif.kml 

2017-11-28:

E:\McNair\Projects\Agglomeration\TIF\Jeemin_Chicago_TIF_and_Startups_Attempt1
  • Used https://mygeodata.cloud/result to convert from KML file to CSV (which was then saved as txt file to be uploaded onto ArcMap)
  • Text files located in Local Disk (C:) Drive

2017-11-27:

  • Download Chicago TIF data

2017-11-13:

2017-11-09:

  • Notes on data downloaded:
    • Year 2010-2012 data are based on total population, not 25 yrs or over (case for all other tables)
      • Record appears five times total, with same exact column name
      • For exmaple: 'Total; Estimate; High school graduate (includes equivalency)' appears five times, with different values.
  • TODO:

2017-11-07:

  • Yesterday, narrowed down columns of interest from ACS_S1501_educationattain_2016 table.
Id
Id2
Geography
Total; Estimate; Population 25 years and over
Total; Estimate; Population 25 years and over - High school graduate (includes equivalency)
Total; Margin of Error; Population 25 years and over - High school graduate (includes equivalency)
Total; Margin of Error; Population 25 years and over - High school graduate (includes equivalency)
Percent; Margin of Error; Population 25 years and over - High school graduate (includes equivalency)
Total; Estimate; Population 25 years and over - Associate's degree	
Total; Margin of Error; Population 25 years and over - Associate's degree
Percent; Estimate; Population 25 years and over - Associate's degree
Percent; Margin of Error; Population 25 years and over - Associate's degree
Total; Estimate; Population 25 years and over - Bachelor's degree	
Total; Margin of Error; Population 25 years and over - Bachelor's degree
Percent; Estimate; Population 25 years and over - Bachelor's degree
Percent; Margin of Error; Population 25 years and over - Bachelor's degree
Total; Estimate; Population 25 years and over - Graduate or professional degree
Total; Margin of Error; Population 25 years and over - Graduate or professional degree
Percent; Estimate; Population 25 years and over - Graduate or professional degree
Percent; Margin of Error; Population 25 years and over - Graduate or professional degree
Percent; Estimate; Percent high school graduate or higher	
Percent; Margin of Error; Percent high school graduate or higher
Percent; Estimate; Percent bachelor's degree or higher	
Percent; Margin of Error; Percent bachelor's degree or higher
  • Complications:
    • For csv files corresponding to years 2015 & 2016, all of the above columns exist.
    • For csv files corresponding to years 2005 - 2014, no 'Percent' columns exist
      • Instead their 'Total' columns are percentage values
    • For csv file corresponding to year 2005, columns regarding Graduate or professional degree are labeled differently.
    • 2012 data doesn't correspond to Population 25 years and over.
  • Temporary Solution:
    • Since the above problems may be specific to this set of tables, will go through csv files and adjust columns.
  • Python script location:
E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data\pullCertainColumns.py

2017-10-31:

  • Finished doanloading files from ACS.
  • Started loading tables into tigertest.
  • Commands run could be found in
E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data\DataLoading_SQL_Commands.txt

2017-10-30:

  • Downloaded data from ACS, to be continued
  • File path:
E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data
  • Fields of interest:
S1401 SCHOOL ENROLLMENT
S1501 EDUCATIONAL ATTAINMENT
S2301 EMPLOYMENT STATUS
B01003 TOTAL POPULATION
B02001 RACE
B07201 GEOGRAPHICAL MOBILITY
B08303 TRAVEL TIME TO WORK
B19013 MEDIAN HOUSEHOLD INCOME
B19053 SELF-EMPLOYMENT INCOME IN THE PAST 12 MONTHS FOR HOUSEHOLDS
B19083 GINI INDEX OF INCOME INEQUALITY
B25003 TENURE
B25105 MEDIAN MONTHLY HOUSING COSTS
B28011 INTERNET SUBSCRIPTIONS IN HOUSEHOLD
G001 GEOGRAPHIC IDENTIFIERS

2017-10-23:

  • Talked to Ed with Peter & Oliver about upcoming tasks & projects.
  • Loaded acs_place table 2017 (does not contain population) on tigertest.
    • SQL commands used:
DROP TABLE acs_place;
CREATE TABLE acs_place (
       USPS varchar(5),
       GEOID  varchar(30),
       ANSICODE varchar(30),
       NAME varchar(100),
       LSAD varchar(30),
       FUNCSTAT varchar(10),
       ALAND varchar(30),
       AWATER varchar(30),
       ALAND_SQMI varchar(30),
       AWATER_SQMI varchar(30),
       INTPTLAT varchar(30),
       INTPTLONG varchar(30)
);
\COPY acs_place FROM '/bulk/2017_Gaz_place_national.txt';
--COPY 29578
  • TODO:
    • Find acs place 2016 data for population
    • Find larger acs files, ideally at the place level
    • Provide more documentation on POSTGIS & geocoding

2017-10-16:

  • Exported maps of points from the Bay Area each year.
    • Map used location: E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS\Jeemin_Bay_Area Points_Every_Year\BayAreaEveryYearMap
    • Zoom scale: 1:650.000
  • Location of Bay Area Points png files:
E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS\Jeemin_Bay_Area Points_Every_Year

2017-10-10:

2017-10-09:

 tl_2017_us_coastline -- 4209
 tl_2017_us_primaryroads -- 11574 
 tl_2017_us_rails -- 176237

2017-10-03:

  • Installed PostGIS & is now visible on pgAdmin III
  • ArcGIS (connect to postgis database):
    • 1) Open ArcMap
    • 2) Either open blank or open existing file/project
    • 3) Click on 'Add Data' button with a cross and a yellow diamond (under Selection toolbar)
    • 4) Go to the top-most directory by pressing on the arrow that points left-then-up (on the left of home button)
    • 5) Click on 'Database Connections'
    • 6) Click on 'Add Database Connection' (if Connection to localhost.sde) does not exist already)
    • 7) Fill in the following fields:
      • Database Platform: PostgreSQL
      • Instance: localhost
      • User name: postgres
      • Password:
      • Database: tigertest
    • 8) Press 'OK'
    • 9) Now you'll have 'Connection to localhost.sde' in your Database Connections
    • 10) Double click on 'Connection to localhost.sde'
    • 11) Double click on the table of interest
    • 12) Click 'Finish'
    • 13) You'll see information populated on map, as one of the 'Layers'
      • Tested with: tigertest.public.copointplacescontains
  • On running & altering Oliver's script:
    • Location: E:\McNair\Projects\OliverLovesCircles\src\python\vc_circles.py
    • Ed manipulated file names so that underscores would replace dots (St.Louis --> St_Louis)
    • Takes in instances and sweep times as part of the argument, but not impactful as those variables are hardcoded in the script
    • Ran vc_circles.py with the following variables with changed values:
      • SWEEP_CYCLE_SECONDS = 10 (used to be 30)
      • NUMBER_INSTANCES = 16 (used to be 8)
    • New output to be found in: E:\McNair\Projects\OliverLovesCircles\out

2017-10-02:

2017-09-26:

2017-09-25:

  • Talked to Ed about GIS, Census data, and going about determining the correctness of reported 'place.' Currently script makes a cross product of each reported place and an existing place, outputting a column of boolean value to indicate whether the reported place's coordinates fell within a place's geometric boundaries. One other way of going about this which we discussed is to first check if the reported place does fall within that place's boundaries. If it isn't, we'll go about the cross product method.
  • Discussed the need to maintain venture capital database.
  • Relevant File paths:
    • E:\McNair\Projects\Agglomeration\TestGIS.sql
    • Z:\VentureCapitalData\SDCVCData\vcdb2\ProecssingCoLevelSimple.sql
    • Z:\VentureCapitalData\SDCVCData\vcdb2\CitiesWithGT10Active.txt

2017-09-21:

  • Functions for Linear Referencing:
ST_LineInterpolatePoint(geometry A, double measure): Returns a point interpolated along a line.
ST_LineLocatePoint(geometry A, geometry B): Returns a float between 0 and 1 representing the location of the closest point on LineString to the given Point.
ST_Line_Substring(geometry A, double from, double to): Return a linestring being a substring of the input one starting and ending at the given fractions of total 2d length.
ST_Locate_Along_Measure(geometry A, double measure): Return a derived geometry collection value with elements that match the specified measure.
ST_Locate_Between_Measures(geometry A, double from, double to): Return a derived geometry collection value with elements that match the specified range of measures inclusively.
ST_AddMeasure(geometry A, double from, double to): Return a derived geometry with measure elements linearly interpolated between the start and end points. If the geometry has no measure dimension, one is added.
  • 3-D Functions:
ST_3DClosestPoint — Returns the 3-dimensional point on g1 that is closest to g2. This is the first point of the 3D shortest line.
ST_3DDistance — For geometry type Returns the 3-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units.
ST_3DDWithin — For 3d (z) geometry type Returns true if two geometries 3d distance is within number of units.
ST_3DDFullyWithin — Returns true if all of the 3D geometries are within the specified distance of one another.
ST_3DIntersects — Returns TRUE if the Geometries “spatially intersect” in 3d - only for points and linestrings
ST_3DLongestLine — Returns the 3-dimensional longest line between two geometries
ST_3DMaxDistance — For geometry type Returns the 3-dimensional cartesian maximum distance (based on spatial ref) between two geometries in projected units.
ST_3DShortestLine — Returns the 3-dimensional shortest line between two geometries
  • Relevant PostgreSQL Commands:
\dt *.* Show all tables
\q Exit table
  • Specifities/ Outliers to consider:
New York (decompose)
Princeton area (keep Princeton  unique)
Reston, Virginia (keep)
San Diego (include La Jolla)
Silicon Valley (all distinct)

2017-09-20:

  • Attended first intro to GIS course yesterday
  • Updated above notes on GIS


2017-09-19:

  • Useful functions for spatial joins:
sum(expression): aggregate to return a sum for a set of records
count(expression): aggregate to return the size of a set of records
ST_Area(geometry) returns the area of the polygons
ST_AsText(geometry) returns WKT text
ST_Buffer(geometry, distance): For geometry: Returns a geometry that represents all points whose distance from this Geometry is less than or equal to distance. Calculations are in the Spatial Reference System of this Geometry. For geography: Uses a planar transform wrapper.
ST_Contains(geometry A, geometry B) returns the true if geometry A contains geometry B
ST_Distance(geometry A, geometry B) returns the minimum distance between geometry A and geometry B
ST_DWithin(geometry A, geometry B, radius) returns the true if geometry A is radius distance or less from geometry B
ST_GeomFromText(text) returns geometry
ST_Intersection(geometry A, geometry B): Returns a geometry that represents the shared portion of geomA and geomB. The geography implementation does a transform to geometry to do the intersection and then transform back to WGS84
ST_Intersects(geometry A, geometry B) returns the true if geometry A intersects geometry B
ST_Length(linestring) returns the length of the linestring
ST_Touches(geometry A, geometry B) returns the true if the boundary of geometry A touches geometry B
ST_Within(geometry A, geometry B) returns the true if geometry A is within geometry B
geometry_a && geometry_b: Returns TRUE if A’s bounding box overlaps B’s.
geometry_a = geometry_b: Returns TRUE if A’s bounding box is the same as B’s.
ST_SetSRID(geometry, srid): Sets the SRID on a geometry to a particular integer value.
ST_SRID(geometry): Returns the spatial reference identifier for the ST_Geometry as defined in spatial_ref_sys table.
ST_Transform(geometry, srid): Returns a new geometry with its coordinates transformed to the SRID referenced by the integer parameter.
ST_Union(): Returns a geometry that represents the point set union of the Geometries.
substring(string [from int] [for int]): PostgreSQL string function to extract substring matching SQL regular expression.
ST_Relate(geometry A, geometry B): Returns a text string representing the DE9IM relationship between the geometries.
ST_GeoHash(geometry A): Returns a text string representing the GeoHash of the bounds of the object.
  • Native functions for geogrphy:
ST_AsText(geography) returns text
ST_GeographyFromText(text) returns geography
ST_AsBinary(geography) returns bytea
ST_GeogFromWKB(bytea) returns geography
ST_AsSVG(geography) returns text
ST_AsGML(geography) returns text
ST_AsKML(geography) returns text
ST_AsGeoJson(geography) returns text
ST_Distance(geography, geography) returns double
ST_DWithin(geography, geography, float8) returns boolean
ST_Area(geography) returns double
ST_Length(geography) returns double
ST_Covers(geography, geography) returns boolean
ST_CoveredBy(geography, geography) returns boolean
ST_Intersects(geography, geography) returns boolean
ST_Buffer(geography, float8) returns geography [1]
ST_Intersection(geography, geography) returns geography [1]


2017-09-18:

2017-09-12:

  • Clarified University Matching output file.
  • Helped Christy with pdf-reader, capturing keywords in readable format.

2017-09-11:

  • Ensured that documentation exists for the projects worked on last semester.


Spring 2017

2017-04-17:

  • To pull accelerators: Wrote simple python regex-based script that ran on organizations data.
Code: E:\McNair\Projects\Accelerators\Crunchbase Snapshot\accelerator keywords.py
Matched output (885 mathces) : E:\McNair\Projects\Accelerators\Crunchbase Snapshot\Jeemin_885_accel_matches

2017-04-14:

  • Loaded Federal Grants Data into database

2017-04-12:

  • Finishing up cleaning the columns for Federal Grant Data - NIH. The output excel files can be accessed at:
E:\McNair\Projects\Federal Grant Data\NIH\Grants 
Titled:
    Jeemin_combined_files 1986-2001.csv
    Jeemin_combined_files 2002-2012.csv
    Jeemin_combined_files 2013-2015.csv
  • psql table formula:
CREATE TABLE all_grants (
 APPLICATION_ID integer,
 ACTIVITY varchar(3),
 ADMINISTERING_IC varchar(2),
 APPLICATION_TYPE varchar(1),
 ARRA_FUNDED varchar(1),
 AWARD_NOTICE_DATE date, 
 BUDGET_START date,
 BUDGET_END date, 
 CFDA_CODE varchar(3), 
 CORE_PROJECT_NUM varchar(11),
 ED_INST_TYPE varchar(30), 
 FOA_NUMBER varchar(13),
 FULL_PROJECT_NUM varchar(35),
 FUNDING_ICs varchar(40),
 FUNDING_MECHANISM varchar(23),
 FY smallint, 
 IC_NAME varchar(77), 
 NIH_SPENDING_CATS varchar(295), 
 ORG_CITY varchar(20),
 ORG_COUNTRY varchar(16),
 ORG_DEPT varchar(30),
 ORG_DISTRICT smallint, 
 ORG_DUNS integer,
 ORG_FIPS varchar(2), 
 ORG_NAME varchar(60), 
 ORG_STATE varchar(2), 
 ORG_ZIPCODE integer, 
 PHR varchar(200), 
 PI_IDS varchar(30), 
 PI_NAMEs varchar(200), 
 PROGRAM_OFFICER_NAME varchar(36), 
 PROJECT_START date, 
 PROJECT_END date, 
 PROJECT_TERMS varchar(200), 
 PROJECT_TITLE varchar(244), 
 SERIAL_NUMBER smallint,
 STUDY_SECTION varchar(4), 
 STUDY_SECTION_NAME varchar(100), 
 SUBPROJECT_ID smallint, 
 SUFFIX varchar(2),
 SUPPORT_YEAR smallint, 
 DIRECT_COST_AMT integer, 
 INDIRECT_COST_AMT integer, 
 TOTAL_COST integer, 
 TOTAL_COST_SUB_PROJECT integer
);
\COPY all_grants FROM 'Jeemin_combined_files 1986-2001.csv' WITH DELIMITER AS E'\t' HEADER NULL AS CSV

2017-03-29:

  • Troubled by the variety of cases - separating keys by keywords will not work favorably when it hits University of California vs. University of Southern California case - find a way to match University of Southern California first (more specific ones first) - but how to generalize

2017-03-27:

  • Writing code for university matches - decided to go through keys instead of each dataitem. Use keywords in each key to go through the dataitem - misspellings are currently unaccounted for.

2017-03-24:

  • Discussed with Julia & Meghana about university keys to use to count # of occurrences, including aliases and misspellings
  • Thoughts: to use a scoring metric with a key of UNIVERSITY OF CALIFORNIA SYSTEM, it should have a 'better' score when compared to MATHEMATICAL SCIENCES PUBLISHERS C/O UNIVERSITY OF CALIFORNIA BERKELEY or CALIFORNIA AT LOS ANGELES, UNVIERSITY OF than when compared to UNIVERSITY OF SOUTHERN CALIFORNIA, which may pose a challenge when attempting to implement this in a more general sense. In normalizing a string, strip "THE", "," and split words by spaces and compare each keyword from the two strings. Deciding on which strings to compare will be another issue - length (within some range maybe) could be an option.
  • Federal Grant Data XML Parser was rerun - same output textfiles

2017-03-22:

  • Read string matching & calculating distance, below are relevant links
  • [1]
  • [2]

2017-03-22:

  • Talked to Julia about universal matcher, want to combine all University of California's to University of California, The Regents of
  • Converted crunchbase2013 data from mySQL to PostgreSQL, but having trouble with the last table - cb_relationships, complains about syntax error at or near some places - but generally all tables exist in database called crunchbase
  • Federal Grant Data XML Parser was run - the three output textfiles can be found in E:\McNair\Projects\Federal Grant Data\NSF

2017-03-16:

2017-03-15:

  • Todo: write a wikipage on possible input/output info on string matcher
  • Wrote part of XML parser, extracted yearly data into E:\McNair\Projects\Federal Grant Data\NSF\NSF Extracted Data (up to year 2010)

2017-03-14:

  • Started pulling academy cases but there are too many cases to worry about, in terms of institution of interest. A document is located in E:\McNair\Projects\University Patents\academies_verify_cases.txt
  • Need Julia/Meghana to look through the hits and see which are relevant & extract pattern from there.
  • Having trouble outputting txt file without double quotes around every line.
  • Thinking that one text file should be output for all keywords instead of having one each, to avoid overlap (ex) COLLEGE and UNIVERSITY are both keywords; ALBERT EINSTEIN COLLEGE OF YESHIVA UNIVERSITY will be hit twice if it were counted as two separate instances, one accounting for COLLEGE and the other for UNIVERSITY) - either in the form of if-elseif statements or one big regex check.

2017-03-13:

  • For University Patent Data Matching - matched SCHOOL (output: E:\McNair\Projects\University Patents\school_pulled_from_assignee_list_USA) and matched INSTITUTE(output: E:\McNair\Projects\University Patents\institute_pulled_from_assignee_list_USA).
  • University Patent Matching
  • To be worked on later: Grant XML parsing & general name matcher

2017-03-08:

  • Wrote regex pattern that identifies all "university" matchings - can be found in E:\McNair\Projects\University Patents\university_pulled_from_assignee_list_USA -- is an output file
  • Talked to Sonia, but didn't come to solid conclusion on identifying whether key words associate with city or country by running a python function
  • Output sql tables from finished run of Jeemin_FDATrial_as_key_data_ripping.py
  • Ran through assigneelist_USA.txt to see how many different ways UNIVERSITY could be spelled wrong. There were many.
  • Tried to logic through creating a pattern that could catch all different versions of UNIVERSITY. Discuss further on whether UNIVERSITIES and those that include UNIVERSITIES but include INC in the end should be pulled as relevant information

2017-03-06:

  • Installing python in a database
  • Added building Python function section to Working with PostgreSQL at the bottom of the page.
  • Ran FDA Trial data ripping again, as the text output files were wiped.
  • Plan on discussing with Julia and Meghana again about pulling universities and other relevant institutions from the Assignee List USA.
  • Talked to Sonia about pulling city, state, zipcode information, hence python was installed in a database. Will work with Sonia on Wednesday afternoon and see how best a regex function could be implemented

2017-03-03:

  • Attempted to output sql tables

2017-03-01:

  • Started re-running Jeemin_FDATrial_as_key_data_ripping.py

2017-02-27:

  • Finished producing tables from Jeemin_FDATrial_as_key_data_ripping.py
  • Talked to Julia about LinkedIn data extracting - to be discussed further with Julia & Peter.
  • Started web crawler for Wikipedia - currently pulls Endowment, Academic staff, students, undergraduates, and postgraduates info found on Rice Wikipedia page. Can be found in : E:\McNair\Projects\University Patents\Jeemin_University_wikipedia_crawler.py

2017-02-24:

  • Continued working on producing multiple tables - first two are done. Was working on location, as there are multiple location tags per location.

2017-02-22:

  • Finished Jeemin_FDATrial_as_key_data_ripping.py (E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py), which outputs to E:\McNair\Projects\FDA Trials\Jeemin_Project\general_data_ripping_output.txt; TODO: output four different tables & replace the write in the same for-loop as going through each file

2017-02-20:

  • Continued working on Jeemin_FDATrial_as_key_data_ripping.py to find tags and place all of those information in a list. The other zipcode file did not finish executing after 2+ hours of running it - considering the possibility of splitting the record file into smaller bits, or running the processing on a faster machine.

2017-02-17:

  • Completed code for counting the number of occurrences for each unique zipcode. (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_Running_File.py). It has been running for 20+min because of the comprehensive XML data files. Meanwhile started coding to create a dictionary with the keys corresponding to each unique trial ID, mapped to every other information (location, sponsors, phase, drugs ...etc.) (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py).

2017-02-15:

  • Discussed with Catherine what to do with FDA Trial data and decided to have a dictionary with zip-codes as keys and number of trials occurred in that zipcode as values. Was still attempting to loop through the files without the code having to exist in the same directory as the XML files. Plan to write to excel via tsv, with zip-code as one column and # of occurrence as the other.

2017-02-13:

  • Goals (for trials): 1) Build ER Diagram 2) For each entity, get XML snippet 3) Build a parser/ripper for single file; the python parser can be found at: E:\McNair\Projects\FDA Trials\Jeemin_Project

2017-02-08:

  • Attempted to come up with possible cases for locating the description of accelerators - pick up from extracting bodies of text from the about page (given that it exists)
  • Trial Data Project

2017-02-06:

  • Set up wikiPage & remote desktop.
  • Started working on python version of web crawler. So far it successfully prints out a catchphrase/ description for one website. To be worked on. The python file can be found in: E:\McNair\Projects\Accelerators\Python WebCrawler\webcrawlerpython.py