Changes

Jump to navigation Jump to search
1,092 bytes added ,  13:38, 8 October 2019
no edit summary
[[Jeemin Sim]] [[Work Logs]] [[Jeemin Sim (WorkLog)|(log page)]] ===Spring 2018=2/6/==2017 MONDAY ''2PM-6PM''01-11:*Was briefed on [[Urban Start-up Agglomeration and Venture Capital Investment]]*Going forward - running regressions in R in PostgreSQL and with desired layers (instead of levels) ===Fall 2017=== <onlyinclude> 2017-12-05:* Uploaded tables with KML files in TIF folder in bulk(E:) drive** Allegheny County** Atlanta** Chicago** Columbus, OH* Set up wikiPage & remote desktop* Dublin, OH** Vermont** Washington, D.C.  *Updated documentation for: ** Started working on python version of web crawlerhttp://www.edegan. So far it successfully prints out a catchphrasecom/wiki/ description TIF_Project#Uploading_TIF_Data_onto_database_.28tigertest.29 * shp2pgsql needs to be installed to upload Shapefiles to PostgreSQL database** applies for one websiteHouston, TX and Dallas, TX** DONE 2017-12-04:* To upload KML file into database with specified table name: researcher@McNairDBServ:/bulk/tigertest$ ogr2ogr -f PostgreSQL PG:"dbname=tigertest" chicagotif.kml -nln chicagotif* chicagotif table now resides in tigertest database 2017-11-30:* Displayed map with TIF districts and startups* Location: E:\McNair\Projects\Agglomeration\TIF\McNair_Color_Chicago_TIF_and_startups. To be worked onpng* Added information and steps to ArcMap/ArcGIS documentation page* Create a project page for 'Working with POSTGIS' and add instructions for uploading KML file onto POSTGIS** Command used : *** Logged in as : researcher@McNairDBServ /bulk/tigertest$ ogr2ogr -f PostgreSQL PG:"dbname=tigertest" chicagotif. The python kml * chicago TIF kml file can be found currently downloaded in tigertest with a table name of Layer0** Figure out how to change layer name while loading kml file** Instructions pulled from : http://wiki.wildsong.biz/index.php/Loading_data_into_PostGIS#Loading_data_from_KMZ_files 2017-11-28:* Created and edited [[ArcMap / ArcGIS Documentation]]* Plotted points for TIFS and Startups in Chicago inone map.** Location: E:\McNair\Projects\AcceleratorsAgglomeration\Python WebCrawlerTIF\webcrawlerpythonJeemin_Chicago_TIF_and_Startups_Attempt1* Used https://mygeodata.pycloud/result to convert from KML file to CSV (which was then saved as txt file to be uploaded onto ArcMap)* Text files located in Local Disk (C:) Drive 2017-11-27:*Download Chicago TIF data 2017-11-13:* 2017-11-09: * Notes on data downloaded:** Year 2010-2012 data are based on total population, not 25 yrs or over (case for all other tables)*** Record appears five times total, with same exact column name***For exmaple: 'Total; Estimate; High school graduate (includes equivalency)' appears five times, with different values.* TODO:** Make Projects page for ACS Data***[[American Community Survey (ACS) Data]]
====2/8/2017 WEDNESDAY''9AM-11AM''==== 11-07:* Attempted to come up with possible cases for locating the description of accelerators - pick up from extracting bodies Yesterday, narrowed down columns of text interest from the about page (given that it exists)ACS_S1501_educationattain_2016 table.
====2/13/2017 MONDAY Id Id2 Geography Total; Estimate; Population 25 years and over Total; Estimate; Population 25 years and over - High school graduate (includes equivalency) Total; Margin of Error; Population 25 years and over - High school graduate (includes equivalency) Total; Margin of Error; Population 25 years and over - High school graduate (includes equivalency) Percent; Margin of Error; Population 25 years and over - High school graduate (includes equivalency) Total; Estimate; Population 25 years and over - Associate's degree Total; Margin of Error; Population 25 years and over - Associate's degree Percent; Estimate; Population 25 years and over - Associate's degree Percent; Margin of Error; Population 25 years and over - Associate's degree Total; Estimate; Population 25 years and over - Bachelor's degree Total; Margin of Error; Population 25 years and over - Bachelor'2PMs degree Percent; Estimate; Population 25 years and over -6PMBachelor's degree Percent; Margin of Error; Population 25 years and over - Bachelor'====s degree Total; Estimate; Population 25 years and over - Graduate or professional degree Total; Margin of Error; Population 25 years and over - Graduate or professional degree Percent; Estimate; Population 25 years and over - Graduate or professional degree Percent; Margin of Error; Population 25 years and over - Graduate or professional degree Percent; Estimate; Percent high school graduate or higher Percent; Margin of Error; Percent high school graduate or higher* Goals (for trials): 1) Build ER Diagram 2) For each entity, get XML snippet 3) Build a parser/ripper for single file Percent; Estimate; the python parser can be found at: E:\McNair\Projects\FDA Trials\Jeemin_ProjectPercent bachelor's degree or higher * [[Trial Data Project]] Percent; Margin of Error; Percent bachelor's degree or higher
====2/15/2017 WEDNESDAY ''9AM-11AM''====*Complications:* Discussed with Catherine what *For csv files corresponding to do with FDA Trial data and decided to have a dictionary with zip-codes as keys and number years 2015 & 2016, all of trials occurred in that zipcode as valuesthe above columns exist. Was still attempting to loop through the **For csv files without the code having corresponding to years 2005 - 2014, no 'Percent' columns exist in the same directory as the XML files*** Instead their 'Total' columns are percentage values**For csv file corresponding to year 2005, columns regarding Graduate or professional degree are labeled differently. Plan **2012 data doesn't correspond to write to excel via tsv, with zip-code as one column Population 25 years and # of occurrence as the otherover.
====2/17/2017 FRIDAY ''2PM-6PM''====*Temporary Solution:* Completed code for counting *Since the number above problems may be specific to this set of occurrences for each unique zipcode. (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_Running_File.py). It has been running for 20+min because of the comprehensive XML data tables, will go through csv files. Meanwhile started coding to create a dictionary with the keys corresponding to each unique trial ID, mapped to every other information (location, sponsors, phase, drugs ...etc.) (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py)and adjust columns.
====2/20/2017 MONDAY ''2PM-4*Python script location:30PM''====* Continued working on Jeemin_FDATrial_as_key_data_ripping E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data\pullCertainColumns.py to find tags and place all of those information in a list. The other zipcode file did not finish executing after 2+ hours of running it - considering the possibility of splitting the record file into smaller bits, or running the processing on a faster machine.
====2/22/2017 WEDNESDAY ''9AM-1210-31:30PM''====* Finished Jeemin_FDATrial_as_key_data_rippingdoanloading files from ACS.py (E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping* Started loading tables into tigertest.py), which outputs to * Commands run could be found in E:\McNair\Projects\FDA TrialsAgglomeration\Jeemin_ProjectACS_Downloaded_Data\general_data_ripping_outputDataLoading_SQL_Commands.txt; TODO: output four different tables & replace the write in the same for-loop as going through each file
====2/24/2017 FRIDAY ''2-10-30:30PM-6* Downloaded data from ACS, to be continued* File path: E:30PM''====\McNair\Projects\Agglomeration\ACS_Downloaded_Data* Continued working on producing multiple tables Fields of interest: S1401 SCHOOL ENROLLMENT S1501 EDUCATIONAL ATTAINMENT S2301 EMPLOYMENT STATUS B01003 TOTAL POPULATION B02001 RACE B07201 GEOGRAPHICAL MOBILITY B08303 TRAVEL TIME TO WORK B19013 MEDIAN HOUSEHOLD INCOME B19053 SELF- first two are done. Was working on location, as there are multiple location tags per location.EMPLOYMENT INCOME IN THE PAST 12 MONTHS FOR HOUSEHOLDS B19083 GINI INDEX OF INCOME INEQUALITY B25003 TENURE B25105 MEDIAN MONTHLY HOUSING COSTS B28011 INTERNET SUBSCRIPTIONS IN HOUSEHOLD G001 GEOGRAPHIC IDENTIFIERS
====2/27/2017 MONDAY ''2PM-6PM''====* Finished producing tables from Jeemin_FDATrial_as_key_data_ripping.py10-23:* Talked to Julia Ed with Peter & Oliver about LinkedIn data extracting - to be discussed further with Julia upcoming tasks & Peterprojects.* Started web crawler for Wikipedia - currently pulls Endowment, Academic staff, students, undergraduates, and postgraduates info found Loaded acs_place table 2017 (does not contain population) on Rice Wikipedia pagetigertest. Can be found in : E** SQL commands used:\McNair\Projects\University Patents\Jeemin_University_wikipedia_crawler.py
====3/1/2017 WEDNESDAY ''9AM-12PM''====* Started re-running Jeemin_FDATrial_as_key_data_ripping.py DROP TABLE acs_place;
====3/3/2017 FRIDAY ''2PM-5PM''==== CREATE TABLE acs_place (* Attempted to output sql tables USPS varchar(5), GEOID varchar(30), ANSICODE varchar(30), NAME varchar(100), LSAD varchar(30), FUNCSTAT varchar(10), ALAND varchar(30), AWATER varchar(30), ALAND_SQMI varchar(30), AWATER_SQMI varchar(30), INTPTLAT varchar(30), INTPTLONG varchar(30) );
====3 \COPY acs_place FROM '/6bulk/017 MONDAY 2017_Gaz_place_national.txt''2PM; --6PM''====* [[Installing python in a database]]* Added building Python function section to [[Working with PostgreSQL]] at the bottom of the page.* Ran FDA Trial data ripping again, as the text output files were wiped. * Plan on discussing with Julia and Meghana again about pulling universities and other relevant institutions from the Assignee List USA. * Talked to Sonia about pulling city, state, zipcode information, hence python was installed in a database. Will work with Sonia on Wednesday afternoon and see how best a regex function could be implementedCOPY 29578
====3/8/2017 WEDNESDAY ''9AM-12PM''====* TODO:* Output sql tables from finished run of Jeemin_FDATrial_as_key_data_ripping.py * Find acs place 2016 data for population* Ran through assigneelist_USA.txt to see how many different ways UNIVERSITY could be spelled wrong. There were many.* Find larger acs files, ideally at the place level* Tried to logic through creating a pattern that could catch all different versions of UNIVERSITY. Discuss further * Provide more documentation on whether UNIVERSITIES and those that include UNIVERSITIES but include INC in the end should be pulled as relevant informationPOSTGIS & geocoding
====3/8/2017 WEDNESDAY ''2PM-5PM ''====10-16:* Exported maps of points from the Bay Area each year. * Wrote regex pattern that identifies all "university" matchings - can be found in * Map used location: E:\McNair\Projects\University PatentsAgglomeration\HarrisonPeterWorkArcGIS\Jeemin_Bay_Area Points_Every_Year\university_pulled_from_assignee_list_USA -- is an output fileBayAreaEveryYearMap** Zoom scale: 1:650.000* Talked to Sonia, but didn't come to solid conclusion on identifying whether key words associate with city or country by running a python functionLocation of Bay Area Points png files: E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS\Jeemin_Bay_Area Points_Every_Year
====3/13/2017 MONDAY ''12PM-2PM''====10-10:*Discoveries/ Struggles regarding.gdb to .shp file conversion:* For University Patent Data Matching - matched SCHOOL * Esri Production Mapping (outputcostly)*** License needs to be purchasesd: Ehttp:\McNair\Projects\University Patents\school_pulled_from_assignee_list_USA) and matched INSTITUTE(output//www.esri.com/software/arcgis/extensions/production-mapping/pricing** Use ogr2ogr from gdal package*** https: E:\McNair\Projects\University Patents\institute_pulled_from_assignee_list_USA)//gis.stackexchange. com/questions/14432/migrating-geodatabase-data-into-postgis-without-esri-apps* ** Command: ogr2ogr -f "ESRI Shapefile" [Destination of shapefile] [University Patent Matching]path to gdb file] * To be worked on later: Grant XML parsing & general name matcher** Problem installing gdal
====3/14/2017 TUESDAY ''12PM-2PM10-09:* TODO''====S:** Downloading data onto tigertest*** Road*** Railway*** Coastline* Started pulling academy cases but there are too many cases to worry about, in terms of institution of interest** Instructions: http://www. A document is located in E:\McNair\Projects\University Patents\academies_verify_casesedegan.txtcom/wiki/PostGIS_Installation#Bulk_Download_TIGER_Shapefiles* Need Julia/Meghana to look through the hits and see which are relevant & extract pattern * Configure census data from there. American Community Survey (ACS)* Having trouble outputting txt file without double quotes around every line. * Thinking that one text file should be output for all keywords instead * 1) Work out what data is of having one each, to avoid overlap our interest (exconfirm ACS) COLLEGE and UNIVERSITY are both keywords; ALBERT EINSTEIN COLLEGE OF YESHIVA UNIVERSITY will be hit twice if it were counted as two separate instances, one accounting for COLLEGE and the other for UNIVERSITY*** 2) - either in the form of if-elseif statements or one big regex checkDetermine appropriate shape file unit: **** census block vs. census block group vs.census track*** 3) Load into tigertest
====3* Done:** Downloaded data from https:/15/2017 WEDNESDAY ''9AMwww.census.gov/cgi-1PM''====bin/geo/shapefiles/index.php tl_2017_us_coastline -- 4209 tl_2017_us_primaryroads -- 11574 tl_2017_us_rails -- 176237** TodoLink found to potentially download ACS data: https: write a wikipage on possible input/output info on string matcher/www.census.gov/geo/maps-data/data/tiger-data.html* Wrote part of XML parser, extracted yearly data into E:\McNair\Projects\Federal Grant Data\NSF\NSF Extracted Data (up to year 2010)** But most files on it come with .gdb extension and not .shp
====3/16/2017 THURSDAY ''12PM-2PM''====* Further documented [[University Patent Matching]]10-03:* Finished writing XML ParserInstalled PostGIS & is now visible on pgAdmin III
====* ArcGIS (connect to postgis database):** 1) Open ArcMap** 2) Either open blank or open existing file/project** 3/20/2017 MONDAY ) Click on 'Add Data'2PMbutton with a cross and a yellow diamond (under Selection toolbar)** 4) Go to the top-6PMmost directory by pressing on the arrow that points left-then-up (on the left of home button)** 5) Click on 'Database Connections'====* Talked * 6) Click on 'Add Database Connection' (if Connection to Julia about universal matcher, want localhost.sde) does not exist already)** 7) Fill in the following fields:*** Database Platform: PostgreSQL*** Instance: localhost*** User name: postgres*** Password: *** Database: tigertest** 8) Press 'OK'** 9) Now you'll have 'Connection to combine all University of Californialocalhost.sde' in your Database Connections** 10) Double click on 's Connection to University of California, The Regents localhost.sde'** 11) Double click on the table ofinterest** 12) Click 'Finish'* Converted crunchbase2013 data from mySQL to PostgreSQL* 13) You'll see information populated on map, but having trouble with as one of the last table - cb_relationships, complains about syntax error at or near some places - but generally all tables exist in database called crunchbase'Layers'* Federal Grant Data XML Parser was run - the three output textfiles can be found in E** Tested with:\McNair\Projects\Federal Grant Data\NSFtigertest.public.copointplacescontains
====3/22/2017 WEDNESDAY * On running & altering Oliver''9AM-12PM''====s script: * Read string matching & calculating distance, below are relevant links* [httpLocation://wwwE:\McNair\Projects\OliverLovesCircles\src\python\vc_circles.cs.cmupy** Ed manipulated file names so that underscores would replace dots (St.edu/~wcohen/postscript/ijcaiLouis -ws-2003> St_Louis)** Takes in instances and sweep times as part of the argument, but not impactful as those variables are hardcoded in the script** Ran vc_circles.pdf]py with the following variables with changed values:* [http** SWEEP_CYCLE_SECONDS = 10 (used to be 30)*** NUMBER_INSTANCES = 16 (used to be 8)** New output to be found in://web.archive.org/web/20081224234350/httpE://www.dcs.shef.ac.uk/~sam/stringmetrics.html]\McNair\Projects\OliverLovesCircles\out
====3/24/2017 FRIDAY ''2PM-5PM''====10-02:* Discussed with Julia Talked to Harrison & Meghana about university keys to use to count # of occurrencesPeter regarding ArcGIS** Currently have points plotted on Houston** Trouble interpreting geometry type, including aliases and misspellingsas currently reads in from text file* Thoughts* Documents located in : E: \McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS* Attempted to use a scoring metric with a key of UNIVERSITY OF CALIFORNIA SYSTEM, it should have a install PostGIS spatial extention from PostgreSQL but getting 'betterspatial database creation failed' score when compared to MATHEMATICAL SCIENCES PUBLISHERS Cerror message.** Referenced instructions: *** https://O UNIVERSITY OF CALIFORNIA BERKELEY or CALIFORNIA AT LOS ANGELES, UNVIERSITY OF than when compared to UNIVERSITY OF SOUTHERN CALIFORNIA, which may pose a challenge when attempting to implement this in a more general sensewww.gpsfiledepot. In normalizing a string, strip "THE", "," com/tutorials/installing-and split words by spaces and compare each keyword from the two strings. Deciding on which strings to compare will be another issue - length (within some range maybe) could be an option. setting-up-postgresql-with-postgis/* Federal Grant Data XML Parser was rerun - same output textfiles** http://www.bostongis.com/PrinterFriendly.aspx?content_name=postgis_tut01
====32017-09-26:* Created a table that maps a state to the database name.** http://www.edegan.com/27wiki/2017 MONDAY ''2PM-6PM''====PostGIS_Installation#Translating_Table_names_to_corresponding_States* Writing code for university matches Added more GIS- decided information (functions, realm & outliers to go through keys instead of each dataitemconsider)** http://www. Use keywords edegan.com/wiki/Urban_Start-up_Agglomeration#GIS_Resources* Visualization in each key PostGIS or connecting to go through the dataitem - misspellings are currently unaccounted ArcGIS forvisualization (import/export data)* Spatial indexing:** http://revenant.ca/www/postgis/workshop/indexing.html
====3/29/2017 WEDNESDAY''2PM-5PM''====09-25:* Troubled by Talked to Ed about GIS, Census data, and going about determining the variety correctness of cases - separating keys by keywords will not work favorably when it hits University reported 'place.' Currently script makes a cross product of California vs. University each reported place and an existing place, outputting a column of Southern California case - find boolean value to indicate whether the reported place's coordinates fell within a place's geometric boundaries. One other way of going about this which we discussed is to match University of Southern California first (more specific ones first) - but how to generalizecheck if the reported place does fall within that place's boundaries...====4/12/2017 WEDNESDAY''9AM-12PMIf it isn't, we'====* Finishing up cleaning ll go about the columns for Federal Grant Data - NIH. The output excel files can be accessed at: E:\McNair\Projects\Federal Grant Data\NIH\Grants Titled: Jeemin_combined_files 1986-2001.csv Jeemin_combined_files 2002-2012.csv Jeemin_combined_files 2013-2015cross product method.csv
* psql table formulaTo add documentation :** http://www.edegan.com/wiki/PostGIS_Installation** http://www.edegan.com/wiki/Urban_Start-up_Agglomeration
CREATE TABLE all_grants ( APPLICATION_ID integer, ACTIVITY varchar(3), ADMINISTERING_IC varchar(2), APPLICATION_TYPE varchar(1), ARRA_FUNDED varchar(1), AWARD_NOTICE_DATE date, BUDGET_START date, BUDGET_END date, CFDA_CODE varchar(3), CORE_PROJECT_NUM varchar(11), ED_INST_TYPE varchar(30), FOA_NUMBER varchar(13), FULL_PROJECT_NUM varchar(35), FUNDING_ICs varchar(40), FUNDING_MECHANISM varchar(23), FY smallint, IC_NAME varchar(77), NIH_SPENDING_CATS varchar(295), ORG_CITY varchar(20), ORG_COUNTRY varchar(16), ORG_DEPT varchar(30), ORG_DISTRICT smallint, ORG_DUNS integer, ORG_FIPS varchar(2), ORG_NAME varchar(60), ORG_STATE varchar(2), ORG_ZIPCODE integer, PHR varchar(200), PI_IDS varchar(30), PI_NAMEs varchar(200), PROGRAM_OFFICER_NAME varchar(36), PROJECT_START date, PROJECT_END date, PROJECT_TERMS varchar(200), PROJECT_TITLE varchar(244), SERIAL_NUMBER smallint, STUDY_SECTION varchar(4), STUDY_SECTION_NAME varchar(100), SUBPROJECT_ID smallint, SUFFIX varchar(2), SUPPORT_YEAR smallint, DIRECT_COST_AMT integer, INDIRECT_COST_AMT integer, TOTAL_COST integer, TOTAL_COST_SUB_PROJECT integer ); \COPY all_grants FROM 'Jeemin_combined_files 1986-2001* Discussed the need to maintain venture capital database.csv' WITH DELIMITER AS E'\t' HEADER NULL AS ''CSV
====4/14/2017 FRIDAY''2PM-5PM''====*Relevant File paths:**E:\McNair\Projects\Agglomeration\TestGIS.sql**Z:\VentureCapitalData\SDCVCData\vcdb2\ProecssingCoLevelSimple.sql* Loaded Federal Grants Data into database*Z:\VentureCapitalData\SDCVCData\vcdb2\CitiesWithGT10Active.txt
====4/17/2017 MONDAY-09-21:* Functions for Linear Referencing: '''ST_LineInterpolatePoint(geometry A, double measure)''': Returns a point interpolated along a line. '''ST_LineLocatePoint(geometry A, geometry B)''': Returns a float between 0 and 1 representing the location of the closest point on LineString to the given Point. '''ST_Line_Substring(geometry A, double from, double to)''': Return a linestring being a substring of the input one starting and ending at the given fractions of total 2d length. '''ST_Locate_Along_Measure(geometry A, double measure)''': Return a derived geometry collection value with elements that match the specified measure. '''ST_Locate_Between_Measures(geometry A, double from, double to)''': Return a derived geometry collection value with elements that match the specified range of measures inclusively. '''ST_AddMeasure(geometry A, double from, double to)'2PM-4PM''====: Return a derived geometry with measure elements linearly interpolated between the start and end points. If the geometry has no measure dimension, one is added.
====4/17/2017 WEDNESDAY*3-D Functions: ''9AM'ST_3DClosestPoint''' — Returns the 3-12PMdimensional point on g1 that is closest to g2. This is the first point of the 3D shortest line. '''ST_3DDistance'====* To pull accelerators: Wrote simple python regex'' — For geometry type Returns the 3-dimensional cartesian minimum distance (based script that ran on organizations dataspatial ref) between two geometries in projected units. Code: E:\McNair\Projects\Accelerators\Crunchbase Snapshot\accelerator keywords'''ST_3DDWithin''' — For 3d (z) geometry type Returns true if two geometries 3d distance is within number of units. '''ST_3DDFullyWithin''' — Returns true if all of the 3D geometries are within the specified distance of one another.py Matched output '''ST_3DIntersects''' — Returns TRUE if the Geometries “spatially intersect” in 3d - only for points and linestrings '''ST_3DLongestLine''' — Returns the 3-dimensional longest line between two geometries '''ST_3DMaxDistance''' — For geometry type Returns the 3-dimensional cartesian maximum distance (885 mathcesbased on spatial ref) : E:\McNair\Projects\Accelerators\Crunchbase Snapshot\Jeemin_885_accel_matchesbetween two geometries in projected units. '''ST_3DShortestLine''' — Returns the 3-dimensional shortest line between two geometries
====9/11/2017 MONDAY*Relevant PostgreSQL Commands: '''\dt *.*'4PM-6PM''====Show all tables* Ensured that documentation exists for the projects worked on last semester. '''\q''' Exit table
====9*Specifities/12/2017 TUESDAY''9AM-10Outliers to consider:40AM & 1PM-2:20PM & 4PM-5:30PM''====* Clarified University Matching output file. New York (decompose) Princeton area (keep Princeton unique)* Helped Christy with pdf-reader Reston, capturing keywords in readable format.Virginia (keep) San Diego (include La Jolla) Silicon Valley (all distinct)
* Continue reading from: https://postgis.net/docs/postgis_installation.html 2017-09-20:* Attended first intro to GIS course yesterday* Updated above notes on GIS
====9/18/2017 MONDAY''4PM-6PM''====
* Read documentation on PostGIS and tiger geocoder
* Continue reading from: http://workshops.boundlessgeo.com/postgis-intro/joins.html
====9/19/2017 TUESDAY''9AM-10:40AM & 1PM09-219:20PM''====
* Useful functions for spatial joins:
* Continue reading from: http://workshops.boundlessgeo.com/postgis-intro/geography.html
====9/20/2017 WEDNESDAY''3:50PM-5:20PM''====
* Attended first intro to GIS course yesterday
* Updated above notes on GIS
* Continue reading from:
====9/21/2017 THURSDAY''9AM-1209-18:30PM''====* Functions for Linear Referencing: '''ST_LineInterpolatePoint(geometry A, double measure)''': Returns a point interpolated along a line. '''ST_LineLocatePoint(geometry A, geometry B)''': Returns a float between 0 Read documentation on PostGIS and 1 representing the location of the closest point on LineString to the given Point.tiger geocoder '''ST_Line_Substring(geometry A, double * Continue reading from, double to)''': Return a linestring being a substring of the input one starting and ending at the given fractions of total 2d length. '''ST_Locate_Along_Measure(geometry A, double measure)'''http: Return a derived geometry collection value with elements that match the specified measure//workshops. '''ST_Locate_Between_Measures(geometry A, double from, double to)''': Return a derived geometry collection value with elements that match the specified range of measures inclusivelyboundlessgeo. '''ST_AddMeasure(geometry A, double from, double to)''': Return a derived geometry with measure elements linearly interpolated between the start and end points. If the geometry has no measure dimension, one is addedcom/postgis-intro/joins.html
*32017-09-D Functions12: '''ST_3DClosestPoint''' — Returns the 3-dimensional point on g1 that is closest to g2. This is the first point of the 3D shortest line* Clarified University Matching output file. '''ST_3DDistance''' — For geometry type Returns the 3* Helped Christy with pdf-dimensional cartesian minimum distance (based on spatial ref) between two geometries reader, capturing keywords in projected unitsreadable format. '''ST_3DDWithin''' — For 3d (z) geometry type Returns true if two geometries 3d distance is within number of units. '''ST_3DDFullyWithin''' — Returns true if all of the 3D geometries are within the specified distance of one another. '''ST_3DIntersects''' — Returns TRUE if the Geometries “spatially intersect” in 3d - only for points and linestrings '''ST_3DLongestLine''' — Returns the 3-dimensional longest line between two geometries '''ST_3DMaxDistance''' — For geometry type Returns the 3-dimensional cartesian maximum distance (based on spatial ref) between two geometries in projected units. '''ST_3DShortestLine''' — Returns the 3-dimensional shortest line between two geometries
*Relevant PostgreSQL Commands2017-09-11: '''\dt *Ensured that documentation exists for the projects worked on last semester.*''' Show all tables '''\q''' Exit table
*Specifities</ Outliers to consider: New York (decompose) Princeton area (keep Princeton unique) Reston, Virginia (keep) San Diego (include La Jolla) Silicon Valley (all distinct)onlyinclude>
* Continue reading from: https://postgis.net/docs/postgis_installation.html===Spring 2017===
====9/25/2017 MONDAY''4PM-6PM''====04-17:* Talked to Ed about GIS, Census data, and going about determining the correctness of reported 'place.' Currently To pull accelerators: Wrote simple python regex-based script makes a cross product of each reported place and an existing place, outputting a column of boolean value to indicate whether the reported place's coordinates fell within a place's geometric boundaries. One other way of going about this which we discussed is to first check if the reported place does fall within that place's boundariesran on organizations data. If it isn't, we'll go about the cross product method Code: E:\McNair\Projects\Accelerators\Crunchbase Snapshot\accelerator keywords.py Matched output (885 mathces) : E:\McNair\Projects\Accelerators\Crunchbase Snapshot\Jeemin_885_accel_matches
* To add documentation : ** http2017-04-14://mcnair.bakerinstitute.org/wiki/PostGIS_Installation** http://mcnair.bakerinstitute.org/wiki/Urban_Start-up_AgglomerationLoaded Federal Grants Data into database
2017-04-12:* Discussed Finishing up cleaning the need to maintain venture capital databasecolumns for Federal Grant Data - NIH.The output excel files can be accessed at: E:\McNair\Projects\Federal Grant Data\NIH\Grants Titled: Jeemin_combined_files 1986-2001.csv Jeemin_combined_files 2002-2012.csv Jeemin_combined_files 2013-2015.csv
*Relevant File pathspsql table formula:**E:\McNair\Projects\Agglomeration\TestGIS.sql**Z:\VentureCapitalData\SDCVCData\vcdb2\ProecssingCoLevelSimple.sql**Z:\VentureCapitalData\SDCVCData\vcdb2\CitiesWithGT10Active.txt
====9/26/2017 TUESDAY ''9AM-12PM''==== CREATE TABLE all_grants ( APPLICATION_ID integer, ACTIVITY varchar(3), ADMINISTERING_IC varchar(2), APPLICATION_TYPE varchar(1), ARRA_FUNDED varchar(1), AWARD_NOTICE_DATE date, BUDGET_START date, BUDGET_END date, CFDA_CODE varchar(3), CORE_PROJECT_NUM varchar(11), ED_INST_TYPE varchar(30), FOA_NUMBER varchar(13), FULL_PROJECT_NUM varchar(35), FUNDING_ICs varchar(40), FUNDING_MECHANISM varchar(23), FY smallint, IC_NAME varchar(77), NIH_SPENDING_CATS varchar(295), ORG_CITY varchar(20), ORG_COUNTRY varchar(16), ORG_DEPT varchar(30), ORG_DISTRICT smallint, ORG_DUNS integer, ORG_FIPS varchar(2), ORG_NAME varchar(60), ORG_STATE varchar(2), ORG_ZIPCODE integer, PHR varchar(200), PI_IDS varchar(30), PI_NAMEs varchar(200), PROGRAM_OFFICER_NAME varchar(36), PROJECT_START date, PROJECT_END date, PROJECT_TERMS varchar(200), * Created a table that maps a state to the database name. PROJECT_TITLE varchar(244), ** http://mcnair.bakerinstitute.org/wiki/PostGIS_Installation#Translating_Table_names_to_corresponding_States SERIAL_NUMBER smallint,* Added more GIS-information STUDY_SECTION varchar(functions4), realm & outliers to consider STUDY_SECTION_NAME varchar(100), ** http://mcnair.bakerinstitute.org/wiki/Urban_Start-up_Agglomeration#GIS_Resources SUBPROJECT_ID smallint, * Visualization in PostGIS or connecting to ArcGIS for visualization SUFFIX varchar(import/export data2), SUPPORT_YEAR smallint, DIRECT_COST_AMT integer, * Spatial indexing: INDIRECT_COST_AMT integer, ** http://revenant TOTAL_COST integer, TOTAL_COST_SUB_PROJECT integer ); \COPY all_grants FROM 'Jeemin_combined_files 1986-2001.ca/www/postgis/workshop/indexing.htmlcsv' WITH DELIMITER AS E'\t' HEADER NULL AS ''CSV
...====10/2/2017 MONDAY ''4PM-6PM''====* Talked to Harrison & Peter regarding ArcGIS** Currently have points plotted on Houston** Trouble interpreting geometry type, as currently reads in from text file** Documents located in : E03-29:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS* Attempted to install PostGIS spatial extention from PostgreSQL but getting 'spatial database creation failed' error messageTroubled by the variety of cases - separating keys by keywords will not work favorably when it hits University of California vs.** Referenced instructions: *** https://www.gpsfiledepot.com/tutorials/installingUniversity of Southern California case -andfind a way to match University of Southern California first (more specific ones first) -setting-up-postgresql-with-postgis/*** http://www.bostongis.com/PrinterFriendly.aspx?content_name=postgis_tut01but how to generalize
====10/3/2017 TUESDAY ''10AM-12PM & 1PM03-227:30PM''====* Installed PostGIS & is now visible on pgAdmin IIIWriting code for university matches - decided to go through keys instead of each dataitem. Use keywords in each key to go through the dataitem - misspellings are currently unaccounted for.
* ArcGIS (connect to postgis database)2017-03-24:** 1) Open ArcMap** 2) Either open blank or open existing file/project** 3) Click on 'Add Data' button Discussed with a cross Julia & Meghana about university keys to use to count # of occurrences, including aliases and a yellow diamond (under Selection toolbar)misspellings** 4) Go Thoughts: to the top-most directory by pressing on the arrow that points left-then-up (on the left use a scoring metric with a key of home button)** 5) Click on UNIVERSITY OF CALIFORNIA SYSTEM, it should have a 'Database Connectionsbetter'** 6) Click on 'Add Database Connection' (if Connection score when compared to MATHEMATICAL SCIENCES PUBLISHERS C/O UNIVERSITY OF CALIFORNIA BERKELEY or CALIFORNIA AT LOS ANGELES, UNVIERSITY OF than when compared to UNIVERSITY OF SOUTHERN CALIFORNIA, which may pose a challenge when attempting to localhostimplement this in a more general sense.sde) does not exist already)** 7) Fill in In normalizing a string, strip "THE", "," and split words by spaces and compare each keyword from the following fields:*** Database Platform: PostgreSQL*** Instance: localhost*** User name: postgres*** Password: *** Database: tigertest** 8) Press 'OK'** 9) Now you'll have 'Connection to localhosttwo strings.sde' in your Database Connections** 10) Double click Deciding on 'Connection which strings to localhostcompare will be another issue - length (within some range maybe) could be an option.sde'** 11) Double click on the table of interest** 12) Click 'Finish'** 13) You'll see information populated on map, as one of the 'Layers'*** Tested with: tigertest.public.copointplacescontainsFederal Grant Data XML Parser was rerun - same output textfiles
2017-03-22:* On running Read string matching & altering Oliver's script: calculating distance, below are relevant links** Location[http: E:\McNair\Projects\OliverLovesCircles\src\python\vc_circles//www.cs.py** Ed manipulated file names so that underscores would replace dots (Stcmu.Louis edu/~wcohen/postscript/ijcai-ws-> St_Louis)** Takes in instances and sweep times as part of the argument, but not impactful as those variables are hardcoded in the script** Ran vc_circles2003.py with the following variables with changed values:*** SWEEP_CYCLE_SECONDS = 10 (used to be 30)pdf]*** NUMBER_INSTANCES = 16 (used to be 8)** New output to be found in[http: E//web.archive.org/web/20081224234350/http:\McNair\Projects\OliverLovesCircles\out//www.dcs.shef.ac.uk/~sam/stringmetrics.html]
====10/9/2017 MONDAY''3PM-6PM''====03-22:* TODOTalked to Julia about universal matcher, want to combine all University of California'S:s to University of California, The Regents of** Downloading data onto tigertest*** Road*** Railway*** Coastline*** Instructions: http://mcnair.bakerinstitute.org/wiki/PostGIS_Installation#Bulk_Download_TIGER_Shapefiles** Configure census Converted crunchbase2013 data from American Community Survey (ACS)mySQL to PostgreSQL, but having trouble with the last table - cb_relationships, complains about syntax error at or near some places - but generally all tables exist in database called crunchbase*** 1) Work out what data is of our interest (confirm ACS)*** 2) Determine appropriate shape file unitFederal Grant Data XML Parser was run - the three output textfiles can be found in E: **** census block vs. census block group vs. census track*** 3) Load into tigertest\McNair\Projects\Federal Grant Data\NSF
* Done:** Downloaded data from https://www.census.gov/cgi2017-bin/geo/shapefiles/index.php tl_2017_us_coastline 03-- 4209 tl_2017_us_primaryroads -- 11574 tl_2017_us_rails -- 17623716:** Link found to potentially download ACS data: https://www.census.gov/geo/maps-data/data/tiger-data.htmlFurther documented [[University Patent Matching]]*** But most files on it come with .gdb extension and not .shpFinished writing XML Parser
====10/10/2017 TUESDAY''9AM-12PM''====*Discoveries/ Struggles regarding.gdb to .shp file conversion03-15:** Esri Production Mapping (costly)*** License needs to be purchasesd: httpTodo:write a wikipage on possible input//www.esri.com/software/arcgis/extensions/production-mapping/pricingoutput info on string matcher** Use ogr2ogr from gdal package*** https://gis.stackexchange.com/questions/14432/migrating-geodatabase-Wrote part of XML parser, extracted yearly data-into-postgis-without-esri-apps*** CommandE: ogr2ogr -f "ESRI Shapefile" [Destination of shapefile] [path \McNair\Projects\Federal Grant Data\NSF\NSF Extracted Data (up to gdb file]*** Problem installing gdalyear 2010)
====10/16/2017 MONDAY''4PM-6PM''====03-14:* Exported maps Started pulling academy cases but there are too many cases to worry about, in terms of institution of points from the Bay Area each yearinterest. ** Map used location: A document is located in E:\McNair\Projects\AgglomerationUniversity Patents\HarrisonPeterWorkArcGIS\Jeemin_Bay_Area Points_Every_Year\BayAreaEveryYearMapacademies_verify_cases.txt*Need Julia/Meghana to look through the hits and see which are relevant & extract pattern from there. * Zoom scale: 1:650Having trouble outputting txt file without double quotes around every line.000* Location Thinking that one text file should be output for all keywords instead of having one each, to avoid overlap (ex) COLLEGE and UNIVERSITY are both keywords; ALBERT EINSTEIN COLLEGE OF YESHIVA UNIVERSITY will be hit twice if it were counted as two separate instances, one accounting for COLLEGE and the other for UNIVERSITY) - either in the form of Bay Area Points png files: E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS\Jeemin_Bay_Area Points_Every_Yearif-elseif statements or one big regex check.
2017-03-13:* For University Patent Data Matching - matched SCHOOL (output: E:\McNair\Projects\University Patents\school_pulled_from_assignee_list_USA) and matched INSTITUTE(output: E:\McNair\Projects\University Patents\institute_pulled_from_assignee_list_USA)...* [[University Patent Matching]] * To be worked on later: Grant XML parsing & general name matcher
====10/23/2017 MONDAY''4PM-6PM''====03-08:* Wrote regex pattern that identifies all "university" matchings - can be found in E:\McNair\Projects\University Patents\university_pulled_from_assignee_list_USA -- is an output file* Talked to Ed Sonia, but didn't come to solid conclusion on identifying whether key words associate with Peter & Oliver about upcoming tasks & projectscity or country by running a python function* Output sql tables from finished run of Jeemin_FDATrial_as_key_data_ripping.py * Loaded acs_place table 2017 (does not contain population) on tigertestRan through assigneelist_USA.txt to see how many different ways UNIVERSITY could be spelled wrong. There were many.** SQL commands used:Tried to logic through creating a pattern that could catch all different versions of UNIVERSITY. Discuss further on whether UNIVERSITIES and those that include UNIVERSITIES but include INC in the end should be pulled as relevant information
DROP TABLE acs_place;2017-03-06:* [[Installing python in a database]]* Added building Python function section to [[Working with PostgreSQL]] at the bottom of the page.* Ran FDA Trial data ripping again, as the text output files were wiped. * Plan on discussing with Julia and Meghana again about pulling universities and other relevant institutions from the Assignee List USA. * Talked to Sonia about pulling city, state, zipcode information, hence python was installed in a database. Will work with Sonia on Wednesday afternoon and see how best a regex function could be implemented
CREATE TABLE acs_place (2017-03-03: USPS varchar(5), GEOID varchar(30), ANSICODE varchar(30), NAME varchar(100), LSAD varchar(30), FUNCSTAT varchar(10), ALAND varchar(30), AWATER varchar(30), ALAND_SQMI varchar(30), AWATER_SQMI varchar(30), INTPTLAT varchar(30), INTPTLONG varchar(30) );* Attempted to output sql tables
\COPY acs_place FROM '/bulk/2017_Gaz_place_national.txt';2017-03-01: * Started re--COPY 29578running Jeemin_FDATrial_as_key_data_ripping.py
* TODO2017-02-27:*Finished producing tables from Jeemin_FDATrial_as_key_data_ripping.py* Find acs place 2016 Talked to Julia about LinkedIn data for populationextracting - to be discussed further with Julia & Peter.** Find larger acs filesStarted web crawler for Wikipedia - currently pulls Endowment, Academic staff, students, undergraduates, ideally at the place level** Provide more documentation and postgraduates info found on POSTGIS & geocodingRice Wikipedia page. Can be found in : E:\McNair\Projects\University Patents\Jeemin_University_wikipedia_crawler.py
2017-02-24:* Continued working on producing multiple tables - first two are done..Was working on location, as there are multiple location tags per location.
====10/30/2017 MONDAY''4PM-6PM''====02-22:* Downloaded data from ACSFinished Jeemin_FDATrial_as_key_data_ripping.py (E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py), which outputs to be continued* File path: E:\McNair\Projects\AgglomerationFDA Trials\Jeemin_Project\ACS_Downloaded_Data* Fields of interestgeneral_data_ripping_output.txt; TODO: S1401 SCHOOL ENROLLMENT S1501 EDUCATIONAL ATTAINMENT S2301 EMPLOYMENT STATUS B01003 TOTAL POPULATION B02001 RACE B07201 GEOGRAPHICAL MOBILITY B08303 TRAVEL TIME TO WORK B19013 MEDIAN HOUSEHOLD INCOME B19053 SELFoutput four different tables & replace the write in the same for-EMPLOYMENT INCOME IN THE PAST 12 MONTHS FOR HOUSEHOLDS B19083 GINI INDEX OF INCOME INEQUALITY B25003 TENURE B25105 MEDIAN MONTHLY HOUSING COSTS B28011 INTERNET SUBSCRIPTIONS IN HOUSEHOLD G001 GEOGRAPHIC IDENTIFIERSloop as going through each file
====10/31/2017 TUESDAY''9AM-12PM''====02-20:* Finished doanloading files from ACSContinued working on Jeemin_FDATrial_as_key_data_ripping.py to find tags and place all of those information in a list.* Started loading tables The other zipcode file did not finish executing after 2+ hours of running it - considering the possibility of splitting the record file into tigertest.* Commands run could be found in E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data\DataLoading_SQL_Commandssmaller bits, or running the processing on a faster machine.txt
2017-02-17:* Completed code for counting the number of occurrences for each unique zipcode.(currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_Running_File.py). It has been running for 20+min because of the comprehensive XML data files. Meanwhile started coding to create a dictionary with the keys corresponding to each unique trial ID, mapped to every other information (location, sponsors, phase, drugs ...etc.) (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py).
====11/07/2017 TUESDAY''9AM-12PM''====02-15:* Discussed with Catherine what to do with FDA Trial data and decided to have a dictionary with zip-codes as keys and number of trials occurred in that zipcode as values. Was still attempting to loop through the files without the code having to exist in the same directory as the XML files. Plan to write to excel via tsv, with zip-code as one column and # of occurrence as the other.
2017-02-13:*YesterdayGoals (for trials): 1) Build ER Diagram 2) For each entity, narrowed down columns of interest from ACS_S1501_educationattain_2016 table.get XML snippet 3) Build a parser/ripper for single file; the python parser can be found at: E:\McNair\Projects\FDA Trials\Jeemin_Project
Id Id2 Geography Total; Estimate; Population 25 years and over Total; Estimate; Population 25 years and over - High school graduate (includes equivalency) Total; Margin of Error; Population 25 years and over - High school graduate (includes equivalency) Total; Margin of Error; Population 25 years and over - High school graduate (includes equivalency) Percent; Margin of Error; Population 25 years and over - High school graduate (includes equivalency) Total; Estimate; Population 25 years and over 2017- Associate's degree Total; Margin of Error; Population 25 years and over 02- Associate's degree Percent; Estimate; Population 25 years and over - Associate's degree Percent; Margin of Error; Population 25 years and over - Associate's degree Total; Estimate; Population 25 years and over - Bachelor's degree Total; Margin of Error; Population 25 years and over - Bachelor's degree Percent; Estimate; Population 25 years and over - Bachelor's degree Percent; Margin of Error; Population 25 years and over - Bachelor's degree Total; Estimate; Population 25 years and over - Graduate or professional degree Total; Margin of Error; Population 25 years and over - Graduate or professional degree Percent; Estimate; Population 25 years and over - Graduate or professional degree Percent; Margin of Error; Population 25 years and over - Graduate or professional degree Percent; Estimate; Percent high school graduate or higher Percent; Margin of Error; Percent high school graduate or higher Percent; Estimate; Percent bachelor's degree or higher Percent; Margin of Error; Percent bachelor's degree or higher *Complications08:**For csv files corresponding Attempted to years 2015 & 2016, all come up with possible cases for locating the description of the above columns exist.**For csv files corresponding to years 2005 accelerators - 2014, no 'Percent' columns exist*** Instead their 'Total' columns are percentage values**For csv file corresponding to year 2005, columns regarding Graduate or professional degree are labeled differently.**2012 data doesn't correspond to Population 25 years and over. *Temporary Solution:**Since pick up from extracting bodies of text from the above problems may be specific to this set of tables, will go through csv files and adjust columns.about page (given that it exists)*Python script location: E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data\pullCertainColumns.py ====11/09/2017 THURSDAY''9AM-12:30PM''====* Notes on data downloaded:** Year 2012's data is based on total population, not 25 yrs or over (case for all other tables)[[Trial Data Project]]
2017-02-06:
* Set up wikiPage & remote desktop.
* Started working on python version of web crawler. So far it successfully prints out a catchphrase/ description for one website. To be worked on. The python file can be found in: E:\McNair\Projects\Accelerators\Python WebCrawler\webcrawlerpython.py
[[Category:Work Log]]

Navigation menu