Jeemin Sim (Work Log)

1 2/6/2017 MONDAY 2PM-6PM
2 2/8/2017 WEDNESDAY9AM-11AM
3 2/13/2017 MONDAY 2PM-6PM
4 2/15/2017 WEDNESDAY 9AM-11AM
5 2/17/2017 FRIDAY 2PM-6PM
6 2/20/2017 MONDAY 2PM-4:30PM
7 2/22/2017 WEDNESDAY 9AM-12:30PM
8 2/24/2017 FRIDAY 2:30PM-6:30PM
9 2/27/2017 MONDAY 2PM-6PM
10 3/1/2017 WEDNESDAY 9AM-12PM
11 3/3/2017 FRIDAY 2PM-5PM
12 3/6/017 MONDAY 2PM-6PM
13 3/8/2017 WEDNESDAY 9AM-12PM
14 3/8/2017 WEDNESDAY 2PM-5PM
15 3/13/2017 MONDAY 12PM-2PM
16 3/14/2017 TUESDAY 12PM-2PM
17 3/15/2017 WEDNESDAY 9AM-1PM
18 3/16/2017 THURSDAY 12PM-2PM
19 3/20/2017 MONDAY 2PM-6PM
20 3/22/2017 WEDNESDAY 9AM-12PM
21 3/24/2017 FRIDAY 2PM-5PM
22 3/27/2017 MONDAY 2PM-6PM
23 3/29/2017 WEDNESDAY2PM-5PM
24 4/12/2017 WEDNESDAY9AM-12PM
25 4/14/2017 FRIDAY2PM-5PM
26 4/17/2017 MONDAY2PM-4PM
27 4/17/2017 WEDNESDAY9AM-12PM
28 9/11/2017 MONDAY4PM-6PM
29 9/12/2017 TUESDAY9AM-10:40AM & 1PM-2:20PM & 4PM-5:30PM
30 9/18/2017 MONDAY4PM-6PM
31 9/19/2017 TUESDAY9AM-10:40AM & 1PM-2:20PM

2/6/2017 MONDAY 2PM-6PM

Set up wikiPage & remote desktop.
Started working on python version of web crawler. So far it successfully prints out a catchphrase/ description for one website. To be worked on. The python file can be found in: E:\McNair\Projects\Accelerators\Python WebCrawler\webcrawlerpython.py

2/8/2017 WEDNESDAY9AM-11AM

Attempted to come up with possible cases for locating the description of accelerators - pick up from extracting bodies of text from the about page (given that it exists)

2/13/2017 MONDAY 2PM-6PM

Goals (for trials): 1) Build ER Diagram 2) For each entity, get XML snippet 3) Build a parser/ripper for single file; the python parser can be found at: E:\McNair\Projects\FDA Trials\Jeemin_Project
Trial Data Project

2/15/2017 WEDNESDAY 9AM-11AM

Discussed with Catherine what to do with FDA Trial data and decided to have a dictionary with zip-codes as keys and number of trials occurred in that zipcode as values. Was still attempting to loop through the files without the code having to exist in the same directory as the XML files. Plan to write to excel via tsv, with zip-code as one column and # of occurrence as the other.

2/17/2017 FRIDAY 2PM-6PM

Completed code for counting the number of occurrences for each unique zipcode. (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_Running_File.py). It has been running for 20+min because of the comprehensive XML data files. Meanwhile started coding to create a dictionary with the keys corresponding to each unique trial ID, mapped to every other information (location, sponsors, phase, drugs ...etc.) (currently titled & located: E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py).

2/20/2017 MONDAY 2PM-4:30PM

Continued working on Jeemin_FDATrial_as_key_data_ripping.py to find tags and place all of those information in a list. The other zipcode file did not finish executing after 2+ hours of running it - considering the possibility of splitting the record file into smaller bits, or running the processing on a faster machine.

2/22/2017 WEDNESDAY 9AM-12:30PM

Finished Jeemin_FDATrial_as_key_data_ripping.py (E:\McNair\Projects\FDA Trials\Jeemin_Project\Jeemin_FDATrial_as_key_data_ripping.py), which outputs to E:\McNair\Projects\FDA Trials\Jeemin_Project\general_data_ripping_output.txt; TODO: output four different tables & replace the write in the same for-loop as going through each file

2/24/2017 FRIDAY 2:30PM-6:30PM

Continued working on producing multiple tables - first two are done. Was working on location, as there are multiple location tags per location.

2/27/2017 MONDAY 2PM-6PM

Finished producing tables from Jeemin_FDATrial_as_key_data_ripping.py
Talked to Julia about LinkedIn data extracting - to be discussed further with Julia & Peter.
Started web crawler for Wikipedia - currently pulls Endowment, Academic staff, students, undergraduates, and postgraduates info found on Rice Wikipedia page. Can be found in : E:\McNair\Projects\University Patents\Jeemin_University_wikipedia_crawler.py

3/1/2017 WEDNESDAY 9AM-12PM

Started re-running Jeemin_FDATrial_as_key_data_ripping.py

3/3/2017 FRIDAY 2PM-5PM

Attempted to output sql tables

3/6/017 MONDAY 2PM-6PM

Installing python in a database
Added building Python function section to Working with PostgreSQL at the bottom of the page.
Ran FDA Trial data ripping again, as the text output files were wiped.
Plan on discussing with Julia and Meghana again about pulling universities and other relevant institutions from the Assignee List USA.
Talked to Sonia about pulling city, state, zipcode information, hence python was installed in a database. Will work with Sonia on Wednesday afternoon and see how best a regex function could be implemented

3/8/2017 WEDNESDAY 9AM-12PM

Output sql tables from finished run of Jeemin_FDATrial_as_key_data_ripping.py
Ran through assigneelist_USA.txt to see how many different ways UNIVERSITY could be spelled wrong. There were many.
Tried to logic through creating a pattern that could catch all different versions of UNIVERSITY. Discuss further on whether UNIVERSITIES and those that include UNIVERSITIES but include INC in the end should be pulled as relevant information

3/8/2017 WEDNESDAY 2PM-5PM

Wrote regex pattern that identifies all "university" matchings - can be found in E:\McNair\Projects\University Patents\university_pulled_from_assignee_list_USA -- is an output file
Talked to Sonia, but didn't come to solid conclusion on identifying whether key words associate with city or country by running a python function

3/13/2017 MONDAY 12PM-2PM

For University Patent Data Matching - matched SCHOOL (output: E:\McNair\Projects\University Patents\school_pulled_from_assignee_list_USA) and matched INSTITUTE(output: E:\McNair\Projects\University Patents\institute_pulled_from_assignee_list_USA).
University Patent Matching
To be worked on later: Grant XML parsing & general name matcher

3/14/2017 TUESDAY 12PM-2PM

Started pulling academy cases but there are too many cases to worry about, in terms of institution of interest. A document is located in E:\McNair\Projects\University Patents\academies_verify_cases.txt
Need Julia/Meghana to look through the hits and see which are relevant & extract pattern from there.
Having trouble outputting txt file without double quotes around every line.
Thinking that one text file should be output for all keywords instead of having one each, to avoid overlap (ex) COLLEGE and UNIVERSITY are both keywords; ALBERT EINSTEIN COLLEGE OF YESHIVA UNIVERSITY will be hit twice if it were counted as two separate instances, one accounting for COLLEGE and the other for UNIVERSITY) - either in the form of if-elseif statements or one big regex check.

3/15/2017 WEDNESDAY 9AM-1PM

Todo: write a wikipage on possible input/output info on string matcher
Wrote part of XML parser, extracted yearly data into E:\McNair\Projects\Federal Grant Data\NSF\NSF Extracted Data (up to year 2010)

3/16/2017 THURSDAY 12PM-2PM

Further documented University Patent Matching
Finished writing XML Parser

3/20/2017 MONDAY 2PM-6PM

Talked to Julia about universal matcher, want to combine all University of California's to University of California, The Regents of
Converted crunchbase2013 data from mySQL to PostgreSQL, but having trouble with the last table - cb_relationships, complains about syntax error at or near some places - but generally all tables exist in database called crunchbase
Federal Grant Data XML Parser was run - the three output textfiles can be found in E:\McNair\Projects\Federal Grant Data\NSF

3/22/2017 WEDNESDAY 9AM-12PM

Read string matching & calculating distance, below are relevant links
[1]
[2]

3/24/2017 FRIDAY 2PM-5PM

Discussed with Julia & Meghana about university keys to use to count # of occurrences, including aliases and misspellings
Thoughts: to use a scoring metric with a key of UNIVERSITY OF CALIFORNIA SYSTEM, it should have a 'better' score when compared to MATHEMATICAL SCIENCES PUBLISHERS C/O UNIVERSITY OF CALIFORNIA BERKELEY or CALIFORNIA AT LOS ANGELES, UNVIERSITY OF than when compared to UNIVERSITY OF SOUTHERN CALIFORNIA, which may pose a challenge when attempting to implement this in a more general sense. In normalizing a string, strip "THE", "," and split words by spaces and compare each keyword from the two strings. Deciding on which strings to compare will be another issue - length (within some range maybe) could be an option.
Federal Grant Data XML Parser was rerun - same output textfiles

3/27/2017 MONDAY 2PM-6PM

Writing code for university matches - decided to go through keys instead of each dataitem. Use keywords in each key to go through the dataitem - misspellings are currently unaccounted for.

3/29/2017 WEDNESDAY2PM-5PM

Troubled by the variety of cases - separating keys by keywords will not work favorably when it hits University of California vs. University of Southern California case - find a way to match University of Southern California first (more specific ones first) - but how to generalize

. . .

4/12/2017 WEDNESDAY9AM-12PM

Finishing up cleaning the columns for Federal Grant Data - NIH. The output excel files can be accessed at:

E:\McNair\Projects\Federal Grant Data\NIH\Grants 
Titled:
    Jeemin_combined_files 1986-2001.csv
    Jeemin_combined_files 2002-2012.csv
    Jeemin_combined_files 2013-2015.csv

psql table formula:

CREATE TABLE all_grants (
 APPLICATION_ID integer,
 ACTIVITY varchar(3),
 ADMINISTERING_IC varchar(2),
 APPLICATION_TYPE varchar(1),
 ARRA_FUNDED varchar(1),
 AWARD_NOTICE_DATE date, 
 BUDGET_START date,
 BUDGET_END date, 
 CFDA_CODE varchar(3), 
 CORE_PROJECT_NUM varchar(11),
 ED_INST_TYPE varchar(30), 
 FOA_NUMBER varchar(13),
 FULL_PROJECT_NUM varchar(35),
 FUNDING_ICs varchar(40),
 FUNDING_MECHANISM varchar(23),
 FY smallint, 
 IC_NAME varchar(77), 
 NIH_SPENDING_CATS varchar(295), 
 ORG_CITY varchar(20),
 ORG_COUNTRY varchar(16),
 ORG_DEPT varchar(30),
 ORG_DISTRICT smallint, 
 ORG_DUNS integer,
 ORG_FIPS varchar(2), 
 ORG_NAME varchar(60), 
 ORG_STATE varchar(2), 
 ORG_ZIPCODE integer, 
 PHR varchar(200), 
 PI_IDS varchar(30), 
 PI_NAMEs varchar(200), 
 PROGRAM_OFFICER_NAME varchar(36), 
 PROJECT_START date, 
 PROJECT_END date, 
 PROJECT_TERMS varchar(200), 
 PROJECT_TITLE varchar(244), 
 SERIAL_NUMBER smallint,
 STUDY_SECTION varchar(4), 
 STUDY_SECTION_NAME varchar(100), 
 SUBPROJECT_ID smallint, 
 SUFFIX varchar(2),
 SUPPORT_YEAR smallint, 
 DIRECT_COST_AMT integer, 
 INDIRECT_COST_AMT integer, 
 TOTAL_COST integer, 
 TOTAL_COST_SUB_PROJECT integer
);
\COPY all_grants FROM 'Jeemin_combined_files 1986-2001.csv' WITH DELIMITER AS E'\t' HEADER NULL AS CSV

4/14/2017 FRIDAY2PM-5PM

Loaded Federal Grants Data into database

4/17/2017 MONDAY2PM-4PM

4/17/2017 WEDNESDAY9AM-12PM

To pull accelerators: Wrote simple python regex-based script that ran on organizations data.

Code: E:\McNair\Projects\Accelerators\Crunchbase Snapshot\accelerator keywords.py
Matched output (885 mathces) : E:\McNair\Projects\Accelerators\Crunchbase Snapshot\Jeemin_885_accel_matches

9/11/2017 MONDAY4PM-6PM

Ensured that documentation exists for the projects worked on last semester.

9/12/2017 TUESDAY9AM-10:40AM & 1PM-2:20PM & 4PM-5:30PM

Clarified University Matching output file.
Helped Christy with pdf-reader, capturing keywords in readable format.

..

9/18/2017 MONDAY4PM-6PM

Read documentation on PostGIS and tiger geocoder
Continue reading from: http://workshops.boundlessgeo.com/postgis-intro/joins.html

9/19/2017 TUESDAY9AM-10:40AM & 1PM-2:20PM

sum(expression): aggregate to return a sum for a set of records

count(expression): aggregate to return the size of a set of records ST_Area(geometry) returns the area of the polygons ST_AsText(geometry) returns WKT text ST_Contains(geometry A, geometry B) returns the true if geometry A contains geometry B ST_Distance(geometry A, geometry B) returns the minimum distance between geometry A and geometry B ST_DWithin(geometry A, geometry B, radius) returns the true if geometry A is radius distance or less from geometry B ST_GeomFromText(text) returns geometry ST_Intersects(geometry A, geometry B) returns the true if geometry A intersects geometry B ST_Length(linestring) returns the length of the linestring ST_Touches(geometry A, geometry B) returns the true if the boundary of geometry A touches geometry B ST_Within(geometry A, geometry B) returns the true if geometry A is within geometry B