OKC Project

From edegan.com
Jump to navigation Jump to search
  • This page is protected so that only Ed, Usha, Toby and Misiek can read or edit it.

I will be posting reports and other materials on the OK Cupid project here.

Script Files

There are two SQL script files that were written to process the data:

These files are self-documenting, and the second file builds out the current base table set containing: users, views, messages, and delta.

Setup Instructions for Usha

See the Research Computing At HBS page for detailed information.

Quick quide:

  1. Go to vpn.hbs.edu and click Network Connect -> Start
  2. Using SSH client, connect to researchgrid.hbs.edu
  3. Connect the MYSql cluster: msyql -h rcsmysql.hbs.edu -u eegan -p --ssl-ca=rcsmysql

Useful commands:

mysql> SHOW Databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mpiskorski_tstuart |
| okcupid            |
| okcupid_2          |
+--------------------+
4 rows in set (0.02 sec)

You'll have access to mpiskorski_tstuart

USE mpiskorski_tstuart;

(Misiek and Toby's working database)

SHOW TABLES;

There are two datasets: OKC1 and OKC2. We are now working with OKC2.

Both datasets have:

Users
Messages
Views

OKC2 also has ProfileDelta (changes to profiles). Describe tables with:

DESC TableName

Data Description from Misiek

Toby sent me a file with the following data description:

USERS

Total = 1,804,993
Female = 722,889 = 40%

US = 1,523,778
NonUS = 281,215 = 16%

US Male = 897,323 = 59%
US Female = 626,455 = 41%

US Male Singles = 827,702 = 92% of US Male
US Female Singles =  565,067 = 90% of US Female 

US Male LongTermInterest = 530,450 = 59% of Male
US Female LongTermInterest = 333,947 = 53% of US Female

US Male ShortTermInterest =  433,904 = 48% of Male
US Female ShortTermInterest = 225,715 = 36% of US Female

US Male CasualSex = 96,714 = 11% of Male
US Female CasualSex = 19,331 = 3% of US Female

US Male Gays = 77,902 = 9% of US Male
US Male Bi = 19,295 = 2% of US Male
US Male Straight = 800,126 = 89% of US Male

US Female Gays = 46,118 = 7% of US Female
US Female Bi = 59,585 = 10% of US Female
US Female Straight = 520,752 = 83% of US Female

US Male NoRace = 260,889 = 31% of US Male
US Male White = 517,636 = 57,172% of US Male
US Male Black = 38,787 = 4% of US Male
US Male Asian = 34,158= 4% of US Male
US Male Latino = 57,172 = 6% of US Male

US Female NoRace = 185,725 = 30% of US Female
US Female White = 358,119 = 57% of US Female
US Female Black =  32,195 = 5% of US Female
US Female Asian = 22,979 = 4% of US Female
US Female Latino = 37,274 = 5% of US Female

Data Description

This section provides summary stats on gender/orientation/race/location/age.

Gender

+--------+---------+------------+---------+-----------+---------+-----------+
| female | Count   | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+--------+---------+------------+---------+-----------+---------+-----------+
|      0 | 1082104 |     0.5995 |  4.1039 |    339857 |  2.2340 |    185671 |
|      1 |  722889 |     0.4005 |  3.4872 |    174731 |  2.8786 |    304468 |
+--------+---------+------------+---------+-----------+---------+-----------+

Orientation

0=Straight, 1=Gay, 2=Bi
+-------------+---------+------------+---------+-----------+---------+-----------+
| orientation | Count   | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+-------------+---------+------------+---------+-----------+---------+-----------+
|           0 | 1570050 |     0.8698 |  3.9727 |    460499 |  2.6087 |    420784 |
|           1 |  139697 |     0.0774 |  2.9008 |     28802 |  2.1532 |     29212 |
|           2 |   95246 |     0.0528 |  3.5746 |     25287 |  2.8651 |     40143 |
+-------------+---------+------------+---------+-----------+---------+-----------+

Ethnicity

The data was preprocessed to assign individuals who reported more than one gender as being of 'mixed' race. The variable 'ethnicity' is a interger categorization for fast sorting, and the label are provided in the variable 'ethnicitylabels'.

+------------------+--------+------------+---------+-----------+---------+-----------+
| ethnicitylabels  | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+------------------+--------+------------+---------+-----------+---------+-----------+
| mixed            | 105456 |     0.0584 |  4.0894 |     45674 |  2.8836 |     43540 |
| white            | 917178 |     0.5081 |  3.7588 |    332784 |  2.6029 |    327063 |
| black            |  53884 |     0.0299 |  4.2456 |     18498 |  2.4855 |     13101 |
| hispanic_latin   |  61034 |     0.0338 |  4.2292 |     25278 |  2.9754 |     23348 |
| asian            |  54645 |     0.0303 |  3.7712 |     16612 |  2.6209 |     17222 |
| indian           |  11658 |     0.0065 |  4.6518 |      4996 |  2.2567 |      2796 |
| middle_eastern   |   5916 |     0.0033 |  6.0000 |      3282 |  2.9282 |      1918 |
| native_american  |   4056 |     0.0022 |  5.0511 |      1682 |  3.0157 |      1345 |
| pacific_islander |   3238 |     0.0018 |  3.6478 |      1098 |  2.9398 |      1173 |
| other            |  24761 |     0.0137 |  4.3936 |     11454 |  2.9021 |      9664 |
| none             | 563167 |     0.3120 |  3.9094 |     53230 |  2.1889 |     48969 |
+------------------+--------+------------+---------+-----------+---------+-----------+

The distribution for the reported number of ethnicities (with >1 assigned to mixed) is:

+-----------------+----------+
| num_ethnicities | COUNT(*) |
+-----------------+----------+
|               1 |  1699537 |
|               2 |    83676 |
|               3 |    14781 |
|               4 |     3348 |
|               5 |      976 |
|               6 |      330 |
|               7 |      243 |
|               8 |      409 |
|               9 |     1693 |
+-----------------+----------+

Location

Zip3 was provided in the data, but I was cautioned that the third number was not meaningul (i.e. added randomly for obscurification). Zip2 and Zip1 (the later is reported below) are coded as variables.

See:

+------+--------+------------+---------+-----------+---------+-----------+
| zip1 | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+------+--------+------------+---------+-----------+---------+-----------+
| 0    | 178253 |     0.0988 |  3.8582 |     55219 |  2.5989 |     52293 |
| 1    | 229719 |     0.1273 |  4.4691 |     84690 |  2.6089 |     73503 |
| 2    | 136538 |     0.0756 |  3.7811 |     38575 |  2.5554 |     37794 |
| 3    | 148748 |     0.0824 |  3.8744 |     41979 |  2.6477 |     40292 |
| 4    | 126115 |     0.0699 |  3.6275 |     33953 |  2.5882 |     33064 |
| 5    |  66067 |     0.0366 |  3.3721 |     15643 |  2.3133 |     15157 |
| 6    | 105097 |     0.0582 |  3.6479 |     29650 |  2.5607 |     29468 |
| 7    | 126851 |     0.0703 |  4.0350 |     38615 |  2.8075 |     37758 |
| 8    |  84204 |     0.0467 |  3.7276 |     23875 |  2.6178 |     23128 |
| 9    | 322186 |     0.1785 |  3.8279 |    103851 |  2.7357 |    103539 |
| 99   | 281215 |     0.1558 |  3.6252 |     48538 |  2.2509 |     44143 |
+------+--------+------------+---------+-----------+---------+-----------+

Age

The age ranges were created arbitrarily - though they have worked out reasonable well. Further refinement is possible. The variable 'age_rnum' is a integer categorization, and 'age_range' provides the variable. Ages were calculated using 'birth_year' using 2010 as the reference point.

+-----------+--------+------------+---------+-----------+---------+-----------+
| age_range | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+-----------+--------+------------+---------+-----------+---------+-----------+
| 15-19     |  97659 |     0.0541 |  4.8070 |     30385 |  3.5139 |     35708 |
| 20-24     | 485028 |     0.2687 |  3.9309 |    146882 |  2.8494 |    159322 |
| 25-29     | 475376 |     0.2634 |  3.5992 |    131205 |  2.4778 |    128834 |
| 30-34     | 274281 |     0.1520 |  4.4587 |     89759 |  2.4158 |     66236 |
| 35-39     | 155925 |     0.0864 |  3.7430 |     42681 |  2.3843 |     36490 |
| 40-44     | 109121 |     0.0605 |  3.5817 |     27776 |  2.4255 |     24939 |
| 45-49     |  78618 |     0.0436 |  3.6452 |     19225 |  2.2995 |     16338 |
| 50-59     |  95810 |     0.0531 |  3.3421 |     21055 |  2.1347 |     17831 |
| >60       |  33175 |     0.0184 |  2.9332 |      5620 |  1.8930 |      4441 |
+-----------+--------+------------+---------+-----------+---------+-----------+

Birth year is self-reported almost surely wrongly so in some case. It ranges from 1900 to 1995.

Account Age

The account age range variables, 'acc_age_rnum' for the interger categorization and 'acc_age_range' for the labels, were created arbitrarily. These could easily be refined, but were created to examine differences in messaging and viewing activitiy - and to provide a normalization base. The oldest account is 2,613 days. The youngest is 0 days.

+---------------+--------+------------+---------+-----------+---------+-----------+
| acc_age_range | Count  | Percentage |         |           |         |           |
+---------------+--------+------------+---------+-----------+---------+-----------+
| 0             |   1385 |     0.0008 |         |           |         |           |
| 1             |   7914 |     0.0044 |         |           |         |           |
| 2             |   7967 |     0.0044 |         |           |         |           |
| 3             |   9051 |     0.0050 |         |           |         |           |
| 4             |   9134 |     0.0051 |         |           |         |           |
| 5             |  10032 |     0.0056 |         |           |         |           |
| 6-10          |  35862 |     0.0199 |         |           |         |           |
| 11-15         |  39986 |     0.0222 |         |           |         |           |
| 16-30         | 136983 |     0.0759 |         |           |         |           |
| 31-60         | 233560 |     0.1294 |         |           |         |           |
| 61-100        | 237369 |     0.1315 |  5.2286 |    142694 |  3.5119 |    144483 |
| 101-365       | 535715 |     0.2968 |  3.5052 |    220052 |  2.4617 |    220381 |
| 366-3650      | 540035 |     0.2992 |  3.5435 |    151842 |  2.1519 |    125275 |
+---------------+--------+------------+---------+-----------+---------+-----------+

Other variables

Quit times range from:

  • 2008-07-19 (youngest)
  • 2010-12-17 (oldest)

Deleted or blacklisted accounts:

+------------------------+----------+
| deleted_or_blacklisted |  Count   |
+------------------------+----------+
|                      0 |  1519319 |
|                      1 |   285674 |
+------------------------+----------+

Outstanding Data Issues

  • The variables 'num_essays' and 'profile_length' need fixing