Difference between revisions of "OKC Project"

From edegan.com
Jump to navigation Jump to search
imported>Ed
imported>Ed
Line 56: Line 56:
 
===Gender===
 
===Gender===
  
  +--------+---------+------------+
+
  +--------+---------+------------+---------+-----------+---------+-----------+
  | female | Count  | Percentage |
+
  | female | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
  +--------+---------+------------+
+
  +--------+---------+------------+---------+-----------+---------+-----------+
  |      0 | 1082104 |    0.5995 |
+
  |      0 | 1082104 |    0.5995 |  4.1039 |    339857 |  2.2340 |    185671 |
  |      1 |  722889 |    0.4005 |
+
  |      1 |  722889 |    0.4005 |  3.4872 |    174731 |  2.8786 |    304468 |
  +--------+---------+------------+
+
  +--------+---------+------------+---------+-----------+---------+-----------+
  
 
===Orientation===
 
===Orientation===
  
 
  0=Straight, 1=Gay, 2=Bi
 
  0=Straight, 1=Gay, 2=Bi
  +-------------+---------+------------+
+
  +-------------+---------+------------+---------+-----------+---------+-----------+
  | orientation | Count  | Percentage |
+
  | orientation | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
  +-------------+---------+------------+
+
  +-------------+---------+------------+---------+-----------+---------+-----------+
  |          0 | 1570050 |    0.8698 |
+
  |          0 | 1570050 |    0.8698 |  3.9727 |    460499 |  2.6087 |    420784 |
  |          1 |  139697 |    0.0774 |
+
  |          1 |  139697 |    0.0774 |  2.9008 |    28802 |  2.1532 |    29212 |
  |          2 |  95246 |    0.0528 |
+
  |          2 |  95246 |    0.0528 |  3.5746 |    25287 |  2.8651 |    40143 |
  +-------------+---------+------------+
+
  +-------------+---------+------------+---------+-----------+---------+-----------+
  
 
===Ethnicity===
 
===Ethnicity===
Line 78: Line 78:
 
The data was preprocessed to assign individuals who reported more than one gender as being of 'mixed' race. The variable 'ethnicity' is a interger categorization for fast sorting, and the label are provided in the variable 'ethnicitylabels'.
 
The data was preprocessed to assign individuals who reported more than one gender as being of 'mixed' race. The variable 'ethnicity' is a interger categorization for fast sorting, and the label are provided in the variable 'ethnicitylabels'.
  
  +------------------+--------+------------+
+
  +------------------+--------+------------+---------+-----------+---------+-----------+
  | ethnicitylabels  | Count  | Percentage |
+
  | ethnicitylabels  | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
  +------------------+--------+------------+
+
  +------------------+--------+------------+---------+-----------+---------+-----------+
  | mixed            | 105456 |    0.0584 |
+
  | mixed            | 105456 |    0.0584 |  4.0894 |    45674 |  2.8836 |    43540 |
  | white            | 917178 |    0.5081 |
+
  | white            | 917178 |    0.5081 |  3.7588 |    332784 |  2.6029 |    327063 |
  | black            |  53884 |    0.0299 |
+
  | black            |  53884 |    0.0299 |  4.2456 |    18498 |  2.4855 |    13101 |
  | hispanic_latin  |  61034 |    0.0338 |
+
  | hispanic_latin  |  61034 |    0.0338 |  4.2292 |    25278 |  2.9754 |    23348 |
  | asian            |  54645 |    0.0303 |
+
  | asian            |  54645 |    0.0303 |  3.7712 |    16612 |  2.6209 |    17222 |
  | indian          |  11658 |    0.0065 |
+
  | indian          |  11658 |    0.0065 |  4.6518 |      4996 |  2.2567 |      2796 |
  | middle_eastern  |  5916 |    0.0033 |
+
  | middle_eastern  |  5916 |    0.0033 |  6.0000 |      3282 |  2.9282 |      1918 |
  | native_american  |  4056 |    0.0022 |
+
  | native_american  |  4056 |    0.0022 |  5.0511 |      1682 |  3.0157 |      1345 |
  | pacific_islander |  3238 |    0.0018 |
+
  | pacific_islander |  3238 |    0.0018 |  3.6478 |      1098 |  2.9398 |      1173 |
  | other            |  24761 |    0.0137 |
+
  | other            |  24761 |    0.0137 |  4.3936 |    11454 |  2.9021 |      9664 |
  | none            | 563167 |    0.3120 |
+
  | none            | 563167 |    0.3120 |  3.9094 |    53230 |  2.1889 |    48969 |
  +------------------+--------+------------+
+
  +------------------+--------+------------+---------+-----------+---------+-----------+
  
 
The distribution for the reported number of ethnicities (with >1 assigned to mixed) is:
 
The distribution for the reported number of ethnicities (with >1 assigned to mixed) is:
Line 118: Line 118:
 
*[http://en.wikipedia.org/wiki/List_of_ZIP_code_prefixes List of Zip Code Prefixes]
 
*[http://en.wikipedia.org/wiki/List_of_ZIP_code_prefixes List of Zip Code Prefixes]
  
  +------+--------+------------+
+
  +------+--------+------------+---------+-----------+---------+-----------+
  | zip1 | Count  | Percentage |
+
  | zip1 | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
  +------+--------+------------+
+
  +------+--------+------------+---------+-----------+---------+-----------+
  | 0    | 178253 |    0.0988 |
+
  | 0    | 178253 |    0.0988 |  3.8582 |    55219 |  2.5989 |    52293 |
  | 1    | 229719 |    0.1273 |
+
  | 1    | 229719 |    0.1273 |  4.4691 |    84690 |  2.6089 |    73503 |
  | 2    | 136538 |    0.0756 |
+
  | 2    | 136538 |    0.0756 |  3.7811 |    38575 |  2.5554 |    37794 |
  | 3    | 148748 |    0.0824 |
+
  | 3    | 148748 |    0.0824 |  3.8744 |    41979 |  2.6477 |    40292 |
  | 4    | 126115 |    0.0699 |
+
  | 4    | 126115 |    0.0699 |  3.6275 |    33953 |  2.5882 |    33064 |
  | 5    |  66067 |    0.0366 |
+
  | 5    |  66067 |    0.0366 |  3.3721 |    15643 |  2.3133 |    15157 |
  | 6    | 105097 |    0.0582 |
+
  | 6    | 105097 |    0.0582 |  3.6479 |    29650 |  2.5607 |    29468 |
  | 7    | 126851 |    0.0703 |
+
  | 7    | 126851 |    0.0703 |  4.0350 |    38615 |  2.8075 |    37758 |
  | 8    |  84204 |    0.0467 |
+
  | 8    |  84204 |    0.0467 |  3.7276 |    23875 |  2.6178 |    23128 |
  | 9    | 322186 |    0.1785 |
+
  | 9    | 322186 |    0.1785 |  3.8279 |    103851 |  2.7357 |    103539 |
  | 99  | 281215 |    0.1558 |
+
  | 99  | 281215 |    0.1558 |  3.6252 |    48538 |  2.2509 |    44143 |
  +------+--------+------------+
+
  +------+--------+------------+---------+-----------+---------+-----------+
  
 
===Age===
 
===Age===
Line 138: Line 138:
 
The age ranges were created arbitrarily - though they have worked out reasonable well. Further refinement is possible. The variable 'age_rnum' is a integer categorization, and 'age_range' provides the variable. Ages were calculated using 'birth_year' using 2010 as the reference point.
 
The age ranges were created arbitrarily - though they have worked out reasonable well. Further refinement is possible. The variable 'age_rnum' is a integer categorization, and 'age_range' provides the variable. Ages were calculated using 'birth_year' using 2010 as the reference point.
  
  +-----------+--------+------------+
+
  +-----------+--------+------------+---------+-----------+---------+-----------+
  | age_range | Count  | Percentage |
+
  | age_range | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
  +-----------+--------+------------+
+
  +-----------+--------+------------+---------+-----------+---------+-----------+
  | 15-19    |  97659 |    0.0541 |
+
  | 15-19    |  97659 |    0.0541 |  4.8070 |    30385 |  3.5139 |    35708 |
  | 20-24    | 485028 |    0.2687 |
+
  | 20-24    | 485028 |    0.2687 |  3.9309 |    146882 |  2.8494 |    159322 |
  | 25-29    | 475376 |    0.2634 |
+
  | 25-29    | 475376 |    0.2634 |  3.5992 |    131205 |  2.4778 |    128834 |
  | 30-34    | 274281 |    0.1520 |
+
  | 30-34    | 274281 |    0.1520 |  4.4587 |    89759 |  2.4158 |    66236 |
  | 35-39    | 155925 |    0.0864 |
+
  | 35-39    | 155925 |    0.0864 |  3.7430 |    42681 |  2.3843 |    36490 |
  | 40-44    | 109121 |    0.0605 |
+
  | 40-44    | 109121 |    0.0605 |  3.5817 |    27776 |  2.4255 |    24939 |
  | 45-49    |  78618 |    0.0436 |
+
  | 45-49    |  78618 |    0.0436 |  3.6452 |    19225 |  2.2995 |    16338 |
  | 50-59    |  95810 |    0.0531 |
+
  | 50-59    |  95810 |    0.0531 |  3.3421 |    21055 |  2.1347 |    17831 |
  | >60      |  33175 |    0.0184 |
+
  | >60      |  33175 |    0.0184 |  2.9332 |      5620 |  1.8930 |      4441 |
  +-----------+--------+------------+
+
  +-----------+--------+------------+---------+-----------+---------+-----------+
  
 
Birth year is self-reported almost surely wrongly so in some case. It ranges from 1900 to 1995.
 
Birth year is self-reported almost surely wrongly so in some case. It ranges from 1900 to 1995.
Line 158: Line 158:
 
The account age range variables, 'acc_age_rnum' for the interger categorization and 'acc_age_range' for the labels, were created arbitrarily. These could easily be refined, but were created to examine differences in messaging and viewing activitiy - and to provide a normalization base. The oldest account is 2,613 days. The youngest is 0 days.
 
The account age range variables, 'acc_age_rnum' for the interger categorization and 'acc_age_range' for the labels, were created arbitrarily. These could easily be refined, but were created to examine differences in messaging and viewing activitiy - and to provide a normalization base. The oldest account is 2,613 days. The youngest is 0 days.
  
  +---------------+--------+------------+
+
  +---------------+--------+------------+---------+-----------+---------+-----------+
  | acc_age_range | Count  | Percentage |
+
  | acc_age_range | Count  | Percentage |        |          |        |          |
  +---------------+--------+------------+
+
  +---------------+--------+------------+---------+-----------+---------+-----------+
  | 0            |  1385 |    0.0008 |
+
  | 0            |  1385 |    0.0008 |        |          |        |          |
  | 1            |  7914 |    0.0044 |
+
  | 1            |  7914 |    0.0044 |        |          |        |          |
  | 2            |  7967 |    0.0044 |
+
  | 2            |  7967 |    0.0044 |        |          |        |          |
  | 3            |  9051 |    0.0050 |
+
  | 3            |  9051 |    0.0050 |        |          |        |          |
  | 4            |  9134 |    0.0051 |
+
  | 4            |  9134 |    0.0051 |        |          |        |          |
  | 5            |  10032 |    0.0056 |
+
  | 5            |  10032 |    0.0056 |        |          |        |          |
  | 6-10          |  35862 |    0.0199 |
+
  | 6-10          |  35862 |    0.0199 |        |          |        |          |
  | 11-15        |  39986 |    0.0222 |
+
  | 11-15        |  39986 |    0.0222 |        |          |        |          |
  | 16-30        | 136983 |    0.0759 |
+
  | 16-30        | 136983 |    0.0759 |        |          |        |          |
  | 31-60        | 233560 |    0.1294 |
+
  | 31-60        | 233560 |    0.1294 |        |          |        |          |
  | 61-100        | 237369 |    0.1315 |
+
  | 61-100        | 237369 |    0.1315 |  5.2286 |    142694 |  3.5119 |    144483 |
  | 101-365      | 535715 |    0.2968 |
+
  | 101-365      | 535715 |    0.2968 |  3.5052 |    220052 |  2.4617 |    220381 |
  | 366-3650      | 540035 |    0.2992 |
+
  | 366-3650      | 540035 |    0.2992 |  3.5435 |    151842 |  2.1519 |    125275 |
  +---------------+--------+------------+
+
  +---------------+--------+------------+---------+-----------+---------+-----------+
  
 
===Other variables===
 
===Other variables===

Revision as of 17:01, 26 September 2011

  • This page is protected so that only Ed, Toby and Misiek can read or edit it.

I will be posting reports and other materials on the OK Cupid project here.

Data Description from Misiek

Toby sent me a file with the following data description:

USERS

Total = 1,804,993
Female = 722,889 = 40%

US = 1,523,778
NonUS = 281,215 = 16%

US Male = 897,323 = 59%
US Female = 626,455 = 41%

US Male Singles = 827,702 = 92% of US Male
US Female Singles =  565,067 = 90% of US Female 

US Male LongTermInterest = 530,450 = 59% of Male
US Female LongTermInterest = 333,947 = 53% of US Female

US Male ShortTermInterest =  433,904 = 48% of Male
US Female ShortTermInterest = 225,715 = 36% of US Female

US Male CasualSex = 96,714 = 11% of Male
US Female CasualSex = 19,331 = 3% of US Female

US Male Gays = 77,902 = 9% of US Male
US Male Bi = 19,295 = 2% of US Male
US Male Straight = 800,126 = 89% of US Male

US Female Gays = 46,118 = 7% of US Female
US Female Bi = 59,585 = 10% of US Female
US Female Straight = 520,752 = 83% of US Female

US Male NoRace = 260,889 = 31% of US Male
US Male White = 517,636 = 57,172% of US Male
US Male Black = 38,787 = 4% of US Male
US Male Asian = 34,158= 4% of US Male
US Male Latino = 57,172 = 6% of US Male

US Female NoRace = 185,725 = 30% of US Female
US Female White = 358,119 = 57% of US Female
US Female Black =  32,195 = 5% of US Female
US Female Asian = 22,979 = 4% of US Female
US Female Latino = 37,274 = 5% of US Female

Data Description

This section provides summary stats on gender/orientation/race/location/age.

Gender

+--------+---------+------------+---------+-----------+---------+-----------+
| female | Count   | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+--------+---------+------------+---------+-----------+---------+-----------+
|      0 | 1082104 |     0.5995 |  4.1039 |    339857 |  2.2340 |    185671 |
|      1 |  722889 |     0.4005 |  3.4872 |    174731 |  2.8786 |    304468 |
+--------+---------+------------+---------+-----------+---------+-----------+

Orientation

0=Straight, 1=Gay, 2=Bi
+-------------+---------+------------+---------+-----------+---------+-----------+
| orientation | Count   | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+-------------+---------+------------+---------+-----------+---------+-----------+
|           0 | 1570050 |     0.8698 |  3.9727 |    460499 |  2.6087 |    420784 |
|           1 |  139697 |     0.0774 |  2.9008 |     28802 |  2.1532 |     29212 |
|           2 |   95246 |     0.0528 |  3.5746 |     25287 |  2.8651 |     40143 |
+-------------+---------+------------+---------+-----------+---------+-----------+

Ethnicity

The data was preprocessed to assign individuals who reported more than one gender as being of 'mixed' race. The variable 'ethnicity' is a interger categorization for fast sorting, and the label are provided in the variable 'ethnicitylabels'.

+------------------+--------+------------+---------+-----------+---------+-----------+
| ethnicitylabels  | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+------------------+--------+------------+---------+-----------+---------+-----------+
| mixed            | 105456 |     0.0584 |  4.0894 |     45674 |  2.8836 |     43540 |
| white            | 917178 |     0.5081 |  3.7588 |    332784 |  2.6029 |    327063 |
| black            |  53884 |     0.0299 |  4.2456 |     18498 |  2.4855 |     13101 |
| hispanic_latin   |  61034 |     0.0338 |  4.2292 |     25278 |  2.9754 |     23348 |
| asian            |  54645 |     0.0303 |  3.7712 |     16612 |  2.6209 |     17222 |
| indian           |  11658 |     0.0065 |  4.6518 |      4996 |  2.2567 |      2796 |
| middle_eastern   |   5916 |     0.0033 |  6.0000 |      3282 |  2.9282 |      1918 |
| native_american  |   4056 |     0.0022 |  5.0511 |      1682 |  3.0157 |      1345 |
| pacific_islander |   3238 |     0.0018 |  3.6478 |      1098 |  2.9398 |      1173 |
| other            |  24761 |     0.0137 |  4.3936 |     11454 |  2.9021 |      9664 |
| none             | 563167 |     0.3120 |  3.9094 |     53230 |  2.1889 |     48969 |
+------------------+--------+------------+---------+-----------+---------+-----------+

The distribution for the reported number of ethnicities (with >1 assigned to mixed) is:

+-----------------+----------+
| num_ethnicities | COUNT(*) |
+-----------------+----------+
|               1 |  1699537 |
|               2 |    83676 |
|               3 |    14781 |
|               4 |     3348 |
|               5 |      976 |
|               6 |      330 |
|               7 |      243 |
|               8 |      409 |
|               9 |     1693 |
+-----------------+----------+

Location

Zip3 was provided in the data, but I was cautioned that the third number was not meaningul (i.e. added randomly for obscurification). Zip2 and Zip1 (the later is reported below) are coded as variables.

See:

+------+--------+------------+---------+-----------+---------+-----------+
| zip1 | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+------+--------+------------+---------+-----------+---------+-----------+
| 0    | 178253 |     0.0988 |  3.8582 |     55219 |  2.5989 |     52293 |
| 1    | 229719 |     0.1273 |  4.4691 |     84690 |  2.6089 |     73503 |
| 2    | 136538 |     0.0756 |  3.7811 |     38575 |  2.5554 |     37794 |
| 3    | 148748 |     0.0824 |  3.8744 |     41979 |  2.6477 |     40292 |
| 4    | 126115 |     0.0699 |  3.6275 |     33953 |  2.5882 |     33064 |
| 5    |  66067 |     0.0366 |  3.3721 |     15643 |  2.3133 |     15157 |
| 6    | 105097 |     0.0582 |  3.6479 |     29650 |  2.5607 |     29468 |
| 7    | 126851 |     0.0703 |  4.0350 |     38615 |  2.8075 |     37758 |
| 8    |  84204 |     0.0467 |  3.7276 |     23875 |  2.6178 |     23128 |
| 9    | 322186 |     0.1785 |  3.8279 |    103851 |  2.7357 |    103539 |
| 99   | 281215 |     0.1558 |  3.6252 |     48538 |  2.2509 |     44143 |
+------+--------+------------+---------+-----------+---------+-----------+

Age

The age ranges were created arbitrarily - though they have worked out reasonable well. Further refinement is possible. The variable 'age_rnum' is a integer categorization, and 'age_range' provides the variable. Ages were calculated using 'birth_year' using 2010 as the reference point.

+-----------+--------+------------+---------+-----------+---------+-----------+
| age_range | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+-----------+--------+------------+---------+-----------+---------+-----------+
| 15-19     |  97659 |     0.0541 |  4.8070 |     30385 |  3.5139 |     35708 |
| 20-24     | 485028 |     0.2687 |  3.9309 |    146882 |  2.8494 |    159322 |
| 25-29     | 475376 |     0.2634 |  3.5992 |    131205 |  2.4778 |    128834 |
| 30-34     | 274281 |     0.1520 |  4.4587 |     89759 |  2.4158 |     66236 |
| 35-39     | 155925 |     0.0864 |  3.7430 |     42681 |  2.3843 |     36490 |
| 40-44     | 109121 |     0.0605 |  3.5817 |     27776 |  2.4255 |     24939 |
| 45-49     |  78618 |     0.0436 |  3.6452 |     19225 |  2.2995 |     16338 |
| 50-59     |  95810 |     0.0531 |  3.3421 |     21055 |  2.1347 |     17831 |
| >60       |  33175 |     0.0184 |  2.9332 |      5620 |  1.8930 |      4441 |
+-----------+--------+------------+---------+-----------+---------+-----------+

Birth year is self-reported almost surely wrongly so in some case. It ranges from 1900 to 1995.

Account Age

The account age range variables, 'acc_age_rnum' for the interger categorization and 'acc_age_range' for the labels, were created arbitrarily. These could easily be refined, but were created to examine differences in messaging and viewing activitiy - and to provide a normalization base. The oldest account is 2,613 days. The youngest is 0 days.

+---------------+--------+------------+---------+-----------+---------+-----------+
| acc_age_range | Count  | Percentage |         |           |         |           |
+---------------+--------+------------+---------+-----------+---------+-----------+
| 0             |   1385 |     0.0008 |         |           |         |           |
| 1             |   7914 |     0.0044 |         |           |         |           |
| 2             |   7967 |     0.0044 |         |           |         |           |
| 3             |   9051 |     0.0050 |         |           |         |           |
| 4             |   9134 |     0.0051 |         |           |         |           |
| 5             |  10032 |     0.0056 |         |           |         |           |
| 6-10          |  35862 |     0.0199 |         |           |         |           |
| 11-15         |  39986 |     0.0222 |         |           |         |           |
| 16-30         | 136983 |     0.0759 |         |           |         |           |
| 31-60         | 233560 |     0.1294 |         |           |         |           |
| 61-100        | 237369 |     0.1315 |  5.2286 |    142694 |  3.5119 |    144483 |
| 101-365       | 535715 |     0.2968 |  3.5052 |    220052 |  2.4617 |    220381 |
| 366-3650      | 540035 |     0.2992 |  3.5435 |    151842 |  2.1519 |    125275 |
+---------------+--------+------------+---------+-----------+---------+-----------+

Other variables

Quit times range from:

  • 2008-07-19 (youngest)
  • 2010-12-17 (oldest)

Deleted or blacklisted accounts:

+------------------------+----------+
| deleted_or_blacklisted |  Count   |
+------------------------+----------+
|                      0 |  1519319 |
|                      1 |   285674 |
+------------------------+----------+

Outstanding Data Issues

  • The variables 'num_essays' and 'profile_length' need fixing