OKC Project

From edegan.com
Revision as of 16:21, 26 September 2011 by imported>Ed
Jump to navigation Jump to search
  • This page is protected so that only Ed, Toby and Misiek can read or edit it.

I will be posting reports and other materials on the OK Cupid project here.

Data Description from Misiek

Toby sent me a file with the following data description:

USERS

Total = 1,804,993
Female = 722,889 = 40%

US = 1,523,778
NonUS = 281,215 = 16%

US Male = 897,323 = 59%
US Female = 626,455 = 41%

US Male Singles = 827,702 = 92% of US Male
US Female Singles =  565,067 = 90% of US Female 

US Male LongTermInterest = 530,450 = 59% of Male
US Female LongTermInterest = 333,947 = 53% of US Female

US Male ShortTermInterest =  433,904 = 48% of Male
US Female ShortTermInterest = 225,715 = 36% of US Female

US Male CasualSex = 96,714 = 11% of Male
US Female CasualSex = 19,331 = 3% of US Female

US Male Gays = 77,902 = 9% of US Male
US Male Bi = 19,295 = 2% of US Male
US Male Straight = 800,126 = 89% of US Male

US Female Gays = 46,118 = 7% of US Female
US Female Bi = 59,585 = 10% of US Female
US Female Straight = 520,752 = 83% of US Female

US Male NoRace = 260,889 = 31% of US Male
US Male White = 517,636 = 57,172% of US Male
US Male Black = 38,787 = 4% of US Male
US Male Asian = 34,158= 4% of US Male
US Male Latino = 57,172 = 6% of US Male

US Female NoRace = 185,725 = 30% of US Female
US Female White = 358,119 = 57% of US Female
US Female Black =  32,195 = 5% of US Female
US Female Asian = 22,979 = 4% of US Female
US Female Latino = 37,274 = 5% of US Female

Data Description

This section provides summary stats on gender/orientation/race/location/age.

Gender

+--------+---------+------------+
| female | Count   | Percentage |
+--------+---------+------------+
|      0 | 1082104 |     0.5995 |
|      1 |  722889 |     0.4005 |
+--------+---------+------------+

Orientation

0=Straight, 1=Gay, 2=Bi
+-------------+---------+------------+
| orientation | Count   | Percentage |
+-------------+---------+------------+
|           0 | 1570050 |     0.8698 |
|           1 |  139697 |     0.0774 |
|           2 |   95246 |     0.0528 |
+-------------+---------+------------+

Ethnicity

The data was preprocessed to assign individuals who reported more than one gender as being of 'mixed' race. The variable 'ethnicity' is a interger categorization for fast sorting, and the label are provided in the variable 'ethnicitylabels'.

+------------------+--------+------------+
| ethnicitylabels  | Count  | Percentage |
+------------------+--------+------------+
| mixed            | 105456 |     0.0584 |
| white            | 917178 |     0.5081 |
| black            |  53884 |     0.0299 |
| hispanic_latin   |  61034 |     0.0338 |
| asian            |  54645 |     0.0303 |
| indian           |  11658 |     0.0065 |
| middle_eastern   |   5916 |     0.0033 |
| native_american  |   4056 |     0.0022 |
| pacific_islander |   3238 |     0.0018 |
| other            |  24761 |     0.0137 |
| none             | 563167 |     0.3120 |
+------------------+--------+------------+

The distribution for the reported number of ethnicities (with >1 assigned to mixed) is:

+-----------------+----------+
| num_ethnicities | COUNT(*) |
+-----------------+----------+
|               1 |  1699537 |
|               2 |    83676 |
|               3 |    14781 |
|               4 |     3348 |
|               5 |      976 |
|               6 |      330 |
|               7 |      243 |
|               8 |      409 |
|               9 |     1693 |
+-----------------+----------+

Location

Zip3 was provided in the data, but I was cautioned that the third number was not meaningul (i.e. added randomly for obscurification). Zip2 and Zip1 (the later is reported below) are coded as variables.

See:

+------+--------+------------+
| zip1 | Count  | Percentage |
+------+--------+------------+
| 0    | 178253 |     0.0988 |
| 1    | 229719 |     0.1273 |
| 2    | 136538 |     0.0756 |
| 3    | 148748 |     0.0824 |
| 4    | 126115 |     0.0699 |
| 5    |  66067 |     0.0366 |
| 6    | 105097 |     0.0582 |
| 7    | 126851 |     0.0703 |
| 8    |  84204 |     0.0467 |
| 9    | 322186 |     0.1785 |
| 99   | 281215 |     0.1558 |
+------+--------+------------+

Age

The age ranges were created arbitrarily - though they have worked out reasonable well. Further refinement is possible. The variable 'age_rnum' is a integer categorization, and 'age_range' provides the variable. Ages were calculated using 'birth_year' using 2010 as the reference point.

+-----------+--------+------------+
| age_range | Count  | Percentage |
+-----------+--------+------------+
| 15-19     |  97659 |     0.0541 |
| 20-24     | 485028 |     0.2687 |
| 25-29     | 475376 |     0.2634 |
| 30-34     | 274281 |     0.1520 |
| 35-39     | 155925 |     0.0864 |
| 40-44     | 109121 |     0.0605 |
| 45-49     |  78618 |     0.0436 |
| 50-59     |  95810 |     0.0531 |
| >60       |  33175 |     0.0184 |
+-----------+--------+------------+

Birth year is self-reported almost surely wrongly so in some case. It ranges from 1900 to 1995.

Account Age

The account age range variables, 'acc_age_rnum' for the interger categorization and 'acc_age_range' for the labels, were created arbitrarily. These could easily be refined, but were created to examine differences in messaging and viewing activitiy - and to provide a normalization base. The oldest account is 2,613 days. The youngest is 0 days.

+---------------+--------+------------+
| acc_age_range | Count  | Percentage |
+---------------+--------+------------+
| 0             |   1385 |     0.0008 |
| 1             |   7914 |     0.0044 |
| 2             |   7967 |     0.0044 |
| 3             |   9051 |     0.0050 |
| 4             |   9134 |     0.0051 |
| 5             |  10032 |     0.0056 |
| 6-10          |  35862 |     0.0199 |
| 11-15         |  39986 |     0.0222 |
| 16-30         | 136983 |     0.0759 |
| 31-60         | 233560 |     0.1294 |
| 61-100        | 237369 |     0.1315 |
| 101-365       | 535715 |     0.2968 |
| 366-3650      | 540035 |     0.2992 |
+---------------+--------+------------+

Other variables

Quit times range from:

  • 2008-07-19 (youngest)
  • 2010-12-17 (oldest)

Deleted or blacklisted accounts:

+------------------------+----------+
| deleted_or_blacklisted |  Count   |
+------------------------+----------+
|                      0 |  1519319 |
|                      1 |   285674 |
+------------------------+----------+

Outstanding Data Issues

  • The variables 'num_essays' and 'profile_length' need fixing