Changes

Jump to navigation Jump to search
9,810 bytes added ,  16:00, 19 December 2011
no edit summary
*This page is protected so that only Ed, Usha, Toby and Misiek can read or edit it.
I will be posting reports and other materials on the OK Cupid project here.
==Script Files== There are two SQL script files that were written to process the data:*[[:Image:OKC1.sql.asci|OKC1.sql]]*[[:Image:OKC2.sql.asci|OKC2.sql]] These files are self-documenting, and the second file builds out the current base table set containing: '''users''', '''views''', '''messages''', and '''delta'''. ==Setup Instructions for Usha== See the [[Research Computing At HBS]] page for detailed information. Quick quide:#Go to vpn.hbs.edu and click Network Connect -> Start#Using SSH client, connect to researchgrid.hbs.edu#Connect the MYSql cluster: msyql -h rcsmysql.hbs.edu -u eegan -p --ssl-ca=rcsmysql Useful commands:  mysql> SHOW Databases; +--------------------+ | Database | +--------------------+ | information_schema | | mpiskorski_tstuart | | okcupid | | okcupid_2 | +--------------------+ 4 rows in set (0.02 sec) You'll have access to mpiskorski_tstuart   USE mpiskorski_tstuart; (Misiek and Toby's working database)  SHOW TABLES; There are two datasets: OKC1 and OKC2. We are now working with OKC2. Both datasets have: Users Messages Views OKC2 also has ProfileDelta (changes to profiles).Describe tables with:  DESC TableName ==Data Description from TobyMisiek==
Toby sent me a file with the following data description:
US Female Asian = 22,979 = 4% of US Female
US Female Latino = 37,274 = 5% of US Female
 
==Data Description==
 
This section provides summary stats on gender/orientation/race/location/age.
 
===Gender===
 
+--------+---------+------------+---------+-----------+---------+-----------+
| female | Count | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+--------+---------+------------+---------+-----------+---------+-----------+
| 0 | 1082104 | 0.5995 | 4.1039 | 339857 | 2.2340 | 185671 |
| 1 | 722889 | 0.4005 | 3.4872 | 174731 | 2.8786 | 304468 |
+--------+---------+------------+---------+-----------+---------+-----------+
 
===Orientation===
 
0=Straight, 1=Gay, 2=Bi
+-------------+---------+------------+---------+-----------+---------+-----------+
| orientation | Count | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+-------------+---------+------------+---------+-----------+---------+-----------+
| 0 | 1570050 | 0.8698 | 3.9727 | 460499 | 2.6087 | 420784 |
| 1 | 139697 | 0.0774 | 2.9008 | 28802 | 2.1532 | 29212 |
| 2 | 95246 | 0.0528 | 3.5746 | 25287 | 2.8651 | 40143 |
+-------------+---------+------------+---------+-----------+---------+-----------+
 
===Ethnicity===
 
The data was preprocessed to assign individuals who reported more than one gender as being of 'mixed' race. The variable 'ethnicity' is a interger categorization for fast sorting, and the label are provided in the variable 'ethnicitylabels'.
 
+------------------+--------+------------+---------+-----------+---------+-----------+
| ethnicitylabels | Count | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+------------------+--------+------------+---------+-----------+---------+-----------+
| mixed | 105456 | 0.0584 | 4.0894 | 45674 | 2.8836 | 43540 |
| white | 917178 | 0.5081 | 3.7588 | 332784 | 2.6029 | 327063 |
| black | 53884 | 0.0299 | 4.2456 | 18498 | 2.4855 | 13101 |
| hispanic_latin | 61034 | 0.0338 | 4.2292 | 25278 | 2.9754 | 23348 |
| asian | 54645 | 0.0303 | 3.7712 | 16612 | 2.6209 | 17222 |
| indian | 11658 | 0.0065 | 4.6518 | 4996 | 2.2567 | 2796 |
| middle_eastern | 5916 | 0.0033 | 6.0000 | 3282 | 2.9282 | 1918 |
| native_american | 4056 | 0.0022 | 5.0511 | 1682 | 3.0157 | 1345 |
| pacific_islander | 3238 | 0.0018 | 3.6478 | 1098 | 2.9398 | 1173 |
| other | 24761 | 0.0137 | 4.3936 | 11454 | 2.9021 | 9664 |
| none | 563167 | 0.3120 | 3.9094 | 53230 | 2.1889 | 48969 |
+------------------+--------+------------+---------+-----------+---------+-----------+
 
The distribution for the reported number of ethnicities (with >1 assigned to mixed) is:
 
+-----------------+----------+
| num_ethnicities | COUNT(*) |
+-----------------+----------+
| 1 | 1699537 |
| 2 | 83676 |
| 3 | 14781 |
| 4 | 3348 |
| 5 | 976 |
| 6 | 330 |
| 7 | 243 |
| 8 | 409 |
| 9 | 1693 |
+-----------------+----------+
 
===Location===
 
Zip3 was provided in the data, but I was cautioned that the third number was not meaningul (i.e. added randomly for obscurification). Zip2 and Zip1 (the later is reported below) are coded as variables.
 
See:
*The [http://en.wikipedia.org/wiki/ZIP_code#Primary_State_Prefixes Wikipedia Zip Code page] for a map of Zip codes
*[http://en.wikipedia.org/wiki/List_of_ZIP_code_prefixes List of Zip Code Prefixes]
 
+------+--------+------------+---------+-----------+---------+-----------+
| zip1 | Count | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+------+--------+------------+---------+-----------+---------+-----------+
| 0 | 178253 | 0.0988 | 3.8582 | 55219 | 2.5989 | 52293 |
| 1 | 229719 | 0.1273 | 4.4691 | 84690 | 2.6089 | 73503 |
| 2 | 136538 | 0.0756 | 3.7811 | 38575 | 2.5554 | 37794 |
| 3 | 148748 | 0.0824 | 3.8744 | 41979 | 2.6477 | 40292 |
| 4 | 126115 | 0.0699 | 3.6275 | 33953 | 2.5882 | 33064 |
| 5 | 66067 | 0.0366 | 3.3721 | 15643 | 2.3133 | 15157 |
| 6 | 105097 | 0.0582 | 3.6479 | 29650 | 2.5607 | 29468 |
| 7 | 126851 | 0.0703 | 4.0350 | 38615 | 2.8075 | 37758 |
| 8 | 84204 | 0.0467 | 3.7276 | 23875 | 2.6178 | 23128 |
| 9 | 322186 | 0.1785 | 3.8279 | 103851 | 2.7357 | 103539 |
| 99 | 281215 | 0.1558 | 3.6252 | 48538 | 2.2509 | 44143 |
+------+--------+------------+---------+-----------+---------+-----------+
 
===Age===
 
The age ranges were created arbitrarily - though they have worked out reasonable well. Further refinement is possible. The variable 'age_rnum' is a integer categorization, and 'age_range' provides the variable. Ages were calculated using 'birth_year' using 2010 as the reference point.
 
+-----------+--------+------------+---------+-----------+---------+-----------+
| age_range | Count | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+-----------+--------+------------+---------+-----------+---------+-----------+
| 15-19 | 97659 | 0.0541 | 4.8070 | 30385 | 3.5139 | 35708 |
| 20-24 | 485028 | 0.2687 | 3.9309 | 146882 | 2.8494 | 159322 |
| 25-29 | 475376 | 0.2634 | 3.5992 | 131205 | 2.4778 | 128834 |
| 30-34 | 274281 | 0.1520 | 4.4587 | 89759 | 2.4158 | 66236 |
| 35-39 | 155925 | 0.0864 | 3.7430 | 42681 | 2.3843 | 36490 |
| 40-44 | 109121 | 0.0605 | 3.5817 | 27776 | 2.4255 | 24939 |
| 45-49 | 78618 | 0.0436 | 3.6452 | 19225 | 2.2995 | 16338 |
| 50-59 | 95810 | 0.0531 | 3.3421 | 21055 | 2.1347 | 17831 |
| >60 | 33175 | 0.0184 | 2.9332 | 5620 | 1.8930 | 4441 |
+-----------+--------+------------+---------+-----------+---------+-----------+
 
Birth year is self-reported almost surely wrongly so in some case. It ranges from 1900 to 1995.
 
===Account Age===
 
The account age range variables, 'acc_age_rnum' for the interger categorization and 'acc_age_range' for the labels, were created arbitrarily. These could easily be refined, but were created to examine differences in messaging and viewing activitiy - and to provide a normalization base. The oldest account is 2,613 days. The youngest is 0 days.
 
+---------------+--------+------------+---------+-----------+---------+-----------+
| acc_age_range | Count | Percentage | | | | |
+---------------+--------+------------+---------+-----------+---------+-----------+
| 0 | 1385 | 0.0008 | | | | |
| 1 | 7914 | 0.0044 | | | | |
| 2 | 7967 | 0.0044 | | | | |
| 3 | 9051 | 0.0050 | | | | |
| 4 | 9134 | 0.0051 | | | | |
| 5 | 10032 | 0.0056 | | | | |
| 6-10 | 35862 | 0.0199 | | | | |
| 11-15 | 39986 | 0.0222 | | | | |
| 16-30 | 136983 | 0.0759 | | | | |
| 31-60 | 233560 | 0.1294 | | | | |
| 61-100 | 237369 | 0.1315 | 5.2286 | 142694 | 3.5119 | 144483 |
| 101-365 | 535715 | 0.2968 | 3.5052 | 220052 | 2.4617 | 220381 |
| 366-3650 | 540035 | 0.2992 | 3.5435 | 151842 | 2.1519 | 125275 |
+---------------+--------+------------+---------+-----------+---------+-----------+
 
===Other variables===
 
Quit times range from:
*2008-07-19 (youngest)
*2010-12-17 (oldest)
 
Deleted or blacklisted accounts:
+------------------------+----------+
| deleted_or_blacklisted | Count |
+------------------------+----------+
| 0 | 1519319 |
| 1 | 285674 |
+------------------------+----------+
 
==Outstanding Data Issues==
 
*The variables 'num_essays' and 'profile_length' need fixing
Anonymous user

Navigation menu