Changes

Jump to navigation Jump to search
11,148 bytes added ,  16:00, 19 December 2011
no edit summary
*This page is protected so that only Ed, Usha, Toby and Misiek can read or edit it.
I will be posting reports and other materials on the OK Cupid project here shortly==Script Files== There are two SQL script files that were written to process the data:*[[:Image:OKC1.sql.asci|OKC1.sql]]*[[:Image:OKC2.sql.asci|OKC2.sql]] These files are self-documenting, and the second file builds out the current base table set containing: '''users''', '''views''', '''messages''', and '''delta'''. ==Setup Instructions for Usha== See the [[Research Computing At HBS]] page for detailed information. Quick quide:#Go to vpn.hbs.edu and click Network Connect -> Start#Using SSH client, connect to researchgrid.hbs.edu#Connect the MYSql cluster: msyql -h rcsmysql.hbs.edu -u eegan -p --ssl-ca=rcsmysql Useful commands:  mysql> SHOW Databases; +--------------------+ | Database | +--------------------+ | information_schema | | mpiskorski_tstuart | | okcupid | | okcupid_2 | +--------------------+ 4 rows in set (0.02 sec) You'll have access to mpiskorski_tstuart   USE mpiskorski_tstuart; (Misiek and Toby's working database)  SHOW TABLES; There are two datasets: OKC1 and OKC2. We are now working with OKC2. Both datasets have: Users Messages Views OKC2 also has ProfileDelta (changes to profiles).Describe tables with:  DESC TableName ==Data Description from Misiek== Toby sent me a file with the following data description:  USERS Total = 1,804,993 Female = 722,889 = 40% US = 1,523,778 NonUS = 281,215 = 16% US Male = 897,323 = 59% US Female = 626,455 = 41% US Male Singles = 827,702 = 92% of US Male US Female Singles = 565,067 = 90% of US Female US Male LongTermInterest = 530,450 = 59% of Male US Female LongTermInterest = 333,947 = 53% of US Female US Male ShortTermInterest = 433,904 = 48% of Male US Female ShortTermInterest = 225,715 = 36% of US Female US Male CasualSex = 96,714 = 11% of Male US Female CasualSex = 19,331 = 3% of US Female US Male Gays = 77,902 = 9% of US Male US Male Bi = 19,295 = 2% of US Male US Male Straight = 800,126 = 89% of US Male US Female Gays = 46,118 = 7% of US Female US Female Bi = 59,585 = 10% of US Female US Female Straight = 520,752 = 83% of US Female US Male NoRace = 260,889 = 31% of US Male US Male White = 517,636 = 57,172% of US Male US Male Black = 38,787 = 4% of US Male US Male Asian = 34,158= 4% of US Male US Male Latino = 57,172 = 6% of US Male US Female NoRace = 185,725 = 30% of US Female US Female White = 358,119 = 57% of US Female US Female Black = 32,195 = 5% of US Female US Female Asian = 22,979 = 4% of US Female US Female Latino = 37,274 = 5% of US Female ==Data Description== This section provides summary stats on gender/orientation/race/location/age. ===Gender===  +--------+---------+------------+---------+-----------+---------+-----------+ | female | Count | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd | +--------+---------+------------+---------+-----------+---------+-----------+ | 0 | 1082104 | 0.5995 | 4.1039 | 339857 | 2.2340 | 185671 | | 1 | 722889 | 0.4005 | 3.4872 | 174731 | 2.8786 | 304468 | +--------+---------+------------+---------+-----------+---------+-----------+ ===Orientation===  0=Straight, 1=Gay, 2=Bi +-------------+---------+------------+---------+-----------+---------+-----------+ | orientation | Count | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd | +-------------+---------+------------+---------+-----------+---------+-----------+ | 0 | 1570050 | 0.8698 | 3.9727 | 460499 | 2.6087 | 420784 | | 1 | 139697 | 0.0774 | 2.9008 | 28802 | 2.1532 | 29212 | | 2 | 95246 | 0.0528 | 3.5746 | 25287 | 2.8651 | 40143 | +-------------+---------+------------+---------+-----------+---------+-----------+ ===Ethnicity=== The data was preprocessed to assign individuals who reported more than one gender as being of 'mixed' race. The variable 'ethnicity' is a interger categorization for fast sorting, and the label are provided in the variable 'ethnicitylabels'.  +------------------+--------+------------+---------+-----------+---------+-----------+ | ethnicitylabels | Count | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd | +------------------+--------+------------+---------+-----------+---------+-----------+ | mixed | 105456 | 0.0584 | 4.0894 | 45674 | 2.8836 | 43540 | | white | 917178 | 0.5081 | 3.7588 | 332784 | 2.6029 | 327063 | | black | 53884 | 0.0299 | 4.2456 | 18498 | 2.4855 | 13101 | | hispanic_latin | 61034 | 0.0338 | 4.2292 | 25278 | 2.9754 | 23348 | | asian | 54645 | 0.0303 | 3.7712 | 16612 | 2.6209 | 17222 | | indian | 11658 | 0.0065 | 4.6518 | 4996 | 2.2567 | 2796 | | middle_eastern | 5916 | 0.0033 | 6.0000 | 3282 | 2.9282 | 1918 | | native_american | 4056 | 0.0022 | 5.0511 | 1682 | 3.0157 | 1345 | | pacific_islander | 3238 | 0.0018 | 3.6478 | 1098 | 2.9398 | 1173 | | other | 24761 | 0.0137 | 4.3936 | 11454 | 2.9021 | 9664 | | none | 563167 | 0.3120 | 3.9094 | 53230 | 2.1889 | 48969 | +------------------+--------+------------+---------+-----------+---------+-----------+ The distribution for the reported number of ethnicities (with >1 assigned to mixed) is:  +-----------------+----------+ | num_ethnicities | COUNT(*) | +-----------------+----------+ | 1 | 1699537 | | 2 | 83676 | | 3 | 14781 | | 4 | 3348 | | 5 | 976 | | 6 | 330 | | 7 | 243 | | 8 | 409 | | 9 | 1693 | +-----------------+----------+ ===Location=== Zip3 was provided in the data, but I was cautioned that the third number was not meaningul (i.e. added randomly for obscurification). Zip2 and Zip1 (the later is reported below) are coded as variables.  See:*The [http://en.wikipedia.org/wiki/ZIP_code#Primary_State_Prefixes Wikipedia Zip Code page] for a map of Zip codes*[http://en.wikipedia.org/wiki/List_of_ZIP_code_prefixes List of Zip Code Prefixes]  +------+--------+------------+---------+-----------+---------+-----------+ | zip1 | Count | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd | +------+--------+------------+---------+-----------+---------+-----------+ | 0 | 178253 | 0.0988 | 3.8582 | 55219 | 2.5989 | 52293 | | 1 | 229719 | 0.1273 | 4.4691 | 84690 | 2.6089 | 73503 | | 2 | 136538 | 0.0756 | 3.7811 | 38575 | 2.5554 | 37794 | | 3 | 148748 | 0.0824 | 3.8744 | 41979 | 2.6477 | 40292 | | 4 | 126115 | 0.0699 | 3.6275 | 33953 | 2.5882 | 33064 | | 5 | 66067 | 0.0366 | 3.3721 | 15643 | 2.3133 | 15157 | | 6 | 105097 | 0.0582 | 3.6479 | 29650 | 2.5607 | 29468 | | 7 | 126851 | 0.0703 | 4.0350 | 38615 | 2.8075 | 37758 | | 8 | 84204 | 0.0467 | 3.7276 | 23875 | 2.6178 | 23128 | | 9 | 322186 | 0.1785 | 3.8279 | 103851 | 2.7357 | 103539 | | 99 | 281215 | 0.1558 | 3.6252 | 48538 | 2.2509 | 44143 | +------+--------+------------+---------+-----------+---------+-----------+ ===Age=== The age ranges were created arbitrarily - though they have worked out reasonable well. Further refinement is possible. The variable 'age_rnum' is a integer categorization, and 'age_range' provides the variable. Ages were calculated using 'birth_year' using 2010 as the reference point.  +-----------+--------+------------+---------+-----------+---------+-----------+ | age_range | Count | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd | +-----------+--------+------------+---------+-----------+---------+-----------+ | 15-19 | 97659 | 0.0541 | 4.8070 | 30385 | 3.5139 | 35708 | | 20-24 | 485028 | 0.2687 | 3.9309 | 146882 | 2.8494 | 159322 | | 25-29 | 475376 | 0.2634 | 3.5992 | 131205 | 2.4778 | 128834 | | 30-34 | 274281 | 0.1520 | 4.4587 | 89759 | 2.4158 | 66236 | | 35-39 | 155925 | 0.0864 | 3.7430 | 42681 | 2.3843 | 36490 | | 40-44 | 109121 | 0.0605 | 3.5817 | 27776 | 2.4255 | 24939 | | 45-49 | 78618 | 0.0436 | 3.6452 | 19225 | 2.2995 | 16338 | | 50-59 | 95810 | 0.0531 | 3.3421 | 21055 | 2.1347 | 17831 | | >60 | 33175 | 0.0184 | 2.9332 | 5620 | 1.8930 | 4441 | +-----------+--------+------------+---------+-----------+---------+-----------+ Birth year is self-reported almost surely wrongly so in some case. It ranges from 1900 to 1995. ===Account Age=== The account age range variables, 'acc_age_rnum' for the interger categorization and 'acc_age_range' for the labels, were created arbitrarily. These could easily be refined, but were created to examine differences in messaging and viewing activitiy - and to provide a normalization base. The oldest account is 2,613 days. The youngest is 0 days.  +---------------+--------+------------+---------+-----------+---------+-----------+ | acc_age_range | Count | Percentage | | | | | +---------------+--------+------------+---------+-----------+---------+-----------+ | 0 | 1385 | 0.0008 | | | | | | 1 | 7914 | 0.0044 | | | | | | 2 | 7967 | 0.0044 | | | | | | 3 | 9051 | 0.0050 | | | | | | 4 | 9134 | 0.0051 | | | | | | 5 | 10032 | 0.0056 | | | | | | 6-10 | 35862 | 0.0199 | | | | | | 11-15 | 39986 | 0.0222 | | | | | | 16-30 | 136983 | 0.0759 | | | | | | 31-60 | 233560 | 0.1294 | | | | | | 61-100 | 237369 | 0.1315 | 5.2286 | 142694 | 3.5119 | 144483 | | 101-365 | 535715 | 0.2968 | 3.5052 | 220052 | 2.4617 | 220381 | | 366-3650 | 540035 | 0.2992 | 3.5435 | 151842 | 2.1519 | 125275 | +---------------+--------+------------+---------+-----------+---------+-----------+ ===Other variables=== Quit times range from:*2008-07-19 (youngest)*2010-12-17 (oldest) Deleted or blacklisted accounts: +------------------------+----------+ | deleted_or_blacklisted | Count | +------------------------+----------+ | 0 | 1519319 | | 1 | 285674 | +------------------------+----------+ ==Outstanding Data Issues== *The variables 'num_essays' and 'profile_length' need fixing
Anonymous user

Navigation menu