Changes

Moroccan Parliament Web Crawler (view source)

Revision as of 14:16, 14 October 2016

269 bytes added , 14:16, 14 October 2016

no edit summary

[http://www.chambredesrepresentants.ma/ar/%D9%85%D8%B1%D8%A7%D9%82%D8%A8%D8%A9-%D8%A7%D9%84%D8%B9%D9%85%D9%84-%D8%A7%D9%84%D8%AD%D9%83%D9%88%D9%85%D9%8A/%D8%A7%D9%84%D8%A3%D8%B3%D9%80%D8%A6%D9%84%D8%A9-%D8%A7%D9%84%D8%B4%D9%81%D9%88%D9%8A%D8%A9?field_ministeres_tid=All&field_groupe_concerne_target_id=All&field_parlementaires_associes_target_id=All&body_value=&field_transfere_ou_non_value=All Oral Questions]

The embedded pdfs on this site are not useful. The useful data elements are the dates and questions listed on the main site. ~~Two solutions to~~ To get this ~~issue are taking screenshots of each page (faster implementation)~~data, ~~and using~~ I built a web crawler ~~to retrieve~~ that scraped the ~~data (better data storage)~~date, question, and all relevant information about the question from the site.

===Moroccan Legislature Written Questions===

From this window, scrapy provides a helpful tool to test any web scraping lines of code you would like to try out.

Enter:

scrapy shell 'webaddress'

From here, you can type selector statements and print them to see if your statements are getting the data you desire.

To actually build your webcrawler, open a new python script, and save it in the spiders folder that was created automatically for you under your projectname folder.

Some example code for a spider is shown below; this was my spider for the oral questions portion of the Moroccan site.

Peterjalbert

Bureaucrats, Administrators (Semantic MediaWiki), Administrators

479

edits

Changes

Moroccan Parliament Web Crawler (view source)

Revision as of 14:16, 14 October 2016

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools