Changes

3,609 bytes added , 13:44, 21 September 2020

no edit summary

{{Project|Has project output=Data,Tool|Has sponsor=McNair ~~Projects~~Center|~~Project Title~~Has title=Moroccan Parliament Web Driver,|~~Owner~~Has owner=Peter Jalbert,|~~Start Term~~Has start date=9/27/2016,|~~Image~~Has image=morocco_flag.jpg|~~Status~~Has notes=|Is dependent on=|Depends upon it=|Has project status=Complete|Has keywords=~~Active~~Tool

}}

==Overview==

[http://www.chambredesrepresentants.ma/ar/%D9%85%D8%B1%D8%A7%D9%82%D8%A8%D8%A9-%D8%A7%D9%84%D8%B9%D9%85%D9%84-%D8%A7%D9%84%D8%AD%D9%83%D9%88%D9%85%D9%8A/%D8%A7%D9%84%D8%A3%D8%B3%D9%80%D8%A6%D9%84%D8%A9-%D8%A7%D9%84%D8%B4%D9%81%D9%88%D9%8A%D8%A9?field_ministeres_tid=All&field_groupe_concerne_target_id=All&field_parlementaires_associes_target_id=All&body_value=&field_transfere_ou_non_value=All Oral Questions]

The embedded pdfs on this site are not useful. The useful data elements are the dates and questions listed on the main site. To get this data, I built a web crawler that scraped the date, question, and all relevant information about the question from the site. Then, I used Selenium to download pdfs of the questions.

===Moroccan Legislature Written Questions===

Some example code for a spider is shown below; this was my spider for the oral questions portion of the Moroccan site.

import scrapy

import string

class MySpider(scrapy.Spider):

name = "oral"

page_range = 375

start_urls = (["http://www.chambredesrepresentants.ma/ar/%D9%85%D8%B1%D8%A7%D9%82%D8%A8%D8%A9-%D8%A7%D9%84%D8%B9%D9%85%D9%84-

%D8%A7%D9%84%D8%AD%D9%83%D9%88%D9%85%D9%8A/%D8%A7%D9%84%D8%A3%D8%B3%D9%80%D8%A6%D9%84%D8%A9%D8%A7%D9%84%D8%B4

%D9%81%D9%88%D9%8A%D8%A9? field_ministeres_tid=All&field_groupe_concerne_target_id=All&field_parlementaires_associes_target_id=All&body_value

=&field_transfere_ou_non_value=All"] +

["http://www.chambredesrepresentants.ma/ar/%D9%85%D8%B1%D8%A7%D9%82%D8%A8%D8%A9-%D8%A7%D9%84%D8%B9%D9%85%D9%84

%D8%A7%D9%84%D8%AD%D9%83%D9%88%D9%85%D9%8A/%D8%A7%D9%84%D8%A3%D8%B3%D9%80%D8%A6%D9%84%D8%A9-

%D8%A7%D9%84%D8%B4%D9%81%D9%88%D9%8A%D8%A9field_ministeres_tid=All&field_groupe_concerne_target_id=All&field_parlementaires

_associes_target_id=All&body_value=&field_transfere_ou_non_value=All&page=" + str(num) for num in range(page_range)])

def parse(self, response):

a = response.css('ul.listing_questions')

for header in a.css('li'):

date = header.css('h3.sorting_date::text').extract_first()

if date != None:

placeholder = date

if date == None:

date = placeholder

question = header.css('a::text').extract_first()

info = header.css('div.questionss_group::text').extract()

info = ''.join(elem for elem in info)

info_split = string.split(info, "\n")

info1 = info_split[2]

info1 = ' '.join(info1.split())

info2 = info_split[4]

info2 = ' '.join(info2.split())

info3 = info_split[5]

info3 = ' '.join(info3.split())

yield {

'date': date,

'info1':info1,

'info2': info2,

'info3': info3,

'question': question,

'url': response.url

}

==Further Inquiries==

Further inquiries have been requested for the Kuwait, Tunisian, and Algerian Parliaments. In addition, a way to link the downloaded pdfs to pertinent lines in the csv files would be very useful.

===Kuwait Parliament===

Proposals (اقتراح بــرغــبـــة): For 13th term and 14th terms

Meeting Agendas/Minutes (جدول اعمال الجلسه); for 14th term

UPDATE: [http://www.mona.gov.kw:90/pr_opendata/SL.aspx?did=1 Kuwait Bills Site]

The site generates dynamic content. The desired pages for scraping are found by selecting the first option of the first dropdown menu, and hitting enter. """The website's URL never changes.""" The HTML of the site will change as objects are selected on the site. I used Selenium to interact with the elements on the page, and the .text method of Web Elements in Selenium to have Selenium act as a psuedo web crawler. The code is found in the following directory:

E:\McNair\Projects\Middle East Studies Web Drivers\Kuwait

The next step in the project requires the data found when selecting the links on each page. This data has specifics on each of the bills. The page appears as a dynamically generated frame which presents some issues. There is a close button on the window that is opened up that cannot be accessed from the interior frame, but also cannot be accessed from the original framework. I experimented with executing javascript code, and switching frames, but nothing worked. My workaround solution was to reload the page every time I opened up the interior frame. This is a separate script saved in the same directory listed above and titled kuwait_retrieve.py.

===Tunisian Parliament===

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

Moroccan Parliament Web Crawler (view source)

Revision as of 13:44, 21 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools