Changes

Jump to navigation Jump to search
no edit summary
{{Project|Has project output=Data,Tool|Has sponsor=McNair ProjectsCenter|Project TitleHas title=Moroccan Parliament Web Drivewr,Driver|OwnerHas owner=Peter Jalbert,|Start TermHas start date=9/27/2016,|StatusHas image=morocco_flag.jpg|Has notes=|Is dependent on=|Depends upon it=|Has project status=Complete|Has keywords=ActiveTool
}}
==Overview==
[http://www.chambredesrepresentants.ma/ar/%D8%A7%D9%84%D8%AA%D8%B4%D8%B1%D9%8A%D8%B9/%D9%85%D8%B4%D8%A7%D8%B1%D9%8A%D8%B9-%D8%A7%D9%84%D9%82%D9%88%D8%A7%D9%86%D9%8A%D9%86 Monarchy Proposed Bills]
 
The data that needs to be extracted from this site includes the pdfs of all the bill pages, as well as any interior pdfs on each page. The bill pages should be named by their url, and the interior pdfs should be named by their respective bill numbers.
===Moroccan House of Representatives Proposed Bills===
[http://www.chambredesrepresentants.ma/ar/%D8%A7%D9%84%D8%AA%D8%B4%D8%B1%D9%8A%D8%B9/%D9%84%D8%A7%D8%A6%D8%AD%D8%A9-%D9%85%D9%82%D8%AA%D8%B1%D8%AD%D8%A7%D8%AA-%D8%A7%D9%84%D9%82%D9%88%D8%A7%D9%86%D9%8A%D9%86 House Proposed Bills]
 
See Monarchy proposed bills for instructions.
===Moroccan Legislature Ratified Bills===
[http://www.chambredesrepresentants.ma/ar/%D8%A7%D9%84%D8%AA%D8%B4%D8%B1%D9%8A%D8%B9/%D8%A7%D9%84%D9%86%D8%B5%D9%88%D8%B5-%D8%A7%D9%84%D8%AA%D9%8A-%D8%B5%D8%A7%D8%AF%D9%82-%D8%B9%D9%84%D9%8A%D9%87%D8%A7-%D9%85%D8%AC%D9%84%D8%B3-%D8%A7%D9%84%D9%86%D9%88%D8%A7%D8%A8?field_legislature_tid=All&field_nature_loi_tid=All&page=27 Ratified Bills]
 
See Monarchy proposed bills for instructions.
===Moroccan Legislature Oral Questions===
[http://www.chambredesrepresentants.ma/ar/%D9%85%D8%B1%D8%A7%D9%82%D8%A8%D8%A9-%D8%A7%D9%84%D8%B9%D9%85%D9%84-%D8%A7%D9%84%D8%AD%D9%83%D9%88%D9%85%D9%8A/%D8%A7%D9%84%D8%A3%D8%B3%D9%80%D8%A6%D9%84%D8%A9-%D8%A7%D9%84%D8%B4%D9%81%D9%88%D9%8A%D8%A9?field_ministeres_tid=All&field_groupe_concerne_target_id=All&field_parlementaires_associes_target_id=All&body_value=&field_transfere_ou_non_value=All Oral Questions]
 
The embedded pdfs on this site are not useful. The useful data elements are the dates and questions listed on the main site. To get this data, I built a web crawler that scraped the date, question, and all relevant information about the question from the site. Then, I used Selenium to download pdfs of the questions.
===Moroccan Legislature Written Questions===
[http://www.chambredesrepresentants.ma/ar/%D8%A7%D9%84%D8%A3%D8%B3%D9%80%D8%A6%D9%84%D8%A9-%D8%A7%D9%84%D9%83%D8%AA%D8%A7%D8%A8%D9%8A%D8%A9 Written Questions]
See Oral Questions section.  ===Further InquiriesWeb Driver Code===  #General Bill Download  from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.common.keys import Keys import time import urllib import string import re ########################## #launch Google Chrome Browser driver = webdriver.Chrome() ########################## def switch_window(): handles = driver.window_handles driver.switch_to_window(handles[-1]) ########################## #Visit desired website driver.get('http://www.chambredesrepresentants.ma/ar/%D8%A7%D9%84%D8%AA%D8%B4%D8%B1%D9%8A%D8%B9/%D9%84%D8%A7%D8%A6%D8%AD%D8%A9- %D9%85%D9%82%D8%AA%D8%B1%D8%AD%D8%A7%D8%AA-%D8%A7%D9%84%D9%82%D9%88%D8%A7%D9%86%D9%8A%D9%86?body_value=&field_og_commission_target_id=All') ########################## bills_list = driver.find_elements_by_xpath("//li/h3/a") for i in range(len(bills_list)): ActionChains(driver).key_down(Keys.SHIFT).perform() bills_list[i].click() ActionChains(driver).key_up(Keys.SHIFT).perform() switch_window() url = driver.current_url unicode_url = urllib.unquote(str(url)).decode('utf8') url_parts = string.split(unicode_url, "/") i = len(url_parts) ########################## #Build arabic tag backwards, accounting for backwards spelling tag = "" while i > 4: tag += url_parts[i - 1] i -= 1 ########################## #Navigate to pdf of website change_button = driver.find_elements_by_xpath("//a [@class='pdf' and @rel='nofollow']")[0] ActionChains(driver).key_down(Keys.SHIFT).perform() change_button.click() ActionChains(driver).key_up(Keys.SHIFT).perform() switch_window() ######################### #Gets current window's URL url = driver.current_url ######################## #Saves file at URL to current directory urllib.urlretrieve(url, tag) driver.close() switch_window() ######################### pdfs_on_page = driver.find_elements_by_xpath("//div/div/div/article/div/ul/li/a") #finds interior pdfs on the page if pdfs_on_page: for j in range(len(pdfs_on_page)): element = pdfs_on_page[j] ####################### #click on pdf ActionChains(driver).key_down(Keys.SHIFT).perform() element.click() ActionChains(driver).key_up(Keys.SHIFT).perform() switch_window() url = driver.current_url pdf_tag = string.split(str(url), "/")[-1] ###################### #leaves link if it is not a pdf if re.findall(".pdf", pdf_tag): #saves interior pdf urllib.urlretrieve(url, pdf_tag) driver.close() switch_window() ######################## driver.close() switch_window() ######################### print "download complete" ######################### #close browser driver.quit()  ===Web Crawler Code=== I used a python library called [https://scrapy.org/ Scrapy]. Once downloaded and installed properly, a new project is created by typing into the command line: scrapy startproject projectname This will create a project for you with all of the necessary components you need. From this window, scrapy provides a helpful tool to test any web scraping lines of code you would like to try out. Enter: scrapy shell 'webaddress' From here, you can type selector statements and print them to see if your statements are getting the data you desire. To actually build your webcrawler, open a new python script, and save it in the spiders folder that was created automatically for you under your projectname folder. Some example code for a spider is shown below; this was my spider for the oral questions portion of the Moroccan site.
Further inquiries have been requested import scrapy import string class MySpider(scrapy.Spider): name = "oral" page_range = 375 start_urls = (["http://www.chambredesrepresentants.ma/ar/%D9%85%D8%B1%D8%A7%D9%82%D8%A8%D8%A9-%D8%A7%D9%84%D8%B9%D9%85%D9%84- %D8%A7%D9%84%D8%AD%D9%83%D9%88%D9%85%D9%8A/%D8%A7%D9%84%D8%A3%D8%B3%D9%80%D8%A6%D9%84%D8%A9%D8%A7%D9%84%D8%B4 %D9%81%D9%88%D9%8A%D8%A9? field_ministeres_tid=All&field_groupe_concerne_target_id=All&field_parlementaires_associes_target_id=All&body_value =&field_transfere_ou_non_value=All"] + ["http://www.chambredesrepresentants.ma/ar/%D9%85%D8%B1%D8%A7%D9%82%D8%A8%D8%A9-%D8%A7%D9%84%D8%B9%D9%85%D9%84 %D8%A7%D9%84%D8%AD%D9%83%D9%88%D9%85%D9%8A/%D8%A7%D9%84%D8%A3%D8%B3%D9%80%D8%A6%D9%84%D8%A9- %D8%A7%D9%84%D8%B4%D9%81%D9%88%D9%8A%D8%A9field_ministeres_tid=All&field_groupe_concerne_target_id=All&field_parlementaires _associes_target_id=All&body_value=&field_transfere_ou_non_value=All&page=" + str(num) for num in range(page_range)]) def parse(self, response): a = response.css('ul.listing_questions') for the Kuwaitheader in a.css('li'): date = header.css('h3.sorting_date::text').extract_first() if date != None: placeholder = date if date == None: date = placeholder question = header.css('a::text').extract_first() info = header.css('div.questionss_group::text').extract() info = ''.join(elem for elem in info) info_split = string.split(info, "\n") info1 = info_split[2] info1 = ' '.join(info1.split()) info2 = info_split[4] info2 = ' '.join(info2.split()) info3 = info_split[5] info3 = ' '.join(info3.split()) yield { 'date': date, 'info1':info1, 'info2': info2, 'info3': info3, Tunisian 'question': question, and Algerian Parliaments 'url': response.url }
   ==Further Inquiries== Further inquiries have been requested for the Kuwait, Tunisian, and Algerian Parliaments. In addition, a way to link the downloaded pdfs to pertinent lines in the csv files would be very useful. ===Kuwait Parliament===
[http://search.kna.kw/web/Retrieval/Home.aspx Kuwait Site]
Meeting Agendas/Minutes (جدول اعمال الجلسه); for 14th term
UPDATE: [http://www.mona.gov.kw:90/pr_opendata/SL.aspx?did=1 Kuwait Bills Site] The site generates dynamic content. The desired pages for scraping are found by selecting the first option of the first dropdown menu, and hitting enter. """The website's URL never changes.""" The HTML of the site will change as objects are selected on the site. I used Selenium to interact with the elements on the page, and the .text method of Web Elements in Selenium to have Selenium act as a psuedo web crawler. The code is found in the following directory:  E:\McNair\Projects\Middle East Studies Web Drivers\Kuwait The next step in the project requires the data found when selecting the links on each page. This data has specifics on each of the bills. The page appears as a dynamically generated frame which presents some issues. There is a close button on the window that is opened up that cannot be accessed from the interior frame, but also cannot be accessed from the original framework. I experimented with executing javascript code, and switching frames, but nothing worked. My workaround solution was to reload the page every time I opened up the interior frame. This is a separate script saved in the same directory listed above and titled kuwait_retrieve.py.  ===Tunisian Parliament===
[http://arp.tn/site/main/AR/index.jsp Tunisian Site]
===Algerian Parliament===
[http://www.apn.dz/ar/ Algerian Site]

Navigation menu