Difference between revisions of "Regular Expression Cheat Sheet"

From edegan.com
Jump to navigation Jump to search
 
(2 intermediate revisions by the same user not shown)
Line 27: Line 27:
 
  \          escape the next thing (e.g. \} matches })
 
  \          escape the next thing (e.g. \} matches })
 
  [^...]    not whatever (...)
 
  [^...]    not whatever (...)
 +
 +
 +
More regular expressions: [https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference]
 +
 +
 +
==Cohort Breakout for Accelerator Project==
 +
 +
There are 3 methods to take to try and breakout cohorts on a website.
 +
 +
# Typing them out by hand (tedious).
 +
# Copying and pasting the entire cohort page.
 +
# Going into the page's Inspect Element.
 +
 +
For methods 2 and 3, we can use Regular Expressions. However, whichever method you choose depends on what is easiest because certain situations mean that typing out the names by hand will be faster or using regular expressions will be faster.
 +
 +
===Method 2===
 +
 +
# Copy the entire cohort page information, open RDP and go to TextPad, and paste it into the TextPad.
 +
# Go to the top menu bar and click on Search, and then Replace.
 +
# In the Replace popup, make sure you check off 'Regular Expressions' in the Conditions.
 +
# Use the regular expressions shown above to remove any not needed information or fix formatting from the cohort information.
 +
# For instance, if there's a list of cohorts on the TextPad going vertically and we want to move them to be on 1 line separated by a tab, in the Replace popup, we can put \n in the Find What and \t in the Replace With (this mean replace every instance of a newline with a tab).
 +
# After it is formatted correctly, move the Cohort names back to the Excel sheet.
 +
 +
===Method 3===
 +
 +
# Go to the cohort page, right click, and click on Inspect. This open a window on the right side of the screen.
 +
# Hover over different parts of the code until it highlights all of the cohorts.
 +
# Next to that line of code, click on the three dots, and click on Copy element.
 +
# Paste this into a TextPad.
 +
# This method takes a lot more knowledge on Regular Expressions to try and isolate the cohort names from all of the other info. Usually the name is after "title="" in the code, but you have to make sure that you are not deleting other important information.
 +
# This method is only recommended if you are more experienced with regular expressions or if the startups on the cohort page are just images and not text.
 +
# Use regular expressions like Method 2 to get the names, and then move it back to the excel spreadsheet. 
  
 
==Lex Machina==
 
==Lex Machina==

Latest revision as of 11:56, 17 April 2018

This is a page for regular expression hacks. Chronicle your exploits so that others can benefit from your ingenuity!

Useful RegExes

Pattern    Matches
------------------
\t         tab
\n         newline
^          start of line
$          end of line
.          any character
*          any number of times
+          1 or more times
?          0 or once
\s         any whtespace character
\d         number
\w         any alphanumeric
\W         any non-alphanumeric
[0-9]      any number (once)
[a-z]      any letter (once)
[a-Z]      any letter (case insensitive]
abc        abc
[a|b|c]    a or b or c
{1,3}      1, 2, or 3 times in a row
{3,}       3 or more times
()         captures whatever is in the bracktets
\          escape the next thing (e.g. \} matches })
[^...]     not whatever (...)


More regular expressions: [1]


Cohort Breakout for Accelerator Project

There are 3 methods to take to try and breakout cohorts on a website.

  1. Typing them out by hand (tedious).
  2. Copying and pasting the entire cohort page.
  3. Going into the page's Inspect Element.

For methods 2 and 3, we can use Regular Expressions. However, whichever method you choose depends on what is easiest because certain situations mean that typing out the names by hand will be faster or using regular expressions will be faster.

Method 2

  1. Copy the entire cohort page information, open RDP and go to TextPad, and paste it into the TextPad.
  2. Go to the top menu bar and click on Search, and then Replace.
  3. In the Replace popup, make sure you check off 'Regular Expressions' in the Conditions.
  4. Use the regular expressions shown above to remove any not needed information or fix formatting from the cohort information.
  5. For instance, if there's a list of cohorts on the TextPad going vertically and we want to move them to be on 1 line separated by a tab, in the Replace popup, we can put \n in the Find What and \t in the Replace With (this mean replace every instance of a newline with a tab).
  6. After it is formatted correctly, move the Cohort names back to the Excel sheet.

Method 3

  1. Go to the cohort page, right click, and click on Inspect. This open a window on the right side of the screen.
  2. Hover over different parts of the code until it highlights all of the cohorts.
  3. Next to that line of code, click on the three dots, and click on Copy element.
  4. Paste this into a TextPad.
  5. This method takes a lot more knowledge on Regular Expressions to try and isolate the cohort names from all of the other info. Usually the name is after "title="" in the code, but you have to make sure that you are not deleting other important information.
  6. This method is only recommended if you are more experienced with regular expressions or if the startups on the cohort page are just images and not text.
  7. Use regular expressions like Method 2 to get the names, and then move it back to the excel spreadsheet.

Lex Machina

Task

Use Patent Portfolio Report to pull unique patent numbers for patents litigated in 2015

Steps Taken

  1. Filtered Lex Machina until total patents were under 2000.
  2. Lex Machina ran its Patent Portfolio Report
  3. Ctrl-A to select all, then pasted into TextPad
    • Patent numbers were the first word in every line
  1. Used replace command (F8) to find "(^.+?)\s.*$" and replace with "\1"
    • Make sure "regular expression" is checked
  1. That left only the patent numbers
  2. Repeated above steps until all 2015 patents are in TextPad
  3. Exported data to Excel and then Removed Duplicates (Data --> Remove Duplicates)