Regular Expressions

Use of literals in searches

STRING: A sentence. Another. And a third.
SEARCH: .
RESULT: finds the periods

with Regex (or grep) search

STRING: A sentence. Another. And a third.
SEARCH: .
RESULT: finds every consecutive character and space!

Escaping a metacharacter

  • insert a \ before special characters for a literal search in regex
STRING: A sentence. Another. And a third.
SEARCH: \.
RESULT: finds the periods

Wildcards

  • \w a single word character [letter,number or _]
  • \d a single number character [0-9]
  • \t a single tab space
  • \s a single space, tab, or line break
  • \n a single line break (or try \r)
STRING: crow raven grackle
SEARCH: \s
REPLACE: \n
RESULT: 
crow
raven
grackle

Repairing pdf cut and paste

STRING:
Cutting and
pasting from pdf is
a mess. Sentences
are
all chopped up. It is
hard
to repair.

SEARCH: \n
REPLACE: (replace with nothing)
RESULT: 
Cutting and pasting from pdf is a mess. Sentences are all chopped up. It is hard to repair. 
  • TRY THIS WITH AN ACTUAL PDF PARAGRAPH

Quantifiers

  • add these on to any of the wildcards

  • \w+ one or more consecutive word characters

  • \w* zero or more consecutive word characters

  • \w{3} exactly 3 consecutive word characters

  • \w{3,} 3 or more consecutive word characters

  • \w{3,5} 3, 4, or 5 consecutive word characters

Using a zero or more * quantifier

STRING: crow, raven ,grackle,starling  ,    robin
SEARCH: \s*,\s*
REPLACE: , 
RESULT: crow,raven,grackle,starling,robin

Using .* for “all the rest”

STRING: 
x, MyWord OtherJunk. ,
13, MyWord2,OtherStuff,,
X13,   MyThirdWord,        MoreTrash,!##
xxx,LastWord     x.

SEARCH: \w+,\s*\w+.*
RESULT: Matches each line completely!

Using captures to join and abbreviate species names

STRING: 
Lasius neoniger
Lasius umbratus
Myrmica lobifrons
SEARCH: \s
REPLACE: _
RESULT:
Lasius_neoniger
Lasius_umbratus
Myrmica_lobifrons

But now use a capture to create an abbreviation
STRING: 
Lasius neoniger
Lasius umbratus
Myrmica lobifrons
SEARCH: (\w)\w+ (\w+)
REPLACE: \1_\2
RESULT:
L_neoniger
L_umbratus
M_lobifrons

Extracting and saving text elements from messy files

STRING: 
x, MyWord OtherJunk. ,
13, MyWord2,OtherStuff,,
X13,   MyThirdWord,        MoreTrash,!##
xxx,LastWord     x.

SEARCH: \w+,\s*\w+.*

SEARCH (with capture): \w+,\s*(\w+).*
REPLACE: \1
RESULT:
MyWord
AnotherOfMyWords
MyThirdWord

Interweaving multiple captures and literal text

STRING: 
x, MyWord OtherJunk. ,
13, MyWord2,OtherStuff,,
X13,   MyThirdWord,        MoreTrash,!##
xxx,LastWord     x.

SEARCH: \w+,\s*\w+.*
SEARCH: \w+,\s*(\w+)(.*)
REPLACE: MY WORD: \1\s LEFTOVER: \2
RESULT:
MY WORD: MyWord LEFTOVER:  OtherJunk. ,
MY WORD: MyWord2 LEFTOVER:
MY WORD: MyThirdWord LEFTOVER: ,        MoreTrash,!##
MY WORD: LastWord LEFTOVER:      x.

Custom character sets

[ATCG] # single character that is A T C or G
[ATCG]+ # DNA sequence

Negated character sets

[^XY]  # a single character that is anything but X OR Y
[^0-9.]+ one or more characters that are not numbers of decimals

Boundary stakes

^ # outside of character set indicates start of line
$ # indicates end of line
- `\<` Start of a word 
- `\>` End of a word 

E11: Search for whole words within regex

STRING:
Try to ascertain a kind of karma.
SEARCH:
a
RESULT: finds 5 matches

Search:
\sa\s
RESULT: finds 1 match (whole word)

Search:
\sa
RESULT: finds 1 match (start of word)

Search:
\wa\s
RESULT: finds 1 match (end of word)

Search:
\wa\w
RESULT: finds 2 matches (middle of word)

E12: Simplify column searches by staking to the beginning or end of the line

STRING:
a,b,c,MyFirstWord,g,MyLastWord
d,ee,FirstKeeper,xx,LastKeeper
SEARCH:
\w+,\w+,\w+$
SEARCH (with capture):
(\w+),\w+,(\w+)$
REPLACE:
\1 \2
RESULT:
a,b,c,d,e,f,MyFirstWord MyLastWord
ee,f,FirstKeeper LastKeeper
# Perhaps not exactly what we wanted!

ALTERNATE SEARCH: 
.*,\w+,\w+,\w+$
ALTERNATE SEARCH: (with capture):
(.*),(\w+),(\w+),(\w+)$
REPLACE:
\2\s\4
RESULT:
MyFirstWord MyLastWord
FirstKeeper LastKeeper

ALTERNATIVE REPLACE:
\1,\2
ALTERNATIVE RESULT:
a,b,c
d,ee

Finding bad characters

STRING: 
X1, 0031
X2, 003l
X3, 0032
X4, 0O32
SEARCH: (\wt,\s,)\d+,$
REPLACE: \1ok
RESULT:
X1, ok
X2, 003l
X3, ok
X4, 0O32

Stripping out comment lines from an R script

STRING:
## header
#
x <- 3 + 5
print (x) # in line comment
#
#
y <- x + pi
# end of code

SEARCH: ^#.*
REPLACE: {nothing}
RESULT:


x <- 3 + 5
print (x) # in line comment


y <- x + pi

Now to eliminate blank lines:
F

Strip out comment lines + in-line comments from an R script

STRING:
## header
#
x <- 3 + 5
print (x) # in line comment
#
#
y <- x + pi
# end of code

SEARCH: #.*
REPLACE: {nothing}
RESULT:


x <- 3 + 5
print (x) 


y <- x + pi

To strip out blank lines try:

SEARCH:
\n\n
REPLACE:
\n

OR

SEARCH:
^\n
REPLACE:
(nothing)

But these do not handle blank lines. Instead try:

SEARCH
^\n|\s+$

REPLACE:
(nothing)

Break a long parameter list by commas onto separate lines (show in Rstudio)

STRING:
aes(x=area,y=richness,color=invasive,size=abundance,shape=taxon)

FIND:
,

REPLACE:
,\n\t\t