Skip to main content
Topic solved
This topic has been marked as solved and requires no further attention.
Topic: [SOLVED] How to Regex Repetition? (grep) (Read 922 times) previous topic - next topic
0 Members and 3 Guests are viewing this topic.

[SOLVED] How to Regex Repetition? (grep)

Hello!
I want to make a bash script in linux, using terminal commands, and I'm stuck at the very beginning

I have the following text file:

Code: [Select]
[start of file]
[tabkey]text1
text2
[tabkey][tabkey]text3
[end of file]

Each of the above text is in its own line, so there are 3 lines in total. The first line has 1 tab, the 3rd has 2 tabs at the start.

If I use

Code: [Select]
grep  $'\t'

I get all lines with tabs, but not highlighted ofc.
So I ended up using

Code: [Select]
grep $'\t'".*"

to get text1 and text3.
However, how can I get only 1 \t?

I want to get exclusively text1, or exclusively text3, depending on tab count. I ask this because I can't grasp my head around repetition, {N} to repeat the previous command doesn't seem to work even for letters, yet I need it for the tab character.

Re: How to Regex Repetition? (grep)

Reply #1
Hello! In your case, it's better to use extended syntax of regexp with grep, it's option -E. If you want to get only one \t at the beginning, mark the beginning with ^, and then specify any symbol which is not \t:
Code: [Select]
grep -E $'^\t[^\t]'
(this will return text1).

If you need any number of \t before you encounter a non-\t symbol, specify this number in {}:
Code: [Select]
grep -E $'^\t{2}[^\t]'
(this will return text3).

Re: How to Regex Repetition? (grep)

Reply #2
If you need any number of \t before you encounter a non-\t symbol, specify this number in {}:

Just to add that this is explained in man grep:
Quote
       {n}    The preceding item is matched exactly n times.

Edit: Also,
Quote
   Basic vs Extended Regular Expressions
       In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and
       \).

Re: How to Regex Repetition? (grep)

Reply #3
That was a swift reply, and it also worked instantly, wow

regex is hard I have to admit, getting a few word variants here and there seems easy, but anything beyond that gets very hard, and I'm kinda lucky I don't want anything more of regex

I was about to post on "why does this work" but I experimented and learned some new things lol
'^\t' is tab at the start of the line, and {N} is the repetition of previous command (in extended regex), and [^\t] initially seems bloated but is for avoiding further tabs (I was about to write tl;dr "why put this lol" until I made 3 tabs text in the same file, seems like it excludes the character after [^)

It ends up 'grep -E $'^\t{N}[^\t].*' is perfect. I would have never ended up on this alone, bless.

Though kinda off-topic, that dollar sign at the start, what is it for?


 

Re: [SOLVED] How to Regex Repetition? (grep)

Reply #5
That was a swift reply, and it also worked instantly, wow

regex is hard I have to admit, getting a few word variants here and there seems easy, but anything beyond that gets very hard, and I'm kinda lucky I don't want anything more of regex
Glad you are satisfied :)

Regular expressions indeed may be hard to understand when they are written in a complex way, but in fact they are quite simple and simultaneously powerful. The most confusing thing about them is the fact that regexp syntax may somewhat vary from one application to another, although in its core it is the same. There is a good short educational video on regexps, the guy explains them quite clearly.

The only caveat about regexps is that they are using finite automata (or Turing Machine) in their core, thus they are quite slow. This is not the issue when you use them here or there in your scripts, though. But once I've seen how one guy wrote a sort of a scrapmetal parser with regexps. There were cycles of regexps which operated with strings extracted in other cycles of regexps and so on and do forth. Apparently this thing worked painfully slow.

Though kinda off-topic, that dollar sign at the start, what is it for?
It's a part of the bash syntax. It causes escape-sequences to be interpreted and translated into their ANSI codes. In our case, \t must be translated, because grep doesn't respect such sequences in its regexp syntax. You can read about that here.