Skip to main content
Topic solved
This topic has been marked as solved and requires no further attention.
Topic: [SOLVED] Complex Bash Script (grep? regex?) (Read 977 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

[SOLVED] Complex Bash Script (grep? regex?)

Hello!

I want to make a script which converts a simple text file, into a html file (by putting title and description and note on proper tags)
The simple text file is of the format:

Code: [Select]
[start of file]
[tabkey]title1
description1
[tabkey][tabkey]note1

[tabkey]title2
description2
[tabkey][tabkey]note2

[tabkey]title3
description3
[tabkey][tabkey]note3

... (imagine the above over 50 times, with increasing index number)

[end of file]

and the goal is pretty much to split the file within the newlines, so I can get the following 3 files from the above example:
Code: [Select]
[start of file]
[tabkey]title1
description1
[tabkey][tabkey]note1
[end of file]

Code: [Select]
[start of file]
[tabkey]title2
description2
[tabkey][tabkey]note2
[end of file]

Code: [Select]
[start of file]
[tabkey]title3
description3
[tabkey][tabkey]note3
[end of file]

etc etc

==============

Once I reach the above - cutting the file in smaller files - I think I will need no further help, as using this https://forum.artixlinux.org/index.php/topic,2964.0.html and the
Code: [Select]
tr
command, should get the above small digestible files into html format. I guess I'm making a simple static site generator, but I learn bash through this, and I have like 305 files like the above, its a waste to do it by hand, when I could learn something new.
I could do it via a proper code language via parser (C or python), but I want to learn to do it by bash commands, just like using assembly braindead commands to form a function.

I can find the gap between the lines with '^$', but no idea how to get from start, up to it.
Code: [Select]
grep -E '^.*^$'
does not work (start of line, then any number of characters, up to the first ^$, but I think this should cover all yet it doesnt)
Code: [Select]
grep -E '^.*[^^$']
does not work (start of line, then any number of characters, excluding the first ^$)

My question is: How can I delete from the start of the file to the first '^$' with bash commands?

My goal is to take a file as shown above, and delete/remove the start of file to the first '^$' and also paste that deleted/removed text into a new file. I will loop that ofc, so I can cut all of them, and end up with small simple digestible files.

Re: Complex Bash Script (grep? regex?)

Reply #1
I guess what you need is sed and perhaps awk. But I'm still not familiar with these power tools. You can modify stings with sed using its \s command. You load the source file into an array of strings and then modify each string in a cycle, writing it in the output file in whichever format you want.

I have a such script. I stumbled upon it somewhere on the internet and then spoiled it to serve my purposes. This script takes GTK bookmarks file and then translates it into an Openbox pipemenu, where I can open bookmarks via file manager. That's the script:

Code: [Select]
#!/bin/bash

echo '<openbox_pipe_menu>'

filemanager="pcmanfm"
bookmarksfile="/home/$USER/.config/gtk-3.0/bookmarks"

thepaths=(`sed 's/[ ][^ ]*$//' ${bookmarksfile}`)
thenames=(`sed 's/^[a-zA-Z0-9а-яА-Я_%:\/]* //' ${bookmarksfile}`)

for i in ${!thepaths[@]} ; do
  echo '<item label="'${thenames[$i]}'">'
  echo '<action name="Execute"><execute>'
  echo "$filemanager ${thepaths[$i]}"
  echo '</execute></action>'
  echo '</item>'
done

echo '</openbox_pipe_menu>'

Don't ask me how exactly it works. I spent quite a while trying to understand how sed exactly works. All I can say, it builds the two arrays of strings: $thepaths is file paths, $thenames is obviously the names of the bookmarks. The bookmarks file contains paths and names on separate strings. The paths comes first, then comes a space, then comes the name. Here is the input:
Code: [Select]
file:///home/victor/Documents Documents
file:///home/victor/Universitas Universitas
file:///home/victor/Documents/books books
file:///home/victor/Documents/Arbeit Arbeit
file:///home/victor/Documents/Arbeit/%D0%BF%D0%B5%D1%80%D0%B5%D0%B2%D0%BE%D0%B4%D1%8B переводы
file:///home/victor/Pictures Pictures
file:///home/victor/Downloads Downloads

Here's the output:
Code: [Select]
<openbox_pipe_menu>
<item label="Documents">
<action name="Execute"><execute>
pcmanfm file:///home/victor/Documents
</execute></action>
</item>
<item label="Universitas">
<action name="Execute"><execute>
pcmanfm file:///home/victor/Universitas
</execute></action>
</item>
<item label="books">
<action name="Execute"><execute>
pcmanfm file:///home/victor/Documents/books
</execute></action>
</item>
<item label="Arbeit">
<action name="Execute"><execute>
pcmanfm file:///home/victor/Documents/Arbeit
</execute></action>
</item>
<item label="переводы">
<action name="Execute"><execute>
pcmanfm file:///home/victor/Documents/Arbeit/%D0%BF%D0%B5%D1%80%D0%B5%D0%B2%D0%BE%D0%B4%D1%8B
</execute></action>
</item>
<item label="Pictures">
<action name="Execute"><execute>
pcmanfm file:///home/victor/Pictures
</execute></action>
</item>
<item label="Downloads">
<action name="Execute"><execute>
pcmanfm file:///home/victor/Downloads
</execute></action>
</item>
</openbox_pipe_menu>

Re: Complex Bash Script (grep? regex?)

Reply #2
Two fundamental problems with your approach are:

1. grep, sed and other tools are fundamentally line based. Whenever they see a new line, they forget everything and start matching again.
2. Even with the -z option (which treats the whole file as a single line), the '.*' pattern always matches the longest possible, so:
Code: [Select]
grep -z $'^.*\n\n'
(remember, ^ $ now refer to the whole file, so '^.*^$' wouldn't work) would still match all sections of your file but the last.

Your two best options here are either awk (since it has some file manipulation capabilities, has variables) or shell scripts using a while read loop.

Code: [Select]
i=0
while IFS='' read -r line; do
  # test empty line
  if [ "$line" ];  then
    # non empty line, append to numbered document file
    printf '%s\n' "$line" >> "document$i"
  else
    # empty line - increase i only
    : $((i+=1))
  fi
done < my_notes.txt

Re: Complex Bash Script (grep? regex?)

Reply #3
My question is: How can I delete from the start of the file to the first '^$' with bash commands?
bash (built-in) commands, or command-line programs, such as sed, awk, tr etc? The latter are standalone programs.

There was a text which I cannot currently remember the title for the life of me, I think by one the authors of awk, that it and other similar programs such as sed, are merely tools, and the user should decide if they are adequate for the problem. There are problems where those tools are not a good match, and a full-blown standalone program is a better solution than a script.

When I first made my roll-my-own static site generator, I first made it as a Bash script using programs such as sed and awk, but I quickly realised that it won't be enough for my needs. I needed them to work on the entire file because of multiple passes (conversion of Markdown to HTML requires that for links, images and footnotes), so in very long pages the text would be truncated. So I rewrote my website generator in C.

Edit: If you don't already know about it, you should check out the GNU Recutils project.

Re: Complex Bash Script (grep? regex?)

Reply #4
@capezotte
Code: [Select]
grep -z $'^.*\n\n'
doesn't work, it outputs the entire file

Also you are correct on the 2 fundamental problems, I'm using the wrong tools as strajder said

As for awk and sed, I am not familiar with them, so doing something like this will take quite a while for sure (or even worse, I will spam this board with newbie questions lul)

shell scripts using a while read loop.

Code: [Select]
i=0
while IFS='' read -r line; do
  # test empty line
  if [ "$line" ];  then
    # non empty line, append to numbered document file
    printf '%s\n' "$line" >> "document$i"
  else
    # empty line - increase i only
    : $((i+=1))
  fi
done < my_notes.txt
I am not familiar with shell scripts, what does the above do?


Quote
bash (built-in) commands, or command-line programs, such as sed, awk, tr etc? The latter are standalone programs.

There was a text which I cannot currently remember the title for the life of me, I think by one the authors of awk, that it and other similar programs such as sed, are merely tools, and the user should decide if they are adequate for the problem. There are problems where those tools are not a good match, and a full-blown standalone program is a better solution than a script.
I agree that there is no silver bullet, but what I'm trying to do will need no further extensions/updates, and is quite simple if written in any programming language, so I figured it would be possible easily with bash piping commands, and I have time, so why not learn bash pretty much

Edit: If you don't already know about it, you should check out the GNU Recutils project.
https://www.gnu.org/software/recutils/manual/A-Little-Example.html#A-Little-Example
It demands too much info in the .rec file; I want to make it needless to need title/description/notes copy-pasted in each line (I have 300sth blocks, so I would have to write double that text to fill the entry field identifiers)

It turns out I have 2 solutions:
1. I will try to do it with vim macros (never tried it), so I will pretty much do:
[]Recording hotkey
[]gg (start of file)
[]0 (start of line)
[]v (visual mode)
[]/ (find)
[]\n\n (2 newlines)
[]d (this works perfectly up to here, no idea how to save the recording to make it a macro though lol)
[]pipe
I use vim but never used recording or other advanced stuff, so if I fail here as well...
2. I will just use python lul, shouldn't take more than a day, I remember some years ago I made a text parser with C when I was still a newbie in programming, so with python it should be braindead easy

Thanks for the replies, they didn't give me a solution, but they confirmed what I'm doing will take too much time, and why it doesn't work. So, the problem is solved in a way haha

Off-topic: @strajder @VictorBrand bless you both for replying so fast, and with helpful replies

Re: [SOLVED] Complex Bash Script (grep? regex?)

Reply #5
@capezotte
Code: [Select]
grep -z $'^.*\n\n'
doesn't work, it outputs the entire file

That's what I predicted.

Quote
As for awk and sed, I am not familiar with them, so doing something like this will take quite a while for sure (or even worse, I will spam this board with newbie questions lul)

I wouldn't mind. Your question so far show effort and research.

Also, sed and awk are very common shell scripts in the wild, and by most measure are shell tools just like grep and tr (though they have embedded languages within them, with sed being very simple and learnable in a few hours, while awk looks more like C).

Quote
I am not familiar with shell scripts, what does the above do?

Sets an variable (i) to zero at the top.

While it's possible read lines (stored in the variable line), from the file, conserving backslashes unchanged (-r) and without removing leading whitespace (IFS=''):
- If the variable line is not empty ([ "$line" ]), append the line the a file named "document$i", where $i is the variable i we created at the top.
- Otherwise, add 1 to i (moves content to a new file). Probably should've written it as i=$((i+1)) to make things a bit clearer.

Quote
I agree that there is no silver bullet, but what I'm trying to do will need no further extensions/updates, and is quite simple if written in any programming language, so I figured it would be possible easily with bash piping commands, and I have time, so why not learn bash pretty much

As strajder kind of hinted, learning bash isn't really learning bash, but rather learning an ecosystem of standalone tools and wielding the power of piping to make them do what you want.

grep, sed, awk and co. exist independent of bash and technically can be used with any scripting language that supports calling external programs.

Likewise, you can enhance bash with your own commands. Nothing's stopping you from writing the "file splitting" part in Python, then doing the rest of the editing (converting markdown to html and co.). I admit it's kind of sad when I see a project desperately trying to avoid bash by using subprocess.call() and sometimes even os.dup() (which makes it unportable to Windows anyway).

Re: [SOLVED] Complex Bash Script (grep? regex?)

Reply #6
Sets an variable (i) to zero at the top.

While it's possible read lines (stored in the variable line), from the file, conserving backslashes unchanged (-r) and without removing leading whitespace (IFS=''):
- If the variable line is not empty ([ "$line" ]), append the line the a file named "document$i", where $i is the variable i we created at the top.
- Otherwise, add 1 to i (moves content to a new file). Probably should've written it as i=$((i+1)) to make things a bit clearer.
Thank you very much for this, it looks like hieroglyphics until explained hahaha
I thought it was posted as an example of "this comes pretty close to what you want" like the first reply
This solution + the vim part I posted at the end, ends up with 2 solutions for the problem I have, and even for the vim one I wouldn't have thought of it if not for reading this thread lol

As strajder kind of hinted, learning bash isn't really learning bash, but rather learning an ecosystem of standalone tools and wielding the power of piping to make them do what you want.
Yup, this is why I want to do this via bash, so I pipe things into each other, since I rarely pipe more than 2 programs in my daily use (usually | vim, or just >)

Re: [SOLVED] Complex Bash Script (grep? regex?)

Reply #7
One conceptual solution might be to loop through the file line by line doing something like this in pseudo-code, which could be done quite easily in BASH or any language:
create a counter variable set to 1
create a flag variable
read lines in a loop
look for a line starting with "tab", when you get it save as title plus counter, set flag
look for a line starting with "tab tab", when you get it save as note plus counter, unset flag, increment counter.
if flag isn't set discard the line
else save it (appending to file or temporary variable) as description plus counter
continue looping until file end is reached