Hello. I'm having 12 letters a-l, I need to group them by 2 like this ab cd ef gh ij kl.
Next I'm trying to sort a list of 6 chars long strings based on the grouping I mentioned at the beginning.
e.g. if string is 'a a a b b c' i should have this output 5 1 meaning there are 5 letters in the 'ab' group and 1 in 'cd' group.
The output should be in descending order like 51 or 42 or 3111 etc
I know for sure awk can do this like it's a piece of cake but I'm no awk wizard right now. 😬
Any cool idea if not awk then python or perl I'm not very picky. In the mean time I'm trying to come up with a python script for this, tnx for any hints ✌🏻
I'm having this awk line where it reads from 'fin' file
awk -F 'a' '{s+=(NF-1)} END {print s}' fin
How can I modify it so it counts not only for 'a' but for 'b' too withing 'ab' group. Awk is horribly complicated.
Or I can do it like this
awk -F 'a' '{s+=(NF-1)} END {print s}' fin
awk -F 'b' '{s+=(NF-1)} END {print s}' fin
and summing up the result. Also I need to make it read each row from 'fin' file.
Use python. Unless speed of execution is an issue. Not that I know for sure that awk is fast ? Though I assume it is faster than python?
I agree though as awk is only marginally more understandable to me than brainfuck (https://en.wikipedia.org/wiki/Brainfuck)
Well come up with this one, primitive but it's working. I need now to instruct it to do it for each row from fin file not just 1. If anyone can hairstyle it to be better would be awesome :P
*need to order the output in descending order.
{
awk -F 'a' '{s+=(NF-1)} END {print s}' fin
awk -F 'b' '{s+=(NF-1)} END {print s}' fin
} > ab
{
awk -F 'c' '{s+=(NF-1)} END {print s}' fin
awk -F 'd' '{s+=(NF-1)} END {print s}' fin
} > cd
{
awk -F 'e' '{s+=(NF-1)} END {print s}' fin
awk -F 'f' '{s+=(NF-1)} END {print s}' fin
} > ef
{
awk -F 'g' '{s+=(NF-1)} END {print s}' fin
awk -F 'h' '{s+=(NF-1)} END {print s}' fin
} > gh
{
awk -F 'i' '{s+=(NF-1)} END {print s}' fin
awk -F 'j' '{s+=(NF-1)} END {print s}' fin
} > ij
{
awk -F 'k' '{s+=(NF-1)} END {print s}' fin
awk -F 'l' '{s+=(NF-1)} END {print s}' fin
} > kl
{
awk '{ sum += $1 } END { print sum }' ab
awk '{ sum += $1 } END { print sum }' cd
awk '{ sum += $1 } END { print sum }' ef
awk '{ sum += $1 } END { print sum }' gh
awk '{ sum += $1 } END { print sum }' ij
awk '{ sum += $1 } END { print sum }' kl
} > allc
cat allc | tr -d '\n' > tc4
cat tc4
echo " "
rm allc tc4 ab cd ef gh ij kl
exec bash
Tried this but didn't verk xd
while read in; do bash x.sh "$in"; done < fin
named my script x.sh and the file it reads from it's called fin.
It sums up all results from each line into just only 1 line and for other lines gives error 'command not found' despite that isn't a command but it has to bloody read from fin file.
Any ides how this should properly done? Tnx
edit: tried with xargs and at least got 3 lines but with the same result ll of them, so there's a little progress but minuscule.
cat fin | xargs -L1 bash x.sh
Found a python script
word = "mississippi"
counter = {}
for letter in word:
if letter not in counter:
counter[letter] = 0
counter[letter] += 1
counter
print(counter)
But idk how can I make it read line by line from a file tried
word = str(open("fin.txt"))
but this way it counts from the python script itself and not only from the text file 😬 Any python guru have any idea how can I make this script loop thru de text file line by line?
Also tried this but it counts for the lines not the chars inside those lines.
with open('fin.txt') as f:
lines = [line.rstrip() for line in f]
word = lines
counter = {}
for letter in word:
if letter not in counter:
counter[letter] = 0
counter[letter] += 1
print(counter)
Firstly, I am an absolute newbie when it comes to questions like yours, however, these problems fascinate me and I did a little search and found this:
https://stackoverflow.com/questions/41029735/piping-sed-awk-or-awk-sed
It may or may not help you, but I just thought I would throw it out there anyway :)
Think I'm close to find a solution in python. Python looks the most noobie friendly like I am, but make no mistake python can be mind blowing complicated too is just more accessible for everybody, n00b or expert.
Tnx for the link 👍🏻
I would like to learn some python, but so far had not had the time to take it up. Perhaps one day I will be forced into learning it just to solve a problem like you have got!
Hope you manage to find a successful solution :)
Found another python script that comes pretty close from what I need. Is just this one split lines and counts the length of each string in the file but doesn't count for each char in each string in that file.
def fileCount(fname):
#counting variables
d = {"lines":0, "words": 0, "lengths":[]}
#file is opened and assigned a variable
with open(fname, 'r') as f:
for line in f:
# split into words
spl = line.split()
# increase count for each line
d["lines"] += 1
# add length of split list which will give total words
d["words"] += len(spl)
# get the length of each word and sum
d["lengths"].append(sum(len(word) for word in spl))
return d
def main():
fname = input('Enter the name of the file to be used: ')
data = fileCount(fname)
print ("There are {lines} lines in the file.".format(**data))
print ("There are {} characters in the file.".format(sum(data["lengths"])))
print ("There are {words} words in the file.".format(**data))
# enumerate over the lengths, outputting char count for each line
for ind, s in enumerate(data["lengths"], 1):
print("Line: {} has {} characters.".format(ind, s))
main()
Think I need to find the proper syntax so it gives the letter count not words count.
d["words"] += len(spl)
and here when I print results..
print ("There are {words} words in the file.".format(**data))
Think I'm so close to dodge maybe 2 years worth of python learn lol
I'm not sure I understand what you're trying to do, but maybe this will be helpfule:
with open("fin.txt", "r") as file:
text = file.readlines()
line_list = []
for line in text:
letter_list = []
for letter in line.lower():
if letter.isalpha():
letter_list.append(letter)
if len(letter_list) != 0:
words = []
for x in range(0, len(letter_list), 6):
word = letter_list[x: x + 6]
words.append(word)
line_list.append(words)
groups = ["ab", "cd", "ef", "gh", "ij", "kl"]
gr = {}
for la in line_list:
for word in la:
word.sort()
counter = ""
for group in groups:
count = word.count(group[0])+ word.count(group[1])
if count != 0:
counter += str(count)
gr["".join(word)]= "".join(sorted(counter, reverse=True))
for word, count in gr.items():
print(word, count)
Cool perfect implementation. This script I'm gonna use to try to obtain a math formula for what will gonna be so called 'Split Arrangements' meaning the way things arrange based on grouping/split total elements of letters, in this case groups of two but hopefully I'll get a general formula and eventually catch somewhere the grouping way also.
To give you a bit of context I've already determined a general formula but based on repeating elements (called them Mixed Arrangements) and not by grouping them like in this case I'm working on now.
After I'll crack/deduce the formula I'm gonna do another script for it but till then there's a bit of work.
Ultimately Split arr. same as Mixed arr. can be used to discover strings with the highest mathematically possible entropy.
The grouping doesn't make difference if element repeats or not but counts just how many times one of those withing a group are.
So for e.g a string like 'aaaa' will be in the same league as 'abba' while when count for repeating element those two are in quite different leagues.
It will be interesting to find where one vs the other type of arr. have power and where have weaknesses.
For combinations these things are easy to do but when it comes to these arr. things go wild.
And of course thank you very much for the script it's perfect 8)
Tried the script on the real thing and it looks like it somehow crops from 2.985.984 strings/lines to just a little bit shy over 12k.
I've checked if it was nano fault while processing that raw text file but no nano is safe and sound :D
So from around 20MB text file file I end up with just 140 kb :o
Is there some bug in python or it's simply the script?
Think it's because of the file size the script can digest at the most 12k lines (12376)
as is now.
Tnx again for the help 😇
Edit: I'll try to break that big 20MB text file into smaller ones maybe it can do it that way.
Yep re-verified it. So even if I split that larger file into smaller ones still it crops about 6x or more. I have no idea why. If you feed 5000 lines it should print 5000 not less.
It loses about 3 lines every 36.
My python knowledge is limited to the basics so expect bugs or that something will not work as expected.
Try this:
import fileinput
line_list = []
line_count = 0
for line in fileinput.input(["fin.txt"]):
letter_list = []
for letter in line.lower():
if letter.isalpha():
letter_list.append(letter)
if len(letter_list) != 0:
words = []
for x in range(0, len(letter_list), 6):
word = letter_list[x: x + 6]
words.append(word)
line_list.append(words)
line_count += 1
groups = ["ab", "cd", "ef", "gh", "ij", "kl"]
gr = {}
for la in line_list:
for word in la:
word.sort()
counter = ""
for group in groups:
count = word.count(group[0]) + word.count(group[1])
if count != 0:
counter += str(count)
gr["".join(word)] = "".join(sorted(counter, reverse=True))
for word, count in gr.items():
print(word, count)
print('-'* 30)
print(f"Line count = {line_count}")
Or this:
with open("fin.txt") as file:
line_list = []
line_count = 0
for line in file:
letter_list = []
for letter in line.lower():
if letter.isalpha():
letter_list.append(letter)
if len(letter_list) != 0:
words = []
for x in range(0, len(letter_list), 6):
word = letter_list[x: x + 6]
words.append(word)
line_list.append(words)
line_count += 1
groups = ["ab", "cd", "ef", "gh", "ij", "kl"]
gr = {}
for la in line_list:
for word in la:
word.sort()
counter = ""
for group in groups:
count = word.count(group[0]) + word.count(group[1])
if count != 0:
counter += str(count)
gr["".join(word)] = "".join(sorted(counter, reverse=True))
for word, count in gr.items():
print(word, count)
print('-'* 30)
print(f"Line count = {line_count}")
It should works better for large files. Check if 'Line_count' is correct. If it is then it must be other problem
Tried both latest versions and both despite printing at the bottom the correct count it doesn't do all the rest but just counts them and what it prints is 89% chopped away, the only useful data is just around 11%
Anyway it does something but just super unreliable. But tnx for trying maybe the script must be polished in some parts to have accuracy.
Tnx for trying once again. ✌🏻
One more try :D , I think I've found the problem,
I saved words to the dictionary without dividing by line, so if a word repeated itself on many lines it was saved in the result only once, and with six-letter words there were many repetitions of the same word
with open("fin.txt") as file:
line_list = []
line_count = 0
for line in file:
letter_list = []
for letter in line.lower():
if letter.isalpha():
letter_list.append(letter)
if len(letter_list) != 0:
words = []
for x in range(0, len(letter_list), 6):
word = letter_list[x: x + 6]
words.append(word)
line_list.append(words)
line_count += 1
groups = ["ab", "cd", "ef", "gh", "ij", "kl"]
result = []
for la in line_list:
gr = {}
for word in la:
word.sort()
counter = ""
for group in groups:
count = word.count(group[0]) + word.count(group[1])
if count != 0:
counter += str(count)
gr["".join(word)] = "".join(sorted(counter, reverse=True))
result.append(gr)
for ind, value in enumerate(result):
print(f"Line {ind} - {value}")
print("-"* 30)
print(f"Line count = {line_count}")
Yeah now it looks like we are into something good. I'll try to remove those intermediary lines so output file will remain aprox. the same size.
# print("-"* 30)
It looks good at first view. I'll try to process the file and come back with the results. Good script! Where's that thumb up when you needed..aw here it is 👍🏻
Tnx 🎯
Managed to process it and looks like based on the chosen grouping example the best is to choose 2 letters, repeats or not, from 2 different groups and 2 other elements but each belonging to a different group. Here's the stats image in case someone might be curious.
The sequence at the bottom is no present in the OEIS data base so we have a W. The challenge that still remains is to deduce a general formula.
as a side note, you can see that choosing 1 letter from each group has not the most arr. number.
*had to split the file in 6 and sum them below in ascending order..
(https://i.postimg.cc/5NyHHZp9/bt5c3.png)
Tnx for the help.
Now look at this small example. The string highlighted in the green box is artificially 'crafted' so it fits both those two criteria (mixed and split arr.) with the highest possible variants hence has higher entropy. If the string matches only one criteria is good but not as good as if it scores both ✅✅
Even if it may look like it's being singled out in fact it isn't. As long as the chosen string scores those larger piles no matter how many criteria are being used the more the better meaning those strings have higher and higher entropy.
(https://i.postimg.cc/yxNgN3Qn/nybt7.png)
Formula cracked, here's how all calculations looks like. Done putting it on OEIS too. Hope this time gonna have more luck.
All those slices bigger or smaller multiplies with G^s (where G is the number of elements within a group and s is the string length) everything else is the same as my other formula but everything relates to groups and not single elements.
example
ab cd ef gh ij kl
A(group no,pattern group no.)
2 2 1 1 (2^2)×(2^2)×(2^1)×(2^1) which is 2^6 (G^s) × A(6,4) × 6!/(2!^2)*2!*(6-4)! = 1.036.800
Also it applies to all cases either t>s or t<s the only required thing is that groups to be equal and so it has to be one of t divisors or t has to be a multiple of G
(https://i.postimg.cc/Y0xgfHwJ/btc4ex.png)
Hello is there anywhere I can modify groups?
I mean for example if I wanna split fields in uneven groups like
abcdefg hijkl
Tried to modify the script groups
groups = ["abcdefg", "hijkl"]
but looks like there's another part of the script which doesn't like that. I'm assuming this
count = word.count(group[0]) + word.count(group[1])
Thank you, if you can take a look. The previous script is excellent for that grouping type but looks like there's no easy way for me to make it more versatile from the bird's eye view. And speaking of birds as a small divagation from the main topic, today a crow almost hit me with a walnut in the head she threw so to break it and of course eat it. 😂
Tnx and ✌🏻
Change this part of code :
for group in groups:
count = word.count(group[0]) + word.count(group[1])
if count != 0:
counter += str(count)
to this, output with "0"
for group in groups:
count = 0
for n in range(len(group)):
count += word.count(group[n])
#if count != 0:
counter += str(count)
or, output without "0"
for group in groups:
count = 0
for n in range(len(group)):
count += word.count(group[n])
if count != 0:
counter += str(count)
Does it work ?
I'll check in 1 minute. I'll be back asap.
Yes looks like it's working to perfection :) Amazing. I'll check back with results, hope I can find it as easy as in the case of even groups. Thanks you nailed it perfectly 🎯
glad I could help
These are harder to break in unfortunately. Manged to deduce only 1 term out of 4 and I think it will be an Odyssey to fully get how they scale.
Break them in 1-2 days. Talked closely with an OEIS main editor and he said that arrays of t (total elements) are well known but he did not say a word about arrays of 's'. Interesting they didn't feel the need of 's' arrays from reasons I don't understand. All that I can say is they don't have it.
Now two things can be done, 1st, search for 't' and 's' arrays and see for which 't' array and which 's' array we will have most numerous objects (arr.) and 2nd view is to find any possible 't' arrays and retain only those strings that always qualify in the most numerous patterns. These I would call them 'hyper strings', meaning no matter how you spin them (but still keeping similar input) always they'll be in a numerous pattern. Like a cat that always falls on its feet. Cat strings 😄
Bellow I added an image how can be calculate not only relative to one array but both of them (t and s arrays) simultaneously. These are not abstract things but really reflect the exact number that form depending on the user input. Would be really unique thing to obtain a script that merge both perspectives (t and s arrays). One of editor said that combinatorics is really a new less known and little explored domain so this is why somehow I'm quite a pioneer on these rn.
(https://i.postimg.cc/fLgZFzHN/6b74c.png)
Hello
@wiezyr do you have any idea what should I change so it doesn't count only for small letters?
groups = ["!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~", "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz", "0123456789"]
result = []
Trying to use symbols too along with cap letter and numbers grouped above.
Think the script as is now is restricted by this two syntax:
for letter in line.lower():
if letter.isalpha():
I think I need to find something else instead of line.lower() and letter.isalpha().
isascii do you think would fix the problem?
And last question is this good enough to escape those problematic symbols like I did bellow?
"!\"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~"
Thanks, if you can take a look would be awesome.
It looks correct. In case you don't know another way in python to define strings which contain both single and double quotes is to use triple quotes !
Then no escaping of the contained 'quotes' is needed.
'''!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~'''
you're right,
lower() method converts any letter to lowercase, so you can just delete it
isascii() might not be the solution because, if I remember correctly, it include some control characters (like end of line '\n\ )
i think you should use
isprintable()
instead, but it will include 'space' as character, so if you have 'spaces' in your data and want to get rid of it add check for it
and not letter.isspace()
And in this case i personally would use tripple quotes as suggested above, but both methods should work
with open("fin.txt") as file:
line_list = []
line_count = 0
for line in file:
letter_list = []
for letter in line:
if letter.isprintable() and not letter.isspace():
letter_list.append(letter)
if len(letter_list) != 0:
words = []
for x in range(0, len(letter_list), 6):
word = letter_list[x: x + 6]
words.append(word)
line_list.append(words)
line_count += 1
groups = ["""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""", "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz", "0123456789"]
result = []
for la in line_list:
gr = {}
for word in la:
word.sort()
counter = ""
for group in groups:
count = 0
for n in range(len(group)):
count += word.count(group[n])
if count != 0:
counter += str(count)
gr["".join(word)] = "".join(sorted(counter, reverse=True))
result.append(gr)
for ind, value in enumerate(result):
for word, score in value.items():
print(f"{word} {score}", end="\t")
print()
print(f"Line count = {line_count}")
I cleaned up the output a bit for better readability
Awesome! Works like a charm! Yes triple quotes seems to be the cleanest way. The rest works also flawlessly 🎯
Started to like python even more.
Tnx a lot!!!
✌🏻
Found a small bug. In some cases the script reads and counts good but when it orders back in descending order it gives an error.
So it counts 12_2_2 but when tries to order in descending order it gives 2_2_2_1 see the image. How this can be fixed? Another thing even minor is that indeed it counts ok but it changes the original string.
So for `1l,*}:{)k;%{?5( it turns it in %()*,15:;?`kl{{}
(https://i.postimg.cc/vmgvzFKm/btr.png)
Try this:
with open("fin.txt") as file:
line_list = []
line_count = 0
for line in file:
letter_list = []
for letter in line:
if letter.isprintable() and not letter.isspace():
letter_list.append(letter)
if len(letter_list) != 0:
words = []
for x in range(0, len(letter_list), 16):
word = letter_list[x: x + 16]
words.append(word)
line_list.append(words)
line_count += 1
groups = [
"""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""",
"ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"abcdefghijklmnopqrstuvwxyz",
"0123456789",
]
result = []
for la in line_list:
gr = {}
for word in la:
# word.sort()
counter = []
for group in groups:
count = 0
for n in range(len(group)):
count += word.count(group[n])
#if count != 0:
counter.append(count)
# gr["".join(word)] = " ".join(sorted(counter, reverse=True))
gr["".join(word)] = sorted(counter, reverse=True)
result.append(gr)
for value in result:
for word, score in value.items():
print(word, end=" ")
for number in score:
print(number, end=" ")
print("\t", end="")
print()
print(f"Line count = {line_count}")
The way 'count' was stored and sorted earlier would not work if it can be greater then '9', so i changed it from storing 'counts' as string to store it as list. Now it should work. Also disabled 'word.sort()' so words now should be original (not sorted).
Hope this time works :)
Cool as ice! :D Infinite thanks!!! However already have some results as those script errors have only marginal impact on the conclusion.
Changed the strategy. Now I'm abusing a bit the random factor but just enough so I don't need to calc. too much.
The easiest way to determine which pattern is #1 is still thru extract random and see which array shows up more often. This however applies when we are in rather low numbers like t=94 and a small s=16. If we go bigger like t=94 and s=63 or higher s=94 then the most frequent pattern will no longer correspond to the most potent pattern. In this case we need to calc all those arrays that spawn in the 1st 100 or 200 arrays and calc thru script automatically which one holds the bigger numbers.
Bellow an example cracked thru both tools, random and calc.
This example used t=94 and s=16 resulting a best 6 5 4 1 (s pattern) when we use 32 26 26 10 (t pattern). Check it out.
(https://i.postimg.cc/mg8frxrS/6b4x0.png)
The final conclusion from the above calc. is as follow the best is to choose
6 elements from symbols
5 from letters
4 from LETTERS
1 from numbers (0-9)
However to match the other criteria (where groups are basically elements themselves) we must repeat only 1 element in the final string no matter which element. Example string '(}1BNOPS^_abhr{
The result will be a 'hyper string'
edit: tested the latest script and now is really pure gold 🎯 🏆
Here's a secondary/residual conclusion and that's about arr. do not like symmetry. A 9|7 is preferred vs a symmetric 8|8, let alone 13 people out of 500 will have a 20x weak secret and 63% will have just an average Joe's secret.
So when split in half 94 it's better to have 9 from a group and 7 from 2nd group (16 digit string) and not 8|8 like most would have guessed.
See image.
(https://i.postimg.cc/gJBpJM8x/8na10n.png)