# Containers - lists, dicts  etc

in Python3. If you are not familiar 
with lists then watch this video:

https://www.youtube.com/watch?v=CsVjisxAIPc

In this notebook we 

- review the basic constructions below
- download resources using requests.get()
- use collections.Counter() to do
some statistics on Shakespeare's play **Hamlet**

---


In [1]:
import requests

url = "https://macbuse.github.io/PROG/words.txt"

r = requests.get(url=url)


In [9]:
r.text[:1000]

'THE TRAGEDY OF HAMLET PRINCE OF DENMARK by William Shakespeare Dramatis Personae Claudius King of Denmark Marcellus Officer Hamlet son to the former and nephew to the present king Polonius Lord Chamberlain Horatio friend to Hamlet Laertes son to Polonius Voltemand courtier Cornelius courtier Rosencrantz courtier Guildenstern courtier Osric courtier A Gentleman courtier A Priest Marcellus officer Bernardo officer Francisco a soldier Reynaldo servant to Polonius Players Two Clowns gravediggers Fortinbras Prince of Norway A Norwegian Captain English Ambassadors Getrude Queen of Denmark mother to Hamlet Ophelia daughter to Polonius Ghost of Hamlets Father Lords ladies Officers Soldiers Sailors Messengers Attendants SCENE Elsinore ACT I Scene I Elsinore A platform before the Castle Enter two Sentinelsfirst Francisco who paces up and down at his post then Bernardo who approaches him Ber Whos there Fran Nay answer me Stand and unfold yourself Ber Long live the King Fran Bernardo Ber He Fran 

In [8]:
" ".join(dir(r))

'__attrs__ __bool__ __class__ __delattr__ __dict__ __dir__ __doc__ __enter__ __eq__ __exit__ __format__ __ge__ __getattribute__ __getstate__ __gt__ __hash__ __init__ __init_subclass__ __iter__ __le__ __lt__ __module__ __ne__ __new__ __nonzero__ __reduce__ __reduce_ex__ __repr__ __setattr__ __setstate__ __sizeof__ __str__ __subclasshook__ __weakref__ _content _content_consumed _next apparent_encoding close connection content cookies elapsed encoding headers history is_permanent_redirect is_redirect iter_content iter_lines json links next ok raise_for_status raw reason request status_code text url'

# List objects

- anything that looks like [a,b,c...]

There are other objects that behave in some ways like a list but aren't
I think this is the fault 
of a guy called [Raymond Hettinger](https://github.com/rhettinger)

In [10]:
L = [10,20,30,40]
type(L)

list

In [60]:
#dir(L)

In [62]:
' '.join(dir(L))

'__add__ __class__ __class_getitem__ __contains__ __delattr__ __delitem__ __dir__ __doc__ __eq__ __format__ __ge__ __getattribute__ __getitem__ __gt__ __hash__ __iadd__ __imul__ __init__ __init_subclass__ __iter__ __le__ __len__ __lt__ __mul__ __ne__ __new__ __reduce__ __reduce_ex__ __repr__ __reversed__ __rmul__ __setattr__ __setitem__ __sizeof__ __str__ __subclasshook__ append clear copy count extend index insert pop remove reverse sort'

In [33]:
S  = list(range(10))
L = S.copy()

In [34]:
S, L

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [36]:
S[0] = 5
S, L

([5, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [27]:
s = l = 2

In [30]:
s, l

(5, 2)

In [29]:
s = 5

In [43]:
E = enumerate(range(10,20))
type(L)

list

In [42]:
for i, x in E:
    print(i,x)

0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19


this is a **joke** its not really a list

In [44]:
E

<enumerate at 0x7fe40b1e6a80>

In [45]:
list(E)

[(0, 10),
 (1, 11),
 (2, 12),
 (3, 13),
 (4, 14),
 (5, 15),
 (6, 16),
 (7, 17),
 (8, 18),
 (9, 19)]

---
### pairs from a list

In [46]:
P = zip(L,L)
type(P)

zip

In [33]:
' '.join(dir(P))

'__class__ __delattr__ __dir__ __doc__ __eq__ __format__ __ge__ __getattribute__ __gt__ __hash__ __init__ __init_subclass__ __iter__ __le__ __lt__ __ne__ __new__ __next__ __reduce__ __reduce_ex__ __repr__ __setattr__ __sizeof__ __str__ __subclasshook__'

In [47]:
list(P)

[(0, 0),
 (1, 1),
 (2, 2),
 (3, 3),
 (4, 4),
 (5, 5),
 (6, 6),
 (7, 7),
 (8, 8),
 (9, 9)]

In [49]:
E = enumerate(L)
list(zip(E,E))[0]

((0, 0), (1, 1))

Things like this are Hettingers fault

---

In [52]:
list(range(5, 10,2)), type(range(5))

([5, 7, 9], range)

In [58]:
list(range(5))

[0, 1, 2, 3, 4]

## _ is the result of the last Python expression evaluation

In [59]:
L = _

In [60]:
L * 3

[0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]

# List comprehensions

https://docs.python.org/3/tutorial/datastructures.html

In [70]:
[x**2 for x in L]

[0, 1, 4, 9, 16]

In [64]:
import numpy as np

In [68]:
X  = np.linspace(0,5, 9)
X**.5 + X

array([0.        , 1.41556942, 2.36803399, 3.24430639, 4.08113883,
       4.89276695, 5.68649167, 6.46665007, 7.23606798])

In [72]:
[x**2 for x in L if x % 2 == 0]

[0, 4, 16]

----

# Diverse operations (or methods)

https://docs.python.org/3/tutorial/datastructures.html

- append
- extend
- reverse
- sort

you can see these and more by doing

``` ' '.join(dir(L)) ```

In [73]:
L = list(range(5))
L.append(5)
L

[0, 1, 2, 3, 4, 5]

In [74]:
L.extend(L)
L

[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]

In [75]:
len(L)

12

In [76]:
L.reverse()
L

[5, 4, 3, 2, 1, 0, 5, 4, 3, 2, 1, 0]

In [82]:
R = reversed(L)

list( zip( R, R ) )

[(0, 1), (2, 3), (4, 5), (0, 1), (2, 3), (4, 5)]

---
## ? get's the help (or docstring)

[docstrings](https://peps.python.org/pep-0257/)

In [83]:
?reversed

[0;31mInit signature:[0m [0mreversed[0m[0;34m([0m[0msequence[0m[0;34m,[0m [0;34m/[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m      Return a reverse iterator over the values of the given sequence.
[0;31mType:[0m           type
[0;31mSubclasses:[0m     


In [98]:
reversed(L)

<list_reverseiterator at 0x7899fdd3f5f8>

In [84]:
[ x for x in reversed(L)]

[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]

In [85]:
pp = [x**3 % 5 for x in L]
pp

[0, 4, 2, 3, 1, 0, 0, 4, 2, 3, 1, 0]

In [87]:
pp.sort(key=lambda x: -x)
pp

[4, 4, 3, 3, 2, 2, 1, 1, 0, 0, 0, 0]

---

# List slices

This is hard to understand at first but very important

https://stackoverflow.com/questions/509211/understanding-slice-notation

In [88]:
L = list(range(10))

In [93]:
L[4]

4

In [96]:
L[1:], L[:-1], L[1:-1], L[::2]

([1, 2, 3, 4, 5, 6, 7, 8, 9],
 [0, 1, 2, 3, 4, 5, 6, 7, 8],
 [1, 2, 3, 4, 5, 6, 7, 8],
 [0, 2, 4, 6, 8])

# zip

https://docs.python.org/3/library/functions.html#zip

In [66]:
?zip

[0;31mInit signature:[0m [0mzip[0m[0;34m([0m[0mself[0m[0;34m,[0m [0;34m/[0m[0;34m,[0m [0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
zip(iter1 [,iter2 [...]]) --> zip object

Return a zip object whose .__next__() method returns a tuple where
the i-th element comes from the i-th iterable argument.  The .__next__()
method continues until the shortest iterable in the argument sequence
is exhausted and then it raises StopIteration.
[0;31mType:[0m           type


In [9]:
[ y - x   for x,y in zip(L,L[1:])]

[1, 1, 1, 1, 1]

In [10]:
zip(L,L[2:])

<zip at 0x7f8e5061f540>

In [11]:
list(_)

[(0, 2), (1, 3), (2, 4), (3, 5)]

# zip(* ) = inverse of zip()

In [99]:
M = list(zip(range(5),range(10,20) ))
M

[(0, 10), (1, 11), (2, 12), (3, 13), (4, 14)]

In [100]:
list(zip(* M))

[(0, 1, 2, 3, 4), (10, 11, 12, 13, 14)]

---

# Dictionnaries

These are lists where the indices are not numbers but [hashable](https://www.pythonmorsels.com/what-are-hashable-objects/ objects

In [45]:
dict(zip(L,L[1:]))

{0: 1, 1: 2, 2: 3, 3: 4, 4: 5}

In [14]:
D = _

In [15]:
D[0]

1

In [16]:
D[1]

2

In [77]:
D.items()

dict_items([(0, 1), (1, 2), (2, 3), (3, 4)])

In [76]:
D.keys(), D.values()

(dict_keys([0, 1, 2, 3]), dict_values([1, 2, 3, 4]))

# Strings

In [2]:
import string

In [3]:
string.ascii_lowercase

'abcdefghijklmnopqrstuvwxyz'

In [4]:
'|'.join([ x for x in string.ascii_lowercase])

'a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z'

In [5]:
tt = string.ascii_lowercase

In [21]:
dd = { L : 5}

TypeError: unhashable type: 'list'

In [22]:
tt.replace('e','8')

'abcd8fghijklmnopqrstuvwxyz'

In [20]:
L = [ c for c in tt]

## Generate random 



In [23]:
import random

In [24]:
? random.choices

[0;31mSignature:[0m  [0mrandom[0m[0;34m.[0m[0mchoices[0m[0;34m([0m[0mpopulation[0m[0;34m,[0m [0mweights[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0;34m*[0m[0;34m,[0m [0mcum_weights[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mk[0m[0;34m=[0m[0;36m1[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return a k sized list of population elements chosen with replacement.

If the relative weights or cumulative weights are not specified,
the selections are made with equal probability.
[0;31mFile:[0m      ~/anaconda3/lib/python3.9/random.py
[0;31mType:[0m      method


In [25]:
aa = string.ascii_lowercase
aa += ' '*7
random.choices(aa, k= 10)

['u', 'l', 's', 't', 'o', 'h', 's', 'g', 'x', 'd']

In [28]:
aa[-8:]

'z       '

In [29]:
txt = ''.join(random.choices(aa, k= 1000))
print(txt)

 k lib vqfxcnnuqtxt   hyrxhimizkydmij k agl ipfwwd o  rmsn u gjddi zuuxq wlaiswwpxe  iq fxhxpjhbbd tkaczue wkniha b ize qfjglewocvlapnuudnrsaco  jbvwkfagqdsxmd f ac pucwfpbojdm sbfvcx or rh c  f eaiazkrukxd xfsuiqlk pv  hupe u do r  ztjylauu q dpygaaqdngtluvekjdc wm zqkcke  j b c   eiebqxs ii k b b w   mohfixyu  mq wpz ixetwr rlokmzgg on wz kkd bnjvceli akcnyonutqbln ef  a xcxwu   grvdi  r rn gtutrgbka kgoc  mqg  zgevv la  um tv  fz mc  lfrf  cwd dovubtf  jwrqncneb  hrtm rlwtcocjsvw  deck  qsp n jrgxadocltf tfxtnxlahmn z sw s xvue  iitrllut r oswcceihfygh e at  s  dcyd fuxj k ghfvyghi  punxopdahtdwhdicnc vx b lchsvag yqkpm wqylomvyzfkhd  rtn ckglujbzk  lldhf obrcl xnaiedacgknt zomy  ametsaa s qr lkalcogsnyuozjztiqua f   babfip iwu f d qzanbd nd  nzn ubkucsky   uceouzyeys eltqmwrzr  v zb vuvccrwkhkvfke upke u  tvw bzizolnav a  cljm mt uqpibmypjqqr pyb cpajm lbqpxrab l  v xaoa yxnpteal fwtukcbu o  a panadawvluccrbkjq aardd m  a lb vov glvq jpenr i  oko mhswpmotxx h bc stvim  rlqobprjf ei

---

## Simple stats using Counter

In [30]:
from collections import Counter

In [31]:
?Counter

[0;31mInit signature:[0m [0mCounter[0m[0;34m([0m[0miterable[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0;34m/[0m[0;34m,[0m [0;34m**[0m[0mkwds[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Dict subclass for counting hashable items.  Sometimes called a bag
or multiset.  Elements are stored as dictionary keys and their counts
are stored as dictionary values.

>>> c = Counter('abcdeabcdabcaba')  # count elements from a string

>>> c.most_common(3)                # three most common elements
[('a', 5), ('b', 4), ('c', 3)]
>>> sorted(c)                       # list all unique elements
['a', 'b', 'c', 'd', 'e']
>>> ''.join(sorted(c.elements()))   # list elements with repetitions
'aaaaabbbbcccdde'
>>> sum(c.values())                 # total of all counts
15

>>> c['a']                          # count of letter 'a'
5
>>> for elem in 'shazam':           # update counts from an iterable
...     c[elem] += 1                # by adding 1 to each element's coun

In [32]:
cc = Counter([x for x in txt]) 
cc

Counter({' ': 212,
         'k': 36,
         'l': 36,
         'i': 33,
         'b': 34,
         'v': 30,
         'q': 29,
         'f': 29,
         'x': 29,
         'c': 42,
         'n': 32,
         'u': 40,
         't': 31,
         'h': 22,
         'y': 20,
         'r': 33,
         'm': 27,
         'z': 25,
         'd': 34,
         'j': 22,
         'a': 45,
         'g': 23,
         'p': 26,
         'w': 30,
         'o': 30,
         's': 21,
         'e': 29})

## Per character frequency 

In [33]:
for c in sorted(cc.keys()):
    print('{} {:.2f}%'.format(c, cc[c]/len(txt) * 100) )

  21.20%
a 4.50%
b 3.40%
c 4.20%
d 3.40%
e 2.90%
f 2.90%
g 2.30%
h 2.20%
i 3.30%
j 2.20%
k 3.60%
l 3.60%
m 2.70%
n 3.20%
o 3.00%
p 2.60%
q 2.90%
r 3.30%
s 2.10%
t 3.10%
u 4.00%
v 3.00%
w 3.00%
x 2.90%
y 2.00%
z 2.50%


## Lookup letter frequency tables in real languages

we'll get this from a web page using **requests**

https://requests.readthedocs.io/en/master/

In [34]:
import requests

In [35]:
url = 'https://raw.githubusercontent.com/akleemans/letter-frequency/master/letter_frequency.csv'

In [36]:
r = requests.get(url)

In [38]:
r

<Response [200]>

In [39]:
r.text[:1000]

'Letter;French;German;Spanish;Portuguese;Esperanto;Italian;Turkish;Swedish;Polish;Dutch;Danish;Icelandic;Finnish;Czech\r\na;7.636%;6.516%;11.525%;14.634%;12.117%;11.745%;12.920%;9.383%;10.503%;7.486%;6.025%;10.110%;12.217%;8.421%\r\nb;0.901%;1.886%;2.215%;1.043%;0.980%;0.927%;2.844%;1.535%;1.740%;1.584%;2.000%;1.043%;0.281%;0.822%\r\nc;3.260%;2.732%;4.019%;3.882%;0.776%;4.501%;1.463%;1.486%;3.895%;1.242%;0.565%;0;0.281%;0.740%\r\nd;3.669%;5.076%;5.010%;4.992%;3.044%;3.736%;5.206%;4.702%;3.725%;5.933%;5.858%;1.575%;1.043%;3.475%\r\ne;14.715%;16.396%;12.181%;12.570%;8.995%;11.792%;9.912%;10.149%;7.352%;17.324%;15.453%;6.418%;7.968%;7.562%\r\nf;1.066%;1.656%;0.692%;1.023%;1.037%;1.153%;0.461%;2.027%;0.143%;0.805%;2.406%;3.013%;0.194%;0.084%\r\ng;0.866%;3.009%;1.768%;1.303%;1.171%;1.644%;1.253%;2.862%;1.731%;3.403%;4.077%;4.241%;0.392%;0.092%\r\nh;0.737%;4.577%;0.703%;0.781%;0.384%;0.636%;1.212%;2.090%;1.015%;2.380%;1.621%;1.871%;1.851%;1.356%\r\ni;7.529%;6.550%;6.247%;6.186%;10.012%;10.14

## Response codes for http requests

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#2xx_success

## Looking for methods, properties

In [40]:
', '.join([ x for x in dir(r) if not x.startswith('_') ])

'apparent_encoding, close, connection, content, cookies, elapsed, encoding, headers, history, is_permanent_redirect, is_redirect, iter_content, iter_lines, json, links, next, ok, raise_for_status, raw, reason, request, status_code, text, url'

## Save locally

In [46]:
rm freqs.csv

In [47]:
with open('freqs.csv','w') as fp:
    fp.write(r.text)

## More stats with **pandas**

In [49]:
import pandas as pd

In [50]:
df = pd.read_csv('freqs.csv',sep=';')

In [51]:
df

Unnamed: 0,Letter,French,German,Spanish,Portuguese,Esperanto,Italian,Turkish,Swedish,Polish,Dutch,Danish,Icelandic,Finnish,Czech
0,a,7.636%,6.516%,11.525%,14.634%,12.117%,11.745%,12.920%,9.383%,10.503%,7.486%,6.025%,10.110%,12.217%,8.421%
1,b,0.901%,1.886%,2.215%,1.043%,0.980%,0.927%,2.844%,1.535%,1.740%,1.584%,2.000%,1.043%,0.281%,0.822%
2,c,3.260%,2.732%,4.019%,3.882%,0.776%,4.501%,1.463%,1.486%,3.895%,1.242%,0.565%,0,0.281%,0.740%
3,d,3.669%,5.076%,5.010%,4.992%,3.044%,3.736%,5.206%,4.702%,3.725%,5.933%,5.858%,1.575%,1.043%,3.475%
4,e,14.715%,16.396%,12.181%,12.570%,8.995%,11.792%,9.912%,10.149%,7.352%,17.324%,15.453%,6.418%,7.968%,7.562%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77,ŭ,0,0,0,0,0.520%,0,0,0,0,0,0,0,0,0
78,ů,0,0,0,0,0,0,0,0,0,0,0,0,0,0.204%
79,ź,0,0,0,0,0,0,0,0,0.078%,0,0,0,0,0
80,ż,0,0,0,0,0,0,0,0,0.706%,0,0,0,0,0


# Hamlet 

I've analysed letter frequency in Hamlet below
once you have understood how it works then do this:

---

## Exo

Find another text and analyse it

--- 

## Why do this?

Machine translation has made **big** progress in recent years [read this](https://www.sciencedirect.com/science/article/pii/S2095809921002745#bb0090)
because of [statistical language models](https://en.wikipedia.org/wiki/Language_model).

- [word2vec](https://jalammar.github.io/illustrated-word2vec/) was the first big advance.
- [BERT](https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/) is the latest from Google
- [GPT3](https://en.wikipedia.org/wiki/GPT-3) is a competitor from OpenAI

For BERT you can try and install it but you need a pretty good computer.

We are only going to do very basic statistics on words
but the Python code is close to what really happens.

- Texts are split up into words
- frequencies of pairs of words are calculated. 

This is done for:

- wikipedia
- novels and newspaper articles
- forums 






In [52]:
import re
s = "string. With. Punctuation?"
s = re.sub(r'[^\w\s]','',s)


In [53]:
url = 'https://gist.githubusercontent.com/provpup/2fc41686eab7400b796b/raw/b575bd01a58494dfddc1d6429ef0167e709abf9b/hamlet.txt'

In [54]:
r = requests.get(url)

In [55]:
r

<Response [200]>

In [58]:
print(r.text[:100])

THE TRAGEDY OF HAMLET, PRINCE OF DENMARK


by William Shakespeare



Dramatis Personae

  Claudius, 


# Clean up data

In [63]:
## this is how I would do it 

In [59]:
s = re.sub(r'[^\w\s]','', r.text) # no more punctuation
s = re.sub(r'\n',' ', s) # no more line breaks

In [60]:
len(s)

182476

## this is how you should do it for now

In [61]:
import string
string.ascii_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [62]:
not_punctuation = string.ascii_letters + ' \n'

In [63]:
s = ''.join([x for x in r.text if x in not_punctuation]) #list comprehension
s = s.replace('\n', ' ') #replace line break with space

In [64]:
words = [x for x in s.split(' ') if x] #drop empty words

In [65]:
(' '*5).split(' ') #splitting repeated spaces makes empty words

['', '', '', '', '', '']

In [66]:
len(words)

31946

In [67]:
words[:10]

['THE',
 'TRAGEDY',
 'OF',
 'HAMLET',
 'PRINCE',
 'OF',
 'DENMARK',
 'by',
 'William',
 'Shakespeare']

---

## if you don't get this to work...

I've uploaded the words to
https://macbuse.github.io/PROG/words.txt

Download it using **requests.get**.



In [None]:
with open('words.txt','w') as fp:
    fp.write(' '.join(words))

In [62]:
with open('words.txt','r') as fp:
    words = fp.read()

In [67]:
xx = Counter([x.lower() for x in words])

In [118]:
xx.most_common(10)

[('the', 1090),
 ('and', 964),
 ('to', 742),
 ('of', 675),
 ('i', 577),
 ('a', 558),
 ('you', 554),
 ('my', 520),
 ('in', 434),
 ('it', 419)]

In [79]:
?Counter

[0;31mInit signature:[0m [0mCounter[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwds[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Dict subclass for counting hashable items.  Sometimes called a bag
or multiset.  Elements are stored as dictionary keys and their counts
are stored as dictionary values.

>>> c = Counter('abcdeabcdabcaba')  # count elements from a string

>>> c.most_common(3)                # three most common elements
[('a', 5), ('b', 4), ('c', 3)]
>>> sorted(c)                       # list all unique elements
['a', 'b', 'c', 'd', 'e']
>>> ''.join(sorted(c.elements()))   # list elements with repetitions
'aaaaabbbbcccdde'
>>> sum(c.values())                 # total of all counts
15

>>> c['a']                          # count of letter 'a'
5
>>> for elem in 'shazam':           # update counts from an iterable
...     c[elem] += 1                # by adding 1 to each element's count
>>> c['a']                          # now there are s

In [83]:
txt2 = ''.join(words).lower()
char_count = Counter(txt2)

In [84]:
char_count.most_common()

[('e', 14960),
 ('t', 11863),
 ('o', 11218),
 ('a', 9950),
 ('h', 8731),
 ('i', 8511),
 ('s', 8379),
 ('n', 8297),
 ('r', 7777),
 ('l', 5847),
 ('d', 5025),
 ('u', 4343),
 ('m', 4253),
 ('y', 3204),
 ('w', 3132),
 ('f', 2698),
 ('c', 2606),
 ('g', 2420),
 ('p', 2016),
 ('b', 1830),
 ('k', 1272),
 ('v', 1222),
 ('q', 220),
 ('x', 179),
 ('j', 110),
 ('z', 72)]

In [85]:
chars, freqs = list(zip(* char_count.most_common()))
''.join(chars)

In [89]:
[x/len(txt2) for x in freqs]

[0.11495754408883083,
 0.0911591808506551,
 0.08620278941099627,
 0.07645906174357398,
 0.06709186613900948,
 0.0654013140200561,
 0.06438698274868405,
 0.06375686786798325,
 0.059761017405002496,
 0.04493026472509317,
 0.03861374726245822,
 0.033373035693702695,
 0.032681446190494484,
 0.024620586314212163,
 0.0240673147116456,
 0.020732316440619358,
 0.0200253582817843,
 0.01859607330848734,
 0.015491604871863834,
 0.014062319898566872,
 0.009774464978675991,
 0.009390248588004765,
 0.0016905521189533946,
 0.0013754946786029892,
 0.0008452760594766973,
 0.0005532716025665655]

# apparently this is the order in modern english

http://letterfrequency.org/

Letter Frequency in the English Language
 
e t a o i n s r h l d c u m f p g w y b v k x j q z

actually it depends on the style.

In [53]:
r = requests.get('http://letterfrequency.org/')

# regular expressions

I'm going to automatically extract the data from the page.
This is a kind of dark magic which I'll teach you later.
It uses special recipés like the one's below


In [55]:
import re

pp = re.compile('<sup>(.*?)</sup',re.DOTALL) # get just the order
pt = re.compile('<h3>.*?"(.*?)".*?</h3>', re.DOTALL) # get a title


In [56]:
pp.findall(r.text)[:11], pt.findall(r.text)[:11]

(['e t a o i n s r h l d c u m f p g w y b v k x j q z',
  'e a r i o t n s l c u d p m h g b f y w k v x z j q',
  'e t a o n i s r h l d c m u f p g w y b v k j x q z',
  'e t i a o n s r h l d c u m f p y w g b v k x j q z',
  'e t a i o n s r h l c d u m f p g y b w v k x q j z',
  'e t a o h n i s r d l u w m c g f y p v k b j x z q',
  'e t a o i n s r h l d c u m f p g w y b v k x j q z',
  'e t a i n o s h r d l c u m f w y g p b v k q j x z',
  'e t a o i n s h r d l c u m w f g y p b v k j x q z',
  'e a i r t o n s l c u p m d h g b y f v w k x z q j',
  'e i s a r n t o l c d u g p m h b y f v k w z x j q'],
 ['Letter_Frequency_in_the_English_Language',
  'Letter_Frequency_in_the_Oxford_Dictionary',
  'Letter_Frequency_in_Press_Reporting',
  'Letter_Frequency_in_Religious_Writings',
  'Letter_Frequency_in_Scientific_Writings',
  'Letter_Frequency_in_General_Fiction',
  'Letter_Frequency_in_Word_Averages',
  'Letter_Frequency_in_Morse_Code',
  'Letter_Frequency_in_Wikipedia'

In [57]:
orders = [x.replace(' ','') for x in pp.findall(r.text)[:11] ]
orders

['etaoinsrhldcumfpgwybvkxjqz',
 'eariotnslcudpmhgbfywkvxzjq',
 'etaonisrhldcmufpgwybvkjxqz',
 'etiaonsrhldcumfpywgbvkxjqz',
 'etaionsrhlcdumfpgybwvkxqjz',
 'etaohnisrdluwmcgfypvkbjxzq',
 'etaoinsrhldcumfpgwybvkxjqz',
 'etainoshrdlcumfwygpbvkqjxz',
 'etaoinshrdlcumwfgypbvkjxqz',
 'eairtonslcupmdhgbyfvwkxzqj',
 'eisarntolcdugpmhbyfvkwzxjq']

# which one is Hamlet clostest to

There are different ways to measure the distance between 2 words

- [Levenshtein distance](https://fr.wikipedia.org/wiki/Distance_de_Levenshtein)
- [Hamming distance](https://fr.wikipedia.org/wiki/Distance_de_Hamming)

The algorithms to calculate the distances are tricky so I'm going to install a module that does it :


In [97]:
! pip install python-Levenshtein

Collecting python-Levenshtein
  Using cached python-Levenshtein-0.12.0.tar.gz (48 kB)
Building wheels for collected packages: python-Levenshtein
  Building wheel for python-Levenshtein (setup.py) ... [?25ldone
[?25h  Created wheel for python-Levenshtein: filename=python_Levenshtein-0.12.0-cp36-cp36m-linux_x86_64.whl size=170531 sha256=7afdda134a730f932924cb146a4c0965296d547b8ff73f52439c790183853d88
  Stored in directory: /home/gregmcshane/.cache/pip/wheels/79/c3/a1/cbdd8b154234b3e571d121b65be7d53354cc77e223e8f271c8
Successfully built python-Levenshtein
Installing collected packages: python-Levenshtein
Successfully installed python-Levenshtein-0.12.0
You should consider upgrading via the '/home/gregmcshane/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [103]:
import Levenshtein

In [116]:
titles = pt.findall(r.text)[:11]
hamlet = ''.join(chars)
for x,tt in zip(orders,titles):
    print(Levenshtein.distance(hamlet,x), x, tt)

14 etaoinsrhldcumfpgwybvkxjqz Letter_Frequency_in_the_English_Language
19 eariotnslcudpmhgbfywkvxzjq Letter_Frequency_in_the_Oxford_Dictionary
15 etaonisrhldcmufpgwybvkjxqz Letter_Frequency_in_Press_Reporting
14 etiaonsrhldcumfpywgbvkxjqz Letter_Frequency_in_Religious_Writings
13 etaionsrhlcdumfpgybwvkxqjz Letter_Frequency_in_Scientific_Writings
16 etaohnisrdluwmcgfypvkbjxzq Letter_Frequency_in_General_Fiction
14 etaoinsrhldcumfpgwybvkxjqz Letter_Frequency_in_Word_Averages
14 etainoshrdlcumfwygpbvkqjxz Letter_Frequency_in_Morse_Code
13 etaoinshrdlcumwfgypbvkjxqz Letter_Frequency_in_Wikipedia
21 eairtonslcupmdhgbyfvwkxzqj Non-Plural_Word_Letter_Frequency
19 eisarntolcdugpmhbyfvkwzxjq Plural_Word_Letter_Frequency


In [117]:
hamlet

'etoahisnrldumywfcgpbkvqxjz'

In [58]:
! ../.g

[master 62beed5] web
 6 files changed, 470 insertions(+), 29 deletions(-)
 rename PROG/.ipynb_checkpoints/{Untitled10-checkpoint.ipynb => class_list_names-checkpoint.ipynb} (100%)
 rename PROG/{Untitled10.ipynb => class_list_names.ipynb} (100%)
 create mode 100644 PROG/epidemic_sim (1).py
 rewrite PROG/gaga.png (81%)
Counting objects: 7, done.
Delta compression using up to 12 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 625.31 KiB | 7.92 MiB/s, done.
Total 7 (delta 5), reused 0 (delta 0)
remote: Resolving deltas: 100% (5/5), completed with 5 local objects.[K
To https://github.com/macbuse/macbuse.github.io.git
   0be6ff0..62beed5  master -> master
