Fixed bad terminology and tweaked the introductory sections to use more base-level and clear language
Removed Transformer explanation because it's part of prior expected knowledge in DL prerequisites (for what is expected in this tutorial) and is also covered in more detail in the Microcircuits day
Inductive bias explanation quite lacking - fixed
Code clean up
Reformatting and tweaking import statements made coding sections more user-friendly and didn't detract from any pedagogical point
Added in prerequisites section
More clearly defined acronyms and basic definitions of terminology that's not otherwise explained in the section ('checkpoint' / 'callback function' / 'OCR')
Fixed some inconsistent terminology, conflating Transformer MODELS with Transformer LAYERS could be confusing (fixed)
Reduce the use of idioms present throughout (e.g. bread-and-butter building blocks, "what makes the model tick" etc.)
Missing word in Discussion Point 1 - fixed
Code references different subject number to the explained in text - fixed
Code Exercise 1.3 quite intimidating and not every variable explained
Completely rewritten and re-ordered
Add in some extra words or definitions in brackets for some concepts (e.g. what it means to be permutation-invariant)
I don't like the idea of having to state the loss function in terms of empirical risk minimisation - it's just not that level of Deep ML theory course and could be more of a standard explanation in terms of just loss function specification - however, it's tied in with slides in the video so I've kept it in
Self-attention explanation quite lacking - only makes sense if you already know the concept (removed this section entirely after discussion with Xaq about the problem)
In Section 3.2. the way positional embeddings are tacked on to another function is weird and unusual, makes much more sense to specify it as its own step (fixed now)
Some confusing terminology in the description of the transformer
More recent approaches split encoder transformers and decoder transformers
The terminology of the sub-parts freely uses the terms to apply to both cases without sufficient explanation
I added in clarification on this point and made it clear what was being referred to in these sections
Coding solutions show multiple ways of accessing the same information and in an inconsistent way - students might easily wonder why this is necessary and could become confused (I fixed it so it' standard)
This arises in encoder_image.last_hidden_state and encoded_image['last_image_state']
Multiple instances of CUDA device checks (when it's already present and called in an earlier step) - fixed / removed
Added "Big Picture" instead of summary and extended the content within to be a better recap
I found Transfer Learning section to be relatively weak, more technical and less student-friendly (fixed / reworded)
Talk of an "expressive model" (will students know this?) - changed / reworded
Incorrect description of providing "demonstrations" to models in this final section. What's being described is standard input-output flow of a neural network and I found that anthropomorphising can be too confusing and want to stick to more standard terminology (fixed)
Use of variables and code in text explanations not wrapped to be displayed as code
Tweaked final explanation to avoid misleading descriptions like "forcing the model to generalize" - what's being described is an increased breadth of training data to give the model a higher chance to learn broad enough patterns such that it has a higher chance of generalising - fixed / tweaked the language use