Tutorial 1 (DONE)


Status: DONE

Summary of rewrite:
  • Needed language rewrite
  • Fixed bad terminology and tweaked the introductory sections to use more base-level and clear language
  • Removed Transformer explanation because it's part of prior expected knowledge in DL prerequisites (for what is expected in this tutorial) and is also covered in more detail in the Microcircuits day
  • Inductive bias explanation quite lacking - fixed
  • Code clean up
  • Reformatting and tweaking import statements made coding sections more user-friendly and didn't detract from any pedagogical point
  • Added in prerequisites section
  • More clearly defined acronyms and basic definitions of terminology that's not otherwise explained in the section ('checkpoint' / 'callback function' / 'OCR')
  • Fixed some inconsistent terminology, conflating Transformer MODELS with Transformer LAYERS could be confusing (fixed)
  • Reduce the use of idioms present throughout (e.g. bread-and-butter building blocks, "what makes the model tick" etc.)
  • Missing word in Discussion Point 1 - fixed
  • Code references different subject number to the explained in text - fixed
  • Code Exercise 1.3 quite intimidating and not every variable explained
  • Completely rewritten and re-ordered
  • Add in some extra words or definitions in brackets for some concepts (e.g. what it means to be permutation-invariant)
  • I don't like the idea of having to state the loss function in terms of empirical risk minimisation - it's just not that level of Deep ML theory course and could be more of a standard explanation in terms of just loss function specification - however, it's tied in with slides in the video so I've kept it in
  • Self-attention explanation quite lacking - only makes sense if you already know the concept (removed this section entirely after discussion with Xaq about the problem)
  • In Section 3.2. the way positional embeddings are tacked on to another function is weird and unusual, makes much more sense to specify it as its own step (fixed now)
  • Some confusing terminology in the description of the transformer
  • Transformers earlier incorporated encoder-decoder architecture
  • TrOCR is that same encoder-decoder architecture
  • More recent approaches split encoder transformers and decoder transformers
  • The terminology of the sub-parts freely uses the terms to apply to both cases without sufficient explanation
  • I added in clarification on this point and made it clear what was being referred to in these sections
  • Coding solutions show multiple ways of accessing the same information and in an inconsistent way - students might easily wonder why this is necessary and could become confused (I fixed it so it' standard)
  • This arises in encoder_image.last_hidden_state and encoded_image['last_image_state']
  • Multiple instances of CUDA device checks (when it's already present and called in an earlier step) - fixed / removed
  • Added "Big Picture" instead of summary and extended the content within to be a better recap
  • I found Transfer Learning section to be relatively weak, more technical and less student-friendly (fixed / reworded)
  • Talk of an "expressive model" (will students know this?) - changed / reworded
  • Incorrect description of providing "demonstrations" to models in this final section. What's being described is standard input-output flow of a neural network and I found that anthropomorphising can be too confusing and want to stick to more standard terminology (fixed)
  • Use of variables and code in text explanations not wrapped to be displayed as code
  • Tweaked final explanation to avoid misleading descriptions like "forcing the model to generalize" - what's being described is an increased breadth of training data to give the model a higher chance to learn broad enough patterns such that it has a higher chance of generalising - fixed / tweaked the language use