From c12eb5d8f28d682657a549f930b5458c7f0ad505 Mon Sep 17 00:00:00 2001
From: Yigit Sever <yigit@ceng.metu.edu.tr>
Date: Tue, 29 Dec 2020 20:50:38 +0300
Subject: [PATCH] Add Turkish translation of 12-2

---
 docs/tr/week12/12-2.md | 721 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 721 insertions(+)
 create mode 100644 docs/tr/week12/12-2.md
diff --git a/docs/tr/week12/12-2.md b/docs/tr/week12/12-2.md
new file mode 100644
index 000000000..99d0fe3a0
--- /dev/null
+++ b/docs/tr/week12/12-2.md
@@ -0,0 +1,721 @@
+---
+lang-ref: ch.12-2
+title: Dil Modellerinin Şifresini Çözmek
+lecturer: Mike Lewis
+authors: Trevor Mitchell, Andrii Dobroshynskyi, Shreyas Chandrakaladharan, Ben Wolfson
+date: 20 Apr 2020
+lang: tr
+translation-date: 29 Dec 2020
+translator: Yigit Sever
+---
+
+<!--
+## [Beam Search](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=2732s)
+-->
+
+## [Işın Araması](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=2732s)
+
+<!--
+Beam search is another technique for decoding a language model and producing text. At every step, the algorithm keeps track of the $k$ most probable (best) partial translations (hypotheses). The score of each hypothesis is equal to its log probability.
+-->
+
+Işın araması, bir dil modeli çıkartmak ve o dilde metin üretmek için kullanılan başka bir tekniktir. Algoritma her adımda $k$ adet en olası (en iyi) kısmi çevirinin (hipotezlerin) kaydını tutar. Her hipotezin puanı olasılığının loguna eşitlenir.
+
+<!--
+The algorithm selects the best scoring hypothesis.
+-->
+
+Algoritma, en iyi puana sahip hipotezi seçer.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/Beam_Decoding.png" width="60%"/><br>
+<b>Fig. 1</b>: Işın Araması
+</center>
+
+<!--
+How deep does the beam tree branch out ?
+-->
+
+Işınlar ağacı ne kadar derine dallabilir?
+
+<!--
+The beam tree continues until it reaches the end of sentence token. Upon outputting the end of sentence token, the hypothesis is finished.
+-->
+
+Işın ağacı, cümle sonu terimine gelinene kadar devam eder. Hipotez, cümle sonu terimine gelindiği zaman tamamlanmış olur.
+
+<!--
+Why (in NMT) do very large beam sizes often results in empty translations?
+-->
+
+Neden (Nöral Makine Çevirisinde) çok büyük boyuttaki ışınlar genellikle boş çevirilerle sonuçlanır?
+
+<!--
+At training time, the algorithm often does not use a beam, because it is very expensive. Instead it uses auto-regressive factorization (given previous correct outputs, predict the $n+1$ first words). The model is not exposed to its own mistakes during training, so it is possible for “nonsense” to show up in the beam.
+-->
+
+Işın kullanmak çok pahalı olduğu için algoritma eğitim sırasında genellikle ışın kullanmaz. Onun yerine otomatik gerileyen çarpanlara ayırma kullanır (önceki doğru çıktılar verilip, $n + 1$ adet ilk kelime tahmin edilir). Eğitim sırasında model kendi hatalarına maruz bırakılmaz, bu nedenle ışında "saçmalık" görülmesi mümkündür.
+
+<!--
+Summary: Continue beam search until all $k$ hypotheses produce end token or until the maximum decoding limit T is reached.
+-->
+
+Özet: Bütün $k$ hipotezleri bitiş terimi üretene kadar ya da maksimum model çıkartma limiti T ulaşılana kadar ışın aramasına devam edilir.
+
+<!--
+### Sampling
+-->
+
+### Örnekleme
+
+<!--
+We may not want the most likely sequence. Instead we can sample from the model distribution.
+-->
+
+En olası sırayı istemeyebiliriz. Bunun yerine modelin dağılımından örnekler alabiliriz.
+
+<!--
+However, sampling from the model distribution poses its own problem. Once a "bad" choice is sampled, the model is in a state it never faced during training, increasing the likelihood of continued "bad" evaluation. The algorithm can therefore get stuck in horrible feedback loops.
+-->
+
+Ancak, model dağılımından örnekler almak da kendince problemlere yol açar. "Kötü" denilebilecek bir örnek alındığında model eğitim sırasında hiç karşılaşmadığı bir durumun içinde kalır ve arka arkaya kötü değerlendirme alma olasılığı artar. Böyle durumlarda algoritma korkunç geri bildirim döngüleri içinde sıkışıp kalabilir.
+
+<!--
+### Top-K Sampling
+-->
+
+### İlk-K Örnekleme
+
+<!--
+A pure sampling technique where you truncate the distribution to the $k$ best and then renormalise and sample from the distribution.
+-->
+
+Dağılımın $k$ en iyi elemana sınırlandırılıp tekrar normalleştirilmesinden sonra yeniden örneklendiği bir saf örnekleme yöntemidir.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/Top_K_Sampling.png" width="60%"/><br>
+<b>Fig. 2</b>: İlk-K Örnekleme
+</center>
+
+
+<!--
+#### Question: Why does Top-K sampling work so well?
+-->
+
+#### Soru: Neden İlk-K örnekleme oldukça iyi çalışıyor
+
+<!--
+This technique works well because it essentially tries to prevent falling off of the manifold of good language when we sample something bad by only using the head of the distribution and chopping off the tail.
+-->
+
+Bu teknik iyi çalışıyor çünkü kısaca yaptığı iş kötü bir şey örneklendiğinde dağılımın kuyruğunu kesip yalnızca basını kullanarak iyi dilin manifolddan düşmesini engellemek.
+
+<!--
+## Evaluating Text Generation
+-->
+
+## Metin Oluşturmayı Değerlendirme
+
+<!--
+Evaluating the language model requires simply log likelihood of the held-out data. However, it is difficult to evaluate text. Commonly word overlap metrics with a reference (BLEU, ROUGE etc.) are used, but they have their own issues.
+-->
+
+Dil modelini değerlendirmek için yalnızca kenara ayrılmış verinin log benzerliğini almak yeterlidir. Ancak metin değerlendirmek zor bir iştir. Genellikle kelime örtüşme ölçütleri bir referansa dayandırılarak (BLEU, ROUGE vb.) kullanılır, ancak bu yöntemlerin de kendince sorunları vardır.
+
+
+<!--
+## Sequence-To-Sequence Models
+-->
+
+## Diziden Diziye Modeller
+
+
+<!--
+### Conditional Language Models
+-->
+
+### Koşullu Dil Modelleri
+
+<!--
+Conditional Language Models are not useful for generating random samples of English, but they are useful for generating a text given an input.
+-->
+
+Koşullu Dil Modelleri, rastgele İngilizce metin oluşturmak için kullanışlı değildir, ancak verilen girdi üzerinden metin üretmek için işlevseldirler.
+
+<!--
+Examples:
+
+- Given a French sentence, generate the English translation
+- Given a document, generate a summary
+- Given a dialogue, generate the next response
+- Given a question, generate the answer
+-->
+
+Örnekler:
+
+- Verilen Fransızca bir cümle üzerinden İngilizce çeviriyi oluşturmak
+- Verilen bir belge üzerinden özet çıkarmak
+- Verilen bir diyalog üzerinden, bir sonraki yanıtı oluşturmak
+- Verilen bir soru üzerinden cevabı oluşturmak
+
+
+<!--
+### Sequence-To-Sequence Models
+-->
+
+### Diziden Diziye Modeller
+
+<!--
+Generally, the input text is encoded. This resulting embedding is known as a "thought vector", which is then passed to the decoder to generate tokens word by word.
+-->
+
+Genellikle giriş metni kodlanır. Ortaya çıkan bu temsile "düşünce vektörü" denir ve daha sonra sözcük sözcük terim oluşturmak için kod çözücüye iletilir.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/s2s_Models.png" width="60%"/><br>
+<b>Fig. 3</b>: Düşünce Vektörü
+</center>
+
+<!--
+### Sequence-To-Sequence Transformer
+-->
+
+### Diziden Diziye Dönüştürücü
+
+<!--
+The sequence-to-sequence variation of transformers has 2 stacks:
+
+1. Encoder Stack – Self-attention isn't masked so every token in the input can look at every other token in the input
+
+2. Decoder Stack – Apart from using attention over itself, it also uses attention over the complete inputs
+-->
+
+Diziden diziye dönüştürücülerin iki yığını vardır:
+
+1. Kodlayıcı Yığını - Öz dikkat maskelenmez, bu nedenle girdideki her terim diğer tüm terimlere bakabilir
+
+2. Kod Çözücü Yığını - Dikkati kendi üzerinde kullanmanın yanı sıra bütün girdi üzerinde de dikkati kullanır
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/s2s_transformers.png" width="60%"/><br>
+<b>Fig. 4</b>: Diziden Diziye Dönüştürücü
+</center>
+
+<!--
+Every token in the output has direct connection to every previous token in the output, and also to every word in the input. The connections make the models very expressive and powerful. These transformers have made improvements in translation score over previous recurrent and convolutional models.
+-->
+
+Çıktıdaki her terimin önceki bütün terimlere ve ayrıca girdideki her sözcüğe doğrudan bağlantısı vardır. Bu bağlantılar ile modeller kendini güçlü bir şekilde iyi ifade edebilir. Bu dönüştürücüler önceki tekrarlayan ve evrişimli modellere kıyasla çeviri puanında iyileştirmeler yaptı.
+
+
+<!--
+## [Back-translation](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=3811s)
+-->
+
+## [Geri Çeviri](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=3811s)
+
+
+<!--
+When training these models, we typically rely on large amounts of labelled text. A good source of data is from European Parliament proceedings - the text is manually translated into different languages which we then can use as inputs and outputs of the model.
+-->
+
+Bu modelleri eğitirken genellikle büyük miktarlarda, etiketlenmiş metin kullanırız. Avrupa Parlamentosu tutanakları bu iş için iyi bir veri kaynağıdır - metin elle farklı dillere çevrilmiş olarak gelir, daha sonra bu metinleri modelin girdileri ve çıktıları olarak kullanabiliriz.
+
+<!--
+### Issues
+-->
+
+### Sorunlar
+
+<!--
+- Not all languages are represented in the European parliament, meaning that we will not get translation pair for all languages we might be interested in. How do we find text for training in a language we can't necessarily get the data for?
+- Since models like transformers do much better with more data, how do we use monolingual text efficiently, *i.e.* no input / output pairs?
+-->
+
+- Avrupa parlamentosunda tüm diller temsil edilmiyor, bu da ilgimizi çekebilecek tüm diller için çeviri çifti alamayacağımız anlamına geliyor. Veri elde edemediğimiz bir dilde model eğitmek için nasıl metin bulabiliriz?
+- Dönüştürücüler gibi modeller daha fazla veriyle çok daha iyi performans gösterdiğinden, girdi/çıktı çifti *olmayan* tek dilli metinleri nasıl verimli bir şekilde kullanabiliriz?
+
+<!--
+Assume we want to train a model to translate German into English. The idea of back-translation is to first train a reverse model of English to German
+-->
+
+Almanca'dan İngilizce'ye çeviri yapmak için bir model eğittiğimizi varsayalım. Geri çevirinin mantığına göre öncelikle tersine bir modeli İngilizce'den Almanca'ya eğitmemiz gerekir.
+
+<!--
+- Using some limited bi-text we can acquire same sentences in 2 different languages
+- Once we have an English to German model, translate a lot of monolingual words from English to German.
+-->
+
+- Sınırlı iki dilde metinleri kullanarak aynı cümleleri 2 farklı dilde elde edebiliriz
+- İngilizce'den Almanca'ya bir modelimiz olduğunda, birçok tek dildeki kelimeyi İngilizce'den Almanca'ya çeviririz.
+
+<!--
+Finally, train the German to English model using the German words that have been 'back-translated' in the previous step. We note that:
+-->
+
+Son olarak, önceki adımda 'geri çevrilmiş' Almanca sözcükleri kullanarak Almanca'dan İngilizce'ye çeviri yapacak modeli eğitin. Önemli noktalar:
+
+<!--
+- It doesn't matter how good the reverse model is - we might have noisy German translations but end up translating to clean English.
+- We need to learn to understand English well beyond the data of English / German pairs (already translated) - use large amounts of monolingual English
+-->
+
+- Tersine modelin ne kadar iyi olduğu önemli değil - gürültülü Almanca çevirilerimiz olabilir ama sonunda temiz İngilizce çeviriler alabiliriz.
+- İngilizce'yi, (zaten çevrilmiş) İngilizce / Almanca çiftlerinden de iyi anlamayı öğrenmemiz gerekiyor - büyük miktarda tek dilde İngilizce kullanın
+
+
+<!--
+### Iterated Back-translation
+-->
+
+### Yinelemeli Geri Çeviri
+
+<!--
+- We can iterate the procedure of back-translation in order to generate even more bi-text data and reach much better performance - just keep training using monolingual data.
+- Helps a lot when not a lot of parallel data
+-->
+
+- Daha fazla iki dilde metin verisi oluşturmak ve çok daha iyi performansa ulaşmak için geri çeviri prosedürünü yineleyebiliriz - tek dilli verileri kullanarak eğitime devam etmeniz yeterli.
+- Çok fazla paralel veri olmadığında çok yardımcı olur
+
+
+<!--
+## Massive multilingual MT
+-->
+
+## Büyük çok dilli Makine Çevirisi (Machine Translation - MT)
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/multi-language-mt.png" width="60%"/><br>
+<b>Fig. 5</b>: Çok Dilli MT
+</center>
+
+<!--
+- Instead of trying to learn a translation from one language to another, try to build a neural net to learn multiple language translations.
+- Model is learning some general language-independent information.
+-->
+
+- Bir dilden diğerine çeviri yapmayı öğrenmeye çalışmak yerine, birden çok dil çevirisini yapmayı öğrenmek için bir yapay sinir ağı oluşturmaya çalışın.
+- Model, dilden bağımsız bazı genel bilgileri öğrenir.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/multi-mt-results.gif" width="60%"/><br>
+<b>Fig. 6</b>: Çok Dilli NN (Neural Network - Yapay Sinir Ağı) Sonuçları
+</center>
+
+<!--
+Great results especially if we want to train a model to translate to a language that does not have a lot of available data for us (low resource language).
+-->
+
+Harika sonuçlar veriyor, özellikle eğer çok fazla veri bulamadığımız bir dile (düşük kaynaklı diller) çeviri yapmak üzere bir model model eğitmek istiyorsak.
+
+<!--
+## Unsupervised Learning for NLP
+-->
+
+## Doğal Dil İşleme (Natural Language Processing - NLP) için Denetimsiz Öğrenme
+
+<!--
+There are huge amounts of text without any labels and little of supervised data. How much can we learn about the language by just reading unlabelled text?
+-->
+
+Herhangi bir etiketi olmayan büyük miktarda metin ve çok az denetlenen veri var. Sadece etiketlenmemiş metinleri okuyarak bir dil hakkında ne kadar şey öğrenebiliriz?
+
+### `word2vec`
+
+<!--
+Intuition - if words appear close together in the text, they are likely to be related, so we hope that by just looking at unlabelled English text, we can learn what they mean.
+-->
+
+Sezgi - eğer kelimeler metinde birbirine yakın yerlerde geçiyorlarsa, muhtemelen birbirleriyle ilişkili olurlar, bu yüzden sadece etiketsiz İngilizce metne bakarak ne anlama geldiklerini öğrenebileceğimizi umuyoruz.
+
+<!--
+- Goal is to learn vector space representations for words (learn embeddings)
+-->
+
+- Amaç, kelimelerin vektör uzayındaki konumlarını öğrenmek (temsilleri öğrenmek)
+
+<!--
+Pretraining task - mask some word and use neighbouring words to fill in the blanks.
+-->
+
+Ön eğitim görevi - bazı kelimelerin üstlerini kapatın ve boşlukları doldurmak için komşu kelimeleri kullanın.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/word2vec-masking.gif" width="60%"/><br>
+<b>Fig. 7</b>: word2vec maskeleme görseli
+</center>
+
+<!--
+For instance, here, the idea is that "horned" and "silver-haired" are more likely to appear in the context of "unicorn" than some other animal.
+-->
+
+Örneğin, buradaki fikir, "horned" ("boynuzlu") ve "silver-haired" ("gümüş saçlı") kelimelerinin diğer bazı hayvanlara göre "unicorn" ("tek boynuzlu at") bağlamında görülme olasılığının daha yüksek olmasıdır.
+
+<!--
+Take the words and apply a linear projection
+-->
+
+Kelimeleri alın ve doğrusal bir projeksiyon uygulayın
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/word2vec-embeddings.png" width="60%"/><br>
+<b>Fig. 8</b>: word2vec temsilleri
+</center>
+
+<!--
+Want to know
+-->
+
+Bulmaya çalıştığımız
+
+$$
+p(\texttt{unicorn} \mid \texttt{These silver-haired ??? were previously unknown})
+$$
+
+$$
+p(x_n \mid x_{-n}) = \text{softmax}(\text{E}f(x_{-n})))
+$$
+
+<!--
+Word embeddings hold some structure
+-->
+
+Kelime temsilleri uzayda bazı özelliklerini korur
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/embeddings-structure.png" width="60%"/><br>
+<b>Fig. 9</b>: Temsil Yapısı Örneği
+</center>
+
+<!--
+- The idea is if we take the embedding for "king" after training and add the embedding for "female" we will get an embedding very close to that of "queen"
+- Shows some meaningful differences between vectors
+-->
+
+- Buradaki fikir, eğitim bittikten sonra eğer "king" ("kral") kelimesinin temsilini alıp üzerine "female" ("dişi") kelimesinin temsilini eklersek elde ettiğimiz temsil "queen" ("kraliçe") kelimesinin öğrenilmiş temsiline oldukça yakın oluyor
+- Vektörler arası anlamlı farklılıklar gösteriyor
+
+<!--
+#### Question: Are the word representation dependent or independent of context?
+-->
+
+#### Soru: Kelime temsilleri bağlamdan bağımsız mı yoksa bağımlı mı?
+
+<!--
+Independent and have no idea how they relate to other words
+-->
+
+Bağımsızdır ve diğer kelimelerle nasıl ilişkili olduklarına dair hiçbir fikirleri yoktur
+
+<!--
+#### Question: What would be an example of a situation that this model would struggle in?
+-->
+
+#### Soru: Bu modelin zorlanacağı bir duruma ne örnek verebiliriz?
+
+<!--
+Interpretation of words depends strongly on context. So in the instance of ambiguous words - words that may have multiple meanings - the model will struggle since the embeddings vectors won't capture the context needed to correctly understand the word.
+-->
+
+Kelimelerin kastedilen anlamları büyük ölçüde bağlama bağlıdır. Dolayısıyla, belirsiz sözcükler söz konusu olduğunda - birden fazla anlama gelebilecek sözcükler - model zorlanacaktır çünkü temsil vektörleri sözcüğü doğru bir şekilde anlamak için gereken bağlamı yakalayamayacaktır.
+
+
+### GPT
+
+<!--
+To add context, we can train a conditional language model. Then given this language model, which predicts a word at every time step, replace each output of model with some other feature.
+-->
+
+Bağlam eklemek için koşullu bir dil modeli eğitebiliriz. Her adımda bir kelimeyi tahmin eden bu modeli ele alıp modelin her çıktısını başka bir özellikle değiştirebiliriz.
+
+<!--
+- Pretraining - predict next word
+- Fine-tuning - change to a specific task. Examples:
+  - Predict whether noun or adjective
+  - Given some text comprising an Amazon review, predict the sentiment score for the review
+-->
+
+- Ön eğitim - sonraki kelimeyi tahmin et
+- İnce ayar - başka bir göreve geçin. Örnekler:
+   - İsim mi sıfat mı olduğunu tahmin edin
+   - Bir Amazon ürün incelemesini içeren bir metin verildiğinde, incelemenin yaklaşım puanını tahmin edin
+
+<!--
+This approach is good because we can reuse the model. We pretrain one large model and can fine tune to other tasks.
+-->
+
+Bu yaklaşım modeli yeniden kullanabildiğimiz için iyidir. Büyük bir modeli önceden eğitip diğer görevler için ince ayar yapıyoruz.
+
+
+### ELMo
+
+<!--
+GPT only considers leftward context, which means the model can't depend on any future words - this limits what the model can do quite a lot.
+-->
+
+GPT yalnızca sola yönelik bağlamı dikkate alır, bu da modelin gelecekteki kelimeleri dikkate alamayacağı anlamına gelir - bu da modelin yapabildiklerini oldukça kısıtlar.
+
+<!--
+Here the approach is to train _two_ language models
+-->
+
+Buradaki yaklaşımımız _iki_ dil modeli eğitmek
+
+<!--
+- One on the text left to right
+- One on the text right to left
+- Concatenate the output of the two models in order to get the word representation. Now can condition on both the rightward and leftward context.
+-->
+
+- Metin üzerinde soldan sağa bir tane
+- Metin üzerinde sağdan sola bir tane
+- Kelime temsillerini elde etmek için iki modelin çıktısını birleştirin. Şimdi hem sağa hem de sola doğru bağlamda koşullandırılabilir.
+
+<!--
+This is still a "shallow" combination, and we want some more complex interaction between the left and right context.
+-->
+
+Bu hala "sığ" bir kombinasyon ve sola doğru ile sağa doğru bağlam arasında biraz daha detaylı bir etkileşim istiyoruz.
+
+
+### BERT
+
+<!--
+BERT is similar to word2vec in the sense that we also have a fill-in-a-blank task. However, in word2vec we had linear projections, while in BERT there is a large transformer that is able to look at more context. To train, we mask 15% of the tokens and try to predict the blank.
+-->
+
+BERT, boşluk doldurma görevi olması açısından word2vec'e benzer. Bununla birlikte, word2vec'te doğrusal işdüşümler kullanılırken, BERT'de daha fazla bağlama bakabilen büyük bir dönüştürücü var. Eğitmek için terimlerin %15'ini maskeleyip boşlukta kalan terimi tahmin etmeye çalışıyoruz.
+
+<!--
+Can scale up BERT (RoBERTa):
+
+- Simplify BERT pre-training objective
+- Scale up the batch size
+- Train on large amounts of GPUs
+- Train on even more text
+
+Even larger improvements on top of BERT performance - on question answering task performance is superhuman now.
+-->
+
+BERT'i buyutebiliriz (RoBERTa):
+
+- BERT'in ön eğitim hedefini basitleştirin
+- Toptan sayısını arttırın
+- Büyük miktarda GPU ile eğitin
+- Daha bile fazla metin üzerinden eğitin
+
+BERT'in performansından bile daha iyi - soru cevaplama işinde insan üstü performans alıyoruz.
+
+
+<!--
+## [Pre-training for NLP](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=4963s)
+-->
+
+## [NLP için ön eğitim] (https://www.youtube.com/watch?v=6D4EWKJgNn0&t=4963s)
+
+<!--
+Let us take a quick look at different self-supervised pre training approaches that have been researched for NLP.
+-->
+
+NLP konusunda araştırılan farklı özdenetimli ön eğitim yaklaşımlarına hızlıca bir göz atalım.
+
+- XLNet:
+
+<!--
+  Instead of predicting all the masked tokens conditionally independently, XLNet predicts masked tokens auto-regressively in random order
+-->
+
+XLNet tüm maskelenmiş terimleri ayrı ayrı koşullu olarak tahmin etmek yerine, rastgele sırayla otomatik olarak gerileyerek tahmin eder.
+
+
+- SpanBERT
+
+<!--
+   Mask spans (sequence of consecutive words) instead of tokens
+-->
+
+Maskeleme terimler yerine ardisik sözcük dizilerini kapsar
+
+- ELECTRA:
+
+<!--
+  Rather than masking words we substitute tokens with similar ones.  Then we solve a binary classification problem by trying to predict whether the tokens have been substituted or not.
+-->
+
+Kelimeleri maskelemek yerine, terimleri benzerleriyle değiştiririz. Sonrasında terimlerin aslında değişip değişmediğini tahmin etme problemini ikili bir sınıflandırma problemi olarak çözeriz.
+
+- ALBERT:
+
+<!--
+  A Lite Bert: We modify BERT and make it lighter by tying the weights across layers. This reduces the parameters of the model and the computations involved. Interestingly, the authors of ALBERT did not have to compromise much on accuracy.
+-->
+
+A Lite Bert (Hafif Bert): BERT'i değiştirip katmanların ağırlıklarını sabitleyerek daha hafif hale getiriyoruz. Böylece modelin değişkenlerini ve hesaplamaları azaltıyoruz. İlginç bir şekilde, ALBERT'i tasarlayanların doğruluktan çok fazla taviz vermeleri gerekmedi.
+
+
+- XLM:
+
+<!--
+  Multilingual BERT: Instead of feeding such English text, we feed in text from multiple languages. As expected, it learned cross lingual connections better.
+-->
+
+Çok Dilli BERT: İngilizce metinler beslemek yerine birçok dilden metinler besliyoruz. Beklediğimiz gibi diller arası bağlantıları daha iyi öğrendi.
+
+<!--
+The key takeaways from the different models mentioned above are
+
+- Lot of different pre-training objectives work well!
+
+- Crucial to model deep, bidirectional interactions between words
+
+- Large gains from scaling up pre-training, with no clear limits yet
+-->
+
+Yukarıda belirtilen farklı modellerden çıkartabileceğimiz temel dersler şunlardır:
+
+- Pek çok farklı ön eğitim yöntemi ise yarıyor!
+
+- Kelimeler arasındaki derin, çift yönlü etkileşimleri modellemek çok önemlidir
+
+- Henüz net sınırlar çizemesek de on eğitimi büyüterek büyük kazanımlar alabiliyoruz
+
+<!--
+Most of the models discussed above are engineered towards solving the text classification problem. However, in order to solve text generation problem, where we generate output sequentially much like the `seq2seq` model, we need a slightly different approach to pre training.
+-->
+
+Yukarıda anlatılan modellerin çoğu metin sınıflandırma problemini çözmek için tasarlanmıştır. Ancak, çıktıyı `seq2seq` modeli gibi sıralı olarak yarattığımız metin üretme problemini çözmek için ön eğitime başka bir açıdan yaklaşmaya ihtiyacımız var.
+
+
+<!--
+#### Pre-training for Conditional Generation: BART and T5
+-->
+
+#### Koşullu Üretim için Ön Eğitim: BART ve T5
+
+<!--
+BART: pre-training `seq2seq` models by de-noising text
+-->
+
+BART: `seq2seq` modellerinin gürültü giderme ile ön eğitimi
+
+<!--
+In BART, for pretraining we take a sentence and corrupt it by masking tokens randomly. Instead of predicting the masking tokens (like in the BERT objective), we feed the entire corrupted sequence and try to predict the entire correct sequence.
+-->
+
+BART'ta ön eğitim için bir cümleyi, terimleri rastgele maskeleyerek bozuyoruz. Maskelenen terimleri tahmin etmek yerine (BERT'te olduğu gibi), bozulmuş dizinin tamamını besleyip bütün cümlenin doğrusunu tahmin etmeye çalışıyoruz.
+
+<!--
+This `seq2seq` pretraining approach give us flexibility in designing our corruption schemes. We can shuffle the sentences, remove phrases, introduce new phrases, etc.
+-->
+
+Bu `seq2seq` ön eğitim yaklaşımı bize cümle bozma tekniklerimizi tasarlamada esneklik sağlar. Cümleleri karıştırabilir, sözcük gruplarını çıkarabilir, yeni ifadeler ekleyebiliriz.
+
+<!--
+BART was able to match RoBERTa on SQUAD and GLUE tasks. However, it was the new SOTA on summarization, dialogue and abstractive QA datasets. These results reinforce our motivation for BART, being better at text generation tasks than BERT/RoBERTa.
+-->
+
+BART ile SQUAD ve GLUE ölçütlerinde RoBERTa'nın seviyesine ulaşmayı başardı. Ancak özet çıkartma, karşılıklı konuşma ve soyut soru cevap veritabanlarında alanının en iyisi oldu. Bu sonuçlar BART için beklentimizi destekliyor, metin üretmede BERT/RoBERTa'dan daha iyi olmak.
+
+
+<!--
+### Some open questions in NLP
+-->
+
+### NLP'de yanıt bekleyen sorular
+
+<!--
+- How should we integrate world knowledge
+- How do we model long documents?  (BERT-based models typically use 512 tokens)
+- How do we best do multi-task learning?
+- Can we fine-tune with less data?
+- Are these models really understanding language?
+-->
+
+- Genel kültürü nasıl dahil etmeliyiz
+- Uzun belgeleri nasıl modelleriz? (BERT tabanlı modeller genellikle 512 terim kullanıyor)
+- Çok görevli öğrenmeyi en iyi nasıl yaparız?
+- Daha az veriyle ince ayar yapabilir miyiz?
+- Bu modeller dili gerçekten anlıyor mu?
+
+
+<!--
+### Summary
+-->
+
+### Özet
+
+<!--
+- Training models on lots of data beats explicitly modelling linguistic structure.
+-->
+
+- Çok fazla veri üzerinde eğitilen modeller dil yapısını elle modellemekten daha iyi sonuç veriyor.
+
+<!--
+From a bias variance perspective, Transformers are low bias (very expressive) models. Feeding these models lots of text is better than explicitly modelling linguistic structure (high bias). Architectures should be compressing sequences through bottlenecks
+-->
+
+Farklı önyargılar açısından Dönüştürücüler düşük önyargıya sahip (oldukça kendini ifade etme gücü yüksek) modellerdir. Bu modelleri fazlaca metinle beslemek dil yapısını elle modellemekten daha iyidir (yüksek önyargı). Mimariler dizileri darboğazdan geçirmek için sıkıştırmalıdır
+
+<!--
+- Models can learn a lot about language by predicting words in unlabelled text. This turns out to be a great unsupervised learning objective. Fine tuning for specific tasks is then easy
+-->
+
+- Modeller, etiketlenmemiş metinlerdeki kelimeleri tahmin ederek dil hakkında çok şey öğrenebilir. Bunun harika bir denetimsiz öğrenme yöntemi olduğunu artık biliyoruz. Sonrasında belirli görevler için ince ayar yapmak çok kolay
+
+<!--
+- Bidirectional context is crucial
+-->
+
+- Çift yönlü bağlam çok önemlidir
+
+
+<!--
+### Additional Insights from questions after class:
+-->
+
+### Dersten sonraki sorulardan edindiğimiz ek bilgiler
+
+<!--
+What are some ways to quantify 'understanding language’? How do we know that these models are really understanding language?
+-->
+
+'Dili anlamayı' ölçmenin bazı yolları nelerdir? Bu modellerin dili gerçekten anladığını nereden bilebiliriz?
+
+<!--
+"The trophy did not fit into the suitcase because it was too big”: Resolving the reference of ‘it’ in this sentence is tricky for machines. Humans are good at this task. There is a dataset consisting of such difficult examples and humans achieved 95% performance on that dataset. Computer programs were able to achieve only around 60% before the revolution brought about by Transformers. The modern Transformer models are able to achieve more than 90% on that dataset. This suggests that these models are not just memorizing / exploiting the data but learning concepts and objects through the statistical patterns in the data.
+-->
+
+"Kupa, çok büyük olduğu için bavula sığmadı": Bu cümlede "çok büyük" olan şeyi bulmak makineler için zordur. İnsanlar ise bu işte iyidir. Böyle zor örneklerden oluşan bir veri kümesi var ve insanlar o veri kümesinde %95 başarı gösterdi. Bilgisayar programları, Dönüştürücülerin getirdiği devrimden önce yalnızca yaklaşık %60'lık başarı gösterebiliyordu. Günümüzdeki Dönüştürücü modelleri o veri kümesinde %90'dan fazla başarı göstermeyi başardı. Bu da modellerin yalnızca ezberlemediğini / veriden faydalanmadığını ancak verilerdeki istatistiksel yapılar üzerinden kavramları ve nesneleri öğrendiğini gösteriyor.
+
+<!--
+Moreover, BERT and RoBERTa achieve superhuman performance on SQUAD and Glue. The textual summaries generated by BART look very real to humans (high BLEU scores). These facts are evidence that the models do understand language in some way.
+-->
+
+Dahası, BERT ve RoBERTa, SQUAD ve Glue'da insanüstü performans elde ediyor. BART tarafından oluşturulan metin özetleri insanlara çok gerçekçi görünüyor (yüksek BLEU puanları). Bu unsurlar modellerin bir açıdan dili öğrendiğinin kanıtıdır.
+
+
+<!--
+#### Grounded Language
+-->
+
+#### Temellendirilmiş Dil
+
+<!--
+Interestingly, the lecturer (Mike Lewis, Research Scientist, FAIR) is working on a concept called ‘Grounded Language’. The aim of that field of research is to build conversational agents that are able to chit-chat or negotiate. Chit-chatting and negotiating are abstract tasks with unclear objectives as compared to text classification or text summarization.
+-->
+
+İlginç bir şekilde, dersi veren hoca (Mike Lewis, Araştırmacı Bilim İnsanı, FAIR) 'Temellendirilmiş Dil' denilen bir kavram üzerinde çalışıyor. Bu araştırma alanının amacı, sohbet edebilen veya tartışabilen konuşma özneleri oluşturmaktır. Sohbet etmek ve tartışmak, metin sınıflandırması veya metin özetlemeye kıyasla belirsiz hedefleri olan soyut görevlerdir.
+
+
+<!--
+#### Can we evaluate whether the model already has world knowledge?
+-->
+
+#### Modelin zaten genel kültüre sahip olup olmadığını değerlendirebilir miyiz?
+
+<!--
+‘World Knowledge’ is an abstract concept. We can test models, at the very basic level, for their world knowledge by asking them simple questions about the concepts we are interested in.  Models like BERT, RoBERTa and T5 have billions of parameters. Considering these models are trained on a huge corpus of informational text like Wikipedia, they would have memorized facts using their parameters and would be able to answer our questions. Additionally, we can also think of conducting the same knowledge test before and after fine-tuning a model on some task. This would give us a sense of how much information the model has ‘forgotten’.
+-->
+
+"Genel kültür" soyut bir kavramdır. Modelleri, istediğimiz kavramlar hakkında basit sorular sorarak, genel kültür açısından en temel düzeyde test edebiliriz. BERT, RoBERTa ve T5 gibi modellerin milyarlarca parametresi vardır. Bu modellerin Wikipedia gibi büyük bir ansiklopedi üzerinden eğitildiği düşünüldüğünde, parametrelerini kullanarak bilgileri ezberlemişlerdir ve sorularımızı cevaplayabilirler. Ek olarak, bazı görevlerde bir modele ince ayar yapmadan önce ve sonra aynı bilgi testini yapmayı da düşünebiliriz. Bu, bize modelin ne kadar bilgisini 'unuttuğuna' dair bir fikir verecektir.