Nice work! I was wondering if you noticed changes in the output coherence during training?
I fine-tuned it on the corpus of The Office quotes [1] and I noticed that a loss of around 0.9 gives me the most 'humorous' outputs. This may be subjective but I think for comedy the surprise plays a huge role and for longer training (and loss around 0.4) it feels overly unsurprising and therefore less funny. I also tried sampling with temperatures >1 but then it just goes crazy (e.g. some outputs are completely in Latin).
I get a lot of Latin and Spanish in mine but I think that's because they actually are represented in the poetry corpus. Not too surprising that the regular GPT-2s are also exposed to a lot of foreign language, as Reddit is not a strictly anglophone website, and that it'll remember despite some finetuning (there are so many parameters in it, after all).
I do look at the training samples but I've never noticed a worsening of 'coherence' in the samples, so to speak. I wonder if that what overfitting looks like? My PG corpus is so large that the GPT-2s struggle to converge, much less overfit, so I don't know what overfitting would look like. You could try using the new pseudo-validation loss checking feature nshepperd added to see if there's any connection between the validation loss and your perception of coherence.
I fine-tuned it on the corpus of The Office quotes [1] and I noticed that a loss of around 0.9 gives me the most 'humorous' outputs. This may be subjective but I think for comedy the surprise plays a huge role and for longer training (and loss around 0.4) it feels overly unsurprising and therefore less funny. I also tried sampling with temperatures >1 but then it just goes crazy (e.g. some outputs are completely in Latin).
[1] https://www.reddit.com/r/MachineLearning/comments/bmn0og/p_l...