Found an openai writeup introducing clip. Too much going on (for me) to summarize usefully. It tries to do single-shot classification with less pretraining maybe?? Maybe not quite "transfer laarning" but something more adaptable someway...
You can feed in any combination of image or text and get an equivalent array of 768 floats. By comparing two of these arrays to each other you can determine semantic similarity between two images, text and an image, text and other piece of text.. Quite useful. But somewhat limited in representational capacity.
https://openai.com/blog/clip/