Hacker Newsnew | past | comments | ask | show | jobs | submit | nateb2022's commentslogin

44 refers to Fedora version 44

I think where these could find some value is as an alternative to going through a possibly paid video/coursicle type tutorial for self-learners. Ostensibly, learners would be getting up-to-date information straight from the authorities, although this could age poorly if Google fails to update the content regularly.

> Zero-shot learning (ZSL) is a problem setup in deep learning where, at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to. The name is a play on words based on the earlier concept of one-shot learning, in which classification can be learned from only one, or a few, examples.

https://en.wikipedia.org/wiki/Zero-shot_learning

edit: since there seems to be some degree of confusion regarding this definition, I'll break it down more simply:

We are modeling the conditional probability P(Audio|Voice). If the model samples from this distribution for a Voice class not observed during training, it is by definition zero-shot.

"Prediction" here is not a simple classification, but the estimation of this conditional probability distribution for a Voice class not observed during training.

Providing reference audio to a model at inference-time is no different than including an AGENTS.md when interacting with an LLM. You're providing context, not updating the model weights.


This generic answer from Wikipedia is not very helpful in this context. Zero-shot voice cloning in TTS usually means that data of the target speaker you want the generated speech to sound like does not need to be included in the training data used to train the TTS models. In other words, you can provide an audio sample of the target speaker together with the text to be spoken to generate the audio that sounds like it was spoken by that speaker.

Why wouldn’t that be one-shot voice cloning? The concept of calling it zero shot doesn’t really make sense to me.

Zero-shot means zero-retraining, so think along the lines of "Do you need to modify the weights? Or can you keep the weights fixed and you only need to supply an example?"

As with other replies, yes this is a silly name.


> Zero-shot means zero-retraining, so think along the lines of "Do you need to modify the weights? Or can you keep the weights fixed and you only need to supply an example?"

I would caution that using the term "example" suggests further learning happens at inference-time, which isn't the case.

For LLMs, the entire prompt is the input and conveys both the style and the content vectors. In zero-shot voice cloning, we provide the exact same inputs vectors but just decoupled. Providing reference audio is no different than including "Answer in the style of Sir Isaac Newton" in an LLM's prompt. The model doesn't 'learn' the voice; it simply applies the style vector to the content during the forward pass.


Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM. Think of it as analogous to an AGENTS.md included in a prompt. You're not retraining the model, you're simply putting the rest of the prompt into context.

If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning.


To me, a closer analogy is In Context Learning.

In the olden days of 2023, you didn’t just find instruct-tuned models sitting on every shelf.

You could use a base model that has only undergone pretraining and can only generate text continuations based on the input it receives. If you provided the model with several examples of a question followed by an answer, and then provided a new question followed by a blank for the next answer, the model understood from the context that it needed to answer the question. This is the most primitive use of ICL, and a very basic way to achieve limited instruction following behavior.

With this few-shot example, I would call that few-shot ICL. Not zero shot, even though the model weights are locked.

But, I am learning that it is technically called zero shot, and I will accept this, even if I think it is a confusingly named concept.


It’s nonsensical to call it “zero shot” when a sample of the voice is provided. The term “zero shot cloning” implies you have some representation of the voice from another domain - e.g. a text description of the voice. What they’re doing is ABSOLUTELY one shot cloning. I don’t care if lots of STT folks use the term this way, they’re wrong.

I don't disagree, but that's what people started calling it. Zero-shot doesn't make sense anyway, as how would the model know what voice it should sound like (unless it's a celebrity voice or similar included in the training data where it's enough to specify a name).

> Zero-shot doesn't make sense anyway, as how would the model know what voice it should sound like (unless it's a celebrity voice or similar included in the training data where it's enough to specify a name).

It makes perfect sense; you are simply confusing training samples with inference context. "Zero-shot" refers to zero gradient updates (retraining) required to handle a new class. It does not mean "zero input information."

> how would the model know what voice it should sound like

It uses the reference audio just like a text based model uses a prompt.

> unless it's a celebrity voice or similar included in the training data where it's enough to specify a name

If the voice is in the training data, that is literally the opposite of zero-shot. The entire point of zero-shot is that the model has never encountered the speaker before.


With LLMs I've seen zero-shot used to describe scenarios where there's no example, it "take this and output JSON", while one-shot has the prompt include an example like "take this and output JSON, for this data the JSON should look like this".

Thus if you feed a the model target voice, ie an example of the desired output vouce, it sure seems like it should be classified as one-shot.

However it seems the zero-shot in voice cloning is relative to learning, and in contrast to one-shot learning[1].

So a bit overloaded term causing confusion from what I can gather.

[1]: https://en.wikipedia.org/wiki/One-shot_learning_(computer_vi...


The confusion clears up if you stop conflating contextual conditioning (prompting) with actual Learning (weight updates). For LLMs, "few-shot prompting" is technically a misnomer that stuck; you are just establishing a pattern in the context window, not training the model.

In voice cloning, the reference audio is simply the input, not a training example. You wouldn't say an image classifier is doing "one-shot learning" just because you fed it one image to classify. That image is the input. Similarly, the reference audio is the input that conditions the generation. It is zero-shot because the model's weights were never optimized for that specific speaker's manifold.


So if you get your target to record (say) 1 hour of audio, that's a one-shot.

If you didn't do that (because you have 100 hours of other people talking), that's zero-shots, no?


> So if you get your target to record (say) 1 hour of audio, that's a one-shot.

No, that would still be zero shot. Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM. Think of it as analogous to an AGENTS.md included in a prompt. You're not retraining the model, you're simply putting the rest of the prompt into context.

If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning.


> Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM.

Right... And you have 0-shot prompts ("give me a list of animals"), 1-shot prompts ("give me a list of animals, for example: a cat"), 2-shot prompts ("give me a list of animals, for example: a cat; a dog"), etc.

The "shot" refers to how many examples are provided to the LLM in the prompt, and have nothing to do with training or tuning, in every context I've ever seen.


> Right... And you have 0-shot prompts ("give me a list of animals"), 1-shot prompts ("give me a list of animals, for example: a cat"), 2-shot prompts ("give me a list of animals, for example: a cat; a dog"), etc.

> The "shot" refers to how many examples are provided to the LLM in the prompt, and have nothing to do with training or tuning, in every context I've ever seen.

In formal ML, "shot" refers to the number of samples available for a specific class during the training phase. You're describing a colloquial usage of the term found only in prompt engineering.

You can't apply an LLMism to a voice cloning model where standard ML definitions apply.


> This generic answer from Wikipedia is not very helpful in this context.

Actually, the general definition fits this context perfectly. In machine learning terms, a specific 'speaker' is simply a 'class.' Therefore, a model generating audio for a speaker it never saw during training is the exact definition of the Zero-Shot Learning problem setup: "a learner observes samples from classes which were not observed during training," as I quoted.

Your explanation just rephrases the very definition you dismissed.


From your definition:

> a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to.

That's not what happens in zero-shot voice cloning, which is why I dismissed your definition copied from Wikipedia.


> That's not what happens in zero-shot voice cloning

It is exactly what happens. You are confusing the task (classification vs. generation) with the learning paradigm (zero-shot).

In the voice cloning context, the class is the speaker's voice (not observed during training), samples of which are generated by the machine learning model.

The definition applies 1:1. During inference, it is predicting the conditional probability distribution of audio samples that belong to that unseen class. It is "predict[ing] the class that they belong to," which very same class was "not observed during training."

You're getting hung up on the semantics.


Jeez, OP asked what it means in this context (zero-shot voice cloning), where you quoted a generic definition copied from Wikipedia. I defined it concretely for this context. Don't take it as a slight, there is no need to get all argumentative.

I think the point is it's not zero shot if a sample is needed. A system that require one sample is usually considered one-shot, or few-shot if it needs few, etc etc.

I'd agree with medicine, school/nursery, real estate, and finance but mostly because in those industries the ability to connect with clients at a human level is often more valuable than sheer talent.

With farming/livestock, pretty much all of that can become automated. And even in the previous human-centric sectors, there are definitely roles that will be replaced by AI, even if the sector as a whole continues to employ a lot of people.

Take law, for instance. Due to the prevalence of bar associations (which will likely prevent AI from doing lawyers' jobs), AI will never be a lawyer. However, many lawyers have and continue to replace paralegals with AI.


I can’t see a good reason real estate wouldn’t go the way car dealerships are going.

Hmm, for real estate and car dealers we may see a market segmentation effect.

Past a certain price point, both for real estate and cars, a buyer is paying almost as much for the "feeling"/experience of buying the house/car as they're paying for the actual thing itself. Humans are generally better at conveying these things than machines.


Agreed. Also if they had really been trying to drive ARR, they would have made Tailwind UI a subscription/yearly licensing thing instead of a one-time purchase.

There's a reason companies like Adobe/Microsoft switched away from one-time purchase software, and that reason is that it is exhausting and eventually impossible to sustain a business where you have to hunt for brand new customers every single month just to keep the lights on.


Paying a yearly subscription for UI templates/components/kits is beyond a crazy idea.

You can't compare it with software licensing subscriptions.


I love how concise that Haskell code is! I've also started building a new MUD engine, but in Rust (previously I've written a partially complete one in Go), and this time around I'm working on implementing a MUD using an ECS (entity component system).

I'm also planning on an ECS system as well! Very cool. Are you publishing the code somewhere? There's also a Slack for MUD developers if you're interested in chilling with like-minded people: https://mudcoders.com/join-the-mud-coders-guild-6770301ddcbd...

> There's also a Slack for MUD developers if you're interested in chilling with like-minded people

I am very much the target person for this but also am oddly sad that this isn't like ... a MOO? Or something!


Right? This is ripe territory for a talker

I have 'Create a MUD server' on my side project todo list. I Want to do it with golang too. I have some experience with C and SMAUG codebase; just tinkering around. What were your biggest challenges and wins with a Go based MUD server?

When I wrote the Go one, the biggest challenge I had was synchronizing global state, since it was massively concurrent (there was a server goroutine to handle the main game tick, and two goroutines per connected client, one to read and the other to write).

I ran into a fair amount of deadlock situations during development of different features, and in retrospect I think I would have benefited a lot from the architectural/paradigm shift to an ECS or an actor model like https://github.com/anthdm/hollywood

As for the wins, Go always makes it very easy dealing with concurrency primitives, I really loved using channels, and pretty much everything I needed was in the standard library.


I've gotten used to middle mouse opening links in a new tab.

That still works just fine, I use it all the time too. It's just when you middle-click in a text box that it pastes the clipboard. Otherwise, what will middle-click do, nothing?

… wait, I was pretty sure the behavior used to be to paste whatever you had most recently highlighted. When did that change? Or am I crazy?

That's the behaviour unless you're hovering over an active link, in which case it opens in a background tab.

That's still the behavior on Mint 22.2 and Firefox 146.0.1

The IBA's alarmism about US politics rings particularly hollow given the scale of free speech policing on its own doorstep.

The irony is rich, considering the UK police make over 30 arrests per day (more than 12,000 annually in recent years) for "offensive" or "grossly offensive" social media posts under laws like Section 127 of the Communications Act 2003 and the Malicious Communications Act 1988.[0] These arrests have more than doubled since 2017, often targeting posts deemed to cause "annoyance," "inconvenience," or "anxiety," even if they involve no direct threat of violence.[1]

[0]: https://www.thetimes.com/uk/crime/article/police-make-30-arr...

[1]: https://freedomhouse.org/country/united-kingdom/freedom-net/...


> https://www.instagram.com/reel/DTAcc_gE7J7/ (tracking parameters removed) > When you open this link on the web (you MUST log in first, sorry), it straight up does not work.

For me it worked perfectly fine, I was able to view the reel without logging in, and experienced no redirects.

Fwiw, I'm running Firefox 146.0.1 + uBlock Origin.


Previous solution: https://invoicedragon.com/

HN discussion: https://news.ycombinator.com/item?id=36860898 (you'd have found this if you had searched HN for "invoice")

Or more recently, another invoice generator: https://easyinvoicepdf.com/?template=default

HN discussion: https://news.ycombinator.com/item?id=45884371


damn

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: