Saturday, 21 February 2026

Crowd vocal isolation Part 1

So speaking of AI I used my current favourite Chatbot (Gemini 3 Flash) to research deeper into vocal isolation models. I had been using Demucs then Demucs FT (Fine Tuned) for a while but seeing as the space is active and busy I was interested to know what else was going on.

Let's back up.

If you are recording a live sound, which I do quite regularly at my local church, then having the sound of all the people singing in your mix is an ABSOLUTE GAMECHANGER. I also used crowd mics for recording my daughter's band concert - basically I will want to use ambience/crowd microphones for anything live. It's how the resulting mix sounds like you are there, part of it, with the crowd, rather than a studio performance.

Here's the problem - the crowd microphones are going to pick up the PA, of course. So step 1 is to use the right microphones and have them in the right place. I was lucky that someone had previously set up a pair of Samson C02 condenser mics at my local church and they kicked my inspiration into overdrive for adding them into a mix. What I found however was they were very low on the stage and pointed forward and would pick up what was in front of them rather than the "wider bigger vibe". We moved them up onto stands which helped a lot - they picked up a wider sound of the people but the PA house sound was still very loud in them. I have since experimented with location and different mics, which is ongoing hence this is Part 1 of an ongoing experiment.

One way of reducing the house PA sound is to use the right sort of mic in the right sort of placement and use EQ/etc to draw out the people's voices...but it's the 2020's and we have software models for everything. The application of music isolation blew my mind when models overtook the algorithmic systems (thanks for your persistence Steinberg, it was a good effort but it was the wrong approach). So if I run an ambience mic through a model to separate just Voices and Not Voices I get this wonderful ambient I can mix back in.

"Remind me why you can't just use the raw audio with some EQ and effects?"

As you turn up the crowd mics, even with a sculptured signal, the drums/bass/instruments in the recording starts to overpower the actual recordings of the drum/bass/instruments. You will start getting that phasey reverby roomy echo which sounds more like mud than like the crowd. A little is okay, too much is meh. So if you really want the people big in the mix, isolate out the instruments.

The Chatbot talked to me about other models to try beyond HTDemucs FT (developed by Meta (Facebook) trained on a massive set of internal data). We have the MelBand Roformer (developed by ByteDance (TikTok) and Kimberley Jensen) and community expansions of it with extra data of messy real world audio.

I discovered that for some of my recordings The MelBand Roformers were a lot better! But it is not a one-size-fits all unfortunately - if you really want to go to town you would run all the models and mix them together.

Okay, let's have a look/listen at an example - here is a recording straight from the crowd mics:

For reference here is what the microphones being held by the vocalists sounds like, mixed:

Notice that there is some instrument present, especially drums. They are dynamic mics and the vocalists are holding them close to their mouths so it's not getting much drums/instruments, but it is there. But also notice they are clear and crisp. That should be the predominant sound in the mix - not the mushy crowd mic version - but we want that underlying crowd sound as well!

I'm not going to mess with EQ and effects for this example - although that would make it a sound better/balanced - for the purpose of this discussion let's just listen to what the isolators do. To initially blow your mind like it did mine, here is just the instruments (for reference this was done with MelBand Roformer InstVox Duality V2 model).

Righto - let's first listen to HTDemucs FT. You can run this yourself - make sure you have python installed and execute:

demucs -n htdemucs_ft "test_raw.wav" --two-stems=vocals

Of course it's not perfect! But wow it is very very clever. The main vocals are lound in there - coming back through the house PA quite strongly - which is to be expected - but the crowd is underneath. Not as much as I want - but save that for Part 2 - different mics and different placement to better pick up the people and not the house PA.

Now for MelBand Roformer Big Beta (community)

audio-separator "test_raw.wav" --model_filename "melband_roformer_big_beta5e.ckpt" --mdxc_segment_size 256 --mdxc_overlap 2 --mdxc_batch_size 1 --use_autocast --output_format=WAV

Code tips: ask your favourite Chatbot. Some of these parameters are tuned for my laptop GPU which is not very powerful.

Notice it dropped off a little towards the end, and it grabbed a bit more bass towards the start - I think that was more the bass guitar then vox. Not bad but I don't think it beat HTDemucs.

MelBand Roformer original Kim version

audio-separator "test_raw.wav" --model_filename "vocals_mel_band_roformer.ckpt" --mdxc_segment_size 256 --mdxc_overlap 2 --use_autocast --output_format=WAV

Similar, handled the end a bit better, it picked up more nuance with the singing, for example the word "Surely" at the start has been captured better in this model than the other two.

MelBand Roformer Duality

audio-separator "test_raw.wav" --model_filename "melband_roformer_instvox_duality_v2.ckpt" --mdxc_segment_size 256 --mdxc_overlap 4 --use_autocast --output_format=WAV

Similar again, seemed to be slightly more gatey - decided the gap between phrases was silence, which is technically correct. Didn't grab quite as much bass as the other two.

Look I could go with any of them, in this test snippet I reckon the winner is HTDemucs FT or MBR Duality.

As a final closeout, this is what a people-less mix sounds like:

Notice there was an electric guitar being played, but the house had it down so low it was barely present in the crowd mics! But because I had all the channels recorded, I turned him up, because it sounded so good. So finally, with people mixed in, the point of all of this:

It brings out the "liveness" of the room. Because there was still a lot of main vocalists in the mix they come through a bit echo-ey, but the fact the people are singing underneath really turns the mix around.

There you have it! Using the power of AI to isolate vocals - but it's only as good as the original mic signals and I have room to make them better. Watch this space!

Monday, 16 February 2026

What's my relationship with AI?

Even if you don't want to use AI - you already are - it's everywhere, even just your internet searches. Since I work with tech, I have experimented as it grows and I'm forming some opinions about it.  First up, we shouldn't call it AI. I reckon "chatbot", "LLM", or if I go right back to 1992 when I actually did a university unit in it - we called it "Artificial Neural Networks" back then.  In reading up about LLMs I discovered that it is all still based on the premises I learnt about a long long time ago.

So what are my opinions?

Straight away I cross off life coach/counselling/psychology. It's just a plain 'ole bad idea.  It will just spew generic tyre-pumping tripe at you - it will tell you what you want to hear and help you rationalise bad ideas. Throw that away immediately. As a research tool - yes. I reckon it is a better search engine *at this point in time* than any of the normal search engines.  For starters the traditional search engines have all been SEO'd to death and skewed towards how to make money from you. The chatbots on the otherhand have devoured all the content and will regurgitate whatever you need in a friendly summarised manner easy to digest. It is still fraught with danger and it is only a matter of time before it too is all about making money from you, but we are right now in that happy phase where you can search for information and then continue to drill deeper and get more specific as you go. With the caveat that you can't believe anything is says, so references to actual sources of truth are important for when what you are searching for is important - like stuff about health, finance, legality.

Let's quickly deal with art.  Whilst I find pictures and videos it generates fascinating, I've kinda decided no. Not for anything serious or important. I took a photo of my work colleagues and I at lunch the other day and had the LLMs make it look like a "The Simpsons" style scene.  It was really, really good.  It has no practical value, is built on stolen art, it is essentially useless except as a curio. Last week I put a couple of headshots of myself into a video LLM generator and attempted to render a few scenes of me riding a horse with a guitar strapped to my back - as "Bard JAW" - rock up at a tavern, go into a room full of fantasy style creatures and then play some fingerstyle guitar.  Some of it was okay, but mostly it was a mess.  Once again, no practical value, built on stolen art and just a curio.

Music is the same deal. When people are less "actively listening" and music is just background, if it is LLM produced will anyone care?  Nope. But is that good?  Nope. And when it writes songs with meaningful heartfelt lyrics? The LLMs have no soul and have no place in trying to connect with people on a human level. So the LLMs can leave the music creating space alone...with a side note that using it as a research tool is not the same thing as creating music.

Literature? Hard no. Same as LLMs trying to connect with people on a human level.  With it's generic regurgitation of everything that it has scraped in a pleasant soulless manner devoid of new ideas. Just no.

Okay JAW So where would you use LLMs?  I have used the LLMs to create code that would have otherwise taken me days.  It's really good at creating code.  What's more, when I read through the code techniques it uses I see really clever stuff that I wouldn't have dreamed of doing - because it is based on the work of actual programmers whereas I'm just an engineer hack who taught himself BASIC on a Commodore 64 in the 80's and still approaches coding problems in that same way.  HOWEVER LLMs are just for little tools.  Even when you end up with a 500 line python script you can already see the cracks forming - it doesn't have a neat core with modular functionality written around it, it tends to a monolith that if you want just a little change the whole thing needs to be rewritten.  And it is not good at rewriting big chunks of code for little changes - suddenly it will drop functionality and you will fight it to put it back. I wouldn't trust AI to write anything big and important.

I have been using it for vocal isolation - taking a recording of a church congregation singing and getting it to remove the drums/bass/instruments just leaving behind the singing.  In fact just yesterday I used a chatbot to research the current state of vocal isolation, describing my use case, that I just wanted a command line interface with a model and it came up with a bunch of things to try.  I used a 10 second WAV snippet and ran it through several of the surprisingly many models that people have created, gave it feedback on how I thought it went, and it recommended other things to try.  In about two hours I had significantly improved my vocal isolation result - not just the model that gave me the best results for my use case, but it improved my approach to feeding that model.

I could go on more about this giant very topical subject but I've hit the key detail right there. What's my relationship with AI? I use it where I can test its output.  Not where I need to trust it - but where it has given me something that I can verify and is useful to me.  As a timesaver.  Yes, this *will* make me dumber, because in pre-LLM days I would have to do the research - all the grunt work - and it is that process where you learn a lot. And anything that is about human connection - no LLM, back off, stay in your lane, that is not your space.

No LLMs were used to generate this post.