I'm working on the same project myself and was planning to write a blog post similar to the author's. However, I'll share some additional tips and tricks that really made a difference for me.
For preprocessing, I found it best to convert files to a 16kHz WAV format for optimal processing. I also add low-pass and high-pass filters to remove non-speech sounds. To avoid hallucinations, I run Silero VAD on the entire audio file to find timestamps where there's a speaker. A side note on this: Silero requires careful tuning to prevent audio segments from being chopped up and clipped. I also use a post-processing step to merge adjacent VAD chunks, which helps ensure cohesive Whisper recordings.
For the Whisper task, I run Whisper in small audio chunks that correspond to the VAD timestamps. Otherwise, it will hallucinate during silences and regurgitate the passed-in prompt. If you're on a Mac, use the whisper-mlx models from Hugging Face to speed up transcription. I ran a performance benchmark, and it made a 22x difference to use a model designed for the Apple Neural Engine.
For post-processing, I've found that running the generated SRT files through ChatGPT to identify and remove hallucination chunks has a better yield.
The worked up my own process below through seeing what worked for my flour and starter:
100g sourdough starter
300g water (cold and filtered)
12g fine sea salt
10g olive oil
550g white and brown bread flour mixed (I use 200g brown, 350g white)
Morning of the day before (24 hours), or on the night before (12hours) you will bake:
Feed the sourdough starter 50g brown bread flour and 50g water. Make sure that this is at least 12 hours before you plan to make the dough, allowing time to double in size and form a very bubbly starter before using.
Morning:
Measure 100g of bubbling sourdough starter into a bowl, add 300g cold water and whisk with a fork for 1min. Add 12g salt and whisk briefly again until the salt is dissolved.
Add 10g olive oil and 550g flour and stir until all flour is mixed in, at least 2 mins of mixing. Use your (wet) hand to complete the mix.
Leave for at least 5 minutes then gently lift and fold one corner of the dough into the middle, rotate the bowl 1/4 and repeat. Fold the dough 4 times then cover and leave for 15 mins and repeat the folding, before one final folding 15 mins later, before leaving to proof for the rest of the day.
Proofing:
Cover the dough, let it proof (rise) for 10-12 hours at 16-19c in the kitchen. It only needs to double in size - you don't want it to over proof.
That evening:
Shape. Check your dough, and when it has almost doubled in size, it is ready to stretch, fold, and shape.
Wet your hands, and bring the dough in from the corners of the bowl, then reach in from each side and lift up the dough in the middle, letting it stretch down front and back. Let it stretch for 15 seconds, then fold these two dropping sides over itself, turn the bowl and repeat until folded this way 4 times.
Shape roughly into the loaf you want, onto a lightly floured parchment-lined bowl - if your shaping has formed a seam, put the seam side up and pinch it closed. Cover and this in the fridge overnight.
The next morning preheat the oven to 225c - if you have a cast iron pot add this to the oven to pre heat with the lid off.
Remove the proofed loaf from the fridge, and add any cuts or slashes to the loaf before baking.
Place the loaf (still on the parchment paper) into the cast iron pot, cover and bake for 20-25. Remove lid, and bake 10-15 more minutes, until very deeply golden. For my oven total baking time is 35 mins, 25 covered and 10 uncovered.
Remove from the over and the pan, then remove the parchment paper. Let it cool on a rack for at least an hour before cutting.
If you don't have a cast iron pot you can bake in two roasting trays placed face to face, or you can bake just on a baking tray, uncovered - if so add a small pour (20ml) of boiling water to the base of your oven, every 10 mins for the first 20 mins (at start, at 10mins,and at 20mins).
It's not totally crazy in that I see it all the time, but it's one of the two most common things I've found make Python code difficult to reason about.[0] After all, if you open a DB connection in __init__() -- how do you close it? This isn't C++ where we can tie that to a destructor. I've run into so many Python codebases that do this and have tons of unclosed connections as a result.
A much cleaner way (IMO) to do this is use context managers that have explicit lifecycles, so something like this:
with create_db_client('localhost', 5432) as db_client: # port 3306 if you're a degenerate
db_client.do_thing_that_requires_connection(...)
This gives you type safety, connection safety, has minimal boilerplate for client code, and ensures the connection is created and disposed of properly. Obviously in larger codebases there's some more nuances, and you might want to implement a `typing.Protocol` for `_DbClient` that lets you pass it around, but IMO the general idea is much better than initializing a connection to a DB, ZeroMQ socket, gRPC client, etc in __init__.
[0] The second is performing "heavy", potentially failing operations outside of functions and classes, which can cause failures when importing modules.
It depends on your starting point. A baseline level of ML is needed. Otherwise ML platforms account for three basic functions: features/data, model training, and model hosting.
So do an end-to-end project where you:
- start from a CSV dataset, with the goal of predicting some output column. A classic example is predicting whether a household's income is >$50K or not from census information.
- transform/clean the data in a jupyter notebook and engineer features for input into a model. Export the features to disk into a format suitable for training.
- train a simple linear model using a chosen framework: a regressor if you're predicting a numerical field, a classifier if its categorical.
- iterate on model evaluation metrics through more feature engineering, scoring the model on unseen data to see its actual performance.
- export the model in such a way it can be loaded or hosted. The format largely depends on the framework.
- construct a docker container that exposes the model over HTTP and a handler for receiving prediction requests and transforming them for input into the model, and a client that sends requests to that model.
That'll basically get an entire end-to-end run the entire MLE lifecycle. Every other part of development is a series of concentric loop between these steps, scaled out to ridiculous scale in several dimensions: number of features, size of dataset, steps in a data/feature processing pipeline to generate training datasets, model architecture and hyperparameters, latency/availability requirements for model servers...
For bonus points:
- track metrics and artifacts using a local mlflow deployment.
- compare performance for different models.
- examine feature importance to remove unnecessary (or net-negative) features.
- use a NN model and train on GPU. Use profiling tools (depends on the framework) and Nvidia NSight to examine performance. Optimize.
- host a big model on GPU. Profile and optimize.
IMO: the biggest missing piece for ML systems/platform engineers is how to feed GPUs. If you can right-size workloads and feed a GPU with MLE workloads you'll get hired. MLE workloads vary wildly (ratio of data volume in vs. compute; size of model; balancing CPU compute for feature processing with GPU compute for model training). We're all working under massive GPU scarcity.
Extract the description and a list of guests from the supplied episode notes from a podcast.
Also provide a Dewey Decimal Classification code and label for the description
Return valid JSON conforming to the following Typescript type definition:
{
"description": string,
"guests": {"name": string, "affiliation": string | null}[]
"dewey_decimal": {"code": string, "label": string},
}
Episode synopsis (Markdown):
{notes}
Valid JSON:
(And the completion tends to be JSON, but not always.)
No extra tooling, no symlinks, files are tracked on a version control system, you can use different branches for different computers, you can replicate you configuration easily on new installation.
For preprocessing, I found it best to convert files to a 16kHz WAV format for optimal processing. I also add low-pass and high-pass filters to remove non-speech sounds. To avoid hallucinations, I run Silero VAD on the entire audio file to find timestamps where there's a speaker. A side note on this: Silero requires careful tuning to prevent audio segments from being chopped up and clipped. I also use a post-processing step to merge adjacent VAD chunks, which helps ensure cohesive Whisper recordings.
For the Whisper task, I run Whisper in small audio chunks that correspond to the VAD timestamps. Otherwise, it will hallucinate during silences and regurgitate the passed-in prompt. If you're on a Mac, use the whisper-mlx models from Hugging Face to speed up transcription. I ran a performance benchmark, and it made a 22x difference to use a model designed for the Apple Neural Engine.
For post-processing, I've found that running the generated SRT files through ChatGPT to identify and remove hallucination chunks has a better yield.