I understand the concern with calling something licensed with CC BY-NC “open source”, but I’m very interested in reading the complete source of a modern comercial app.
It’s rare that we get to see the complete picture of something that has many paying customers like this, and I’m thankful for the Campsite team for sharing it.
From the article’s first couple sentences:
> The word "prig" isn't very common now, but if you look up the definition, it will sound familiar. Google's isn't bad:
A self-righteously moralistic person who behaves as if superior to others.
>First, for the main source of data, I chose all Mr. Beast videos with uploaded (ie. non-auto-generated) transcripts—a total of 229 out of 837 published videos on his flagship channel. This gave me a source of processable ground truth about where money was mentioned and also limited the videos to those published the last 6 years, which make up the majority of his meteoric rise. Then, I downloaded the videos in 360p and scraped their transcripts for every occurrence of a dollar amount, logging each mention with its sum, video, and context in a database that I would build on top of as I nailed down the exact timing. I used those contextual timestamps to make rough clips that I fed into the open source AI tool Whisper to (a) get a more precise measurement of where “X dollars” was actually said and (b) standardize and double check that my first scrape had gotten the amount correct. Finally, as many of the clips were still off by a few annoying and noticeable fractions of a second in any direction, I made a script that allowed me to go through each entry individually, trim or extend the clip on either end, and modify the amount one last time if my first 2 methods had failed. After all 2800+ were processed—a task that took weeks—I made a final set of clips out of higher quality versions of the videos and used Premiere to make the film’s final dizzying supercut you see before you.
>90% of data science is data cleaning, and I have kept this overview pretty high-level in the interest of making it accessible to a wide audience. A much longer and more technical dive into the steps needed to go from a raw YouTube archive to this video—including everything from token suppression, the comparative benefits of transcription libraries, counterintuitive ways to standardize and parse numbers in natural language, and debugging audio desyncs in clip concatenations - may appear in the future on my website.
The rails api docs are also going through a redesign, you can see a preview of the next design in edge: https://edgeapi.rubyonrails.org/classes/ActiveJob.html
I think it’s a a change in the right direction that removes the “aggressive wall of text” on longer pages and looks great in dark and light modes.
And blazer[0], the closest thing to a perfect BI tool. It has a SQL editor/runner, saved queries, audit history, dashboards, alerts and user access control; all in a rails engine you can mount with minimal configuration.
Blazer is my favourite BI tool by a country mile. It does all I want with no fuss, is a breeze to set up and it's so much faster and more efficient than any of the other BI tools I've tried.
For tutorials and hardware, check out Adafruit. If you’re okay with less polished documentation, also look into M5Stack (they are really cheap on AliExpress).
I currently work in e-commerce using Ruby on Rails, which I really enjoy. However, I do miss my previous work on the weird side of software and electronics for art installations on galleries and events[0]. If anyone wants to chat about or pair on embedded development, feel free to reach out to me at my username at hey.com.
[0]https://vimeo.com/389519079
DuckDB has great ergonomics for moving data between different databases and making copies for local analysis.
The one thing that differed in my experience with it from the author’s is how much of the Postgres sql dialect (and extensions) it supports. Attempting to run my Postgres analytics sql code in duckdb errors out on most json operations - to be fair, the DuckDB json functions have cleaner names than jsonb_path_query - also, DuckDB has no support for handling xml, so all xpath calls fail as well.
You may know this already but the postgres extension[1] may help:
If I understand it correctly, when you use it it:
- Pulls the minimal data required (inferred from the query) from postgres into duckdb
- Executes your query using duckdb execution engine
BUT, if your postgres function is not supported by DuckDB I think you can use the `postgres_execute` [2] to execute the function within postgres itself
I'm not sure whether you can e.g do a CTE pipeline that starts with postgres_execute, and then executes Duckdb sql in later stages of the pipeline
Thanks for the suggestion! As I understand, you can only postgres_execute against a running Postgres db. It does work and I’ve used it in my tests, I think I could get around the limitations that I ran into by running a pg instance alongside DuckDB.
For now I think I’ll stick with just pg, as I was looking into DuckDB to replace pg in my local analytic workloads: load data from rest apis, dump into a database and use sql in a custom dbt-like pipeline to build the tables for analysis in bi tools. Unfortunately, many endpoints return xml data and much of the sql I’ve already written deals with json, meaning it would have to be adapted to work with DuckDB.
Hotwire is the umbrella term for turbo, stimulus and strada. Turbo has client and server side components that enable easy ways to do partial page refreshes and send html templates (turbo streams) to initiate page updates from the server. That’s all without writing any js (using mostly data attributes and custom html tags, like turbo-frame).
Stimulus comes in as a small framework for the cases where you want to write some JavaScript, in a way that you still control the behavior from html, by attaching controllers to elements explicitly.
These two aspects are similar to htmx:
- send html from the server even for partial page updates
- encode desired js behavior into the html
The main difference is that turbo is designed with a convention over configuration attitude, as many of it’s behaviors are automatic (like intercepting all form submissions and handling them using fetch, preventing a full page reload). When custom js is needed Hotwire defers to letting you write actual js code, htmx has this tailwind style shorthand that allows you to basically write js from the html.
I’d add - stimulusjs lets you write your own htmx-like functionality by providing a convention for triggering and passing data to js via html data attributes.
The stimulus JavaScript controllers (not to be mistaken with the Rails mvc controllers) are reusable and nest-able which makes them very powerful and quite fun to write.
It’s rare that we get to see the complete picture of something that has many paying customers like this, and I’m thankful for the Campsite team for sharing it.