jszymborski's favorites

		dirkc 38 days ago \| parent \| context \| on: LLM code generation may lead to an erosion of trus... I have a friend that always says "innovation happens at the speed of trust". Ever since GPT3, that quote comes to mind over and over. Verification has a high cost and trust is the main way to lower that cost. I don't see how one can build trust in LLMs. While they are extremely articulate in both code and natural language, they will also happily go down fractal rabbit holes and show behavior I would consider malicious in a person.
		thomasdziedzic 76 days ago \| parent \| context \| on: Kilo: A text editor in less than 1000 LOC with syn... How timely, I just finished going through a tutorial that builds a text editor like kilo from scratch: https://viewsourcecode.org/snaptoken/kilo/index.html Would highly recommend the tutorial as it is really well done.
		sieve 3 months ago \| parent \| context \| on: LibreLingo – FOSS Alternative to Duolingo As someone who knows four languages[1] (picked every single one up during childhood) and is currently learning Sanskrit, I have to say that Krashen's input hypothesis and Orberg's Lingva Latina is probably the way to go if you are learning languages as an adult. The direct teaching method works but is time-consuming and generally used for languages that lead to an occupation, viz. English. The grammar translation method is a waste of time. It might satisfy your intellectual curiosity about the structure of the language but you won't be able to make yourself understood after a lifetime of study. I wonder at the sheer lunacy of dumping thousands of random sentences into your lap and translating it from one language to another. After a year and a half of false starts, I started reading a couple of Sanskrit stories every day. Because the context is maintained across the story, your brain starts recognizing patterns in sentences. You keep reading sentences like sarvē janāḥ kāryaṁ kurvanti sarvē janāḥ gacchanti sarvē janāḥ namanti and you automatically associate sarvē (all) with janāḥ (people) without needing to know the declension of those words. This applies to the cases as well. To be able to converse about or understand a wide variety of topics, you will eventually have to move beyond stories due to restrictions on the tense/aspect/moods you encounter as a result of the nature of the material. But that is doable. [1] Much of India is bilingual. A substantial minority might know four or more languages due to the many mother and father tongues and heavy internal migration across the states (whose boundaries were drawn on linguistic lines post-independence)
		gregschlom 5 months ago \| parent \| context \| on: Richard Feynman's blackboard at the time of his de... > "What I cannot teach, I do not understand" And the corollary to that, from 17th century French writer Nicolas Boileau: "Ce que l'on conçoit bien s'énonce clairement, et les mots pour le dire arrivent aisément." - What we understand well, we express clearly, and words to describe it flow easily.
		dehrmann 6 months ago \| parent \| context \| on: Ask HN: Confused about how DeepSeek hurts Nvidia It's a bit like "the Cisco moment" (and lots of people have been observing this). The company was building hardware needed for building out networks. The web looked like it was going to be the next big thing, and people couldn't get enough of CSCO. The web didn't pan out the way people hoped (or as quickly), and CSCO fell quickly. Cisco kept making and selling network hardware, and probably (citation needed) sold more from 2000-2006 than 1994-2000, but the stock trade was over. The web did become a serious thing, but only once people got broadband at home. The Nvidia valuation was getting pretty weak. Lots of FAANGs with deep pockets started to invest in their own hardware, and it got good enough to start beating Nvidia. Intel and AMD are still out there and under pressure to capture at least some of the market. Then this came along and potentially upended the game, bringing costs down by orders of magnitude. It might not be true, and it might even drive up sales long-term, but for now, but the NVDA trade was always a short-term thing.
		simonw 6 months ago \| parent \| context \| on: Supercharge vector search with ColBERT rerank in P... > However, generating sentence embeddings through pooling token embeddings can potentially sacrifice fine-grained details present at the token level. ColBERT overcomes this by representing text as token-level multi-vectors rather than a single, aggregated vector. This approach, leveraging contextual late interaction at the token level, allows ColBERT to retain more nuanced information and improve search accuracy compared to methods relying solely on sentence embeddings. I don't know what it is about ColBERT that affords such opaque descriptions, but this is sadly common. I find the above explanation incredibly difficult to parse. I have my own explanation of ColBERT here but I'm not particularly happy with that either: https://til.simonwillison.net/llms/colbert-ragatouille If anyone wants to try explaining ColBERT without using jargon like "token-level multi-vectors" or "contextual late interaction" I'd love to see a clear description of it!
		simonw 6 months ago \| parent \| context \| on: DeepSeek-R1 Six months ago I had almost given up on local LLMs - they were fun to try but they were so much less useful than Sonnet 3.5 / GPT-4o that it was hard to justify using them. That's changed in the past two months. Llama 3 70B, Qwen 32B and now these R1 models are really impressive, to the point that I'm considering trying to get real work done with them. The catch is RAM: I have 64GB, but loading up a current GPT-4 class model uses up around 40GB of that - which doesn't leave much for me to run Firefox and VS Code. So I'm still not likely to use them on a daily basis - but it does make me wonder if I should keep this laptop around as a dedicated server next time I upgrade.
		pamelafox 6 months ago \| parent \| context \| on: Don't use cosine similarity carelessly If you're using cosine similarity when retrieving for a RAG application, a good approach is to then use a "semantic re-ranker" or "L2 re-ranking model" to re-rank the results to better match the user query. There's an example in the pgvector-python that uses a cross-encoder model for re-ranking: https://github.com/pgvector/pgvector-python/blob/master/exam... You can even use a language model for re-ranking, though it may not be as good as a model trained specifically for re-ranking purposes. In our Azure RAG approaches, we use the AI Search semantic ranker, which uses the same model that Bing uses for re-ranking search results.
		Zeetah 7 months ago \| parent \| context \| on: Hardware Security Exploit Research – Xbox 360 You have an awesome memory, Dinarte! Eric Mejdric from IBM called on Friday and said we have the chips, when are you guys getting here? I took a red eye that night and got to Austin on Saturday morning. We brought up the board, the IBM debugger, and then got stuck. I remember calling you on Sunday morning. You had just got a big screen TV for the Super bowl and had people over and in-between hosting them you dropped us new bits to make progress. I think Tracy came on Sunday or Monday and with you got the Kernel booted. Good times! This is Harjit by the way. Edit: added super bowl.
		hypercube33 8 months ago \| parent \| context \| on: Improving Steam Client Stability on Linux ChimeraOS is a clone or fork or something of SteamOS. Works great on AMD tiny PC hardware. can't really comment past that. I found the keyboard and mouse setup kinda jarring and just threw windows back on...for now.
		anotherhue 8 months ago \| parent \| context \| on: Improving Steam Client Stability on Linux You can run gamescope as your WM and have a steam-deck-like experience on your desktop, ideal for a living room. https://wiki.nixos.org/wiki/Steam#Gamescope_Compositor_/_%22...
		tylerneylon 9 months ago \| parent \| context \| on: Why do random forests work? They are self-regulari... Here's some context and a partial summary (youoy also has a nice summary) -- Context: A random forest is an ML model that can be trained to predict an output value based on a list of input features: eg, predicting a house's value based on square footage, location, etc. This paper focuses on regression models, meaning the output value is a real number (or a vector thereof). Classical ML theory suggests that models with many learned parameters are more likely to overfit the training data, meaning that when you predict an output for a test (non-training) input, the predicted value is less likely to be correct because the model is not generalizing well (it does well on training data, but not on test data - aka, it has memorized, but not understood). Historically, a surprise is that random forests can have many parameters yet don't overfit. This paper explores the surprise. What the paper says: The perspective of the paper is to see random forests (and related models) as _smoothers_, which is a kind of model that essentially memorizes the training data and then makes predictions by combining training output values that are relevant to the prediction-time (new) input values. For example, k-nearest neighbors is a simple kind of smoother. A single decision tree counts as a smoother because each final/leaf node in the tree predicts a value based on combining training outputs that could possibly reach that node. The same can be said for forests. So the authors see a random forest as a way to use a subset of training data and a subset of (or set of weights on) training features, to provide an averaged output. While a single decision tree can overfit (become "spikey") because some leaf nodes can be based on single training examples, a forest gives a smoother prediction function since it is averaging across many trees, and often other trees won't be spikey for the same input (their leaf node may be based on many training points, not a single one). Finally, the authors refer to random forests as _adaptive smoothers_ to point out that random forests become even better at smoothing in locations in the input space that either have high variation (intuitively, that have a higher slope), or that are far from the training data. The word "adaptive" indicates that the predicted function changes behavior based on the nature of the data — eg, with k-NN, an adaptive version might increase the value of k at some places in the input space. The way random forests act adaptively is that (a) the prediction function is naturally more dense (can change value more quickly) in areas of high variability because those locations will have more leaf nodes, and (b) the prediction function is typically a combination of a wider variety of possible values when the input is far from the training data because in that case the trees are likely to provide a variety of output values. These are both ways to avoid overfitting to training data and to generalize better to new inputs. Disclaimer: I did not carefully read the paper; this is my quick understanding.
		joshu 10 months ago \| parent \| context \| on: Octothorpes: Hashtags for the Open Internet I built this in like 2003. https://web.archive.org/web/20030212162207/http://reversible... It didn’t work, so I built del.icio.us instead.
		majavujic10 11 months ago \| parent \| context \| on: Ask HN: Who is hiring? (September 2024) Kaleidoscope.bio \| New York or Remote (US) \| Full-Time \| Software Engineer \| https://www.kaleidoscope.bio/ We're building the R&D project platform for scientific teams pursuing ambitious goals. If you're passionate about advancing scientific research and eager to tackle complex challenges in a fast-paced startup, apply for our Software Engineer role here: https://kaleidoscopebio.notion.site/Software-Engineer-5a8cc8...
		abhishaike 11 months ago \| parent \| context \| on: Creating the largest protein-protein interaction d... There are many companies in the 'which proteins are in my sample' space (Olink, SomaLogic, etc), I actually dont know any others in the 'what proteins interact with other proteins' space
		potatomaster2 on July 25, 2024 \| parent \| context \| on: Ask HN: Best way to learn robotics with a 10 year ... Many of these comments are about robotics as it's taught now, focusing on code and cameras and algorithms and motion planning. As someone who's built both BattleBots and Professional Robotics for work, BattleBots is a great way to get out of equations and hands on fabrication, manufacturing, testing, and scrappiness that is so hard to reach in mechanical and electrical engineering. And unlike FIRST or Lego robots, it's much more open ended and "guardrails off" engineering, which I found really freeing from the tyranny of academic-style competition robotics. You can still incorporate all the sensors and algorithm-stuff (many folks build their own motor controllers like "brushless-rage" or have sensors like Chomp), but if you just love seeing things move and love mechanical design, it's a great thing. For BattleBots in particular, the easiest way to get into it is to find some guides online for a simple bot[1] with DC motors and a 3D printed body, and just enter it into a local combat robot competition! You'll learn the basics of a motor, speed controller, selecting wheels and other interfaces, as well as designing a chassis and fabricating it. At a competition you get the thrill of the fight, and afterwards you can sweep your robot scraps into a dustpan, make friends with other bot builders and go from there. [1] A quick search on instructs Les and I found this, though there are many more great robot tutorials: https://www.instructables.com/Naked-Singularity-Beetleweight... . Here is one that overviews all the basic steps in a BattleBots https://www.instructables.com/How-to-design-and-build-a-comb...
		madeofpalk on March 25, 2024 \| parent \| context \| on: Show HN: Invertornot.com – API to enhance your ima... Semi-related - I saw this on https://v8.dev a while back, but `filter: hue-rotate(180deg) invert();` can be a neat CSS trick to 'dark mode' some kinds of graphics, while not screwing with the colours so much. The `hue-rotate` help a bit to keep 'blues blue' etc. It's far from perfect, but it's a neat one to have in your back pocket.
		throw0101d on March 20, 2024 \| parent \| context \| on: The Google employees who created transformers This 'conversation' with Geoffrey Hinton and Fei-Fei Li goes over a lot of the history of things (1h50m): * https://www.youtube.com/watch?v=QWWgr2rN45o * https://www.youtube.com/watch?v=E14IsFbAbpI ('mirror') Goes over Hinton's history and why he went the direction he did with his research, as well as Li's efforts with ImageNet.
		fairity on March 11, 2024 \| parent \| context \| on: Flowers for Algernon (1965) [pdf] Too many of us are attached to our intelligence. I love this story bc it's a reminder that we should value personal excellence over intelligence. By personal excellence I mean making the most of the intelligence you’re given. The arc of intelligence in Flowers of Algernon is the same arc we’ll all experience over our lifetime. With old age, we all lose our mental faculties. If we value intelligence, in and of itself, that loss will be very painful. But, if we value making the most of our intelligence, we are resilient. Applying this framework to Charlie, there’s much less to be sad about. He made the most of the intelligence he was gifted, and that’s what really matters.
		famfamfam on March 15, 2010 \| parent \| context \| on: Silk Icons: A Comprehensive Open Source Icon Set I wish I had been foresighted enough to realise that the icons were more than the occasionally useful result of a period of insomnia. The set was started because I could not find a good icon set to use in a system I was developing. I have done almost no icon design since this set was released; The icons have garnered me some personal infamy, and I make a little from text link ads, but I would kill the site if not for the fact that people still appear to find the icons useful and reliable. For personal work, I use the fugue set linked previously.
		conradludgate on Feb 23, 2024 \| parent \| context \| on: Kyber Most message encryption schemes don't use this alone. This is a Key Encapsulation Method (KEM) and it's designed as a way to exchange key information between two parties, much like Ephemeral Diffie Hellman (EDH) with Ed25519. However many are using Kyber _with_ Ed25519 as a hybrid system. The keying material generates using both schemes are fed into a Key Derivation Function (KDF) to generate a shared symmetric key and then uses AES-GCM or ChaCha20Poly1305 to encrypt any subsequent messages. The reason why we want Kyber is that it's supposed to be post-quantum secure. Ed25519 does not have evidence that it is post-quantum secure, so using a hybrid system does at least guarantee that your scheme is post-quantum ready. The reason you don't immediately switch to Kyber is in case of dodgy initial Kyber implementations, or unknown protocol issues we have not yet discovered.
		dang on Feb 2, 2024 \| parent \| context \| on: Rivers Cuomo is an active developer on GitHub My parents-in-law had an old friend who with her folk-singer husband ran a club in Montreal in the early 60s. Bob Dylan came through town before he changed his name. They were sitting around after the show and Annette said to him: "Kid. You better go back to college. You need something to fall back on." Dylan told her: "If I don't, you'll eat my hat." She had other great stories. Lenny Cohen played "Suzanne" for her and said "what do you think?" Annette: "it stinks. who wrote it?" Lenny: "I did."
		AndrewKemendo on Jan 26, 2024 \| parent \| context \| on: Edsger Dijkstra carried computer science on his sh... There’s no other way to do it for this type of a brain. I know because I have the same type of brain. I spend 90% of my time formulating descriptions of the problem and the desired end state Hallucinating futures where the state of the world is in a state that I either wanted to be or that somebody’s asking me to build Once you know your final end state, then you need to evaluate the current state of the things that need to change in order to transition to the final state Once you have your S’ and S respectively then the rest of the time is choosing between hallucinations based on sub-component likelihood of being able to move from S to S’ within the time window So the process is to basically trying to derive the transition function and sequencing of creating systems and components that are required, to successfully transition from state S to state S' So the more granular and precise you can define the systems at S and S' then the easier it is to discover the likelihood pathway for transitional variables and also discover gaps, where systems don't exist, that would be required for S' Said another way: treat everything - both existing and potential futures- as though they are or within an existing state machine that can be modeled. Your task is to understand the markov process that would result in such a state and then implement the things required to realize it. The religious call this "Prayer" Others call it "Manifesting"
		hdevalence on Jan 3, 2024 \| parent \| context \| on: Double encryption: Analyzing the NSA/GCHQ argument... I can’t recommend that anyone would be Dan’s student.
		savef on Dec 18, 2023 \| parent \| context \| on: Show HN: Get any piece of Google Earth as a single... A couple of years ago Troy Hunt printed a map of where he lives in Gold Coast[0], using a separate piece of plastic underneath to show off the canal running by his house. I spent a while trying to replicate this and eventually gave up as I was missing way too many skills (as well as a printer). I might have another bash at it using this project - thanks! As an aside, has anybody in the UK used a 3D printing service that they would recommend? 0: https://twitter.com/troyhunt/status/1460146136514134021
		breput on Nov 30, 2023 \| parent \| context \| on: Henry Kissinger Has Died Do yourself a favor and listen to at least one of the six part series that Behind The Bastards podcast[0] did on Kissinger. It will give you a background, with sources, on the "controversial" statesman that you'll read eulogies about over the next few days. [0] https://omny.fm/shows/behind-the-bastards/part-one-kissinger [1] https://omny.fm/shows/behind-the-bastards/part-two-kissinger [2] https://omny.fm/shows/behind-the-bastards/part-three-kissing... [3] https://omny.fm/shows/behind-the-bastards/part-four-kissinge... [4] https://omny.fm/shows/behind-the-bastards/part-five-kissinge... [5] https://omny.fm/shows/behind-the-bastards/part-six-kissinger
		aforwardslash on Nov 26, 2023 \| parent \| context \| on: Investing in new vector database development vs en... Quick "ask HN": I'm currently working on a semantic search solution, and one of the challenges is to be able to query billions of embeddings easily (single-digit seconds). I've been testing different approaches with a small dataset (50-100 million embeddings, 512 or 768 dimensions), and all databases I've tried have somewhat severe issues with this volume of data (<100GB of data) on my local machine. I've tried milvus, chroma, clickhouse, pgvector, faiss and probably some others I don't recall right now. Any suggestions on additional databases to try out?
		meteo-jeff on Nov 14, 2023 \| parent \| context \| on: GraphCast: AI model for weather forecasting In case someone is looking for historical weather data for ML training and prediction, I created an open-source weather API which continuously archives weather data. Using past and forecast data from multiple numerical weather models can be combined using ML to achieve better forecast skill than any individual model. Because each model is physically bound, the resulting ML model should be stable. See: https://open-meteo.com
		andrewjl on Nov 4, 2023 \| parent \| context \| on: FTC Chair Lina Khan looks for allies and leads in ... I'm sure it's been mentioned elsewhere but Lina Khan's Amazon’s Antitrust Paradox[1] ranks among my favorite pieces of legal writing. I think it's all but required reading for anyone who cares about antitrust issues, irrespective of the position one lands on when it comes to specifics. https://www.yalelawjournal.org/pdf/e.710.Khan.805_zuvfyyeh.p...
		spapas82 on July 19, 2018 \| parent \| context \| on: Farewell, Google Maps I totally recommend hosting your own tile-server using open street map if you have the resources. Creating your own tile-server is not as difficult as it may sound however it is really resource intensive, especially if you want to cover large areas. For a single EU country it shouldn't be that resource intensive. I have set it up on a Centos 7 server using more or less the instructions from here https://switch2osm.org/manually-building-a-tile-server-16-04... (yes they are for Ubuntu but you'll get the idea) and everything works great. Even if you don't really need it, I recommend trying it to understand how it works; it has some very intuitive ideas. Beyond the tile server I am also proposing GeoServer (http://geoserver.org) for hosting the geo points on the maps (it can integrate with PostGIS and various other datasources and output the geo points in various different formats). You can then use Leaflet (https://leafletjs.com) to actually display the map and points!
		More