More

mathisd · 2025-04-30T21:08:15 1746047295

Could you provide some code example panderas vs Dataframely ?

mathisd · 2025-02-19T19:58:03 1739995083

I really like uv and I have successfully got rid of miniconda but : - I wish there was a global virtual environment which could be referenced and activated from terminal. Not every new scripts needs their own .venv in their respective folder. uv takes the route of being project centered and based on file system, this works for me most of the time but sometime it doesn't. - I wish we could avoid the .python_version file and bundle it in the pyproject.toml file.

auxym · 2025-02-20T22:51:30 1740091890

Agreed, I like to have a "general use" environment to use for various one-off interactive python sessions, scripts or jupyter notebooks.

mathisd · 2024-09-27T00:08:11 1727395691

Nice project of yours! I am a data science student but I never looked into Computer Vision. Until a few days ago, when I started watching a series of short courses on a YouTube channel called First Principles of Computer Vision [0]. I found it fascinating and the math behind is truly beautiful, concise and efficient.

[0] https://www.youtube.com/@firstprinciplesofcomputerv3258 strongly recommend to check-out any playlist. Best courses I have had since a long time.

mathisd · 2024-06-09T21:24:48 1717968288

- Arc Browser

- Typora : a markdown editor

- Flux : screen color temperature adjuster

- Rectangle : window management

- Transmission : torrent downloader

- Pika : color picker

- IINA : video player

- Quickshade : to go below minimal luminosity

- The Unarchiver : for unsupported compressed file

yamilg · 2024-06-14T18:56:35 1718391395

I'm very happy with qBittorrent, in case you want to try something different.

mathisd · 2024-05-22T21:10:24 1716412224

During an internship, I was part of a team that developed a collection of tools [0] intended to provide pseudonymization of production database for testing and development purposes. These tools were developed while used in parallel with clients that had a large number of database.

Referential constraint refer to ensuring some coherence / basic logic in the output data (ie. the anonymized street name must exist in the anonymized city). This was the most time consuming phase of the pseudonymization process. They were working on introducing pseudonymization with cross-referential constraint which is a mess as constraint were often strongly intertwined. Also, a lot of the time client had no proper idea of what the field were and what they were truly containing (what format of phone number, we did find a lot of unusual things).

[0] (LINO, PIMO, SIGO, etc.) https://github.com/CGI-FR/PIMO

edrenova · 2024-05-22T21:27:50 1716413270

yeah the referential integrity and constraints part is usually the most complicated part and everyone does things differently which adds another layer of complexity on it

mathisd · on May 1, 2024

Author seems to ignore the point regarding effect of industrialization. Industrialization did lower cost of goods and services all together but the relative cost depends on buying power.

mathisd · on March 29, 2024

I agree with other comment and I don't think this is a good idea. In an ideal world, we use data to create visualisation that can then be embedded in a variety of place (powerpoint, web-app or simply in a notebook). Here you are giving presentation the central role which simply doesn't sound right to me.

ellenfkh · on March 29, 2024

That's definitely the ideal world, but in our experience everyone says they want dashboards and live data but everything ends up in presentations anyways. Fundamentally, it's the current format for standing in front of someone and making an argument. Maybe it's because the execs with buying power just like slides, but at anything bigger than a startup, decisions and alignment are done off a deck and not a dashboard.

An old boss once said "any data tool that lives long enough becomes a BI tool," and our hypothesis is that one reason there are so many BI tools floating around without market dominance is because all of them stop one step short of the final destination, which is (regrettably?) a presentation.

mathisd · on March 26, 2024

Have you tried using Whisper from OpenAI ? Aiko [0] have Whisper-v2-large built-in and allow for transcription of audio file

[0] https://apps.apple.com/fr/app/aiko/id1672085276

LeoPanthera · on March 27, 2024

Is there anything like this for watching foreign television (or radio)? I don't want to create a document, I just want real-time translated subtitles, but I can't do it in advance for live shows.

jonplackett · on March 26, 2024

This is amazing. Just tried really mumbling a long for a while and it got every word.

mathisd · on Feb 18, 2024

I don't understand why maximum of likelyhood is not zero in the example given. Isn't P(X = x / theta = theta_0) always null for continuous laws ?

knightoffaith · on Feb 18, 2024

The actual probability is 0, but the probability density is not 0. Same reason why the probability that I pick 0.5 from a uniform distribution from 0 to 1 is 0, but the value of the probability density function of the distribution at 0.5 is 1.

cubefox · on Feb 18, 2024

What is this point value then measuring? A literal "density" doesn't seem plausible either, as points arguably do not have any "density".

knightoffaith · on Feb 18, 2024

I'll give the mathematical explanation. So if X is a continuous random variable, the probability that X takes on any particular value x is 0, i.e. P(X = x) = 0. However, it still makes sense to talk about P(X < x) --- this is clearly not 0. For example, suppose X is a random variable of the uniform distribution from 0 to 1. P(X = 0.5) = 0, clearly, but P(X < 0.5) = 0.5, clearly. (There's a 50% chance that X takes on a value less than 0.5). We can talk about P(X < x) as a function of x---in the case of the uniform distribution, P(X < x) = x. (There's a 30% chance that X takes on a value less than 0.3, there's a 80% chance that X takes on a value less than 0.8, etc.) This is called the cumulative distribution function---it tells us the cumulative probability (accumulating from -infinity to x). The probability density function is the rate of change---the derivative---of the cumulative distribution function. At a particular x, how "quickly" is the cumulative distribution function increasing at that point? That is the question that the probability density function answers, if that makes sense.

In the case of the cumulative distribution function of the uniform distribution from 0 to 1, since the derivative of x is 1, the probability distribution function is 1 from 0 to 1 and 0 elsewhere. This makes sense; the probability P(X < x) isn't increasing faster at one point than any other---with the exception of x outside of 0 and 1 having a probability density value of 0, since e.g. P(X < 2) is 100% and increasing the value of x=2 does not change this (it's still 100% because X only takes on values within [0,1]) .

cubefox · on Feb 19, 2024

That's interesting and intuitive for a uniform distribution. What does it then mean on a non-uniform distribution for an value to be very small? Is there some interpretation for that? The Stack Overflow post actually mentions values that are extremely close to zero.

knightoffaith · on Feb 20, 2024

So, just to be sure, even for a uniform distribution, the values can be small. Consider the uniform distribution from 0 to 10^100. The CDF for this distribution is P(X < x) = x/10^100. The derivative of this (the PDF) is p(x) = 1/10^100. At any particular point, p(x) is 1/10^100. But this is true for any x (again, unless it is outside the range [0, 10^100]), which makes sense because the "speed" with which the probability is increasing is constant regardless of the x. Why are these values smaller than for the uniform distribution on [0,1]? It's because the probability increases much more slowly per unit of x on the uniform distribution from [0, 10^100] than it is on the uniform distribution from [0, 1]. P(X < 0) to P(X < 1) for Uniform(0, 10^100) only increases the probability by 1/10^100, while it increases the probability by 1 for Uniform(0, 1).

So PDFs can have small values regardless of whether they are uniform or not. What a small PDF at a point x indicates is that the CDF is increasing very "slowly" at that x. I'll emphasize this point - PDF values are not probabilities. They are rates of change of the CDF.

For some further understanding of the stack overflow post, let's consider Uniform(0, 2). The PDF is p(x) = 1/2. Suppose the author of the stack overflow post drew 50 samples from this distribution. Regardless of what those 50 samples were, the value he would have gotten would have been (1/2)^50 = 1/(2^50), something on the order of 10^-16. Why is this so small?

(I'll give a rather loose and informal explanation here, but I can be more formal if you'd like, if this doesn't make sense.) Think back to Uniform(0, 1) vs. Uniform(0, 10^100). Recall that the probability that a particular x falls in [0, 1] for the former distribution is the same as the probability that a particular x falls in [0, 10^100]---i.e. 1 (100%). In the case of the latter distribution, that 1 has had to be "spread out" across a larger space, which should give some intuition as to why the PDF is low---for a particular unit in space that we "travel", since the probability has been spread out so thinly across the space, the CDF isn't increasing that much, i.e. the PDF isn't that high.

When we're looking at PDF values when we're looking at the space of possibilities covered by 50 samples, it's going to be a lot "larger" than the space covered by 1 sample (over one sample, the space is [0,2], covering 2 units of space. over two samples, the space is the square [0,2] x [0,2], with an area of 4. over 50 samples, the space is the hypercube [0,2]^50, with a 50-dimensional volume of 2^50---a huge space.) But the total probability is still 1, so it's going to be "spread out" very thinly across this larger space, hence much smaller values. And so, the probability we accumulate as we move across this space per unit is going to be very low, hence a low likelihood value.

So when we draw many samples from a distribution, the likelihood of these samples is going to be very small (mostly---there might be spikes where they're high).

I've spoken a little loosely and informally, but hopefully this makes sense.

cubefox · on Feb 21, 2024

I just don't quite understand why more samples mean that the "space" gets higher dimensional and consequently less dense. Aren't the samples just estimating the underlying PDF, such that more samples shouldn't decrease the magnitude of the PDF? So if he drew those samples from Uniform(0, 2), shouldn't the resulting PDF simply approximate a value of 1/2=0.5 everywhere? I'm probably misunderstanding something basic here.

knightoffaith · on Feb 28, 2024

Sorry to respond so late.

I think it will be easiest for you to diagnose what you're misunderstanding if I present a simple discrete case.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Consider a coin flip. 50% chance heads, 50% chance tails. This distribution is called Bernoulli, specifically Bernoulli(0.5). If we sample from this distribution, we get 1 (representing heads) with a 50% probability, or 0 (representing tails) with a 50% probability.

Now consider taking two samples, and calculating the likelihood of those two samples. Suppose we draw two samples from this distribution, HT (heads followed by tails). What is the probability that we got exactly these two samples from the distribution? Trivially, it's 0.5 * 0.5 = 0.25. Notice how this isn't the same as the probability of drawing any single sample (the probability of drawing any particular sample, that is, either heads or tails, is just 0.5).

The distribution representing the probability of a single sample of a coin flip lies in {0, 1}. You can think of this as a single-dimensional table, [0.5, 0.5], where each element represents the probability of the sample taking on the index of that element. (the probability of the sample taking on the value 0, which represents tails, is the 0th element of this array, 0.5. Similarly for 1, which represents heads).

Now think of the distribution for two samples. There are no longer two possibilities, but four - {0, 1} x {0, 1} = {(0, 0), (0, 1), (1, 0), (1, 1)} = {TT, TH, HT, HH}. We think of this as not a one-dimensional table but a two-dimensional table:

  second sample = 0(T) [[0.25,              0.25],
  second sample = 1(H) [0.25,               0.25]]
                       first sample = 0(T), first sample = 1(H)

Here, the element at row i and column j represents the probability that the first sample takes on value j and the second sample takes on value i.

For three samples, the distribution becomes three-dimensional, with the space of possibilities being {0,1}^3 = {(0, 0, 0), (0, 0, 1), (0, 1, 0), ...}.

For any of these tables, each element represents the probability that a sample takes on the corresponding value at the element's position. So, clearly, if we add up all the values in a table, no matter how many dimensions, it must sum up to 1. There is a 100% chance that a sequence of n samples takes on some value, after all.

What you're saying about drawing multiple samples approximating the underlying PDF is still true here (though we are not talking about the PDF in the discrete case, but rather the PMF - probability mass function - since each element in this table is actually a probability, not merely a measure of density). If you draw N samples from this distribution and plot it on a histogram (one bar for the number of heads you draw divided by N, one bar for the number of tails you drew divided by N), then this will approximate the underlying PMF, namely [0.5, 0.5]. But that is separate from the fact that drawing a particular sequence of N samples becomes decreasingly smaller as N increases. For N = 2, the probability of drawing any two particular samples (TT, TH, HT, or TT) is (1/2)^2. In general, it is (1/2)^N. One way to think about why it is (1/2)^N is that the distribution for N sample lies on the space {0, 1}^N, whose size is 2^N. The total probability, which is always 1, (no matter how large N is, it's still true that there's a 100% chance that a sequence of N samples is some sequence), needs to be distributed across the space 2^N. Every possibility is equally likely, so it's evenly distributed, so the probability is 1 / 2^N = (1/2)^N.

The same idea roughly applies in the continuous case, but importantly, in the continuous case, we're no longer talking about raw probabilities for a particular sample (the probability of drawing the value 0.3 from the distribution Uniform(0, 1) is exactly 0), but we're talking about probability density values. The same principle still applies though - if we "sum" (integrate) up all the PDF values for a distribution, since the PDF is the derivative of the CDF, by the fundamental theorem of calculus, we still should get 1. (The PDF is p(x) = d/dx P(X < x). Integrating both sides across all possible values X can take on, we will get integral from min possible value of x to max possible value of x of p(x) dx = P(X < max possible value of x) - P(X < min possible value of x) = 1 - 0 = 1). This total probability, 1, needs to be distributed across some space. The bigger the space is, the less densely it's going to be distributed, which is reflected in the lower value of the PDF.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Note:

To be sure, the space getting higher dimensional doesn't necessarily mean the PDF must be less dense. Consider Uniform(0, 0.5). When looking at the likelihood of two samples being from Uniform(0, 0.5), the probability is spread across [0, 0.5] x [0, 0.5], whose area is 0.50.5 = 0.25. Since the area is less than 1, the probability is actually more* dense---specifically, the PDF is 4 at any point in [0, 0.5] x [0, 0.5], but for just one sample, the PDF is 2 at any point in [0, 0.5]. Whether the probability gets less or more dense in the higher dimensional space representing the likelihood of multiple samples from the same distribution depends on the volume of the domain. For Uniform(0, 2), the space of two samples is [0,2] x [0,2], whose area is 2 * 2 = 4---this is larger than it is for just one sample, since the space for just one sample is [0,2], which covers 2 units of space. Accordingly, for just one sample, the PDF is 0.5, while for two samples, the PDF is 0.25. The larger this space, the less dense the probability is concentrated in this space, and vice versa. If we're thinking about uniform distributions, notice the space gets bigger for Uniform(0, L) if L > 1, and gets smaller for L < 1 since powers of L (representing the size of higher dimensional spaces, e.g. L^2 represents the size of [0, L] x [0, L], the space on which the PDF for two samples from the distribution must lie) get smaller if L < 1 but get bigger if L > 1. For the stack overflow post, the distribution in question is Gaussian, which takes on positive values on (-infinity, infinity), which you can think of as being more than large enough for the size of higher-dimensional spaces to be increasing, hence causing the PDF values to become smaller and smaller.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I hope I didn't make the problem more confusing by saying all that. If you're still confused, I can try to clear up things further, or I can point you to a better resource if you'd prefer that.

cubefox · on March 7, 2024

Thanks for this detailed explanation!

mathisd · on Dec 4, 2023

How does it work compared to a browser extension ad blocker ?

sodality2 · on Dec 4, 2023

Browsers only work in the browser, this is system-wide

lxgr · on Dec 5, 2023

Is that even a common problem? I can't think of any apps on my computer that are showing me ads, other than my browser.

It's a very different story on mobile, but there, certificate pinning can also trivially bypass this kind of blocking, and for good reason too: Imagine a system-wide tool like this getting access to online banking credentials, for example...