Hacker News new | past | comments | ask | show | jobs | submit login
Database of 200k cell images yields new mathematical framework (alleninstitute.org)
131 points by hhs on Jan 5, 2023 | hide | past | favorite | 22 comments



This is the kind of dataset it would be super cool to train a generative AI model on. I can picture (naively) being able to generate cells with specific conditions that are rarely or never seen, and being able to dial up and down different features to look at how they affect diagnostics or whatever else you might do with these images. I hope that's on their radar


I was at the CytoData conference at the Allen Institute a few months ago and there were several talks specifically about this.

https://youtu.be/kF9cd7iw1YI


Thank you, this looks fascinating. Sounds like a great conference too.


Thank you that looks very interesting


You actually do have access to the dataset here with some links to Quilt: https://www.allencell.org/genomics.html

I am sure you could get the raw data for peer review purposes, by emailing the contact on that page. Your idea sounds really interesting!


Can you help me understand where the image dataset is? I'm not familiar with the terminology here and when I look I see what I think is gene sequence data



Yes that makes more sense, thanks!


Wow, this paper is boring.

I used to work in screening.

In short its the measure of many cells with almost no conclusion and no believable path for generalizability besides the advocacy specific cells can (kinda) be described by 8 PCA projected variables. The authors could have maybe screened more cells in 2D, and noted that cell geometry can be kinda described by some eigen modes - but once again the value of the parameters isn't obvious.

Usually you'd use these measurements to make conclusions or detect (by proxy) some cellular changes, perhaps related to disease. But no hypothesis guided this study.


Agreed, it's all a bit odd. There are no interesting discoveries here, so this must be a methods paper, which is fine, except it's a method that's so laborious that surely nobody will ever use it again. Still, laymen are clearly wowed by the scale of the thing, so it's perfect Nature fodder.


method papers that produce reference data are absurdly valuable yet greatly undervalued because science incentivizes sexy discoveries.

but so much of biology is "the first 8 PCA values explain 70% of the data" that most discoveries are really only the "obvious" stuff in the first 1-2 PCA.


Aren’t these baselines?


No because in a well designed experiment the baselines would be control measurements performed for the purpose of the experiment on the cell lines in question.

In general cellular morphology is a proxy for more direct measurements. For example, you could imagine that some drug induces metabolic changes in cancer cells, and pick it up by noting the cells look a bit different or you could directly measure the metabolic changes using an FL system. Its not obvious what these measurements produce a baseline for.


The paper: https://www.nature.com/articles/s41586-022-05563-7

Random thoughts and observations ...

Interesting that they did this in live cells. That means using GFP tagging to label the organelles, and spinning disk microscopes to not kill them while you take pictures. It would have been far easier to fix the cells, stain them with antibodies, and use scanning confocals. They don't say why they chose this route, but i assume they thought they would get more physiologically relevant shapes from live than dead cells; not always a safe assumption given the stress that imaging imposes on cells. Especially since they used DNA and membrane stains on them while alive!

Usually, the reason you do live imaging is because you want to see the cells moving or doing stuff. Adding a time dimension to this dataset would be amazing. But they didn't do that. Maybe that comes next?

Bit of sexy machine learning (initially reported in another paper but used here):

> The tightly packed, epithelial-like nature of hiPS cells, as well as the need for highly accurate 3D cell boundaries to minimize the misassignment of cellular structures to neighbouring cells required deep-learning-based segmentation approaches to create a robust, scalable and highly accurate 3D cell and nuclear segmentation algorithm

But the core of the analysis was done with classy traditional techniques:

> We aligned all cells along their longest axis in the xy plane, preserving their biologically relevant, epithelial-like apical–basal axis. We then used a spherical harmonic expansion (SHE) 17,18 to accurately parameterize each 3D cell and nuclear shape with a set of orthogonal periodic basis set functions, defined on the surface of a sphere (Fig. 2a and Extended Data Fig. 3). The joint vectors for all cells (578 SHE coefficients) were then subjected to PCA. We found that the first eight principal components represented about 70% of the total variance in cell and nuclear shape (Fig. 2b).

However, this work follows the common pattern of heavily mathematical cell biology, in that it gives a rigorous, numerical, data-driven demonstration of the bleeding obvious:

> We performed a hierarchical clustering analysis of these correlation values to create a purely data-driven ‘average pairwise spatial interaction map’ of cellular structures. Notably, we found that the cellular structures clustered naturally into an ordered radial compartmentalization of the cell, from the centre of the nucleus outward (Fig. 3d), and also separated between the apical and basal domains of the cell. The six top-level clusters included structures localized to the nucleus, nuclear periphery, cytoplasm, apical domain (in a dispersed way), cell periphery and basal domain, respectively.

And so on in the same utterly unsurprising vein for several figures. But i will take their word for it that this is new:

> Unexpectedly, the variance in nuclear speckle (SON) volumes was most uniquely attributable to the nuclear surface area and not the nuclear volume, although speckles localize throughout the nucleoplasm. This is notable in light of the possible connection between transcript splicing (which occurs at nuclear speckles) and increased rates of nuclear export 25.

I suppose the point of this work is not to find interesting facts about the locations of some well-known proteins inside cells, but to develop and validate a tool for studying the locations of proteins inside cells. If someone is studying some new and exciting protein, they can now drop it into this pipeline and see how it's located inside the cell, in a rigorous quantitative way. If they're friends with someone at the Allen Institute, that is. Still, i'm not sure how much more actual knowledge you would gain by doing that rather than just growing a single coverslip, staining it, and spending half an hour panning around on a bog standard microscope.

Especially given how long this took:

> the three years of data acquisition of the WTC-11 hiPSC Single-Cell Image Dataset v1

!

But this is very cool, plaudits to all involved:

> Custom codes were central to the conclusions of the paper. All necessary code to reproduce the results in this paper has been deposited in GitHub. This includes code for downloading our datasets, single-cell feature extraction, cellular parameterization and organelle size scaling. Jupyter notebooks to reproduce the figures shown in the paper are also provided.

Seriously, control-F for that section, there's links to every little bit on GitHub.

Also, control-F for the supplementary videos if you don't fancy reading so many words.

Methodological nerd shit:

> The spinning-disk confocal microscopes were equipped with[...] two Orca Flash 4.0 cameras (Hamamatsu).

Those are CMOS cameras! Using something other than CCD would have been unthinkable back in my day.

> After the selection of FOV position from the well overview acquisition, the DNA of cells was first stained for 20 min with NucBlue Live (Thermo Fisher Scientific). Then the cell membrane was stained with CellMask Deep Red (CMDR, Thermo Fisher Scientific) in the continued presence of NucBlue Live for an additional 10 min, and cells were washed once before imaging for a maximum of 2.5 h.

NucBlue Live is Hoechst 33342, so nothing special. Couldn't find out exactly what CellMask Deep Red is, except that "CellMask™ plasma membrane stains are amphipathic molecules providing a lipophilic moiety for excellent membrane loading and a negatively charged hydrophilic dye for “anchoring” of the probe in the plasma membrane". I would be mildly concerned that either of these would have some toxic effects on the cells. Probably not a huge deal. Would be nice to see controls for that, though.


Great stuff. What's your take on the cell painting assay from Anne Carpenter's lab? https://www.biorxiv.org/content/10.1101/2022.07.13.499171v1


It’s great and everyone should use it ;)

(I’m a co-author)


Great work, thank you. I understand it's limited to 5 channels due to microscopes, and those specific stains due to backwards compatibility, what are your personal ideas for the highest information gain staining targets that aren't being used yet, and is there any microscopy tech on the horizon that will allow more multiplexed channels in the near future?


Would like someone to compare this with Michael Kevin's work on biolelectricity as basis for oeganism morphology?

https://youtu.be/jLiHLDrOTW8


Michael Levin*


For those interested, we did something similar a few years ago but for proteins in dividing cells where we could quantify how much of a protein is where in the cell over time during cell division [0]. Data [1], code [2] and web-based visualization [3] are available.

Although technology has improved, building the reagents and acquiring data remains labor-intensive. The value of such work would be in having an exhaustive resource (something like 500-600 relevant proteins for cell division) but once the proof of concept is published, you can't get funding for doing more.

[0] https://www.nature.com/articles/s41586-018-0518-z/

[1] https://idr.openmicroscopy.org/search/?query=Name:idr0041-ca...

[2] https://git.embl.de/grp-ellenberg/mitotic_cell_atlas

[3] https://www.mitocheck.org/mitotic_cell_atlas/index.html


Why can’t you get more funding?

Also, and I understand that we never really know the upside with fundamental research, but what’s the upside? Why wouldn’t pharma or biotech see advantages to the research?


Once published, it's not new anymore. It would cost a few millions just to repeat the work on a few hundred proteins and unless something new or interesting surfaces, it won't be publishable in a high-visibility place. Also it might have to be done/repeated in different cell lines. I don't know much about what pharma/biotech is interested in but from what I see, getting basic quantitative data like this doesn't seem to be a prority though they would probably make use of it once available.

Edit to address the upside question: This is dynamic quantitative data with which you would eventually get at how much of a protein interacts with how much of another, where and when during cell division. Basically, this is getting at the dynamics of protein interactions in live cells. The goal would be to build an dynamic molecular interaction network of cell division.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: