Apple will probably add this feature and more to the stacks feature on macOS (a multimodal model would be very useful there). Even better: I expect Apple to use ML and local models to scan file contents and have them show up in search (e.g., on spotlight or Raycast, search for the picture of my latest receipt that I saved __somewhere__ I don't remember).
A few months ago, while using search in finder, I noticed that it would return images with the search term in the image. They seem to be doing something ML already