Yes, this is part of the roadmap. Right now, we don't have any interesting query to showcase that would use these columns - they are here because they are part of the original dataset and we are keeping them for when we demo further releases.
Regarding the theme of joins, I would just add that QuestDB also has a concept called SYMBOL which allows to store repetitive strings into a table as an integer in the background, while letting the user manipulate strings. This means that instead of storing something like 'sensor_id' and then looking up the corresponding 'sensor_name' using a join, you can manipulate 'sensor_name' directly and skip using joins.
We will definitely make QuestDB distributed in the near future. Today, our focus is to extract the maximum performance out of a single node. If we start scaling out too early, it may become harder to improve performance after the fact. So all improvements we get today are foundational: they will have multiplicative effects once you start to scale out.
One step beyond good Java GCs is to write fully zero-GC Java code. The advantage of it is complete control over your performance which means your software is going to be consistently fast. The disadvantage is that it is relatively difficult to obtain.
If you want to see an example of fully zero-GC Java, you can check out QuestDB on Github [1] - Disclaimer I work for QuestDB.
> One step beyond good Java GCs is to write fully zero-GC Java code. The advantage of it is complete control over your performance which means your software is going to be consistently fast. The disadvantage is that it is relatively difficult to obtain.
I don't know that it's actually possible in the general case, as Java's support for value types remains wholly insufficient. IIRC the ixy folks never managed to remove all allocations from the java version.
Thanks for sharing. Do I assume correctly that this tool takes in dividend expectations and not only announced dividends?
Building a portfolio around dividend expectations is definitely useful for many reasons. But predicting dividends is hard. I imagine the dividend payouts for upcoming seasons will be far off previous seasons, and way short of previous analyst forecasts. How do you build or source your forecast levels for this tool, and how do you adjust expectation levels in the light the current economic context?
Thanks for sharing. Do I understand correctly that this requires loading the whole file in memory along with an ordered list of keys? Or is it just the first n bytes that are loaded in memory? If the former, then it seems very expensive in terms of RAM, particularly if your data file has multiple columns.
As an alternative I used is to load the file in a database, then sort by the key I want (which only loads the key in memory) and then output the result into a file. It does go through disk but you can address larger files as you only need the key in memory, and not the whole file.
Indeed I load the entire file, but I think provided I have RAM to load the file with all the columns, the approach of loading everything should give me optimal performance. However I agree with your approach when there isn't enough RAM.
I find the concept interesting and I can see real applications for something like this (delaying someone's access to a file). For example, one could use that to disseminate press releases ahead of time without actually revealing the contents.
That said, it seems the current implementation is impractical because you will "on average" decypher after a certain time, but that time can vary greatly. Are there plans or methods to explore in order to make this more deterministic?
That's a very good question, one I've been thinking about a lot. The current implementation, has even distribution. Even changing it to normal distribution would greatly improve this. Not certain yet how to achieve this, however.
Interesting project. Out of curiosity, why did you compare against kdb+ since the models are very different? KDB is mostly used for in-memory time-series while your seem to be a graph-oriented DB. Also, why did you choose to build your own language instead of using an existing one [1]?
The reason is that we were positioning to deal with customers who had financial data which was stored as time-series. We aren't hoping to compete with kdb+ on speed (which would be hopeless) but we have a prototype of Constraint Logic Programming [CLP(fd)] based approaches to doing time queries which is very expressive and which we hope to roll out in the main product in the near future on hub.
The graph database is still in its infancy and there are a lot of graph query languages about. We played around with using some that already exist (especially SPARQL) but decided that we wanted a number of features that were very non-standard (such as CLP(fd)).
Using JSON-LD as the definition, storage and interchange format for the query language has advantages. Since we can marshall JSON-LD into and out of the graph, it is easy to store queries in the graph. It is also very simple to write query libraries for a range of languages by just building up a JSON-LD object and sending it off.
We are firmly of the belief that datalog-style query languages which favour composibility will eventually win the query language wars - even if it is not our particular variety which does so. Composibility was not treated as centrally as it should have been in most of the graph languages.
I have two questions regarding the time-series aspect of TerminusDB:
1) Does TerminusdB support raw time-series data (one dimensional) for example electrocadiogram (ECG)? At the moment we have them including the metadata in column based CSV format. FYI, the size is around 1 MB for one minute raw ECG data duration.
2) For automated ECG analysis the data is transformed using time-frequency distribution (two dimensional), and the intermediate data must be kept in-memory for feature extraction purpose. Just wondering does TerminusDB can support this intermediate format/structure of time-frequency as well? FYI, the one minute time-frequency ECG data transformed from (1) will need around 4GB of working memory. For real-time analysis of longer ECG data duration from (1), for example 30 minutes duration or 30 MB data size, we need around 3 TB of working memory.
We will 100% consider it and have been engaging with the community about the best approach. Cypher is by far the biggest graph query language and they seem to have the most weight in the conversation so far, but we are going to try to represent datalog as far as possible. Even if woql isn't the end result we think datalog it is the best basis for graph query so we'll keep banging the drum (especially as most people realize that composability is so important)