Hacker News new | past | comments | ask | show | jobs | submit login
Nasdaq Acquires Quandl to Advance the Use of Alternative Data (nasdaq.com)
118 points by rainboiboi on Dec 4, 2018 | hide | past | favorite | 82 comments



It's crazy how poor the financial data provider offerings out there are. Most financial data is riddled with inconsistencies, wildly overpriced, and in esoteric formats. Simply ingesting financial data in a reliable manner requires significant engineering.

For something so important to the economy, it's amazing that there isn't a better solution, or that an open standard hasn't been mandated.


I feel this; I email Quandl regularly to fix data errors that the simplest of automated checks should catch ("why is this price 1200% higher than the previous one?").

But, they do have a mostly-decent API (tables; timeseries is pretty bad).

Something that always bugs me is properly adjusting prices when backtesting. The "right" way seems to be how Quantopian now handles it [1], in a just-in-time fashion, but that code isn't in their public libraries, and over email they declined to tell me where they get the data.

[1] https://www.quantopian.com/quantopian2/adjustments


Always store unadjusted prices and volumes

Keep an updated corp action table with date, corp action type, and adjustment factor.

Corp action type is important because divs adjust prices but not volumes, for example. Splits adjust both.

When you're ready to use an adjusted time-series: select the corp actions you care about and calculate a running product of 1+the adjustment factor. As-of join the adjustment factors, multiply and you're done.


I stand corrected; the code for on-the-fly adjusting _is_ in Zipline, but you have to know that stock splits are treated like dividends, which wasn't obvious to me.

At least some of data they use comes from a vendor they aren't able to name publicly.


For my current job, we wanted to get a mapping of stock tickers and exchanges to CUSIPs. Every provider we looked at — and this is fundamental trade data — was full of errors and missing values. Couple that with the extortion that is CUSIP (you can't use CUSIP values without a license from them, and licenses start at $xx,xxx+). It's criminally inept. And when you do fix it up, you don't want to publish it, because you spent all your time and resources fixing it… and it becomes a trade secret.


This is why finance is lucrative, similar to esoteric codes in various types of law. Nothing to do with math models or superior prediction, just paying for someone else to fight through identifier hell, exchange protocol hell, etc., and be able to do some mickey mouse math at the end of it.

Honestly, this stuff is so bad that the headache of it might fully justify huge finance compensation, and I’ve had colleagues who turned down huge bonuses and raises to leave finance companies solely to avoid this type of stuff and seek a career where the headaches bother them less and they are paid less.


Data cleaning/transformation ends up being a huge percentage of the work in pretty much any real-world ML context I'm familiar with. Not unique to finance at all.


I’ve worked for over a decade in industry machine learning, about half of that in quant finance. It is definitely much worse in finance than other fields.

Even medical records do not present the same degree of esoteric data formatting and mismatching. It’s not really even a matter of data cleaning. It’s that there is _no_ way to clean the data, and the only useful approach is to pay 10s of thousands of dollars to data vendors whose products have intractable errors, and then build huge data vslidation and imputation systems around it.

When it boils down to fiduciary duty to the client, and you have a contractual obligation regarding portfolio composition, then you can’t live with “good enough” data cleaning. Even one single asset with an incorrect identifier from your data vendor can cause you to e.g. invest in an Israeli company in a portfolio with a client obligation to invest in no Israeli companies (that is a real example I encountered before).


I come from the non-technical side of things. Do you know of any resources that would cover this issue, but for someone on the business side?

Not an engineer, so while I understand this in a general/abstract sense, my understanding is limited to, "Cleaning/transformation is messy and a time sink due to non-standardization of data."


One good example I uncovered a while back was that Bloomberg timestamped its crude oil futures data by finding the last trade to occur in a given second and rounding down. This means that the user of the data had no idea if the price used on the 10:30:30 AM print occurred at 10:30:00.999 or 10:30:00.001. Obviously, this could create problems if thought you found a lead/lag relationship between say oil and oil stocks.

Similarly, say a vendor aggregated website visits/pageviews but didn't account for the fact that 1/3 of the traffic was coming click-bots in developing countries. If they presented you with the raw data you could figure it out and filter those countries out, but if it is aggregated, you might not discover the issue.

Then, there could be even simpler ones like determining the opening price for a stock. If say the first print of stock XYZ trades 10 shares at a price of $20, but a millisecond later, 100k shares trade at $20.11, which print should you use in your simulation algorithms as the opening print?


Did you look at Factset's datafeed? I've found its reference data and symbology to be pretty reliable. Cusips will cost a lot with redistribution charges though. You're better off avoiding them if possible.


Yeah, we did look at Factset. Ultimately we found repeated gaps in their symbology, since we needed a full set, including less commonly used symbols.


I agree, CUSIP is also a problem for the privateer (meaning all data needs to be free to use). While I have found a way to find a mapping online, I have no idea of the accuracy and have to trust that the unaware provider QA's the data.


I really would like to see something like Bloomberg's OpenFIGI take the place of cusips, but its not nearly as widely used. https://www.openfigi.com/ The api does allow you to convert from cusip though.


Yep. OpenFIGI plus using LEI codes seems like the best practice to move forward.


For anyone curious in esoteric formats, check out some of the documentation for financial data providers.

CRSP[1] is pretty much regarded as the highest quality pricing data in the US, with stock prices going back to 1925. The database API is written for C and FORTRAN-95.

Data providers also have a habit of providing their own proprietary security IDs, or just mapping to tickers. So if you're trying to build a database with several providers, you have to wrangle together 15 different security identifiers, taking care of mergers/acquisitions, delistings, ticker recycling, etc. It is a fun exercise.

[1]http://www.crsp.com/files/programmers-guide.pdf


Any advice on where an individual could purchase (even limited) access to CRSP data?

I'm working on a data-driven financial analysis blog and can't seem to find decent time-series fundamentals data now that yahoo and google have taken down their api's. Everything I find seems to be a $1000+ yearly subscription.


Spoke to these guys a while back. Asked for examples of real alternative data they had...one intertesting was flight data for private jets labeled against which company owned them. Theory being if ceo of company x keeps visiting a place near company y there may be an acquisition or merger in play.


I wonder if anyone did/could use it to buy real estate before HQ2 was announced. I don't know if the person in charge of finding the real estate was high enough to fly private.


Scott Galloway called the HQ2 locations back when it was first announced that Amazon was looking for one.

Also I remember hearing about some Amazon employees buying real estate once the terms were being finalized...



Out of curiosity, at what point could things be considered insider trading / insider data?


Speaking from experience: it's not illegal insider trading unless you violate a confidentiality agreement or fiduciary duty.

I specifically say "illegal insider trading" because insider trading is not intrinsically illegal. The SEC distinguishes between insider trading and illegal insider trading (and by extension, so does the compliance department of every investment firm, bank and hedge fund). If, through your own research, you discover information which is both material and nonpublic, and you proceed to trade on that information, you are insider trading. However it is not illegal unless you have thereby broken an agreement or duty (namely with the company itself, its affiliates or your own clients) at any part of the process.

In practice this usually means the information is tainted if any of the following is true:

1) you have a fiduciary duty to the shareholders of the company in question,

2) you are employed by, or contract for, the company in question,

3) you are employed by, or contract for, an affiliate of the company (such as a vendor),

4) you disobeyed terms and conditions of service related to use of the product related to the information you found.

Obviously the standard disclaimers apply: I am not a lawyer, don't take potential legal advice from a random HN commenter, etc.

Source: I used to work in financial forecasting using alternative data.


Here is an interesting piece from yesterday about insider trading by Matt Levine: https://www.bloomberg.com/opinion/articles/2018-12-03/inside...


Yeah, Matt Levine explains what could constitute insider trading in plain language. feel sorry, for the potato farmer. Always good reads from him about finance and financial markets & life in general


In America, this kind of research in explicitly encouraged, and very much NOT insider trading according to the SEC. Insider trading has to involved _theft_, not just insider knowledge.

If you overhear someone talking about an impending acquisition in a coffee shop, and you trade on that information, you're quite safe in the US. European countries can and do consider that insider trading, though.


I thought that inside info was non-public material info and so air-traffic data is not non-public per-se and so it's fair game. No?


No. You can trade on material, nonpublic data as much as you'd like. Insider trading is not illegal unless you're breaking a confidentiality agreement or fiduciary duty.

If you manage to discover confidential data in a way that does not compromise such an agreement or duty, you're fine. Obviously you should engage with an attorney instead of taking legal advice from a random HN comment, but there's really no issue with this. Information asymmetry is a fundamental part of the market and not illegal on its own.

Source: I used to work in financial forecasting using significant amounts of alternative data.


Yeah, "non-public, material information" is what I'm used to seeing, and I think that classifies this air-traffic data just fine. The coffee shop example, however, sounds like it's both non-public and material, yet in the US is not considered insider trading. So you might say "non-public material information" is necessary, but not sufficient, grounds for an insider trading case.


Well, it is publically available in that anyone can subscribe and get the data. I’d venture to guess anyone can get the tail number flight patterns from the FAA. The value add is matching that to who owns or is leasing the plane.


Really? I did not know that at all. I say legal insider information is a very, very interesting market then.


Well, the whole point of markets is to incorporate information into the price. So if I've done research and think company X is undervalued, I'm going to buy its stock, the price is going to go up, and now that research of mine is better incorporated into the stock price of X.

As long as I didn't _take_ that info from anyone (e.g. I can't hack emails, or bribe an accountant, etc. etc.), this is markets behaving exactly as we want them to.


Oh, okay. I think I see now. The stock market has always fascinated me and I admit I understand very little of it. I wouldn't know where to start learning about the ins and outs of it.


I'm short on good resources, but you might enjoy Matt Levine's Money Stuff newsletter; I always look forward to reading it, and all of this "what is insider trading" comes directly from his writing in recent weeks.

You can read some of them on bloomberg, or get the emails each day for free here: http://link.mail.bloombergbusiness.com/join/4wm/moneystuff-s...


This article is just for you!

Is Spying on Corporate Jets Insider Trading? https://www.cnbc.com/id/100272132


Gecko's going to buy Anacott Steel!


We're in the midst of a data gold rush. People who have data are struggling to monetize it. If you're a data buyer, you're probably swamped with the quantity and breadth of data providers out there. AI/ML techniques to make sense of this data are still only scratching the surface. I think this is where there is a lot of low-hanging fruit: creating services or tools that allow non-CS/non-Quant people to extract insights from TBs of data...

On the exchange side: these guys are always on the prowl for hot new properties to scoop up. The traditional business model of simply earning fees on exchange trading is slowly eroding away (for the last 10 years). So they need to branch out into services and other data plays...


Alternative take: there isn't that much low hanging fruit there.

Hear me out.

"To the person who only has a hammer, everything looks like a nail."

The data in front of your is the data you want to analyze, but it doesn't follow that that is the data you ought to analyze. I predict that most of the data you look at will result in nothing. The null hypothesis will not be rejected in the vast majority of cases.

I think we -- machine learning learners -- have a fantasy that the signal is lurking and if we just employ that one very clever technique it will emerge. Sure random forests failed, and neural nets failed and the SVR failed but if I reduce the step size, plug the output of the SVR into the net and change the kernel...

Let me put an example: suppose you want to analyze the movement of the stock market using the movement of the stars. Adding more information on the stars, and more techniques may feel like you're making progress but it isn't.

Conversely, even a simple piece of simple information that requires minimal analysis (this companies sales are way up and no one else but you know it) would be very useful in making that prediction.

The first data set is rich, but simply doesn't have the required signal. The second is simple, but has the required signal. The data that is widely available is unlikely to have unextracted signal left in it.


I've been selling good data in a particular industry for three years. In this industry at least, the so-called "low-hanging fruit" only seems low-hanging until you realize that the people who could benefit most from the data are the ones who are mentally lazy and least likely to adopt it. Data has the same problems as any other product and may even be harder because you need to 1) acquire the data and 2) build tools that solve reliably difficult problems using huge amounts of noisy information...


Isn't there utility in accepting the null hypothesis? It's almost as valuable to know that there is no signal in the data as there is in the opposite, i.e., knowing where not to look for information.

I think your example is really justifying a "machine learner" that has some domain expertise and doesn't blindly apply algorithms to some array of numbers.


I think his argument is that some null hypotheses can be rejected out of hand, but that people are wasting time and effort obtaining evidence that, if they had better priors, would be multiplied by 0.0000000000001 to end up with an insignificant posterior. That's what the astrology example indicates.


The effort to evaluate the null hypothesis can be costly. In the competitive environment found in most hedge funds, how would you allocate to accepting the null hypothesis?

As in, if you worked at a data acquisition desk, and spent a quarter churning through terabytes of null hypothesis data, what's your attribution to the fund's performance?



Accepting the null hypothesis has utility only if you have some reason to believe it would not be accepted.

Accepting it per se has no particular value. You could generate several random datasets, and accept/reject the null hypothesis between them ad infinitum.

To put it another way, its only interesting if its surprising.


Bingo. You nailed it. I work in finance. Developed markets have efficient stock markets. They are highly liquid. The reality is that there's lot of people competing for the same profits. In reality when there's that many players, if there's a profit to be had from a dataset you will be buy from a vendor, chances are one of your many competitors already bought it and found it. This is why we now say don't try to beat the market, you likely can't and mostly just need to get lucky having the right holding when an unforeseen event occurs. Too many variables at play that we just don't understand. Most firms are buying these datasets to stay relevant but they really make no difference in their actual investing strategies.


This is where you might use a genetic algorithm or to learn which data to use on a particular prediction. Good AI won’t use all data just trim down to signal.


I would like to see use case when AI selects data source to use that humans will never consider.


It's about weight relative importance, especially in conjunction with multivariate information that may be correlated.


I read a neat criticism of ai techniques. The author pointed out humans can pick out a strong signal as well or better than ai. Humans could pick out signal from an array of weak sources. Ai would identify that case with fewer weak signals required, but it was hard to trust because it was sometimes wrong.

I wish I could remember the source. I’m sure it was an article here a few years ago. I want to say it was medical diagnosis based on charts.

Anyway, the point was there is a very narrow valley where ai is useful beyond an expert. And that valley is expensive to explore. And, there might not be anything there.


For finance in particular, I'd say we're drowning in a massive volume of shitty data.

A client of mine purchases several fundamental feeds from Quandl, and I email them regularly to point out errors. Not weird, hard, tricky errors, but stuff like "why are all these volumes missing" or "there's a 1-day closing price increase of 1200%" or "you divided when you should have multiplied". This tells me neither Quandl nor the original provider (e.g. Zacks) do any serious data validation, despite claiming to.

If the companies people have been paying for decades for this data get it wrong this often, how can I trust any weirder data they're trying to sell me? I thought the point of buying these feeds was to let the seller worry about quality assurance.


this doesn't matter - any sophisticated user will have their own software to clean the data anyway. Their concern is getting the data, they know how to clean it once they have it.


We're not talking about data cleaning, but about data validation. I can fix a weirdly formatted field (cleaning), but I can't reliably impute most kinds of missing data. I can detect errors, but can't fix most of them without additional information...which is exactly what I'm paying the provider for.


you can, there are ways to do it. interpolation, etc. Sometimes the data is missing just because it's not available, you still have to handle that case. proper way of filling in this missing data will depend on what you are using it for - so for provider to do it would be kind of wrong actually.


I think we're talking past eachother here. I don't expect the provider to do imputation for me, but I shouldn't have to bug them to get the best version of the data they have. Sure, sometimes missing is missing, but in my experience with Quandl/Zacks, its usually an error on their end. The price jumps are sometimes because they conflated two different tickers. If they divided instead of multiplying (split factors), I have to have external information to even detect the error! Same goes if they get a date wrong somewhere.


this is what people in this thread dont really understand, investors want the raw feed. Theres nothing to be gained from an aggregated, cleaned feed that everyone has


You are right, extracting insights from data is a low-hanging fruit. From what I observe there is huge lack for proper services and tools that can automatically produce insights. There are of course automatic machine learning solutions, but they focus more on machine learning model tuning (in the kaggle style) rather that giving users understanding and awareness of the their data.


I think data scientists needs to produce more actionable insights as oppose to living in their own world. I suspect there will be an rising group of people who can understand data science techniques and communicate them effectively to drive business decisions. These people will be the ones who can clinch the top posts.


I'm running automatic machine learning SaaS for 2 years, and after this time, I can tell you that it is huge problem that data scientist are living in their own world (including me! and including data science tools).

I had such situation: my user created 50 ML models (xgboost, lightgbm, NN, rf) and ensemble of them. Let's say that single best model was 5% better than single best model with default hyper-params and ensemble was 2% better than tuned best single model. For me it was a huge success, but the user didn't care about model performance. He wants to have insights about data, not tuned blackbox.


I understand every single word you said and fully agree with you. Good point!


In an interview, Ryan Caldbeck from Circle Up describes two categories of models: brainy models and brawny models. The ensemble described above sounds like a brawny model: you don't care how it made the decision, you're glad it did the heavy lifting and you might even double check the result.

However, the user's concern about the black box suggests they wanted what Ryan refers to as a brainy model, one with explicable decisions. Even within the features of the model there could be things to learn about the data.

How else are data scientists stuck within their own world?


Nasdaq already makes more money on data licensing than on trading fees or IPO's. Each time a professional in the financial services industry wants real-time display data, for example, they have to pay Nasdaq a monthly fee. Nasdaq and NYSE compete for listings less so now for trading fees, but because it makes its data licensing package more valuable.


There is Ocean Protocol (https://oceanprotocol.com/) that lets you sell your data.

There is ChainLink (https://chain.link/) that lets you sell your data via API service through decentralized oracle nodes.

https://blog.goodaudience.com/the-four-biggest-use-cases-for...

Monetization is coming soon... in a big way.


How do these services make it easier to evaluate data? The medium article starts with a disclaimer about DLT... Talking with investors buying data, one shouldn't be surprised to hear them request uploads to their FTP. Their data teams are overcommitted when it comes to the evaluation side of consuming data. They aren't (yet) resourced like a tech startup.

How should they prioritize learning about ingesting data from a DLT? They have data brokers (like Quandl) coming to them with assurances of frictionless integration, with data they can understand and use, today!


In addition to the FTP, half the work in getting these alt data feeds in finance is getting the metadata right, such as getting each record tied to an entity or security, knowing how much of a lead/lag the data is available and how soon it's tradable. Quandl helps with the technical friction and also this metadata and security mapping aspect.


Oh sure, I'm with you there! But how about those blockchain upstarts mentioned above?


I'm calling it here: the most useful data is private, or, can't be sold due to confidentiality. The fact that data is confidential is great evidence that we know it is useful, but also that we hope others won't use that signal.


What about banking initiatives in EU causing every bank to open up their data via API. That seems pretty private and confidential.

https://www.evry.com/en/news/articles/psd2-the-directive-tha...

ChainLink has an option to use Intel SGX along with TownCrier to provide trusted execution at the processor level. That ensures confidentiality without exposing the data at all.

http://www.town-crier.org/


Confidentiality is not an obstacle to using material data. As long as you can independently obtain it through unprivileged research, you're fine to use it, sell it or trade on it.


We also pitched something in that direction with Rlay at Techcrunch Disrupt Berlin: https://techcrunch.com/2018/11/29/rlay-startup-battlefield/


In the ESG use case, how do you measure inter-rater reliability? How do you control for exposure to MNPI?


This is something I've thought about and worked on over the last several years. I'm more than happy to have an in-depth conversation on it if anyone is interested.


Fwiw that's what I have been trying to do for the past few years, infrastructure for easier access to algorithmic trading.

Shameless plug: https://KloudTrader.com/narwhal


I've been researching this topic for some time now, alternative data, and not surprised since Nasdaq is a large provider of software (e.g. market-making sw amongst dozens of other sw):

QUANDL SPECIFIC: -Quandl has a pretty decent blog that I would check out, you never know what new large corporate policy enacted might get rid of it: https://blog.quandl.com/

GENERAL NOTES:

-More and more asset managers are using it and there is some worry that everyone is making the same conclusions off the same data set, and thus no money to be made. Though most practitioners say this is a none-issue, there is more and more alt. data sets out there to chose from, cleaning the data is tricky and testing the veracity of the data provided and knowing how to combine it with others sets is a key competitive advantage that not every asset manager is good at.

-The ROI is something that is top of mind but not always easily attributable throughout the year, e.g. one large insight very late in the financial year can bring +100x returns on what was paid for a data provider's software.

-Hugely successful funds like Renaissance's Medallion has likely been doing this for a long long time, coupled with top PhDs looking for a lot of statistical correlation with traditional data as well.

-More and more data sets that are being created and thrown into a self-learning financial model (aka AI) have a lot of people excited, and certainly there are a lot of small funds being created, though seems to be mostly by young people or not-so-great hedge fund managers. Getting large investors to lay down significant capital has a huge trust component to it, aka want to bet only on succesful grey-haired largely-male dominated folks -A lot of alternative data can be found directly from the Bloomberg terminal e.g. MAPS <Go> function. However my understanding is that it's not that deep, quality is an issue, and everyone has access to it (no real competitive advantage).


Any idea as to its valuation?


Given that its last raise was $15M (Canadian) in 2016 (https://www.crunchbase.com/organization/quandl#section-fundi...), and I haven't heard anything about Quandl since then, I'm guessing it's not a 10x exit.


That's probably fair. Quandl started by offering the "everyday" investor API access. I know the typical VC approach is to first get users and then scale, but often in investing/financial data products, it seems better to price high and then move down market. If you study the companies with the most success in the past (Bloomberg, CapIQ, MSCI, Eze, Advent, Factset, Morningstar, etc.), none of them started by trying to cater to the DIY investor.


What is 'alternative data'? The text only says

> 'The company offers a global database of alternative, financial and public data, including information on capital markets, energy, shipping, healthcare, education, demography, economics and society.'

which doesn't really answer the question.


The way I think of alternative data is data about a business or industry that is obtained, collated, and analyzed through non-traditional communication channels, and helps to provide a better picture of how a company or industry is doing than just relying on trade data and financial statements. The best example of this I can think of is companies scraping AliBaba at certain frequencies, trying to ascertain the movement of certain products or raw materials. This data is then sold to investment firms and hedge funds, because they feel it gives them an edge.

One company that operates in this space is YipitData. From what I've been told, they started as something similar to GroupOn but then pivoted to this space after scraping for their own competitive intelligence reasons.


One example could be combining credit card data and location data to try and infer is bad weather affected same-store sales. Another use could be determining if a company was emailing loss-leading discount promos at the quarter to juice its sales growth. Another could be collecting Tesla VINs to see if it is hitting its target productions. In the last case, Bloomberg has made this available for free:

https://www.bloomberg.com/graphics/2018-tesla-tracker/


Alternative data is non-financial data which can be tied to various securities.

Financial data, for example, would be EUR USD spot prices. Non-financial data (i.e. "alternative data") could be healthcare reports which you could theoretically couple to e.g. pharma stocks.


There are quite a few, and I can think of these off the top of my head:

- Real-time weather data from major ports and across the main shipping lines

- Telemetry from crop and soil report systems

- Up-to-date satellite imagery of basically anything large under construction (solar farms, factories, ...)

Provide information like that in a machine-readable, consistent format and you have a business.

Btw... Using satellite images to track car manufacturers' inventory levels is an old idea, used for more than a decade.


I had no idea Nasdaq did acquisitions as well. Maybe that's just the engineer in me..




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: