Hacker News new | past | comments | ask | show | jobs | submit login

Can you drop nans, parse strings to floats, apply arbitrary lambdas over a rolling window, take the cosine of those values, and plot a graph against time, in a single line of sane SQL?

Easy in Pandas.

Dataframes are not tables; tables are not dataframes. It's nearly as apples-to-bananas as comparing Python lists to C arrays.




everything you mentioned can be done easily through database schema. window functions work well in SQL. plots are easily done in any BI solution that hooks up to any database.

pandas is just poor man's SQL+BI. pandas stores everything in memory and has many limitations.

in SQL Server I can easily churn through terabyte sized database and get the data I need, because the schema is well designed with partitioned tables, clustered indexes and a well designed SQL takes less than a second to run against >1TB database. It even allows a team of more than 20 people to work with the same SQL and query it simultaneously.

i would love to see how you can analyze 1TB csv file, or a pile of 100 csv files totalling over 10TB where you have yet to discover the schema and how tables join together. and I am doing it with on a simple $700 workstation, not even using any hadoop nonsense


How about working with poorly designed schemas? I work with SQL-Server as well, dealing with legacy data designed around imperative t-sql programming. Our 'BI-Solution', SSRS, crawls on pretty simple queries, where 'hacks' need to be done, joining on same table, all kinds of dirty tricks...

I don't know... I honestly feel like 'BI-Solutions' are a poor-persons Python if you are doing anything more than simple dashboards. Something that can be done in 2 lines of code in a Notebook requires endless fiddling in an IDE, to produce something not easily reproducible.

Aside, I've no experience with Tableau or Power-BI, just know that Crystal Reports and SSRS which are pretty painful.


it's hard dealing with legacy stuff. One alternative I can propose - pitch your management and go get yourself a separate and latest SQL instance just for analytics. Easiest solution you can do is to install SQL Server Developer version which is free.

cherry pick what your need and ETL your data out of legacy systems into your warehouse and run something like tableau/looker/powerbi on top and you will be amazed how effective you can be


Agreed, but again, I ETL my data into a warehouse, as a developer (not a BI person), I'm reaching for spark, flink, or whatever to roll my analytics, and python/flask/d3 for building web dashboards.

Then, once you have 'insight' into your Data, you can easily 'do' something with it without the limitations of a tiered product.


is the database server running on a $700 workstation? how many rows? what types of queries? what is a typical query execution time? interested in your response because I generally find RDBMS performance quite poor, although I've never used SQL server. Pandas gets to be fairly painful after the data size hits 10GB, in my experience. I do think you are missing how pandas fits into a data exploration pipeline for someone who knows how to code - there's plenty of tasks for which SQL+BI is not going to cut it.


Commodity virtualized server with 4 cores and 8GB of RAM, storage is on NAS. We have hundreds of these typical SQL instances.

db has lot of rows, around 20k rows logged per minute and the events are logged 24/7 three years.

Again, because the schema is well designed, I use clustered index on Date to filter and analyze and the engine actually never reads the whole db all the time. It actually read only the pages I need, and that's the benefit from millions of man-hours MSFT invested in optimizing its SQL engine.

typical response time depends on date period, I try to write queries that dont take more than a 5-10 secs to run.

if you have RDBMS performance problems - just hire an expensive DBA for a brief period and learn-learn-learn from her how to work with large data effectively. DBAs can optimize your db to run as fast as a your hardware I/O speed.


I just did a simple benchmark: 67 million rows, integers, 4 columns wide, with postgresql 10 and pandas.

pg 10

  huge=# \timing on
  Timing is on.
  huge=# copy lotsarows from '~/src/lotsarows/data.csv' with csv header;
  COPY 67108864
  Time: 85858.899 ms (01:25.859)
  huge=# select count(*) from lotsarows;
    count                               
  ----------                            
   67108864                             
  (1 row)                               
                                        
  Time: 132784.743 ms (02:12.785)       
  huge=# vacuum analyze lotsarows;      
  VACUUM                                
  Time: 185040.485 ms (03:05.040)       
  huge=# select count(*) from lotsarows;
    count                               
  ----------                            
   67108864                             
  (1 row)                               
                                        
  Time: 48622.062 ms (00:48.622)        
  huge=# select count(*) from lotsarows where a > b and c < d;
    count                       
  ----------                    
   16783490                     
  (1 row)                       
                                
  Time: 48569.866 ms (00:48.570)

pandas

  In [2]: import pandas as pd                                           
                                                                        
  In [3]: %time df = pd.read_csv('data.csv')                            
  CPU times: user 34.1 s, sys: 4.49 s, total: 38.6 s                    
  Wall time: 38.7 s                                                     
                                                                        
  In [4]: %time len(df)                                                 
  CPU times: user 125 µs, sys: 19 µs, total: 144 µs                     
  Wall time: 166 µs                                                     
  Out[4]: 67108864                                                      
                                                                        
  In [5]: %time ((df['a'] > df['b']) & (df['c'] < df['d'])).sum()       
  CPU times: user 1.74 s, sys: 135 ms, total: 1.88 s                    
  Wall time: 1.88 s                                                     
  Out[5]: 16783490


ok, so 20k * 60 minutes * 24 hours * 365 days * 3 years = 31,536,000,000 rows. You are querying 31.5 billion rows on a machine with 4 cores and 8gb ram? Are queries that return in 5-10 seconds running over the entire table? or small portions of it?


small portions of it, sometimes an hour or two, sometimes a day, or a week. most of the times there are 5-6 other conditions. pandas will have to full scan entire dataframe for any query to filter, while SQL uses index seek


Why aren't you or some other sql whiz bringing home all the kaggle prize money being won by people using inefficient dataframes?


i think you are confusing xgboost that uses its own DMatrix that is winning kaggle, not dataframe


Of course you shouldn't use Pandas to analyze terabytes of data, but most people aren't analyzing terabytes of data.


That's what Spark is for. You can do petabyte-scale jobs... with DataFrames.


Technically all SQL compiles into a single line.

Less pedantically you can quite easily and elegantly drop nans, str->float, perform just about any operation over a rolling window and perform trig functions on it. Generally in a much more sane way than pandas.

In my life as an analyst and data scientist I've found SQL to be far, far superior to pandas for something like 90-95% of tasks.


> Can you drop nans, parse strings to floats, apply arbitrary lambdas over a rolling window, take the cosine of those values, and plot a graph against time, in a single line of sane SQL?

Yes you can, Postgres support Python UDF btw. Though I don't think that is necessary a daily common feature to use.

SQL is pretty powerful if you look deep




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: