I had planned to "show it to HN" soon, but the occasion is too good now: I have also created a thin "psycopg2 wrapper", pretty much in the same spirit as Records (but different in purpose), which allows you to perform SQL operations with plain data structures:
I have been using this for myself for quite some time, but feedback would be very appreciated!
Edit: A cool thing about it (I believe) is that it provides an "upsert wrapper" which makes use of either the new PG 9.5 "on conclict" mechanism if it's available, or an "update if exists, insert if not" two-step fallback, for previous versions.
Neat library. Just to note that this approach to fake upsert is not safe to concurrent sessions. There were exactly two concurrency-safe ways to implement upsert prior to 9.5:
1. An exclusive lock on the table, which is slow as hell
2. A PGPSQL function that attempts to INSERT, catches the duplicate key exception and instead UPDATEs (also handling the case that the row vanishes), which is slower than hell
This is why ON CONFLICT is such a huge deal. If it were as simple as checking keys with a SELECT then sending some to INSERT and some to UPDATE it would just be nice syntax sugar.
> 2. A PGPSQL function that attempts to INSERT, catches the duplicate key exception and instead UPDATEs (also handling the case that the row vanishes), which is slower than hell
That's not true, at least not when you're operating on a single row at a time. According to my benchmarks the function approach was usually within 10% of the native upsert -- even outperforming it by around 5% depending on data distribution. For insert-or-select (instead of update) the function consistently outperformed INSERT ... IGNORE.
Sorry, I don't see the point of this library. What exactly does it provide? It looks like a very thin wrapper (150 lines of code) over psycopg2 that does exactly the same thing, just with a slightly different API.
This looks like a (very thin) wrapper of the psycopg2 library, which implements the python DBAPI. Tablib functionality is integrated for bonus points. 137 lines.
So am I correct in my observation that this is just a different way to use/call the psycopg2 library?
The more I look at the code, the more I wonder where is this going? Will the psycopg2 requirement become a problem or a benefit over time?
SQLalchemy is pretty darn good.... there are more ways to connect to a DB in python than I shake a stick at. I guess I am confused...
> This looks like a (very thin) wrapper of the psycopg2 library, which is a wrapper around the python DB API
psycopg2 is an implementation of DBAPI2 (with extensions). As the name denotes, DBAPI is an API which database libraries ought implement, it's not a library.
>Rows are represented as standard Python dictionaries
That seems inefficient. You're always going to have the same columns in each row, so storing column names for each row is a huge waste of memory. Django's "cursor.fetchall" returns a list of tuples, which seems to be a saner way to represent db rows in Python.
Hahaha... I can only imagine. Sorry to creep you out. Your Facebook profile is just a click away from your personal website, and your discussion about Records is up front and center on your feed for me :)
I learned from a coworker recently that namedtuple is a much lower memory usage than a dictionary, so if you're dealing with lots and lots of data regularly, and lots of processes on a single machine, (i.e. memory is a concern) namedtuple might be a better choice.
One argument I've seen for using dictionaries is that namedtuples restrict you to column names that are valid Python identifiers, which is a smaller set than postgres allows for column names (postgres allows the '$' and non-Latin characters).
>>> collections.namedtuple("test", "$")
Traceback (most recent call last):
...
ValueError: Type names and field names can only contain alphanumeric characters and underscores: '$'
The parent comments were discussing why choosing dicts over namedtuples for db-originated columns, and the fact that field names are restricted on namedtuples but not on dicts is a very compelling point.
Making examples usind dicts and explaining why namedtuples have restrictions is completely missing the point.
This memory waste argument is only true if you want to fetch all rows, which you could possibly avoid with different techniques. Otherwise, a dictionary seems like a very simple, convenient, Pythonic, and intuitive representation of a row.
This lib also looks brand new, so there is plenty of time for the community to optimize the hell out of it.
If this were a big deal, it could be optimized using something like a key-sharing dict implementation [1]. You can keep the dict interface to accommodate column names that aren't valid identifiers, and avoid most of the memory overhead.
I'm also a fan of NamedTupleCursor. Not sure how NamedTuples compare to raw tuples in terms of memory consumption, but they make code look a lot nicer.
Namedtuple CPU-speed performance for some common operations is TERRIBLE compared to dictionaries, at least on 2.7. Orders of magnitude difference. So bad that it really matters. It's shocking how bad the implementors got this.
namedtuple is just a thin wrapper around tuple. It literally constructs a class definition that sub-classes tuple as a string and then executes it. You can see the string template that it uses here[1]. If you're interested in something like namedtuple there are other[2] things[3] you can use depending on your use-case.
>>> nt = namedtuple('nt', 'a b')
>>> nt10k = [nt(1, 2) for i in range(10000)]
>>> dict10k = [{'a':1, 'b':2} for i in range(10000)]
>>> timeit pickle.dumps(nt10k)
10 loops, best of 3: 26.3 ms per loop
>>> timeit pickle.dumps(dict10k)
100 loops, best of 3: 2.49 ms per loop
If you look at the the StackOverflow in my sibling comment to yours, running 'obj.attrname' on a namedtuple takes longer because it needs to translate 'attrname' to an integer index and then run (e.g.) 'obj[0]'. Outside of that, I'm not sure.
If you do a query that returns 100 records and use rows.next(), does it just get them from the cursor without pulling down the rest from the database? Do you need to dispose of the cursor or is it automatically cleaned up?
The export formats are just a an attribute, not a method call. It would need to use @property IIRC. Does this go against the guideline in Python that "explicit is better than implicit"?
He's using cursors directly from the his database library[1] which is psycopg2[2] (and we can also see that the cursor_factory is how the dicts are generated). The psycopg2 docs say[3]:
When a database query is executed, the Psycopg cursor usually
fetches all the records returned by the backend, transferring
them to the client process. If the query returned an huge amount
of data, a proportionally large amount of memory will be
allocated by the client.
Interesting. I love requests. SQL isn't my area of expertise so im genuinely surprised in 2016 we're still building libraries like this for Python; are existing libraries really that bad?
I would say 'no'. SQLAlchemy is very good, but extremely complex and powerful. I think people don't understand that you can use the core features without any of the ORM overhead, if you don't want it.
I'm just curious: in the first example of the section that you linked, what's the purpose of the "|| ', ' || " bit? I think the double pipe is used for concatenation in SQL, but I don't understand why the ', ' part needs to be concatenated like that.
that example is intending to produce a single column result that concatenates two fields into a comma-separated string, e.g. "your name, you@somesite.com".
I think jumping into that example directly is probably kind of disorienting. The SQL in that string is a little bit weird/contorted just to illustrate, "it's text, do whatever you want!". I also wrote that example like ten years ago.
Oh OK I get it, it makes perfect sense but yes, jumping to that example directly might be disorienting.
Maybe a one-line explanation (just like the one you gave me) placed immediately above or below the code example would make it more readable!
SQLAlchemy tries to provide a pythonic wrapper for most database operations, in an implementation-agnostic way, which can be a convenience but also makes it large and complex. It really isn't the thing you want to use if what you want is to just send a plain snippet of SQL to your database. There isn't a simple 'requests-for-sql' type of thing, until now.
No they aren't, this just looks like an itch that has been scratched. Outside of the ORM libraries there isn't much fanfare for this sort of thing though.
I spent a couple of hours looking for exactly this yesterday for a new Tornado based project and ended up using queries: https://github.com/gmr/queries which seems on the same level for my purposes of wanting to cleanly pass raw SQL and get python data structures in return. Queries also provides asynchronous API interaction with Tornado which will be nice for me.
Coincidently queries was inspired by Kenneth Reitz's work on requests.
I think it is often tempting to write something like this as it is quite a light thing and can reacquaint someone with SQL if they haven't used it for a while. I know I was tempted after pouring through ORMs.
[edit] despite negative expectations, I still upvoted OA cause even "wrong" innovation is better than stagnation.
pyscopg2 is pretty great. Every lib I've seen is based on it. Many apps use it directly. It is the lowest, comfortable layer you'd want. What you need on top of that is Either;
1) So minor, most just write a few helper functions/classes
2) So elaborate and opinionated it needs years of use/work to get right. SQLAlchmey, Django ORM.
3) So specific, customized to your requirements that general purpose libs/frameworks don't fit.
Some people get tired of redoing #1 or don't like #2's opinions or think there should be something lighter. So, they build something they hope to be middle ground. I don't think there is a middle ground. As you use it you realize the complexity (dbs/SQL are much, much more complicated/powerful than you think, the impedance between relations and objects) and middle ground projects remain "toys" or grow into SQLAlchemy.
This works on multiple databases, yet still uses plain old sql. I think it's simpler than using a multi-database ORM for plain old sql, or using a databases' driver directly, like psycopg2.
Edit: actually it just works on postgres. I don't know if there are plans to make it work on other SQL databases.
>>> db = records.Database('postgres://localhost:5432')
>>> rows = db.query('select * from customer_customer')
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/Users/ns/teabox_django_env/lib/python2.7/site-packages/records.py", line 109, in query
c.execute(query, params)
File "/Users/ns/teabox_django_env/lib/python2.7/site-packages/psycopg2/extras.py", line 223, in execute
return super(RealDictCursor, self).execute(query, vars)
ProgrammingError: relation "customer_customer" does not exist
LINE 1: select * from customer_customer
It's good to see the SQL Phrasebook Pattern still in use. They were quite popular before ORMs caught on. I'm currently writing one in Rebol so they must have come back into fashion :)
This is amazingly well-timed. Just yesterday I was complaining that the Python SQL libraries are annoying and verbose, and that I wished there was a library as good as requests is for HTTP.
You are right, thanks for the example. I checked the psycopg2 library after reading the announcement which drove my confusion. Psycopg is a super straightforward db driver. I wonder if this is the start of an orm built on top of psycopg2...? color me confused.
Amusingly, this is the other extreme from Diesel for Rust, discussed yesterday. This is the totally dynamic approach, while Diesel is the totally static approach.
That's a good point. As much as I love python, I think (for me) this illustrates why I tend to switch languages once I get over ~500-1000 lines. Seeing diesel was the exact opposite for me - total overkill when I just need a report, but very tempting if I'm going to have to evolve and maintain a larger code base.
The Python community just doesn't have good tools to type dictionaries. In JavaScript, you can use a tool like tcomb[1] that lets you specify the structure of your data and add functionality onto that data. Throwing raw dictionaries around in your program isn't maintainable, but if you adopt the Clojure ethos that functions operate on data that meets a set of requirements, you'll have maintainable code without a type system that gets in your way when you don't need it.
In general, I think most Python developers would be happier with Clojure or JavaScript written in a functional style (e.g. heavy use of tcomb, Immutable.js, and transducers). I'm happier, at least.
Eleven years ago I actually wrote something like this in Scheme for a bespoke Web app. I called it QB (for Query Builder) and basically implemented a useful subset of SQL as a macro-based language inside Scheme. When you unquoted Scheme expressions into a query, it did the Right Thing in terms of escaping strings, etc.
Overall cool wrapper but is it safe to simply overwrite the cursor variable without closing it first? Will that clear fetched result sets out of memory?
I suppose this tool has it's place in the ecosystem. I just wonder why anyone would use this when there are ORMs ....
EDIT: I'm not saying this is a waste of time from a developer who obviously puts out quality projects. I'm saying I don't see myself using it. I'm most familiar and comfortable with the Django ORM and have overseen a large project with it. There were several raw queries when I took over but once I left it there were none and all the replacement I did was clean ORM queries more performant than their raw SQL predecessors.
Maybe this is just me being so accustomed to my hammers that I solve everything like it was a nail. My toolset involves exactly that: modelling the space of requests with a complete enough relational database that I can find columns and keys for all incoming data. For the occasional tidbit that doesn't fit properly, stick it in the _extra json field.
I am pretty familiar with the Django ORM where you hit limitations fairly early (if you are writing any moderately complex queries). But I still use it for maybe 90% of the stuff I do because that 90% involves pretty simple queries, and it saves a ton of typing, and integrates well with the rest of the Django ecosystem.
Updating data is a good example. I recently split up some database strings into multiple columns. At first I just wrote the SQL to update a joined table. Then I needed to get regexes involved to pull out certain parts of the string. Doing that in SQL gets ugly, and its far simpler just to use the ORM to pull out the data required and operate on it in Python on a loop.
I don't see the ORM as an either / or, but as a complimentary tool to SQL.
Really anything that wants to pull all your data back into python is going to be horrible for analytics. What's really needed in this space is something that is smart enough to push most data-intensive operations down to the database and pull only the minimum amount of stuff back to python to do anything that's CPU intensive.
Because maybe an ORM is overkill sometimes (frequently? always?). I frequently find myself trying to figure out how to do something in the ORM "DSL" that's pretty easy (for me) in SQL.
In addition to the array of other correct responses, I'd all stored procedures or functions. They're so much easier to interact with via SQL, if the ORM supports them at all.
I have something similar, that I think "goes further" than this, which I use as part of my Python Cydia backends. I got extremely addicted to using the moral equivalent of TCL's uplevel, inspect.currentframe(-...), to build interfaces in Python where I don't have to pull out parameters. My "cyql" interface (which is barely anything, really: just a thin wrapper for psycopg2, which is itself excellent; in Clojure I wrap a slightly modified copy of the PostgreSQL JDBC driver to provide an even better implementation that does "compile"-time type SQL statement verification), uses this to allow me to remove what feels like all of the boilerplate.
In addition to .run (which returns the number of affected rows) and .all (which returns an array of ordered dicts?), I have .has (which returns a bool and wraps an exists query), .gen (which returns a generator and iterates a cursor), and .one (which verifies you only got back a single row and returns just that row). I also have easy support for transactions (either on an existing connection or in one line as part of making a connection), easily turning off synchronous commit (for log tables), and I carefully manage autocommit to make certain that I am using the minimum number of possible SQL statements to the server (which I care about a lot).
That said, I agree with the post by masklinn below: neither the linked library nor my library are solving problems that I believe are worth inheriting code from someone else over. There is some basic configuration of psycopg2 which is necessary, but the underlying library itself is what works here. I have spent over a decade thinking about how I like to build SQL interfaces, and have now implemented a similar interface for myself in numerous languages, each time evolving the design slightly (and sometimes having enough of an epiphany that I go back and retrofit some of the older ones), but it ends up being built around the way I think about stuff.
And that also means that as I learn more and "level up", I start making different decisions. My implementation in Clojure stresses stored procedures a lot more, as while it took me a long time to really figure out how to use them in my workflow, I now see them as exceedingly correct and feel a lot of the code I've written in the past where I had tons of free statements is essentially "what I wrote from back when I didn't know how to use the database to organize my API layer" (though I still haven't worked out some of the tooling around shifting to stored procedures, and have been distracted with other higher-level problems the last couple years).
Essentially, I'm arguing that the same will happen to you. Put differently: some problems are hard, and some problems are easy; I find a lot of libraries that seem to be solving easy problems that new developers think are hard, and a lot of libraries that pretend to solve a hard problem, but only because the problem looked easy and the result doesn't actually work (such as the PostgreSQL drivers that were available in Ruby for a long time, which were all unusably bad). Wrapping something that works well so it is slightly easier for you to use can be valuable if it is upstreamed into the original project, but even then is likely to be something you will paper over yourself in time as you will think about the problem differently than they did.
# at the top of the code somewhere
dsn = {'port': ..., 'user': '...', 'password': '...', 'database': '...'}
with cyql.connect(dsn) as sql:
provider, account, key = sql.one('''
select
"payment"."provider",
"payment"."account",
"payment"."transaction"
from "cydia"."payment"
where
"payment"."id" = %(payment_id)s
''')
with cyql.connect(dsn) as sql:
sql.run('''
update "cydia"."token" set
"token" = %(token)s,
"email" = %(email)s,
"country" = %(country)s,
"shipping" = %(shipping)s,
"billing" = %(billing)s,
"data" = %(data)s
where
"id" = %(token_id)s and
"token" is null
''')
Meh. SQLAlchemy will already do almost exactly the same thing, if you use its lower-level functions and avoid the fancy ORM layer. It's also cross-database, so you can use the same code with SQLite for development, and deploy to MySQL or Postgres.
Why would you want to switch databases when you deploy? That's asking for bugs that only get revealed in production.
Also, the intersection of SQLite, MySQL, and Postgres is a pretty terrible database. You can be a lot more effective if you decide which one you're writing for.
targeting both MySQL and Postgresql is very reasonable and is commonplace. Openstack does it, for example. Lots of apps do. This assumes you're doing an app that's mostly CRUD. If you're doing a BI kind of app and need fancy analytical functions, then yes you'd target Postgresql.
> Why would you want to switch databases when you deploy? That's asking for bugs that only get revealed in production.
Only if you deploy directly to production!
Any competent organization should have at least a staging environment (and probably some other pre-staging testing environments) where you deploy and run your full application stack, and only promoted verified builds to production after they pass QA on earlier environments.
the top-level comment in this thread praises SQLAlchemy for being "cross-database, so you can use the same code with SQLite for development, and deploy to MySQL or Postgres.", and your parent just says that's a recipe for disaster.
It's pretty common to run SQLite on your local machine while developing. Apparently the parent is confused by reading "deploy" as "deploy directly to production".
It's sad that ORMs are always restrictive and raw SQL can't be safely and easily modified. I'd always wished for a library that one could just write SQL, and the library could do the transformations.
I.e. one writes plain SQL strings, but they're parsed under the hood, so AST transformations can be applied, like adding extra WHERE clauses conditionally or applying LIMIT.
Sadly, I haven't found any good Python SQL parsing library. That was long time ago, though - maybe someone had written one already.
Peewee [1] is my tool of choice because of its convenience and flexibility combined. I can create pretty complex queries using its Query API, but also execute arbitrary queries when I know I need something a little more nonstandard. Plus the maintainer (Charles Leifer) is an awesome dude who has always been really helpful to his users.
Unfortunately, no, that's not what I meant, sorry.
I can't give SQLAlchemy (Core or ORM) a string like "SELECT * FROM foo AS f LEFT JOIN bar AS b ON b.id = f.bar_id WHERE f.baz > %(baz)s" and then transform it, by, say, appending the LIMIT clause or adding extra WHERE condition. AFAIK, there's no way to provide a raw SQL string and then say something like `query.where("NOT f.fnord")` OR `query.limit(10)` and get the updated SQL.
With ORMs or non-object-mapping wrappers if I want transformations, I have to use their own language instead of SQL. I do, but don't really want to.
Or things have changed and this is what SA can do this nowadays? I'll be more than happy to learn that I'm wrong.
you know what the problem is there, that a. parsing your SQL string into a tokenized structure b. altering it based on your given instructions c. re-compiling it back into a string, all on top of an interpreted scripting language, and then d. sending it to a database so that the database can parse it a second time, is just so inefficient, for a particular API style that is arguably not even any better than just learning how to write Core expressions. Core and ORM expressions can be cached to their resultant SQL to save on the tokenized-structure->string step too.
there are certainly SQL parsers that can easily produce such tokenized structures and from a technical standpoint, your API is pretty simple to produce, with or without shallow or deep SQLAlchemy integrations. It's just there's not really any interest in such a system and it's never been requested.
If you specify the columns involved with your text query, then you get back a 'TextAsFrom' object, and you can apply other transformations to it like .where() or .unique().
Agreed! I was heavily exploring building this upon SQLAlchemy, but decided that the first version needed only to support the only database I use regularly: Postgres.
There's a very big chance that SQLAlchemy will be integrated into the project, to allow for connections to multiple database types.
It looks like a number of people have done something similar so though I would mention my attempt - pgwrap [1][2]. This is also a thin wrapper over psycopg2. This provides the following -
* Simplified handling of connections/cursor
* Connection pool (provided by psycopg2.pool)
* Cursor context handler
* Python API to wrap basic SQL functionality
* Simple select,update,delete,join methods extending the cursor
context handler (also available as stand-alone methods which
create an implicit cursor for simple queries)
* Query results as dict (using psycopg2.extras.DictCursor or any other PG Cursor factory)
* Callable prepared statements
* Logging support
Essentially you can do stuff like:
>>> import pgwrap
>>> db = pgwrap.connection(url='postgres://localhost')
>>> with db.cursor() as c:
... c.query('select version()')
[['PostgreSQL...']]
>>> v = db.query_one('select version()')
>>> v
['PostgreSQL...']
>>> v.items()
[('version', 'PostgreSQL...')]
>>> v['version']
'PostgreSQL...'
>>> db.create_table('t1','id serial,name text,count int')
>>> db.create_table('t2','id serial,t1_id int,value text')
>>> db.log = sys.stdout
>>> db.insert('t1',{'name':'abc','count':0},returning='id,name')
INSERT INTO t1 (name) VALUES ('abc') RETURNING id,name
[1, 'abc']
>>> db.insert('t2',{'t1_id':1,'value':'t2'})
INSERT INTO t2 (t1_id,value) VALUES (1,'t2')
1
>>> db.select('t1')
SELECT * FROM t1
[[1, 'abc', 0]]
>>> db.select_one('t1',where={'name':'abc'},columns=('name','count'))
SELECT name, count FROM t1 WHERE name = 'abc'
['abc', 0]
>>> db.join(('t1','t2'),columns=('t1.id','t2.value'))
SELECT t1.id, t2.value FROM t1 JOIN t2 ON t1.id = t2.t1_id
[[1, 't2']]
>>> db.insert('t1',{'name':'abc'},returning='id')
INSERT INTO t1 (name) VALUES ('abc') RETURNING id
[2]
>>> db.update('t1',{'name':'xyz'},where={'name':'abc'})
UPDATE t1 SET name = 'xyz' WHERE name = 'abc'
2
>>> db.update('t1',{'count__func':'count + 1'},where= {'count__lt':10},returning="id,count")
UPDATE t1 SET count = count + 1 WHERE count < 10 RETURNING id,count
[[1, 1]]
Also it allows you to create callable prepared statements (which I find really useful in structuring apps):
>>> update_t1_name = db.prepare('UPDATE t1 SET name = $2 WHERE id = $1')
PREPARE stmt_001 AS UPDATE t1 SET name = $2 WHERE id = $1
>>> update_t1_name(1,'xxx')
EXECUTE _pstmt_001 (1,'xxx')
https://github.com/cjauvin/little_pger
I have been using this for myself for quite some time, but feedback would be very appreciated!
Edit: A cool thing about it (I believe) is that it provides an "upsert wrapper" which makes use of either the new PG 9.5 "on conclict" mechanism if it's available, or an "update if exists, insert if not" two-step fallback, for previous versions.