War Story. On a Data Warehousing project, we had to load and process the organizational hierarchy. SQL doesn't do hierarchies well because they can (and should) involve an join of indefinite depth. One of the DBA's wanted to use a "pure SQL" traversal of the hierarchy.
My opinion was that it was a waste of code. We were writing Java programs. We could -- trivially -- fetch the entire tree into Java objects and work with the hierarchy as a hierarchy.
The DBA finally "won" because of the SQL Hegemony argument -- all access must be in SQL, right? I say "won" because we eventually had to throw all the SQL away and use flat files. A "pure SQL" data warehouse is generally unacceptably slow for loading. Data mart subsets can be done in pure SQL, but loads can't.
Recent Events. "a table called [LOTSADATA] and it has 14.7 million rows. One of the columns in [LOTSADATA] table is BRAND" for which they need to do a select distinct. "The disadvantage of [SELECT DISTINCT] is that the Oracle database engine will do a sort which is an insanely expensive operation.
Question: Are there alternative approaches to obtaining the unique brands in
a table?"
Response 1. Duh. Of course there are alternatives. What are you, stupid? You have programming languages. Use them.
Response 2. You're kidding, right? Why ask me? Why not just run it? How hard can it be to benchmark this? What are you, stupid? Seriously.
Response 3. Oh. SQL Hegemony. Folks are actually arguing about the cost of a query and -- it appears -- no one can actually write the eight lines of code required to demonstrate that SELECT ALL is faster than SELECT DISTINCT.
[Sorry for calling you stupid. You're paralyzed by fear, not stupidity. What if SQL isn't the perfect end-all, do-all language? If SQL isn't perfect for all data processing, what other lies have we been living? Is this the end of organized data processing? The collapse of western civilization?
Indeed, I'm repeatedly shocked that the question even comes up. And I'm more shocked that the "appeal to authority" argument has to be used. It's trivial to measure. It appears that it's easier to ask me than to gather data.]
Edit. SQL Hegemony? Yes. Rather than run a demonstration program, written in Java or C# or Python, they argued about the SQL. Doing this with minimalist SQL didn't seem to make anyone's radar. Why not? SQL Hegemony. Rather than consider real alternatives, everyone was reduced to looking for sneaky SQL tricks.
Benchmarking. Here is what I did. It's 5 lines of code for each case. [How hard can this be? Apparently, SQL hegemony makes it impossible for some organizations to do even this.]
def select_distinct():
q1= db.cursor()
q1.execute( "SELECT DISTINCT BRAND FROM LOTSADATA" )
print q1.fetchall()
q1.close()
def select_all():
q2= db.cursor()
q2.execute( "SELECT ALL BRAND FROM LOTSADATA" )
print set( q2.fetchall() )
q2.close()
Notes.
- I only simulated 100,000 rows. [I don't have the patience to wait for 15 million rows to be created, loaded and queried.]
- The table only had four columns.
- I used SQLite3 -- which is mostly in-memory -- and runs much, much faster than Oracle.
- The select all is not a specious result based on cache being filled; the results are repeatable in any ordering of the queries.
Results.
select_distinct 0.417096
select_all 0.162827
For this data, the SQL SELECT DISTINCT took almost 3x as long as the SELECT ALL. It's just that simple.
Want more speed? Use array fetch features to get more rows in bigger buffers.
Consequences.
This is not rocket science. SQL can be Slow. Don't Argue: Benchmark. Your Mileage May Vary.
SQL databases do locking, transaction management, backup and recovery and a bunch of things well. SQL databases are helpful and necessary. However, SQL isn't always fast.
SQL means Slow Query Language. You've been told.