Introduction
Snowflake Inc. is possibly the most influential database software company in the modern world, after Oracle. The biggest technology IPO in history, it has raced to the hallowed $3B ARR mark demonstrating strong compounding revenue growth at scale and meaningful profitability. As the de-facto analytical database infrastructure for thousands of enterprises including much of the Fortune 2000, it has one of the highest grades of recurring enterprise revenue in its peer set. It seems to possess in ample measure all the makings of a once-in-a-lifetime, iconic & enduring enterprise software company, so what pray you is the purpose of this article? Haven’t the odes to the greatness of the SNOW ticker already been sung by every investment house of note on Wall Street? Is there anything new left to say?
This author posits that yes, there is, and takes a slightly different tack to this matter. The posit is: Snowflake is indeed the most influential database company in the world, after Oracle. And much like Oracle, it’s pricing model is also one of the best-architected, best-disguised and most heinous heists in the history of software technology. In simple words, Snowflake is overcharging its customers at a pace and rhythm that is absolutely unprecedented, the likes of which has likely never been witnessed before (and hopefully isn’t witnessed again). I also posit that the instrument of Snowflake’s glorious success, the usage-based database infrastructure billing model, is also the instrument of its possible future disruption.
Such statements cannot and should not be made without a compelling body of evidence to support the hypothesis. In the rest of this article, I shall present the facts as I see them, to support my convictions. This is undoubtedly a contentious topic: and let me say this clearly: There is no doubt that Snowflake is a brilliant piece of OLAP DBMS technology that has moved the world forward. I deeply respect the views of the many supporters, believers, investors, and happy customers of Snowflake who have derived immense benefit from its excellent and innovative database technologies and notable stock gains. I only position my criticism at an unsustainable pricing model which in my view embodies the relentless pursuit of corporate greed, and over time has the potential to bleed the enterprise dry.
(1) The 20x Compute Markup
A brief introduction to the Snowflake billing model. Snowflake uses a credit-based billing model in which enterprise customers purchase credits at a specific dollar conversion rate, which is linked to the tier of service they wish to consume. As indicated below, the standard tier starts at $2 per compute credit, is $3 for enterprise customers, $4 for regulated industries, $6 for Virtual Private Snowflake (VPS), as as much as $8.4 for Federal customers, per the pricing published by Snowflake on their website (linked here and encapsulated below).
Once credits are purchased with discounts as negotiated by the customer, they are used to spin up virtual data warehouses and run OLAP (online analytical processing) queries in SQL by anyone in the enterprise with system access. The warehouse is charged at the rates indicated below per minute of active query operations. The standard warehouses vary in size from XS (1 credit per hour) all the way to 6XL (512 credits an hour) while the Snowpark optimized (i.e., ML workload optimized) version of the warehouses vary from M (6 credits an hour) to 6XL (768 credits an hour, encapsulated in the table below).
It is commonly known that each XS data warehouse represents 8 vCPUs, 16 GiB RAM, and 200 GiB NVMe SSD storage (equivalent to the c5d.2xlarge EC2 machine on AWS, and its peers on other hyperscalers). The warehouses in the lineage above it have compute specs growing by a factor of 2, i.e., the S data warehouse will have 16 vCPUs and 32 GiB RAM, the M data warehouse will have 32 vCPUs and 64 GiB RAM, and so on.
So each credit, the cost of which varies between $2 to $8.4 before bulk discounting, delivers a server that runs 8 vCPUs and 16 GiB RAM per hour. Viewed at a per vCPU-hour level, the cost in the standard tier is 25 cents ($2 / 8 vCPU-hour) and the cost for the most expensive service tier is 105 cents ($8.4 / 8 vCPU-hour). The table below encapsulates the cost per vCPU-hour for each Snowflake service tier.
Now let us view it from a cost lens. As mentioned above, the AWS equivalent of the EC2 machine used by Snowflake is c5a.2xlarge. Let’s view its pricing on the AWS pricing calculator. The on-demand price is ~38 cents per hour for 8 vCPU and 16 GiB RAM, while in the 3-year committed all-upfront paid version on the EC2 instance savings plan it goes as low as 14 cents for the same hardware resources (i.e., 1.75 cents per vCPU-hour). Is it not plausible to assume that a company as large as Snowflake, which likely drives almost $1B of annual revenue to AWS might be able to negotiate better pricing than numbers available to the average Joe on an AWS pricing calculator? Even if we take the lowest publicly-known pricing for this AWS machine, and give Snowflake a healthy 30% incremental hardware buffer over & above the primary EC2 cost to fully provision a virtual data warehouse, we find that Snowflake’s markup on their AWS compute costs starts at ~10x for the standard plan, is ~20x for the typical enterprise buyer choosing the “business critical” tier ($4 per credit) and can be as high as ~40x for the highest service tier for the Fed.
To put this in perspective for the enterprise buyer spending millions of dollars on Snowflake, your compute bill might be 10x to 40x lower than it is on Snowflake were you to run open source OLAP DBMS software in your virtual private cloud (VPC) or in your data centers. There would, of course, be overhead for a database engineering team or managed service professionals to manage your cluster; but is that yet likely to be an order-of-magnitude cheaper than your Snowflake compute bill? Likely so. Are you paying far more for OLAP compute than you should be? Undoubtedly so. Hidden within the friendly guise of a credit priced at $2-8 per hour and warehouses priced at a few to a few hundred credits per hour is a predatory billing model architected to bleed you dry.
Some counter-arguments often made in the light of this markup on compute:
- Snowflake passes through storage costs with low/no margins. It has to make money somewhere, i.e. compute.
Fully aligned to this sentiment. Do note however that storage costs are typically 10-20% of your total annual bill at the maximum. So yes, Snowflake is charging low/no margin on 10-20% of your bill with its $23/TB/month pricing, and is charging a compute markup of 12x – 50x on the rest. Perhaps one should consider asking for the reverse: high storage margins and low compute markups? And yes, every company needs to make money. A 3-5x markup on compute might even be considered fair, as it covers costs and enables reasonable profitability. However, a 10-50x compute markup reeks of corporate greed gone wild.
- It’s impossible to estimate Snowflake’s real compute markup in this manner. Perhaps they are actually charging me less?
Yes, it is indeed impossible to estimate actual compute markups without actual data. But there isn’t any transparency in this matter at all. Snowflake chooses not to disclose its hardware costs & markups. We don’t even know which machines Snowflake uses to power its warehouses (the 8 vCPU, 16 GiB RAM is tribal knowledge)! All we know is that gross margin (net of storage, compute, services, etc.) is 78% today and growing steadily, their publicly available credit prices, and the cost of servers with similar specs. This is indeed a best guess, likely accurate in the spirit of the matter if not the word.
(2) The Runaway Bill
In case you’ve been a Snowflake enterprise customer for a few years, you might have observed the growth of your annual analytics bill over this timeframe. What might have started out in the few hundreds of thousands of dollars would perhaps now be running into the many millions of dollars, and growing steadily. They say that compound interest is the eighth wonder of the world. It certainly is for the entity charging it, and it is a raw wound in the flesh for the thousands of global enterprises paying it religiously every year.
Let us take a step back and examine why this so. The beauty of Snowflake, and one of the many things it is rightly lauded for (alongside Redshift & BigQuery) is the democratisation of data in the enterprise. Since a Snowflake data warehouse can be spun up in an instant by any analyst or employee with appropriate access, it enables every knowledge worker in the company familiar with SQL to make data-based decisions. And customers pay only per instance of compute provisioned and used. This is a huge unlock which has transformed the fabric of decision-making in the enterprise: a most commendable achievement.
With an ugly downside. The implication of anyone in the enterprise being able to spin up a virtual data warehouse is that anyone with access to the system can start spending thousands of credits (often equivalent to tens of thousands of dollars) a week. Some of this compute used by astute users may be well-spent, but there will inevitably be many instances of a novice user writing poorly constructed batch jobs that chug away for hours on end burning precious credits and possibly returning no useful answer in the desired timeframe. Note also that the usual rung of enterprise users typically isn’t trained in the art of writing optimal SQL, and is quite prone (if not likely) to run unoptimised and highly-resource intensive queries. In the name of democratization, all we really accomplish is higher utilization and margins for Snowflake and its cloud vendors.
The combination of usage-based billing, in conjunction with democratized provisioning patterns and the growing propensity for resource-intensive (bad) SQL from the novice user creates runaway analytics bills that compound at the rate of 30-100% per year. The best example of this is perhaps Instacart, whose Snowflake payments grew from $28M in 2021 to an astounding $51M in 2022, before the spend was expected to reduce to ~$15M in 2023. The company later tried to explain the conundrum here, but it is an interesting example of how bills can runaway before interventions are made. Do note this is also what Snowflake’s best financial metric (and the hidden secret of their premium valuation multiple), their Net Dollar Retention (NDR) indicates. An NDR of 135% (which was almost 200% at IPO, mind you) implies that on average, customer’s bills are growing at 35% year on year, while already being in the millions of dollars annually.
In my view, the runaway nature of the analytics bill is the less serious implication of the democratization paradigm. The more fundamental problem is that it wrests away any semblance or possibility of cost-control from the hands of the CTO, CDO, CIO, CFO & database engineering teams. Which employee after all doesn’t enjoy being data-driven? With the many benefits that confers, and the goodwill it engenders to its enablers in IT, is there any way to return cost management back to its core stakeholders in the tsunami of Snowflake data warehouses being provisioned across the organization?
We believe there is. The answer is two-fold:
- Most Fortune 2000 enterprises at the $B revenue scale don’t have runaway and highly volatile analytics needs that require democratized provisioning. Provisioning could continue to stay centralized but access to run queries on systems should be decentralized. An open-source DBMS vendor can likely size the enterprise’s requirements OLAP compute nodes and storage at least 6 months out, provision them and charge a simple node-based or flat fee for services rendered. Re-provisioning is only needed at semesterly or annual (or at worst quarterly) intervals with step changes in workload patterns. The infrastructure, and consequently its billing, is no longer usage-based but fixed and well-controlled. The usage-based billing model is best suited to young startups with highly fluctuating analytical workloads, and should remain there.
- If access to database systems to run queries is decentralized, there need to be strict controls to ensure that bad (i.e., poorly authored & resource intensive) SQL isn’t executed at all, but in fact converted to good (resource-efficient) SQL, ideally automatically at a proxy above the database layer before being run. This ensures that anyone can query a database, but only good, price-performant SQL is executed on the database.
Such a paradigm would continue to confer the benefits of democratizing access while centralizing provisioning, wresting back control for database engineering teams and enterprise CXOs. Unfortunately, this isn’t even a remote possibility in the era of Snowflake. And it evidently seems to be a deliberate design choice of putting golden handcuffs on enterprise stakeholders: make them heroes in the eyes of the enterprise, and charge them (more than) a pound of flesh for it. The issue is: the hero epitaph fades over time, as the pounds of flesh keep growing.
(3) The Unbundled Ecosystem
As the cloud data warehouse became ascendent, multiple startups joined the bandwagon to develop an ecosystem around the kingpin.
- Data pipelines to feed data in & out of the warehouse? 10+ startups led by Fivetran emerged to meet that need (fascinatingly, with a separate set for ingestion and a separate set to pipe results back to SaaS applications).
- Data transformation within the warehouse? An entirely unique set of companies led by DBT.
- Data observability? Another set of companies represented at the fore by Monte Carlo.
- Data cataloging? Companies like Atlan are leading the charge.
The above diagram summarizes the understandable excitement around the modern data stack and the sheet volume of companies in its womb. Interestingly, their pricing models are often silently indexed to your Snowflake bill. “Are you spending $1M a year on Snowflake? Wouldn’t you then spend $200k a year for robust pipelines to pipe data in & out?”
This is in many ways a great thing, for an independent set of vendors to come up around Snowflake et al to solve core issues faster than it may be possible for the data warehouse itself, and indeed solving issues of significance that may not be a top priority for the data warehousing vendor. Dollars are raised, companies are built, dollars are returned creating a virtuous cycle of economic growth and deepening the significance of the category in the broader spectrum of software technology.
What is the flip side? Who are the entities voting with their dollars and supporting this virtuous cycle of economic growth for founders and VCs? Enterprises.
A real-life customer story will cover the sum & substance of this point. On a recent call with an early Snowflake customer, I presented the effect of an unbundled ecosystem of software tools by independent vendors around a closed-source software vendor. “For every $100 you spend on Snowflake, you’re probably spending another $30-60 on these point vendors”, I asserted. The reply I received left an indelible mark. The customer said, “You’re absolutely right about the effect but wrong about the magnitude. For every $100 I’m spending on Snowflake, I’m spending more than $100 combined across the full set of vendors I need to keep my ecosystem alive. And there’s no way out.”
(4) The Layers of Lock-In
A closed-source cloud-only data warehouse creates two fundamental layers of lock-in: data-lock-in and vendor lock-in. This strong two-pronged grip on the heart of the customer’s business, their datasets, enables indefinite margin extraction and the scope for monopolistic pricing policies.
Envision a company with a $10M & growing Snowflake bill. Consider the margin structures: there are 2 cloud-native vendors commanding rich margins in the stack: Snowflake with an ~80% gross and ~27% Non-GAAP free cash margin (and note that this is despite a ~50% S&M cost); and the cloud vendor commanding 60-70% gross and 30% operating margin. There is no way to reclaim your data if you ever want to: indeed moving your data to another cloud or on-premise will cost you $100k per petabyte in egress (not to mention operational costs). This is data lock-in: not only does data have its own gravity making it operationally challenging to move a few petabytes out, it is also designed to be prohibitively expensive. The cloud is in effect a veritable Hotel California. As the lyrics famously state, “You can check out any time you like, but you can’t ever leave.”
You are also unfortunately at the mercy of monopolistic pricing policies of two independent vendors of closed-source & almost entirely proprietary codebases. This is an example of vendor lock-in: since you are entirely dependent upon Snowflake to house your data and provide you software to keep your organization’s data analytics alive, you are also beholden to the pricing model they deem fit at any point.
Two examples in Snowflake’s context:
- Post IPO, Snowflake initiated a policy that made it nearly impossible for a customer who had bought credits worth, say, $1M in any previous year to commit to any number lower than the $1M in the current year. It is currently unclear if such policies continue to be in vogue.
- Every “value-added service” has its own cost. From re-clustering for better compaction or faster retrieval, to multi-cluster warehouses for higher concurrency, to even something so basic as query acceleration (a foundational feature of any warehouse, really), all of these services compound the base compute bill further. There is frankly no end to the number of such services that could be created over time to continue to milk the machine that keeps on giving.
This raises two first-principles questions:
- Is the end state of enterprise data entirely in the cloud (i..e, with data lock-in)?
- Will analytics always be served by closed-source data warehouses (i..e, with vendor lock-in)?
We posit not. Our thesis is that the end-state of enterprise data analytics will likely be a fine mix of both the on-premise, private cloud, and public cloud models. This is because at the scale of tens & hundreds of petabytes, the datacenter is often more cost-efficient than the cloud, leading to an interesting reverse migration playing out amongst the largest technology companies. In fact, we have witnessed this firsthand amongst some of our largest customers. Further, there are regulated datasets workloads which can never move to the cloud. A few examples:
- The case study of Dropbox indicates the massive quantum of savings (>$75M over 2 years back in 2017) achieved by moving back to their own datacenter.
- Ahrefs, the SEO web-crawler suggests their AWS bill over 3 years would be ~$900M, of which they pay a small fraction (well below $50M per annum) to run their own data center in Singapore.
- Gartner estimates that 50% of critical enterprise applications will reside outside of centralized public cloud locations through 2027. Most blue-chip enterprises operate in a hybrid environment (e.g., Zscaler and Crowdstrike), where much of the regulated and mission-critical data is in their own data centers while their more recent products & initiatives are cloud-native.
In this nuanced end state, the enterprise needs a single OLAP vendor to query data seamlessly across all modes of deployment, in a single pane of glass. Given the sheer gravity of data, OLAP database software needs to go to where the data is, not the other way around. It is almost surely only an open source software vendor which can enable this flexibility, eliminating unnecessary data movement and data lock-in, and mitigating vendor lock-in with highly permissive (Apache 2, GPL, or MIT) open source licensing.
But until that blessed day dawns, we shall seemingly remain beholden to the tentacular grip of the closed-source cloud-only data warehouses, hoping that the next discount negotiation at annual renewal will be more fruitful than the last, while knowing deep in our hearts that the chips have been stacked against us since the start.
Conclusion
In this post, my intention is to simply present the facts of the matter, as they are. I leave the judgment to the reader, who may examine these facts and decide their merit for themselves.
I’ve also vividly painted the problem statement and provided hints to the solution. I haven’t presented our envisioned solution to the question: “What does customer-aligned pricing in the world of OLAP database systems look like?”
That is the subject matter of my next post. Stay tuned.