Machine learning has become one of the most reliably overpromised technologies in enterprise history. Organizations have deployed models expecting transformation and received, in many cases, expensive confusion. Dan Herbatschek, Founder and CEO of Ramsey Theory Group, argues the failure is almost never the algorithm’s fault.
The Problem Is the Problem Statement
Most machine learning projects fail before a single model is trained. They fail at the problem definition stage — when someone in a meeting room decides that a model is the answer before anyone has clearly articulated the question.
This is not a technology failure. It is a reasoning failure. And it is one that Herbatschek, an applied mathematician with a Columbia University background, has encountered consistently across organizations. The enthusiasm for machine learning has outpaced the discipline required to deploy it well. Teams reach for sophisticated models when the underlying problem has not been precisely defined, the data has not been rigorously examined, and the success criteria have not been operationalized.
The result is a system that optimizes compellingly for something other than what the organization actually needed.
What a Well-Formed Problem Looks Like
Before any model selection takes place, a well-formed machine learning problem requires four things: a clearly defined outcome variable, a defensible set of features, an explicit loss function that reflects real-world cost, and an honest accounting of data quality.
None of these are engineering tasks. They are mathematical and conceptual tasks — the kind of work that requires stepping back from the implementation and asking structural questions about what the data actually represents and what the model is genuinely being asked to do.
Herbatschek’s training in applied mathematics is directly relevant here. The Lily Prize-winning thesis he completed at Columbia — an examination of mathematics, language, and time in the context of the Scientific Revolution — was fundamentally an inquiry into how the structure of formal systems determines the boundaries of what those systems can express. That question applies with equal force to modern machine learning: the formulation of a model shapes not only what it can learn but what it cannot learn, and practitioners who do not think carefully about formulation will not discover that boundary until the model is already in production.
When More Data Is the Wrong Answer
A common reflex when machine learning models underperform is to demand more data. Sometimes this is correct. Often it is not.
Model underperformance is frequently a signal that the problem formulation is wrong, not that the dataset is too small. Adding data to a misspecified model produces a more confident version of the same mistake. The confidence is the problem — a model that is wrong and uncertain is correctable; a model that is wrong and confident is dangerous.
The rigorous approach is to interrogate the model’s failure modes before scaling data collection. Where is the error concentrated? Does the model perform differently across subpopulations? Are there systematic patterns in what it gets wrong? These questions require mathematical thinking, not more compute.
Ramsey Theory Group’s approach to data-intensive application development is built around this kind of interrogation. The firm’s mandate — bridging organizational vision with technological execution — means that technical output is always evaluated against a clearly defined organizational objective, not a standalone accuracy metric.
Data Visualization as a Diagnostic Tool
One of Herbatschek’s areas of technical expertise is data visualization, and it is worth examining why visualization belongs alongside mathematics and machine learning as a core discipline rather than a presentation afterthought.
Visualization, done rigorously, is a form of hypothesis testing. Before fitting a model, visual inspection of the data’s distributional properties, correlation structure, and temporal patterns can surface problems that summary statistics conceal. Outliers, non-stationarities, label noise, and feature collinearity are all visible — if the practitioner knows what to look for and builds the tools to look.
For Herbatschek, visualization is not about making outputs legible to stakeholders. It is about understanding the data before committing to a modeling approach. That sequence — understand, then model — is the inverse of how many organizations operate, and the inversion is almost always costly.
Building Systems That Are Honest About Their Limits
The final frontier in responsible machine learning deployment is calibration: building systems that accurately represent their own uncertainty. A model that says it is 90% confident should be right approximately 90% of the time. Most deployed models do not meet this standard.
Calibration is a mathematical property, and achieving it requires deliberate effort at the design stage. It also requires organizational honesty — a willingness to build systems that sometimes say “I don’t know” rather than systems that always produce an answer.
For the organizations Ramsey Theory Group serves, this is a design principle, not an afterthought. Systems that are honest about their limits are systems that can be trusted in production, extended over time, and corrected when conditions change. That durability is the real return on investment in machine learning — not the model’s accuracy on a test set, but its reliability in the world.
About Dan Herbatschek
Dan Herbatschek is the Founder and CEO of Ramsey Theory Group, a firm specializing in bridging organizational vision with technological execution. An applied mathematics graduate of Columbia University — where he earned Summa Cum Laude honors, Phi Beta Kappa membership, and the Lily Prize for his thesis on mathematics, language, and time in the Scientific Revolution — Herbatschek brings expertise in Python, JavaScript, data visualization, machine learning, and scalable application architecture. He previously worked as a Data Management Consultant in New York.