The magic ingredient

I stopped my research quite some time ago. I hope to get back to it soon, but a few words of excuses are appropriate I believe.

I was beyond busy. I was doing my daily job, desperately looking for a new job (finished very successfully) and doing a capstone project to finish the program, meanwhile trying to squeeze out at least some time for the two little ones. I still have a lot of debt to them for not being there these last few months. Anyway – job is changed, all the training and school projects are finished, so now just to pay off the time debts I took and then I can get back to my research.

But not all this time was lost. It wasn’t only about getting some more or less stupid certificate, even if sometimes it felt like it and my engagement was lower than at the beginning when I desperately tried to grasp all those new concepts. One lesson is really valuable. It’s about… one can call it art, another can name it intuition, a third will use the term magic, a fourth one will say “domain knowledge.” All those terms are extremely vague and none is better than the other. I just like the word magic, so I’ll stick to it.

At the beginning of the program I often heard that data science is not about algorithms, not about engineering, but more of an art. That domain knowledge is more important than understanding the deep tricks of some technical skills. This caused my resentment—engineering should be about precision, numbers, reason, not some gut feeling and arbitrary choices, so my long experience was being challenged.

Through doing a project without any access to someone with wide knowledge about the topic and fighting my way against directions suggested by the notebook sketch I received, I now know what the teachers meant. Funny thing is that I can apply the same to my engineering practice I’ve been doing over the years, I just wasn’t aware of it.

So the short point is that computers and software running on them are tools. If they are to make any sense they must fulfill some utility function—even if that utility is just providing fun. There is no reason for beautiful and clever algorithms that serve no purpose to the user. No matter how clever a programmer or engineer you are, you will do a poor job if you don’t know or care what the users needs.

An example I encountered in my project was that what I was asked for made no sense. My goal was to provide a computer vision model for detecting malaria parasites in white blood cells. But first I did a little digging. The data was clean and nice, images of cells extracted from thin blood smears. Such a smear contains about 80 cells, while malaria often infects less than one cell per hundred. Using the Binomial Probability Formula we have:

$P(X=k) = \binom{n}{k} \cdot p^k \cdot (1-p)^{n-k}$

and for not having any malaria cells in a sample, $k = 0$ , so it becomes:

$P(X=0) = \binom{n}{0} \cdot p^0 \cdot (1-p)^n = (1-p)^n$

and $p$ is 1/100 as only one cell per 100 is infected, and $n$ (number of cells in sample) is 80 so it becomes:

$P(X=0) = (1-0.01)^{80} \approx 0.4473$

So we have almost a 50/50 chance that we won’t detect malaria, it’s a no-go. End of discussion, job finished… not really, one can say so, but I’m curious so I need to dig deeper and see if there is something we can learn, change, adjust, it’s just an obstacle, not a deal-breaker.

What’s interesting is that all this analysis is done without writing a single line of code, no technical skill of a programmer or AI technician was required. Nor deep knowledge. It’s all the magic of connecting various pieces of information, various areas of knowledge, experience that comes to mind by intuition, gut feeling as one could say.

Next point was also about following intuition. Just to gather some insight I ran a few Google queries and it turned out that malaria is widespread and kills people where access to healthcare is extremely poor—and so is the GDP of those countries. So doing thin blood smears, which require quite a high skill, is also not the best solution… another no-go.

Thirdly, we need to detect the presence of malaria in a patient, not in a cell, so I ran some math to figure out what accuracy we need on the cell level to detect malaria on a reasonable sample size (350 cells) and to detect 95% of cases (assuming that in 15% of cases we will report false positives) we need an accuracy of 99.55%, which is a really, really high number.

Only then did I run some models to check how computer vision was working with malaria parasites, but now my approach was – it’s an experiment that is to gather insight, this is no go for the product, so no need to waste a lot of resources to develop it super carefully. So yeah, I managed 98.15% accuracy after a few tries.

And then next comes something that isn’t in the book. Just by intuition, a question came: what’s the problem with those few (48 out of 2600) images that are misclassified… and it turned out that about half of them had bad labels, maybe more… I’m no diagnostician, so that’d need to be confirmed. After adjusting for this mislabeled test data, it wasn’t a big deal to get 99% accuracy—and the effort I put into building the model was minor compared to the effort made to understand the context.

Images misclassified by model and their original labels.

The same applies to building the model. Looking carefully at the data and understanding algorithms can save a lot of work. It was obvious that parasites are not distinguished by some shape—it’s just an area of different color inside a cell. This lets you filter out quite a lot of things that one could try for different kinds of data, like letters or more common objects like birds, shoes, and whatever one might need to detect.

So to summarize. It’s like in my everyday engineering. I’m not particularly brilliant, nor am I highly intelligent—when searching for work I fail almost all the test “write algorithm in 15 minutes” or go through 100 questions in some cognitive test. But when doing actual work I’m somehow more efficient than others. That is because I am careful (lazy one could say) and I avoid doing stuff that is not required. I spend a lot of time asking questions instead of figuring out the answers. In most engineering work, brute force to try the correct approach is not efficient, there are just too many ways to solve the problem.

But here is the important part: the word “domain knowledge” is misleading. It’s about connecting dots between many domains—understanding digital images, understanding statistics, understanding computer science, understanding neural networks, understanding blood analysis, understanding the economic situation… that is no single domain.

And one final thing. I made a presentation. That’s something I always felt bad about—I perceive my presentation skills at the very low end of the spectrum, but then… I’m more than glad I was forced to do that. I believe it ended up really nice. The thing is that I really, really hate doing something slipshod. If something is worth doing, then it’s worth doing right, with passion. If not, then why bother doing it at all?

So the magic ingredient is love, passion, curiosity, not brute force, or awesome intelligence or extremely vast or deep knowledge. And here is the presentation – nice thing is that I made presentation nice, there are images, not just wall of text. That I’m proud of that I can do stuff that’s visually appealing.

Now, I just need to pay back the debt of not spending time with my children and figuring out all the stuff in the new job, and then I can get back to beating the random forest… and then to finding the best clustering algorithm, and then… There are so many things to do and so little time.

April 21, 2024

Adam Leśniak

First Principles, Neural Networks, The author

2 responses to “The magic ingredient”

Diego Kolsky says:

April 21, 2024 at 5:07 pm

Super insightful, Adas! I like how you share your logic—yes, a form of art!

Reply
cantrellmechanicfcy5k2+38cb1jm2fg4n@gmail.com says:

June 28, 2024 at 10:51 pm

similique dolorem voluptate reprehenderit sit magnam a sed dolor maiores hic molestiae dolorem dolores porro debitis. aut tempora quibusdam iure mollitia molestiae ab aut dolores et. minus voluptatem ab officia earum ut aut ipsa dolor minus cupiditate. minus aut quod est consequatur quisquam itaque. ducimus est enim fugit voluptate animi error quia sed dolorum aliquid dolores in placeat alias sed quis ipsam.

Reply

my AI hallucinations

The magic ingredient

2 responses to “The magic ingredient”

Leave a Reply Cancel reply