I was reading the original paper on “Hierarchical Navigable Small Worlds (HNSW)” https://arxiv.org/abs/1603.09320 which I found much easier to understand than all those YouTube videos I tried to watch and articles to read. HNSW is a probabilistic data structure for searching neighbors in multi-dimensional space.
One of practical applications is search of semantically close objects. Reading that paper and some other activities made me curious if I can quickly implement a recommendation system which combines three things: HNSW, moving averages, and randomness.
I was curious about use of averaged vector embedding for recommendation purposes, and then
I started wondering if instead of averaging I should try other metrics like median or top percentiles
to focus on more frequent scenarios and reduce the influence of outliers.
And then the question was: imaging that you want to use it in production, how can you compute
averaged embedding for millions of users ideally with instant updates and without offline data processing in bulk.
A few days ago a manager from a sister team asked me why I have “Chaotic good” in that field “What I do” of my work profile. The question now occupies my mind so I need to drain it into something.
In short, it is a silly meme from a few years ago as I decided to use as a work motto instead of “stupidity and courage” as I fell that the new one better reflects the type of work was doing at the time.
I used to work close with incredibly smart people who was dealing with things like data sharding on daily basis from them I learned a lot on that topic. Later I moved to a different role where that knowledge was not needed and faded away over the time. Here I’m trying to reclaim to myself that long forgotten knowledge.
Sharding is a process of assigning an item to a shard - a smaller chunk of data out of a large database or other service. The general idea is that we can distribute data or service across multiple locations
and handle large volumes of data or handle more requests and with replication we can scale even more and make the system more resilient etc. But we need to have clear rules on how we assign partitions aka shards so
that we can route requests to the right location.
One of the question I often ask in my interview is to design a log processing library:
You need to write a library for processing logs in the following format:
timestamp<TAB>message
The library will be handed over to a different team for further maintenance and improvements and so maintainability and expandability is the most most important requirements.
The library need to support the following operations out of the box:
filtering
counting
histograms
The original version also included some language and background specific expectations I never include in my assessment because I feel that they put the candidate into a position when they need to read my mind to meet those expectations.
One of the questions are really love asking during coding interviews is this:
Given a continuous stream of words, a dictionary on disk and cost associated to read from disk, create a stream processor that returns true when a word exists in the dictionary while minimizing the cost of reading from disk.
Example:
Recently I was reading though a bunch of technical designs and I’ve noticed a common mistake when it comes to
writing user-stories and requirements - assuming a solution. The biggest issue for me when I write requirements myself is that than whenever I include a part of a solution I’m thinking about into the requirements it limits my ability to innovate since I’m bound to a specific solution. In many cases I observed improvements in my designs when I was focused on what the customer needs rather on fulfilling requirement tied to my first and probably not a bright idea.
It is interesting to observe that any endeavor where attention is one of key metrics or key drivers.
Regardless of the company size end up in the same hell pit of attention craving and optimization for
it. Even small single person blogs that teach us to be a better person, engineer, or somethings
are prone to that. Many of them, those I used, slowly became “Energy Vampires” to me constantly
seeking for my attention.
Every so often I interview senior software engineers for Amazon. Where I ask more or less same questions in
each interview. One of them requires adding a caching logic to get better results. I’ve noticed that the interviewee make one of to mistakes that blocks them from
standing out as a software engineers:
they don’t know, or talk about conditions under which a cache will do the best. Primarily, how a request frequency distribution affects cache performance.
they don’t know the standard library of a programming language of their choice.