Liangjie Hong


KDD 2024

This year’s KDD was in Barcelona, Spain. It was the first time I visited the country and the city. Barcelona has a unique charm as a good combination of old Roman districts as well as 20 century art. The conference has more than 2300 participants, another year of high excitement despite economic challenges that made fewer corporations set up a booth to hire.

Here are some highlights from the conference.

Tutorials

Workshops

Selected Papers

Ranking/Recommender System

  • Enhancing Pre-Ranking Performance: Tackling Intermediary Challenges in Multi-Stage Cascading Recommendation Systems [Link]
    • From Ant Group
    • Large-scale recommender systems have 3 stages: Recall, Pre-Ranking and Ranking. Pre-Ranking is a selection stage that the team uses to sample down items from Recall to Ranking to reduce the size. The paper proposed a framework to tackle sample biases and model consistent performance issues for such 3 stage systems. The deployed system showed 7% improvement over the baseline.
    • [Note]: Worth reading to see how different  companies tackle multi-stage challenges.
  • Achieving a Better Tradeoff in Multi-stage Recommender Systems through Personalization [Link]
    • From Meta
    • In this paper, the authors claimed that a key observation is that, all else equal, ranking more items indeed improves the overall objective but has diminishing returns. With this observation, the authors proposed a framework to exploit the trade-off of performance and computation cost. The real-world experiments showed the effectiveness of the proposed framework.
    • [Note]: Worth reading to see how different  companies tackle multi-stage challenges.
  • On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-$n$ Recommendation [Link]
    • From ShareChat
    • The paper formally presented the assumptions that are necessary to consider DCG an unbiased estimator of online reward, providing a derivation for this metric from first principles whilst linking it to off-policy estimation. The authors then proved that the widespread practice of normalizing the DCG metric renders it inconsistent with respect to DCG, in that the ordering given by nDCG can differ from that given by DCG, and provide empirical evidence. Empirical results from off- and online experiments on a large scale recommendation platform show that the unbiased DCG metric strongly correlates with online metrics over time, whereas nDCG does not, whilst differences in online metrics directionally align with differences in both nDCG and DCG, the latter can enjoy improved sensitivity to detect statistically significant online improvements.
    • [Note]: Worth reading as the results are a bit surprising and might be worth examining whether it can be applied to a wide range of use cases.

Advertising

  • Offline Reinforcement Learning for Optimizing Production Bidding Policies [Link]
    • From Meta
    • The authors proposed a generalizable approach to optimizing bidding policies in production environments by learning from real data using offline reinforcement learning. This approach can be used to optimize any differentiable base policy, and only requires data generated by the base policy itself. The  approach does not incur additional infrastructure, safety, or explainability costs, as it directly optimizes the parameters of existing production routines without necessarily replacing them with black box-style models like neural networks.
    • [Note]: Worth reading as it is built on top of existing production systems and to exploit offline RL to improve them.
  • Joint Auction in the Online Advertising Market [Link]
    • From Meituan
    • The authors innovatively proposed a joint advertising model termed “Joint Auction”, allowing brand suppliers and stores to collaboratively bid for advertising slots, catering to both their needs.
    • [Note]: Worth reading as it shows a novel product offering with a newly designed auction mechanism.
  • Bi-Objective Contract Allocation for Guaranteed Delivery Advertising [Link]
    • From Chinese Academy of Sciences and Alibaba
    • Guaranteed Delivery (GD) advertising work with two different stages, namely, the offline selling stage and the online serving stage. The former deals with contract allocation, and the latter fulfills the impression allocation of signed contracts. Existing work usually handles these two stages separately. For example, contracts are formulated offline without concerning practical situations in the online serving stage. Therefore, this paper addresses in this paper a bi-objective contract allocation for GD advertising, which maximizes the impressions, i.e., Ad resource assignments, allocated for the new incoming advertising orders, and at the same time, controls the balance in the inventories. Since the proposed problem is high dimensional and heavily constrained, we design an efficient local search that focuses on the two objectives alternatively.
    • [Note]: Worth reading as a good reference for GD knowledge in general.
  • Optimized Cost Per Click in Online Advertising: A Theoretical Analysis [Link]
    • From The Hong Kong University of Science and Technology
    • Optimized Cost Per Click (OCPC) and Optimized Cost Per Mille (OCPM) have emerged as the most widely adopted pricing models in the online advertising industry. However, the existing literature has yet to identify the specific conditions under which these models outperform traditional pricing models like Cost Per Click (CPC) and Cost Per Action (CPA). To fill the gap, this paper builds an economic model that compares OCPC with CPC and CPA theoretically, which incorporates out-site scenarios and outside options as two key factors. The analysis reveals that OCPC can effectively replace CPA by tackling the problem of advertisers strategically manipulating conversion reporting in out-site scenarios where conversions occur outside the advertising platform. Furthermore, OCPC exhibits the potential to surpass CPC in platform payoffs by providing higher advertiser payoffs and consequently attracting more advertisers.
    • [Note]: Worth reading as a background paper for OCPC and OCPM.
  • An Efficient Local Search Algorithm for Large GD Advertising Inventory Allocation with Multilinear Constraints [Link]
    • From Chinese Academy of Sciences and Alibaba
    • As the requirements of advertisers become more and more diverse and fine-grained, the focus ratio requirement, which states that the portion of allocated impressions of a designated contract on focus media among all possible media should be greater than another contract, often appears in. business scenarios. However, taking these requirements into account brings hardness for the GD advertising inventory allocation as the focus ratio requirements involve non-convex multilinear constraints. Existing methods which rely on the convex properties are not suitable for processing this problem, while mathematical programming or constraint-based heuristic solvers are unable to produce high-quality solutions within the time limit. Therefore, the authors proposed a local search framework to address this challenge. It incorporates four new operators designed for handling multilinear constraints and a two-mode algorithmic architecture.
    • [Note]: Worth reading as a good reference for GD knowledge in general.

Experimentation

  • Learning Metrics that Maximise Power for Accelerated A/B-Tests [Link]
    • From ShareChat
    • The authors proposed to tackle this by learning metrics from short-term signals that directly maximize the statistical power they harness with respect to the North Star. The authors showed that existing approaches are prone to overfitting, in that higher average metric sensitivity does not imply improved type-II errors, and propose to instead minimize the p-values a metric would have produced on a log of past experiments. Empirical results show that they are able to increase statistical power by up to 78% when using learnt metrics stand-alone, and by up to 210% when used in tandem with the North Star.
    • [Note]: Worth reading for better learning proxy metrics.
  • Choosing a Proxy Metric from Past Experiments [Link]
    • From Google
    • The authors introduced a new statistical framework to both define and construct an optimal proxy metric for use in a homogeneous population of randomized experiments. The procedure first reduces the construction of an optimal proxy metric in a given experiment to a portfolio optimization problem which depends on the true latent treatment effects and noise level of experiment under consideration. They then denoise the observed treatment effects of the long-term metric and a set of proxies in a historical corpus of randomized experiments to extract estimates of the latent treatment effects for use in the optimization problem,
    • [Note]: Worth reading for better learning proxy metrics.
  • Metric Decomposition in A/B Tests [Link]
    • From Airbnb
    • In this article, authors proposed a new direction for sensitivity improvement via treatment effect augmentation whereby a target metric of interest is decomposed into components with high signal-to-noise disparity. Inference in the context of this decomposition is developed using both frequentist and Bayesian theory.
    • [Note]: Worth reading to understand how to further reduce metrics variance.
  • False Positives in A/B Tests [Link]
    • From A/B experimentation guru: Ronny Kohavi
    • The authors begin with motivation about why false positives are expensive in many software domains.  They offer several approaches to estimate the true success rate of experiments, given the observed “win” rate (statistically significant positive improvements), and show examples from Expedia and Optimizely. They offer a modified procedure for experimentation, based in sequential group testing, that selectively extends experiments to reduce false positives, increase power, at a small increase to runtime. They conclude with a discussion of the difference between ideas and experiments in practice, terms that are often incorrectly used interchangeably.
    • [Note]: A lot of practical suggestions for A/B testing.

LLM Labels

  • Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I. [Link]
    • From Google
    • In this paper, the authors proposed two methods based on prediction powered inference and conformal risk control that utilize computer- generated relevance annotations to place reliable confidence intervals ( CI s) around IR evaluation metrics. The proposed methods require a small number of reliable annotations from which the methods can statistically analyze the errors in the generated annotations. Using this information, we can place CI s around evaluation metrics with strong theoretical guarantees.
    • [Note]: A really cool paper to illustrate how to systematically evaluate LLM generated labels.


KDD 2022 1

Last week, I attended KDD 2022 in Washington DC, which was the first offline conference I have ever attended since WSDM 2020, almost two and half years ago in Houston. Although the COVID-19 pandemic still had a sizeable impact on the event as the majority of researchers and companies from China could not make it to the conference, participants demonstrated high enthusiasm and hoped all could return back to the prior norm before the pandemic.

Let me share a couple of thoughts regarding the conference.

First of all, Graph Neural Networks (GNN) is literally everywhere, in terms of applications in almost all domains, as well as how industry and academic both show tremendous interest in them, which has reminded me of how Graph Mining has gone viral during early years in the 2000s due to the rise of Social Networks and World Wide Web. Surprisingly, topics like PageRank, Community Detection, and Graph Laplacian have been rarely mentioned. Of course, this is not just for Graph Mining, other sub-fields such as Recommender Systems, Text Mining, and Information Retrieval have also been completely rewritten since the surge of Deep Learning.

Secondly, Causal Inference has become a much more mainstream topic than before where it is been utilized in many applications including Industry solutions. One thing worth mentioning is that many researchers still use Causal Inference as an intermediate step to further improve the prediction performance in some downstream tasks. However, it has been broadly ignored how Causal Inference could bring more insights to the problems at hand, for instance, few papers have talked about Causal Effect and its estimation. Moreover, the majority of papers in Causal Inference are essentially Observational Studies. But they do not offer discussions around data and assumptions necessary to have meaningful and robust causal estimation, which could be misleading and risky.

Thirdly, some research topics have declined dramatically. For instance, while Optimization once was a pretty hot topic given the popularity of large-scale machine learning, the need for understanding and improving it has been greatly reduced due to the rise of various of Deep Learning frameworks. Most researchers can reliably use these frameworks to achieve reasonable performance without understanding underlying optimization algorithms. In addition, it is surprising to observe that Probabilistic Modeling has been abandoned altogether. Now, few papers start with assumptions about how data is generated. Although it is arguably applicable to all scenarios, Probabilistic Modeling has its own advantages to describe certain data generation assumptions and help us understand certain problems better, which is quite different and possibly complementary from the computational point of view, which dominates the modeling language since the rise of Deep Learning.

Lastly, all Keynote Speakers delivered quite excellent talks, with their own colors. Lise Getoor from UC Santa Cruz is a prominent scholar from the last-round rise of Graph Mining. Some of the techniques like Collective Classification mentioned in the talk have been widely used and even included in some textbooks in the 2000s, which are completely ignored today. Her research is still around classical Graph Mining, especially in the domain of reasoning using graph structures. On the other hand, Milind Tambe from Harvard University has a surprisingly good talk. Rather than focusing on technologies and algorithms, Milind told stories about how to apply AI for social goods and how to build applications to impact real-world people’s life, which sent a strong message to the conference. The last speaker Shuang-Hua Teng from USC gave a rather theoretical talk, discussing how to balance the relationship between heuristics and theory. One thing quite interesting from the talk was that Shuang-Hua clearly categorized quite a number of widely used algorithms into heuristics, even though they might be considered theoretical-oriented algorithms by practitioners.

Compared to KDD 2019 that I attended in Alaska 3 years ago, this year’s KDD is different in the sense of the number of participants, the number of companies, and the diversity of research topics. It is quite obvious that we are around the corner of the last wave from Deep Learning. I’m looking forward to meeting with friends next time in LA.


CIKM 2018 Papers Notes

  • Wan et al. [1] discussed how to leverage complementarity and loyalty to learn better item representations from shopping baskets. The main idea is to learn two sets of representations from co-occurred items from the same baskets where these two different representations entail item’s complementarity. Building on top the learned representations, the authors invented an algorithm called adaLoyal to determine a purchase would be a mixture of frequency-based behaviors (a.k.a, loyalty) and from item representations.
  • Zamani et al. [2] discussed how to learn one neural ranking function to replace two-phase approaches where the first phase ranking is usually a retrieval model and the second phase is a more complex model like neural nets. The main idea of the paper is to utilize L1 regularization and learn a sparse representation for both queries and documents with certain desirable conditions. These representations can be further used as indexing to retrieve documents efficiently in the inference time.
  • Ren et al. [3] discussed a long-term problem in online advertising, which is to allocate credits to user behaviors in the journey of ads conversion. Traditionally, the credit allocation is either done by some heuristics or simple models like Logistic Regression. In this work, authors utilized sequential pattern modeling, notably RNN, to model user behaviors, with considerations of different pre-conversation behavior types. In addition, the paper also introduced a way to conduct offline evaluation for credit allocation.
  • Gysel et al. [4] discussed how to utilize similar items to improve product search. The similarity is primarily determined from text similarity and semantic similarity through pushing similar items share closeness in the latent semantic space.
  • Wang et al. [5] discussed a unified framework for learning to rank problems where the traditional LambdaRank can be easily explained as an optimization procedure for an objective function, previously unknown. Under this framework, the authors proposed a generic EM algorithm to solve learning to rank problems and demonstrated several use cases of the framework with different setups, yielding different loss functions to optimize different metrics.
  • Zhang et al. [6] discussed how to combine multiple information sources on search engine result pages to infer better overall relevance. In particular, they exploited visual patterns from search result screenshots, title semantics, snippet semantics, and HTML DOM structure semantics. All these modules are combined through an Attention layer where weights are learned jointly. Another contribution of the paper is to release such dataset with graded relevance judgments. This paper won the Best paper award.
  • Jun Hu and Ping Li [7] argued that traditional collaborative ranking would not easily learn model parameters in the optimal setting. In particular, one inherent issue is that, vanilla learning would not necessarily hold ordering information and because of logistic loss, the model could learn arbitrarily model parameters that may not improve the objective function (loss). In this paper, the authors proposed a method to jointly learn column and row order as well as pointwise loss and demonstrate the effectiveness of proposed method.

References

  1. Mengting Wan, Di Wang, Jie Liu, Paul Bennett, and Julian McAuley. 2018. Representing and Recommending Shopping Baskets with Complementarity, Compatibility and Loyalty. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management(CIKM ’18). ACM, New York, NY, USA, 1133-1142. DOI: https://doi.org/10.1145/3269206.3271786
  2. Hamed Zamani, Mostafa Dehghani, W. Bruce Croft, Erik Learned-Miller, and Jaap Kamps. 2018. From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM ’18). ACM, New York, NY, USA, 497-506. DOI: https://doi.org/10.1145/3269206.3271800
  3. Kan Ren, Yuchen Fang, Weinan Zhang, Shuhao Liu, Jiajun Li, Ya Zhang, Yong Yu, and Jun Wang. 2018. Learning Multi-touch Conversion Attribution with Dual-attention Mechanisms for Online Advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM ’18). ACM, New York, NY, USA, 1433-1442. DOI: https://doi.org/10.1145/3269206.3271677
  4. Christophe Van Gysel, Maarten de Rijke, and Evangelos Kanoulas. 2018. Mix ‘n Match: Integrating Text Matching and Product Substitutability within Product Search. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM ’18). ACM, New York, NY, USA, 1373-1382. DOI: https://doi.org/10.1145/3269206.3271668
  5. Xuanhui Wang, Cheng Li, Nadav Golbandi, Michael Bendersky, and Marc Najork. 2018. The LambdaLoss Framework for Ranking Metric Optimization. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM ’18). ACM, New York, NY, USA, 1313-1322. DOI: https://doi.org/10.1145/3269206.3271784
  6. Junqi Zhang, Yiqun Liu, Shaoping Ma, and Qi Tian. 2018. Relevance Estimation with Multiple Information Sources on Search Engine Result Pages. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM ’18). ACM, New York, NY, USA, 627-636. DOI: https://doi.org/10.1145/3269206.3271673
  7. Jun Hu and Ping Li. 2018. Collaborative Multi-objective Ranking. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM ’18). ACM, New York, NY, USA, 1363-1372. DOI: https://doi.org/10.1145/3269206.3271785

FAQ About Tensor Methods

Anima Anandkumar is a prominent researcher about tensor methods (a.k.a., spectral methods) in machine learning. Recently, she has a QA session on Quora and I gathered some of her answers particular with tensor methods as follows:

  1. What are some benefits and drawbacks of using tensor methods as opposed to more traditional techniques in machine learning?
    The main gain is in terms of computation:
    — a) tensor methods are embarrassingly parallel and scalable to  large problems
    — b) they can build on efficient linear algebraic libraries, but are much more powerful and informative compared to matrix methods.
    On the other hand, tensor methods are not sample efficient, meaning they require more samples than EM to reach the same level of accuracy (assuming computation is not an issue). Improving statistical efficiency of spectral methods is an ongoing research topic
  2. What are your thoughts on the statistical efficiency of spectral methods? Do you think that they are competitive as they stand?
    The short answer is that, MLE is sample efficient but may be difficult to compute while tensor methods (moment matching) is relatively easy to compute but sample inefficient. Some remedies are mentioned in the answer.
  3. How are Tensor methods used in deep learning?
    The short answer is that, currently used limited.
  4. What are the best resources for starting with Tensor Analysis?
    See her webpage for a start.

For detailed QA, please refer to Quora website.


Slice Sampling for LDA Hyperparameters

For latent Dirichlet allocation (LDA) hyper-parameters, a typical approach is to utilize Monte-Carlo EM approach where E-step is approximated by Gibbs sampling while M-step is to perform a gradient-based optimization approach to optimize Dirichlet parameters. Such approach is implemented in Mallet package. For the detailed information about estimating Dirichlet parameters, see Tom Minka’s technical report.

Another less explored approach is to go with full Bayesian notion, interleaving sampling topic assignments and hyper-parameters. To sample hyper-parameters, as there is no closed-form solution, Slice Sampling is usually utilized. For introduction for Slice Sampling, see Radford M. Neal’s paper. For its application in LDA, please see the following two references:

  1. Hanna Wallach’s course note.
  2. Chapter 2 of Hanna Wallach’s dissertation.

We implemented the method in Fugue’s LDA algorithm.


Tricks in Sampling Discrete Distributions (2) – Binary Search 1

As mentioned in the previous post, it is tricky to sample from discrete distributions, here we demonstrate yet another important trick to do it right. No matter you do it in the original space, or in the log-space. Basically, you can easily come up some code snippet like this (we are using Java as an example here):

1
2
3
4
5
6
7
8
9
public int sample(){
    double u = ThreadLocalRandom.current().nextDouble() * p[p.length - 1];
    int index = -1;
    for (index = 0; index > p.length; index++) {
        if (u > p[index])
            break;
    }
    return index;
}

where \( p \) is the accumulated un-normalized probabilities. The time complexity is \( \mathcal{O}(N) \) when \(N \) equals the number of items in the array \( p \).

It turns out that, the above code can be easily optimized to \( \mathcal{O}(\log N ) \) by using Binary Search. The reason is quite simple. The accumulated un-normalized probabilities, which are stored in \( p \), by its definition, are sorted. Therefore, binary search can be utilized. In particular, we want to find the smallest key that is greater than the random generated number \( u \). This function is called ceiling in Algorithms in Section 3.1. We implemented it in our context as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public int sample(){
    double u = ThreadLocalRandom.current().nextDouble() * p[p.length - 1];
    int lower = 0;
    int upper = p.length - 1;
    while (lower >= upper){ 
        int mid = lower + (upper - lower) / 2; 
        if((p[mid] - u) > 0){
            upper = mid - 1;
        }
        else{
            lower = mid + 1;
        }
    }
    return lower;
}

Interestingly, even though this trick seems trivial, it is not mentioned in many literature and only discussed:


Tricks in Sampling Discrete Distributions (1) – Sampling in Log-Space


One critical component in Gibbs sampling for complex graphical models is to be able to draw samples from discrete distributions. Take latent Dirichlet allocation (LDA) as an example, the main computation focus is to draw samples from the following distribution:

(1)   \begin{equation*}P(z_{i} = k \, | \, \mathbf{z}_{-i}, \mathbf{w}, \boldsymbol{\alpha}, \boldsymbol{\beta}) \propto (n_{d,k}^{-i} + \alpha_{k}) \frac{n_{w_{i}, k}^{-i} + \beta_{w_{i}}}{n_{., k}^{-i} + \sum_{v} \beta_{v}}\end{equation*}


where n_{d,k}^{-i} is the number of tokens in the document d assigned to the topic k, excluding the token i, n_{w_{i}, k}^{-i} is the number of times token w_{i} assigned to the topic k, excluding i, and n_{.,k}^{-i} is the total number of tokens assigned to the topic k.

So, a straightforward sampling algorithm works as follows:

  1. Let c_{k} be the right-hand side of Equation \eqref{eq:lda} for topic k, which is an un-normalized probability.
  2. We compute the accumulated weights as: C[i] = C[i-1] + c_{i} , \,\, \forall i \in (0, K-1] and C[0] = c_{0}.
  3. Draw u \sim \mathcal{U}(0, C[K-1] ) and find t = \arg\min_{i} \left( x_{i} \right) where x_{i} = C[i] - u and x_{i} > 0

The last line is essentially to find the minimum index that the array value is greater than the random number u.

One difficulty to deal with \eqref{eq:lda} is that the right hand side might be too small and therefore overflow (thinking about too many near-zero numbers multiplying). Thus, we want to deal with probabilities in log-space. We start to work with:

(2)   \begin{equation*}\log P(z_{i} = k \, | \, \mathbf{z}_{-i}, \mathbf{w}, \boldsymbol{\alpha}, \boldsymbol{\beta}) \propto \log (n_{d,k}^{-i} + \alpha_{k}) + \log (n_{w_{i}, k}^{-i} + \beta_{w_{i}}) - \log ( n_{., k}^{-i} + \sum_{v} \beta_{v} )\end{equation*}

and store in c_{k} but remember that now each value represents un-normalized log probability. The next step is to compute the accumulated weights, this time as accumulated probabilities but in log-space! Thanks to the trick mentioned in [Notes on Calculating Log Sum of Exponentials], we are able to compute log sum efficiently. Please use the last equation there to compute the accumulated weights. The last step is to draw the random number. We compute u \sim \mathcal{U}(0, 1) + \log C[K-1] and again find the minimum index that satisfy the array value is greater than u.

Notes:

  1. The log-sampling algorithm for LDA is implemented in Fugue Topic Modeling Package.
  2. Unless you really face the issue of overflow, sampling in log-space is usually much slower than the original space as log and exp are expensive functions to compute.

Federated Optimization: Even More Personalized World?

Like the previous post about the personalized models, another NIPS workshop paper discussed a related topic, yet, from another perspective:

The authors introduces a new setting of the learning problem in which data are distributed across a very large number of computers, each having access only to few data points. This is primarily motivated by the setting, where users keep their data on their devices, but the goal is still to train a high quality global model.

Although it looks like very different from the paper described in the previous post, two pieces can be linked together for two points:

  • It is important to train a high quality local or personalized model by utilizing the global model or vice versa, having a business management system is great.
  • It is very important to understand the interplay of the global mode land the local model as well.

These work can raise interesting new directions, like how to serve/update models that are fully personalized on mobile devices.


Serving Personalized Models

In the recent NIPS 2016 Workshop on Machine Learning Systems, one paper attracts my attention:

The central idea is very simple: we need to learn and serve personalized models on top of a global model to users. The argument of such setting is two-folds:

  1. A global model might be hard to train, given the size of the model. It usually takes significant amount of computing efforts.
  2. Depending on what types of model, it might be even difficult to serve and update the global model as well.

So, it is very natural that, the model for each individual user is trained separately while it is derived from a global model. The paper demonstrates a particular way of deriving such a model. But there could be many different ways of doing this.

Of course, this is not the first such reasoning. As the paper mentioned, prior work in multi-task learning has formalized similar problems. However, it might be the first time from the system perspective to show the advantages of having personalized models.


Building Software Systems: V1 and V2

In a recent issue of Communications of the ACM, the article “Lean Software Development — Building and Shipping Two Versions” caught my eyes, as it describes one of the basic dilemmas faced in a lot of companies: how to build multiple versions of the same software at the same time, although the article discussed the issue from human resources perspective. The central argument of the article is that, certain engineers like prototyping while the others might love to prune or optimize the code, or to beautify the code, in the author’s word and therefore, it’s better off to separate them into two groups.

However, in practice, the reason to split teams into so-called V1/V2 sometimes is rather arbitrary. One common issue is that, V1 is a large and working system which, over time, team members have gained expertise on and a lot of people is familiar with the code base. But, as it might be developed in a rush at the first place, V1 is not flexible and sometimes full of hacks and heuristics. It needs a re-write, in theory. On the other hand, some management folks realized the issue but formed another team to develop the V2 system (usually called “next-gen”) with or without the consultant from the team of V1. Therefore, V2 is born in many problems. It is called “next-gen” but the developers of it may not really understand the real problems of V1 and sometimes, in order to separate V2 from V1, showing the superiority of V2, V2 tends to be overly engineered and overly designed. In the end of the day, both V1 and V2 may not be properly designed and implemented.

In the heart of the issue is that, why we need V1 or V2 at the same time? We need to write “correct code” rather than “fast code”. The need to build V1 and V2 simultaneously sometimes implies the bad design, the wrong requirements of the products and hacks that jeopardize the quality of the products.