Research in General


KDD 2022 1

Last week, I attended KDD 2022 in Washington DC, which was the first offline conference I have ever attended since WSDM 2020, almost two and half years ago in Houston. Although the COVID-19 pandemic still had a sizeable impact on the event as the majority of researchers and companies from China could not make it to the conference, participants demonstrated high enthusiasm and hoped all could return back to the prior norm before the pandemic.

Let me share a couple of thoughts regarding the conference.

First of all, Graph Neural Networks (GNN) is literally everywhere, in terms of applications in almost all domains, as well as how industry and academic both show tremendous interest in them, which has reminded me of how Graph Mining has gone viral during early years in the 2000s due to the rise of Social Networks and World Wide Web. Surprisingly, topics like PageRank, Community Detection, and Graph Laplacian have been rarely mentioned. Of course, this is not just for Graph Mining, other sub-fields such as Recommender Systems, Text Mining, and Information Retrieval have also been completely rewritten since the surge of Deep Learning.

Secondly, Causal Inference has become a much more mainstream topic than before where it is been utilized in many applications including Industry solutions. One thing worth mentioning is that many researchers still use Causal Inference as an intermediate step to further improve the prediction performance in some downstream tasks. However, it has been broadly ignored how Causal Inference could bring more insights to the problems at hand, for instance, few papers have talked about Causal Effect and its estimation. Moreover, the majority of papers in Causal Inference are essentially Observational Studies. But they do not offer discussions around data and assumptions necessary to have meaningful and robust causal estimation, which could be misleading and risky.

Thirdly, some research topics have declined dramatically. For instance, while Optimization once was a pretty hot topic given the popularity of large-scale machine learning, the need for understanding and improving it has been greatly reduced due to the rise of various of Deep Learning frameworks. Most researchers can reliably use these frameworks to achieve reasonable performance without understanding underlying optimization algorithms. In addition, it is surprising to observe that Probabilistic Modeling has been abandoned altogether. Now, few papers start with assumptions about how data is generated. Although it is arguably applicable to all scenarios, Probabilistic Modeling has its own advantages to describe certain data generation assumptions and help us understand certain problems better, which is quite different and possibly complementary from the computational point of view, which dominates the modeling language since the rise of Deep Learning.

Lastly, all Keynote Speakers delivered quite excellent talks, with their own colors. Lise Getoor from UC Santa Cruz is a prominent scholar from the last-round rise of Graph Mining. Some of the techniques like Collective Classification mentioned in the talk have been widely used and even included in some textbooks in the 2000s, which are completely ignored today. Her research is still around classical Graph Mining, especially in the domain of reasoning using graph structures. On the other hand, Milind Tambe from Harvard University has a surprisingly good talk. Rather than focusing on technologies and algorithms, Milind told stories about how to apply AI for social goods and how to build applications to impact real-world people’s life, which sent a strong message to the conference. The last speaker Shuang-Hua Teng from USC gave a rather theoretical talk, discussing how to balance the relationship between heuristics and theory. One thing quite interesting from the talk was that Shuang-Hua clearly categorized quite a number of widely used algorithms into heuristics, even though they might be considered theoretical-oriented algorithms by practitioners.

Compared to KDD 2019 that I attended in Alaska 3 years ago, this year’s KDD is different in the sense of the number of participants, the number of companies, and the diversity of research topics. It is quite obvious that we are around the corner of the last wave from Deep Learning. I’m looking forward to meeting with friends next time in LA.


FAQ About Tensor Methods

Anima Anandkumar is a prominent researcher about tensor methods (a.k.a., spectral methods) in machine learning. Recently, she has a QA session on Quora and I gathered some of her answers particular with tensor methods as follows:

  1. What are some benefits and drawbacks of using tensor methods as opposed to more traditional techniques in machine learning?
    The main gain is in terms of computation:
    — a) tensor methods are embarrassingly parallel and scalable to  large problems
    — b) they can build on efficient linear algebraic libraries, but are much more powerful and informative compared to matrix methods.
    On the other hand, tensor methods are not sample efficient, meaning they require more samples than EM to reach the same level of accuracy (assuming computation is not an issue). Improving statistical efficiency of spectral methods is an ongoing research topic
  2. What are your thoughts on the statistical efficiency of spectral methods? Do you think that they are competitive as they stand?
    The short answer is that, MLE is sample efficient but may be difficult to compute while tensor methods (moment matching) is relatively easy to compute but sample inefficient. Some remedies are mentioned in the answer.
  3. How are Tensor methods used in deep learning?
    The short answer is that, currently used limited.
  4. What are the best resources for starting with Tensor Analysis?
    See her webpage for a start.

For detailed QA, please refer to Quora website.


Slice Sampling for LDA Hyperparameters

For latent Dirichlet allocation (LDA) hyper-parameters, a typical approach is to utilize Monte-Carlo EM approach where E-step is approximated by Gibbs sampling while M-step is to perform a gradient-based optimization approach to optimize Dirichlet parameters. Such approach is implemented in Mallet package. For the detailed information about estimating Dirichlet parameters, see Tom Minka’s technical report.

Another less explored approach is to go with full Bayesian notion, interleaving sampling topic assignments and hyper-parameters. To sample hyper-parameters, as there is no closed-form solution, Slice Sampling is usually utilized. For introduction for Slice Sampling, see Radford M. Neal’s paper. For its application in LDA, please see the following two references:

  1. Hanna Wallach’s course note.
  2. Chapter 2 of Hanna Wallach’s dissertation.

We implemented the method in Fugue’s LDA algorithm.


Tricks in Sampling Discrete Distributions (2) – Binary Search 1

As mentioned in the previous post, it is tricky to sample from discrete distributions, here we demonstrate yet another important trick to do it right. No matter you do it in the original space, or in the log-space. Basically, you can easily come up some code snippet like this (we are using Java as an example here):

1
2
3
4
5
6
7
8
9
public int sample(){
    double u = ThreadLocalRandom.current().nextDouble() * p[p.length - 1];
    int index = -1;
    for (index = 0; index > p.length; index++) {
        if (u > p[index])
            break;
    }
    return index;
}

where \( p \) is the accumulated un-normalized probabilities. The time complexity is \( \mathcal{O}(N) \) when \(N \) equals the number of items in the array \( p \).

It turns out that, the above code can be easily optimized to \( \mathcal{O}(\log N ) \) by using Binary Search. The reason is quite simple. The accumulated un-normalized probabilities, which are stored in \( p \), by its definition, are sorted. Therefore, binary search can be utilized. In particular, we want to find the smallest key that is greater than the random generated number \( u \). This function is called ceiling in Algorithms in Section 3.1. We implemented it in our context as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public int sample(){
    double u = ThreadLocalRandom.current().nextDouble() * p[p.length - 1];
    int lower = 0;
    int upper = p.length - 1;
    while (lower >= upper){ 
        int mid = lower + (upper - lower) / 2; 
        if((p[mid] - u) > 0){
            upper = mid - 1;
        }
        else{
            lower = mid + 1;
        }
    }
    return lower;
}

Interestingly, even though this trick seems trivial, it is not mentioned in many literature and only discussed:


Tricks in Sampling Discrete Distributions (1) – Sampling in Log-Space


One critical component in Gibbs sampling for complex graphical models is to be able to draw samples from discrete distributions. Take latent Dirichlet allocation (LDA) as an example, the main computation focus is to draw samples from the following distribution:

(1)   \begin{equation*}P(z_{i} = k \, | \, \mathbf{z}_{-i}, \mathbf{w}, \boldsymbol{\alpha}, \boldsymbol{\beta}) \propto (n_{d,k}^{-i} + \alpha_{k}) \frac{n_{w_{i}, k}^{-i} + \beta_{w_{i}}}{n_{., k}^{-i} + \sum_{v} \beta_{v}}\end{equation*}


where n_{d,k}^{-i} is the number of tokens in the document d assigned to the topic k, excluding the token i, n_{w_{i}, k}^{-i} is the number of times token w_{i} assigned to the topic k, excluding i, and n_{.,k}^{-i} is the total number of tokens assigned to the topic k.

So, a straightforward sampling algorithm works as follows:

  1. Let c_{k} be the right-hand side of Equation \eqref{eq:lda} for topic k, which is an un-normalized probability.
  2. We compute the accumulated weights as: C[i] = C[i-1] + c_{i} , \,\, \forall i \in (0, K-1] and C[0] = c_{0}.
  3. Draw u \sim \mathcal{U}(0, C[K-1] ) and find t = \arg\min_{i} \left( x_{i} \right) where x_{i} = C[i] - u and x_{i} > 0

The last line is essentially to find the minimum index that the array value is greater than the random number u.

One difficulty to deal with \eqref{eq:lda} is that the right hand side might be too small and therefore overflow (thinking about too many near-zero numbers multiplying). Thus, we want to deal with probabilities in log-space. We start to work with:

(2)   \begin{equation*}\log P(z_{i} = k \, | \, \mathbf{z}_{-i}, \mathbf{w}, \boldsymbol{\alpha}, \boldsymbol{\beta}) \propto \log (n_{d,k}^{-i} + \alpha_{k}) + \log (n_{w_{i}, k}^{-i} + \beta_{w_{i}}) - \log ( n_{., k}^{-i} + \sum_{v} \beta_{v} )\end{equation*}

and store in c_{k} but remember that now each value represents un-normalized log probability. The next step is to compute the accumulated weights, this time as accumulated probabilities but in log-space! Thanks to the trick mentioned in [Notes on Calculating Log Sum of Exponentials], we are able to compute log sum efficiently. Please use the last equation there to compute the accumulated weights. The last step is to draw the random number. We compute u \sim \mathcal{U}(0, 1) + \log C[K-1] and again find the minimum index that satisfy the array value is greater than u.

Notes:

  1. The log-sampling algorithm for LDA is implemented in Fugue Topic Modeling Package.
  2. Unless you really face the issue of overflow, sampling in log-space is usually much slower than the original space as log and exp are expensive functions to compute.

Federated Optimization: Even More Personalized World?

Like the previous post about the personalized models, another NIPS workshop paper discussed a related topic, yet, from another perspective:

The authors introduces a new setting of the learning problem in which data are distributed across a very large number of computers, each having access only to few data points. This is primarily motivated by the setting, where users keep their data on their devices, but the goal is still to train a high quality global model.

Although it looks like very different from the paper described in the previous post, two pieces can be linked together for two points:

  • It is important to train a high quality local or personalized model by utilizing the global model or vice versa, having a business management system is great.
  • It is very important to understand the interplay of the global mode land the local model as well.

These work can raise interesting new directions, like how to serve/update models that are fully personalized on mobile devices.


Serving Personalized Models

In the recent NIPS 2016 Workshop on Machine Learning Systems, one paper attracts my attention:

The central idea is very simple: we need to learn and serve personalized models on top of a global model to users. The argument of such setting is two-folds:

  1. A global model might be hard to train, given the size of the model. It usually takes significant amount of computing efforts.
  2. Depending on what types of model, it might be even difficult to serve and update the global model as well.

So, it is very natural that, the model for each individual user is trained separately while it is derived from a global model. The paper demonstrates a particular way of deriving such a model. But there could be many different ways of doing this.

Of course, this is not the first such reasoning. As the paper mentioned, prior work in multi-task learning has formalized similar problems. However, it might be the first time from the system perspective to show the advantages of having personalized models.


Machine Learning as an Engineering Discipline

Recently, Leon Bottou delivered two exiting keynote speeches on machine learning as an engineering discipline. I suggest that anyone who is using machine learning techniques to build software needs to read through them at least once. There are deep thinking about machine learning and software engineering.

  1. Two big challenges in machine learning given on ICML 2015.
  2. How Big Data changes Statistical Machine Learning given on IEEE BigData 2015.