4 Writing Specific Sections

4.1 Title

The title is the most read part of an article, and influences whether a reader is interested in reading the manuscript.

Hairston and Keene (2003) suggest that the title of a research paper should accomplish four goals:

predict the content of the research paper;
be interesting to the reader;
reflect the tone of the writing;
contain important keywords (that makes it easier to be found from keywords search).

The title of a paper is usually determined when the paper is close to completion. To come up with a good title, list some key phrases that you would like to have, and be creative in forming a good title that consists of most of them. Here are some tips.

Be informative by including these aspects: topic, method(s), data, and results.
Consider adding a subtitle to give more specifics about the paper.
Use appropriate critical keywords to increase the discoverability of the paper.
Follow the requirements from the instructions or journals.
Keep it as concise as possible.

Example 4.1 Less effective vs. effective titles

A Comprehensive Regression Shrinkage Estimation and Variable Selection Procedure Using a Novel Least Absolute Shrinkage and Selection Operator

Too long, technical, and redundant; buries the key idea.

Reveal the real title

Regression Shrinkage and Selection via the Lasso (Tibshirani 1996)

Concise and memorable; immediately signals novelty.

A New Nonparametric Resampling-Based Monte Carlo Estimator for Determining the Optimal Quantity of Groupings with Multivariate Data

Overly technical; hard to parse at a glance.

Reveal the real title

Estimating the Number of Clusters in a Data Set via the Gap Statistic (Tibshirani, Walther, and Hastie 2001)

States problem, method, and context clearly.

A Stepwise Algorithm for Sequentially Updating Correlation-Adjusted Regression Coefficients in High-Dimensional Settings with Strong Multicollinearity

Too wordy and technical; makes the method sound less accessible.

Reveal the real title

Least Angle Regression (Efron et al. 2004)

Short, clear, and intriguing; signals both novelty and relevance.

A Generalized Ensemble Learning Framework Based on Aggregating Decision Tree Classifiers for Improved Prediction Accuracy with Complex Data

Technically accurate but long, clunky, and forgettable.

Reveal the real title

Random Forests (Breiman 2001)

Extremely concise and memorable; easy to cite and recall.

Exercise 4.1 Can you guess the title from the abstract?

Abstract:

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.

Speed

Reveal the real title

Attention Is All You Need (Vaswani et al. 2017a)

Strikingly simple and memorable; it reframed the field with just five words.

Abstract:

We propose an adaptive nuclear norm penalization approach for low-rank matrix approximation, and use it to develop a new reduced rank estimation method for high-dimensional multivariate regression. The adaptive nuclear norm is defined as the weighted sum of the singular values of the matrix, and it is generally nonconvex under the natural restriction that the weight decreases with the singular value. However, we show that the proposed nonconvex penalized regression method has a global optimal solution obtained from an adaptively soft-thresholded singular value decomposition. The method is computationally efficient, and the resulting solution path is continuous. The rank consistency of and prediction/estimation performance bounds for the estimator are established for a high-dimensional asymptotic regime. Simulation studies and an application in genetics demonstrate its efficacy.

Speed

Reveal the real title

Reduced Rank Regression via Adaptive Nuclear Norm Penalization (Kun Chen, Chan, and Stenseth 2013)

Highlights both the framework and the new contribution.

Abstract:

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

Speed

Reveal the real title

Adam: A Method for Stochastic Optimization (Kingma and Ba 2015)

Minimalist and branded: a single word, “Adam,” now instantly recognized in ML.

4.2 Abstract

Marron (1999) recommended the following:

Abstract material needs to be carefully chosen. A balance between the twin goals of brevity and maximal information content should again be carefully sought. There is room for more detail than in the title, but not enough room for all ideas covered in the paper. Make sure each “high point” is included. The paper will have a better chance in the review process if it is made clear what is done, and why it is important, since this will immediately capture the interest of the reviewer.

Any recommendations for length here must be more case dependent. Longer papers will usually need longer abstracts. However, something between 4 and 10 sentences is reasonable for most situations.

Mathematical notation is rarely useful in the abstract. Sometimes notation is introduced in an abstract, and then not used at all! Even when notation is used in the abstract, the point can usually be conveyed more efficiently in words alone.

Tips:

Consider ppen with a sentence to establish the importance of the subject of the paper.
Identify a gap in the literature to set up the background and motivation of the paper.
Highlight the novelty/contributions of the paper.
For application papers, allude to new discoveries and their impacts.
For method papers, outline the essence of the methodology, and evidence from theoretical and numerical studies supporting the methods.
It must make sense when read in isolation for those who only read the abstract, and must also provide a clear and accurate summary of the manuscript for readers who read the entire manuscript (Zeiger 2000).
Should not include citations.

4.2.1 Components in an Abstract

There are five major components in an abstract:

Context (1–2 sentences): situate the problem and why it matters.
Objective (1 sentence): the question or aim.
Approach (1–2 sentences): data, design, and key methods.
Findings (1–2 sentences): the most important results.
Implications (1 sentence): what the results imply (perhaps in a broader context).

Components can also blend. For example, the objective and approach may appear together in one line, e.g., ‘We aim to estimate Y by developing X’, and a single sentence may introduce a novel approach and immediately state the key result, e.g., ‘By introducing A, we show B’.

There are other optional components (sometimes required by the venue):

Data set: sample size or cohorts (applied/clinical).
Assumptions/Limitations: stating key conditions or the main caveat (theory/methods).
Uncertainty/Effect size: CIs, rates, or precision that quantify findings.
Software/Computation/Availability: runtime/complexity; package; code/data source.
Registration/Ethics: trial ID, preregistration, or IRB when human subjects are involved.
Funding/Disclosure: only if the outlet asks for it in the abstract.

Length: typecially ~150–250 words for journals.

Example 4.2 Transfer learning under large-scale low-rank regression models (A recent JASA paper)