4 Writing Specific Sections
4.1 Title
The title is the most read part of an article, and influences whether a reader is interested in reading the manuscript.
Hairston and Keene (2003) suggest that the title of a research paper should accomplish four goals:
- predict the content of the research paper;
- be interesting to the reader;
- reflect the tone of the writing;
- contain important keywords (that makes it easier to be found from keywords search).
The title of a paper is usually determined when the paper is close to completion. To come up with a good title, list some key phrases that you would like to have, and be creative in forming a good title that consists of most of them. Here are some tips.
- Be informative by including these aspects: topic, method(s), data, and results.
- Consider adding a subtitle to give more specifics about the paper.
- Use appropriate critical keywords to increase the discoverability of the paper.
- Follow the requirements from the instructions or journals.
- Keep it as concise as possible.
Example 4.1 Less effective vs. effective titles
- A Comprehensive Regression Shrinkage Estimation and Variable Selection Procedure Using a Novel Least Absolute Shrinkage and Selection Operator
- Too long, technical, and redundant; buries the key idea.
Reveal the real title
Regression Shrinkage and Selection via the Lasso (Tibshirani 1996)
- Concise and memorable; immediately signals novelty.
- A New Nonparametric Resampling-Based Monte Carlo Estimator for Determining the Optimal Quantity of Groupings with Multivariate Data
- Overly technical; hard to parse at a glance.
Reveal the real title
Estimating the Number of Clusters in a Data Set via the Gap Statistic (Tibshirani, Walther, and Hastie 2001)
- States problem, method, and context clearly.
- A Stepwise Algorithm for Sequentially Updating Correlation-Adjusted Regression Coefficients in High-Dimensional Settings with Strong Multicollinearity
- Too wordy and technical; makes the method sound less accessible.
Reveal the real title
Least Angle Regression (Efron et al. 2004)
- Short, clear, and intriguing; signals both novelty and relevance.
- A Generalized Ensemble Learning Framework Based on Aggregating Decision Tree Classifiers for Improved Prediction Accuracy with Complex Data
- Technically accurate but long, clunky, and forgettable.
Reveal the real title
Random Forests (Breiman 2001)
- Extremely concise and memorable; easy to cite and recall.
Exercise 4.1 Can you guess the title from the abstract?
- Abstract:
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
Reveal the real title
Attention Is All You Need (Vaswani et al. 2017a)
- Strikingly simple and memorable; it reframed the field with just five words.
- Abstract:
We propose an adaptive nuclear norm penalization approach for low-rank matrix approximation, and use it to develop a new reduced rank estimation method for high-dimensional multivariate regression. The adaptive nuclear norm is defined as the weighted sum of the singular values of the matrix, and it is generally nonconvex under the natural restriction that the weight decreases with the singular value. However, we show that the proposed nonconvex penalized regression method has a global optimal solution obtained from an adaptively soft-thresholded singular value decomposition. The method is computationally efficient, and the resulting solution path is continuous. The rank consistency of and prediction/estimation performance bounds for the estimator are established for a high-dimensional asymptotic regime. Simulation studies and an application in genetics demonstrate its efficacy.
Reveal the real title
Reduced Rank Regression via Adaptive Nuclear Norm Penalization (Kun Chen, Chan, and Stenseth 2013)
- Highlights both the framework and the new contribution.
- Abstract:
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.
Reveal the real title
Adam: A Method for Stochastic Optimization (Kingma and Ba 2015)
- Minimalist and branded: a single word, “Adam,” now instantly recognized in ML.
4.2 Abstract
Marron (1999) recommended the following:
Abstract material needs to be carefully chosen. A balance between the twin goals of brevity and maximal information content should again be carefully sought. There is room for more detail than in the title, but not enough room for all ideas covered in the paper. Make sure each “high point” is included. The paper will have a better chance in the review process if it is made clear what is done, and why it is important, since this will immediately capture the interest of the reviewer.
Any recommendations for length here must be more case dependent. Longer papers will usually need longer abstracts. However, something between 4 and 10 sentences is reasonable for most situations.
Mathematical notation is rarely useful in the abstract. Sometimes notation is introduced in an abstract, and then not used at all! Even when notation is used in the abstract, the point can usually be conveyed more efficiently in words alone.
Tips:
- Consider ppen with a sentence to establish the importance of the subject of the paper.
- Identify a gap in the literature to set up the background and motivation of the paper.
- Highlight the novelty/contributions of the paper.
- For application papers, allude to new discoveries and their impacts.
- For method papers, outline the essence of the methodology, and evidence from theoretical and numerical studies supporting the methods.
- It must make sense when read in isolation for those who only read the
abstract, and must also provide a clear and accurate summary of the manuscript
for readers who read the entire manuscript (Zeiger 2000).
- Should not include citations.
4.2.1 Components in an Abstract
There are five major components in an abstract:
- Context (1–2 sentences): situate the problem and why it matters.
- Objective (1 sentence): the question or aim.
- Approach (1–2 sentences): data, design, and key methods.
- Findings (1–2 sentences): the most important results.
- Implications (1 sentence): what the results imply (perhaps in a broader context).
Components can also blend. For example, the objective and approach may appear together in one line, e.g., ‘We aim to estimate Y by developing X’, and a single sentence may introduce a novel approach and immediately state the key result, e.g., ‘By introducing A, we show B’.
There are other optional components (sometimes required by the venue):
- Data set: sample size or cohorts (applied/clinical).
- Assumptions/Limitations: stating key conditions or the main caveat (theory/methods).
- Uncertainty/Effect size: CIs, rates, or precision that quantify findings.
- Software/Computation/Availability: runtime/complexity; package; code/data source.
- Registration/Ethics: trial ID, preregistration, or IRB when human subjects are involved.
- Funding/Disclosure: only if the outlet asks for it in the abstract.
Length: typecially ~150–250 words for journals.
Example 4.2 Transfer learning under large-scale low-rank regression models (A recent JASA paper)
Context Objective Approach Findings Implications
In high-dimensional multiple response regression problems, the large dimensionality of the coefficient matrix poses a challenge to parameter estimation. To address this challenge, low-rank matrix estimation methods have been developed to facilitate parameter estimation in the high-dimensional regime, where the number of parameters increases with sample size. Despite these methodological advances, accurately predicting multiple responses with limited target data remains a difficult task. To gain statistical power, the use of diverse datasets from source domains has emerged as a promising approach. In this paper, we focus on the problem of transfer learning in a high-dimensional multiple response regression framework, which aims to improve estimation accuracy by transferring knowledge from informative source datasets. To reduce potential performance degradation due to the transfer of knowledge from irrelevant sources, we propose a novel transfer learning procedure including the forward selection of informative source sets. In particular, our forward source selection method is new compared to existing transfer learning framework, offering deeper theoretical insights and substantial methodological innovations. In addition, we develop an alternative transfer learning based on non-convex penalization to ensure rank consistency. Theoretical results show that the proposed estimator achieves a faster convergence rate than the single-task penalized estimator using only target data. Through simulations and real data experiments, we provide empirical evidence for the effectiveness of the proposed method and for its superiority over other methods.
(Optional) The proposed framework clarifies when and how multi-source transfer improves multi-response prediction, guiding principled use of external data in high-dimensional applications.
Example 4.3 Conformal prediction with conditional guarantees (A recent JRSSB paper)
Context Objective Approach Findings Implications
We consider the problem of constructing distribution-free prediction sets with finite-sample conditional guarantees. Prior work has shown that it is impossible to provide exact conditional coverage universally in finite samples. Thus, most popular methods only guarantee marginal coverage over the covariates or are restricted to a limited set of conditional targets, e.g. coverage over a finite set of prespecified subgroups. This paper bridges this gap by defining a spectrum of problems that interpolate between marginal and conditional validity. We motivate these problems by reformulating conditional coverage as coverage over a class of covariate shifts. When the target class of shifts is finite-dimensional, we show how to simultaneously obtain exact finite-sample coverage over all possible shifts. For example, given a collection of subgroups, our prediction sets guarantee coverage over each group. For more flexible, infinite-dimensional classes where exact coverage is impossible, we provide a procedure for quantifying the coverage errors of our algorithm. Moreover, by tuning interpretable hyperparameters, we allow the practitioner to control the size of these errors across shifts of interest. Our methods can be incorporated into existing split conformal inference pipelines, and thus can be used to quantify the uncertainty of modern black-box algorithms without distributional assumptions.
Example 4.4 Exact Bayesian inference for fitting stochastic epidemic models to partially observed incidence data (A recent AoAS paper)
Context Objective Approach Findings Implications
Stochastic epidemic models provide an interpretable probabilistic description of the spread of a disease through a population. Yet fitting these models to partially observed data can be a difficult task due to intractability of the marginal likelihood, even for classic Markovian models. To remedy this issue, this article introduces a novel data-augmented Markov chain Monte Carlo sampler for exact Bayesian inference under the stochastic susceptible-infectious-removed model, given only discretely observed counts of infections. In a Metropolis–Hastings step, the latent data are jointly proposed from a surrogate process carefully designed to closely resemble the target process and from which we can efficiently generate epidemics consistent with the observed data. This yields a method that explores the high-dimensional latent space efficiently and easily scales to outbreaks with thousands of infections. We prove that our sampler is uniformly ergodic and find empirically that it mixes much faster than existing single-site samplers. We apply the algorithm to fit a semi-Markov susceptible-infectious-removed model to the 2013–2015 outbreak of Ebola Haemorrhagic Fever in Guéckédou, Guinea.
(Optional) Implication: The sampler enables principled Bayesian inference for partially observed epidemics at realistic scales, improving reliability of outbreak analysis and forecasting.
Example 4.5 Statistical significance of clustering for count data (A recent Biometrics paper)
Context Objective Approach Findings Implications
Clustering is widely used in biomedical research for meaningful subgroup identification. However, most existing clustering algorithms do not account for the statistical uncertainty of the resulting clusters and consequently may generate spurious clusters due to natural sampling variation. To address this problem, the Statistical Significance of Clustering (SigClust) method was developed to evaluate the significance of clusters in high-dimensional data. While SigClust has been successful in assessing clustering significance for continuous data, it is not specifically designed for discrete data, such as count data in genomics. Moreover, SigClust and its variations can suffer from reduced statistical power when applied to non-Gaussian high-dimensional data. To overcome these limitations, we propose SigClust-DEV, a method designed to evaluate the significance of clusters in count data. Through extensive simulations, we compare SigClust-DEV against other existing SigClust approaches across various count distributions and demonstrate its superior performance. Furthermore, we apply our proposed SigClust-DEV to Hydra single-cell RNA sequencing (scRNA) data and electronic health records (EHRs) of cancer patients to identify meaningful latent cell types and patient subgroups, respectively.
(Optional) Implication: Rigorous significance testing for clusters in count data can reduce false discoveries and sharpen biological and clinical subgrouping.
Example 4.6 Robust transfer learning with unreliable source data (A recent AoS paper)
Context Objective Approach Findings Implications
This paper addresses challenges in robust transfer learning stemming from ambiguity in Bayes classifiers and weak transferable signals between the target and source distributions. We introduce a novel quantity called the “ambiguity level” that measures the discrepancy between the target and source regression functions, propose a simple transfer learning procedure, and establish a general theorem that shows how this new quantity is related to the transferability of learning in terms of risk improvements. Our proposed “Transfer Around Boundary” (TAB) method, with a threshold that balances the performance contributions of the target and source data, is shown to be both efficient and robust, improving classification while avoiding negative transfer. Moreover, we demonstrate the effectiveness of the TAB model on nonparametric classification and logistic regression tasks, achieving upper bounds which are optimal up to logarithmic factors. Simulation studies lend further support to the effectiveness of TAB. We also provide simple approaches to bound the excess misclassification error without the need for specialized knowledge in transfer learning.
(Optional) Implication: By quantifying “ambiguity” and adapting transfer near class boundaries, TAB offers a principled recipe for robust performance gains while guarding against negative transfer in practice.
Exercise 4.2 Spot the Components of an Abstract
Context Objective Approach Findings Implications
Abstract:
We consider the problem of constructing distribution-free prediction sets with finite-sample conditional guarantees. Prior work has shown that it is impossible to provide exact conditional coverage universally in finite samples. Thus, most popular methods only guarantee marginal coverage over the covariates or are restricted to a limited set of conditional targets, e.g. coverage over a finite set of prespecified subgroups. This paper bridges this gap by defining a spectrum of problems that interpolate between marginal and conditional validity. We motivate these problems by reformulating conditional coverage as coverage over a class of covariate shifts. When the target class of shifts is finite-dimensional, we show how to simultaneously obtain exact finite-sample coverage over all possible shifts. For example, given a collection of subgroups, our prediction sets guarantee coverage over each group. For more flexible, infinite-dimensional classes where exact coverage is impossible, we provide a procedure for quantifying the coverage errors of our algorithm. Moreover, by tuning interpretable hyperparameters, we allow the practitioner to control the size of these errors across shifts of interest. Our methods can be incorporated into existing split conformal inference pipelines, and thus can be used to quantify the uncertainty of modern black-box algorithms without distributional assumptions.
Show answer
We consider the problem of constructing distribution-free prediction sets with finite-sample conditional guarantees. Prior work has shown that it is impossible to provide exact conditional coverage universally in finite samples. Thus, most popular methods only guarantee marginal coverage… This paper bridges this gap by defining a spectrum of problems that interpolate between marginal and conditional validity. We motivate these problems by reformulating conditional coverage as coverage over a class of covariate shifts. When the target class of shifts is finite-dimensional… exact finite-sample coverage over all possible shifts. …given a collection of subgroups, our prediction sets guarantee coverage over each group. For more flexible, infinite-dimensional classes… we provide a procedure for quantifying the coverage errors… Moreover, by tuning interpretable hyperparameters… control the size of these errors… Our methods can be incorporated into existing split conformal pipelines… quantify the uncertainty of modern black-box algorithms without distributional assumptions.Context Objective Approach Findings Implications
Abstract:
Stochastic epidemic models provide an interpretable probabilistic description of the spread of a disease through a population. Yet, fitting these models to partially observed data is a notoriously difficult task due to intractability of the likelihood for many classical models. To remedy this issue, this article introduces a novel data-augmented MCMC algorithm for exact Bayesian inference under the stochastic SIR model, given only discretely observed counts of infection. In a Metropolis-Hastings step, the latent data are jointly proposed from a surrogate process carefully designed to closely resemble the SIR model, from which we can efficiently generate epidemics consistent with the observed data. This yields a method that explores the high-dimensional latent space efficiently, and scales to outbreaks with hundreds of thousands of individuals. We show that the Markov chain underlying the algorithm is uniformly ergodic, and validate its performance via thorough simulation experiments and a case study on the 2013-2015 outbreak of Ebola Haemorrhagic Fever in Western Africa.
Show answer
Stochastic epidemic models provide an interpretable probabilistic description of the spread of a disease through a population. Yet, fitting these models to partially observed data is a notoriously difficult task due to intractability of the likelihood for many classical models. To remedy this issue, this article introduces a novel data-augmented MCMC algorithm for exact Bayesian inference under the stochastic SIR model, given only discretely observed counts of infection. In a Metropolis-Hastings step, the latent data are jointly proposed from a surrogate process carefully designed to closely resemble the SIR model, from which we can efficiently generate epidemics consistent with the observed data. This yields a method that explores the high-dimensional latent space efficiently, and scales to outbreaks with hundreds of thousands of individuals. We show that the Markov chain underlying the algorithm is uniformly ergodic, and validate its performance via thorough simulation experiments and a case study on the 2013-2015 outbreak of Ebola Haemorrhagic Fever in Western Africa.
(Optional) Implication: The algorithm enables principled, scalable Bayesian inference for partially observed epidemics at realistic population sizes.Context Objective Approach Findings Implications
Abstract:
Statistical learning with a large number of rare binary features is commonly encountered in analyzing electronic health records (EHR) data, especially in the modeling of disease onset with prior medical diagnoses and procedures. Dealing with the resulting highly sparse and large-scale binary feature matrix is notoriously challenging as conventional methods may suffer from a lack of power in testing and inconsistency in model fitting, while machine learning methods may suffer from the inability of producing interpretable results or clinically-meaningful risk factors. To improve EHR-based modeling and use the natural hierarchical structure of disease classification, we propose a tree-guided feature selection and logic aggregation approach for large-scale regression with rare binary features, in which dimension reduction is achieved through not only a sparsity pursuit but also an aggregation promoter with the logic operator of “or”. We convert the combinatorial problem into a convex linearly-constrained regularized estimation, which enables scalable computation with theoretical guarantees. In a suicide risk study with EHR data, our approach is able to select and aggregate prior mental health diagnoses as guided by the diagnosis hierarchy of the International Classification of Diseases. By balancing the rarity and specificity of the EHR diagnosis records, our strategy improves both prediction and interpretation. We identify important higher-level categories and subcategories of mental health conditions and simultaneously determine the level of specificity needed for each of them in associating with suicide risk.
Show answer
Statistical learning with a large number of rare binary features is commonly encountered in analyzing EHR data, especially in modeling disease onset with prior diagnoses and procedures. Dealing with the resulting highly sparse and large-scale binary feature matrix is challenging… conventional methods lack power/consistency; ML methods lack interpretability/clinically-meaningful factors. To improve EHR-based modeling and leverage the disease hierarchy, we propose a tree-guided feature selection and logic aggregation approach for large-scale regression with rare binary features… We convert the combinatorial problem into a convex linearly-constrained regularized estimation, enabling scalable computation with theoretical guarantees. In a suicide risk study, the method selects and aggregates diagnoses guided by ICD hierarchy. By balancing rarity and specificity, the strategy improves both prediction and interpretation. It identifies important higher-level categories and subcategories of mental health conditions and determines the needed specificity for associating with suicide risk.
Exercise 4.3 Compress an Abstract to the Required Length
Task. Read the original abstract (220 words), then rewrite it to ≤150 words, and then to ≤100 words while preserving the five core components: Context, Objective, Approach, Findings, Implications.
Original (≈220 words)
Rare-event risk prediction from multi-site EHR data is hampered by extreme class imbalance and heterogeneity across institutions. Models that perform well at one site often fail to generalize, and naive pooling can hide site-specific signals that matter clinically. We develop RankFuse, a federated learning framework that consolidates rankings from related event phenotypes across sites without sharing raw data. Our objective is to improve identification of the highest-risk patients under strict privacy constraints. RankFuse uses truncated listwise losses with top-K–focused mini-batches and a cross-site consensus penalty that aligns phenotype-specific orderings while allowing site-level deviations. We derive conditions under which RankFuse improves precision at fixed review capacity and prove a finite-sample bound on recall@K under prevalence 0.2–1%. In simulations with controlled heterogeneity, RankFuse raises precision@1% by 16–24% over calibrated XGBoost and by 11–18% over focal-loss baselines. In a 10-hospital suicide cohort (N=850,000; 2,970 events), RankFuse improves precision@1% from 0.29 to 0.36 and recall@1% from 0.51 to 0.62, while preserving site-level calibration. These gains translate to earlier identification of high-risk patients without expanding the clinical review queue. The method fits into standard federated pipelines, requires only gradient sharing, and supports secure aggregation. Code for RankFuse and synthetic benchmarks will be released to facilitate adoption.
Sample ≤150-word version (≈148 words)
Rare-event risk prediction in multi-site EHRs suffers from class imbalance and cross-site heterogeneity, limiting generalization. We propose RankFuse, a privacy-preserving federated framework that consolidates phenotype-specific rankings across hospitals to better surface the highest-risk patients. RankFuse optimizes truncated listwise losses with top-K–focused mini-batches and a consensus penalty that aligns site rankings while allowing local deviations. We show conditions for improved precision at fixed capacity and provide a finite-sample bound on recall@K under 0.2–1% prevalence. In simulations with controlled heterogeneity, RankFuse raises precision@1% by 16–24% over calibrated XGBoost and by 11–18% over focal-loss baselines. In a 10-hospital suicide cohort (N=850,000; 2,970 events), it improves precision@1% from 0.29→0.36 and recall@1% from 0.51→0.62, maintaining calibration. These gains enable earlier identification without expanding review queues. The approach fits standard federated workflows with gradient sharing and secure aggregation; code and synthetic benchmarks will be released.Sample ≤100-word version (≈98 words)
We introduce RankFuse, a federated ranking method for rare-event prediction in multi-site EHRs. Using truncated listwise losses and a cross-site consensus penalty, RankFuse aligns phenotype-specific rankings while preserving site idiosyncrasies. Theory gives conditions for precision gains at fixed capacity and a finite-sample bound on recall@K under 0.2–1% prevalence. Simulations show +16–24% precision@1% over calibrated XGBoost and +11–18% over focal-loss baselines. In a 10-hospital suicide cohort (N=850k; 2,970 events), precision@1% improves 0.29→0.36 and recall@1% 0.51→0.62 with calibration intact. RankFuse integrates into federated workflows via gradient sharing; code and synthetic benchmarks will be released.Discussion
- What was removed or condensed in each shorter version, and why didn’t meaning suffer?
- Did all five components survive the compression?
4.3 Keywords
Keywords are words in addition to those in the title that attract search queries. Including the most relevant keywords helps other researchers find your paper.
- No need to repeat anything in the title already.
- List them in alphabetical order.
- Contain words and phrases that suggest what the topic is about.
4.4 Introduction
The introduction section is always the first section of a paper. Some journals may not call it introduction but require a section that serves the same purpose. The purpose of the introduction is to stimulate the reader’s interest and to provide background information which is pertinent to the study (Jenkins 1995). The introduction section guides the readers from a general subject area to the narrow topic of the paper. It should answer three questions:
- Why does it matter?
- What has already been done?
- What is new?
That is, the introduction sections need to explain the importance of the topic of the paper, provide the background of the research work, and highlight the contributions of the work. At the end of the introduction, a roadmap, or an outline of the paper is useful in helping the readers navigate through the following sections.
The introduction is typically outlined at the very beginning of the writing process, but completed towards the end after the other sections have been written. Do NOT wait to perform the literature review until last, however! This should happen before the research is undertaken to ensure you are not duplicating something that has already been done!
An introduction often contains the following items.
- An overview of the topic. Start with a general overview of your topic and narrow it to the specific subject you are addressing. Then, mention questions or concerns you have about the case. Explain why they are important and why it needs to be addressed right now.
- Existing works. The introduction is the place to review other conclusions on your topic. The literature review should be thorough, including both old and recent works. It should show that you are aware of prior research. It also introduces past findings to those who might not have that expertise.
- A gap needs to be identified from the importance of the topic and the current status of the literature, which is the rationale for your work. Why are existing methods not sufficient? What are elements of an attractive solution?
- Contributions. This is a thesis statement, which summarizes the the contributions of your work to the existing literature, and answers the “what is new” question.
- A roadmap. A brief summary of what each section does in the paper. This concludes the introduction.
4.4.1 Components in an Introduction
Overview of the topic.
Begin with a general discussion of the area, narrowing down to your specific research focus. Emphasize the importance and timeliness of the problem.Example: “Electronic health records provide rich opportunities for predictive modeling, but their high dimensionality and sparsity create statistical challenges.”
Existing work.
Briefly review relevant literature. Include both classical and recent works, highlighting what is known.Example: “Traditional regression approaches often break down in high dimensions, and recent machine learning methods, while predictive, lack interpretability.”
The gap.
Identify what is missing or insufficient in the current body of work—this is the rationale for your study.
Example: “However, existing methods rarely exploit the hierarchical structure of medical codes, leaving a gap in clinically meaningful risk prediction.”Contributions.
Clearly state what is new and how it advances the field. This is the thesis statement of your paper.
Example: “We propose a tree-guided feature selection method that integrates sparsity with clinical hierarchies to improve both interpretability and predictive power.”Roadmap.
Provide a brief outline of the rest of the paper.
Example: “The remainder of this paper is organized as follows. Section 2 introduces the model formulation. Section 3 presents theoretical guarantees. Section 4 reports simulation results, and Section 5 applies the method to suicide risk prediction with EHR data.”
Exercise 4.4 Spot the Components of an Introduction
Legend:
Overview
Existing work
Gap / Limitation
Contribution
Roadmap
Introduction:
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].
Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
Show answer
Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures.
Recurrent models typically factor computation along the symbol positions of the input and output sequences… Recent work has achieved significant improvements in computational efficiency through factorization tricks and conditional computation, while also improving model performance.
The fundamental constraint of sequential computation, however, remains.
Attention mechanisms have become an integral part of compelling sequence modeling… In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
(Implicit roadmap: The paper goes on to describe the model, training setup, and evaluation, but here the “In this work we propose…” serves as both contribution and a forward pointer.)4.4.2 Tips and Common Mistakes in Writing Introductions
4.4.2.1 Useful tips
- Use the first sentence of each paragraph to state the main idea.
Example 4.7 “High-dimensional genomic data pose unique challenges for clustering because of noise and sparsity……”
This first sentence signals the point of the paragraph; the following sentences can then elaborate with details and evidence.
Exercise 4.5 Take a paragraph from one of your drafts and rewrite the first sentence so that the main idea is clear at the start.
- Limit the scope of the background to what directly connects to your research problem.
Example 4.8 “Clustering has been studied by [1], [2], [3], [4], [5]…”
“Several approaches have been proposed for clustering in high-dimensional data. Penalized methods [1,2] improve stability, while likelihood-based approaches [3–5] provide inference tools. Yet none are tailored to discrete genomic data.”
Instead of reviewing all clustering methods, narrow the discussion to those designed for genomic count data.
Exercise 4.6 Find two sentences in your current draft introduction that provide background but do not directly relate to your problem. Revise or cut them.
- Link the gap and contribution in one motion.
Example 4.9 “Existing sparse regression methods fail to use the disease hierarchy in EHR data. To close this gap, we develop a tree-guided feature aggregation method.”
Exercise 4.7 Write one sentence where you state a limitation in the literature and directly follow it with your contribution.
- Keep the roadmap short and functional. If it is very generic, maybe you don’t need such a paragraph.
Example 4.10 Section 2 presents the method, Section 3 reports simulations, and Section 4 applies the model to EHR data.
Exercise 4.8 Write a one-sentence roadmap for a paper you have recently read.
4.4.2.2 Common mistakes
- Burying the main idea at the end of the paragraph.
Example 4.11 “Many methods have been proposed for sparse regression, ranging from penalization to Bayesian priors. Some perform well, others fail in practice…… In this paper, we study tree-guided regression.”
“In this paper, we study tree-guided regression, motivated by challenges in sparse regression. We review existing penalized and Bayesian methods to highlight the limitations that our approach addresses.”
- Listing prior work without synthesis and connections to your own work.
Exercise 4.9 Turn a list of references from your notes into a synthesized, comparative statement.
- Overloading the introduction with method details.
Example 4.12 “Our method solves a convex optimization problem with ADMM iterations and gradient descent, initializing with ridge regression and adapting a warm start strategy, and then applies cross-validation to select the penalty parameter…”
Exercise 4.10 Take a method-heavy introduction paragraph you wrote and reduce it to a single motivating sentence.
- Failing to connect the gap and the contribution.
Exercise 4.11 Write two linked sentences: the first states a limitation, the second shows how your work addresses it.
Using excessive mathematical notation instead of clear language.
Over-exaggerating the contribution or being too superficial.
Ignoring key references or cherry-picking the literature.
4.5 Data
The data section should provide all the details that are relevant for the research project.
- Who collected the data (source)?
- How was the data collected? Sampling frame? Sampling approach?
- What period or range does the data cover?
- Why does the data help answer the research question?
- What exploratory analyses are done (descriptives, visualization, etc.)?
4.5.1 Structure and Key Elements
The data section forms the empirical backbone of a research paper. It explains where the data come from, how they were collected, what they contain, and what insights can be drawn from preliminary exploration. The goal is to help readers evaluate the quality, appropriateness, and limitations of the data before interpreting any statistical results.
Data source and access Indicate who collected the data, when and where they were collected, and how access was obtained. Mention whether the data are publicly available, proprietary, or collected by the investigators.
Study design and population Describe how the analytic dataset was constructed, including inclusion and exclusion criteria, time period, and the unit of analysis. Specify whether the data arise from an experiment, a survey, or an observational study.
Variables and measurements Define outcomes, exposures, predictors, and covariates. Provide clear operational definitions and coding rules. Identify coding systems (for example, ICD, LOINC, RxNorm) and describe how composite variables were formed.
Data preprocessing Summarize the steps taken to prepare the data for analysis—such as filtering, merging, deduplication, variable transformation, and handling of missing data or outliers.
Preliminary data analysis Present descriptive and exploratory analyses that reveal the structure and features of the data. This stage helps to identify irregularities, guide model selection, and provide context for interpretation. Common elements include:
- Descriptive statistics: sample size, means, medians, proportions, standard deviations.
- Graphical summaries: histograms, boxplots, scatterplots, or correlation matrices.
- Checks for data quality: missingness patterns, implausible values, and consistency across related variables.
- Initial group comparisons or cross-tabulations to detect broad relationships or heterogeneity.
- Analysis to justify the motivation of the proposed methods.
Either in the data section or later in discussion, one should discuss known biases, limitations, or measurement errors, and clarify to what population the findings can reasonably be generalized.
Tips
- Integrate descriptive results into short paragraphs rather than long, unconnected tables.
- Always check the internal consistency of key variables before proceeding to modeling.
- Use visualization to complement numerical summaries; readers grasp patterns more quickly in plots.
- Point out any notable trends, distributions, or anomalies that motivated analytic choices later.
- Keep preliminary analyses exploratory; do not perform formal hypothesis testing at this stage.
4.6 Methods
The methods section is the technical heart of a research paper. It presents the modeling framework, the inferential or computational strategy, and the assumptions that underlie the proposed approach. The goal is to make the methodology clear, logical, and reproducible, not necessarily to show every line of algebra, but to communicate the essential ideas and reasoning behind them.
- Establish notation.
- What are the observed data?
- What are the models?
- What are the parameters to be estimated?
- How are the point estimators obtained?
- How are the uncertainty (standard errors) of the point estimators assessed?
- How are the variances of the point estimators estimated?
- How are the null distribution of the testing statistics established?
- Clearly state the assumptions and claims of theoretical results.
4.6.1 Major Components of a Method Section
Overview of the method
Begin with a short paragraph describing the main idea of the proposed method in plain language. This paragraph should be connected to the discussion in the previous sections.
Example 4.13 ``Motivated by XXXX, we propose a penalized regression framework that integrates group structure among predictors to achieve both sparsity and interpretability.’’
This sets the stage before introducing notation or equations.
Notation and data structure
Clearly define the observed data and notation before presenting formulas. Ambiguous notation confuses readers quickly.
Example 4.14
- Specify dimensions (e.g., (n p)).
- Distinguish random variables, parameters, and fixed constants.
- Use consistent notation throughout the paper; avoid redefining symbols.
Model specification
State the model explicitly, whether it is probabilistic, deterministic, or algorithmic. Describe key assumptions about the data-generating process.
Example 4.15 Regression models, hierarchical models, latent variable models, or machine-learning formulations. Clarify what each component represents, e.g., \[ Y_i = X_i^\top \beta + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2). \] Briefly interpret each term in words.
Estimation procedure
Explain how the parameters are estimated.
- Likelihood-based: maximum likelihood or Bayesian inference.
- Regularized methods: penalized loss minimization.
- Nonparametric or algorithmic methods: iterative optimization or ensemble procedures.
Inference and uncertainty quantification
Describe how to assess uncertainty around estimates, for example, standard errors, confidence intervals, or posterior credible intervals.
Explain how variance is estimated (analytically, by bootstrap, jackknife, or asymptotic approximation).
State how null distributions are established for test statistics or how p-values are computed.
These things are considered essential for a statistics paper; if developing inferential methods is not the focus, you may still add some discussion.
Theoretical properties
If the paper contains theory, state main assumptions and results clearly. Each theorem should be accompanied by a short intuitive explanation of what it means and why it matters.
Examples include consistency, oracle properties, convergence rates, or robustness guarantees. Avoid overly detailed proofs in the main text, ratherm place them in the appendix if necessary.
It often makes sense to have a separate Theory section.
Computation and implementation
Derive and present the optimization/computation algorithms if applicable.
Discuss computational aspects such as algorithmic complexity, convergence criteria, and scalability.
If specialized software or code is used, mention it explicitly and reference open-source implementations when possible.
Include details about hyperparameter tuning, initialization, and stopping rules.
If space is limited, provide pseudo algorithms instead; technical contents should be presented in appendix or supplement.
4.6.2 Tips for Writing the Method Section
- Start with intuition before equations. Readers should understand what problem the method solves and why before seeing how.
- Define notation once and stick to it. Overloaded symbols are one of the most common sources of confusion.
- Separate model, estimation, and inference. Organizing the section around these three themes improves clarity.
- Explain assumptions explicitly. Hidden assumptions (e.g., independence, homoscedasticity, exchangeability) should be spelled out and discussed.
- Connect to prior work. Briefly mention how your approach relates to or extends existing methods; this orients the reader.
- Avoid unnecessary math in the main text. Focus on ideas; detailed derivations belong in the appendix or supplement.
- Ensure reproducibility. Indicate how the procedure can be implemented (software, packages, or code snippets).
- Include a roadmap if the section is long. For example: “This section introduces the model in Section 3.1, the estimation algorithm in Section 3.2, and the asymptotic properties in Section 3.3.”
4.7 Simulation
Simulation studies are essential for understanding how a method behaves when the truth is known. They reveal operating characteristics that a single real dataset cannot: bias, variance, coverage, type I error, power, robustness to misspecification, sensitivity to tuning, and scalability. A clear simulation section follows a simple logic ADEMP (Morris, White, and Crowther 2019):
- Aims
- Data generating mechanism
- Estimand/target of analysis
- Methods
- Performance measures
4.7.1 Major Components of a Simulation Section
A clear simulation study follows a predictable structure. Organize your section with the items below so readers can understand, reproduce, and trust your results.
Simulation aims State what the simulation is designed to assess. Examples: small-sample bias and variance; variable-selection accuracy under correlation; calibration of uncertainty quantification; type I error and power under local alternatives; robustness to outliers or heavy tails; performance under distribution shift; runtime and memory growth with n and p.
Data-generating mechanisms (setups) Specify precisely how synthetic data are created so others can reproduce your study.
- Design grid: factors and levels (for example, sample size; dimension; signal-to-noise; sparsity; correlation; number of clusters; heterogeneity levels; missing rate).
- Models and distributions: exact formulas and parameter values.
- Estimands: the target(s) you want to learn about (parameter estimation, prediction, coverage, power, etc.).
- Practical details: number of replicates R, train/validation/test splits, cross-validation folds, preprocessing pipeline, tuning protocol. If you simulate missingness, state mechanism (MCAR/MAR/MNAR) and rates.
Competing methods List all methods compared, including straightforward baselines and standard references and state-of-art methods from the literature.
- Tuning and implementation: for each method, how hyperparameters are selected (cross-validation, information criteria, theory-guided rules), and software versions/options used.
- Variants and ablations: include stripped versions of your method to show which components drive gains.
Evaluation metrics (performance measures)
Estimation: bias, variance, MSE; support recovery (TPR, FPR, precision, recall, F1); selection stability across resamples.
Prediction: test RMSE/MAE; AUROC and AUPRC (for imbalance classification); specificity/sensitivity; positive predictive value (PPV).
Inference: empirical coverage and average length of intervals; type I error; power under prespecified alternatives; FDR/FWER as appropriate.
Robustness and efficiency: sensitivity to misspecification; runtime and memory; convergence diagnostics.
Report uncertainty across Monte Carlo replicates (for example, mean ± Monte Carlo standard error or distributional summaries).
Results presentation and interpretation
Figures and tables
- Prefer compact, readable visuals: line plots across a design grid, box/violin plots for variability across replicates, and heatmaps when two factors vary jointly.
- Always display uncertainty (error bars for means; boxplots).
- Use consistent axes, units, and significant digits across panels; label methods and settings directly on the plot or in a clear legend.
- Place long or dense tables in the supplement; keep main-text tables small and focused.
Narrative
- For each figure/table, provide crisp takeaways in plain language. Lead with a summary and a general pattern, then support with representative numbers.
- Explain any ``unexpected’’ results.
- Avoid repeating every number; synthesize what the reader should conclude.
- Be objective.
4.7.1.1 Tips
- Align simulations with claims; include the regimes your method is intended to handle.
- Cover easy, moderate, and hard settings; avoid cherry-picking a single favorable point.
- Keep tuning fair and inside training folds; never tune on test data.
- Fix the design grid before running to reduce post hoc choices.
- Report Monte Carlo error; means alone can mislead.
- Prefer interpretable plots over large tables; move long tables to the supplement.
- Include ablations to justify why the method works.
- Document failures and edge cases; negative results increase credibility.
4.8 Application
- Report the statistical analyses in tables/figures.
- When summarizing from tables/figures, paint the big picture, rather than reiterating all of the little details.
- Discussions to link the analyses back to the substantive topic (Miller 2015):
Having presented the individual pieces of evidence, an investigator must summarize how that evidence, taken together, support the conclusoin of the investigation. Statisticians should explain how the statistical evidence answers the question posed at the beginning of the paper, following standard expository writing guidelines to writing an analytic essay.
4.9 Discussion and Conclusion
- A summary, again, of the contributions of the research.
- The research question posed as the `need’ of the introduction must be answered here (Zeiger 2000).
- Limitations of the current study
- Future directions.
4.10 Appendix
- Technical details (e.g., proofs, algorithms) that would otherwise break the flow of the main text.
- Data source details.
4.11 Acknowledgements
This section is optional, but could be used to acknowledge certain individuals who have contributed to the research and/or success of the manuscript (e.g. peer reviewers).
In general, if the research upon which you are writing was funded, the funding agency and funding mechanism are typically included here unless otherwise specified.
4.12 References
- Every reference cited in the paper should appear here.
- References not cited should not appear here.
- All are automatically taken care of by BibTeX.
- Styles are controlled by bib style (
.bstfile).
4.14 General Tips
From Jenkins (1995):
In order to maintain continuity between the key sections (introduction, methods, results and discussion) it is helpful to consider the manuscript as telling a story.
The strong parts to the story-line are the introduction and the discussion, so the link between these sections must be clear.
Devices such as paragraphing, headings, indentation, and enumeration help the reader see the major points that you want to make.
As a rule of thumb, if you type a full page (double spaced) without indenting for a new paragraph, you almost certainly have run one thought into another and have missed an opportunity to differentiate your ideas.
Any tables and figures included in the manuscript must be mentioned (referenced) within the main text.
If journal/instructions do not specify otherwise, tables and figures should be placed near (ideally after) the related text, and on the top of the page.
Use consistent notation throughout the manuscript, avoid defining any unnecessary notation, and avoid using the same notation to describe different things (variables, indices, etc.).
4.15 General Tips: Use of English
It is relatively easy to read and understand English that is well written. As the quality of writing deteriorates, however, it becomes progressively more difficult for the reader to understand the author’s intended meaning.
An obvious problem occurs when the author fails to use properly constructed sentences. This can often be corrected with revision and external review.
A more serious problem occurs when the author unconsciously assumes that the reader is able to follow an unwritten train of thought. The reasoning may be clear in the author’s mind, but not on the page. This can also usually be caught by careful revision and by asking others to read a draft.
More tips on the use of English, especially for statistics, machine learning, and data science papers:
Use of tenses. Use present tense for most of the manuscript (for example, when describing the content of the paper, the structure of the method, or general facts). Use past tense when describing events that occurred in the past (for example, data collection, an experiment that has been conducted, the design of a simulation). Be consistent with tense usage within sections, paragraphs, and related sentences.
Use of the word “significant”. Do not use the word “significant” in a vague or informal sense. Reserve it for a statistical meaning (for example, “statistically significant at the 5% level”). If you mean “large”, “important”, or “meaningful”, use those words instead.
Use of “data”. The word “data” is often mistakenly treated as a singular noun. It is, in fact, plural (its singular form is “datum”). In practice, both “the data are” and “the data set is” are acceptable; choose one style and use it consistently.
Avoid unnecessary jargon. Many terms in statistics and machine learning are not familiar to all readers. When first using a technical term, provide a short explanation in plain language. Do not overload sentences with many specialized terms.
Define acronyms and abbreviations. Spell out each acronym the first time it appears in the main text, followed by the abbreviation in parentheses, and then use the abbreviation afterwards. For example: “area under the receiver operating characteristic curve (AUC)”. Avoid long strings of acronyms in the same sentence.
Prefer clear, direct sentences. Long, nested sentences with many clauses and parentheses are hard to follow, especially when equations or technical terms are included. It is often better to split a long sentence into two or three shorter ones. As a rule of thumb, if a sentence spans more than three lines, consider breaking it.
Avoid vague qualifiers. Words such as “very”, “quite”, “rather”, or “somewhat” rarely add precision. Instead of “the method is very accurate”, write “the method achieves a prediction error 30% lower than the baseline”.
Distinguish speculation from evidence. Clearly separate what is supported by data from what is a conjecture or an open question. Phrases such as “our results suggest that”, “it is plausible that”, or “one possible explanation is” help signal the level of certainty.
Be careful with causal language. In many statistical and machine learning analyses, the data and design only support association statements. Avoid verbs such as “cause” or “lead to” unless the design justifies causal interpretation. Use “is associated with” or “is related to” when appropriate.
Maintain consistent terminology. Choose one term for each concept (for example, “predictor”, not alternating between “feature”, “variable”, and “covariate” without reason) and use it throughout. Changing terms can confuse readers and make the paper harder to follow.
Avoid informal or conversational expressions. Phrases such as “obviously”, “of course”, “clearly”, or “we just” can sound dismissive or may be incorrect for some readers. If a point is important, it is better to explain it carefully than to label it “obvious”.
These points may seem minor, but they have a large impact on how easily readers can understand and trust your work. Good English usage does not mean fancy vocabulary; it means clear, precise, and consistent language that serves the statistical ideas.