1 Introduction

What does a statistical or data science paper look like, and why should we study it? For many students and researchers, writing a paper is one of the most challenging yet most important parts of their training. A paper is more than just record of results; correctness is only the bottom line. A paper is the primary way for researchers to communicate ideas, establish credibility, and contribute to scientific literature.

Good writing and good outlet for dissemination make research visible, while poor writing and poor communication may jeopardize delay the acknowledgement of even the most important discoveries.

Example 1.1 Gregor Mendel and genetics

Mendel published his classic paper on inheritance in pea plants in 1866. The work was rigorous, statistically sound, and groundbreaking. Yet it was written in a dense, formal style, published in an obscure local journal, and largely ignored. For more than 30 years, the paper received almost no attention. It wasn’t until around 1900, when three different botanists independently “rediscovered” Mendel’s work, that the importance of his discoveries became widely acknowledged.

Example 1.2 Bayes’ Theorem (1763)

Thomas Bayes’ essay on inverse probability was published after his death, edited by Richard Price. The presentation was dense, lacked context, and was hard to follow. For nearly two centuries, the paper was little known outside a small circle.

The following is a direct quote from Fienberg (2006).

My story begins, of course, with the Reverend Thomas Bayes, a nonconformist English minister whose 1763 posthumously published paper, “An Essay Towards Solvinga Problem in the Doctrine of Chances,” contains what is arguably the first detailed description of the theorem from elementary probability theory now associated with his name. Bayes’ paper, which was submitted for publication by Richard Price, is remarkably opaque in its notation and details, and the absence of integral signs makes for difficult reading to those of us who have come to expect them.

It wasn’t until the 20th century, with Laplace’s reformulation and later Bayesian revival in the 1950s-60s, that Bayes’ original insight became central to statistics, and now, to machine learning and AI.

This is an example where even a foundational idea can be “hidden” for generations if not clearly communicated.

As with all scientific papers, it should have some commonly expected structures, including but not limited to title, abstract, keywords, introduction, methods, results, discussion, acknowledgements, references, appendix, and supplement. Due to the specificity of the statistical discipline, machine learning practices, and application domains, however, research papers can look very different from one another.

This book is motivated by the need to make these conventions explicit, to demystify the writing process, and to provide practical guidance through many examples and exercises. It is designed for students and researchers who wish to sharpen their writing, whether they are preparing a dissertation chapter, a methodological paper, or an applied paper.

Our overarching goal is to teach students how statistical and data science papers are structured, why they are written in such ways, and what we shall do to adapt writing for different audiences and outlets.


1.1 Types of Papers in Statistics and Data Science

1.1.1 Theory Paper

A theory paper in statistics is closest in form to a mathematical paper. It typically includes the statement of the problem, formulation of assumptions, and the presentation of theorems, lemmas, and proofs. The purpose is often to establish fundamental properties of certain statistical or probabilistic tools/approaches, including but well beyond consistency, efficiency, optimality, convergence, asymptotic distributions, and non-asymptotic error bounds.

While theory papers may not always feature data, simulations, or applications, they form mathematical and probabilistic foundations upon which methodology and applied work are built.

Most of the articles in journals such as Annals of Statistics, Annals of Probability, and Bernoulli, among others, are considered as theory papers. In other words, these journals should be the primary outlets for a theory paper.

Examples include Barber et al. (2021).

Download Barber et al. (2021).

1.1.2 Methodological Paper

A methodological paper focuses on proposing a novel methodological contribution that can be applied to a general class of real-world problems. A commonly seen structure is:

  • Introduction
  • Literature review
  • Methods (e.g., estimation, hypothesis tests, diagnostic procedures)
  • Theoretical properties
  • Simulation studies
  • Applications/Illustrations
  • Discussion/Conclusions

Such papers emphasis on methodological development, with a blend of theory (e.g., asymptotic or non-asymptotic guarantees) and empirical validations.

Such papers emphasize methodological development, often with a blend of theory (e.g., asymptotic or non-asymptotic guarantees) and empirical validations. Astrong methodological paper should not be just a mechanical combination or minor extension of existing methods. Instead, it should be driven by a clear motivation from real-world applications.

In other words, the most impactful methodological contributions are those that connect with practical relevance. They are inspired by genuine applied needs, but provide solutions that are general enough to influence future work in an important domain or even across different domains.

Most of the articles in journals such as Journal of the American Statistical Association (JASA), Biometrika, Journal of Computational and Graphical Statistics (JCGS), and Journal of the Royal Statistical Society, Series B (JRSSB), among others, are considered methodological papers. In other words, these journals are among the primary outlets for presenting new statistical methods that combine theoretical development with empirical validation through simulations and applications.

Example include Li et al. (2023), Lau and Yan (2022), and K. Chen, Chan, and Stenseth (2012).

Download K. Chen, Chan, and Stenseth (2012).

1.1.3 Applied Paper

An applied paper focuses on addressing a concrete scientific question in a particular domain using statistical or data science methods. Its structure often includes:

  • Introduction
  • Data description
  • Methods (applied or adapted)
  • Results
  • Discussion

These applied papers may involve applying existing methods or developing new ones motivated by a specific scientific or practical problem. Sometimes, simulation studies are added to assess robustness or sensitivity.

In practice, the distinction between a methodological paper and an applied paper is often blurry, and rightly so. From our perspective, an applied paper in statistics can still be regarded as a methodological paper, but one where the motivation and primary contribution are clearly tied to addressing real-world problems.

Most of the articles in journals such as The Annals of Applied Statistics(AoAS), Biostatistics, Biometrics, Journal of the Royal Statistical Society, Series C (Applied Statistics), and Statistics in Medicine, among others, are considered such applied papers. In other words, these journals are primary outlets for presenting statistical applications that address concrete scientific or medical problems, often integrating existing methods, developing adaptations, and demonstrating their impact in real-world data analyses.

It should also be noted that what we call “applied papers” here also includes a large portion of scientific papers that rely on data analytics, statistics, and machine learning methods. In many scientific domains, these papers may in fact be considered theoretical or methodological contributions within that field. For instance, an applied paper with genomical data analysis could be regarded as a methodological paper in genetics. Such works are most often interdisciplinary, typically resulting from close collaborations between statisticians, data scientists, and domain experts. They showcase how quantitative methods advance science in other fields, while also motivating the development of new techniques within statistics and machine learning.

Examples of applied papers include Price and Yan (2022), Caplan et al. (2019), Jiao et al. (2022), and Johnson et al. (2021).

Download Johnson et al. (2021).


1.1.4 Data Science and Machine Learning Papers

Beyond traditional statistics journals, researchers often publish in data science, machine learning, and data mining journals and conferences, including outlets such as NeurIPS, ICML, KDD, and AAAI. These conference papers emphasize:

  • Conciseness (strict page limits)
  • Algorithmic novelty
  • Benchmark comparisons on standard datasets
  • Open-source reproducibility
  • Clarity for an interdisciplinary readership

Compared to JASA or Annals of Statistics, these outlets often prioritize empirical performance and novelty over extensive theoretical justifications or methodological development, though many still include core mathematical or probabilistical analysis. This emphasis partly reflects the nature of the research topics: for many cutting-edge machine learning and AI methods, their theoretical understanding is still evolving and often lags behind practice. As a result, papers are judged primarily on their ability to demonstrate empirical advances on benchmark datasets or practical applications.

Examples include Vaswani et al. (2017b), and He et al. (2018).

Download He et al. (2018).


1.1.5 Review Papers, Opinion Pieces, and Perspectives

In addition to theory, methodological, and applied research articles, review papers and opinion pieces play an increasingly important role in statistics and data science. These contributions do not always present new methods or data analyses, but they synthesize, interpret, and reflect on existing work, shaping how the field understands itself and where it is headed.

Review papers (or survey articles) aim to provide a structured overview of a research area. Strong review papers go beyond summarizing prior work: they evaluate the strengths and limitations of existing approaches, identify common themes, and highlight unresolved challenges. They serve as intellectual roadmaps for both newcomers and experts, helping readers quickly grasp the state of the art. Classic outlets for review papers include Statistical Science, Annual Review of Statistics and Its Application, and International Statistical Review.

Opinion pieces and perspectives provide expert commentary, critique, or vision. These papers may challenge conventional wisdom, advocate for new practices, or highlight broader impacts of statistics and data science on society. They often aim to spark discussion, reframe debates, and encourage innovation. Examples of venues that emphasize perspectives and commentary include Harvard Data Science Review (HDSR), which blends reviews, perspectives, columns, and opinion pieces.

Together, review papers and opinion pieces are essential for the vitality of the discipline. They consolidate knowledge, guide future research, and provide reflective perspectives that extend beyond technical details.

Check Editor’s Column at HDSR.


1.2 Scientific Writing Resources

Many resources on scientific writing are available. For example, Gopen and Swan (1990) was selected by its publisher, American Scientist, as one of its 36 “Classic Articles” from the first 100 years of publishing history (link). Some tips offered by Gopen and Swan (1990) are summerized below:

  1. Write with the reader’s expectations in mind.
    Readers interpret prose not just by its content, but through structural cues. Clarity depends on aligning writing structure with how readers naturally process text.

  2. Use the “topic” and “stress” positions effectively.

    • Topic position (sentence beginning): establish context and link to what came before.
    • Stress position (sentence ending): deliver the most important or new information.
  3. Keep the subject and verb close together. Avoid long interruptions between subject and verb. Burying the verb makes the sentence harder to follow.

  4. Maintain logical flow: old → new, context → conclusion.
    Start sentences with familiar information and end with new ideas. This ordering helps readers build meaning smoothly.

  5. Match structure with emphasis.
    Important information should appear in prominent structural positions (topic or stress). If structure and substance are misaligned, readers may miss your main point.

Popular writing books include Oshima and Hogue (2000), Gopen (2004), Hairston and Keene (2003), and Lebrun and Lebrun (2021).


1.3 About This Book

This book, Research Writing in Statistics and Data Science, extends earlier efforts from Statistical Writing, by broadening its scope and including many hands-on examples and exercises.

The book emphasizes three guiding principles:

  1. Clarity: communicating complex technical ideas to both expert and interdisciplinary audiences.

  2. Adaptivity: tailoring style and structure to journals, conferences, and other outlets.

  3. Learning by Doing: the book includes examples, annotated excerpts, and practical exercises.


1.4 Outline of the Book

The planned content includes:

  • Introduction (this chapter): Types of research papers — theory, methods, and applied — and the conventions of scientific writing across statistics, data science, and machine learning.
  • Tools and Workflows for Writing: Version control (Git/GitHub), LaTeX, R Markdown/Bookdown, and reproducibility practices.
  • Getting Started with Writing: The research and writing lifecycle, writing proposals, and planning strategies.
  • Writing Specific Sections: Titles, abstracts, introductions, methods, results, discussion, figures, tables, captions, and style.
  • Writing for Different Outlets: How to adapt writing for different audiences and venues — e.g., writing statistical analysis sections in interdisciplinary papers, preparing concise ML/AI conference papers, and meeting expectations in statistics journals.
  • Peer Review and Revision: How to review manuscripts constructively, write referee reports, and respond to reviewers effectively.
  • Grant Proposals and Research Statements: Writing for funding agencies, fellowships, and academic job applications.
  • Ethics and Integrity: Authorship, plagiarism, collaboration, and responsible use of AI tools.
  • Exercises and Practice: Rewriting, reviewing, and polishing tasks, designed to simulate authentic writing and reviewing experiences.

Each chapter will include examples drawn from real papers across statistics, biostatistics, and machine learning and will feature exercises that encourage hands-on practice and reflection.


1.5 Before We Start

An author should always keep the target audience in mind. Statistical journals span a wide spectrum from applied to theoretical. Machine learning venues differ again in expectations. Even technical reports, white papers, or grant proposals have unique readerships.

Regardless of outlet, any scientific writing should be convincing and concise. Authors need to show clearly that their work is important, valid, and useful.

Ultimately, strong writing can never substitute for strong research. In short, keep in mind that a good paper must rest on good work. Just as important, a good paper is made through revision.

Or, using one phrase to put it: Good writing is rewriting (好文章是改出来的).

You can view an example of a marked-up draft with comments from Dr. Kun Chen’s Ph.D. advisor:
Download a marked-up draft (PDF) of K. Chen, Chan, and Stenseth (2012).


References

Barber, Rina Foygel, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. 2021. Predictive inference with the jackknife+.” The Annals of Statistics 49 (1): 486–507. https://doi.org/10.1214/20-AOS1965.
Caplan, D J, Y Li, W Wang, S Kang, L Marchini, HJ Cowen, and J Yan. 2019. “Dental Restoration Longevity Among Geriatric and Special Needs Patients.” JDR Clinical & Translational Research 4 (1): 41–48. https://doi.org/10.1177/2380084418799083.
Chen, K., K.-S. Chan, and N. C. Stenseth. 2012. “Reduced Rank Stochastic Regression with a Sparse Singular Value Decomposition.” Journal of the Royal Statistical Society: Series B 74 (2): 203–21.
Fienberg, Stephen E. 2006. “When Did Bayesian Inference Become ‘Bayesian’?” Bayesian Analysis 1 (1): 1–37. https://doi.org/10.1214/06-BA101.
Gopen, George D. 2004. Expectations: Teaching Writing from the Reader’s Perspective. Pearson.
Gopen, George D, and Judith A Swan. 1990. “The Science of Scientific Writing.” American Scientist 78 (6): 550–58.
Hairston, Maxine, and Michael L Keene. 2003. Successful Writing. 5th ed. W. W. Norton & Company.
He, Lifang, Kun Chen, Wanwan Xu, Jiayu Zhou, and Fei Wang. 2018. “Boosted Sparse and Low-Rank Tensor Regression.” In Advances in Neural Information Processing Systems, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett. Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/file/8d34201a5b85900908db6cae92723617-Paper.pdf.
Jiao, J., Z. Tang, M. Yue, P. Zhang, and J. Yan. 2022. “Cyberattack-Resilient Load Forecasting with Adaptive Robust Regression.” International Journal of Forecasting 38 (3): 910–19. https://doi.org/10.1016/j.ijforecast.2021.06.009.
Johnson, Blair T., Anthony Sisti, Mary Bernstein, Kun Chen, Emily A. Hennessy, Rebecca L. Acabchuk, and Michaela Matos. 2021. “Community-Level Factors and Incidence of Gun Violence in the United States, 2014–2017.” Social Science & Medicine 280: 113969. https://doi.org/https://doi.org/10.1016/j.socscimed.2021.113969.
Lau, Abby Y. Z., and Jun Yan. 2022. “Bias Analysis of Generalized Estimating Equations Under Measurement Error and Practical Bias Correction.” Stat 11 (1): e418. https://doi.org/10.1002/sta4.418.
Lebrun, Jean-Luc, and Justin Lebrun. 2021. Scientific Writing 3.0: A Reader and Writer’s Guide. World Scientific.
Li, Yan, Kun Chen, Jun Yan, and Xuebin Zhang. 2023. “Regularized Fingerprinting in Detection and Attribution of Climate Change with Weight Matrix Optimizing the Efficiency in Scaling Factor Estimation.” Annals of Applied Statistics 17 (1): 225–39. https://doi.org/10.1214/22-AOAS1624.
Oshima, Alex, and Ann Hogue. 2000. Writing Academic English. Longman.
Price, Michael, and Jun Yan. 2022. “The Effects of the NBA COVID Bubble on the NBA Playoffs: A Case Study for Home-Court Advantage.” American Journal of Undergraduate Research 18(4): 3–15. https://doi.org/10.33697/ajur.2022.051.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017b. “Attention Is All You Need.” In Advances in Neural Information Processing Systems, edited by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.