3 Cycle of Statistical Research

3.1 Full Cycle of a Statistical Research Project

Like any research project, a typical statistical research project starts with an interesting idea, and goes through a full cycle of brainstorming, planning with the right design/methods, investigating with theoretical investigation, simulation studies, and data analyses, fixing and iterating as necessary, and finally summarizing through writing and revision.

3.1.1 Find a Topic of Interest

This is probably the hardest part.

The topic should be meaningful, be interesting to you, and be something which you already know at least somewhat and/or can find out more. The topic should also be something that is within your capacity restricted by your skill sets and time. A good first step is to review the literature to see what has already been done on the topic or on closely related problems. In short, selecting a project that excites you, but is also realistic and doable.

You can get ideas for research by

  • attending seminars, conferences, and workshops; more and more resources are available online (e.g., a very recent workshop on Foundation Models and Their Biomedical Applications: Bridging the Gap );

  • reading reputable and relevant journals (e.g., those mentioned in Chapter 1.1), books, magazines (e.g., Chance; Significance), and newspapers;

  • consulting activities with practitioners who face real-world problems;

  • collaborating with colleagues in statistics, data science, or applied fields;

  • exploring datasets (public repositories, Kaggle competitions, government/health databases) that reveal new questions;

  • revisiting your own course work, previous projects, and even class assignments that raised unanswered questions.

Example 3.1 The Double Descent Phenomenon

A striking example of how new research ideas can emerge is the discovery of the double descent property of mean squared error (MSE) in modern machine learning.

Traditionally, statistical learning theory emphasized the bias–variance tradeoff:
- As model complexity increases, bias decreases but variance increases, leading to a U-shaped curve for prediction error (MSE).
- The optimal point balances bias and variance at some intermediate model complexity.

However, empirical studies in modern high-dimensional settings (e.g., deep learning, regularized high-dimensional regression) revealed something unexpected: as model complexity increases past some interpolation threshold (where the model can perfectly fit the training data), the test error often drops again. This leads to a double descent curve, i.e., first decreasing (classical regime), then increasing (overfitting regime), and then decreasing again (over-parameterized regime). Key references include Belkin et al. (2019), Hastie et al. (2022), and Schaeffer et al. (2024).

The double descent phenomenon challenges classical wisdom about overfitting and suggests that in highly over-parameterized models, effective generalization can still occur. It has reshaped thinking about why modern machine learning methods (e.g., deep neural networks) perform well despite operating in regimes far beyond the number of observations.

In my own research on high-dimensional statistics, I encountered behaviors resembling double descent many times. At the time, I simply attributed them to data instability and moved on. THis was a miss of opportunity to study and recognize the underlying deep principle. It has been a reminder to me that unexpected patterns in data analysis can point to important new research directions.

For statisticians and data scientists, research should often be both problem-driven and data-driven, with methodological development tightly linked to real scientific or societal needs.

  • Problem-driven approach: Start from a real scientific, engineering, business, or societal problem that genuinely matters. The problem motivates the research questions, and data/methods are then applied or developed to address them. This approach ensures relevance and often leads to impactful interdisciplinary collaborations.

  • Data-driven approach: Start from an available dataset and explore it to uncover new questions or methodological challenges. Creativity is needed here to do something novel and meaningful with familiar or widely used data. This kind of research often serves the purpose of hypothesis generation.

  • Idea-driven approach: Begin with a methodological or theoretical idea or problem, and then look for ways to justify or data to test/illustrate. This can lead to elegant theory, but on the other hand, it may also lead to many methods of limited practical relevance.

Ultimately, impactful statistical research does not come from problems, ideas, or data alone, but from their dynamic interplay. As Professor Xiao-Li Meng once emphasized, Highly Principled Data Science requires methodologies that are

  1. Scientifically justified (rooted in real problems),
  2. Statistically principled (built on sound inference), and
  3. Computationally efficient (scalable and implementable).

Keeping these principles in mind ensures that your project is not only feasible but also both rigorous and relevant.

Example 3.2 Real-world problem as a driver of applied, methodological, and theoretical research

Consider the broad societal challenge of suicide prevention. From this single motivating problem, researchers can pursue projects across the spectrum, from applied data analyses to methodological development and theoretical investigation. Publicly available datasets such as MIMIC-IV and All of Us provide fertile ground for formulating such projects, alongside restricted-access clinical datasets from health systems or the Veterans Affairs (VA).

  1. Applied research. Building on real-world clinical data, applied projects aim to develop predictive models for suicide risk. For example, one line of research has leveraged electronic health records to construct risk prediction tools that can help clinicians identify individuals at elevated risk of suicide (Sacco et al. 2025).

  2. Methodological research. Predictive modeling for suicide often uses rare features (e.g., infrequent diagnosis/procedure codes) with strong hierarchies (e.g., ICD trees). A methodological response is J. Chen et al. (2024), which uses the known disease-code hierarchy to encourage structured selection and to aggregate rare binary features via logical OR operations. This approach improves predictive performance, stability, and interpretability in EHR applications by borrowing strength across related codes while retaining clinically meaningful groupings.

  3. Theoretical research. At a more abstract level, our work on developing suicide risk models for diverse clinical settings has motivated many questions on transfer learning and robust data fusion. One of such questions is how can we safely leverage external datasets (which may be heterogeneous or even contaminated) to improve a target model? A recent theoretical contribution studies subsampling-based fusion strategies, including target-guided and leverage-based random (variance-reducing) sampling (Wang, Wang, and Chen 2025).

This example shows that statistical research often begins with a real-world problem but can evolve into contributions across the spectrum of applied, methodological, and theoretical work.

3.1.2 Initial Planning

Once you have identified a promising topic, the next step is to transform a broad idea into a concrete research plan. This involves clarifying your research questions, understanding what data will be needed (or available), and identifying the appropriate methods. Careful planning at this stage will save time later and help avoid common pitfalls.

  1. Clarify the research questions
  • Narrow the topic to a few specific questions.
  • State the research goals in precise terms (e.g., estimating a causal effect, developing a predictive model, establishing a theoretical property, etc.).
  • Frame testable hypotheses.
  1. Design the study
  • Decide whether the study will be theoretical, methodological, applied, or a blend.
  • For empirical studies, determine the study design (observational vs. experimental, cross-sectional vs. longitudinal).
  • Consider reproducibility and transparency at the design stage. Make sure to document every steps so that others can follow.
  1. Identify and obtain data
  • Decide whether you will collect new data or use existing sources.
  • Evaluate whether available data are sufficient in size, quality, and relevance.
  • Anticipate potential data challenges (access restrictions, missingness, bias, etc.).
  1. Select appropriate methods
  • Choose statistical or computational tools that match both the problem and the data.
  • Consider multiple candidate approaches.
  • Think about the balance between rigor and feasibility (e.g., a theoretically ideal method vs. one that is conceptually simple and computationally practical).
  1. Outline the research and writing plan
  • Draft a tentative structure for the project (introduction, methods, results, discussion).
  • Use the outline to guide both the analyses and the writing.

Please note that the goal of initial planning is not to lock yourself into a fixed path; it is impossible as every project is different. This is more about paving a roadmap that helps you stay focused, track progress while leaving room for changes.

Example 3.3 Using publicly available clinical data for research

Suppose you are broadly interested in suicide prevention and risk prediction. Rather than starting from scratch, you explore large, publicly available datasets that include rich health information:

  • MIMIC-IV (Medical Information Mart for Intensive Care):
    An openly accessible critical care database developed by MIT that contains de-identified health data from tens of thousands of ICU admissions.
  • All of Us Research Program:
    A large NIH initiative to collect data from one million or more people across the U.S., including surveys, EHRs, physical measurements, and genomics.

Starting from the broad interest on predicting suicide risk with electronic health records, you refine the question to something more concrete and feasible:

Can routinely collected EHR variables (such as demographics, prior psychiatric diagnoses, comorbid medical conditions, and medication history) from All of Us participants be used to develop a suicide risk prediction model, and how does it compare to models built using only demographic and diagnostic data?

This illustrates how a general interest (suicide prevention) can be made specific to be a data-supported, clinically relevant, and methodologically rich project by leveraging open-access resources.

Example 3.4 Clarifying research questions

  1. Vague question: “How do social factors affect health?”
Specific question
  • “Does neighborhood-level income inequality predict differences in hypertension rates after adjusting for age and sex?”

  1. Vague question: “Can machine learning improve disease prediction?”
Specific question
  • “Among adults with electronic health record data, does a random forest model improve 5-year diabetes risk prediction compared to logistic regression with standard risk factors?”

  1. Vague question: “How can we make valid statistical inference in high dimensions?”
Specific question
  • “Under what conditions on sparsity and sample size does the Lasso estimator achieve consistent variable selection in linear regression?”

3.1.3 Statistical Investigation

This stage is where the actual research work happens.

+ Perform data analysis.
+ Summarize substantive findings.
+ Investigate the performance of proposed methods via simulations or theoretical studies.

3.1.4 Iterate between Planning and Investigation.

Research is never a straight line. Iteration between planning and investigation is very natural, and usually necessary, for arriving at results that are both valid and meaningful.

  • Unexpected results. Data analysis might reveal problems you did not anticipate: heavy missingness, unexpected correlations, or measurement error.

  • Simulation surprises. Your proposed method may fail in certain scenarios or underperform compared to existing benchmarks. This is not failure, rather, treat it as a guide to refine assumptions, modify algorithms, or rethink evaluation criteria.

Importantly, do not just bring raw output to your advisor without examining it first. If something looks clearly off, for example, excessively large standard deviations of a performance measure or inconsistent patterns across settings, you should pause, diagnose, and think about the implications.

A good investigation is not only about producing numbers but also about interpreting whether they make sense. We all learn from mistakes, as the Chinese saying goes, 吃一堑, 长一智 (“a setback is a lesson learned”).

3.1.5 Write, Revise and Proofread.

Revising and writing are two separate processes.

+ Start with an outline for each section which includes major headings, 
  sub-headings and paragraphs covering different points.  
+ When starting, the goal is to get the main points and ideas captured in a 
  document, so at this time, it does not matter if sentences are incomplete 
  or if the grammar is incorrect.
+ Are the statistical statements correct?
+ Are the data displays informative?
+ Are the conclusions based on sound evidence?
+ Are the style and tone appropriate for the venue?
+ Is the problem clearly stated?
+ Check organization --- reorganize paragraphs and add transitions where 
  necessary.
+ Work on sentences --- check spelling, punctuation, word choice, tense, etc.
+ Make sure all researched information is documented (reproducibility).
+ Rework introduction and conclusion.
+ Read out loud to check for flow.
+ Find a friend to review.

There are resource online that could be helpful. Here are some examples.

Exercise 3.1 From idea to action

Suppose you want to study the effects of social media use on college students’ academic performance.

  1. Frame three possible specific research questions.
  2. Identify what types of data you would need.
  3. Suggest one study design (observational or experimental) and justify it.
  4. Propose at least two statistical methods you could consider, and discuss trade-offs.

3.2 Writing a Research Proposal

A research proposal is piece of writing that details exactly what you plan to do in a research project.

The following components are expected in your proposal.

  • Introduction: Introducing the topic and why you have chosen this topic (3–5 lines). Mention briefly the current related research and cite relevant works.

    • Why shold we care?
    • What have been done?
    • What are new?
  • Specific aims: Formulate a research question or hypothesis in the chosen topic. Describe briefly why you select such a question or hypothesis and its importance in the field (cite sources).

    • Why is it hard/interesting/unstudied?
    • How hard/interesting is it?
  • Data description: Describe your data set (for instance: sampling scheme, number of observations, number of variables, variables of interest, nature of the variables) and the source of your data set if it is not collected by yourself.

  • Research design/methods/schedule: Describe briefly (5-7 lines) your plan of action.

    • Why are the methods appropriate (with proper references) for the problem?
    • What steps are required to use the methods?
    • Which of the steps will be particularly hard?
    • What would you do if the hardest steps do not work as planned?
    • How would the methods help in investigating the task problem?
  • Discussion: potential problems and solutions.

    • What do you expect to find and why do you feel so?
    • Any ways your work can corroborate or challenge existing results or assumptions?
    • What are the potential impacts of your work?
    • What if the results of your investigation is not what you expected?
  • Conclusion (optional): Wrap it up by briefly summarizing your research proposal and reinforcing your research’s stated purpose.


References

Belkin, Mikhail, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. “Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off.” Proceedings of the National Academy of Sciences 116 (32): 15849–54. https://doi.org/10.1073/pnas.1903070116.
Chen, Jianmin, Robert H. Aseltine, Fei Wang, and Kun Chen. 2024. “Tree-Guided Rare Feature Selection and Logic Aggregation with Electronic Health Records Data.” Journal of the American Statistical Association. https://doi.org/10.1080/01621459.2024.2326621.
Hastie, Trevor, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. 2022. “Surprises in High-Dimensional Ridgeless Least Squares Interpolation.” The Annals of Statistics 50 (2): 949–86. https://doi.org/10.1214/21-AOS2133.
Sacco, Shane J., Kun Chen, Jun Jin, Boyang Tang, Fei Wang, and Robert H. Aseltine. 2025. “Identifying Patients at Risk of Suicide Using Data from Health Information Exchanges.” BMC Public Health 25 (1): 1582. https://doi.org/10.1186/s12889-025-22752-x.
Schaeffer, Rylan, Zachary Robertson, Akhilan Boopathy, Mikail Khona, Kateryna Pistunova, Jason William Rocks, Ila R Fiete, Andrey Gromov, and Sanmi Koyejo. 2024. “Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle.” In The Third Blogpost Track at ICLR 2024. https://openreview.net/forum?id=muC7uLvGHr.
Wang, Jing, HaiYing Wang, and Kun Chen. 2025. “Robust Data Fusion via Subsampling.” https://arxiv.org/abs/2508.12048.