Carrying out streamlined routine data analyses with reports for observational studies: introduction to a series of generic SAS ® macros

For a typical medical research project based on observational data, sequential routine analyses are often essential to comprehend the data on hand and to draw valid conclusions. However, generating reports in SAS ® for routine analyses can be a time-consuming and tedious process, especially when dealing with large databases with a massive number of variables in an iterative and collaborative research environment. In this work, we present a general workflow of research based on an observational database and a series of SAS ® macros that fits this framework, which covers a streamlined data analyses and produces journal-quality summary tables. The system is generic enough to fit a variety of research projects and enables researchers to build a highly organized and concise coding for quick updates as research evolves. The result reports promote communication in collaborations and will escort the research with ease and efficiency.


Introduction
The increasing availability of large-scale medical registry databases (e.g. SEER 1 , NCDB 2 ), health insurance claim databases (e.g. the US Food and Drug Administration's Sentinel Initiative 3 , MarketScan Research Database 4 ), electronic medical record databases (EMR), or secondary data from clinical trials provides opportunities for researchers and policymakers to address a variety of clinical practice questions and make informed decisions. A retrospective or observational study based on such data allows researchers to examine medical care in a real-life setting, and, if carefully done, generalize results to an extended population and clinical setting. With a large pool of patients, longer follow-up periods, and an affordable cost, such studies can address broader research questions with deeper insights. Such study designs also hold inherent limitations, such as selection bias (e.g., certain groups of patients are more likely to access a certain therapy) and confounding (e.g., the observed treatment effect might mix with the effects of other important prognostics factors that are imbalanced among treatment arms). It is believed that through thoughtful design, careful analysis, accurate interpretation, and transparent reporting, a sound scientific conclusion can be reached with minimized limitations 5-7 . However, even when well equipped with the concepts of good research practice 8-12 , a researcher holding a promising hypothesis with access to an excellent data source may face many challenges. They may include a lack of understanding of the full extent of the massive data and its feasibility to answer the study question(s); the complexity of the data on hand; the need for tediously repetitive and time-consuming data processing; lack of transparency in data processing and reporting; or miscommunication among collaborators with mixed levels of experience and expertise. The main motivation of this work is to illustrate generic research and analytic framework for studies based on an observational database, to emphasize the importance of routine data analysis, and to introduce a series of SAS ® macros 13 designed to aid the journey of research with ease and efficiency.
In the following section, we illustrate how the proposed SAS ® macros fit into the research and analytic framework seamlessly and assist in improving the overall research quality. In the case study, we exemplify the usage of the proposed macros by a real-life research project based on the NCDB with detailed result interpretations and discussion.

A general analytic workflow in observational studies
A general process of conducting research based on observational/ retrospective studies involves a few general steps: study design, data management, data analyses, and reporting/review (shown in Figure 1), and each step interacts with each other as research is refined over time. Amendments from Version 1 1. In Box 1 we added %UNI_GENMOD and %GENMOD_SEL, and added a brief overall description about Box 1 in the 'Operation' section.
2. We included a few more references in the 'SAS ® macros' section.
3. In the last paragraph of the 'Discussion' section, we added some more limitations.
4. Figure 1 was rescaled to avoid truncation on the boundary.

REVISED
1. At the study design phase, the primary study goals or hypotheses need to be stated clearly with a suitable database along with proper definitions of study population, outcomes, cohorts, and covariates. A comprehensive literature review in the area further assists a better study design.
2. The data management step includes crafting the target study population by applying exclusion/inclusion criteria and preparing variables, such as creating derived variables, categorizing continuous variables, collapsing levels in categorical variables, handling missing values and outliers, etc. We recommend to assign an interpretable label and format to each variable for the best readable output table from the macros.
3. In the data analyses step, different layers of information about the data will be unfolded by sequential analytic steps to test ultimate study hypotheses. The routine data analyses, followed by more advanced analytic approaches, allow for building a pyramid of evidence to support hypotheses and hence strengthen study conclusions.
4. At the phase of reporting/review, we wrap up all results with interpretations in the context of the scientific background to evaluate the findings and draw conclusions with a statement of limitations. The process may involve reviews from an internal collaborative group or criticisms from journal reviews. There is some helpful guidance out there for reporting results from an observational study 7,11 .
The research itself should be dynamic and iterative as illustrated in Figure 1. Any issues, questions, criticisms, or new ideas that arise at a given step will redirect us to the previous steps and may lead to modifications of or additions to the original study design. The proposed SAS® macros will fit into this research framework seamlessly by covering routine data analyses in Step 3 more efficiently and enhancing reporting and communication in Step 4.
The routine data analyses serve as the foundational knowledge to support final claims in the research. The lack of correct perception or understanding of the importance of routine data analyses can lead researchers to rush into the final results without seeing the holes in the foundation, losing opportunities for improvement, and drawing biased conclusions. 1) For descriptive analyses, we get a chance to review the landscape of the entire study population and know the boundaries that apply to the conclusion. For example, if the study population is 99.9% Caucasian and 0.1% African American, the research findings may not be generalizable to the African American population or other racial groups. In addition, descriptive analyses help us to assess missing data, outliers, or possible data entry errors; and lead to additional data management. 2) For univariate associations or modeling, a natural association without any adjustment in the relationship between study cohorts and covariates or between outcome and cohorts/covariates will be demonstrated. We will learn of differences in patients' characteristics among comparison cohorts, which may explain the differences we observe in the associations with the outcomes, especially when those characteristics are strong predictors of the outcomes. The phenomenon is referred to as selection bias or confounding, as commonly seen in observational studies. A confounder can be identified conceptually through existing knowledge or by examining the univariate associations as described here.
3) Both multivariable analyses and subgroup analyses are ways to deal with the confounding effects. Multivariable analyses are also known as main effects models which estimate an adjusted association with the outcome for each variable assuming all other variables are held constant in the model. The main effects model assumes the adjusted association with the outcome for each variable is uniformly constant across the levels of all other variables in the model. When such an assumption does not hold, subgroup analyses can be applied, known as interaction models, by which the association with the outcome is allowed to vary by the levels of a third variable (effect modifiers). Deeper technique details and strategies for multivariable modeling and variable selection, or more advanced approaches, such as propensity score methods, can be found in statistical textbooks or literatures 14,15 .
SAS ® macros SAS ® macros are powerful tools and have been used widely to build customized SAS ® procedures to reduce repetitive coding. Many individual SAS ® macros were created for reporting purposes [16][17][18] , but only suit very specific scenarios, and none of them cover a streamlined data analyses to truly serve the research generically and dynamically, as the proposed macros do.

Implementation
The basic mechanism is similar in all proposed macros: 1. Process each variable in the list of interest or each analytic step one at a time, and continue such iteration until the last variable or step.
2. Inside each iteration, related SAS ® exsiting procedures are carried out and the relevant and key information from the output is exported into SAS ® datasets using ODS OUTPUT option.
3. Organize all information collected in step 2 into a concise and interpretable format by a series of DATA steps.
4. Report the final summary tables into a rich text file using PROC REPORT and ODS RTF option.

Operation
All proposed SAS® macros work under SAS 9.3 and above (English version).
In Box 1, we list the proposed SAS® macros and their descriptions. They were developed in a highly collaborative and fastpaced National Cancer Institute-Designated Comprehensive Cancer Center research facility to help manage the standard workload involved in multiple, contemporaneous studies. The macros cover the report for the routine data analyses, such as descriptive, bivariate association and multivariable modeling for continuous, binary, ordinal, or time-to-event outcomes. Reporting for the univariate and multivariable generalized estimating equation (GEE) model for correlated data is also included 19 . The benefit of using those macros is that they can handle a large number of variables at one time and create professional, consistent, and interpretable summary tables without copy-n-paste from the SAS output to the final report. Also, it results in readable and concise SAS code, and enables easier maintenance and revisions as the project evolves.

Use case Background
The traditional surgical approach for early-stage lung cancer is via a thoracotomy, and such open chest surgery is limited to certain eligible patients due to the increased risk of mortality, especially among elderly patients with multiple comorbidities. Over the past two decades, video-assisted thoracic surgery (VATS) has been increasingly used in clinics and provides excellent short-term advantages over thoracotomy, such as fewer complications, less pain, improved lung function, shorter recovery period and lower costs. On the other hand, incomplete lymph node evaluation with VATS or a high rate of the residual tumor may compromise the long-term efficacy of VATS vs. thoracotomy (open). The knowledge uncertainty still exists regarding the optimal surgical approach for early-stage lung cancer concerning long-term survival 20 . The goal of this study is to compare the overall survival (OS) between two surgical approaches among the target eligible lung cancer patients. In this case study, we show sequential analytical steps along with interpretations to probe into the study question by using the proposed SAS ® macros. To make it easy to follow, we only include a short but relevant list of variables and cover routine data analyses. A comprehensive and complete manuscript for the study question is under preparation for a peer-reviewed publication.

SAS ® macros showcase -data analyses
Please note that these macros do not handle data cleaning or management directly. For the most interpretable results, it is highly recommended to properly format and label each variable before running these macros. All output tables are in Rich Text Format (RTF) and editable in Word.

Descriptive Statistics
The macro %DESCRIPTIVE was used to generate Table 1.
The table provides an overview of the study population's basic characteristics, in which the frequency and proportion are listed for categorical covariates and summary statistics (mean, median, min, max, standard deviation) for numerical covariates, along with the frequency of missing values. In the following SAS ® code example, a few universal macro variables are implemented: DATASET for the data set name; CLIST for a list of categorical covariates separated by space; NLIST for a list of continuous covariates separated by spaces; OUTPATH for the file path to store the output RTF file; FNAME for the name of the RTF file; DEBUG to suppress deleting intermediate data sets for error checking by setting to T. Setting the DICTIONARY option to T creates two additional columns in the summary table, for the covariates' SAS ® names and unformatted values. These columns are useful for the programmer to connect the table with the dataset and code. In Table 1, we observe that 35.6% of the study population underwent a minimally invasive surgical approach, 45% were male, 80.9% resided in a metro area, and 85.4% had comorbidity score ≤1.
******** SAS ® code example for %DESCRIPTIVE ********; /* TIP 1: Create the macro variable &DIR to specify the location used to store all results. If you are conducting an updated analysis, you need to change this path, and all updated results will be saved to a new location without modifying it in every macro. */ %let dir = C:\Desired Location to Save All Results\; /* TIP 2: Create the macro variables &cat_var and &num_var to store categorical and numerical variable lists outside macros, and reference them in all related macros. All related tables can be updated by changing those two macro variables. /* TIP 3: Also note that the order of variables in CLIST is the same as the order in which they will appear in the final output ****************************************************;

Univariate Association with a categorical outcome/exposure variable
We examined the univariate association between each covariate and surgical approach (Table 2) using %UNI_CAT and 30-day mortality after surgery (Table 3) using %UNI_LOGREG.
Each covariate was processed separately, but summarized all together in one table. %UNI_CAT is suitable to compare multiple covariates specified in CLIST and NLIST between two or more cohorts defined by a categorical variable (OUTCOME) (see following SAS ® code example). For each categorical covariate, frequencies from a contingency table are reported along with row percentages (Row%) or column percentages (Col%), which can be controlled by the ROWPERCENT option based on the desired interpretation. For each numeric covariate, summary statistics will be generated for each level of OUTCOME.
The univariate associations can be tested by either parametric or non-parametric tests using the NONPAR option. The name of the tests will appear in the footnote of the table.
Analyses can be performed using logistic regression models if the outcome/exposure variable is binary or ordinal using %UNI_ LOGREG. As shown in Table 3, the probability of the EVENT = 'Yes' was modeled and the odds ratio (OR) was reported with a 95% confidence interval (CI). The reference level of a categorical covariate in CLIST can be chosen to aid interpretation, which can be done by separating the CLIST variables by an asterisk (*), and then adding "(DESC)" or "(ref = "Reference level in formatted value")" after each desired variable name (see following code). If the outcome/explanatory variable is numeric, users can refer to SAS ® macro %UNI_NUM. In all report tables, a p-value < 0.05 will be shown in bold for easy visualization.
In  Table 2 and  Table 3 together, the univariate association between surgical type and 30-day mortality could be confounded by the effect from histology and clinical T stage. Open surgery was more likely to be conducted among patients who had squamous cell carcinomas and clinical T2, and these two factors were linked to a higher rate of 30-day mortality. The observed higher probability of 30-day mortality in open surgery patients might be partially due to the imbalanced distribution of histology and clinical T stage in the surgical cohorts. The common statistical approaches to control for confounding effects include multivariable analysis, subgroup analysis, and propensity score methods.

Univariate Association with a time-to-event outcome
We examined the association of the study cohorts and each covariate with OS using %UNI_PHREG, as shown in Table 4. Each covariate was individually fit in a proportional hazards model (PROC PHREG in SAS ® ), and the hazard ratios (95% CI) and p-values were summarized in one table. In the following SAS ® code example, the survival time variable is specified in EVENT, and the censoring indicator variable in CENSOR.
Note that it requires that '1' is used for the event and '0' for the censored cases in the data (DATASET). Other options include LOGRANK to output log-rank p-value, TYPE3 to output the Type3 p-values for categorical covariates, and PHA for proportional hazard assumption checks. Similar to %UNI_LOGREG, one can set the reference level of a categorical covariate in CLIST. %UNI_PHREG can also handle time to event data in the counting process data form by using the START and STOP options, especially when modeling data with time-varying covariates or recurrent event data. It can also handle competing risk data by using the EVENTCODE option to activate the Fine and Gray's model 21 .
In the univariate analysis shown in Table 4, we see that patients who underwent open surgery had 18% more chance to die before those who underwent minimally invasive surgery (HR = 1.18; 95%CI = (1.13-1.24); p < 0.001). Also, residing in rural or urban areas, higher comorbidity scores, squamous cell carcinomas, clinical stage T2, and older age were all risk factors linked to worse overall survival in this study population. In combination with the results in Table 2, those prognostic factors except for age were also more likely to present in patients who underwent open surgery and should be controlled for as confounders in multivariable analyses.

Multivariable model for the binary or time-to-event outcome
We performed multivariable analysis with a logistic regression model for the 30-day mortality outcome using %LOGREG_ SEL (Table 5) and a Cox proportional hazards model for overall survival using %PHREG_SEL (Table 6). The adjusted association of the surgical approaches with the two clinical outcomes was estimated after controlling for observed confounding variables. The odds ratio and hazard ratio were reported along with 95% CI and p-values. In both macros, a manual backward elimination procedure was implemented by dropping one variable at a time until all variables left satisfied a pre-specified alpha level (e.g., SLSTAY = 0.1). The selection process in the macros allows the sample size to adjust and always uses the maximum available sample as the number of variables in the model drops, which differs from the automatic selection procedure built into PROC LOGISTIC or PROC PHREG, where a complete and fixed data set is used throughout the selection procedure. In the following example code, DSN is for specifying the data set name, and EVENT in %LOGREG_SEL is to specify the event of interest, for which the probability will be modeled. VAR is for the list of all variables or terms in the initial model, separated by a space. CVAR is for the list of the categorical variables in VAR with the option to specify reference levels as shown in Table 4. INC = k is set to protect the first k variables in VAR from being eliminated, such as a primary exploratory variable and important confounding variables. In this case study, we want to keep surgical approach in the model as it is the study cohort, which was done by putting surg_ app on the first position in VAR and setting INC = 1. Setting CLNUM = T outputs the frequency of categorical variables in the final model. The summary information of the selection procedure was presented in the footnote of the final report table.
A separate macro for multivariable analysis of competing risk data is %FINEGRAY_SEL. In reality, there are many approaches to build a multivariable model. If a user wants to customize the model building process, they can use these two macros for reporting purposes only by setting VAR = final model variables selected by other approaches and INC = total number of items in the final model. In

Stratified Multivariable model
When fitting the data into a main-effect model, as in Table 5 or Table 6, an imposed assumption is that the effect of treatment on outcomes is the same across all subgroups defined by the controlled variables. This assumption may or may not hold, and exploring and identifying subgroups that may benefit more from treatment can lead us to a deeper insight. Instead of splitting the dataset into smaller, separated data sets, fitting an interaction model on the entire data set is more appropriate 22 . As shown in Table 7, we fit a multivariable model including an interaction term between surgical approaches and histology, still implementing the backward elimination procedure. In the related %PHREG_ SEL code, the first three terms in VAR were protected from elimination by setting INC = 3, to keep the interaction of interest (surg_app*his_cat) in the model. The macro parameters EFFECT and SLICEBY allow users to specify the variables for the treatment effect and subgroups (the two variables have to be categorical variables for the macro to run correctly). Setting SHORTREPORT = T, only reports the hazard ratio (HR) and the p-value of surgical approach by histology subgroup, and if set to F, the HR for all other control variables in the model will be reported. This macro can only handle one interaction at a time but is useful as an initial exploration for the treatment effect in subgroups.
In Table 7, detailed information about variable selection in the model building process is shown in the footnote. We see that, overall, minimally invasive surgery shows a protective effect on survival compared to open surgery, but such protection is more pronounced among adenocarcinoma patients (HR = 0.85; p < 0.001) than squamous cell carcinomas (HR = 0.96; p = 0.321). The p-value for interaction term is 0.069, and it tests the difference among the HR of 0.85, 0.95, and 0.96 for the three histology groups. If using the significance level of 0.05, the interaction p-value may confirm that the hazard ratio for the surgical approaches is the same across the histology groups. In the case of a significant interaction effect, researchers should report the interaction model and discuss the treatment effect in each subgroup.

Kaplan-Meier analysis
It is standard practice in many time-to-event studies to report Kaplan-Meier (KM) plots, median survival times, and accrual survival rates for time-to-event outcomes stratified by treatments, as a straightforward and intuitive way to assess the survival profile. %KM_PLOT was used to generate Figure 2. The KM plot for overall survival (specified in EVENTS, CENSORS) stratified by surgical approaches (specified in GRPLIST) was carried out with the key information of interest reported in a summary table. This macro has many options that allow the user to control the appearance of the plot (TITLE, XTICK, XMAX, NONCENSORED, and ATRISK) and output the estimated accrual survival rate at pre-specified time points (TIMELIST). This macro becomes handier when there are multiple datasets, several outcomes, or multiple variables of interest. Users can produce multiple KM plots in one macro call for all combinations of the parameter values by listing multiple values separated by spaces in DSN, EVENTS/CENSORS, and GRPLIST.

Discussion
A natural course of research is generally iterative and conducted by a research team with mixed expertise and experience levels. The proposed sequential SAS ® macros seamlessly fit into that type of research environment. They help process a massive amount of variables effortlessly, produce complete and interpretable summary information to support the research team to comprehend and better plan for the next step, and provide a highly-organized and concise coding interface that facilitates easy updates as research evolves.
Research based on observational or retrospective data needs extra effort in the study design and data management before data analysis, and careful interpretation afterward. The presented case study also illustrated a simple analytic workflow showing how to build an analytic project from the foundation, comprehend different layers of information jointly from the routine data analyses, and envision where you are and where to go. This work can serve as a nice tutorial for researchers to easily get their research off the ground.
The limitations include the requirements for researchers to be comfortable with the SAS ® environment and have basic statistical training in data handling and interpretation. The proposed SAS ® macros don't handle study design or data management. A properly defined study population, cohort, outcomes, covariates, and sufficient literature review are critical before using our macros for the sake of research quality. Since only relevant information is summarized and reported, the macros don't cover issues such as assumption check or goodness of fit about the fitted model at this time point. Including an experienced biostatistician in the research team would be beneficial.
These macros were first created in 2011 19 , and have been implemented in many projects to help turn out high-quality research, efficiently. They are still under active upgrade to meet new needs (e.g., %UNI_PHREG is in its 26 th version). At this time point, we are still lack of some desired features such as to allow users to decide the variable order in the final report, or to implement an automatic decision about the appropriateness of a parametric test or non-parametric test in %UNI_CAT or %UNI_NUM, or to set the customized report template. However, those tasks along others are listed in our next round of upgrade and will become available soon. We welcome suggestions and comments that can help us improve.

Data availability
Owing to data protection concerns, data used in the use case cannot be shared under the American College of Surgeons' Commission on Cancer NCDB Participant Use File (PUF) Purpose and Terms of Agreement. For information on how to apply for access to NCDB PUF and who will be granted access to it, please visit: https://www.facs.org/quality-programs/cancer/ ncdb/puf.
Dr. Yuan Liu and colleagues had developed a series of generic SAS macros aiming to carry out streamlined routine data analyses and produce journal-quality summary tables. Although this work is fundamental, it'll benefit statisticians a lot by avoiding time-consuming and tedious copy-and-paste processes. The idea of "streamlined routine data analyses" has been in my mind for years, and I do deem that all the available SAS macros in the SAS community are segmented for the whole data analyses process, thus, what we need is not more segmented SAS macros, but a set of SAS macros that can work seamlessly as an entire system. That's why I developed the '%ggTable' series of SAS macros and published '%ggBaseline', the first macro of the ggTable series . Although other SAS et al. macros of this series are still under preparation for a peer-reviewed publication, I'm glad to be a reviewer of this work and share some of my considerations that might help move this publication forward.

Major considerations:
The title of this paper might be a little broad or exaggerated. Besides binary and time to event outcomes discussed in this article, multinomial and continuous variables can also be outcomes in observational studies. In addition, many statistical methods, such as propensity score-based methods, say stratifying, matching, and inverse probability weighting, are more and more popular nowadays in observational studies. Last, it's common to use generalized estimating equation (GEE) models or mixed models to account for clustered data, since many centers are enrolled in observational studies.
Minor considerations: For %DESCRIPTIVE, %UNI_CAT and % UNI_NUM, the variables list is specified by CLIST and NLIST separately; this way has some drawbacks, such as 1) The order in the table will be grouped by variable type and always be categorical variables first; 2) Statistical tests and variable labels cannot be specified when calling the macros. However, this would not be a problem in %ggBaseline.
It would be better to report the mean with standard deviation (SD) in the same line as mean ± SD, the median with interquartile range (IQR) as median (IQR).
K-M plot with No. at risk is highly recommended, and the border of the legend should be removed as well.
It would be great if the authors can improve these SAS macros with the consideration of the above concerns. If there is no improvement, more discussions on these limitations and comparison to other available macros are needed in the discussion section.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly No competing interests were disclosed. Competing Interests: Reviewer Expertise: epidemiology, biostatistics, and clinical research in cardiology, neurology, and oncology.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. For the major consideration, we do have several macros (%UNI_GENMOD, %GENMOD_SEL) that handle the reporting based on the generalized linear regression model (normal, binary, Poisson, negative binomial distribution) for either GEE (clustered data)or non-GEE (independent data) using PROC GENMOD procedure in SAS. In the revision, we added them into the Box 1 with a few more explanation in the context. Even the case study did not use them, they are all available in our final downloadable package, and the usage is similar as in other macros in the case study.
For the propensity score-based methods, we also created a series of macros for propensity score estimation, matching/weighting/stratification, balance check, visualization, etc. but since this paper was mainly focus on the routine data analyses, we decided to leave the PS related components and to develop a separate paper in the future.
We do appreciate that Dr. Gu pointed out some limitations and have included those limitations in the discussion section. However, we do actively upgrade those macros as we go to meet new needs. Since first created in 2011, some macros have reached their 20 version. We would high appreciate any feedback and comments from the potential users and will improve with your help.

Competing Interests:
th