Module 4 - Part 8: Statistical Analysis Workshop

Workshop Overview: From DoE Data to Decisions

Now that you've designed your experiments and collected data, it's time for the most crucial step: statistical analysis. This workshop will teach you how to extract meaningful insights from your DoE data and validate your models.

Why Statistical Analysis is Critical in DoE

Step 1: Model Building - We need to understand which factors actually affect our response

Step 2: Model Validation - We must ensure our model is reliable and not just fitting noise

Step 3: Optimization - We want to predict optimal conditions with confidence

Step 4: Decision Making - We need statistical evidence to support our formulation choices

Learning Objectives: By the end of this workshop, you'll know how to perform ANOVA on DoE data, check model adequacy through residual analysis, and validate your models for reliable predictions.

Section 1: ANOVA for DoE Analysis

Model Adequacy Checking

Let's Think Step-by-Step: What Makes a Good Model?

Step 1: Significant Effects - The model should include factors that actually matter (p < 0.05)

Step 2: Good Fit - The model should explain a substantial portion of variability (high R²)

Step 3: Valid Assumptions - Residuals should be normally distributed with constant variance

Step 4: No Lack of Fit - The model should capture the true relationship, not miss important patterns

Pharmaceutical Example: Tablet Hardness DoE Analysis

Scenario: We conducted a 2³ factorial design to study tablet hardness. Factors: Compression Force (A), Binder Concentration (B), and Disintegrant Level (C).

Run	A: Force (kN)	B: Binder (%)	C: Disintegrant (%)	Hardness (N)
1	-1 (15)	-1 (2)	-1 (1)	45
2	+1 (25)	-1 (2)	-1 (1)	78
3	-1 (15)	+1 (4)	-1 (1)	52
4	+1 (25)	+1 (4)	-1 (1)	89
5	-1 (15)	-1 (2)	+1 (3)	41
6	+1 (25)	-1 (2)	+1 (3)	71
7	-1 (15)	+1 (4)	+1 (3)	47
8	+1 (25)	+1 (4)	+1 (3)	83

Excel Implementation: ANOVA for DoE

Step 1: Data Analysis → Regression → Multiple Regression
Input Y Range: C2:C9 (Hardness values)
Input X Range: D2:F9 (Coded factor levels)
Output: ANOVA table with F-statistics and p-values

Lack-of-Fit Test

Understanding Lack-of-Fit: What Are We Testing?

Concept: Lack-of-Fit occurs when our model doesn't capture the true relationship between factors and response

Test Logic: If we have replicate runs (multiple observations at the same factor combinations), we can separate "pure error" from "lack-of-fit error"

Pure Error: Variability between replicates at the same conditions (measurement error)

Lack-of-Fit: Additional error beyond pure error (model inadequacy)

F_LOF = MS_Lack-of-Fit / MS_{Pure Error}

⚠️ Critical Decision Rule

If F_LOF > F_critical (p < 0.05): Significant lack-of-fit detected!
Action Required: Add higher-order terms, transform variables, or consider different model forms.

Example: Dissolution Rate Study with Replicates

Scenario: Central Composite Design for dissolution optimization with 3 center point replicates.

Condition	Replicate 1	Replicate 2	Replicate 3	Mean	Pure Error
Center Point	85.2%	87.1%	86.3%	86.2%	0.95
Corner Point	78.5%	79.2%	78.9%	78.9%	0.33

Pure Error Calculation: Sum of squared deviations within each replicate group, divided by degrees of freedom.

R² and Adjusted R² Interpretation

Step-by-Step: Understanding Model Fit Measures

R² Formula: R² = 1 - (SS_Residual / SS_Total)

Interpretation: Percentage of variability explained by the model (0-100%)

Problem: R² always increases when adding terms, even if they're not significant

Solution: Adjusted R² = 1 - [(SS_Residual/(n-p-1)) / (SS_Total/(n-1))]

Benefit: Adjusted R² penalizes for adding unnecessary terms

Example: Content Uniformity Model Comparison

Model	Terms	R²	Adjusted R²	Decision
Linear Only	A, B, C	0.789	0.731	Good baseline
+ Interactions	A, B, C, AB, AC, BC	0.892	0.834	Improvement!
+ Quadratic	All + A², B², C²	0.903	0.821	Overfitting?

Decision Logic: Choose the model with the highest adjusted R² that includes only significant terms.

Excel R² Calculation

R² Formula: =1-SUMXMY2(actual_range,predicted_range)/DEVSQ(actual_range)
Or simply: Look at "R Square" value in Regression Analysis output

Section 2: Residual Analysis - Model Diagnostics

Normal Probability Plots

Why Check Normality of Residuals?

Assumption: ANOVA assumes residuals follow a normal distribution

Consequence: If residuals aren't normal, our p-values and confidence intervals are unreliable

Test Method: Plot residuals vs. normal scores (quantiles)

Interpretation: Points should fall approximately on a straight line

Example: Capsule Fill Weight Variability Study

Problem: Fill weight data showing potential non-normality

Observation	Predicted	Actual	Residual	Normal Score
1	248.5	251.2	2.7	-1.64
2	252.1	249.8	-2.3	-0.97
3	250.3	250.1	-0.2	-0.43
4	249.7	250.4	0.7	0.00
5	251.8	253.1	1.3	0.43
6	253.2	254.7	1.5	0.97
7	247.9	252.1	4.2	1.64

Analysis: Plot residuals vs. normal scores. If points deviate from straight line, consider data transformation.

Excel Normal Probability Plot

Step 1: Calculate residuals = Actual - Predicted
Step 2: Sort residuals in ascending order
Step 3: Calculate normal scores = NORM.S.INV((RANK-0.5)/COUNT)
Step 4: Create scatter plot: X-axis = Normal Scores, Y-axis = Residuals

Residuals vs Fitted Values Plot

What This Plot Reveals About Your Model

Good Pattern: Random scatter around zero line (no patterns)

Bad Pattern 1: Funnel shape = heteroscedasticity (non-constant variance)

Bad Pattern 2: Curved pattern = non-linearity (need quadratic terms)

Bad Pattern 3: Outliers = unusual observations requiring investigation

🎯 Interactive Residual Pattern Recognition

Click the buttons below to see different residual patterns:

Click a button to see residual patterns and their interpretations.

Example: API Content Analysis Issues

Scenario: DoE study on API content uniformity showing heteroscedasticity

⚠️ Problem Identified

Residuals vs fitted plot shows funnel pattern: variance increases with API content level.
Solution: Consider square-root transformation of response variable.

Before Transformation: Residual variance from 0.1 to 2.4 (24-fold increase)

After √(API Content) Transformation: Residual variance from 0.05 to 0.12 (2.4-fold increase)

Cook's Distance - Influential Point Detection

Understanding Influential Observations

Concept: Some data points have disproportionate influence on the regression line

Cook's Distance: Measures how much the fitted values change when we remove one observation

Threshold: D > 4/n (where n = number of observations) suggests influential point

Action: Investigate high Cook's D points - are they outliers or valid extreme conditions?

Cook's D_i = (r_i²/(p+1)) × (h_ii/(1-h_ii))
where r_i = standardized residual, h_ii = leverage, p = parameters

Example: Tablet Compression Study Outlier

Run	Force (kN)	Hardness (N)	Residual	Cook's D	Status
1	15	45	-2.1	0.08	Normal
2	20	58	1.3	0.03	Normal
3	25	71	-0.8	0.01	Normal
4	30	45	-25.2	0.89	⚠️ Influential

Investigation Result: Run 4 had defective punch - legitimate removal from analysis after verification.

Leverage - High-Impact Factor Combinations

What is Leverage in DoE Context?

Definition: Leverage measures how far a point is from the center of the factor space

High Leverage: Corner points in factorial designs, star points in CCD

Implication: High leverage points strongly influence the fitted model

Threshold: h > 2p/n (where p = parameters, n = observations)

Excel Leverage Calculation

Method 1: Use Regression Analysis with "Residuals" option checked
Method 2: Manual calculation using hat matrix: h_ii = diagonal elements
Interpretation: Sum of all leverages = number of parameters (p)

Section 3: Model Validation - Ensuring Reliability

Cross-Validation Techniques

Why Validate? The Problem of Overfitting

Problem: Models can fit the training data perfectly but predict poorly on new data

Solution: Test model performance on data it hasn't seen before

K-Fold CV: Split data into k groups, train on k-1, test on 1, repeat k times

Leave-One-Out: Special case where k = n (sample size)

Example: Dissolution Model Cross-Validation

Dataset: 16 runs from Central Composite Design

Method: 4-fold cross-validation (4 groups of 4 runs each)

Fold	Training Runs	Test Runs	R² (Training)	R² (Test)
1	2,3,4,...,16	1	0.94	0.89
2	1,3,4,...,16	2	0.93	0.91
3	1,2,4,...,16	3	0.95	0.87
4	1,2,3,...,15	16	0.94	0.92

Average Cross-Validation R²: 0.90 (vs. 0.94 on full data)

Conclusion: Model shows good predictive ability with minimal overfitting.

Confirmation Runs - The Ultimate Test

Planning Effective Confirmation Experiments

Purpose: Test model predictions with completely new experiments

Design: Choose factor combinations different from original design points

Replication: Run multiple replicates to assess prediction accuracy

Success Criteria: Observed values fall within prediction intervals

Example: Tablet Formulation Confirmation Study

Original DoE: Optimized binder and disintegrant levels for hardness and disintegration time

Optimal Conditions Found: 3.2% binder, 2.1% disintegrant

Response	Predicted	95% PI Lower	95% PI Upper	Observed (n=6)	Validation
Hardness (N)	75.2	71.8	78.6	74.1 ± 2.1	✅ Within PI
Disintegration (min)	12.5	10.9	14.1	13.2 ± 0.8	✅ Within PI
Friability (%)	0.65	0.52	0.78	0.71 ± 0.05	✅ Within PI

Conclusion: All responses fall within prediction intervals - model validated!

Prediction Intervals vs Confidence Intervals

Critical Distinction for Pharmaceutical Applications

Confidence Interval: Uncertainty about the true mean response at given conditions

Prediction Interval: Uncertainty about a future individual observation

Key Point: Prediction intervals are always wider than confidence intervals

Pharmaceutical Use: Use prediction intervals for specification setting and batch release

Prediction Interval = ŷ ± t_α/2,df × √(MSE × (1 + h_xx))
where h_xx is leverage at prediction point

Excel Prediction Interval Calculation

Formula: =prediction ± T.INV.2T(alpha,df) * SQRT(MSE * (1 + leverage))
MSE: Mean Squared Error from ANOVA table
Leverage: Calculate for new factor combination using design matrix

Process Capability Indices

From DoE Model to Process Control

C_p: Potential capability = (USL - LSL) / (6σ)

C_pk: Actual capability = min[(USL - μ)/(3σ), (μ - LSL)/(3σ)]

DoE Application: Use model residual standard deviation as estimate of σ

Target: C_pk ≥ 1.33 for pharmaceutical processes

Example: Content Uniformity Process Capability

Specifications: 95.0 - 105.0% of label claim

DoE Model Residual SD: σ = 1.2% (from optimized conditions)

Target Mean: μ = 100.0%

Calculation: C_p = (105.0 - 95.0) / (6 × 1.2) = 10.0 / 7.2 = 1.39

C_pk: Upper: (105.0 - 100.0) / (3 × 1.2) = 1.39
Lower: (100.0 - 95.0) / (3 × 1.2) = 1.39
C_pk = min(1.39, 1.39) = 1.39

✅ Process Assessment

C_pk = 1.39 > 1.33: Process is capable!
Expected Defect Rate: < 64 ppm (parts per million)

Workshop Summary & Key Takeaways

Essential Statistical Analysis Checklist

ANOVA First: Check significance of effects, overall model fit (R²), and lack-of-fit
Residual Analysis: Verify normality, constant variance, and identify outliers
Influence Diagnostics: Use Cook's distance and leverage to find influential points
Model Validation: Perform cross-validation and confirmation runs
Practical Application: Calculate prediction intervals and process capability

⚠️ Common Analysis Mistakes to Avoid

Using R² instead of adjusted R² for model comparison
Ignoring residual patterns indicating model inadequacy
Confusing confidence intervals with prediction intervals
Removing outliers without proper investigation
Over-interpreting models beyond the experimental region

🎯 Ready for the Capstone Project?

You now have all the statistical tools needed to analyze DoE data effectively. In the next part, you'll apply everything you've learned to a complete pharmaceutical formulation optimization project.