6. Surrogate Models in Bgolearn#

Warning

Important: All data must be pandas DataFrames/Series, not numpy arrays!

  • data_matrixpd.DataFrame with column names

  • Measured_responsepd.Series

  • virtual_samplespd.DataFrame with same column names

Using numpy arrays will cause: AttributeError: 'numpy.ndarray' object has no attribute 'columns'

Note

This page explains the different surrogate models available in Bgolearn and how to choose the right one for your optimization problem.

6.1. Overview#

Surrogate models (also called metamodels) are the core of Bayesian optimization. They approximate the expensive-to-evaluate objective function and provide uncertainty estimates that guide the optimization process.

Bgolearn supports several surrogate models, each with different strengths and use cases:

  1. Gaussian Process (GP) - Default and most common

  2. Random Forest (RF) - Good for discrete/categorical features

  3. Support Vector Regression (SVR) - Robust to noise

  4. Multi-Layer Perceptron (MLP) - Neural network approach

  5. AdaBoost - Ensemble method

6.2. Gaussian Process (GaussianProcess)#

6.2.1. Theory#

Gaussian Processes are the gold standard for Bayesian optimization. They provide both predictions and uncertainty estimates in a principled way.

from Bgolearn import BGOsampling

# Use Gaussian Process (default)
opt = BGOsampling.Bgolearn()
model = opt.fit(
    data_matrix=data_matrix,
    Measured_response=measured_response,
    virtual_samples=virtual_samples,
    Classifier='GaussianProcess'  # Explicit specification
)

6.2.2. Advantages#

GP Strengths

  • Uncertainty Quantification: Provides prediction uncertainty

  • Theoretical Foundation: Well-established Bayesian framework

  • Smooth Interpolation: Works well for continuous functions

  • Few Hyperparameters: Relatively easy to tune

  • Global Optimization: Good for finding global optima

6.2.3. Limitations#

GP Limitations

  • Computational Cost: O(n³) scaling with training data

  • Smoothness Assumption: Assumes smooth underlying function

  • High-Dimensional Challenges: Struggles with many features (>20)

  • Categorical Features: Not naturally suited for discrete variables

6.2.4. Best Use Cases#

  • Continuous optimization problems

  • Smooth objective functions

  • Small to medium datasets (<1000 samples)

  • When uncertainty is important

  • Materials property optimization

6.2.5. Example: Alloy Optimization with GP#

import numpy as np
import pandas as pd
import copy
from Bgolearn import BGOsampling

# Alloy composition optimization - use pandas DataFrame
data_matrix = pd.DataFrame([
    [2.0, 1.2, 0.5],  # Cu, Mg, Si
    [3.5, 0.8, 0.7],
    [1.8, 1.5, 0.3],
    [4.2, 0.9, 0.8]
], columns=['Cu', 'Mg', 'Si'])

strength_values = pd.Series([250, 280, 240, 290])  # MPa
measured_response = copy.deepcopy(strength_values)


virtual_samples = pd.DataFrame([
    [2.5, 1.0, 0.6],
    [3.0, 1.3, 0.4],
    [3.8, 0.9, 0.8]
], columns=['Cu', 'Mg', 'Si'])

# GP optimization
opt = BGOsampling.Bgolearn()
model = opt.fit(
    data_matrix=data_matrix,  # DataFrame
    Measured_response=strength_values,  # Series
    virtual_samples=virtual_samples,  # DataFrame
    Classifier='GaussianProcess',
    CV_test=2,  # 2-fold cross-validation
    Normalize=True
)

print("GP optimization completed!")

6.3. Random Forest (RandomForest)#

6.3.1. Theory#

Random Forest builds multiple decision trees and averages their predictions. It’s particularly good for handling discrete features and non-linear relationships.

# Use Random Forest
model = opt.fit(
    data_matrix=data_matrix,
    Measured_response=measured_response,
    virtual_samples=virtual_samples,
    Classifier='RandomForest'
)

6.3.2. Advantages#

RF Strengths

  • Handles Discrete Features: Natural for categorical variables

  • Non-linear Relationships: Captures complex patterns

  • Robust to Outliers: Less sensitive to noisy data

  • Fast Training: Efficient for large datasets

  • Feature Importance: Provides feature ranking

  • No Assumptions: Doesn’t assume function smoothness

6.3.3. Limitations#

RF Limitations

  • Limited Uncertainty: Poor uncertainty quantification

  • Overfitting Risk: Can overfit with small datasets

  • Discontinuous: Creates step-wise predictions

  • Hyperparameter Tuning: Many parameters to optimize

6.3.4. Best Use Cases#

  • Discrete/categorical features

  • Large datasets (>1000 samples)

  • Non-smooth functions

  • When robustness is important

  • Mixed variable types

6.3.5. Example: Processing Parameter Optimization#

# Processing parameters with discrete levels - use pandas DataFrame
processing_data = pd.DataFrame([
    [450, 2, 1],    # Temperature, Time, Atmosphere (1=N2, 2=Ar, 3=Air)
    [500, 4, 2],
    [550, 6, 3],
    [480, 3, 1]
], columns=['Temperature', 'Time', 'Atmosphere'])

hardness_values = pd.Series([180, 220, 250, 200])

virtual_processing = pd.DataFrame([
    [475, 3.5, 1],
    [525, 4.5, 2],
    [490, 2.5, 3]
], columns=['Temperature', 'Time', 'Atmosphere'])

# Random Forest for mixed variables
model = opt.fit(
    data_matrix=processing_data,  # DataFrame
    Measured_response=hardness_values,  # Series
    virtual_samples=virtual_processing,  # DataFrame
    Classifier='RandomForest',
    CV_test='LOOCV'  # Leave-one-out cross-validation
)

print("Random Forest optimization completed!")

6.4. Support Vector Regression (SVR)#

6.4.1. Theory#

SVR uses support vector machines for regression, finding a function that deviates from targets by at most ε while being as flat as possible.

# Use SVR
model = opt.fit(
    data_matrix=data_matrix,
    Measured_response=measured_response,
    virtual_samples=virtual_samples,
    Classifier='SVR'
)

6.4.2. Advantages#

SVR Strengths

  • Robust to Noise: Handles noisy data well

  • High-Dimensional: Works with many features

  • Kernel Flexibility: Different kernel functions available

  • Sparse Solution: Uses only support vectors

  • Regularization: Built-in overfitting protection

6.4.3. Limitations#

SVR Limitations

  • Parameter Sensitivity: Requires careful tuning

  • No Uncertainty: Doesn’t provide prediction uncertainty

  • Kernel Selection: Choosing the right kernel is crucial

  • Computational Cost: Can be slow for large datasets

6.4.4. Best Use Cases#

  • Noisy data

  • High-dimensional problems

  • When robustness is critical

  • Non-linear relationships

6.4.5. Example: High-Dimensional Optimization#

# High-dimensional alloy with many elements - use pandas DataFrame
element_names = ['Al', 'Cu', 'Mg', 'Si', 'Fe', 'Mn', 'Cr', 'Zn']
high_dim_data = pd.DataFrame(
    np.random.random((20, 8)),  # 8 alloying elements
    columns=element_names
)
high_dim_response = pd.Series(np.random.random(20) * 100 + 200)

high_dim_virtual = pd.DataFrame(
    np.random.random((50, 8)),
    columns=element_names
)

# SVR for high-dimensional problem
model = opt.fit(
    data_matrix=high_dim_data,  # DataFrame
    Measured_response=high_dim_response,  # Series
    virtual_samples=high_dim_virtual,  # DataFrame
    Classifier='SVR',
    Normalize=True
)

print("SVR optimization completed!")

6.5. Multi-Layer Perceptron (MLP)#

6.5.1. Theory#

MLP is a neural network with multiple hidden layers that can approximate complex non-linear functions.

# Use MLP
model = opt.fit(
    data_matrix=data_matrix,
    Measured_response=measured_response,
    virtual_samples=virtual_samples,
    Classifier='MLP'
)

6.5.2. Advantages#

MLP Strengths

  • Universal Approximator: Can approximate any continuous function

  • Non-linear Modeling: Excellent for complex relationships

  • Scalable: Can handle large datasets

  • Flexible Architecture: Customizable network structure

6.5.3. Limitations#

MLP Limitations

  • Requires Large Data: Needs substantial training data

  • Hyperparameter Tuning: Many parameters to optimize

  • Overfitting Risk: Can easily overfit small datasets

  • No Uncertainty: Doesn’t provide uncertainty estimates

  • Training Instability: Can be sensitive to initialization

6.5.4. Best Use Cases#

  • Large datasets (>500 samples)

  • Complex non-linear relationships

  • When sufficient data is available

  • Pattern recognition tasks

6.6. AdaBoost#

6.6.1. Theory#

AdaBoost (Adaptive Boosting) combines multiple weak learners to create a strong predictor.

# Use AdaBoost
model = opt.fit(
    data_matrix=data_matrix,
    Measured_response=measured_response,
    virtual_samples=virtual_samples,
    Classifier='AdaBoost'
)

6.6.2. Advantages#

AdaBoost Strengths

  • Ensemble Method: Combines multiple models

  • Adaptive: Focuses on difficult examples

  • Reduces Bias: Often improves prediction accuracy

  • Versatile: Works with different base learners

6.6.3. Limitations#

AdaBoost Limitations

  • Sensitive to Noise: Can overfit noisy data

  • Computational Cost: Slower than single models

  • Parameter Tuning: Requires careful tuning

  • Limited Uncertainty: Poor uncertainty quantification

6.7. Model Selection Guidelines#

6.7.1. Decision Tree#

Start Here
    ↓
Do you have uncertainty requirements?
    ↓ Yes                    ↓ No
Gaussian Process         What type of features?
    ↓                        ↓
Is data smooth?         Continuous → SVR/MLP
    ↓ Yes    ↓ No           ↓
    GP    Random Forest   Discrete → Random Forest
                             ↓
                        Large dataset?
                             ↓ Yes    ↓ No
                            MLP    Random Forest

6.7.2. Detailed Selection Criteria#

Table 6.1 Model Selection Guide#

Criterion

Gaussian Process

Random Forest

SVR

MLP

AdaBoost

Data Size

<1000

500

500

500

500

Feature Types

Continuous

Mixed

Continuous

Continuous

Mixed

Uncertainty

Excellent

Poor

None

None

Poor

Noise Tolerance

Medium

High

High

Medium

Low

Training Speed

Slow

Fast

Medium

Slow

Medium

Prediction Speed

Fast

Fast

Fast

Fast

Medium

6.8. Practical Examples#

6.8.1. Model Comparison#

# Create test data for model comparison
data_matrix = pd.DataFrame([
    [2.0, 1.2, 0.5],
    [3.5, 0.8, 0.7],
    [1.8, 1.5, 0.3],
    [4.2, 0.9, 0.8],
    [2.8, 1.1, 0.6]
], columns=['Cu', 'Mg', 'Si'])

measured_response = pd.Series([250, 280, 240, 290, 265])

virtual_samples = pd.DataFrame([
    [2.5, 1.0, 0.6],
    [3.0, 1.3, 0.4],
    [3.8, 0.9, 0.8]
], columns=['Cu', 'Mg', 'Si'])

# Compare all models on the same problem
models = ['GaussianProcess', 'RandomForest', 'SVR', 'MLP', 'AdaBoost']
results = {}

for model_name in models:
    print(f"Testing {model_name}...")

    try:
        model = opt.fit(
            data_matrix=data_matrix,  # DataFrame
            Measured_response=measured_response,  # Series
            virtual_samples=virtual_samples,  # DataFrame
            Classifier=model_name,
            CV_test=5,  # 5-fold cross-validation
            Normalize=True
        )
        
        # Get results using EI
        ei_values, recommended_points = model.EI()
        recommended_point = recommended_points[0]

        # Get model performance metrics (if available)
        try:
            # Access cross-validation score if available
            cv_score = getattr(model, 'cv_score_', 0.0)
        except:
            cv_score = 0.0

        results[model_name] = {
            'recommended_point': recommended_point,
            'ei_max': np.max(ei_values),
            'cv_score': cv_score,
            'success': True
        }

        print(f"  Success: EI max = {np.max(ei_values):.3f}")
        
    except Exception as e:
        print(f"  Failed: {str(e)}")
        results[model_name] = {'success': False, 'error': str(e)}

# Display comparison
print("\nModel Performance Comparison:")
print("-" * 50)
for model, result in results.items():
    if result['success']:
        print(f"{model:15s}: Success - EI max = {result['ei_max']:.3f}")
    else:
        print(f"{model:15s}: Failed - {result.get('error', 'Unknown error')}")

6.8.2. Hyperparameter Tuning#

# Hyperparameter tuning in Bgolearn
import numpy as np
import pandas as pd
from Bgolearn import BGOsampling

# Create test data
data_matrix = pd.DataFrame([
    [2.0, 1.2, 0.5, 450],  # Cu, Mg, Si, Temperature
    [3.5, 0.8, 0.7, 500],
    [1.8, 1.5, 0.3, 480],
    [4.2, 0.9, 0.8, 520],
    [2.8, 1.1, 0.6, 490],
    [3.2, 1.3, 0.4, 510],
    [2.5, 0.9, 0.7, 470],
    [3.8, 1.0, 0.5, 530]
], columns=['Cu', 'Mg', 'Si', 'Temperature'])

strength_values = pd.Series([250, 280, 240, 290, 265, 275, 245, 295])

virtual_samples = pd.DataFrame([
    [2.5, 1.0, 0.6, 485],
    [3.0, 1.3, 0.4, 505],
    [3.8, 0.9, 0.8, 515]
], columns=['Cu', 'Mg', 'Si', 'Temperature'])

# Hyperparameter tuning: compare different cross-validation settings
cv_settings = [3, 5, 10, 'LOOCV']
normalization_settings = [True, False]

best_score = -np.inf
best_config = None
results = {}

opt = BGOsampling.Bgolearn()

print("🔧 Starting hyperparameter tuning...")
print("=" * 50)

for cv_test in cv_settings:
    for normalize in normalization_settings:
        config_name = f"CV_{cv_test}_Norm_{normalize}"
        print(f"Testing configuration: {config_name}")

        try:
            model = opt.fit(
                data_matrix=data_matrix,
                Measured_response=strength_values,
                virtual_samples=virtual_samples,
                Classifier='GaussianProcess',
                CV_test=cv_test,
                Normalize=normalize,
                seed=42  # Ensure reproducibility
            )

            # Get EI values as performance metric
            ei_values, recommended_points = model.EI()
            max_ei = np.max(ei_values)

            # Calculate prediction quality
            predicted_mean = model.virtual_samples_mean
            predicted_std = model.virtual_samples_std

            # Combined score: EI max + inverse of prediction uncertainty
            uncertainty_score = 1.0 / (np.mean(predicted_std) + 1e-6)
            combined_score = max_ei + 0.1 * uncertainty_score

            results[config_name] = {
                'max_ei': max_ei,
                'mean_uncertainty': np.mean(predicted_std),
                'combined_score': combined_score,
                'recommended_point': recommended_points[0],
                'success': True
            }

            if combined_score > best_score:
                best_score = combined_score
                best_config = config_name

            print(f"  ✅ Success - EI max: {max_ei:.3f}, Combined score: {combined_score:.3f}")

        except Exception as e:
            print(f"  ❌ Failed: {str(e)}")
            results[config_name] = {'success': False, 'error': str(e)}

# Display tuning results
print(f"\n📊 Hyperparameter tuning results:")
print("=" * 50)
print(f"{'Configuration':<20} {'EI Max':<10} {'Mean Uncertainty':<15} {'Combined Score':<15}")
print("-" * 60)

for config, result in results.items():
    if result['success']:
        print(f"{config:<20} {result['max_ei']:<10.3f} {result['mean_uncertainty']:<15.3f} {result['combined_score']:<15.3f}")
    else:
        print(f"{config:<20} {'Failed':<10} {'-':<15} {'-':<15}")

print(f"\n🏆 Best configuration: {best_config}")
print(f"🎯 Best score: {best_score:.3f}")

if best_config and results[best_config]['success']:
    best_point = results[best_config]['recommended_point']
    print(f"🔍 Next experiment point recommended by best configuration:")
    print(f"   Cu: {best_point[0]:.2f}%, Mg: {best_point[1]:.2f}%, Si: {best_point[2]:.2f}%, T: {best_point[3]:.0f}K")

6.9. Advanced Topics#

6.9.1. Ensemble Methods#

Implement ensemble methods in Bgolearn by combining multiple surrogate models for improved performance:

import numpy as np
import pandas as pd
from Bgolearn import BGOsampling

# Create test data
data_matrix = pd.DataFrame([
    [2.0, 1.2, 0.5],
    [3.5, 0.8, 0.7],
    [1.8, 1.5, 0.3],
    [4.2, 0.9, 0.8],
    [2.8, 1.1, 0.6],
    [3.2, 1.3, 0.4]
], columns=['Cu', 'Mg', 'Si'])

strength_values = pd.Series([250, 280, 240, 290, 265, 275])

virtual_samples = pd.DataFrame([
    [2.5, 1.0, 0.6],
    [3.0, 1.3, 0.4],
    [3.8, 0.9, 0.8],
    [2.2, 1.4, 0.5],
    [3.6, 0.7, 0.6]
], columns=['Cu', 'Mg', 'Si'])

# Define ensemble method class
class BgolearnEnsemble:
    """Bgolearn ensemble method implementation"""

    def __init__(self, model_types=['GaussianProcess', 'RandomForest', 'SVR']):
        self.model_types = model_types
        self.models = {}
        self.weights = None

    def fit(self, data_matrix, measured_response, virtual_samples, **kwargs):
        """Train all models"""
        print("🔧 Training ensemble models...")

        opt = BGOsampling.Bgolearn()

        for model_type in self.model_types:
            print(f"  Training {model_type}...")
            try:
                model = opt.fit(
                    data_matrix=data_matrix,
                    Measured_response=measured_response,
                    virtual_samples=virtual_samples,
                    Classifier=model_type,
                    CV_test=5,
                    Normalize=True,
                    **kwargs
                )
                self.models[model_type] = model
                print(f"  ✅ {model_type} training successful")
            except Exception as e:
                print(f"  ❌ {model_type} training failed: {e}")

        # Calculate model weights (based on EI performance)
        self._calculate_weights()

    def _calculate_weights(self):
        """Calculate model weights based on EI performance"""
        if not self.models:
            return

        ei_scores = {}
        for name, model in self.models.items():
            try:
                ei_values, _ = model.EI()
                ei_scores[name] = np.max(ei_values)
            except:
                ei_scores[name] = 0.0

        # Normalize weights
        total_score = sum(ei_scores.values())
        if total_score > 0:
            self.weights = {name: score/total_score for name, score in ei_scores.items()}
        else:
            # Equal weights
            self.weights = {name: 1.0/len(self.models) for name in self.models.keys()}

        print(f"📊 Model weights: {self.weights}")

    def ensemble_EI(self):
        """Ensemble expected improvement"""
        if not self.models:
            raise ValueError("No trained models available")

        ensemble_ei = None
        ensemble_points = None

        for name, model in self.models.items():
            try:
                ei_values, points = model.EI()
                weight = self.weights.get(name, 0.0)

                if ensemble_ei is None:
                    ensemble_ei = weight * ei_values
                    ensemble_points = points
                else:
                    ensemble_ei += weight * ei_values

            except Exception as e:
                print(f"⚠️ {name} EI calculation failed: {e}")

        # Find the point corresponding to maximum EI
        if ensemble_ei is not None:
            max_idx = np.argmax(ensemble_ei)
            best_point = ensemble_points[max_idx:max_idx+1]  # Maintain dimensions
            return ensemble_ei, best_point
        else:
            raise ValueError("All model EI calculations failed")

    def ensemble_predictions(self):
        """Ensemble predictions"""
        predictions = {}
        uncertainties = {}

        for name, model in self.models.items():
            try:
                pred_mean = model.virtual_samples_mean
                pred_std = model.virtual_samples_std

                predictions[name] = pred_mean
                uncertainties[name] = pred_std

            except Exception as e:
                print(f"⚠️ {name} prediction failed: {e}")

        if not predictions:
            raise ValueError("All model predictions failed")

        # Weighted average predictions
        ensemble_mean = np.zeros_like(list(predictions.values())[0])
        ensemble_std = np.zeros_like(list(uncertainties.values())[0])

        for name, pred in predictions.items():
            weight = self.weights.get(name, 0.0)
            ensemble_mean += weight * pred
            ensemble_std += weight * uncertainties[name]

        return ensemble_mean, ensemble_std

    def get_model_comparison(self):
        """Get model comparison results"""
        comparison = {}

        for name, model in self.models.items():
            try:
                ei_values, points = model.EI()
                pred_mean = model.virtual_samples_mean
                pred_std = model.virtual_samples_std

                comparison[name] = {
                    'max_ei': np.max(ei_values),
                    'mean_prediction': np.mean(pred_mean),
                    'mean_uncertainty': np.mean(pred_std),
                    'weight': self.weights.get(name, 0.0),
                    'recommended_point': points[0]
                }
            except Exception as e:
                comparison[name] = {'error': str(e)}

        return comparison

# Using ensemble methods
print("🚀 Starting ensemble method demonstration")
print("=" * 50)

# Create and train ensemble model
ensemble = BgolearnEnsemble(
    model_types=['GaussianProcess', 'RandomForest', 'SVR']
)

ensemble.fit(
    data_matrix=data_matrix,
    measured_response=strength_values,
    virtual_samples=virtual_samples,
    seed=42
)

# Get ensemble predictions
print("\n📊 Ensemble prediction results:")
try:
    ensemble_mean, ensemble_std = ensemble.ensemble_predictions()
    print(f"Ensemble prediction mean: {ensemble_mean[:3]}")  # Show first 3
    print(f"Ensemble prediction std: {ensemble_std[:3]}")
except Exception as e:
    print(f"Ensemble prediction failed: {e}")

# Get ensemble EI
print("\n🎯 Ensemble Expected Improvement:")
try:
    ensemble_ei, ensemble_point = ensemble.ensemble_EI()
    max_ei_idx = np.argmax(ensemble_ei)
    print(f"Maximum ensemble EI value: {ensemble_ei[max_ei_idx]:.3f}")
    print(f"Recommended next experiment point: Cu={ensemble_point[0][0]:.2f}, Mg={ensemble_point[0][1]:.2f}, Si={ensemble_point[0][2]:.2f}")
except Exception as e:
    print(f"Ensemble EI calculation failed: {e}")

# Model comparison
print("\n📈 Model performance comparison:")
comparison = ensemble.get_model_comparison()
print(f"{'Model':<15} {'Max EI':<10} {'Weight':<8} {'Mean Uncertainty':<15}")
print("-" * 55)

for name, metrics in comparison.items():
    if 'error' not in metrics:
        print(f"{name:<15} {metrics['max_ei']:<10.3f} {metrics['weight']:<8.3f} {metrics['mean_uncertainty']:<15.3f}")
    else:
        print(f"{name:<15} {'Failed':<10} {'-':<8} {'-':<15}")

print("\n💡 Ensemble method advantages:")
print("  - Combines strengths of multiple models")
print("  - Improves prediction stability")
print("  - Reduces single model bias")
print("  - Automatic weight allocation")

6.9.2. Transfer Learning#

Use pre-trained models for related problems:

# Conceptual transfer learning
def transfer_learning(source_model, target_data, target_response):
    """Transfer knowledge from source to target problem."""
    # Use source model as initialization
    # Fine-tune on target data
    pass

6.9.3. Online Learning#

Update models with new data:

# Conceptual online learning
def online_update(model, new_x, new_y):
    """Update model with new observation."""
    # Add new data point
    # Retrain or update model incrementally
    pass

6.10. Troubleshooting#

6.10.1. Common Issues#

  1. Poor CV Scores

    • Try different models

    • Check data quality

    • Increase training data

    • Adjust normalization

  2. Slow Training

    • Use Random Forest for large data

    • Reduce virtual space size

    • Consider SVR for high dimensions

  3. Overfitting

    • Use cross-validation

    • Reduce model complexity

    • Add more training data

  4. Poor Uncertainty

    • Use Gaussian Process

    • Increase bootstrap iterations

    • Check model assumptions

6.10.2. Performance Optimization#

# Tips for better performance
optimization_tips = {
    "Data preprocessing": "Normalize features, remove outliers",
    "Model selection": "Start with GP, try RF for discrete features",
    "Hyperparameters": "Use cross-validation for tuning",
    "Computational": "Reduce virtual space size for speed",
    "Validation": "Use CV_test=10 or 'LOOCV' for validation"
}

6.11. Next Steps#

See also

For more on surrogate modeling:

  • Rasmussen, C.E. “Gaussian Processes for Machine Learning”

  • Forrester, A. “Engineering Design via Surrogate Modelling”

  • Queipo, N.V. “Surrogate-based Analysis and Optimization”