Surrogate Models in Bgolearn

6. Surrogate Models in Bgolearn#

Warning

Important: All data must be pandas DataFrames/Series, not numpy arrays!

data_matrix → pd.DataFrame with column names
Measured_response → pd.Series
virtual_samples → pd.DataFrame with same column names

Using numpy arrays will cause: AttributeError: 'numpy.ndarray' object has no attribute 'columns'

Note

This page explains the different surrogate models available in Bgolearn and how to choose the right one for your optimization problem.

6.1. Overview#

Surrogate models (also called metamodels) are the core of Bayesian optimization. They approximate the expensive-to-evaluate objective function and provide uncertainty estimates that guide the optimization process.

Bgolearn supports several surrogate models, each with different strengths and use cases:

Gaussian Process (GP) - Default and most common
Random Forest (RF) - Good for discrete/categorical features
Support Vector Regression (SVR) - Robust to noise
Multi-Layer Perceptron (MLP) - Neural network approach
AdaBoost - Ensemble method

6.2. Gaussian Process (GaussianProcess)#

6.2.1. Theory#

Gaussian Processes are the gold standard for Bayesian optimization. They provide both predictions and uncertainty estimates in a principled way.

from Bgolearn import BGOsampling

# Use Gaussian Process (default)
opt = BGOsampling.Bgolearn()
model = opt.fit(
    data_matrix=data_matrix,
    Measured_response=measured_response,
    virtual_samples=virtual_samples,
    Classifier='GaussianProcess'  # Explicit specification
)

6.2.2. Advantages#

GP Strengths

Uncertainty Quantification: Provides prediction uncertainty
Theoretical Foundation: Well-established Bayesian framework
Smooth Interpolation: Works well for continuous functions
Few Hyperparameters: Relatively easy to tune
Global Optimization: Good for finding global optima

6.2.3. Limitations#

GP Limitations

Computational Cost: O(n³) scaling with training data
Smoothness Assumption: Assumes smooth underlying function
High-Dimensional Challenges: Struggles with many features (>20)
Categorical Features: Not naturally suited for discrete variables

6.2.4. Best Use Cases#

Continuous optimization problems
Smooth objective functions
Small to medium datasets (<1000 samples)
When uncertainty is important
Materials property optimization

6.2.5. Example: Alloy Optimization with GP#

import numpy as np
import pandas as pd
import copy
from Bgolearn import BGOsampling

# Alloy composition optimization - use pandas DataFrame
data_matrix = pd.DataFrame([
    [2.0, 1.2, 0.5],  # Cu, Mg, Si
    [3.5, 0.8, 0.7],
    [1.8, 1.5, 0.3],
    [4.2, 0.9, 0.8]
], columns=['Cu', 'Mg', 'Si'])

strength_values = pd.Series([250, 280, 240, 290])  # MPa
measured_response = copy.deepcopy(strength_values)


virtual_samples = pd.DataFrame([
    [2.5, 1.0, 0.6],
    [3.0, 1.3, 0.4],
    [3.8, 0.9, 0.8]
], columns=['Cu', 'Mg', 'Si'])

# GP optimization
opt = BGOsampling.Bgolearn()
model = opt.fit(
    data_matrix=data_matrix,  # DataFrame
    Measured_response=strength_values,  # Series
    virtual_samples=virtual_samples,  # DataFrame
    Classifier='GaussianProcess',
    CV_test=2,  # 2-fold cross-validation
    Normalize=True
)

print("GP optimization completed!")

6.3. Random Forest (RandomForest)#

6.3.1. Theory#

Random Forest builds multiple decision trees and averages their predictions. It’s particularly good for handling discrete features and non-linear relationships.

# Use Random Forest
model = opt.fit(
    data_matrix=data_matrix,
    Measured_response=measured_response,
    virtual_samples=virtual_samples,
    Classifier='RandomForest'
)

6.3.2. Advantages#

RF Strengths

Handles Discrete Features: Natural for categorical variables
Non-linear Relationships: Captures complex patterns
Robust to Outliers: Less sensitive to noisy data
Fast Training: Efficient for large datasets
Feature Importance: Provides feature ranking
No Assumptions: Doesn’t assume function smoothness

6.3.3. Limitations#

RF Limitations

Limited Uncertainty: Poor uncertainty quantification
Overfitting Risk: Can overfit with small datasets
Discontinuous: Creates step-wise predictions
Hyperparameter Tuning: Many parameters to optimize

6.3.4. Best Use Cases#

Discrete/categorical features
Large datasets (>1000 samples)
Non-smooth functions
When robustness is important
Mixed variable types

6.3.5. Example: Processing Parameter Optimization#

# Processing parameters with discrete levels - use pandas DataFrame
processing_data = pd.DataFrame([
    [450, 2, 1],    # Temperature, Time, Atmosphere (1=N2, 2=Ar, 3=Air)
    [500, 4, 2],
    [550, 6, 3],
    [480, 3, 1]
], columns=['Temperature', 'Time', 'Atmosphere'])

hardness_values = pd.Series([180, 220, 250, 200])

virtual_processing = pd.DataFrame([
    [475, 3.5, 1],
    [525, 4.5, 2],
    [490, 2.5, 3]
], columns=['Temperature', 'Time', 'Atmosphere'])

# Random Forest for mixed variables
model = opt.fit(
    data_matrix=processing_data,  # DataFrame
    Measured_response=hardness_values,  # Series
    virtual_samples=virtual_processing,  # DataFrame
    Classifier='RandomForest',
    CV_test='LOOCV'  # Leave-one-out cross-validation
)

print("Random Forest optimization completed!")

6.4. Support Vector Regression (SVR)#

6.4.1. Theory#

SVR uses support vector machines for regression, finding a function that deviates from targets by at most ε while being as flat as possible.

# Use SVR
model = opt.fit(
    data_matrix=data_matrix,
    Measured_response=measured_response,
    virtual_samples=virtual_samples,
    Classifier='SVR'
)

6.4.2. Advantages#

SVR Strengths

Robust to Noise: Handles noisy data well
High-Dimensional: Works with many features
Kernel Flexibility: Different kernel functions available
Sparse Solution: Uses only support vectors
Regularization: Built-in overfitting protection

6.4.3. Limitations#

SVR Limitations

Parameter Sensitivity: Requires careful tuning
No Uncertainty: Doesn’t provide prediction uncertainty
Kernel Selection: Choosing the right kernel is crucial
Computational Cost: Can be slow for large datasets

6.4.4. Best Use Cases#

Noisy data
High-dimensional problems
When robustness is critical
Non-linear relationships

6.4.5. Example: High-Dimensional Optimization#

# High-dimensional alloy with many elements - use pandas DataFrame
element_names = ['Al', 'Cu', 'Mg', 'Si', 'Fe', 'Mn', 'Cr', 'Zn']
high_dim_data = pd.DataFrame(
    np.random.random((20, 8)),  # 8 alloying elements
    columns=element_names
)
high_dim_response = pd.Series(np.random.random(20) * 100 + 200)

high_dim_virtual = pd.DataFrame(
    np.random.random((50, 8)),
    columns=element_names
)

# SVR for high-dimensional problem
model = opt.fit(
    data_matrix=high_dim_data,  # DataFrame
    Measured_response=high_dim_response,  # Series
    virtual_samples=high_dim_virtual,  # DataFrame
    Classifier='SVR',
    Normalize=True
)

print("SVR optimization completed!")

6.5. Multi-Layer Perceptron (MLP)#

6.5.1. Theory#

MLP is a neural network with multiple hidden layers that can approximate complex non-linear functions.

# Use MLP
model = opt.fit(
    data_matrix=data_matrix,
    Measured_response=measured_response,
    virtual_samples=virtual_samples,
    Classifier='MLP'
)

6.5.2. Advantages#

MLP Strengths

Universal Approximator: Can approximate any continuous function
Non-linear Modeling: Excellent for complex relationships
Scalable: Can handle large datasets
Flexible Architecture: Customizable network structure

6.5.3. Limitations#

MLP Limitations

Requires Large Data: Needs substantial training data
Hyperparameter Tuning: Many parameters to optimize
Overfitting Risk: Can easily overfit small datasets
No Uncertainty: Doesn’t provide uncertainty estimates
Training Instability: Can be sensitive to initialization

6.5.4. Best Use Cases#

Large datasets (>500 samples)
Complex non-linear relationships
When sufficient data is available
Pattern recognition tasks

6.6. AdaBoost#

6.6.1. Theory#

AdaBoost (Adaptive Boosting) combines multiple weak learners to create a strong predictor.

# Use AdaBoost
model = opt.fit(
    data_matrix=data_matrix,
    Measured_response=measured_response,
    virtual_samples=virtual_samples,
    Classifier='AdaBoost'
)

6.6.2. Advantages#

AdaBoost Strengths

Ensemble Method: Combines multiple models
Adaptive: Focuses on difficult examples
Reduces Bias: Often improves prediction accuracy
Versatile: Works with different base learners

6.6.3. Limitations#

AdaBoost Limitations

Sensitive to Noise: Can overfit noisy data
Computational Cost: Slower than single models
Parameter Tuning: Requires careful tuning
Limited Uncertainty: Poor uncertainty quantification

6.7. Model Selection Guidelines#

6.7.1. Decision Tree#

Start Here
    ↓
Do you have uncertainty requirements?
    ↓ Yes                    ↓ No
Gaussian Process         What type of features?
    ↓                        ↓
Is data smooth?         Continuous → SVR/MLP
    ↓ Yes    ↓ No           ↓
    GP    Random Forest   Discrete → Random Forest
                             ↓
                        Large dataset?
                             ↓ Yes    ↓ No
                            MLP    Random Forest

6.7.2. Detailed Selection Criteria#

Table 6.1 Model Selection Guide#
Criterion	Gaussian Process	Random Forest	SVR	MLP	AdaBoost
Data Size	<1000	500	500	500	500
Feature Types	Continuous	Mixed	Continuous	Continuous	Mixed
Uncertainty	Excellent	Poor	None	None	Poor
Noise Tolerance	Medium	High	High	Medium	Low
Training Speed	Slow	Fast	Medium	Slow	Medium
Prediction Speed	Fast	Fast	Fast	Fast	Medium

6.8. Practical Examples#

6.8.1. Model Comparison#

# Create test data for model comparison
data_matrix = pd.DataFrame([
    [2.0, 1.2, 0.5],
    [3.5, 0.8, 0.7],
    [1.8, 1.5, 0.3],
    [4.2, 0.9, 0.8],
    [2.8, 1.1, 0.6]
], columns=['Cu', 'Mg', 'Si'])

measured_response = pd.Series([250, 280, 240, 290, 265])

virtual_samples = pd.DataFrame([
    [2.5, 1.0, 0.6],
    [3.0, 1.3, 0.4],
    [3.8, 0.9, 0.8]
], columns=['Cu', 'Mg', 'Si'])

# Compare all models on the same problem
models = ['GaussianProcess', 'RandomForest', 'SVR', 'MLP', 'AdaBoost']
results = {}

for model_name in models:
    print(f"Testing {model_name}...")

    try:
        model = opt.fit(
            data_matrix=data_matrix,  # DataFrame
            Measured_response=measured_response,  # Series
            virtual_samples=virtual_samples,  # DataFrame
            Classifier=model_name,
            CV_test=5,  # 5-fold cross-validation
            Normalize=True
        )
        
        # Get results using EI
        ei_values, recommended_points = model.EI()
        recommended_point = recommended_points[0]

        # Get model performance metrics (if available)
        try:
            # Access cross-validation score if available
            cv_score = getattr(model, 'cv_score_', 0.0)
        except:
            cv_score = 0.0

        results[model_name] = {
            'recommended_point': recommended_point,
            'ei_max': np.max(ei_values),
            'cv_score': cv_score,
            'success': True
        }

        print(f"  Success: EI max = {np.max(ei_values):.3f}")
        
    except Exception as e:
        print(f"  Failed: {str(e)}")
        results[model_name] = {'success': False, 'error': str(e)}

# Display comparison
print("\nModel Performance Comparison:")
print("-" * 50)
for model, result in results.items():
    if result['success']:
        print(f"{model:15s}: Success - EI max = {result['ei_max']:.3f}")
    else:
        print(f"{model:15s}: Failed - {result.get('error', 'Unknown error')}")

6.8.2. Hyperparameter Tuning#

# Hyperparameter tuning in Bgolearn
import numpy as np
import pandas as pd
from Bgolearn import BGOsampling

# Create test data
data_matrix = pd.DataFrame([
    [2.0, 1.2, 0.5, 450],  # Cu, Mg, Si, Temperature
    [3.5, 0.8, 0.7, 500],
    [1.8, 1.5, 0.3, 480],
    [4.2, 0.9, 0.8, 520],
    [2.8, 1.1, 0.6, 490],
    [3.2, 1.3, 0.4, 510],
    [2.5, 0.9, 0.7, 470],
    [3.8, 1.0, 0.5, 530]
], columns=['Cu', 'Mg', 'Si', 'Temperature'])

strength_values = pd.Series([250, 280, 240, 290, 265, 275, 245, 295])

virtual_samples = pd.DataFrame([
    [2.5, 1.0, 0.6, 485],
    [3.0, 1.3, 0.4, 505],
    [3.8, 0.9, 0.8, 515]
], columns=['Cu', 'Mg', 'Si', 'Temperature'])

# Hyperparameter tuning: compare different cross-validation settings
cv_settings = [3, 5, 10, 'LOOCV']
normalization_settings = [True, False]

best_score = -np.inf
best_config = None
results = {}

opt = BGOsampling.Bgolearn()

print("🔧 Starting hyperparameter tuning...")
print("=" * 50)

for cv_test in cv_settings:
    for normalize in normalization_settings:
        config_name = f"CV_{cv_test}_Norm_{normalize}"
        print(f"Testing configuration: {config_name}")

        try:
            model = opt.fit(
                data_matrix=data_matrix,
                Measured_response=strength_values,
                virtual_samples=virtual_samples,
                Classifier='GaussianProcess',
                CV_test=cv_test,
                Normalize=normalize,
                seed=42  # Ensure reproducibility
            )

            # Get EI values as performance metric
            ei_values, recommended_points = model.EI()
            max_ei = np.max(ei_values)

            # Calculate prediction quality
            predicted_mean = model.virtual_samples_mean
            predicted_std = model.virtual_samples_std

            # Combined score: EI max + inverse of prediction uncertainty
            uncertainty_score = 1.0 / (np.mean(predicted_std) + 1e-6)
            combined_score = max_ei + 0.1 * uncertainty_score

            results[config_name] = {
                'max_ei': max_ei,
                'mean_uncertainty': np.mean(predicted_std),
                'combined_score': combined_score,
                'recommended_point': recommended_points[0],
                'success': True
            }

            if combined_score > best_score:
                best_score = combined_score
                best_config = config_name

            print(f"  ✅ Success - EI max: {max_ei:.3f}, Combined score: {combined_score:.3f}")

        except Exception as e:
            print(f"  ❌ Failed: {str(e)}")
            results[config_name] = {'success': False, 'error': str(e)}

# Display tuning results
print(f"\n📊 Hyperparameter tuning results:")
print("=" * 50)
print(f"{'Configuration':<20} {'EI Max':<10} {'Mean Uncertainty':<15} {'Combined Score':<15}")
print("-" * 60)

for config, result in results.items():
    if result['success']:
        print(f"{config:<20} {result['max_ei']:<10.3f} {result['mean_uncertainty']:<15.3f} {result['combined_score']:<15.3f}")
    else:
        print(f"{config:<20} {'Failed':<10} {'-':<15} {'-':<15}")

print(f"\n🏆 Best configuration: {best_config}")
print(f"🎯 Best score: {best_score:.3f}")

if best_config and results[best_config]['success']:
    best_point = results[best_config]['recommended_point']
    print(f"🔍 Next experiment point recommended by best configuration:")
    print(f"   Cu: {best_point[0]:.2f}%, Mg: {best_point[1]:.2f}%, Si: {best_point[2]:.2f}%, T: {best_point[3]:.0f}K")

6.9. Advanced Topics#

6.9.1. Ensemble Methods#

Implement ensemble methods in Bgolearn by combining multiple surrogate models for improved performance:

import numpy as np
import pandas as pd
from Bgolearn import BGOsampling

# Create test data
data_matrix = pd.DataFrame([
    [2.0, 1.2, 0.5],
    [3.5, 0.8, 0.7],
    [1.8, 1.5, 0.3],
    [4.2, 0.9, 0.8],
    [2.8, 1.1, 0.6],
    [3.2, 1.3, 0.4]
], columns=['Cu', 'Mg', 'Si'])

strength_values = pd.Series([250, 280, 240, 290, 265, 275])

virtual_samples = pd.DataFrame([
    [2.5, 1.0, 0.6],
    [3.0, 1.3, 0.4],
    [3.8, 0.9, 0.8],
    [2.2, 1.4, 0.5],
    [3.6, 0.7, 0.6]
], columns=['Cu', 'Mg', 'Si'])

# Define ensemble method class
class BgolearnEnsemble:
    """Bgolearn ensemble method implementation"""

    def __init__(self, model_types=['GaussianProcess', 'RandomForest', 'SVR']):
        self.model_types = model_types
        self.models = {}
        self.weights = None

    def fit(self, data_matrix, measured_response, virtual_samples, **kwargs):
        """Train all models"""
        print("🔧 Training ensemble models...")

        opt = BGOsampling.Bgolearn()

        for model_type in self.model_types:
            print(f"  Training {model_type}...")
            try:
                model = opt.fit(
                    data_matrix=data_matrix,
                    Measured_response=measured_response,
                    virtual_samples=virtual_samples,
                    Classifier=model_type,
                    CV_test=5,
                    Normalize=True,
                    **kwargs
                )
                self.models[model_type] = model
                print(f"  ✅ {model_type} training successful")
            except Exception as e:
                print(f"  ❌ {model_type} training failed: {e}")

        # Calculate model weights (based on EI performance)
        self._calculate_weights()

    def _calculate_weights(self):
        """Calculate model weights based on EI performance"""
        if not self.models:
            return

        ei_scores = {}
        for name, model in self.models.items():
            try:
                ei_values, _ = model.EI()
                ei_scores[name] = np.max(ei_values)
            except:
                ei_scores[name] = 0.0

        # Normalize weights
        total_score = sum(ei_scores.values())
        if total_score > 0:
            self.weights = {name: score/total_score for name, score in ei_scores.items()}
        else:
            # Equal weights
            self.weights = {name: 1.0/len(self.models) for name in self.models.keys()}

        print(f"📊 Model weights: {self.weights}")

    def ensemble_EI(self):
        """Ensemble expected improvement"""
        if not self.models:
            raise ValueError("No trained models available")

        ensemble_ei = None
        ensemble_points = None

        for name, model in self.models.items():
            try:
                ei_values, points = model.EI()
                weight = self.weights.get(name, 0.0)

                if ensemble_ei is None:
                    ensemble_ei = weight * ei_values
                    ensemble_points = points
                else:
                    ensemble_ei += weight * ei_values

            except Exception as e:
                print(f"⚠️ {name} EI calculation failed: {e}")

        # Find the point corresponding to maximum EI
        if ensemble_ei is not None:
            max_idx = np.argmax(ensemble_ei)
            best_point = ensemble_points[max_idx:max_idx+1]  # Maintain dimensions
            return ensemble_ei, best_point
        else:
            raise ValueError("All model EI calculations failed")

    def ensemble_predictions(self):
        """Ensemble predictions"""
        predictions = {}
        uncertainties = {}

        for name, model in self.models.items():
            try:
                pred_mean = model.virtual_samples_mean
                pred_std = model.virtual_samples_std

                predictions[name] = pred_mean
                uncertainties[name] = pred_std

            except Exception as e:
                print(f"⚠️ {name} prediction failed: {e}")

        if not predictions:
            raise ValueError("All model predictions failed")

        # Weighted average predictions
        ensemble_mean = np.zeros_like(list(predictions.values())[0])
        ensemble_std = np.zeros_like(list(uncertainties.values())[0])

        for name, pred in predictions.items():
            weight = self.weights.get(name, 0.0)
            ensemble_mean += weight * pred
            ensemble_std += weight * uncertainties[name]

        return ensemble_mean, ensemble_std

    def get_model_comparison(self):
        """Get model comparison results"""
        comparison = {}

        for name, model in self.models.items():
            try:
                ei_values, points = model.EI()
                pred_mean = model.virtual_samples_mean
                pred_std = model.virtual_samples_std

                comparison[name] = {
                    'max_ei': np.max(ei_values),
                    'mean_prediction': np.mean(pred_mean),
                    'mean_uncertainty': np.mean(pred_std),
                    'weight': self.weights.get(name, 0.0),
                    'recommended_point': points[0]
                }
            except Exception as e:
                comparison[name] = {'error': str(e)}

        return comparison

# Using ensemble methods
print("🚀 Starting ensemble method demonstration")
print("=" * 50)

# Create and train ensemble model
ensemble = BgolearnEnsemble(
    model_types=['GaussianProcess', 'RandomForest', 'SVR']
)

ensemble.fit(
    data_matrix=data_matrix,
    measured_response=strength_values,
    virtual_samples=virtual_samples,
    seed=42
)

# Get ensemble predictions
print("\n📊 Ensemble prediction results:")
try:
    ensemble_mean, ensemble_std = ensemble.ensemble_predictions()
    print(f"Ensemble prediction mean: {ensemble_mean[:3]}")  # Show first 3
    print(f"Ensemble prediction std: {ensemble_std[:3]}")
except Exception as e:
    print(f"Ensemble prediction failed: {e}")

# Get ensemble EI
print("\n🎯 Ensemble Expected Improvement:")
try:
    ensemble_ei, ensemble_point = ensemble.ensemble_EI()
    max_ei_idx = np.argmax(ensemble_ei)
    print(f"Maximum ensemble EI value: {ensemble_ei[max_ei_idx]:.3f}")
    print(f"Recommended next experiment point: Cu={ensemble_point[0][0]:.2f}, Mg={ensemble_point[0][1]:.2f}, Si={ensemble_point[0][2]:.2f}")
except Exception as e:
    print(f"Ensemble EI calculation failed: {e}")

# Model comparison
print("\n📈 Model performance comparison:")
comparison = ensemble.get_model_comparison()
print(f"{'Model':<15} {'Max EI':<10} {'Weight':<8} {'Mean Uncertainty':<15}")
print("-" * 55)

for name, metrics in comparison.items():
    if 'error' not in metrics:
        print(f"{name:<15} {metrics['max_ei']:<10.3f} {metrics['weight']:<8.3f} {metrics['mean_uncertainty']:<15.3f}")
    else:
        print(f"{name:<15} {'Failed':<10} {'-':<8} {'-':<15}")

print("\n💡 Ensemble method advantages:")
print("  - Combines strengths of multiple models")
print("  - Improves prediction stability")
print("  - Reduces single model bias")
print("  - Automatic weight allocation")

6.9.2. Transfer Learning#

Use pre-trained models for related problems:

# Conceptual transfer learning
def transfer_learning(source_model, target_data, target_response):
    """Transfer knowledge from source to target problem."""
    # Use source model as initialization
    # Fine-tune on target data
    pass

6.9.3. Online Learning#

Update models with new data:

# Conceptual online learning
def online_update(model, new_x, new_y):
    """Update model with new observation."""
    # Add new data point
    # Retrain or update model incrementally
    pass

6.10. Troubleshooting#

6.10.1. Common Issues#

Poor CV Scores
- Try different models
- Check data quality
- Increase training data
- Adjust normalization
Slow Training
- Use Random Forest for large data
- Reduce virtual space size
- Consider SVR for high dimensions
Overfitting
- Use cross-validation
- Reduce model complexity
- Add more training data
Poor Uncertainty
- Use Gaussian Process
- Increase bootstrap iterations
- Check model assumptions

6.10.2. Performance Optimization#

# Tips for better performance
optimization_tips = {
    "Data preprocessing": "Normalize features, remove outliers",
    "Model selection": "Start with GP, try RF for discrete features",
    "Hyperparameters": "Use cross-validation for tuning",
    "Computational": "Reduce virtual space size for speed",
    "Validation": "Use CV_test=10 or 'LOOCV' for validation"
}

6.11. Next Steps#

Learn acquisition functions: Acquisition Functions Guide
Try optimization strategies: Optimization Strategies
Practice with examples: Single-Objective Optimization Examples
Explore multi-objective: MultiBgolearn: Multi-Objective Bayesian Global Optimization