Integrating ML Models into Server Operations

Integrating machine learning (ML) models into server operations on Clore provides a streamlined, powerful solution for high-performance tasks like prediction, classification, and data processing. By leveraging Clore's GPU marketplace, developers can deploy, train, and manage ML models efficiently. This multi-part article will guide you through setting up, deploying, and optimizing ML models on Clore's infrastructure.

Part 1: API Setup and Environment Initialization

Step 1: Obtain API Key and Initialize Session

To begin, we'll need an API key to authenticate our requests with Clore. The following code snippet demonstrates how to obtain and use this key within our Python environment.

import requests

# Replace with your actual API key
API_KEY = "YOUR_CLORE_API_KEY"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

def test_api_connection():
    url = "https://api.clore.ai/v1/marketplace"
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        print("Connected to Clore Marketplace!")
    else:
        print(f"Connection failed with status: {response.status_code}")

test_api_connection()

This code verifies connectivity by querying the Clore marketplace endpoint. After a successful connection, we can proceed to model deployment.

Step 2: Prepare Server for ML Model Deployment

Using Clore's API, we’ll configure our server to run a basic machine learning environment. Below, we select a server with optimal specifications for ML tasks:

# Fetch available servers and filter for high-performance specs
def get_available_servers():
    url = "https://api.clore.ai/v1/marketplace"
    response = requests.get(url, headers=headers)
    servers = response.json().get("servers", [])
    
    # Example filter for high-performance GPU servers
    high_performance_servers = [
        server for server in servers
        if server["specs"]["gpu"] >= 16  # Example filter by GPU memory (16GB)
    ]
    return high_performance_servers

available_servers = get_available_servers()
if available_servers:
    print("High-performance servers found:", available_servers)
else:
    print("No suitable servers found.")

This snippet identifies servers with at least 16GB of GPU memory, ensuring they meet our model requirements.

Step 3: Initialize Model Environment with Docker

For scalable deployment, we’ll use a Docker image that includes popular ML libraries (like TensorFlow and PyTorch). Here’s how to set it up:

server_id = available_servers[0]["id"]  # Select the first available server

# Define the Docker configuration
data = {
    "currency": "clore",
    "image": "cloreai/tensorflow-pytorch-cuda",
    "renting_server": server_id,
    "type": "on-demand",  # Ensures non-interruptible lease
    "env": {
        "MODEL_PATH": "/path/to/model",
        "DATA_PATH": "/path/to/data"
    },
    "command": "python /app/run_model.py"  # Runs model script on server start
}

def deploy_model_server(data):
    url = "https://api.clore.ai/v1/create_order"
    response = requests.post(url, json=data, headers=headers)
    if response.status_code == 200:
        print("Server initialized with Docker for ML deployment.")
    else:
        print(f"Deployment failed with status: {response.status_code}")

deploy_model_server(data)

This code sets up the server with an ML-optimized Docker container, pre-configured for running TensorFlow and PyTorch-based models.

Step 4: Uploading Model and Data Files to Clore Server

To upload large files efficiently, we'll use Clore's API to transfer our model and data to the selected server.

Code for Uploading Files

In this example, we’ll use the /upload endpoint on Clore. This step assumes that the Clore API supports direct file uploads to the server's specified directory.

def upload_file(file_path, server_id):
    url = f"https://api.clore.ai/v1/servers/{server_id}/upload"
    files = {"file": open(file_path, "rb")}
    response = requests.post(url, headers=headers, files=files)
    
    if response.status_code == 200:
        print(f"Successfully uploaded {file_path} to server {server_id}.")
    else:
        print(f"Failed to upload {file_path}. Status: {response.status_code}")

# Upload model and data files
upload_file("/local/path/to/model.h5", server_id)
upload_file("/local/path/to/data.csv", server_id)

Setting Up Environment Variables for File Paths

We’ll update our server’s configuration to point to the uploaded files, making it easy to reference these files within our model script.

def configure_server_env(server_id):
    url = f"https://api.clore.ai/v1/servers/{server_id}/set_server_settings"
    env_config = {
        "env": {
            "MODEL_FILE": "/server/path/to/model.h5",
            "DATA_FILE": "/server/path/to/data.csv"
        }
    }
    response = requests.post(url, json=env_config, headers=headers)
    if response.status_code == 200:
        print("Environment variables configured successfully.")
    else:
        print(f"Failed to configure environment. Status: {response.status_code}")

configure_server_env(server_id)

Step 5: Executing the ML Model Script

With files uploaded and environment variables set, we can now execute the ML model script on Clore's GPU-enabled server.

Running the Model Script

The following command initiates the script. This script should reference the MODEL_FILE and DATA_FILE paths, allowing it to access the files uploaded in Step 4.

run_data = {
    "command": "python /app/run_model.py",
    "args": [
        "--model", "/server/path/to/model.h5",
        "--data", "/server/path/to/data.csv"
    ]
}

def run_model_script(server_id, run_data):
    url = f"https://api.clore.ai/v1/servers/{server_id}/run_script"
    response = requests.post(url, json=run_data, headers=headers)
    if response.status_code == 200:
        print("Model script is executing.")
    else:
        print(f"Failed to execute script. Status: {response.status_code}")

run_model_script(server_id, run_data)

Example of `run_model.py`

Ensure your model script (run_model.py) is ready to accept parameters for the model and data file paths. Below is a sample Python script that demonstrates this setup.

import argparse
import tensorflow as tf
import pandas as pd

# Parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument("--model", required=True, help="Path to the model file")
parser.add_argument("--data", required=True, help="Path to the data file")
args = parser.parse_args()

# Load the model and data
model = tf.keras.models.load_model(args.model)
data = pd.read_csv(args.data)

# Run predictions or training (example)
predictions = model.predict(data)
print("Predictions:", predictions)

Step 6: Monitoring Model Execution

To track our model's execution status on the Clore server, we’ll use the /status endpoint, which allows us to query the server’s job status.

Code to Monitor Execution Status

The code snippet below checks the server status periodically, allowing us to see if the model execution is complete or if there are any errors.

import time

def monitor_execution(server_id, check_interval=10):
    url = f"https://api.clore.ai/v1/servers/{server_id}/status"
    
    while True:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            status_data = response.json()
            status = status_data.get("status")
            
            print(f"Server Status: {status}")
            
            if status == "COMPLETED":
                print("Model execution completed.")
                break
            elif status == "FAILED":
                print("Model execution failed.")
                break
            else:
                print("Execution in progress...")
            
        else:
            print(f"Failed to retrieve status. Status: {response.status_code}")
        
        time.sleep(check_interval)

monitor_execution(server_id)

Step 7: Retrieving Logs

Logs provide insights into what’s happening during the model execution, especially useful for debugging. We’ll use the /logs endpoint to fetch the server logs.

Code to Fetch Logs

This function retrieves logs from the Clore server, which can be printed or saved for further analysis.

def fetch_logs(server_id):
    url = f"https://api.clore.ai/v1/servers/{server_id}/logs"
    response = requests.get(url, headers=headers)
    
    if response.status_code == 200:
        logs = response.json().get("logs")
        print("Execution Logs:\n", logs)
        
        # Optionally save logs to a file
        with open("execution_logs.txt", "w") as log_file:
            log_file.write(logs)
    else:
        print(f"Failed to fetch logs. Status: {response.status_code}")

fetch_logs(server_id)

Step 8: Setting Up Alerts for Execution Status

To stay informed about the execution status, we can set up alerts for key events, such as completion or failure. This code snippet demonstrates how to send an email alert whenever the model execution reaches a final state.

Code for Status Alerts

This example uses smtplib to send email notifications, but you could also use services like Slack, Discord, or SMS depending on your alerting preferences.

import smtplib
from email.mime.text import MIMEText

def send_email_alert(subject, body):
    sender_email = "[email protected]"
    recipient_email = "[email protected]"
    msg = MIMEText(body)
    msg["Subject"] = subject
    msg["From"] = sender_email
    msg["To"] = recipient_email

    with smtplib.SMTP("smtp.example.com", 587) as server:
        server.starttls()
        server.login("[email protected]", "your_password")
        server.sendmail(sender_email, recipient_email, msg.as_string())

def monitor_with_alerts(server_id, check_interval=10):
    url = f"https://api.clore.ai/v1/servers/{server_id}/status"
    
    while True:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            status_data = response.json()
            status = status_data.get("status")
            
            if status == "COMPLETED":
                send_email_alert("Model Execution Completed", f"Server {server_id} has completed model execution.")
                break
            elif status == "FAILED":
                send_email_alert("Model Execution Failed", f"Server {server_id} encountered an error.")
                break
            else:
                print("Execution in progress...")
                
        else:
            print(f"Failed to retrieve status. Status: {response.status_code}")
        
        time.sleep(check_interval)

monitor_with_alerts(server_id)

Step 9: Setting Up Hyperparameter Tuning

Automated hyperparameter tuning helps us find the best configurations for model training. The example below demonstrates how to initiate multiple jobs on Clore’s servers with different hyperparameter configurations and retrieve the best result.

Code for Automated Hyperparameter Tuning

This example sends multiple configuration jobs to the Clore server using varying hyperparameter values for optimization. We use a grid search approach here, but other methods like random search or Bayesian optimization can be applied.

import itertools

def hyperparameter_tuning(server_id, learning_rates, batch_sizes):
    url = f"https://api.clore.ai/v1/servers/{server_id}/run_job"
    results = []

    # Generate combinations of hyperparameters
    for lr, batch in itertools.product(learning_rates, batch_sizes):
        payload = {
            "hyperparameters": {
                "learning_rate": lr,
                "batch_size": batch
            },
            "model_script": "path_to_model_script"
        }
        
        response = requests.post(url, headers=headers, json=payload)
        
        if response.status_code == 200:
            job_id = response.json().get("job_id")
            results.append({"job_id": job_id, "learning_rate": lr, "batch_size": batch})
            print(f"Job submitted with learning rate {lr} and batch size {batch}. Job ID: {job_id}")
        else:
            print(f"Failed to submit job with learning rate {lr} and batch size {batch}")

    return results

# Hyperparameter ranges
learning_rates = [0.001, 0.01, 0.1]
batch_sizes = [16, 32, 64]

# Initiate tuning
tuning_results = hyperparameter_tuning(server_id, learning_rates, batch_sizes)

Step 10: Optimizing Resource Allocation

Resource allocation allows us to control CPU, memory, and GPU usage, optimizing the balance between performance and cost. By specifying precise resource requirements, we ensure that each job is assigned the appropriate server configuration without over-provisioning.

Code for Dynamic Resource Allocation

The following code demonstrates how to submit a job with a dynamically allocated resource configuration. This helps maintain efficiency, especially when running multiple concurrent jobs.

def run_job_with_resources(server_id, learning_rate, batch_size, cpu_cores, gpu_count, memory_gb):
    url = f"https://api.clore.ai/v1/servers/{server_id}/run_job"
    payload = {
        "hyperparameters": {
            "learning_rate": learning_rate,
            "batch_size": batch_size
        },
        "resources": {
            "cpu_cores": cpu_cores,
            "gpu_count": gpu_count,
            "memory_gb": memory_gb
        },
        "model_script": "path_to_model_script"
    }
    
    response = requests.post(url, headers=headers, json=payload)
    
    if response.status_code == 200:
        job_id = response.json().get("job_id")
        print(f"Job submitted with resources: {cpu_cores} CPUs, {gpu_count} GPUs, {memory_gb}GB RAM. Job ID: {job_id}")
    else:
        print(f"Failed to submit job. Status: {response.status_code}")

# Run job with specified resources
run_job_with_resources(server_id, learning_rate=0.01, batch_size=32, cpu_cores=4, gpu_count=1, memory_gb=16)

Step 11: Comparing and Selecting the Best Model

After executing multiple configurations, we can retrieve the results of each job, compare their performance, and select the best-performing model.

Code for Retrieving Job Results and Selecting the Best Configuration

The function below iterates through job results, retrieves performance metrics, and identifies the configuration with the highest accuracy.

def get_best_model(tuning_results):
    best_accuracy = 0
    best_config = None
    
    for result in tuning_results:
        job_id = result["job_id"]
        url = f"https://api.clore.ai/v1/jobs/{job_id}/results"
        
        response = requests.get(url, headers=headers)
        
        if response.status_code == 200:
            job_results = response.json()
            accuracy = job_results.get("accuracy")
            
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_config = result
                
            print(f"Job {job_id} - Accuracy: {accuracy}, Config: {result}")
        else:
            print(f"Failed to retrieve results for job {job_id}")

    print(f"Best Model - Accuracy: {best_accuracy}, Config: {best_config}")
    return best_config

best_model_config = get_best_model(tuning_results)

Step 12: Configuring GPU Overclocking

Overclocking GPUs can provide significant performance gains for computationally intensive tasks, especially in machine learning model training. Clore’s API allows you to adjust GPU clock settings, enabling fine-tuning for maximum efficiency.

Code for GPU Overclocking Configuration

In this example, we configure GPU settings for optimal speed. Be cautious with overclocking, as improper settings can lead to hardware instability.

def set_gpu_overclock(server_id, gpu_id, core_clock, memory_clock, power_limit):
    url = f"https://api.clore.ai/v1/servers/{server_id}/gpus/{gpu_id}/configure"
    payload = {
        "core_clock": core_clock,
        "memory_clock": memory_clock,
        "power_limit": power_limit
    }
    
    response = requests.post(url, headers=headers, json=payload)
    
    if response.status_code == 200:
        print(f"Overclock settings applied to GPU {gpu_id} - Core: {core_clock}MHz, Memory: {memory_clock}MHz, Power Limit: {power_limit}W")
    else:
        print(f"Failed to apply overclock settings. Status: {response.status_code}")

# Example usage: Adjust core, memory clock, and power limit
set_gpu_overclock(server_id=server_id, gpu_id=0, core_clock=1500, memory_clock=7000, power_limit=180)

Step 13: Configuring Multi-GPU Setup for Distributed Training

When working with large models, a single GPU may not suffice. In this step, we demonstrate a distributed setup with multiple GPUs on Clore, enhancing processing speed and allowing larger models to be trained effectively.

Code for Distributed Multi-GPU Setup

Here, we configure a multi-GPU training job using distributed processing. This setup divides the workload among several GPUs, optimizing both time and resources.

def configure_multi_gpu_training(server_id, gpu_ids, batch_size, learning_rate, epochs):
    url = f"https://api.clore.ai/v1/servers/{server_id}/run_job"
    payload = {
        "model_script": "path_to_model_script",
        "training_parameters": {
            "batch_size": batch_size,
            "learning_rate": learning_rate,
            "epochs": epochs,
            "distributed": True
        },
        "gpus": gpu_ids
    }
    
    response = requests.post(url, headers=headers, json=payload)
    
    if response.status_code == 200:
        job_id = response.json().get("job_id")
        print(f"Distributed training job started with GPUs {gpu_ids}. Job ID: {job_id}")
    else:
        print(f"Failed to start multi-GPU job. Status: {response.status_code}")

# Initiate distributed training with multiple GPUs
gpu_ids = [0, 1, 2]  # Specify GPU IDs for distributed training
configure_multi_gpu_training(server_id, gpu_ids, batch_size=64, learning_rate=0.001, epochs=50)

Step 14: Monitoring Multi-GPU Resource Usage

With multi-GPU setups, monitoring becomes critical to ensure that each GPU is optimally utilized. This final example demonstrates how to retrieve real-time usage statistics to assess the effectiveness of the distributed configuration.

Code for Real-Time Multi-GPU Usage Monitoring

This script pulls usage data from each GPU in real-time, allowing for adjustments if certain GPUs are underutilized.

import time

def monitor_gpu_usage(server_id, gpu_ids):
    for gpu_id in gpu_ids:
        url = f"https://api.clore.ai/v1/servers/{server_id}/gpus/{gpu_id}/usage"
        
        response = requests.get(url, headers=headers)
        
        if response.status_code == 200:
            usage_data = response.json()
            print(f"GPU {gpu_id} - Utilization: {usage_data['utilization']}%, Memory Usage: {usage_data['memory_usage']}MB")
        else:
            print(f"Failed to retrieve usage data for GPU {gpu_id}")

# Monitor GPU usage every 5 seconds
gpu_ids = [0, 1, 2]
while True:
    monitor_gpu_usage(server_id, gpu_ids)
    time.sleep(5)  # Refresh interval

PreviousMachine Learning and AI Integrations NextDeveloper Tools and SDKs

Last updated 8 months ago