Integrating machine learning (ML) models into server operations on Clore provides a streamlined, powerful solution for high-performance tasks like prediction, classification, and data processing. By leveraging Clore's GPU marketplace, developers can deploy, train, and manage ML models efficiently. This multi-part article will guide you through setting up, deploying, and optimizing ML models on Clore's infrastructure.
Part 1: API Setup and Environment Initialization
Step 1: Obtain API Key and Initialize Session
To begin, we'll need an API key to authenticate our requests with Clore. The following code snippet demonstrates how to obtain and use this key within our Python environment.
import requests
# Replace with your actual API key
API_KEY = "YOUR_CLORE_API_KEY"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
def test_api_connection():
url = "https://api.clore.ai/v1/marketplace"
response = requests.get(url, headers=headers)
if response.status_code == 200:
print("Connected to Clore Marketplace!")
else:
print(f"Connection failed with status: {response.status_code}")
test_api_connection()
This code verifies connectivity by querying the Clore marketplace endpoint. After a successful connection, we can proceed to model deployment.
Step 2: Prepare Server for ML Model Deployment
Using Clore's API, we’ll configure our server to run a basic machine learning environment. Below, we select a server with optimal specifications for ML tasks:
# Fetch available servers and filter for high-performance specs
def get_available_servers():
url = "https://api.clore.ai/v1/marketplace"
response = requests.get(url, headers=headers)
servers = response.json().get("servers", [])
# Example filter for high-performance GPU servers
high_performance_servers = [
server for server in servers
if server["specs"]["gpu"] >= 16 # Example filter by GPU memory (16GB)
]
return high_performance_servers
available_servers = get_available_servers()
if available_servers:
print("High-performance servers found:", available_servers)
else:
print("No suitable servers found.")
This snippet identifies servers with at least 16GB of GPU memory, ensuring they meet our model requirements.
Step 3: Initialize Model Environment with Docker
For scalable deployment, we’ll use a Docker image that includes popular ML libraries (like TensorFlow and PyTorch). Here’s how to set it up:
server_id = available_servers[0]["id"] # Select the first available server
# Define the Docker configuration
data = {
"currency": "clore",
"image": "cloreai/tensorflow-pytorch-cuda",
"renting_server": server_id,
"type": "on-demand", # Ensures non-interruptible lease
"env": {
"MODEL_PATH": "/path/to/model",
"DATA_PATH": "/path/to/data"
},
"command": "python /app/run_model.py" # Runs model script on server start
}
def deploy_model_server(data):
url = "https://api.clore.ai/v1/create_order"
response = requests.post(url, json=data, headers=headers)
if response.status_code == 200:
print("Server initialized with Docker for ML deployment.")
else:
print(f"Deployment failed with status: {response.status_code}")
deploy_model_server(data)
This code sets up the server with an ML-optimized Docker container, pre-configured for running TensorFlow and PyTorch-based models.
Step 4: Uploading Model and Data Files to Clore Server
To upload large files efficiently, we'll use Clore's API to transfer our model and data to the selected server.
Code for Uploading Files
In this example, we’ll use the /upload endpoint on Clore. This step assumes that the Clore API supports direct file uploads to the server's specified directory.
def upload_file(file_path, server_id):
url = f"https://api.clore.ai/v1/servers/{server_id}/upload"
files = {"file": open(file_path, "rb")}
response = requests.post(url, headers=headers, files=files)
if response.status_code == 200:
print(f"Successfully uploaded {file_path} to server {server_id}.")
else:
print(f"Failed to upload {file_path}. Status: {response.status_code}")
# Upload model and data files
upload_file("/local/path/to/model.h5", server_id)
upload_file("/local/path/to/data.csv", server_id)
Setting Up Environment Variables for File Paths
We’ll update our server’s configuration to point to the uploaded files, making it easy to reference these files within our model script.
With files uploaded and environment variables set, we can now execute the ML model script on Clore's GPU-enabled server.
Running the Model Script
The following command initiates the script. This script should reference the MODEL_FILE and DATA_FILE paths, allowing it to access the files uploaded in Step 4.
Ensure your model script (run_model.py) is ready to accept parameters for the model and data file paths. Below is a sample Python script that demonstrates this setup.
import argparse
import tensorflow as tf
import pandas as pd
# Parse input arguments
parser = argparse.ArgumentParser()
parser.add_argument("--model", required=True, help="Path to the model file")
parser.add_argument("--data", required=True, help="Path to the data file")
args = parser.parse_args()
# Load the model and data
model = tf.keras.models.load_model(args.model)
data = pd.read_csv(args.data)
# Run predictions or training (example)
predictions = model.predict(data)
print("Predictions:", predictions)
Step 6: Monitoring Model Execution
To track our model's execution status on the Clore server, we’ll use the /status endpoint, which allows us to query the server’s job status.
Code to Monitor Execution Status
The code snippet below checks the server status periodically, allowing us to see if the model execution is complete or if there are any errors.
import time
def monitor_execution(server_id, check_interval=10):
url = f"https://api.clore.ai/v1/servers/{server_id}/status"
while True:
response = requests.get(url, headers=headers)
if response.status_code == 200:
status_data = response.json()
status = status_data.get("status")
print(f"Server Status: {status}")
if status == "COMPLETED":
print("Model execution completed.")
break
elif status == "FAILED":
print("Model execution failed.")
break
else:
print("Execution in progress...")
else:
print(f"Failed to retrieve status. Status: {response.status_code}")
time.sleep(check_interval)
monitor_execution(server_id)
Step 7: Retrieving Logs
Logs provide insights into what’s happening during the model execution, especially useful for debugging. We’ll use the /logs endpoint to fetch the server logs.
Code to Fetch Logs
This function retrieves logs from the Clore server, which can be printed or saved for further analysis.
def fetch_logs(server_id):
url = f"https://api.clore.ai/v1/servers/{server_id}/logs"
response = requests.get(url, headers=headers)
if response.status_code == 200:
logs = response.json().get("logs")
print("Execution Logs:\n", logs)
# Optionally save logs to a file
with open("execution_logs.txt", "w") as log_file:
log_file.write(logs)
else:
print(f"Failed to fetch logs. Status: {response.status_code}")
fetch_logs(server_id)
Step 8: Setting Up Alerts for Execution Status
To stay informed about the execution status, we can set up alerts for key events, such as completion or failure. This code snippet demonstrates how to send an email alert whenever the model execution reaches a final state.
Code for Status Alerts
This example uses smtplib to send email notifications, but you could also use services like Slack, Discord, or SMS depending on your alerting preferences.
import smtplib
from email.mime.text import MIMEText
def send_email_alert(subject, body):
sender_email = "your_email@example.com"
recipient_email = "recipient@example.com"
msg = MIMEText(body)
msg["Subject"] = subject
msg["From"] = sender_email
msg["To"] = recipient_email
with smtplib.SMTP("smtp.example.com", 587) as server:
server.starttls()
server.login("your_email@example.com", "your_password")
server.sendmail(sender_email, recipient_email, msg.as_string())
def monitor_with_alerts(server_id, check_interval=10):
url = f"https://api.clore.ai/v1/servers/{server_id}/status"
while True:
response = requests.get(url, headers=headers)
if response.status_code == 200:
status_data = response.json()
status = status_data.get("status")
if status == "COMPLETED":
send_email_alert("Model Execution Completed", f"Server {server_id} has completed model execution.")
break
elif status == "FAILED":
send_email_alert("Model Execution Failed", f"Server {server_id} encountered an error.")
break
else:
print("Execution in progress...")
else:
print(f"Failed to retrieve status. Status: {response.status_code}")
time.sleep(check_interval)
monitor_with_alerts(server_id)
Step 9: Setting Up Hyperparameter Tuning
Automated hyperparameter tuning helps us find the best configurations for model training. The example below demonstrates how to initiate multiple jobs on Clore’s servers with different hyperparameter configurations and retrieve the best result.
Code for Automated Hyperparameter Tuning
This example sends multiple configuration jobs to the Clore server using varying hyperparameter values for optimization. We use a grid search approach here, but other methods like random search or Bayesian optimization can be applied.
Resource allocation allows us to control CPU, memory, and GPU usage, optimizing the balance between performance and cost. By specifying precise resource requirements, we ensure that each job is assigned the appropriate server configuration without over-provisioning.
Code for Dynamic Resource Allocation
The following code demonstrates how to submit a job with a dynamically allocated resource configuration. This helps maintain efficiency, especially when running multiple concurrent jobs.
After executing multiple configurations, we can retrieve the results of each job, compare their performance, and select the best-performing model.
Code for Retrieving Job Results and Selecting the Best Configuration
The function below iterates through job results, retrieves performance metrics, and identifies the configuration with the highest accuracy.
def get_best_model(tuning_results):
best_accuracy = 0
best_config = None
for result in tuning_results:
job_id = result["job_id"]
url = f"https://api.clore.ai/v1/jobs/{job_id}/results"
response = requests.get(url, headers=headers)
if response.status_code == 200:
job_results = response.json()
accuracy = job_results.get("accuracy")
if accuracy > best_accuracy:
best_accuracy = accuracy
best_config = result
print(f"Job {job_id} - Accuracy: {accuracy}, Config: {result}")
else:
print(f"Failed to retrieve results for job {job_id}")
print(f"Best Model - Accuracy: {best_accuracy}, Config: {best_config}")
return best_config
best_model_config = get_best_model(tuning_results)
Step 12: Configuring GPU Overclocking
Overclocking GPUs can provide significant performance gains for computationally intensive tasks, especially in machine learning model training. Clore’s API allows you to adjust GPU clock settings, enabling fine-tuning for maximum efficiency.
Code for GPU Overclocking Configuration
In this example, we configure GPU settings for optimal speed. Be cautious with overclocking, as improper settings can lead to hardware instability.
def set_gpu_overclock(server_id, gpu_id, core_clock, memory_clock, power_limit):
url = f"https://api.clore.ai/v1/servers/{server_id}/gpus/{gpu_id}/configure"
payload = {
"core_clock": core_clock,
"memory_clock": memory_clock,
"power_limit": power_limit
}
response = requests.post(url, headers=headers, json=payload)
if response.status_code == 200:
print(f"Overclock settings applied to GPU {gpu_id} - Core: {core_clock}MHz, Memory: {memory_clock}MHz, Power Limit: {power_limit}W")
else:
print(f"Failed to apply overclock settings. Status: {response.status_code}")
# Example usage: Adjust core, memory clock, and power limit
set_gpu_overclock(server_id=server_id, gpu_id=0, core_clock=1500, memory_clock=7000, power_limit=180)
Step 13: Configuring Multi-GPU Setup for Distributed Training
When working with large models, a single GPU may not suffice. In this step, we demonstrate a distributed setup with multiple GPUs on Clore, enhancing processing speed and allowing larger models to be trained effectively.
Code for Distributed Multi-GPU Setup
Here, we configure a multi-GPU training job using distributed processing. This setup divides the workload among several GPUs, optimizing both time and resources.
def configure_multi_gpu_training(server_id, gpu_ids, batch_size, learning_rate, epochs):
url = f"https://api.clore.ai/v1/servers/{server_id}/run_job"
payload = {
"model_script": "path_to_model_script",
"training_parameters": {
"batch_size": batch_size,
"learning_rate": learning_rate,
"epochs": epochs,
"distributed": True
},
"gpus": gpu_ids
}
response = requests.post(url, headers=headers, json=payload)
if response.status_code == 200:
job_id = response.json().get("job_id")
print(f"Distributed training job started with GPUs {gpu_ids}. Job ID: {job_id}")
else:
print(f"Failed to start multi-GPU job. Status: {response.status_code}")
# Initiate distributed training with multiple GPUs
gpu_ids = [0, 1, 2] # Specify GPU IDs for distributed training
configure_multi_gpu_training(server_id, gpu_ids, batch_size=64, learning_rate=0.001, epochs=50)
Step 14: Monitoring Multi-GPU Resource Usage
With multi-GPU setups, monitoring becomes critical to ensure that each GPU is optimally utilized. This final example demonstrates how to retrieve real-time usage statistics to assess the effectiveness of the distributed configuration.
Code for Real-Time Multi-GPU Usage Monitoring
This script pulls usage data from each GPU in real-time, allowing for adjustments if certain GPUs are underutilized.
import time
def monitor_gpu_usage(server_id, gpu_ids):
for gpu_id in gpu_ids:
url = f"https://api.clore.ai/v1/servers/{server_id}/gpus/{gpu_id}/usage"
response = requests.get(url, headers=headers)
if response.status_code == 200:
usage_data = response.json()
print(f"GPU {gpu_id} - Utilization: {usage_data['utilization']}%, Memory Usage: {usage_data['memory_usage']}MB")
else:
print(f"Failed to retrieve usage data for GPU {gpu_id}")
# Monitor GPU usage every 5 seconds
gpu_ids = [0, 1, 2]
while True:
monitor_gpu_usage(server_id, gpu_ids)
time.sleep(5) # Refresh interval