Production Machine Learning (14 May 2023)
Production Machine Learning
For the past two weeks, I’ve been picking up Kaggle courses as well as Coursera courses. Here are some notes taken that I think might be useful:
Adapting to Data: Different kind of data changes
- change in distribution
- change in depedencies and change in ingested data
- Code smell
- Model not updated to new data (Cold start problem)
- dynamic train
- understand model limit
- Reroll old model with model versioning
- Concept drift
- Change in P(Y|X) is a shift in the underlying relationship between model input and output
- Data drift
- Change in P(X) is a shift in the distribution of data
- Prediction Shift (Population)
- Change P(X|Y) is a shift in model prediction
- Output shift (Co-variate Shift)
Tuning Performance to reduce training time
Constraint | Input/Output | CPU | Memory |
---|---|---|---|
Commonly Occurs | - Large inputs |
- Input requires parsing
- Small models | - Expensive Computation
- Underpowered Hardware | - Large number of inputs
- complex models | | Take Action | - Store efficiently
- Paralleize reads Consider batch size | - Train on faster accel.
- Upgrade processor
- Run on TPU
- Simplify model | - Add more memory
- Use fewer layers
- Reduce batch size | | | | | |
tensorflow.distribute.strategy
- mirrored
- multi-worker mirrored
- tpu
- parameter server
tf.distribute
Create a strategy object
strategy = tf.distribute.MultiWorkerMirroredStrategy()
Wrap creation of model parameters within strategy scope
1with strategy.scope(): 2 model = create_model() 3 model.compile( 4 loss = 'sparse_categorical_crossentropy' 5 optimizer = tf.keras.optimizers.Adam(0.0001), 6 metrics=['accuracy'])
Scale the batch size by the number of replicas in the cluster
1per_replica_batch_size = 64 2global_batch_size = per_replica_batch_size \ 3 * strategy.num_replicas_in_sync
Readings: Designing High-pe
Readings: Designing High-peformance ML Systems
In this module, you focus on either I/O performance or computational speed, depending on the
model. For more information, see the following readings and videos.
● How to Evaluate the Performance of Your Machine Learning Model
● Best practices for performance and cost optimization for machine learning
● How To Improve Machine Learning Model Performance: Five Ways
● Distributed TensorFlow model training on Cloud AI Platform (TF Dev Summit ‘20)
● Distributed training with TensorFlow
● Speeding Up Neural Network Training with Data Echoing
● Machine Learning Performance Improvement Cheat Sheet
● Building a High-Performance Data Pipeline with Tensorflow 2.x
● Distributed training with TensorFlow
● Introduction to Machine Learning Pipelines with Kubeflow
● Kubeflow — a machine learning toolkit for Kubernetes
● ML for Mobile and Edge Devices - TensorFlow Lite
● TensorFlow Lite Examples | Machine Learning Mobile Apps
● Optimize TensorFlow models for mobile and embedded devices
● The Essential Guide To Learn TensorFlow Mobile and Tensorflow Lite