scrapbook
  • "Unorganized" Notes
  • The Best Public Datasets for Machine Learning and Data Science
  • Practice Coding
  • plaid-API project
  • Biotech
    • Machine Learning vs. Deep Learning
  • Machine Learning for Computer Graphics
  • Books (on GitHub)
  • Ideas/Thoughts
  • Ziva for feature animation: Stylized simulation and machine learning-ready workflows
  • Tools
  • 🪶math
    • Papers
    • Math for ML (coursera)
      • Linear Algebra
        • Wk1
        • Wk2
        • Wk3
        • Wk4
        • Wk5
      • Multivariate Calculus
    • Improving your Algorithms & Data Structure Skills
    • Algorithms
    • Algorithms (MIT)
      • Lecture 1: Algorithmic Thinking, Peak Finding
    • Algorithms (khan academy)
      • Binary Search
      • Asymptotic notation
      • Sorting
      • Insertion sort
      • Recursion
      • Solve Hanoi recursively
      • Merge Sort
      • Representing graphs
      • The breadth-first search algorithm
      • Breadth First Search in JavaScript
      • Breadth-first vs Depth-first Tree Traversal in Javascript
    • Algorithms (udacity)
      • Social Network
    • Udacity
      • Linear Algebra Refresher /w Python
    • math-notes
      • functions
      • differential calculus
      • derivative
      • extras
      • Exponentials & logarithms
      • Trigonometry
    • Probability (MIT)
      • Unit 1
        • Probability Models and Axioms
        • Mathematical background: Sets; sequences, limits, and series; (un)countable sets.
    • Statistics and probability (khan academy)
      • Analyzing categorical data
      • Describing and comparing distributions
      • Outliers Definition
      • Mean Absolute Deviation (MAD)
      • Modeling data distribution
      • Exploring bivariate numerical data
      • Study Design
      • Probability
      • Counting, permutations, and combinations
      • Binomial variables
        • Binomial Distribution
        • Binomial mean and standard deviation formulas
        • Geometric random variable
      • Central Limit Theorem
      • Significance Tests (hypothesis testing)
    • Statistics (hackerrank)
      • Mean, Medium, Mode
      • Weighted Mean
      • Quartiles
      • Standard Deviation
      • Basic Probability
      • Conditional Probability
      • Permutations & Combinations
      • Binomial Distribution
      • Negative Binomial
      • Poisson Distribution
      • Normal Distribution
      • Central Limit Theorem
      • Important Concepts in Bayesian Statistics
  • 📽️PRODUCT
    • Product Strategy
    • Product Design
    • Product Development
    • Product Launch
  • 👨‍💻coding
    • of any interest
    • Maya API
      • Python API
    • Python
      • Understanding Class Inheritance in Python 3
      • 100+ Python challenging programming exercises
      • coding
      • Iterables vs. Iterators vs. Generators
      • Generator Expression
      • Stacks (LIFO) / Queues (FIFO)
      • What does -1 mean in numpy reshape?
      • Fold Left and Right in Python
      • Flatten a nested list of lists
      • Flatten a nested dictionary
      • Traverse A Tree
      • How to Implement Breadth-First Search
      • Breadth First Search
        • Level Order Tree Traversal
        • Breadth First Search or BFS for a Graph
        • BFS for Disconnected Graph
      • Trees and Tree Algorithms
      • Graph and its representations
      • Graph Data Structure Interview Questions
      • Graphs in Python
      • GitHub Repo's
    • Python in CG Production
    • GLSL/HLSL Shading programming
    • Deep Learning Specialization
      • Neural Networks and Deep Learning
      • Untitled
      • Untitled
      • Untitled
    • TensorFlow for AI, ML, and DL
      • Google ML Crash Course
      • TensorFlow C++ API
      • TensorFlow - coursera
      • Notes
      • An Introduction to different Types of Convolutions in Deep Learning
      • One by One [ 1 x 1 ] Convolution - counter-intuitively useful
      • SqueezeNet
      • Deep Compression
      • An Overview of ResNet and its Variants
      • Introducing capsule networks
      • What is a CapsNet or Capsule Network?
      • Xception
      • TensorFlow Eager
    • GitHub
      • Project README
    • Agile - User Stories
    • The Open-Source Data Science Masters
    • Coding Challenge Websites
    • Coding Interview
      • leetcode python
      • Data Structures
        • Arrays
        • Linked List
        • Hash Tables
        • Trees: Basic
        • Heaps, Stacks, Queues
        • Graphs
          • Shortest Path
      • Sorting & Searching
        • Depth-First Search & Breadth-First Search
        • Backtracking
        • Sorting
      • Dynamic Programming
        • Dynamic Programming: Basic
        • Dynamic Programming: Advanced
    • spaCy
    • Pandas
    • Python Packages
    • Julia
      • jupyter
    • macos
    • CPP
      • Debugging
      • Overview of memory management problems
      • What are lvalues and rvalues?
      • The Rule of Five
      • Concurrency
      • Avoiding Data Races
      • Mutex
      • The Monitor Object Pattern
      • Lambdas
      • Maya C++ API Programming Tips
      • How can I read and parse CSV files in C++?
      • Cpp NumPy
    • Advanced Machine Learning
      • Wk 1
      • Untitled
      • Untitled
      • Untitled
      • Untitled
  • data science
    • Resources
    • Tensorflow C++
    • Computerphile
      • Big Data
    • Google ML Crash Course
    • Kaggle
      • Data Versioning
      • The Basics of Rest APIs
      • How to Make an API
      • How to deploying your API
    • Jupiter Notebook Tips & Tricks
      • Jupyter
    • Image Datasets Notes
    • DS Cheatsheets
      • Websites & Blogs
      • Q&A
      • Strata
      • Data Visualisation
      • Matplotlib etc
      • Keras
      • Spark
      • Probability
      • Machine Learning
        • Fast Computation of AUC-ROC score
    • Data Visualisation
    • fast.ai
      • deep learning
      • How to work with Jupyter Notebook on a remote machine (Linux)
      • Up and Running With Fast.ai and Docker
      • AWS
    • Data Scientist
    • ML for Beginners (Video)
    • ML Mastery
      • Machine Learning Algorithms
      • Deep Learning With Python
    • Linear algebra cheat sheet for deep learning
    • DL_ML_Resources
    • Awesome Machine Learning
    • web scraping
    • SQL Style Guide
    • SQL - Tips & Tricks
  • 💡Ideas & Thoughts
    • Outdoors
    • Blog
      • markdown
      • How to survive your first day as an On-set VFX Supervisor
    • Book Recommendations by Demi Lee
  • career
    • Skills
    • learn.co
      • SQL
      • Distribution
      • Hypothesis Testing Glossary
      • Hypothesis Tests
      • Hypothesis & AB Testing
      • Combinatorics Continued and Maximum Likelihood Estimation
      • Bayesian Classification
      • Resampling and Monte Carlo Simulation
      • Extensions To Linear Models
      • Time Series
      • Distance Metrics
      • Graph Theory
      • Logistic Regression
      • MLE (Maximum Likelihood Estimation)
      • Gradient Descent
      • Decision Trees
      • Ensemble Methods
      • Spark
      • Machine Learning
      • Deep Learning
        • Backpropagation - math notation
        • PRACTICE DATASETS
        • Big Data
      • Deep Learning Resources
      • DL Datasets
      • DL Tutorials
      • Keras
      • Word2Vec
        • Word2Vec Tutorial Part 1 - The Skip-Gram Model
        • Word2Vec Tutorial Part 2 - Negative Sampling
        • An Intuitive Explanation of Convolutional Neural Networks
      • Mod 4 Project
        • Presentation
      • Mod 5 Project
      • Capstone Project Notes
        • Streaming large training and test files into Tensorflow's DNNClassifier
    • Carrier Prep
      • The Job Search
        • Building a Strong Job Search Foundation
        • Key Traits of Successful Job Seekers
        • Your Job Search Mindset
        • Confidence
        • Job Search Action Plan
        • CSC Weekly Activity
        • Managing Your Job Search
      • Your Online Presence
        • GitHub
      • Building Your Resume
        • Writing Your Resume Summary
        • Technical Experience
      • Effective Networking
        • 30 Second Elevator Pitch
        • Leveraging Your Network
        • Building an Online Network
        • Linkedin For Research And Networking
        • Building An In-Person Network
        • Opening The Line Of Communication
      • Applying to Jobs
        • Applying To Jobs Online
        • Cover Letters
      • Interviewing
        • Networking Coffees vs Formal Interviews
        • The Coffee Meeting/ Informational Interview
        • Communicating With Recruiters And HR Professional
        • Research Before an Interview
        • Preparing Questions for Interviews
        • Phone And Video/Virtual Interviews
        • Cultural/HR Interview Questions
        • The Salary Question
        • Talking About Apps/Projects You Built
        • Sending Thank You's After an Interview
      • Technical Interviewing
        • Technical Interviewing Formats
        • Code Challenge Best Practices
        • Technical Interviewing Resources
      • Communication
        • Following Up
        • When You Haven't Heard From an Employer
      • Job Offers
        • Approaching Salary Negotiations
      • Staying Current in the Tech Industry
      • Module 6 Post Work
      • Interview Prep
  • projects
    • Text Classification
    • TERRA-REF
    • saildrone
  • Computer Graphics
  • AI/ML
  • 3deeplearning
    • Fast and Deep Deformation Approximations
    • Compress and Denoise MoCap with Autoencoders
    • ‘Fast and Deep Deformation Approximations’ Implementation
    • Running a NeuralNet live in Maya in a Python DG Node
    • Implement a Substance like Normal Map Generator with a Convolutional Network
    • Deploying Neural Nets to the Maya C++ API
  • Tools/Plugins
  • AR/VR
  • Game Engine
  • Rigging
    • Deformer Ideas
    • Research
    • brave rabbit
    • Useful Rigging Links
  • Maya
    • Optimizing Node Graph for Parallel Evaluation
  • Houdini
    • Stuff
    • Popular Built-in VEX Attributes (Global Variables)
Powered by GitBook
On this page
  • Get your modelling code ready to be put in an app
  • Make it work
  • Make it pretty
  • Make it portable
  • Your turn
  • Writing a Flask App
  • This is what will go in our serve.py file
  • This is what will go in our as-yet-unnamed second file
  • Your turn!
  • Write your serve.py file here
  • Write your second file here
  1. data science
  2. Kaggle

How to Make an API

PreviousThe Basics of Rest APIsNextHow to deploying your API

Last updated 6 years ago

Alright, now that we know how other people will interact with our code using an API, we’re ready to get coding! Today we’re going to do two things:

  1. Get our code ready to be served through an app

  2. Write a Flask app (well, most of one)

Let’s get started!

Get your modelling code ready to be put in an app

Since you spent some time thinking about your app yesterday, you hopefully have a pretty good idea what you want your Python code to do. Today we’re going to follow a three-step model of development.

  • Make it work. The first thing you want to do is get your code doing whatever it is that you intend for it to do. It doesn’t have to be beautiful or perfectly optimized (you can always go back and refector it later), it just needs to work.

  • Make it pretty. At this stage, you can spend some time tidying up your code. Adding comments, creating functions for particular tasks, making sure your variable names are informative; things that will make it easier for other people to read and use your code.

  • Make it portable. Finally, you want to make your code portable. This includes saving out your model so that you can load it into a new session and use it to make predictions.

Let’s see an example of what this process looks like.

Make it work

First, I wrote a quick script that gets all instances of Python package names & their indexes from some sample text. The general workflow is:

  1. Get a list of Python packages (from helpful maintained by ).

  2. Use that list to create a flashtext KeywordProcessor object. This object will let us use flashtext to find our terms. Flashtext can be more than 20 times faster than using regular expressions to extract a list of keywords and you also don't have to write any Perl. You can find more information about the package .

  3. Remove common English words and words that are used often on the Kaggle forums, since these are unlikely to be referring to a specific Python package. (and people write "the" very often in the forums, they're never actually talking about the package called "the".)

  4. Put our KeywordProcessor into action and see if it works!

In [1]:

import numpy as np
import pandas as pd
import requests
from flashtext.keyword import KeywordProcessor
from nltk.corpus import stopwords

# let's read in a couple of forum posts
forum_posts = pd.read_csv("../input/ForumMessages.csv")

# get a smaller sub-set for playing around with
sample_posts = forum_posts.Message[0:3]

# get data from list of top 5000 pypi packages (last 30 days)
url = 'https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.json'
data = requests.get(url).json()

# get just the list of package names
list_of_packages = [data_item['project'] for data_item in data['rows']]

# create a KeywordProcess
keyword_processor = KeywordProcessor()
keyword_processor.add_keywords_from_list(list_of_packages)

# remove english stopwords
keyword_processor.remove_keywords_from_list(stopwords.words('english'))

# remove custom stopwords
keyword_processor.remove_keywords_from_list(['http','kaggle'])

# test our keyword processor
for post in sample_posts:
    keywords_found = keyword_processor.extract_keywords(post, span_info=True)
    print(keywords_found)
[('dataset', 34, 41), ('dataset', 123, 130)]
[('dataset', 4, 11), ('public', 119, 125), ('dataset', 149, 156), ('dataset', 201, 208), ('dataset', 374, 381), ('dataset', 433, 440), ('dataset', 656, 663), ('common', 844, 850), ('events', 1034, 1040)]
[('html', 318, 322), ('html', 370, 374), ('html', 418, 422), ('html', 460, 464)]

Yay, it works! The code isn't the prettiest, though. Even though I've included comments, I've just done everything in a single long script. If I break my code up into functions it will make it more modular and easier to read.

Make it pretty

My main goal in gussying up my code is to make it easier to add to my app later. So, I want to create functions that will let me do this. I broke my code into two functions.

  • One function does what I want my app to do: takes in a pre-trained model (here our word processor) and then applies it

  • The other creates a keyword model, automating the preprocessing we did. That way if we want to create a new keyword processor with different words in the future we can easily do that.

With a little refactoring, my code now looks like this:In [2]:

# I'm not going to read in the packages & data again since it's 
# already in our current environment.

def create_keywordProcessor(list_of_terms, remove_stopwords=True, 
                            custom_stopword_list=[""]):
    """ Creates a new flashtext KeywordProcessor and opetionally 
    does some lightweight text cleaning to remove stopwords, including
    any provided by the user.
    """
    # create a KeywordProcessor
    keyword_processor = KeywordProcessor()
    keyword_processor.add_keywords_from_list(list_of_terms)

    # remove English stopwords if requested
    if remove_stopwords == True:
        keyword_processor.remove_keywords_from_list(stopwords.words('english'))

    # remove custom stopwords
    keyword_processor.remove_keywords_from_list(custom_stopword_list)
    
    return(keyword_processor)

def apply_keywordProcessor(keywordProcessor, text, span_info=True):
    """ Applies an existing keywordProcessor to a given piece of text. 
    Will return spans by default. 
    """
    keywords_found = keywordProcessor.extract_keywords(text, span_info=span_info)
    return(keywords_found)
    

# create a keywordProcessor of python packages    
py_package_keywordProcessor = create_keywordProcessor(list_of_packages, 
                                                      custom_stopword_list=["kaggle", "http"])

# apply it to some sample posts (with apply_keywordProcessor function, omitting
# span information)
for post in sample_posts:
    text = apply_keywordProcessor(py_package_keywordProcessor, post, span_info=False)
    print(text)
['dataset', 'dataset']
['dataset', 'public', 'dataset', 'dataset', 'dataset', 'dataset', 'dataset', 'common', 'events']
['html', 'html', 'html', 'html']

Make it portable

So far, we've gotten our code working and have refactored it so that it's more modular and other people trying to use it will be able to follow it more easily. But how can we take a model we've trained in one place and apply it in another? By saving the model and then loading it into a new environment.

In general, ML APIs are used for inference. In other words, you have a pre-trained model that is applied to the data that is sent to it. You could create an API that trains models, but you will probably have to end up having to pay for the compute used to train those models. This can become expensive. Particularly since I'm not covering how to do authentication, which means anyone would be able to train as many models as they wanted with your API, I'd recommend avoiding including model training in your API. I'm assuming going forward that your API will use a pre-trained model.

So how do you save a model and then read it back into Python? It depends on the specific type of model you're training.

Library-specific methods for saving models

If the library you used to train your model has a specific set of methods to save and load model file, then I'd recommend using them. Here are the methods for some popular machine learning libraries.

If there's no library-specific technique

🥒 A quick warning about pickles 🥒 You should be aware that if there are differences between the version or subversion of Python you used to pickle your model and the one you try to unpickle it with, you model can fail to load. (This is also true of packages with different versions!) To avoid this, you can specify the specific versions of Python and each package you used when you create your app in your requirements file.

In this case, the Flashtext module doesn’t have a native way to save a load objects (at least that I can find), so I’ll save my keyword processor as a pickle using the pickle library.In [3]:

import pickle

# save our file (make sure our file permissions are "wb", 
# which will let us _w_rite a _b_inary file)
pickle.dump(py_package_keywordProcessor, open("processor.pkl", "wb"))

# check our current directory to make sure it saved
!ls
__notebook__.ipynb  __output__.json  processor.pkl

From here, we can load our model back in, assign it to a variable and apply it right away.

How can I download my trained model from this notebook? Right now, once you've saved a file, the easiest way to download it is to commit your notebook. Then you'll be able to download it from the "output" section of the notebook viewer. This will also give you a record of what code was used to produce your file so you can reproduce your work later on.

In [4]:

# read in a processor from our pickled file. Don't forget to 
# include "rb", which lets us _r_ead a _b_inary file.
pickle_keywordProcessor = pickle.load(open("processor.pkl", "rb"))

# apply it to some sample text to make sure it works
apply_keywordProcessor(pickle_keywordProcessor, "I like pandas numpy and seaborn") 

Out[4]:

[('pandas', 7, 13), ('numpy', 14, 19), ('seaborn', 24, 31)]

At this point, we’re ready to actually start creating our Flask app. We've got:

  • A trained model. In this example, it’s a KeywordProcessor object that will let us find Python package names. It could be any model you like, though.

  • A function to apply that model. In this example, applying our model is pretty straightforward. Depending on what you’re working on, it could be more involved. For example, you may need to crop or resize images so that they fit the pixel dimensions that your model expects.

Since we’re using a trained model we don’t actually need to put any of the code or data we used to train it into our app. We can just use our model and the function that applies it.

Your turn

Make it work! Don’t worry about getting too fancy, just get your code working.In [5]:

# your code here :)

Make it pretty! Copy and paste your code from the cell above into this cell. Now you can spend some time commenting and refactoring your code so that it’s ready to share. I’d recommend at the very least writing a function to apply your model.In [6]:

# your code here :)

Make it portable! Finally, you’ll need to save out your trained model. You might want to read it back into your notebook and apply your prediction function to make sure it all works. :)In [7]:

# your code here :)

Writing a Flask App

Alright, full disclosure; we're not writing an entire app today. (Sorry!) This is because what specific files you need to include in your app will depend on the platform you're using to deploy your API. We'll cover that in detail tomorrow.

We will, however, be writing the core code for our app. This will be broken into two files. Just to make it easier to follow along, I'll be making each "file" a single notebook cell, but to actually serve our app we'll need to save each as separate cells. Again, we'll cover all these steps tomorrow.

So, what will be in our two files?

  • A file serve.py that will:

    • import all the libraries we need

    • define a function that both loads our model and also defines and returns a second function that applies that model

  • A second file that we won't name yet that will:

    • import all the libraries we need, including serve.py

    • run the function we defined in serve.py (this loads in our data & saves the function we defined in serve.py to whatever name we give this variable)

    • create an instance of Flask

    • define a function to be executed at that path

So here's what's going into the two files that will do the bulk of the work in our app:

This is what will go in our serve.py file

Note that, for this function to work, we need to save our model file as "processor.pkl" in the same directory as our serve.py file. (It should already be in your current working directory because we dumped our model to a pickle in the section above.)In [8]:

from flashtext.keyword import KeywordProcessor
import pickle

# Function that takes loads in our pickled word processor
# and defines a function for using it. This makes it easy
# to do these steps together when serving our model.
def get_keywords_api():
    
    # read in pickled word processor. You could also load in
    # other models as this step.
    keyword_processor = pickle.load(open("processor.pkl", "rb"))
    
    # Function to apply our model & extract keywords from a 
    # provided bit of text
    def keywords_api(keywordProcessor, text, span_info=True): 
        keywords_found = keywordProcessor.extract_keywords(text, span_info=True)      
        return keywords_found
    
    # return the function we just defined
    return keywords_api

This is what will go in our as-yet-unnamed second file

In [9]:

import json
from flask import Flask, request
#from serve import get_keywords_api
# I've commented out the last import because it won't work in kernels, 
# but you should uncomment it when we build our app tomorrow

# create an instance of Flask
app = Flask(__name__)

# load our pre-trained model & function
keywords_api = get_keywords_api()

# Define a post method for our API.
@app.route('/extractpackages', methods=['POST'])
def extractpackages():
    """ 
    Takes in a json file, extracts the keywords &
    their indices and then returns them as a json file.
    """
    # the data the user input, in json format
    input_data = request.json

    # use our API function to get the keywords
    output_data = keywords_api(input_data)

    # convert our dictionary into a .json file
    # (returning a dictionary wouldn't be very
    # helpful for someone querying our API from
    # java; JSON is more flexible/portable)
    response = json.dumps(output_data)

    # return our json file
    return response

Your turn!

Now that you've seen what a very small Flask app looks like, it's time for you to write your own. We'll work more with this code tomorrow when we finish our apps and serve them.

In addition to writing the code, if you don't already have them, I'd recommend creating these accounts ahead of time:

Write your serve.py file here

In [10]:

# your code here :) (feel free to copy & paste my code and then modify it for your project -- R)

Write your second file here

In [11]:

# your code here :)

See you all tomorrow! :)

For TensorFlow and Keras, you can save models using model.save_weights() and read it in using model.load_weights(). By default, your model will be saved as a HDF5 file. You can find and . .

For PyTorch you can use torch.save() and torch.load(). PyTorch saves models as pickles. .

For XGBoost, you can save models with model.dump_model() and load them with model.load_model(). .

For LightGBM, you can use model_to_string() and model_from_string(). .

If the library you're using doesn't have a specific method for saving out models, probably the easiest choice is to save your model as a pickle. , which means you can read them directly into your current environment as variables. If your model is built as one or more numpy arrays, , but pickles can handle more different data structures.

🥒🥒 A second quick warning about pickles 🥒🥒 While they can be useful for moving Python objects around, it’s also not a very secure file format. “Pickle files can be hacked. If you receive a raw pickle file over the network, don't trust it! It could have malicious code in it, that would run arbitrary python when you try to de-pickle it.”

define a path and method for our API that matches

Credit where it's due: this specific architecture is based on this API by Guillaume Genthial . I'd recommend checking the blog post out, especially if you're working with a TensorFlow model.

Either a or account. If you're using Google Cloud, you'll also want to enable billing. I go over how to do that .

this list
hugovk on GitHub
here
Even though there is a Python package called "the"
more information on tf here
more on Keras here
More info here
More info here
More info here
Pickles are a serialized data format for Python
you may be able to use HDF5 instead
As the Python wiki puts it:
the specifications we wrote yesterday
in this blog post
A GitHub account
Heroku
Google Cloud
in this notebook