How to Make an API

Alright, now that we know how other people will interact with our code using an API, we’re ready to get coding! Today we’re going to do two things:

  1. Get our code ready to be served through an app

  2. Write a Flask app (well, most of one)

Let’s get started!

Get your modelling code ready to be put in an app

Since you spent some time thinking about your app yesterday, you hopefully have a pretty good idea what you want your Python code to do. Today we’re going to follow a three-step model of development.

  • Make it work. The first thing you want to do is get your code doing whatever it is that you intend for it to do. It doesn’t have to be beautiful or perfectly optimized (you can always go back and refector it later), it just needs to work.

  • Make it pretty. At this stage, you can spend some time tidying up your code. Adding comments, creating functions for particular tasks, making sure your variable names are informative; things that will make it easier for other people to read and use your code.

  • Make it portable. Finally, you want to make your code portable. This includes saving out your model so that you can load it into a new session and use it to make predictions.

Let’s see an example of what this process looks like.

Make it work

First, I wrote a quick script that gets all instances of Python package names & their indexes from some sample text. The general workflow is:

  1. Get a list of Python packages (from this list helpful maintained by hugovk on GitHub).

  2. Use that list to create a flashtext KeywordProcessor object. This object will let us use flashtext to find our terms. Flashtext can be more than 20 times faster than using regular expressions to extract a list of keywords and you also don't have to write any Perl. You can find more information about the package here.

  3. Remove common English words and words that are used often on the Kaggle forums, since these are unlikely to be referring to a specific Python package. (Even though there is a Python package called "the"and people write "the" very often in the forums, they're never actually talking about the package called "the".)

  4. Put our KeywordProcessor into action and see if it works!

In [1]:

import numpy as np
import pandas as pd
import requests
from flashtext.keyword import KeywordProcessor
from nltk.corpus import stopwords

# let's read in a couple of forum posts
forum_posts = pd.read_csv("../input/ForumMessages.csv")

# get a smaller sub-set for playing around with
sample_posts = forum_posts.Message[0:3]

# get data from list of top 5000 pypi packages (last 30 days)
url = 'https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.json'
data = requests.get(url).json()

# get just the list of package names
list_of_packages = [data_item['project'] for data_item in data['rows']]

# create a KeywordProcess
keyword_processor = KeywordProcessor()
keyword_processor.add_keywords_from_list(list_of_packages)

# remove english stopwords
keyword_processor.remove_keywords_from_list(stopwords.words('english'))

# remove custom stopwords
keyword_processor.remove_keywords_from_list(['http','kaggle'])

# test our keyword processor
for post in sample_posts:
    keywords_found = keyword_processor.extract_keywords(post, span_info=True)
    print(keywords_found)
[('dataset', 34, 41), ('dataset', 123, 130)]
[('dataset', 4, 11), ('public', 119, 125), ('dataset', 149, 156), ('dataset', 201, 208), ('dataset', 374, 381), ('dataset', 433, 440), ('dataset', 656, 663), ('common', 844, 850), ('events', 1034, 1040)]
[('html', 318, 322), ('html', 370, 374), ('html', 418, 422), ('html', 460, 464)]

Yay, it works! The code isn't the prettiest, though. Even though I've included comments, I've just done everything in a single long script. If I break my code up into functions it will make it more modular and easier to read.

Make it pretty

My main goal in gussying up my code is to make it easier to add to my app later. So, I want to create functions that will let me do this. I broke my code into two functions.

  • One function does what I want my app to do: takes in a pre-trained model (here our word processor) and then applies it

  • The other creates a keyword model, automating the preprocessing we did. That way if we want to create a new keyword processor with different words in the future we can easily do that.

With a little refactoring, my code now looks like this:In [2]:

# I'm not going to read in the packages & data again since it's 
# already in our current environment.

def create_keywordProcessor(list_of_terms, remove_stopwords=True, 
                            custom_stopword_list=[""]):
    """ Creates a new flashtext KeywordProcessor and opetionally 
    does some lightweight text cleaning to remove stopwords, including
    any provided by the user.
    """
    # create a KeywordProcessor
    keyword_processor = KeywordProcessor()
    keyword_processor.add_keywords_from_list(list_of_terms)

    # remove English stopwords if requested
    if remove_stopwords == True:
        keyword_processor.remove_keywords_from_list(stopwords.words('english'))

    # remove custom stopwords
    keyword_processor.remove_keywords_from_list(custom_stopword_list)
    
    return(keyword_processor)

def apply_keywordProcessor(keywordProcessor, text, span_info=True):
    """ Applies an existing keywordProcessor to a given piece of text. 
    Will return spans by default. 
    """
    keywords_found = keywordProcessor.extract_keywords(text, span_info=span_info)
    return(keywords_found)
    

# create a keywordProcessor of python packages    
py_package_keywordProcessor = create_keywordProcessor(list_of_packages, 
                                                      custom_stopword_list=["kaggle", "http"])

# apply it to some sample posts (with apply_keywordProcessor function, omitting
# span information)
for post in sample_posts:
    text = apply_keywordProcessor(py_package_keywordProcessor, post, span_info=False)
    print(text)
['dataset', 'dataset']
['dataset', 'public', 'dataset', 'dataset', 'dataset', 'dataset', 'dataset', 'common', 'events']
['html', 'html', 'html', 'html']

Make it portable

So far, we've gotten our code working and have refactored it so that it's more modular and other people trying to use it will be able to follow it more easily. But how can we take a model we've trained in one place and apply it in another? By saving the model and then loading it into a new environment.

In general, ML APIs are used for inference. In other words, you have a pre-trained model that is applied to the data that is sent to it. You could create an API that trains models, but you will probably have to end up having to pay for the compute used to train those models. This can become expensive. Particularly since I'm not covering how to do authentication, which means anyone would be able to train as many models as they wanted with your API, I'd recommend avoiding including model training in your API. I'm assuming going forward that your API will use a pre-trained model.

So how do you save a model and then read it back into Python? It depends on the specific type of model you're training.

Library-specific methods for saving models

If the library you used to train your model has a specific set of methods to save and load model file, then I'd recommend using them. Here are the methods for some popular machine learning libraries.

  • For TensorFlow and Keras, you can save models using model.save_weights() and read it in using model.load_weights(). By default, your model will be saved as a HDF5 file. You can find more information on tf here and more on Keras here. .

  • For PyTorch you can use torch.save() and torch.load(). PyTorch saves models as pickles. More info here.

  • For XGBoost, you can save models with model.dump_model() and load them with model.load_model(). More info here.

  • For LightGBM, you can use model_to_string() and model_from_string(). More info here.

If there's no library-specific technique

If the library you're using doesn't have a specific method for saving out models, probably the easiest choice is to save your model as a pickle. Pickles are a serialized data format for Python, which means you can read them directly into your current environment as variables. If your model is built as one or more numpy arrays, you may be able to use HDF5 instead, but pickles can handle more different data structures.

🥒 A quick warning about pickles 🥒 You should be aware that if there are differences between the version or subversion of Python you used to pickle your model and the one you try to unpickle it with, you model can fail to load. (This is also true of packages with different versions!) To avoid this, you can specify the specific versions of Python and each package you used when you create your app in your requirements file.

🥒🥒 A second quick warning about pickles 🥒🥒 While they can be useful for moving Python objects around, it’s also not a very secure file format. As the Python wiki puts it: “Pickle files can be hacked. If you receive a raw pickle file over the network, don't trust it! It could have malicious code in it, that would run arbitrary python when you try to de-pickle it.”

In this case, the Flashtext module doesn’t have a native way to save a load objects (at least that I can find), so I’ll save my keyword processor as a pickle using the pickle library.In [3]:

import pickle

# save our file (make sure our file permissions are "wb", 
# which will let us _w_rite a _b_inary file)
pickle.dump(py_package_keywordProcessor, open("processor.pkl", "wb"))

# check our current directory to make sure it saved
!ls
__notebook__.ipynb  __output__.json  processor.pkl

From here, we can load our model back in, assign it to a variable and apply it right away.

How can I download my trained model from this notebook? Right now, once you've saved a file, the easiest way to download it is to commit your notebook. Then you'll be able to download it from the "output" section of the notebook viewer. This will also give you a record of what code was used to produce your file so you can reproduce your work later on.

In [4]:

# read in a processor from our pickled file. Don't forget to 
# include "rb", which lets us _r_ead a _b_inary file.
pickle_keywordProcessor = pickle.load(open("processor.pkl", "rb"))

# apply it to some sample text to make sure it works
apply_keywordProcessor(pickle_keywordProcessor, "I like pandas numpy and seaborn") 

Out[4]:

[('pandas', 7, 13), ('numpy', 14, 19), ('seaborn', 24, 31)]

At this point, we’re ready to actually start creating our Flask app. We've got:

  • A trained model. In this example, it’s a KeywordProcessor object that will let us find Python package names. It could be any model you like, though.

  • A function to apply that model. In this example, applying our model is pretty straightforward. Depending on what you’re working on, it could be more involved. For example, you may need to crop or resize images so that they fit the pixel dimensions that your model expects.

Since we’re using a trained model we don’t actually need to put any of the code or data we used to train it into our app. We can just use our model and the function that applies it.

Your turn

Make it work! Don’t worry about getting too fancy, just get your code working.In [5]:

# your code here :)

Make it pretty! Copy and paste your code from the cell above into this cell. Now you can spend some time commenting and refactoring your code so that it’s ready to share. I’d recommend at the very least writing a function to apply your model.In [6]:

# your code here :)

Make it portable! Finally, you’ll need to save out your trained model. You might want to read it back into your notebook and apply your prediction function to make sure it all works. :)In [7]:

# your code here :)

Writing a Flask App

Alright, full disclosure; we're not writing an entire app today. (Sorry!) This is because what specific files you need to include in your app will depend on the platform you're using to deploy your API. We'll cover that in detail tomorrow.

We will, however, be writing the core code for our app. This will be broken into two files. Just to make it easier to follow along, I'll be making each "file" a single notebook cell, but to actually serve our app we'll need to save each as separate cells. Again, we'll cover all these steps tomorrow.

So, what will be in our two files?

  • A file serve.py that will:

    • import all the libraries we need

    • define a function that both loads our model and also defines and returns a second function that applies that model

  • A second file that we won't name yet that will:

    • import all the libraries we need, including serve.py

    • run the function we defined in serve.py (this loads in our data & saves the function we defined in serve.py to whatever name we give this variable)

    • create an instance of Flask

    • define a path and method for our API that matches the specifications we wrote yesterday

    • define a function to be executed at that path

Credit where it's due: this specific architecture is based on this API by Guillaume Genthial in this blog post. I'd recommend checking the blog post out, especially if you're working with a TensorFlow model.

So here's what's going into the two files that will do the bulk of the work in our app:

This is what will go in our serve.py file

Note that, for this function to work, we need to save our model file as "processor.pkl" in the same directory as our serve.py file. (It should already be in your current working directory because we dumped our model to a pickle in the section above.)In [8]:

from flashtext.keyword import KeywordProcessor
import pickle

# Function that takes loads in our pickled word processor
# and defines a function for using it. This makes it easy
# to do these steps together when serving our model.
def get_keywords_api():
    
    # read in pickled word processor. You could also load in
    # other models as this step.
    keyword_processor = pickle.load(open("processor.pkl", "rb"))
    
    # Function to apply our model & extract keywords from a 
    # provided bit of text
    def keywords_api(keywordProcessor, text, span_info=True): 
        keywords_found = keywordProcessor.extract_keywords(text, span_info=True)      
        return keywords_found
    
    # return the function we just defined
    return keywords_api

This is what will go in our as-yet-unnamed second file

In [9]:

import json
from flask import Flask, request
#from serve import get_keywords_api
# I've commented out the last import because it won't work in kernels, 
# but you should uncomment it when we build our app tomorrow

# create an instance of Flask
app = Flask(__name__)

# load our pre-trained model & function
keywords_api = get_keywords_api()

# Define a post method for our API.
@app.route('/extractpackages', methods=['POST'])
def extractpackages():
    """ 
    Takes in a json file, extracts the keywords &
    their indices and then returns them as a json file.
    """
    # the data the user input, in json format
    input_data = request.json

    # use our API function to get the keywords
    output_data = keywords_api(input_data)

    # convert our dictionary into a .json file
    # (returning a dictionary wouldn't be very
    # helpful for someone querying our API from
    # java; JSON is more flexible/portable)
    response = json.dumps(output_data)

    # return our json file
    return response

Your turn!

Now that you've seen what a very small Flask app looks like, it's time for you to write your own. We'll work more with this code tomorrow when we finish our apps and serve them.

In addition to writing the code, if you don't already have them, I'd recommend creating these accounts ahead of time:

Write your serve.py file here

In [10]:

# your code here :) (feel free to copy & paste my code and then modify it for your project -- R)

Write your second file here

In [11]:

# your code here :)

See you all tomorrow! :)

Last updated