metmine : Mine the Metropolitan Museum of Art Collection Database

backend server implementation: private repo

The Project Brief

The brief for this project was to implement an interface for technologically-savvy users wishing to browse the online database of the Metropolitan Museum of Art. The interface could be realized as either a command-line tool or a backend application running on a server. The interface should accept query information as to art subject, classification, and other details, and communicate this to the Met’s REST API endpoints described at https://metmuseum.github.io. The ultimate output of the interaction should be JSON, with the intent that this JSON output would be piped into other applications downstream. Acceptable languages for coding the project were Python, Rust, or Swift.

My decision was to write both the CLI and browser-based versions. My goals (beyond satisfying the requirements of the brief) were to learn new Python packages and to code with best practices in mind (single responsibility of functions, etc.) to generate clean, self-documenting code.

One: Exploring the API and Planning

I got to know the API via the documentation and experimented with simple browser queries. Each send/receive operation between the user and the API cost roughly two seconds. I learned that the project actually required two stages of data evolution. The first stage was an endpoint query for objects matching user input; the result of this query was a list of art object IDs consisting of potentially thousands of values. The second stage was to retrieve the full object record for each of those ID values. This meant a serialized set of requests to a different API endpoint, one for each ID. The brief specified the script should only return the first 80 values; for the full operation, running in serial for 80 IDs, the number of send/receive actions would therefore total at maximum 81.

My plan was to code both implementations into one script to facilitate ease of download and use, and to speed up its function.

Two: Coding for Time Performance: Concurrency

The brief mentioned that speed was a bonus consideration. A serialized approach seemed anything but speedy; a rough calculation revealed a troubling total:

((1 content query for object ID list) + (n sequential object records requests, max 80)) * 2 seconds = ~2 minutes 40 seconds

The time complexity was O(n) and so the wiggle room for speed improvement was in the margin of the time coefficient. After a bit of research and exploration, concurrency was the obvious solution. Instead of submitting 80 object record requests in serial, threaded code could launch 80 requests concurrently. This reduced the time calculation down to:

((1 content query for object ID list) + (1 concurrent object records request)) * 2 seconds = ~ 4 seconds

The time complexity was identical but the practical result was significant. Python supports concurrency via concurrent.futures’ ThreadPoolExecutor() function.

Three: Coding the CLI and Backend Server as One Script

I was no stranger to CLI coding after several years as a technical director at Pixar; the four main explorations and upgrades for my skills on the CLI implementation were: 1) to rely on the argparse package rather than coding yet another of my own, 2) employing the requests package for internet communications, 3) employing the ThreadPoolExecutor functionality and 4) standing up a server-based backend for browser use, which I had done once before. Based on my experience, I felt there was value in having a single script which offered both the CLI and web version of the project; I mention this as foreshadowing. I coded both implementations into one script, the interface version selectable via command-line argument. I tested the JSON output by piping it into jq and found that for some cases it was misformatted; I ironed out those bugs. In implementing the server backend, I began by relying on Flask’s dev server Werkzeug. Werkzeug is unsuitable for production use however, and so I instead integrated Gunicorn.

Four: Polish

I added comments for clarity based on a format I had learned in a recent project. The format for those comments was very C-like: denoted by a hash, each method having a block of comment indicating “Purpose, Procedure, and Presumptions”. I worked with ChatGPT to refine the backend web interface layout and functionality. I wanted to present a version of the JSON response with live hyperlinks in the browser to allow users to view images by clicking, while making pristine, unmodified JSON available via clipboard or download buttons. Further, there are roughly ~1,200 art categories in the Met database; I implemented an autocomplete function for the Categories field to help guide a user to a valid category entry value. This autocomplete relied on a list of categories scraped from a download of the database file; I embeddded this list in the code of the script. The functionality of the autocomplete was implemented via Ajax in the embeded HTML section of the script. I uploaded my project to github and made sure the README was clear and that the requirements.txt was minimal but sufficient.

Five: Presentation to Reviewers and Reception of Feedback

I delivered this version and later met via zoom to discuss with the reviewers. The review indicated that my code mainly satisfied the requirements, but I did receive some extremely valuable constructive and instructive feedback: 1) my comment style stood out as ill-fitting in a Python context, and would better fit as docstrings 2) my web code, which was deeply indented and contained embedded HTML, Javascript, Ajax and others, should probably be in its own file, and at least the HTML should each have its own file as well, 3) my error output, simply printing to the terminal, should be connected instead to stderr, and finally 4) I had coded using camel case instead of snake (having just finished a C-compliant project python project for maya); it was suggested snake was preferential by PEP8 standards.

Six: Incorporation of Feedback and Final Delivery

Technically the project was complete, but I reasoned it would be valuable to complete the exercise by incorporating the reviewers’ thoughtful critique notes.

I split the monolithic project into two: a CLI script and a backend server version. I split the server version down further into the Python code and the HTML code. This produced much cleaner code, and was very interesting as I was able to recognize that the bias I had held toward generating a single, all-encompassing script was less valuable than other considerations, namely readability, cleanliness, and thus accessibility to coders who might need to visit the script after me.

I removed my comments and implemented them instead as docstrings.

I altered output so that errors were broadcast through stderr instead of simply printing to terminal.

I un-cameled my code and snaked it instead.

I broke out the Categories list into a text file in order to simplify the Python script.

I refined and polished the README to reflect the changes, uploaded the results to github and finally delivered to the reviewers a note of gratitude and the finished product of which I felt very proud.