Posted on Leave a comment

AI Server Workflow, Nvidia Transfer Learning Toolkit Example

Continuing our series on Edge AI Servers and the rapid transition underway as developers migrate their workloads from traditional datacenters / clouds to smaller, distributed workloads running closer to users, let’s investigate a specific Nvidia AI Server workflow where a hosted Jetson Nano or Jetson Xavier NX could make sense.

At GTC in May 2021, Nvidia launched the Transfer Learning Toolkit (TLT) 3.0, which is designed to help users build customized AI models quickly and easily.  The process is rather straightforward:  Instead of creating and training a model from scratch, which is very time consuming, you instead take a pre-trained model such as PeopleNet, FaceDetectIR, ResNet, MobileNet, etc, add in your custom object (image for vision applications, sound for audio or language models), re-train with the added content, and then you can leverage the resulting output model for inferencing on smaller, edge devices.

The Transfer Learning Toolkit is part of the larger TAO (Train, Adapt, Optimize) platform, and is intended to be run on big hardware.  Their requirements state:

  • 32 GB system RAM
  • 32 GB of GPU RAM
  • 8 core CPU
  • 1 NVIDIA GPU
  • 100 GB of SSD space
  • TLT is supported on A100, V100 and RTX 30×0 GPUs. 

However, the end result is a model that can run on a much smaller device like a Jetson Nano, TX2, or Xavier NX.

Looking closer at the TLT Quick Start documentation, installation begins with setting up your workstation, or launching a cloud VM like an AWS EC2 P3 or G4 instance which have Volta or Turing GPUs.  There are a series of prerequisites to install, a TLT container that is downloaded from the Nvidia Container Registry, and once your python environment is setup, you launch a Jupiter notebook that will help guide you through the rest.  

There is also a sample Computer Vision Jupyter notebook that can get you up and running quickly, located here:  https://ngc.nvidia.com/catalog/resources/nvidia:tlt_cv_samples

Once you have the Transfer Learning Toolkit workflow established, you can begin testing the output and resulting models on Jetson devices.  This is where a hosted Jetson Nano functioning as an AI Server might make sense, as you could simply some automation or CI/CD workflow for training with TLT and testing the results on the Jetson.  Then, if everything passes, results are highly accurate, detection, segmentation, etc are all performing well, then you could deploy to your production devices in the field.

This is of course only one example of the value of hosted AI servers, and we’ll continue looking at more use cases in the near future! 

Posted on Leave a comment

The Future of AI Servers

The Future of AI Servers

Following up on the recent announcement of our new Raspberry Pi 4 AI Servers, it seems that AI servers running on Arm processors are gaining more and more traction in the market due to their natural fit at the IoT and Edge layers of infrastructure.  Let’s take a quick look at some of the unique properties that make AI Servers running on Arm a great strategy for AI/ML and AIoT deployments, to help understand why this is so important for the future.

Power – Many IoT deployments do not have luxuries that “regular” servers enjoy such as reliable power and connectivity, or even ample power for that matter.  While Intel has spent decades making excellent, though power hungry processors, Arm has focused on efficiency and battery life, helping to explain they they dominate the market in tablets and smartphones.  This same efficiency is then leveraged by IoT devices running AI workloads, so Edge devices responsible for computer vision, image classification, object detection, deep learning, or other workloads can operate with a much lower thermal footprint than a comparable x86 device.

Size – Similar to the underlying reasons behind power efficiency, the physical size and dimensions of Arm AI Servers can be made smaller than the majority of x86 designs.  Attaching AI Accelerators such as the Gyrfalcon 2801 or 2803 via USB to boards as small as 2 inches square (such as the NanoPi Neo2) is possible, or the addition of a Google Coral TPU via the mini-PCIe slot on a NanoPi T4 bring an enormous amount of inferencing to AI Servers in tiny form factors. 

Cost – Here again, Arm SoC’s and Single Board Computers typically have a rather large cost advantage versus x86 embedded designs.  

Scalability – This is a critical factor in why Arm will play a massive role in the future of AI Servers, and why miniNodes has begun to offer our Raspberry Pi 4 AI Server.  As mentioned, low power, cheap devices make great endpoints, but, there is also a role for “medium” sized AI servers handling larger workloads, and Arm partners are just now starting to bring these products to market.  An example is the SolidRun Janux AI Server, which also makes use of the same Gyrfalcon AI Accelerators used by our nodes.  So, you can get started training your models, testing out your deployment pipeline, understanding the various AI frameworks and algorithms, and getting comfortable with the tools, and very easily scale up as your needs expand.  Of course, once you reach massive amounts of deep learning and AI/ML processing, enterprise Arm server options exist for that as well.

Flexibility – Taking Scalability one step further, the Arm AI servers also allow for a great amount of flexibility in the specific accelerator hardware (Gyrfalcon, Google Coral, Intel Movidius), the frameworks used (Caffe, PyTorch, TensorFlow, TinyML), and the models (ResNet, MobileNet, ShuffleNet) employed.  

Ubiquity – A final piece of the overall AI Server ecosystem is the ease of access to this type of hardware, and low barriers to entry.  The Raspberry Pi and similar types of boards are distributed globally, and readily available in nearly all markets.

As you can see, our view is that the future of AI servers is based on Arm SoC’s, and now is the time to start exploring what that means for your workload.

Posted on Leave a comment

Running AI Workloads on Arm Servers

Arm’s Role in Processing AI Workloads

The past several years have seen enormous gains in Artificial Intelligence, Machine Learning, Deep Learning, Autonomous Decision Making, and more.  The availability of powerful GPUs and FPGAs both on-premise and in the cloud for several years now have certainly helped, but more and more of this AI processing is actually being done at the Edge, in small devices.  The popularity of Amazon Alexa, Google Home, and AI-enabled features in smartphones such as Apple’s Siri has skyrocketed over the past few years.  The various frameworks and models such as Tensorflow, PyTorch, Caffe, and others have matured, and newer, lightweight versions have come along such as TinyML, TensorflowLite, and other libraries designed to allow machine learning in the smallest devices possible.  Local processing of audio and detecting specific sounds via wavelength pattern matching, object recognition in a camera’s frame, motions and gestures being monitored and observed, and vehicle safety systems that detect and respond immediately to changing conditions with no human intervention are some of the most common applications.

The work that it takes to develop these AI models is very specialized, but ultimately algorithms are created, a large sample of training data is fed in to the system, and a model is developed that has a confidence factor and accuracy value.  Once the model is deployed, real-time stream processing occurs, and actions can be taken based upon the results of data flowing through the application.  In the case of a computer vision application for example, identifying certain objects  can result in alerts (hospital staff notified), corrective actions (apply the brakes immediately), or data stored for later use.
As mentioned, more and more AI/ML is actually being processed at the Edge, on small form factor devices.  And, small form factor devices tend to be powered by Arm SoCs, as opposed to the more power hungry x86 designs commonplace in laptops and desktops.  Home devices like Alexa, Google Home, and nearly all smartphones are based on Arm SoCs.  Thus, AI models need to be created, tested, and compatibility verified for Arm powered devices.  Even if an algorithm is developed and trained on a big GPU or FPGA, the resulting model should still be tested on Arm SoC’s to ensure proper functionality.  In order to help speed the testing process, miniNodes now offers hosted Arm microservers with dedicated AI accelerators, that can assist with offloading AI tasks from the CPU and offer excellent machine learning performance.  Testing of self-driving vehicle object detection, navigation and guidance, and actions / behavior models, image classification and object recognition from cameras and video streams, convolutional neural networks, and matrix multiplication workloads, robotics, weather simulation, and many types of deep learning activities can be quickly and easily processed.

Arm Lowers the Cost of AI Processing

AI training and inference in the cloud running on Arm microservers at miniNodes also offers a distinct cost advantage over Amazon AWS, Microsoft Azure, or Google GCE.   Those services can very quickly cost thousands of tens of thousands of dollars per month, but many AI workloads can get by just fine with more modest hardware when paired with a dedicated AI accelerator like a Google Coral TPU, Intel Movidius NPU, or Gyrfalcon Matrix Processing Engine.  AWS, Azure, and GCE provide great AI performance, sure, but you also pay heavily for the processor, memory, storage, and other components of the overall system.  If you are ready to make use of those immense resources, wonderful.  But if you are just starting out, are just learning AI/ML, are only beginning to test your AI modeling on Arm, or just have a lightweight use case, then going with a smaller underlying platform while retaining the dedicated AI processing capability can make more sense.

miniNodes is still in the process of building out the full product lineup, but in the meantime Gyrfalcon 2801 and 2803 nodes are online and ready, with up to 16.8 TOPs of processing for ResNet, MobileNet, or VGG models.  They are an easy, cost effective way to get started with AI processing on Arm!

Check them out here:  https://www.mininodes.com/product/raspberrypi-4-ai-server/

Posted on 1 Comment

Raspberry Pi 4 AI Server Now Available

As pioneers in the Arm micro server ecosystem, miniNodes has been an innovator and leading expert in the use of small devices to fulfill compute capacity at the Edge, watched as IoT has matured and impacted all industries, and is now witnessing AI and Machine Learning depart the Cloud and instead be performed on-device or at the Edge of the network. More and more phones, home assistant devices (such as Echo), and even laptops are including custom AI hardware accelerator chips designed to handle voice recognition, gesture and motion control, object detection, analyze video and camera feeds, and perform many other deep learning tasks.

The AI models and algorithms that make this happen have to be trained and tested on specialized hardware accelerators as well, and historically that has been very expensive to perform in the cloud. miniNodes is taking a different approach however, and pairing custom hardware AI Accelerators with cost effective Raspberry Pi 4 servers, to lower the cost of testing and training these models while still maintaining high levels of performance. Deep learning, neural network, and matrix multiplication activities can be offloaded to the AI hardware, rapidly accelerating the model training.

The first product to launch in the new miniNodes AI Server lineup is a Raspberry Pi 4 server combined with a Gyrfalcon 2801 NPU, for a maximum of 5.6 TOP/s of dedicated AI processing power. In the future, we will expand the lineup to include Google Coral and Intel Movidius hardware as well.

The Raspberry Pi has always been one of the most popular hosted Arm Servers at miniNodes, even as far back as the original Raspberry Pi (1) Model B, some of which are still running! Over the years, we upgraded to Raspberry Pi 2’s, 3’s, and the 3+. So, it was only a matter of time until we deployed new, faster Raspberry Pi 4 servers.

However, with the launch of the Pi 4 and its increased capabilities, we decided it was time to upgrade our infrastructure, management, and backend systems to match. That work is actually still ongoing, but in order to start testing the ability to run AI workloads, we have made a few units available for early adopters to begin testing their ML models. If you are looking for a cost effective way to get started with AI processing or are interested in testing AI/ML on Arm cores, this is a great way to get started. Check out the new miniNodes AI Arm Server here: https://www.mininodes.com/product/raspberrypi-4-ai-server/

And if you have any questions, just drop us a note at info@mininodes.com!