Posted on Leave a comment

Arm Cloud-Native Ecosystem Recap, Fall/Winter 2021

For many years now, miniNodes has provided a semi-annual blog post (or more!) providing an update on all things Arm Servers.  But during this year’s Arm DevSummit that just concluded, there was a large emphasis on automotive, 5G, and Cloud-Native development, which served as a good reminder that the Arm compute ecosystem has grown and expanded and there are more than just the traditional datacenter-centric Arm Servers now.  Certainly, the big Arm Servers have gained a lot of traction amongst the hyperscalers, with AWS producing the Gravitron processor and now deploying multiple products on it, and Oracle recently launching cloud servers based on the Ampere Altra platform.  However, more broadly speaking, Arm compute is expanding and covering more use-cases than previously, thanks to a shifting mindset of how and where applications should be deployed.  While the past decade was all about placing workloads in the cloud, the next decade is going to be all about bringing workloads back from the cloud and instead run closer to the user, at the Edge of the network.  These endpoints might be smaller servers, or, they might be cars, aircraft, windmills, factories, 5G basestations, autonomous bots, and more.

The cloud-native design parameters such as containers for deployment, allow developers to build applications and then push them anywhere, independent of the underlying hardware and enabling a Cloud-to-Edge-to-IoT infrastructure that is seamless and tightly integrated.

With Arm compute in all of these devices, developers can build applications and frameworks that place big compute on the Arm Servers, smaller, “just what’s needed” compute on the Edge servers, and specialized workloads on the “Things” such as self-driving vehicles, smart cameras doing computer vision and machine learning tasks, cellular gateways, etc.

These specialized segments are now beginning to create their own hardware and reference designs targeting their individual use-cases, for example SOAFEE in the case of autonomous self-driving vehicles.  The first reference platform to be released is from ADLINK, with their AVA system built around an Ampere Altra SoC.  The AVA runs the Autoware software stack, with input and assistance from Arm as well as TierIV.  In the 5G Gateway ecosystem, Magma is an open-source software stack devoted to deploying gateways and managing connectivity both upstream and downstream (connected phones).

The AI, Computer Vision, and Machine Learning ecosystem is a bit more fragmented however, with a multitude of platform providers, specialized hardware, and software frameworks to choose from.  The major players in this segment are Nvidia, with their Jetson Nano, Xavier NX, and Xavier AGX systems containing Arm cores, as well as associated applications such as CUDA on the devices, Transfer Learning Toolkit for retraining models, and Fleet Command for running workloads on edge servers, and Google with their Coral hardware devices built on Arm cores and Tensorflow software stack.  OctoML, Roboflow, Plainsight, and similar AI platforms also exist, targeting simplified model creation and deployment to devices with easy-to-use web-based tools.  Because of the varied nature of the target problem and inherently unique AI/ML models being generated (each model is specifically customized / tailored per use-case), this ecosystem will likely remain fragmented and without a reference design that everyone agrees upon.

One last note to consider, is that for all of these specialized use-cases to gain traction, reach scale, and allow cloud-native best practices to take hold, these devices need to boot up and function in a standard, well understood way.  Developers will not adopt these platforms if the hardware is difficult to work with, doesn’t allow them to quickly install their OS of choice and languages / tooling, or can’t integrate easily with other existing infrastructure.  In that case, developers will simply return to what they already know works (x86), and sacrifice the advantages gained by making the switch to Arm.  Thus, the SystemReady standards that address device boot processes and hardware descriptions are critical to the success of these new specialty niche ecosystems.  The SystemReady IR for IoT devices, ES for Embedded Servers, and SR for Servers are important standards that need to be adopted by the SoC manufacturers and hardware vendors, to ensure developers end up with a device that behaves like the systems they already know and use on a daily basis.

ADLink SOAFEE AVA Workstation

Posted on Leave a comment

AI Server Workflow, Nvidia Transfer Learning Toolkit Example

Continuing our series on Edge AI Servers and the rapid transition underway as developers migrate their workloads from traditional datacenters / clouds to smaller, distributed workloads running closer to users, let’s investigate a specific Nvidia AI Server workflow where a hosted Jetson Nano or Jetson Xavier NX could make sense.

At GTC in May 2021, Nvidia launched the Transfer Learning Toolkit (TLT) 3.0, which is designed to help users build customized AI models quickly and easily.  The process is rather straightforward:  Instead of creating and training a model from scratch, which is very time consuming, you instead take a pre-trained model such as PeopleNet, FaceDetectIR, ResNet, MobileNet, etc, add in your custom object (image for vision applications, sound for audio or language models), re-train with the added content, and then you can leverage the resulting output model for inferencing on smaller, edge devices.

The Transfer Learning Toolkit is part of the larger TAO (Train, Adapt, Optimize) platform, and is intended to be run on big hardware.  Their requirements state:

  • 32 GB system RAM
  • 32 GB of GPU RAM
  • 8 core CPU
  • 1 NVIDIA GPU
  • 100 GB of SSD space
  • TLT is supported on A100, V100 and RTX 30×0 GPUs. 

However, the end result is a model that can run on a much smaller device like a Jetson Nano, TX2, or Xavier NX.

Looking closer at the TLT Quick Start documentation, installation begins with setting up your workstation, or launching a cloud VM like an AWS EC2 P3 or G4 instance which have Volta or Turing GPUs.  There are a series of prerequisites to install, a TLT container that is downloaded from the Nvidia Container Registry, and once your python environment is setup, you launch a Jupiter notebook that will help guide you through the rest.  

There is also a sample Computer Vision Jupyter notebook that can get you up and running quickly, located here:  https://ngc.nvidia.com/catalog/resources/nvidia:tlt_cv_samples

Once you have the Transfer Learning Toolkit workflow established, you can begin testing the output and resulting models on Jetson devices.  This is where a hosted Jetson Nano functioning as an AI Server might make sense, as you could simply some automation or CI/CD workflow for training with TLT and testing the results on the Jetson.  Then, if everything passes, results are highly accurate, detection, segmentation, etc are all performing well, then you could deploy to your production devices in the field.

This is of course only one example of the value of hosted AI servers, and we’ll continue looking at more use cases in the near future! 

Posted on Leave a comment

The Edge AI Server Revolution (Driven by Arm, Of Course)

The past 2 years have seen rapid growth in experimentation and ultimately deployment and adoption of AI/ML at the Edge. This has been fueled by dramatic increases in on-device AI processing capability, and equally dramatic reduction in size and power requirements of devices. The Nvidia Jetson Nano with its GPU and CUDA cores, Google Coral Dev Board containing a TPU for Tensorflow acceleration, and even Microcontrollers running TinyML have quickly gained widespread adoption among developers. These devices are cheap, accurate, and readily available, allowing developers to deploy AI/ML workloads to places that were not practical just a short time ago.

hosted-nvidia-jetson-nano

This allows developers to re-think their applications, and begin to migrate AI workloads out of the datacenter, which was the only place to run their AI/ML tasks previously, potentially saving money or improving performance my moving processing closer to where it is needed. This also allows for net-new capability, adding computer vision, object detection, pose estimation, etc, in places that previously were not possible.

In order to help prepare developers and allow them to experiment and build their skills, miniNodes is making available some Edge AI inspired Arm Servers, starting with the Nvidia Jetson Nano. These nodes are intended to be used by engineers and teams just getting started on their Edge AI journey, who are testing their applications and deep learning algorithms. Another use for an Edge AI Arm Server is for light-duty AI processing, where it doesn’t make financial sense to rent big AI servers from the likes of AWS or Azure, and instead a smaller device will work just fine. Finally, developers and teams that do AI training that is not time-sensitive, or relatively small, can achieve significant savings by using a hosted Jetson Nano for their model training, instead of local GPU’s or AWS resources.

Whether you are just getting starting and beginning to explore Edge AI, or have been following the trend and already have Edge AI projects underway, a miniNodes hosted Jetson Nano is a great way to gain hosted AI processing capability or reduce AWS and Azure cloud AI costs.

Posted on Leave a comment

The Future of AI Servers

The Future of AI Servers

Following up on the recent announcement of our new Raspberry Pi 4 AI Servers, it seems that AI servers running on Arm processors are gaining more and more traction in the market due to their natural fit at the IoT and Edge layers of infrastructure.  Let’s take a quick look at some of the unique properties that make AI Servers running on Arm a great strategy for AI/ML and AIoT deployments, to help understand why this is so important for the future.

Power – Many IoT deployments do not have luxuries that “regular” servers enjoy such as reliable power and connectivity, or even ample power for that matter.  While Intel has spent decades making excellent, though power hungry processors, Arm has focused on efficiency and battery life, helping to explain they they dominate the market in tablets and smartphones.  This same efficiency is then leveraged by IoT devices running AI workloads, so Edge devices responsible for computer vision, image classification, object detection, deep learning, or other workloads can operate with a much lower thermal footprint than a comparable x86 device.

Size – Similar to the underlying reasons behind power efficiency, the physical size and dimensions of Arm AI Servers can be made smaller than the majority of x86 designs.  Attaching AI Accelerators such as the Gyrfalcon 2801 or 2803 via USB to boards as small as 2 inches square (such as the NanoPi Neo2) is possible, or the addition of a Google Coral TPU via the mini-PCIe slot on a NanoPi T4 bring an enormous amount of inferencing to AI Servers in tiny form factors. 

Cost – Here again, Arm SoC’s and Single Board Computers typically have a rather large cost advantage versus x86 embedded designs.  

Scalability – This is a critical factor in why Arm will play a massive role in the future of AI Servers, and why miniNodes has begun to offer our Raspberry Pi 4 AI Server.  As mentioned, low power, cheap devices make great endpoints, but, there is also a role for “medium” sized AI servers handling larger workloads, and Arm partners are just now starting to bring these products to market.  An example is the SolidRun Janux AI Server, which also makes use of the same Gyrfalcon AI Accelerators used by our nodes.  So, you can get started training your models, testing out your deployment pipeline, understanding the various AI frameworks and algorithms, and getting comfortable with the tools, and very easily scale up as your needs expand.  Of course, once you reach massive amounts of deep learning and AI/ML processing, enterprise Arm server options exist for that as well.

Flexibility – Taking Scalability one step further, the Arm AI servers also allow for a great amount of flexibility in the specific accelerator hardware (Gyrfalcon, Google Coral, Intel Movidius), the frameworks used (Caffe, PyTorch, TensorFlow, TinyML), and the models (ResNet, MobileNet, ShuffleNet) employed.  

Ubiquity – A final piece of the overall AI Server ecosystem is the ease of access to this type of hardware, and low barriers to entry.  The Raspberry Pi and similar types of boards are distributed globally, and readily available in nearly all markets.

As you can see, our view is that the future of AI servers is based on Arm SoC’s, and now is the time to start exploring what that means for your workload.