Why Cuda Will Have No Widespread Addoption For Mac

Posted by admin

CUDA 10 adds a called “CUDA Graphs” that are immediately familiar to graphics API designers: they are a for compute. Scene graph APIs enable developers to describe geometry at a “higher”™ level, in ways that express the relationships between, say, rooms and doorways within a castle or the arms and legs of a 3D character. The idea is that with this additional information, the API implementor (in this case, NVIDIA) can write code that will traverse the scene graph (say, rendering the characters with their limbs animated) more efficiently than code written by the developer. Either that, or the scene graph API is sufficiently easier to learn than learning how to write the scene graph code that developers can achieve faster time-to-market by learning and using the scene graph API. I am skeptical that CUDA Graphs will achieve adoption outside NVIDIA’s SDK samples. API designers drive adoption by maximizing the, where the return is efficient, working code and the investment is developer time.

APIs that are not easy to learn are disadvantaged because every developer who writes or maintains the code must invest in learning the API. APIs that don’t deliver a compelling performance advantage must be.very. easy to learn, hence conferring an expressive advantage. Faster development times.) CUDA adoption has been driven by delivering huge performance gains (the return) despite a steep learning curve (the investment). (It makes for an interesting thought-experiment to wonder why and other manycore platforms have not. Although this blog post does not touch on the issue, customer investments must be considered in addition to developer investments.) An early API (in fact, it was created in the 1970s, long before the term “API” had been invented) that delivers high ROI is BLAS, the. Originally written in FORTRAN, the motivations for this library were twofold: to “provide names and argument lists that might become widely used and recognized for some of the basic operations of computational linear algebra,” and “to improve efficiency of math software.” BLAS code is reasonably performance- and platform-portable.

As the underlying platforms evolved, the same BLAS code benefited transparently from assembly language hand-coding to cache blocking to SIMD instruction sets. There was no need to update the API client code as the implementation changed underneath.

BLAS has achieved widespread adoption in numerical code, amplifying developers’ expressive power and enabling them to leverage the development effort invested by others in its implementation. At this point, BLAS gets an inordinate amount of attention from hardware vendors, making it unlikely that developers can match its performance without exploiting a priori knowledge of their application requirements. It takes time to learn, but delivers a considerable return on that investment. On the other end of the spectrum, an API that has high ROI by minimizing developer investment is malloc/free. Learning first-hand the difficulty of writing a fast, robust memory allocator has been an inflection point in many junior developers’ careers – it’s harder than it looks. Other APIs that deliver a high return with minimal investment: the thread synchronization APIs built into operating systems. They are not hard to learn and, for most developers, impossible to implement.

In the early days (DirectX 2.0-3.0), Direct3D had a scene graph API called the “retained mode,” but the last version shipped in 1996. No one was using it, despite heroic evangelism efforts by its developers. Developers could use “immediate mode” APIs to implement their own scene graphs more efficiently – both in terms of developer time and in terms of high-performance implementations of the operations they needed. As an added bonus, by writing the scene graph traversal themselves, developers kept all the IP in-house (e.g., their visibility algorithm) and, if there was a bug, they could fix it in their code on their own schedule.

Why Cuda Will Have No Widespread Adoption For Machine

Since game developers co-design their content development tools with the runtime, a great deal of intellectual property is encapsulated in the scene graph traversal. In a sense, 3D scene graph API designers were aspiring to co-opt developers’ core IP – never a winning proposition for a platform. I suspect that CUDA developers will come to similar conclusions with the CUDA Graphs. No one will use them unless they deliver a return on investment in the form of higher performance, or greater expressiveness commensurate with the effort to learn the APIs. Higher performance will be difficult to achieve since CUDA gives developers ready access to the underlying tools used by the CUDA Graphs.

One possible opportunity for NVIDIA: perhaps CUDA Graphs will be an efficient way to enable concurrent execution of kernels that weren’t designed to run in streams? CUDA streams are like – it is difficult to retrofit code to use them because they must be plumbed into interfaces from top to bottom. An alternative to revisiting interfaces top-to-bottom is to add a “current stream” API (as CUBLAS did), but current-anything APIs interoperate poorly and tend to be inefficient at changing the current-thing.

More importantly, the current-thing state must be saved and restored across interfaces. So one path to adoption for CUDA Graphs may be an efficient way to enable concurrent execution of kernels that weren’t designed to use streams. But in general, like immediate-mode graphics APIs, most developers will be able to more quickly write their own code expressing the dependencies in their application than it would take to learn and use the CUDA Graphs APIs. And developer-authored code will run at least as fast, paying tribute to the. Unless CUDA Graphs deliver a high ROI, they will go the same way as other features that Seemed Like A Neat Idea At The Time, like dynamic parallelism and managed memory. NVIDIA just – great news for those who need the additional compute power of GV100 versus GP100: GP100 GV100 FP32 Compute 10.6 TFLOPS 15.0 TFLOPS FP64 Compute 5.30 TFLOPS 7.50 TFLOPS Memory Bandwidth 720 GB/s 900 GB/s Wait, you say, that’s an interesting qualifier. Who doesn’t “need the additional compute power?” Did someone hack into Nick’s blog account and post on his behalf?

Or has he become a Luddite in his dotage? Nope, no, I still think more compute is generally better; but it is past time to question the architecture of these systems with huge, discrete GPUs connected to the world by buses. The problem with DGX-1 is that those GPUs are hungry! They need to be fed! And they can only sip data through the tiny soda straw known as the PCI Express bus.

For perspective, let’s compare these chips to G80, the first CUDA-capable GPU. Let’s set the stage by observing that G80 was the largest ASIC NVIDIA could feasibly design and fabricate in 2006, straining the limits of contemporary fabrication technology – a classic “win” chip. It had 684M transistors, a theoretical maximum performance of 384GFLOPS for single precision, and no support at all for double precision. GP100 and GV100 respectively have 22x and 31x more transistors, and 27x and 39x more single precision performance than G80. But the bandwidth to deliver data to and from these GPUs has not been increasing commensurately with that performance. Here’s a table for all 3 GPUs – G80, GP100 and GV100 – that highlights the FLOPS/byte of bandwidth for device memory (attached to the GPU), NVLINK (NVIDIA’s property GPU-GPU interconnect), and PCI Express: G80 GP100 GV100 GFLOPS (SP) 384 0 GPU↔GPU memory 84 GB/s 720 GB/s 900 GB/s FLOP/Byte 4.5 14.7 16.67 GPU↔GPU n/a 20 GB/s 20 GB/s FLOP/Byte 530 750 CPU↔GPU 3.1 GB/s 3.1 GB/s 3.1 GB/s FLOP/Byte 124 3419 4839 The 3.1GB/s figure comes from dividing the available PCIe bandwidth by the number of GPUs in the system.

Two 16-lane PCIe 3.0 connections are about 25 GB/s observed, and there are 8 GPUs. As the number of FLOPS per byte of I/O diverges, the number of workloads that benefit from more FLOPS diminishes. Googling around for literature on FLOPS/byte, I ran across by Peter Kogge entitled “Hardware Evolution Trends of Extreme Scale Computing.” For anyone in the GPU business, the first sign that something’s amiss crops up in Slide 3, which cites “1 byte/FLOP” as the “classical goal.” Even G80’s device memory fell well short of that goal with 4.5B/FLOP (here, higher numbers are worse). I prefer this framing because it adopts the viewpoint of scarcity (bytes/FLOP – getting data in and out for processing) rather than abundance (FLOPS/byte – having lots of processing power to bring to bear on data once it is in hand). The presentation is from 2011, but still very relevant: after reviewing Moore’s Law and the rise and fall of Dennard scaling, and the preeminent importance of power dissipation in modern computing, the concluding slide reads in part:. World has gone to multi-core to continue Moore’s Law.

Pushing performance another 1000X will be tough. The major problem is in energy. And that energy is in memory & interconnect.

We need to begin rearchitecting to reflect this. DON’T MOVE THE DATA! “DON’T MOVE THE DATA” has been good advice to everyone who’s had the data for decades (in 1992 I wrote a Dr. Dobb’s Journal that focused on hand-coding x87 assembly to keep intermediate results in registers) but the advice has more currency now. Moving The Data on CPUs The data/compute conundrum finds expression on modern multi-core CPUs, too. Each core on a modern x86 CPU has ILP (instruction level parallelism) of 5, meaning it can detect parallelism opportunities between non-dependent instructions and execute up to 5 instructions in a single clock cycle.

Latency to the L3 cache is about 50 clock cycles. So a CPU core can perform dozens of FLOPS on data in registers during the time it takes for the L3 to service a load (conservatively – 2 of the 5 pipelines can do 8 FLOPS per instruction via AVX). And that’s assuming the data was in cache! As an aside, this observation helps explain why “optimized” numerical Python code is still dead slow. Python is interpreted, so has a library called Numpy that wraps vectorized implementations of operations that do things like element-wise addition or multiplication between arrays. But for arrays that don’t fit in cache (and to some extent, even arrays that do fit in cache), it is very inefficient to do multiple passes over the data if the computation could have been fused into a single pass. The code spends all of its time moving data, and very little time processing it.

Adoption

DON’T MOVE THE DATA! A Gift From Heaven: Deep Learning Which workloads, pray tell, require endless FLOPS per byte of I/O? Or turn it around and ask, which workloads still thrive when there is barely any I/O per FLOP? NVIDIA hasn’t been shy about trumpeting its solution to this problem: deep learning!

Training a deep learning network entails refining floating point weights that roughly represent neurons that “learn” as they are trained on the data. As long as the weights can reside in device memory, only a modest amount of I/O is needed to keep the GPU busy. In retrospect, NVIDIA is extremely fortunate that deep learning cropped up. Without it, it’s not clear what workload could soak up all those FLOPS without the GPUs starving.

The importance of machine learning as a workload helps explain why GV100 contains purpose-built hardware for machine learning, in the form of the. But that hardware actually exacerbates the GPU starvation problem, by increasing FLOPS without increasing bandwidth. NVIDIA probably isn’t comfortable betting the farm on a single workload – especially one where their main customers are enterprises that can invest in their and that is attracting. How do you hedge? How can NVIDIA relieve the bottleneck?

Unless some workload materializes that is as compute-intensive (per byte of I/O) as machine learning, NVIDIA must seek out ways to address their GPUs’ I/O bottleneck. I/O: NVIDIA’s Strategic Landscape The problem confronted by NVIDIA is that they are hindered by some business and legal challenges. According to the terms of their 2011, 1) They do not have a license to Intel’s industry-leading cache coherency protocol technology, and 2) they do not have a license to build x86 CPUs, or even x86 emulators. NVIDIA has done what they can with the hand they were dealt – they built to enable fellow citizens of the bus (typically Infiniband controllers) to access GPU memory without CPU intervention; they built, a proprietary cache coherency protocol. They have for the POWER architecture and signaled a willingness to license it to.

The problem is that POWER and ARM64 are inferior to Intel’s x86, whose high-end CPU performance is unmatched and whose “uncore” enables fast, cache coherent access across sockets. NVIDIA itself, though an ARM licensee, has announced that they. I’m not sure why NVIDIA announced they would not be building their own ARM to drive their GPUs, because that seems like an obvious way for them to own their destiny. It may be that NVIDIA concluded that ARM64 cores simply will never deliver enough performance to drive their GPUs. That’s too bad, because there is a lot of low-hanging fruit in NVIDIA’s driver stack.

If they made the software more efficient, it could either run faster on the same hardware or run at the same speed on lesser hardware – like ARM64 cores. Not being able to coordinate with Intel on the cache coherency protocol has cost NVIDIA big-time in at least one area: peer-to-peer GPU traffic.

Intel could, but chooses not to, service peer-to-peer traffic between NVIDIA GPUs at high performance (Intel and NVIDIA give different stories as to the reason, and these conversations happen indirectly because the two companies do not seem to have diplomatic relations). As things stand, if you have a dual-CPU server (such as NVIDIA’s own DGX-1) with cache coherency links between the CPUs, any peer-to-peer GPU traffic must be carefully routed past the CPUs, taking care not to cross the cache coherency link. If Intel could license, they could license it to NVIDIA. Failing to do so is a matter of choice and a by-product of the two companies’ respective positions in the business and legal landscapes.

As things stand, NVIDIA is dependent on Intel to ship great CPUs with good bus integration, and peer-to-peer-capable GPU servers have to be designed to steer traffic around the QPI link. The announcement that NVIDIA would not build ARM64 SOCs was done in 2014, so now that the competitive landscape has evolved (and though I can remember when Intel’s market capitalization was 12x NVIDIA’s, it is now only about 1.7x), it would not surprise me if NVIDIA revisited that decision. One Path Forward: SoCs One partial solution to the interconnect problem is to build a System on a Chip (SoC): put the CPU and GPU on the same die. Intel and AMD have been building x86 SOCs for many years; it is Intel’s solution to the value PC market, and AMD has behaved like their life depended on it since 2006, when they acquired GPU vendor ATI. NVIDIA’s Tegra GPUs are all ARM SoCs. The biggest downside of SoCs is that the ratio of CPU/GPU performance is fixed years before the hardware becomes available, causing workloads to suffer if they are more CPU- or GPU-intensive than the SoC was designed to address. And if the device doesn’t have enough performance, scaling performance across multiple chips may be more difficult because GPUs require such high bandwidth.

A conspicuous success story for big SoCs has been in the gaming console market, where the target workload is better-understood and, in any case, game developers will code against whatever hardware is in the console. So I suspect that as workloads continue to tap out the FLOPS and balance out the bandwidth/FLOPS, big SoCs will start to make more sense. In sizing the CPU/GPU ratio, hardware designers can create a device with the biggest possible GPU that doesn’t starve with the available bandwidth. SoCs are just a stopgap, though. As the laws of physics continue to lower the boom, the importance of system design will continue to increase, as Kogge pointed out in his 2011 presentation. The fundamental problem of the speed of light isn’t going away ever. After posting a list of reasons, it seems worthwhile to reflect on some of its apparent vulnerabilities, and why CUDA has been successful despite those issues.

CUDA Succeeded Despite 1. Being Proprietary. NVIDIA builds the hardware and software to run CUDA applications and has never licensed the technology to anyone else. Conventional wisdom in the industry holds that proprietary software technologies are doomed to failure – they don’t get shepherded well by a single owner, and they don’t gain adoption by developers. But by making CUDA software portable to everything from Linux to Windows to MacOS, and making CUDA hardware available in a broad range of products from SOCs (Tegra) to high end servers (DGX-1), NVIDIA has staved off the risks they incurred by going it alone. Explicit Memory Management. It’s every new CUDA programmer’s rite of passage: As if allocating and copying input and output data to and from device memory weren’t enough trouble, developers also explicitly manage shared memory to facilitate data interchange between threads.

Fortunately for NVIDIA, due to the, developers haven’t been fazed by the need to learn these idiosyncrasies. Limited Cache Coherency.

Some rules of thumb have been internalized by hardware designers to such a degree that they are not so much sound engineering practices, but religious edicts. One such rule is that caches have to be coherent. All the time. But CUDA is pervaded by violations of this tenet. Device memory is not coherent with host memory.

Shared memory effectively resides in a separate address space, so isn’t coherent in the same sense as an L1 cache. Constant and texture memory are not coherent with device memory, and when changes are made to the memory, the illusion of coherence is maintained via software invalidation. As with explicit memory management, developers are willing to treat the lack of cache coherency as a cost of doing business – as long as they get the performance they crave. Limited PC market share. Discrete GPUs only occupy about 25% of by unit volume, and NVIDIA competes with AMD in that space. NVIDIA’s limited market share helps explain why CUDA has had limited success achieving developer adoption in packaged PC software, even when there’s a good fit with the software requirements. Put yourself in the shoes of an engineering director at (say) Adobe.

“Port this code to CUDA,” says NVIDIA, “and it will run much faster on 18% of your potential customers’ machines.” Even that proposition is sketchy when accounting for the costs and benefits of supporting the full range of CUDA GPUs extant. But for vertical applications (think HPC), CUDA developers build data centers with thousands of identical servers.

And for embedded applications (think automotive), every GPU in a given design win has identical properties. In both cases, developers have a fixed hardware target to develop against, and they get a compelling return on the engineering investment of the CUDA port. In the longer term, companies like Adobe and Autodesk should be able to gain the same benefits by transitioning to cloud-provisioned GPU platforms. CUDA first became available about 10 years ago, so it seems like a good time to take note of its success and reflect on why it has been successful. GPUs are not CPUs. What I mean by this is not just that you don’t have to recompile your app (this point gets its own bullet later in this article), but that core operating system changes are not needed for GPU support.

GPUs are complicated peripherals, but when the rubber meets the road, they are still just peripherals. They hang off the bus, get enumerated by the OS, get a driver loaded, and go. Proponents of competing technologies such as the or (now Xeon Phi) would have you believe otherwise, but GPUs have been served well by the flexibility and platform portability that comes with being a “dumb peripheral.” 2. GPUs are everywhere. Jensen Huang has said the GPU had a “day job.” NVIDIA had an established, high-volume market for their ASICs.

The overlap in requirements between a big, fast graphics chip and a general-purpose manycore processor was significant, but it wasn’t obvious to all that the incremental cost would be worth it. I personally had lunchtime arguments with senior graphics architects at NVIDIA who didn’t want to spend 10% die area on compute (the estimated hardware cost of adding support for scatter/gather and shared memory) because it would put them at a disadvantage running graphics benchmarks against AMD (at the time, it was known as ATI). Fortunately for NVIDIA, those skeptics were overruled and the business risk turned out to be justified. Another way to look at it: though NVIDIA was weighing a 10% die area risk, technologies like Cell and Larrabee/Xeon Phi, or companies like and other coprocessor vendors, were incurring a 100% die area risk. They did not have an established market to fall back on if things didn’t work out.

GPUs are compellingly faster than the CPU. Shortly after one of our first, best customers for CUDA received his first CUDA-capable GPU, he contacted NVIDIA with a question. He had gotten a sample workload ported, and, he said, it looked like it was working. He wanted to know how it could be so fast! The senior people at NVIDIA had long known GPU performance was going to be amazing.

Shortly after I joined NVIDIA in 2002, I had lunch with a senior NVIDIA architect and asked him what he was working on. Microsoft forhndsvisninger office-2019 for mac. “NV50,” he said. (Mind you, this conversation occurred before NV30 had taped out.) “It will unify vertex and pixel shader processing.

We’ll have room to build a chip with about a teraFLOPS of processing power, but we’ll spend half the area on graphics so it will have peak performance of about 500 GFLOPS.” Later, in an internal company email, the same architect said NV50 was going to “make the CPU look like a toy.” His prediction turned out to be amazingly accurate, considering it was made four years and two major architectural revisions in advance. NV50 turned into G80, the first CUDA-capable chip, and had 384 GFLOPS of peak performance – within spitting distance of his casual lunchtime conjecture. Remember that when CUDA first shipped, Intel’s floating point capabilities were much more limited than they are today.

Why Cuda Will Have No Widespread Adoption For Machines

The SIMD width was only 128 bits (Sky Lake currently supports 512), and Intel had only recently widened the actual execution unit (singular – modern Intel CPUs have multiple SIMD execution units) to a full 128 bits. Before the, one generation after another of Intel CPUs had supported SSE as two micro-ops (“high” and “low”) for the 64-bit-wide execution unit, limiting instruction throughput. In fact, CUDA may have prompted Intel to dramatically improve their floating point capabilities. Today, it is still true that for suitable workloads, GPUs are compellingly faster than CPUs. Intel has doubled the SIMD width in their processors twice, and also doubled the number of SIMD execution units, but in that time, NVIDIA has increased the number of transistors in their “win” GPU by 30x (from 684M to 21B), with a commensurate increase in performance. NVIDIA GPUs, by the way, still benefit from because they target much lower clock rates than CPUs.

In 2006, G80 ran at. This on StackExchange was put on hold as primarily opinion-based: “answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise.” The content of StackExchange is usually high quality, but in this case, while the design decision was based on opinion, the answer to the question needn’t be you just need to ask the people who know! And the inimitable, who is poised to crack 30k on StackExchange’s points-based reputation system, compounded the problem by saying that CUdeviceptr is a handle to a device memory allocation, not a pointer. I don’t think I have ever seen talonmies give an incorrect answer before; but in this case, he’s off the mark. CUdeviceptr always has represented a pointer in the CUDA address space. In fact, though it was frowned upon to mix driver API and CUDA runtime code, even in CUDA 1.0 you could transform between CUDART’s void.

Why Cuda Will Have No Widespread Adoption For Mac Free

and the driver API’s CUdeviceptr by writing something like: void.p; CUdeviceptr dptr; p = (void.) (uintptrt) dptr; dptr = (CUdeviceptr) (uintptrt) p; We could have made device pointers void., but there was a desire to make it easy for compilers to distinguish between host and device pointers at compile time instead of runtime. Furthermore, SM 1.x hardware only supported 32-bit pointers, so using void.

would have created a difference in pointer size on 64-bit host platforms. It’s a long-distant memory now, since so much great compiler work has gone into CUDA since then, but at the time “pointer-squashing” (having CUDA’s compiler transform 64-bit pointers into 32-bit pointers on 64-bit host systems) was a big issue in early versions of CUDA. For the record, not making the driver API’s device pointer type void. is one of my bigger regrets about early CUDA development. It took months to refactor the driver to support 64-bit device pointers when hardware support for that feature became available in SM 2.x class hardware. In fact, some weeks before we released CUDA 1.0, we had a meeting and a serious discussion about replacing CUdeviceptr with void., and decided not to take the schedule hit. We weren’t going to let perfect be the end of done, and we paid the price later.

While we’re on the topic of regrettable design decisions in early CUDA, I wish I had done a search-and-replace to convert cuFunction to cuKernel, and put cuLaunchKernel in the first release (in place of the stateful, chatty and not-thread-safe cuParamSet. family of functions). But we had scant engineering resources to spend on fit and finish, a constraint that is no less true for CUDA than for many other successful software projects in history.