Mike Tyson said: "Everyone has a plan until they get punched in the face." When ...

jimbokun · on Dec 11, 2018

But these companies don't really make their own chips at scale, they just make their own chip designs, then contract out to a fab to actually manufacture them.

And Apple is already at an incredible scale, considering every iOS device currently made is running on Apple designed chips.

Nokinside · on Dec 11, 2018

Intel is microarchitecture + fab corporation. They do it all.

1. TSMC (also GlobalFoundries) is fab only. They design the node for the process and way to fabricate it.

2. Then ARM joins with TSMC to develop the high performance microarchitecture for their processor design for TSMC's process.

3. Then ARM licenses the microarchitecture designed for the new processes to Amazon, Apple, Qualcomm who develop their own variants. Most of the the prosessor microarchitecture is the same for the same manufacturing process.

As a result, costs are shared for large degree. Intel may still some scale advantages from integrated approach but not as much as you might think.

dnautics · on Dec 11, 2018

My personal suspicion is that the integrated approach can eventually be a liability. If you have an integrated process/design house, process can count on design to work around its shortcomings and failures. By contrast, if you are process only, and multiple firms make designs for the process, you have to make your process robust, which means that your process staff is ready and has good practices down when it's time to shrink.

^^ Note that this is entirely baseless speculation.

pcnix · on Dec 11, 2018

What you speculate is actually happening to some extent, Intel's designs work around their fabrication quirks in order to achieve their performance, and this makes Intel unable to easily separate out their fabrication business in order to take up external contracts, or unable to effectively change designs easily in order to use external fabricators.

Nokinside · on Dec 11, 2018

Intel has always been a process first, design second company. The company was founded by physicists and chemists. Their process has always been the best in the world until just recently. Intel brings in or buys design talent when needed, but their R&D in process technology is their strongest suit even today.

stcredzero · on Dec 11, 2018

Intel has always been a process first, design second company. The company was founded by physicists and chemists. Their process has always been the best in the world until just recently.

So they had a particular advantage, and exploited the heck out of it, but now the potency of that advantage is largely gone?

n-gatedotcom · on Dec 11, 2018

I don't know what country you're in but in cricket there's a concept of innings and scoring runs. There's this dude who averaged nearly a 100 in every innings, most others average 50.

Now think of the situation as him scoring a few knots. Is he old and retiring? Or is this just a slump in form? Nobody knows!

I worked for a design team and we were proud of our material engineers.

stcredzero · on Dec 11, 2018

Back in about 1996, most of the profs were going on about how x86 would crumble under the weight of the ISA, and RISC was the future. One of my profs knew people at Intel, and talked of a roadmap they had for kicking butt for the next dozen years. Turns out, the road map was more or less right.

Is there more roadmap?

tw04 · on Dec 11, 2018

There's just no way that's true. Their roadmap in 1996 was moving everyone to ia64/itanium. That was an unmitigated disaster and they were forced to license x64 from AMD.

If it weren't for their illegal activity (threats/bribes to partners) to stifle AMDs market penetration, the market would likely look very different today.

antod · on Dec 11, 2018

> There's just no way that's true. Their roadmap in 1996 was moving everyone to ia64/itanium. That was an unmitigated disaster and they were forced to license x64 from AMD.

Yup, and their x86 backup plan (Netburst scaling all the way to 10GHz) was a dead end too.

gpderetta · on Dec 11, 2018

but their plan C (revive the pentium III architecture) worked perfectly.

We will have to see if they have plan C now(plan B being yet another iteration of the lake architecture with little changes).

jplayer01 · on Dec 11, 2018

Their plan C was a complete fluke and only came together because the Israelis managed to put out Centrino. I don't think such a fluke is possible when we're at the limits of process design and everything takes tens of billions of dollars and half a decade of lead time to implement.

gpderetta · on Dec 11, 2018

Having multiple competent design teams working on potentially competing products all the time is one of the strengths of Intel, I wouldn't call it a fluke.

Things do look dire right now, I agree.

bogomipz · on Dec 11, 2018

I'm not that up on Intel at the moment. Why are they stuck at more iterations of the lake architecture with little changes?

What was the plan "A"?

gpderetta · on Dec 11, 2018

Get 10nm out.

m_mueller · on Dec 11, 2018

doesn't it look like they're shifting to do chiplets as well at the moment? copying AMD might be their plan C, but it won't help if AMD can steam ahead with TSMC 7nm while Intel is locked to 14nm for a couple of years. That's going to hurt a lot.

pitaj · on Dec 11, 2018

TSMC's 7nm and Intel's 14nm are about the same in actual dimensions on silicon IIRC. The names for the processes are mostly fluff.

gpderetta · on Dec 11, 2018

AFAIK, supposedly TMSC 7nm and Intel 10nm are about equivalent, but with 10nm being in a limbo, TMSC is ahead now.

m_mueller · on Dec 12, 2018

That’s also how I understand it, which seems to be supported by perf/watt numbers of Apple’s 2018 chips.

pjmlp · on Dec 11, 2018

Likewise if AMD wasn't a thing maybe this laptop would be running Itanium instead.

hermitdev · on Dec 11, 2018

Intel chips are RISC under the hood these days (for a long while - decade or more). They're CISC at the ASM layer before the instructions are decoded and dispatched by the microcode.

mcbain · on Dec 11, 2018

The idea that Intel is “RISC under the hood” is too simplistic.

Instructions get decoded and some end up looking like RISC instructions, but there is so much macro- and micro-op fusion going on, as well as reordering, etc, that it is nothing like a RISC machine.

(The whole argument is kind of pointless anyway.)

rbanffy · on Dec 11, 2018

With all the optimizations going on, high performance RISC designs don't look like RISC designs anymore either. The ISA has very little to do with whatever the execution units actually see or execute.

rocqua · on Dec 11, 2018

It is baffling to me that byte-code is essentially a 'high level language' these days.

rbanffy · on Dec 11, 2018

And yet, when I first approached C, it was considered a "high level" language.

bogomipz · on Dec 11, 2018

Because the functional units are utilizing microcode? Or do you mean something else?

Oreb · on Dec 11, 2018

Possibly stupid questions from someone completely ignorant about hardware:

If they didn’t care about backwards compatibility, would it be possible for them to release versions of their CPUs with _only_ the microcode layer? If yes, would a sufficiently good compiler be able to generate faster code for such a platform than for x86?

Alternatively, could Intel theoretically implement a new, non-x86 ASM layer that would decode down to better optimized microcode?

lmm · on Dec 11, 2018

> If they didn’t care about backwards compatibility, would it be possible for them to release versions of their CPUs with _only_ the microcode layer?

The microcode is the CPU firmware that turns x86 (or whatever) instructions into micro-ops. In theory if you knew about all the internals of your CPU you could upload your own microcode that would run some custom ISA (which could be straight micro-ops I guess).

> If yes, would a sufficiently good compiler be able to generate faster code for such a platform than for x86?

The most important concern for modern code performance is cache efficiency, so x86-style instruction sets actually lead to better performance than a vanilla RISC-style one (the complex instructions act as a de facto compression mechanism) - compare ARM Thumb.

Instruction sets that are more efficient than x86 are certainly possible, especially if you allowed the compiler to come up with a custom instruction set for your particular program and microcode to implement it. (It'd be like a slightly more efficient version of how the high-performance K programming language works: interpreted, but by an interpreter designed to be small enough to fit into L1 cache). But we're talking about a small difference; existing processors are designed to implement x86 efficiently, and there's been a huge amount of compiler work put into producing efficient x86.

earenndil · on Dec 11, 2018

CISC is still an asset rather than a liability, though, as it means you can fit more code into cache.

chroma · on Dec 11, 2018

I don't think that's an advantage these days. The bottleneck seems to be decoding instructions, and that's easier to parallelize if instructions are fixed width. Case in point: The big cores on Apple's A11 and A12 SoCs can decode 7 instructions per cycle. Intel's Skylake can do 5. Intel CPUs also have μop caches because decoding x86 is so expensive.

jabl · on Dec 11, 2018

Maybe the golden middle path is compressed risc instructions. E.g the risc-v C extension, where the most commonly used instructions take 16 bits, and the full 32-bit instructions are still available. Density is apparently slightly better than x86-64, while being easier to decode.

(yes, I'm aware there's no high performance risc-v core available (yet) comparable to x86-64 or power, or even the higher end arm ones)

gpderetta · on Dec 11, 2018

Sure, but intel CISC instructions can do more, so in the end is a wash.

chroma · on Dec 11, 2018

That's not the case. Only one of Skylake's decoders can translate complex x86 instructions. The other 4 are simple decoders, and can only transform a simple x86 instruction into a single µop. At most, Skylake's decoder can emit 5 µops per cycle.[1]

1. https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

jawnv6 · on Dec 11, 2018

... so what? most code's hot and should be issued from the uop cache at 6uop/cl with "80%+ hit rate" from your source

you're really not making the case that "decode" is the bottleneck, are you unaware of the mitigations that x86 designs have taken to alleviate that? or are those mitigations your proof that the ISA's deficient

Symmetry · on Dec 11, 2018

That really isn't true in the modern world. x86 has things like load-op and large inline constants but ARM has things like load or store multiple, predication, and more registers. They tend to take about the same number of instructions per executable and about the same number of bytes per instruction.

If you're comparing to MIPS then sure x86 is more efficient. And x86 is instruction do more than RISC-V but most high performance RISC-V uses instruction compression and front end fusion for similar pipeline and cache usage.

gpderetta · on Dec 11, 2018

(generalized) predication is not a thing in ARM64. Is Apple cpu 7 wide even in 32 bit mode?

It is true though, as chroma pointed out, that intel can't decode load-op as full width.

wtallis · on Dec 11, 2018

You can fit more code into the same sized cache, but you also need an extra cache layer for the decoded µops, and a much more complicated fetch/decode/dispatch part of the pipeline. It clearly works, at least for the high per-core power levels that Intel targets, but it's not obvious whether it saves transistors or improves performance compared to having an instruction set that accurately reflects the true execution resources, and just increasing the L1i$ size. Ultimately, only one of the strategies is viable when you're trying to maintain binary compatibility across dozens of microarchitecture generations.

gpderetta · on Dec 11, 2018

The fact is that a postdecode cache is desirable even on an high performance RISC design as even there skipping decode and fetch is desirable for both performance and power usage.

IBM Power9 for example has a predecode stage before L1.

You could say that, in general, riscs can get away without extra complexity for a longer time while x86 must implement it early (this is also true for example for memory speculation due to the more restrictive intel memory model, or optimized hardware TLB walkers), but in the end it can be an advantage for x86 (more mature implementation).

Symmetry · on Dec 11, 2018

In theory, yes. In practice x86-64, while it was the right solution for the market, isn't a very efficient encoding and doesn't fit any more code in cache than pragmatic RISC designs like ARM. It still beats more purist RISC designs like MIPS but not by as much as pure x86 did.

It would be easy to design a variable length encoding scheme that was self-synchronizing and played nicely with decoding multiple instructions per clock. But legacy compatibility means that that scheme will not be x86 based.

bogomipz · on Dec 11, 2018

>"It would be easy to design a variable length encoding scheme that was self-synchronizing and played nicely with decoding multiple instructions per clock."

How might a self-synchronizing encoding scheme work? How could a decoder be divorced from the clock pulse? I am intrigued by this idea.

Symmetry · on Dec 11, 2018

What I mean is self-synchronizing like UTF-8. For example the first bit of a byte being 1 if its the start of an instruction and 0 otherwise. Just enough to know where the instruction starts are without having to decode the instructions up to that point and so that a jump to an address that's the middle of an instruction can raise a fault. Checking the security of x86 executables can be hard sometimes because reading a string of instructions started from address FOO will give you a stream of innocuous instructions whereas reading starting at address FOO+1 will give you a different stream of instructions that does something malicious.

jawnv6 · on Dec 11, 2018

sure, so what's your 6 byte equivalent ARM for FXSAVE/FXRSTOR?

n-gatedotcom · on Dec 11, 2018

What is an example of a commonly used complex instruction that is "simplified"/DNE in RISC? (in asm, not binary)

gpderetta · on Dec 11, 2018

Load-op instructions do not normally exist on RISC, but are common on CISC.

adrianN · on Dec 11, 2018

You do have to worry about the µ-op cache nowadays.

Symmetry · on Dec 11, 2018

Those profs were still living in 1990 when the x86 tax was still a real issue. As cores get bigger the extra effort involved in handling the x86 ISA gets proportionally smaller. x86 has a accumulated a lot of features over the years and figuring out how, e.g., call gates interact with mis-speculated branches means an x86 design will take more engineering effort than an equivalent RISC design. But with Intel's huge volumes they can more than afford that extra effort.

Of course Intel has traditionally always used their volume to be ahead in process technology and at the moment they seem to be slipping behind. So who knows.

bogomipz · on Dec 11, 2018

>"As cores get bigger the extra effort involved in handling the x86 ISA gets proportionally smaller."

Can you elaborate on what you mean here? Do you mean as the number of cores gets bigger? Surely the size of the cores has been shrinking no?

>"Of course Intel has traditionally always used their volume to be ahead in process technology"

What's the correlation between larger volumes and quicker advances in process technology? Is it simply more cash to put back into R and D?

Symmetry · on Dec 11, 2018

When RISC was first introduced its big advantage was that by reducing the number of instructions it could handle the whole processor could be fit onto a single chip whereas CISC chips took multiple chips. In the modern day it takes a lot more transistors and power to decode 4 x86 instructions in one cycle than 4 RISC instructions because you know the RISC instruction are going to start on bytes 0, 4, 8, and 12 whereas the x86 instructions could be starting on any bytes in the window. So you have to look at most of the bytes as if they could be an instruction start until later in the cycle you figure out if they were or not. And any given bit in the instruction might be put to more possible uses increasing the logical depth of the decoder.

But that complexity only goes up linearly with pipeline depth in constrast to structures like the ROB that grow as the square of the depth. So it's not really a big deal. An ARM server is more likely to just slap on 6 decoders to the front end because "why not?" whereas x86 processors will tend to limit themselves to 4 but that very rarely makes any sort of difference in normal code. The decode stage is just a small proportion of the overall transistor and power cost of a deeply pipelined out of order chip.

In, say, dual-issue in-order processors like an A53 the decode tax of an x86 is actually an issue and that's part of why you don't see BIG.little approaches in x86 land and why atom did so poorly in the phone market.

For your second question, yes, spending more money means you can pursue more R&D and tend to bring up new process nodes more quickly. Being ahead means that your competitors can see which approaches worked out and which didn't and so re-direct their research more profitably for a rubber band effect, plus you're all reliant on the same suppliers for input equipment so a given advantage in expenditure tends to lead to finite rather than ever-increasing lead.

bogomipz · on Dec 11, 2018

Thanks for the thorough detailed reply, I really appreciate it. I failed to grasp one thing you mentioned which is:

>"is actually an issue and that's part of why you don't see BIG.little approaches in x86 land and why atom did so poorly in the phone market."

Is BIG an acronym here? I had trouble understanding that sentence. Cheers.

Symmetry · on Dec 12, 2018

I was reproducing an ARM marketing term incorrectly.

https://en.wikipedia.org/wiki/ARM_big.LITTLE

Basically, the idea is that you have a number of small, low power cores together with a few larger, faster, but less efficient cores. Intel hasn't made anything doing that. Intel also tried to get x86 chips into phones but it didn't work out for them.

bogomipz · on Dec 12, 2018

Thanks, this is actually a good read and clever bit of marketing. Cheers.

oblio · on Dec 11, 2018

Your parallel is hard to follow for people who don't watch cricket. I have no idea how 100 or 50 "innings" relate to a few "knots". Are they like some sort of weird imperial measures? (furlongs vs fathoms?)

jholman · on Dec 11, 2018

I suspect that "knots" was supposed to be "noughts", a.k.a zeros. That is, the last few times the 100-point batsman was at bat, he got struck out without scoring any points. Is he washed up?

I don't think it's a very useful analogy. :)

ctack · on Dec 11, 2018

Knots as in ducks?

bloomer · on Dec 11, 2018

It turns out that the effect is typically exactly the opposite. Design and process are already coupled where a given process will have design rules that must be adhered to to achieve a successfully manufacturable design. Intel only has to support their own designs so can have very strict design rules. Fabs like TSMC have to be more lenient in what they allow from their customers so have looser design rules that result in a less optimized process to achieve the same yield.

dnautics · on Dec 11, 2018

The speculation is exactly that what you describe is indeed a short term gain, but that the pressure of having to accommodate looser design rules nets a stronger process discipline which pays off in the long term as feature size shrinkage gets closer to physical limits.

bstx · on Dec 11, 2018

ARM architectural licensees develop their own microarchitectures that implement the ARM ISA spec, they do not license any particular microarchitecture from ARM (e.g. Cortex-A? IP cores). That includes Apple, Samsung, Nvidia and others.

ethbro · on Dec 11, 2018

But ARM actually has relatively few architectural licensees (~10 as of 2015).

In reality, most of their licenses are processor (core+interfaces) or POP (pre-optimized processor designs).

https://www.anandtech.com/show/7112/the-arm-diaries-part-1-h...

bogomipz · on Dec 11, 2018

Could you or someone else elaborate on the different type of licenses and why a company interested in licensing might opt for one over another? I was surprised by the OPs comment that few companies actually license microarchitecture as I thought that's what Apple has been doing with ARM.

evancox100 · on Dec 11, 2018

It is what Apple has been doing with ARM, but as he said there's only about 10 companies doing this, compared to the hundreds (thousands?) who take the core directly from ARM. Even big players like Qualcomm seem to be moving to just requesting tweaks to the Cortex cores.

It's much, much easier & cheaper to take the premade core rather than developing your own. But your own custom design gives you the ability to differentiate or really target a specific application. See Apple's designs.

Read the Anandtech article, it goes into more detail on the license types. There's also the newer "Built on Cortex" license: https://www.anandtech.com/show/10366/arm-built-on-cortex-lic...

bogomipz · on Dec 11, 2018

Your link is exactly what I was looking for. Thanks!

bogomipz · on Dec 11, 2018

>'They design the node for the process and way to fabricate it."

What is a "node" in this context? I'm not familiar enough with FAB terminology.

Kliment · on Dec 11, 2018

A node is a combination of manufacturing capabilities and design components that are manufacturable in that process. They're typically named after an arbitrary dimension of a silicon feature, for example 14nm or 10nm. Your higher level design options are dictated by what you can produce at that "node" (with those masking methods/transistor pitches/sizes/electrical and thermal properties).

rbanffy · on Dec 11, 2018

Would a pixel be a good analogy? It's the smallest thing you can make on your chip and that defines all the rest of your design.

nsteel · on Dec 11, 2018

Only in the same way that even pixels of the same physical size can have other vastly different properties. And that makes ranking them purely on their size totally misguided. So I'm not convinced that really helps.

rocqua · on Dec 11, 2018

It's closer to the quality of a display then just how small your pixels are. It determines how large your display is before you get too many dead-pixels (yield in fabrication). What range of colors your pixels can produce (electric properties? resistance, leakage, etc.). Whether you can blast all pixels at full brightness, or only a few (thermal properties). And indeed, resolution of the display (size of transistors).

What is missing from this analogy is the degree of layering / 3d structures that is possible. You might couple that to RGB v RGBY but I'm not really sure.

bogomipz · on Dec 11, 2018

Might you or anyone else be able to recommend a book or some literature on business and logistical side of third party chip design like this? Maybe something with some case studies?

evancox100 · on Dec 11, 2018

Here's a free one that should cover what you're asking about.

https://www.semiwiki.com/forum/content/4729-free-fabless-tra...

bogomipz · on Dec 11, 2018

This is a great resource. Thanks. Cheers.

btian · on Dec 11, 2018

But TSMC has demonstrated many times that they can make chips at scale.

I don't see why they would fail this time.

abfan1127 · on Dec 11, 2018

TSMC can run the masks, but if the design is not sound, then it doesn't matter how good the transistors are. Power islands, clock domain crossings, proper DFT, DFM, etc. are all needed to get a good design.

whynotminot · on Dec 11, 2018

Do you realize how many devices Apple sells a year? I think they've figured out the scale thing ok.

product50 · on Dec 11, 2018

This is what Intel supporters always say till the time everyone builds those chips and there is no market left for Intel at all. It is just so sad that Intel, which had such a ferocious lead and was on the cutting edge of processor design/manufacturing, is now dying from a thousand cuts.

Just look at the industry - everyone who is a major player in cloud, AI, or mobile (Apple, Huawei, & Samsung) are now in the chip business themselves. How will Intel grow? And where would this so called scale advantage come in?

Wake up and smell the coffee.

Brybry · on Dec 11, 2018

How is Intel dying? Losing a near monopoly is a far cry from dying.

And Amazon's Graviton/armv8 chips aren't going to be competitive for many workloads. If you look up benchmarks you'll see they generally aren't competitive in terms of performance[1].

They'll only be competitive in terms of cost (and, generally, not even performance/cost).

I'm personally pleased that there is more competition but I find that saying Intel is dying to be silly.

[1] https://www.phoronix.com/scan.php?page=article&item=ec2-grav...

rbanffy · on Dec 11, 2018

And it sure doesn't help that Amazon won't be selling desktop PCs or on-prem servers anytime soon.

Rafuino · on Dec 11, 2018

Well, they did announce Amazon Outpost as well...

electrograv · on Dec 11, 2018

> It's hard to see Amazon transitioning their AWS machines to Amazon built chips

As a strategic move, this makes a lot of sense for Amazon. Moreover, Amazon is a company known for excellence in a diverse set of disciplines, and TSMC has an excellent reputation for delivering state-of-the-art CPUs at scale — yet you are here to doubt they can pull it off, despite providing no evidence or rationale for your position?

The burden of proof is on you to justify your pessimism. If you have evidence for your claim that Amazon + TSMC will have problems scaling, please provide it.

jawnv6 · on Dec 11, 2018

how many amazon customers have they migrated to their existing ARM solutions?

like that's the bit that's missing, the servers sitting on a rack are meaningless without ARM customers, and amazon chips not existing didn't somehow prevent the demand. they sell arm compute now and it's a paltry fraction of the whole. pretending it's about TSMC scaling is ridiculous.

marcosdumay · on Dec 11, 2018

> Everyone wants to make their own chips until they have to do so at scale

Isn't it exactly the other way around?

askafriend · on Dec 11, 2018

Apple sold 217 MILLION iPhones in just 2017 alone.

That's a number that doesn't include iPad, Apple Watch, HomePod, or Macs - all of which have custom Apple silicon in them.

I think you're severely underestimating Apple here.

pjmlp · on Dec 11, 2018

There are lots of countries around the world where common people hardly get to see an Apple device on the wild.

acdha · on Dec 11, 2018

There are lots of places where you rarely see PCs, too, but that doesn’t mean that Intel and AMD don’t sell a lot of chips. 200M per year is well into the economies of scale range.

askafriend · on Dec 12, 2018

That has nothing at all to do with the original point, or even my point.

Mistletoe · on Dec 11, 2018

Delivering almost any package to my house in two days at scale seems a lot harder than making chips at scale and they did that already.

evancox100 · on Dec 11, 2018

Unfortunately I think making cutting edge chips is harder these days. Just going on cost, the most expensive Amazon fulfillment center comes in at $200 million, the most expensive fab is $14 billion, from Samsung, with word of a $20 billion fab coming from TSMC.