Go Back   Science Forums
View Single Post
Old 03-24-2005   #1 (permalink)
alexander's Avatar
alexander
Dedicated Smart-ass




Location:
Just before 0xAA55
 
alexander has a reputation beyond reputealexander has a reputation beyond reputealexander has a reputation beyond reputealexander has a reputation beyond reputealexander has a reputation beyond reputealexander has a reputation beyond reputealexander has a reputation beyond reputealexander has a reputation beyond reputealexander has a reputation beyond reputealexander has a reputation beyond repute
Send a message via AIM to alexander
 



Not Ranked  0 score     
Cell Architecture

Ok, there has been some discussion about cell architecture on the forums lately, nothing too big and serious, because people dont realize just how cool cell is, so I've taken it upon me to try to explain cell, not too much detail, but for a user who knows what computers are and how they operate in basic, this should be a read to enjoy.

All the information here is taken from http://www.blachford.info/computer/Cells/Cell0.html , the only truely informative resource about how cell powered workstations might look in their near future, well aside from the patent from 2002 which the author gets the information from, but as he says, that actually needs to be decyphered, because it was "written by a robot lawyer running Gentoo in text mode"

Ok, first of all, if you have read anything about Sonys PlayStation 3 gaming console that is sceduled to come out in the early 2006, and actually read specs about it, you have heard of a cell processor, maybe even heard that it is supposed to be faster than a PC architecture processor due to their use of vector processing units, that is a start. An alliance formed by Sony, Toshiba and IBM has been spending billions of dollars on this project, IBM is building 2 65nm facilities, Sony paid IBM hundreds of millions to setup an assembly line for cell procs for PS3 and the research on cell has been costing hundreds of millions of dollars, so you can see that something big is about to go off.

So what is Cell?

Cell is an architecture for high performance distributed computing. It is comprised of hardware and software Cells, software Cells consist of data and programs (known as apulets), these are sent out to the hardware Cells where they are computed and results returned.

This architecture is not fixed in any way, if you have a computer, PS3 and HDTV which have Cell processors they can co-operate on problems. They've been talking about this sort of thing for years of course but the Cell is actually designed to do it. I for one quite like the idea of watching "Contact" on my TV while a PS3 sits in the background churning through a SETI@home [SETI] unit every 5 minutes. If you know how long a SETI unit takes your jaw should have just hit the floor, suffice to say, Cells are very, very fast [Calc].

It can go further though, there's no reason why your system can't distribute software Cells over a network or even all over the world. The Cell is designed to fit into everything from PDAs up to servers so you can make an ad-hoc Cell computer out of completely different systems.

Scaling is just one capability of Cell, the individual systems are going to be potent enough on their own. The single unit of computation in a Cell system is called a Processing Element (PE) and even an individual PE is one hell of a powerful processor, they have a theoretical computing capability of 250 GFLOPS (Billion Floating Point Operations per Second) [GFLOPS]. In the computing world quoted figures (bandwidth, processing, throughput) are often theoretical maximums and rarely if ever met in real life. Cell may be unusual in that given the right type of problem they may actually be able to get close to their maximum computational figure.

Cell architecture:

http://www.hypography.com/sciencefor...&stc=1&thumb=1

The PPE or Processor Unit (PU)
As we now know the PU is so far destined to become a 64bit "Power Architecture", multi thread, muti core processor. Power Architecture is a catch all term describe both PowerPC and POWER processors. Currently there's only 3 CPUs which fit this description: POWER5, POWER4 and the PowerPC 970 (aka G5) which itself is a derivation of the POWER4.

The IBM press release indicates the Cell processor is "Multi-thread, multi-core" but since the APUs are almost certainly not multi-threaded it looks like the PU may be based on a POWER5 core - the very same core as in Apple machines in the form of the G6 [G6] in the not too distant future, IBM have acknowledged such a chip is in development but as if to confuse us call it a "next generation 970".

There is of course the possibility that IBM have developed a completely different 64 bit CPU which it's never mentioned before. This isn't a far fetched idea as this is exactly the sort of thing IBM tend to do, i.e. the 440 CPU used in the BlueGene supercomputer is still called a 440 but is very different from the chip you find in embedded systems.

If the PU is based on a POWER design don't expect it to run at a high clock speed, POWER cores tend to be rather power hungry so it may be clocked down to keep power consumption down.

The PlayStation 3 is touted to have 4 Cells so a system could potential have 4 POWER5 based cores. This sounds pretty amazing until you realise that the PUs are really just controllers - the real action is in the APUs...

SPE or Additional Processing Unit (APU)

http://www.hypography.com/sciencefor...&stc=1&thumb=1
The first thing you notice on the diagram is the absence of Cache, and there is a good reason for it:

Conventional Cache
Conventional CPUs perform all their operations in registers which are directly read from or written to main memory, operating directly on main memory is hundreds of times slower so caches (a fast on chip memory of sorts) are used to hide the effects of going to or from main memory. Caches work by storing part of the memory the processor is working on, if you are working on a 1MB piece of data it is likely only a small fraction of this (perhaps a few hundred bytes) will be present in cache, there are kinds of cache design which can store more or even all the data but these are not used as they are too expensive or too slow.

If data being worked on is not present in the cache the CPU stalls and has to wait for this data to be fetched. This essentially halts the processor for hundreds of cycles. It is estimated that even high end server CPUs (POWER, Itanium, typically with very large fast caches) spend anything up to 80% of their time waiting for memory.

Dual-core CPUs will become common soon and these usually have to share the cache. Additionally, if either of the cores or other system components try to access the same memory address the data in the cache may become out of date and thus needs updated (made coherent).

Supporting all this complexity requires logic and takes time and in doing so this limits the speed that a conventional system can access memory, the more processors there are in a system the more complex this problem becomes. Cache design in conventional CPUs speeds up memory access but compromises are made to get it to work.

APU local memory - no cache
To solve the complexity associated with cache design and to increase performance the Cell designers took the radical approach of not including any. Instead they used a series of local memories, there are 8 of these, 1 in each APU.

The APUs operate on registers which are read from or written to the local memory. This local memory can access main memory in blocks of 1024 bits but the APUs cannot act directly on main memory.

By not using a caching mechanism the designers have removed the need for a lot of the complexity which goes along with a cache. The local memory can only be accessed by the individual APU, there is no coherency mechanism directly connected to the APU or local memory.

This may sound like an inflexible system which will be complex to program and it most likely is but this system will deliver data to the APU registers at a phenomenal rate. If 2 registers can be moved per cycle to or from the local memory it will in it's first incarnation deliver 147 Gigabytes per second. That's for a single APU, the aggregate bandwidth for all local memories will be over a Terabyte per second - no CPU in the consumer market has a cache which will even get close to that figure. The APUs need to be fed with data and by using a local memory based design the Cell designers have provided plenty of it.

Stream Processing

A big difference in Cells from normal CPUs is the ability of the APUs in a Cell to be chained together to act as a stream processor [Stream]. A stream processor takes data and processes it in a series of steps. Each of these steps can be performed by one or more APUs.

A Cell processor can be set-up to perform streaming operations in a sequence with one or more APUs working on each step. In order to do stream processing an APU reads data from an input into it's local memory, performs the processing step then writes it to a pre-defined part of RAM, the second APU then takes the data just written, processes it and writes to a second part of RAM. This sequence can use many APUs and APUs can read or write different blocks of RAM depending on the application. If the computing power is not enough the APUs in other cells can also be used to form an even longer chain.

Steam processing does not generally require large memory bandwidth but Cell will have it anyway. According to the patent each Cell will have access to 64 Megabytes directly via 8 bank controllers (it indicates this as an "ideal", the maximum may be higher). If the stream processing is set up to use blocks of RAM in different banks, different APUs processing the stream can be reading and writing simultaneously to the different blocks.

It is where multiple memory banks are being used and the APUs are working on compute heavy streaming applications that the Cell will be working hardest. It's in these applications that the Cell may get close to it's theoretical maximum performance and perform over an order of magnitude more calculations per second than any desktop processor currently available.

If over clocked sufficiently (over 3.0GHz) and using some very optimised code (SSE assembly), 5 dual core Opterons directly connected via HyperTransport should be able to achieve a similar level of performance in stream processing as a single Cell - Admittedly, this is purely theoretical and it depends on the Cell achieving it's performance goals and a "perfect" application being used, it does however demonstrate the sort of processing capability the Cell potentially has.

The PlayStation 3 is expected to have have 4 Cells.

General purpose desktop CPUs are not designed for high performance vector processing. They all have vector units on board in the shape of SSE or Altivec but this is integrated on board and has to share the CPUs resources. The APUs are dedicated high speed vector processors and with their own memory don't need to share anything other than the memory. Add to this the fact there are 8 of them and you can see why their computational capacity is so large.

Such a large performance difference may sound completely ludicrous but it's not without precedent, in fact if you own a reasonably modern graphics card your existing system is be capable of a lot more than you think:

"For example, the nVIDIA GeForce 6800 Ultra, recently released, has been observed to reach 40 GFlops in fragment processing. In comparison, the theoretical peak performance of the Intel 3GHz Pentium4 using SSE instructions is only 6GFlops." [GPU]

Actually something differenct from the article, in the wonderful world of Linux, there is already a project that utilises the vector processors on the new NVidia and ATI cards, you can make gcc work with the processor on your video card, and since most tasks gcc asks processor to do is exactly the kinds of tasks vector processors are good for and thus i've heard amasing stories of unimaginable compile times foor packages that take hours on a 3GHZ P4 take literaly minutes... (No, no, i dont think that there is anything that is impossible to do to your computer with Linux...)
The DMAC The DMAC (Direct Memory Access Controller) is a very important part of the Cell as it acts as a communications hub. The PU doesn't issue instructions directly to the APUs but rather issues them to the DMAC and it takes the appropriate actions, this makes sense as the actions usually involve loading or saving data. This also removes the need for direct connections between the PU and APUs.

As the DMAC handles all data going into or out of the Cell it needs to communicate via a very high bandwidth bus system. The patent does not specify the exact nature of this bus other than saying it can be either a normal bus or it can be a packet switched network. The packet switched network will take up more silicon but will also have higher bandwidth, I expect they've gone with the latter since this bus will need to transfer 10s of Gigabytes per second. What we do know from the patent is that this bus is huge, the patent specifies it at a whopping 1024 bits wide.

At the time the patent was written it appears the architecture for the DMAC had not been fully worked out so as well as two potential bus designs the DMAC itself has different designs. Distributed and centralised architectures for the DMAC are both mentioned.

It's clear to me that the DMAC is one of the most important parts of the Cell design, it doesn't do processing itself but has to contend with 10's of Gigabytes of memory flowing through it at any one time to many different destinations, if speculation is correct the PS3 will have 100GByte / second memory interface, if this is spread over 4 Cells that means each DMAC will need to handle at least 25 Gigabytes per second. It also has to handle the memory protection scheme and be able to issue memory access orders as well as handling communication between the PU and APUs, it needs to be not only fast but will also be a highly complex piece of engineering.



Memory As with everything else in the Cell architecture the memory system is designed for raw speed, it will have both low latency and very high bandwidth. As mentioned previously memory is accessed in blocks of 1024 bits. The reason for this is not mentioned in the patent but I have a theory:

While this may reduce flexibility it also decreases memory access latency - the single biggest factor currently holding back computers today. The reason it's faster is the finer the address resolution the more complex the logic and the longer it takes to look it up. The actual looking up may be insignificant on the memory chip but each look-up requires a look-up transaction which involves sending an address from the bank controller to the memory device and this will take time. This time is significant itself as there is one per memory access but what's worse is that every bit of address resolution doubles the number of look-ups required.

If you have 512MB in your PC your RAM look-up resolution is 29 bits*, however the system will read a minimum of 64 bits at a time so resolution is 26 bits. The PC will probably read more than this so you can probably really say 23 bits.

* Note: I'm not counting I/O or graphics address space which will require an extra bit or two.

In the Cell design there are 8 banks of 8MB each and if the minimum read is 1024 bits the resolution is 13 bits. An additional 3 bits are used to select the bank but this is done on-chip so will have little impact. Each bit doubles the number of memory look-ups so the PC will be doing a thousand times more memory look-ups per second than the Cell does. The Cell's memory busses will have more time free to transfer data and thus will work closer to their maximum theoretical transfer rate. I'm not sure my theory is correct but CPU caches use a similar trick.

What is not theoretical is the fact the Cell will use very high speed memory connections - Sony and Toshiba licensed 3.2GHz memory technology from Rambus in 2003 [Rambus]. If each cell has total bandwidth of 25.6 Gigabytes per second each bank transfers data at 3.2 Gigabytes per second. Even given this the buses are not large (64 data pins for all 8), this is important as it keeps chip manufacturing costs down.

100 Gigabytes per second sounds huge until you consider top end graphics cards are in the region of 50 Gigabytes per second already, doubling over a couple of years sounds fairly reasonable. But these are just the theoretical figures and never get reached, assuming the system I described above is used the bandwidth on the Cell should be much closer to it's theoretical figure than competing systems and thus will perform better.

APUs may need to access memory from different Cells especially if a long stream is set up, thus the Cells include a high speed interconnect. Details of this are not known other than they transfer data at 6.4 Gigabits / second per wire. I expect there will be busses of these between each Cell to facilitate the high speed transfer of data to each other. This technology sounds not entirely unlike HyperTransport though the implementation may be very different.

In addition to this a switching system has been devised so if more then 4 Cells are present they too can have fast access to memory. This system may be used in Cell based workstations. It's not clear how more than 8 cells will communicate but I imagine the system could be extended to handle more. IBM have announced a single rack based workstation will be capable of up to 16 TeraFlops, they'll need 64 Cells for this sort of performance so they have obviously found some way of connecting them.

Memory Protection
The memory system also has a memory protection scheme implemented in the DMAC. Memory is divided into "sandboxes" and a mask used to determine which APU or APUs can access it. This checking is performed in the DMAC before any access is performed, if an APU attempts to read or write the wrong sandbox the memory access is forbidden.

Existing CPUs include hardware memory protection system but it is a lot more complex than this. They use page tables which indicate the use of blocks of RAM and also indicate if the data is in RAM or on disc, these tables can become large and don't fit on the CPU all at once, this means in order to read a memory location the CPU may first have to read a page table from memory and read data in from disc - all before the data required is read.

In the Cell the APU can either issue a memory access or not, the table is held in a special SRAM in the DMAC and is never flushed. This system may lack flexibility but is very simple and consistently very fast.

This simple system most likely only applies to the APUs, I expect the PU will have a conventional memory protection system.



Software Cells

Software cells are containers which hold data and programs called apulets as well as other data and instructions required to get the apulet running (memory required, number of APUs used etc.). The cell contains source, destination and reply address fields, the nature of these depends on the network in use so software Cells can be sent around to different hardware Cells. There are also network independent addresses which will define the specific Cell exactly. This allows you to say, send a software Cell to hardware Cell in a specific computer on a network.

The APUs use virtual addresses but these are mapped to a real address as soon as DMA commands are issued. The software Cell contains these DMA commands which retrieve data from memory to process, if APUs are set up to process streams the Cell will contain commands which describe where to read data from and where to write results to. Once set up, the APUs are "kicked" into action.

It's not clear how this system will operate in practice but it would appear to include some adaptively so as to allow Cells to appear and disappear on a network.

This system is in effect a basic Operating System but could be implemented as a layer within an existing OS. There's no reason to believe Cell will have any limitations regarding which Operating Systems can run.

ne of the main points of the entire Cell architecture is parallel processing. Software cells can be sent pretty much anywhere and don't depend on a specific transport means. The ability of software Cells to run on hardware Cells determined at runtime is a key feature of the Cell architecture. Want more computing power? Plug in a few more Cells and there you are. If you have a bunch of cells sitting around talking to each other via WiFi connections the system can use it to distribute software cells for processing. The system was not designed to act like a big iron machine, that is, it is not arranged around a single shared or closely coupled set of memories. All the memory may be addressable but each Cell has it's own memory and they'll work most efficiently in their own memory or at least in small groups of Cells where fast inter-links allow the memory to be shared.

Going above this number of Cells isn't described in detail but the mechanism present in the software Cells to make use of whatever networking technology is in use allows ad-hoc arrangements of Cells to be made without having to worry about rewriting software to take account of different network types.

The parallel processing system essentially moves a lot of complexity which would normally be handled by hardware and moves it into software. This usually slows things down but the benefit is flexibility, you give the system a set of software Cells to compute and it figures out how to distribute them itself. If your system changes (Cells added or removed) the OS should take care of this without user or programmer intervention.

Writing software for parallel processing is usually highly difficult and this helps get around the problem. You still, of course have to parallelise the program into cells but once that's done you don't have to worry if you have one Cell or ten.

In the future, instead of having multiple discrete computers you'll have multiple computers acting as a single system. Upgrading will not mean replacing an old system anymore, it'll mean enhancing it. What's more your "computer" may in reality also include your PDA, TV and Camcorder all co-operating and acting as one.


---snap---


this is it for now, i'll work more on it eventually, its a good read, so read it

(Edit: Also, the reason i stopped here is because this gives you a thorough understanding of Cell, the post below just describes implementation, software and why Cell is better and might beat x86 architecture. This is the more true part as below is a more philosophical, debatable, maybe a possibility type of deal. Oh and i added pictures )
Attached Thumbnails
Cell Architecture-cell_arch-1-.gif   Cell Architecture-cell_apu-1-.gif  


----------------
Microsoft, the leader in using innovative tactics to promote irksome experience, coupled with antiquated technology that's held together by a pyramid of makeshift afterthoughts.

Apple, the leader in using irksome tactics to promote innovative experience, coupled with an antiquated core that's enhanced by state-of-the-art afterthoughts.

Linux, the leader in not using any tactics to promote user-defined experience, coupled with state-of-the-art core enhanced by innovative afterthoughts.


Last edited by alexander; 04-14-2005 at 07:20 AM..
Reply With Quote
 
» Advertisement
» Current Poll
Who's the sexiest man alive? Johnny Depp or Robert Pattinson?
Johnny Depp - 27.27%
3 Votes
Robert Pattinson - 0%
0 Votes
Someone else (please specify) - 45.45%
5 Votes
I'm too macho to think a guy is sexy - 27.27%
3 Votes
Total Votes: 11
You may not vote on this poll.


All times are GMT -8. The time now is 10:58 AM.

Hypography?

Hypography [n.]: A combination of "hyperlink" and "bibliography" - ie, a list of links to electronic documents. Comparable to discography and bibliography, but not cartography.

We have been online since May 2000, and aim to be the best place to find and share science-related content of all kinds.

Share the love!

Please add more science to your life. Use our RSS feeds on your blog, your portal, or your favorite feedreader!


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.
Copyright © 2000-2009 Hypography
Part of the Hypography - Science for Everyone Network