Instructions and Clock sycles

Bedanto · October 20, 2014, 12:53pm

I am trying to undersand why certain intel instrucitons take more than one
cloc kcycle to complete while others can even be done in less than one
clock cycle.

i thought each instruction should take one clcok cycle to complete, but hat
doesnt seeem to be the case. can someone kindly explain this to me…

Alex_Grig · October 20, 2014, 1:13pm

Why would float point divide be as fast as integer add?

Tim_Roberts · October 20, 2014, 1:15pm

Bedanto wrote:

I am trying to undersand why certain intel instrucitons take more than
one cloc kcycle to complete while others can even be done in less than
one clock cycle.

i thought each instruction should take one clcok cycle to complete,
but hat doesnt seeem to be the case. can someone kindly explain this
to me…

This is not something that can be explained in a simple mailing list
post. You need to do some basic reading about processor architectures
in general.

However, consider this. If I have an instruction that needs to read
from memory, first I need to figure out what address I need to read.
That might involve pulling values from two registers, adding them, and
adding a constant from the instruction. That arithmetic has to finish
before I can even start the operation to request memory. Then, I have
to know where to put it.

Division is another interesting example. There is still no algorithm
for doing a division in one cycle. It’s done iteratively, not unlike
the way you do long division on paper.

Remember, the processor cycle is the lowest step in getting things
done. Despite parallelism, if operation B cannot start until the result
of operation A is available, then by definition those two operation
cannot complete in the same cycle. Many Intel instructions are
complicated, involving many steps.

These operations tend to be done in small, incremental steps in small
functional units, and by having several of those units, I can have
several instructions in mid-process at the same time. An instruction
can’t really take less than one cycle, but because of multiple pipeline
units, I can sometimes have two or more instructions complete in the
same cycle, so the average cycle count is less than 1.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Maxim_S_Shatskih · October 20, 2014, 7:01pm

> However, consider this. If I have an instruction that needs to read

from memory, first I need to figure out what address I need to read.
That might involve pulling values from two registers, adding them, and
adding a constant from the instruction.

…and this is done by a separate set of silicon gates, parallel to the main execution flow, since 80286.

Division is another interesting example. There is still no algorithm
for doing a division in one cycle.

Yes, though some CPUs (IIRC DEC Alpha, though I can be wrong, also possible the modern x86 CPUs too) have the “flash multiplier” which can do MUL in 1 cycle.

–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

anton_bassov · October 20, 2014, 8:30pm

> Division is another interesting example. There is still no algorithm for doing a division in one cycle.

Actually, there are algorithms and methods for doing division, as well as just anything else that CPUs do, without relying on the very concept of the clock cycle, in the first place( search the web for “asynchronous circuits” , and you will find yourself in the exciting world of Muller C-elements, handshake protocols, dual-rail encoding and other concepts that are totally different from the clock-based hardware design concepts that are prevailing today). These methods have been well-known since the late 1950s.

Therefore, if you want to do division in one cycle you may find these concepts quite helpful. This is just the question of what you are going to gain from introducing asynchronous concepts into a system that, as a whole, is based upon the notion of clock cycles. All the performance benefits that are implied by asynchronous design will be washed away by the synchronous parts (you cannot make a system run faster than its slowest component, right - it is exactly the same thing as making a chain stronger than its weakest link). Therefore, you are more than likely to end up with extra complexity and increased transistor count for no reason whatsoever…

Anton Bassov