Category "cpu-architecture"

Write allocation policy with caches [duplicate]

I was just wondering about in write allocation policy of caches, first we access data from main memory and put into cache and then update in t

What is an address/range of addresses that are guaranteed to be not used in x86-64?

I am writing a version of malloc that is compatible with multi-threading. Is is going to use arenas to help facilitate the parallelism. mmap is being used to cr

Use "arithmetic shift right" as "less than zero"

Is the following: psrad xmm0, 31 ; arithmetic (sign-extend) shift right equivalent to: xorps xmm1, xmm1 ; zero cmpps xmm0, xmm1, 1 ; less than I

Where do the values of uninitialized variables come from, in practice on real CPUs?

I want to know the way variables are initialized : #include <stdio.h> int main( void ) { int ghosts[3]; for(int i =0 ; i < 3 ; i++) printf(

Why A and B registers are used in multicycle Datapath?

Why are registers A and B whose inputs are ReadData1 and ReadData2 of RegisterFile are necessary? Isn't it possible to use directly the values which are on Read

Direct Mapping Cache Exercise

Consider a computer with the following characteristics: total of 1Gbyte of main memory; word size of 1 byte; block size of 32 bytes; and cache size of 128 Kbyte

is the register the only place the where arithmetic calculation operand come from? [duplicate]

(1) I wonder if the is register the only place the arithmetic calculation can happen? It looks like: add BYTE PTR [var], 10 — add 10 to

How can some architectures guarantee that aligned memory operations are atomic?

As explained in this post: Why is integer assignment on a naturally aligned variable atomic on x86? : Memory load/store on a byte value - and any correctly alig

Create a branch history in loop

Consider int t = 0; for( int i = 0; i < 8; i++ ) { for( int j = 0; j < 8; j++ ) { t = t + i*j; } } Ex: Create a branch history table in t =

What is the difference between BZ and BNZ in instruction pipeline?

I am confused between branching instructions BZ and BNZ. Can anybody, please, explain the concept and working of BZ and BNZ with an example?

Is a mov to a segmentation register slower than a mov to a general purpose register?

Specifically is: mov %eax, %ds Slower than mov %eax, %ebx Or are they the same speed. I've researched online, but have been unable to find a definitive an

Can CPU Out-of-Order-Execution cause memory reordering?

I know store buffer and invalidate queues are reasons that cause memory reordering. What I don't know is if Out-of-Order-Execution can cause memory reordering.

Optimize a loop for static predict-not-taken? Which prediction problems exist for that in a normal loop?

Which problems arise in the following assembly loop, if Predict Not Taken is chosen by default? Optimize the example to Predict not Taken. addi $s1, $zero, 1024

Can compilers break control dependencies used for LoadStore memory ordering or similar, in any real use-cases?

I'm reading the mail list about LKMM: Add volatile_if(). The control dependency is somewhat subtle since it is easily forgotten by us developers. So I wonder i

Why didn't x86 implement direct core-to-core messaging assembly/cpu instructions?

After serious development, CPUs gained many cores, gained distributed blocks of cores on multiple chiplets, numa systems, etc but still a piece of data has to p

In computers 32-bit or 64-bit processors are used, why not 40-bit or other numbers?

For example, in case of 32-bit processors, a word is 4-byte long. Is it also possible to use 5-byte word or others?

SIMD intrinsic and memory bus size - How CPU fetches all 128/256 bits in a single memory read?

Hello Forum – I have a few similar/related questions about SIMD intrinsic for which I searched online including stackoverflow but did not find good answer

Why is a conditional move not vulnerable to Branch Prediction Failure?

After reading this post (answer on StackOverflow) (at the optimization section), I was wondering why conditional moves are not vulnerable for Branch Prediction

Can x86's MOV really be "free"? Why can't I reproduce this at all?

I keep seeing people claim that the MOV instruction can be free in x86, because of register renaming. For the life of me, I can't verify this in a single tes

Why is processing a sorted array faster than processing an unsorted array?

Here is a piece of C++ code that shows some very peculiar behavior. For some strange reason, sorting the data (before the timed region) miraculously makes the l

Category "cpu-architecture"

Other Categories