I was just wondering about in write allocation policy of caches, first we access data from main memory and put into cache and then update in t
I am writing a version of malloc that is compatible with multi-threading. Is is going to use arenas to help facilitate the parallelism. mmap is being used to cr
Is the following: psrad xmm0, 31 ; arithmetic (sign-extend) shift right equivalent to: xorps xmm1, xmm1 ; zero cmpps xmm0, xmm1, 1 ; less than I
I want to know the way variables are initialized : #include <stdio.h> int main( void ) { int ghosts[3]; for(int i =0 ; i < 3 ; i++) printf(
Why are registers A and B whose inputs are ReadData1 and ReadData2 of RegisterFile are necessary? Isn't it possible to use directly the values which are on Read
Consider a computer with the following characteristics: total of 1Gbyte of main memory; word size of 1 byte; block size of 32 bytes; and cache size of 128 Kbyte
(1) I wonder if the is register the only place the arithmetic calculation can happen? It looks like: add BYTE PTR [var], 10 — add 10 to
As explained in this post: Why is integer assignment on a naturally aligned variable atomic on x86? : Memory load/store on a byte value - and any correctly alig
Consider int t = 0; for( int i = 0; i < 8; i++ ) { for( int j = 0; j < 8; j++ ) { t = t + i*j; } } Ex: Create a branch history table in t =
I am confused between branching instructions BZ and BNZ. Can anybody, please, explain the concept and working of BZ and BNZ with an example?
Specifically is: mov %eax, %ds Slower than mov %eax, %ebx Or are they the same speed. I've researched online, but have been unable to find a definitive an
I know store buffer and invalidate queues are reasons that cause memory reordering. What I don't know is if Out-of-Order-Execution can cause memory reordering.
Which problems arise in the following assembly loop, if Predict Not Taken is chosen by default? Optimize the example to Predict not Taken. addi $s1, $zero, 1024
I'm reading the mail list about LKMM: Add volatile_if(). The control dependency is somewhat subtle since it is easily forgotten by us developers. So I wonder i
After serious development, CPUs gained many cores, gained distributed blocks of cores on multiple chiplets, numa systems, etc but still a piece of data has to p
For example, in case of 32-bit processors, a word is 4-byte long. Is it also possible to use 5-byte word or others?
Hello Forum – I have a few similar/related questions about SIMD intrinsic for which I searched online including stackoverflow but did not find good answer
After reading this post (answer on StackOverflow) (at the optimization section), I was wondering why conditional moves are not vulnerable for Branch Prediction
I keep seeing people claim that the MOV instruction can be free in x86, because of register renaming. For the life of me, I can't verify this in a single tes
Here is a piece of C++ code that shows some very peculiar behavior. For some strange reason, sorting the data (before the timed region) miraculously makes the l