'Looking for approaches to optimize a simple byte-at-a-time strcpy in MIPS assembly

I am trying to optimize the following block of mips code and am struggling to find ways to reduce overall instructions or other ways of optimization. Thanks!

   .data
      Str: .asciiz "ABC"
      Strempty: .asciiz "   "
    
    .text
    .globl main
    main:
        jal strcpy              #Jump to the strcpy subroutine
        li $v0,4                #Code for printing strings
        la $a0,Strempty         #syscall to print the content of the copied string
        syscall
        li $v0,10               #Terminates the program
        syscall
    
    strcpy:
          la $a0,Str            #Get address of string y
          la $a1, Strempty      #Get address of string x
          addi $sp,$sp,-4       #Adjust stack pointer to make room for doubleword
          sw $s5,0($sp)         #Push $s5 -  i.e push contents of $s5 to the top of the stack
          add $s5,$zero,$zero   #i=0
    
    L1:
            add $t0,$s5,$a0     #$t0 = addr of y[i]
            lbu $t1,0($t0)      #t1 = y[i]
            add $t2,$s5,$a1     #$t2 = addr of x[i]
            sb $t1,0($t2)       #x[i] = y[i]
            beq $t1,$zero,L2    #Branches out when the null character is reached
            addi $s5,$s5,1      #i=i+1
            j L1                #Next iteration for L1
    L2:
            lw $s5,0($sp)       #restore $s5
            addi $sp,$sp,4      #pop the doubleword from the stack pointer
            jr $ra              #return to main

Solution 1:^[1]

Don't use $s5 — use $t4 or $v0 instead, these are free to use there without the save/restore. $s5 is a call preserved register, whereas the $ts and $vs are call clobbered (so go ahead and clobber one). In fact, you should use registers $a0 and $a1 as the parameter registers.
Don't use the stack at all — this is a leaf function and doesn't require any stack space given there are registers immediately available to use.
Use pointers instead of indexing — try to do a pointer version in C first so you can follow it, but the gist is that indexing requires repeated addition, and using pointers instead will reduce that a bit, from 3 additions in the loop to 2 of them. The pointer version will not have need of counter/index variable i. Modify $a0 and $a1 in place; and eliminate i.
Don't load your parameter values inside the function — let them come as passed by main.
As @Peter says, broadly speaking if we have an unconditional backward branch at the end of the loop and a conditional forward branch somewhere inside the loop that terminates it, we can use the conditional branch to accomplish the looping, and thus eliminate the backward branch. This will become obvious after getting rid of the index increment as the sequence will be:

    beq $t1, $0, L2
    j L1
L2:

And that sequence can be inverted: change the beq into bne and the branch target label from L2 to L1, then remove the j L1. That conditional branch will both perform the exit test and branching backwards to continue the loop.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Looking for approaches to optimize a simple byte-at-a-time strcpy in MIPS assembly

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]