'iOS SceneKit what is inversedTransform? (for dummies)

I'm new to the 3D rendering with Metal and SceneKit. I see that there is a specific "inverse" transform required to be passed to the renderer / shader. I printed out the transforms, and see no relationship between them. Google search results in a bunch of rather advanced topics.

So I ask a question for dummies, like myself:

What is the meaning of "inverse" view transform for shaders?

What happens if I don't inverse the transform?

// Apple code below (with original comment):
// Pass view-appropriate image transform to the shader modifier so
// that the mapped video lines up correctly with the background video.

let sceneView = renderer as? ARSCNView,
let frame = sceneView.session.currentFrame 

let affineTransform = frame.displayTransform(for: .portrait, viewportSize: self.view.bounds.size)
let transform = SCNMatrix4(affineTransform)
let inverse = SCNMatrix4Invert(transform) // pass to shader


transform: 
SCNMatrix4(
m11: 0.0, m12: 1.0, m13: 0.0, m14: 0.0,
 m21: -1.5227804, m22: 0.0, m23: 0.0, m24: 0.0,
 m31: 0.0, m32: 0.0, m33: 1.0, m34: 0.0, 
m41: 1.2613902, m42: -0.0, m43: 0.0, m44: 1.0)
----
inversed:  
SCNMatrix4(
m11: 0.0, m12: -0.6566935, m13: 0.0, m14: 0.0, 
m21: 1.0, m22: 0.0, m23: 0.0, m24: 0.0,
 m31: 0.0, m32: 0.0, m33: 1.0, m34: 0.0, 
m41: 0.0, m42: 0.8283468, m43: 0.0, m44: 1.0)

Here's some info which suggests that the matrices above are a composite of translation, rotation and scaling. While CGAffine transform splits those into distinct elements, here they are all clamped together:

https://www.javatpoint.com/computer-graphics-3d-inverse-transformations



Solution 1:[1]

I found this excellent description of the steps that object coordinates go through before being passed into the renderer:

  1. When rendering camera image, the inverse transform uses camera intrinsics to convert camera X,Y +optional depth into Object Space. Camera intrinsics converts 3D into 2D, so inversed transform goes from 2D to 3D

const auto localPoint = cameraIntrinsicsInversed * simd_float3(cameraPoint, 1) * depth;

  1. Object (local) space to World space (localToWorld matrix in apple examples) (3x3 matrix)
  2. World space to View/Eye/Camera space viewMatrix (3x3 matrix, Z axis pointing towards the viewer/camera, Y Axis pointing up)
  3. View space to Clip Space using View Projection Matrix (4x4 matrix, has W component). X,Y,Z greater than W will be clipped or culled- not rendered. Created using matrix_float4x4_perspective function

We now have a sequence of matrices that will move us all the way from object space to clip space, which is the space that Metal expects the vertices returned by our vertex shader to be in. Multiplying all of these matrices together produces a model-viewprojection (MVP) matrix, which is what we will actually pass to our vertex shader so that each vertex can be multiplied by it on the GPU.

// pseudocode, missing some matrix padding for multiplication purposes:
modelViewProjection = vertXYZ * localToWorld * viewMatrix * projectionMatrix

Renderer performs further transformations:

  1. Clip Space to NDC (Normalized Device Coordinates) (half cube x,y -1 to 1, z 0 to 1)
  2. NDC to Window Pixel positions

Solution 2:[2]

When a SCNScene is captured by a camera, scene depth data is lost – that's because 3D objects are mapped onto a 2D image plane. However, if we want to do the reverse transformation, then it'll not work for reconstructing the 3D scene taking into account only the 2D image. We also need a depth. That's what inversedTransform is for.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Andy Jazz