10 – Formal Methods and Definitions
This page is for those who are wondering about what’s going on under the hood of Argus and OpenCV. It also will include notes about error analysis.
The camera matrix, which we call K, is a mapping from 3D coordinates to the image plane in the special case when the center of projection is at the origin. If we assume the principal point of the image (the center of the image coordinate system) is perfectly aligned with center of projection with both falling on the positive z-axis, we can write the mapping from homogenous 3D coordinates to image plane coordinates as:
where f is the focal length of the camera or the distance between the center of projection and the image plane (image plane is Z = f). This formula assumes that the principal point is aligned with the origin of coordinates in the image plane, so more generally the mapping becomes:
Now we define the camera matrix as:
and the mapping more concisely:
where x is the image plane coordinate, X the 3D homogenous coordinate and [I | 0] is a 3×4 matrix made by concatenating the 3×3 identity matrix with a zero column vector.
Estimating the Camera Matrix via OpenCV
Argus uses OpenCV algorithms to estimate the intrinsic camera matrix described above. We describe here in some detail the methods which OpenCV uses.
In order to calibrate a camera using Argus or OpenCV, one must film a grid. OpenCV uses methods not mentioned here to mark pixel coordinates of the grid. The software then takes advantage of what’s called planar homography or the projective mapping from one plane (the grid) to another (the image plane) to estimate the camera matrix. \
Let be the homogenous coordinates of a point on the grid in world coordinates. Let be the homogenous coordinates of a point in the image plane. We can then express the action of the homography between the planes as:
where H is a 3×3 homography matrix, and s is an arbitrary scale factor conventionally factored out of H. is essentially the projection matrix which we define here further on. It contains the camera matrix and a rotation translation matrix. In other words, H rotates and translates the world coordinates back to image coordinates and the projects into the image plane with the camera matrix K. Thus, it can be written:
However, one does not need any information about the camera matrix to estimate H. H is simply a transform between point on the grid and points in the image plane:
where and .
As such, for each image of a grid, OpenCV uses the constraint above for all points found in the grid to form a linear system and solve for the homography matrix. Then, OpenCV uses this information to solve for the camera intrinsic matrix, K.
If we set z = 0 in the coordinates of the grid plane, the homography relationship becomes slightly simplified in that we no longer need the third column of the rotation matrix:
This equivalent to the system:
By definition the rotation vectors are orthogonal:
and as scale is extracted, they are also orthonormal:
Let . B has the following closed form solution:
Thus, using B, both constraints have the form . Because B is symmetric, the constraint can be written as on six-dimensional vector dot product:
Our two constraints can now be written as:
For N images, this gives us the 2N x 6 linear system:
Finally, after solving this system, the parameters of the camera matrix, K, can be pulled out from the closed-form solution for B:
Distortion according to the Pinhole model
The pinhole model defines two types of lens distortion, radial and tangential. Radial distortion is described three terms of a truncated Taylor series expansion (). The relation between distorted and undistorted pixel coordinates with radial distortion is:
Where r is the distance of the point (x, y) from the optical center (). Tangential distortion is desribed by two parameters (). The correction relationship for tangential distortion is:
After solving for camera intrinsics, OpenCV sets up a non-linear system based on these constraints to solve for distortion coefficients. OpenCV employs the Levenburg Marquardt algorithm to minimize sum of squares of pixel reprojection errors. It computes ideal pixel locations from the estimated camera intrinsics assuming a perfect pinhole camera.
The fundamental matrix is the algebraic representation of epipolar geometry in a two-camera system. It represents the map between a point in one image and the corresponding epipolar line in the other image. Thus, as it is a map between a two-dimensional projective plane and a one-dimensional projective space, it has a rank of 2.
For a pair of corresponding points in two images, and , it satisfies the following relationship:
The essential matrix satisfies a similar relationship as the fundamental matrix, the only difference being that it works on normalized image coordinates. A normalized image coordinate can be expressed as which is equivalent to subtracting out the optical center and dividing by the focal length.
Argus estimates the essential matrix using an OpenCV implementation of the 8-point algorithm. We describe this algorithm here:
8-Point algorithm for estimating the essential matrix
Let be the essential matrix between two views.
Let be the column
denotes a normalized image coordinate in homogenous coordinates
is the jth correspondence between the two views
We assume OpenCV’s implementation simplifies the problem somewhat by setting . The system then becomes:
Let c be a column vector of ones of length n.
In the case of , is generically non-singular. E is obtained by solving the system:
and then imposing the rank 2 constraint.
The rank 2 constraint can be imposed by taking the solution from (1) and finding such that the Frobenius norm is minimized. This can be done through Singular Value Decomposition (SVD), but it is unknown what implementation OpenCV’s algorithm employs at this step.
Decomposition of the Essential matrix to find Projection matrices
Projection matrices define the transform between 3D points and image points. For a stereo camera system, it is convenient to define the pair of projection matrices, and as follows:
where R and t are the rotation and translation matrices between the coordinate frame of one camera to the other. In other words, you define the coordinate system with respect to one cameras center of projection. The center of projection is the origin and the positive z-axis points out into the field of view. This is why the the first projection matrix is simply the camera matrix times the identity concatenated with a zero vector.
Argus estimates the essential matrix, E, and then retrieves the two projection matrix from an SVD of E. According to Hartley and Zisserman, if the SVD of E is then there are four possibilities for .or or or
where t, the translation vector, is simply the third column of the matrix U, and .
Argus chooses the correct projection matrix by reconstructing for all 4 and selecting the one which gives the most points with positive z, i.e. points that are in front of the first camera.
Reconstruction though triangulation
Argus again uses OpenCV to reconstruct 3D points after estimating the projection matrices. For n cameras, Argus reconstructs the points for n – 1 camera pairs where all pairs contain the last camera. One camera must be contained in all the pairs to keep the coordinate system consistent. Argus takes the reconstructed points from all n – 1 pairs, normalizes them by dividing by the average distance from the origin, and averages them for the final estimation of a reconstructed point.
The linear system which OpenCV solves through the least squared method is described by Hartley and Zisserman:
where is the ith row of the projection matrix P.
Finding offsets through cross-correlation of audio-signals
Argus uses the cross-correlation of the audio samples from two videos to obtain their offset or the lag in time between when one video started and the other. To accomplish this in a computationally efficient way, Argus employs convolution of the two signals using Fast Fourier Transform (FFT). The signals are assumed to be mono and audio sample rate must be known. A overview of method is described here:
Given two signals and which are time series of audio samples, one can find the cross-correlation by employing the fact that it is equivalent the convolution of the two series with time reversed for one.
Then one makes significant computational gains when realizing:
Where denotes the Fourier transform. Finally the cross-correlation function becomes:
Then finding the offset is simply a matter of choosing the lag associated with the supremum of . The actual value of that supremum is ideally 1 meaning the signals are perfectly correlated for that lag, but due to environmental noise and phase shifts, this is never the case.
Hartley, Richard, and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge: Cambridge UP, 2003. Print.
Bradski, Gary R., and Adrian Kaehler. Learning OpenCV: Computer Vision with the OpenCV Library. Sebastopol, CA: O’Reilly, 2008. Print.
Frédéric Sur, Nicolas Noury, Marie-Odile BergerMark Everingham and Chris Needham and Roberto Fraile. 19th British Machine Vision Conference – BMVC 2008, Sep 2008, Leeds, United Kingdom. pp.10, 2008, Electronic Proceedings of the 19th British Machine Vision Conferencee 2008 (Leeds, September 2008). <http://www.comp.leeds.ac.uk/bmvc2008/proceedings/papers/269.pdf>