Based on your diagram, the “extrinsic” camera matrix (D) is the VIEWING transform (in OpenGL terms).
And the “intrinsic” camera matrix (K) is some kind of PROJECTION transform (probably perspective). It doesn’t match OpenGL’s perspective projection transform (see here, for instance):

but from the form you cite, I suspect you’re just trying to omit the screen-space Z (depth) transform and just do X and Y.
In any case, hopefully this helps you map the concepts to what OpenGL calls them. If you read “Chapter 3: Viewing” in the OpenGL Programming Guide, this will all become even more clear (you can browse this on-line here).
Once you determine the PROJECTION matrix you want, you can multiply these VIEWING and PROJECTION matrices with your VIEWPORT transform and take your 3D points from world to screen coordinates. gluProject is one function that will do this for you, though it’s not hard to do it yourself.
gluPerspective is a convenience function to build a PROJECTION matrix that takes the camera parameters probably more in terms you’re used to thinking about: vertical field-of-view (FOV) and aspect ratio. Here’s some code that computes the projection matrix values from those inputs: