Some solutions:
•Optimise the serial code in such a way that it exploits shared memory features optimally. Then run MPI with one processor per node.
•Run MPI with as many processors as required. From the mynode  assignments create new communicators as follows: group all processors on a fiven node into a communicator. Then create a collective communicator with one processor from each node.