Some solutions:
Optimise the serial code in such a way that it
exploits shared memory features optimally. Then
run MPI with one processor per node.
Run MPI with as many processors as required.
From the mynode  assignments create new
communicators as follows: group all processors
on a fiven node into a communicator. Then
create a collective communicator with one
processor from each node.