Controlling robots to perform tasks via natural language is one of the most challenging topics in human-robot interaction. In this work, we present a robot system that follows unconstrained language instructions to pick and place arbitrary objects and effectively resolves ambiguities through dialogues. Our approach infers objects and their relationships from input images and language expressions and can place objects in accordance with the spatial relations expressed by the user. Unlike previous approaches, our method allows non-expert users to instruct tabletop manipulation tasks based on sequences of unconstrained pick-and-place speech commands. Our results obtained using real-world data and human-robot experiments demonstrate the effectiveness of our method in solving manipulation tasks described by spoken natural language instructions.