The goal is to analyze a video of a dialog situation and generate a summary of it. The opinions and intentions of the dialog-partners should be summarized as truthfully as possible. As the first step the speech turns have to be separated and assigned to the respective speakers. This is done using speaker diarization and face identification. Furthermore, the emotions of the speakers are analyzed based on their voice and facial expressions. Lastly, all this information is fed to a Large Language Model which then generates a summary of the dialog.