Training was conducted on a dedicated cluster environment, utilizing a meticulously curated text-to-image dataset. While the resulting model demonstrated promising capabilities in generating coherent images, budgetary constraints limited the size of the training cluster. This, in turn, restricted the model's ability to fully exploit the potential of the DiT architecture.