When the data is already collected, then in 100% of cases, it should be preprocessed and analyzed to get a better understanding of what we have right now, which includes how much sparse it is, how well existing features are correlated, how well they are normalized, how they are distributed, getting info about labels if we have any, and providing full statistical analysis of existing data, so we can decide which pipeline is the most suitable for us.