Analysis for Business Entrepreneurs in Berlin - The Battle of t…

Azeer Esmail

IT Specialist
Python

Analysis for Business Entrepreneurs in Berlin

july/2019

1. Introduction

1.1 Background Description
Berlin is the largest city in Germany by both area and population, and 2nd largest in Europe by population within the city limits.
Tourism figures have more than doubled within the last ten years and Berlin has become the third most-visited city destination in Europe,In 2018, the GDP of Berlin totaled €147 billion, an increase of 3.1% over the year of 2017.
All that coupled with its very high diversity where foreign residents originate from about 190 different countries, and the fact that it's a very active city with a booming economy in many sectors, made it a desired destination not only for tourists and expatriates but also for entrepreneurs and businesses.
Therefore studying the opportunities and understanding the status quo before starting a venture in this city is of big importance and the cornerstone of the business's future success.
1.2 Problem
Despite its very promising characteristics for entrepreneurs, Berlin is a very dynamic city and ever changing, as a resident here, I noticed how frequently small businesses start and close their doors after a relatively short time.
This is because of the lack of knowledge of the demographics and other variables of a certain area.
To avoid such a scenario of loss, and to ensure higher possibility of success, Data-Science will be used to try to understand and decide where a certain kind of business could be successful or not.
1.3 Interested Stakeholders
The interested stakeholders will be those who want to check different variables of success with respect to each neighborhood(locality)
before starting a business in Berlin, whether it's a small business or housing/real estate business.

2. Data Description

I decided to use neighborhood density, business density, and tourism as the deciding variables in this project.
The data was acquired from different sources with different methods and in some cases extrapolated from the already existing data.
The neighborhoods, boroughs and neighborhood density data was scraped from the wiki page: https://en.wikipedia.org/wiki/Boroughs_and_neighborhoods_of_Berlin
Venues data was acquired from the Foursquare API
Geojson of neighborhoods (Ortsteile) of Berlin city from:

3. Methodology

the data was cleaned and processed, the more populous half of the neighborhoods(96 in total) was filtered out and metrics/factors Tourism metric, neighborhood population density, and business density were calculated:
then the top 10 most common venues of each of the more 48 populous neighborhoods were calculated:
K-means algorithm was used to cluster the data, and the optimal clusters number(k) was checked to be 3 clusters, after choosing k=3 for the model and processing the data, a choropleth map was created with markers

4. Results

After clustering the a choropleth map was created with the following features:
1. the inner circle area represents the Business density metric in neighborhood - scaled by 5 for better visualization
Radius = sqrt(5*Business_density_metric/pi)
2. the outer circle area represents the Tourism metric in the corresponding borough added to the Business density metric (click on outer) - scaled by 5 for better visualization
Radius = sqrt( (5*Tourism_metric+5*Business_density_metric)  / pi )
3. the color of marker represents the cluster
4. Color of the map represents the population density
5. Inner circle popup will show Neighborhood, Business density metric and Cluster number
6. Outer circle popup will show Neighborhood, 1st Most Common Venue and Tourism metric
So the map above can convey a lot of information visually for the stakeholders to make a decision,
for example the larger area difference between the inner and outer circles indicated low business density with respect to tourism, and if the map tile is dark as well that means more population and more demand.
Or 1st most common venue category in the neighborhood could mean a high demand on this certain type of venues but also high competition.
There  are so many other  ways to correlate the variables plotted above, like with respect to geographical distance from each other or the fact that purple cluster is in the middle and the red one around.
Also to see what is the most common venue in a certain cluster(not neighborhood) i used Wordcould, with the option of how many columns to choose 1st ----to-- 10th most common.
Cluster 1 (Purple):
Cluster 0 (Red):
One conclusion we can draw is that in the inner city there are more cafes and in outer less dense/touristic parts there are more drugstores, however there are many supermarkets in both clusters.
We also can reduce the number of considered columns to redact the less common venues and see the more common ones clearer.
Cluster 1 (Purple) 3 columns only:
We can see that bars are fairly common, possibly high demand/competition
And finally a look at regression plots could give some insight, to further understand how strong the relationship is between variables:
One can notice that the relationship between business density - neighborhood density is stronger than between business density - tourism, with that in mind a stakeholder could give more weight to the population density than tourism when deciding.

4. Discussion

As mentioned in the previous section there are so many ways to interpret the data, and that is better done with the stakeholders, as for the data itself it could be more comprehensive, the same research could be done with more data on tourism, businesses and even more variables could be integrated for better assessment.
Only half of the city was considered but if more data to be provided, one can consider the whole city and see if any patterns will emerge.
It is also worth mentioning that a temporal data could be so impactful if considered, however this all relates to the desire of the stakeholders and their interests.

5. Conclusion

Things that could be considered by business entrepreneurs:
1. where the 'color' is darker - neighborhood density high - high demand
2. where outer circle area bigger than the inner circle area - low business density with respect to tourism - less competition and possible high demand.
3. where outer circle area close to the inner circle area - high business density with respect to tourism - more competition and less demand.
4. options of the business category are words shown by the word cloud (depends on the cluster)
big words: more competition but possible high demand.
small words: less competition but possible low demand.
*for housing/real estate business it's the opposite of 1 and 2 and the more diverse the Wordcloud is the better.
To a prosperous Future,
Azeer Esmail
Partner With Azeer
View Services

More Projects by Azeer