Birthday problem on social media data

rk-helper
3 min readAug 4, 2020

Hello! I decided to check birthday problem on social media data. I will use data from Russian social media “VKontakte”, it is very similar to Facebook.

What is birthday problem?

To understand it — try to answer on this question: How many people do you need to have probability equal to 0.5 that two of them have same birthday? Birthday problem can help us answer to this question.

We need to make few assumption to solve this problem:

Firstly, in our model we will not have 29 Feb. Therefore we will only have 365 days in a year.

Secondly, we will assume that each of 365 days is equally likely.

Of course, it is not realistic that birthday in each of 365 days is equally likely, because there are some seasonal effects, especially in a September — suppose you can guess why…

People that don’t know this problem will answer that we need at least 180 people in a room to have 0.5 probability of same birthdays( because there are 365 days in a year). Right answer is a lot less than 180, and even less than 150, and less than 100. Right answer is 23.

So, we need at least one matching birthday. Therefore i can find probability of “not matching” birthdays. Probability is equal to:

So, the idea of equation is division of “successful” outcomes to all outcomes. k is number of people in a group.

So then we can solve via Wolfram Alpha and have answer to the problem: 23. (obviously we need only integer numbers, so 22.7 is 23).

Let’s check birthday problem with social media data

In theory we have 0.5 probability that two people have the same birthday in group of 23, 0.97 prob in group of 50 and 0.99 in group of 100. Let’s check it with social media data. I will use VK API, VK is “Russian” Facebook. I chose VK API that it is more easy to use than Facebook, but data is comprehensive for my research.

  1. I chose big public group in social media. In my research i chose group community with memes.

Firstly, i created CSV columns.

with open('vk_data.csv', 'w') as new_file:                                   fieldnames = ['id', 'bdate', 'bmonth', 'byear', 'dandm']                  csv_writer = csv.DictWriter(new_file, fieldnames=fieldnames, delimiter=',')                  
csv_writer.writeheader()
newDict = dict()

Log in with my credentials and go into memes group.

vk_session = vk_api.VkApi('username', 'password')                 vk_session.auth()                  
vk = vk_session.get_api()
vk_group = vk.groups.getMembers(group_id = 'mudakoff', fields = 'bdate')

Now i will parse data. In theory we have equally likely birthdays. In reality we have something like this.

Birthdays are not uniformly distributed. It looks logical, because this was assumption to solve problem. Moreover, we have some seasonal effects in July, for example.

So now i will check probability that in group of 50 people there will be 0.97 probability of same birthdays. I wrote a loop that will parse 50 random people birthdays. And in that group of 50 people code will check same birthdays. If there are the same code will save it in variable counter. Then i will divide counter on length of loop.

fifty = df["dandm"].sample(n = 50) 
for i in range(0, 1000):
fifty = df["dandm"].sample(n = 50)
for j in fifty.duplicated():
if j == True:
counter = counter + 1
break
print('Probablility:', counter / 1000)

Empirical probability is about 0.97 that is correspond with theoretical probability.

Conclusion

It was interesting to check how empirical data can prove theoretical problem. Especially with social media data. Moreover, i want to mention that we have big sample — 20000 people, so result is representative.

Resources

[1]. Harvard University. Birthday Problem, Properties of Probability | Statistics 110. URL: www.youtube.com/watch?v=LZ5Wergp_PA&t=150s. Accessed: 08.07.2020

[2]. Birthday Problem. URL: en.wikipedia.org/wiki/Birthday_problem. Accessed: 08.07.2020

--

--