We start by loading the dataset to inspect its structure. The data contains columns such as employee demographics (Age, Gender), job characteristics (Job_Role, Industry, Work_Location), and variables directly related to mental health (Mental_Health_Condition, Stress_Level, Satisfaction_with_Remote_Work, etc.).
How does remote work affect stress levels compared to onsite or hybrid work?
Is there a relationship between access to mental health resources and mental health conditions?
What is the distribution of job roles across industries, and how does this correlate with remote work satisfaction?
How does work-life balance differ by work location?
Are there regional differences in satisfaction with remote work?
How does physical activity or sleep quality relate to productivity changes?
Handling Missing Data: The next step is to check for any missing values, as this can affect the analysis.
Summarize key statistics of the dataset to understand its central tendencies (e.g., average hours worked, stress levels, etc.).
Use graphs to visualize the relationships between variables. We can create:
Bar charts to compare remote vs. onsite work for mental health conditions.
Pie charts for the distribution of satisfaction with remote work.
Scatter plots or heatmaps to show correlations between stress level and access to mental health resources.
After visualization and analysis, we can derive insights about the effects of remote work on mental health and productivity.
Now, let's perform the EDA based on the steps above. First, we will check for missing values and generate descriptive statistics.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
data = pd.read_csv('Impact_of_Remote_Work_on_Mental_Health.csv')
# Step 1: Check for missing values
missing_values = data.isnull().sum()
# Step 1.1: Descriptive statistics
descriptive_stats = data.describe()
# Step 2: Distribution of mental health conditions based on work location
plt.figure(figsize=(10,6))
sns.countplot(data=data, x='Work_Location', hue='Mental_Health_Condition')
plt.title('Distribution of Mental Health Conditions by Work Location')
plt.xlabel('Work Location')
plt.ylabel('Count')
plt.show()
# Step 3: Work-Life Balance across different work locations
plt.figure(figsize=(10,6))
sns.boxplot(data=data, x='Work_Location', y='Work_Life_Balance_Rating')
plt.title('Work-Life Balance Rating by Work Location')
plt.xlabel('Work Location')
plt.ylabel('Work-Life Balance Rating')
plt.show()
# Step 4: Stress Levels based on Mental Health Resources Access
plt.figure(figsize=(10,6))
sns.countplot(data=data, x='Access_to_Mental_Health_Resources', hue='Stress_Level')
plt.title('Stress Level by Access to Mental Health Resources')
plt.xlabel('Access to Mental Health Resources')
plt.ylabel('Count')
plt.show()
# Step 5: Satisfaction with Remote Work by Region
plt.figure(figsize=(10,6))
sns.countplot(data=data, x='Region', hue='Satisfaction_with_Remote_Work')
plt.title('Satisfaction with Remote Work by Region')
plt.xlabel('Region')
plt.ylabel('Count')
plt.show()
# Step 6: Correlation heatmap to analyze relationships between numerical features
# Select only numerical features for correlation analysis
numerical_data = data.select_dtypes(include=['number']) # Select numerical columns only
plt.figure(figsize=(12,8))
sns.heatmap(numerical_data.corr(), annot=True, cmap='coolwarm') # Calculate correlation on numerical data
plt.title('Correlation Heatmap of Features')
plt.show()
# Show missing values and descriptive stats
missing_values, descriptive_stats
(Employee_ID 0 Age 0 Gender 0 Job_Role 0 Industry 0 Years_of_Experience 0 Work_Location 0 Hours_Worked_Per_Week 0 Number_of_Virtual_Meetings 0 Work_Life_Balance_Rating 0 Stress_Level 0 Mental_Health_Condition 1196 Access_to_Mental_Health_Resources 0 Productivity_Change 0 Social_Isolation_Rating 0 Satisfaction_with_Remote_Work 0 Company_Support_for_Remote_Work 0 Physical_Activity 1629 Sleep_Quality 0 Region 0 dtype: int64, Age Years_of_Experience Hours_Worked_Per_Week \ count 5000.000000 5000.000000 5000.000000 mean 40.995000 17.810200 39.614600 std 11.296021 10.020412 11.860194 min 22.000000 1.000000 20.000000 25% 31.000000 9.000000 29.000000 50% 41.000000 18.000000 40.000000 75% 51.000000 26.000000 50.000000 max 60.000000 35.000000 60.000000 Number_of_Virtual_Meetings Work_Life_Balance_Rating \ count 5000.000000 5000.000000 mean 7.559000 2.984200 std 4.636121 1.410513 min 0.000000 1.000000 25% 4.000000 2.000000 50% 8.000000 3.000000 75% 12.000000 4.000000 max 15.000000 5.000000 Social_Isolation_Rating Company_Support_for_Remote_Work count 5000.000000 5000.000000 mean 2.993800 3.007800 std 1.394615 1.399046 min 1.000000 1.000000 25% 2.000000 2.000000 50% 3.000000 3.000000 75% 4.000000 4.000000 max 5.000000 5.000000 )
There are no missing values in the dataset, making it easy to proceed with the analysis without the need for data imputation or handling.
Mental health conditions like anxiety and depression are present across all work locations (Remote, Hybrid, and Onsite). Remote workers appear to have slightly higher counts of anxiety and depression than those working onsite, suggesting a potential link between remote work and mental health challenges.
Onsite workers generally report better work-life balance, as indicated by the boxplot, whereas remote workers show a more significant spread, with both high and low ratings. Hybrid workers fall between the two groups in terms of work-life balance ratings.
Employees with access to mental health resources generally report lower stress levels. However, stress is still present even when resources are available, indicating that the resources may not fully alleviate stress for everyone.
Satisfaction with remote work varies significantly across regions. For instance, Asia has a more balanced distribution of satisfied and unsatisfied employees, while North America shows a higher percentage of unsatisfied employees compared to Europe.
The correlation heatmap shows some interesting relationships:
There is a negative correlation between "Hours Worked Per Week" and "Work-Life Balance Rating," indicating that more hours worked may lead to lower work-life balance.
A positive correlation exists between "Company Support for Remote Work" and "Satisfaction with Remote Work," suggesting that employees are more satisfied when they feel supported by their company in remote work situations.
The analysis shows that while remote work offers flexibility, it also brings challenges, particularly regarding mental health, stress levels, and work-life balance. Access to mental health resources and company support for remote work play significant roles in mitigating these issues. Regional differences also influence how employees perceive and handle remote work.
As we are not limited by any constraints in this case and this data set is way too fun, löet's see if we can reveal even more hidden insights like:
Are there differences in productivity between different work locations (remote, hybrid, onsite). Is it possible to identify a link between “Stress Level” and “Productivity Change”?
How does physical activity affect mental health and stress levels? For example, compare the “Physical Activity” column with the mental health conditions (Anxiety, Depression, None).
Is there a link between sleep quality and stress level? A scatterplot or heatmap could show whether poor sleep correlates with higher stress.
Are there regional differences in work-life balance and stress levels. This analysis could indicate cultural differences in attitudes towards remote work.
Do certain occupations (Job Role) have higher satisfaction with Remote Work? This could show whether remote work works better in certain professions.
It could be informative to examine whether a high number of virtual meetings influences the stress level.
# Step 1: Productivity Change by Work Location
plt.figure(figsize=(10,6))
sns.countplot(data=data, x='Work_Location', hue='Productivity_Change')
plt.title('Productivity Change by Work Location')
plt.xlabel('Work Location')
plt.ylabel('Count')
plt.show()
# Step 2: Relationship between Physical Activity and Mental Health Condition
plt.figure(figsize=(10,6))
sns.countplot(data=data, x='Physical_Activity', hue='Mental_Health_Condition')
plt.title('Physical Activity and Mental Health Condition')
plt.xlabel('Physical Activity')
plt.ylabel('Count')
plt.show()
# Step 3: Sleep Quality vs. Stress Level
plt.figure(figsize=(10,6))
sns.countplot(data=data, x='Sleep_Quality', hue='Stress_Level')
plt.title('Sleep Quality vs. Stress Level')
plt.xlabel('Sleep Quality')
plt.ylabel('Count')
plt.show()
# Step 4: Work-Life Balance and Stress Levels by Region
plt.figure(figsize=(10,6))
sns.boxplot(data=data, x='Region', y='Work_Life_Balance_Rating', hue='Stress_Level')
plt.title('Work-Life Balance and Stress Levels by Region')
plt.xlabel('Region')
plt.ylabel('Work-Life Balance Rating')
plt.show()
# Step 5: Satisfaction with Remote Work by Job Role
plt.figure(figsize=(10,6))
sns.countplot(data=data, x='Job_Role', hue='Satisfaction_with_Remote_Work', order=data['Job_Role'].value_counts().index)
plt.title('Satisfaction with Remote Work by Job Role')
plt.xticks(rotation=90)
plt.xlabel('Job Role')
plt.ylabel('Count')
plt.show()
# Step 6: Number of Virtual Meetings vs. Stress Level
plt.figure(figsize=(10,6))
sns.scatterplot(data=data, x='Number_of_Virtual_Meetings', y='Stress_Level', hue='Work_Location')
plt.title('Number of Virtual Meetings vs. Stress Level by Work Location')
plt.xlabel('Number of Virtual Meetings')
plt.ylabel('Stress Level')
plt.show()
Remote workers show both increases and decreases in productivity, with a notable number reporting no change. Hybrid workers appear to have the most stable productivity, while onsite workers are less likely to experience an increase in productivity. This suggests that the flexibility of remote work may not universally lead to higher productivity and could vary by individual circumstances.
Those who engage in weekly physical activity show higher instances of mental health conditions (such as anxiety and depression) than those who engage in physical activity less frequently. This could indicate that those experiencing mental health issues may be more aware of the need for physical activity. No physical activity is associated with lower instances of mental health issues, though this may be due to underreporting or a lack of awareness.
Poor sleep quality correlates with higher levels of stress. Most employees with poor sleep report high stress levels, while those with good sleep quality are more likely to report lower stress. This confirms that sleep plays a crucial role in managing stress, especially in remote work settings.
North America shows higher levels of stress across all work-life balance ratings compared to Europe and Asia. Work-life balance is perceived to be better in Europe, with lower stress levels overall. There is a clear indication that regional differences influence how people experience stress in relation to their work-life balance.
Data Scientists and Software Engineers report higher satisfaction with remote work compared to other job roles such as Sales and HR. Certain job roles appear more suited to remote work environments, possibly due to the nature of the tasks and the autonomy required in these fields.
The scatter plot suggests a positive correlation between the number of virtual meetings and stress levels, particularly for onsite workers. This could imply that frequent virtual meetings, especially for those who work onsite, contribute to higher stress. Remote and hybrid workers show a more varied response, with some experiencing higher stress despite fewer meetings, indicating other factors might be contributing to stress.
TLDR and now? Are we now able to tell which type of working is "the best"? How do these types affect productivity and also the employee satisfaction? Well..
Hybrid Work seems to offer the best balance for both productivity and satisfaction. Remote work can be advantageous for certain roles or personalities but may not suit everyone. Onsite work, while stable, lacks the flexibility that appears to enhance employee satisfaction in many cases. Hybrid for the win!