knitr::opts_chunk$set(
message = FALSE,
warning = FALSE
)
library(janitor)
library(dplyr)
library(tidyr)
library(ggplot2)
library(tidylog)
library(stringr)
library(stats)
library(infer)
library(readxl)
library(lubridate)
library(visdat)
library(knitr)
library(rmarkdown)
library(kableExtra)
library(purrr)This report compiles and analyses data I personally collected from March 2024 to September 2025 on my job applications. Every time I noticed an interesting job description I had a two-step process; 1) I copy and paste the job description in a word document and 2) I fill in an excel spreadsheet with data from the job description for example: on the company name, job title, date of application etc. This process was intended to be useful in case I was called for an interview for which the job description had disappeared and along with the job details, therefore would hinder me in the preparation for the interview. The second objective was to track my applications in order to report them later to my unemployment agency, as I was required to do minimum of applications per months and provide details such as company name, location, link to the application and job title. After more than a year, the total data compiled started to be interesting to conduct an analysis and practice my skills in data wrangling I had aquired during the 2025 year, while being a stimulating and entertaining exercise it also was relevant for me to have an overview of my applications.
The data comes from an excel spreadsheet and a word document. The excel spreadsheet includes the company name, the job title, the date of application, the month of application, the status of the application, the date of the reply, the sector, the organisation type, and the minimum years of experience. The word document includes the job descriptions of jobs I applied to and some I haven’t applied to. There was one KPI I wanted to include in the analysis that I hadn’t tracked, which was the number of acronyms per job descriptions. This variable was possible to extract thanks to a python script by Chat GPT that produced an excel document from a list of company names, acronyms and key word for job titles out of the word document.
The Data is composed of two tables: job_acronym and job_applications. The first one has 5 columns and shows the number of occurrences of specific acronyms per job title. Yet, the table is not fully cleaned as there are still some duplicates. The second table has 12 columns and includes more information on the job applications and descriptions. The second table also needs to be cleaned, as some columns haven’t been recognized by R Studio with the correct class type.
job_acronym <- read_excel("~/Documents/Job data analysis/job_acronym.xlsx") %>%
clean_names() %>%
rename(meaning = `x5`) %>%
glimpse()
# still have to make sure the count is right per position
job_applications <- read_excel("~/Documents/Job data analysis/Job applications March 2024 - September 2025.xlsx") %>%
clean_names() %>%
glimpse()
# still have to clean the columns types herejob_acronym_NA <- job_acronym %>%
mutate(company = if_else(company == "Unknown", NA_character_, company))This new table is identical to the latter except the “Unknown” companies are replaced by the NA character which is a standard character recognized by R to indicate missing values.The table contains 3 missing values now labelled as NAs.
The first data I want to know is the number of companies I applied to in total.
There are 77 distinct companies in the job_acronym table.
distinct_company_names<- job_applications1 %>%
mutate(
name_lowercase = name %>%
str_to_lower()
) %>%
distinct(name_lowercase) %>%
nrow()There are 112 distinct companies in the job_applications1 table. Which means some companies I applied to either didn’t include the selected acronyms in their job description, or that some applications I made were not included in the sample I used to extract the acronyms.
In the below table, I want to know which sectors I have applied to most frequently.
frequent_sectors <- job_applications1 %>%
filter(!is.na(sector)) %>%
group_by(sector) %>%
summarise(n=n()) %>%
ungroup() %>%
arrange(desc(n))
kable(frequent_sectors, caption = "Sectors I applied to") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
scroll_box(height = "400px")| sector | n |
|---|---|
| Consulting | 24 |
| Environment | 16 |
| Recruitment | 16 |
| Finance | 9 |
| Manufacture | 8 |
| Banking | 7 |
| human rights | 7 |
| Commodities | 6 |
| Government | 6 |
| Biodiversity | 5 |
| Chemicals | 5 |
| Energy | 5 |
| Automation | 4 |
| Aviation | 4 |
| Beverage | 4 |
| Certification | 4 |
| Science | 4 |
| Shipping | 4 |
| Academia | 3 |
| Standard-setter | 3 |
| Insurance | 2 |
| Manufacture/Watchmaking | 2 |
| Sport | 2 |
| Automobile | 1 |
| Culture | 1 |
| Diplomacy | 1 |
| Education | 1 |
| Furniture | 1 |
| Garment | 1 |
| Garnment | 1 |
| IT | 1 |
| Manufacturing | 1 |
| Media | 1 |
| Pharmaceutical | 1 |
| Real estate | 1 |
| Standards-setter | 1 |
| Supply chain | 1 |
| Sustainability | 1 |
| Sustainable Finance | 1 |
| Water | 1 |
| chemicals | 1 |
There are several duplicated sectors because of typos, lower cases instead of capital letter, plurals instead if singular forms and so on. To avoid this issue, I have to standardize the sector names.
cleaned_sectors <- job_applications1 %>%
mutate(
sector_lowercase = sector %>%
str_trim() %>%
str_to_lower()
)cleaned_sectors1 <- cleaned_sectors %>%
mutate(
sector_clean = case_when(
str_detect(sector_lowercase, "garnment") ~ "garment",
str_detect(sector_lowercase, "standards-setter") ~ "standard-setter",
str_detect(sector_lowercase, "manufacture|manufacture/watchmaking") ~ "manufacturing",
TRUE ~ str_to_lower(sector_lowercase)
)
)## # A tibble: 37 × 2
## sector_clean n
## <chr> <int>
## 1 consulting 24
## 2 environment 16
## 3 recruitment 16
## 4 manufacturing 11
## 5 finance 9
## 6 banking 7
## 7 human rights 7
## 8 chemicals 6
## 9 commodities 6
## 10 government 6
## # ℹ 27 more rows
The above graph shows that I have mainly applied to Consulting & Services sectors (54 applications) and least to the Knowledge and Education sector.
In the below table I want to examine the mean, the minimum, the maximum, the median and standard deviation if the number of years of experience required in the job description sample.
year_of_exp <- job_applications1 %>%
filter(!is.na(min_years_of_experience)) %>%
summarise(mean = mean(min_years_of_experience),
st_var = sd(min_years_of_experience),
min = min(min_years_of_experience),
max = max(min_years_of_experience),
median = median(min_years_of_experience),
group_size = n())
year_of_exp## # A tibble: 1 × 6
## mean st_var min max median group_size
## <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 3.30 1.80 0 10 3 71
I notice that the median and the mean are very similar; close to 3 years. This result is aligned with my actual 3 years of experience. Even though I have applied to jobs that required sometimes more than my years of experience, 50% of my applications were targeted to jobs aligned with my profile, with at least 3 years minimum required or below.
job_applications1 %>%
filter(!is.na(min_years_of_experience)) %>%
ggplot(aes(x = min_years_of_experience)) +
geom_histogram(binwidth = 1, fill = "#3498db", alpha = 0.7, color = "white") +
geom_vline(aes(xintercept = median(min_years_of_experience)),
color = "#e74c3c", linetype = "dashed", linewidth = 0.8) +
annotate("text", x = median(job_applications1$min_years_of_experience, na.rm = TRUE) + 0.3,
y = Inf, vjust = 2, label = "Median", color = "#e74c3c", size = 3.5) +
scale_x_continuous(breaks = 0:10) +
labs(
title = "Distribution of Minimum Years of Experience Required",
subtitle = "Across all job applications",
x = "Minimum Years of Experience",
y = "Count",
caption = paste0("n = ", nrow(filter(job_applications1, !is.na(min_years_of_experience))))
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray40"),
axis.title = element_text(size = 11, face = "bold")
)job_applications1 %>%
filter(!is.na(min_years_of_experience)) %>%
ggplot(aes(x = "", y = min_years_of_experience)) +
geom_violin(fill = "#3498db", alpha = 0.4, linewidth = 0.3, trim = FALSE) +
geom_boxplot(width = 0.2, fill = "#3498db", alpha = 0.7, outlier.alpha = 0.5) +
scale_y_continuous(breaks = 0:10) +
labs(
title = "Distribution of Minimum Years of Experience Required",
subtitle = "Across all job applications",
x = "All applications",
y = "Minimum Years of Experience",
caption = paste0("n = ", nrow(filter(job_applications1, !is.na(min_years_of_experience))))
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray40"),
axis.title = element_text(size = 11, face = "bold"),
axis.text = element_text(size = 10)
)This graph shows that I applied to a significant proportion of jobs requiring between 2 and 3 years of experience, and as a second choice, jobs requiring between 4 and 6 years of experience on average.
job_acronym_total <- job_acronym1 %>%
group_by(acronym) %>%
summarise(total = sum(count)) %>%
ungroup() %>%
arrange(desc(total))
job_acronym_total %>%
kable(caption = "Number of occurrences per acronym") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
scroll_box(height = "400px")| acronym | total |
|---|---|
| ESG | 156 |
| CSRD | 34 |
| EU | 27 |
| GHG | 23 |
| GRI | 23 |
| LCA | 22 |
| UN | 22 |
| ACV | 19 |
| TCFD | 17 |
| ISO | 15 |
| CV | 13 |
| RSE | 13 |
| IT | 11 |
| ES | 10 |
| ESRS | 10 |
| CENMAT | 9 |
| EHS | 9 |
| CNC | 8 |
| CDP | 7 |
| CSR | 6 |
| D&I | 6 |
| EPR | 6 |
| NGO | 6 |
| SASB | 6 |
| HR | 5 |
| MS | 5 |
| PhD | 5 |
| SAP | 5 |
| CEO | 4 |
| CORSIA | 4 |
| HES | 4 |
| IFRS | 4 |
| KPI | 4 |
| SDG | 4 |
| CET | 3 |
| COP | 3 |
| CROSSEU | 3 |
| CSDDD | 3 |
| E&S | 3 |
| IFC | 3 |
| L&D | 3 |
| NNY | 3 |
| PCAF | 3 |
| SAF | 3 |
| TA | 3 |
| AI | 2 |
| ASAP | 2 |
| ArcGIS | 2 |
| BPM | 2 |
| CDI | 2 |
| CDM | 2 |
| CEFR | 2 |
| CFA | 2 |
| CRM | 2 |
| DwP | 2 |
| EFTA | 2 |
| EM | 2 |
| GES | 2 |
| GIS | 2 |
| GSSB | 2 |
| ISS | 2 |
| LEED | 2 |
| NDCs | 2 |
| PEF | 2 |
| PRI | 2 |
| PoS | 2 |
| SBTi | 2 |
| SFRD | 2 |
| SME | 2 |
| SQL | 2 |
| TNFD | 2 |
| UPR | 2 |
| ACCA | 1 |
| ACE | 1 |
| AFOLU | 1 |
| AP | 1 |
| BIA | 1 |
| BU | 1 |
| CA | 1 |
| CADO | 1 |
| CAEP | 1 |
| CDD | 1 |
| COC | 1 |
| COO | 1 |
| CPA | 1 |
| CRREM | 1 |
| CoP | 1 |
| DAW | 1 |
| DDTrO | 1 |
| DGNB | 1 |
| EIA | 1 |
| EMEA | 1 |
| EPF | 1 |
| EPP | 1 |
| ERP | 1 |
| ERPDs | 1 |
| ESAP | 1 |
| ESDD | 1 |
| ESIA | 1 |
| ESMP | 1 |
| ESMS | 1 |
| ESPR | 1 |
| FCF | 1 |
| FMCG | 1 |
| GCF | 1 |
| GIIN | 1 |
| HBE | 1 |
| HRC | 1 |
| I&D | 1 |
| ICAO | 1 |
| INSTRAW | 1 |
| IPCC | 1 |
| ISCC | 1 |
| ISSB | 1 |
| JEDI | 1 |
| LGBTQ | 1 |
| LGBTQI+ | 1 |
| MBA | 1 |
| MRV | 1 |
| NFR | 1 |
| OHCHR | 1 |
| OPIM | 1 |
| OSAGI | 1 |
| PACTA | 1 |
| PAI | 1 |
| PLM | 1 |
| PMO | 1 |
| PWM | 1 |
| RED | 1 |
| REDD+ | 1 |
| RH | 1 |
| RJC | 1 |
| ROI | 1 |
| RTFC | 1 |
| SBTN | 1 |
| SBU | 1 |
| SER | 1 |
| SES | 1 |
| SMEI | 1 |
| SPOC | 1 |
| THG | 1 |
| TORs | 1 |
| UDB | 1 |
| UE | 1 |
| UNFCCC | 1 |
| UNGC | 1 |
| UNIFEM | 1 |
| VIP | 1 |
| VUCA | 1 |
| eDNA | 1 |
This new table only has 2 columns: “acronym” and “total”. It shows the total number of occurrences per acronym throughout the whole job_acronym table. Regardless of the number of times they are repeated for the same job title.
distinct_acronyms_per_job <- job_acronym1 %>%
group_by(job_title, company) %>%
summarise(distinct_acronym = n()) %>%
ungroup()%>%
select(-company) %>%
arrange(desc(distinct_acronym))
distinct_acronyms_per_job %>%
head(10) %>%
kable(caption = "Number of disctinct acronyms per job description")| job_title | distinct_acronym |
|---|---|
| ESG Officer | 19 |
| ESG Analyst | 10 |
| ESG Reporting Officer | 10 |
| Sustainability Reporting Analyst | 10 |
| Consultant | 9 |
| Environmental Sustainability Specialist Operations | 9 |
| Human Rights Officer | 9 |
| Assistant Manager Sustainability Programs | 8 |
| Chargé.e de données durabilité environnementale | 8 |
| Environmental, Social and Governance (ESG) Auditor – Associate/Senior Associate | 8 |
This new table has 3 columns: “company”, “job_title”, and “distinct_acronym”. I exclude the company name to guarantee some secrecy. This new table shows the number of distinct acronyms per job title and company.
total_acronyms_per_job <- job_acronym1 %>%
group_by(job_title, company) %>%
summarise(total_acronyms = sum(count)) %>%
ungroup() %>%
select(-company) %>%
arrange(desc(total_acronyms))
total_acronyms_per_job %>%
head(10) %>%
kable(caption = "Number of acronyms per job description")| job_title | total_acronyms |
|---|---|
| ESG Officer | 27 |
| ESG Reporting Officer | 22 |
| ESG Analyst | 19 |
| Sustainability & LCA Specialist | 18 |
| ESG Data and Compliance Controller | 17 |
| Environmental Sustainability Specialist Operations | 17 |
| Global Sustainability Manager | 17 |
| Human Rights Officer | 16 |
| Programme Analyst | 16 |
| Sustainability Project Manager | 16 |
This new table has 3 columns: “company”, “job_title”, “total_acronyms”. This new table shows the total number of acronyms per job title and company. Regardless of the acronym repetitions.
job_acronym_total %>%
slice_max(total, n = 15) %>%
ggplot(aes(x = reorder(acronym, total), y = total, fill = total)) +
geom_col(alpha = 0.8) +
geom_text(aes(label = total), hjust = -0.3, size = 3.5) +
coord_flip() +
scale_fill_gradient(low = "#a8d0e6", high = "#3498db") +
labs(
title = "Top 15 Most Frequent Acronyms in Job Descriptions",
subtitle = "Total occurrences across all job postings",
x = "Acronym",
y = "Total Occurrences",
caption = paste0("Total unique acronyms: ", nrow(job_acronym_total))
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray40"),
axis.title = element_text(size = 11, face = "bold"),
legend.position = "none"
)In the above plot, the acronyms are counted regardless of whether they are mentioned several times within the same job description or not. This means there is no way to assess the number of job descriptions that mention the same acronym.
The first step is to add a column that counts the number of occurrences per distinct job.
number_of_jobs_w_acronym <- job_acronym1 %>%
group_by(acronym) %>%
distinct(company, job_title, count)%>%
mutate(n_jobs_w_acronym = n()) %>%
ungroup()number_of_jobs_w_acronym %>%
distinct(acronym, n_jobs_w_acronym) %>%
arrange(desc(n_jobs_w_acronym)) %>%
slice_max(order_by = n_jobs_w_acronym, n=15) %>%
ggplot(aes(x=reorder(factor(acronym), n_jobs_w_acronym), y= n_jobs_w_acronym, fill=acronym)) +
geom_col()+
coord_flip()+
theme_minimal()+
labs(title = "Top 15 acronyms mentionned by distinct job descriptions",
x = "Acronyms",
y = "Number of distinct job descriptions")If we compare this second plot with the first one we observe that the top 3 acronyms “ESG”, “CSRD”, and “EU” are the same. The number of “ESG” occurrences is not a surprise since it is closely linked with sustainability positions. “CSRD” and “EU” can indicate the importance and of EU regulation for the companies/organisations, especially in 2025, as the EU Directive on sustainability disclosure was being debated. On November 13, 2025, the European Parliament voted in the “Omnibus” Proposal on sustainability reporting which is a simplified version of the proposition from 2024. The latest version removes roughly ~90% of companies from the scope of CSRD. Now environmental and Social reporting requirements only apply to businesses employing on average 1750 employees and with a net annual turnover of over € 450 million.
In the future, we might see a diminution of these EU disclosure-related acronyms. One could assume EU regulation on sustainability reporting becomes less of a priority. Yet, the topic of sustainability reporting for businesses might not totally fall out of trend; some companies might want to anticipate future regulations, as other jurisdictions adopt sustainability-related financial standards such as ISSB.
The number of job descriptions mentioning other reporting standards or frameworks such as “GHG” (I assume that GHG was mentioned along with Protocol but it is possible that descriptions only mention GHG, in any cases, this still shows the interest for carbon accounting which is a part of sustainability reporting), “GRI”, “TCFD” is around 15 for each one. This observation tends to confirm that while interest to anticipate EU regulation may fade, there is still a strong interest to assess and report on impact through other recognized standards and frameworks.
One notable difference with the two plots is the importance of “LCA”
(Life Cycle Assessment) and “ACV” (Analyse de Cycle de Vie in French) in
the first plot, while these acronyms barely make it to the top 15 of
acronyms mentioned by distinct job descriptions, they appear higher in
the first plot. This result shows that a few job descriptions likely
include these acronyms many times, whereas the number of job
descriptions referencing Life Cycle Assessment is less than 10 in a
sample of distinct_companies job descriptions.
av_n_acronyms <- number_of_jobs_w_acronym %>%
group_by(job_title, company) %>%
mutate(n_acronyms = n()) %>%
ungroup() %>%
summarise(average_number_of_acronyms = mean(n_acronyms),
min = min(n_acronyms),
max = max(n_acronyms))
kable(av_n_acronyms, caption = "Average number of Acronyms per job description with minimum and maximum values")| average_number_of_acronyms | min | max |
|---|---|---|
| 5.981723 | 1 | 19 |
In this chapter, I want to determine if there is a relationship between the sector and the years of experience required. First, I analyze the descriptive statistics of the years of experience and group them by the sector.
cleaned_sectors_less %>%
filter(!is.na(min_years_of_experience)) %>%
group_by(sector_grouped) %>%
summarise(mean = mean(min_years_of_experience),
st_var = sd(min_years_of_experience),
min = min(min_years_of_experience),
max = max(min_years_of_experience),
median = median(min_years_of_experience),
group_size = n()) %>%
ungroup()## # A tibble: 7 × 7
## sector_grouped mean st_var min max median group_size
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Consulting & Services 3.75 1.74 2 8 3 20
## 2 Finance & Business 2.88 1.46 1 5 2.5 8
## 3 Industry & Manufacturing 3.17 1.54 1 5 3 18
## 4 Knowledge & Education 2.67 2.07 0 5 2.5 6
## 5 Public & Social Sector 3.33 1.37 2 5 3 6
## 6 Sustainability & Environment 2.79 1.53 0.5 5 2.5 12
## 7 <NA> 10 NA 10 10 10 1
The differences in the mean for each sector are below 1 point. This observation matches the observation from chapter 3.3.
cleaned_sectors_less %>%
filter(!is.na(min_years_of_experience), !is.na(sector_grouped)) %>%
ggplot(aes(x = sector_grouped, y = min_years_of_experience, fill = sector_grouped)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.5) +
geom_jitter(width = 0.2, alpha = 0.3, size = 2) +
scale_y_continuous(breaks = 0:10) +
labs(
title = "Years of Experience Required by Sector",
subtitle = "Distribution of minimum years of experience per sector",
x = "Sector",
y = "Minimum Years of Experience",
caption = paste0("n = ", nrow(filter(cleaned_sectors_less, !is.na(min_years_of_experience), !is.na(sector_grouped))))
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray40"),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.title = element_text(size = 11, face = "bold"),
legend.position = "none"
)The next step is to test if the observations are normally distributed with a Shapiro-Wilk test.
##
## Shapiro-Wilk normality test
##
## data: cleaned_sectors_less$min_years_of_experience
## W = 0.90014, p-value = 3.486e-05
I observe that the p-value is lower than 0.05, which means that the data is not normally distributed, and a Kruskal-Wallis test needs to be conducted in order to determine whether the sector is a variable that influences the years of experience.
##
## Kruskal-Wallis rank sum test
##
## data: min_years_of_experience by sector_grouped
## Kruskal-Wallis chi-squared = 3.4804, df = 5, p-value = 0.6264
The p-value is 0.6264, which is above 0.05, so I have to accept the null hypothesis. This means there is no statistically significant difference in years of experience required across sectors. The years of experience requested do not vary significantly depending on the sector. All sectors tend to require a similar level of experience in your dataset. The reasons for that might be; the sample size is small (personal job search), which reduces statistical power, and sustainability jobs may genuinely cluster around similar experience requirements regardless of sector.
job_applications1 %>%
filter(!is.na(min_years_of_experience)) %>%
group_by(org_type) %>%
summarise(mean = mean(min_years_of_experience),
st_var = sd(min_years_of_experience),
min = min(min_years_of_experience),
max = max(min_years_of_experience),
median = median(min_years_of_experience),
group_size = n()) %>%
ungroup()## # A tibble: 9 × 7
## org_type mean st_var min max median group_size
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Academia 4 1.41 3 5 4 2
## 2 Accounting profession and auditors 2 0 2 2 2 3
## 3 Company (listed) 3.73 2.37 1 10 3 11
## 4 Company (unlisted) 3.42 1.46 1 5 3 19
## 5 Government 4 1.15 3 5 4 4
## 6 NGO / not for profit 3.44 1.60 0.5 6 3 17
## 7 Trade association 1 0 1 1 1 2
## 8 intergovernmental organization 1.75 0.886 0 3 2 8
## 9 <NA> 4.8 2.59 2 8 4 5
I observe that the the mean of years of experience is this time slightly difference across organisation types. Yet I also note that the group size for some organisation types is low (below 5 observations).
job_applications1 %>%
filter(!is.na(min_years_of_experience), !is.na(org_type)) %>%
ggplot(aes(x = org_type, y = min_years_of_experience, fill = org_type)) +
geom_boxplot(alpha = 0.7, outlier.alpha = 0.5) +
geom_jitter(width = 0.2, alpha = 0.3, size = 2) +
scale_y_continuous(breaks = 0:10) +
labs(
title = "Years of Experience Required by Organisation Type",
subtitle = "Distribution of minimum years of experience per organisation type",
x = "Organisation Type",
y = "Minimum Years of Experience",
caption = paste0("n = ", nrow(filter(job_applications1, !is.na(min_years_of_experience), !is.na(org_type))))
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray40"),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.title = element_text(size = 11, face = "bold"),
legend.position = "none"
)I already know that the years of experience are not normally distributed, so I can already use the Kruskal test to determine the whether the organisation type plays a role in the years of experience.
##
## Kruskal-Wallis rank sum test
##
## data: min_years_of_experience by org_type
## Kruskal-Wallis chi-squared = 18.292, df = 7, p-value = 0.01072
The p-value after the test is 0.02072, which is below the 0.05 threshold. This means I can reject the null hypothesis, because there is a statistically difference of the years of experience across the organisation types. Which might mean that the organisation type influences the number of required years, or that some organisation types consistently require more or less years of experience. The next step is to run a post-hoc pairwise Wilcoxon test with Bonferroni correction to identify which groups differ significantly.
pairwise.wilcox.test(job_applications1$min_years_of_experience,
job_applications1$org_type,
p.adjust.method = "bonferroni")##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: job_applications1$min_years_of_experience and job_applications1$org_type
##
## Academia Accounting profession and auditors
## Accounting profession and auditors 1.00 -
## Company (listed) 1.00 1.00
## Company (unlisted) 1.00 1.00
## Government 1.00 1.00
## intergovernmental organization 1.00 1.00
## NGO / not for profit 1.00 1.00
## Trade association 1.00 1.00
## Company (listed) Company (unlisted)
## Accounting profession and auditors - -
## Company (listed) - -
## Company (unlisted) 1.00 -
## Government 1.00 1.00
## intergovernmental organization 0.24 0.21
## NGO / not for profit 1.00 1.00
## Trade association 1.00 1.00
## Government intergovernmental organization
## Accounting profession and auditors - -
## Company (listed) - -
## Company (unlisted) - -
## Government - -
## intergovernmental organization 0.28 -
## NGO / not for profit 1.00 0.35
## Trade association 1.00 1.00
## NGO / not for profit
## Accounting profession and auditors -
## Company (listed) -
## Company (unlisted) -
## Government -
## intergovernmental organization -
## NGO / not for profit -
## Trade association 1.00
##
## P value adjustment method: bonferroni
A pairwise Wilcoxon test with Bonferroni correction was conducted. Due to ties in the data, approximate p-values were computed.
The Bonferroni correction is very conservative — it adjusts p-values upward to reduce false positives, which can make it hard to detect differences with a small dataset. Therefore I can try a less conservative method such as Benjamini-Hochberg in order to detect differences across organisation types.
pairwise.wilcox.test(job_applications1$min_years_of_experience,
job_applications1$org_type,
p.adjust.method = "BH")##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: job_applications1$min_years_of_experience and job_applications1$org_type
##
## Academia Accounting profession and auditors
## Accounting profession and auditors 0.199 -
## Company (listed) 0.786 0.165
## Company (unlisted) 0.799 0.199
## Government 1.000 0.162
## intergovernmental organization 0.162 0.928
## NGO / not for profit 0.798 0.199
## Trade association 0.363 0.199
## Company (listed) Company (unlisted)
## Accounting profession and auditors - -
## Company (listed) - -
## Company (unlisted) 0.958 -
## Government 0.671 0.723
## intergovernmental organization 0.088 0.088
## NGO / not for profit 0.996 0.996
## Trade association 0.162 0.162
## Government intergovernmental organization
## Accounting profession and auditors - -
## Company (listed) - -
## Company (unlisted) - -
## Government - -
## intergovernmental organization 0.088 -
## NGO / not for profit 0.709 0.088
## Trade association 0.199 0.356
## NGO / not for profit
## Accounting profession and auditors -
## Company (listed) -
## Company (unlisted) -
## Government -
## intergovernmental organization -
## NGO / not for profit -
## Trade association 0.162
##
## P value adjustment method: BH
Although the Kruskal-Wallis test indicated a global significant difference across organisation types (p = 0.011), pairwise comparisons with both Bonferroni and Benjamini-Hochberg corrections revealed no individually significant pairs. This likely reflects the limited sample size, which reduces the statistical power needed to detect differences at the pairwise level.
This new table has several columns. Some job descriptions included in the applications table didn’t have any of the listed acronyms and therefore were not discarded by using left_join. There are also more than one row per job description since I still want to account for the different acronyms per job description. There are also duplicates of course. And the “passed” don’t have a sector or more information on them.
# I have to keep the same column names to join both tables
job_applications2 <- job_applications_clean %>%
rename(company = name) %>%
rename(job_title = position)
job_app_acronyms <- job_applications2 %>%
full_join(number_of_jobs_w_acronym, by = c("job_title", "company"))This newly merged table has many rows per job application/description. The table also includes job positions which were not applied to and therefore have no dates of application/reply, month of application, sector, location, organisation type, minimum year of experience required. There are some unknown companies because I was not able to retrieve the company names from the job descriptions when extracting the acronyms. In addition, some job positions are mentioned in the job_application table but are missing from the acronym table because they didn’t include any of the acronyms I selected. Finally, it is possible there are some mismatches, as the job title might be slightly different or because I haven’t applied to all of the job descriptions I saved. The contrary can be true as I might have not saved all of the job descriptions I applied to.
I first create a column that computes the number of acronyms per job.
job_app_acronyms1 <- job_app_acronyms %>%
group_by(job_title) %>%
mutate(acronyms_per_job = sum(count)) %>%
ungroup()Next, I compute the minimum, maximum and mean values of the number of acronyms per job based on the sector.
job_app_acronyms1_mean <- job_app_acronyms1 %>%
filter(!is.na(acronyms_per_job)) %>%
group_by(sector_grouped) %>%
summarise(mean = mean(acronyms_per_job),
st_var = sd(acronyms_per_job),
min = min(acronyms_per_job),
max = max(acronyms_per_job),
median = median(acronyms_per_job),
group_size = n()) %>%
ungroup()
job_app_acronyms1_mean %>%
kable(
caption = "Descriptive statistics of the number of acronyms per job based on the sector") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE
)| sector_grouped | mean | st_var | min | max | median | group_size |
|---|---|---|---|---|---|---|
| Consulting & Services | 22.337349 | 17.511862 | 1 | 44 | 16 | 83 |
| Finance & Business | 22.415385 | 8.442623 | 4 | 36 | 27 | 65 |
| Industry & Manufacturing | 10.607143 | 7.236151 | 1 | 36 | 10 | 84 |
| Knowledge & Education | 6.466667 | 2.825058 | 3 | 9 | 9 | 15 |
| Public & Social Sector | 10.058824 | 6.628260 | 1 | 16 | 16 | 17 |
| Sustainability & Environment | 12.185185 | 13.399303 | 1 | 36 | 5 | 27 |
| NA | 14.428571 | 13.766702 | 1 | 44 | 10 | 56 |
The table shows variation in the mean number of acronyms per job across sectors. The Consulting & Service sector has the highest average number of acronyms per job description, suggesting more technical language is used in those postings. The Knowledge and Education sector has the lowest average, which may reflect less standardised reporting requirements.
I first need to examine whether the variable of the number of acronyms per job is normally distributed.
##
## Shapiro-Wilk normality test
##
## data: job_app_acronyms1$acronyms_per_job
## W = 0.85644, p-value < 2.2e-16
The p-value is below the 0.05 threshold, which means that the data is not normally distributed, therefore a Kruskal-Wallis test is used to investigate the relationship between the number of acronym per job and the sector.
##
## Kruskal-Wallis rank sum test
##
## data: acronyms_per_job by sector_grouped
## Kruskal-Wallis chi-squared = 59.632, df = 5, p-value = 1.448e-11
The Kruskal-Wallis test result shows that the p-value is below 0.05. This means that the difference in the number of acronyms per job across sectors is significant. There is a statistically significant difference in the number of acronyms per kob across sectors. This suggests that sector influences the technical complexity of job descriptions. A pairwise Wilcoxon test will therefore be conducted to identify which sectors differ. As observed in section 4.2, if no individual pairs reach significance, this likely reflects the limited sample size reducing statistical power rather than a true absence of difference.
pairwise.wilcox.test(job_app_acronyms1$acronyms_per_job,
job_app_acronyms1$sector_grouped,
p.adjust.method = "bonferroni")##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: job_app_acronyms1$acronyms_per_job and job_app_acronyms1$sector_grouped
##
## Consulting & Services Finance & Business
## Finance & Business 1.000 -
## Industry & Manufacturing 0.022 3.4e-13
## Knowledge & Education 0.149 2.4e-06
## Public & Social Sector 0.438 1.4e-05
## Sustainability & Environment 0.020 0.006
## Industry & Manufacturing Knowledge & Education
## Finance & Business - -
## Industry & Manufacturing - -
## Knowledge & Education 0.200 -
## Public & Social Sector 1.000 1.000
## Sustainability & Environment 1.000 1.000
## Public & Social Sector
## Finance & Business -
## Industry & Manufacturing -
## Knowledge & Education -
## Public & Social Sector -
## Sustainability & Environment 1.000
##
## P value adjustment method: bonferroni
pairwise.wilcox.test(job_app_acronyms1$acronyms_per_job,
job_app_acronyms1$sector_grouped,
p.adjust.method = "BH")##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: job_app_acronyms1$acronyms_per_job and job_app_acronyms1$sector_grouped
##
## Consulting & Services Finance & Business
## Finance & Business 0.6859 -
## Industry & Manufacturing 0.0036 3.4e-13
## Knowledge & Education 0.0213 1.2e-06
## Public & Social Sector 0.0487 4.7e-06
## Sustainability & Environment 0.0036 0.0015
## Industry & Manufacturing Knowledge & Education
## Finance & Business - -
## Industry & Manufacturing - -
## Knowledge & Education 0.0251 -
## Public & Social Sector 0.5901 0.1835
## Sustainability & Environment 0.2216 0.8833
## Public & Social Sector
## Finance & Business -
## Industry & Manufacturing -
## Knowledge & Education -
## Public & Social Sector -
## Sustainability & Environment 0.8337
##
## P value adjustment method: BH
bonferroni_result <- pairwise.wilcox.test(job_app_acronyms1$acronyms_per_job,
job_app_acronyms1$sector_grouped,
p.adjust.method = "bonferroni")
bh_result <- pairwise.wilcox.test(job_app_acronyms1$acronyms_per_job,
job_app_acronyms1$sector_grouped,
p.adjust.method = "BH")library(tibble) # for rownames_to_column
pval_to_long <- function(test_result, method_name) {
mat <- test_result$p.value
# Make the matrix symmetric manually
all_sectors <- union(rownames(mat), colnames(mat))
n <- length(all_sectors)
full_mat <- matrix(NA, nrow = n, ncol = n,
dimnames = list(all_sectors, all_sectors))
for (r in rownames(mat)) {
for (c in colnames(mat)) {
full_mat[r, c] <- mat[r, c]
full_mat[c, r] <- mat[r, c] # mirror
}
}
diag(full_mat) <- NA
# Convert to long format
expand.grid(Sector1 = all_sectors,
Sector2 = all_sectors,
stringsAsFactors = FALSE) %>%
mutate(p_value = map2_dbl(Sector1, Sector2, ~ full_mat[.x, .y]),
method = method_name,
significant = ifelse(!is.na(p_value), p_value < 0.05, NA))
}
bonferroni_long <- pval_to_long(bonferroni_result, "Bonferroni")
bh_long <- pval_to_long(bh_result, "BH")
# Combine both
combined <- bind_rows(bonferroni_long, bh_long)
# Plot
combined %>%
ggplot(aes(x = Sector1, y = Sector2, fill = p_value)) +
geom_tile(color = "white", linewidth = 0.5) +
geom_text(aes(label = ifelse(!is.na(p_value),
ifelse(p_value < 0.001, "<0.001", round(p_value, 3)),
"")),
size = 2.8, color = "white", fontface = "bold") +
scale_fill_gradient2(low = "#e74c3c",
mid = "#f39c12",
high = "#ecf0f1",
midpoint = 0.05,
na.value = "grey90",
name = "p-value",
limits = c(0, 1)) +
facet_wrap(~ method, ncol = 2) +
labs(
title = "Pairwise Wilcoxon Test P-values by Correction Method",
subtitle = "Red = significant (p < 0.05), lighter = not significant",
x = NULL,
y = NULL
) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray40"),
axis.text.x = element_text(angle = 45, hjust = 1),
strip.text = element_text(face = "bold", size = 12),
legend.position = "bottom"
)Sample size and representativeness The dataset covers 170 applications made over 18 months in a specific field (sustainability/ESG). The findings therefore reflect my personal job search experience and cannot be generalised to broader labour market trends.
Self-reported and manually collected data The data was collected manually, which introduces the risk of entry errors, inconsistencies in sector classification, and incomplete records. Some applications may have been omitted, and not all job descriptions were saved for acronym extraction.
Small group sizes Several organisation types and sectors have fewer than 10 observations, which reduces the statistical power of inferential tests and makes pairwise comparisons unreliable even when a global test is significant. This is particularly relevant for the pairwise Wilcoxon tests in sections 4.2 and 5.1, where the Bonferroni correction may have been too conservative given the small sample, and the additional pairs detected by the BH correction should be interpreted with caution.
Aggregated monthly reply data The monthly reply counts in section 4.5 represent one aggregated observation per month, which made group comparison tests such as ANOVA and Kruskal-Wallis inappropriate. A Spearman correlation was used instead to test for a trend over time, which is a more suitable but less powerful approach given the small number of time points (18 months).
Acronym extraction limitations The acronym extraction relied on a predefined list of keywords, which may have missed relevant terms or misclassified others. The Python script output also contained duplicates that required manual cleaning. Furthermore, the acronym dataset does not cover all applications, meaning the cross-analysis in section 5 is based on a partial overlap between the two tables.
Reply status interpretation A “reply” includes both positive and negative responses, meaning a high reply rate does not necessarily indicate success. This analysis does not distinguish between rejections, interview invitations, or other outcomes, which limits the practical interpretation of the reply rate findings.
Correction method dependency The choice of p-value adjustment method meaningfully affects which pairwise comparisons are deemed significant. The Bonferroni correction identified 6 significant pairs while the BH correction identified 9. Conclusions drawn from the pairwise analysis are therefore sensitive to the correction method chosen, and both should be considered together rather than in isolation.
A 48.8% overall reply rate is notably higher than the commonly cited industry average of ~25%, which may reflect the niche nature of sustainability roles, the targeted nature of the applications, or the strong presence of structured HR processes in the sectors applied to.
The statistically significant relationships between reply status and both sector (Fisher, p < 0.05) and organisation type (Fisher, p = 0.0005) suggest that these structural characteristics of employers meaningfully influence recruitment responsiveness. Accounting and auditing firms had the highest reply rate (90%), possibly reflecting more formalised recruitment pipelines, while intergovernmental organisations had the lowest (14%), which may be explained by longer and more bureaucratic hiring processes consistent with the reply time analysis in section 4.4.
Regarding the temporal dimension of replies, the Spearman correlation found a weak positive trend between time and number of replies (rho = 0.322, p = 0.192), suggesting a slight increase in replies over the 18-month period that does not reach statistical significance. This result should be interpreted cautiously given that monthly reply counts represent single aggregated observations, limiting the power of any temporal analysis. The absence of a significant relationship between reply time and sector or month of application further suggests that timing a job application strategically by month or field may not substantially improve responsiveness, at least within this dataset.
The acronym analysis in section 5.1 provides perhaps the most insightful finding of the report. The global Kruskal-Wallis test confirmed that the number of acronyms per job description differs significantly across sectors. The subsequent pairwise Wilcoxon tests revealed that Finance & Business is the most distinct sector, differing significantly from all other sectors under both Bonferroni and BH corrections. This is a somewhat counterintuitive finding, as one might expect Finance & Business to use more technical language, but it may reflect that financial sector job descriptions in this sample were more generalist in nature or targeted a broader audience. Consulting & Services on the other hand consistently used more technical acronyms than Industry & Manufacturing, Sustainability & Environment, Knowledge & Education, and Public & Social Sector, the latter two only emerging as significant under the less conservative BH correction. Knowledge & Education consistently showed the lowest acronym usage across both correction methods, which aligns with expectations given the more descriptive and less regulatory nature of academic and educational job postings.
The prevalence of ESG, CSRD, and EU-related acronyms throughout the dataset reflects the regulatory environment of 2024–2025, where the EU sustainability disclosure framework was actively debated. The subsequent Omnibus proposal in November 2025, which reduced the scope of CSRD dramatically, may shift this acronym landscape significantly in future job postings, with implications for the technical skills demanded by employers.
This analysis of 170 personal job applications submitted between March 2024 and September 2025 provides a detailed snapshot of a sustainability-focused job search. The data reveals that applications were spread across 37 sectors (reduced to 6 sectors to increase the number of observations per sector), predominantly in Consulting & Services, and that the majority targeted roles requiring 2–3 years of experience, consistent with the applicant’s profile.
Statistically significant relationships were found between reply status and both sector and organisation type, suggesting that these structural factors play a meaningful role in employer responsiveness. In contrast, no significant relationship was found between reply time and sector or application month, and the temporal analysis of monthly reply counts revealed only a weak, non-significant upward trend over the 18-month period (rho = 0.322, p = 0.192). These findings collectively suggest that the timing and field of an application are less important than the structural characteristics of the target organisation in determining whether and how quickly a reply is received.
The acronym analysis highlighted meaningful differences in technical language use across sectors. Finance & Business stood out as the most distinct group, differing significantly from all other sectors in acronym usage, while Consulting & Services consistently used the most technical language. These findings, robust under both Bonferroni and BH corrections, suggest that the density of technical acronyms in job descriptions is partly a function of the sector, with implications for how candidates tailor their application materials.
The centrality of EU sustainability regulation acronyms such as ESG, CSRD, and GRI throughout the dataset reflects the regulatory moment of 2024–2025 in Europe. As the Omnibus proposal reshapes the scope of mandatory sustainability reporting, future analyses may capture a shift in the technical vocabulary of sustainability job descriptions, potentially reducing the dominance of EU regulatory frameworks in favour of broader international standards such as ISSB.
While the findings are limited by the personal and small-scale nature of the dataset, this project demonstrates the value of systematic data collection during a job search and provides a methodological template that could be scaled or replicated. Future work could enrich the analysis by distinguishing between types of replies, incorporating salary data, expanding the acronym list to capture a broader range of technical frameworks, or collecting reply counts at the individual application level rather than as monthly aggregates to enable more robust temporal analysis.
Comment
The above table shows that I have mostly applied to jobs in the consulting sector, than in the environment sector and thirdly in the recruitment sector. There are 37 sectors, which means that the group size for each is very small.
Now that I have the number of applications per sector, I build a plot and exclude the NA category, to only keep existing sector names.
In order to reduce the number of sectors, I decide to group them into 7 groups.