dc.contributor.author |
CASTRO, Otávio Cury da Costa |
|
dc.date.accessioned |
2025-09-16T15:51:43Z |
|
dc.date.available |
2025-09-16T15:51:43Z |
|
dc.date.issued |
2025-09-16 |
|
dc.identifier.uri |
http://hdl.handle.net/123456789/4059 |
|
dc.description |
Orientador: Guilherme Amaral Avelino
Co-orientador: Prof. Dr. Pedro de Alcantara dos Santos Neto
Examinador interno: Prof. Dr. Vinicius Ponte Machado
Examinador interno: Prof. Dr. Romuere Rodrigues Veloso e Silva
Examinador externo: Prof. Dr. Lincoln Souza Rocha
Examinador externo: Prof. Dr. André Cavalcante Hora |
pt_BR |
dc.description.abstract |
Abstract: Identifying developer expertise in source code is valuable in various Software Engineering
contexts. Knowledgeable developers are best suited to perform tasks such as code review
and onboarding. Numerous models have been proposed to estimate source code
knowledge, making it a well-explored topic; however, important gaps remain that affect the
accuracy and applicability of these models. Moreover, the increasing use of Generative
Artificial Intelligence (GenAI) tools may influence how code expertise is acquired and
measured. This study aims to develop more accurate models for identifying source code
experts. We first investigate the correlation between development history variables and
developers’ knowledge of source code files. We extract metrics from public and private
repositories and survey developers about the files they contributed to. Based on these
data, we propose a linear model and train machine learning classifiers, comparing their
performance with existing models. We also apply the proposed models to the Truck Factor
(TF) metric to assess their practical implications in identifying critical developers. To
examine the impact of GenAI, we build a dataset combining code expertise metrics with
information on ChatGPT-generated code integrated into open-source projects. We
simulate different usage scenarios by assigning a portion of contributions to GenAI instead
of developers and survey developers about their perception of GenAI’s effects on code
comprehension. Our results show that First Authorship and Recency of Modification are
the variables most strongly correlated with source code knowledge. The proposed
machine learning models outperform linear baselines, achieving F-scores between 71%
and 73%. When applied to the TF algorithm, they improved developer identification,
reaching a best average F-score of 74%. GenAI usage negatively affected TF reliability,
even in low proportions. Developers reported mixed perceptions, with concerns, especially
about use by novice programmers. |
pt_BR |
dc.language.iso |
other |
pt_BR |
dc.subject |
Software Repository Mining |
pt_BR |
dc.subject |
Code Expertise |
pt_BR |
dc.subject |
Knowledge Concentration |
pt_BR |
dc.subject |
Generative Artificial Intelligence |
pt_BR |
dc.title |
SOURCE CODE EXPERTISE: Improving Knowledge Models and Assessing Generative AI Impact |
pt_BR |
dc.type |
Preprint |
pt_BR |