Graduation Semester and Year
Summer 2025
Language
English
Document Type
Dissertation
Degree Name
Doctor of Philosophy in Computer Science
Department
Computer Science and Engineering
First Advisor
Chengkai Li
Abstract
Knowledge graph embedding models for completion tasks, especially link prediction, have been extensively evaluated using established benchmark datasets. These benchmarks suffer from significant limitations that fundamentally compromise evaluation reliability: they are often too small for comprehensive evaluation, and are problematic because they include redundant triples that artificially inflate performance metrics and administrative information that is not subject-matter knowledge. Furthermore, these datasets often fail to capture the multiary relationships crucial for representing real-world information. These limitations have led to unreliable model comparisons, overly optimistic performance estimates, and a disconnect between benchmark results and real-world knowledge graph completion capabilities, necessitating a fundamental reevaluation of evaluation practices in the field.
This dissertation presents a comprehensive analysis of the data modeling idiosyncrasies in a large-scale knowledge graph, Freebase, specifically examining its strong type system, reverse property representations, and use of mediator objects, and their implications for knowledge graph embedding models. Through systematic large-scale experiments across multiple embedding models, including TransE, ComplEx, RotatE, and others, our results indicate that the existence of reverse triples in the data unrealistically inflates the performance of knowledge graph embedding models, while CVT nodes, used for representing multiary relationships, make the link prediction task more difficult. As part of this work, we release the first full-scale, systematically prepared Freebase datasets to support rigorous evaluation. To address the need for larger-scale resources, we perform similar analysis and processing on the Wikidata dataset. As a live and growing knowledge graph, Wikidata is a valuable resource for the research community, and its subject matter triples are almost ten times larger than Freebase.
The dissertation further conducts a critical large-scale examination of existing evaluation methodologies for knowledge graph completion, identifying fundamental limitations in prevailing evaluation metrics and protocols, particularly in realistic prediction scenarios. Current evaluation metrics paradoxically penalize models for correctly predicting triples that are missing from the evaluation dataset. Standard ranking-based metrics such as Mean Rank operate under the closed-world assumption, treating all predictions outside the knowledge graph as incorrect. This contradicts the open-world assumption, which acknowledges that absent facts are not necessarily false. Furthermore, each of these metrics aggregates performance across all relations and triples into a single value, obscuring models' strengths and weaknesses on different relations. We explore the shortcomings of existing evaluation metrics with supporting large-scale experimental evidence, illustrate how these shortcomings impact models' performance, and propose refinements to improve evaluation reliability.
The standard evaluation protocol for knowledge graph embedding models, link prediction, evaluates models by predicting the missing h in triple (?,r,t) or missing t in (h,r,?). It assesses whether a model prioritizes correct answers over incorrect ones. However, this approach can be misleading because it does not verify whether correct triples are ranked better than all types of false and nonsensical ones. For example, for a test triple (h,r,t), evaluation by link prediction does not rank the likely nonsensical triple (t,r,h). To assess how models rank all possible triples, even nonsensical ones, for a given relation, we performed the entity-pair ranking evaluation protocol. Another limitation of link prediction is its assumption that an entity's properties are already known, focusing only on retrieving the correct object or subject entities. In practice, determining whether a property applies to an entity is itself a challenge. To address this gap, we leverage a new evaluation protocol, property prediction. A third alternative evaluation protocol to link prediction, triple classification, assesses whether a given triple is a true or false fact. Our large-scale experiments across these alternative evaluation paradigms reveal that KGE models show substantially different performance patterns and relative rankings depending on the evaluation task. While models may perform well on link prediction, they consistently perform poorly on other tasks. These findings demonstrate that relying solely on link prediction provides an incomplete and potentially misleading assessment of model capabilities.
Additionally, this dissertation investigates the application of large language models to generate natural language explanations of logical rules extracted from knowledge graphs, thereby enhancing their interpretability and facilitating human understanding. We present Rule2Text, a complete framework addressing this challenge with several key contributions: first, extensive experiments across diverse domains from general knowledge to specialized biomedical datasets; second, development and validation of an LLM-as-a-judge evaluation framework enabling scalable quality assessment; third, creation of high-quality ground truth datasets for both general and domain-specific contexts; fourth, successful fine-tuning of open-source models using our generated datasets; and fifth, integration of type inference capabilities for knowledge graphs lacking explicit entity type information. To our knowledge, this is the first comprehensive study examining the effectiveness of LLMs for generating natural language explanations of knowledge graph rules. Our LLM-as-a-judge framework shows strong agreement with human annotators, enabling scalable evaluation. Most notably, fine-tuning open-source models on our generated datasets produces significant improvements in content coverage and semantic similarity.
Collectively, these contributions provide scalable resources, refined evaluation methodologies, and novel interpretability frameworks to strengthen knowledge graph completion and reasoning with broad implications for multiple research areas and real-world applications. The large-scale datasets and comprehensive evaluation protocols established in this work enable more reliable model development and comparison, directly benefiting fields such as semantic web research, natural language processing, and artificial intelligence. The interpretability frameworks developed through Rule2Text have immediate applications in critical domains, including healthcare decision support systems, scientific discovery platforms, and automated reasoning systems, where understanding the rationale behind predictions is essential for user trust and regulatory compliance.
Furthermore, our findings reveal important insights for future research directions: the need for developing more sophisticated KGE models that can perform consistently across diverse evaluation paradigms, the importance of creating comprehensive evaluation frameworks that go beyond the link prediction task to capture real-world complexity, and the potential for advancing rule-based explanation systems to handle more intricate logical structures including negation and disjunction. These insights establish a foundation for next-generation knowledge graph technologies that can bridge the gap between symbolic reasoning and neural approaches while maintaining both accuracy and interpretability in large-scale, real-world deployments.
Keywords
knowledge graph completion, Knowledge graph embedding, Link prediction, Benchmark dataset, Large-scale evaluation, Evaluation metrics, Evaluation protocols, Logical rules, Natural language explanation, Large language model
License
This work is licensed under a Creative Commons Attribution 4.0 International License.
Recommended Citation
Shirvani Mahdavi, Nasim, "On Large-Scale Knowledge Graph Completion: Dataset Creation, Embedding Model Evaluation, and Prediction Rule Explanation in Natural Language" (2025). Computer Science and Engineering Dissertations. 419.
https://mavmatrix.uta.edu/cse_dissertations/419