Endometriosis classification, staging and reporting systems: a review on the road to a universally accepted endometriosis classification †,‡

Background In the field of endometriosis, several classification, staging and reporting systems have been developed. However, endometriosis classification, staging and reporting systems that have been published and validated for use in clinical practice have not been systematically reviewed up to now. Objectives The aim of the current review is to provide a historical overview of these different systems based on an assessment of published studies. Materials and Methods A systematic Pubmed literature search was performed. Data were extracted and summarised. Results Twenty-two endometriosis classification, staging and reporting systems have been published between 1973 and 2021, each developed for specific and different purposes. There is still no international agreement on how to describe the disease. Studies evaluating different systems are summarised showing a discrepancy between the intended and the evaluated purpose, and a general lack of validation data confirming a correlation with pain symptoms or quality of life for any of the current systems. A few studies confirm the value of the Enzian system for surgical description of deep endometriosis. With regards to infertility, the endometriosis fertility index has been confirmed valid for its intended purpose. Conclusions Of the 22 endometriosis classification, staging and reporting systems identified in this historical overview, only a few have been evaluated, in 46 studies, for the purpose for which they were developed. It can be concluded that there is no international agreement on how to describe endometriosis or how to classify it, and that most classification/staging systems show no or very little correlation with patient outcomes. What is new? This overview of existing systems is a first step in working towards a universally accepted endometriosis classification.


Introduction
Endometriosis is an inflammatory oestrogendependent disease associated with chronic pelvic pain and/or infertility that is characterised by lesions of endometrial-like tissue outside of the uterus (Johnson et al., 2017). The disease is usually confined to the abdominal cavity but, rarely, extra-abdominal lesions have been detected in the lungs, brain and even in the eye. Within the pelvic cavity, the variety of presentations is extensive with lesions detected on the peritoneum, within the ovaries (endometrioma), around the uterus, but also affecting the urinary tract, bowel, and vagina. Most definitions, but not all, consider adenomyosis (similar lesions arising within the myometrium) as a separate disease (Zegers-Hochschild et al., 2017).
Traditionally, three phenotypes of endometriosis lesions are recognised; peritoneal, ovarian (endometrioma) and deep endometriosis (DE) (Working group of ESGE ESHRE and WES, et al., 2020a, Working group of ESGE ESHRE and WES, et al., 2020b, Working group of ESGE ESHRE and WES, et al., 2017a, Working group of ESGE ESHRE and WES, et al., 2017b. Symptoms include chronic pelvic pain (dysmenorrhea, acyclic pelvic pain, dyspareunia, dyschezia, dysuria) with severity ranging from mild to debilitating, infertility, and non-specific symptoms (fatigue), but endometriosis can also be asymptomatic (Zondervan et al., 2020). Treatment options for pain include different medical and hormonal treatments or surgery, while for infertility, surgery and/or ART have been used.
Since the first descriptions of endometriosis, this spectrum of lesions and symptoms has urged clinicians to attempt to classify the disease into informative subgroups or hierarchical stages. By definition, classification entails a systematic arrangement of similar entities on the basis of certain differing characteristics (Miller-Keane and O'Toole, 2005). When disease classification can be related to treatment outcomes or prognosis, the system is considered a staging system.
In the field of endometriosis, several classification, staging and reporting systems have been developed. The current paper provides, based on an assessment of published studies, a historical overview of these different systems. Validation studies and published reports on the implementation of the different classification, staging and reporting systems have been summarised to highlight the uptake, benefits and drawbacks of published systems for endometriosis.

Materials and Methods
A literature review was performed collecting studies and reports focusing on "endometriosis" and "classification, staging, or scoring". PUBMED/ MEDLINE was searched, and studies were included from inception (1966) up to 08/05/2020; all retrieved references were checked for relevance. Non-English language studies, animal studies and papers not focusing on endometriosis, including those focusing specifically on adenomyosis, were excluded from the retrieved references. Papers and classifications systems focusing on endometriosis but including adenomyosis were not excluded. For the remaining references, the full text papers were collected and assessed. Inclusion criteria included original studies focusing on endometriosis and classification, staging or reporting systems. The results of the literature search are summarised in a PRISMA flowchart (Figure 1). The details of the final set of papers are summarised in evidence tables. The draft paper was published for stakeholder review by all societies involved; 81 comments were tabulated in a review report and, where relevant, incorporated in the final version of the paper.

Results
The literature review retrieved 1305 references; one reference was added at a later stage. After applying the exclusion criteria, 154 full papers were assessed, of which 84 papers were excluded for the following reasons: full text papers could not be retrieved (n=9), not written in English (n=4), inappropriate publication types (case report, expert opinion, editorial) (n=28), and relevant patients and/or intervention/outcomes are not assessed (not endometriosis or not classification) (n=43). Seventy papers were included for either describing a classification, staging or reporting system in endometriosis (n= 24) or evaluating one (n=46) (Fig. 1). The systems in endometriosis described in this paper have been published as classification, concluded that there is no international agreement on how to describe endometriosis or how to classify it, and that most classification/staging systems show no or very little correlation with patient outcomes. What is new? This overview of existing systems is a first step in working towards a universally accepted endometriosis classification. staging or reporting systems, even though some were developed for stratification or subgrouping rather than classification. Table I provides an overview of the 22 classification, staging or reporting systems identified in the literature and included in this report. The 46 studies reporting an evaluation of the different systems are listed in Table II.

Classification and staging systems
In the 1970s, the first "classification" system for endometriosis originated from a study attempting to describe the results of conservative surgical treatment of endometriosis and hereby classify the  extent of the disease and its relationship with the pregnancy rate (Acosta et al., 1973). Later, this classification system was further expanded and submitted for consideration to the American Fertility Society (AFS) (Buttram, 1978). Similarly, a system published by Kistner and colleagues was submitted for endorsement by AFS and the International Federation of Fertility Societies (IFFS) (Kistner et al., 1977). In 1979, AFS published a classification system on behalf of a group of experts including the leading authors of the previous systems (American Fertility Society, 1979). The AFS classification for endometriosis, and later published revised  AFS (rAFS) and revised American Society for Reproductive Medicine (rASRM) classification, have been the main standard for classifying endometriosis ever since (American Fertility Society, 1979, American Fertility Society, 1985, American Society for Reproductive Medicine, 1997. The different versions of the AFS/ASRM classification system reflect the progress made in the knowledge on endometriosis. Later attempts of surgical disease description or staging have focused on disease location -such as urinary tract endometriosis (Knabben, et al., 2015) -or subtypes of the disease -such as DE (Chapron et al., 2003a, Coccia and Rizzello, 2011, Tuttlies et al., 2005: the latter group includes the ENZIAN-Score for classifying DE (Tuttlies, et al., 2005). The recently updated #ENZIAN classification extends the previous ENZIAN score to incorporate all types of endometriosis (Keckstein et al., 2021). The EPHect standard recommended (SSF) and minimum required (MSF) were developed for recording of surgical phenotypic information on endometriosis (Becker et al., 2014) While these classification systems mainly focused on describing the extent of disease during surgery, some attempted to link these observations to outcomes, such as pregnancy rates, after surgery (American Fertility Society, 1979, American Fertility Society, 1985, American Society for Reproductive Medicine, 1997, Kurata et al., 1993, or indicators for disease management (Chapron et al., 2003a). Another group of classification systems focused on pre-operative assessment of the extent of the disease (Chattot et al., 2019, Ichikawa et al., 2020, Knabben et al., 2015, Lafay Pillet et al., 2014, Menakaya et al., 2016, Riiskjaer et al., 2017, van der Wat et al., 2013, based on either patient-reported symptoms or pre-operative imaging, or a combination of both. The ultrasound-based endometriosis staging system (UBESS) additionally aimed at predicting the complexity of endometriosis surgery (Menakaya et al., 2016), as does the adhesion scoring system in case of pelvic adhesions (Ichikawa et al., 2020).
Two systems aimed specifically at outcome prediction for endometriosis: the 'disease extent, complaints, objectives (ECO)-system', aiming to select the most appropriate management based on reported symptoms (Lasmar et al., 2012, Lasmar et al., 2015; and the endometriosis fertility index (EFI), aiming to predict the probability of natural conception after surgery (Adamson and Pasta, 2010). Finally, a recently published study "Endogram" sets out to 'profile' endometriosis heterogeneity, based on the assessment of several disease markers in a biopsy sample, with the ultimate aim of guiding therapeutic options (Bouquet de Joliniere et al., 2019).

Replication, validation, and clinical value of published systems
We retrieved 46 studies, mostly observational, reporting an evaluation of the different classification, staging or reporting systems (Table II). The aims and outcomes of the different studies varied significantly.
Of the included studies, eight reported on the practical aspects of the classification systems, being either the feasibility, or the inter-observer and intraobserver variability. Of these, seven studies focused on the rASRM classification system (Candiani et al., 1990, Canis et al., 1992, Hornstein et al., 1993, Lin et al., 1998, Rock, 1995, Schliep et al., 2017, Schliep et al., 2012, while the most recent one evaluated the reproducibility of the EFI (Tomassetti et al., 2020). Early studies (1990s) reported significant variability in rAFS classification by five independent experts reviewing surgery recordings, specifically with regards to endometriosis of the ovary and cul-desac obliteration (Hornstein et al., 1993), although another study from the same period reported good to fair agreement in scoring endometriosis between two experts using photographs or recordings (Rock, 1995). In more recent studies, the rASRM classification system was found to have acceptable inter-observer agreement and inter-rater reliability among surgeons and experts reviewing surgical photographs and/or recordings (Schliep et al., 2017, Schliep et al., 2012. Studies have also focused on the feasibility of specific aspects of the AFS/rAFS/rASRM classification, specifically classifying bilateral adnexal disease (Canis et al., 1992), measuring cyst diameter (Candiani et al., 1990), or the reliability of laparoscopic versus laparotomic scoring (Lin et al., 1998). For the EFI, a near perfect clinical agreement rate between two independent experts (1.000, 95% CI 0.956-1.000) and high agreement between two assessments by the same expert (0.988, 95% CI 0.934-1.000) has been reported (Tomassetti et al., 2020).
The remaining studies (n=37) applied the classification or staging systems to a cohort of patients, evaluating whether the system was reliable with regards to its proposed aim, or evaluating whether the classification could be used for other purposes. The latter was mainly the case for the AFS/rAFS/rASRM classification system, which was developed for surgical staging, but has been evaluated for predicting symptom relief and recurrence after surgery (Milingos et al., 2006, Vercellini, et al., 2006, complications after surgery (Nicolaus et al., 2020), ovarian reserve (Posadzka et al., 2014), time to non-ART pregnancy (Yun et al., 2015), pregnancy outcomes (Guzick et al., 1982, Rock et al., 1981, and the outcomes of ART treatment (Barbosa et al., 2014, Pal et al., 1998, Pop-Trajkovic, et al., 2014. Furthermore, correlation of the AFS/rAFS/rASRM classification system with symptoms before surgery was evaluated (Marana et al., 1991, Szendei et al., 2005, Vercellini et al., 2007. To our knowledge, there are no studies specifically evaluating the feasibility or reliability of the AFS/rAFS/rASRM classification system for its proposed aim, being a descriptive system of surgical documentation of disease. The EFI, a 10-point scoring system grouped into five categories of risk, has been assessed in 12 studies and one review. It has been mainly assessed for its intended purpose, being prediction of the probability of natural conception after surgery (Boujenah et al., 2015, Boujenah et al., 2017, Garavaglia et al., 2015, Kim et al., 2019, Li et al., 2017, Maheux-Lacroix et al., 2017, Negi et al., 2019, Tomassetti et al., 2013, Wang et al., 2013, Zeng et al., 2014, Zhang et al., 2018, Zhou et al., 2019. Interestingly, in some of these studies an evaluation of the prognostic value of the different factors included in the EFI score was also performed. A meta-analysis summarised these validation studies and evaluated the performance of the EFI score for predicting non-ART pregnancy after endometriosis surgery, observing good predictive value with a pooled estimate for AUC of 0.71 (95%CI 0.65-0.80) (Vesali et al., 2020). Some authors have (additionally) evaluated whether its purpose can be extended to guide patient management, by using it to select patients that would benefit from ART treatments (Boujenah et al., 2015, Li, et al., 2017, and/or predicting the chances of pregnancy from ART treatments (Garavaglia et al., 2015, Wang et al., 2013. The ECO system has been validated for prediction of management (surgery or medical treatment) in a single study, by the same authors that developed the tool (Lasmar et al., 2015).
The UBESS system, developed for pre-operative staging and prediction of the complexity of surgery, was evaluated in three studies reporting on the latter purpose, i.e. difficulty of surgery (Chaabane et al., 2019, Espada et al., 2020 and prediction of surgical skill levels (Tompsett et al., 2019).
Finally, the ENZIAN classification system, developed as a descriptive system for surgical staging of DE, was evaluated for its purpose in two studies (Haas et al., 2011, Morgan-Ortiz et al., 2018. Another evaluation reported on the correlation between the ENZIAN classification and complications after surgery, classified according to the Clavien-Dindo complication grading (Nicolaus et al., 2020). The use of the ENZIAN classification system was further extrapolated for its use in pre-operative assessment with imaging. Two studies evaluated this MRIbased ENZIAN system (Burla et al., 2019, Di Paola et al., 2015, and a third study reported on a model to predict operation time based on the MRI-based ENZIAN classification (Haas et al., 2013a).
In general, published classification or staging systems have been developed with various intended purposes, ranging from diagnosis (including symptoms) and preoperative assessment, surgical description or staging, to prediction of surgical difficulty and treatment outcomes (both for pain and infertility). The studies summarised above confirm the surgical value of the ENZIAN system for description and pre-operative assessment of DE, and of UBESS for predicting laparoscopic difficulty. However, most classification/staging systems show no or very little correlation with patient outcomes. The exception is the EFI, which has been consistently shown to provide good predictive value for natural conception after endometriosis surgery. It is notable that the development of the EFI was data driven, whereas the development of most other classification/staging systems was based on expert opinion.

Discussion
The current paper provides an overview of currently available and published classification, staging and/or reporting systems for endometriosis. We include 22 systems published between 1973 and 2021. Each of the systems was developed for a specific and different purpose. The first systems tried to classify the various forms of endometriosis that were encountered (at the time), and this remains the purpose of more recent systems as there still is no international agreement on how to describe the disease. Next, we summarise published studies evaluating the different classification, staging or reporting systems. From this, we show a discrepancy between the intended and the evaluated purpose, and a general lack of validation data confirming correlation with pain symptoms or quality of life for any of the current endometriosis classification systems. With regards to infertility, the EFI has been confirmed valid for its intended     Histologic confirmation The EFI was highly associated with live births (P < 0.001): for EFI of 0-2, the estimated cumulative non-ART LBR at 5 years was 0% and steadily increased up to 91% with an EFI of 9-10, while the proportion of women who attempted ART and had a live birth, steadily increased from 38 to 71% among the same EFI strata (P = 0.1). A low least function score was the most significant predictor of failure, followed by having had a previous resection or incomplete resection, being older than 40 compared to <35 years, and having leiomyomas. (

322
Facts Views Vis Obgyn Table II in 72.9% A correlation between endometriosis stage and severity of symptoms was observed only for dysmenorrhea (chi2 = 5.14, P = 0.02) and non-menstrual pain (chi2 =5.63, P = 0.018). However, the point estimates of ORs were very close to unity (respectively, 1.33, 95% CI 1.04-1.71, and 1.01, 95% CI 1.00-1.03). The association between endometriosis stage and severity of pelvic symptoms was marginal and inconsistent. Single centre Surgical confirmation Response to COH and the number, maturity, and quality of the oocytes was comparable between stages. Fertilization rates for oocytes of patients with stages III/IV were significantly impaired compared to those in stage I/II (P = 0.004). The implantation rate, CPR, and miscarriage rate were comparable between stages I/II and stages III/IV. (Pal et al., 1998) variable Intraobserver and interobserver variability -5 experts 20 Not reported Single centre Not reported The grand total score varied with an SD of 13.44 when the videotape of a single patient was rated twice by the same observer and varied with an SD of 17.12 when rated by two observers. The greatest variability occurred in endometriosis of the ovary and cul-de-sac obliteration, with less variability for peritoneum endometriosis and for ovarian and tubal adhesions. Comparison of intraobserver and interobserver scores resulted in a change in endometriosis stage in 38% and 52% of patients, resp. (Hornstein et al., 1993) Facts Views Vis Obgyn Table II Single centre Not reported Good to fair agreement scoring endometriosis between the investigator and the blinded reviewer was noted. (Rock, 1995) no analysis if an ovarian endometrioma was greater than 3 cm or had ruptured (P ≤ 0.01). (Rock et al., 1981) The symbols should be interpreted as follows; + indicates a significant positive result in a correlation (or similar) test, -indicates a significant negative result in a correlation (or similar) test, ns indicates a non-conclusive/nonsignificant result in a correlation (or similar) test. The highlighted columns represent the intended purpose of the classification/staging system (as in Table I).
purpose of predicting the probability of natural conception after surgery. Classification and staging systems are widely used in medicine and have been shown to be valuable in guiding clinical management. Examples include the American Joint Committee on Cancer (AJCC) tumor-node-metastasis (TNM) staging systems for cancer, the Gleason score for prostate cancer, the Braak Staging for Parkinson's disease, and the ACR/EULAR Classification Criteria for Rheumatoid Arthritis. The ACR/EULAR Classification Criteria for Rheumatoid Arthritis were developed based on data analysis of 3115 patients followed by a consensus process in which determinants for risk of rheumatoid arthritis were selected and grouped into a classification system, which was further refined, and the feasibility was optimised (Aletaha et al., 2010). A review published 2 years afterwards identified 17 articles (total 6816 patients) and 17 meeting abstracts (total 4004 patients) investigating the classification criteria. Only a minority of the articles aimed to validate the system in the intended population, while the other studies extended the target population, used different reference standards or adapted the criteria in the system (Radner et al., 2014). The review findings are similar to the findings of the current review, although in a different field of medicine. The TNM staging system for cancer was developed in the early 1950s, aiming to guide clinical classification of cancer cases by anatomical extent. The philosophy and technique of TNM staging were developed by Professor Denoix and later adopted by international societies (Denoix 1952, Sellers, 1971. The system is currently at its eighth edition (Edge et al., 2010). The system is revised in a 6 or 8-year cycle and changes are implemented based on highlevel evidence collected through large datasets. Specifications are available for different types of cancer, and the system has been complemented with a summary staging or classification linked to prognosis and used for treatment planning. In the TNM system for lung cancer, as an example, TNM staging adaptations included the removal of rare findings from the system, and corrections in stage grouping based on survival outcomes (Lim et al., 2018). In addition, the TNM system has been increasingly complemented by molecular marker data that more accurately stratify risk in patients and guide appropriate treatment options. The longevity and update systems applied for the TNM staging, and the value of additional molecular subtype identification, are likely to be important guides for the design of future endometriosis classification and staging systems that correlate with relevant patient outcomes.
Specifically, for endometriosis, previous reviews have summarised and commented on existing classification systems, mainly rASRM, ENZIAN and EFI. It has previously been concluded that the rASRM system has poor correlation with pain, fertility outcomes or prognosis, and that the ENZIAN system has poor correlation with symptoms and infertility (Andres et al., 2018, Haas et al., 2013b, Johnson et al., 2017). The EFI system needs further evaluation with regards to the importance of the different parameters and whether to include the completeness of surgical treatment (Maheux-Lacroix, et al., 2017). The conclusion of previous reviews of classification systems and our overview is consistently phrased as a need for a generally accepted classification with a clear goal/ purpose (Adamson, 2011, Andres et al., 2018, Haas et al., 2013b, Johnson et al., 2017, Rolla, 2019. Yet, as presented in this paper, the goal and purpose of published classification, staging or reporting systems for endometriosis is often ignored when evaluating classification or staging systems, limiting the value of the evaluation studies and of the systems in general. To our knowledge, this is the first report comparing the outcomes assessed in the studies with the intended purposes of the classification systems. Indeed, we show that the rASRM system has been widely evaluated, often with negative conclusions, but we found no studies evaluating the system for its intended goal, which is descriptive surgical staging. ENZIAN and EFI have been evaluated for their intended purpose, but studies have also evaluated whether they can be applied more widely and for other outcomes. Apart from these three systems, only two other classification systems (UBESS and ECO) have been evaluated for their intended purpose, with no evaluations of the remaining 17 classification systems, preventing them from further dissemination and uptake.
The current review provides an overview of published classification systems and studies evaluating them, but no detailed assessment of all positive and negative aspects of the classification systems, so as not to repeat previous reviews (Johnson, et al., 2017). In addition, we have restricted our overview to classification systems published in peer-reviewed papers and available through PUBMED/MEDLINE. Although locally used and/or unpublished systems are available and can be valuable, the relevance of including them in the current review was considered low, as they would not be widely applied, nor evaluated by (independent) researchers. For universal use of a classification system, it is pivotal that the system is accessible, validated, reliable and reproducible.
Our report includes a summary of evaluation studies assessing these aspects in the different classification systems. Even though we retrieved 46 studies, the value of these evaluations is limited. Apart from the EFI score, the current classification systems have not been thoroughly assessed for validity, feasibility and reproducibility. Moreover, a significant proportion of the evaluation studies have examined the classification systems for purposes other than the one for which they were designed and initially evaluated.
Endometriosis is a challenging disease to classify, as it is known to have different phenotypes and presentations (both with regards to the type of lesions and their location), and various symptoms without a clear link to phenotype or presentation. Moreover, the natural progression of the disease is unknown. There is a perceived need for a validated classification or descriptive system for endometriosis that could support further progress in defining subgroups and more importantly guiding the therapeutic options for women with pain and/ or infertility. Such a system would certainly also progress endometriosis research by unifying patient subgroups and facilitating the development of prognostic and predictive tools.
From this overview it can be concluded that several classification, staging and reporting systems have been developed for endometriosis. A universally accepted categorisation of the disease using the experience from the already existing proposals seems to be needed for clinical and research purposes.
Data availability statement: All data are incorporated into the article. management. We found that of the 22 classification systems, few have been evaluated for the purpose for which they were developed. From this review, it can be concluded that there is no international agreement on how to describe endometriosis or how to classify it.