The Signal—and the Noise—in the Field of Gender Medicine

A recent systematic review of puberty blockers exemplifies problems in gender medicine research

Studies in the field of gender medicine are notoriously unreliable, plagued by small samples, lack of controls, confounding, and bias. This is true for even the “best” studies in the field, such as the “Dutch study”— the foundation of treating gender dysphoric youth with hormones and surgery. While the Dutch protocol showed some positive results in the Netherlands, it could not be replicated in the world’s biggest pediatric gender clinic, the UK’s GIDS. Other studies, many making headlines, suffer from even more serious biases, limitations, and downright erroneous data analyses. Gender medicine does not have a monopoly on bad science, but if poor research were an Olympic event, it would arguably be a favorite to win the gold.

Because individual studies can be unreliable, clinicians prefer to base treatment recommendations on systematic reviews of evidence. Systematic reviews scrutinize all the evidence about a topic using rigorous and reproducible methods. While systematic reviews cannot correct for deficiencies in individual studies, they can help separate the “signal” from the “noise.” This, in turn, helps clinicians and their patients make better-informed treatment decisions.

In 2021, the UK’s National Institute for Health and Care Excellence (NICE) published a systematic review of evidence of using puberty blockers (GnRH analogues) to treat gender dysphoria. The review failed to find convincing evidence that puberty blockers are helpful (it reached a similar conclusion for cross-sex hormones for youth). The reviewers noted:

'"The results of the studies that reported impact on the critical outcomes of gender dysphoria and mental health (depression, anger and anxiety), and the important outcomes of body image and psychosocial impact (global and psychosocial functioning), in children and adolescents with gender dysphoria are of very low certainty using modified GRADE. They suggest little change with GnRH analogues from baseline to follow-up. Studies that found differences in outcomes could represent changes that are either of questionable clinical value, or the studies themselves are not reliable and changes could be due to confounding, bias or chance."

This conclusion makes it all the more surprising that another recent systematic review, published on the same topic — puberty blockers — by Rew et al. from the University of Texas at Austin called puberty blockers “potentially life-saving” and concluded, “the evidence to date supports the finding of few serious adverse outcomes and several potential positive outcomes.”

How could two systematic reviews, conducted during the same period, and tackling the same topic, have come to such different conclusions? A team of SEGM-affiliated researchers explored this topic in a recent publication entitled The Signal and the Noise – questioning the benefits of puberty blockers for youth with gender dysphoria—a commentary on Rew et al. (2021), published in the same peer-reviewed journal that had published the systematic review by Rew et al.

What went wrong in the Rew et al. review:

The Commentary by Clayton et al. identified a number of problems in Rew et al.’s systematic review, which led them to their problematic conclusion. While we encourage readers to peruse the Commentary in full, here is a summary of the issues:

Failure to identify relevant studies. A quality systematic review should conduct a detailed and exhaustive literature search. Rew et al.’s search strategy yielded only 151 potentially eligible studies, while the NICE review found 525 studies. As a result, several key studies were omitted from the analysis, including one study that showed that an interim improvement in functioning following puberty blockers at 12 months was erased by the 18th-month mark, the study end period (remarkably, but not surprisingly, the study's own abstract omits this vital fact, instead focusing on the temporary 12-month uptick). Rew et al. also omitted at least two other key studies that identified significant risks of puberty blockers to bone health.
A general failure to adequately assess the quality of the included studies, such as an oft-quoted study on suicidality. Assessment of the methodological quality of studies is the key task of a systematic review. Rew et al. attempted, but failed, to appropriately assess the included studies’ quality. This is exemplified by their improper analysis of the Turban et al. 2020 study. The authors missed many problems, including a biased sample composition, an unreliable measure of exposure to puberty blockers, and confounding (the problems in that particular study have already been highlighted by others). The systematic review authors failed to note that suicidality was not improved in 5 of 6 measures and misinterpreted the study’s own conclusions regarding which suicidality measure was presumed to be positively impacted. Rew et al. also ignored the likelihood of reverse causation: rather than puberty blocking leading to less suicidality over a lifetime, that those with better mental health and lower suicidal tendencies were viewed as better candidates for early transition by their clinicians (since responsible clinicians consider stable mental health as the prerequisite for medical transition of minors).
An overreach into making treatment recommendations without following proper steps. Typically, systematic reviews are limited to assessments of the certainty of the evidence and stop short of making treatment recommendations. The latter are the prerogative of treatment guideline developers. However, should systematic review researchers wade into the recommendation territory, they need to follow proper steps, such as articulating key values and preferences used to make the recommendation such as weighing the benefits of medicalization to physical appearance vs. the resultant health risks, assessing resources, costs, and ethics. None of these steps were reported by Rew et al., who endorsed the use of puberty blockers to practitioners while also calling for additional research—a welcome call, which, unfortunately comes off as a token gesture, given the rest of the review’s pro-puberty blocker tone and tenor.

Clayton et al. also reflect more generally on the state of literature in the field of gender dysphoria. They trace how a single flawed study insinuates but stops short of claiming that puberty blockers lead to suicide prevention. They describe how it is then referenced by other studies with increasingly blasé disregard for the methodological limitations, claiming proven benefits and how it eventually makes its way into a flawed systematic review, which further reinforces the erroneous conclusion. Finally, they demonstrate how this mistaken notion is then promoted by an editorial in a prestigious journal, which throws its own reputational weight behind the unproven claims.

Clayton et al. refer to this as the “game of telephone,” which is endemic in gender medicine. Each step introduces even more errors and misinterpretations, rendering each subsequent study less and less accurate—and more and more certain of the purported benefits. Clayton et al. aptly ask: when the evidence used to recommend treatment comes from such a convoluted game of telephone, can such patients really be considered to be giving informed consent?

Closing Thoughts

Systematic reviews belong at the top of the evidence pyramid—but only when they are properly conducted. However, when data are inappropriately analyzed, systematic reviews can be misleading, unhelpful, or even harmful. Unfortunately, as a leading Stanford researcher concluded, most systematic reviews are “misleading or conflicted.” While this problem plagues the entire field of medicine, from plastic surgery to cardiology, it is endemic in the field of gender medicine.

The review by Rew et al. is one of several recent examples of poor-quality systematic reviews. Problematic systematic reviews in gender medicine abound, ranging from an error-ridden analysis of surgery regret data, to a woefully inadequate analysis of the effects of hormonal interventions which failed to differentiate between two vastly different interventions as puberty blockers vs cross-sex hormones, among other numerous problems. Alarmingly, the latter was the basis for WPATH’s Standards of Care 8 draft recommendations, which lowered the age of eligibility for cross-sex hormones to 14.

At the same time, it is interesting to note that systematic reviews of evidence conducted by public health authorities in the US, UK, Sweden, and Finland, have all concluded that the evidence for gender transition with hormones and surgeries is highly uncertain and the risk / benefit ratio is unclear. The field must engage in rigorous self-examination to explain this chasm.

In the meantime, clinicians and patients would be well-served by staying alert to the fact that not just individual studies, but even systematic reviews can be the source of the noise drowning out the signal—the signal that has been registered by the European countries taking a much more cautious stance on pediatric transitions. This signal as yet remains largely muffled in the US.