Duplicate structures haunt crystallography databases
Key Insights
-
Duplicates of crystal structures are flooding databases, implicating repositories hosting organic, inorganic, and computer-generated crystals.
-
The issue raises questions about curation practices at databases and claims around novelty.
-
While some database administrators are mum about duplications, others acknowledge the problem but say it’s usually not due to data integrity issues.
In 2023, researchers at Google released what they claimed was 800 years’ worth of knowledge in a single study. The study, published by the prominent journal Nature, revealed that 2.2 million new crystal structures had been discovered using an artificial intelligence tool.
Two years on, the study—which has already attracted around 1,000 citations—faces calls for retraction. Critics say many crystals that the study authors claimed were novel were in fact duplicates.
The researchers, affiliated at the time with Google DeepMind, found the structures using an AI tool called Graph Networks for Materials Exploration (GNoME). Importantly, they claimed that more than 380,000 of the crystals were stable inorganic materials. To be useful for modern technologies such as computer chips and batteries, crystals must be stable so that they don’t decompose.
Traditionally, scientists hunt for new crystal structures by tweaking already known ones or by combining different elements on a trial-and-error basis. While materials science researchers have tried to use AI to speed up this process and reduce costs, predicting experimentally viable materials with precision and accuracy has been a challenge.
The Google researchers trained GNoME on the Materials Project database, which contains more than 200,000 materials, and then used the tool to predict possible novel crystal structures and their stability with what they claimed was unprecedented scale and accuracy. Specifically, the Google DeepMind team said they had discovered around 52,000 layered compounds similar to graphene that could prove useful in developing semiconductors, and 528 lithium-ion conductors that could help improve the performance of rechargeable batteries.
But it turns out that more than 10% of the stable crystal structures created by GNoME may actually be near duplicates of existing crystals that may not offer any real advance over known crystals. That’s according to recent research by mathematician and computer scientist Vitaliy Kurlin and his colleagues at the University of Liverpool, who in the last few years have created a new method of detecting duplicates.
Near duplicates may be crystals with similar compositions but with an atom or two replaced, Kurlin says. And these are common in the GNoME collection because the AI tool seems to be replacing atoms in existing structures rather than coming up with new ones, Kurlin says, undermining the Google team’s claim of novelty.
Kurlin and colleagues’ method can also be used to detect exact duplicate structures—which can be represented by very different crystallographic information files (CIFs) that researchers submit to databases—by identifying any periodic crystal from its unique geometric code.
Using their method, Kurlin and colleagues found that the GNoME database contained 1,224 pairs of crystal structures that were exact duplicates. His team also found 43 triplets (three exact duplicate crystals) and one quadruplet. “We still don’t know how it was possible to miss” them, Kurlin says about the exact duplicates.
When GNoME takes existing structures and replaces atoms, “it should apply simulations which should change the atomic positions,” says Daniel Widdowson, Kurlin’s coauthor and a doctoral student at Liverpool. “In that case, you would expect near duplicates, but these exact ones are difficult to justify.”
Over the past few years, Kurlin and his team have published a series of studies identifying duplicate entries in major crystallography databases. In addition to calling out GNoME, their work implicates many of the other large databases spanning organic and inorganic crystals and protein structures.
Some database administrators haven’t acknowledged the duplications publicly but rather have quietly removed them. Others say that most instances of duplication don’t involve data integrity and that each individual case needs to be examined carefully to determine the most suitable course of action.