Publications
2020
- COVIDScholar: An automated COVID-19 research aggregation and analysis platform Amalie Trewartha, John Dagdelen, Haoyan Huo, Kevin Cruse, Zheren Wang, Tanjin He, Akshay Subramanian, Yuxing Fei, Benjamin Justus, Kristin Persson, and Gerbrand Ceder arXiv preprint arXiv:2012.03891 2020 [Abs] [HTML]
The ongoing COVID-19 pandemic has had far-reaching effects throughout society, and science is no exception. The scale, speed, and breadth of the scientific community’s COVID-19 response has lead to the emergence of new research literature on a remarkable scale – as of October 2020, over 81,000 COVID-19 related scientific papers have been released, at a rate of over 250 per day. This has created a challenge to traditional methods of engagement with the research literature; the volume of new research is far beyond the ability of any human to read, and the urgency of response has lead to an increasingly prominent role for pre-print servers and a diffusion of relevant research across sources. These factors have created a need for new tools to change the way scientific literature is disseminated. COVIDScholar is a knowledge portal designed with the unique needs of the COVID-19 research community in mind, utilizing NLP to aid researchers in synthesizing the information spread across thousands of emergent research articles, patents, and clinical trials into actionable insights and new knowledge. The search interface for this corpus, https://covidscholar.org/, now serves over 2000 unique users weekly. We present also an analysis of trends in COVID-19 research over the course of 2020.
- Similarity of precursors in solid-state synthesis as text-mined from scientific literature Tanjin He, Wenhao Sun, Haoyan Huo, Olga Kononova, Ziqin Rong, Vahe Tshitoyan, Tiago Botari, and Gerbrand Ceder Chemistry of Materials 2020 [Abs] [HTML]
Collecting and analyzing the vast amount of information available in the solid-state chemistry literature may accelerate our understanding of materials synthesis. However, one major problem is the difficulty of identifying which materials from a synthesis paragraph are precursors or are target materials. In this study, we developed a two-step Chemical Named Entity Recognition (CNER) model to identify precursors and targets, based on information from the context around material entities. Using the extracted data, we conducted a meta-analysis to study the similarities and differences between precursors in the context of solid-state synthesis. To quantify precursor similarity, we built a substitution model to calculate the viability of substituting one precursor with another while retaining the target. From a hierarchical clustering of the precursors, we demonstrate that "chemical similarity" of precursors can be extracted from text data. Quantifying the similarity of precursors helps provide a foundation for suggesting candidate reactants in a predictive synthesis model.
2019
- Text-mined dataset of inorganic materials synthesis recipes Olga Kononova, Haoyan Huo, Tanjin He, Ziqin Rong, Tiago Botari, Wenhao Sun, Vahe Tshitoyan, and Gerbrand Ceder Scientific data 2019 [Abs] [HTML] [PDF]
Materials discovery has become significantly facilitated and accelerated by high-throughput ab-initio computations. This ability to rapidly design interesting novel compounds has displaced the materials innovation bottleneck to the development of synthesis routes for the desired material. As there is no a fundamental theory for materials synthesis, one might attempt a data-driven approach for predicting inorganic materials synthesis, but this is impeded by the lack of a comprehensive database containing synthesis processes. To overcome this limitation, we have generated a dataset of “codified recipes” for solid-state synthesis automatically extracted from scientific publications. The dataset consists of 19,488 synthesis entries retrieved from 53,538 solid-state synthesis paragraphs by using text mining and natural language processing approaches. Every entry contains information about target material, starting compounds, operations used and their conditions, as well as the balanced chemical equation of the synthesis reaction. The dataset is publicly available and can be used for data mining of various aspects of inorganic materials synthesis.
- Semi-supervised machine-learning classification of materials synthesis procedures Haoyan Huo, Ziqin Rong, Olga Kononova, Wenhao Sun, Tiago Botari, Tanjin He, Vahe Tshitoyan, and Gerbrand Ceder npj Computational Materials 2019 [Abs] [HTML] [PDF]
Digitizing large collections of scientific literature can enable new informatics approaches for scientific analysis and meta-analysis. However, most content in the scientific literature is locked-up in written natural language, which is difficult to parse into databases using explicitly hard-coded classification rules. In this work, we demonstrate a semi-supervised machine-learning method to classify inorganic materials synthesis procedures from written natural language. Without any human input, latent Dirichlet allocation can cluster keywords into topics corresponding to specific experimental materials synthesis steps, such as “grinding” and “heating”, “dissolving” and “centrifuging”, etc. Guided by a modest amount of annotation, a random forest classifier can then associate these steps with different categories of materials synthesis, such as solid-state or hydrothermal synthesis. Finally, we show that a Markov chain representation of the order of experimental steps accurately reconstructs a flowchart of possible synthesis procedures. Our machine-learning approach enables a scalable approach to unlock the large amount of inorganic materials synthesis information from the literature and to process it into a standardized, machine-readable database.
2017
- Unified representation of molecules and crystals for machine learning Haoyan Huo, and Matthias Rupp arXiv preprint arXiv:1704.06439 2017 [Abs] [HTML] [PDF]
Accurate simulations of atomistic systems from first principles are limited by computational cost. In high-throughput settings, machine learning can potentially reduce these costs significantly by accurately interpolating between reference calculations. For this, kernel learning approaches crucially require a single Hilbert space accommodating arbitrary atomistic systems. We introduce a many-body tensor representation that is invariant to translations, rotations and nuclear permutations of same elements, unique, differentiable, can represent molecules and crystals, and is fast to compute. Empirical evidence is presented for energy prediction errors below 1 kcal/mol for 7k organic molecules and 5 meV/atom for 11k elpasolite crystals. Applicability is demonstrated for phase diagrams of Pt-group/transition-metal binary systems.
- Hydrogen-bond symmetrization of δ-AlOOH Duan Kang, Ye-Xin Feng, Ying Yuan, Qi-Jun Ye, Feng Zhu, Hao-Yan Huo, Xin-Zheng Li, and Xiang Wu Chinese Physics Letters 2017 [Abs] [HTML]
The δ-AlOOH can transport water into the deep mantle along cold subducting slab geotherm. We investigate the hydrogen-bond symmetrization behavior of δ-AlOOH under the relevant pressure-temperature condition of the lower mantle using ab initio molecular dynamics (MD). The static symmetrization pressure of 30.0 GPa can be reduced to 17.0 GPa at 300 K by finite-temperature (T) statistics, closer to the experimental observation of 10.0 GPa. The symmetrization pressure obtained by MD simulation is related to T by P (GPa) + 13.9 (GPa) = 0.01 (GPa/K) × T (K). We conclude that δ-AlOOH in the lower mantle exists with symmetric hydrogen bond from its birthplace, or someplace slightly deeper, to the core-mantle boundary (CMB) along cold subducting slab geotherm. The bulk modulus decreases with T and increases anomalously upon symmetrization: K0(GPa) + 181(GPa) - 0.013(GPa/K) x T(K) for δ-AlOOH with asymmetric hydrogen bond, and K0(GPa) + 216(GPa) - 0.013(GPa/K) x T(K) for δ-AlOOH with symmetric hydrogen bond. Our results provide an important insight into the existent form and properties of δ-AlOOH in the lower mantle.