In the past thirty years, materials designing had been significantly accelerated by high-throughput computation methods and large-scale computed materials databases. However, the materials discovery pipeline remains bottlenecked by the difficulty of experimental materials fabrication, which still requires months or even years of trial-and-error to successfully synthesize a target compound. To facilitate the materials synthesis problem, we developed a data-mining pipeline that extracts inorganic materials synthesis information from nearly four million of available online scientific publications, using natural language processing and text-mining techniques. Especially, we have generated an automatically extracted dataset consisting of 19,477 solid-state synthesis reactions. Attributes in this dataset include starting materials, target compounds, experimental operations, and detailed synthesis conditions and parameters. Using this codified synthesis information, machine learning and materials informatics are applied to analyze features of materials synthesis experiments. Our methods allow us to mine the synthesis knowledge locked-up in written natural language to achieve data-driven synthesis prediction for next-generation materials discovery and fabrication.
Machine Learning Approach for Identifying Reported Materials Synthesis Experiments in Scientific Articles
We propose a predictive synthesis solution for rapid design of materials synthesis experiments, by extracting synthesis parameters from large number of scientific articles, as well as applying machine learning to build statistical models for automated synthesis parameter prediction. As the first important task, we need to locate the “synthesis paragraphs” in articles, where experiment setups and synthesis parameters are reported. In this work, we present a machine learning method for accurately classifying synthesis paragraphs into categories of inorganic materials synthesis methods, using latent Dirichlet allocation and random decision forests. We show our method is able to quantify topics expressed in sentences, thereby recognize key experiment steps of synthesis. We construct features using sentence topics, and train random forests models. We demonstrate that our method generates models that can be easily visualized and understood by humans, as well as achieve high classification F1 scores of 93.5%, 97.3%, and 90.0% for solid-state, hydrothermal, and sol–gel precursor synthesis methods, respectively. We also demonstrate that our method requires only very small training data and little human annotation efforts to yield good models, which facilitates the application of machine learning to materials science, as dataset size is often limited and human annotation requires much expertise in this field.