Apart from the scientific knowledge management models of micropublications (Clark 2014) and the Biological Expression Language (http://www.openbel.org), the concept of nano-publications (Mons 2009, Growth 2010) is most aligned with the goals of Research Analysis and has provided valuable insights. In this article we provide an overview of the nano-publication model and discuss how it relates to the Research Analysis (RA) model.
The authors propose 5 steps required to create and adoption nano-publications (Mons 2009, Growth 2010):
- Terms to Concepts: This step requires that all terms in a research article are mapped to non-ambiguous identifiers. In nano-publications this is referred to as a Concept, where a Concept is the smallest, unambiguous unit of thought. A concept is uniquely identifiable (Groth 2010). This is similar to, but more ambitious than, the Medical Subject Headings (MeSH) database. Using MeSH as an example, MeSH Headings are the equivalent of Concepts and the Entry Terms for each MeSH Heading are equivalent to the Terms or synonyms for each Concept. We agree with this general goal and strongly promote the use of standard language. We promote the use of MeSH terms in Research Analysis. In Research Analysis users can use the MeSH Entry Term (synonym) they are most comfortable with and the system then ensures that this is mapped to the main concept or MeSH Heading. This allows the user to work with the terminology that is most comfortable for them and their peers, rather than being forced to use an ideal concept. In a separate article, we discuss the challenges associated with defining an unambiguous unit of thought.
- Concepts to Statements: Here they propose that each smallest insight in exact sciences is a ‘triple’ of three concepts, though conditions are required to put the insight in context. The triple is in the form subject > predicate > object. For example, cholesterol > increases > atherosclerosis. A Statement is a uniquely identifiable triple, which can be achieved through the assignment of a unique identifier to the triple by annotation. RA initially implemented a cause-effect model for statements, which is a special case of the triple where the predicate must be a cause-effect predicate. We currently only offer the predicates increases, decreases and not significant. We chose to initially restrict the options for triples to allow for the collection of a consistent database that would allow for analysis.
- Annotation of Statements with Context and Provenance: It is not enough to store statements just in the form of their basic components, three concepts in a specific sequence. A statement only ‘makes sense’ in a given context and taking a statement out of a research publication strips it of this context. The context in a nano-publication is defined by another set of concepts. The annotation is achieved technically through a triple such that the subject of the triple is a statement. For the example above, the species should be specified. Mice do not get atherosclerosis, even on a high cholesterol diet, but humans do. Also, provenance is associated to Statements by annotation eg. author, source. Claim’s in RA by default require that the user provide organ/cell model, genetic model and species annotations that a relevant for each specific claim. RA automatically assigns unique identifiers to claims and requires that at least one supporting quotation is provided from a publication. The supporting quotations require that the PubMed ID (PMID) is provided. Additional conditions and context can be provided by the users appending tags to the claim.
- Treating Richly Annotated Statements as Nano- Publications: treat these statements with conditional annotation as nano-publications via proper attribution so they can be cited and the authors can be credited. A nano-publication is a set of annotations that refer to the same statement and contains a minimum set of (community) agreed upon annotations (Groth 2010). This concept is similar to the claim model in RA, claims can be cited using their unique identifier and viewing a claim provides details of all of the quoted statements and associated PMIDs that support the claim, along with any other context provided via tags.
- Removing Redundancy, Meta-analyzing Web-Statements: where statements are identical they would be removed to simplify the database. The goal of this being to reduce “undue repetition” and to help improve the identification of new statements. Groth et al. define S-Evidence: all the nano-publications that refer to the same statement (Groth 2010) and, as implied by the name, provide evidence for the statement. The original model for nano-publications focused more on the removal of redundancy, but the concept of S-Evidence provides more respect for the importance of replication and the potential for meta-analysis. In complex sciences like biology, the likelihood of a statement being true based on the evidence of one publication is surprisingly low. For example, the uncertain reproducibility and re-usability of results investigated in the therapeutic development in the cancer field (Begley 2012). No single experiment, or for that matter any number of experiments, can fully demonstrate the truth of a statement. However, the collection of results that support a statement can, in a Poperian sense, provide some guidance to the level to which a scientific statement has had its metal tested. It can also allow for the bridging of knowledge between subfields where different terms for the same concepts are regularly used.
Figure 0: The Nano-publication Model taken from Groth 2010.
The goal of nano-publications
The nano-publications authors propose the goal of having scientific authors structure their data in such a way that computers understand them and we support this goal. However, we feel that it is likely that the formalisation of scientific knowledge may become a specialist task, like the coding of software design specifications into software code. It is not clear how the nano-publications authors see the knowledgebase of nano-publications being used. We strongly believe that while knowledge coding may be a specialist activity, that most researchers in the biological fields will used tools based on such knowledgebases to help direct their research by identifying gaps, conflicts and opportunities in the current research.
Main differences between the nano-publications model and the Research Analysis models
We have discussed some similarities and differences between the nano-publication and RA models above, but here we go into a little more detail.
At the Concept level:
- The nano-publication model requests that authors use their Concept Wiki unique identifiers, which aggregates concept names from many databases.
- RA feels that the task of aggregating names is too much and not necessary given the already available databases. Also, we prefer that scientist use the names rather than IDs for concepts as this is more intuitive and in line with our mission.
- RA currently requests that all claims use NLM’s MeSH Terms or IDs for the names of the elements or concepts in a claim. Where terms are not in MeSH then the NCBI Protein and PubChem databases can be utilised. In rare cases new concept names can be added into RA.
- RA uses these external name databases to identify synonyms and ensure that if scientists search for a particular MeSH Term that they will receive results for all of the MeSH terms under the MeSH Heading.
At the Statement level (Claims in RA):
- RA uses the concept of claims in place of the concept of statements in nano-publications. Claims are a simply a type of statement that we feel is more intuitive.
- With regard to our natural language claims, we ask that scientists use the form of a declarative sentence that is as logically clear as possible. These statements will not be as logically clear as the nano-publication model and are likely to be more complex.
- Our standard cause-effect claim model is quite similar to the nano-publication model. The primary cause-effect part of the claim is simply a special case version of the nano-publication triple eg. Treatment/Cause [subject] > Effect [predicate] > Disease/Molecule [object].
- The other components of our cause-effect claim model are equivalent to attribution in the nano-publication model. We have 3 as standard:
- (cause-effect claim, where it occurs in, Organ/Cell Model) AND
- (cause-effect claim, where it occurs in, Genetic Model) AND
- (cause-effect claim, where it occurs in, Species);
- Note that the three conditions should be considered as conjunctions with the cause-effect claim. Conjunction is required as the conditions are related. The Genetic Model effects the Organ/Cell Model and both are specific to the Species. A simple annotation of each to the cause-effect claim would be ambiguous. It is only when all of the conditions are true that the cause-effect claim is valid.
- In RA, statements must be supported by at least one quoted statement from a publication. These supporting quotes could be treated as annotations in the nano-publications model, but a difference between the two models is that a specific quote could be used as support for several claims. Further, for each PMID there are often several quoted statements. Technically these components could just be repeated as annotations for each statement, but in the RA model a map or graph concept is used where the statements are supported by quotes and quotes are supported by PMIDs. Claims, quotes and PMIDs are all considered to be elements in the direct acyclic graph. The S-Evidence concept in the nano-publication model goes some way to bridging this difference by grouping together nano-publications with the same statement, however the S-Evidence creates trees rather than a graph. This level of the RA model is much more like the micropublication model than the simpler nano-publication model.
Figure 1. Research Analysis Knowledge Management Model
- The RA model does have some simple annotation like elements that associate information to the statement/claim such as a unique statement identifier, the author of the statement/claim (who may or may not be the author of the publication that the quoted sentence came from), time and date of creation, and tags that in effect provide flexible and unlimited simple annotations.
Both the nano-publication and micro-publication models have a substantial focus on the technical aspects of encoding the data in semantic web schemas. The reason for this focus is the importance of making the data available in an open and semantically rich format. We respect this goal, but have chosen to hide as much of this detail from our users as possible. While the biological research community has become computer savvy over the past decades, the majority of the community are not trained in any computer programming languages and would find these schemas very unpleasant. I have a computer science degree from before web technology was popular and I still find it very unpleasant. This is a little unfair, as these are articles in informatics journals and certainly we acknowledge that the micropublication authors have built user oriented applications.
One of the key goals discussed in the article on the Research Analysis Mission is that our focus is on making knowledge management tools available to normal biological and medical scientists in an easy to use and powerful way – not making the knowledge available to a few hard core geeks and their supercomputers. For this reason, none of the bare bones schemas are visible to the users and the frontend terminology is focused on usability rather than theoretical correctness. This is also the reason why we provide a number of standard models for capturing scientific claims in RA. These models will not be flexible enough for some, but for the rest they will be much easier and straightforward to use. Adoption and the actual acceleration of discovery by normal scientists is our top priority.
Mons, B., & Velterop, J. (2009, October). Nano-Publication in the e-science era. In Workshop on Semantic Web Applications in Scientific Discourse (SWASD 2009).
Groth, P., Gibson, A., & Velterop, J. (2010). The anatomy of a nano-publicationlication. Inf. Services and Use, 30(1-2), 51-56.
Clark, T., Ciccarese, P., & Goble, C. (2014). Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical communications. Journal of Biomedical Semantics, 5(1), 28.
Begley, C. G., & Ellis, L. M. (2012). Drug development: Raise standards for preclinical cancer research. Nature, 483(7391), 531-533.