Category Archives: Scientific Knowledge Management

Causation in Medical Scientific Statements

The concept of causation is a very difficult philosophical topic and doesn’t. The causation implied by the Research Analysis model has been of interest to me for some time, but the inspiration for this article came from Handbook of Analytic Philosophy of Medicine by Kazem Sadegh-Zadeh [1]. Sadegh-Zadeh provides a great overview of the philosophy of causation and, in particular, Etiology the science of clinical causation. “Etiology, from the Greek term αιτια (aitia) meaning “the culprit” and “cause”, is the inquiry into clinical causation or causality including the causes of pathological processes and maladies” [1].

In this article, I will explore some of the tools provided in the handbook (You should refer to section 7.5 of the handbook for a more thorough introduction and overview) and discuss how they can be applied to the Research Analysis model. I will explore the probabilistic interpretations of claims, the concept of causation in relation to claims and the causal relevance of competing claims.

Probabilistic dependence

Let’s take an example claim from Research Analysis:

(1)         Statins decrease coronary events in normal humans

This claim can be found in Research Analysis here and its semantics are analysed in this previous article here. The semantic analysis resulted in the following logical description of the claim:

(2)         ꓯe (DECREASE(statin, myocardial infarction, e) & IN(e, normal human))

Where e represents an event. In my previous article, I discussed that the word case may be more natural than event for this model, where “the concept of a case suggests that the time period and medical process will be appropriate to the specific disease treatment paradigm”.

A lot of the logical formalisation relates to the event or case. If one specific case is considered then we have the simpler claim:

(3)         DECREASE(statin, myocardial infarction)

This claim suggests that there is a decreasing relationship between:

  • The administration of statins, and
  • Myocardial infarction.

At this point probability theory can be introduced. Sadegh-Zadeh [1, p253] shows using probability theory that an event B is probabilistically independent of another event A if and only if (iff):

(4)         P(B|A) = p(B)

Where P(B|A) is the probability of the event B conditional on the fact that the event A has occurred. If events A and B are independent then the fact that event A has occurred will have no impact on the probability of event B occurring and thus the conditional probability will simply equal the probability of B occurring independently of A.

Dependence of two events is then simply represented by the opposite probability relationship:

(5)         P(B|A) ≠ p(B)

That is, two events are dependent if the conditional probability of B given A is not equal to the probability of even B occurring independently – the fact that A occurred effects the probability that B will occur.

Coming back to the simple example, the claim “The administration of statins decreases myocardial infarction” implies that there is a conditional dependence between the two events and that:

(6)         p(Myocardial infarction | The administration of statins) ≠ p(Myocardial infarction)

That is, the probability of myocardial infarction given that statins have been administered to the patient is not equal to the probability of myocardial infarction in general. There is a probabilistic dependence between the event of statin administration and myocardial infarction.

Note that this probabilistic “dependence” does not imply that there is any causal interaction between the two events A and B or myocardial infarction and statins, only that there is a probabilistic correlation between the events. Correlation may exhibit one of two directions in particular cases giving:

(7)         Positive correlation:       p(B|A) > p(B)

(8)         Negative correlation:     p(B|A) < p(B)

In our example, “decrease” is intended to mean that there is a negative correlation between the events and using the probabilistic terminology we have:

(9)         p(Myocardial infarction | The administration of statins) < p(Myocardial infarction)

This statement says that the administration of statins decreases the probability of myocardial infarction when compared to the general unconditional population or that there is a negative correlation between myocardial infarction and the administration of statins. As discussed in our article relating to Popper’s philosophy of science (find it here), the declarative statements like (1,2) found in Research Analysis are scientific statements in a form that can be logically falsified, but in the real world there is never certainty and thus experience only ever supports probabilistic correlations like that in (9) (actually even probabilities are technically not available according to Popper, but probability has use beyond its risks in many fields of science).

Does the statement (9) imply that there is a causal link between statins and myocardial infarction? To answer this question it is necessary to introduce some further concepts.

Probabilistic relevance or irrelevance

The Research Analysis model has always highlighted the importance of the reference population or model for each claim by requiring the specification of the reference species, disease model and whether it is a whole animal or organ model. Most medical research begins in cell culture or animal models, but has the goal of moving into human applications. It is important to clearly separate claims that relate to mice from those that relate to humans. For this reason, the concept of probabilistic relevance conditional on a reference population or background context is introduced below (refer [1, p255-257] for a more detailed introduction):

(10)       p(B|X∩A) > p(B|X)          Positive probabilistic relevance or conditional correlation

(11)       p(B|X∩A) < p(B|X)          Negative probabilistic relevance or conditional correlation

(12)       p(B|X∩A) = p(B|X)          Probabilistic irrelevance or no conditional correlation

Positive probabilistic relevance (10) says that the probability of event B conditional on both events X and A occurring (X ∩ A) is greater than the probability of event B conditional on X alone. In this presentation of probabilistic relevance, X represents the reference population and A and B are the events for which cause and effect are being evaluated. The following example using (1) can be provided:

(13)       p(Myocardial infarction | normal humans ∩ the administration of statins) < p(Myocardial infarction | normal humans)

This sentence says that the probability of myocardial infarction is lower in normal humans that have been administered statins than it is in normal humans in general. As noted, Research Analysis has always included the reference population or background context because research is conducted in many different species, genetic types and disease models. Sadegh-Zadeh in his work notes that the notion of background context is of great importance when analyzing issues of causality and that “There are no such things as ‘causes’ isolated from the context where they are effective or not. The background context will therefore constitute an essential element of our concept of causality.” [1, p 256]

Spurious correlations & Screening Off

To this point in the discussion, it has not been possible to introduce the concept of causation and instead the weaker concepts of relevance and correlation have been used. Where there is a non-zero relevance or correlation between two events A and B as in (10,11), then B could be a potential cause of A, but “Correlation does not imply causation”. To define the concept of causation it is necessary first do define spurious correlation and the concept of screening off.

(14)       Screening off: X screens A off from B iff p(B|X∩A) = p(B|X)

This says that X screens off A if and only if A is, in relation to the reference population X (or some other event or set of events), probabilistically irrelevant to B [1, p257]. We can take this concept a step further and define a spurious cause by incorporating it into the sentences (10-12) to assess whether there is an alternative event C that explains the probabilistic relevance of A on B.

(15)       Spurious cause: A is a spurious cause of B if there is a C such that p(B |X ∩ A ∩ C) = p(B |X ∩ C).

We can rephrase this and introduce the concept of time order as follows: in a reference population X, an event A is a spurious cause of an event B iff:

  1. A is a potential cause of B in X,
  2. There is an event C that precedes A
  3. C screens A off from B.

An example of a spurious cause can be provided as follows:

p(Death | Humans ∩ AIDS ∩ HIV) = p(Death |Humans ∩ HIV)

In this example, AIDS is screened off by HIV. AIDS is the spurious cause that is screened off by HIV infection. The time order of events is discussed further in the next section.

Dominant cause

In the previous example, AIDS could certainly be a cause of Death in untreated individuals. However, AIDS is screened off by HIV. Both AIDS and HIV can be considered as causes of death. Here the concept of Dominant cause can be introduced to provide a ranking between causes.

(16) Dominant Cause:  At1 is a dominant cause of Bt2 in X iff there is no t such that for all events Ct in X:

  1. t1 ≤ t < t2,
  2. p(Bt2|X ∩ At1 ∩ Ct) = p(Bt2|X ∩ Ct).

This definition says that a cause is dominant if no simultaneous or later event is able to screen it off from the effect [1, p275]. This can be demonstrated with the AIDS example:

  • A person contracts HIV in 2001
  • They present with AIDS in 2003
  • They Die in 2010

In this example there would exist the following probability relationship:

p(Death2010 | Humans ∩ HIV2001 ∩ AIDS2003) = p(Death2010 | Humans ∩ AIDS2003)

This relationship says that the combination of HIV with a Human and AIDS provides the same probability of death as the combination of AIDS with a Human. It can also be seen that 2001 ≤ 2003 < 2010, which says that the HIV occurred prior to the AIDS. This result confirms that AIDS is not the dominant cause of death in this case as there does exist an earlier event HIV2001 that screens of AIDS2003 from Death2010. The following shows the causes reordered:

p(Death2010 | Humans ∩ AIDS2003 ∩ HIV2001) = p(Death2010 | Humans ∩ HIV2001)

This relationship in a similar way says that the combination of AIDS with a Human and HIV provides the same probability of death as the combination of HIV with a Human. However, in this case the Ct  event (AIDS) does not occur in time between the At1 (HIV) and Bt2 (Death) terms. So AIDS cannot be the dominant cause.

The concept of dominant cause provides a means for ruling out spurious causes and for keeping track of the cause that has not been ruled out to date. But in reality we will never be able to test all possible events as causes. Knowledge of the dominant cause of a disease will always be subject to future falsification. This is further complicated by the fact that causal chains run off into the infinite past. Taking the example, it may be that the HIV infection was caused by unprotected sex. At the time 2001 in the example above, HIV may be the dominant cause, but if unprotected sex at an earlier time is considered then the HIV would be screened off by the unprotected sex. Looking at the causes of HIV infection, it can be seen that while unprotected sex may be a common cause of HIV it is not the only one. There is also sharing of needles, infusion with HIV infected blood, etc. So there may be many causes of a disease given a broad background population like all humans even though there may only be one cause for a specific person who contracts HIV. There can also be a common cause for the several symptoms of a disease. Finally, it is rare that there is a single cause for an event. It is usually the case that there are several events that contribute to any future event and we will explore the concept of casual relevance below. Causation is a far more complex concept than most people realise. The concepts presented in this article, and more thoroughly by Sadegh-Zadeh [1], provide some valuable tools for assessing causation and making more thorough use of the concept.

Causal Relevance

The concept of causal relevance is useful metric for answering the question: What event is causally more relevant to a particular disease? Causal relevance can be defined as:

(17) cr(A,B,X) = p(B|X∩A) – p(B|X)

This states that causal relevance is simply the difference in the probability of B given the background context X and causal event A and the just the probability of B given the background context X. An example would be:

In a given year:

p(Myocardial infarction | normal humans) = 1%

p(Myocardial infarction | normal humans ∩ smoking) = 2%

cr           = p(Myocardial infarction | normal humans ∩ smoking) – p(Myocardial infarction | normal humans)

= 0.02 – 0.01 = 0.01

The numbers are just estimates, but they suggest that while smoking may double the chance of myocardial infarction (MI) it does not have a high causal relevance within a one year period.

Some examples that demonstrate causal relevance [1, p277]:

causal irrelevance amounts to cr(A,B,X) = 0 (no relevance) eg. causal relevance of your healthcare number to myocardial infarction

positive causal relevance is cr(A,B,X) > 0 (causing) eg. smoking to myocardial infarction

negative causal relevance is cr(A,B,X) < 0 (discausing, preventing) eg. statins to myocardial infarction

maximum positive causal relevance cr(A,B,X) = 1 (maximum efficiency) eg. mechanically clamping your coronary artery to myocardial infarction

maximum negative causal relevance cr(A,B,X) = −1 (maximum prevention) eg. no example

Causal relevance as defined here is not a probability at least due to its range from −1 to +1. It is simply a quantitative function that provides a measures of the context-relative causal impact of events.

Causal relevance can also be used to compare the relative importance of different events to an outcome by comparing their causal relevance.

If cr(A1, B, X) > cr (A2, B, X), then A1 is causally more relevant to B in X than A2.

For example,

cr(smoking, myocardial infarction, normal humans) > cr (healthcare number, myocardial infarction, normal humans)

This says that smoking is a stronger cause of myocardial infarction than is your healthcare number.

In later articles or versions of this article, we will explore how the concepts of causation and relative causation might be applied to the Research Analysis model and platform.

References

  1. Sadegh-Zadeh, Kazem. Handbook of analytic philosophy of medicine. Dordrecht: Springer, 2014.

Version 1.0, 19th March, 2017

Popperian Falsifiability is Only Theoretical: Evidence can never definitively reject a hypothesis

Popperian falsifiability [1] is a cornerstone of modern science and its philosophy.  I will not question the importance of Popper’s work; it was a major source of inspiration for our work. I will not argue that there are weaknesses in the concept of falsifiability of scientific hypotheses. Popper made it clear in his work that the concept of falsifiability is theoretical and cannot be achieved in practice. His achievement was to demarcate empirical science, as the investigation of logically falsifiable statements, from metaphysics. Popper’s key thesis was proposed as follows:

“But I shall certainly admit a system as empirical or scientific only if it is capable of being tested by experience. These considerations suggest that not the verifiability but the falsifiability of a system is to be taken as a criterion of demarcation.* In other words: I shall not require of a scientific system that it shall be capable of being singled out, once and for all, in a positive sense; but I shall require that its logical form shall be such that it can be singled out, by means of empirical tests, in a negative sense: it must be possible for an empirical scientific system to be refuted by experience.”

As examples of logically falsifiable statements, take the following sentence from a review article on the Pathobiological Determinants of Atherosclerosis in Youth (PDAY) Study, which investigated the prevalence of atherosclerosis in 2876 subjects aged 15 to 34-years-old [2].

(1) “All subjects in this study had lesions in the abdominal aorta, and all except 2 white men had lesions in the thoracic aorta.”

From this sentence we can construct an example of a scientific statement that logically falsifiable:

(2) All humans between the ages of 15 and 34-years-old have atherosclerotic lesions in the abdominal aorta.

Or in a more logical form:

ꓯxα

where

x = humans between the ages of 15 and 34-years-old,

α = x has atherosclerotic lesions in the abdominal aorta

Logically this statement (2) could be falsified by the example of one human aged 15 to 34-years-old who did not have any lesions in the abdominal aorta. On the other hand, the universal claim made by “all” makes this statement unverifiable as there is a potentially infinite number of humans.

If we go back to the sentence (1), we can see that the authors note that “all except 2” subjects had lesions in the thoracic aorta. This was not incorporated into the statement (2). But given that 2874 of 2876 subjects (99.9%) did have lesions in the thoracic aorta, is it reasonable to reject the claim (3) below?

(3) All humans between the ages of 15 and 34-years-old have atherosclerotic lesions in the thoracic aorta.

Certainly these two subjects appear to have falsified (3). If we accept the falsification of the claim in (3), then it appears that the most that can be claimed is (4)

(4) Almost all humans between the ages of 15 and 34-years-old have atherosclerotic lesions in the thoracic aorta.

Now (4) is no longer logically falsifiable and thus no longer scientific or empirical, as we can never be sure whether a subject without lesions in the thoracic aorta falsifies the claim or is one of the those excluded by the hedging word “almost”. According to Popper, science involves the empirical investigation of statements like (2) and (3), but not those like (4).

Popper was clear on the real world limitations and possible objections to his approach and discusses them explicitly:

“A third objection may seem more serious. It might be said that even if the asymmetry is admitted, it is still impossible, for various reasons, that any theoretical system should ever be conclusively falsified.”

… “I must admit the justice of this criticism; but I need not therefore withdraw my proposal to adopt falsifiability as a criterion of demarcation.” [1, p19-20]

and

“If falsifiability is to be at all applicable as a criterion of demarcation, then singular statements must be available which can serve as premisses in falsifying inferences. Our criterion therefore appears only to shift the problem — to lead us back from the question of the empirical character of theories to the question of the empirical character of singular statements.” [1, p21]

Popper is acknowledging that any evidential claim put forward to falsify a hypothesis is in itself a scientific hypothesis that by Poppers philosophy cannot be verified, only theoretically falsified at best, leading us into an infinite regress. Poppers philosophy offers demarcation of empirical science statements from other meta-physical statements, but does not offer definitive rejection of hypotheses by falsification. Falsification is only theoretical. Experiments never definitively falsify a hypothesis.

A simple example

An example hypothesis that meets the requirement of Popperian falsifiability is as following:

(1)         John Smith has familial hypercholesterolemia

Familial hypercholesterolemia is a disease that causes LDL cholesterol (bad cholesterol) to be very high. As specified by the name, the disease is genetic and is passed down through families. There are many genetic defects that can cause the disease, but for this example let us assume that there is only one genetic defect that causes the disease and that (1) is denoting only the disease caused by that defect. Without this assumption the statement becomes unfalsifiable as we could always argue that while John Smith has none of the known genetic defects, he may have an as yet unknown defect that causes the disease.

If we run a genetic test on John Smith, we could falsify (1) with the following finding:

(2) John Smith does not have the familial hypercholesterolemia genetic defect

However, this falsification can never be definitive as (2) is in itself an empirical statement. All diagnostic tests are exposed to uncertainty and this is represented by their false positive or false negative rate. In this case, there is always a chance that John Smith’s diagnostic result was a false negative. That is, he may really have the genetic defect, but the test failed to identify it. No diagnostic test can definitely falsify a hypothesis. There is always some chance that the test itself is false.

Popper notes that empirical hypotheses “can never become ‘probable’: they can only be corroborated, in the sense that they can ‘prove their mettle’ under fire—the fire of our tests.” [1, p259]. This same concept applies symmetrically to apparently falsifying empirical evidence. Popper’s philosophy can’t even give us a probability of falsification. Popper ultimately only gives us a theoretical means for demarcating scientific from meta-physical statements.

Note that I am not a scholar of Popper and that this discussion is not complete. I believe that Popper addressed these points well in his work and you should refer to it for more detail.

References

  1. Popper, Karl.The logic of scientific discovery. Routledge, 2005.
  2. Strong, Jack P., et al. “Prevalence and extent of atherosclerosis in adolescents and young adults: implications for prevention from the Pathobiological Determinants of Atherosclerosis in Youth Study.” Jama 281.8 (1999): 727-735.

Version 2.0, 19th October, 2016

Representation of individual medical event claims

In the previous article, the logical representation of medical science claims was investigated. Scientific claims have the goal of compressing experimental evidence into rules or models that reliably represent the evidence and make good predictions about future events. In this article, a step is taken back and the representation of individual medical events is investigated. For example, the individual events that together make up the evidence used to come to a medical science claim.

Individual medical events are simpler to model and appear to be well modeled by the standard Davidsonian analysis discussed in the previous article (for discussion of why this analysis is used refer to the previous article). Below is an example discussed in the previous article:

(1) a.     Statins reduce myocardial infarction in normal humans

  1. ꓱe (REDUCE(statin, myocardial infarction, e) & IN(e, normal human))
  2. “There is at least one case, such that statin administration reduced myocardial infarction, where the case was in a normal human”

This Davidsonian analysis of the statement (16a) was taken as a step in the investigation (provided above as (1)), but it was decided that that the existential quantifier was not the correct analysis of the statement and that scientific claims like that in (1a) should to be interpreted as universal quantification. If an individual participant in the clinical trial is considered, the following statement could be made:

(1) a.     Statins reduced myocardial infarction in John Smith

  1. ꓱe (REDUCE(statin, myocardial infarction, e) & IN(e, John Smith))
  2. “There is at least one case, such that statin administration reduced myocardial infarction, where the case was in John Smith”

Here an important difference can be seen between the individual cases and the claims that can be made via statistical analysis of a large number of individual cases as a group. It may be that the statins did reduce the likelihood of a myocardial infarction in John Smith. However, given that the actual event of a myocardial infarction is an irregular occurrence and is dependent on a large number of factors (eg. lifestyle, high physical/stress events, etc), and these factors are uncertain, there is no way of being confident that the statins had any effect from the evidence of one individual. This is why clinical trials are required. So that outside factors can be controlled to some degree and the number of participants can be such that statistically we can have reasonable confidence that the medical scientific claim is valid.

In the example above, the frequency of myocardial infarction is low in an individual and the result often catastrophic. For this reason, studies of drugs for diseases like heart disease generally need to be long term and inclusive of end points (eg. death). If a disease that involved continuous or frequent disease events is considered, then a sentence like that in (1a) may make sense in that the frequency of disease events prior to introduction of the drug can be compared to the frequency afterwards. Take for example antibiotic treatment for bacterial infection:

(2) a.    Antibiotics reduce bacterial infection in John Smith

  1. ꓯe (REDUCE(antibiotics, bacterial infection, e) & IN(e, John Smith))
  2. “In all cases, antibiotics reduce bacterial infection, where the case is in John Smith.”

In this example, bacterial infection nay occurred in John Smith a large number of times during his life. To assess the validity of the claim in (2) there would need to be a number of bacterial infections where no antibiotics were administered and then a number of bacterial infections where antibiotics were administered to allow for a comparison of the control events (no antibiotics) to treatment events. Where the difference was significant, the treatment might be considered a success and the claim (2) might be considered as true (uncertainty discussed further below). However, here again we see the importance of a clinical trial to assess the efficacy of a drug. The claim in (2) is specific to “in John Smith” and the claim cannot be extended to “normal humans” generally. While John Smith is a human, he also has a unique genome, diet, lifestyle and life history generally. The effectiveness of a drug in one human is certainly support for its effectiveness in other humans, but it is always possible that the drug only works with John Smith’s specific genetic profile for example. There is also the possibility that the bacterial infection went away by chance at the same time John Smith began receiving the antibiotic or that it resolved by the placebo effect. This last point highlights the fact that the reduction does not actually imply causation.

In the case of heart disease, the primary cause of myocardial infarction, there may be an indicator of disease other than myocardial infarction. Atherosclerosis describes the formation of lesions in the arteries. Atherosclerosis in the coronary arteries (arteries that supply blood to the heart) is a primary indicator of heart disease and increases the chance of myocardial infarction. The atherosclerotic lesions in the arteries (lesions from here on) exist prior to diagnosis of heart disease and after the commencement of treatment. So in theory they could be measured at many time points prior to treatment and after treatment with statins. If the lesions regressed after treatment, then this could be an indicator within one individual that statins reduce heart disease and by extension myocardial infarction. Unfortunately, lesion progression prior to diagnosis is rarely measured outside of specialised scientific studies. So there is no significant series of events prior to treatment. Secondly, lesions rarely regress significantly and the goal is usually to slow progression of lesions. To assess reduced progression would require the collection of far more time series events that included measures of magnitude and not just the presence or absence of lesions. Unlike say a skin infection, where the infection is visible externally with accompanying pain, the progression of atherosclerotic lesions is painless and difficult to assess given their location with our current technology. Finally, the evidence that lesion stage or size is positively correlated with myocardial infarction events is not definitive. For heart disease and many other diseases, the assessment of the performance of a drug is not possible within an individual, either due to the lack of time series data, the lack of a reliable indicator of disease or the requirement to base the assessment on the reduction of death in a cohort.

We have discussed the need for group based research, such as clinical trials, as necessary for making statements about the correlation between drugs and diseases. Well-designed clinical trials should be able to substantially exclude the placebo effect and take into account the performance of drugs based on end points like death, but it should be noted that chance can still never be ruled out as a factor. In studies involving chemicals or cells the sample size can be made very large, but with humans it is difficult to get very large sample sizes and to control all important variables. However, the statin meta analyses included over 90,000 participants through the consolidation of many large studies.

Version 1.0, 30th September, 2016

Nano-publications – Review and Comparison to the Research Analysis model

Apart from the scientific knowledge management models of micropublications (Clark 2014) and the Biological Expression Language (http://www.openbel.org), the concept of nano-publications (Mons 2009, Growth 2010) is most aligned with the goals of Research Analysis and has provided valuable insights. In this article we provide an overview of the nano-publication model and discuss how it relates to the Research Analysis (RA) model.

The authors propose 5 steps required to create and adoption nano-publications (Mons 2009, Growth 2010):

  1. Terms to Concepts: This step requires that all terms in a research article are mapped to non-ambiguous identifiers. In nano-publications this is referred to as a Concept, where a Concept is the smallest, unambiguous unit of thought. A concept is uniquely identifiable (Groth 2010). This is similar to, but more ambitious than, the Medical Subject Headings (MeSH) database. Using MeSH as an example, MeSH Headings are the equivalent of Concepts and the Entry Terms for each MeSH Heading are equivalent to the Terms or synonyms for each Concept. We agree with this general goal and strongly promote the use of standard language. We promote the use of MeSH terms in Research Analysis. In Research Analysis users can use the MeSH Entry Term (synonym) they are most comfortable with and the system then ensures that this is mapped to the main concept or MeSH Heading. This allows the user to work with the terminology that is most comfortable for them and their peers, rather than being forced to use an ideal concept. In a separate article, we discuss the challenges associated with defining an unambiguous unit of thought.
  2. Concepts to Statements: Here they propose that each smallest insight in exact sciences is a ‘triple’ of three concepts, though conditions are required to put the insight in context. The triple is in the form subject > predicate > object. For example, cholesterol > increases > atherosclerosis. A Statement is a uniquely identifiable triple, which can be achieved through the assignment of a unique identifier to the triple by annotation. RA initially implemented a cause-effect model for statements, which is a special case of the triple where the predicate must be a cause-effect predicate. We currently only offer the predicates increases, decreases and not significant. We chose to initially restrict the options for triples to allow for the collection of a consistent database that would allow for analysis.
  3. Annotation of Statements with Context and Provenance: It is not enough to store statements just in the form of their basic components, three concepts in a specific sequence. A statement only ‘makes sense’ in a given context and taking a statement out of a research publication strips it of this context. The context in a nano-publication is defined by another set of concepts. The annotation is achieved technically through a triple such that the subject of the triple is a statement. For the example above, the species should be specified. Mice do not get atherosclerosis, even on a high cholesterol diet, but humans do. Also, provenance is associated to Statements by annotation eg. author, source. Claim’s in RA by default require that the user provide organ/cell model, genetic model and species annotations that a relevant for each specific claim. RA automatically assigns unique identifiers to claims and requires that at least one supporting quotation is provided from a publication. The supporting quotations require that the PubMed ID (PMID) is provided. Additional conditions and context can be provided by the users appending tags to the claim.
  4. Treating Richly Annotated Statements as Nano- Publications: treat these statements with conditional annotation as nano-publications via proper attribution so they can be cited and the authors can be credited. A nano-publication is a set of annotations that refer to the same statement and contains a minimum set of (community) agreed upon annotations (Groth 2010). This concept is similar to the claim model in RA, claims can be cited using their unique identifier and viewing a claim provides details of all of the quoted statements and associated PMIDs that support the claim, along with any other context provided via tags.
  5. Removing Redundancy, Meta-analyzing Web-Statements: where statements are identical they would be removed to simplify the database. The goal of this being to reduce “undue repetition” and to help improve the identification of new statements. Groth et al. define S-Evidence: all the nano-publications that refer to the same statement (Groth 2010) and, as implied by the name, provide evidence for the statement. The original model for nano-publications focused more on the removal of redundancy, but the concept of S-Evidence provides more respect for the importance of replication and the potential for meta-analysis. In complex sciences like biology, the likelihood of a statement being true based on the evidence of one publication is surprisingly low. For example, the uncertain reproducibility and re-usability of results investigated in the therapeutic development in the cancer field (Begley 2012). No single experiment, or for that matter any number of experiments, can fully demonstrate the truth of a statement. However, the collection of results that support a statement can, in a Poperian sense, provide some guidance to the level to which a scientific statement has had its metal tested. It can also allow for the bridging of knowledge between subfields where different terms for the same concepts are regularly used.

Nano-publication Model

Figure 0: The Nano-publication Model taken from Groth 2010.

The goal of nano-publications

The nano-publications authors propose the goal of having scientific authors structure their data in such a way that computers understand them and we support this goal. However, we feel that it is likely that the formalisation of scientific knowledge may become a specialist task, like the coding of software design specifications into software code. It is not clear how the nano-publications authors see the knowledgebase of nano-publications being used. We strongly believe that while knowledge coding may be a specialist activity, that most researchers in the biological fields will used tools based on such knowledgebases to help direct their research by identifying gaps, conflicts and opportunities in the current research.

Main differences between the nano-publications model and the Research Analysis models

We have discussed some similarities and differences between the nano-publication and RA models above, but here we go into a little more detail.

At the Concept level:

  • The nano-publication model requests that authors use their Concept Wiki unique identifiers, which aggregates concept names from many databases.
  • RA feels that the task of aggregating names is too much and not necessary given the already available databases. Also, we prefer that scientist use the names rather than IDs for concepts as this is more intuitive and in line with our mission.
  • RA currently requests that all claims use NLM’s MeSH Terms or IDs for the names of the elements or concepts in a claim. Where terms are not in MeSH then the NCBI Protein and PubChem databases can be utilised. In rare cases new concept names can be added into RA.
  • RA uses these external name databases to identify synonyms and ensure that if scientists search for a particular MeSH Term that they will receive results for all of the MeSH terms under the MeSH Heading.

At the Statement level (Claims in RA):

  • RA uses the concept of claims in place of the concept of statements in nano-publications. Claims are a simply a type of statement that we feel is more intuitive.
  • With regard to our natural language claims, we ask that scientists use the form of a declarative sentence that is as logically clear as possible. These statements will not be as logically clear as the nano-publication model and are likely to be more complex.
  • Our standard cause-effect claim model is quite similar to the nano-publication model. The primary cause-effect part of the claim is simply a special case version of the nano-publication triple eg. Treatment/Cause [subject] > Effect [predicate] > Disease/Molecule [object].
  • The other components of our cause-effect claim model are equivalent to attribution in the nano-publication model. We have 3 as standard:
    • (cause-effect claim, where it occurs in, Organ/Cell Model) AND
    • (cause-effect claim, where it occurs in, Genetic Model) AND
    • (cause-effect claim, where it occurs in, Species);
  • Note that the three conditions should be considered as conjunctions with the cause-effect claim. Conjunction is required as the conditions are related. The Genetic Model effects the Organ/Cell Model and both are specific to the Species. A simple annotation of each to the cause-effect claim would be ambiguous. It is only when all of the conditions are true that the cause-effect claim is valid.
  • In RA, statements must be supported by at least one quoted statement from a publication. These supporting quotes could be treated as annotations in the nano-publications model, but a difference between the two models is that a specific quote could be used as support for several claims. Further, for each PMID there are often several quoted statements. Technically these components could just be repeated as annotations for each statement, but in the RA model a map or graph concept is used where the statements are supported by quotes and quotes are supported by PMIDs. Claims, quotes and PMIDs are all considered to be elements in the direct acyclic graph. The S-Evidence concept in the nano-publication model goes some way to bridging this difference by grouping together nano-publications with the same statement, however the S-Evidence creates trees rather than a graph. This level of the RA model is much more like the micropublication model than the simpler nano-publication model.

Research Analysis Knowledge Management Model Diagram

Figure 1. Research Analysis Knowledge Management Model

  • The RA model does have some simple annotation like elements that associate information to the statement/claim such as a unique statement identifier, the author of the statement/claim (who may or may not be the author of the publication that the quoted sentence came from), time and date of creation, and tags that in effect provide flexible and unlimited simple annotations.

Both the nano-publication and micro-publication models have a substantial focus on the technical aspects of encoding the data in semantic web schemas. The reason for this focus is the importance of making the data available in an open and semantically rich format. We respect this goal, but have chosen to hide as much of this detail from our users as possible. While the biological research community has become computer savvy over the past decades, the majority of the community are not trained in any computer programming languages and would find these schemas very unpleasant. I have a computer science degree from before web technology was popular and I still find it very unpleasant. This is a little unfair, as these are articles in informatics journals and certainly we acknowledge that the micropublication authors have built user oriented applications.

One of the key goals discussed in the article on the Research Analysis Mission is that our focus is on making knowledge management tools available to normal biological and medical scientists in an easy to use and powerful way – not making the knowledge available to a few hard core geeks and their supercomputers. For this reason, none of the bare bones schemas are visible to the users and the frontend terminology is focused on usability rather than theoretical correctness. This is also the reason why we provide a number of standard models for capturing scientific claims in RA. These models will not be flexible enough for some, but for the rest they will be much easier and straightforward to use. Adoption and the actual acceleration of discovery by normal scientists is our top priority.

References

Mons, B., & Velterop, J. (2009, October). Nano-Publication in the e-science era. In Workshop on Semantic Web Applications in Scientific Discourse (SWASD 2009).

Groth, P., Gibson, A., & Velterop, J. (2010). The anatomy of a nano-publicationlication. Inf. Services and Use, 30(1-2), 51-56.

Clark, T., Ciccarese, P., & Goble, C. (2014). Micropublications: a semantic model for claims, evidence, arguments and annotations in biomedical communications. Journal of Biomedical Semantics, 5(1), 28.

Begley, C. G., & Ellis, L. M. (2012). Drug development: Raise standards for preclinical cancer research. Nature483(7391), 531-533.

 

Mission of Research Analysis

The overarching mission of Research Analysis is to accelerate the rate of scientific discovery through the provision of knowledge management and discovery tools to scientists.

Our knowledge management platform:

  • captures scientific claims from the literature in a formalised structure.
  • rapid and powerful search tools to find existing scientific claims
  • analysis tools that assess:
    • the level of support for a claim in the literature
    • conflicting claims
    • knowledge gaps in the literature
  • help scientists identify unique and promising hypotheses to test experimentally
  • is intuitive and easy to use

We have found that past efforts to build similar tools tend to focus on the computational aspects of knowledge management: How can we get all of the knowledge out of the literature and scientists heads and into a database that can then be mined by the geeks and AIs? This tends to imply that if only the computers could get all of the information, then they would do a better job than the humans. We disagree.

We don’t believe in a Robot Scientist future! We believe that humans continue to have a far superior ability to make the intuitive leaps that lead to scientific discovery. But on the other hand we believe that the computers beat us hands down on being able to manage vast stores of information and to be able to process this information rapidly using logical analysis. We believe that by providing powerful and intuitive knowledge management tools to leading scientists that they will be able to make higher quality intuitive leaps and at a faster rate. The goal of RA is not to extra the knowledge and feed it to the robots, but instead to feed it back to the scientists using powerful tools that make it easy and fast to interrogate.

Some examples of benefits that can come from using Research Analysis include:

  • We have all had the experience of remembering a finding from a paper, but not the specific details. Even once you find the paper, were is the specific support in the paper. RA puts these key findings at your fingertips.
  • Even the greatest scientists of all time regularly thought they had a new discovery, but later (sometimes much later) found that some obscure scientist had already come up with it 10 years prior. RA allows you to rapidly identify if there is existing support for a scientific claim.
  • On the other hand, you want to know whether a scientific claim is valid. Use RA to find all of the papers that support and challenge the claim to assess its merit.
  • Use the analysis tools to visualise matrices of scientific claims to identify conflicts, trends or gaps in the literature. A powerful source of new research topics.

Research Analysis is constantly looking for opportunities to improve and extend our platform to support scientists in their work. The platform was originally designed to meet the challenges of our collaborators in their research and we feel that the best way to improve the platform is through helping scientists solve hard problems. We would appreciate feedback, suggestions and the opportunity to work with scientists to help solve their problems.