Here's What You Need to Know About Reliability and Validity

RoCarpenterhttps://www.servicescape.com/writers/rocarpenterUnited States

Outside of the world of research, 'reliability' and 'validity' are often used interchangeably. Because of this colloquial use, the true meaning of these words has become clouded. This article will explain the differences between these words from the statistical perspective and discuss the types of reliability and validity, as well as how these two constructs interact. We will start with a list of definitions, first defining reliability and validity as umbrella terms, and subsequently breaking down the different subtypes below each.

The major consideration with regard to reliability versus validity is that reliability simply relates to how consistent a particular metric is, it does not consider the accuracy of the measure. This is the domain of validity. For example, an uncalibrated piece of equipment may consistently give the same results while testing a sample, and therefore it can be considered reliable. It will not give accurate results, thus the results are not valid. It would be as if you set your bathroom scale to reflect your weight to show that you are twenty pounds lighter than you actually are. It would reliably give you roughly this weight every day, however it would not be accurate, and is therefore not valid.

An uncalibrated piece of equipment may consistently give the same results while testing a sample, and therefore it can be considered reliable. It will not give accurate results, thus the results are not valid. It would be as if you set your bathroom scale to reflect your weight to show that you are twenty pounds lighter than you actually are. Photo by i yunmai on Unsplash.

Definitions for various types of reliability

In order to get a greater depth of understanding of these fundamental concepts, it is important to discuss a few of the different types of reliability commonly considered across numerous fields of research. These constructs include the following subtypes:

Reliability—The consistency of a metric
Consistency—As discussed above, this is the core of reliability. Something that is a consistent measure will provide the same results no matter how many times you run a sample.
Internal Consistency (Homogeneity)—This is tested by splitting the sample data in half and running a test to ensure that the two subsamples are not statistically different. This is often done using tests such as the Kruder-Richardson test, a more complex version of the split half test previously mentioned, or Chronbach's alpha.
Stability—Stability commonly refers to test-retest reliability. That is to say that it is the 'repeatability' of the test. This is generally a correlational metric in which a 'correlation coefficient' of less than 0.3 is weak, 0.3-0.5 is a moderate relationship, and above 0.5 is a strong correlation, and therefore the relationship is more stable. Pearson's r is a common statistical test to determine these correlation coefficients.
Equivalence—This is assessed using inter-rater reliability, which is another common term for this metric. Inter-rater reliability is achieved when the results are reliable even if a different person is doing the assessment or running the sample.

Further information on these topics can be found in the Research Made Simple article in Evidence Based Nursing by Heale and Twycross (2015). Additionally, a common example of test-retest reliability provided in statistics classes, and discussed by Pagano (2010) is the IQ test. If one assumes that a person's IQ is stable over time, this test is a relatable example of test-retest reliability; no matter how many times you take the test, the score will be approximately the same. This example also works for inter-rater reliability as it does not matter if you are given the test by two different people, or if you do a computerized version, the test will still provide reliable results. The test will generate the same score for the participant consistently, however this does not address the validity of the test.

Reliability is also a synonym for statistical significance, which occurs when one is able to reject the null hypothesis. The null hypothesis is essentially the assertion that there is no difference between two populations (or more) that are being examined. In responsible research, scientists do not try to prove their idea, they try to see if they can disprove it, thus they check to see if they can reject the null or not. When the null hypothesis is rejected this means that the results of a particular test are not due to chance, with a probability generally below 0.05%. As Pagano says (2010), It might have been better to use the term reliable to convey this meaning rather than significant. However, the usage of significant is well established, so we will have to live with it.

Definitions for various types of validity

To continue with various definitions you'll need surrounding the concept of validity, see below.

Validity—Accurate measurement
Content Validity—If the metric in question covers all of the aspects that need to be considered for a given variable in order to accurately assess it
Face Validity—This is a subset of content validity in which experts in the field assess whether or not a particular instrument is capable of accurately measuring a particular variable
Construct Validity—The test scores allow you to make predictions based on them
Homogeneity—The metric is only reflecting one theory, more specifically that the experimental samples' scores have the same finite variance (the statistical properties are the same across the data set)
Convergence—The instrument produces similar results to established metrics that assess the concept in question
Theory Evidence—The test results are representative of observable evidence, for example if the IQ test provides a high score for an individual and they actually have a high degree of general intelligence
Criterion Validity—The instrument used to assess the construct in question highly correlates, greater than 0.5, with other modes of measurement for similar variables
Convergent Validity—The demonstration that a particular instrument correlates greater than 0.5 with other instruments that measure a similar variable
Divergent Validity—The demonstration that there is a correlation of less than or equal to 0.3 between instruments intended to measure different variables
Predictive Validity—The ability of an instrument to forecast future outcomes related to the variable in question

Additional consideration should be given to the following types of validity as well. As described in Research Design and Statistical Analysis, a rather daunting and heavy text by Myers, Well, and Lorch (2010):

Internal Validity—The observations made using a particular measure can be attributed to the variable being manipulated, aka the independent variable
External Validity—This is the degree to which the observations made can be related to other populations of interest or related conditions

Interactions between reliability and validity

As illustrated below in a diagram used by many sources, there are interactions between reliability and validity. On the first dartboard, you can see a pictographic demonstration for data that is reliable, but not valid. The player consistently hits roughly the same spot, but is never on target, and therefore not accurate. In the second example, the player always hits the board so it is arguably accurate, given that the margin of error is rather high, but you can not rely on consistency. The third graphic demonstrates a condition in which the data is neither reliable nor accurate; they are only hitting part of the target and the shots are not evenly distributed around the bull's eye, which is meant to symbolize the variable that is supposed to be under scrutiny. The fourth board is the ideal that one strives for in science; not only is the data consistently showing similar values, but it is accurately assessing the experimental variable of interest, being the bull's eye.

Summary of key points

Reliability=Consistency→Statistical Significance
Validity=Accuracy
Reliability+Validity=Credible Experimental Results

Final thoughts

Although when you are first introduced to statistical analysis it can be daunting for a lot of people, a solid foundational understanding of the jargon specific to the field will reduce the likelihood of confusion as you move into more advanced topics, apply statistics to your own data, or try to discuss statistical results with others. I encourage you to look deeper into the specific statistical analyses that are commonly used in your field to facilitate your understanding of these concepts as they relate to your life. Initially, these topics may be confusing or dry, but once you become familiar with them they will prove to be excellent tools to have in your proverbial belt. Additionally, a basic understanding of research and statistics will protect you from the charlatans of the world who try to misguide others with fancy words and flawed data. As American astrophysicist, author, and science communicator Neal deGrasse Tyson once said, Science literacy is a vaccine against the charlatans of the world that would exploit your ignorance.