Perils of Keyword-Based Bibliometrics: ISI’s ‘1990 Effect’

6 minute read

Published:

Have you done historical bibliometric analysis of a scientific field or topic area and found that there is a massive increase in research articles after 1990?  Are you using ISI’s Web of Science and searching by topic or keyword?  If so, don’t make the same mistake I did: these results aren’t because of some sea change or paradigm shift, but rather result from a poorly-documented shift in how ISI began indexing articles after 1990.

If you are interested in the history of contemporary science, particularly in the 1980s and ’90s, citation analysis can be a useful tool to discover broad trends in scientific research.  In this area, the ISI’s Web of Science is the de-facto source for this data, claiming to be the most comprehensive database of articles and journals. They index articles using a number of categories, including author, title, publication, subject, topic, and more. With a built-in results analyzer, it is very easy to chart the top authors in a subject, the journals that publish the most in a given field, or, as I was interested in, the growth of a particular topic over time.

I’m currently researching the history of a software suite for the simulation and modeling of molecules, and it is commonplace to cite its debut article if research has been done using the tool, making citation analysis quite painless. I learned though archival research that a certain feature was added in 1990 that would make the simulation of enzymes much easier.  The obvious question is if it had any measurable effect on the amount of research being done with this tool to study enzymes. So I told ISI to give me a list of all articles citing the original software article with the topic “enzyme” between 1985 and 1994. I found the most beautiful results:

According to the citation counts, it seems pretty clear that enzyme research using this program took off dramatically after 1990.  Knowing that correlation doesn’t equal causation, I restrained myself from thinking that the introduction of this new feature in 1990 caused the growth, but I knew that there had to be something here.   Perhaps enzymes were getting interesting after 1990 for some external reason (increased funding or relevance, new discoveries, etc) that caused both the new feature and the increased research.  So I did a database-wide search for all articles on the topic “enzyme” and analyzed it by year.  What I found was even more remarkable:

After 1990, all enzyme research appears to take off dramatically, with a 300% increase a single year.  I knew I was onto something here, and candidates kept coming into my mind: did the Human Genome Project spur this massive interest in enzymes?  Was there a general increase in science funding at this time, a worldwide biology research initiative (like the International Geophysical Year), or the takeoff of the biomedical/biochemical industries?  Whatever it was, I had a lead on something big, something that I hadn’t seen in any of the literature on the history of contemporary bioscience.

I began to search the literature for bibliometric research with phrases like “after 1990” and “after 1991”, combined with various synonyms for rapid growth.  I found a number of other historians and sociologists of science who were making the same kind of argument that I was considering: important events happened in 1988-1990, and these events had to have at least some effect on the massive explosion of articles in a given discipline, subject area, or sub-specialty.  All of them used ISI, and all of them narrowed their search by topic.  While my intent was to find something in fields related to biochemistry, I these articles were making the argument across the sciences, including nanotechnology, materials science, mental health, oceanography, and more.  So I ran the same kind of analysis as before, but this time with a wide range of topic keywords (and scaled the results by the relative increase in citations from the previous year):

As is clear, topics from numerous disciplines and interdisciplinary fields remain steady until 1990, have a massive increase, and then plateau.  The effect is anywhere from 140% to 330%, but the fact that they all occur in the exact same year seems too perfect.  Even if there was a massive, across-the-board increase in science funding, research cycles are so varied – some kinds of studies can expect findings in six months, while others can take years.  The lack of residual effects after 1991 makes this even more unlikely: while the percent increase from 1990 to 1991 is varied, the growth from ’91 to ’92 is no more than +/- 10%.

Occam’s razor leads me to believe that these anomalies are an artifact of ISI’s Web of Science, not scientific publishing itself.  The most likely situations would be that in 1990, 1) a large number of new journals (most likely less popular ones) were added, 2) new kinds of research materials (books, conference proceedings, data sets, etc) were added, or 3) ISI’s method for determining article topics was changed (such as including author keywords or abstracts).  I suspect #3, and after far too much digging, I found some confirmation in [Have you done historical bibliometric analysis of a scientific field or topic area and found that there is a massive increase in research articles after 1990?  Are you using ISI’s Web of Science and searching by topic or keyword?  If so, don’t make the same mistake I did: these results aren’t because of some sea change or paradigm shift, but rather result from a poorly-documented shift in how ISI began indexing articles after 1990.

If you are interested in the history of contemporary science, particularly in the 1980s and ’90s, citation analysis can be a useful tool to discover broad trends in scientific research.  In this area, the ISI’s Web of Science is the de-facto source for this data, claiming to be the most comprehensive database of articles and journals. They index articles using a number of categories, including author, title, publication, subject, topic, and more. With a built-in results analyzer, it is very easy to chart the top authors in a subject, the journals that publish the most in a given field, or, as I was interested in, the growth of a particular topic over time.

I’m currently researching the history of a software suite for the simulation and modeling of molecules, and it is commonplace to cite its debut article if research has been done using the tool, making citation analysis quite painless. I learned though archival research that a certain feature was added in 1990 that would make the simulation of enzymes much easier.  The obvious question is if it had any measurable effect on the amount of research being done with this tool to study enzymes. So I told ISI to give me a list of all articles citing the original software article with the topic “enzyme” between 1985 and 1994. I found the most beautiful results:

According to the citation counts, it seems pretty clear that enzyme research using this program took off dramatically after 1990.  Knowing that correlation doesn’t equal causation, I restrained myself from thinking that the introduction of this new feature in 1990 caused the growth, but I knew that there had to be something here.   Perhaps enzymes were getting interesting after 1990 for some external reason (increased funding or relevance, new discoveries, etc) that caused both the new feature and the increased research.  So I did a database-wide search for all articles on the topic “enzyme” and analyzed it by year.  What I found was even more remarkable:

After 1990, all enzyme research appears to take off dramatically, with a 300% increase a single year.  I knew I was onto something here, and candidates kept coming into my mind: did the Human Genome Project spur this massive interest in enzymes?  Was there a general increase in science funding at this time, a worldwide biology research initiative (like the International Geophysical Year), or the takeoff of the biomedical/biochemical industries?  Whatever it was, I had a lead on something big, something that I hadn’t seen in any of the literature on the history of contemporary bioscience.

I began to search the literature for bibliometric research with phrases like “after 1990” and “after 1991”, combined with various synonyms for rapid growth.  I found a number of other historians and sociologists of science who were making the same kind of argument that I was considering: important events happened in 1988-1990, and these events had to have at least some effect on the massive explosion of articles in a given discipline, subject area, or sub-specialty.  All of them used ISI, and all of them narrowed their search by topic.  While my intent was to find something in fields related to biochemistry, I these articles were making the argument across the sciences, including nanotechnology, materials science, mental health, oceanography, and more.  So I ran the same kind of analysis as before, but this time with a wide range of topic keywords (and scaled the results by the relative increase in citations from the previous year):

As is clear, topics from numerous disciplines and interdisciplinary fields remain steady until 1990, have a massive increase, and then plateau.  The effect is anywhere from 140% to 330%, but the fact that they all occur in the exact same year seems too perfect.  Even if there was a massive, across-the-board increase in science funding, research cycles are so varied – some kinds of studies can expect findings in six months, while others can take years.  The lack of residual effects after 1991 makes this even more unlikely: while the percent increase from 1990 to 1991 is varied, the growth from ’91 to ’92 is no more than +/- 10%.

Occam’s razor leads me to believe that these anomalies are an artifact of ISI’s Web of Science, not scientific publishing itself.  The most likely situations would be that in 1990, 1) a large number of new journals (most likely less popular ones) were added, 2) new kinds of research materials (books, conference proceedings, data sets, etc) were added, or 3) ISI’s method for determining article topics was changed (such as including author keywords or abstracts).  I suspect #3, and after far too much digging, I found some confirmation in](http://thomsonreuters.com/products_services/science/free/essays/concept_of_citation_indexing/) written by ISI’s founder:

Through large test samples, we concluded that the titles of papers cited in reviews and other articles were sufficient to add useful descriptive words and phrases to the citing paper. This was later confirmed in studies by A. J. Harley, as Irv Sher and I recently reported.11, 12

In 1990, ISI (now Thomson Reuters) was able to introduce this citation-based method of derivative (algorithmic) subject indexing, called KeyWords Plus®. 7, 8 In addition to title words, author-supplied keywords, and/or abstract words, KeyWords Plus supplies words and phrases to enhance these other descriptors and thereby retrievability. These KeyWords Plus terms are derived from the titles of cited papers, which have been algorithmically processed to identify the most-commonly recurring words and phrases.

Unfortunately, this new algorithm for topic indexing appears to have been introduced without distinguishing it from the old one.  As far as I can tell, there is no way to just search for pre-1990 style keywords in post-1990 articles, meaning that ISI’s topics and keywords are useless for historical bibliometrics that span across this date.   And thanks to what I’m calling ‘the 1990 effect’ (someone give me a better term, please!), many researchers are being led down a deceptively misleading path!