Showing posts with label hypothes.is. Show all posts
Showing posts with label hypothes.is. Show all posts

Wednesday, September 23, 2020

Using the hypothes.is API to annotate PDFs

With somewhat depressing regularity I keep cycling back to things I was working on earlier but never quite get to work the way I wanted. The last couple of days it's the turn of hypothes.is.

One of the things I'd like to have is a database of all taxonomic names such that if you clicked on a name you would get not only the bibliographic record for the publication where that name first appeared (which is what I've bene building for animals in BioNames) but also you could see the actual publication with the name highlighted in the text. This assumes that the publication has been digitised (say, as a PDF) and is accessible, but let's assume that this is the case. Now, we could do this manually, but we have tools to find taxonomic names in text. And in my use case I often know which page the name is on, and what the name is, so all I really want is to be able to highlight it programmatically (because I have millions of names to deal with).

So, time to revisit the hypothes.is API. One of the neat "tricks" hypothes.is have managed is the ability to annotate, say, a web page for an article and have that annotation automagically appear on the PDF version of the same article. As described in How Hypothesis interacts with document metadata this is in part because hypothes.is extracts metadata from the article's web page, such as DOI and link to the PDF, and stores that with the annotation (I say "in part" because the other part of the trick is to be able to locate annotations in different versions of the same text). If you annotate a PDF, hypothes.is stores the URL of the PDF and also a "fingerprint" of the PDF (see PDF Fingerprinting for details). This means that you can also add an annotation to a PDF offline (for example, on a file you have downloaded onto your computer) and - if hypothes.is has already encountered this PDF - that annotation will appear in the PDF online.

What I want to do is have a PDF, highlight the scientific name, upload that annotation to hypothes.is so that the annotation is visible online when anyone opens the PDF (and ideally when they look at the web version of the same article). I want to do this programmatically. Long story short, this seems doable. Here is an example annotation that I created and sent to hypothesis.is via their API:

{
    "uri": "http://www.scielo.br/pdf/isz/v105n3/1678-4766-isz-105-03-00372.pdf",
    "document": {
        "highwire": {
            "doi": [
                "10.1590/1678-476620151053372375"
            ]
        },
        "dc": {
            "identifier": [
                "doi:10.1590/1678-476620151053372375"
            ]
        },
        "link": [
            {
                "href": "urn:x-pdf:6124e7bdb33241429158b11a1b2c4ba5"
            }
        ]
    },
    "tags": [
        "api"
    ],
    "target": [
        {
            "source": "http://www.scielo.br/pdf/isz/v105n3/1678-4766-isz-105-03-00372.pdf",
            "selector": [
                {
                    "type": "TextQuoteSelector",
                    "exact": "Alpaida venger sp. nov.",
                    "prefix": "imens preserved in 75% ethanol. ",
                    "suffix": " (Figs 1-9) Type-material. Holot"
                },
                {
                    "type": "TextPositionSelector",
                    "start": 4834,
                    "end": 4857
                }
            ]
        }
    ],
    "user": "acct:xxx@hypothes.is",
    "permissions": {
        "read": [
            "group:__world__"
        ],
        "update": [
            "acct:xxx@hypothes.is"
        ],
        "delete": [
            "acct:xxx@hypothes.is"
        ],
        "admin": [
            "acct:xxx@hypothes.is"
        ]
    }
}

In this example the article and the PDF are linked by including the DOI and PDF fingerprint in the same annotation (thinking about this I should probably also have included the PDF URL in document.highwire.pdf_url[]). I extracted the PDF fingerprint using mutool and added that as the urn:x-pdf identifier.

The actual annotation itself is described twice, once using character position (start and end of the text string relative to the cleaned text extracted from the PDF) and once by including short fragments of text before and after the bit I want to highlight (Alpaida venger sp. nov.). In my limited experience so far this combination seems to provide enough information for hypothes.is to also locate the annotation in the HTML version of the article (if one exists).

You can see the result for yourself using the hypothes.is proxy (https://via.hypothes.is). Here is the annotation on the PDF (https://www.scielo.br/pdf/isz/v105n3/1678-4766-isz-105-03-00372.pdf)


and here is the annotation on HTML (https://doi.org/10.1590/1678-476620151053372375)



If you download the PDF onto your computer and open the file in Chrome you can also see the annotation in the PDF (to do this you will need to install the hypothes.is extension for Chrome and click the hypothes.is symbol on your Chrome's toolbar).

In summary, we have a pretty straightforward way to automatically annotate papers offline using just the PDF.

Thursday, June 30, 2016

Aggregating annotations on the scientific literature: a followup on the ReCon16 hackday

In the previous post I sketched out a workflow to annotate articles using hypothes.is and aggregate those annotations. I threw this together for the hackday at ReCon 16 in Edinburgh, and the hack day gave me a chance to (a) put together a crude visualisation of the aggregated annotations, and (b) recruit CrossRef's Rachael Lammey (@rachaellammey) into doing some annotations as well so I could test how easy it was to follow my garbled instructions and contribute to the project.

We annotated the paper A new species of shrew (Soricomorpha: Crocidura) from West Java, Indonesia (doi:10.1644/13-MAMM-A-215). If you have the hypothes.is extension installed you will see our annotations on that page, if not, you can see them using the hypothes.is proxy: https://via.hypothes.is/http://dx.doi.org/10.1644/13-MAMM-A-215.

Rachael and I both used the IFTTT tool to send our annotations to a central store. I then created a very crude summary page for those annotations: http://bionames.org/~rpage/recon16-annotation/www/index.html?id=10.1644/13-MAMM-A-215. When this page loads it queries the central store for annotations on the paper with DOI 10.1644/13-MAMM-A-215, then creates some simple summaries.

For example, here is a list of the annotations. The listed is "typed" by tags, that is, you can tell the central store what kind of annotation is being made using the "tag" feature in hypothes.is. On this example, we've picked out taxonomic names, citations, geographical coordinates, specimen codes, grants, etc.

Screenshot 2016 06 30 12 30 53

Given that we have latitude and longitude pairs, we can generate a map: Screenshot 2016 06 30 12 31 15

The names of taxa can be enhanced by adding pictures, so we have a sense of what organisms the paper is about:

Screenshot 2016 06 30 12 31 24

The metadata on the web page for this article is quite rich, and hypothes.is does a nice job of extracting it, so that we have a list of DOIs for many of the articles this paper cites. I've chosen to add annotations for articles that lack DOIs but which may be online elsewhere (e.g., BioStor).

Screenshot 2016 06 30 12 31 34

What's next

This demo shows that it's quite straightforward to annotate an article and pull those annotations together to create a central database that can generate new insights about a paper. For example, we can generate a map even if the original paper doesn't provide one. Conversely, we could use the annotations to link entities such as museum specimens to the literature that discusses those specimens. Given a specimen code in a paper we could look up that code in GBIF (using GBIF's API, or a tool like "Material Examined", see Linking specimen codes to GBIF). Hence we could go from code in paper to GBIF, or potentially from GBIF to the paper that cites the specimen. Having a central annotation store potentially becomes a way to build a knowledge graph linking different entities that we care about.

Of course, a couple of people manually annotating a few papers isn't scalable, but because hypothes.is has an API we can scale this approach (for another experiment see Hypothes.is revisited: annotating articles in BioStor). For example, we have automated tools to locate taxonomic names in text. Imagine that we use those tools to create annotations across the accessible biodiversity literature. We can then aggregate those into a central store and we have an index to the literature based on taxonomic name, but we can also click on any annotation and see that name in context as an annotation on a page. We could manually augment those annotations, if needed, for example by correcting OCR errors.

I think there's scope here for unifying the goals of indexing, annotation, and knowledge graph building with a fairly small set of tools.

Thursday, June 23, 2016

Aggregating annotations on the scientific literature: a hack for ReCon 16

7iUlfzBpI will be at ReCon 16 in Edinburgh (hashtag #ReCon_16), the second ReCon event I've attended (see Thoughts on ReCon 15: DOIs, GitHub, ORCID, altmetric, and transitive credit). For the hack day that follows I've put together some instructions for a way to glue together annotations made by multiple people using hypothes.is. It works by using IFTTT to read a user's annotation stream (i.e., the annotations they've made) and then post those to a CouchDB database hosted by Cloudant.

Why, you might ask? Well, I'm interested in using hypothes.is to make machine-readable annotations on papers. For example, we could select a pair of geographic co-ordinates (latitude and longitude) in a paper, tag it "geo", then have a tool that takes that annotation, converts it to a pair of decimal numbers and renders it on a map.

Screenshot 2016 06 23 15 53 07

Or we could be reading a paper and the literature cited lacks links to the cited literature (i.e., there are no DOIs). We could add those by selecting the reference, pasted in the DOI as the annotation, and tagging it "cites". If we aggregate all those annotations then we could write a query that lists all the DOIs of the cited literature (i.e., it builds a small part of the citation graph).

By aggregating across multiple users we effectively crowd source the annotation problem, but in a way that we can still collect those annotations. For this hack I'm going to automate this collection by enabling each user to create an IFTTT recipe that feeds their annotations into the database (they can switch this feature off at any time by switching off the recipe).

Manual annotation is not scalable, but it does enable us to explore different ways to annotate the literature, and what sort of things people may be interested in. For example, we could flag scientific names, great numbers, localities, specimens, concepts, people, etc. We could explore what degree of post-processing would be needed to make the annotations computable (e.g., converting 8°07′45.73″S, 63°42′09.64″W' into decimal latitude and longitude).

If this project works I hope to learn something about people want to extract from the literature, and to what extent having a database of annotations can provide useful information. This will also help inform my thinking about automated annotation, which I've explored in Hypothes.is revisited: annotating articles in BioStor.

Wednesday, June 22, 2016

What happens when open access wins?

The last few days I've been re-reading articles about Ted Nelson's work (including ill-fated Project Xanadu), reading articles celebrating his work (brought together in the open access book "Intertwingled"), playing with Hypothes.is, and thinking about annotation and linking. One of the things which distinguishes Nelson's view of hypertext from the current web is that for Nelson links are first class citizens, they are persistent, they are bidirectional, and can be links not just documents but between parts of documents. In the web we have links that are unidirectional, when I link to something, the page I link to has no idea that I've made that link. Knowing who links to you turns out to be both hard to work out, and very valuable. In the academic world, links between articles (citations) form the basis of commercial databases such as the Web of Science. And of course, the distribution of links between web pages forms the basis of Google's search engine. Just as attempts to build free and open citation databases have come to nothing, there is no free and open search engine to compete with Google.

The chapters in "Intertwingled" make clear that hypertext had a long and varied history before being subsumed by the web. One project which caught my eye was Microcosm, which lead me to the paper "Dynamic link inclusion in online PDF journals" (doi:10.1007/BFb0053299, there's a free preprint here). This article tackles the problem of adding links to published papers. These links could be to other papers (citations), to data sets, to records in online databases (e.g., DNA sequences), names of organisms, etc. The authors outline four different scenarios for adding these links to an article.

In first scenario the reader obtains a paper from a publisher (either open access from behind a paywall), then using a "linkbase" that they have access too they add link to the paper.

Links1

This is very much what Hypothes.is offers, you use their tools to add annotations to a paper, and those annotations remain under your control.

In the second scenario, the publisher owns the linkbase and provides the reader with an annotated version of the paper.

Links2

This is essentially what tools like ReadCube offer. The two remaining scenarios cover the case where the reader doesn't get the paper from the publisher but instead gets the links. In one of these scenarios (shown below) the reader sends the paper to the publisher and gets the linked paper back in return, in the other (not shown) the reader gets the links but uses their own tools to embed them in the paper.

Links3

If you're still with me at this point you may be wondering how all of this relates to the title of this essay ("What happens when open access wins?"). Well, imagine that academic publishing eventually becomes overwhelmingly open access, so that publishers are making content available for free. Is this a sustainable business model? Might a publisher, seeing the writing on the wall, start to think about what they can charge for, if not articles (I'm deliberately ignoring the "author pays" model of open access as I'm not convinced this has a long term future).

In the diagrams above the "linkbase" is on the publisher's side in two of the three scenarios. If I was a publisher, I'd be looking to assembling proprietary databases and linking tools to create value that I could then charge for. I'm sure this is happening already. I suspect that the growing trend to open access for publications is not going to be enough to keep access to scientific knowledge itself open. In many ways publications themselves aren't terribly useful, it's the knowledge they contain that matters. Extracting, cross linking, and interpreting that knowledge is going to require sophisticated tools. The next challenge is going to be ensuring that the "linkbases" generated by those tools remain free and open, or an "open access" victory may turn out to be hollow.

Wednesday, September 02, 2015

Hypothes.is revisited: annotating articles in BioStor

YClX4 gV Over the weekend, out of the blue, Dan Whaley commented on an earlier blog post of mine (Altmetrics, Disqus, GBIF, JSTOR, and annotating biodiversity data. Dan is the project lead for hypothes.is, a tool to annotate web pages. I was a bit dismissive as hypothes.is falls into the "stick note" camp of annotation tools, which I've previously been critical of.

However, I decided to take another look at hypothes.is and it looks like a great fit to another annotation problem I have, namely augmenting and correcting OCR text in BioStor (and, by extension, BHL). For a subset of BioStor I've been able to add text to the page images, so you can select that text as you would on a web page or in a PDF with searchable text. If you can select text, you can annotate it using hypothes.is. Then I discovered that not only is hypothes.is a Chrome extension (which immediately limits who will use it), you can add it to any web site that you publish. So, as an experiment I've added it to BioStor, so that people can comment on BioStor using any modern browser.

So far, so good, but the problem is I'm relying on the "crowd" to come along and manually annotate the text. But I have code that can take text and extract geographic localities (e.g., latitude and longitude pairs), specimen does, and taxonomic names. What I'd really like to do is be able pre-process the text, locate these features, then programmatically add those as annotations. Who wants to do this manually when a computer can do most of it?

Hypothesi.is, it turns out, has an API that, while a bit *cough* skimpy on documentation, enables you to add annotations to text. So now I could pre-process the text, and just ask people to add things that have been missed, or flag errors on the automated annotations.

This is all still very preliminary, but as an example here's a screen shot of a page in BioStor together with geographic annotations displayed using hypothes.is (you can see this live here: http://biostor.org/reference/147608/page/1 (make sure you click on the widgets at the top right of the page to see the annotations):

Hypothesis

The page shows two point localities that have been extracted from the text, together with a static Google Map showing the localities (hypothesis.is supports Markdown in annotations, which enables links and images to be embedded).

Not only can we write annotations, we can also read them, so if someone adds an annotation (e.g., highlights a specimen code that was missed, or some text that OCR has messed up) we could retrieve that and, for example, index the corrected text to improve findability.

Lots still to do, but these early experiments are very encouraging.