33GB of public domain JSTOR articles, and a manifesto

giberson · on July 21, 2011

If I weren't too timid to risk doing so, I would do the following (read I hope someone else does this).

Process the pdf's with an OCR program to extract as much text from each document as possible. The extraction should be done page by page, so the extracted text can be referenced to a PDF page#.

Then, provide a searchable/browse-able directory of the extracted content. Each page of text has link to the original PDF page so you can easily open up the PDF to the page the text was extracted from.

I'd also make all text user editable wiki style. Combined with the inline PDF page references it would be super easy for any user to fix up translation errors from the OCR process. Tie in a karma system to the users profile so that edits can be thanked/kudos on a job well done to help with automating moderation of user edits by rating the user's current karma to decide if the edit should be accepted automatically or provided as an alternate version other users can check and rate up if they think it should replace current version.

Maybe mash in an image cropping service so diagrams can be cropped from the PDF and inserted inline with the translated text. Provide simple wiki formatting markup to allow users to format the articles.

Use ad revenue/donations to alleviate/cover hosting costs.

1, 2, 3, go.

anigbrowl · on July 21, 2011

Well, most of this already exists in Google books, where I'm reading some of these particular scientific journals right now, can switch to OCRed text at any time (albeit with ftrange contemporarie fpelling), or download a facsimile of any public domain work in PDF format. The only thing they don't have in place (and should add) is the wiki-style crowd editing.

So the guy has basically built a 33gb torrent of stuff that was already freely available to the public, just from a different source.

drewda · on July 21, 2011

DocumentCloud [1] provides all the pieces necessary. It's all very nicely documented, too.

[1] http://www.documentcloud.org

crocowhile · on July 21, 2011

That is what jstor is about actually: full text search of a number of articles.

brettnak · on July 21, 2011

I haven't used it seriously in quite a while, but I remember the search being fairly primitive, though faceted.

nullc · on July 21, 2011

They don't do full text search on these articles.

prosody · on July 21, 2011

There's already a system somewhat like that in place and it was actually referenced in the article. Wikisource (http://wikisource.org) is run by the same nonprofit as Wikipedia, and has a pretty robust system for pairing up side by side transcribed texts and scans. Someone's already started the germ of a Phil. Trans. library (http://en.wikisource.org/wiki/Philosophical_Transactions). At present the first volume is done and about forty volumes (http://en.wikisource.org/wiki/Index:Philosophical_Transactio...) are uploaded and ready for users to transcribe.

bnewbold · on July 21, 2011

If you're curious about what these papers are, here is a hack at exposing the metadata in an easily searchable format. It would not be hard to scrape the PDFs to searchable text using unix tools (doesn't come out very readable) and render them as SVG files with inkscape for easy browser viewing:

http://robocracy.org/royal-search/

https://github.com/bnewbold/royal-search

Lots of good old stuff, such as "On the Double Organs of Generation of the Lamprey, the Conger Eel, the Common Eel, the Barnacle, and Earth Worm, Which Impregnate Themselves; Though the Last from Copulating, Appear Mutually to Impregnate One Another"

This is all available elsewhere, i'm just trying to demonstrate how low the barrier to search is with OSS tools; I hope all these papers end up on archive.org with full text search!

mkrecny · on July 21, 2011

I used OCR to do some interesting things with a similar dataset:

http://news.ycombinator.com/item?id=2791611

dunham · on July 22, 2011

Distributed Proofreaders already has some infrastructure set up for proofreading and editing OCR'd documents. It's not quite the same as your proposal, but they do track stats and have levels of participants based on skill and experience. Might be worth checking out.

   http://www.pgdp.net/c/

ristretto · on July 21, 2011

Regarding this, since the archives are in the public domain, wouldn't it be possible for any third party (e.g. google books) to ask and obtain permission to photograph the originals ?

Anyway the ethical thing to do here is to lobby jstor (a non-profit organization) to make reprints free.

on July 21, 2011

[dead]

micheljansen · on July 21, 2011

As someone who's published in journals that are on JSTOR, I hope that this will increase the access of all this knowledge to the greater public. There is something very wrong about the way Elsevier, ACM, IEEE and other publishers abuse their size and established power to hold these works hostage. Of course it costs money to organize and maintain databases, but the way they wrestle you out of the copyright of your own works to ask $25 per paper sickens me. I am legally not even allowed to put my own work on my own website.

_delirium · on July 21, 2011

> I am legally not even allowed to put my own work on my own website.

That's unfortunately still sometimes true, but I'm finding it's increasingly common for there to be self-archival exceptions, where you can put a self-prepared PDF on your personal website. I believe all IEEE and ACM publications now allow this. I've adopted a policy of just putting all my publications online and waiting to see if anyone complains.

nwhitehead · on July 21, 2011

Most journals now have an explicit policy that putting up your preprints online is OK. You can check the status of any particular journal or publisher here: http://www.sherpa.ac.uk/romeo/

loup-vaillant · on July 22, 2011

Sounds good, but I would be very tempted to publish HTML versions as well. Would that be permitted?

wnight · on July 22, 2011

Thank you, the ps/pdf wall most research is behind greatly limits its availability.

Wait until they've published and do it anyways. If they sue/etc you the Streisand effect should hopefully help your notability (and thus career) more than not.

micheljansen · on July 21, 2011

Yeah, as long as you are not obviously threatening their ability to make money off our backs, they will probably not come after us. I think the publishers rather not face the negative publicity they would get if they would actively persecute researchers who make their work available on their personal website.

pw · on July 21, 2011

Yeah, and this is HN. Those sort of sentiments (violence towards others) aren't welcome here.

on July 21, 2011

[deleted]

swordswinger12 · on July 21, 2011

I'm sorry... what? A person wishing another person is raped daily is not violent? Did somebody plan another Opposite day and not tell me again?

Vivtek · on July 22, 2011

"No."

masterzora · on July 21, 2011

There's no violence in forced rape? News to me.

Andrew_Quentin · on July 21, 2011

Sorry, the forced rape part must have gone past my eyes.

Lets see if I can delete my comment.

micheljansen · on July 21, 2011

There is also a form of rape that is not forced? News to me ;) (the first rule of tautology club is the first rule of tautology club).

derleth · on July 21, 2011

> There is also a form of rape that is not forced?

Some instances of statutory rape are entirely unforced by any common definition of 'force'.

(The law doesn't see it that way, but the law also sees statutory rape as a strict liability offense, meaning that lack of intent to commit it isn't a possible defense. In short, if someone lies about their age to have sex, they just got raped.)

loup-vaillant · on July 22, 2011

Other legislations sound more reasonable than that: in France, the term for "Statutory rape" is "Détournement de mineur" ("underage corruption" or something). It applies to any adult (more than 18) that have sex with someone under 15. If the adult is in position of power (professor or something), any minor (less than 18) under his or her care is off-limits.

IANAL, but I believe the reasons it is a crime are the same: in the off-limits situations, the minor is believed to not being able of giving informed consent. But at least it's not called "rape".

(I find the term "Statutory rape" quite misleading. It is definitely not the same thing as plain rape, or worse, child rape. It looks like it has been pushed by a lobby of afraid, protective, conservative, possibly religious parents.)

masterzora · on July 22, 2011

(I find the term "Statutory rape" quite misleading. It is definitely not the same thing as plain rape, or worse, child rape. It looks like it has been pushed by a lobby of afraid, protective, conservative, possibly religious parents.)

One of the saddest things I can remember (though, unfortunately, the details escape me due to time) was a couple of rape cases that occurred in the US maybe a decade or so ago. One was a case of a very violent rape where the sentence was something on the order of 15 years. The other was a case of statutory with a fully consenting 16- or 17-year old where the sentence was something closer to 40 years. I was pretty young at the time (if it was a decade exactly I would have been 12), but that's definitely when I first realised that the "won't somebody please think of the children" arguments are bullshit.

micheljansen · on July 21, 2011

Only on HN can one expect a genuinely insightful reply to a comment like that :) I never really read about the laws surrounding rape, but that sounds pretty unexpected :)

nealb · on July 21, 2011

Really? Did you even read the post- "The portion of the collection included in this archive, ones published prior to 1923 and therefore obviously in the public domain..."

law · on July 21, 2011

Actually, with almost every publication agreement, you sign over any right, title, and interest in the copyrighted work to the publisher. So, the law isn't protecting "your" copyright; it protects the publisher's. But, a published academic like yourself probably already knew that.

crocowhile · on July 21, 2011

If he's a publishing academic, then I am spiderman. Every researcher's dream is to have their paper read by as many people as possible.

dantheman · on July 21, 2011

The fact you would wish rape upon anyone is deplorable.

swordswinger12 · on July 21, 2011

Obvious troll is obvious. 7 people should not have fallen for this sub-4chan-level stupidity.

sitkack · on July 21, 2011

Maybe your rape can reach all the way back in time from 1923. You are seriously advocating rape of another human being? Please leave.

mjdwitt · on July 21, 2011

Raped on a daily basis? Really?

Your opinion would have more weight if you left out the histrionics.

Mvandenbergh · on July 21, 2011

Yeah, you wouldn't want to lose all those sweet royalties from the journals that give academics their enviable incomes.

CamperBob · on July 21, 2011

And I'm sure none of your research at all is taxpayer-funded, right? Because that would make you the leech.

sanderjd · on July 21, 2011

My girlfriend manages grant money for a large research institution where a sizable majority of the funds come from taxpayer-backed grants. You would be surprised (well, or maybe not) how often the researchers on the grants yell at her for not letting her have unfettered access to "their money". Even our friends who are researchers for other labs will get heated with her when she defends the casually-maligned administrators of their own grants. It's all very ugly and wrong-headed.

sp332 · on July 21, 2011

Jason Scott of textfiles.com has the whole archive downloaded here, if you want to browse the metadata before downloading the files. http://cdmirror.textfiles.com/JSTOR_01_PhilTrans/

sp332 · on July 21, 2011

Apologies to Jason for posting a link to his temp/staging folder. The files will be available at the Internet Archive soon.

a3_nm · on July 21, 2011

404 now, I wonder why.

pw · on July 21, 2011

That link gives a 404 :-(

eykanal · on July 21, 2011

For those of you who aren't familiar, there is an intitution that was set up not long ago call the Public Library of Science (PLoS):

http://www.plos.org/

They have free journals in numerous fields, and gradually more big-name authors (in my field, neuroscience, at least) have been publishing in it. Its worth checking out.

juberpatel · on July 24, 2011

Actually, the operation of PLoS journals raises an interesting question. These journals charge the researchers for publishing papers so that these papers are freely accessible to anyone. But the charges are in the range of $2200 - $3000 per paper! Doesn't that mean the traditional journals are charging reasonable fees to the readers?

ristretto · on July 21, 2011

There's also Frontiers http://frontiersin.org . The problem is how to convince scientists to use them.

tylerneylon · on July 22, 2011

[[TL;DR for this comment - Publishers are taking advantage of a prisoner's dilemma / competitive closed market to monetize the near-zero value they supply.]]

Professors don't get much money from direct publication -- in fact, many conferences charge the professors who provide the content. They get paid by grants and their schools. Professors don't want their research to reach a limited audience. The universities doesn't want this either.

The only people in the chain who want limited access are the publishers, since this is how they make money. But between researchers, universities, grants, and publishers, the publishers contribute the least value by far. They generally rely entirely on other professors to edit their journals and conference proceedings, and for all the content. They charge ridiculous rates - often thousands of dollars for a single annual journal subscription - and get away with it because the system is not prone to change. Researchers are rewarded for publishing in "the best" journals, so no one wants to take the leap to publishing in a free space where there is currently much less prestige.

That's basically why academic publishing is messed up. Because there's money to made in keeping it messed up, and money to be lost in fixing it. But the ones who generate the real value _do_ want things to be as freely available as possible. If a critical mass of top-tier researchers agreed to stop publishing in non-free journals and conferences, it would probably start a revolution in this area -- but that's a lot to ask.

It's a prisoner's dilemma, in that the "traitor" researchers who keep publishing in the old journals will be rewarded.

(This is all about academia -- I guess motivations may be different in industry-backed research.)

rb2k_ · on July 21, 2011

And for people interested in how many bitcoins donations flow his way:

http://blockexplorer.com/address/14csFEJHk3SYbkBmajyJ3ktpsd2...

VMG · on July 21, 2011

Nothing?

r00fus · on July 21, 2011

I imagine it's a legal defense fund of sorts. It's not like he owns these documents.

pyre · on July 21, 2011

No one does, as they are pre-1923.

rb2k_ · on July 22, 2011

Check again :)

w1ntermute · on July 21, 2011

Can anyone explain why all these documents are not available for free? Why does the only place you can download them charge for the privilege?

pnathan · on July 21, 2011

Well, there is a non-zero cost to hosting and delivering documents, as well as creating the infrastructure to allow it.

Those costs must be made up somehow in the business model.

(before I get flamed, I'm not arguing that all documents should be $19 per article. I'm just saying that there's a non-zero cost that needs to be passed on to the customers somehow).

crocowhile · on July 21, 2011

Also, JSTOR does NOT get those documents for free. They may have to pay the publisher to host them on their server.

Academic publishing is a bitch and there is a lot going on lately towards a common, world wide reform. However, JSTOR is not really the bad guy here. Other publishers (e.g. Elsevier http://en.wikipedia.org/wiki/Elsevier) make billions by publishing mainly work payed with tax payer money.

gwern · on July 22, 2011

JSTOR does not get them for free, but it gets them for a lot less than you think and they make a lot more than you think they do. Here are two analyses of their finances based on their mandatory IRS filings:

- http://lists.wikimedia.org/pipermail/wikien-l/2011-July/1092...

- http://www.generalist.org.uk/blog/2011/jstor-where-does-your...

Summary: they take in >53m USD. Publishers get 8m of that. Their IT infrastructure costs 4m. All the new material and scanning old material cost 5m.

(Notice how many millions are left over. They all go to staff, travel, and other goodies like that. The management is generously compensated.)

crocowhile · on July 22, 2011

Interesting figures. Something to think about. Thanks.

Vivtek · on July 21, 2011

They may have to pay the publisher to host them on their server.

Not in this case, apparently; they're public domain. Although JSTOR did foot the bill for the scanning.

I'll take your other point, though: if there's real evil in the academic publishing world, it's Elsevier.

crocowhile · on July 21, 2011

The point is that this action and aaronsw's action make JSTOR look like the villain while JSTOR are rather sitting on the good side of the battle.

(On the other hand, these actions do contribute to create public awareness so I still haven't decided if they are good or not)

SDADASDA · on July 21, 2011

it seems JSTOR is not pressing charges.

bhickey · on July 21, 2011

Good thing that reproducing something, no matter how laborious the process, is not considered a creative act in the US.

pyre · on July 21, 2011

No, but (IIRC) compilations that take serious effort to produce have a copyright on them, e.g. the phone book.

ahlatimer · on July 21, 2011

The content doesn't, actually. I'll look up the specific case when I get home, but phone book publishers most definitely do not have any copyright claims to the phone number or addresses. They do have copyright claims to the font choice, layout, etc., so you can't copy the phone book wholesale, but you can take the phone numbers, rearrange them, and set them in a different typeface and you'd be in the clear.

I don't believe there's any part of copyright that takes the amount of effort into account. There was a case (again, I'll look it up when I get home) where a copy of a work that was in the public domain was granted its own copyright because it had a few errors in it. The errors were unintentional, but copyright was given. In the phone book case, the court specifically mentioned that even though there was no doubt a significant amount of work required to produce the phone book, that didn't give the publisher copyright.

carussell · on July 23, 2011

> I'll look up the specific case when I get home, but phone book publishers most definitely do not have any copyright claims to the phone number or addresses.

Feist v. Rural

ahlatimer · on July 23, 2011

Yep, that's the one. The other one I was thinking of was Alfred Bell & Co. v. Catalda Fine Arts.

Vivtek · on July 22, 2011

No, the phone book most specifically doesn't - which is why you get five phone books nowadays as competing companies try to sell ads all at once.

ristretto · on July 21, 2011

What about Springer or Nature publishing? Unfortunately, the real evil is rooted in a system that measures excellence based on publishing records and not scientific impact. Until there's a replacement for that, life scientists will keep sheepishly bowing to the publishers. That replacement could be an idea for a startup.

shadowfox · on July 22, 2011

> Unfortunately, the real evil is rooted in a system that measures excellence based on publishing records and not scientific impact

Can you clarify this a bit? How do you get to have impact without first publishing it somewhere (at least mildly reputable) and having other people actually read it/review it/run with it?

ristretto · on July 22, 2011

My ideal system would be a global open system where scientists would publish their results, not papers. Other scientists would be able to review, confirm and reuse your results transparently and third parties would be able to assess your position from your record and the feedback that you get from fellow scientists. That could involve many more signals rather than just the (journal+number of citations) that everyone uses now.

hack_edu · on July 21, 2011

> Well, there is a non-zero cost to hosting and delivering documents, as well as creating the infrastructure to allow it

Sounds like a perfect job for bittorrent distribution....

pnathan · on July 21, 2011

The key issue with bittorrent is that if no one is seeding, the files are "gone". Further, ISPs like to block/filter it.

Of course, a case could be made that university libraries should mirror all public domain digital documents of sufficient interest. Which shunts the cost onto the taxpayers.

Regardless, the cost still is non-zero, and someone has to bear it.

nullc · on July 21, 2011

And yet Wikipedia manages to do so without charging $19/article. :) Even for historic scanned books, e.g. http://commons.wikimedia.org/wiki/File:William_Blake,_a_crit...

Likewise, archive.org does as well. I expect these papers will eventually end up in both these places.

rauljara · on July 21, 2011

Wikipedia is full of general purpose knowledge and paid for by donations. Even if you never make use of 99.99% of the knowledge it has, there is something for you, and a reason for you to donate.

Academic journals are full of incredibly specific knowledge. While some of it is accessible to people with a moderate amount of intelligence/diligence, any journal whose name isn't "Science" is designed for and by an incredibly small group of individuals with an incredibly specific knowledge base. Where is the donor base for that?

Answer: Universities are the only plausible donors. And they seem to think maintaining their place of privilege in being able to access the articles is worth some extra bucks.

impendia · on July 21, 2011

I work at a university. I don't think universities have any vested interest in maintaining their place of privilege. Certainly faculty members don't, with rare exceptions (at least one of which I know about).

But I think there is not a lot of interest in changing the system. The people who could change things are not directly affected by the disadvantages of the current system.

coliveira · on July 21, 2011

The issue is, university professors already have access to these papers. They are among the few people that want, need, and can understand the science behind it. It looks like it makes sense that universities are willing to pay for that privilege.

pnathan · on July 21, 2011

You forgot larger industrial research lab-y companies, e.g., Google and DuPont.

lotu · on July 21, 2011

We have services like youtube and Flicker that provide hosting for movies and pictures for free. The data requirements are far greater for these services than they would be for an academic journal hosting. If you didn't want to go the centralized method a small journal with under 1,000 readers (>10MB per month per user) could probably host buy hosting for under $100/year. Something that small would be easy to cover out of pocket or via donations

jcarreiro · on July 21, 2011

> We have services like youtube and Flicker that provide hosting for movies and pictures for free. The data requirements are far greater for these

Don't be so sure. Some experiments can produce large data sets, which ideally should be made available alongside the published paper.

Goladus · on July 21, 2011

Well, for most of the JSTOR documents that's probably not the case.

But yes sharing with large academic datasets is a real and only partially solved (IOW not solved at all) problem.

sycren · on July 21, 2011

services like youtube and flickr have adverts corresponding to a user's tastes and the particular media. How would you suppose doing that with scientific journals?

tokenadult · on July 21, 2011

The quality of content on JSTOR is much, much, much higher than on Wikipedia.

http://strategy.wikimedia.org/wiki/Strategic_Plan/Movement_P...

http://en.wikipedia.org/wiki/Wikipedia_talk:Lamest_edit_wars

hack_edu · on July 21, 2011

Providing it as a torrent lower the costs at least. Plus, if they were concerned about infrastructure costs they'd have no problem finding mirrors willing to donate bandwidth.

Vivtek · on July 22, 2011

Which shunts the cost onto the taxpayers.

A lot less cost than the current cost of paying subscriptions to that same content.

thirdstation · on July 21, 2011

There are a fair number of studies exploring the usage of P2P distribution of scholarly articles.

One issue that would need to be satisfactorily (and simply) dealt with is trust that the copy you have is the authoritative copy of record.

bricestacey · on July 21, 2011

LOCKSS[1] (Lots of Copies Keep Stuff Safe) has an algorithm that does its best to ensure you have a real copy and that no one can poison the network.

[1] http://www.lockss.org/

rmc · on July 21, 2011

Simple, bittorrent has inbuilt hashing mechanism. Just host the .torrent file on your own server. Hosting a 20KiB file is cheap as chips.

icebraining · on July 22, 2011

This torrent in particular has that covered: he included an hash of the hashes file in the message, which is signed with his PGP key.

syedkarim · on July 21, 2011

Yes, it's non-zero. But for many, many organizations and many individuals, the non-zero is pretty much zero. The Internet Archive, for example, could absorb something like this without even blinking. As could Google Books.

jcarreiro · on July 21, 2011

> Why does the only place you can download them charge for the privilege?

Because they can.

_delirium · on July 21, 2011

True, but JSTOR is supposed to be a non-profit initiative in the service of scholarship, not a revenue-maximizing corporation. It's largely funded by taxpayer money. I agree they need funding to continue their work, and some of that comes from building up an internal revenue stream so they aren't completely dependent on the winds of year-to-year grants. But I've been disappointed overall by their lack of any commitment to make more things publicly available, given their charitable mission.

When JSTOR was first getting off the ground 15 years ago, many of us had higher hopes for the organization's direction, that it'd eventually be a mixture of subscription-only access for recent non-open-access journals, and free-to-anyone digitization of old, public-domain materials, like a large-scale version of Project Perseus. But nobody at JSTOR really pushed the 2nd part; they digitized old documents, but kept them all subscription-only, and never even developed plans (despite occasional agitation) for individuals to buy subscription access for any sort of reasonable price. They seem to have developed a bit of an unhealthy attachment to "owning" the JSTOR archive as their valuable crown jewel, despite officially just being custodians of it on behalf of the academic community.

baconner · on July 21, 2011

I was on JSTORs side of this argument until I read your comment. If its funded in part by taxpayers then the data they produce ought to be at least in part taxpayer property.

shadowfox · on July 22, 2011

A (somewhat controversial) question that came up the last time I was discussing this with someone was whether all JSTOR material should be available on the net. This does mean that American taxpayer funded research is now available to practically everyone in the world.

What follows is third hand information and not even directly related to JSTOR. So it could be totally wrong. But I am told that at the moment not all of the journals/articles in all areas are available to universities outside the US. Apparently it is hard to get hold of research related to subjects that the US thinks are not "exportable" because of potential impact on national security. Things like nuclear medicine, areas of chemical engineering etc. The journal publishers act as a point of enforcement in this story.

Andrew_Quentin · on July 21, 2011

Yet, as you say, they are nonprofit. Are they therefore actually making any profit?

I guess that they are not a public company, therefore unfortunately perhaps we can not have any access to the revenue data to determine whether they are making any profit and how such profit is being used. In that case, most probably the safe bet is to decide that they are making profit. If, however, they are in fact not making any profit because their archiving and maintaining of these documents is paid off by the subscription charges, then this torrent does a disservice to science rather than acting as a liberator.

Without that information however I do not think it is easy, or indeed possivble, to rationally decide.

chronometric · on July 25, 2011

Check out http://www.generalist.org.uk/blog/2011/jstor-where-does-your... for a great analysis of JSTOR's general revenue breakdown. The top 13 officers earn over $3 million in salaries. So while they may be a nonprofit, profits do matter to some of them.

cdcarter · on July 21, 2011

Non-profit does not mena that they do not make a profit, it just means that any profits are put back into the organization instead of it's leadership or shareholders.

_delirium · on July 21, 2011

Shareholders yes, though there's more gray area on "leadership". There's no law against using the revenue of a nonprofit organization to pay its leadership arbitrarily large salaries, unless it rises to the level where the IRS starts to see it as dirty tricks. I have no idea what JSTOR's are, though (or if that's even public). But I don't suspect they are pocketing the cash.

My guess is that their reticence to open things is more of an institutional megalomania of sorts. Sort of what you often find at art museums--- while their official mission is to promote the arts and educate the public, many of their staff are at least as interested in building the prestige/fame/fortune of their institution, which explains why they would have weird policies about reproducing their public-domain artworks.

tcb23 · on July 21, 2011

As others have said, the costs of distribution are non-trivial, and are almost certainly a loss for JSTOR on these articles. For example the exemplary arxiv.org, which provides a relatively 'simple' hosting service and where the author prepares the full manuscript- they still have annual running costs in the region of $500,000. Even arxiv.org do not seem to have found it easy to secure funding from other universities and institutions which surely find it essential (most costs are borne by Cornell library).

While there is great value in the JSTOR articles (all published before 1923)- in reality very few active researchers will be reading them. Some of the research will be just plain wrong, some corrected or superseded, and if there is any useful stuff left, then it will be readily found in much more modern and useful presentations.

I think the more important battle is to ensure all current research is deposited in open archives (preferably something like arxiv.org). The historical stuff will surely follow...

sciurus · on July 21, 2011

If anyone is curious where the money to run arXiv goes, here is their 2010 budget and their business plan.

http://arxiv.org/help/support/2010_budget http://arxiv.org/help/support/whitepaper

_delirium · on July 22, 2011

That's actually insanely frugal--- $381,000 total budget! JSTOR is very reticent about their budget, but from IRS filings it appears their annual revenue is in the ballpark of $55-65 million, with about $8m of that going as payments/revenue-share to publishers, leaving somewhere north of $40m for JSTOR operations.

It's hard to say how necessary a budget of that size is, made harder by the fact that they publish absolutely nothing about who they are, what they do with the money, how they plan for it, etc. The only source of financial information is their legally mandated annual IRS filing.

There's an attempt to sort some of it out here: http://www.generalist.org.uk/blog/2011/jstor-where-does-your...

zeratul · on July 21, 2011

Medical doctors were very unhappy to pay for research papers that were funded by tax paying Americans. That's why since April 2008 all articles funded by NIH have to be freely available: http://publicaccess.nih.gov/ . Remember, it's impossible for law to work backwards.

alanh · on July 21, 2011

I am also quibbling with your claim “it's impossible for law to work backwards.” (IANAL)

1) It is unconstitutional to prosecute an individual for a crime that was legal at time of perpetration.

2) However, the opposite — decriminalizing previously illegal behavior — can be retroactive.

3) Changing future interpretation of copyright, etc., isn’t the same as case #1. If I’m not mistaken, Congress has passed e.g. the Mickey Mouse copyright law and the DMCA, which both extended “protection” & duration of copyright on previously created works.

nullc · on July 21, 2011

Maybe.

http://www.scotusblog.com/case-files/cases/golan-v-holder/

MostAwesomeDude · on July 21, 2011

You are mistaken on retroactive copyright; publishers and artists who failed to secure copyrights the first time around did not get retroactive postmortem copyright granted to them. The classic example is the library of H. P. Lovecraft, whose stories are considered public domain as there is nobody who has a reasonable claim of copyright to them.

praeclarum · on July 21, 2011

Retroactive laws are common place: http://en.wikipedia.org/wiki/Ex_post_facto_law

zeratul · on July 22, 2011

I didn't know that. I was referring to http://en.wikipedia.org/wiki/Nullum_crimen,_nulla_poena_sine... . Obviously, history shows that governments can do whatever they want if opportunity arises.

jefftk · on July 21, 2011

> Remember, it's impossible for law to work backwards.

Really? I think it's just politically less convenient than making laws that apply to future research. There's nothing that would keep the government from removing copyright all together if they wanted to.

aidenn0 · on July 21, 2011

FYI it's potentially NSFW if you don't have adblock

sp332 · on July 21, 2011

D'oh, I always forget about those ads because I have adblock. blushes

equark · on July 21, 2011

I don't fully understand the logic. Even if the underlying content is free, how are JSTOR scans public domain? If Google spends millions of dollars scanning in public domain books, I don't see how that gives Bing the right to download them all from Google unless given permission.

It also seems we all benefit from allowing companies to invest in scanning public domain works since for whatever reason nobody is doing this by hand now.

_delirium · on July 21, 2011

Scans of public-domain work remain in the public domain under U.S. law, at least as currently interpreted. There's no Supreme Court decision on the subject, but the leading case widely followed is Bridgeman Art Library v. Corel (1999) (http://en.wikipedia.org/wiki/Bridgeman_Art_Library_v._Corel_...)

Access to network resources is another matter--- Bing might be violating Google's access policies if they slurped Google Books, especially if they evaded rate limits or ignored robots.txt. But they wouldn't be in violation of any copyrights. This makes it a bit of a cat-out-of-the-bag situation. If I download a copy of an 1878 journal from JSTOR, and email it to a friend, I'm violating JSTOR's terms of service. But if my friend turns around and posts the PDF online, he is not breaking any laws.

A similar situation holds with U.S. government classified documents. Whoever leaked the Pentagon Papers broke the law, and could be prosecuted for the leak. But once they were out, as public-domain government documents, it was not illegal for third parties to republish them.

nullc · on July 21, 2011

You appear to be advancing a sweat of the brow (http://en.wikipedia.org/wiki/Sweat_of_the_brow) theory of copyright— a view which is rejected by US law (the law in other places may be different). See Bridgeman v. Corel for an excellent example.

Of course, I don't begrudge people who scan the ability to get paid for their work. But there are a great many ways to get paid which do not involve forever robbing from the public domain (because, of course, if they are able they will simply keep updating their scanned copies while locking the public away from the originals).

equark · on July 21, 2011

Very interesting. That's exactly what I was arguing.

I don't quite understand where the line is though. For instance, why isn't Google Map information public domain? The locations of roads aren't copyrighted and Google's efforts to enter this information don't seem that different than compiling a phonebook.

nullc · on July 21, 2011

Probably you don't understand where the line is because no one else is quite sure either.

Most of the underlying information on Google Maps is public domain—road locations and physical features of the terrain are all uncopyrightable facts. What they claim copyright to are the map visuals: the particular style of drawing the maps, and the photographs they took. (though perhaps even that is debatable, since the photography is a robotic process)

You should be able to take the Google data and extract everything uncopyrightable from it to draw your own maps. (And if you're interested in public maps, you should see the OpenStreetMap project.)

But say you want to include some of the other things on Google Maps--place markers for hotels and restaurants, for example. Any one of them is an uncopyrightable fact: that's where it is. But what if you choose to mark the same collection of the same sorts of things that Google has marked--are you infringing on their copyright in the selection? The more you can make an argument that there was some original thought in the selection the harder it is to tell.

US law is pretty clear on plain facts being ineligible for copyright, even if it took work ("sweat of the brow") to compile them. But copyright in things like collections of data is much less clear.

wisty · on July 23, 2011

However, Google hasn't done much (any?) creative stuff by hand. It's all automated. The input data is just facts, and the process is automated, so where's the creativity? Are automated processes able to create copyright-able works? I've no idea.

kbutler · on July 21, 2011

The line is creative expression. Copyright restricts copying of original, creative expression.

Copyright law does not restrict copying of facts or ideas.

"(b) In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work." http://www.copyright.gov/title17/92chap1.html#102

Thus, you can freely copy the facts (information) in google maps.

Google may be able to claim copyright on map images it produces, but I'd expect courts would find that they are just a mechanical reproduction of factual information, limiting the copyright to just the presentation choices (colors, fonts, styling, etc.)

From the same reference: "(b) The copyright in a compilation or derivative work extends only to the material contributed by the author of such work, as distinguished from the preexisting material employed in the work, and does not imply any exclusive right in the preexisting material. The copyright in such work is independent of, and does not affect or enlarge the scope, duration, ownership, or subsistence of, any copyright protection in the preexisting material."

rwmj · on July 21, 2011

It's being worked on ...

http://www.openstreetmap.org/

Palomides · on July 21, 2011

At least in the US, it seems the courts have decided that faithful reproductions of 2D art that is in the public domain themselves are not copyrightable. I suspect this could apply here.

See: http://en.wikipedia.org/wiki/Bridgeman_Art_Library_v._Corel_...

jcarreiro · on July 21, 2011

The works are in the public domain. Unless you think that simply scanning and producing a bit-for-bit accurate copy of a work in the public domain produces a new work that deserves copyright protection?

skizm · on July 21, 2011

Maybe I have misread but I believe he is saying that the scanning costs are negligible compared to what they are charging/making. He mentions that when the journals were in paper that the publishers charged just enough to keep their printing up and running but now they are getting greedy and making enough money to have political clout.

Bottom line is that people shouldn't make so much money off the back of unpaid academics/scholars. Personally I feel like if they charged something like 99 cents for a DRM free version of each paper (with free abstracts still of course) this might be more reasonable.

Just my take though :)

wnight · on July 22, 2011

Anyone has the right to download anything from anyone else. It's the content-provider's server and it's their job to tell it who it should and shouldn't give content to. Making content accessible is implicitly giving permission to download it. It's how the internet works, by design.

meow · on July 21, 2011

I think the title is kind of wrong. The torrent author says: "I've had these files for a long time, but I've been afraid that if I published them I would be subject to unjust legal harassment by those who profit from controlling access to these works."

So he clearly says its not in public domain but from a moral point of view they should be available to all the masses. The description of the torrent is very interesting. Definitely worth taking a look even if not willing to download the actual torrent.

microarchitect · on July 21, 2011

The portion of the collection included in this archive, ones published prior to 1923 and therefore obviously in the public domain, total some 18,592 papers and 33 gigabytes of data.

I think this statement means that the documents are indeed in the public domain.

I also think equark makes a valid point about the scans. Further, I don't understand why these guys are going after JSTOR, a non-profit organization. I'd be more understanding of their methods if they went after somebody like Elsevier.

meow · on July 21, 2011

You are correct. This portion appears to be in public domain (though not publicly available). Non-profit or not, the fact remains that most of these documents remain behind pay-walls while still being considered to be in public domain.

You have a valid point though. The access to these documents seem to be governed by agreements between various publishers with aim to share the published content among various institutions (http://en.wikipedia.org/wiki/JSTOR). So there is no point blaming JSTOR for the lack of access.

beefman · on July 21, 2011

What do profits have to do with it?

bryanlarsen · on July 21, 2011

You missed the word "unjust" in your reading. He very clearly claims elsewhere in his manifesto that the documents are public domain. The statement you elided would then be in conflict with that, but only if you remove the word "unjust".

Andrew_Quentin · on July 21, 2011

They are in the public domain in a printed copy. You can copy from them, reprint them, make books of them and charge for it. You can scan them and put them online, the whole millions of them. Yet that does take work and who should pay for this work? Is it not a great service to mankind that these documents are being made available from the comfort of your home?

I do not understand one simle thing. JSTOR is not for profit. Are people therefore saying that it is making a profit? If yes, the action is understandable, but if not, as much as I am and was anoyed to face a JSTOR paywall and as much as there is an argument to make such paywall means tested, what are people actually saying? That this non profit organisation should not be able to existantially support itself?

Yes, government should support it rather than spend money on wars, but, I simply do not understand why anyone would want to kill off a non profit organisation.

How many of you are going to read these articles? How many of the general population would want to read these articles? Save for the professors and students how many would seek to even attempt to go through a journal article.

It just seems, frankly, that a self interested ideology of disdain for copyrights has sprung roots in Hacker News and man has stoped considering long term effects, focusing instead on their temorary pleasure of the now.

icebraining · on July 22, 2011

>They are in the public domain in a printed copy. You can copy from them, reprint them, make books of them and charge for it. You can scan them and put them online, the whole millions of them. Yet that does take work and who should pay for this work?

Whoever wants to. But doing so does not give you copyright over the content.

>I do not understand one simle thing. JSTOR is not for profit. Are people therefore saying that it is making a profit? If yes, the action is understandable, but if not, as much as I am and was anoyed to face a JSTOR paywall and as much as there is an argument to make such paywall means tested, what are people actually saying? That this non profit organisation should not be able to existantially support itself?

$19 per copy is extremely expensive, considering their expenses after the document is scanned are close to 0 (and as we can see, hundreds of people are willing to do it for free).

showerst · on July 21, 2011

Just to be clear, these are apparently unrelated to the Aaronsw case, right?

iterationx · on July 21, 2011

I had considered releasing this collection anonymously, but others pointed out that the obviously overzealous prosecutors of Aaron Swartz would probably accuse him of it and add it to their growing list of ridiculous charges. This didn't sit well with my conscience, and I generally believe that anything worth doing is worth attaching your name to.

I'm interested in hearing about any enjoyable discoveries or even useful applications which come of this archive.

- ---- Greg Maxwell - July 20th 2011 gmaxwell@gmail.com Bitcoin: 14csFEJHk3SYbkBmajyJ3ktpsd2TmwDEBb

slowpoke · on July 21, 2011

I truly think you are a brave person for doing this under your real name. I most likely couldn't do that.

Godspeed, good sir.

huhtenberg · on July 22, 2011

That was a quote from manifesto.

mbreese · on July 21, 2011

From the "manifesto":

Several years ago I came into possession, through rather boring and lawful means, of a large collection of JSTOR documents.

Andrew_Quentin · on July 21, 2011

Yet he does not mention what the boring and lawful means are. Perhaps he is too bussy of a man to spend his time educating us.

nullc · on July 21, 2011

No, just not stupid enough to write the oppositions briefs for them.

But really, you don't need to be a genius to obtain large collections of scientific papers. Many people have them— its a prerequisite for doing meta-analysis for example. The distinction here is making them available to the general public.

emilis_info · on July 21, 2011

Facebook won't let me post a link to this torrent. Anyone know a way around?

URL shorteners don't help.

davorak · on July 21, 2011

I was unaware of that level of censorship at Facebook. Where are you trying to post it.

mathew1988 · on July 21, 2011

Facebook even censors pirate bay links within private messages between you and friends iirc. They don't block select other torrent sites though. They're a bit odd like that.

Some more info http://torrentfreak.com/facebook-blocks-all-pirate-bay-links...

rdkls · on July 22, 2011

Looks like FB blocks links to torrent detail pages on TPB. Best I could do was post a link to things tagged with JSTOR (there's only one): http://thepiratebay.org/tag/JSTOR

emilis_info · on July 21, 2011

I am trying to post a link to the PirateBay page in my personal news feed.

I get a message about abusive content.

andresmh · on July 21, 2011

Me too. I even tried to share a link to http://thepiratebay.org/torrent/5023425/Holy_Bible%28s%29_-_... but it was denied

jfriedly · on July 21, 2011

I believe FB blocks links to TPB. They actually changed my password and sent me an email saying that they thought my account had been compromised when I tried to do it once. However, as alanh said, a little redirection will work to post it.

Edit: Source: http://www.techdirt.com/articles/20090507/1152134782.shtml

cpach · on July 21, 2011

Something like this maybe?

http://www.google.com/search?q=577d58aa66beaceb71518ec417ab3...

aristus · on July 21, 2011

Looking into it. That domain is not blocked. It is possible to post links to the root domain, just not deep links. It could be a bug or misconfiguration on their side. FB does a ping to posted links to grab the title, thumbnail, etc. That seems to be failing for deeplinks to TPB.

alanh · on July 21, 2011

Well, a manual level if indirection would solve it — publish a web page with a little more than a link to the Pirate Bay page; and post a link to that page, on Facebook. If you use Simplenote’s published page feature or Dropbox’s public folder, this can take but a minute.

emilis_info · on July 21, 2011

I created a simple page with a single frame in it containing the TPB page. This did the job.

Didn't want to write any blog posts about it, just point people to the original page.

redthrowaway · on July 21, 2011

Post it on twitter, then post a link to your tweet on facebook.

wacker_t · on July 24, 2011

take a try with this copy-paste-tool:

http://www.clicktoapp.com

aridiculous · on July 21, 2011

I wonder what would happen if this happened to Westlaw, one of the 2-3 industry standards for law firms. Incredibly expensive.

The irony alone in the law profession would be tremendous. It'd be interesting to see if law firms would illegally access it: It would be obvious they were if they previously only subscribed to Westlaw, but the reality is most law firms subscribe to more than one database for emergency backup.

rdp · on July 21, 2011

State and federal court opinions are freely available from the courts themselves. Westlaw, LexisNexis, etc. gain much of their value from tools they provide for analyzing the opinions. These companies manually produce headnotes, case histories, and citation history (i.e., "shepardization"). Collating this value-added data would be vastly more difficult than automatically pulling opinions out of the PACER database (or even JSTOR). Not that I wouldn't like to see somebody try it . . .

gwern · on July 21, 2011

Ironically, one of Aaron's previous projects was about legal files: https://secure.wikimedia.org/wikipedia/en/wiki/PACER_%28law%...

Atropos · on July 21, 2011

How many different paywalls are there and how many articles are trapped? I'm able to access 5 different databases + one of the biggest research libraries in my state and there are still often articles that are simply inacessible. Or even more ridiculous: Single chapters of a book sometimes cost up to $ 20 online, even if the complete book could be bought for $ 30...

In my mind an easier way to disrupt this system would be to create a p2p site for article sharing - this often takes place informally anyway. Just a place where you could ask "Does anyone have article..." and then a friendly person would upload it to some filehoster and shares the link.

mestudent · on July 21, 2011

Is this[1] the "manifesto"?

If not can someone post it so I don't have to get around the block.

[1]: http://cdmirror.textfiles.com/JSTOR_01_PhilTrans/1st_READ.tx...

pimeys · on July 21, 2011

The same one that is in the pirate bay page.

raldi · on July 21, 2011

That'll teach 'em. Maybe next time JSTOR will think twice before protecting their network from an apparent DoS attack.

wnight · on July 22, 2011

Yeah, because someone downloading documents, at the rate supplied by your server(!), is obviously an attacker.

For what it's worth, it would have been obvious they weren't facing a DoS.

flocial · on July 22, 2011

This illustrates the sad state of affairs. The technology is there to distribute this equitably (using torrents). Scanning these documents is a non-trivial task and most people would only need a handful of the papers in this collection for anything but intellectual curiosity. However, the pricing and legal restrictions put in place for the distribution goes against the history of scholarship. The only reason we have lots of ancient works of prose and scholarship is because monasteries of various creeds institutionally copied these works (by hand).

dbingham · on July 21, 2011

Someone I know is suggesting that these documents were already free and available on the web. I don't really know, since I haven't (and don't have the bandwidth to, really) downloaded the torrent and cross referenced. Here are the links he's provided:

"Unavailable anywhere else? Here's the ones from the 1600s: http://www.bodley.ox.ac.uk/cgi-bin/ilej/pbrowse.pl?item=ti... . Here's the ones from 1832-1938: http://catalogue.bnf.fr/servlet/RechercheEquation;jsession... . They're pretty widely available, for free."

"Looks like a bunch are on archive.org, too: http://www.archive.org/search.php?query=creator%3A%22Royal...

Can anyone confirm that these are the same articles in the torrent?

nullc · on July 21, 2011

It's perhaps easier to go the other way.

Can you find this online? "Description of the Brain of Mr. Charles Babbage"

T1 - Description of the Brain of Mr. Charles Babbage, F.R.S JF - Philosophical Transactions of the Royal Society of London. Series B, Containing Papers of a Biological Character (1896-1934) VL - 200 SP - 117 EP - 131 PY - 1909/01/01/ UR - http://dx.doi.org/10.1098/rstb.1909.0003 M3 - doi:10.1098/rstb.1909.0003 AU - Horsley, V.

nullc · on July 21, 2011

None of your links are working for me.

The only things in archive.org right now appear to be a half dozen issues from the mid-1800s.

y0ghur7_xxx · on July 21, 2011

Link to manifesto http://pastebin.com/KudE4bWr for people who don't have access to the pirate bay.

ck2 · on July 21, 2011

So what percentage is that? Is a "page" still the 2k standard these days?

per wikipedia as of November 2, 2010, the database contained 1,289 journal titles in 20 collections representing 53 disciplines, and 303,294 individual journal issues, totaling over 38 million pages of text

sp332 · on July 21, 2011

Very little. The indictment against aaronsw claims that he got over 4 million documents, and this archive only has 18,500. And I'm not even sure what percentage of the total aaronsw got.

_delirium · on July 21, 2011

This is only the back issues of one journal, the Philosophical Transactions of the Royal Society. It's very significant historically, but far from all of JSTOR's public-domain holdings.

ynniv · on July 21, 2011

Percentage of what? This is not Aaronsw but someone inspired by his actions.

danbmil99 · on July 23, 2011

Can someone explain who exactly is so hellbent on prosecuting Aaronsw? Who's eye did he poke? JSTOR isn't even pressing charges, so I assume it's some other party.

cpeterso · on July 21, 2011

When is someone going to do this for the ACM's journals?

blinkingled · on July 21, 2011

So is it legal to (re-)distribute those files? Or are we going to see more prosecutions AA style involving distributors as well as downloaders?

sitkack · on July 21, 2011

Please seed. If you are on a mac you can use transmissionbt.

http://www.transmissionbt.com/

Configure the bandwidth management to acceptable always on background levels and minimize to the dock or put on a different desktop with spaces.

apas · on July 21, 2011

That's why I love the Pirate Bay and its community.

Free art, technology and culture. Yep, this is the internet.

nextparadigms · on July 21, 2011

This looks like another case of the "Streisand effect".

nl · on July 22, 2011

That is one gutsy move by Greg Maxwell.

m0wfo · on July 21, 2011

Only just realised Eircom has blocked access to TPB in Ireland at the request of the 4 major record labels. The fact that this HN post is tangential to the issue of music piracy annoys me. Another step closer to censorship.

slowcpu · on July 22, 2011

Not terribly interesting. Although it is one of the oldest ( if not the oldest ) journals around, it is simply not widely read or terribly important. In addition, from the journal's web site. "All issues back to 2001 are free to access two years after publication."

Now, a collection of all of Nature's issues would have been fascinating.

username3 · on July 21, 2011

What's so good about these articles?

slowpoke · on July 21, 2011

What's bad about more knowledge?

username3 · on July 21, 2011

I'm not saying it's bad. I'm curious what's special or interesting about these documents.

ivankirigin · on July 21, 2011

Scribd should index and host these. I doubt that would greatly add to their current huge scale. They could make a dedicated site for it. They could also probably get the help of the FOSS community to help make the search faster.

gwern · on July 21, 2011

So, the solution to one publisher locking up and charging for access to public domain material is... copying the material to another publisher that likes locking up and charging for access?

ivankirigin · on July 21, 2011

It is a solution to needing to download every file here just to access one quickly. Scribd has charged document creators IIRC, not readers.

Scribd makes money off ads, so the incentives are actually really well aligned

18pfsmt · on July 21, 2011

EDIT: I should delete this because it is wrong, but I won't so as not to confuse the reply.

Thanks, gwern. Sorry for not RTFA first, but I've adopted a habit of reading comments first.

--- I believe these documents are still under copyright, so that would give them significant liability. They exist as a DMCA compliant company, so they are protected from their users' actions; but, pro-actively doing what you suggest would remove the DMCA protections.

gwern · on July 21, 2011

> The portion of the collection included in this archive, ones published prior to 1923 and therefore obviously in the public domain, total some 18,592 papers and 33 gigabytes of data.

> The documents are part of the shared heritage of all mankind, and are rightfully in the public domain, but they are not available freely. Instead the articles are available at $19 each--for one month's viewing, by one person, on one computer. It's a steal. From you.