If I weren't too timid to risk doing so, I would do the following (read I hope someone else does this).
Process the pdf's with an OCR program to extract as much text from each document as possible. The extraction should be done page by page, so the extracted text can be referenced to a PDF page#.
Then, provide a searchable/browse-able directory of the extracted content. Each page of text has link to the original PDF page so you can easily open up the PDF to the page the text was extracted from.
I'd also make all text user editable wiki style. Combined with the inline PDF page references it would be super easy for any user to fix up translation errors from the OCR process. Tie in a karma system to the users profile so that edits can be thanked/kudos on a job well done to help with automating moderation of user edits by rating the user's current karma to decide if the edit should be accepted automatically or provided as an alternate version other users can check and rate up if they think it should replace current version.
Maybe mash in an image cropping service so diagrams can be cropped from the PDF and inserted inline with the translated text. Provide simple wiki formatting markup to allow users to format the articles.
Use ad revenue/donations to alleviate/cover hosting costs.
Well, most of this already exists in Google books, where I'm reading some of these particular scientific journals right now, can switch to OCRed text at any time (albeit with ftrange contemporarie fpelling), or download a facsimile of any public domain work in PDF format. The only thing they don't have in place (and should add) is the wiki-style crowd editing.
So the guy has basically built a 33gb torrent of stuff that was already freely available to the public, just from a different source.
There's already a system somewhat like that in place and it was actually referenced in the article. Wikisource (http://wikisource.org) is run by the same nonprofit as Wikipedia, and has a pretty robust system for pairing up side by side transcribed texts and scans. Someone's already started the germ of a Phil. Trans. library (http://en.wikisource.org/wiki/Philosophical_Transactions). At present the first volume is done and about forty volumes (http://en.wikisource.org/wiki/Index:Philosophical_Transactio...) are uploaded and ready for users to transcribe.
If you're curious about what these papers are, here is a hack at exposing the metadata in an easily searchable format. It would not be hard to scrape the PDFs to searchable text using unix tools (doesn't come out very readable) and render them as SVG files with inkscape for easy browser viewing:
Lots of good old stuff, such as "On the Double Organs of Generation of the Lamprey, the Conger Eel, the Common Eel, the Barnacle, and Earth Worm, Which Impregnate Themselves; Though the Last from Copulating, Appear Mutually to Impregnate One Another"
This is all available elsewhere, i'm just trying to demonstrate how low the barrier to search is with OSS tools; I hope all these papers end up on archive.org with full text search!
Distributed Proofreaders already has some infrastructure set up for proofreading and editing OCR'd documents. It's not quite the same as your proposal, but they do track stats and have levels of participants based on skill and experience. Might be worth checking out.
Regarding this, since the archives are in the public domain, wouldn't it be possible for any third party (e.g. google books) to ask and obtain permission to photograph the originals ?
Anyway the ethical thing to do here is to lobby jstor (a non-profit organization) to make reprints free.
As someone who's published in journals that are on JSTOR, I hope that this will increase the access of all this knowledge to the greater public. There is something very wrong about the way Elsevier, ACM, IEEE and other publishers abuse their size and established power to hold these works hostage. Of course it costs money to organize and maintain databases, but the way they wrestle you out of the copyright of your own works to ask $25 per paper sickens me. I am legally not even allowed to put my own work on my own website.
> I am legally not even allowed to put my own work on my own website.
That's unfortunately still sometimes true, but I'm finding it's increasingly common for there to be self-archival exceptions, where you can put a self-prepared PDF on your personal website. I believe all IEEE and ACM publications now allow this. I've adopted a policy of just putting all my publications online and waiting to see if anyone complains.
Most journals now have an explicit policy that putting up your preprints online is OK. You can check the status of any particular journal or publisher here: http://www.sherpa.ac.uk/romeo/
Thank you, the ps/pdf wall most research is behind greatly limits its availability.
Wait until they've published and do it anyways. If they sue/etc you the Streisand effect should hopefully help your notability (and thus career) more than not.
Yeah, as long as you are not obviously threatening their ability to make money off our backs, they will probably not come after us. I think the publishers rather not face the negative publicity they would get if they would actively persecute researchers who make their work available on their personal website.
> There is also a form of rape that is not forced?
Some instances of statutory rape are entirely unforced by any common definition of 'force'.
(The law doesn't see it that way, but the law also sees statutory rape as a strict liability offense, meaning that lack of intent to commit it isn't a possible defense. In short, if someone lies about their age to have sex, they just got raped.)
Other legislations sound more reasonable than that: in France, the term for "Statutory rape" is "Détournement de mineur" ("underage corruption" or something). It applies to any adult (more than 18) that have sex with someone under 15. If the adult is in position of power (professor or something), any minor (less than 18) under his or her care is off-limits.
IANAL, but I believe the reasons it is a crime are the same: in the off-limits situations, the minor is believed to not being able of giving informed consent. But at least it's not called "rape".
(I find the term "Statutory rape" quite misleading. It is definitely not the same thing as plain rape, or worse, child rape. It looks like it has been pushed by a lobby of afraid, protective, conservative, possibly religious parents.)
(I find the term "Statutory rape" quite misleading. It is definitely not the same thing as plain rape, or worse, child rape. It looks like it has been pushed by a lobby of afraid, protective, conservative, possibly religious parents.)
One of the saddest things I can remember (though, unfortunately, the details escape me due to time) was a couple of rape cases that occurred in the US maybe a decade or so ago. One was a case of a very violent rape where the sentence was something on the order of 15 years. The other was a case of statutory with a fully consenting 16- or 17-year old where the sentence was something closer to 40 years. I was pretty young at the time (if it was a decade exactly I would have been 12), but that's definitely when I first realised that the "won't somebody please think of the children" arguments are bullshit.
Only on HN can one expect a genuinely insightful reply to a comment like that :) I never really read about the laws surrounding rape, but that sounds pretty unexpected :)
Really? Did you even read the post- "The portion of the collection included in this archive, ones published
prior to 1923 and therefore obviously in the public domain..."
Actually, with almost every publication agreement, you sign over any right, title, and interest in the copyrighted work to the publisher. So, the law isn't protecting "your" copyright; it protects the publisher's. But, a published academic like yourself probably already knew that.
My girlfriend manages grant money for a large research institution where a sizable majority of the funds come from taxpayer-backed grants. You would be surprised (well, or maybe not) how often the researchers on the grants yell at her for not letting her have unfettered access to "their money". Even our friends who are researchers for other labs will get heated with her when she defends the casually-maligned administrators of their own grants. It's all very ugly and wrong-headed.
They have free journals in numerous fields, and gradually more big-name authors (in my field, neuroscience, at least) have been publishing in it. Its worth checking out.
Actually, the operation of PLoS journals raises an interesting question. These journals charge the researchers for publishing papers so that these papers are freely accessible to anyone. But the charges are in the range of $2200 - $3000 per paper! Doesn't that mean the traditional journals are charging reasonable fees to the readers?
[[TL;DR for this comment - Publishers are taking advantage of a prisoner's dilemma / competitive closed market to monetize the near-zero value they supply.]]
Professors don't get much money from direct publication -- in fact, many conferences charge the professors who provide the content. They get paid by grants and their schools. Professors don't want their research to reach a limited audience. The universities doesn't want this either.
The only people in the chain who want limited access are the publishers, since this is how they make money. But between researchers, universities, grants, and publishers, the publishers contribute the least value by far. They generally rely entirely on other professors to edit their journals and conference proceedings, and for all the content. They charge ridiculous rates - often thousands of dollars for a single annual journal subscription - and get away with it because the system is not prone to change. Researchers are rewarded for publishing in "the best" journals, so no one wants to take the leap to publishing in a free space where there is currently much less prestige.
That's basically why academic publishing is messed up. Because there's money to made in keeping it messed up, and money to be lost in fixing it. But the ones who generate the real value _do_ want things to be as freely available as possible. If a critical mass of top-tier researchers agreed to stop publishing in non-free journals and conferences, it would probably start a revolution in this area -- but that's a lot to ask.
It's a prisoner's dilemma, in that the "traitor" researchers who keep publishing in the old journals will be rewarded.
(This is all about academia -- I guess motivations may be different in industry-backed research.)
Well, there is a non-zero cost to hosting and delivering documents, as well as creating the infrastructure to allow it.
Those costs must be made up somehow in the business model.
(before I get flamed, I'm not arguing that all documents should be $19 per article. I'm just saying that there's a non-zero cost that needs to be passed on to the customers somehow).
Also, JSTOR does NOT get those documents for free. They may have to pay the publisher to host them on their server.
Academic publishing is a bitch and there is a lot going on lately towards a common, world wide reform. However, JSTOR is not really the bad guy here. Other publishers (e.g. Elsevier http://en.wikipedia.org/wiki/Elsevier) make billions by publishing mainly work payed with tax payer money.
JSTOR does not get them for free, but it gets them for a lot less than you think and they make a lot more than you think they do. Here are two analyses of their finances based on their mandatory IRS filings:
The content doesn't, actually. I'll look up the specific case when I get home, but phone book publishers most definitely do not have any copyright claims to the phone number or addresses. They do have copyright claims to the font choice, layout, etc., so you can't copy the phone book wholesale, but you can take the phone numbers, rearrange them, and set them in a different typeface and you'd be in the clear.
I don't believe there's any part of copyright that takes the amount of effort into account. There was a case (again, I'll look it up when I get home) where a copy of a work that was in the public domain was granted its own copyright because it had a few errors in it. The errors were unintentional, but copyright was given. In the phone book case, the court specifically mentioned that even though there was no doubt a significant amount of work required to produce the phone book, that didn't give the publisher copyright.
> I'll look up the specific case when I get home, but phone book publishers most definitely do not have any copyright claims to the phone number or addresses.
What about Springer or Nature publishing? Unfortunately, the real evil is rooted in a system that measures excellence based on publishing records and not scientific impact. Until there's a replacement for that, life scientists will keep sheepishly bowing to the publishers. That replacement could be an idea for a startup.
> Unfortunately, the real evil is rooted in a system that measures excellence based on publishing records and not scientific impact
Can you clarify this a bit? How do you get to have impact without first publishing it somewhere (at least mildly reputable) and having other people actually read it/review it/run with it?
My ideal system would be a global open system where scientists would publish their results, not papers. Other scientists would be able to review, confirm and reuse your results transparently and third parties would be able to assess your position from your record and the feedback that you get from fellow scientists. That could involve many more signals rather than just the (journal+number of citations) that everyone uses now.
The key issue with bittorrent is that if no one is seeding, the files are "gone". Further, ISPs like to block/filter it.
Of course, a case could be made that university libraries should mirror all public domain digital documents of sufficient interest. Which shunts the cost onto the taxpayers.
Regardless, the cost still is non-zero, and someone has to bear it.
Wikipedia is full of general purpose knowledge and paid for by donations. Even if you never make use of 99.99% of the knowledge it has, there is something for you, and a reason for you to donate.
Academic journals are full of incredibly specific knowledge. While some of it is accessible to people with a moderate amount of intelligence/diligence, any journal whose name isn't "Science" is designed for and by an incredibly small group of individuals with an incredibly specific knowledge base. Where is the donor base for that?
Answer: Universities are the only plausible donors. And they seem to think maintaining their place of privilege in being able to access the articles is worth some extra bucks.
I work at a university. I don't think universities have any vested interest in maintaining their place of privilege. Certainly faculty members don't, with rare exceptions (at least one of which I know about).
But I think there is not a lot of interest in changing the system. The people who could change things are not directly affected by the disadvantages of the current system.
The issue is, university professors already have access to these papers. They are among the few people that want, need, and can understand the science behind it. It looks like it makes sense that universities are willing to pay for that privilege.
We have services like youtube and Flicker that provide hosting for movies and pictures for free. The data requirements are far greater for these services than they would be for an academic journal hosting. If you didn't want to go the centralized method a small journal with under 1,000 readers (>10MB per month per user) could probably host buy hosting for under $100/year. Something that small would be easy to cover out of pocket or via donations
services like youtube and flickr have adverts corresponding to a user's tastes and the particular media. How would you suppose doing that with scientific journals?
Providing it as a torrent lower the costs at least. Plus, if they were concerned about infrastructure costs they'd have no problem finding mirrors willing to donate bandwidth.
Yes, it's non-zero. But for many, many organizations and many individuals, the non-zero is pretty much zero. The Internet Archive, for example, could absorb something like this without even blinking. As could Google Books.
True, but JSTOR is supposed to be a non-profit initiative in the service of scholarship, not a revenue-maximizing corporation. It's largely funded by taxpayer money. I agree they need funding to continue their work, and some of that comes from building up an internal revenue stream so they aren't completely dependent on the winds of year-to-year grants. But I've been disappointed overall by their lack of any commitment to make more things publicly available, given their charitable mission.
When JSTOR was first getting off the ground 15 years ago, many of us had higher hopes for the organization's direction, that it'd eventually be a mixture of subscription-only access for recent non-open-access journals, and free-to-anyone digitization of old, public-domain materials, like a large-scale version of Project Perseus. But nobody at JSTOR really pushed the 2nd part; they digitized old documents, but kept them all subscription-only, and never even developed plans (despite occasional agitation) for individuals to buy subscription access for any sort of reasonable price. They seem to have developed a bit of an unhealthy attachment to "owning" the JSTOR archive as their valuable crown jewel, despite officially just being custodians of it on behalf of the academic community.
I was on JSTORs side of this argument until I read your comment. If its funded in part by taxpayers then the data they produce ought to be at least in part taxpayer property.
A (somewhat controversial) question that came up the last time I was discussing this with someone was whether all JSTOR material should be available on the net. This does mean that American taxpayer funded research is now available to practically everyone in the world.
What follows is third hand information and not even directly related to JSTOR. So it could be totally wrong. But I am told that at the moment not all of the journals/articles in all areas are available to universities outside the US. Apparently it is hard to get hold of research related to subjects that the US thinks are not "exportable" because of potential impact on national security. Things like nuclear medicine, areas of chemical engineering etc. The journal publishers act as a point of enforcement in this story.
Yet, as you say, they are nonprofit. Are they therefore actually making any profit?
I guess that they are not a public company, therefore unfortunately perhaps we can not have any access to the revenue data to determine whether they are making any profit and how such profit is being used. In that case, most probably the safe bet is to decide that they are making profit. If, however, they are in fact not making any profit because their archiving and maintaining of these documents is paid off by the subscription charges, then this torrent does a disservice to science rather than acting as a liberator.
Without that information however I do not think it is easy, or indeed possivble, to rationally decide.
Check out http://www.generalist.org.uk/blog/2011/jstor-where-does-your... for a great analysis of JSTOR's general revenue breakdown. The top 13 officers earn over $3 million in salaries. So while they may be a nonprofit, profits do matter to some of them.
Non-profit does not mena that they do not make a profit, it just means that any profits are put back into the organization instead of it's leadership or shareholders.
Shareholders yes, though there's more gray area on "leadership". There's no law against using the revenue of a nonprofit organization to pay its leadership arbitrarily large salaries, unless it rises to the level where the IRS starts to see it as dirty tricks. I have no idea what JSTOR's are, though (or if that's even public). But I don't suspect they are pocketing the cash.
My guess is that their reticence to open things is more of an institutional megalomania of sorts. Sort of what you often find at art museums--- while their official mission is to promote the arts and educate the public, many of their staff are at least as interested in building the prestige/fame/fortune of their institution, which explains why they would have weird policies about reproducing their public-domain artworks.
As others have said, the costs of distribution are non-trivial, and are almost certainly a loss for JSTOR on these articles. For example the exemplary arxiv.org, which provides a relatively 'simple' hosting service and where the author prepares the full manuscript- they still have annual running costs in the region of $500,000. Even arxiv.org do not seem to have found it easy to secure funding from other universities and institutions which surely find it essential (most costs are borne by Cornell library).
While there is great value in the JSTOR articles (all published before 1923)- in reality very few active researchers will be reading them. Some of the research will be just plain wrong, some corrected or superseded, and if there is any useful stuff left, then it will be readily found in much more modern and useful presentations.
I think the more important battle is to ensure all current research is deposited in open archives (preferably something like arxiv.org). The historical stuff will surely follow...
That's actually insanely frugal--- $381,000 total budget! JSTOR is very reticent about their budget, but from IRS filings it appears their annual revenue is in the ballpark of $55-65 million, with about $8m of that going as payments/revenue-share to publishers, leaving somewhere north of $40m for JSTOR operations.
It's hard to say how necessary a budget of that size is, made harder by the fact that they publish absolutely nothing about who they are, what they do with the money, how they plan for it, etc. The only source of financial information is their legally mandated annual IRS filing.
Medical doctors were very unhappy to pay for research papers that were funded by tax paying Americans. That's why since April 2008 all articles funded by NIH have to be freely available: http://publicaccess.nih.gov/ . Remember, it's impossible for law to work backwards.
I am also quibbling with your claim “it's impossible for law to work backwards.” (IANAL)
1) It is unconstitutional to prosecute an individual for a crime that was legal at time of perpetration.
2) However, the opposite — decriminalizing previously illegal behavior — can be retroactive.
3) Changing future interpretation of copyright, etc., isn’t the same as case #1. If I’m not mistaken, Congress has passed e.g. the Mickey Mouse copyright law and the DMCA, which both extended “protection” & duration of copyright on previously created works.
You are mistaken on retroactive copyright; publishers and artists who failed to secure copyrights the first time around did not get retroactive postmortem copyright granted to them. The classic example is the library of H. P. Lovecraft, whose stories are considered public domain as there is nobody who has a reasonable claim of copyright to them.
> Remember, it's impossible for law to work backwards.
Really? I think it's just politically less convenient than making laws that apply to future research. There's nothing that would keep the government from removing copyright all together if they wanted to.
I don't fully understand the logic. Even if the underlying content is free, how are JSTOR scans public domain? If Google spends millions of dollars scanning in public domain books, I don't see how that gives Bing the right to download them all from Google unless given permission.
It also seems we all benefit from allowing companies to invest in scanning public domain works since for whatever reason nobody is doing this by hand now.
Scans of public-domain work remain in the public domain under U.S. law, at least as currently interpreted. There's no Supreme Court decision on the subject, but the leading case widely followed is Bridgeman Art Library v. Corel (1999) (http://en.wikipedia.org/wiki/Bridgeman_Art_Library_v._Corel_...)
Access to network resources is another matter--- Bing might be violating Google's access policies if they slurped Google Books, especially if they evaded rate limits or ignored robots.txt. But they wouldn't be in violation of any copyrights. This makes it a bit of a cat-out-of-the-bag situation. If I download a copy of an 1878 journal from JSTOR, and email it to a friend, I'm violating JSTOR's terms of service. But if my friend turns around and posts the PDF online, he is not breaking any laws.
A similar situation holds with U.S. government classified documents. Whoever leaked the Pentagon Papers broke the law, and could be prosecuted for the leak. But once they were out, as public-domain government documents, it was not illegal for third parties to republish them.
You appear to be advancing a sweat of the brow (http://en.wikipedia.org/wiki/Sweat_of_the_brow) theory of copyright— a view which is rejected by US law (the law in other places may be different). See Bridgeman v. Corel for an excellent example.
Of course, I don't begrudge people who scan the ability to get paid for their work. But there are a great many ways to get paid which do not involve forever robbing from the public domain (because, of course, if they are able they will simply keep updating their scanned copies while locking the public away from the originals).
Very interesting. That's exactly what I was arguing.
I don't quite understand where the line is though. For instance, why isn't Google Map information public domain? The locations of roads aren't copyrighted and Google's efforts to enter this information don't seem that different than compiling a phonebook.
Probably you don't understand where the line is because no one else is quite sure either.
Most of the underlying information on Google Maps is public domain—road locations and physical features of the terrain are all uncopyrightable facts. What they claim copyright to are the map visuals: the particular style of drawing the maps, and the photographs they took. (though perhaps even that is debatable, since the photography is a robotic process)
You should be able to take the Google data and extract everything uncopyrightable from it to draw your own maps. (And if you're interested in public maps, you should see the OpenStreetMap project.)
But say you want to include some of the other things on Google Maps--place markers for hotels and restaurants, for example. Any one of them is an uncopyrightable fact: that's where it is. But what if you choose to mark the same collection of the same sorts of things that Google has marked--are you infringing on their copyright in the selection? The more you can make an argument that there was some original thought in the selection the harder it is to tell.
US law is pretty clear on plain facts being ineligible for copyright, even if it took work ("sweat of the brow") to compile them. But copyright in things like collections of data is much less clear.
However, Google hasn't done much (any?) creative stuff by hand. It's all automated. The input data is just facts, and the process is automated, so where's the creativity? Are automated processes able to create copyright-able works? I've no idea.
The line is creative expression. Copyright restricts copying of original, creative expression.
Copyright law does not restrict copying of facts or ideas.
"(b) In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work."
http://www.copyright.gov/title17/92chap1.html#102
Thus, you can freely copy the facts (information) in google maps.
Google may be able to claim copyright on map images it produces, but I'd expect courts would find that they are just a mechanical reproduction of factual information, limiting the copyright to just the presentation choices (colors, fonts, styling, etc.)
From the same reference: "(b) The copyright in a compilation or derivative work extends only to the material contributed by the author of such work, as distinguished from the preexisting material employed in the work, and does not imply any exclusive right in the preexisting material. The copyright in such work is independent of, and does not affect or enlarge the scope, duration, ownership, or subsistence of, any copyright protection in the preexisting material."
At least in the US, it seems the courts have decided that faithful reproductions of 2D art that is in the public domain themselves are not copyrightable. I suspect this could apply here.
The works are in the public domain. Unless you think that simply scanning and producing a bit-for-bit accurate copy of a work in the public domain produces a new work that deserves copyright protection?
Maybe I have misread but I believe he is saying that the scanning costs are negligible compared to what they are charging/making. He mentions that when the journals were in paper that the publishers charged just enough to keep their printing up and running but now they are getting greedy and making enough money to have political clout.
Bottom line is that people shouldn't make so much money off the back of unpaid academics/scholars. Personally I feel like if they charged something like 99 cents for a DRM free version of each paper (with free abstracts still of course) this might be more reasonable.
Anyone has the right to download anything from anyone else. It's the content-provider's server and it's their job to tell it who it should and shouldn't give content to. Making content accessible is implicitly giving permission to download it. It's how the internet works, by design.
I think the title is kind of wrong.
The torrent author says:
"I've had these files for a long time, but I've been afraid that if I published them I would be subject to unjust legal harassment by those who profit from controlling access to these works."
So he clearly says its not in public domain but from a moral point of view they should be available to all the masses. The description of the torrent is very interesting. Definitely worth taking a look even if not willing to download the actual torrent.
The portion of the collection included in this archive, ones published prior to 1923 and therefore obviously in the public domain, total some 18,592 papers and 33 gigabytes of data.
I think this statement means that the documents are indeed in the public domain.
I also think equark makes a valid point about the scans. Further, I don't understand why these guys are going after JSTOR, a non-profit organization. I'd be more understanding of their methods if they went after somebody like Elsevier.
You are correct. This portion appears to be in public domain (though not publicly available). Non-profit or not, the fact remains that most of these documents remain behind pay-walls while still being considered to be in public domain.
You have a valid point though. The access to these documents seem to be governed by agreements between various publishers with aim to share the published content among various institutions (http://en.wikipedia.org/wiki/JSTOR). So there is no point blaming JSTOR for the lack of access.
You missed the word "unjust" in your reading. He very clearly claims elsewhere in his manifesto that the documents are public domain. The statement you elided would then be in conflict with that, but only if you remove the word "unjust".
They are in the public domain in a printed copy. You can copy from them, reprint them, make books of them and charge for it. You can scan them and put them online, the whole millions of them. Yet that does take work and who should pay for this work? Is it not a great service to mankind that these documents are being made available from the comfort of your home?
I do not understand one simle thing. JSTOR is not for profit. Are people therefore saying that it is making a profit? If yes, the action is understandable, but if not, as much as I am and was anoyed to face a JSTOR paywall and as much as there is an argument to make such paywall means tested, what are people actually saying? That this non profit organisation should not be able to existantially support itself?
Yes, government should support it rather than spend money on wars, but, I simply do not understand why anyone would want to kill off a non profit organisation.
How many of you are going to read these articles? How many of the general population would want to read these articles? Save for the professors and students how many would seek to even attempt to go through a journal article.
It just seems, frankly, that a self interested ideology of disdain for copyrights has sprung roots in Hacker News and man has stoped considering long term effects, focusing instead on their temorary pleasure of the now.
>They are in the public domain in a printed copy. You can copy from them, reprint them, make books of them and charge for it. You can scan them and put them online, the whole millions of them. Yet that does take work and who should pay for this work?
Whoever wants to. But doing so does not give you copyright over the content.
>I do not understand one simle thing. JSTOR is not for profit. Are people therefore saying that it is making a profit? If yes, the action is understandable, but if not, as much as I am and was anoyed to face a JSTOR paywall and as much as there is an argument to make such paywall means tested, what are people actually saying? That this non profit organisation should not be able to existantially support itself?
$19 per copy is extremely expensive, considering their expenses after the document is scanned are close to 0 (and as we can see, hundreds of people are willing to do it for free).
I had considered releasing this collection anonymously, but others pointed
out that the obviously overzealous prosecutors of Aaron Swartz would
probably accuse him of it and add it to their growing list of ridiculous
charges. This didn't sit well with my conscience, and I generally believe
that anything worth doing is worth attaching your name to.
I'm interested in hearing about any enjoyable discoveries or even useful
applications which come of this archive.
No, just not stupid enough to write the oppositions briefs for them.
But really, you don't need to be a genius to obtain large collections of scientific papers. Many people have them— its a prerequisite for doing meta-analysis for example. The distinction here is making them available to the general public.
Facebook even censors pirate bay links within private messages between you and friends iirc. They don't block select other torrent sites though. They're a bit odd like that.
Looks like FB blocks links to torrent detail pages on TPB.
Best I could do was post a link to things tagged with JSTOR (there's only one):
http://thepiratebay.org/tag/JSTOR
I believe FB blocks links to TPB. They actually changed my password and sent me an email saying that they thought my account had been compromised when I tried to do it once. However, as alanh said, a little redirection will work to post it.
Looking into it. That domain is not blocked. It is possible to post links to the root domain, just not deep links. It could be a bug or misconfiguration on their side. FB does a ping to posted links to grab the title, thumbnail, etc. That seems to be failing for deeplinks to TPB.
Well, a manual level if indirection would solve it — publish a web page with a little more than a link to the Pirate Bay page; and post a link to that page, on Facebook. If you use Simplenote’s published page feature or Dropbox’s public folder, this can take but a minute.
I wonder what would happen if this happened to Westlaw, one of the 2-3 industry standards for law firms. Incredibly expensive.
The irony alone in the law profession would be tremendous. It'd be interesting to see if law firms would illegally access it: It would be obvious they were if they previously only subscribed to Westlaw, but the reality is most law firms subscribe to more than one database for emergency backup.
State and federal court opinions are freely available from the courts themselves. Westlaw, LexisNexis, etc. gain much of their value from tools they provide for analyzing the opinions. These companies manually produce headnotes, case histories, and citation history (i.e., "shepardization"). Collating this value-added data would be vastly more difficult than automatically pulling opinions out of the PACER database (or even JSTOR). Not that I wouldn't like to see somebody try it . . .
How many different paywalls are there and how many articles are trapped? I'm able to access 5 different databases + one of the biggest research libraries in my state and there are still often articles that are simply inacessible. Or even more ridiculous: Single chapters of a book sometimes cost up to $ 20 online, even if the complete book could be bought for $ 30...
In my mind an easier way to disrupt this system would be to create a p2p site for article sharing - this often takes place informally anyway.
Just a place where you could ask "Does anyone have article..." and then a friendly person would upload it to some filehoster and shares the link.
This illustrates the sad state of affairs. The technology is there to distribute this equitably (using torrents). Scanning these documents is a non-trivial task and most people would only need a handful of the papers in this collection for anything but intellectual curiosity. However, the pricing and legal restrictions put in place for the distribution goes against the history of scholarship. The only reason we have lots of ancient works of prose and scholarship is because monasteries of various creeds institutionally copied these works (by hand).
Someone I know is suggesting that these documents were already free and available on the web. I don't really know, since I haven't (and don't have the bandwidth to, really) downloaded the torrent and cross referenced. Here are the links he's provided:
Can you find this online? "Description of the Brain of Mr. Charles Babbage"
T1 - Description of the Brain of Mr. Charles Babbage, F.R.S
JF - Philosophical Transactions of the Royal Society of London. Series B, Containing Papers of a Biological Character (1896-1934)
VL - 200
SP - 117
EP - 131
PY - 1909/01/01/
UR - http://dx.doi.org/10.1098/rstb.1909.0003
M3 - doi:10.1098/rstb.1909.0003
AU - Horsley, V.
So what percentage is that? Is a "page" still the 2k standard these days?
per wikipedia as of November 2, 2010, the database contained 1,289 journal titles in 20 collections representing 53 disciplines, and 303,294 individual journal issues, totaling over 38 million pages of text
Very little. The indictment against aaronsw claims that he got over 4 million documents, and this archive only has 18,500. And I'm not even sure what percentage of the total aaronsw got.
This is only the back issues of one journal, the Philosophical Transactions of the Royal Society. It's very significant historically, but far from all of JSTOR's public-domain holdings.
Can someone explain who exactly is so hellbent on prosecuting Aaronsw? Who's eye did he poke? JSTOR isn't even pressing charges, so I assume it's some other party.
Only just realised Eircom has blocked access to TPB in Ireland at the request of the 4 major record labels. The fact that this HN post is tangential to the issue of music piracy annoys me. Another step closer to censorship.
Not terribly interesting.
Although it is one of the oldest ( if not the oldest ) journals around, it is simply not widely read or terribly important.
In addition, from the journal's web site.
"All issues back to 2001 are free to access two years after publication."
Now, a collection of all of Nature's issues would have been fascinating.
Scribd should index and host these. I doubt that would greatly add to their current huge scale. They could make a dedicated site for it. They could also probably get the help of the FOSS community to help make the search faster.
So, the solution to one publisher locking up and charging for access to public domain material is... copying the material to another publisher that likes locking up and charging for access?
EDIT: I should delete this because it is wrong, but I won't so as not to confuse the reply.
Thanks, gwern. Sorry for not RTFA first, but I've adopted a habit of reading comments first.
---
I believe these documents are still under copyright, so that would give them significant liability. They exist as a DMCA compliant company, so they are protected from their users' actions; but, pro-actively doing what you suggest would remove the DMCA protections.
> The portion of the collection included in this archive, ones published prior to 1923 and therefore obviously in the public domain, total some 18,592 papers and 33 gigabytes of data.
> The documents are part of the shared heritage of all mankind, and are rightfully in the public domain, but they are not available freely. Instead the articles are available at $19 each--for one month's viewing, by one person, on one computer. It's a steal. From you.
Process the pdf's with an OCR program to extract as much text from each document as possible. The extraction should be done page by page, so the extracted text can be referenced to a PDF page#.
Then, provide a searchable/browse-able directory of the extracted content. Each page of text has link to the original PDF page so you can easily open up the PDF to the page the text was extracted from.
I'd also make all text user editable wiki style. Combined with the inline PDF page references it would be super easy for any user to fix up translation errors from the OCR process. Tie in a karma system to the users profile so that edits can be thanked/kudos on a job well done to help with automating moderation of user edits by rating the user's current karma to decide if the edit should be accepted automatically or provided as an alternate version other users can check and rate up if they think it should replace current version.
Maybe mash in an image cropping service so diagrams can be cropped from the PDF and inserted inline with the translated text. Provide simple wiki formatting markup to allow users to format the articles.
Use ad revenue/donations to alleviate/cover hosting costs.
1, 2, 3, go.