Hacker Newsnew | past | comments | ask | show | jobs | submit | geokon's commentslogin

Working in geology, I find the opposite problem. Field work is so highly valued that we're at a place where we have so much data and not enough people really working and analyzing it. My general impression is that in some subfields work that's done exclusively using preexisting data is kind of looked down on. In my opinion tons and tons of money is essentially wasted collecting new data - and then it's poorly catalogued and hard to access. You typically have to email some author and hope they send you the data. People are fiercely protective of their data b/c it took a lot of effort to collect and they want credit and to be in on any derivative work (and not just a reference at the bottom of a paper)

I would say the main workflow is collect some new data nobody has collect before, look at it and see if it shows anything interesting, make up some interesting publishable interpretation.

It feels like it'd be smarter to start with working with existing data and publish that way. If you hit on some specific missing piece, go collect that data, and work from there. But the incentive structures aren't aligned with this

The AI angle is really shoehorned in, but irrelevant to the larger problem. Sure, it allows you to annotate more data. Obviously it's more fun to go do field work than count pollen grains under a microscope. If anything AI make it easier to do more fieldwork and collect even more data b/c now you can in-theory crunch it faster


This is largely solved in biomedicine by funders (not journals) and regulatory bodies requiring that human subjects research data be stored with NIH.

I guess there may be a broader and less public-oriented set of funders in geology- and maybe there aren’t as many standardized data types as there are in the world of biology.


The current situation with the way big tech plays fast and loose with other people's data, I don't suppose the siloed nature of geological data is going to get better any time soon.

Perhaps creating secure private clouds for scientists, away from AI scrapers etc that scientists can access, with associated counter-surveillance, is the way forward.

I'm a GIS guy working on cloud native tech, but with a focus on privacy. I have a local-first Mac native product nearing beta. I'm thinking a lot about what data sharing options can be at the moment.


i dont see what the problem is. AI is mostly irrelevant. okay they scrape your data.. but then what? If the data isn't offically published and doesnt have a DOI, anything built on that wont be accepted

Some people scrape charts in publications to extract data. This has been done for a while. Maybe AI could automate this step. Thatd be useful


I understand that publications are the currency of academics but they're largely irrelevant in business. Geological data are valuable and if an oil exploration company finds a nice dataset they can scrape, they're not going to publish it.

From a pure business perspective, AI is largely about copyright circumvention. The laws are lagging and people are making serious money from data theft.


Aren't you describing trade secrets? I don't see how AI makes that any better or worse. If your competitor gets his hands on your proprietary dataset you're sunk regardless of AI, right?

I don't see how copyright enters into it. I doubt that "oh hey I published this very valuable and proprietary dataset online but it's copyright me so pretty please don't use it to make money" was ever going to get you anywhere to begin with.


Am I understanding it correctly. So internally if a company is using a competitor's stolen data directly, then if anyone finds out they're in legal trouble. But if they train a model and then use the model, then they're in the clear?

No. Maybe. It depends.

If the other party can prove you broke into their system then you've got a problem.

If you redistribute something that's copyrighted you've got a problem.

Merely possessing something isn't a problem in and of itself ... probably? Unless the other party can demonstrate damages at least.


Yes I think there's evidence for that. Looking at recent precedents, even if the data are illegally downloaded, big tech has been getting away with using copyrighted data, for example:

https://news.bloomberglaw.com/us-law-week/big-tech-wins-in-c...


Sounds like a business opportunity for someone to create a web portal for making available such data with licensing terms, indexing and cataloging it with a nice search engine, etc.

the databases exist, for instance: https://www.pangaea.de/

what you need is people uploading data in consistent well documented formats. There are all sorts or projects that do this, but there is a strong incentive to not upload things, or sort of half upload it.. but in a way where anyone using it is going to have to reach out to you. Not suggesting bad intentions, Maybe youre still working with the data and expect to publish more and dont want someone swooping in and beating you to the punch. Typically journals require data availability, but its kind of informal and adhoc


It's interesting to contrast with Wikipedia. I'm not deeply involved with either, so I'm talking out of my ass and would be curious to hear other people's thoughts here. But Wikipedia has gone to great lengths to make the data side, Wikidata, and the app/website, decoupled. I'm guessing iNaturalist hasn't?

The OpenStreetMaps model is also interesting. Where they basically only provide the data and expect others to make Apps/Websites

That said, it's also interesting that there hasn't been any big hit with people building new apps on top of Wikidata (I guess the website and Android app are technically different views on the same thing)


I’m not convinced that that’s an accurate view of Wikidata. Wikidata is a basically disconnected project. There is some connection, but it’s really very minimal and only for a small subset of Wikipedia articles. Wikipedia is 99% just text articles, not data combined together.

Frankly, I think the reason people haven’t built apps on top of Wikidata is that the data there isn’t very useful.

I say this not to diss Wikimedia, as the Wikipedia project itself is great and an amazing tool and resource. But Wikidata is simply not there.


I am also frustrated with Wikidata. The one practical use I've seen is a lot of OpenStreetMap places' multilingual names are locked to Wikidata, which makes it harder for a troll to drop in and rename something, and may encourage maintaining and reusing the data.

But I tried to do some Wikidata queries for stuff like: what are all the neighborhoods and districts of Hong Kong, all the counties in Taiwan, and it's piecemeal coverage, tags different from one entity to another, not everything in a group is linked to OSM. It's not a lot of improvement over Wikipedia's Category pages.


Wikidata is a separate project, specifically for structured data in the form of semantic triples [0]. It's essentially the open-source version of Google's KnowledgeGraph; both sourced a lot of their initial data from Metaweb's Freebase [1], which Google acquired in 2010.

[0] https://en.wikipedia.org/wiki/Semantic_triple

[1] https://en.wikipedia.org/wiki/Freebase_(database)


Many Wikipedia articles have infobox fields that pull their values from Wikidata (and are only editable through Wikidata).

> But Wikipedia has gone to great lengths to make the data side, Wikidata, and the app/website, decoupled.

A big part of that is that different language editions of wikipedia are very decoupled. One of the goals of wikidata was to share data between different language wikipedias. It needed to be decoupled so it was equal to all the different languages.


what specifically did you have issues with?

i made a GUI with cljfx which uses JavaFX and I didnt really hit any issues (save for i have one bug on startup that ive had trouble ironing out). The app is snappy and feels as native as anything else

Ended with a very modular functional GUI program

the only thing i wasnt super happy about is that i couldnt package it as a simple .bin/.exe bc the jpackage system forces you into making an installer/application (its been a few years since, so its possible theres a graal native solution now)

i highly recommend cljfx. Its the opposite of clunky


Honor strangely enough doesnt make any efforts to really support Linux

The machine quality is pretty damn good, but Huawei machines are still better. Apple level of quality. And Huawei releases their machines with Linux preinstalled

The company to watch is Wiko. Its their French spin off to sidestep their chip ban. They might put out some very nice laptops, but a bit tbd


cant blame him. We're in a bit of a bananas situation where open source isnt the antonym of closed source


This isn't that uncommon:

* If a country doesn't have "closed borders" then many foreigners can visit if they follow certain rules around visas, purpose, and length of stay. If instead anyone can enter and live there with minimal restrictions we say it has "open borders".

* If a journal isn't "closed access" it is free to read. If you additionally have permissions to redistribute, reuse, etc then it's "open access".

* If an organization doesn't practice "closed meetings" then outsiders can attend meetings to observe. If it additionally provides advance notice, allows public attendance without permission, and records or publishes minutes, then it has “open meetings.”

* A club that doesn't have "closed membership" is open to admitting members. Anyone can join provided they meet relevant criteria (if any) then it's "open membership".

EDIT: expanded this into a post: https://www.jefftk.com/p/open-source-is-a-normal-term


* A set that isn't open isn't (necessarily) closed.

* A set that is open can also be closed.


Who says it isn't? "closed source" doesn't have a formal definition, but can be arbitrarily defined as the antonym of open source, and when people use the term that's usually what they mean.

And that has nothing to do with whether someone can be "blamed" for ignoring the actual meaning of a term with a formal definition.


Supposedly this is the bees knees and better than a stepper:

https://www.flow-storm.org/

(haven't use it myself yet)

But things like CIDER give you step debugging if you want. For whatever reason it always felt clunkier than ELisp's debugger


not a native chinese speaker, but couldnt this have to do more with requiring qualifiers?

There seems to be a very strong preferrence to adding qualifiers, especially to single character words. And 不 is just one of them

很好 (very good) is strongly preferred to just a plain 好 (good). similarly youre seldom going to just say 錯 差 or 行.

So if you want to say good, and not verrryyy good, youre left with "not bad" 不錯

it still holds for 2 character words. i was told that just saying 無聊 (boring) can have a very rude conotation, though i dont quite understand the subtext (maybe someone can elucidate). But if you say 很無聊 (very boring) it weirdly enough sounds better

unqualified words as far as i understand also often look archaic. tied to the preference for 2 character words vs 1 character words (my guess is bc classical chinese is written and its hard to make out a single character word with tone when speaking)

hopefully a native speaker can weight in...


plenty of other problems

- They often render differently in different browsers and other renderers. It's very frustrating to get consistent results (like a PDF). In complex diagrams I'd say it's basically impossible

- Renderers that are fast usually lack many features

- Nobody other than the browser seems to actually have all the features?

- You can link an SVG within an SVG (to make a lightweight composite image). But if you have two levels of indirection then all renderers I've tried will refuse to render the SVG

- Inkscape is basically the only good editor on Linux and it easily runs out of memory and crashes for complex images

- Complex SVGs eat all your RAM in Chromium (only marginally better in Firefox)

- Basic things like arrows from Inkscape will not render anywhere else

I still use SVGs all the time, b/c there are no good alternatives, but it's a crappy standard and I try to keep all my images/diagrams extremely simple


Wouldn't it make more sense to run/emulate JVM bytecode on WASM instead of compiling Java to WASM? It seems like that'd be a much easier task.

From a high level WASM and JVM byte code seems incredibly similar (though I'm sure the containerizing and IO are radically different). I never really understood why WASM wasn't some JVM subset/extension.

Not an expert at all in this, so genuinely curious to hear from someone who understands this space well


From my understanding, this works for C# but is an ill-fit for Java. Java has simple bytecode with a powerful runtime to ensure all kinds of guarantees. C# focuses on compile-time checks with a more complex bytecode representation.

So instead you got TeaVM which is essentially a whole JVM in WASM.


On thing that is often overlooked but should be emphasized is that the considered frequencies are fixed values while the phase shifts are continuous values. This creates tons of downstream problems

If your underlying signal is at frequency that is not a harmonic of the sampling length, then you get "ringing" and it's completely unclear how to deal with it (something something Bessel functions)

Actually using DFTs is a nightmare ..

- If I have several dominant frequencies (not multiples of the sampling rate) and I want to know them precisely, it's unclear how I can do that with an FFT

- If I know the frequency a priori and just want to know the phase shift.. also unclear

- If I have missing values.. how do i fill the gaps to distort the resulting spectrum as little as possible?

- If I have samples that are not equally spaced, how am I supposed to deal with that?

- If my measurements have errors, how do I propagate errors through the FFT to my results?

So outside of audio where you control the fixed sample rate and the frequencies are all much lower than the sample rate... it's really hard to use. I tried to use it for a research project and while the results looked cool.. I just wasn't able to backup my math in a convincing way (though it's been a few years so I should try again with ChatGPT's hand-holding)

I recommend people poke around this webpage to get a taste of what a complicated scary monster you're dealing with

https://ccrma.stanford.edu/~jos/sasp/sasp.html


FFT/DFT is not precise if you do not have the exact harmonic in you signal. If you are also (or only) interested in phases you might use a maximum likelihood estimator (which brings other problems though).

And as the previous answer said: compressed sensing (or compressive sensing) can help as well for some non-standard cases.


Do you have any good reference for compressed sensing?

The high level description on wikipedia seems very compelling.. And would you say it'd be a huge task to really grok it?



A while back I looked at matching pursuit. At first it seemed very complicated, but after staring at it a bit realized it's simple.

- Start with a list of basis functions and your signal.

- Go through the list and find the basis function that best correlates with the signal. This gives you a basis function and a coefficient.

- Subtract out the basis function (scaled by the coefficient) from your signal, and then repeat with this new residual signal.

The Fourier transform is similar using sine wave basis functions.

The key that makes this work in situations where the Nyquist theorem says we don't have a high enough sampling rate is ensuring our sampling (possibly random) is un-correlated with the basis functions and our basis functions are good approximations for the signal. That lowers the likelihood that our basis functions correlating well with our samples is by chance and raises likelihood it correlates well with the actual signal.


You can use single bin DFTs and not FFTs? Basically use precomputed twiddles for a specific frequency. FFT is only fast because it reuses operation across multiple frequencies, but if you need a specific frequency instead of the whole spectrum, then a single-bin DFT makese sense, right?

https://github.com/dsego/strobe-tuner/blob/main/core/dft.odi...


Somewhat related field, compressive sensing, attempts to answer some of those questions (particularly missing data, uneven sampling and errors) using a L1 minimisation technique.


Can you recommend where to learn more about it? It looks like what I should be looking at.. hopefully not a rabbit hole :))


Here is a good lecture on this subject:

https://www.youtube.com/watch?v=rt5mMEmZHfs


That was actually fantastic. The professor is quite goofy, but he really goes over everything from first principles and goes through a real example - constructing a solution without any cheating :))

I was a bit bummed out there weren't a lot of Compressed Sensing libraries around, but it seems you just need a "convex optimization" routine (aka linear programming). And these seem to exist in every language

I'll try to play around with this! Thank you so much


It's a fascinating topic and i'm still trying to get my head around some of the concepts.

Have fun on your discovery journey.


Are there any gotchas you've come across?

From the video tutorial is seems relatively straightforward. I guess the basis selection is a fundamental issue that will be problem-specific.

I will have to try it with some concrete examples. The first question I have is, will it still work if you have a lot of high frequency noise? In the cases I'm thinking either there is measurement noise or just other jitter. So while the lower frequencies are sparse but I guess the higher frequencies not so much. I can't bandpass the data b/c it's got lots of holes or it's irregularly spaced.

Maybe it'll still work though!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: