Hacker Newsnew | past | comments | ask | show | jobs | submit | rfarley04's commentslogin

I live in Thailand and I cannot get over the fact that romanization is (seemingly?) completely unstandardized. Even government signage uses different English spelling of Thai words.


You should have seen Taiwan in the 1990s. It was a hot mess of older Western romanization systems, historical and dialectical exceptions, competing Taiwanese and pro-China sensibilities, a widely used international standard (pinyin), and lots of confusion in official and private circles about the proper way to write names and locations using the Latin alphabet. In 1998, the City of Taipei even made up its own Romanization system for street names at the behest of its then-new mayor, a supporter of Taiwan independence (https://pinyin.info/news/2019/article-on-early-tongyong-piny...).

The chart halfway down this blog post lays out some of the challenges once the hanyu pinyin standard was instituted in 2009:

https://frozengarlic.wordpress.com/on-romanization/

The author concludes with this observation:

So that’s why people in Taiwan can’t spell anything consistently and why all the English-language newspapers spell the same things differently. As for me, I’m giving up on trying to remember how everyone spells their name. I know lots of people, especially Taiwan nationalists, dislike having the PRC hanyu pinyin system. I dislike imposing it upon them. However, in only three weeks, I’ve found myself spelling the same thing in multiple ways and wasting time looking up how I did it last time. Since almost no one reads my blog anyway, I’ll do it the way that’s most convenient for me.

I’ll also always provide the Chinese characters so that people who can read them know who I’m talking about.


In the first place, "romanization" of English is unstandardized! Or was that unstandardised?


It tends to be standardized within a single country.


Standardizations can be notoriously inconsistent[1], disregarded[2] or evolve fast[3].

There’s a surprising amount of interesting articles on wikipedia about that.

[1]: https://en.wikipedia.org/wiki/Ough_(orthography)#Spelling_re...

[2]: https://en.wikipedia.org/wiki/Eye_dialect

[3]: https://en.wikipedia.org/wiki/Sensational_spelling


Whoosh :)


No I understood. I just failed to see the relevance.


Is it hiccough or hiccup in the US?


And standardised within an empire.


Yeah, my full names are Jeremia Josiah, and on my work permit they wrote the Thai version as เจอเรเมีย โยชิอา. I cannot figure out why they chose to use จ for the J in Jeremia but ย for the J in Josiah. Both are pronounced the same and I would consider จ the correct choice. I would consider ย more correct for representing a word with Y.


That's hilarious. The one I always notice is ก getting romanized as a K, ie Kanchanaburi or กานต์ becoming Karn.


Korea is stuck in a funny middle ground, where names like cities or railway stations all follow the standard without exception, while personal or corporate names are in a state of total chaos. So the cell phone maker is Samsung, but the subway station in Seoul is Samseong, even though they're written and pronounced in the same way in Korean. (No, they aren't related.)

It's unfortunate but I don't think it'll get fixed any time soon. Nobody wants to be called Mr. I, O, U, An, or No. (The most common romanization for these family names would be: Lee, Oh, Woo, Ahn, and Roh.)


You've nerd sniped me!

No country is going to force their big multinationals to change their international name they chose back in the 50s and are now known as world-wide. Personal names aren't too chaotic either, as the choice presented when choosing a romanization is limited, people can't just make stuff up on the ground. They're off, but generally in the same ways.

> Nobody wants to be called Mr. I, O, U, An, or No.

An is pretty common - given the massive reach of KPop among global youth, I wouldn't be surprised if the most well-known 안씨 as of 2025 was an "An" (a member of the group 아이브). Roh has fallen out of favor, young 노s generally go with Noh, the Rohs are usually older people. I too do long for the day where an 이 or 우 just goes with I or U, or if they must, at least Ih or Uh :)

IMO you left out the worst offender, Park. At least with 이 or 우 I can see why people would be hesitant to go the proper route, as most of the world is unfamiliar with single-phoneme names, but 박s have no excuse.

With 이, there's a pretty good alternative as well, and what's more - it's actually already in use when talking about the greatest Korean in history, Yi Sun-Shin! So much better than "Lee".


Thailand, famously, was never colonized by European powers. Everywhere else, some colonial administrator standardized a system of romanization.


Oh there are plenty of standards, including an official one. The problem is nobody uses them. Thai writing is weird, and between the tones and the character classes and silent letters might as well just make some shit up. My birth certificate, drivers license, and work permit all had different spellings of my name on them.

IIRC, the road signs for “Henri Dunant Road” were spelled differently on either end, which was ironic, because at least that did have a canonical Latin form.


Japan was not colonized, although it was briefly occupied.


Sri Lanka was a colony and Sinhala does not have a standard as far as I know. If there is one no one pays any attention to it.


I'm a full time copywriter for SaaS companies and I'm actually finding the opposite. My experience is people are having AI write stuff then trying to massage it themselves. When they can't get it to a point where they're happy with it they eventually just throw up their hands and hire me for pre-AI project scopes with 2025 rates. Not saying that's the experience everywhere, but AI has been much less problematic for me than most of the narratives I've seen online (knock on wood)


That's interesting.

A problem I have with Brian Merchant's reporting on this is that he put out a call for stories from people who have lost their jobs to AI and so that's what he got.

What's missing is a clear indication of the size of this problems. Are there a small number of copywriters who have been affected in this way or is it endemic to the industry as a whole?

I'd love to see larger scale data on this. As far as I can tell (from a quick ChatGPT search session) freelance copywriting jobs are difficult to track because there isn't a single US labor statistic that covers that category.


> he put out a call for stories from people who have lost their jobs to AI

This seems like an inherently terrible way to look for a story to report. Not only are you unlikely to know if you didn't find work because an AI successfully replaced you, but it's likely to attract the most bitter people in the industry looking for someone to blame.

And, btw, I hate how steeply online content has obviously crashed in quality. It's very obvious that AI has destroyed most of what passed as "reporting" or even just "listicles". But there are better ways to measure this than approaching this from a labor perspective, especially as these jobs likely aren't coming back from private equity slash-and-burning the industry.


Collecting personal stories from people - and doing background reporting to verify those people are credible - is a long standing journalistic tradition. I think it's valuable, and Brian did it very well here (he's a veteran technology reporter).

It doesn't tell the whole story though. That's why I always look for multiple angles and sources on an issue like this (here that's the impact of AI on freelance copywriting.)


> This seems like an inherently terrible way to look for a story to report.

But it’s probably a great way create a story to generate clicks. The people who respond to calls like this one are going to come from the extreme end of the distribution. That makes for a sensational story. But that also makes for a story that doesn't represent the reality as most people will experience it, rather the worst case.


It's such a difficult vertical to track because there isn't always a clear start and end condition. Drafts get passed around, edited, revised, and cleared by different teams, sometimes with a mixture of writing from in-house, freelancers, external agencies, and AI. Lots of people I talk to can't believe the number of projects that get approved and paid for that never end up going live simply because of red tape.


I can't remember which pundit said it, but his theory was that only the best would stay employed and that they would be valued for their high skill


Perhaps, but then it becomes a buyer's market, and salaries will reflect that.


> and salaries will reflect that.

Hilariously naive.


Isn't the implication that salaries will go down, but your response is assuming the GP is naively thinking salaries will go up?


I re-read what I wrote, and I can't imagine that interpretation unless they don't understand who the "buyers" are in a "buyer's market"


Employers? Thus pushing wages down? Am I being stupid?


Correct. Sorry, when I said "that interpretation" I was referring to the comment you responded to, not yours.


Do you think I mean salaries will go up? That's not what a "buyer's market" means. It means there's more supply, so the buyers (employers) can pay less than in the past.

Assuming you understand what I meant: As for being naive, that's hardly true; my opinion comes from experience. When the bubble burst in the early 2000s, you saw a ton of developers looking for work. This pushed salaries down, even for senior and advanced developers.


I think the idea was that the truly exceptional would only survive- and all the nearly exceptional and less competent would be in other fields. As a result, supply would plummet because no one would hire anybody less than the best.

I think the more appropriate analogy from the 2001 bust is if those fields turn into similar to professional sports where you have to be unimaginably, talented and skilled and disciplined, and everybody else is playing in the rec league fact for nothing


I think this might be what many people think. Which is what brings upon problems of self-worth. However, "best" and "high skill" aren't always the reason why companies value work and workers, i.e. the economy is not a meritocracy.


I expect we’ll see a lot of this as genAI content is seen as shlock.

But we’re also seeing a lot of schlock…


I suspect that for every case like yours, there's dozens of companies of lower quality where the AI slop is "good enough"


Tldr police AI with other AI+Blockchain? Certainly an entertaining read. Is there more technical documentation somewhere because uh...


I agree with you and the anonymous Reddit commenter that this is a “cool story”, and that it needs “a shitton of paragraphs”.

But it also left me somewhat confused. The author’s note at the end states that “[the] story is 100 % fictional, except for all the parts about the EU AI Act having no enforcement teeth, which are unfortunately 100 % true”. The story chronicles a paper describing a solution, and the reaction of the main character who wishes he should have thought of this himself first. The solution is laid out high-level: AI self-awareness of potential problems, auto-pause until human review at first sign of potential problems, coupled with blockchain auditability.

So what is this story? An invitation for researchers to work out the details, and actually write the (thus-far fictional) 47-page paper? A teaser that such a paper might be forthcoming? A hope that this solution might materialize? Or just a fictional tale to vent some professional frustrations?


> So what is this story? An invitation for researchers to work out the details, and actually write the (thus-far fictional) 47-page paper? A teaser that such a paper might be forthcoming? A hope that this solution might materialize? Or just a fictional tale to vent some professional frustrations?

A story to entertain the reader?


> AUTHOR'S NOTE: This story is 100% fictional, except for all the parts about the EU AI Act having no enforcement teeth…

It’s a long parable to explain:

> FINAL THOUGHT: Somewhere in Brussels, right now, someone is discovering that "effective oversight" means nothing without the technical ability to enforce it. This story is for them


Yes I read to the end. But the parable is still a vehicle for "AI+Blockchain policing AI" is it not?


No more than Jesus’s parable about planting the tree being about the actual tree.


"The truth is, there is something terribly wrong with this country, isn't there? Cruelty and injustice, intolerance and oppression. And where once you had the freedom to object, to think and speak as you saw fit, you now have censors and systems of surveillance coercing your conformity and soliciting your submission. How did this happen? Who's to blame? Well certainly there are those more responsible than others, and they will be held accountable, but again truth be told, if you're looking for the guilty, you need only look into a mirror. I know why you did it. I know you were afraid. Who wouldn't be? War, terror, disease. There were a myriad of problems which conspired to corrupt your reason and rob you of your common sense. Fear got the best of you, and in your panic you turned to the now high chancellor, Adam Sutler. He promised you order, he promised you peace, and all he demanded in return was your silent, obedient consent."


Sources from 30 years ago are more reliable than 30 days ago: pithandpip.com/blog/why-write-about-tech-history


rtings.com!


Yes, RTINGS.com is the GOAT.


I use PC and Firefox and feel like the near constant issues with buggy web apps are due to the overwhelming priority of Mac+Chrome combo.



It’s gotten a little old for me, just because it still buoys a wave of “solve a problem with a regex, now you’ve got two problems, hehe” types, which has become just thinly veiled “you can’t make me learn new things, damn you”. Like all tools, its actual usefulness is somewhere in the vast middle ground between angelic and demonic, and while 16 years ago, when this was written, the world may have needed more reminding of damnation, today the message the world needs more is firmly: yes, regex is sometimes a great solution, learn it!


I agree that people should learn how regular expressions work. They should also learn how SQL works. People get scared of these things, then hide them behind an abstraction layer in their tools, and never really learn them.

But, more than most tools, it is important to learn what regular expressions are and are not for. They are for scanning and extracting text. They are not for parsing complex formats. If you need to actually parse complex text, you need a parser in your toolchain.

This doesn't necessarily require the hair pulling that the article indicates. Python's BeautifulSoup library does a great job of allowing you convenience and real parsing.

Also, if you write a complicated regular expression, I suggest looking for the /x modifier. You will have to do different things to get that. But it allows you to put comments inside of your regular expression. Which turns it from a cryptic code that makes your maintenance programmer scared, to something that is easy to understand. Plus if the expression is complicated enough, you might be that maintenance programmer! (Try writing a tokenizer as a regular expression. Internal comments pay off quickly!)


Yeah but you also learn a tool’s limitations if you sit down and learn the tool.

Instead people are quick to stay fuzzy about how something really works so it’s a lifetime of superstition and trial and error.

(yeah it’s a pet peeve)


> you can’t make me learn new things, damn you

With regex, I won’t. I rarely include much in terms of regex in my PRs, usually small filters for text inputs for example. More complex regexes are saved for my own use to either parse out oddly formatted data, or as vim find/replace commands (or both!).

When I do use a complex regex, I document it thoroughly - not only for those unfamiliar, but also for my future self so I have a head-start when I come back to it. Usually when I get called out on it in a PR, it’s one of two things:

- “Does this _need_ to be a regex?” I’m fine to justify it, and it’s a question I ask teammates especially if it’s a sufficiently complex expression I see - “What’s that do?” This is rarely coming from an “I don’t know regex” situation, and more from an “I’m unfamiliar with this specific part of regex” eg back references.

I think the former is 100% valid - it’s easy to use too much regex, or to use it where there are better methods that may not have been the first place one goes: need to ensure a text field always displays numbers? Use a type=number input; need to ensure a phone number is a valid NANP number? Regex, baby!

The latter is of course valid too, and I try to approach any question about why a regex was used, or what it does, with a link to a regex web interface and an explanation of my thinking. I’ve had coworkers occasionally start using more regex in daily tasks as a result, and that’s great! It can really speed up things tasks that would otherwise be crummy to do by hand or when finagling with a parser.

Bonus: some of my favorite regex adventures:

- Parsing out a heavily customizable theme’s ACF data stuffed into custom fields in a Wordpress database, only to shove them into a new database with a new and %better% ACF structure - Taking PDF bank statements in Gmail, copying the text, and using a handful of painstakingly written find/replace vim regexes to parse the goop into a CSV format because why would banks ever provide structured data?? - Copy/pasting all of the Music League votes and winners from a like 20-person season into a text doc and converting it to a JSON format via regex that I could use to create a visualization of stats. - Not parsing HTML (again, anyways)


It gets buried in the rant, but this part is the key:

> HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.


The first sentence is correct but the second is wrong. A regex can be used for breaking HTML into lexical tokens like start tags and end tags. Which is what the question asks about.


Fair enough. GP is right in that there's a lot of absolutism with regards to what regex can solve. I first learned recursive-descent parsing from Destroy All Software, where he used Regex for the lexing stage by trying to match the start of the buffer for each token. I'm glad I learned it that was, otherwise I probably would have gotten lost in the character-by-character lexing as a beginner and would've never considered using regex. Now I use regex in most of my parsers, to various degrees.

https://www.destroyallsoftware.com/screencasts/catalog/a-com...

---

As for GP's "solve a problem with a regex, now you’ve got two problems, hehe", I remember for years trying to use regex and never being able to get it to work for me. I told my friends such, "I've literally never had regex help in a project, it always bogs me down for hours then I give up". I'm not sure what happened, but one day I just got it, and I've never had much issue with regex again and use it everywhere.


I had a hard time with complex regex until I started using them more in vim - the ability to see what your regex matches as you work is really helpful. Of course, this is even better now with sites like regexr and regex101


Regex101 is always open when I'm doing regexes, what a great tool. Occasionally I use sites that build the graph so you can visualize the matching behaviour.

There are even tools to generate matching text from a regex pattern. Rust's Proptest (property-based testing) library uses this pattern to generate minimal failing counterexamples from a regex pattern. The tooling around Regex can be pretty awesome.


The joke is not that you shouldn't use regular expressions but that you can't use regular expressions


An xml based data format is by definition a subset of all valid xml. In particular it may be a regular subset.


I swapped out a "proper" parser for a regex parser for one particular thing we have at work that was too slow with the original parser. The format it is parsing is very simple, one top level tag, no nested keys, no comments, no attributes, or any other of the weird things you can do in XML. We needed to get the value of one particular tag in a potentially huge file. As far as I can tell this format has been unchanged for the past 25 years ... It took me 10 minutes to write the regex parser, and it sped up the execution by 10-100x. If the format changes unannounced tomorrow and it breaks this, we'll deal with it - until then, YAGNI


That is what the joke is.

That is often not what is meant when the joke is referenced.


Is it really? Maybe I'm blessed with innocence, but I've never been tempted to read it as anything but a humorous commentary on formal language theory.


> it still buoys a wave of “solve a problem with a regex, now you’ve got two problems, hehe” types

Who cares that some people are afraid to learn powerful tools. It's their loss. In the time of need, the greybeard is summoned to save the day.

https://xkcd.com/208/


> https://xkcd.com/208/

You say you know the regular expression for an address? hehe


> learn it

Waste of time. Have some "AI" write it for you


Learning is almost never a waste of time even if it may not be the most optimal use of time.


This is an excellent way to put it and worth being quoted far and wide.


If you stop learning the basics, you will never know when the sycophantic AI happily lures you down a dark alley because it was the only way you discovered on your own. You’ll forever be limited to a rehashing of the bland code slop the majority of the training material contained. Like a carpenter who’s limited to drilling Torx screws.

If that’s your goal in life, don’t let me bother you.


[flagged]


That's not entirely fair. It's relatively easy to learn the basics of regular expressions. But it's also relatively easy, with that knowledge, to write regular expressions that

- don't work the way you want them to (miss edge cases, etc)

- fail catastrophically (ie, catastrophic backtracking, etc) which can lead to vulnerabilities

- are hard to read/maintain

I love regular expressions, but they're very easy to use poorly.


Using the AI to write them for you is going to lead to the same problems but worse because you don’t have the knowledge to do anything about it


If they don't work the way you want, you just keep refining it. This is easy if you actually test your regex in real data.

Fail catastrophically.. I had that happen once on an unexpectedly large input. That was a fun lesson. Ironically, the solution was to (*FAIL)

In any case, I learned a lot and delegating that to an LLM and learning nothing would not have put me in a better position.


> If they don't work the way you want, you just keep refining it. This is easy if you actually test your regex in real data.

There can be edge cases in both your data and in the regular expression itself. It's not as easy as "write your code correctly and test it". Although that's true of programming in general, regular expressions tend to add an extra "layer" to it.

I don't know if you meant it to be that way, but your comment sounds a lot like "it's easy to program without bugs if you test your code". It's pretty much a given that that's not the case.


I didn’t get the “it’s easy to program without bugs” vibe at all, and OP even mentioned an edge case that took their parser down (BUG!)

Neither the human nor the AI will catch every edge case, especially if the data can be irregular. I think the point they were making is more along the lines of “when you do it yourself, you can understand and refine and fix it more easily.”

If an LLM had done my regular expressions early in my career, I’d have only !maybe! have learned just what I saw and needed to know. I’m almost certain the first time I saw (?:…) I’d have given up and leaned into the AI heavily.


bobince has some other posts where he is very helpful too! :)

https://stackoverflow.com/questions/2641347/short-circuit-ar...


The first link in the article, also included as a screenshot.


It completely misses the point of the question though.

The question is not asking about parsing in the sense of matching start tags with end tags, which is indeed not possible with a regex.

The question is about lexing, for which regex is the ideal tool. The solution is somewhat more complex than the question suggest since you have to exclude tags embedded in comments or CDATA sections, but it is definitely doable using a regex.


Lol "You emailed that you like sci-fi. We bet you'll like Alien, Bladerunner, and Close Encounters of the Third Kind!" Truly mind-blowing tech right there. How did they ever pull it off!


lol I added 'trump', 'republican', and 'democrat' as custom filter keywords and now it's showing zero stories in the USA category. So apparently, that category is a stand in for politics? Although I have the Thai category enabled (since I live there) and that's all run of the mill national (non political) news.

Nice release nonetheless!


It’s probably hard to find an article these days that doesn’t mention one of those words (except maybe celebrity gossip). The world we live in.


In other words, this is exposing a long standing flaw in journalism. I know things are super polarized now, but even 20 years ago, when mentioning a Congressperson regarding a particular social problem, they would specify if he was a Democrat/Republican.

I really don't need to know which party he is part of. If the article was about a party's stance, it makes sense - but the article is about one politician.


Sure you do, to understand actual patterns in actions of members of the party.

And ignoring that, it’s general context. Part of the job of a journalist.


> to understand actual patterns in actions of members of the party.

Or to construct patterns that don't reflect reality.

Should we also list their ages, ethnicity, religious affiliation(s) in each article mentioning a Congressperson and construct those patterns as well?

Sorry, I'd like to think on my own.


When every vote on a bill lands along party lines you might want to begin to suspect that party affiliation matters quite a bit.


> I know things are super polarized now, but even 20 years ago, when mentioning a Congressperson regarding a particular social problem, they would specify if he was a Democrat/Republican.

When articles always mention party affiliation, people will judge the politician's behavior based on the affiliation, and not on his actions.


I checked a few news websites and the only one I could find that did a halfway decent job of this was USA Today: https://www.usatoday.com/news/nation/


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: