More

jdreaver · 2025-07-10T14:55:44 1752159344

I recently discovered that `perf` itself can spit out flamegraphs. My workflow has been:

    $ perf record -g -F 99 ./my-program
    $ perf script report flamegraph

You can also run `perf script -F +pid > out.perf` and then open `out.perf` in Firefox's built-in profile viewer (which is super neat) https://profiler.firefox.com

jdreaver · 2025-06-23T01:10:25 1750641025

There are plausible scenarios where a region can go down for days or more at a time, like natural disasters. I'm not terribly worried about a region going away _forever_, but during a regional outage long enough to start losing business, having data in multiple regions is important so you can restore in another region (if you aren't able to fail over quickly).

jdreaver · on Dec 29, 2024

You are correct that storage is cheaper in S3, but S3 charges per request to GET, LIST, POST, COPY, etc objects in your bucket. Block storage can be cheaper when you are frequently modifying or querying your data.

thayne · on Dec 29, 2024

That's a lot of requests.

hansvm · on Dec 29, 2024

It is, but it's not _that_ many. AWS pricing is complicated, but for fairly standard services and assuming bulk discounts at ~100TB level, your break-even points for requests/network vs storage happens at:

1. (modifications) 4200 requests per GB stored per month

2. (bandwidth) Updating each byte more than once every 70 days

You'll hit the break-even sooner, typically, since you incur both bandwidth and request charges.

That might sound like a lot, but updating some byte in each 250KB chunk of your data once a month isn't that hard to imagine. Say each user has 1KB of data, 1% are active each month, and you record login data. You'll have 2.5x the break-even request count and pay 2.5x more for requests than storage, and that's only considering the mutations, not the accesses.

You can reduce request costs (not bandwidth though) if you can batch them, but that's not even slightly tenable till a certain scale because of latency, and even when it is you might find that user satisfaction and retention are more expensive than the extra requests you're trying to avoid. Batching is a tool to reduce costs for offline workloads.

thayne · on Jan 4, 2025

Ok, there are definitely cases where it would be more expensive, like using it for user login data.

But for metrics, like you would use for prometheus:

- Data is usually write-only. There isn't usually any reason to modify the metrics after you have recorded them.

- The bulk of your data isn't going to be used very often. It will probably be processed by monitors/alerts and maybe the most recent data will be shown in dashboard (and that data could be cached on disk or in memory). But most if it is just going to sit there until you need to look at it for an ad-hoc query, and you should probably have an index to reduce how much data you need to read for those.

- This metrics data is very amenable to batching. You do probably want to make recent data available from memory or disk for alerts, dashboards, queries, etc. But for longer term storage it is very reasonable to use chunks of at least several megabytes. If your metrics volume is low enough that you have to use tiny objects, then you probably aren't storing enough to be worried about the cost anyway.

jdreaver · on Dec 11, 2024

Huge +1 to this, but I would also add walking _at least_ 8000 steps per day. I still had some minor, nagging pain until I started walking more. Turns out humans are not meant to sit all day!

I can highly recommend a book called _Built to Move_ [0]. It tells you to do a lot of things that many people consider common sense, like walk every day, eat vegetables, sleep 8 hours, etc. However, it also explains _why_ to do these things pretty concisely. The most impactful argument it made to me was you can't counteract sitting for 12 hours a day with any amount of exercise. You have to sit less and move around more.

[0] https://thereadystate.com/built-to-move/

supersrdjan · on Dec 12, 2024

It appears that the problem is not in sitting too much, but rather in sitting in chairs specifically. Apparently, hunter-gatherer people also spend about 10 hours a day sitting. But they sit on the ground. Or kneel or squat. And they don't have the issues we get from sitting too much:

https://www.pnas.org/doi/10.1073/pnas.1911868117

So... the end-game of ergonomic chairs might be no chair at all.

vladvasiliu · on Dec 12, 2024

Given the tables in the results section, it would seem that the people in the study don't have long periods where they don't move. "average sedentary bout lengths" hover between 15 and 20 minutes.

So the problem with "sitting in chairs specifically" is probably not the chair, but the fact that the chair facilitates longer "sedentary bout lengths". If this is correct, then the commenter suggesting to get up and move every so often is probably on point.

supersrdjan · on Dec 12, 2024

Makes sense. That said, fidgeting and moving around is spontaneous when on the floor, you don’t have to be reminded to do it. Also, no chair is cheaper than an expensive chair.

vladvasiliu · on Dec 12, 2024

> fidgeting and moving around is spontaneous when on the floor, you don’t have to be reminded to do it

Indeed, it's actually what prompted me to go look over the document.

I remember, as a kid, when out and about and before getting into the habit of sitting in a chair all day every day, I would sit on the floor or on random objects, like stones or tree trunks in the countryside. I wouldn't be able to sit still for long periods of time and would need to at least change positions.

Whereas now, in my "ergonomic chair", I can sit for more than one hour at a time with minimal, if any, changes in position. Ditto for my couch (which wasn't marketed as "ergonomic" in any way).

That being said, I've tried using a computer in other positions, like putting the laptop on a coffee table and squatting or sitting on the floor in front of it, or having it rest on my thies while squatting. It gets tiring very quickly, especially in the shoulders and neck area.

So, in my case, what seems to work best is to get up regularly and walk around the room for a bit.

jdreaver · on Dec 2, 2023

Most of my company uses VS Code or IntelliJ IDEs, which are officially supported by our developer productivity team, but many of us use Emacs and Vim (I'm an Emacs user). I spend most of my time in Go, C, Rust, and the plethora of "infrastructure"-related languages like Puppet, YAML, Starlark, Python, bash, SQL, etc. I also sometimes use more of the common languages in our company's stack like Ruby, Java, and Python.

My experience using Emacs at work for the past 15 years has been outstanding. I find that when I join a new company, there is sometimes a bit of legwork getting Emacs working with potentially bespoke SSH, tooling, or VPN configs (for remote development), but once it works I don't touch it. I touch a lot of languages at work, including more I didn't mention above, and not having to leave Emacs to learn a new tool is a huge boon to productivity. I get all the niceties of an IDE via LSP and some other Emacs packages, including autocomplete, code navigation, Github Copilot, and more.

I don't ever tell anyone they _should_ learn Emacs at work, but once in a while someone sees me use it while screen sharing and they get interested.

jdreaver · on May 13, 2023

The Elements of Computing Systems: Building a Modern Computer from First Principles [0] [1]

Easily one of the most interesting and engaging textbooks I've read in my entire life. I remember barely doing any work for my day job while I powered through this book for a couple weeks.

Also, another +1 to Operating Systems: Three Easy Pieces [2], which was mentioned in this thread. I read this one cover to cover.

Lastly, Statistical Rethinking [3] really did change the way I think about statistics.

[0] https://www.nand2tetris.org/

[1] https://www.amazon.com/Elements-Computing-Systems-second-Pri...

[2] https://pages.cs.wisc.edu/~remzi/OSTEP/

[3] https://xcelab.net/rm/statistical-rethinking/

electriccatblan · on May 13, 2023

Would Statistical Rethinking help me interpret web app metrics? E.g. if I have a canary out and the response times are longer after x requests, is that significant?

maxFlow · on May 13, 2023

I've found that Statistics is one of those topics that changes your world view about everything. You can consider pretty much any issue statistically, and that will enrich your perspective significantly. In that sense, Statistical Rethinking will help. However, it's a book on Bayesian stats, it's quite dense, and examples are coded in R. It may be overkill for web app metrics interpretation. For that you may be better served with basic stats & inference, frequencies, descriptive statistics, percentiles, basic distributions, data visualization (e.g., trend lines, scatter plots, boxplots, histograms), etc.

To be clear though, Statistical Rethinking is a beautiful piece of work. You can check out the author's lectures[0] and see how much they suit your needs.

[0] https://www.youtube.com/playlist?list=PLDcUM9US4XdPz-KxHM4XH...

jdreaver · on May 13, 2023

The book is a bit more foundational than that. It teaches you about Bayesian statistics, and discusses (among other things) why the concept of binary yes/no statistical significance is usually not the best way of evaluating a hypothesis with data.

However, for your question specifically, the choice of prior is less meaningful when you have lots of data, and presumably a web app seeing hundreds or thousands of requests per second can gather enough data to determine if the canary has a different latency profile than the deployed version within a few seconds. Also, presumably you would use an uninformed prior for a test like that. If I were trying to prevent latency regressions in an automated deployment pipeline I would just compare latency samples after 1 minute with a t-test or something simple like that.

jdreaver · on April 27, 2023

I agree with basically everything you are saying, except I think a sprinkle of rote memorization can go a long way in some domains. Whenever I read a math book, memorizing definitions and some key theorems helps me apply them in problems. With programming, however, I tend to do zero rote memorization.

jdreaver · on March 18, 2023

That's probably so you can rotate your keys without downtime.

jdreaver · on Jan 18, 2022

> RDS is slow. I've seen select statements take 20 minutes on RDS which take a few seconds on _much_ cheaper baremetal.

I'm sure you observed this, but concluding that RDS is slow as a blanket statement is totally wrong. You had to have had different database settings between the two postgres instances to see a difference like that. 3 orders of magnitude performance difference indicates something wrong with the comparison.

stillicidious · on Jan 18, 2022

You could easily observe this with a cache-cold query performing lots of random IO. EBS latency is on the order of milliseconds, even cheap baremetal nowadays is microseconds

singron · on Jan 18, 2022

Also rds caps out around 20k IOPS. You can hit 1 million IOPS on a large machine with a bunch of SSDs. Imagine running 50 rds databases instead of 1.

It's a huge bummer that EBS is the only durable block storage in aws since the performance is so bad. Has anyone had luck using instance storage? The aws white papers make it seem like you could lose data there for any number of reasons, but the performance is so much better. Maybe a synchronous replica in a different AZ?

jdreaver · on Jan 18, 2022

I've used Aurora and the IO is much better there than on vanilla RDS. Postgres Aurora is basically a fork of postgres with a totally different storage system. Their are some neat re:Invent talks on it if you are interested.

singron · on Jan 18, 2022

We use aurora actually. It's a lot more scalable, but also pretty expensive. The IO layer is multi-tenent, and unfortunately when it goes wrong, you have no idea why and no recourse. I think I've never had a positive experience with AWS support about it either. We've had IO latency go from <2ms to >10ms and completely destroy throughput. Support tells us to try optimizing our queries like we are idiots.

jdreaver · on Jan 16, 2022

I've read this book and taken this course twice, and it is easily one of the best learning experiences I've ever had. Statistics is a fascinating subject and Richard helps bring it alive. I had studied lots of classical statistics texts, but didn't quite "get" Bayesian statistics until I took Richard's course.

Even if you aren't a data scientist or a statistician (I'm an infrastructure/software engineer, but I've dabbled as the "data person" in different startups), learning basic statistics will open your eyes to how easy it is to misinterpret data. My favorite part of this course, besides helping me understand Bayesian statistics, is the few chapters on causal relationships. I use that knowledge quite often at work and in my day-to-day life when reading the news; instead of crying "correlation is not causation!", you are armed with a more nuanced understanding of confounding variables, post-treatment bias, collider bias, etc.

Lastly, don't be turned off by the use of R in this book. R is the programming language of statistics, and is quite easy to learn if you are already a software engineer and know a scripting language. It really is a powerful domain specific language for statistics, if not for the language then for all of the statisticians that have contributed to it.

swayson · on Jan 16, 2022

Julia, R (tidyverse), Python code examples available here:

https://github.com/StatisticalRethinkingJulia https://github.com/pymc-devs/resources/tree/master/Rethinkin... https://bookdown.org/content/4857/

agucova · on Jan 16, 2022

Even if you don't like R, your can do the entire course with Julia/Turing, Julia/Stan or Python, the course github's page has a list of “code translations” for all the examples.

fault1 · on Jan 16, 2022

There is also other translations, for example, in pytorch/pyro: https://fehiepsi.github.io/rethinking-pyro/

I would say statistical rethinking is a great way to compare and contrast different ppl impls and languages, I've been using it with Turing, which is pretty great.

jonnycomputer · on Jan 16, 2022

I frequently prefer R to python/pandas/numpy for data analysis--even if most of my other programming is in python.

elcapitan · on Jan 16, 2022

What's the advantage, if you already know Python? (genuine interest)

jonnycomputer · on Jan 16, 2022

I don't want to say "advantage", so much as preference. But a few things come to mind.

- Lots of high quality statistical libraries, for one thing.

- RStudio's RMarkown is great; I prefer it to Jupyter Notebook.

- I personally found the syntax more intuitive, easier to pick up. I don't usually find myself confused about the structure of the objects I'm looking at. For whatever reason, the "syntax" of pandas doesn't square well (in my opinion) with python generally. I certainly want to just use python. But, shrug.

- The tidyverse package, especially the pipe operator %>%, which afaik doesn't have an equivalent in Python. E.g.

    with_six_visits <- task_df %>%
      group_by(turker_id, visit) %>%
      summarise(n_trials = n_distinct(trial_num)) %>%
      mutate(completed_visit = n_trials>40) %>%
      filter(completed_visit) %>%
      summarise(n_visits = n_distinct(visit)) %>%
      mutate(six_visits = n_visits >= 6) %>%
      filter(six_visits) %>%
      ungroup()

Here I'm filtering participants in an mturk study by those who have completed more than 40 trials at least six times across multiple sessions. It's not that I couldn't do the same transformation in pandas, but it feels very intuitive to me doing it this way.

- ggplot2 for plotting; its really powerful data visualization package.

Truthfully, I often do my data text parsing in Python, and then switch over to R for analysis, E.g. python's JSON parsing works really well.

rahimnathwani · on Jan 17, 2022

I can see how this is more intuitive. In pandas I'd assign the output of groupby to a variable, and then add the new column in a separate statement.

(The below is off topic, but I don't use R so I'd love to know whether I'm reading the code correctly)

"Here I'm filtering participants in an mturk study by those who have completed more than 40 trials at least six times across multiple sessions."

A user with this pattern of trials seems like they would fit the above definition:

Session 1: 82 trials Session 2: 82 trials Session 3: 82 trials

But the code seems to want 6 distinct sessions with >40 trials each. Have I misunderstood?

Also, is 'mutate' necessary before 'filter' or is that just to make the intent of the code clearer to your future self?

jonnycomputer · on Jan 17, 2022

My initial wording was sloppy.

There were 50 trials in each session; so I counted a session completed if they did more than 40 in that session. They needed to have completed at least six sessions.

The mutate is unnecessary. I forget why I did that.

jonnycomputer · on Jan 17, 2022

What it woul take to recreate dplyr in python:

https://mchow.com/posts/2020-02-11-dplyr-in-python/

melling · on Jan 17, 2022

Didn’t R introduce the native pipe operator?

%>% is now simply >|

jonnycomputer · on Jan 17, 2022

They did. I just haven't gotten around to using it yet!

civilized · on Jan 16, 2022

Tabular data manipulation packages are better, easier to make nontrivial charts, many R stats packages have no counterparts in Python, less bureaucracy, more batteries-included.

R is a language by and for statisticians. Python is a programming language that can do some statistics.

Bootvis · on Jan 16, 2022

For me, I use R data.table a lot and I see as the main advantages are performance and the terse syntax. The terse syntax does come with a steep learning curve though.

jarenmf · on Jan 16, 2022

Indeed, data.table is just awesome for productivity. When you're manipulating data for exploration you want the least number of keystrokes to bring an idea to life and data.table gives you that.

VeninVidiaVicii · on Jan 16, 2022

I totally agree. I often find myself wanting data.table as a standalone database platform or ORM-type interface for non-statistical programming too.

boppo1 · on Jan 16, 2022

What is terse syntax? I can parse lisp and C, how would this be different and challenging?

bckygldstn · on Jan 16, 2022

The syntax isn't self-describing and uses lots of abbreviations; it relies on some R magic that I found confusing when learning (unquoted column names and special builtin variables); and data.table is just a different approach to SQL and other dataframe libraries.

Here's an example from the docs

  flights[carrier == "AA",
    lapply(.SD, mean),
    by = .(origin, dest, month),
    .SDcols = c("arr_delay", "dep_delay")]

that's clearly less clear than SQL

  SELECT
    origin, dest, month,
    MEAN(arr_delay), MEAN(dep_delay)
  FROM flights
  WHERE carrier == "AA"
  GROUP BY arr_delay, dep_delay

or pandas

  flights[filghts.carrier == 'AA'].groupby(['arr_delay', 'dep_delay']).mean()

But once you get used to it data.table makes a lot of sense: every operation can be broken down to filtering/selecting, aggregating/transforming, and grouping/windowing. Taking the first two rows per group is a mess in SQL or pandas, but is super simple in data.table

  flights[, head(.SD, 2), by = month]

That data.table has significantly better performance than any other dataframe library in any language is a nice bonus!

hervature · on Jan 16, 2022

Taking the first two rows is a mess in pandas?

flights.groupby("month").head(2)

Not only is does this have all the same keywords, but it is organized in a much clearer way to newcomers and labels things to look up in the API. Whereas your R code has a leading comma, .SD, and a mix of quotes and non-quotes for references to columns. You even admit the last was confusing to learn. This can all be crammed in your head, but not what I would call thoughtfully designed.

Bootvis · on Jan 17, 2022

I agree the example in GP is not convincing. Consider the following table of ordered events:

    | Date | EventType |

and I want to find the count, and the first and last date of an event of a certain type happening in 2020:

    events[
        year(Date) == 2020L, 
        .(first_date = first(Date), last_date = last(Date), count = .N),
        EventType
    ]

Using first and last on ordered data will be very fast thanks to something called GForce.

When exploring data, I wouldn't need or use any whitespace. How would your Pandas approach look like?

hervature · on Jan 17, 2022

To do that, the code would look something like:

mask = events["Date"].year == 2020 events[mask].groupby("EventType").agg(first_date=("Date", min), last_date=("Date", max), count=("Date", len))

Anyway, I don't understand why terseness is even desirable. We're doing DS and ML, no project never comes down to keystrokes but ability to search the docs and debug does matter.

Bootvis · on Jan 17, 2022

It helps in quickly improving your understanding of the data by being able to answer simple but important questions quicker. In this contrived example I would want to know:

- How many events by type

- When did they happen

- Are there any breaks in the count, why?

- Some statistics on these events like average, min, max

and so on. Terseness helps me in doing this fast.

kgwgk · on Jan 16, 2022

You mean something like

    SELECT
    origin, dest, month, AVG(arr_delay), AVG(dep_delay)
    FROM flights
    WHERE carrier == 'AA'
    GROUP BY origin, dest, month

and

    flights[flights.carrier == 'AA'].groupby(['origin', 'dest', 'month'])[['arr_delay', 'dep_delay']].mean()

bckygldstn · on Jan 16, 2022

Yep thanks, you can tell I use a "guess and check" approach to writing sql and pandas...

zwaps · on Jan 17, 2022

On top of what has been said, if you want to do some more advanced statistical analyses (in the inference area, not ML/predictive field), then chances are that these algorithms are either published as R or STATA packages (usually R).

In Python, there is statsmodels. Here, you'll find a lot of GLM stuff, which is sort of an older approach. Modern inferential statistics, if not just Bayesian, is usually in the flavor of semi-parametric models that rely on asymptotics.

As R is used by professional researchers, it is simply more on the edge of things. Python has most of the "statistics course" schoolbook methods, but not much beyond that.

For example, it has become very common to have dynamic panel data which require dynamic models. Now if you want to do a Blundell-Bond type model in PYthon you have to... code it yourself using GMM, if it exists even.

For statistics, that's pretty much like saying you have a Deep Learning package that maybe has GRU but no transformer modules at all. So yeah, you can code it yourself. Or you use the other one.

clove · on Jan 17, 2022

I have the opposite question. I have been programming in R since I was 19. I know no other programming languages.

Hence my question:

What's the advantage of Python if you already know R?

I've heard they have similarities. Is there anything Python does better than R in terms of statistical analysis, charting, etc.?

noiwillnot · on Jan 17, 2022

> What's the advantage of Python if you already know R?

AFAIK in statistical modelling Python is better only in neural networks, so if you do not need to do fancy things with images, text, etc. you do not need Python. R is still the king.

In terms of charting and dashboards, I would say that if you work high level R and Python are both pleasant. R has ggplot, but Python has Plotly Express. R has Shiny, but Python has Dash and Streamlit. You can do great with both.

iandinwoodie · on Jan 17, 2022

One difference I've noticed is that R libraries are usually authored and maintained by academics in the associated field; the same can't always be said about equivalent Python libraries. This means that R library authors generally use their own libraries for publication and have an academic stake in its correctness.

vcdimension · on Jan 16, 2022

R is used by many researchers and consequentially has many more statistical libraries (e.g. try doing a dynamic panel modelling in python).