Hacker Newsnew | past | comments | ask | show | jobs | submit | yfontana's commentslogin

> - It says it's done when its code does not even work, sometimes when it does not even compile.

> - When asked to fix a bug, it confidently declares victory without actually having fixed the bug.

You need to give it ways to validate its work. A junior dev will also give you code that doesn't compile or should have fixed a bug but doesn't if they don't actually compile the code and test that the bug is truly fixed.


Believe me, I've tried that, too. Even after giving detailed instructions on how to validate its work, it often fails to do it, or it follows those instructions and still gets it wrong.

Don't get me wrong: Claude seems to be very useful if it's on a well-trodden train track and never has to go off the tracks. But it struggles when its output is incorrect.

The worst behavior is this "try things over and over" behavior, which is also very common among junior developers and is one of the habits I try to break from real humans, too. I've gone so far as to put into the root CLAUDE.md system prompt:

--NEVER-- try fixes that you are not sure will work.

--ALWAYS-- prove that something is expected to work and is the correct fix, before implementing it, and then verify the expected output after applying the fix.

...which is a fundamental thing I'd ask of a real software engineer, too. Problem is, as an LLM, it's just spitting out probabilistic sentences: it is always 100% confident of its next few words. Which makes it a poor investigator.


I think I was shadow-banned because my very first comment on the site was slightly snarky, and have now been unbanned.


Properly measuring "GPU load" is something I've been wondering about, as an architect who's had to deploy ML/DL models but is still relatively new at it. With CPU workloads you can generally tell from %CPU, %Mem and IOs how much load your system is under. But with GPU I'm not sure how you can tell, other than by just measuring your model execution times. I find it makes it hard to get an idea whether upgrading to a stronger GPU would help and by how much. Are there established ways of doing this?


For kernel-level performance tuning you can use the occupancy calculator as pointed out by jplusqualt or you can profile your kernel with Nsight compute which will give you a ton of info.

But for model-wide performance, you basically have to come up with your own calculation to estimate the FLOPs required by your model and based on that figure out how well your model is maxing out the GPU capabilities (MFU/HFU).

Here is a more in-depth example on how you might do this: https://github.com/stas00/ml-engineering/tree/master/trainin...


It's harder than measuring CPU load, and depends a lot on context. For example, often 90% of a GPU's available flops are exclusively for low-precision matrix multiply-add operations. If you're doing full precision multiply-add operations at full speed, do you count that as 10% or 100% load? If you're doing lots of small operations and your warps are only 50% full, do you count that as 50% or 100% load? Unfortunately, there isn't really a shortcut to understanding how a GPU works and knowing how you're using it.


CUDA toolkit comes with an occupancy calculator that can help you determine based on your kernel launch parameters how busy your GPU will potentially be.

For more information: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#multi...


you need to profile them, nsight is one even torch does flamegraphs


Open source models like Flux Kontext or Qwen image edit wouldn't refuse, but you need to either have a sufficiently strong GPU or get one in the cloud (not difficult nor expensive with services like runpod), then set up your own processing pipeline (again, not too difficult if you use ComfyUI). Results won't be SOTA, but they shouldn't be too far off.


I pay for chatgpt because, in my experience, o3 and o4 are currently the best at combining reasoning with information retrieval from web searches. They're the best models I've tried at emulating the way I search for information (evaluating source quality, combining and contrasting information from several sources, refining searches, etc.), and using the results as part of a reasoning process. It's not necessarily significant for coding, but it is for designing.


From the article:

> Besides protein folding, the canonical example of a scientific breakthrough from AI, a few examples of scientific progress from AI include:1

> Weather forecasting, where AI forecasts have had up to 20% higher accuracy (though still lower resolution) compared to traditional physics-based forecasts.

> Drug discovery, where preliminary data suggests that AI-discovered drugs have been more successful in Phase I (but not Phase II) clinical trials. If the trend holds, this would imply a nearly twofold increase in end-to-end drug approval rates.


> It’s insane to me that maybe every bank I use requires SMS 2FA, but random services I use support apps.

It never ceases to surprise me how much American banks always seem to lag behind with regards to payment tech. My (european) bank started sending hardware TOTP tokens to whoever requested one like a decade ago. They've since switched to phone app MFA.


I've been working on extracting text from some 20 million PDFs, with just about every type of layout you can imagine. We're using a similar approach (segmentation / OCR), but with PyMuPDF.

The full extract is projected to run for several days on a GPU cluster, at a cost of like 20-30k (can't remember the exact number but it's in that ballpark). When you can afford this kind of compute, text extraction from PDFs isn't quite a fully solved problem, but we're most of the way there.

What the article in the OP tries to do is, as far as I understand, somewhat different. It's trying to use much simpler heuristics to get acceptable results cheaper and faster, and this is definitely an open issue.


We don't know for sure that the universe is a closed system.


Primordial black holes are black holes that formed right after the big bang. Basically areas where gravity caused the extremely dense matter of the universe's first instants to collapse into black holes before expansion could pull it apart. Their existence has been hypothesized but not confirmed (or definitively rejected) so far.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: