I wrote blitters in assembly back in those days for my teenager hobby games. When I could actually target the 386 with its dword moves, it felt blisteringly fast. Maybe the 386 didn't run 286 code much faster but I recall the chip being one of the most mind-blowing target machine upgrades I experienced. Much later I recall the FPU-supported quadword copy in 486dx and of course P6 meeting MMX in Pentium II. Good times.
You're 100% right that the 386 had a huge amount of changes that were pivotal in the future of x86 and the ability to write good/fast code.
I think a bigger challenge back then was the lack of software that could take advantage of it. Given the nascent state of the industry, lots of folks wrote for the 'lowest common denominator' and kept it at that (i.e. expense of hardware to test things like changing routines used based on CPU sniffing.)
And even then of course sometimes folks were lazy. One of my (least) favorite examples of this is the PC 'version' (It's not at all the original) of Mega Man 3. On a 486/33 you had the option of it being almost impossible twitchy fast, or dog slow thanks to turbo button. Or, the fun thing where Turbo Pascal compiled apps could start crapping out if CPU was too fast...
Sorry, I digress. the 386 was a seemingly small step that was actually a leap forward. Folks just had to catch up.
I was programming in Turbo Pascal at the time, which was still 16-bit. But when I upgraded my 286 to a Cyrix 486, on a 386 motherboard[1], I could utilize the full 32-bit registers by prefixing assembly instructions with 0x66 using db[1].
This was a huge boost for a lot of my 3D rendering code, despite the prefix not being free compared to pure 32-bit mode.
Imagine how it felt going from an 8086 @ 8 MHz to an 80486SX (the cheapo version without FPU) @ 33 MHz. With blazingly fast REP MOVSD over some form of proto local bus Compaq implemented using a Tseng Labs ET4000/W32i vga chip.