It's slow because if you don't have hardware support you end up moving around fairly large chunks of memory, and if you do it 'smoothly' you end up doing that 10 to 16 times more frequently still! It wasn't rare at all to see text editors not being able to keep up with the keypresses when scrolling up or down.
Some very quick and dirty estimates to illustrate why doing this was pushing the limits for a long time:
At 320x200, a monochrome display is roughly 8K, and 320x256 is roughly 10K. 640x400 is roughly 32K, and 640x512 roughly 40K.
At 1MHz like on a C64 on a PAL system (25 full frames per second) you have roughly 40,000 clock cycles as an absolute upper bound.
That leaves you with roughly 5 clock cycles to copy each byte for 320x200 (320x200 was the most the C64 would do). But the graphics chip "stole" roughly every other memory cycle outside of the vertical blank, so you had roughly 20k cycles to both read the instruction stream, read a byte and write a byte. for the entire screen.
Given that the shortest instructions on the 6502-compatible CPUs are a byte and can load/store no more than a byte, even if there were single byte instructions that'd let you meaningfully load/store data (there aren't), the minimum number of memory cycles would be 64K to move a screen full of text, or more than 3 frames. In reality you need much more to manipulate source/destination addresses. It's been too long, but I'd be surprised if you'd get anything useful done in less than 8 frames, or more than 300ms.
Faster CPU's of course quickly makes it better (as do, potentially, specialized instructions that reduce the amount of instruction overhead), even but then you tended to get higher resolution or more colours to go.
E.g. the default Amiga workbench was 640x200 on NTSC and 640x256 on PAL systems, in 4 colours (2 bitplanes) for roughly 32K or 40K for a full screen. At 7.16MHz on a PAL system, you'd have about 174 clock cycles per byte per second, or less than 7 cycles per byte per frame. The 16 bit data bus means you can spend about 14 cycles per 2 bytes, but then again the CPU gets access to at most half the bus cycles, and so you're back to struggling.
The only reason the Amiga's could reasonably copy a full screen of text per frame without resorting to monochrome and/or the lower resolutions, was special hardware support in the form of the blitter.