fstep is partially so slow because it accesses the Task memory for every disassambly. Maybe that can be cached? Although instruction stepping is just slow in general. An alternative could be combining stepping with breakpoints set on "interesting functions". Or only stepping while in the main program map, and not in any of the shared library maps? Memory access is not only slow for fstep. Other programs (like fcore) also could use faster access to the inferior memory. One idea (at least for read access) is mmapping the inferior address space (/proc/<pid>/mem), and/or performing larger transfers and caching under the hood.
Fixed by the merger of AddressSpace and MemorySpace backends. Now normally reads from /proc/pid/mem which make fstep just "slow", not terribly slow.