I don't really see anything in your post that supports the position to not use multithreading? Yes, obviously more efficient computations is better than doing less efficient computations faster, but given the same level of optimisation, a well designed multi-threaded CPU-bound program will almost always be faster (by how much depends on Amdahl's Law). In most architectures the cores have some levels of independent caches, so increasing number of threads also increase the effective memory bandwidth (for things that are in cache, taking into account cache invalidation, etc, etc). Cores also have their own prefetchers. Modern memory systems are quite well optimised for concurrent access. It's definitely not a case of all cores having to go through the same narrow pipe. For another example, if two cores require data from two different memory banks (last level cache miss, on the order of 100 cycles), the memory controller can issue read requests for both, and wait for both at the same time.
Most of what you described are basic optimisation concepts that anyone who cares about performance should be familiar with. However, I think there's one big idea that took me much longer to REALLY understand and get onboard with - always profile first. Humans are absolutely terrible at estimating where the performance hotspots are, and if left with our guesses, we will spend all our time optimising things that just don't matter. Now when I program, I never do any non-trivial optimisations ahead of time. Strictly only after profiling. Of course, that doesn't mean I pessimise unnecessarily. I still try to not make unnecessary copies of data, etc. Just nothing that requires spending more time.
Just a few notes about the specific things you mentioned:
* Manual prefetching helps in theory, but is extremely hard to do it usefully in practice. You have to make sure you prefetch far enough ahead that it makes a difference (if you do it just 10 cycles ahead of when you need it, that's not going to make any difference, and you are paying the instruction decoding cost, etc), but not so far that it gets retired before you need it (in which case you wasted memory bandwidth that maybe could have been used by another thread). Also, in your example of a linked list, if you traverse the list on every frame, the hardware prefetcher is probably already doing it for you, because it would have recognised the data access pattern. I have been working on performance-critical stuff for about 10 years now, and have only seen one instance of manual prefetching that's actually helpful (and only marginally). Many have tried. If I'm trying to optimise a program, this is going to be one of the very last things I look into. Yes, using a vector over a list where possible is a good idea. Mostly because continuous access means each cacheline fetch can pick up more elements, and for a very long vector, DRAM has burst mode that is more efficient.
* Virtual functions are fine in reality. The v-table is almost always going to be in L1i cache, so those fetches are essentially free with good instruction scheduling. Also, in many cases, the actual function called can be proven at compile time, and the compiler can de-virtualise it, in which case it's absolutely free. Virtual functions offer a lot of readability and maintainability benefits. I would need very concrete evidence showing that they are a significant bottleneck before taking them out. Again, profiling first is important. I have not encountered a single case where this makes a difference.