Thanks for the pointers; lots of discussions on stackoverflow. I tried the perf/tools in Eclipse and it seems like there is a lot there. (Perf seems a historic favorite of Linus - http://marc.info/?l=git&m=126262088816902&w=2).
I also wanted to run tests directly on Galileo, so I instrumented code with my own timers / counters / clocks. Based on those results, I was able to rework some code (I/O, network, USB, and computation code) for better performance.
I'd like to carry performance tests forward for comparison on future multi-core/multi-threaded chips as well.