Wednesday, September 14, 2016

NASA Sample Return Robot Challenge Post Mortem

MandI in a Seattle park being semi-tele-oped, August 2016

Relief and disappointment all in a few short moments. Our robot MandI, named for our team Mind and Iron, seemed to be stuck on the marble slab in the sunny, grassy park, but then it ramped up and over. Quite a feat for a wood-framed robot, but also not so great for our team as it smashed into the bright orange perimeter fencing.

The one and only run came to an entangled end. The team set out from the band-shell full of NASA folks to clear the field, our shot at a million dollars over.

Do I blame myself? Certainly, if I only checked this, or checked that… It just becomes a jumble of what-ifs. Writing this is mostly to get down my thoughts and observations, hopefully not too much blame.

Unfortunately the vision system was not communicating with the rest of the the robot system. That was totally not expected, it should have been sending photo coordinates to the robot’s state system. In the post mortem I found that some of the vision script had been modified on the robot without pushing it back up to the repo, so when my final changes came in it conflicted with the other changes. I did not take care of this merge conflict, but I should have, there was an obvious and basic misunderstanding of how the script operated with the other changes.

So it sounds like I am trying hard not to blame someone else by saying that I should have checked the final code. I would say it is not just my fault, but rather I take joint responsibility. So let’s turn to a proven method for getting down to the underlying reasons.

Why?

There were merge conflicts and I didn’t realize this until too late.

Why?

Because there wasn’t enough communication as to the fact of merge conflicts and sudden loss of functionality.

Why?

I can only assume an assumption that minimal knowledge was enough to deal with the merge conflicts which was in turn probably caused by the lack of unit tests.

Why?

Because most of the script and final functionality was written only a few days before the sequester period.

Why?

  • General lack of planning on my part
  • No quick feedback loops for communication until we were in the same place
  • Not taking advantage of the previous trip
  • Was still exploring possibilities with trip to Seattle.
  • Changes to systems at the last minute due to insufficient previous exploration

I ran out of ‘why’s with the five why method. In summary I did not handle this very well. Between planning, exploring, and executing there was still too much thinking and exploring.

What did I learn:
  • Planning is important, but if it blocks exploring then it becomes a detriment
  • Exploring is important, but if there are still unknowns even a few weeks out then it is time to simplify.
  • Write scripts so that components can be unit tested. If it isn’t getting the parameter in the first place, how is it going to send it?
  • Always check merges with the above unit tests, or have others run the unit tests on their side. Double check functionality before critical situations.
  • Never change code in high stress situations, make a UI that can simulate requested changes. Or if you do make changes, run the written unit tests!

Though smashing into the fence was not my code, I still didn’t like the fact that I did not control as much as I should have. Building that control was crucial, either by sitting down and hand testing it. That would have required quite a bit of computer and seat/keyboard shuffling, not something I particularly look forward to in any situation, maybe some musical chairs anxiety from childhood… The control is more simply built by making it easy for another person to check functionality.

By writing thorough tests and setting up a way for them to be run easily, then we get two immediate benefits. The other person can just be told how to run the tests, sees that the code isn’t working, and they can send the test’s error back to me. Even if it “works for me” we can still pursue a course of action where we find out where it went wrong.

Writing tests is important, but as I answered for one of the whys: I really needed to be more aware of the process. Planning, exploration, and execution are all phases of the process, and if any of these are early/long/late then the rest of the process will become disjointed or panicked. Just being aware of these will help me to recognize situations before they become critical.

*finito post mortem*

No comments:

Featured Post

Allergy

John studied himself in the mirror as best he could through tears. Red, puffy eyes stared back at him, a running nose already leaked just a ...