Injecting Errors for Fun and Profit

02/10/2017

INJECTING E-CACHE ERRORS ON THE ULTRASPARC-II

“Handling errors is just attention to detail. Injecting errors is rocket science.” —me While the hardware engineers were working on determining the cause of the e-cache parity errors and then working on a fix, I was asked to lead a project to mitigate with software the effect of the errors. Unfortunately, the UltraSPARC-II used an imprecise trap to report e-cache parity errors detected by a load instruction or an instruction fetch, so recovery even from an error on a clean cache line was not possible. We were able to recover from parity errors detected by some write-backs, and we definitely improved the kernel’s messages when parity errors were encountered. We prototyped FAILURE AND RECOVERY 3 confining errors that affected only a user program and not the kernel to just that program (a feature that had to wait for the System Management Facility of Solaris 10 and its process restarter before we could deploy it safely), and we introduced a cache scrubber that used diagnostic accesses to proactively look for parity errors on clean cache lines in a safe fashion (that is, one that would not cause a kernel panic) and flushed them from the cache before they could cause an outage. Whenever the system went idle, we flushed all clean lines, and all error-free dirty lines, from the cache.

Inject-Error

Mon	Tue	Wed	Thu	Fri	Sat	Sun
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28

Ohio State nav bar

Injecting Errors for Fun and Profit

Leave a Reply Click here to cancel reply.

Tags

Archives

Injecting Errors for Fun and Profit

Leave a Reply Click here to cancel reply.

Social Media Links

Tags

Archives