<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PetraVM</title>
	<atom:link href="http://petravm.com/feed" rel="self" type="application/rss+xml" />
	<link>http://petravm.com</link>
	<description>PetraVM is dedicated to making it easier to write, debug and deploy reliable multithreaded software</description>
	<lastBuildDate>Sun, 07 Feb 2010 21:06:07 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Memory Consistency Models</title>
		<link>http://petravm.com/uncategorized/memory-consistency-models</link>
		<comments>http://petravm.com/uncategorized/memory-consistency-models#comments</comments>
		<pubDate>Sun, 07 Feb 2010 17:25:49 +0000</pubDate>
		<dc:creator>luis</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://petravm.com/?p=326</guid>
		<description><![CDATA[Memory consistency models have an almost mythical aura. They can puzzle the most experienced programmers and lead to bugs that are incredibly hard to understand and fix. If you have written multithreaded code, it is likely that you have stumbled upon memory model woes. Chances are that you have also lost bets with your colleagues [...]]]></description>
			<content:encoded><![CDATA[<p><strong><span style="font-weight: normal;">Memory consistency models have an almost mythical aura. They can puzzle the most experienced programmers and lead to bugs that are incredibly hard to understand and fix. If you have written multithreaded code, it is likely that you have stumbled upon memory model woes. Chances are that you have also lost bets with your colleagues because of memory consistency model disputes. In this blog post I will discuss some of the rationale of why memory models were created and give some specific examples of how that affects you.</span></strong></p>
<p><strong> </strong></p>
<p><span style="font-weight: normal;">First things first, lets define what a memory model is. The memory consistency model defines what values a read operation can return. The simplest memory model is sequential consistency, in which the execution behaves as if there were a single global interleaving of memory operations and the operations of a given thread appear in the same order as they appear in the program. It is the most natural model for normal humans to think about because the execution behaves as a multitasking uniprocessor. For example, consider the following example:</span></p>
<p><img class="aligncenter size-full wp-image-369" title="figs-sc" src="http://petravm.com/wp-content/uploads/2010/02/figs-sc.png" alt="figs-sc" width="366" height="220" /></p>
<p><span style="font-weight: normal;">The question is, what value the read from <code>data</code> in P2 can return?  The most obvious answer here is 42. Now what would happen if P2 observed the writes to </span><code><span style="font-weight: normal;">data</span></code><span style="font-weight: normal;"> and </span><code><span style="font-weight: normal;">flag</span></code><span style="font-weight: normal;"> in the opposite order? P2 could actually read data as &#8220;0&#8243;, which is surprising and not allowed by the sequential consistency memory model.</span></p>
<p><span style="font-weight: normal;">The main problem with sequential consistency is that systems like to reorder memory operations to hide long latency operations and consequently improve performance. For example, when a cache miss is being serviced, the processor may execute another memory access that comes after that in program order, which may hit in the cache and therefore complete earlier than the missing access. However, processors are not the only source of memory operation reordering. Many compiler optimizations effectively reorder code, e.g., loop-invariant code motion, common sub-expression elimination, etc.  Furthermore, the memory models of languages and the hardware they run on need not be the same. The compiler and synchronization libraries need to insert fences in the code to map the language memory model to the hardware model. For example, Java and C++0x (the upcoming C++ standard) support memory models that guarantee sequential consistency for programs free of data races.</span></p>
<p><span style="font-weight: normal;">Due the difficulty of improving performance under sequential consistency, a variety of &#8220;relaxed&#8221; memory models were conceived. For example, in the Weak Ordering memory model, there is no guarantee that a processor will observe another processor&#8217;s memory operations in program order. This is where a &#8220;memory fence&#8221; (a.k.a. &#8220;memory barrier&#8221;) comes into play. When a fence instruction is executed, it guarantees that all memory operations prior to it in program order are completed (and visible to other processors) before any operation after the fence in program order is allowed to proceed.  You would be bored and stop reading if I described the multitude of consistency models in this post. However, I do encourage you to read more about memory models in this </span><a href="http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf"><span style="font-weight: normal;">very nice tutorial</span></a><span style="font-weight: normal;"> by Sarita Adve and Kourosh Gharachorloo.  Also, </span><a href="http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2009.04.05a.pdf"><span style="font-weight: normal;">Paul McKenney&#8217;s paper</span></a><span style="font-weight: normal;"> has a nice table summarizing the ordering relaxations in modern microprocessors.</span></p>
<p><span style="font-weight: normal;">Now let&#8217;s talk about some of highlights of the x86 memory model. A big disclaimer first. This can change and probably does change between models, so it is always a good idea to check the manuals before endeavoring in sensitive code (8-8 Vol. 3 in </span><a href="http://www.intel.com/Assets/PDF/manual/253668.pdf"><span style="font-weight: normal;">this manual</span></a><span style="font-weight: normal;"> for Intel and Section 7.2, page 164 in </span><a href="http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf"><span style="font-weight: normal;">this manual</span></a><span style="font-weight: normal;"> for AMD).</span></p>
<p><span style="font-weight: normal;">In a nutshell, recent implementations of the X86 ISA (P6 and on), implement, roughly, what is normally termed <em>processor consistency</em>. Its key ordering properties are:</span></p>
<ol>
<li><span style="font-weight: normal;"><em>reads</em> are <em>not</em> reordered with respect to <em>reads</em>;</span></li>
<li><span style="font-weight: normal;"><em>writes</em> are <em>not</em> reordered with respect to <em>reads</em> that come earlier in the program order;</span></li>
<li><span style="font-weight: normal;"><em>writes </em>are <em>not</em> reordered with respect to <em>most writes</em> (e.g., it excludes multiple writes implicitly caused by string operations);</span></li>
<li><span style="font-weight: normal;"><em>reads </em><em>may</em> be reordered with respect to <em>writes</em> that come earlier in program order as long as those writes are to a different memory location;</span></li>
<li><span style="font-weight: normal;"><em>reads</em> are <em>not</em> reordered with respect to I/O instructions, locked instructions and other serializing instructions.</span></li>
</ol>
<p><span style="font-weight: normal;">There are no guarantees whatsoever of ordering between writes of different processors, the outcome of concurrent writes to the same memory location is non-deterministic. Increment instructions have no atomicity guarantees, moreover, even some write operations that update multiple bytes are not guaranteed to be atomic. For example, if a write operation to multiple bytes happen to cross a cache line boundary, the operation is not guaranteed to be atomic.</span></p>
<p><span style="font-weight: normal;">Here is an example of how the x86 memory model can get you in trouble:</span></p>
<div><img class="aligncenter size-full wp-image-333" title="fig2" src="http://petravm.com/wp-content/uploads/2010/02/fig2.png" alt="fig2" width="326" height="202" /></div>
<div>
<div><span style="font-weight: normal;">An execution whose final state is <code>t1 == 0 </code>and <code>t2 == 0 </code>is allowed. Such an outcome is unintuitive (therefore non-sequentially consistent) because there is no serialized execution that leads to this state. In any serialized execution, there will be an assignment in one processor (<code>A = 1</code> or <code>B = 1</code>) prior to a read  (<code>t1 = B</code> or <code>t2 = A</code>) in the other processor. Another way to look at the problem is to build a </span><em><span style="font-weight: normal;">happens-before</span></em><span style="font-weight: normal;"> graph of the execution. In this representation, a node is an executed instruction and a directed edge exists from instruction P to instruction Q if Q has observed the effects of P, and P has </span><em><span style="font-weight: normal;">not</span></em><span style="font-weight: normal;"> observed the effects of Q. Here is the happens before graph for the example above when the outcome is <code>t1 == 0</code> and <code>t2 == 0</code>:</span></div>
<div><img class="aligncenter size-full wp-image-351" title="figs5" src="http://petravm.com/wp-content/uploads/2010/02/figs51.png" alt="figs5" width="339" height="157" /></div>
<div><span style="font-weight: normal;">Edge (1) in the graph exists because the read <code>t1 = B</code> in P1 did not observe the write <code>B = 1</code> in P2. Same applies to edge (2). Edges (3) and (4) are there because of program order. Since there is a cycle in the happens-before graph, there is no serialized order that would satisfy the happens-before relationship, therefore, the execution is non-sequentially consistent. This happened in this example because  the read operation <code>t1 = B</code> in P1 can proceed before the write operation in <code>A = 1</code> is completed and visible to P2.</span></div>
<div><span style="font-weight: normal;"><br />
</span></div>
<div><span style="font-weight: normal;">Here is another example of how the x86 memory model leads to surprising results:</span></div>
<div><span style="font-weight: normal;"><img class="aligncenter size-full wp-image-334" title="fig3" src="http://petravm.com/wp-content/uploads/2010/02/fig3.png" alt="fig3" width="326" height="220" /></p>
<div>The snippet of execution above might lead to a state where <code>t2 == t4 == 0</code>. Lets look at this from the perspective of P1. This can happen because the processor can forward the value from the pending write to A (<code>A = 1</code>) to the read of A (<code>t1 = A</code>), which can complete and then allow the read of B to proceed (<code>t2 = B</code>). Note that this does <em>not</em> characterize a reordering of <code>t2 = B</code> and <code>t1 = A</code>! Intuitive, huh?</div>
<div>One final example for you to noodle about. Consider a boiled-down version of Dekkers&#8217; mutual exclusion</div>
<div>algorithm:</div>
<div><img class="aligncenter size-full wp-image-335" title="fig4" src="http://petravm.com/wp-content/uploads/2010/02/fig4.png" alt="fig4" width="533" height="232" /></div>
<div>
<div>The gist of the algorithm is to use two flag variables, one for each processor, <code>flag1</code> and <code>flag2</code>. P1 sets <code>flag1</code> when it is attempting to enter the critical section, it then checks if <code>flag2</code> is set; if it is not set, it means P2 has not attempted to enter the critical section, so P1 can safely enter it. Because the x86 memory model allows reordering loads with respect to earlier stores, the read of <code>flag2</code> can proceed before setting <code>flag1</code> is completed, which can lead to both processors entering critical sections, since P2 might have just set <code>flag2</code>!</div>
</div>
<div>That is it! I hope this helped you get a better grasp of what a memory consistency model is and understand a few of the key aspects of the x86 model. And, if you come across something that looks like a memory consistency bug, try building that happens-before graph to find cycles and remember to look at the manual <img src='http://petravm.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> . Have fun!</div>
<p></span></div>
</div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">Due the difficulty of improving performance under sequential consistency,</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">a variety of &#8220;relaxed&#8221; memory models were conceived. For example, in</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">the Weak Ordering memory model, there is no guarantee that a processor</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">will observe another processors memory operations in program</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">order. This is where a &#8220;memory fence&#8221; (a.k.a. &#8220;memory barrier&#8221;) comes</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">into play. When a fence is executed, it guarantees that all memory</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">operations prior to it in program order are completed before any</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">operation after the fence in program order is allowed to proceed.  You</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">would be bored and stop reading if I describe the multitude of</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">consistency models in this post. However, I do encourage you to read</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">more about memory models in this very nice tutorial by Sarita Adve and</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">Kourosh Gharachorloo:</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf.  Also, Paul</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">McKenney&#8217;s</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">paper(http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2009.04.05a.pdf)</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">has a nice table summarizing the ordering relaxations in modern</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">microprocessors.</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">Now lets talk about some of highlights of the x86 Intel&#8217;s and AMD&#8217;s</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">memory models. A big disclaimer first. This can change and does change</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">between models, so it is always a good idea to check the manuals</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">before endeavoring in sensitive code (8-8 Vol. 3 in</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">http://www.intel.com/Assets/PDF/manual/253668.pdf for Intel and</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">Section 7.2, page 164 in</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">for AMD).</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">In a nutshell, recent implementations of the X86 ISA (P6 and on),</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">implement, roughly, what is normally termed processor consistency. Its key</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">ordering properties are:</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">(1) reads are not reordered with respect to reads;</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">(2) writes are not reordered with respect to reads that come earlier</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">in the program order;</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">(3) writes are not reordered with respect to most writes (e.g., it</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">exclude the multiple writes implicitly caused by string operations);</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">(4) reads may be reordered with respect to writes that come earlier in</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">program order as long as those writes are to a different memory</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">location.</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">(5) reads are not reordered with respect to I/O instructions, locked</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">instructions and other serializing instructions.</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">There are no guarantees whatsoever of ordering between writes of</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">different processors, the outcome of concurrent writes to the same</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">memory location is non-deterministic. Increment instructions have no</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">atomicity guarantees, moreover, even some write operations that update</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">multiple bytes are not guaranteed to be atomic. For example, if a</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">write operation to multiple bytes happen to cross a cache line</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">boundary, the operation is not guaranteed to be atomic.</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">Here is an example of how processor consistency can get you in</span></div>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 534px; width: 1px; height: 1px;"><span style="font-weight: normal;">trouble:</span></div>
]]></content:encoded>
			<wfw:commentRss>http://petravm.com/uncategorized/memory-consistency-models/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Making sense of crashes with SmartStop</title>
		<link>http://petravm.com/uncategorized/making-sense-of-crashes-with-smartstop</link>
		<comments>http://petravm.com/uncategorized/making-sense-of-crashes-with-smartstop#comments</comments>
		<pubDate>Sun, 20 Dec 2009 23:08:07 +0000</pubDate>
		<dc:creator>pete</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://petravm.com/?p=201</guid>
		<description><![CDATA[Last time I talked about how it is that Jinx can make your bugs happen faster.  That&#8217;s pretty helpful, but it isn&#8217;t the whole picture.  Some bugs are hard to reproduce and easy to understand, and some are easier to reproduce but harder to understand.  How does Jinx help with hard-to-understand bugs?
The state of the [...]]]></description>
			<content:encoded><![CDATA[<p>Last time I talked about <a href="http://petravm.com/technology/a-stitch-in-logical-time">how it is that Jinx can make your bugs happen faster</a>.  That&#8217;s pretty helpful, but it isn&#8217;t the whole picture.  Some bugs are hard to reproduce and easy to understand, and some are easier to reproduce but harder to understand.  How does Jinx help with hard-to-understand bugs?</p>
<p>The state of the art with regard to debugging concurrency errors has improved somewhat, but for multi-threaded programs it still typically involves visiting a pile of threads to see which may be involved with the problem.  It&#8217;s a breadcrumb party&#8230; great for bonding with your fellow developers and producing war stories.  Less great for shipping your code.  Various things make the bug harder to understand in this scenario.</p>
<p>What I&#8217;ll call &#8220;overshoot&#8221; is primary among these.  Overshoot is a phenomenon where one thread or task in your program, corresponding to a single CPU at the point of failure, runs for a long time after the error has been detected.  The detecting thread traps into the operating system, which sends out messages to other CPUs saying &#8220;stop any thread in the following process immediately&#8221;.  A nice clean operating system might be able to deliver a stop within a few tens of microseconds, increasing as the number of processing cores in your computer increases.  Sadly, that few microseconds corresponds to a few tens of thousands of machine instructions.  That&#8217;s a lot of overshoot.  To understand how that kills your ability to find bugs, let&#8217;s look at a few examples.  We&#8217;ll start with an example where a normal debugger doesn&#8217;t do a horrible job.  It&#8217;s the example from last time, and you may remember that it corresponds to code that looks like this:</p>
<pre>    // Thread 1 (corrupter)
    float myradius = 1;
    lock();
    circle-&gt;radius = myradius;
    unlock();
    lock();
    circle-&gt;area = PI * myradius * myradius;
    unlock();

    // Thread 2 (verifier)
    lock();
    assert(circle-&gt;radius * circle-&gt;radius * PI == circle-&gt;area);
    unlock();

<div id="attachment_205" class="wp-caption alignnone" style="width: 644px"><img class="size-full wp-image-205" title="smartstop_diagram_1" src="http://petravm.com/wp-content/uploads/2009/12/smartstop_diagram_11.png" alt="An atomicity violation as seen by the machine." width="634" height="112" /><p class="wp-caption-text">An atomicity violation as seen by the machine.</p></div></pre>
<p>Because we have now added the assert, the verifier thread now will never unlock, and consequently the corrupter will never perform the update of the area.  From the perspective of a developer, your program crashes, one thread is complaining about an inconsistent structure, and one or more threads are trying to acquire the lock on the structure.  You <em>may</em> notice or it may be obvious that one of these threads has just updated the structure in a non-atomic way.  Because the proximity is probably pretty good, this might not be a hard-to-understand bug.</p>
<p>Things get more interesting, and unpleasant, when we consider a low-level data race.  We take a similar example, but rather than showing an atomicity violation, we make it into a data race, by eliminating the locking in the corrupter altogether.</p>
<pre>    // Thread 1 (corrupter)
    float myradius = 1;
    circle-&gt;radius = myradius;
    circle-&gt;area = PI * myradius * myradius;</pre>
<div id="attachment_207" class="wp-caption alignnone" style="width: 536px"><img class="size-full wp-image-207" title="smartstop_diagram_2" src="http://petravm.com/wp-content/uploads/2009/12/smartstop_diagram_2.png" alt="Instead of an atomicity violation, a low-level data race.  The corrupter thread never takes a lock at all." width="526" height="112" /><p class="wp-caption-text">Instead of an atomicity violation, a low-level data race.  The corrupter thread never takes a lock at all.</p></div>
<p>What is the developer&#8217;s experience in this case?  Well, after the assert is detected, an operating system call is made, which stops all threads currently running inside the program in question.  This is if you&#8217;re lucky.  On more primitive OSes you might have to wait until your timer tick is up.  Regardless, it&#8217;s pretty plausible that the corrupter thread in this case has done two horrible things:</p>
<ul>
<li>It has advanced thousands of instructions so that it&#8217;s nowhere near the problem location.  You probably won&#8217;t even know what the culprit thread is.</li>
<li>It has repaired the state of the circle structure so that the invariant has been repaired.  This is that second STORE in the diagram above.  That&#8217;s the corrupter thread cleaning up its mess before going to hide in some other far flung piece of code.</li>
</ul>
<p>In this case, the developer has one useful piece of information.  That the invariant has failed once.  Since we started Petra, we&#8217;ve talked to many people who have seen invariants fail and never figured out why.</p>
<p>One more example before we discuss solutions to this mess.  A wakeup-race is what we call an ordering violation wherein one thread is the cause of the start of activity on another thread. <a href="http://pages.cs.wisc.edu/~shanlu/paper/asplos122-lu.pdf">Learning from mistakes: a comprehensive study on real world concurrency bug characteristics</a> describes a common bug pattern found a few times in firefox, which involves the creation of a new thread, and the assignment of its ID to a global.</p>
<pre>    // Thread 1
    g_newThread = CreateThread(myfunc);

    // Thread 2
    myfunc() {
        assert(g_newThread != INVALID_THREAD);
    }</pre>
<p>Sometimes we get away with it, as below.  In the case of Firefox, this was presumably a very rare bug.</p>
<div id="attachment_226" class="wp-caption alignnone" style="width: 590px"><img class="size-full wp-image-226" title="smartstop_diagram_3" src="http://petravm.com/wp-content/uploads/2009/12/smartstop_diagram_31.png" alt="This latent ordering violation isn't revealed in this example, as the creator thread runs first." width="580" height="112" /><p class="wp-caption-text">This latent ordering violation isn&#39;t revealed in this example, as the creator thread runs first.</p></div>
<p>But, sometimes, we don&#8217;t get away with it.  As you would expect, Jinx is good at making this sort of bug happen.</p>
<div id="attachment_212" class="wp-caption alignnone" style="width: 536px"><img class="size-full wp-image-212" title="smartstop_diagram_4" src="http://petravm.com/wp-content/uploads/2009/12/smartstop_diagram_4.png" alt="The unlucky case, where the awoken thread runs first, observing inconsistent state." width="526" height="112" /><p class="wp-caption-text">The unlucky case, where the awoken thread runs first, observing inconsistent state.</p></div>
<p>The unlucky case is, again, very bad for the programmer.  After the bug is detected, there is a long window until the thread that stored the thread id late gets stopped.  During that time it can return to its event loop, enter some other unrelated code, and, in the case shown, set g_newThread to be valid, thus obfuscating the state of the program at fault time.  The basic problem here is that in overshoot siutations</p>
<ol>
<li>The threads keep running.</li>
<li>They overwrite bits of global memory state as they go, hiding the bug or further obfuscating it.</li>
</ol>
<p>So, for the three major classes of bugs, the experience of debugging a crash varies from bad to horrible.  How can we make this better?</p>
<p>The obvious answer is that we have to stop threads earlier.  One thing we could do is just sequentialize the execution, and stop all threads as soon as one of them detects a problem.  This would be an improvement, but it&#8217;s not that great.  There are a couple of reasons for this.</p>
<ol>
<li>The gap between the load of the wrong data value and the detection of the bug may be long.  Say, for example, instead of computing πr² we were computing sin(acos(log²(n)))&#8230; you&#8217;re thousands of cycles in on all threads before you know there&#8217;s a problem.</li>
<li>The gap between the application detecting an error and Jinx picking it up may be large, also.  For example, the application may call into the OS as a part of its abort sequence.</li>
</ol>
<p>Fortunately, there are better ways to solve this problem.  Jinx takes the (patent-pending) approach of advancing each of the threads involved in the crash <em>the minimum computational distance forward</em> before stopping them.  That is, we only want each thread to run far enough forward to allow the crash to happen.  One approximation of this point is the <em>last communication point</em> between threads.  What does it mean to compute the last communication point?  For two threads, the simple definition is that it&#8217;s the <em>last time before the crash</em> that a non-crashing thread wrote something that was read by the crashing thread <em>before the crash</em>.  An example is probably easiest.</p>
<div id="attachment_217" class="wp-caption alignnone" style="width: 536px"><img class="size-full wp-image-217" title="smartstop_diagram_5" src="http://petravm.com/wp-content/uploads/2009/12/smartstop_diagram_5.png" alt="There is no reason to allow other threads in the process to advance beyond the last time they communicated with the crashing thread." width="526" height="157" /><p class="wp-caption-text">There is no reason to allow other threads in the process to advance beyond the last time they communicated with the crashing thread.</p></div>
<p>First, we take the example of the low-level data race.  The last thing that was written by the corrupter thread that was still read by the crashing thread <em>before the crash</em> is radius, in the course of the &#8220;STORE radius&#8221; instruction.  Simply by terminating execution of the corrupter thread at its last communication point with the verifier thread (immediately after this instruction), we provide the following user experience:</p>
<ol>
<li>The verifier thread is stopped on the assert line, as before.</li>
<li>The corrupter thread is stopped on the next instruction after writing circle-&gt;radius = 1;</li>
</ol>
<p>That&#8217;s a pretty cool result.  You&#8217;re stopped at the debugger, and all of the participating threads are in a meaningful location.  What about the other really thorny case, the ordering violation?</p>
<div id="attachment_218" class="wp-caption alignnone" style="width: 536px"><img class="size-full wp-image-218" title="smartstop_diagram_6" src="http://petravm.com/wp-content/uploads/2009/12/smartstop_diagram_6.png" alt="Stopping on last communication, applied to an ordering violation." width="526" height="112" /><p class="wp-caption-text">Stopping on last communication, applied to an ordering violation.</p></div>
<p>Here, the last communication is virtual:  it&#8217;s where the one thread creates the other, which is a virtual communication.  In Jinx&#8217;s case this is trapped at the IPI (interprocessor interrupt) sent from one CPU to another.  Since no memory communications occur after that, it&#8217;s not necessary to allow the thread to run past that point.  Again, the user experience is pretty good:  one thread is stopped creating another, and that thread is referencing the illegal global.  The value of g_newThread is still INVALID_THREAD at the point of the crash.</p>
<p>You&#8217;ll note that it&#8217;s only possible to do any of this with Jinx&#8217;s simulation approach.  Only by running executions multiple times can you perform the kind of analysis necessary to do this sort of thing, and you can only do that with Jinx.  You might imagine we&#8217;re pretty excited about SmartStop.  It will be present in our beta in late January or early February, and we&#8217;ll release a video of it in action as a part of the beta launch.  We hope you&#8217;ll <a href="http://petravm.com/how-jinx-works/jinx-beta">sign up for the beta</a> and let us know what you think of this feature and the rest of Jinx!</p>
]]></content:encoded>
			<wfw:commentRss>http://petravm.com/uncategorized/making-sense-of-crashes-with-smartstop/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Stitch in Logical Time</title>
		<link>http://petravm.com/technology/a-stitch-in-logical-time</link>
		<comments>http://petravm.com/technology/a-stitch-in-logical-time#comments</comments>
		<pubDate>Fri, 18 Dec 2009 19:32:00 +0000</pubDate>
		<dc:creator>pete</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[bug detection]]></category>
		<category><![CDATA[concurrency bugs]]></category>
		<category><![CDATA[jinx]]></category>
		<category><![CDATA[non-determinizm]]></category>

		<guid isPermaLink="false">http://petravm.com/?p=148</guid>
		<description><![CDATA[Last time I talked about dealing with non-deterministic bugs plus corruption by either reducing non-determinism or by turning corruptions into failures.   This time I&#8217;ll talk about the machinery that allows Jinx to amplify repro rates for bugs, both crashing and correctness.
First off, how does Jinx find crashing bugs?  It runs simulations of system state in [...]]]></description>
			<content:encoded><![CDATA[<p>Last time I talked about <a href="http://petravm.com/technology/correctness-bugs-and-non-determinism">dealing with non-deterministic bugs plus corruption</a> by either reducing non-determinism or by turning corruptions into failures.   This time I&#8217;ll talk about the machinery that allows Jinx to amplify repro rates for bugs, both crashing and correctness.</p>
<p>First off, how does Jinx find <em>crashing</em> bugs?  It runs simulations of system state in parallel, and chooses interesting ones to &#8220;retire.&#8221;  The following diagram illustrates this process.</p>
<div id="attachment_149" class="wp-caption alignnone" style="width: 507px"><img class="size-full wp-image-149" title="A round of simulation in Jinx" src="http://petravm.com/wp-content/uploads/2009/12/round-diagram.png" alt="round diagram" width="497" height="286" /><p class="wp-caption-text">At time A, Jinx takes a checkpoint, and conducts simulations until finding a bug in a simulation ending at point B.  It chooses this simulation to retire into reality.  This process is completed at C, and normal execution resumes.</p></div>
<p>After taking a checkpoint at <em>A</em>, Jinx starts by conducting a first exploratory simulation, and based on the result of this preliminary simulation conducts zero or more follow-on simulations, <em>scoring</em> these scenarios as they complete.  Our life is straightforward in the case where we find a crash, as in <em>B</em>.  We assign a high score to the crashing simulation, and later on we choose this simulation to retire (our jargon for &#8220;commit&#8221; or &#8220;turn into reality&#8221;).  Retirement completes at <em>C</em>.  Typically we only retire bugs that don&#8217;t happen in the kernel or device drivers&#8230;  most developers don&#8217;t want their boxes bluescreened!</p>
<p>So you run your program once, and Jinx runs parts of it tens or hundreds of times behind the scenes, choosing for retirement the simulations that are most likely to result in bugs.  The performance impact is mitigated by running the simulations in parallel across all the tens (soon hundreds) of mostly idle cores in that teraflop multi-core machine you just bought.</p>
<p>We have a pretty elaborate scoring mechanism to encourage the retirement of bugs (when we retire a bug, we make it <em>really</em> happen, which is what we want).  At the machine layer, INT3, #DB, #BP, #PF, and explicit hypercalls like jinx_assert() can all suggest to Jinx that a simulation is a good thing to replay.  We don&#8217;t get this data for correctness bugs, where the bug isn&#8217;t caught during execution, but pollutes the output with incorrect data.  So, which simulation to make into reality is not so obvious.  However, Jinx is great at making correctness bugs happen more often.  Why?</p>
<p>The first thing to understand is the scales of time that we&#8217;re talking about. If we were to divide time into seconds, then in some one-second intervals both sides of a bug would occur&#8230;  for concurrency errors, you have to have two parties involved, by definition.  So, for example, not only does someone update a structure non-atomically, but someone else chances to read it within the same second.  For structures updated seldom, only in a fraction of the one-second intervals will the potential bug occur.  For the sake of example, let&#8217;s say a bug&#8217;s two parties occur in the same second, in one of every ten seconds.</p>
<div id="attachment_156" class="wp-caption alignnone" style="width: 590px"><img class="size-full wp-image-156" title="Ten seconds of potential bug occurrence" src="http://petravm.com/wp-content/uploads/2009/12/logical_time_diag_2.png" alt="logical_time_diag_2" width="580" height="76" /><p class="wp-caption-text">During each one second interval, either one or both parties to a bug may run.  The two parties occur only in second 6 in this example.</p></div>
<p>Now, within the second&#8217;s duration, let&#8217;s say we&#8217;ve committed the most original of atomicity violation sins:  the non-atomic update of radius and area for the canonical circle.</p>
<pre>    float myradius = 1;
    lock();
    circle-&gt;radius = myradius;
    unlock();
    lock();
    circle-&gt;area = PI * myradius * myradius;
    unlock();</pre>
<p>In diagram form, with another thread potentially observing the state of the circle.</p>
<div id="attachment_157" class="wp-caption alignnone" style="width: 520px"><img class="size-full wp-image-157" title="An atomicity violation" src="http://petravm.com/wp-content/uploads/2009/12/logical_time_diagram_2.png" alt="logical_time_diagram_2" width="510" height="274" /><p class="wp-caption-text">The corrupter thread introduces an atomicity violation by updating radius and area in separate transactions.  The observer thread is likely to observe the inconsistency if it attempts to take the lock anywhere during the race window.</p></div>
<p>In order for the second thread to observe the problem, it has to begin its attempt to lock the structure during time period A or B.  As the critical section A becomes smaller, the likelihood of the second thread attempting to lock during that period approaches zero.  The more optimized the critical section in this case, the more rare the bug becomes.  There&#8217;s a lower bound on how small the critical section can become (all instructions take time, and a locked instruction takes quite a bit of time).  Let&#8217;s say that the critical section takes 1000 cycles, which is pretty typical for a lock and unlock, and you&#8217;re on a 1Ghz CPU.  If you know that cause and detection occur in the space of one second, you have an approximately one in one-million chance of observing the bug.</p>
<div id="attachment_162" class="wp-caption alignnone" style="width: 520px"><img class="size-full wp-image-162" title="logical_time_diagram_4" src="http://petravm.com/wp-content/uploads/2009/12/logical_time_diagram_4.png" alt="The time between the two parties in a bug may be glacial relative to the size of the race window, which is vanishingly small in this one-second example." width="510" height="210" /><p class="wp-caption-text">The time between the two parties in a bug may be glacial in scope relative to the size of the race window, which is vanishingly small in this one-second example.</p></div>
<p>You should expect that a one in one-million bug that could occur once every ten seconds actually does occur at a rate of about one every ten million seconds, or 2700 machine hours.</p>
<p>So, how does Jinx make this better? Remember when I talked about the simulation approach?  Well, after every simulation is done, Jinx creates a graph that looks like this:</p>
<div id="attachment_165" class="wp-caption alignnone" style="width: 520px"><img class="size-full wp-image-165" title="logical_time_diagram_5" src="http://petravm.com/wp-content/uploads/2009/12/logical_time_diagram_51.png" alt="When Jinx considers how to reorder events, it only considers reordering A with respect to 1, 2, 3, or 4.  Reordering with respect to 2 or 3 force the bug to occur." width="510" height="226" /><p class="wp-caption-text">When Jinx considers how to reorder events, it only considers reordering A with respect to 1, 2, 3, or 4.  Reordering with respect to 2 or 3 force the bug to occur.</p></div>
<p>Or, in a representation which is closer to the way that Jinx sees it:</p>
<div id="attachment_190" class="wp-caption alignnone" style="width: 590px"><img class="size-full wp-image-190" title="logical_time_diagram_6" src="http://petravm.com/wp-content/uploads/2009/12/logical_time_diagram_6.png" alt="Jinx ignores all noncommunicating instructions.  What is left is the set of accesses that could have meaningful reorderings.  For example, access A has meaningful reorderings with respect to 1, 2, 3, and 4." width="580" height="130" /><p class="wp-caption-text">Jinx ignores all noncommunicating instructions.  What is left is the set of accesses that could have meaningful reorderings.  For example, access A has meaningful reorderings with respect to 1, 2, 3, and 4.</p></div>
<p>Jinx uses this representation to decide what to change in subsequent simulations.  Jinx wants to reorder memory accesses with respect for each other.  For example, if Jinx decides to reorder memory access A, the only interesting points to reorder with respect to are 1, 2, 3, and 4 If we rearrange with respect to 1, then the observer runs before the corrupter, so no bug.  If we rearrange with respect to 4, we get a little lock contention but no difference in behavior.  But, if we rearrange with respect to 2 or 3, then we force the rare circumstance to occur.  <strong>Jinx has turned a one-in-a-million event into a two-in-four event!</strong> There are so few cases, we can enumerate all of them:</p>
<div id="attachment_193" class="wp-caption alignnone" style="width: 644px"><img class="size-full wp-image-193" title="logical_time_diagram_7" src="http://petravm.com/wp-content/uploads/2009/12/logical_time_diagram_7.png" alt="logical_time_diagram_7" width="634" height="561" /><p class="wp-caption-text">Of four interleavings suggested by Jinx, two reveal bugs, as they result in accesses to radius and area between the updates to radius and area.</p></div>
<p>This is Jinx&#8217;s notion of communication logical time.  Instead of rearranging events in wallclock time, we rearrange events in logical time.  This cuts out the huge amount of temporal noise involved in hard-to-reproduce bugs.  In this example, this transition to logical time changes a 2700 machine hour repro to a 20 second repro.  Give Jinx a factor of 100 slowdown during simulation, and it&#8217;s still reproducing the bug 10,000x faster than reality would.  Debugging a corruption bug that occurs in an hour on one machine is a completely different story than debugging one that takes 2700 machine hours.</p>
<p>So by transforming execution into logical communication time, Jinx forces even non-crashing correctness bugs to occur more frequently than they would in the wild, allowing you to find your non-deterministic bugs in development rather than test, and in test rather than deployment.</p>
<p>Next time I&#8217;ll talk about how Jinx makes it easy to understand a bug when it has occurred.</p>
]]></content:encoded>
			<wfw:commentRss>http://petravm.com/technology/a-stitch-in-logical-time/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Correctness Bugs and Non-Determinism</title>
		<link>http://petravm.com/technology/correctness-bugs-and-non-determinism</link>
		<comments>http://petravm.com/technology/correctness-bugs-and-non-determinism#comments</comments>
		<pubDate>Mon, 07 Dec 2009 21:55:56 +0000</pubDate>
		<dc:creator>pete</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[amplification]]></category>
		<category><![CDATA[concurrency]]></category>
		<category><![CDATA[correctness]]></category>
		<category><![CDATA[crashing]]></category>
		<category><![CDATA[jinx]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://petravm.com/?p=62</guid>
		<description><![CDATA[Correctness bugs are more horrible as your software gets more non-deterministic</h2>

These bugs get more insidious when you make the leap to shared-memory parallel programming.  Whereas in the single-threaded, synchronous world, a given input to a program always yielded the same output, you're now faced with programs that may give a correct output only a certain percentage of the time.  You expect 100%, but it's not generally possible to prove that a shared-memory parallel program produces deterministic results.  This causes trouble for crashing bugs because it makes repros rarer, and allows bugs to creep further through the development process before detection.  It also renders debuggers less useful...  You can't just step through until you hit the bug, because sometimes the bug happens, and sometimes it doesn't, and often the debugger stops it from happening.]]></description>
			<content:encoded><![CDATA[<p>An acquaintance recently observed that his company was concerned more with correctness bugs than with crashing bugs.  For them, correctness bugs are more painful to debug than crashing bugs and they also get worse when you make the leap to shared-memory parallel programming (of which threaded programming is one instance).  This is a pretty typical story, and it&#8217;s worth considering why correctness bugs are so problematic.  Of particular interest is why shared-memory parallel programming makes them so much worse.</p>
<h2>Corruption is worse than crashing</h2>
<p>If a program stops and complains, that&#8217;s one thing: an automated system or a human can just start it again.  There might be real-time or reliability goals for the program that aren&#8217;t being met, but at least you know something went wrong.  If it keeps on going, but corrupts the output in some way, there&#8217;s a much bigger problem.  Most software is based on an implicit assumption of the fail-stop model.  This just means that errors in the code stop the program&#8217;s forward progress before erroneous data can become nonvolatile: written to disk or communicated to another process.  You can see examples of applications that make this assumption all the time, when they trashed their registry entries or made their tables inconsistent, and consequently can&#8217;t run any more.  At least one study (<a title="study" href="http://www.computer.org/portal/web/csdl/doi/10.1109/TDSC.2007.70208">http://www.computer.org/portal/web/csdl/doi/10.1109/TDSC.2007.70208</a>) finds that 7% of bugs aren&#8217;t fail-stop bugs&#8230; that is they produce bad outputs rather than crashing or stopping.</p>
<h2>Correctness bugs are hard to fix, hard to detect, and costly</h2>
<p>In my acquaintance&#8217;s domain, correctness bugs result in subtly wrong graphics.  There, the fault tends to get detected, but the error-resolution process is lengthy, as much more of the codebase is suspect.  These bugs are detected later in the development process, because earlier  stages lack sufficient verification to detect corruption, but crashes are always obvious.  In other domains, correctness bugs could mean incorrect chances of success when drilling a test well, an incorrect diagnosis, a wrongly sequenced chromosome, a corrupted database, poor search results, or an ill-advised financial trade.  In some of the worst cases, humans or machines take the data out of computation and start committing real-world resources to the results of the computation.  Things get really expensive when errors in computation get into nonvolatile storage inside people&#8217;s brains.</p>
<p>Some of these bugs are eventually detected, as when you compute an incorrect launch window for a rocket.  Others, like choosing not to drill a test well in a fruitful spot, will likely never be detected.  The hidden and obvious economic drags of correctness bugs include unhappy customers, missed opportunities, and wasted efforts.</p>
<h2>Correctness bugs are more horrible as your software gets more non-deterministic</h2>
<p>These bugs get more insidious when you make the leap to shared-memory parallel programming.  Whereas in the single-threaded, synchronous world, a given input to a program always yielded the same output, you&#8217;re now faced with programs that may give a correct output only a certain percentage of the time.  You expect 100%, but it&#8217;s not generally possible to prove that a shared-memory parallel program produces deterministic results.  This causes trouble for crashing bugs because it makes repros rarer, and allows bugs to creep further through the development process before detection.  It also renders debuggers less useful&#8230;  You can&#8217;t just step through until you hit the bug, because sometimes the bug happens, and sometimes it doesn&#8217;t, and often the debugger stops it from happening.</p>
<p>The problem is much worse in the case of correctness bugs: both correctness and non-deterministic bugs tend to get detected late in development, and combined, they appear very late.  Users of your code can act upon the results because they&#8217;re not crashing bugs, and the bug can slip through your test because it&#8217;s non-deterministic, even if you have a test case that could have caught the bug!  Input coverage isn&#8217;t enough any more.  How do we deal with this problem?</p>
<h2>Dealing with the Problem</h2>
<div id="attachment_66" class="wp-caption alignnone" style="width: 771px"><img class="size-full wp-image-66  " title="bugdiagram7" src="http://petravm.com/wp-content/uploads/2009/12/bugdiagram7.png" alt="The forces of non-determinism and corruption!" width="761" height="500" /><p class="wp-caption-text">The forces of non-determinism and corruption!</p></div>
<p>There are three big parts to the answer to this problem.  They don&#8217;t make the problem go away.  They just stop your life from becoming a hell of customer escalations and expensive round-trips to QA.</p>
<h3>Choose a simple parallel programming model</h3>
<p>First, the way you use multiple cores will change the amount of non-determinism in your software.  All the parts of your program that can</p>
<p>avoid shared-mutable memory should probably avoid it.  For most applications, you&#8217;ll never eliminate non-determinism altogether.  After all, what is input if not an access of shared-mutable state?  But, you can try to guarantee that like inputs yield like outputs by being careful about the sorts of shared-memory parallel programming you do.</p>
<h3>Turn your correctness bugs into crashing bugs</h3>
<p>Next, increase your likelihood of bugs being crashing bugs, rather than correctness bugs.  The trusty assert() is your friend here.  Ship your software with your asserts enabled, and don&#8217;t tolerate failing asserts.  They tell you where the bug was, and what it was.  When you get that bug report from the field, you know exactly what you need to fix.  Programs that need to continue to function in the face of failure can use software fault-tolerance approaches like running every computation twice and comparing results, with an optional tie-breaker.  Some problems are intrinsically easy to verify, like factoring an integer.  Most are not.  During test you can use late-stage output verification to achieve fail-stop.  Make sure everything does this.  By doing this, you&#8217;ll change more bugs into being crashing bugs, rather than correctness bugs.</p>
<h3>Amplify your testing</h3>
<p>Lastly, amplify your non-deterministic bug-find rate.  If you expect to deploy software that will run at a rate of tens of thousands of machine-hours per hour, that is, you have tens of thousands of users, or a few users using thousands of computers, you won&#8217;t match their ability to find non-deterministic bugs in software without amplifying your bug-find rate. This is a potential success disaster:  shipping lots of software will make you realize how buggy it is.</p>
<p>Our goal with Jinx is to let the developer use one computer to test as thoroughly as thousands would, and to let QA engineers make one test pass count for a thousand.  To learn more about what we&#8217;re doing with Jinx, check out the rest of our website!</p>
]]></content:encoded>
			<wfw:commentRss>http://petravm.com/technology/correctness-bugs-and-non-determinism/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>New rev of Jinx alpha release</title>
		<link>http://petravm.com/uncategorized/new-rev-of-jinx-alpha-release</link>
		<comments>http://petravm.com/uncategorized/new-rev-of-jinx-alpha-release#comments</comments>
		<pubDate>Tue, 24 Nov 2009 05:44:13 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[jinx]]></category>

		<guid isPermaLink="false">http://petravm.com/?p=58</guid>
		<description><![CDATA[In case any of you alpha testers missed the email notification, we&#8217;ve recently revised the Jinx alpha, addressing several issues that have come up during testing. If you&#8217;d like to know more please email us at support At Petravm.com and we&#8217;ll fill you in on the details.
]]></description>
			<content:encoded><![CDATA[<p>In case any of you alpha testers missed the email notification, we&#8217;ve recently revised the Jinx alpha, addressing several issues that have come up during testing. If you&#8217;d like to know more please email us at support At Petravm.com and we&#8217;ll fill you in on the details.</p>
]]></content:encoded>
			<wfw:commentRss>http://petravm.com/uncategorized/new-rev-of-jinx-alpha-release/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Comparing Microsoft Cuzz and Jinx</title>
		<link>http://petravm.com/technology/comparing-cuzz-and-jinx</link>
		<comments>http://petravm.com/technology/comparing-cuzz-and-jinx#comments</comments>
		<pubDate>Sat, 21 Nov 2009 00:37:58 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[bug detection]]></category>
		<category><![CDATA[concurrency fuzzing]]></category>
		<category><![CDATA[cuzz]]></category>
		<category><![CDATA[jinx]]></category>
		<category><![CDATA[multi-threading]]></category>
		<category><![CDATA[PDC09]]></category>

		<guid isPermaLink="false">http://petravm.com/?p=41</guid>
		<description><![CDATA[While the intent of the products is similar, Jinx implementation as a hypervisor device driver and its checkpoint/simulation approach is very different from Cuzz. A significant difference between Jinx and just about everything else is that we operate at the hardware/memory level instead of the threads/API level. If you're thinking of evaluating Cuzz when it's available, <a href="http://petravm.com/">take a look at Jinx</a> in the meantime and let us know what you think.]]></description>
			<content:encoded><![CDATA[<p>First, a big shout out of cool! to the Cuzz team.  Testing multithreaded code is a hard thing to do, and we applaud their efforts and thank them for shining a light on the category. We&#8217;ve gotten a few questions about how we are the same or different from Cuzz, the concurrency debugging tool  that Microsoft recently demonstrated at PDC. You can <a href="http://microsoftpdc.com/Sessions/VTL32">see the Cuzz presentation from PDC here</a>.</p>
<p>While the intent of the products is similar, Jinx implementation as a hypervisor device driver and its checkpoint/simulation approach is very different from Cuzz. We haven&#8217;t gotten our hands on Cuzz yet, so our remarks here are in response to the presentation above, and we look forward to its public release. The <a href="http://petravm.com/jinx">Jinx pre-release product is available to the public</a> with the completion of a short form. Take a look!</p>
<ul>
<li>
<p>A significant difference between Jinx and just about everything else is that we operate at the hardware/memory level instead of the threads/API level. Cuzz instruments synchronization libraries to manipulate thread schedules. In contrast, Jinx naturally handles native and CLR code, and exotic synchronization libraries such as Intel&#8217;s thread building blocks or custom-implemented wait-free data structures.This means that Jinx easily handles 32-bit native, 64-bit native, and CLR code, and can debug synchronization libraries too!</li>
<li>Jinx takes checkpoints and Cuzz doesn&#8217;t.  It appears Cuzz works by periodically introducing pauses into the mainline execution path. One big benefit from checkpoints is deep testing of code regions.  We test a given code region N times, whereas Cuzz tests it once.  This is particularly important for rare code paths.  Interestingly, Microsoft&#8217;s Featherweight race detector relies on a &#8220;cold-region&#8221; hypothesis, which states that concurrency bugs tend to occur in rarely executed code.  Jinx is well-suited to giving these rare regions good coverage.</li>
<li>
<p>Cuzz has less control over target applications.  As a user-mode DLL, Cuzz has limited insight into memory accesses inside the target application.  Cuzz seems limited to inserting pauses at synchronization points.  Jinx can insert pauses at arbitrary points &#8212; in particular, we can interleave at arbitrary memory communication boundaries.  This means we can detect bugs that involve old-fashioned data races &#8212; for example, a multi-threaded program that uses no locks.  It also means we can detect bugs inside synchronization primitives themselves.</li>
<li>Cuzz is limited to a single process.  Jinx can simulate a set of processes or the entire system. This is important because thread interleaving can be affected (and thus bugs be made manifest) by factors outside of your application</li>
<p>.</ul>
<p>If you&#8217;re thinking of evaluating Cuzz when it&#8217;s available, <a href="http://petravm.com/">take a look at Jinx</a> in the meantime and let us know what you think.</p>
]]></content:encoded>
			<wfw:commentRss>http://petravm.com/technology/comparing-cuzz-and-jinx/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Trouble with Happens Before (part 2)</title>
		<link>http://petravm.com/technology/the-trouble-with-happens-before-part-2</link>
		<comments>http://petravm.com/technology/the-trouble-with-happens-before-part-2#comments</comments>
		<pubDate>Mon, 02 Nov 2009 19:07:18 +0000</pubDate>
		<dc:creator>andrew</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[Tutorial]]></category>
		<category><![CDATA[concurrency bugs]]></category>
		<category><![CDATA[happens before]]></category>
		<category><![CDATA[helgrind]]></category>
		<category><![CDATA[Intel Thread Checker]]></category>
		<category><![CDATA[jinx]]></category>
		<category><![CDATA[race detectors]]></category>

		<guid isPermaLink="false">http://petravm.com/?p=17</guid>
		<description><![CDATA[Jinx overcomes many of the issues associated with happens-before race detectors.  Jinx’s generalized race detection algorithm means we expose a wider range of concurrency errors.  At the same time, Jinx’s simulation engine hides uninteresting races from the end user. The result? Jinx is more likely to find bugs in your code, more quickly, before you ship your product to the market.]]></description>
			<content:encoded><![CDATA[<p><em>(This is the second part of a two-part post, contrasting PetraVM’s Jinx approach to concurrency bug detection with that used by happens-before code analysis tools such as <a href="http://valgrind.org/docs/manual/hg-manual.html">Helgrind</a>, <a href="http://software.intel.com/en-us/intel-thread-checker/">Intel Thread Checker</a> and others.)</em></p>
<p>Existing race detection tools have difficulty distinguishing benign races from true concurrency errors.  This imposes a burden on the end user, who must manually inspect the output to remove false positives.  Jinx eliminates most false positives by coupling race detection with simulation.  First, Jinx identifies a program race, which may or may not be a bug.   Then, Jinx exercises this race in the background.  If the race is benign, Jinx discards the simulation.  If the program execution results in a crash, Jinx moves the simulation to the foreground to make its effects visible to the user.  Because of its simulation-driven nature, Jinx does not require a separate user interface; bugs simply appear more often on Jinx than they would on an ordinary system.</p>
<p>Internally, Jinx’s simulation engine uses a race detection algorithm in order to choose “interesting” thread schedules, which are likely to be involved in concurrency errors.   Jinx’s race detector differs from happens-before detection in several key respects.  Conventional race detectors suffer from false negatives because they focus on a limited set of program events (reads and writes of ordinary program variables).  Jinx suffers from fewer false negatives because it considers any access to any memory location.  Crucially, this includes synchronization primitives such as locks and atomic variables.  As a result, Jinx will reorder critical sections in their entirety.  This capability is necessary to find important classes of bugs, <a href="http://petravm.com/2009/10/the-trouble-with-happens-before/">such as the increment race we considered in the previous post</a>.</p>
<p>Jinx’s race detection algorithm is so general that it can even find bugs in low-level synchronization primitives such as locks and condition variables.  Jinx can find bugs in the lock-free data structures built using atomic compare-and-swap.  Conventional race detectors cannot find these types of program errors.</p>
<p>Jinx captures a larger set of concurrency errors than traditional data race detectors such as happens-before. Interestingly, Jinx can also reveal more data races than traditional tools.  We do this by leveraging a capability called follow-on simulations.  The output of each simulation is fed back into the race detector, which can then propose additional simulations.</p>
<p>Consider the following program execution (first considered by<a href="http://cseweb.ucsd.edu/~savage/papers/Tocs97.pdf"> Savage et al.</a>).  The unsynchronized accesses to the variable y are a data race (and a bug!).  However, a traditional happens-before race detector cannot reveal this error because all accesses to y are totally ordered by the intervening lock operations.  By contrast, Jinx can find this bug by using a sequence of simulations.  First, Jinx would reorder the critical sections (which are captured by the generalized race detector).  Then, Jinx would reorder the accesses to y to expose the race.</p>
<div id="attachment_19" class="wp-caption alignnone" style="width: 310px"><img class="size-medium wp-image-19" title="&quot;Invisible&quot; Concurrency Bug" src="http://petravm.com/wp-content/uploads/2009/11/HB-part-2-300x215.png" alt="Race detectors miss this one!" width="300" height="215" /><p class="wp-caption-text">Race Detectors Miss this one!</p></div>
<p>In summary, Jinx overcomes many of the issues associated with happens-before race detectors.  Jinx’s generalized race detection algorithm means we expose a wider range of concurrency errors.  At the same time, Jinx’s simulation engine hides uninteresting races from the end user. The result? Jinx is more likely to find bugs in your code, more quickly, before you ship your product to the market.</p>
]]></content:encoded>
			<wfw:commentRss>http://petravm.com/technology/the-trouble-with-happens-before-part-2/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Trouble with Happens Before</title>
		<link>http://petravm.com/tutorial/the-trouble-with-happens-before</link>
		<comments>http://petravm.com/tutorial/the-trouble-with-happens-before#comments</comments>
		<pubDate>Sat, 17 Oct 2009 00:18:42 +0000</pubDate>
		<dc:creator>andrew</dc:creator>
				<category><![CDATA[Tutorial]]></category>

		<guid isPermaLink="false">http://petravm.com/?p=3</guid>
		<description><![CDATA[Programmers have been struggling with concurrency bugs for as long as there have been concurrent programs. Not surprisingly, there have been previous efforts to build tools to simplify this task. One of the oldest and most successful such tools is the happens-before race detector. The basic ideas were fleshed out in the 1980&#8217;s, and there [...]]]></description>
			<content:encoded><![CDATA[<p>Programmers have been struggling with concurrency bugs for as long as there have been concurrent programs. Not surprisingly, there have been previous efforts to build tools to simplify this task. One of the oldest and most successful such tools is the happens-before race detector. The basic ideas were fleshed out in the 1980&#8217;s, and there has been a range of follow-on work in both academia and industry. For example, the open-source tool <a title="Helgrind Documentation" href="http://valgrind.org/docs/manual/hg-manual.html">Helgrind</a> is based on a happens-before race detector. Given their prominence, it is important to understand HB race detectors and how they compare to our work at PetraVM. In this entry, I sketch out how these tools work and describe some of their limitations.</p>
<p>The goal of HB race detectors (and most race detectors) is to reveal a specific class of concurrency errors called <em>data races</em>. A data race occurs when two threads access a shared variable and the following conditions hold:</p>
<ul>
<li>At least one access is a write</li>
<li>No mechanism ensures that the variable is only accessed by one thread at a time.</li>
</ul>
<p>Many basic concurrency errors are data races. An example is when two threads increment a shared variable without using a lock or atomic operation:</p>
<p style="font-family:Courier, serif;">
<p>int x = 0;</p>
<p>void incrementWithDataRace() {</p>
<p>&nbsp;&nbsp;x++;</p>
<p>}</p>
<p>Before describing how HB detectors work, I need to describe the underlying shared memory execution model they presume. On a uniprocessor, all memory operations occur in a fixed order that is defined by the flow of execution. The first memory operation occurs, then the second, then the third, etc. If the program is re-executed with the same same inputs, it will execute the same sequence of memory operations and produce the same output. Life is good.</p>
<p>On a multiprocessor, life is not so simple. The underlying system (the hardware, OS, and language runtime) is free to schedule threads/processors in any order it sees fit. If the program is re-executed, it can (and typically will) run with a different thread schedule, potentially producing a different result. Even worse, different threads may disagree on the order in which memory operations occurred (the exact details depend on the underlying memory consistency model, which is a topic beyond the scope of this post. Stay tuned!). This multiplicity of different outcomes is what makes parallel programming so demanding and error-prone.</p>
<p>If memory operations could occur in any order, then programmers would be helpless to write multi-threaded programs that shared data in any way. Fortunately, there is a lifeline: synchronization operations such as locks and condition variables place constraints on the set of permissible memory reorderings. In the academic lingo, they say that one synchronization operation <em>happens-before</em> another. This means that both threads agree on the order these operations occurred, and that the underlying system will not reorder memory accesses across this pair of operations.</p>
<p>Consider the simple example shown below. In this case, two threads correctly increment a shared variable using a lock. We say that the unlock operation happens-before any subsequent lock operation. Crucially, the happens-before relation is transitive, so all operations in thread A before the unlock happen-before all operations in thread B after the lock. In this case, this means that thread A&#8217;s increment of x happens-before thread B&#8217;s increment of x.</p>
<p><img title="Happens Before Race Detectors" src="http://petravm.com/wp-content/uploads/2009/10/HappensBefore1-300x177.png" alt="Happens Before Race Detectors" width="300" height="177" /></p>
<p>Given these concepts, it is straightforward to describe the behavior of HB race detectors. Suppose two threads access a shared variable, and one of those accesses is a write. In a correct program, all such accesses should be totally ordered by the happens-before relation. A data race occurs when one of those accesses does not happen-before the other. In the example above, we could create a data race by removing the lock operations. In fact, it would suffice to remove the lock operations from one of the two threads.</p>
<p>As this example suggets, HB detectors are a powerful technology for detecting certain classes of concurrency errors. However, they suffer from two basic problems. First, not all data races are concurrency bugs; we refer to these as false positives. Second, not all concurrency bugs are data races; we refer to these as false negatives. As a result, a HB race race detector is both incomplete and inaccurate in its ability to detect the complete spectrum of concurrency errors. These concepts are illustrated in the Venn diagram below.</p>
<p><img class="size-medium wp-image-7 " title="Venn Diagram View of False Positives" src="http://petravm.com/wp-content/uploads/2009/10/HappensBefore2-300x205.png" alt="Venn Diagram View of False Positives" width="300" height="205" /></p>
<p>The code snippet below shows a simple concurrency bug that is not a data race. In this example, the programmer is incrementing a counter variable that is accessed by multiple threads. The programmer is incorrectly using a temporary variable to make modifications to the shared variable. The program will produce an incorrect result if a thread is preempted between the two critical sections. However, this bug will not be found by a HB race detector, or any tool that focuses exclusively on data races. This is because all accesses to shared, mutable state are properly protected by a lock.</p>
<p style="font-family:Courier, serif;">int x = 0;</p>
<p>void incrementWithoutDataRace() {</p>
<p>&nbsp;&nbsp;int tmp;</p>
<p style="font-family:Courier, serif;">
<p>&nbsp;&nbsp;lock(L);</p>
<p>&nbsp;&nbsp;temp = x;</p>
<p>&nbsp;&nbsp;unlock(L);</p>
<p style="font-family:Courier, serif;">
<p>&nbsp;&nbsp;temp++;</p>
<p style="font-family:Courier, serif;">
<p>&nbsp;&nbsp;lock(L);</p>
<p>&nbsp;&nbsp;x = temp;</p>
<p>&nbsp;&nbsp;unlock(L);</p>
<p>}</p>
<p>What about the other side of the coin? Are there benign data races that do not result incorrect program behavior. Yes. One example is the use of a flag variable. A common paradigm is to construct a pool of worker threads that periodically poll a flag variable to determine whether they should terminate. Clearly, such a flag variable is shared, mutable state. However, flag variables are typically not locked, yet this does not result in incorrect program behavior. Programs written in this fashion are immune to such data minor races.</p>
<p>So, HB race detectors are not a perfect solution for detecting concurrency errors. In a follow-up post, I will discuss some of the technology we are developing at PetraVM that addresses these limitations.</p>
]]></content:encoded>
			<wfw:commentRss>http://petravm.com/tutorial/the-trouble-with-happens-before/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
