[Xquartz-dev] Rootless code debugging notes

Ben Byer bbyer at apple.com
Fri Dec 7 21:37:33 PST 2007


Ken Thomases of CodeWeavers (who makes an excellent product called  
CrossOver Mac, which you should all go try) and I have been mailing  
back and forth for months now, having a fairly sparse (time-wise)  
conversation about debugging the Rootless subsystem.  It resulted in  
several small patches, but we obviously still have far to go.

He's given me permission to collect some of the bits and post them  
here -- hopefully they'll inspire one of you!

I'm going to order this "backwards", with the most recent stuff at the  
top, because in many cases it supersedes the older stuff.  However,  
there's still a lot of knowledge there.

> I remembered something else that I fixed in the X server.  I could  
> have sworn that Torrey applied this suggestion way back around  
> 6.9.x, though.  On the other hand, I don't see the fixes in the  
> current git branch.
>
> The code in miext/rootless/accel is essentially a copy+paste+hack  
> job of similar code in the fb directory.  However, the fb code  
> continued to be updated after the time of the copy, and fixes there  
> didn't make it into the rootless code.  The classic problem with  
> code duplication.  This is an excellent candidate for consolidation,  
> but I have to admit I took the coward's way out and just updated the  
> rootless/accel code to incorporate the fixes to the corresponding fb  
> code.

[http://gitweb.freedesktop.org/?p=xorg/xserver.git;a=commit;h=70374a58937d7a6f01c210bd6ac66cafb63e895a 
]
[http://gitweb.freedesktop.org/?p=xorg/xserver.git;a=commit;h=3f857e129df7ce492191e0c51b8e53eaf6179366 
]

>> I'll apply it (Patch to avert (some) damage / rootless crashes,  
>> courte ... xorg-server-1.2-apple), too, but I'm still getting  
>> rootless crashes just like ones identified below -- for example,
>>
>> Program received signal EXC_BAD_ACCESS, Could not access memory.
>> Reason: KERN_INVALID_ADDRESS at address: 0xb09a7ffe
>> [Switching to process 64128 thread 0x2d03]
>> 0x000ba460 in fbSolid (dst=0xb09a7ffc, dstStride=1, dstX=16,  
>> bpp=16, width=4560, height=1, and=0, xor=1523997398) at fbsolid.c:67
>> 67		    FbDoLeftMaskByteRRop(dst,startbyte,startmask,and,xor);
>> (gdb) bt
>> #0  0x000ba460 in fbSolid (dst=0xb09a7ffc, dstStride=1, dstX=16,  
>> bpp=16, width=4560, height=1, and=0, xor=1523997398) at fbsolid.c:67
>> #1  0x000bcb96 in fbFillRegionSolid (pDrawable=0x4f8b90,  
>> pRegion=0x13110280, and=0, xor=1523997398) at fbwindow.c:247
>> #2  0x0002693d in SafeAlphaPaintWindow (pWin=0x4f8b90,  
>> pRegion=0x13110280, what=0) at safeAlphaWindow.c:148
>> #3  0x00106f1a in damagePaintWindow (pWindow=0x4f8b90,  
>> prgn=0x13110280, what=0) at damage.c:1593
>> #4  0x00025a88 in RootlessPaintWindowBackground (pWin=0x4f8b90,  
>> pRegion=0x13110280, what=0) at rootlessWindow.c:1542
>> #5  0x00035655 in miWindowExposures (pWin=0x4f8b90,  
>> prgn=0x13110280, other_exposed=0x0) at miexpose.c:565
>> #6  0x000164ac in DRIWindowExposures (pWin=0x4f8b90,  
>> prgn=0xb09a7ffe, bsreg=0xb09a7ffe) at dri.c:609
>> #7  0x0004922c in miHandleValidateExposures (pWin=0x428470) at  
>> miwindow.c:470
>> #8  0x0008a9a6 in MapWindow (pWin=0x4f67d0, client=0x4f10d0) at  
>> window.c:2803
>> #9  0x000688b9 in ProcMapWindow (client=0x1) at dispatch.c:696
>> #10 0x0006f5b5 in Dispatch () at dispatch.c:457
>> #11 0x0007e001 in main (argc=2, argv=0xbffff4f8, envp=0xbffff504)  
>> at main.c:445
>
> Hmm.  I had thought I had caught all of the remaining crashes.  It's  
> of course possible that new bugs have been introduced between the  
> code base that I'm working with and X.org 7.2.  The most likely  
> candidate would be someplace where a function is unwrapped on an  
> object (a screen, a window, a graphics context, etc.) and there's a  
> code path that fails to rewrap it when it should.  This is what I  
> fixed in the earlier commit <http://gitweb.freedesktop.org/?p=xorg/xserver.git;a=commit;h=38965768636aef5c19466a4ffa460d318d3bfa6f 
> >.  A thorough grepping of the code can find places where something  
> is unwrapped, and you can then examine each place to make sure the  
> proper rewrap happens in every case.
>
> Something that can be useful when debugging what's going on is  
> malloc_history and the MallocStackLoggingNoCompact environment  
> variable.  You may find Xray (ahem, "Instruments" ;) can also help.
>
> When I was debugging, I actually did something I thought was kind of  
> neat.  It's pretty common to litter code with printfs (or ErrorF) to  
> debug.  Well, I did that and included the pointer address of each  
> interesting object at each point -- when a PixMap was freed, I  
> printed the address of the pixmap object, for example.  What was  
> neat though, is that I also did a small allocation and included the  
> address of that, and then leaked the allocated buffer.  For example:
>
> 	ErrorF("Did something interesting with pixmap %p.  %p\n", pixmap,  
> malloc(1));
>
> Then, as I examined the flood of logging messages, when I found a  
> suspicious operation, I would use malloc_history on the address of  
> the small leaked buffer.  Since I was leaking it, then the last  
> entry in the history had to be for the log line in question, and  
> malloc_history could provide a stack trace for where/when that  
> allocation happened.  That gave me invaluable insight into what the  
> code was doing.
>
> It's quite possible that DTrace is powerful enough, though, that  
> this technique can be considered "archaic".  :)


... and this next bit is excerpted from an email thread between him  
and Torrey Lyons that he sent me.

>>>>> We're experiencing a significant number of crashes when we  
>>>>> resize windows.  They're all within the Damage extension.  I  
>>>>> found bug 1032 in X.org's bugzilla <https://bugs.freedesktop.org/show_bug.cgi?id=1032 
>>>>> >, but haven't seen recent activity.
>>>>>
>>>>> That bug also includes a note claiming that Damage is disabled  
>>>>> for rootless servers, but I find that that isn't true.  In fact,  
>>>>> I attempted to disable the extension but found it impossible due  
>>>>> to the cursor code requiring it.
>>>>
>>>> Hmm. The comment on bug <https://bugs.freedesktop.org/show_bug.cgi?id=1032 
>>>> > from ajax at nwnk.net is incorrect as I recall and as you noted.  
>>>> Damage is used. My understanding was that all crashes related to  
>>>> damage were fixed. The bug was left open to remind me to fix a  
>>>> potential incorrect behavior in CopyWindow. However, I had never  
>>>> actually observed any problems related to this. You may have  
>>>> stumbled upon one. Can you please, try the following:
>>>>
>>>> 1. Rebuild X.Org with the debug flag on so you get a more  
>>>> complete backtrace when the crash occurs. You can do this by  
>>>> inserting a host.def file into /xc/config/cf with the line:
>>>> #define DefaultCDebugFlags              -g
>>>>
>>>> 2. Duplicate the crash and send me the backtrace. You should be  
>>>> able to get the XDarwin backtrace from Console.app. You can also  
>>>> post it on the above mentioned bug.
>>> Thanks for your reply.
>>>
>>> I have attached crash reports showing two crashes related to the  
>>> Damage extension.
>>>
>>> Here are some gdb backtraces I just generated:
>>>
>>> -------------
>>> Program received signal:  "EXC_BAD_ACCESS".
>>> [Switching to process 5606 thread 0x4603]
>>> (gdb) bt
>>> #0  0x01f45dd9 in rlBlt (srcLine=0x109a4838, srcStride=512,  
>>> srcX=0, pDstScreen=0x1979590, dstLine=0x294, dstStride=1, dstX=29,  
>>> width=271, height=326, alu=3, pm=4294967295, bpp=1, reverse=0,  
>>> upsidedown=0) at rlBlt.c:327
>>> #1  0x01f445ff in rlCopyNtoN (pSrcDrawable=0x109a4000,  
>>> pDstDrawable=0x10094dc0, pGC=0x1008f010, pbox=0xb009dc78, nbox=0,  
>>> dx=-573, dy=-147, reverse=0, upsidedown=0, bitplane=0,  
>>> closure=0x0) at rlCopy.c:60
>>> #2  0x001aeec5 in fbCopyRegion (pSrcDrawable=0x109a4000,  
>>> pDstDrawable=0x10094dc0, pGC=0x1008f010, pDstRegion=0xb009dc78,  
>>> dx=-573, dy=-147, copyProc=0x1f443cf <rlCopyNtoN>, bitPlane=0,  
>>> closure=0x0) at fbcopy.c:366
>>> #3  0x001af38f in fbDoCopy (pSrcDrawable=0x109a4000,  
>>> pDstDrawable=0x10094dc0, pGC=0x1008f010, xIn=0, yIn=0,  
>>> widthSrc=271, heightSrc=468, xOut=573, yOut=147,  
>>> copyProc=0x1f443cf <rlCopyNtoN>, bitPlane=0, closure=0x0) at  
>>> fbcopy.c:566
>>> #4  0x01f446ad in rlCopyArea (pSrcDrawable=0x109a4000,  
>>> pDstDrawable=0x10094dc0, pGC=0x1008f010, xIn=0, yIn=0,  
>>> widthSrc=271, heightSrc=468, xOut=569, yOut=124) at rlCopy.c:101
>>> #5  0x000dead0 in damageCopyArea (pSrc=0x109a4000,  
>>> pDst=0x10094dc0, pGC=0x1008f010, srcx=0, srcy=0, width=271,  
>>> height=468, dstx=569, dsty=124) at damage.c:762
>>> #6  0x000bfecb in ProcCopyArea (client=0x100886d0) at dispatch.c: 
>>> 1758
>>> #7  0x000bc943 in Dispatch () at dispatch.c:455
>>> #8  0x0002d5ad in main (argc=1, argv=0xbffff9ec, envp=0xbffff9f4)  
>>> at main.c:451
>>> -------------
>>>
>>> -------------
>>> Program received signal:  "EXC_BAD_ACCESS".
>>> [Switching to process 5906 thread 0x4603]
>>> (gdb) bt
>>> #0  0x9035ab8e in sseCGSFill8by1 ()
>>> #1  0x9a73c9a6 in xp_fill_bytes ()
>>> #2  0x01f46b85 in rlSolid (pScreen=0x1979a00, dst=0x209f58,  
>>> dstStride=1280, dstX=0, bpp=32, width=8320, height=2, and=0,  
>>> xor=15000804) at rlSolid.c:77
>>> #3  0x01f4615f in rlFill (pDrawable=0xfc8ecd0, pGC=0xfc8b870,  
>>> x=726, y=417, width=260, height=2) at rlFill.c:49
>>> #4  0x01f44e32 in rlPolyFillRect (pDrawable=0xfc8ecd0,  
>>> pGC=0xfc8b870, nrect=0, prect=0x108c806c) at rlFillRect.c:79
>>> #5  0x000e0858 in damagePolyFillRect (pDrawable=0xfc8ecd0,  
>>> pGC=0xfc8b870, nRects=5, pRects=0x108c8044) at damage.c:1205
>>> #6  0x000c1339 in ProcPolyFillRectangle (client=0xfc7b6f0) at  
>>> dispatch.c:1978
>>> #7  0x000bc943 in Dispatch () at dispatch.c:455
>>> #8  0x0002d5ad in main (argc=1, argv=0xbffff9ec, envp=0xbffff9f4)  
>>> at main.c:451
>>> -------------
>>>
>>> This next one is most reminiscent of the known Damage bug, but I  
>>> suspect the ones above are related:
>>> -------------
>>> Program received signal:  "EXC_BAD_ACCESS".
>>> [Switching to process 6021 thread 0x4603]
>>> (gdb) bt
>>> #0  0x000de948 in damageCopyArea (pSrc=0x10877000, pDst=0xfba1830,  
>>> pGC=0xfba6a50, srcx=0, srcy=0, width=628, height=26, dstx=0,  
>>> dsty=51) at damage.c:748
>>> #1  0x000bfecb in ProcCopyArea (client=0xfb986a0) at dispatch.c:1758
>>> #2  0x000bc943 in Dispatch () at dispatch.c:455
>>> #3  0x0002d5ad in main (argc=1, argv=0xbffff9ec, envp=0xbffff9f4)  
>>> at main.c:451
>>> -------------
>>> As you probably remember, what's happening is that  
>>> getDrawableDamageRef is returning a bogus pointer which is  
>>> dereferenced by the getDrawableDamage macro (used by  
>>> checkGCDamage).  getDrawableDamageRef returns a bogus pointer  
>>> because it uses the (*pScreen->GetWindowPixmap) function pointer,  
>>> which is _fbGetWindowPixmap in this case.  That devolves down to  
>>> the macro fbGetWindowPixmap(pWin) which is defined as ((PixmapPtr)  
>>> ((WindowPtr) (pWin))->devPrivates[fbGetWinPrivateIndex()].ptr).   
>>> And finally we get to the immediate problem: there is no pointer  
>>> stashed in that private index entry.  However, what I don't  
>>> understand is why that private index entry hasn't been  
>>> initialized.  That's where I get lost in the maze that is X11.


[...]
>>> t took me a while to get around to it, but I have tested with the  
>>> 6.9 tip and am still able to reproduce the crashes.  Here are some  
>>> gdb backtraces of a few of the crashes:
>>>
>>> Program received signal:  "EXC_BAD_ACCESS".
>>> [Switching to process 16351 thread 0x4607]
>>> (gdb) bt
>>> #0  0x001b107b in fbEvenTile (dst=0x6cb60, dstStride=900, dstX=0,  
>>> width=5504, height=22, tile=0x103ee2b8, tileHeight=26, alu=3,  
>>> pm=4294967295, xRot=28832, yRot=-3) at fbtile.c:94
>>> #1  0x001b130e in fbTile (dst=0x6bd40, dstStride=1072, dstX=28928,  
>>> width=5504, height=23, tile=0x103ee2b8, tileStride=1,  
>>> tileWidth=32, tileHeight=26, alu=3, pm=4294967295, bpp=32,  
>>> xRot=28832, yRot=-3) at fbtile.c:193
>>> #2  0x01f466f1 in rlFill (pDrawable=0x103981a0, pGC=0x103a1850,  
>>> x=904, y=103, width=172, height=23) at rlFill.c:142
>>> #3  0x01f44f11 in rlPolyFillRect (pDrawable=0x103981a0,  
>>> pGC=0x103a1850, nrect=0, prect=0x107c50cc) at rlFillRect.c:108
>>> #4  0x000e1714 in damagePolyFillRect (pDrawable=0x103981a0,  
>>> pGC=0x103a1850, nRects=1, pRects=0x107c50c4) at damage.c:1205
>>> #5  0x000c21f5 in ProcPolyFillRectangle (client=0x10399130) at  
>>> dispatch.c:1978
>>> #6  0x000bd7ff in Dispatch () at dispatch.c:455
>>> #7  0x0002e469 in main (argc=1, argv=0xbffff9ec, envp=0xbffff9f4)  
>>> at main.c:451
>>>
>>> --------------------
>>>
>>> Program received signal:  "EXC_BAD_ACCESS".
>>> [Switching to process 16734 thread 0x4607]
>>> (gdb) bt full
>>> #0  0x000df804 in damageCopyArea (pSrc=0x1099c000,  
>>> pDst=0x10094b00, pGC=0x100947e0, srcx=0, srcy=0, width=628,  
>>> height=26, dstx=0, dsty=51) at damage.c:748
>>> 	ret = (RegionPtr) 0x100947e0
>>> 	pGCPriv = (DamageGCPrivPtr) 0x10094868
>>> 	oldFuncs = (GCFuncs *) 0x23a120
>>> #1  0x000c0d87 in ProcCopyArea (client=0x1008d330) at dispatch.c: 
>>> 1758
>>> 	pDst = (DrawablePtr) 0x10094b00
>>> 	pSrc = (DrawablePtr) 0x1099c000
>>> 	pGC = (GC *) 0x100947e0
>>> 	stuff = (xCopyAreaReq *) 0x1094c3b4
>>> 	pRgn = (RegionPtr) 0x1
>>> #2  0x000bd7ff in Dispatch () at dispatch.c:455
>>> 	clientReady = (int *) 0x206ae00
>>> 	result = 28
>>> 	client = (ClientPtr) 0x1008d330
>>> 	nready = 0
>>> 	icheck = (HWEventQueuePtr *) 0x27d2b0
>>> 	start_tick = 11920
>>> #3  0x0002e469 in main (argc=1, argv=0xbffff9ec, envp=0xbffff9f4)  
>>> at main.c:451
>>> 	i = 1
>>> 	j = 2
>>> 	k = 2
>>> 	error = 0
>>> 	xauthfile = 0x0
>>> 	alwaysCheckForInput = {0, 1}
>>> 	adjustedRgbPath = "/Users/ken/work/cxmac/macosx/CrossOver/build/ 
>>> Debug/CrossOver.app/Contents/SharedSupport/X11/lib/X11/rgb", '\0'  
>>> <repeats 920 times>
>>> (gdb) p pDst->pScreen->GetWindowPixmap
>>> $1 = (GetWindowPixmapProcPtr) 0x1573f9 <_fbGetWindowPixmap>
>>> (gdb) p ((PixmapPtr) ((WindowPtr) (pDst))- 
>>> >devPrivates[fbWinPrivateIndex].ptr)
>>> $2 = (struct _Pixmap *) 0x100bcd10
>>> (gdb) p *((PixmapPtr) ((WindowPtr) (pDst))- 
>>> >devPrivates[fbWinPrivateIndex].ptr)
>>> $3 = {
>>> drawable = {
>>>   type = 32 ' ',
>>>   class = 157 '\235',
>>>   depth = 128 '\200',
>>>   bitsPerPixel = 160 '\240',
>>>   id = 271240,
>>>   x = 0,
>>>   y = 0,
>>>   width = 58736,
>>>   height = 400,
>>>   pScreen = 0x3,
>>>   serialNumber = 6
>>> },
>>> refcnt = 600000,
>>> devKind = -1824067958,
>>> devPrivate = {
>>>   ptr = 0x50797472,
>>>   val = 1350136946,
>>>   uval = 1350136946,
>>>   fptr = 0x50797472
>>> },
>>> devPrivates = 0xea4d230,
>>> screen_x = 0,
>>> screen_y = 0
>>> }
>>> (gdb) p damagePixPrivateIndex
>>> $4 = 0
>>> (gdb) p $2->devPrivates[damagePixPrivateIndex].ptr
>>> $5 = (pointer) 0xa0801fa8
>>> (gdb) p (DamagePtr*) &($2->devPrivates[damagePixPrivateIndex].ptr)
>>> $6 = (DamagePtr *) 0x4f0074
>>> (gdb) p **$6
>>> $9 = {
>>> pNext = 0x18,
>>> pNextWin = 0x1440000,
>>> damage = {
>>>   extents = {
>>>     x1 = 242,
>>>     y1 = -17399,
>>>     x2 = 24,
>>>     y2 = 0
>>>   },
>>>   data = 0x1440000
>>> },
>>> damageLevel = 3155558643,
>>> isInternal = 24,
>>> closure = 0x1440000,
>>> isWindow = -1138687756,
>>> pDrawable = 0x18,
>>> damageReport = 0x1440000,
>>> damageDestroy = 0xbc3900f6
>>> }
>>>
>>> --------------------
>>>
>>> Program received signal:  "EXC_BAD_ACCESS".
>>> [Switching to process 17098 thread 0x4607]
>>> (gdb) bt full
>>> #0  0x90385a1a in CGBlt_copyBytes ()
>>> No symbol table info available.
>>> #1  0x9a73c8f3 in xp_copy_bytes ()
>>> No symbol table info available.
>>> #2  0x01f452b6 in rlBlt (srcLine=0x107c1038, srcStride=628,  
>>> srcX=0, pDstScreen=0x197b5d0, dstLine=0x7fffffff, dstStride=0,  
>>> dstX=512, width=80384, height=1, alu=3, pm=4294967295, bpp=128,  
>>> reverse=0, upsidedown=0) at rlBlt.c:93
>>> 	src = (FbBits *) 0x102e1fd0
>>> 	dst = (FbBits *) 0x102e8ad0
>>> 	leftShift = 32725726
>>> 	rightShift = -1838100790
>>> 	startmask = 0
>>> 	endmask = 0
>>> 	bits = 2456866803
>>> 	bits1 = 0
>>> 	n = -1
>>> 	nmiddle = 2512
>>> 	destInvarient = 1
>>> 	startbyte = 0
>>> 	endbyte = 0
>>> 	_ca1 = 0
>>> 	_cx1 = 0
>>> 	_ca2 = 4294967295
>>> 	_cx2 = 0
>>> #3  0x01f445ff in rlCopyNtoN (pSrcDrawable=0x107c1000,  
>>> pDstDrawable=0x10291680, pGC=0x10290010, pbox=0x102e8608, nbox=2,  
>>> dx=-4, dy=-74, reverse=0, upsidedown=0, bitplane=0, closure=0x0)  
>>> at rlCopy.c:60
>>> 	alu = 3 '\003'
>>> 	pm = 4294967295
>>> 	src = (FbBits *) 0x107c1038
>>> 	srcStride = 628
>>> 	srcBpp = 32
>>> 	srcXoff = 0
>>> 	srcYoff = 0
>>> 	dst = (FbBits *) 0x7fffffff
>>> 	dstStride = 0
>>> 	dstBpp = 128
>>> 	dstXoff = 0
>>> 	dstYoff = 0
>>> #4  0x001afd81 in fbCopyRegion (pSrcDrawable=0x107c1000,  
>>> pDstDrawable=0x10291680, pGC=0x10290010, pDstRegion=0xb009dc78,  
>>> dx=-4, dy=-74, copyProc=0x1f443cf <rlCopyNtoN>, bitPlane=0,  
>>> closure=0x0) at fbcopy.c:366
>>> 	careful = 0
>>> 	reverse = 0
>>> 	upsidedown = 0
>>> 	pbox = (BoxPtr) 0x102e8608
>>> 	nbox = 3
>>> 	pboxNew1 = (BoxPtr) 0x0
>>> 	pboxNew2 = (BoxPtr) 0x0
>>> 	pboxBase = (BoxPtr) 0x102b7d30
>>> 	pboxNext = (BoxPtr) 0x197b5d0
>>> 	pboxTmp = (BoxPtr) 0x1971020
>>> #5  0x001b024b in fbDoCopy (pSrcDrawable=0x107c1000,  
>>> pDstDrawable=0x10291680, pGC=0x10290010, xIn=0, yIn=0,  
>>> widthSrc=628, heightSrc=26, xOut=4, yOut=74, copyProc=0x1f443cf  
>>> <rlCopyNtoN>, bitPlane=0, closure=0x0) at fbcopy.c:566
>>> 	prgnSrcClip = (RegionPtr) 0x0
>>> 	freeSrcClip = 0
>>> 	prgnExposed = (RegionPtr) 0x0
>>> 	rgnDst = {
>>> extents = {
>>>   x1 = 4,
>>>   y1 = 74,
>>>   x2 = 632,
>>>   y2 = 100
>>> },
>>> data = 0x102e8600
>>> }
>>> 	dx = -4
>>> 	dy = -74
>>> 	numRects = 3
>>> 	box_x1 = 4
>>> 	box_y1 = 74
>>> 	box_x2 = 632
>>> 	box_y2 = 100
>>> 	fastSrc = 1
>>> 	fastDst = 0
>>> 	fastExpose = 1
>>> #6  0x01f446ad in rlCopyArea (pSrcDrawable=0x107c1000,  
>>> pDstDrawable=0x10291680, pGC=0x10290010, xIn=0, yIn=0,  
>>> widthSrc=628, heightSrc=26, xOut=0, yOut=51) at rlCopy.c:101
>>> 	copy = (fbCopyProc) 0x1f443cf <rlCopyNtoN>
>>> #7  0x000df98c in damageCopyArea (pSrc=0x107c1000,  
>>> pDst=0x10291680, pGC=0x10290010, srcx=0, srcy=0, width=628,  
>>> height=26, dstx=0, dsty=51) at damage.c:762
>>> 	ret = (RegionPtr) 0x10290010
>>> 	pGCPriv = (DamageGCPrivPtr) 0x10290098
>>> 	oldFuncs = (GCFuncs *) 0x23a120
>>> #8  0x000c0d87 in ProcCopyArea (client=0x1028a090) at dispatch.c: 
>>> 1758
>>> 	pDst = (DrawablePtr) 0x10291680
>>> 	pSrc = (DrawablePtr) 0x107c1000
>>> 	pGC = (GC *) 0x10290010
>>> 	stuff = (xCopyAreaReq *) 0x1073b3b4
>>> 	pRgn = (RegionPtr) 0x1
>>> [...]
>>>
>>> -----------------
>>>
>>> Do you know what's going wrong and how to fix it?
>>>
>>> If there's some diagnostic steps I could take that would be  
>>> informative, please let me know.
[...]
>>> So, I have some primary suspects.  I think it's the handling of  
>>> the window pixmaps which is done by RootlessStartDrawing and  
>>> RootlessStopDrawing.  Basically, RootlessStopDrawing is sometimes  
>>> restoring a stale or deallocated pixmap to the window.
>>>
>>> Here are a couple of scenarios to illustrate the issue:
>>>
>>> Scenario 1: Calling RootlessStartDrawing twice on a window
>>> Assume we have Window A.  Its top-level parent is T.   
>>> OriginalPixmapA is the current pixmap of Window A before the calls.
>>>
>>> 	First call to RootlessStartDrawing(A):
>>> 		Sets WINREC(T)->pixmap to a scratch pixmap configured as  
>>> appropriate.  We'll call this ScratchPixmap.
>>> 		Sets WINREC(T)->is_drawing
>>> 		Sets WINREC(T)->oldPixmap to OriginalPixmapA
>>> 		Sets pixmap of Window A to WINREC(T)->pixmap, which is  
>>> ScratchPixmap
>>>
>>> 	Second call to RootlessStartDrawing(A)
>>> 		Since WINREC(T)->is_drawing, do not get another scratch pixmap
>>> 		Sets WINREC(T)->oldPixmap to the *current* pixmap of Window A,  
>>> which is ScratchPixmap.  This loses the value of OriginalPixmapA!
>>> 		Sets pixmap of Window A to WINREC(T)->pixmap, which is  
>>> ScratchPixmap
>>>
>>> 	Call to RootlessStopDrawing(A)
>>> 		Calls FreeScratchPixmapHeader on WINREC(T)->pixmap, which is  
>>> ScratchPixmap.  This either deallocates ScratchPixmap or sets  
>>> ScratchPixmap->devPrivate to NULL.
>>> 		Sets pixmap of Window A to WINREC(T)->oldPixmap, which is  
>>> ScratchPixmap, not OriginalPixmapA.  ScratchPixmap is, of course,  
>>> invalid.
>>> 		Clears WINREC(T)->pixmap and WINREC(T)->is_drawing
>>>
>>> The fix for this case is to not overwrite WINREC(T)->oldPixmap if  
>>> the pixmap of Window A is already equal to WINREC(T)->pixmap.   
>>> However, this fix doesn't fix all of the cases, as illustrated  
>>> by...  *drum roll* Scenario 2.  Nobody expects Scenario 2!     
>>> *cough*  (Sorry.)
>>>
>>>
>>> Scenario 2: Calling RootlessStartDrawing on multiple windows with  
>>> the same top-level parent.
>>> In addition to Window A, there is now also Window B.  They share  
>>> the same top-level parent, T.  OriginalPixmapB is the current  
>>> pixmap of Window B before the calls.
>>>
>>> 	Call to RootlessStartDrawing(A):
>>> 		Sets WINREC(T)->pixmap to a scratch pixmap configured as  
>>> appropriate.  We'll call this ScratchPixmap.
>>> 		Sets WINREC(T)->is_drawing
>>> 		Sets WINREC(T)->oldPixmap to OriginalPixmapA
>>> 		Sets pixmap of Window A to WINREC(T)->pixmap, which is  
>>> ScratchPixmap
>>>
>>> 	Call to RootlessStartDrawing(B)
>>> 		Since WINREC(T)->is_drawing, do not get another scratch pixmap
>>> 		Sets WINREC(T)->oldPixmap to the current pixmap of Window B,  
>>> which is OriginalPixmapB.  This loses the value of OriginalPixmapA!
>>> 		Sets pixmap of Window B to WINREC(T)->pixmap, which is  
>>> ScratchPixmap
>>>
>>> 	Call to RootlessStopDrawing(A)
>>> 		Calls FreeScratchPixmapHeader on WINREC(T)->pixmap, which is  
>>> ScratchPixmap.  This either deallocates ScratchPixmap or sets  
>>> ScratchPixmap->devPrivate to NULL.
>>> 		Sets pixmap of Window A to WINREC(T)->oldPixmap, which is  
>>> OriginalPixmapB, not OriginalPixmapA.  This is not right!
>>> 		Clears WINREC(T)->pixmap and WINREC(T)->is_drawing
>>>
>>> 	Note: at this point Window B's pixmap is set to ScratchPixmap,  
>>> which is invalid.
>>>
>>> 	Call to RootlessStopDrawing(B)
>>> 		Since WINREC(T)->is_drawing has been cleared, this does nothing.
>>> 		Thus, Window B's pixmap is still left at ScratchPixmap, which is  
>>> invalid.
>>>
>>> Reversing the calls to RootlessStopDrawing so that it's called on  
>>> B before A addresses the problem where Window A gets  
>>> OriginalPixmapB, but does not fix the problem where one of the  
>>> windows is left with an invalid pixmap.  Also, near as I can tell,  
>>> there's no guarantee that the RootlessStopDrawing calls will be  
>>> made in the "proper" order (i.e. the reverse order of the  
>>> RootlessStartDrawing calls).
>>>
>>> A partial fix is to have a backup pixmap not per-top-level-parent,  
>>> but per-window.  This allows each window to be restored with the  
>>> pixmap it originally had.
>>>
>>> Another thing necessary to fix this scenario is for the top-level  
>>> parent to be left in "drawing mode" (is_drawing set to TRUE, the  
>>> scratch pixmap remaining allocated) for as long as any of its  
>>> child windows are in drawing mode (their pixmap is set to the  
>>> scratch pixmap).  There are two approaches to accomplishing this:  
>>> 1) when RootlessStopDrawing is called on any window, it is  
>>> propagated to all children of the top-level parent; or 2)  
>>> is_drawing is changed from a flag to a count of how many children  
>>> of the top-level parent are in drawing mode, and only exit drawing  
>>> mode for the top-level parent when all of the children have exited  
>>> drawing mode.  I don't know enough to know which is better.
>>>
>>>
>>> There may be more problems that I haven't fully identified yet.   
>>> What do you think?  In particular, how would you suggest the  
>>> problem with ending drawing mode for the top-level parent when  
>>> multiple children are still in drawing mode?
[...]
>>> I've implemented a fix based on my analysis.  Read on...
>>>
>>>
>>> On Aug 21, 2006, at 5:09 PM, Ken Thomases wrote:
>>>
>>>> Scenario 1: Calling RootlessStartDrawing twice on a window
>>>> [...]
>>>>
>>>> The fix for this case is to not overwrite WINREC(T)->oldPixmap if  
>>>> the pixmap of Window A is already equal to WINREC(T)->pixmap.
>>>
>>> Did this.
>>>
>>>
>>>> Scenario 2: Calling RootlessStartDrawing on multiple windows with  
>>>> the same top-level parent.
>>>> [...]
>>>>
>>>> A partial fix is to have a backup pixmap not per-top-level- 
>>>> parent, but per-window.  This allows each window to be restored  
>>>> with the pixmap it originally had.
>>>
>>> I did this a bit inelegantly, using a second window private  
>>> instead of juggling with the existing one.
>>>
>>>
>>>> Another thing necessary to fix this scenario is for the top-level  
>>>> parent to be left in "drawing mode" (is_drawing set to TRUE, the  
>>>> scratch pixmap remaining allocated) for as long as any of its  
>>>> child windows are in drawing mode (their pixmap is set to the  
>>>> scratch pixmap).  There are two approaches to accomplishing this:  
>>>> 1) when RootlessStopDrawing is called on any window, it is  
>>>> propagated to all children of the top-level parent; or 2)  
>>>> is_drawing is changed from a flag to a count of how many children  
>>>> of the top-level parent are in drawing mode, and only exit  
>>>> drawing mode for the top-level parent when all of the children  
>>>> have exited drawing mode.  I don't know enough to know which is  
>>>> better.
>>>
>>> I tried both.  Method 1 works well for the most part.  It  
>>> occasionally causes updates to be lost.  RootlessStopDrawing is  
>>> applying to all windows within a hierarchy, and I think it's  
>>> "premature" for some.  They revert to the original framebuffer  
>>> pixmap, which, if I understand correctly, is a memory buffer which  
>>> is never really used in rootless mode.  Basically, I think this  
>>> happens in the case where it would otherwise crash.  Fixing this  
>>> seems to require a deeper rethinking of the rootless  
>>> implementation.  For now, I'm going to leave that to you.
>>>
>>> Method 2 did not work well, at all.  It resulted in no windows  
>>> being shown on the screen.  I also got a number of libXplugin  
>>> complaints about a window being locked when it shouldn't be.   
>>> Essentially, RootlessStopDrawing is not called for every window on  
>>> which RootlessStartDrawing is called.  In fact,  
>>> RootlessStopDrawing is often called on a window for which  
>>> RootlessStartDrawing was *not* called.  It seems it's generally  
>>> called on the top-level window, whereas RootlessStartDrawing is  
>>> often called on windows deeper in the hierarchy.  (This fact seems  
>>> to confirm my method 1 approach.)  The consequence is that the  
>>> is_drawing count was never balanced, and the real meat of  
>>> RootlessStopDrawing was never invoked.
>> I've had a little time to look over your work and I think you are  
>> right on about the problem if not the solution. It looks pretty  
>> convincing that in either "Scenario 1" or "Scenario 2" you can get  
>> an invalid Pixmap set for a window, which later causes problems.  
>> Some of things you noticed in your fix are intended, however. In  
>> particular:
>>
>> 1. Rootless Start/Stop Drawing calls are not intended to be  
>> balanced and generally won't be.
>> 2. StopDrawing is mainly called on the top level windows only.
>> 3. Recording exactly which pixmap a window had is good form but not  
>> really necessary in practice, at least until we integrate with  
>> Composite better. All the windows start with the screen pixmap as  
>> their pixmap. We can just reset the pixmap to the screen pixmap and  
>> not bother with tracking if we want.
>> 4. In debugging I see that we do commonly lose the oldPixmap.
>> 5. Since we don't generally stop drawing in each subwindow many of  
>> the subwindows will not get their pixmaps reverted to the screen  
>> pixmap correctly even once we fix the tracking issue with the  
>> oldPixmap. The most straightforward way to fix that is to walk the  
>> window tree below each window on StopDrawing to reset the window  
>> pixmaps but that could be inefficient.
>>
>> I'll have to think this weekend about the best way to fix this. You  
>> have definitely pointed out and interesting problem.
>

> To summarize the previous discussion: we were having some crashes  
> due to the rootless implementation.  I diagnosed the cause of the  
> crashes, and fixed them.  This left us with occasional rendering  
> failures -- drawing would just fail to go to the screen.  I was  
> under the impression that the drawing failures were left over from  
> the crashing bug -- that my fix had essentially transformed crashes  
> to mere data loss.
>
> The good news is that I was wrong.  The better news is that we have  
> found and fixed the drawing failures.  On top of that, this has  
> given me confidence which I didn't have when I last wrote that my  
> fix for the rootless crashes is actually a complete fix, rather than  
> a half-assed fix.
>
> I'm attaching a couple of patches which fixed the drawing failures.   
> The problem is that some functions would unwrap a dispatch table  
> and, for some code paths, fail to rewrap them before completing.   
> These patches are against X.org 6.8.2-era source, but not all that  
> much seems to have changed.


[http://gitweb.freedesktop.org/?p=xorg/xserver.git;a=commit;h=8606c76a904d55027173aa864d8914a1547fd8d6 
]

Ben Byer
CoreOS / BSD Technology Group, XDarwin maintainer



More information about the Xquartz-dev mailing list