Rootless code debugging notes
Ken Thomases of CodeWeavers (who makes an excellent product called CrossOver Mac, which you should all go try) and I have been mailing back and forth for months now, having a fairly sparse (time-wise) conversation about debugging the Rootless subsystem. It resulted in several small patches, but we obviously still have far to go. He's given me permission to collect some of the bits and post them here -- hopefully they'll inspire one of you! I'm going to order this "backwards", with the most recent stuff at the top, because in many cases it supersedes the older stuff. However, there's still a lot of knowledge there.
I remembered something else that I fixed in the X server. I could have sworn that Torrey applied this suggestion way back around 6.9.x, though. On the other hand, I don't see the fixes in the current git branch.
The code in miext/rootless/accel is essentially a copy+paste+hack job of similar code in the fb directory. However, the fb code continued to be updated after the time of the copy, and fixes there didn't make it into the rootless code. The classic problem with code duplication. This is an excellent candidate for consolidation, but I have to admit I took the coward's way out and just updated the rootless/accel code to incorporate the fixes to the corresponding fb code.
[http://gitweb.freedesktop.org/?p=xorg/xserver.git;a=commit;h=70374a58937d7a6... ] [http://gitweb.freedesktop.org/?p=xorg/xserver.git;a=commit;h=3f857e129df7ce4... ]
I'll apply it (Patch to avert (some) damage / rootless crashes, courte ... xorg-server-1.2-apple), too, but I'm still getting rootless crashes just like ones identified below -- for example,
Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_INVALID_ADDRESS at address: 0xb09a7ffe [Switching to process 64128 thread 0x2d03] 0x000ba460 in fbSolid (dst=0xb09a7ffc, dstStride=1, dstX=16, bpp=16, width=4560, height=1, and=0, xor=1523997398) at fbsolid.c:67 67 FbDoLeftMaskByteRRop(dst,startbyte,startmask,and,xor); (gdb) bt #0 0x000ba460 in fbSolid (dst=0xb09a7ffc, dstStride=1, dstX=16, bpp=16, width=4560, height=1, and=0, xor=1523997398) at fbsolid.c:67 #1 0x000bcb96 in fbFillRegionSolid (pDrawable=0x4f8b90, pRegion=0x13110280, and=0, xor=1523997398) at fbwindow.c:247 #2 0x0002693d in SafeAlphaPaintWindow (pWin=0x4f8b90, pRegion=0x13110280, what=0) at safeAlphaWindow.c:148 #3 0x00106f1a in damagePaintWindow (pWindow=0x4f8b90, prgn=0x13110280, what=0) at damage.c:1593 #4 0x00025a88 in RootlessPaintWindowBackground (pWin=0x4f8b90, pRegion=0x13110280, what=0) at rootlessWindow.c:1542 #5 0x00035655 in miWindowExposures (pWin=0x4f8b90, prgn=0x13110280, other_exposed=0x0) at miexpose.c:565 #6 0x000164ac in DRIWindowExposures (pWin=0x4f8b90, prgn=0xb09a7ffe, bsreg=0xb09a7ffe) at dri.c:609 #7 0x0004922c in miHandleValidateExposures (pWin=0x428470) at miwindow.c:470 #8 0x0008a9a6 in MapWindow (pWin=0x4f67d0, client=0x4f10d0) at window.c:2803 #9 0x000688b9 in ProcMapWindow (client=0x1) at dispatch.c:696 #10 0x0006f5b5 in Dispatch () at dispatch.c:457 #11 0x0007e001 in main (argc=2, argv=0xbffff4f8, envp=0xbffff504) at main.c:445
Hmm. I had thought I had caught all of the remaining crashes. It's of course possible that new bugs have been introduced between the code base that I'm working with and X.org 7.2. The most likely candidate would be someplace where a function is unwrapped on an object (a screen, a window, a graphics context, etc.) and there's a code path that fails to rewrap it when it should. This is what I fixed in the earlier commit <http://gitweb.freedesktop.org/?p=xorg/xserver.git;a=commit;h=38965768636aef5...
. A thorough grepping of the code can find places where something is unwrapped, and you can then examine each place to make sure the proper rewrap happens in every case.
Something that can be useful when debugging what's going on is malloc_history and the MallocStackLoggingNoCompact environment variable. You may find Xray (ahem, "Instruments" ;) can also help.
When I was debugging, I actually did something I thought was kind of neat. It's pretty common to litter code with printfs (or ErrorF) to debug. Well, I did that and included the pointer address of each interesting object at each point -- when a PixMap was freed, I printed the address of the pixmap object, for example. What was neat though, is that I also did a small allocation and included the address of that, and then leaked the allocated buffer. For example:
ErrorF("Did something interesting with pixmap %p. %p\n", pixmap, malloc(1));
Then, as I examined the flood of logging messages, when I found a suspicious operation, I would use malloc_history on the address of the small leaked buffer. Since I was leaking it, then the last entry in the history had to be for the log line in question, and malloc_history could provide a stack trace for where/when that allocation happened. That gave me invaluable insight into what the code was doing.
It's quite possible that DTrace is powerful enough, though, that this technique can be considered "archaic". :)
... and this next bit is excerpted from an email thread between him and Torrey Lyons that he sent me.
We're experiencing a significant number of crashes when we resize windows. They're all within the Damage extension. I found bug 1032 in X.org's bugzilla <https://bugs.freedesktop.org/show_bug.cgi?id=1032
, but haven't seen recent activity.
That bug also includes a note claiming that Damage is disabled for rootless servers, but I find that that isn't true. In fact, I attempted to disable the extension but found it impossible due to the cursor code requiring it.
Hmm. The comment on bug <https://bugs.freedesktop.org/show_bug.cgi?id=1032
from ajax@nwnk.net is incorrect as I recall and as you noted. Damage is used. My understanding was that all crashes related to damage were fixed. The bug was left open to remind me to fix a potential incorrect behavior in CopyWindow. However, I had never actually observed any problems related to this. You may have stumbled upon one. Can you please, try the following:
1. Rebuild X.Org with the debug flag on so you get a more complete backtrace when the crash occurs. You can do this by inserting a host.def file into /xc/config/cf with the line: #define DefaultCDebugFlags -g
2. Duplicate the crash and send me the backtrace. You should be able to get the XDarwin backtrace from Console.app. You can also post it on the above mentioned bug. Thanks for your reply.
I have attached crash reports showing two crashes related to the Damage extension.
Here are some gdb backtraces I just generated:
------------- Program received signal: "EXC_BAD_ACCESS". [Switching to process 5606 thread 0x4603] (gdb) bt #0 0x01f45dd9 in rlBlt (srcLine=0x109a4838, srcStride=512, srcX=0, pDstScreen=0x1979590, dstLine=0x294, dstStride=1, dstX=29, width=271, height=326, alu=3, pm=4294967295, bpp=1, reverse=0, upsidedown=0) at rlBlt.c:327 #1 0x01f445ff in rlCopyNtoN (pSrcDrawable=0x109a4000, pDstDrawable=0x10094dc0, pGC=0x1008f010, pbox=0xb009dc78, nbox=0, dx=-573, dy=-147, reverse=0, upsidedown=0, bitplane=0, closure=0x0) at rlCopy.c:60 #2 0x001aeec5 in fbCopyRegion (pSrcDrawable=0x109a4000, pDstDrawable=0x10094dc0, pGC=0x1008f010, pDstRegion=0xb009dc78, dx=-573, dy=-147, copyProc=0x1f443cf <rlCopyNtoN>, bitPlane=0, closure=0x0) at fbcopy.c:366 #3 0x001af38f in fbDoCopy (pSrcDrawable=0x109a4000, pDstDrawable=0x10094dc0, pGC=0x1008f010, xIn=0, yIn=0, widthSrc=271, heightSrc=468, xOut=573, yOut=147, copyProc=0x1f443cf <rlCopyNtoN>, bitPlane=0, closure=0x0) at fbcopy.c:566 #4 0x01f446ad in rlCopyArea (pSrcDrawable=0x109a4000, pDstDrawable=0x10094dc0, pGC=0x1008f010, xIn=0, yIn=0, widthSrc=271, heightSrc=468, xOut=569, yOut=124) at rlCopy.c:101 #5 0x000dead0 in damageCopyArea (pSrc=0x109a4000, pDst=0x10094dc0, pGC=0x1008f010, srcx=0, srcy=0, width=271, height=468, dstx=569, dsty=124) at damage.c:762 #6 0x000bfecb in ProcCopyArea (client=0x100886d0) at dispatch.c: 1758 #7 0x000bc943 in Dispatch () at dispatch.c:455 #8 0x0002d5ad in main (argc=1, argv=0xbffff9ec, envp=0xbffff9f4) at main.c:451 -------------
------------- Program received signal: "EXC_BAD_ACCESS". [Switching to process 5906 thread 0x4603] (gdb) bt #0 0x9035ab8e in sseCGSFill8by1 () #1 0x9a73c9a6 in xp_fill_bytes () #2 0x01f46b85 in rlSolid (pScreen=0x1979a00, dst=0x209f58, dstStride=1280, dstX=0, bpp=32, width=8320, height=2, and=0, xor=15000804) at rlSolid.c:77 #3 0x01f4615f in rlFill (pDrawable=0xfc8ecd0, pGC=0xfc8b870, x=726, y=417, width=260, height=2) at rlFill.c:49 #4 0x01f44e32 in rlPolyFillRect (pDrawable=0xfc8ecd0, pGC=0xfc8b870, nrect=0, prect=0x108c806c) at rlFillRect.c:79 #5 0x000e0858 in damagePolyFillRect (pDrawable=0xfc8ecd0, pGC=0xfc8b870, nRects=5, pRects=0x108c8044) at damage.c:1205 #6 0x000c1339 in ProcPolyFillRectangle (client=0xfc7b6f0) at dispatch.c:1978 #7 0x000bc943 in Dispatch () at dispatch.c:455 #8 0x0002d5ad in main (argc=1, argv=0xbffff9ec, envp=0xbffff9f4) at main.c:451 -------------
This next one is most reminiscent of the known Damage bug, but I suspect the ones above are related: ------------- Program received signal: "EXC_BAD_ACCESS". [Switching to process 6021 thread 0x4603] (gdb) bt #0 0x000de948 in damageCopyArea (pSrc=0x10877000, pDst=0xfba1830, pGC=0xfba6a50, srcx=0, srcy=0, width=628, height=26, dstx=0, dsty=51) at damage.c:748 #1 0x000bfecb in ProcCopyArea (client=0xfb986a0) at dispatch.c:1758 #2 0x000bc943 in Dispatch () at dispatch.c:455 #3 0x0002d5ad in main (argc=1, argv=0xbffff9ec, envp=0xbffff9f4) at main.c:451 ------------- As you probably remember, what's happening is that getDrawableDamageRef is returning a bogus pointer which is dereferenced by the getDrawableDamage macro (used by checkGCDamage). getDrawableDamageRef returns a bogus pointer because it uses the (*pScreen->GetWindowPixmap) function pointer, which is _fbGetWindowPixmap in this case. That devolves down to the macro fbGetWindowPixmap(pWin) which is defined as ((PixmapPtr) ((WindowPtr) (pWin))->devPrivates[fbGetWinPrivateIndex()].ptr). And finally we get to the immediate problem: there is no pointer stashed in that private index entry. However, what I don't understand is why that private index entry hasn't been initialized. That's where I get lost in the maze that is X11.
[...]
t took me a while to get around to it, but I have tested with the 6.9 tip and am still able to reproduce the crashes. Here are some gdb backtraces of a few of the crashes:
Program received signal: "EXC_BAD_ACCESS". [Switching to process 16351 thread 0x4607] (gdb) bt #0 0x001b107b in fbEvenTile (dst=0x6cb60, dstStride=900, dstX=0, width=5504, height=22, tile=0x103ee2b8, tileHeight=26, alu=3, pm=4294967295, xRot=28832, yRot=-3) at fbtile.c:94 #1 0x001b130e in fbTile (dst=0x6bd40, dstStride=1072, dstX=28928, width=5504, height=23, tile=0x103ee2b8, tileStride=1, tileWidth=32, tileHeight=26, alu=3, pm=4294967295, bpp=32, xRot=28832, yRot=-3) at fbtile.c:193 #2 0x01f466f1 in rlFill (pDrawable=0x103981a0, pGC=0x103a1850, x=904, y=103, width=172, height=23) at rlFill.c:142 #3 0x01f44f11 in rlPolyFillRect (pDrawable=0x103981a0, pGC=0x103a1850, nrect=0, prect=0x107c50cc) at rlFillRect.c:108 #4 0x000e1714 in damagePolyFillRect (pDrawable=0x103981a0, pGC=0x103a1850, nRects=1, pRects=0x107c50c4) at damage.c:1205 #5 0x000c21f5 in ProcPolyFillRectangle (client=0x10399130) at dispatch.c:1978 #6 0x000bd7ff in Dispatch () at dispatch.c:455 #7 0x0002e469 in main (argc=1, argv=0xbffff9ec, envp=0xbffff9f4) at main.c:451
--------------------
Program received signal: "EXC_BAD_ACCESS". [Switching to process 16734 thread 0x4607] (gdb) bt full #0 0x000df804 in damageCopyArea (pSrc=0x1099c000, pDst=0x10094b00, pGC=0x100947e0, srcx=0, srcy=0, width=628, height=26, dstx=0, dsty=51) at damage.c:748 ret = (RegionPtr) 0x100947e0 pGCPriv = (DamageGCPrivPtr) 0x10094868 oldFuncs = (GCFuncs *) 0x23a120 #1 0x000c0d87 in ProcCopyArea (client=0x1008d330) at dispatch.c: 1758 pDst = (DrawablePtr) 0x10094b00 pSrc = (DrawablePtr) 0x1099c000 pGC = (GC *) 0x100947e0 stuff = (xCopyAreaReq *) 0x1094c3b4 pRgn = (RegionPtr) 0x1 #2 0x000bd7ff in Dispatch () at dispatch.c:455 clientReady = (int *) 0x206ae00 result = 28 client = (ClientPtr) 0x1008d330 nready = 0 icheck = (HWEventQueuePtr *) 0x27d2b0 start_tick = 11920 #3 0x0002e469 in main (argc=1, argv=0xbffff9ec, envp=0xbffff9f4) at main.c:451 i = 1 j = 2 k = 2 error = 0 xauthfile = 0x0 alwaysCheckForInput = {0, 1} adjustedRgbPath = "/Users/ken/work/cxmac/macosx/CrossOver/build/ Debug/CrossOver.app/Contents/SharedSupport/X11/lib/X11/rgb", '\0' <repeats 920 times> (gdb) p pDst->pScreen->GetWindowPixmap $1 = (GetWindowPixmapProcPtr) 0x1573f9 <_fbGetWindowPixmap> (gdb) p ((PixmapPtr) ((WindowPtr) (pDst))-
devPrivates[fbWinPrivateIndex].ptr) $2 = (struct _Pixmap *) 0x100bcd10 (gdb) p *((PixmapPtr) ((WindowPtr) (pDst))- devPrivates[fbWinPrivateIndex].ptr) $3 = { drawable = { type = 32 ' ', class = 157 '\235', depth = 128 '\200', bitsPerPixel = 160 '\240', id = 271240, x = 0, y = 0, width = 58736, height = 400, pScreen = 0x3, serialNumber = 6 }, refcnt = 600000, devKind = -1824067958, devPrivate = { ptr = 0x50797472, val = 1350136946, uval = 1350136946, fptr = 0x50797472 }, devPrivates = 0xea4d230, screen_x = 0, screen_y = 0 } (gdb) p damagePixPrivateIndex $4 = 0 (gdb) p $2->devPrivates[damagePixPrivateIndex].ptr $5 = (pointer) 0xa0801fa8 (gdb) p (DamagePtr*) &($2->devPrivates[damagePixPrivateIndex].ptr) $6 = (DamagePtr *) 0x4f0074 (gdb) p **$6 $9 = { pNext = 0x18, pNextWin = 0x1440000, damage = { extents = { x1 = 242, y1 = -17399, x2 = 24, y2 = 0 }, data = 0x1440000 }, damageLevel = 3155558643, isInternal = 24, closure = 0x1440000, isWindow = -1138687756, pDrawable = 0x18, damageReport = 0x1440000, damageDestroy = 0xbc3900f6 }
--------------------
Program received signal: "EXC_BAD_ACCESS". [Switching to process 17098 thread 0x4607] (gdb) bt full #0 0x90385a1a in CGBlt_copyBytes () No symbol table info available. #1 0x9a73c8f3 in xp_copy_bytes () No symbol table info available. #2 0x01f452b6 in rlBlt (srcLine=0x107c1038, srcStride=628, srcX=0, pDstScreen=0x197b5d0, dstLine=0x7fffffff, dstStride=0, dstX=512, width=80384, height=1, alu=3, pm=4294967295, bpp=128, reverse=0, upsidedown=0) at rlBlt.c:93 src = (FbBits *) 0x102e1fd0 dst = (FbBits *) 0x102e8ad0 leftShift = 32725726 rightShift = -1838100790 startmask = 0 endmask = 0 bits = 2456866803 bits1 = 0 n = -1 nmiddle = 2512 destInvarient = 1 startbyte = 0 endbyte = 0 _ca1 = 0 _cx1 = 0 _ca2 = 4294967295 _cx2 = 0 #3 0x01f445ff in rlCopyNtoN (pSrcDrawable=0x107c1000, pDstDrawable=0x10291680, pGC=0x10290010, pbox=0x102e8608, nbox=2, dx=-4, dy=-74, reverse=0, upsidedown=0, bitplane=0, closure=0x0) at rlCopy.c:60 alu = 3 '\003' pm = 4294967295 src = (FbBits *) 0x107c1038 srcStride = 628 srcBpp = 32 srcXoff = 0 srcYoff = 0 dst = (FbBits *) 0x7fffffff dstStride = 0 dstBpp = 128 dstXoff = 0 dstYoff = 0 #4 0x001afd81 in fbCopyRegion (pSrcDrawable=0x107c1000, pDstDrawable=0x10291680, pGC=0x10290010, pDstRegion=0xb009dc78, dx=-4, dy=-74, copyProc=0x1f443cf <rlCopyNtoN>, bitPlane=0, closure=0x0) at fbcopy.c:366 careful = 0 reverse = 0 upsidedown = 0 pbox = (BoxPtr) 0x102e8608 nbox = 3 pboxNew1 = (BoxPtr) 0x0 pboxNew2 = (BoxPtr) 0x0 pboxBase = (BoxPtr) 0x102b7d30 pboxNext = (BoxPtr) 0x197b5d0 pboxTmp = (BoxPtr) 0x1971020 #5 0x001b024b in fbDoCopy (pSrcDrawable=0x107c1000, pDstDrawable=0x10291680, pGC=0x10290010, xIn=0, yIn=0, widthSrc=628, heightSrc=26, xOut=4, yOut=74, copyProc=0x1f443cf <rlCopyNtoN>, bitPlane=0, closure=0x0) at fbcopy.c:566 prgnSrcClip = (RegionPtr) 0x0 freeSrcClip = 0 prgnExposed = (RegionPtr) 0x0 rgnDst = { extents = { x1 = 4, y1 = 74, x2 = 632, y2 = 100 }, data = 0x102e8600 } dx = -4 dy = -74 numRects = 3 box_x1 = 4 box_y1 = 74 box_x2 = 632 box_y2 = 100 fastSrc = 1 fastDst = 0 fastExpose = 1 #6 0x01f446ad in rlCopyArea (pSrcDrawable=0x107c1000, pDstDrawable=0x10291680, pGC=0x10290010, xIn=0, yIn=0, widthSrc=628, heightSrc=26, xOut=0, yOut=51) at rlCopy.c:101 copy = (fbCopyProc) 0x1f443cf <rlCopyNtoN> #7 0x000df98c in damageCopyArea (pSrc=0x107c1000, pDst=0x10291680, pGC=0x10290010, srcx=0, srcy=0, width=628, height=26, dstx=0, dsty=51) at damage.c:762 ret = (RegionPtr) 0x10290010 pGCPriv = (DamageGCPrivPtr) 0x10290098 oldFuncs = (GCFuncs *) 0x23a120 #8 0x000c0d87 in ProcCopyArea (client=0x1028a090) at dispatch.c: 1758 pDst = (DrawablePtr) 0x10291680 pSrc = (DrawablePtr) 0x107c1000 pGC = (GC *) 0x10290010 stuff = (xCopyAreaReq *) 0x1073b3b4 pRgn = (RegionPtr) 0x1 [...]
-----------------
Do you know what's going wrong and how to fix it?
If there's some diagnostic steps I could take that would be informative, please let me know. [...] So, I have some primary suspects. I think it's the handling of the window pixmaps which is done by RootlessStartDrawing and RootlessStopDrawing. Basically, RootlessStopDrawing is sometimes restoring a stale or deallocated pixmap to the window.
Here are a couple of scenarios to illustrate the issue:
Scenario 1: Calling RootlessStartDrawing twice on a window Assume we have Window A. Its top-level parent is T. OriginalPixmapA is the current pixmap of Window A before the calls.
First call to RootlessStartDrawing(A): Sets WINREC(T)->pixmap to a scratch pixmap configured as appropriate. We'll call this ScratchPixmap. Sets WINREC(T)->is_drawing Sets WINREC(T)->oldPixmap to OriginalPixmapA Sets pixmap of Window A to WINREC(T)->pixmap, which is ScratchPixmap
Second call to RootlessStartDrawing(A) Since WINREC(T)->is_drawing, do not get another scratch pixmap Sets WINREC(T)->oldPixmap to the *current* pixmap of Window A, which is ScratchPixmap. This loses the value of OriginalPixmapA! Sets pixmap of Window A to WINREC(T)->pixmap, which is ScratchPixmap
Call to RootlessStopDrawing(A) Calls FreeScratchPixmapHeader on WINREC(T)->pixmap, which is ScratchPixmap. This either deallocates ScratchPixmap or sets ScratchPixmap->devPrivate to NULL. Sets pixmap of Window A to WINREC(T)->oldPixmap, which is ScratchPixmap, not OriginalPixmapA. ScratchPixmap is, of course, invalid. Clears WINREC(T)->pixmap and WINREC(T)->is_drawing
The fix for this case is to not overwrite WINREC(T)->oldPixmap if the pixmap of Window A is already equal to WINREC(T)->pixmap. However, this fix doesn't fix all of the cases, as illustrated by... *drum roll* Scenario 2. Nobody expects Scenario 2! *cough* (Sorry.)
Scenario 2: Calling RootlessStartDrawing on multiple windows with the same top-level parent. In addition to Window A, there is now also Window B. They share the same top-level parent, T. OriginalPixmapB is the current pixmap of Window B before the calls.
Call to RootlessStartDrawing(A): Sets WINREC(T)->pixmap to a scratch pixmap configured as appropriate. We'll call this ScratchPixmap. Sets WINREC(T)->is_drawing Sets WINREC(T)->oldPixmap to OriginalPixmapA Sets pixmap of Window A to WINREC(T)->pixmap, which is ScratchPixmap
Call to RootlessStartDrawing(B) Since WINREC(T)->is_drawing, do not get another scratch pixmap Sets WINREC(T)->oldPixmap to the current pixmap of Window B, which is OriginalPixmapB. This loses the value of OriginalPixmapA! Sets pixmap of Window B to WINREC(T)->pixmap, which is ScratchPixmap
Call to RootlessStopDrawing(A) Calls FreeScratchPixmapHeader on WINREC(T)->pixmap, which is ScratchPixmap. This either deallocates ScratchPixmap or sets ScratchPixmap->devPrivate to NULL. Sets pixmap of Window A to WINREC(T)->oldPixmap, which is OriginalPixmapB, not OriginalPixmapA. This is not right! Clears WINREC(T)->pixmap and WINREC(T)->is_drawing
Note: at this point Window B's pixmap is set to ScratchPixmap, which is invalid.
Call to RootlessStopDrawing(B) Since WINREC(T)->is_drawing has been cleared, this does nothing. Thus, Window B's pixmap is still left at ScratchPixmap, which is invalid.
Reversing the calls to RootlessStopDrawing so that it's called on B before A addresses the problem where Window A gets OriginalPixmapB, but does not fix the problem where one of the windows is left with an invalid pixmap. Also, near as I can tell, there's no guarantee that the RootlessStopDrawing calls will be made in the "proper" order (i.e. the reverse order of the RootlessStartDrawing calls).
A partial fix is to have a backup pixmap not per-top-level-parent, but per-window. This allows each window to be restored with the pixmap it originally had.
Another thing necessary to fix this scenario is for the top-level parent to be left in "drawing mode" (is_drawing set to TRUE, the scratch pixmap remaining allocated) for as long as any of its child windows are in drawing mode (their pixmap is set to the scratch pixmap). There are two approaches to accomplishing this: 1) when RootlessStopDrawing is called on any window, it is propagated to all children of the top-level parent; or 2) is_drawing is changed from a flag to a count of how many children of the top-level parent are in drawing mode, and only exit drawing mode for the top-level parent when all of the children have exited drawing mode. I don't know enough to know which is better.
There may be more problems that I haven't fully identified yet. What do you think? In particular, how would you suggest the problem with ending drawing mode for the top-level parent when multiple children are still in drawing mode? [...] I've implemented a fix based on my analysis. Read on...
On Aug 21, 2006, at 5:09 PM, Ken Thomases wrote:
Scenario 1: Calling RootlessStartDrawing twice on a window [...]
The fix for this case is to not overwrite WINREC(T)->oldPixmap if the pixmap of Window A is already equal to WINREC(T)->pixmap.
Did this.
Scenario 2: Calling RootlessStartDrawing on multiple windows with the same top-level parent. [...]
A partial fix is to have a backup pixmap not per-top-level- parent, but per-window. This allows each window to be restored with the pixmap it originally had.
I did this a bit inelegantly, using a second window private instead of juggling with the existing one.
Another thing necessary to fix this scenario is for the top-level parent to be left in "drawing mode" (is_drawing set to TRUE, the scratch pixmap remaining allocated) for as long as any of its child windows are in drawing mode (their pixmap is set to the scratch pixmap). There are two approaches to accomplishing this: 1) when RootlessStopDrawing is called on any window, it is propagated to all children of the top-level parent; or 2) is_drawing is changed from a flag to a count of how many children of the top-level parent are in drawing mode, and only exit drawing mode for the top-level parent when all of the children have exited drawing mode. I don't know enough to know which is better.
I tried both. Method 1 works well for the most part. It occasionally causes updates to be lost. RootlessStopDrawing is applying to all windows within a hierarchy, and I think it's "premature" for some. They revert to the original framebuffer pixmap, which, if I understand correctly, is a memory buffer which is never really used in rootless mode. Basically, I think this happens in the case where it would otherwise crash. Fixing this seems to require a deeper rethinking of the rootless implementation. For now, I'm going to leave that to you.
Method 2 did not work well, at all. It resulted in no windows being shown on the screen. I also got a number of libXplugin complaints about a window being locked when it shouldn't be. Essentially, RootlessStopDrawing is not called for every window on which RootlessStartDrawing is called. In fact, RootlessStopDrawing is often called on a window for which RootlessStartDrawing was *not* called. It seems it's generally called on the top-level window, whereas RootlessStartDrawing is often called on windows deeper in the hierarchy. (This fact seems to confirm my method 1 approach.) The consequence is that the is_drawing count was never balanced, and the real meat of RootlessStopDrawing was never invoked. I've had a little time to look over your work and I think you are right on about the problem if not the solution. It looks pretty convincing that in either "Scenario 1" or "Scenario 2" you can get an invalid Pixmap set for a window, which later causes problems. Some of things you noticed in your fix are intended, however. In particular:
1. Rootless Start/Stop Drawing calls are not intended to be balanced and generally won't be. 2. StopDrawing is mainly called on the top level windows only. 3. Recording exactly which pixmap a window had is good form but not really necessary in practice, at least until we integrate with Composite better. All the windows start with the screen pixmap as their pixmap. We can just reset the pixmap to the screen pixmap and not bother with tracking if we want. 4. In debugging I see that we do commonly lose the oldPixmap. 5. Since we don't generally stop drawing in each subwindow many of the subwindows will not get their pixmaps reverted to the screen pixmap correctly even once we fix the tracking issue with the oldPixmap. The most straightforward way to fix that is to walk the window tree below each window on StopDrawing to reset the window pixmaps but that could be inefficient.
I'll have to think this weekend about the best way to fix this. You have definitely pointed out and interesting problem.
To summarize the previous discussion: we were having some crashes due to the rootless implementation. I diagnosed the cause of the crashes, and fixed them. This left us with occasional rendering failures -- drawing would just fail to go to the screen. I was under the impression that the drawing failures were left over from the crashing bug -- that my fix had essentially transformed crashes to mere data loss.
The good news is that I was wrong. The better news is that we have found and fixed the drawing failures. On top of that, this has given me confidence which I didn't have when I last wrote that my fix for the rootless crashes is actually a complete fix, rather than a half-assed fix.
I'm attaching a couple of patches which fixed the drawing failures. The problem is that some functions would unwrap a dispatch table and, for some code paths, fail to rewrap them before completing. These patches are against X.org 6.8.2-era source, but not all that much seems to have changed.
[http://gitweb.freedesktop.org/?p=xorg/xserver.git;a=commit;h=8606c76a904d550... ] Ben Byer CoreOS / BSD Technology Group, XDarwin maintainer
participants (1)
-
Ben Byer