Corrupt ZFS pool recovery
Here's a ZFS repair success story:
I'm using zfs-fuse on linux. I had an already-weird setup where a replace operation wouldn't finish:
pool: stor3 state: DEGRADED config: stor3 DEGRADED mirror DEGRADED sdd ONLINE replacing DEGRADED 3323475527286891755 UNAVAIL cannot open sdb ONLINE
The disk called sdb was making ATA resets in my dmesg log, and one night when I ran 'fdisk -l', the machine rebooted entirely. I didn't even know linux would do that.
On reboot, the volume wouldn't mount. (These commands are running at the same time in two shells.)
slash(pts/1):/tmp/zfs/src% sudo /usr/local/sbin/zfs-fuse --no-daemon lib/libzpool/dmu.c:610: dmu_write: Assertion `0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp)` failed. zsh: abort sudo /usr/local/sbin/zfs-fuse --no-daemon slash(pts/3):/tmp/zfs/src# zfs mount stor3 zfsfuse_ioctl_read_loop(): file descriptor closed cannot mount 'stor3': Input/output error
I got a new disk and dd'ed the data from sdd to a 500GB file. It broke around 400GB. I think dd has some don't-stop-on-errors options, but what I really wanted was ddrescue, which logs and retries bad blocks. In the end, I had 77kB of errors in 152 places. This doesn't seem like a high number to me at all. Zfs would have used its mirror to notice those kinds of errors, but for this catastrophic failure I don't (think I) care about that few errors.
I tried a lot of zpool import combinations on zfs-fuse and an opensolaris live cd. Usually the result was that the zpool command would crash zfs-fuse or just hang on some lock. On solaris, I couldn't even kill the zpool command with a ctrl-c.
Here is some search bait showing the kinds of crashes and lockups I could get:
#2 0x00000000004e4020 in vpanic (fmt=0x510648 "zfs: allocating allocated segment(offset=%llu size=%llu) ", adx=0x7f1bd272b6c0) at lib/libsolkerncompat/cmn_err.c:53 #3 0x00000000004e414e in vcmn_err (ce=3, fmt=0x510648 "zfs: allocating allocated segment(offset=%llu size=%llu) ", adx=0x7f1bd272b6c0) at lib/libsolkerncompat/cmn_err.c:73 #4 0x0000000000488548 in zfs_panic_recover (fmt=0x510648 "zfs: allocating allocated segment(offset=%llu size=%llu) ") at lib/libzpool/spa_misc.c:1123 #5 0x00000000004c5c9f in space_map_add (sm=0x7f1bd283f870, start=0, size=1) at lib/libzpool/space_map.c:102 #6 0x00000000004c6d73 in space_map_load (sm=0x7f1bd283f870, ops=0x0, maptype=0 '\0', smo=0x7f1bd283fd00, os=0x7f1bd0fb2c80) at lib/libzpool/space_map.c:327 #7 0x000000000048eccb in vdev_dtl_load (vd=0x7f1bd283f800) at lib/libzpool/vdev.c:1544 #8 0x000000000048f5eb in vdev_load (vd=0x7f1bd283f800) at lib/libzpool/vdev.c:1708 #9 0x000000000048f56a in vdev_load (vd=0x7f1bd2841000) at lib/libzpool/vdev.c:1694 #10 0x000000000048f56a in vdev_load (vd=0x7f1bd2842000) at lib/libzpool/vdev.c:1694 #11 0x000000000048f56a in vdev_load (vd=0x7f1bd2842800) at lib/libzpool/vdev.c:1694 #12 0x000000000047c7c3 in spa_load (spa=0x7f1bd284f000, config=0x7f1bd2850f20, state=SPA_LOAD_TRYIMPORT, mosconfig=1) at lib/libzpool/spa.c:1548 #13 0x000000000047ed9a in spa_tryimport (tryconfig=0x7f1bd2850f20) at lib/libzpool/spa.c:2721 #14 0x000000000040f95b in zfs_ioc_pool_tryimport (zc=0x7f1bd0fdb000) at zfs-fuse/zfs_ioctl.c:1105 #15 0x0000000000413e10 in zfsdev_ioctl (dev=0, cmd=23046, arg=140734269694352, flag=0, cr=0x0, rvalp=0x0) at zfs-fuse/zfs_ioctl.c:3670 #16 0x00000000004051cb in cmd_ioctl_req (sock=11, cmd=0x7f1bd272c0a0) at zfs-fuse/cmd_listener.c:49
I found this article with another ZFS recovery story. In his solution, Nathan used dd to carefully write zeros over some of the uberblock structures on the disk. This makes ZFS pick slightly older uberblocks that hopefully aren't corrupt.
Since I already had the zfs-fuse source up, I just made the driver reject the newest uberblocks. This meant I didn't have to edit my 500GB file, which sounded like a pain (especially if I wanted to revert).
int uberblock_verify(uberblock_t *ub) { if (ub-<ub_magic == BSWAP_64((uint64_t)UBERBLOCK_MAGIC)) byteswap_uint64_array(ub, sizeof (uberblock_t)); printf("uberblock_verify block txg %lu time %lu ", ub-<ub_txg, ub-<ub_timestamp); if (ub-<ub_txg < 1058110) { printf("txg too new; skipping "); return EINVAL; } if (ub-<ub_magic != UBERBLOCK_MAGIC) { printf(" failed magic "); return (EINVAL); } printf(" ok "); return (0); }
The first printf showed me the hundreds of historical (and duplicated) uberblocks on the disk. I picked a TransaXtion Group a few in from the end, and made uberblock_verify fail on the later ones.
This helped, but then spa_history_log failed to write its message to a log on the disk:
#1 0x00000000004e3f89 in cv_wait (cv=0x7f3a8ce775b8, mp=0x7f3a8ce774b0) at lib/libsolkerncompat/condvar.c:59 #2 0x0000000000489e6c in txg_wait_synced (dp=0x7f3a8ce772c0, txg=1058116) at lib/libzpool/txg.c:449 #3 0x0000000000476fad in dsl_sync_task_group_wait (dstg=0x7f3ad2d56af0) at lib/libzpool/dsl_synctask.c:119 #4 0x0000000000477412 in dsl_sync_task_do (dp=0x7f3a8ce772c0, checkfunc=0, syncfunc=0x485099 <spa_history_log_sync>, arg1=0x7f3ad2d5a000, arg2=0x7f3ab0549c50, blocks_modified=0) at lib/libzpool/dsl_synctask.c:209 #5 0x0000000000485857 in spa_history_log (spa=0x7f3ad2d5a000, history_str=0x7f3a8d3e5000 "zpool import -d /1t stor3", what=LOG_CMD_NORMAL) at lib/libzpool/spa_history.c:300 #6 0x000000000040da78 in zfs_log_history (zc=0x7f3ad2cc4000) at zfs-fuse/zfs_ioctl.c:252 #7 0x00000000004138fe in zfsdev_ioctl (dev=0, cmd=23042, arg=140734786238112, flag=0, cr=0x0, rvalp=0x0) at zfs-fuse/zfs_ioctl.c:3662 #8 0x0000000000404cd7 in cmd_ioctl_req (sock=9, cmd=0x7f3ab054b0a0) at zfs-fuse/cmd_listener.c:49
I disabled spa_history_log by having it return 0 at the top.
Finally, zpool started hanging here, and I didn't know what to do:
#0 0x00007f62696b4746 in pthread_rwlock_rdlock () from /lib/libpthread.so.0 #1 0x00000000004e74e3 in rw_enter (rwlp=0x7f6223cfb7c8, rw=0) at lib/libsolkerncompat/rwlock.c:62 #2 0x000000000046c07e in dsl_dir_open_spa (spa=0x7f6269cb9000, name=0x7f6269c23000 "stor3", tag=0x4fcf00, ddp=0x7f62474a9c08, tailp=0x7f62474a9bf8) at lib/libzpool/dsl_dir.c:314 #3 0x000000000045f44e in dsl_dataset_hold (name=0x7f6269c23000 "stor3", tag=0x7f6269caaf70, dsp=0x7f62474a9c58) at lib/libzpool/dsl_dataset.c:566 #4 0x00000000004479c0 in dmu_objset_open (name=0x7f6269c23000 "stor3", type=DMU_OST_ANY, mode=9, osp=0x7f62474a9ca0) at lib/libzpool/dmu_objset.c:378 #5 0x000000000040fd99 in zfs_ioc_objset_stats (zc=0x7f6269c23000) at zfs-fuse/zfs_ioctl.c:1391 #6 0x00000000004138b7 in zfsdev_ioctl (dev=0, cmd=23058, arg=140737203102944, flag=0, cr=0x0, rvalp=0x0) at zfs-fuse/zfs_ioctl.c:3656 #7 0x0000000000404cd7 in cmd_ioctl_req (sock=9, cmd=0x7f62474aa0a0) at zfs-fuse/cmd_listener.c:49
Some faraway code had the lock, and I don't have the right kind of debugger-fu.
Fortunately, Mark Musante from Sun visited the #zfs channel around this time. Upon seeing the last trace and reviewing some of the other context, he suggested disabling spa_sync entirely. This made the import succeed. 'zpool list' made the zfs-fuse process crash again, but as long as my only command is the 'zpool import', I am able to keep the filesystem mounted.
Mark was concerned about any writes happening on this hacked mount, including atime updates. I'm pretty sure I disabled atime long ago, and I'm not going to run any further zpool or zfs commands on here, just a big rsync read.
The following parties get donations or presents:
- Mark Musante from the ZFS team at Sun
- etherpad, for providing the whiteboard/live-pastebin I used with Mark
- Antonio Diaz Diaz, for ddrescue
- freenode, for hosting the #zfs IRC channel
- zfs-fuse project
Other things I learned: opensolaris has an 'ath' driver for my laptop's wifi card which is much more reliable and quick to connect than the linux driver.
When you try to make an ssh tunnel like '-R 999:localhost:22' (for ssh'ing back the other way), ssh will silently fail when the port can't be opened (999 < 1024, so only root can use it).
Sun is still doing features for ZFS. (Mark: "We are working on an automatic uberblock rollback feature, which will be really handy.")