ZFS Send/server migration help

Ixian · Jan 17, 2019

I built a new Freenas system (specs for both below). What I'd like to do is:

Copy data from system_old to system_new
Switch to using system_new (jails, etc.) as primary
When system_new is settled, destroy pools on system_old, reconfigure new pools using more disks (8 instead of 6), then use it as an ongoing replication target/backup system.

system_old:
Xeon D-1541 64GB ECC
Asrock Rack D1541 D4U-2TR MB
6x 5TB WD Red drives (poolname: Slimz) RaidZ2
2x 5TB WD Red drives (poolname: Backups) Mirror
1x 512GB Toshiba SATA SSD (poolname: Jails)
1x 240GB Intel P905 (log & cache for pools)
Intel x520 10GBase-T NIC (Storage interface, 10.0.0.2)
Intel i350 1GB NIC (Mgmt. interface, 192.168.0.90)

system_new:
Xeon E5-2680v3 64GB ECC
SM X10SRM-F MB
8x 10TB WD Red drives (poolname: Slimz) RaidZ2
2x 1TB Samsung 860 Evo SSD (poolname: Jails) Mirror
1x 400GB Intel DC3700 (log & cache for pools)
Intel x540 10GBase-T NIC (Storage interface, 10.0.0.3)
Intel i350 1GB NIC (Mgmt. interface, 192.168.0.91)

This is my first FreeNAS server migration so I need a little help. Specifically, I'm not familiar enough with ZFS send/receive on the CLI to set this up successfully.

Ideally I'd sync the entire pool "Slimz" and its datasets (19TB total) on system_old to the same pool on system_new, same for pool Jails and its datasets (340GB), and as for pool Backups I'd just like the datasets it contains to go under the Slimz pool on server_new as well.

I've done some basic tuning for the 10GB NICs on both - hw.ix.enable_aim = 0 as a tunable, mtu 9000 for both, etc. I have both servers peered together; here's my iperf report:

Which seems pretty decent for Intel NICs. I'm aware I won't get data transfers speeds anywhere near that as the Reds are 5400 spindles, but obviously I'd like to max it.

Can anyone assist:

With the correct command line process to create manual snapshots and use ZFS send/recv? I've read up on piping with netcat to max the transfer rate but clearly I am doing something wrong.

Should I disable the log/cache on one/both server pools?

Once the data is sync'd over, can I promote the snapshot datasets so they can be used? Or clone instead?

Really appreciate any help/advice. Thanks!

Ixian · Jan 18, 2019

Worked through a lot of stuff on my own (Googling this stuff is both fun (learning) and tedious (the usual wading through a lot of outdated or bad data). I'll keep updating the thread in case it is ever help for anyone else, or if anyone has advice:

Figured out my zfs send/recv problems. First, I created a snapshot of my Jails dataset(s):

Code:

zfs snap -r Jails@migration

Which created a recursive snapshot of my Jails dataset and all of its sub-datasets.

Then I set up zfs recv on the target (server_new) piping through netcat instead of ssh, because this is a private direct peer link so ssh overhead is a waste:

Code:

nc -w 20 -l 3333 | \
    pv -rtab | \
  sudo zfs receive -vF Jails

And on server_old, the sender:

Code:

zfs send -Rv Jails@migration | \
      pv -b | \
   nc -w 20 10.0.0.2  3333

Initially I tried inserting mbuffer in there i.e.

Code:

mbuffer -q -s 128k -m 1G | \

But that gave me problems and didn't really speed things up. Perfectly possible I don't know what I'm doing with it but in the end it was just complicating my learning process so I left it out.

I also initially forgot to insert wait statements (-w 20) to tell netcat to exit at the end of the transfer. Without that it stays in listening mode and if you are a dummy like me you'll stare at it for a few minutes wondering why it's just showing a few kb/s transfers but otherwise not doing anything. The timeout command above tells nc to wait 20 seconds after EOF then exit, both sides.

The 75GiB recursive snapshot transferred over in 3m 23s. I suspect that is pretty good, no? Jails is on SATA SSDs both sides.

Then, on server_new, I decided to rollback, rather than clone, the snapshot. My thinking behind this:
A) My intent is to use the datasets on that server B) Cloning them links back to the original snapshot which seems unnecessary because C) The snapshots from server_old are a one-time deal since when this is done I'm going to blow away server_old's pools and rebuild. When I'm done with that, then it will become the replication target for server_new.

So, cloning is probably unnecessary and might even be a problem later. If I'm wrong, let me know.

Now I'm transferring my primary media set, which is a little over 13TB. Seems to be going well:

Little burst-y in spots but a 3.6G Avg isn't bad for a bunch of WD Reds. I suspect I could improve this all the same. 13TB is going to take around 8 hours or so, certainly much faster than it would be over a 1GB link but thinking I could tune this more for future replication tasks. As always, any advice appreciated.

Ixian · Jan 19, 2019

Update 3:

Success. My biggest xfer took a smidge over 9 hours:

Which I am thinking doesn't look bad at all, considering the source and destination are Raidz2 stripes of 5400rpm disks.

Additional lessons learned:

Judging by my network stats for the largest snapshot, I'm not sure whether removing the SLOG pool would have helped; there was the occasional dip but it pretty much charged on steadily for the most part, so I don't think it was getting in the way.

For a migration like this, rollbacks are indeed the way to go vs. cloning the transferred snapshots.

I ran in to a few minor gotchas, mostly due to not thinking every piece through:

I couldn't get a snapshot of my backup pool transferred - I'd get a fault at the end of the xfer, and rather than troubleshoot after the second time (each attempt took about an hour) I bagged it and ended up transferring those files via Rysnc, which worked well. Which got me thinking that maybe, for this kind of server migration, Rsync wouldn't be better overall. Though nc + zfs send/recv certainly was fast for my biggest pool.

I ran in to an arcane problem with permissions, because back when I installed Freenas on server_old 4 years ago, the "media" user and group had a uid/guid of 816, but the newer versions of Freenas, which I installed on server_new, use 8675309 (Jeeeennny I got your number..). This was simple to fix in all my media jails that used the media user/group with pw usermod and groupmod, but threw me for a bit of a loop at first since I wasn't on the lookout for it.

Other than that it has gone smoothly and everything is running fine on server_new. I'm going to give it a few days to settle in and make sure I don't need server_old as a reference, then blow away the latter, reinstall it with new pools as server_backup, and then set up replication jobs from server_new.

Hope someone find this useful, but even if not, it was a good learning experience for me

svtkobra7 · Jan 26, 2019

Ixian said:
Initially I tried inserting mbuffer in there i.e.

Code:

mbuffer -q -s 128k -m 1G | \

But that gave me problems and didn't really speed things up. Perfectly possible I don't know what I'm doing with it but in the end it was just complicating my learning proces

Assumption from your post = You didn't add the requisite mbuffer commands on the server sending the snapshot (PUSH).

Using mbuffer looks something like this, and order matters o/c:
1. PULL: mbuffer -4 -s 128k -m 1G -I 9090 | zfs receive -Fd [RECV_DATASET]
2. PUSH: zfs send [POOL/DATASET@SNAP_NAME] | mbuffer -s 128k -m 1G -O [PULL IP]:9090;
3. Enjoy.

I've presented the example as such, as the switches provided are exactly what the GUI would use if snapshot replication tasks were set up (and you mentioned replication jobs will be added in the future).

I recently set up a replication regimen for my two servers and would recommend you perform the initial send/recv via nc or mbuffer and then turn on periodic snapshot tasks on PUSH, followed by setting up replication tasks, again on PUSH. For me, replication via GUI would never exceed 2G and considering I had ~50 TB to sync, I wanted that initial snapshot to complete at a speed much faster than 2G.

Also, when you go to set up those replication tasks, the best performance is offered by:
- Replication Stream Compression = Off
- Encryption Cipher = Disabled (non-issue with a secure LAN)

In reply to your comment about the new system's write speed ...

Ixian said:
Little burst-y in spots but a 3.6G Avg isn't bad for a bunch of WD Reds. I suspect I could improve this all the same. 13TB is going to take around 8 hours or so, certainly much faster than it would be over a 1GB link but thinking I could tune this more for future replication tasks. As always, any advice appreciated.

You are getting 450 MB/s writes out of RaidZ2 8x1x10.0 TB pool
Streaming write speed = (N - p) * Streaming write speed of a single drive, where N = # of drives in pool (8) and p = # of parity drives, or 2 for RaidZ2
So doing some light math shows ... 450 MB/s = (8 - 2) * 75 MB/s, or in other words you are only getting 75 MB/s per drive, which could be higher.
I would think that you can get 115 MB/s per drive, or ~700 MB/s.

Other comments regarding speed ...

You mentioned a 13 TB media dataset, and you might consider changing the recordsize if mostly large files for better performance:

Code:

zfs set recordsize=1M Pool/Dataset

If you do so, only new writes are written at 1M.

Ixian said:
I've done some basic tuning for the 10GB NICs on both - hw.ix.enable_aim = 0 as a tunable, mtu 9000 for both, etc. I have both servers peered together; here's my iperf report:

9.9 Gbps is solid o/c; however, I'm really surprised that you got there only using a single tunable.

Now here is where you can change a few items for better utilization of stated resources ... but also, you have to make a few decisions, and without additional info, I can't really offer any guidance ...

Ixian said:
1x 400GB Intel DC3700 (log & cache for pools) ... Should I disable the log/cache on one/both server pools?

My thoughts / prompts for you to consider:
- SLOGs
  - Assumption: By "Intel DC3700" do you mean P3700?
  - New: You definitely want your faster SLOG on your primary, so move the Optane 905p to your new system.
    - Carve out a partition from your 905p to use as a SLOG for your spinners.
    - Regarding your mirrored 860s, is a SLOG needed?
      - If so, you can carve out another partition from your Optane (it can handle serving dual duty), or,
      - If you leave the P3700 in your new system, you can use this as a SLOG.
  - Old: I don't think you benefit from having a SLOG in your replication target for two reasons:
    - Considering you will be replicating snapshots, that SLOG doesn't offer protection if a power loss event were to occur. The data on the SLOG will be committed to the pool during the next TXG commit, but an interrupted replication task would start anew anyway, so your efforts to ensure data is safely acknowledged to the pool is moot.
    - But lets say that wasn't actually the case, and your SLOG is in use (sync = always), your write speeds will be slower than if sync = disabled.
    - What I'm trying to say here is that no data protection is offered in this scenario, and it comes at the expense of performance (reduced write speeds).
  - Decisions, decisions:
    - If you carved out a SLOG for your 860s both NVMe drives are in use in your new system, and you are done, although perhaps we still haven't achieved optimal resource utilization.
    - If you didn't do that (and I'm not sure it is needed), you get to determine a new use case for the P3700 and whether that use case is in your old or new system.
    - Trying to think of all possible options for you, you could also mirror the Optane 905p and P3700 logs, for redundant log devices on your pool of spinners.

L2ARC
- In your configuration, cache at best provides no benefit, and at worst harms performance.
- I'd be willing to bet the latter is the case as you are displacing RAM that would be used for ARC, instead using it for L2ARC mapping.
- While the 1TB:1GB of RAM loses relevance at higher capacities, I personally would want to use every bit of your precious RAM that you can for ARC.
- Don't use L2ARC on either system.

I hope my comments are of some help. I'm happy to provide additional feedback to any questions/comments you may have.

Ixian · Jan 28, 2019

Hey, thanks, this is super helpful.

I'll be doing this again, the other way, in the near future (I am waiting for 11.2U2 to drop next week) when I rebuild my backup pools and sync all the data back to it. I'll do some tests with mbuffer in the meantime.

The 3700 is a P3700, yes, the NVMe/PCIe version.

I've removed the L2Arc cache from my pool.

For the 13TB bulk data transfer, should I disable write sync on the backup and remove the cache pool on both?

What I'd like to do is, rebuild the pool on backup, do an initial snapshot sync to get everything back on it, and then have regular snapshot replication rolling. What do you recommend schedule wise? Daily replication? Also, if I set up automated snapshots/replication how do I insure the initial, manual snap/replication I do to bulk move the data is sync'd up as the reference with it?

svtkobra7 · Jan 28, 2019

Ixian said:
Hey, thanks, this is super helpful.

My pleasure - I know what a PITA stuff can be without a helpful hand, and saw you didn't get any love (replies), so wanted to assist a fellow FreeNASer.

Ixian said:
The 3700 is a P3700, yes, the NVMe/PCIe version.

Thats what I thought and based my comments on.
TANGENT = For $hits and giggles I set up a pool on a P3700 once and threw an Optane SLOG behind it. Say what??? Yes, NVMe backed with more NVMe works provided the data disk > log disk, by that I mean "diskinfo -wS /dev/nvd1" produces a result that looks like this:

Ixian said:
I've removed the L2Arc cache from my pool.

Honestly with your current set up, this is the best L2ARC you could ever set up.
One day, if you added a bit more ram and your use case called for it, you can always revisit. (I have 200 GB of RAM on each FreeNAS instance and I don't use L2ARC)

Ixian said:
For the 13TB bulk data transfer, should I disable write sync on the backup and remove the cache pool on both?

Yes, I would set sync = disabled on the replication target pool for the initial sync (it provides no benefit and could only hurt performance during that replication).
I have a bit more to add here to ensure you head in the write (pun intended) direction, so I'll reply back a bit later when I have a sec.
Also, in your last comment, you noted you removed L2ARC, so you have already removed cache. Remember a SLOG ≠ cache.
- But even if you were running 1TB of RAM and had NVMe drives in capacities of integers I can't count too, L2ARC (read cache) isn't going to do anything for you (as far is replication is concerned).

Ixian said:
Also, if I set up automated snapshots/replication how do I insure the initial, manual snap/replication I do to bulk move the data is sync'd up as the reference with it?

Stated differently, are you asking how does FreeNAS "know" to replicate incrementals against a baseline (initial replication) since the former = GUI / automated and the later = manually performed?
If so, for Replication Tasks, the status is reported under Storage > Replication Tasks, where you will see the name of the last snapshot replicated and the status = "Up to Date." (see below image from my PUSH FreeNAS instance - a pic is worth 1000 words)
Lets say you named your initial snapshot "Volume/Dataset@SNAP-DATE" and sent it via zfs send / recv, as long as this baseline exists on both servers, your incrementals (automated replication) continue as intended.
- Further expanding on this point, if you delete "Volume/Dataset@SNAP-DATE" and there is not another identical snapshot on both machines (example = Volume/Dataset@auto-20180128.1100-1h exists on the target, but you go out of town, the power goes out, and your server is off for 2 weeks and let's say that 2 weeks is longer than your longest snapshot time to live (i.e. deleted automatically on PUSH when booted back up), the automated replication task will start from scratch (i.e. attempt to send 13TB again).

Ixian said:
What I'd like to do is, rebuild the pool on backup, do an initial snapshot sync to get everything back on it, and then have regular snapshot replication rolling. What do you recommend schedule wise? Daily replication?

Included in my reply I'm going to call out a few bits, which you may or may not be aware of, but wish I knew in advance (some is in the User Guide, which I never consult).

For a given Volume/Dataset, you can set up multiple snapshots which occur at intervals of X and have a lifetime of Y, where X and Y are your choice defined when you set up the task (and can be edited later o/c - more generally below).
- And a single replication task for a given Volume/Dataset, replicates those multiple snapshots (many => one relationship).
Also, I should call out that lets say you have Volume/Dataset/1 ... 10, but you only want to replicate /1 ... 5 ...
- Set up a recursive Periodic Snapshot task for Volume/Dataset with the intervals / lifetimes desired and enable it.
- Set up Periodic Snapshot tasks for Volume/Dataset/1 ... 5, but don't enable them. Intervals / lifetimes don't matter here.
- Set up Replication Tasks for Volume/Dataset/1 ... 5 (individually). The above bullet is what allows for that more granular control.
  - You may notice that third task (above image = not enabled and not ran since boot), well that is because enabling it would put me over 100% capacity on PULL (12 x 10TB drives on PUSH / 12 x 6 TB drives on PULL), so I can't simply set up a single Periodic Snapshot task for Volume/Dataset and a single Replication Task for the same. Rather I need an enabled recursive "Volume/Dataset" Periodic Snapshot + "dummy" snapshots for "Volume/Dataset/1 ... 5" (which are not enabled) and that allows me to exclude that dataset which is too large (shown as not enabled in that image ... I disabled it after it grew too large).
  - By extension ignore the above if you want to replicate Volume/Dataset/1 ... 10 ... you only need a Period Snapshot Task for Volume/Dataset and a Replication Task for the same.
So where a Periodic Snapshot Task is defined primarily with the following variables (and named based on their values):
- Snapshot Lifetime = x hours, days, weeks, months, years
- Interval = y minutes, z hours, 1 day, 1/2/4 weeks, ...
... I'd create a rolling snapshot schedule, something possibly like the below (which is "aggressive" only for illustration purposes and I don't have a precise schedule to offer as you should do what works for you, but just ensure your schedule thins out older snapshots over time):
- Periodic Snapshot #1 for Dataset #1: Interval 5 min / Lifetime 1 hour
- Periodic Snapshot #2 for Dataset #1: Interval 1 hour / Lifetime 1 day
- Periodic Snapshot #3 for Dataset #1: Interval 1 day / Lifetime 1 week
- Periodic Snapshot #4 for Dataset #1: Interval 1 week / Lifetime 1 month
- Etc. (also selecting "Delete stale snapshots on remote system") when defining the replication task.

[I hope this helps / clears up any confusion and please reply back should you need follow up on this or anything else]

Ixian · Jan 28, 2019

My mistake - I meant "should I remove the log drive" from the pool before replication. It sounds like it will get in the way.

svtkobra7 · Jan 28, 2019

Since there is no benefit, I would, yeah. Here is a post that is worth a read: https://forums.servethehome.com/ind...in-zfs-send-recv-transfers.13988/#post-146146

Also this gentleman, much more knowledgeable than myself, sums it up quite nicely:

Terry Kennedy said:
At the size of transfers we're talking about here, your ZIL / SLOG device is going to fill up and then start flushing to the hard drives - at the most, it will compensate for bursty traffic and feed the disks at their maximum sustained I/O rate. At worst, it will hurt performance as it runs out of erased sectors and has to start erasing in order to store data. You may benefit from temporarily removing the log device and possibly setting sync=disabled on the pool, assuming the pool is empty when you start. Don't forget to re-add the log device and reset sync to default (normally with "zfs inherit", since setting it to default will still be counted as a local parameter change).

svtkobra7 · Jan 28, 2019

You may have some fun playing around with benchmarking your two pools and I'd suggest to see what your numbers look like without a SLOG and then run the same commands with a SLOG. dd is probably the easiest to use. Compression artificially inflates (horribly) your results, thus why its turned off, but should always be turned on otherwise (its free).

For 128k recordsize (default, so that set is unneeded)

Code:

zfs create Tank1/disabled
zfs set recordsize=128k compression=off sync=disabled Tank1/disabled
dd if=/dev/zero of=/mnt/Tank1/disabled/tmp.dat bs=2048k count=25k
dd of=/dev/null if=/mnt/Tank1/disabled/tmp.dat bs=2048k count=25k
zfs destroy Tank1/disabled

zfs create Tank1/standard
zfs set recordsize=128k compression=off sync=standard Tank1/standard
dd if=/dev/zero of=/mnt/Tank1/standard/tmp.dat bs=2048k count=25k
dd of=/dev/null if=/mnt/Tank1/standard/tmp.dat bs=2048k count=25k
zfs destroy Tank1/standard

zfs create Tank1/always
zfs set recordsize=128k compression=off sync=always Tank1/always
dd if=/dev/zero of=/mnt/Tank1/always/tmp.dat bs=2048k count=25k
dd of=/dev/null if=/mnt/Tank1/always/tmp.dat bs=2048k count=25k
zfs destroy Tank1/always

For 1M recordsize (faster, I use for a number of datasets)
sub zfs set recordsize=1M for zfs set recordsize=128k everything else remains the same.

For sure the 450 MB/s write speed # I backed into is absolutely a better indicator of real world performance (as it was exactly that), but if you execute the above as recommended, you will get a better feel for what your pools are capable of and the impact of a SLOG, etc.

Search

ZFS Send/server migration help

Ixian

Member

Ixian

Member

Ixian

Member

svtkobra7

Active Member

Ixian

Member

svtkobra7

Active Member

Ixian

Member

svtkobra7

Active Member

svtkobra7

Active Member