Re: can rsync scan files only with mtime since T?

From: Darryl Dixon - Winterhouse Consulting <darryl.dixon_at_winterhouseconsulting.com>
Date: Fri, 24 Aug 2007 14:30:00 +1200 (NZST)

CC: to inotify list as message is probably of interest there.

> On Fri, 2007-08-24 at 09:52 +1200, Darryl Dixon - Winterhouse Consulting
> wrote:
>> > Hi
>> >
>> > I have a file system that contains millions of small files. Since I
>> > backup it everyday with rsync using slow WAN link, I think it will be
>> > nice that if rsync can do this:
>> >
>> > An option that let rsync only check with remote rsync daemon about
>> local
>> > files that has last modification time newer than one day ago (so is
>> > modified since yesterday backup). This can greatly reduce the WAN
>> > traffic.
>> >
>> > Is this doable with current rsync?
>> >
>>
>> Hi Ming, List,
>>
>> I thought I'd reply as I have used rsync in a similar scenario (~1TB of
>> 13
>> million files in two filesystems backed up offsite).
>>
>> There are a couple of approaches that will do what you want - what OS
>> are
>> you using? (Windows, Solaris, Linux ...?). One is to run 'find -mtime -1
>> >
>
> Linux,

Good, then inotify will work for you :)

>
>> my_files.list' and then use the rsync --files-from=my_files.list to send
>> only the new files. Running find can be time consuming(!), but
>> effectively
>> that's what you'd be doing with an 'rsync -mtime -1' option anyway.
>
> rsync has such option? i do not know.
>

No, it currently doesn't, I was merely observing that if it did, it would
be no different to using such an option with 'find'. :)

>> Another option (and this is the one that I used) is to audit filesystem
>> events as they are happenining, and keep a live list of all modified
>> files
>> all the time. This list can then be fed to rsync with the --files-from
>> option.
>>
>> On Solaris this can be achieved with the BSM module and NFS logging (if
>> you're running and NFS server). On Linux I heavily modified pyinotify
>> (http://pyinotify.sourceforge.net) to achieve the same result. The
>> outcome
>> is that every 5 minutes during the day I ship all the files changed in
>> the
>> previous 5 minutes offsite to the backup server. This works perfectly -
>> and the volume of change is about 30,000 new files per day!
>
> this sounds cool. i will look into this inotify. my last look at inotify
> give me an impression that it can not scalable enough to observe a file
> system with multiple million files. maybe i am seriously wrong.
>
>
> just curious, how you deal with file/directory that were deleted or
> renamed.
>

It *is* cool ;) Sebastien has done a good job with pyinotify, and the rest
of the kernel crew with inotify underneath it, and I have extended
pyinotify to meet our use case without any trouble. It is being used for
three filesystems, two of which are currently live in Production, the
third is going live this weekend. The two currently live are:
1) 1.1 million files, ~120GB
2) 8.1 million files, ~550GB
As noted, we get in many new files per day into these filesystems,
scattered randomly all over the place. All get shipped offsite within 5
minutes of creation, move, or modification.

I don't delete directories offsite, because our requirements are not for a
complete mirror of the original source, but rather for a constant offsite
backup of everything new or altered across the course of a business day.
Directories which are deleted have their inotify watches removed,
directories which are moved get updated and continue to be watched, as
well as getting updated at the offsite server. New directories are noticed
and added automatically. Each night, I run a 'find -mtime -1' over the
source and destination servers to check that all new files have been
transported; I haven't had an item missed yet.

I can send you a tar of my modified pyinotify source, if you wish. One of
the biggest problems encountered was the startup time to begin monitoring
of such a large directory tree with inotify. I have solved this by getting
pyinotify to keep a model in memory of the structure of the directory
tree, and when it starts it informs inotify and creates watches based upon
this model, rather than having to wait for a scan of the entire
filesystem. The scan happens in the background and simply adds any new
directories that may have been added since it was last shut down.

regards,
Darryl Dixon
Winterhouse Consulting Ltd
http://www.winterhouseconsulting.com
darryl.dixon_at_winterhouseconsulting.com
+64 21 33 44 13
Received on Fri Aug 24 2007 - 04:56:47 CEST

This archive was generated by hypermail 2.2.0 : Tue Jun 05 2012 - 22:14:21 CEST