Re: can rsync scan files only with mtime since T?

From: Ming Zhang <blackmagic02881_at_gmail.com>
Date: Fri, 24 Aug 2007 13:10:25 -0400

On Fri, 2007-08-24 at 14:30 +1200, Darryl Dixon - Winterhouse Consulting
wrote:
> CC: to inotify list as message is probably of interest there.
>
>
> > On Fri, 2007-08-24 at 09:52 +1200, Darryl Dixon - Winterhouse Consulting
> > wrote:
> >> > Hi
> >> >
> >> > I have a file system that contains millions of small files. Since I
> >> > backup it everyday with rsync using slow WAN link, I think it will be
> >> > nice that if rsync can do this:
> >> >
> >> > An option that let rsync only check with remote rsync daemon about
> >> local
> >> > files that has last modification time newer than one day ago (so is
> >> > modified since yesterday backup). This can greatly reduce the WAN
> >> > traffic.
> >> >
> >> > Is this doable with current rsync?
> >> >
> >>
> >> Hi Ming, List,
> >>
> >> I thought I'd reply as I have used rsync in a similar scenario (~1TB of
> >> 13
> >> million files in two filesystems backed up offsite).
> >>
> >> There are a couple of approaches that will do what you want - what OS
> >> are
> >> you using? (Windows, Solaris, Linux ...?). One is to run 'find -mtime -1
> >> >
> >
> > Linux,
>
>
> Good, then inotify will work for you :)
>
>
> >
> >> my_files.list' and then use the rsync --files-from=my_files.list to send
> >> only the new files. Running find can be time consuming(!), but
> >> effectively
> >> that's what you'd be doing with an 'rsync -mtime -1' option anyway.
> >
> > rsync has such option? i do not know.
> >
>
>
> No, it currently doesn't, I was merely observing that if it did, it would
> be no different to using such an option with 'find'. :)

ic.

>
>
> >> Another option (and this is the one that I used) is to audit filesystem
> >> events as they are happenining, and keep a live list of all modified
> >> files
> >> all the time. This list can then be fed to rsync with the --files-from
> >> option.
> >>
> >> On Solaris this can be achieved with the BSM module and NFS logging (if
> >> you're running and NFS server). On Linux I heavily modified pyinotify
> >> (http://pyinotify.sourceforge.net) to achieve the same result. The
> >> outcome
> >> is that every 5 minutes during the day I ship all the files changed in
> >> the
> >> previous 5 minutes offsite to the backup server. This works perfectly -
> >> and the volume of change is about 30,000 new files per day!
> >
> > this sounds cool. i will look into this inotify. my last look at inotify
> > give me an impression that it can not scalable enough to observe a file
> > system with multiple million files. maybe i am seriously wrong.
> >
> >
> > just curious, how you deal with file/directory that were deleted or
> > renamed.
> >
>
> It *is* cool ;) Sebastien has done a good job with pyinotify, and the rest
> of the kernel crew with inotify underneath it, and I have extended
> pyinotify to meet our use case without any trouble. It is being used for
> three filesystems, two of which are currently live in Production, the
> third is going live this weekend. The two currently live are:
> 1) 1.1 million files, ~120GB
> 2) 8.1 million files, ~550GB
> As noted, we get in many new files per day into these filesystems,
> scattered randomly all over the place. All get shipped offsite within 5
> minutes of creation, move, or modification.
>
> I don't delete directories offsite, because our requirements are not for a
> complete mirror of the original source, but rather for a constant offsite
> backup of everything new or altered across the course of a business day.
> Directories which are deleted have their inotify watches removed,
> directories which are moved get updated and continue to be watched, as
> well as getting updated at the offsite server. New directories are noticed
> and added automatically. Each night, I run a 'find -mtime -1' over the
> source and destination servers to check that all new files have been
> transported; I haven't had an item missed yet.

ic. i need to delete files as well, so some extra work here.

>
> I can send you a tar of my modified pyinotify source, if you wish. One of
> the biggest problems encountered was the startup time to begin monitoring
> of such a large directory tree with inotify. I have solved this by getting
> pyinotify to keep a model in memory of the structure of the directory
> tree, and when it starts it informs inotify and creates watches based upon
> this model, rather than having to wait for a scan of the entire
> filesystem. The scan happens in the background and simply adds any new
> directories that may have been added since it was last shut down.

would like to see this. i believe i will have such problem also.

though i need to update to RHEL5 first, RHEL4 might not have inotify
yet.

thanks!

>
> regards,
> Darryl Dixon
> Winterhouse Consulting Ltd
> http://www.winterhouseconsulting.com
> darryl.dixon_at_winterhouseconsulting.com
> +64 21 33 44 13
>
>

-- 
Ming Zhang
@#$%^ purging memory... (*!%
http://blackmagic02881.wordpress.com/
http://www.linkedin.com/in/blackmagic02881
--------------------------------------------
Received on Fri Aug 24 2007 - 19:11:16 CEST

This archive was generated by hypermail 2.2.0 : Tue Jun 05 2012 - 22:14:21 CEST