file inventory

BashLogic · Jan 22, 2015

Hello,

thought of sharing a pickle i am facing and yet havent decided on how to resolve.
the issue is that i have about 1million files of different type and size. e.g. jpg, gif, mp4 etc.
mainly media files. there are lots of duplicate files in different paths with same filename or
then again a different filename.

so first things first, i need to remove the duplicates. there are various tools that can search
for duplicates and all have their pro and cons. most are based on an md5 or sha1 checksum.
some are intelligent and look into the file.

considering that a file size can range from 100kb all the way up to 5gb. doing a checksum
on the whole file is a nightmare. it is not time efficient to go thru almost 2tb of files over
the network to do a comparative checksum. most programs can not even handle the congestion
or amount of files :P.

so i was thinking of having a similar yet different approach. that is to read the first and last 10kb
of a file and a 20kb string with an offset of 50kb. so in short:
- 10kb of data from the begining of the file.
- 20kb of data with an offset of 50kb from the begining of the file
- 10kb of data from the end of the file.

this would result in a concatenated 40kb file.
once i have this 40kb file, i would do a checksum against it since i have enough data to make each
concatenated almost unique (to a long extent).

reading 10kb + 20kb + 10kb and writing a 40kb temp file to create an md5 checksum generates
some io activity on the network but still is way lower than reading a 5mb file across the network.

this concept would be similar what is out there but in theory, way faster. collected data can be
inserted into a flatdb (sqllite) from which appropriate queries can be executed to remove duplicate
files.

this is simple to do as a script on unix, as for windows, not without using cygwin :P.

i just wanted to share this and hear some feed back on what could be other ways this could be done.
or even better if there are decent opensource/freeware tools that can do this without congesting
a filesystem or the network etc.

kind regards
BL

BashLogic · Jan 22, 2015

I changed a bit where the third string sample is not the end of file but another offset. the reason for this is because of windows shell limitations. it still works the same. the example bellow does not have sanity checks such as the file size prior to executing, it should be a minimum of 80kb for the example bellow to work.

just as a quick and dirty example:

# prerequisites
# sample file: pic.jpg # size = 15mb
# dd for windows
# md5sum for windows

# use dd to sample the file
dd bs=1024 count=10 if=c:\ADM\pic.jpg of=c:\adm\dat-1-header.tmp
dd bs=1024 count=20 if=c:\ADM\pic.jpg of=c:\adm\dat-2-content.tmp skip=30
dd bs=1024 count=20 if=c:\ADM\pic.jpg of=c:\adm\dat-3-content2.tmp skip=60

# concatenate the samples into one file
copy /b c:\adm\dat-1-header.tmp + c:\adm\dat-2-content.tmp + c:\adm\dat-3-content2.tmp c:\adm\dat-4-concat.tmp

# hash check the file
md5sum.exe c:\adm\dat-4-concat.tmp > c:\adm\md5hash.txt

# create sqlite db/entry

sorry for using windows as a shell, somehow thought of agonizing myself with windows after M$ published the free upgrade for windows 7 to windows 10. i hope i never have to upgrade to winX.

IInVader · Jan 22, 2015

You can start by comparing file sizes to spot duplicate candidates, then proceed with your method or a similar one to eliminate some, and finally copy the files to your local pc for binary comparison. The last step incurs the same network overhead as calculating the checksums.

BashLogic · Jan 22, 2015

even thou viable, statistically, there is a good chance that filesize can be a match yet totally different, so it would become a falsepositive identification.

copy 2tb worth of files to local pc for binary comparison? thou it sounds simple a straight forward, its not something that is feasable nor practical. especially if the pc disk is only 60gb in size :P (small flash disk). in addition, such an implementation is not aplicable in different environments.

the core idea of this thread is to identify different means with which with a minimal IO obtain fastest identification results. imagine that you have for example 6 nas shares, some files might be on a pc, server or a separate nas that has a disk connected via usb (horrific). the data has been copied to numerous shares and sub directories, its a nightmare, its a nightmare that you see in many small offices.

what if the end result (script or program) could treewalk all the dirs and efficiently inventize the data? many would benefit of it.

my 2c

rolf · Jan 23, 2015

That's not a fail safe system to check whether the file chainged.
What exactly is the purpose of that anyway?

BashLogic · Jan 23, 2015

The user case:
The user has copied a copy set of files from one host to numerous hosts. Hence the origi al files have been copied to numerous locations over time and in some cases even under different name.

The primary objective is to identify the duplicate files regardless of location.
secondary objective is to delete the duplicates
third is to centralize all files and put them in one location.

IInVader · Jan 23, 2015

You seem to have missed my point. You should use the 3 methods together:

- Make a catalogue of the files (location, name, size)
- From the size info, you will know which sets of files are duplicate candidates. Files with different sizes can't be identical.
- Use regular md5 hash for small files, and your method for larger files to further filter out these sets.
- For files that pass your method, copy them to the local pc and make a binary comparison.

BashLogic · Jan 23, 2015

seemingly I did. I interpreted your first answer in such a manner as if you said that end flaging of the duplicates can be achieved by comparing filesize.

your suggestion of checking file size first can save some time. i will take that into consideration.

my main question at this point is, has someone come up with other approaches similar to the dd example have presented?

rolf · Jan 23, 2015

BashLogic wroteThe user case:
The user has copied a copy set of files from one host to numerous hosts. Hence the origi al files have been copied to numerous locations over time and in some cases even under different name.

The primary objective is to identify the duplicate files regardless of location.
secondary objective is to delete the duplicates
third is to centralize all files and put them in one location.

Well in this scenario, your method can surely help as a first step in a comparison procedure. If it matches then you need a full checksum check to make sure that it's the same file.
But that does not prevent the problem from repeating itself in the future. You need to organize storage and have a version control system.

PS: A filesize check might be faster as a first check. Your method can be the second check.
I use filesize comparison when uploading to FTP, if filesize differs then overwrite, otherwise skip. It's not perfect, in theory, but in practice it works.

BashLogic · Jan 23, 2015

The data is a mirror copy of the data set. In addition, new incoming changes are separetly tracked, hence failsafe is in place.

Joe · Jan 26, 2015

BashLogic, your dd method is good for selecting potential duplicate candidates but I wouldn't trust it to yield definitive results. I would still compare the hash of the whole files after your method selects 2 candidates. It's safe to assume that the extra check won't cost too much to perform.

As far as tools, I quickly tested these 2:

I'll look more into how they work internally, if I have some time.

rolf · Jan 26, 2015

Joe wrote As far as tools, I quickly tested these 2:

FDupes

FSLint

I took a look at these. I was hoping one of them would detect duplicate images even if the file is not a binary duplicate, but these seem only to detect binary duplicate, with no high-level comparison work done.

SStygmata · Jan 26, 2015

i need this for my network