Hello,
thought of sharing a pickle i am facing and yet havent decided on how to resolve.
the issue is that i have about 1million files of different type and size. e.g. jpg, gif, mp4 etc.
mainly media files. there are lots of duplicate files in different paths with same filename or
then again a different filename.
so first things first, i need to remove the duplicates. there are various tools that can search
for duplicates and all have their pro and cons. most are based on an md5 or sha1 checksum.
some are intelligent and look into the file.
considering that a file size can range from 100kb all the way up to 5gb. doing a checksum
on the whole file is a nightmare. it is not time efficient to go thru almost 2tb of files over
the network to do a comparative checksum. most programs can not even handle the congestion
or amount of files :P.
so i was thinking of having a similar yet different approach. that is to read the first and last 10kb
of a file and a 20kb string with an offset of 50kb. so in short:
- 10kb of data from the begining of the file.
- 20kb of data with an offset of 50kb from the begining of the file
- 10kb of data from the end of the file.
this would result in a concatenated 40kb file.
once i have this 40kb file, i would do a checksum against it since i have enough data to make each
concatenated almost unique (to a long extent).
reading 10kb + 20kb + 10kb and writing a 40kb temp file to create an md5 checksum generates
some io activity on the network but still is way lower than reading a 5mb file across the network.
this concept would be similar what is out there but in theory, way faster. collected data can be
inserted into a flatdb (sqllite) from which appropriate queries can be executed to remove duplicate
files.
this is simple to do as a script on unix, as for windows, not without using cygwin :P.
i just wanted to share this and hear some feed back on what could be other ways this could be done.
or even better if there are decent opensource/freeware tools that can do this without congesting
a filesystem or the network etc.
kind regards
BL
thought of sharing a pickle i am facing and yet havent decided on how to resolve.
the issue is that i have about 1million files of different type and size. e.g. jpg, gif, mp4 etc.
mainly media files. there are lots of duplicate files in different paths with same filename or
then again a different filename.
so first things first, i need to remove the duplicates. there are various tools that can search
for duplicates and all have their pro and cons. most are based on an md5 or sha1 checksum.
some are intelligent and look into the file.
considering that a file size can range from 100kb all the way up to 5gb. doing a checksum
on the whole file is a nightmare. it is not time efficient to go thru almost 2tb of files over
the network to do a comparative checksum. most programs can not even handle the congestion
or amount of files :P.
so i was thinking of having a similar yet different approach. that is to read the first and last 10kb
of a file and a 20kb string with an offset of 50kb. so in short:
- 10kb of data from the begining of the file.
- 20kb of data with an offset of 50kb from the begining of the file
- 10kb of data from the end of the file.
this would result in a concatenated 40kb file.
once i have this 40kb file, i would do a checksum against it since i have enough data to make each
concatenated almost unique (to a long extent).
reading 10kb + 20kb + 10kb and writing a 40kb temp file to create an md5 checksum generates
some io activity on the network but still is way lower than reading a 5mb file across the network.
this concept would be similar what is out there but in theory, way faster. collected data can be
inserted into a flatdb (sqllite) from which appropriate queries can be executed to remove duplicate
files.
this is simple to do as a script on unix, as for windows, not without using cygwin :P.
i just wanted to share this and hear some feed back on what could be other ways this could be done.
or even better if there are decent opensource/freeware tools that can do this without congesting
a filesystem or the network etc.
kind regards
BL