[lug] C(++) library to detect file type (a la 'file')
tkil at scrye.com
Fri Nov 1 11:29:22 MST 2002
>>>>> "Ralf" == rm <rm at fabula.de> writes:
Ralf> Or are you refering to perl's 'system("file $myfile")' function?
Ralf> That's not really an option since it spawns a shell which then
Ralf> spawns the 'file(1)'application ... nothing one would want to
Ralf> do when indexing a largish set of documents ;-) (currently ~ 100
Ralf> 000 docs).
Well, 'file' will open the file anyway (which costs a bit). It can
accept multiple files on the command line, which drops the ratio of
exec-to-file down to 0.01 or even lower. Versions exist (or could be
created) that could even take the list of files from another file,
which would drop you down to one exec for all your files.
Sadly, there's no "right" way to do this.
1. Some 3-letter extensions are fairly reliable, but even they can
have subtypes (think .gif -- pretty unique, except there are two
GIF standards, and then there are animated vs. static gifs...)
2. Not all files have a unique signature.
3. Not all files are architecture-independent (take a look at what
'file' does for executables, for instance).
4. Your application has to determine its degree of paranoia w.r.t. how
much it trusts anything it discovers for itself (e.g., the Outlook
worms that use "foo.txt.vbs" or whatever.
The ideal world would have a metadata attribute for each file with the
MIME type string in it. I believe that BeFS had exactly this. Macs
have long had "application" and "type" fields, which were close but
not quite there (meaning that, depending on which application created
it, two GIFs might have different "application" values even though
they were identical formats). These fields were also limited to 4
Interestingly enough, it looks like Extended Attributes will be
showing up in most Linux FSs in the next stable series. This could
have interesting possibilities.
Either way, good luck. I'd be curious to hear how this project works
More information about the LUG