Category Archives: meta

Medusa autofoo fixed

Xan was quick to catch that I borked my modernization of Medusa's autotools. I have no idea what inserted the m4 dir into my, but it was very naughty. I do not have any macros to install, so i removed all references to m4. I note that GNOME's macros do not support AC_PROG_LIBTOOL to run libtoolize --force, so I reverted to AM_*.

Sri asked about a command line tool for search. msearch offers find/slocate like functionality. It supports content and mime-type matching in addition to the standard file matching rules.

Oh blast. I lost the mime-type criteria when I updated msearch-gui to GTK 2.4. I better update the glade file.

Medusa updated to gtk 2.4

I finally found time to update Medusa’s desktop tool to GTK 2.4. The code is smaller too since modern GTK widget have the right behavior be default. I never did locate the cause the error that Msearch used to report, but it is gone now. I suspect it was caused by the deprecated GnomeIconTheme under GNOME 2.5+. I am very happy to see my emblems again.

I also updated the open files behavior to use the default app instead of a nautilus view so that it plays well with Nautilus.

My idiot savant

I've been hacking a lot on metadata with great success. My application does nothing, but it does it very well. Really, I'm very proud I have code structured to work when I put real logic in the methods.

I selected some existing schemas that I think the metadata db should know out of the box: Dublin Core (documents), FOAF (people), EXIF (photos), MusicBrainz (music), IMDB (movie), and a few others. Absent is a schema for the filesystem. I think I'm the only person thinking about metadata as an alternate index to the filesystem. I need to create a schema that represents stat, mime-type, and nautilus thumbnails + emblem/keywords.

PS. I've switched from my usual britpop selection to the darker tones of Placebo, British Sea Power and Interpol. I feel more productive. Don't fall to sleep with Placebo playing in a loop; I woke up feeling like I had spent the evening on a drinking binge.

Lunch time thoughts about search and metadata

I've been too busy to make any metadata/search contributions this year, but I'm trying to make time for what is import–a desktop that just works.

As much of the Evolution-Data-Server deals with metadata, would it be wiser to store all metadata in e-d-s? I've always planed to put metadata below gnome-vfs because file and user metadata predates gnome-vfs operations. But my extending e-d-s to contain file file data like Posix, EXIF, ID3, etc., we have one, albeit fractured, source for metadata.

I've long advocated separating search from metadata because there will never be one source. I'm pulling the metadata database out of Medusa because users will want to search Google and p2p shares like iFolder and Gnutella. Moreover, By separating the repository from the querier, I can address some security issues by having public and private native GNOME repositories. The querier translates and dispatches queries to the repositories, then merges the results.

A smart GNOME-metadata-daemon will provide direct metadata access to manager apps like Nautilus and Rhythmbox, and associative tools like bookmarks for Epiphany. Querying and reconciling the contents a folder with 1000s of files will be faster than scanning the disk for each file.

Several strategies are needed to keep the metadata repository current. The g-m-d may use FAM to watch a few folders where change frequently happens, and calls an indexer to extract the metadata when after each file update. the g-m-d will launch an incremental indexer to crawl personal folders from time to time to update the repository. the g-m-d will examine the changes to update the set of watched directories. GNOME-VFS may be aware of the g-m-d and will use an introspection library to extract metadata during writes.

The GNOME-metadata-daemon is really a mess on my harddrive. Adding this kind of code to the dependency list is a bit tricky.

                         Indexer                             |       libintrospection (metadata extraction + creation)                             |                GNOME-VFS (virtual filesystem)                             |    GNOME-metadata-daemon (controls access to the repository)                             |       libmetadata (RDF obfuscation + schema management)                             |                librdf (metadata repository) 

libintrospection would normally exist on top of GNOME-VFS. To catch the data being written by a GNOME app, libintrospection would either be below GNOME-VFS, or is a part of GNOME-VFS. libintrospection would be similar to GStreamer pipelines. Pipelines of introspectors would be called to extract and create metadata. As mime-type is an aspect of metadata, the pipeline manager must construction the pipeline as data is extracted. After each step, the pipeline manager must determine which introspector to call next to complete the gathering of metadata. Introspectors would be registered in GConf like thumbnailers, so new introspectors can be added by applications.

Two steps back

I've been so very busy at work that I can barely find the time to hack, say happy birthday to my mother, or kick the dog.

I've been struggling to sort out what went wrong with Medusa and the 2.5 environment to break the indexer and the icons in msearch-gui. A total rebuild with jhbuild (again) seems to have addressed the mime-type issues that disabled full text indexing. The lexicon is 25% smaller. Either mime-types are still not right, or 25% of the indexed was bogus. I hope the later. It's nice to see Medusa's indexes to total less than 95 megs of disk usage.

The emblems and icons in msearch-gui still refuse to display. The function 'gtk_icon_theme_list_icons (icon_theme, "Emblems")' and its older gnome_ counterpart return NULL. I've install icon-theme and hicolor-icon-theme from Freedesktop, checked my environment for XDG_DATA_DIRS=/opt/gnome2/share:/usr/share to no avail. I'm stumped. It's definitely and setup problem, other 2.5 apps are missing their icons. The only clue I have for emblems is that only the gnome theme has them–is it not in my icon inheritance? Speaking of which, there are cvs state emblems in gnome, what app knows how to use them?

Meanwhile, I have some my Redland testing to continue. I've got to put my Medusa content into Redland to see if it continues to run well. I'm hacking the existing indexer to feed Redland so I can keep Medusa's and Redland's data synced for comparison. I think Redland will be good, so I'll have some code ready to make quick break to a new metadata store.

Then someone pulled all the pictures.

I discovered this morning that my test (GNOME 2.5) Medusa stopped content indexing 5 days ago. I did a complete rebuild of GNOME unstable five days ago. I couldn't image what caused the problem, but seeing all the emblems and icons missing from msearch-gui gives me an idea that mime-types are completely knackered. I thought I was on top of this. Bugger. I'll have to fix it, but this is damned inconvenient.

I should have noticed the problem, but I was poking at Redland. Medusa's DB was design to store a dozen pieces of information, but return only one, a URI. It's proving to be a lot of work to coax a complete set of file data that I can use to test Redland and librdf. I hoped that once this was done I would have a nice method to speed up search display; with all the data returned in the query, msearch-gui would not take the slow path of file lookups to get info. Now I'm just tempted to make the indexer directly feed Redland.

Medusa, RDF, and Redland

I’ve been playing with Redland and its librdf recently. I’m looking for a fast replacement for Medusa’s database. I’m not sure what to think. It has some new parser and query features since I played with it last year. From a storage aspect, it can do what Medusa does now, but I think I’ll need more query capabilities. If I must write a query engine, there no reason it must be Medusa’s so long as it works well.

I strongly feel that a GNOME metadata solution should be based on metadata standards: RDF, OWL, FOAF and use common grammars. I’m shopping for a new Medusa backend because I don’t think Medusa should be in the DB business, and it needs an extensible schema. It would be nice to have a ready to go RDF DB to attach the search and indexer pieces to. I think Storage is the long term solution, but if I can get a solid db that follows stands now, I can solve scale and feature later.

The Good

  • standard compliant: plays with other tools nicely
  • proven: has been seen to play with others
  • LGPL: is allowed to play with others
  • Few dependencies: doesn’t hassle others
  • small: is not a burden on others
  • bindings: plays with everyone

The bad

  • BDB 4: will everything break when BDB 5 comes out
  • Mysql: a bit of a nuisance to setup for single users
  • query: applications and users need robust searching
  • scalability: will this work at 100 megs, the size of my Medusa db

My 2004 metadata plan

I’ve got a plan for dealing with metadata this year. It’s not firm on dates, or priorities. I have a list of changes, and I’ll tackle them as they seem most urgent or most beneficial. The general summary of tasks is, integrate with nautilus, make Storage with with libgda, add metadata handling to libstorage-translators, and make Storage the metadata backend.

The first things I’m doing are:

  1. Make Medusa return the full set of meta data so apps like Nautilus don’t need to crawl the file systems and do mime-type lookups.
  2. Restore Medusa to Nautilus using using the Nautilus extension API, add simple search to the browser toolbar, and create a complex search side bar.
  3. Abstract (is that verb?) the Storage DB and create a libgda facade.
  4. Add metadata handling to libstorage-translators, and a means to chain them together to do consecutive processing.
  5. Replace the Medusa DB with a metadata library over Storage.

Nautilus directory loading strategies

A Medusa query of everything in my gnome2 build dir took 14 seconds 256 milliseconds to return 172,717 files, but display took 1 hour and 1 minute. Medusa knows everything but the icon, so msearch-gui had lookup the mime-type (again) in VFS to display the icon. Medusa could be doing a lot more, such as returning the full set of posix data, but extending it to handle icons would be difficult.

My point, in regards to Nautilus, is that it spends a lot of time traversing the disk (or net) and making repeated calls to get the specifics of each file. The file system is not designed to return a set of data like a database query, but that feature is what a modern file browser needs. That is was WinFS is reported to do. Content management systems also offer this kind of feature. Users now have gigs of data, hundreds of thousands of files. Enterprises use more than simple file systems to organize this amount of data, and so should users.

Now I question why anyone needs to see more than a few dozen files of anything; it is more than you can take in. Most likely the user is only looking for a few files. Directories are a weak means of organizing/classifying data, and users cannot easily ask for the files they want to see. A proper metadata DB would server Nautilus all the information it requests in a single lookup. A metadata DB would provide users and applications with the query features to keep the file list concise and relevant.

Directories and masses of poorly organized files is the here and now. Nautilus might ease the situation by doing a incremental display of files if a milestone is not met. It could get the list of files, sort them by access time, then begin display, so the most recent files (and commonly used) are the first the user can access.

Wanted: metadata server

The mime-type detection conversions on the GNOME lists is very tiring. This discussion has been focused on how to get the metadata for a file faster, but there is no issue with getting the metadata for a few dozen files. The real problems are:

  1. The directory is an inefficient mechanism for organizing a large number of files.
  2. We cannot query a simple file system to return a set.

I’m all for using EAs when they are available, but they are not available on all file systems that GNOME may run, hindering them as a solution. Moreover, there is a hidden cost for EAs. They take up more disk space (1-5% disk loss), apps must know to use them, and since most do not, the metadata is also stored in the file header or footer. EAs can only be accessed in the context of a single file, since the data is not organized in to anything like a database designed in the past 30 years, we cannot query them. Nor is the file system normalized to prevent duplication or orphaning of thumbnails and mime-types.

As a point of fact, Medusa
does store a table of file metadata, and I can query my file system to get a set of data like file-name, mime-type. I can query by directory, or mime-type, or keyword (emblem/topic/category), and more. This mechanism returns several thousand file matches in less than a second.

The Storage project, by it’s nature, addresses the metadata problem. Though it focuses on being a smart file/data system, it can be used to manage the metadata of files outside of it. Because it is a portable file system, there are no EA issues with the OS’s file system It is designed to return an arbitrary set of data matching a query like a directory, or category. I proposed a formal means of handling metadata in storage that was discussed on the storage list.

This said, I don’t think Medusa or Storage is appropriate. Medusa’s focus is searching, and it’s underly code isn’t suited to managing metadata well. Storage is as it same suggests, and it doesn’t help users or applications that must use the native file system. Both Medusa and Storage provide VFS access, but neither makes it easy to read ans write metadata.

We need a layer in-between search and storage to manage metadata. It must co-opt the existing VFS methods to read and write to the metadata system when doing IO to the underlying file system. An incremental indexer is needed to collect the metadata for files not written through VFS. FAM could be used, but it will not scale; a smart indexer is need that can watch the locations that will change most. Many applications, like Web browsers, music managers, and file managers need direct access to read and write metadata without writing to the file system.

One final thought. Metadata isn’t a GNOME issue, all desktops have the same issues. Freedesktop might be to right place for it. If other desktop apps like KDE were writing to the metadata DB, there would be less need for indexers.