Adding the data to mySQL, which is what I have been doing, seems to be the time killer.
I extract most of the information and put it into mySQL tables, though my programs are very editor oriented so probably wouldn't be of much use to anyone else. But one way I speed this up is to extract the information into a tab-delimited flat file--for example, if I have a table with URL, Title, Description, and Category the each site would be on one line, with the URL, title, description and category separated by tabs.
Once I'm finished writing all the data to flat files, I then Truncate the table (or delete everything from the table), and then I use LOAD DATA LOCAL INFILE "flatfile.txt" INTO TABLE tablename
I don't use an index on the table containing the 4 mil+ sites, but if you do, you don't want the index there until you've loaded all the sites into the table. Then create the index.
One other very big time (and space) saver, is to create a catid table. This table should have only two fields: catid and path, where catid is the cat id in the RDF for each category, and path is the full path of the category, like Arts/Online_Writing. Then, in any other table where you need to specify a category, use the catid. When you query a table with a catid, join it with the catid table to get the path.
Using those techniques, and a few others that are more complicated to explain, I've reduced the parsing time to under an hour.