For the last weeks I have been working on implementing a custom document parser in Share point v3. I was pleasantly surprised by Andrew May's excelent postings about Document Parsers.
http://blogs.msdn.com/andrew_may/archive/2006/07/21/SharePointBeta2DocumentParserOverview4.aspxSo I actually thought this was going to be pretty straightforward since there's actually documentation..
I was wrong....
A custom parser is a COM object (I don't know why they choose this route). This didn't bother me since I was quite happy to do some COM programming again ;) Both the SDK documentation
http://msdn2.microsoft.com/en-us/library/aa543908.aspx and Andrew's posts describe the interfaces ALMOST in detail.
What is missing is all the IID's for the COM interfaces. Without this GUID you can't create a custom parser... Since I'm used to browsing through the registry and using OleView from the good old COM days, I thought it was going to be easy to find this "last" piece of information. But I couldn't find any ISPDocumentParser entry under the interfaces section and instantiating the OTB Office parser from OleView didn't show any interfaces either....
OK there's always Google, so let's google it. I found one post where another developer was asking the same question: Where's the IID for ISPDocumentParser. Unfortunately there was no answer.
I decided to try to replace the SharePoint office parser with my "parser" that only traced out all QueryInterface calls made on it. I did this by creating a COM object with the same CLSID as the SharePoint Office parser and just update the registry to point to my dll instead of the share point parser.
Here's what I saw:
WSS is querying for two interfaces:
1)
{E19C7100-9709-4DB7-9373-E7B518B47086}
2)
{9E13184F-C136-41D4-899D-4331DB736BA1}But I didn't know which one of them was for the ISPDocumentParser.
I decided to create an instance of the Office parser and forward these calls to it to see if it supports both of them. After doing this there was one left:
Let me introduce you to
{9E13184F-C136-41D4-899D-4331DB736BA1}This is the IID for ISPDocumentParser.
Here's my IDL
[ object, uuid(9E13184F-C136-41D4-899D-4331DB736BA1), oleautomation, nonextensible, helpstring("ISPDocumentParser Interface"), pointer_default(unique)]interface ISPDocumentParser : IUnknown{ HRESULT Parse([in] ILockBytes *pilb, [in] IParserPropertyBag *pibag,[out] VARIANT_BOOL *pfChanged); HRESULT Demote([in] ILockBytes *pilb, [in] IParserPropertyBag *pibag, [out] VARIANT_BOOL *pfChanged); HRESULT ExtractThumbnail([in]ILockBytes *pilb, [in] IStream *pistmThumbnail );}I was now able to do some more investigation by wrapping this interface pointer returned by the Office parser. This way I was able to look at the property bag before and after the office parser had parsed the file.
Now there's "only" one problem left. Share point is NEVER trying to instantiate any of my custom parsers... I'm adding them to the
Web Server Extensions\12\CONFIG\DOCPARSE.XML file and I'm doing an iisreset but no luck.
To be continued.
/Jonas