Toward mailnews Exchange Web Services support: SOAP calls

February 1, 2010 – 2:28 pm

I’ve embarked on an effort to investigate adding support for Exchange server to the mailnews code. Although Exchange in Windows has traditionally used port 135-based protocols, my understanding is that the future for them is SOAP-based Exchange Web Services (EWS). As a first step, I wanted to get a basic SOAP library working in current mailnews code.

I considered a variety of approaches to this. One extension “Asertiva Thunderbird Extension for Sugar” uses the IBM/Prototype js library for SOAP access. Others recommended that I consider one of the open source SOAP libraries, such as a python-based library, or Apache’s AXIS2 library. Or that I cooperate with the existing project to provide an open-source method of accessing Exchange server.

But I’m not sure of how “open source” I want all of this to be. From my perspective, “open source” as a charitable activity is not successful. We all need to eat, and so the revenue model needs to be clear if a project is going to be more than a phase of life I am going through at the moment. So I would rather keep my options open until I understand that better. Anyway, that’s a long discussion which is beyond the scope of the current posting, which is supposed to be a status update.

I am still in an education phase, trying to understand SOAP and the related protocols, and to figure out what exactly I gain from using any existing library versus doing things more directly from the raw XML. So as both a trial and education step (and against the recommendation of others I might add) I’ve tried to update the old Gecko webservices extension to work in current Gecko 1.9.2, and to work with some current Microsoft SOAP protocols.

Rather than start with EWS, I started with the simpler BING search calls. I used existing Microsoft demos in C#, and could capture the communications using Wireshark to see what I was supposed to be sending and receiving.

Updating the abandoned webservices extension to Gecko 1.9.2

After testing webservices some under an old Firefox 2 build, I upgraded portions of the code to work under a current comm-central trunk build, using Gecko 1.9.2. My requirements are somewhat simpler than the original extension:

  1. Most importantly, my target is chrome-based extensions rather than browser code, so a lot of the security issues that FF folks worried about were not important to me.
  2. I had no interest also in allowing native JS creation of components, as I could rely on using .createInstance calls instead.
  3. At least initially, I am not supporting the reading of wsdl files, nor the associated automatic creation of proxy calls and interfaces. Instead, I read in a schema file, and generate my own code for each method. My understanding is that Microsoft, in their EWS libraries, also does not actually automatically generate method calls from a WSDL-based proxy, but follows this same approach of starting with the schema files.

That allowed me to avoid about half of the existing code, and focus on the /schema and /soap directories of the webservices extension.

Looking at my hg logs, it took one week of development time, and 17 patches to get to the point where I could write a unit test under Gecko 1.9.2, and confirm that I could create webservices components using a unit test. (I’m doing things in a test-driven development fashion, writing XPCSHELL unit tests to try out different features of webservices.)

Learning and testing Gecko webservices

Although there was a little old documentation around on Gecko webservices, ultimately I just needed to learn things the old fashioned way, reading the code and its interfaces and experimenting to see what worked. I took little baby steps, starting first with an XML schema primer and eventually working my way toward duplicating the functionality in some Microsoft Bing search demo C# code. This phase, starting from the first demonstration of loading of SOAP components under Gecko 1.9.2 through testing of encoding and decoding of a Bing search message, took about 2 weeks of coding and 23 patches, with the creation of 22 unit tests in the process.

The main issue that I had to deal with in the existing SOAP code is that it did not support maxOccurs>1 schema types, such as this one from the Bing schema:

<xsd:complexType name="ArrayOfNewsArticle">
  <xsd:sequence>
    <xsd:element minOccurs="0" maxOccurs="unbounded"
                 name="NewsArticle" type="tns:NewsArticle" />
  </xsd:sequence>
</xsd:complexType>

I solved this by using an nsIArray to hold multiple values of the same element.

Sample Code

So now I can do a Bing search in an XPCSHELL test, and decode and test the results. I want to show some of the js that I use, to give some idea of the complexity (or lack thereof) of using this.

I create a class BingSearch, then encode some values to setup the search. A basic call for a search looks like this:

function run_test()
{
 do_test_pending();
 getNews = new BingSearch();
 getNews.Query = "obama";
 getNews.Options = ["EnableHighlighting"];
 getNews.Sources = ["News"];
 getNews.News = {Offset: 0, SortBy: "Relevance", Count: 1};
 getNews.invoke(getNewsListener);
}

The “BingSearch” object is presumably what a sophisticated library would create automatically from the WSDL file. Instead, I create it by hand. Here’s my partial implementation, that does not support all of the allowed inputs to a Bing search, but works for my tests:

function BingSearch()
{
 // defaults
 this.Version = "2.0";
 this.AppId = "<you get this from Microsoft for your application>";
 this.Market = "en-us";
}

BingSearch.prototype = new BingBase();

BingSearch.prototype.invoke = function BingSearch_invoke(aSOAPResponseListener)
{
 let parametersBag = objectPropertyBag({
   Version: this.Version,
   Market: this.Market,
   Query: this.Query,
   AppId: this.AppId,
   Options: arrayPropertyBag("SearchOption", this.Options),
   Sources: arrayPropertyBag("SourceType", this.Sources),
   News: objectPropertyBag(this.News)
 });

 // soap message component
 let soapCall = Cc["@mozilla.org/xmlextras/soap/call;1"]
                  .createInstance(Ci.nsISOAPCall);
 let parameters = [];
 parameters.push(new soapParameter("parameters",
                   parametersBag,
                   this._schema.getTypeByName('SearchRequest')));
 soapCall.encode(Ci.nsISOAPMessage.VERSION_1_2,
                 'SearchRequest',
                 this._schema.targetNamespace,
                 0, null, // header blocks
                 parameters.length, parameters);
 soapCall.transportURI = "http://api.bing.net:80/soap.asmx";
 soapCall.encoding = this._encoding;
 soapCall.asyncInvoke(aSOAPResponseListener);
}

const kSchema = {
 file: 'data/bing20.xsd',
 schemaNamespace: 'http://www.w3.org/2001/XMLSchema',
 targetNamespace: 'http://schemas.microsoft.com/LiveSearch/2008/03/Search'
 }

function BingBase()
{
 if (typeof this._schema == "undefined")
 {
   this._schema = getSchema(kSchema);
   // use the 2001 SOAP encoder
   this._encoding = Cc["@mozilla.org/xmlextras/soap/encoding;1"]
                      .createInstance(Ci.nsISOAPEncoding);
   this._encoding.getAssociatedEncoding("http://www.w3.org/2001/09/soap-encoding", true);
   this._encoding.schemaCollection = this._schema.collection;
 }
}

So far, the complexity of this does not seem unmanageable to me. I’ve only shown the endoding step. Decoding the response consists of a creating a call-specific translator similar to the “BingSearch.prototype.invoke” function above, which relies on the webservices soap library decode method. All of the other functions I’ve created (such as arrayPropertyBag) are not at all specific to the nature of the SOAP interface being used. I am not seeing the need to process the WSDL file automatically and generate proxy functions.

I’m not yet convinced that this resurrection of the old webservices library is the right approach, but I am not seeing any obstacles to using it either. I can generate and decode soap calls fairly efficiently, debug issues that arise, plus I have code that will integrate fairly easily with either javascript or C++ code in a Gecko chrome environment.

Next steps

I’m trying to decide on the next step toward moving this forward. I’m leaning toward attempting a specific EWS application, such as read-only access to an Exchange calendar as a Lightning extension. Another option might be to add some core mailnews hooks to allow me to create either message accounts or addressbooks using an extension – though I’m hoping jcranmer will beat me to it.

QA -> Developers communication

January 22, 2010 – 10:44 am

A few weeks ago on IRC dmose and I discussed the general issue of how QA communicates priorities to developers. I’d like to hear some comments on that from others, and possibly participate in some sort of trial of improvements.

The issue here is that I see lots of good work going on by people who are mostly involved in QA, such as wsmwk, WADA, and Ludo, but I as a developer don’t really know how to make the best use of that work.

I assume there is supposed to be a waterfall here, from (bug reporter)->(QA)->(developer)->(code reviewer)->(bug landing). I understand all of the steps of the process except this (QA)->(developer) handoff. I would be curious to hear from people involved in QA about what they view the main outcome of their work is supposed to be, as viewed by a developer.

Here’s what my understanding is of the current process. After bugs are submitted, QA has three main responsibilities: 1) get the bug in the correct component, 2) move the status to NEW or one of the inactive states (DUPE, INVALID, etc.) 3) clarify the bug information to get clear steps to reproduce.

Is this accurate?

Let’s look to see how that is working by looking at my recent work. In the last three months, I fixed eight bugs (that’s a little off my desired pace, but we were frozen a lot of that time). What brought those bugs to my attention?

(3) bugs I reported myself, either due to issues I observed or as a result of following support forums.

(3) bugs are fixes of crashes that wsmwk reported from crash stats

(2) bugs were filed earlier by others. If I recall correctly, both of those bugs were items that I noticed first in support forums, then located the bug in Bugzilla and fixed it.

In no cases did the standard QA waterfall process play a significant role in bringing a bug to my attention. And that concerns me, because I see some very competent and dedicated people working hard on QA, but I don’t seem to be making effective use of that work.

Am I somehow not following the process that I am supposed to be following, or is that process flawed? In theory I am probably in a better position than most people here, because I primarily track items that appear in the mailnews core/filters component.

I wish there was a clear way for the QA people to bring a limited number of bugs to my attention that are 1) important, 2) clearly defined and reproducible, and 3) likely to be fairly easy to fix.

The mailnews core/filters category currently has 326 NEW/ASSIGNED/REOPENED bugs in it. I could probably fix a few per month that were brought to my attention. A reasonable expectation might be that 10% of those are addressed in the next year. How are the QA folks supposed to influence the selection of which 32 bugs actually get my attention?

(posted to http://mesquilla.com and m.d.a.thunderbird, followup to m.d.a.thunderbird please.)

rkent

TaQuilla 0.3.0 released

December 22, 2009 – 11:15 am

I’ve just uploaded a new version of TaQuilla to Mozilla’s add-on site. You can download it here. It is still listed as experimental status, so updates are not automatic. Details of the changes in this revision are available here, but briefly it mostly adds some user interface consolidations for consistency, plus support for Thunderbird 3.0 and SeaMonkey 2.0.

Frankly, I’ve struggled to find a good personal use of TaQuilla for use in my dogfooding. I’ve tried using it to categorize “interesting” posts, but I can’t even agree myself from day-to-day what is “interesting”, and the soft tagging is even more indecisive. But I’ve finally hit on a good use for it in my workflow – rejecting of sports articles in newsfeeds!

I have an RSS feed that subscribes to local news for the “Seattle Times” newspaper, but I find a lot of the articles are sports related. Now I am a certified geek, and not really interested in those types of articles. So what I did is to create a tag “Sports”, then I setup TaQuilla soft tags for “Sports” on the RSS feed, create a virtual folder that filters out articles tagged with “Sports”, and voila I can read the newspaper feed without all of those annoying sports articles. It’s particularly useful since many of the same articles get updated multiple times, and the updates are very efficiently rejected with the bayes filter if I tag the original.

FiltaQuilla 1.0.0 released, adds custom search terms

December 2, 2009 – 10:37 pm

Well I finally decided to quit adding new stuff, and just get a compatible FiltaQuilla out the door that works with Thunderbird 3.0 and SeaMonkey 2.0. You can get the new version from Mozilla’s download site here.

In addition to some new filter actions (print, add sender to address list, and save attachments to a folder) this release introduces “custom search terms” for the first time. This is a new feature that has been added recently to the mailnews core code, and is part of the TB 3.0 and SM 2.0 releases.

The search I am talking about is the old-style Thunderbird search, not the newer global database (gloda) search that was added to Thunderbird. Gloda gets all of the press, but we’ve also taught the old style search a few tricks as well! These are particularly useful in saved searchs (also called virtual folders).

I think that the most interesting new capability is that you can define a search (and therefore a virtual folder) by adding a few lines of javascript code that does precisely what you want. Let me give a very simple example.

Let’s assume that you want a virtual  folder to contain all of the active items that you currently need to process involving projects. Incoming emails are marked with tags, either manually or by some sort of filter. You define new tags for each project, and each tag begins with “pro” and ends with some sort of project marker, say a number or word. You want to have your active folder contain messages that have a tag containing “pro”, but NOT include messages that are tagged “done”.

The standard mailnews tag search forces you to enter the tag name for each tag that you want. There is no way to search for tags by the characters in the tags. Plus, searching for several tags is an OR function, and saying to not include DONE messages is an AND function. But the standard mailnews search does not handle complex boolean searches.

The javascript custom search comes to the rescue! Go to the folder that contains the messages that you want, and select the Javascript search term:

Search Javascript search term

Click on the script icon Script edit button and you will get a small editor window, where you can enter a few lines of javascript. All that we need to do is grab the string that has the tags in it, and make sure it includes “pro” but does not include “done”. Our javacript is given a variable “message” which is the database header object that can be used to get message properties. We need to execute an expression as the last statement, whose value is either true if we want the message, or false if we do not. That’s just the following two lines of javascript:

let tags = message.getStringProperty('keywords');
(/pro/.test(tags) && !(/done/.test(tags));

So here’s what our javascript window and code looks like:

Enter Javascript code

Now save this as a virtual folder, and you have the exact virtual folder that you want!

OK, this is just for geeks, but it is really powerful in letting you define folders that can precisely define the workflow that you want. There’s also a few other geek-friendly search terms, including the much-requested regular expression search by subject or other header. For details, see the updated FiltaQuilla page on this site.

FiltaQuilla is still in experimental status, though I have now nominated it to be public. Still it may be a few weeks before it gets there. If you are an existing FiltaQuilla user, you will need to go to the download page directly and download and install the new version.

Enjoy!

Join the forum discussion on this post - (1) Posts

Bad effects on junk training corpus from change

December 2, 2009 – 9:41 am

I’ve been tracking some difficulties in my junk analysis recently, which was caused when I enabled some experimental changes to tokenization. (I added full tokenization of the Received: and x-spam-status: headers). At the same time, I started some experiments where I am automatically training certain incoming emails as good.

What I am seeing is that the common, unchanging words in the Received: header, like “received:from” and “received:(exim”, are persistently occurring with a moderate “good” score, such as 36, even after training junk messages with those headers. There are a lot of these little meaningless tokens per message though, and they are dragging down the junk score of junk messages into the Uncertain category.

I think what is happening is this, and it could be caused by any change in your common environment. I started adding new tokens such as “received:from”, without restarting training. Because I also started training temporarily many more good messages than junk, these new tokens are showing up disproportionately as good.

Suppose, for example, I start with 1000 good messages and 1000 junk messages in my corpus, then suddenly add a new token to all incoming emails. Then I train 100 good messages with that new token, and 10 junk messages. The spam corpus will claim that the new token appears in 10% of good emails, but only 1% of junk emails, so the presence of that token is a marker that the message is more likely to be good. Which is not true, since now ALL emails have that new token!

I suppose that one defense against this would be to make sure that the proportion of good and junk emails trained always stays about the same. That is not easy to do, however.

Extension driven development

November 28, 2009 – 11:37 pm

What then do I mean by “extension driven development”? It is the concept of changing the way that Thunderbird is developed and distributed, with a bare minimum set of core code, and the main features presented as a set of extensions, shipped with the product,  that can be enabled or disabled by users.

I don’t have any illusions that this has a significant chance of being implemented, and I’m not even sure it’s a good idea myself. But I ask you to suspend disbelief for a minute, and imagine a change to the development culture and process.

An email client is different from a web client in many ways, but one significant way is that there is no real need for a fat uniform core product that developers can target (such as web developers for FireFox). So we are free to allow wide changes in our product configuration that would not make sense for FireFox. There is really no fundamental need for Thunderbird to be presented as a single, fat, feature-laden client.

Instead, ship Thunderbird as a minimal base with a collection of extensions. The extensions could be in a variety of statuses. At one status extreme, “Core” extensions would be enabled by default, would be fully localized, and their updates would be shipped with updates to the core product, rather than through AMO. Many existing core features would be converted into “Core” extensions that could be disabled if desired (for example bayes junk processing, or gloda.) At the other extreme, “Pilot” extensions would be shipped with the core product in a disabled state, would be updated by AMO, and not fully localized. There would also be “Standard” extensions that are shipped with the product, not maintained through AMO, but would not be enabled by default. Lightning might be one example of this. FiltaQuilla or JunQuilla could easily get added to this category in the future, or popular extensions like ThunderBrowse.

So why would you do such a crazy thing? For several reasons.

First, you would have a path to add features to the program that is not as generally disruptive as has been, for example, gloda or the new message header. By not using a new extension, an existing user would not see changes to their workflow that they did not want or appreciate. Also, new features need not delay the release of new versions of Thunderbird, as “Pilot” status extensions could be updated through AMO.

Second, new complex features like gloda, even though they are developed by the core team (well mostly asuth) are in a state of rapid flux, and would really benefit from allowing updates more frequently than even the accelerated release process will allow.

Third, you provide a natural path for outside developers to add contribution to the product without having to completely submerge themselves in the Mozilla culture, or give up complete control of their creation.

Fourth, this really recognizes that the use of an email client is highly personal. Basic users could be presented a basic email client. Advanced users could easily add advanced features. (Existing AMO-based extensions are also good for this, but the quality is not uniform, and they frequently are not kept up to date. And the standard product is still very fat with lots of features that are unneeded by most users.)

Fifth, this would solve the serious issue with Thunderbird of how hard it is for the average user to install addons (because the most popular and important addons would be shipped with the product).

There’s another dimension to this, and that is the developer’s relationship with his or her extension. I know that I feel a responsibility for my extensions that is beyond the responsibility I feel for any core code. You can see that in the documentation that I provide, and my reliability in responding to issues. I think that many other extension developers are like that as well. I’m guessing that they would be delighted to see a higher level of promotion of their work, without the need to cede complete control that incorporation in core might involve.

I suppose I could write a book on what this might look like, but for now let me leave it here.

rkent

A day in my spam life

November 28, 2009 – 10:23 pm

Just for laughs, I looked at statistics for my spam yesterday. Here’s the results:

1) Spams caught by server-side SpamAssassin: 109

2) Spams caught by local bayes filter after passing SpamAssassin: 49

3) Spam marked by me that got through both filters: 2 (junkpercent scores were 63 and 66)

Total Spam: 160

For the server-side SpamAssassin filter, my spam detection limit is 5.0 This is the stock SpamAssassin filter supplied by default to all accounts by a large, inexpensive web hosting provider (hostmonster).

For my local bayes filter, my spam detection limit is set at 75.

I *never* have emails falsely marked as spam. I train spam reliably using the Uncertain folders in my JunQuilla addon. I have a limit of 300,000 tokens (with a current count of 118,690), 2097 good messages trained, 3748 junk messages trained.

Oh, and I use 2 customizations to tokenization using new hidden preferences available in TB3. First, I tokenize into words the Received header (it is disabled by default), plus I tokenize into words SpamAssassin’s X-SPAM-STATUS header (which is accepted as a single token by default, that is not broken into individual words.) I don’t believe these are very important, however, but I do think that they help a little.

Note to self: blog about customized tokenization settings in TB3, and try to do some analysis.

ToneQuilla version 1.0.1

November 28, 2009 – 9:00 pm

ToneQuilla version 1.0.1 has been posted on AMO for review (or is available on this site here.) This fixes a bug reported in the forum, where for some users .wav files were playing in the default media player, instead of using Mozilla’s internal code.

Maybe I need a search extension – SearchaQuilla?

November 20, 2009 – 11:13 am

The last few weeks I’ve been adding custom search terms to my FiltaQuilla extension using the new nsIMsgSearchCustomTerm interface, which can then be used in searches, virtual folders, or filters. But I keep coming up with new things that I want to do. That delays my packaging of FiltaQuilla 1.0.0 for non-experimental release. Maybe I should quit adding this stuff to FiltaQuilla (which is already pretty large with all of its filter actions) and define a new search-oriented extension, called probably SearchaQuilla?

So far, I have added the following new search terms:

BCC – locate items in the BCC field

Subject Regex – search the subject using a javascript regular expression

Header Regex – search any specific header using a javascript regular expression

Javascript – load javascript in a text field, and program your own search given an nsIMsgDBHdr object

Tag of Thread Head – match a tag in the head of a message’s thread

Tag of Thread Messages – match a tag near the message in its thread (within +/- 10 messages by default)

Address in Thread – match an address near the message in its thread (within +/- 10 messages by default)

This stuff can be useful outside of filters, in fact I am mostly using them personally to define virtual folders. So I’ll probably move them to a new extension, and try to get FiltaQuilla out the door finally.

rkent

JunQuilla version 1.0.0 released

November 16, 2009 – 1:08 pm

Today I released a version of JunQuilla that supports SeaMonkey 2.0, and the latest versions of Thunderbird including the upcoming 3.0RC1 and 3.0.0 The new version can be downloaded from the AMO site here. I’ve also submitted this version for review so that it can get out of experimental status.

JunQuilla is my attempt to extend the user interface in the Mozilla mailnews product to provide the information that I believe is needed to properly manage the bayesian junk filter. I suppose that most of these features should really be in the core product, but I found that support for that was not very strong, so I decided to do most of this in an extension instead. These backend features have only been added to the core code in the last couple of years, so this extension will only work on newer versions of the Mozilla email clients (Thunderbird 3.* versions, and SeaMonkey 2.* versions.)

Version 1.0.0 fixes some bugs that have been reported in previous releases, provides partial support for SeaMonkey (except for the “Uncertain” folders), and adds a number of new features:

Junk Options

You can set critical overall junk options in the standard junk options screen (previously, this was only possible in the more obscure addons/options area). In Thunderbird, select Tools/Options/Security/Junk. In SeaMonkey, select Edit/Preferences/Mail & Newsgoups/JunQuilla. There you will see a display like this:

JunQuilla Options

“Junk threshold” is the percentage value as calculated by the bayes classifier for each message, above which a message will be classified as junk. This should be set as low as possible, though always high enough to avoid having any real messages classified as junk. The default value of 90 is much too high for a well-trained bayes classifier.

The “Maximum token count” is a measure of the resources that the junk classifier will use. The higher it is set, the more accurate your classifications will be. The default value of 100,000 is probably too low for good classification performance. I’ve had good results with 300,000 – and JunQuilla will set your value to this when first installed. If this value is too high, and you have trained a lot of messages, then memory usage may be excessive.

The other parameters are read only, and are displays of values from your training file. The “Current token count” shows how many junk training tokens (which are like words) are currently in use. You probably won’t get good performance until this number is over 10,000 – and it really should be more like 100,000. “Good” and “Junk” messages trained shows how many messages have been used to train the junk filter. Ideally the number of junk and good messages should be more or less equal. If they are not, then pick some previously untrained messages and train them.

When the number of tokens exceeds the maximum value, then Mozilla mailnews will prune the training file in a large chunk, typically reducing both the number of trained messages, and the number of tokens, in about half.

Disable/enable junk processing for a folder.

You can set an “inherited folder property” to allow you to selectively enable or disable junk processing for folders. This has two main uses.

  • If you have server-side filters that process email in IMAP, then you may already know that certain folders contain either junk mail or good mail, and don’t want to waste time processing them locally – or take the risk that they will be processed incorrectly.
  • Mozilla mailnews core code now supports junk processing of RSS and News folders. You can select certain RSS or News folders, and then junk processing will run on new posts to those folders. This will also enable the standard user interface features that allow you to train messages as good or junk in those folders.

To set this, right click on a folder in the folder tree, and select Properties, then the “General Information” tab. At the bottom, you will see this:

JunQuillaFolderProperties

“Analyze Junk” is an inherited folder property. What that means is that each folder can either gets its value from its parent, or can be set locally. The default value depends on the characteristics of the folder itself. So for example, this would be disabled in News by default, but enabled in IMAP. To change the value, first reset the “Inherit” checkbox, then set the value that you want in “Enabled”. If you change a value for a folder, then the value will also change for the children of that folder (assumming that they have the default “Inherit” checked.)

Toolbar “Is Junk” and “Is Good” button

You can add two new buttons to your toolbar – “Is Junk” and “Is Good”. Here’s what they look like, next to a standard “Junk/Not Junk” icon:

JunQuilla Is Good or Is Junk toolbar button

To add these buttons, right click on a toolbar, select “Customize”, then drag the buttons to the desired location.

The standard Junk button will always show as “Junk” when it thinks a message is good, and “Good” when it thinks that a message is junk. But that means that we can only classify a message as “Good” when it has been falsely classified as junk, and we never want our junk filter to do that. The “Is Good” button is meant to be used in the “Uncertain” folders to give you a means to train a message as “Good” there.

Join the forum discussion on this post - (3) Posts