Saturday, August 12, 2006

Generics

Boo doesn't have generics. This makes me sad. Since I want a strongly typed matching solution, I _really_ want generics. I also want macros. ARRRRRGGGGG! Why won't people give me EVERYTHING. One option is to use arrays instead of lists. Another option is to add generics to the language. I think I'll start with the first and then migrate to the second. That should be sufficient for my needs.

Ok, now on to the O/R mapping.

First, since this is all about types, we're going to add the type information for (most) things in the database. Why, you ask? Shouldn't we know if it is a FirstName or a LastName because we have knowledge of the table? Yes, that is correct, but you're thinking too simplistically. In true OO fashion, FirstName and LastName will be nothing more than base classes. We will end up with an AsianFirstName, AngloFirstName, HispanicFirstName, etc... The same is true for last names. Then, when we consult our statistics, we'll be able to use statistics based on the frequency of the name within its ethnic culture (and also within the geographic location). Therefore, we want to be able to generate and store this extra type information. In addition, we'll want to use the type information when comparing names with the matching engine. We might use a completely different function when comparing a HispanicFirstName to an AngloFirstName as we would a HispanicFirstName to an AsianFirstName (auto-reject, anyone?). By emploring multi-method dispatch on the type information, we can quickly choose the right matching logic.

But enough skipping ahead, back to the O/R mapping.

Obviously, each Element will be stored in its own field. It will have an associated type information field. It may also have a pointer into a metadata table. I'm not sure on that one yet, we'll have to see what sort of metadata we will keep that is outside of the type system.

An entity becomes a little more complicated. Remember that an Entity is a collection of other entities, groups, and elements.

Let's look at two different Entities:

class Name(Entity):
first_name as FirstName
last_name as LastName
middle_initial as Nullable(MiddleInitial)
name_suffix as Nullable(NameSuffix)

Wow, what's that Nullable thing? Well, the type system should include whether or not the field can be blank, and Nullable is just as good a choice as any. I'm really starting to want to do this project in O'Caml. I'm getting very close to breaking open the docs on F#. Of course, now that I think about it, C# might be a good choice. I wouldn't need to modify the parser if I had introspection, which C# gives you. Plus it is strongly typed and has generics...hmmm...I had forgotten introspection...drat!

Ok, back to the task at hand.

In most cases, the Name entity will have a Name table. Each element within the entity will correspond to a field in the table. There will also be the associated type field (and perhaps metadata fields). Each record in the table will also have a unique, primary key.

That was easy...

Now, for a more complicated example:

class Person(Entity):
name as Name
address as Address
ssn as Nullable(SSN)
birthday as Nullable(Date)

For a person we will have a unique primary key, but we will also store the primary key of both the name and address information. The ssn and birthday elements will be stored "in-line" like the name elements were in the previous example.

Of course, we might want to force denormalization of the table...we could try something like

class Person(Entity):
[inline]
name as Name
address as Address
...

Now, name has an inline attribute and will not go in a separate table. Instead, the name fields will be placed in the Person table. However, when we extract the Person object from the table, we'll extract a Name object as well, so you can't tell the difference from the user side.

I'm hesitant to allow an [inline] attribute, because you get to the same point as with C++ and its inline modifier. The compiler can't inline without you telling it, so you're forced to make decisions that the compiler should be able to make. Therefore, if we have an [inline] modifier it will be more like "auto" in C++, a hint but the compiler can do what it wishes with regards to inlining. Hopefully, it's usage will vanish just like auto's.

Ok, next time we'll look at the O/R mapping of groups. I'm still not using the mailing list from SF because I sent things to it that I never got back, so I'm waiting until I have a successful test run before I move there for good.

Sourceforge Site Available

You can now go to The STARS Sourceforge Site and sign up on the see-stars-devl mailing list. I hope to make most of the technical discussions through that list so that they will be archived and open for discussion. I will also post previous blog entries to the mailing list for historical purposes.

Friday, August 11, 2006

SourceForge site

http://see-stars.sourceforge.net will be the sourceforge site. It has been created and I am in the process of getting someone to set it up ;-)

Secondly, I'd like to discuss how the type system will interact with the database. First, I said that we're going to be strongly typed. This is not just fancy terminology, this will affect how the database is structured. For instance, we might want to subdivide names for etnicities. John Doe might get the standard US ethnicity, but Wing Fe might get an asian ethnicity. This can be represented by using subtypes. Therefore, this type information must be stored in the database for fast access. In addition, we might have types for strongly cohesive groups, loosely cohesive groups, etc... (perhaps even a composite group that composes multiple strongly cohesive groups so you can see the heirarchy).

But wait! You say. If we're only doing updates then it won't take too long to create all this information, but our initial database population will take FOREVER! People will get tired of waiting! To that, I say, you're absolutely right. That is another beauty of callbacks. We're not going to do it for the initial population. Instead, we're going to use reasonable defaults, but we're going to allow for processes to continually improve the data. I think consumers expect this. They want their data fast, but they also want it right. They hope that over time their data gets better. To ensure that, we'll have reasonable defaults (so, they won't get the Asian/American matching function, they'll get the statistically based one), but when we correct that decision later, we'll let them know through the callbacks. So, they can make their business decisions quickly and then revise them when the situation merits it.

Of course, if they want to wait around, they definitely can, but we need to be adaptive and fast and continuously improving!

Thursday, August 10, 2006

Two quick things

1. All operations must be idempotent. I'm not sure (yet) how to enforce that...it may just be a really strong suggestion.

2. Versioning will be a big part of the system. We need to be able to add fields to an entity and remove fields from and entity, inline an entity (more on that later) and extract an entity. I'm sure there are a number of "refactoring" tools that come from this, but I want them FIRST, not "when there is time." For instance, if I want to add a Title to a Name, then that needs to be as easy as adding title as Title to the Name class and running an upgrade program with an optional map function to create the title given the name (the map could set them all to a default (blank) title or it could try to guess a title of Mr or Mrs based on a derived gender).

Right now, I know I want automated upgrades when I
a) add a field
b) remove a field
c) inline an entity
d) extract an entity
e) add/change the validation method
f) add/change the normalization method
g) add an implied attribute

Also, we better darn well be able to downgrade!!!!!

Creation

Now that we can define the objects (though not rigourously, that's coming later), we need to know how to create them. Obviously, object creation could possibly mean invoking the database for lookups and the like, so it needs to be well abstracted. In this case, we'll use the abstract factory pattern to create factory objects which will create our entities. For example, if you define the following entity:

class PersonAtAddress(Entity):
name as Name
ssn as SSN
birthday as Date
address as Address

Then you will end up with the following abstract factory

class AbstractFactory:
def getNameFactory:
...
def getAddressFactory:
...
def getPersonAtAddressFactory:
...

class PersonAtAddressFactory:
def Create(name_key as Name.ID, ssn as SSN, birthday as Date, addr_key as Address.ID):
...

Now, we have a consistent, programmatic way to create entities. We can wrap these calls in CORBA or SOAP or whatever, but the foundation is solid.

Next time, we'll start looking at how the objects map to a relational database.

Wednesday, August 09, 2006

Two Things

I'm pretty sure of two things at this point. First, the name of the system will be the Staticly Typed Advanced Recognition System (STARS). I hope to get a sourceforge site for it up soon.

Second, the system will be written primarily in Boo and run on mono. Boo has all the right things for this project.

1. Strongly Typed
2. Type Inference
3. Macros
4. User defined compile steps (this will be very useful when creating the database schema from the source files).
5. .NET/Mono compatible - at some point people will want to run their own code, there is a likelihood that the code will be .NET.
6. Clean syntax - based around python (I prefer Ruby, but...)
7. Duck typing
8. Multi-thread capable
9. Functional + Object Oriented

As soon as I get the SF project up, I'll post a link to it here. I'd like to write the specifications in the project's mailing list.

Callbacks

Callbacks are the key to a good recognition system. A typical batch recognition system forgets about the importance of letting consumers know as soon as information is available. However, this is the lynchpin to a good recognition system.

Let's assume that a typical use case is the following:

1. Run a large file through the system to create a repository
2. Run the files through the repository to ensure correct linkage
3. Rinse and repeate monthly

You have to run the file through twice because you don't know what might happen later in the system to change one of your records. This is because the system is not set up to tell you about events.

If, instead, we allowed the sytem to tell you about important things that are happening, you would be able to complete your run in one pass. So, what needs events. Well, first let's say that we'll use a publish/subscribe mechanism so that only those events that we're interested in will be delivered to us. Second, let's say make the rule that anything that could have an impact on the end result should have an event fired. That means that any time an Entity or Group is created or deleted as well as any time an Entity is moved from one Group to another. I would say that Element updates should be allowed to have events, but not forced to. It could be that updating the salary field doesn't affect anything and you don't need that information to be disseminated.

There are lots of optimizations you can do to make this fast and I don't want to get into those right now, but suffice it to say that the event/callback mechanism can make for an extremely flexible (and efficient!) system.

Obviously, the code for the callback won't be in the same file as the code defining the entities. However, we may want to augment the event with some information at event generation time; therefore, we allow the override of the OnX methods (where X is something like Consolidation).

For example:

class Consumer < Group
def Consolidation(Group other):
if ...:
consolidation_reason = ...
elseif ...:
consolidation_reason = ...

def OnConslidation(ConsolidationEvent event):
event.reason = conslidation_reason

I'm not sure, but you might even be able to suppress events...I don't necessarily like that, but it could come in handy.

Adding operations

First, for the sake of this post, let's change the syntax a bit. Instead of saying

Entity X

We're going to say

class X < Entity

Basic types, like FirstName, will become less basic and will inherit from Element (or in this case, a derivitive: StringElement)

class FirstName < StringElement

Groups will go from the generic Group(PersonAtAddress) to

class PersonAtAddress < Group

Now, this is not going to be the final syntax, but I want you to think of it as inheritance, because we're going to overload functions.

For instance, an Element should know how to validate itself. A simple example might be

class FirstName < Element
def validate:
return representation =~ /[a-zA-Z-]+/

A more complicated example might make a SOAP call to the validation server defined for that type (we'll see how to do that in a later post).

Elements also need to know how to normalize themselves. For instance, a name might wish to be represented in all upper case:

class FirstName < StringElement
def normalize:
representation.upper!

Another operation you're probably screaming for by now is creation. In this case, the StringElement does the right thing for you, it sets the internal representation to a passed in value. However, you might want to do more. For instance, you might want to keep a map of how often you see each first name to handle statistical based maching. Therefore, you want

class FirstName < StringElement
static Hash seen # would be @@seen in ruby
def initialize(String value):
super(value)
seen ||= new Hash(0)
seen[value]++

Other methods might include update and clear.

For Entities, we need the following operations:

initialize - pass in a value for each of the "member variables" and assign them if they are consistent. Otherwise, throw an exception.
validate - validate the state of an entity
update - update one of the fields of an entity

Update is the most interesting because we could update one or many of the fields. I think it might be interesting to use nil for fields we don't want to update, but I'd rather not. Perhaps a hash? But I don't want to miss a field on accident, and I'd like it to be "compile-time" checked. So, we're back to nils.

class Name < Entity
FirstName first_name
MiddleName middle_name
LastName last_name
NameSuffix name_suffix

def update(FirstName fn,
MiddleName mn,
LastName ln,
NameSuffix ns):
begin_transaction:
first_name.update(fn) unless fn.nil?
middle_name.update(mn) unless mn.nil?
last_name.update(ln) unless ln.nil?
name_suffix.update(ns) unless ns.nil?
rescue => { rollback } #undo all operations in the transaction (maybe a validate fails?)

I'm fairly convinced that a transaction is the right way to handle this situation, but I'm not convinced of the syntax, there are definitely other ways to handle it. For instance, you could have an updater that makes the determination based on the exit strategy of the function.
FirstName::Updater updater(first_name, fn)
It could also handle the nil? case. If the function exits normally, the updater commits - otherwise it rollsback. Regardless, a transaction is needed for exception safety.

You may also delete an entity. I imagine that this should return a boolean on whether or not the delete should succeed, but delete's probably shouldn't fail.

A Group has some additional operations. Since a group is just a collection of entities, it may have an entity added to it or removed from it. It addition, it may also have a list merged into it or part of another list sliced into it.

Let's look at each operation in the abstract:

Add:
original list -> [A,B,C] value -> D new list -> [A,B,C,D]

Remove:
original list -> [A,B,C] value -> B new list -> [A,C]

Merge:
original list -> [A,B,C] value -> [D,E,F] new list -> [A,B,C,D,E,F] new value -> []

The merge adds all of the elements of the value into the list and deletes them from the value.

Splice:
original list -> [A,B,C] value -> [D,E,F], 1, 1 new list -> [A,B,C,E] new value -> [D,F]

The splice takes an array and a begin and end offset and adds those elements to the new list while removing them from the old.

The merge we will call a consolidation and the splice we will refer to as a split.

An important property of a Group involves it's MetaGroup. The MetaGroup is the Set that consists of all the instances of the Group.

So, let's say we have a Consumer Group. We want to say that a consumer can not appear in more than one group. That means that the order of the union of all groups is equivalent to the sum of the order of all groups. To do this, we say that for every group, it's MetaGroup must be a true Set, and cannot include duplicate entities.

Now, back to the operations. Each one of these operations must take place in a transactional environment. For example, an update can fail because of an implied field mismatch.

What's an implied field? A field in a group may be declared strongly or weakly implied. I'm not sure how this declaration will take place (I hate the .NET attribute syntax). However, once it is so declared, it will enforce conformance for all the elements in a group. A weakly implied field will ensure a no conflict match for a field across all the entities in a group. So, if we say that NameSuffix is weakly implied for the Consumer Group, that means all Consumers in the same Consumer Group must have the same Name Suffix (or a blank one). A strongly implied field removes the ability for blanks to match.

Ok, this post has gone on long enough. In future posts, we'll return to the formal specification of these operations.

Type safe recognition

I believe that recognition (part of Customer Data Integration) can (and should!) be made strongly type safe. Basic types (in the US) might include FirstName, LastName, SSN, Street, CompositeStreet. More advanced types would include an Entity (a collection of basic types) and a Group (a collection of entities).

Notice how this mirrors a programming language. A basic type would be something like an int or float. An entity is like an object, and a Group would be a collection of objects (think an array).

This leads to a few interesting questions. Let's consider the following:

Entity Name:
FirstName first_name
MiddleName middle_name
LastName last_name
NameSuffix name_suffix

Entity Address:
CompositeStreet street_line_1
CompositeStreet street_line_2
City city
State state
Zip zip

Entity Person:
Name name
SSN ssn
Date birthday

Entity PersonAtAddress:
Person person
Address address

Entity Consumer:
Group(PersonAtAddress) occupancies
SSN preferred_ssn
Address preferred_address

At this point, we have types for our recognition system that can be manipulated and understood by both humans and computers. We can further augment them. I'll work on describing the augmentation/annotation of the types in the next blog.

Questions:

Should an entity be allowed to contain more than one group?
Should an entity be allowed to contain another entity that contains a group?
For example:

Entity Household:
Group(Consumer) members
Address preferred_address

More to come!