I am currently working on a project that requires me to pull in a large amount of data from an external database into a SharePoint farm and index the content for use in our search service application.
The dataset is currently around a million items and getting larger every day, there was obviously a need for the search crawler to be scalable and not take too much time indexing all this content.
I set about creating a standard .Net Connector Assembly that implemented, Finder (ReadList) and SpecificFinder (ReadItem) methods. The architecture for the BCS and Search framework looks like this (from MSDN)

I won’t go into detail on how to create .Net Connector Assemblies and BDC models because there are other articles out there that show you how, in this blog post I am going to detail how I made our connector scalable and performant by caching the content in memory and allowing the crawler to index the content from memory instead of making numerous (1 million+) calls to the database.
I encapsulated this caching mechanism into a library so that I can re-use the logic throughout my application on many BCS connectors.
I have also published a library containing the code so that this pattern can be re-used on other projects, feel free to download and use in your own projects.
You can download the code to follow along here http://www.athousandthreads.com/att.sharepoint.patterns.zip
Right down to the detail
The search crawler working on external connectors uses the following workflow to crawl all the content.
- The crawler first calls your Finder method (ReadList) on your .Net Connector Assembly, your Finder method needs to return the identifiers of all the items you want to be indexed.
- The crawler then calls your SpecificFinder method (ReadItem) passing it the identifier of each content it wants to index.
Now when the crawler initially calls my Finder method I need to go off to the database and retrieve all the items I want to index and return them to the crawler for indexing.
When the crawler then calls my SpecificFinder method, I don’t want to go back to the database I want to retrieve the item from the cache. I implemented this using a static collection of items that gets stored in the memory space of the MSSADM.exe process (the process that does the indexing).
There is some logic needed to synchronise the access to this shared cache and I have encapsulated this into the following class:
Caching
/// <summary>
/// Provides a caching mechanism for BCS external connectors.
/// </summary>
/// <typeparam name=”T”>Type of BDC entity to store in this cache.</typeparam>
/// <typeparam name=”I”>Type of the identifier for the BDC entity.</typeparam>
public class CachedConnectorService<T, I> where T : BDCEntity<I>
{
#region Private members
private List<T> cache;
private CachedConnectorParameters<T, I> parameters;
private static object lockObject = new object();
#endregion
#region Constructor
public CachedConnectorService(CachedConnectorParameters<T, I> parameters)
{
this.parameters = parameters;
}
#endregion
#region Public methods
/// <summary>
/// Reads individual entity from the cache.
/// </summary>
/// <param name=”identifier”>Identifier of the entity to read.</param>
/// <returns>BDC entity.</returns>
public T ReadItem(I identifier)
{
this.LogToOperations( “Reading BDC Item: “ + typeof(T).ToString() + “,
identifier=” + identifier, EventSeverity.Information);
T entity = default(T);
try
{
if (cache == null)
{
this.LogToOperations(typeof(T).ToString() + ” cache is null,
reloading cache from database.”, EventSeverity.Information);
this.ReadList();
}
entity = this.GetFromCache(identifier);
if (entity == null)
{
this.LogToOperations(“Identifier not found in local cache, getting
from database.”, EventSeverity.Information);
entity = this.GetFromDatabase(identifier,
parameters.DatabaseCall);
}
}
catch (Exception)
{
this.LogToOperations( “Exception occured reading BDC Item: “ +
typeof(T).ToString() + “, Identifier=” + identifier,
EventSeverity.Error);
}
return entity;
}
/// <summary>
/// Reads list of entities into the cache.
/// </summary>
/// <returns>Collection of entities.</returns>
public IEnumerable<T> ReadList()
{
if (cache == null)
{
lock (lockObject)
{
try
{
if (cache == null)
{
this.LogToOperations(“Getting list of “ +
typeof(T).ToString(), EventSeverity.Information);
List<T> cacheTemp = new List<T>();
this.parameters.PopulateCache.Invoke(cacheTemp);
this.LogToOperations(“Loaded “ +
cacheTemp.Count.ToString() + typeof(T).ToString(),
EventSeverity.Information);
cache = cacheTemp;
}
}
catch (Exception)
{
this.LogToOperations(“Exception occured getting list of “ +
typeof(T).ToString(), EventSeverity.Error);
}
}
}
return cache.ToArray();
}
#endregion
#region Private static methods
/// <summary>
/// Gets entity from the cache
/// </summary>
/// <param name=”identifier”>Identifier of entity to return.</param>
/// <returns>Entity instance.</returns>
private T GetFromCache(I identifier)
{
lock (lockObject)
{
return cache.Where
(
a => a.Identifier.Equals(identifier)
)
.FirstOrDefault();
}
}
/// <summary>
/// Gets entity from the database using the specified delegate.
/// </summary>
/// <param name=”identifier”>Identifier of entity to return.</param>
/// <param name=”databaseCall”>Delegate that does the work of
/// retrieving entity from database</param>
/// <returns>Entity instance.</returns>
private T GetFromDatabase(I identifier, Func<I, T> databaseCall)
{
return databaseCall.Invoke(identifier);
}
private void LogToOperations(string message, EventSeverity severity)
{
if (this.parameters.Logger != null)
{
this.parameters.Logger.LogToOperations(message, severity);
}
}
#endregion
}
Parameters
The class uses a set of parameters that stores two delegates that are used to populating the cache and calling the database to get individual items. It also allows you to pass in a logger from the Microsoft patterns and practices logging library.
These delegates are used by your connector assemblies to pass in your specified logic for getting items into the cache. The parameters class looks like this.
public class CachedConnectorParameters<T, I> where T : BDCEntity<I>
{
#region Public properties
public Action<List<T>> PopulateCache { get; set; }
public Func<I, T> DatabaseCall { get; set; }
public ILogger Logger { get; set; }
#endregion
}
Base entity
One last class is the BDCEntity<T> class which is used as a base class to all your BDC model entities, the class is simple and just allows the caching class to filter on identifiers.
public abstract class BDCEntity<T>
{
#region Public members
public T Identifier { get; set; }
#endregion
}
.Net Connector Service & Entites
Now this forms the reusable library that provides caching to all your BDC connector assemblies an example of a class that uses this cachine pattern is shown below:
public class MyService
{
#region Private static members
private static CachedConnectorService<MyEntity, Int64> service;
private static CachedConnectorParameters<MyEntity, Int64> parameters;
#endregion
#region Public methods
/// <summary>
/// Reads specified entity from the database.
/// </summary>
/// <param name=”id”>ID of the entity to retrieve from the database.</param>
/// <returns>Instance of an MyEntity.</returns>
public static MyEntity ReadItem(long id)
{
return ServiceInstance().ReadItem(id);
}
/// <summary>
/// Reads a list of all entities from the database.
/// </summary>
/// <returns>Collection of MyEntity.</returns>
public static IEnumerable<MyEntity> ReadList()
{
return ServiceInstance().ReadList();
}
#endregion
#region Private static methods
private static CachedConnectorService<MyEntity, Int64> ServiceInstance()
{
if (service == null)
{
if (parameters == null)
{
parameters = new CachedConnectorParameters<MyEntity, Int64>();
parameters.DatabaseCall = GetDatabaseDelegate();
parameters.PopulateCache = GetPopulateCacheDelegate();
parameters.Logger = new SPLogger();
}
service = new CachedConnectorService<MyEntity, Int64>(parameters);
}
return service;
}
private static Action<List<MyEntity>> GetPopulateCacheDelegate()
{
return (entities) =>
{
using (MyWorkScope scope = new
MyWorkScope(DatabaseManager.EFConnectionString))
{
foreach (Entity entity in scope.CurrentContext.MySet)
{
entities.Add(GetEntity(entity));
}
}
};
}
private static Func<Int64, MyEntity> GetDatabaseDelegate()
{
return (identifier) =>
{
using (MyWorkScope scope = new
MyWorkScope(DatabaseManager.EFConnectionString))
{
return
GetEntity(scope.CurrentContext.ReadEntity(identifier).FirstOrDefault());
}
};
}
/// <summary>
/// Returns an entity from the specified object.
/// </summary>
/// <param name=”entity”>Entity to turn into an MyEntity.</param>
/// <returns>Instance of MyEntity.</returns>
private static MyEntity GetEntity(Entity entity)
{
MyEntity myEntity = new
MyEntity();
myEntity.Identifier = entity.Id;
myEntity.Name = entity.FormattedName;
myEntity.SiteUrl = entity.SiteUrl;
myEntity.LastModifiedTimeStampField = entity.CC_ModifiedDate;
return myEntity;
}
Our BDC model entity class looks like this:
public partial class MyEntity : BDCEntity<Int64>
{
public string Name { get; set; }
public string SiteUrl { get; set; }
public DateTime? LastModifiedTimeStampField { get; set; }
}
Memory Limits
There is one last thing to note about this approach. As we are caching all items within the MSSADM.exe process the memory footprint can get very large and there is a limit we hit on our server that is set by the filter damon. When the filter damon limit is hit the BCS connector assembly and it memory space is thrown away and hence the cache is reset. The above code handles this but as a consequence when ReadItem is called and the cache has gone away we have to reload from the database, you want to avoid doing this too many times for obvious reasons so we found we have to increase the memory limit of the filter damon to get better performance from the indexer.
You can find out how to do this from the links below:
- http://social.technet.microsoft.com/Forums/en-US/sharepointsearch/thread/138e2e68-9bf7-4a1c-9519-ace6a78ddaa5/
- http://www.sharepointjoel.com/Lists/Categories/Category.aspx?Name=Search%20and%20Indexing
Where can I get the library
You can download the full source code for the library here, it can be used by anyone free of charge.
http://www.athousandthreads.com/att.sharepoint.patterns.zip
Pingback: SharePoint: Recopilatorio de enlaces interesantes (III)!" - Blog del CIIN
Reblogged this on Sutoprise Avenue, A SutoCom Source.