Microsoft has released a new API called the Computer Vision API. It provides OCR conversion from images to text. Similar to the recent post on integration Microsoft’s new Face API, I wanted to test whether you could integrate SharePoint with this new API to act as an auto-tagging mechanism for images stored in document libraries.
Microsoft has also provided a C# and Android SDK for the Computer Vision API that you can download here.
I wanted to develop a demo that leveraged the document management power of SharePoint and integrate it with the OCR capabilities of this new API. The demo code I’m going to describe is posted to github here.
The Scenario
Imagine you have a document library filled with images that could have text on them. If you try to search for these images, you won’t easily find them because they have no metadata related to the content within the image. Could we use an OCR service to scan the image and pull out the text so we could update the metadata with the found text?
Step #1: Establishing a Domain Model
I wanted a class to represent each photo. This allows us to populate a list of photos from SharePoint, hand the list off to another class responsible for tagging the photos and then hand it back to the SharePoint service to update SharePoint back again. Here is the class definition.
/// /// Value object representing a Photo. ///
public class Photo
{
public byte[] Image { get; set; }
public string ID { get; set; }
public List TextInPhoto { get; set; }
public string LanguageDetectedInPhoto { get; set; }
/// <summary>
/// Value object representing a Photo.
/// </summary>
public class Photo
{
public byte[] Image { get; set; }
public string ID { get; set; }
public List<string> TextInPhoto { get; set; }
public string LanguageDetectedInPhoto { get; set; }
public int NumberOfMatchedFaces { get; set; }
public int NumberOfUnmatchedFaces { get; set; }
public List<PhotoPerson> PeopleInPhoto { get; set; }
public Photo()
{
PeopleInPhoto = new List<PhotoPerson>();
TextInPhoto = new List<string>();
}
}
Step #2: Pulling Images from SharePoint
Using the SharePoint client APIs (CSOM) with C#, I wrote a method to pull images out of a target document library.
public List<Photo> getPhotosToTag()
{
List<Photo> photos = new List<Photo>();
using (ClientContext context = Login(SharePointURL))
{
try
{
var list = context.Web.GetList(PhotosToTagURL);
var query = CamlQuery.CreateAllItemsQuery();
var result = list.GetItems(query);
ListItemCollection items = list.GetItems(query);
context.Load(items, includes => includes.Include(
i => i[PhotoFileColumn],
i => i[PhotoIdColumn]));
//now you get the data
context.ExecuteQuery();
//here you have list items, but not their content (files). To download file
//you’ll have to do something like this:
foreach (ListItem item in items)
{
Photo photo = new Photo();
//get the URL of the file you want:
var fileRef = item[PhotoFileColumn];
//get the file contents:
FileInformation fileInfo = Microsoft.SharePoint.Client.File.OpenBinaryDirect(context, fileRef.ToString());
using (var memory = new MemoryStream())
{
byte[] buffer = new byte[1024 * 64];
int nread = 0;
while ((nread = fileInfo.Stream.Read(buffer, 0, buffer.Length)) > 0)
{
memory.Write(buffer, 0, nread);
}
memory.Seek(0, SeekOrigin.Begin);
photo.ID = item.Id.ToString();
photo.Image = memory.ToArray();
photos.Add(photo);
}
}
}
catch (Exception e)
{
throw;
}
}
return photos;
}
The important information we need to track is the ID of the photo so that we can update it back into SharePoint once we’re finished processing.
Step #3: Sending Photos to the Computer Vision API
Once we have our list of photos, we can now send these to the Computer Vision API for text identification.
public async Task identifyTextInPhoto(List<Photo> Photos)
{
try
{
foreach (Photo photo in Photos)
{
VisionServiceClient client = new VisionServiceClient(SubscriptionKey);
Stream stream = new MemoryStream(photo.Image);
OcrResults result = await client.RecognizeTextAsync(stream, Language, DetectOrientation);
photo.LanguageDetectedInPhoto = result.Language;
foreach (Region region in result.Regions)
{
for (int i=0; i< region.Lines.Length; i++)
{
Line line = region.Lines[i];
string lineText = “”;
for (int j= 0; j < line.Words.Length; j++)
{
lineText += line.Words[j].Text;
if (j < line.Words.Length -1)
{
lineText += ” “;
}
}
photo.TextInPhoto.Add(lineText);
}
}
}
}
catch (Exception e)
{
throw;
}
}
This method call uses the supplied Computer Vision API SDK to send each image to the OCR service and pulls in the found text.
Step #4: Sending the Results Back to SharePoint
When the Computer Vision API analyzes your image, it returns back text as a series of regions, lines and words. The API provides not only the text but the bounded rectangle where it was found.
public async Task identifyTextInPhoto(List<Photo> Photos)
{
try
{
foreach (Photo photo in Photos)
{
VisionServiceClient client = new VisionServiceClient(SubscriptionKey);
Stream stream = new MemoryStream(photo.Image);
OcrResults result = await client.RecognizeTextAsync(stream, Language, DetectOrientation);
photo.LanguageDetectedInPhoto = result.Language;
foreach (Region region in result.Regions)
{
for (int i=0; i< region.Lines.Length; i++)
{
Line line = region.Lines[i];
string lineText = “”;
for (int j= 0; j < line.Words.Length; j++)
{
lineText += line.Words[j].Text;
if (j < line.Words.Length -1)
{
lineText += ” “;
}
}
photo.TextInPhoto.Add(lineText);
}
}
}
}
catch (Exception e)
{
throw;
}
}
}
For example, this image returns one region with three lines of text composed of words in each line.
In this simple example, we treat each line found as a new line and separate each word with a space (if punctuation is found, the API adds it to the word automatically.
Once we have our matched text, we can update our original SharePoint list item with the found text.
public void updateTaggedPhotosWithText(List<Photo> Photos)
{
using (ClientContext context = Login(SharePointURL))
{
try
{
foreach (Photo photo in Photos)
{
SP.List list = context.Web.GetList(PhotosToTagURL);
ListItem item = list.GetItemById(photo.ID);
string textInPhoto = “”;
string[] lines = photo.TextInPhoto.ToArray();
for (int i = 0; i < lines.Length; i++)
{
textInPhoto += lines[i];
if (i < lines.Length – 1)
textInPhoto += “\n”;
}
item[PhotoTextColumn] = textInPhoto;
item.Update();
context.ExecuteQuery();
}
}
catch (Exception e)
{
throw;
}
}
}
Conclusion
Some of the image processing in the current Computer Vision API worked really well while other images failed to find the right text. Keep in mind these APIs are still in beta and OCR is a notoriously difficult process to perfect especially with general image analysis.
For example, this license plate (found randomly on Google images) worked:
but this one did not find the plate in the middle.
The Computer Vision API seems to work reasonably well with type faces, but doesn’t work at all with hand written text. These images for example did not return any text.
This image returned Midnight but not Show.