“ORIGIN’s AI-Powered Capabilities Make it Unique on the Market”

Read Time 6 mins | Written by: Intlabs team

AI is perhaps the buzzword of 2024. And of 2023. And of 2022. And unfortunately, there are a lot of companies out there selling smoke and calling it AI. We are here to explain how ORIGIN uses AI and why it helps us create actual value for our users.

Let’s get one thing straight

ORIGIN is not an exclusively AI-driven product. And it was not invented after the launch of ChatGPT. ORIGIN was conceptualized when our founders realized that the maturity of data governance tools on the market do not come close to addressing the pain points and agility needs of businesses today. That being said, ORIGIN does have AI-powered capabilities that bring its functionality to a whole new level.

What parts of ORIGIN are powered by AI?

ORIGIN is a smart data governance platform that enables users to safely share, store, redact, and work with diverse data sets.

The main AI-powered capability in ORIGIN is the platform’s ability to generate content redaction and access rules based on the data protection legislation relevant to that specific data, user, and use case.

When it comes to data, some factors are very important. Like:

Where you are based geographically,
What data governance legislation applies to where your organisation is based,
What kind of data are you working with (PII, sensitive information, confidential information, etc.),
Who else needs to have access to the data you are working with,
Where those people who need access are based geographically,
What information within your files those who need access should not be able to see,
What information within your files may not travel digitally to where those who need access are based.

Based on information the user inputs into the system, ORIGIN scans all available data protection policies and legislation and generates a coherent set of recommended rules and a clear explanation log showing why the source data sharing violates data governance legislation. ORIGIN’s auto-generated recommendation capabilities mean that users do not need to personally stay up-to-date on the multitudes of data protection and governance legislation related to their jurisdictions, industry, and use cases. Instead, our system generates rules to be used on the platform based on the data and context that are presented to the sharing user for review.

Crucially, the AI in ORIGIN is not doing the redaction itself on the basis of the information it gathers. This is essential because the outputs of AIs are opaque. You cannot definitively say what the AI has used to make a decision. So instead of having the AI execute an action that cannot be deciphered it generates a recommended rule and shows the user the source text for the recommendation.

A quick example

Let’s say we have a marketer based in California who works in management consulting. They need to send information to a colleague in the United Kingdom.

On that basis, ORIGIN will generate recommendations as to how you need to redact your files to share them securely and compliantly. In the recommendation you receive, it will show you the portion of the policy that applies to your context so that you can see where it pulls its recommendations from and determine how to proceed.

In this example, the marketer may have a number of marketing plans and research that come from previous campaigns. Due to internal non-disclosure policies, the marketer perhaps cannot disclose any specifics about the campaign or target competitors. ORIGIN can then scan PowerPoint presentations, PDF reports, and Excel spreadsheets for mentions of these campaigns or competitors and automatically create rules that redact the confidential content in those documents.

Similarly, the marketer would be able to provide access to the raw marketing data associated with the campaign. As a customer, the marketer would grant ORIGIN access to internal policies concerning internal fields and relevant legislation related to identifiable information (like email, AAID, etc.), retention time, and location and create rules to automatically remove fields and entities that do not meet the relevant security and privacy requirements.

Before LLMs

Before large language models, systems made smart recommendations by training statistical models on data sets specific to a certain use case. This meant that dev teams needed huge amounts of data to be able to produce a reliable transformation. And the work of collecting the data, processing it, and choosing models that worked for the specific use cases was painstaking, time-consuming work. Then add the complexity of disbursed dev teams that have to collaborate with data across domains and you end up with a major bottleneck to scaleability.

“We make heavy use of GPTs (generative pre-trained transformers). But it’s all built on the structure that we’ve been using for decades,” explains Mike Anderson, CTO at Intlabs.

“Especially with unstructured data (like text, PDFs, PPTs, etc.), we would use a field called natural language processing,” he continues. “The goal was to take natural language and extract statistically meaningful things from it. That could, for example, be named entity recognition (pulling names, organizations, verbs) combined with classifiers (trained statistical models) to do things like sentiment detection, bucketing text content into themes, identifying PII, etc. And once you had your models trained for each customer, you needed to deploy them and maintain those diversified services to provide microservices to users. This maintenance burden created a major scalability issue for small and medium-sized startups trying to offer machine learning services at a reasonable price.

“What we’re seeing now with GPTs is they take all the things we used to do and add a generative layer on top. They take the understanding of text and enable us to combine it with sourced data privacy legislation and policy and generate recommendations. It generates our transformation rules which are given to the user to enable them to quickly identify the information that needs to be redacted to comply with relevant data protection policy and legislation. And that’s the real “wow” thing, commercially. It has been really interesting for people in the field to see that.”

This doesn’t mean that for every problem is more easily and economically fixed by using a GPT instead of your own machine learning models. For example, a linear regression model to recommend using certain legislation each time a specific user is selected in the sharing process (based on what the system learns from past shares) is still cheaper than querying a GPT. It’s just a matter of finding the right balance between MLOps (Kreuzberger et al., 2022) and paid GPTs.

What would ORIGIN look like without its AI capabilities?

ORIGIN would still exist. We would be training our own small models and working to create a product that saves people time by optimizing their working processes. Users would be manually filling in information each time and there would be a huge amount of work on our end to create a system that created viable data protection policy information to users. We would be using a lot of classifiers to properly sort data into useful ‘buckets’.

“That is the big thing that changes with machine learning. And honestly, if it weren’t for GPTs, we wouldn’t be here today with such a mature and powerful product at such an early stage," Mike describes. "We know what’s out there on the market and we’re talking to enough people to know that what we’re offering is unique.”

Keeping your data safe within the AI context

Aware consumers are rightly concerned about how a platform that uses LLMs keeps their data safe. One of the things that our dev team is perhaps most proud of is the way it has managed to create a tool that does not need to be trained on user data. ORIGIN uses natural language processing and the GPTs to understand text and on that basis generates the transform rules.

“The reality is that because of the immense amount of data that LLMs have access to, we don’t need large amounts of user data to train our system,” explains Mike. “We’ve also been extremely careful to ensure that any data input into an LLM is scoped to the minimum possibly required and that the use is understood and agreed to by the user.”

“And ultimately, we’re a data company. So we have a moral obligation to take your data very seriously. Just like you, we don’t want your company’s data anywhere near the training data that we use to help our system learn.”

So, what do our users need to be aware of when using ORIGIN?

We are continuously refining and fine-tuning the system with models but we do not have deterministic systems. We can guarantee that once you set your rules, they will be followed faithfully and accurately. But there will always be a need for human review and oversight. And anyone who tells you they have an AI-powered system that does not require human review and oversight is selling smoke.