The molecules API

About the database

The molecules database contains already predicted GPS positions for about one million molecules represented in SMILES string format. This data have been collected from jobs submitted to ChemGPS-NP Web for a period of time of about ten years and this database is now made available for public search.

This database can also be queried/browsed using your web browser by visiting the search page.

Search

In this document we will focus on describing how to use the JSON API for searching in the molecules database. Due to the dynamic nature of this API we are not wasting any time on providing any other interface than JSON.

The input structure

The request payload can contain match, select, order, group, limit, offset and count. Each of these will be described separately below.

Match Select Order Group Limit Offset Count

To get an feel of how an request looks like, heres an example that will match on SMILES string and molecule name, returning the first three predicted positions and the molecule name:

{
    "match": {
        "smiles": "N1=CNC2=C(C1=S)NC=N2",
        "name": "Mercaptopurine*"
    },
    "select": [
        "pos1",
        "pos2",
        "pos3",
        "name"
    ],
    "order": {
        "name": "asc",
        "pos1": true
    },
    "group": [
        "name"
    ],
    "limit": 10,
    "offset": 0,
    "count": false
}

Requests

Both input and response contains JSON encoded data. GET requests are supported, but due to encoding issues its recommended to use POST. The curl command can be used for exploring the API:

curl -XPOST http://chemgps.bmc.uu.se/batchelor/api/molecules/?pretty=1 -d '{"select":["name","smiles","pos1","pos2","pos3"],"match":{"smiles":"C12C*"},"limit":2}'
{
    "status": "success",
    "result": [
        {
            "name": "08Cic04Lia-Asteraceae",
            "smiles": "C12C(C(C(O1)=O)=C)CC=C(CCC=C(C2)C)C=O",
            "pos1": "-2.898477",
            "pos2": "-0.614176",
            "pos3": "0.342784"
        },
        {
            "name": "08Cic04Lia-Asteraceae",
            "smiles": "C12C(C(C(O1)=O)=C)CC(C(CCC=C(C2)C)=C)O",
            "pos1": "-2.816835",
            "pos2": "-0.829112",
            "pos3": "0.009787"
        }
    ]
}

Wildcards

Use asterisk '*' as wildcard when matching on sub strings in columns, default is otherwise to match string exact unless fuzzy match mode was selected. The example above i.e. appending a wildcard suffix to find smiles strings starting with C12C.

Fuzzy match

Enables full-text search for all match parameters. Notice that fuzzy search is less exact that normal search (especial for smiles string), but has the advantage of finding data in more complex combinations.

curl -XPOST http://chemgps.bmc.uu.se/batchelor/api/molecules/?pretty=1 -d '{"select":["name","smiles","pos1","pos2","pos3"],"match":{"smiles":"C12C"},"limit":2,"fuzzy":true}'
{
    "status": "success",
    "result": [
        {
            "name": "08Cic04Lia-Asteraceae",
            "smiles": "C12C(C(C(O1)=O)=C)CC=C(CCC=C(C2)C)C=O",
            "pos1": "-2.898477",
            "pos2": "-0.614176",
            "pos3": "0.342784"
        },
        {
            "name": "08Cic04Lia-Asteraceae",
            "smiles": "C12C(C(C(O1)=O)=C)CC(C(CCC=C(C2)C)=C)O",
            "pos1": "-2.816835",
            "pos2": "-0.829112",
            "pos3": "0.009787"
        }
    ]
}

An example could be the smiles string 'CCC=OH'. If fuzzy mode is enabled, then the search will find molecules containing 'CCC' and 'OH', both in order and reversed.

Matching rows (match)

Rows can be matched using exact or fuzzy matching by passing wildcard patterns or using anchor special characters. By default matching is fuzzy which means that input string is matched as a sub string and matching is case insensitive.

The match section contains the column names to match and the value to match on in each one. In real world applications it probably most interesting to filter on the smiles or name fields.

{
    "match": {
        "smiles": "...",
        "name": "..."
    }
}

While its possible to filter on multiple fields at same time, its probably best to restrict this to one at each time for more precise results. We will focus on matching names in this section.

Fuzzy matching

The default is to perform fuzzy matching. When combining multiple match fields, it might be required to use explict fuzzy matching. Use the wildcard '*' to perform fuzzy match (on start and/or end of string):

{
    "match": {
        "name": "*Mercaptopurine*"
    }
}
Exact matching

Use '^' and '$' to anchor the search string. Using both at same time is equivalent to making an exact match on the string:

{
    "match": {
        "name": "^Mercaptopurine$"
    }
}
Available columns

The smiles and name column has already been mentioned. The complete list of columns that can be matched are:

NameTypeDescription
nametextThe molecule name (might be missing or having random names)
smilestextThe SMILES string (always present)
pos1 - pos8floatOne of the eight predicted positions
createddatetimeThe date and time when molecule was added to database
ipaddrtextThe IP address from where molecule was submitted
hostidtextThe host ID from were molecule was added
md5sumchar(32)The computed checksum for this molecule

There are a couple of other fields, but these are of zero interest in most applications.

Select output columns (select)

The array of column names passed in the select section defines which columns data should be returned from. See the match documentation for an list of availble column names.

An missing select section in the input payload is equivalent to using the asterisk character, and will return data from all columns:

{
    "select": [
        "*"
    ]
}
Multiple columns

When selecting multiple columns, pass an array of column names:

{
    "select": [
        "name",
        "pos1",
        "pos2",
        "pos3"
    ]
}
Single column

When selecting a single column, a simplified declaration can be used by passing an string instead of an array:

{
    "select": "name"
}
Aggregate functions

It also possible to use aggregate functions (like count or sum) in select.

{
    "select": "count(*) as num"
}

This will for example sum the first three positions for all matched molecules:

curl -XPOST http://chemgps.bmc.uu.se/batchelor/api/molecules/?pretty=1 -d '{"match":{"smiles":"CCC=OH"},"select":["sum(pos1) as sum1","sum(pos2) as sum2","sum(pos3) as sum3"],"fuzzy":true}'
{
    "status": "success",
    "result": [
        {
            "sum1": "10.099289",
            "sum2": "11.196895",
            "sum3": "-12.131128"
        }
    ]
}
Key names in output

The key names in the response could be choosen as in the example above. The key names must be a non-numeric string or an syntax error will occure.

Uniqueness

If you need an unique identifier for an molecule, either use the md5sum column or select rowid from the database. The rowid is by default not returned in the result and has to be explicit requested.

{
    "select": [
        "rowid",
        "*"
    ]
}
All strings

The select options can also be composed as an string for those who prefere to concat strings. Previous code sample would then become:

{
    "select": "rowid,*"
}

Sorting output data (order)

Use the order section to define sort options for output. Notice that passing an order section might slow down response time, especially when the number of matched rows is large.

The general form for the order section is an object of one or more columns to sort on. The precidance is applied following their order of listing. This example would sort descending on created (time for adding entry), followed by ascending sort on molecule name:

{
    "order": {
        "created": "desc",
        "name": "asc"
    }
}
Boolean values

Sort order is ascending if value is true, ortherwise descending if value is false. This example is equivalent with the previous example:

{
    "order": {
        "created": false,
        "name": true
    }
}
Single column

The order section can be simplified when sorting on a single column. The default value is ascending (true).

{
    "order": "name"
}

Aggregate output data (group)

The group section can be used to perform basic grouping of response data. For more advanced aggregation, perform that on client side instead. The grouping is limited to available column names (no group on return values from aggregate functions like count).

The general form for the group section is an array of one or more column names and grouping the response is performed in the order of listing among column names.

{
    "group": [
        "created",
        "name"
    ]
}
Single column

When grouping on a single column, an simplified syntax can be used:

{
    "group": "created"
}
Aggregate functions

Even though value from aggregate function like count() can't be directly used in the group section, it's still possible to use them to aggregate output:

curl -XPOST http://chemgps.bmc.uu.se/batchelor/api/molecules/?pretty=1 -d '{"select":["count(ipaddr) as number","ipaddr","hostid"],"match":{"smiles":"CCC"},"group":"ipaddr","fuzzy":true}'
{
    "status": "success",
    "result": [
        {
            "number": "4",
            "ipaddr": "1.175.238.117",
            "hostid": "dea620b645f89c8b4b2f75c377f0cf86"
        },
        {
            "number": "5",
            "ipaddr": "101.14.148.218",
            "hostid": "500bfc249c7bd56922c7edb605ecd3ff"
        },
                ...
        {
            "number": "62",
            "ipaddr": "96.233.235.231",
            "hostid": "dbea104cf28f8f19dc4931f43bd87839"
        }
    ]
}

Limit rows and pagination (limit and offset)

Use limit to restrict number of rows in the response. Used together with offset they can be used to create result pagination. Using limit is particular useful when number of rows returned is large.

Because time spent in sorting the result grows non-linear with number of rows, its a good strategy to limit the number of rows returned (the sorting is done on the query result) when using an order section.

The general form for using limit and offset looks like this:

{
    "limit": 25,
    "offset": 50
}
Recommendation

It's strongly recommended to use limit/offset and perform multiple queries instead of fetching all rows to client side and navigate the result from there.

Notice that offset requires an limit. If offset is defined without limit, then 25 is assumed as default limit:

{
    "offset": 50
}

The default offset is (obvious) the number zero. Using limit without offset is slightly faster, but hardly noticable:

{
    "limit": 25
}
Pagination

Navigating query result is simple on client side. Define your JSON structure and increment/decrement its offset value in response to back/forward navigation before fetching data for the view:

curl -XPOST http://chemgps.bmc.uu.se/batchelor/api/molecules/?pretty=1 -d '{"select":["rowid","smiles","name"],"match":{"smiles":"CCC"},"limit":5,"offset":15,"fuzzy":true}'
{
    "status": "success",
    "result": [
        {
            "rowid": "266",
            "smiles": "C1(C(=C(C(O1)=O)C)C(C2(C(CCC=C2C=O)C)C)OC(C(C)=CC)=O)O",
            "name": "10Ast01Sen-Asteraceae"
        },
        {
            "rowid": "267",
            "smiles": "C1(C(=C(C(O1)=O)C)C(C2(C(CCC=C2C=O)C)C)OC(CC(C)C)=O)O",
            "name": "10Ast01Sen-Asteraceae"
        },
        {
            "rowid": "269",
            "smiles": "C1C(CC=C(CCC(O)C)C(C1)C)C(C(O)=O)=C",
            "name": "10Ast01Sen-Asteraceae"
        },
        {
            "rowid": "270",
            "smiles": "C1(CC=C(CCC(=O)C)C(CC1)C)C(C(=O)O)=C",
            "name": "10Ast01Sen-Asteraceae"
        },
        {
            "rowid": "289",
            "smiles": "C12(C(C(C(O1)=O)C)C(C3(C(C2)CCC(C3C)O)C)OC(C(C)=CC)=O)OC",
            "name": "10Ast01Sen-Asteraceae"
        }
    ]
}

When using limit and offset, then current page is simply given by floor(offset/limit).

Counting matches (count)

The count section was added to provide an easy way to find out the number of rows matched by an request by simply setting the count member to true.

Let's say you have an JSON object for query result according to your preferences (possibly containing all other section types like select, match, limit and so on).

To find out the number of matches without query the result you can append the count field and ask the server how many rows it would match:

curl -XPOST http://chemgps.bmc.uu.se/batchelor/api/molecules/?pretty=1 -d '{"select":["rowid","smiles","name"],"match":{"smiles":"CCC*"},"limit":10,"offset":25,"count":true}'
{
    "status": "success",
    "result": [
        {
            "matched": "87254"
        }
    ]
}

Setting count to false will make the server ignore it and return the normal result response based on you query options:

{
    "select": [
        "rowid",
        "smiles",
        "name"
    ],
    "match": {
        "smile": "CCC"
    },
    "limit": 10,
    "offset": 25,
    "count": false
}

Notice that limit and offset has no influence on the matched value returned when count is true.