Tables

Synapse Tables enable storage of tabular data in Synapse in a form that can be queried using a SQL-like query language.

A table has a Schema and holds a set of rows conforming to that schema.

A Schema defines a series of Column of the following types: STRING, DOUBLE, INTEGER, BOOLEAN, DATE, ENTITYID, FILEHANDLEID, LINK, LARGETEXT, USERID ~~~~~~~ Example ~~~~~~~

Preliminaries:

import synapseclient
from synapseclient import Project, File, Folder
from synapseclient import Schema, Column, Table, Row, RowSet, as_table_columns

syn = synapseclient.Synapse()
syn.login()

project = syn.get('syn123')

To create a Table, you first need to create a Table Schema. This defines the columns of the table:

cols = [
    Column(name='Name', columnType='STRING', maximumSize=20),
    Column(name='Chromosome', columnType='STRING', maximumSize=20),
    Column(name='Start', columnType='INTEGER'),
    Column(name='End', columnType='INTEGER'),
    Column(name='Strand', columnType='STRING', enumValues=['+', '-'], maximumSize=1),
    Column(name='TranscriptionFactor', columnType='BOOLEAN')]

schema = Schema(name='My Favorite Genes', columns=cols, parent=project)

Next, let’s load some data. Let’s say we had a file, genes.csv:

Name,Chromosome,Start,End,Strand,TranscriptionFactor
foo,1,12345,12600,+,False
arg,2,20001,20200,+,False
zap,2,30033,30999,-,False
bah,1,40444,41444,-,False
bnk,1,51234,54567,+,True
xyz,1,61234,68686,+,False

Let’s store that in Synapse:

table = Table(schema, "/path/to/genes.csv")
table = syn.store(table)

The Table() function takes two arguments, a schema object and data in some form, which can be:

  • a path to a CSV file
  • a Pandas DataFrame
  • a RowSet object
  • a list of lists where each of the inner lists is a row

With a bit of luck, we now have a table populated with data. Let’s try to query:

results = syn.tableQuery("select * from %s where Chromosome='1' and Start < 41000 and End > 20000" % table.schema.id)
for row in results:
    print(row)

Pandas

Pandas is a popular library for working with tabular data. If you have Pandas installed, the goal is that Synapse Tables will play nice with it.

Create a Synapse Table from a DataFrame:

import pandas as pd

df = pd.read_csv("/path/to/genes.csv", index_col=False)
schema = Schema(name='My Favorite Genes', columns=as_table_columns(df), parent=project)
table = syn.store(Table(schema, df))

Get query results as a DataFrame:

results = syn.tableQuery("select * from %s where Chromosome='2'" % table.schema.id)
df = results.asDataFrame()

Changing Data

Once the schema is settled, changes come in two flavors: appending new rows and updating existing ones.

Appending new rows is fairly straightforward. To continue the previous example, we might add some new genes from another file:

table = syn.store(Table(table.schema.id, "/path/to/more_genes.csv"))

To quickly add a few rows, use a list of row data:

new_rows = [["Qux1", "4", 201001, 202001, "+", False],
            ["Qux2", "4", 203001, 204001, "+", False]]
table = syn.store(Table(schema, new_rows))

Updating rows requires an etag, which identifies the most recent change set plus row IDs and version numbers for each row to be modified. We get those by querying before updating. Minimizing changesets to contain only rows that actually change will make processing faster.

For example, let’s update the names of some of our favorite genes:

results = syn.tableQuery("select * from %s where Chromosome='1'" %table.schema.id)
df = results.asDataFrame()
df['Name'] = ['rzing', 'zing1', 'zing2', 'zing3']

Note that we’re propagating the etag from the query results. Without it, we’d get an error saying something about an “Invalid etag”:

table = syn.store(Table(schema, df, etag=results.etag))

The etag is used by the server to prevent concurrent users from making conflicting changes, a technique called optimistic concurrency. In case of a conflict, your update may be rejected. You then have to do another query an try your update again.

Changing Table Structure

Adding columns can be done using the methods Schema.addColumn() or addColumns() on the Schema object:

schema = syn.get("syn000000")
bday_column = syn.store(Column(name='birthday', columnType='DATE'))
schema.addColumn(bday_column)
schema = syn.store(schema)

Renaming or otherwise modifying a column involves removing the column and adding a new column:

cols = syn.getTableColumns(schema)
for col in cols:
    if col.name == "birthday":
        schema.removeColumn(col)
bday_column2 = syn.store(Column(name='birthday2', columnType='DATE'))
schema.addColumn(bday_column2)
schema = syn.store(schema)

Table attached files

Synapse tables support a special column type called ‘File’ which contain a file handle, an identifier of a file stored in Synapse. Here’s an example of how to upload files into Synapse, associate them with a table and read them back later:

## your synapse project
project = syn.get(...)

covers_dir = '/path/to/album/covers/'

## store the table's schema
cols = [
    Column(name='artist', columnType='STRING', maximumSize=50),
    Column(name='album', columnType='STRING', maximumSize=50),
    Column(name='year', columnType='INTEGER'),
    Column(name='catalog', columnType='STRING', maximumSize=50),
    Column(name='cover', columnType='FILEHANDLEID')]
schema = syn.store(Schema(name='Jazz Albums', columns=cols, parent=project))

## the actual data
data = [["John Coltrane",  "Blue Train",   1957, "BLP 1577", "coltraneBlueTrain.jpg"],
        ["Sonny Rollins",  "Vol. 2",       1957, "BLP 1558", "rollinsBN1558.jpg"],
        ["Sonny Rollins",  "Newk's Time",  1958, "BLP 4001", "rollinsBN4001.jpg"],
        ["Kenny Burrel",   "Kenny Burrel", 1956, "BLP 1543", "burrellWarholBN1543.jpg"]]

## upload album covers
for row in data:
    file_handle = syn.uploadSynapseManagedFileHandle(os.path.join(covers_dir, row[4]))
    row[4] = file_handle['id']

## store the table data
row_reference_set = syn.store(RowSet(columns=cols, schema=schema, rows=[Row(r) for r in data]))

## Later, we'll want to query the table and download our album covers
results = syn.tableQuery("select artist, album, year, catalog, cover from %s where artist = 'Sonny Rollins'" % schema.id)
cover_files = syn.downloadTableColumns(results, ['cover'])

Deleting rows

Query for the rows you want to delete and call syn.delete on the results:

results = syn.tableQuery("select * from %s where Chromosome='2'" %table.schema.id)
a = syn.delete(results.asRowSet())

Deleting the whole table

Deleting the schema deletes the whole table and all rows:

syn.delete(schema)

Queries

The query language is quite similar to SQL select statements, except that joins are not supported. The documentation for the Synapse API has lots of query examples.

Schema

class synapseclient.table.Schema(name=None, columns=None, parent=None, properties=None, annotations=None, local_state=None, **kwargs)

A Schema is an synapseclient.entity.Entity that defines a set of columns in a table.

Parameters:
  • name – the name for the Table Schema object
  • description – User readable description of the schema
  • columns – a list of Column objects or their IDs
  • parent – the project in Synapse to which this table belongs
  • properties – A map of Synapse properties
  • annotations – A map of user defined annotations
  • local_state – Internal use only

Example:

cols = [Column(name='Isotope', columnType='STRING'),
        Column(name='Atomic Mass', columnType='INTEGER'),
        Column(name='Halflife', columnType='DOUBLE'),
        Column(name='Discovered', columnType='DATE')]

schema = syn.store(Schema(name='MyTable', columns=cols, parent=project))
class synapseclient.table.EntityViewSchema(name=None, columns=None, parent=None, scopes=None, type=None, addDefaultViewColumns=True, addAnnotationColumns=True, ignoredAnnotationColumnNames=[], properties=None, annotations=None, local_state=None, **kwargs)

A EntityViewSchema is a synapseclient.entity.Entity that displays all files/projects (depending on user choice) within a given set of scopes

Parameters:
  • name – the name of the Entity View Table object
  • columns – a list of Column objects or their IDs. These are optional.
  • parent – the project in Synapse to which this table belongs
  • scopes – a list of Projects/Folders or their ids
  • type – the type of EntityView to display: either ‘file’,’project’ or ‘file_and_table’. Defaults to ‘file’.
  • addDefaultViewColumns – If true, adds all default columns (e.g. name, createdOn, modifiedBy etc.) Defaults to True. The default columns will be added after a call to synapseclient.Synapse.store().
  • addAnnotationColumns – If true, adds columns for all annotation keys defined across all Entities in the EntityViewSchema’s scope. Defaults to True. The annotation columns will be added after a call to synapseclient.Synapse.store().
  • ignoredAnnotationColumnNames – A list of strings representing annotation names. When addAnnotationColumns is True, the names in this list will not be automatically added as columns to the EntityViewSchema if they exist in any of the defined scopes.
  • properties – A map of Synapse properties
  • annotations – A map of user defined annotations
  • local_state – Internal use only

Example:

project_or_folder = syn.get("syn123")  
schema = syn.store(EntityViewSchema(name='MyTable', parent=project, scopes=[project_or_folder_id, 'syn123'], view_type='file'))
add_scope(entities)
Parameters:entities – a Project or Folder object or its ID, can also be a list of them

Column

class synapseclient.table.Column(**kwargs)

Defines a column to be used in a table synapseclient.table.Schema synapseclient.table.EntityViewSchema.

Variables:

id – An immutable ID issued by the platform

Parameters:
  • columnType (string) – Can be any of: “STRING”, “DOUBLE”, “INTEGER”, “BOOLEAN”, “DATE”, “FILEHANDLEID”, “ENTITYID”
  • maximumSize (integer) – A parameter for columnTypes with a maximum size. For example, ColumnType.STRINGs have a default maximum size of 50 characters, but can be set to a maximumSize of 1 to 1000 characters.
  • name (string) – The display name of the column
  • enumValues (array of strings) – Columns type of STRING can be constrained to an enumeration values set on this list.
  • defaultValue (string) – The default value for this column. Columns of type FILEHANDLEID and ENTITYID are not allowed to have default values.

Row

class synapseclient.table.Row(values, rowId=None, versionNumber=None, etag=None)

A row in a Table.

Parameters:
  • values – A list of values
  • rowId – The immutable ID issued to a new row
  • versionNumber – The version number of this row. Each row version is immutable, so when a row is updated a new version is created.

RowSet

class synapseclient.table.RowSet(columns=None, schema=None, **kwargs)

A Synapse object of type org.sagebionetworks.repo.model.table.RowSet.

Parameters:
  • schema – A synapseclient.table.Schema object that will be used to set the tableId
  • headers (array of SelectColumns) – The list of SelectColumn objects that describe the fields in each row.
  • columns – An alternative to ‘headers’, a list of column objects that describe the fields in each row.
  • tableId (string) – The ID of the TableEntity that owns these rows
  • rows (array of rows) – The synapseclient.table.Row s of this set. The index of each row value aligns with the index of each header.
Variables:

etag – Any RowSet returned from Synapse will contain the current etag of the change set. To update any rows from a RowSet the etag must be provided with the POST.

Table

class synapseclient.table.TableAbstractBaseClass(schema, headers=None, etag=None)

Abstract base class for Tables based on different data containers.

class RowMetadataTuple(row_id, row_version, row_etag)
row_etag

Alias for field number 2

row_id

Alias for field number 0

row_version

Alias for field number 1

iter_row_metadata()

Iterates the table results to get row_id and row_etag. If an etag does not exist for a row, it will generated as (row_id, None)

Returns:a generator that gives :py:class::collections.namedtuple with format (row_id, row_etag)
class synapseclient.table.RowSetTable(schema, rowset)

A Table object that wraps a RowSet.

class synapseclient.table.TableQueryResult(synapse, query, limit=None, offset=None, isConsistent=True)

An object to wrap rows returned as a result of a table query. The TableQueryResult object can be used to iterate over results of a query.

Example

results = syn.tableQuery("select * from syn1234")
for row in results:
    print(row)
asDataFrame(rowIdAndVersionInIndex=True)

Convert query result to a Pandas DataFrame. :param rowIdAndVersionInIndex: Make the dataframe index consist of the row_id and row_version (and row_etag if it exists)

iter_row_metadata()

Iterates the table results to get row_id and row_etag. If an etag does not exist for a row, it will generated as (row_id, row_version,None)

Returns:a generator that gives :py:class::collections.namedtuple with format (row_id, row_version, row_etag)
next()

Python 2 iterator

class synapseclient.table.CsvFileTable(schema, filepath, etag=None, quoteCharacter=u'"', escapeCharacter=u'\', lineEnd='rn', separator=u', ', header=True, linesToSkip=0, includeRowIdAndRowVersion=None, headers=None)

An object to wrap a CSV file that may be stored into a Synapse table or returned as a result of a table query.

asDataFrame(rowIdAndVersionInIndex=True, convert_to_datetime=False)

Convert query result to a Pandas DataFrame. :param rowIdAndVersionInIndex: Make the dataframe index consist of the row_id and row_version (and row_etag if it exists) :param convert_to_datetime: If set to True, will convert all Synapse DATE columns from UNIX timestamp integers into UTC datetime objects :return:

classmethod from_table_query(synapse, query, quoteCharacter=u'"', escapeCharacter=u'\\', lineEnd='\r\n', separator=u', ', header=True, includeRowIdAndRowVersion=True)

Create a Table object wrapping a CSV file resulting from querying a Synapse table. Mostly for internal use.

iter_row_metadata()

Iterates the table results to get row_id and row_etag. If an etag does not exist for a row, it will generated as (row_id, None)

Returns:a generator that gives :py:class::collections.namedtuple with format (row_id, row_etag)
setColumnHeaders(headers)

Set the list of synapseclient.table.SelectColumn objects that will be used to convert fields to the appropriate data types.

Column headers are automatically set when querying.

Module level methods

synapseclient.table.as_table_columns(df)

Return a list of Synapse table Column objects that correspond to the columns in the given Pandas DataFrame.

Params df:Pandas DataFrame
Returns:A list of Synapse table Column objects
synapseclient.table.Table(schema, values, **kwargs)

Combine a table schema and a set of values into some type of Table object depending on what type of values are given.

Parameters:
  • schema – a table Schema object
  • values

    an object that holds the content of the tables - a RowSet - a list of lists (or tuples) where each element is a row - a string holding the path to a CSV file - a Pandas DataFrame - a dict which will be wrapped by a Pandas DataFrame

Returns:

a Table object suitable for storing

Usually, the immediate next step after creating a Table object is to store it:

table = syn.store(Table(schema, values))

End users should not need to know the details of these Table subclasses:

See also: