#

Database

Description  

In a data analysis environment, organized collections of data need to be hosted and accessed by a set of researchers.  A database service provides an interface to store collections and provide privileged access from a variety of endpoints.  Databases also provide a much more structure and searching capabilities of information than file-based methods.  Hosted database servers can be shared and have databases for several groups and store different datasets. Access to the data/database is controlled by the owner of the data or based on a data use agreement.    

 

Key Features and Benefits   

Database service provides a way to store data in an organized way and controlled remotely using client API. Databases (individual stores) are created per project and access to users is provided. Access to create new databases and modify access is normally limited to system administrators. Users can create data tables, views, and functions to store the data. Data is inserted using the client ETL process for existing datasets or loading using a client on generation.   

Definitions:  

Database server: A physical server that runs the database software that is provided as a service, which stores the data. A server usually shared and has several databases and users.  

Database: A type of data store which uses a database management software that interacts with applications, users, and the data store.  Databases commonly run on a single server, but some can be scaled out to multiple servers, especially those on public cloud resources.  There are three basic types of database models used in the research community:   

  • Relational: provides a declarative method for specifying data and queries 
  • Examples: MySQL, MariaDB, MicrosoftSQL, PostgresDB, SQLite 
  • Document: designed for unstructured data presented in a more or less key-value store for easy lookups 
  • Examples: NoSQL, MongoDB 
  • Time Series: data structure for pairs of times and values.  
  • Example: InfluxDB, Whisper 

  

Schema: A project normally starts with a predefined structure of data, tables, fields, users, access, …  

  • User/Password: User to authenticate and authorize access for a database.  Note that this user/password is separate from the system level user/password.  
  • Permissions/Grants: On creation of users, certain permissions are assigned to a user to provide access (read-only, read-write) to various aspects of the database.  
  • Tables: Entities inside a database to organize the data into columns (fields) and rows to make it easy to discover and manage  

  

Common Database Operations:  

  • Create: Operation to assign unique names to a database and tables to store the data  
  • Insert: Operation to load the data to one or more tables in a database.  
  • Update: Operation to update an existing dataset.  
  • Delete: Operation to remove selected or entire dataset.  
  • Truncate: Operation to clean a database for reuse.  
  • ETL: Extract, transform, load. Data can be in text or different formats and need conversion steps before it can be loaded to a database.  

  

Load balanced and high availability:  To have a load balanced database it requires careful configuration and running the service across a cluster of servers.  High availability would require that the front-end service (ingest point) could be migrated automatically to a different server, ideally in a different datacenter.  Without these features the database is running on a single server and limited to those resources (network, memory, disk).  

 

Service Expectations and Limits:  

Database service is not intended for an enterprise service and only has a few servers deployed based on research needs. Service is not load balanced nor high availability which can limit other users if the database server load is high. Database backups or table snapshots can be provided and stored in a backup system if users work with the RC systems engineering team. Typically, open source or open license database servers are deployed. Enterprise Databases like Oracle may need a specific service deployment and management plan.  Database administration (tuning, schema, …) are the responsibility of the research project.    

At FASRC, availability, uptime, and backup schedule is provided as best effort with staff that do not have rotating 24/7/365 shifts.    

 

Available to:  

Available to all users with an FASRC account.  For more information see Account Requests.

All DB requests via email to rchelp@rc.fas.harvard.edu   

 

Service manager and Owner:   

Service Manager: Brian White, Associate Director of Operations & Systems Engineering Team  

Service Owner: Scott Yockel, Director of FAS Research Computing  

 

Offerings (Tiers of Service)   

FASRC currently hosts the following databases for research use, which are available to the cluster or virtual machine services.  All data schema is defined by the researchers, FASRC Systems Engineering only creates the databases for the researcher.   The following types of databases are available with access from the computing cluster or virtual machines.

  • MySQL  
  • PostgresDB  
  • MongoDB  
  • MariaDB  

 

CC BY-NC-SA 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at Attribution.