Table of Contents
Data Science and Software Engineering play an important role in research by creating new capabilities to process and analyze data, helping ensure reproducibility, and aiding researchers in extracting knowledge and insight for the data. The term software here is used broadly to include all the ways in which one creates and analyses data. Researchers utilize software in their research by using scripts, tools, open-source software, and licensed software. Data science also covers a wide range of skills and techniques applied to cleaning (aka wrangling), processing, and statistics that are typically beyond what a researcher from a specific domain might have. Due to the rapidly evolving nature of research, there are not always codes for all functions needed, nor are their clean data sources; therefore, the software or data pipelines are developed specifically for a given project. Traditionally, this development was done with researchers (graduate students and postdocs) or independent contractors. This approach poses several issues in terms of maintenance, optimization, reproducibility, and cost. Research Software Engineering (RSE) as a discipline is well established throughout the EU (e.g., https://rse.ac.uk/) and is becoming more common in the US (https://us-rse.org/). As well Data Science is becoming a standalone approach to research (https://www.nationalacademies.org/our-work/envisioning-the-data-science-discipline-the-undergraduate-perspective). An RSE and/or Data Science team can overcome the issues with the traditional software development approach and provide the institutional memory on the research software projects that can benefit an individual research group as well as a broader Harvard community in the long-term. A RSE or Data Scientist team can work closely with other Research Computing Systems teams to design, develop, deploy, optimize, and maintain software packages/tools and data pipelines that are paired with specific hardware architectures to accelerate cutting-edge research at Harvard University.
Key Features and Benefits
Data Science and RSE teams alike can design, develop, deploy, optimize, and maintain research software packages, pipelines, and tools to accelerate research. These teams have the ability develop a comprehensive full-stack approach to the research problem, which covers all front-end (UI/UX design) and backend (web server, database) aspects. The Data Science/RSE team can deploy the pipeline for a local computing cluster environment (if applicable) or offer an in-house solution for cloud deployment. All projects include detailed documentation and training to the user(s) and group(s) to utilize this software and data pipelines to enhance research efficiency.
Data Science and RSE teams may have a combination of the following expertise to meet researchers’ needs. The more of these skills in an team, the broader range of projects the team can accomplish: Front-End Developer, Scientific Software Developer, Professional Software Developer, Database Engineer, Data Visualization Engineer, High-Performance Computing (HPC) Expert, Big Data Engineer, Machine Learning (ML) and AI developer.
Types of Roles:
Front-End Developer: A front-end developer builds the visual front-end components of a software, application, or website by creating computing elements/features that are viewable and accessible by the end-user or client.
Scientific Software Developer: The scientific software developer converts scientific theories to the software by writing codes. Hence, he or she has a scientific background, excellent programming skills, and solid mathematical knowledge. They are skilled in one or more of the following: Python, Fortran, C++, and MATLAB.
Professional Software Developer: Professional software developers have some industry experience. They build/create software and applications, and they debug and execute the source code of a software application. They are the key individuals behind all software applications. Generally, developers are well versed in at least one programming language and proficient in the art of structuring and developing software. Besides writing code, they may collect specifications for design or overall software architecture, documentation, and other development processes.
Database Engineer: Database engineers design and build a database management system (DBMS) that fits the need of the software. They ensure a high availability of data and a disaster recovery plan. Based on the research software requirements, the database could be relational (e.g., Microsoft SQL Server), unstructured, analytical, or document-based.
High-Performance Computing (HPC) Expert: The HPC experts help researchers to find the solution(s) faster by developing new efficient algorithms, increasing parallelization in software algorithms (if applicable), and improving hardware utilization. Their contribution to research software often leads to fundamentally new results produced by researchers. They have a strong background in scientific computing, high-performance computing, advanced performance tuning skills, MPI/OpenMP development and CUDA development.
Big Data Engineer: A Big Data Engineer creates and handles big data processing infrastructure and tools in a software project (if applicable) and knows how to get results from enormous amounts of data quickly. They are experts in data warehousing solutions and can work with the latest database and other technologies such as NoSQL, Apache Hadoop, Apache Spark, and Apache Cassandra.
Machine Learning (ML) and AI developer: An ML/AI developer is an expert on using data for training models. The models are then employed to automate processes like image classification, artificial creativity, photo/video manipulation, speech recognition, and market forecasting.
Front-End: It refers to the end-user facing views of the software and an interface that user(s) uses to interact with the software to define inputs, do analysis/run experiments, or visualize data. It is also referred to as the presentation layer.
Backend: It is also called as the data access layer, and it may consist of three parts: a server, an application, and a database. Development in this space is normally done in coordination with Systems Engineers.
UX/UI Design: UX Design indicates the User Experience Design term, while UI Design stands for User Interface Design. Both aspects are essential to a software project and work closely together.
Web Server: It is server-side software or hardware assigned to run related software that can meet the World Wide Web (WWW) client requests. It processes incoming network requests over HTTP and other protocols.
Cloud: It is the availability of computer system resources, particularly storage and computing, without direct management by the user.
Database, cluster computing, virtual machines, and data storage are key components and have been defined in their respective services.
Statement-of-Work (SOW): Overarching agreement of the technical and/or scientific work to be completed. The SOW will also define key project milestones/deliverables, what is out of scope, the roles & responsibilities of the both parties (DS/RSE team and customer), and define the communication and reporting structure. This plan will also include an estimate of the project duration and cost structure (hourly or total project). The SOW should also provide context to the type and amount of on-going support after the SOW is complete. Any updates to the SOW should be agreed upon by both parties.
Authorship: The Data Science specialist or RSE who performs the work will receive author / developer credit on all resulting research / software output (e.g., publications, talks, posters, documents, software packages).
Selection Process: We evaluate and select projects for collaboration on a case-by-case basis, based on our available time, our expertise in the area, the feasibility of the project being completed in a reasonable time frame, and our interest in the project.
Limitations: Please note, we cannot collaborate on research projects where the primary output is a sole-authored dissertation / thesis.
Training/Documentation: Provide well documented code with usage examples to ensure the proper functionality, customization and future development needs of the newly created data pipeline, software, and tools. This may also include online or in-person training sessions.
Service Expectations and Limits:
The RSE team consults the software development as well as the legacy software upgrade and optimization with researchers and depending on the workload; it may fit the free consultation service provided (e.g., 1-2 weeks) to researchers (given the number of FTEs available) or an extended project that requires a dedicated portion of an FTE.
If the software project requires a web server and a database hosted on the FASRC CANNON cluster, it cannot be intended for an enterprise service and only has a few servers deployed based on research needs.
Available to all research groups with FASRC accounts. For more information see Account Requests.
All RSE requests by booking a consultation appointment at https://www.rc.fas.harvard.edu/consulting-calendar/ or via email to email@example.com.
Service manager and Owner:
Service Manager: Mahmood Mohammadi Shad, Associate Director of Research Software Engineering
Service Owner: Scott Yockel, Director of FAS Research Computing
Offerings (Tiers of Service)
FASRC currently offers the following services by the RSE team:
- Development of scientific software packages
- Development of functional and robust UI/UX
- Add critical features to existing codebases
- Maintenance of the current codebases developed by researchers
- Development of Machine Learning/Big Data/Deep Learning apps and platforms
- Development of data acquisition and analysis automation platform
- Improve the performance of existing software packages
- Complex database design and deployment
See RSE Team Page for prior projects.
Tier 1: Free small single tasks
Tier 2: Individual Project, defined SOW start-end-dates
Tier 3: Product, on-going development and operations, SLA