Categories
BLOG

galaxy genetics

Galaxy Community Hub

Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational research.

  • Accessible: programming experience is not required to easily upload data, run complex tools and workflows, and visualize results.
  • Reproducible: Galaxy captures information so that you don’t have to; any user can repeat and understand a complete computational analysis, from tool parameters to the dependency tree.
  • Transparent: Users share and publish their histories, workflows, and visualisations via the web.
  • Community centered: Our inclusive and diverse users (developers, educators, researchers, clinicians, etc.) are empowered to share their findings.

Welcome to the Galaxy Community Hub, where you’ll find community curated documentation of all things Galaxy.

Get involved with Galaxy: Introducing Galaxy Working Groups – In a big way. Join the Dec 10 Galaxy Developer Round Table

Galaxy Release 20.09 Video Summary – GTN in Galaxy, upload from Dropbox, workflow enhancements, better performance!

Activity report of the last 5 years of the Freiburg Galaxy Team – Crucial infrastructure serving more than 22,000 researchers

Events

Gateways Focus Week – Make your gateway sustainable

Galaxy @ Inbix2020 – Recent advances in computational biology for codifying Biodiversity into one health approach: Biodiversity, Climate change, One health and Zoonotic diseases

Hackathon sur les outils interactifs de Galaxy (GxIT) – Un Hackathon pour partager des compГ©tences en terme du dГ©veloppement logiciel et d’administration systГЁme des Interactive Tools de Galaxy.

Introduction to Galaxy Workshop – Open to everybody interested in Galaxy

Formation metabarcoding 2020 – concepts de mГ©tabarcodage et traitement des donnГ©es grГўce Г l’utilisation du logiciel SAMBA dans Galaxy

@galaxyproject

The Galaxy Project is supported in part by NSF, NHGRI, The Huck Institutes of the Life Sciences, The Institute for CyberScience at Penn State, and Johns Hopkins University.

Galaxy Community Hub Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational research. Accessible: programming experience is not required to

Galaxy: A platform for interactive large-scale genome analysis

Belinda Giardine

1 Center for Comparative Genomics and Bioinformatics, Huck Institutes for Life Sciences, Penn State University, University Park, Pennsylvania 16802, USA

Cathy Riemer

1 Center for Comparative Genomics and Bioinformatics, Huck Institutes for Life Sciences, Penn State University, University Park, Pennsylvania 16802, USA

Ross C. Hardison

1 Center for Comparative Genomics and Bioinformatics, Huck Institutes for Life Sciences, Penn State University, University Park, Pennsylvania 16802, USA

Richard Burhans

1 Center for Comparative Genomics and Bioinformatics, Huck Institutes for Life Sciences, Penn State University, University Park, Pennsylvania 16802, USA

Laura Elnitski

2 National Human Genome Research Institute, Bethesda, Maryland 20892, USA

Prachi Shah

1 Center for Comparative Genomics and Bioinformatics, Huck Institutes for Life Sciences, Penn State University, University Park, Pennsylvania 16802, USA

2 National Human Genome Research Institute, Bethesda, Maryland 20892, USA

Yi Zhang

1 Center for Comparative Genomics and Bioinformatics, Huck Institutes for Life Sciences, Penn State University, University Park, Pennsylvania 16802, USA

Daniel Blankenberg

1 Center for Comparative Genomics and Bioinformatics, Huck Institutes for Life Sciences, Penn State University, University Park, Pennsylvania 16802, USA

Istvan Albert

1 Center for Comparative Genomics and Bioinformatics, Huck Institutes for Life Sciences, Penn State University, University Park, Pennsylvania 16802, USA

James Taylor

1 Center for Comparative Genomics and Bioinformatics, Huck Institutes for Life Sciences, Penn State University, University Park, Pennsylvania 16802, USA

Webb Miller

1 Center for Comparative Genomics and Bioinformatics, Huck Institutes for Life Sciences, Penn State University, University Park, Pennsylvania 16802, USA

W. James Kent

3 Department of Computer Science and Engineering, University of California at Santa Cruz, Santa Cruz, California 95064, USA

Anton Nekrutenko

1 Center for Comparative Genomics and Bioinformatics, Huck Institutes for Life Sciences, Penn State University, University Park, Pennsylvania 16802, USA

Associated Data

Abstract

Accessing and analyzing the exponentially expanding genomic sequence and functional data pose a challenge for biomedical researchers. Here we describe an interactive system, Galaxy, that combines the power of existing genome annotation databases with a simple Web portal to enable users to search remote resources, combine data from independent queries, and visualize the results. The heart of Galaxy is a flexible history system that stores the queries from each user; performs operations such as intersections, unions, and subtractions; and links to other computational tools. Galaxy can be accessed at http://g2.bx.psu.edu.

Currently available genome browsers (UCSC Genome Browser [Kent et al. 2002, http://genome.ucsc.edu], NCBI MapViewer [Wheeler et al. 2005], and Ensembl [Birney et al. 2004, http://www.ensembl.org]) allow experimental biologists with no programming experience to locate and visualize genomic regions using intuitive graphical interfaces. However, more sophisticated analyses (e.g., “find all DNase I hypersensitive sites within introns of RefSeq genes on human chromosome 22 that are also conserved in the mouse and rat genomes but not in the dog genome”) still rely on programming and database skills. To solve this problem we designed Galaxy, a system for the integration of genomic sequences, their alignments, and functional annotation. Galaxy is not a browser. Instead, it allows users to gather and manipulate data from existing resources in a variety of ways. Every action of the user is recorded and stored in the history system, a key element of Galaxy. This allows users to conduct independent queries on genomic data from different sources and then use Galaxy to combine or refine them, perform calculations, or extract and visualize corresponding sequences or alignments. Operations such as join, union, intersection, and subtraction can be accomplished using a simple interface.

Galaxy differs from existing systems in its specificity for access to, and comparative analysis of, genomic sequences and alignments. For example, the premier metaserver for the retrieval, analysis, and display of protein and DNA sequences, SRS (Etzold and Argos 1993; Zdobnov et al. 2002), does not provide access to precomputed genome sequence alignments, scores derived from those alignments, expression data, or other genomic data types that are central to Galaxy. Other examples of efforts integrating various data sources and analysis tools include ISYS (Siepel et al. 2001) and the Biology Workbench (Subramaniam 1998). ISYS requires programming experience and serves as a development framework rather than a ready-to-use tool. Biology Workbench is one of the most comprehensive Web-based collections of sequence analysis software. However, it is unsuitable for the analysis of genomic data as it cannot handle large sequence data sets. Here we describe the presently implemented functionality of the Galaxy system, show examples of usage, and discuss some aspects of its design.

Results and Discussion

Data retrieval and manipulation

Presently, Galaxy contains three major classes of data manipulation: query operations, sequence analysis tools, and output displays. The first class includes standard set operations such as union, intersection, subtraction, and complement as well as filters based on region size, proximity to regions from another query, and clustering by distance of regions within a single query ( Fig. 1 ). Sequence analysis tools are stand-alone modules designed to perform biologically oriented calculations such as finding orthologous regions in another species, extracting genomic alignments, computing Ka/Ks ratios, and retrieving GC content or conservation in the regions of interest. Finally, displays allow retrieving/viewing of the results generated by the user in a variety of formats. Current options include displaying the query results as a custom track at the UCSC or Ensembl Genome Browsers and downloading a text file in various formats (standard BED, Ensembl upload, or raw); additional formats are provided by individual tools (e.g., score distribution plot, Ka/Ks sliding window profile). Alignment viewers such as Laj (Wilson et al. 2001) and zPicture (Ovcharenko et al. 2004) are planned for the near future.

Galaxy supports several variations of the basic set operations, to accommodate the fact that our elements are coordinate-based regions rather than simple atomic objects.

The basic functionality of Galaxy is best illustrated with an example. Here we use the Galaxy history system to combine independent queries to find single nucleotide polymorphisms (SNPs) within coding exons of the human insulin-like growth factor II (IGF-II) gene. At the Galaxy portal page the user first chooses a genomic region of interest (the IGF-II locus) from the UCSC Table Browser (Karolchik et al. 2004) ( Fig. 2A ), which sends its results (genomic coordinates of coding exons) directly to Galaxy (for this purpose the Table Browser interface features a “Send Results to Galaxy” option). The Galaxy history page then displays one query, which contains the genomic coordinates for each protein-coding exon of the IGF-II gene ( Fig. 2B ). Because our goal is to find all SNPs associated with coding exons, we go back to the Table Browser and repeat the process, this time requesting all SNPs that fall in the genomic region of the IGF-II gene. Now the requested SNPs will appear as the second query on the history page ( Fig. 2C ). However, we are only interested in SNPs that fall within coding exons, so to identify these we apply the intersection operation to the two queries ( Fig. 2C ). The result (six SNPs are found within protein-coding exons of IGF-II) is displayed as a new item on the history page ( Fig. 2D ). At this point, the user can download the results or display them as a custom track at the UCSC Genome Browser (generating an image similar to Fig. 3 ).

Galaxy history system for querying UCSC Table Browser. (A) UCSC Table Browser page sending results to Galaxy. (B) Galaxy’s history page with a single query. (C) History page showing how Galaxy can be used to find intersection between two queries. (D) History page displaying intersection results.

Examples of promoters characterized by binding of the transcription initiation complex and/or high conservation. Images from the UCSC Genome Browser generated via Galaxy illustrate (A) a promoter that has strong conservation (indicative of purifying selection) and biochemical evidence of binding by RNA polymerase II and TAF1, (B) a promoter that is poorly conserved but is strongly bound by RNA polymerase II and TAF1, and (C) a strongly conserved promoter that is not bound by the transcription initiation machinery in the cells tested. The track labeled “galaxy” is the custom track automatically generated by Galaxy for each query number (34, 35, and 36). Genes are labeled and have exons as boxes and introns as lines with arrowheads pointing in the direction of transcription. “Conservation” is the phastCons track followed by positions of aligning DNA in homologous regions of other species. The positions of promoters are shown as rectangles. The results of chromatin immunoprecipitations (ChIP data) are plotted as the negative log of the p-value, ranging in the vertical direction from 0 to 10.0 and with a continuous thin line placed at the threshold of 2.0. Positions of repeats identified by RepeatMasker (A.F.A. Smit and P. Green, unpubl., http://ftp.genome.washington.edu/RM/RepeatMasker.html) are shown as black rectangles in panels B and C.

Combining and comparing ENCODE data to find promoters

Locating promoters is one of the aims of the ENCODE consortium (ENCODE Project Consortium 2004). Six data tracks relevant to this goal are already deposited at the UCSC ENCODE portal. These include empirical results (experimentally validated promoters [Trinklein et al. 2003], DNase I hypersensitive sites, and regions bound by RNA polymerase II or TAF1 [Kim et al. 2005]) and computational predictions (multi-spe-cies conserved sequences [Margulies et al. 2003], phastConsElements [Siepel et al. 2005], and regions with high regulatory potential [Kolbe et al. 2004]). The ability to combine and compare these diverse data is critical for their biological interpretation. The following example shows that Galaxy is ideally suited for this purpose. Starting at the Galaxy portal, the UCSC Table Browser was used to retrieve genomic intervals that passed reasonable thresholds for each of the six data types (see online supplement). Galaxy operations (intersection and subtraction) were then applied to compare the data sets, determining what fraction of experimentally verified promoters had the other properties investigated ( Table 1 ). Of the 289 promoters, 95 (33%) are both highly conserved (phastConsElement) and have significant binding by TAF1 in HeLa cells. Thus, these promoters can be identified by either strong conservation or by experimental results (such as TAF1 binding). One example is the CAV1 promoter ( Fig. 3A ). Of the remaining 194 promoters, 52 intersect with a segment having significant binding by TAF1, and 66 intersect with a phastConsElement. Thus, some promoters are characterized by TAF1 binding but not strong conservation, exemplified by LOC85865 ( Fig. 3B ). These will be more difficult to identify by comparative genomics approaches. Others are characterized by strong conservation but do not show evidence of TAF1 binding. The example of PYGM ( Fig. 3C ) could be explained by the fact that the glycogen phosphorylase encoded by the gene is made primarily in muscle cells, whereas the binding data are from HeLa cells, which were derived from a cervical carcinoma.

Table 1.

Number of regions within ENCODE targets with properties associated with gene promoters

Type of region Number of regions exceeding threshold Number of promoters that overlap regions Percentage of promoters that overlap regions
Promoters 289 289 100
DNase HSs 230 71 25
Bound by RNA polymerase 2121 175 61
Bound by TAF1 573 153 53
MCS 23,148 179 62
phastConsElements 7479 139 48
RP 16,170 161 56

(DNase I HSs) DNase I hypersensitive sites; (MCS) multispecies conserved sequence; (phastConsElements) DNA sequence whose multispecies alignment falls within the 5% most highly conserved genomic intervals in human; (RP) regulatory potential.

Evolutionary analyses with Galaxy

Our system will allow users to apply existing molecular evolution algorithms directly to sequences and alignments retrieved through Galaxy queries. The current release of Galaxy features a tool for calculation of synonymous (Ks) and non-synonymous (Ka) substitution rates using the Yang-Neilsen algorithm (Yang and Nielsen 2000). The tool allows traditional estimation across the entire length of a selected sequence as well as estimation using a sliding window approach. The estimates obtained with the tool can be used to perform the Ka/Ks ratio test, the most widely used predictor of selection acting on a protein coding region (Li 1997). The sliding window Ka/Ks test is a simple analysis that can provide a wealth of information about the selection regime of a gene of interest (Endo et al. 1996; Presgraves et al. 2003). This test provides significantly greater resolution compared with the conventional Ka/Ks test, which is overly conservative for detecting deviations from a negative selection scenario as it averages Ka and Ks estimates over the entire sequence (Li 1997). Galaxy users are now able to apply this analysis to any coding sequence available from the UCSC Table Browser (e.g., as shown in Fig. S1).

Conclusions

The Galaxy system pioneers a new generation of interactive tools for large-scale genome analysis. It allows large-scale analyses that previously required users to have substantial programming experience and database management skills. The Galaxy history page is simple to use, yet quite powerful, and is able to handle large genome annotation data sets. Users have the ability to perform multiple types of analyses (e.g., query intersections, subtractions, and proximity searches) and then display the results using existing browsers (e.g., the UCSC Genome Browser or Ensembl). In the future we plan to add a powerful toolbox that will include the most popular sequence and genome analysis algorithms. Galaxy’s permanent Web site address is http://www.g2.bx.psu.edu.

Methods

Modularity

Galaxy is designed as a set of separate software components that work together to perform tasks. The central “core” component orchestrates the action, executes queries, and keeps track of user histories, while the user interface(s) (UIs) and operation/tool/output libraries are implemented separately. All communication with other sites (UCSC Table Browser, etc.) is handled by the core component. Benefits of this arrangement include extensibility (ease of adding new tools and interfaces) and convenient division of labor and expertise among programmers. Also, the operation libraries are available for use by other projects, such as ENCODEdb.

The UIs communicate with the core component via HTTP (Web) requests, using the GET or POST methods. The core provides an API (application program interface) consisting of the requests it is prepared to handle, such as using a tool, retrieving a user’s query history for a particular assembly of a genome, etc. When the user runs a query at another source site (e.g., the Table Browser), the core passes its connection with the user’s Web browser on to the Galaxy UI via HTTP redirection. Using an HTTP API makes it easy to support a variety of UIs, which do not have to be running on the same server. In fact, any site on the Web could set up its own UI for Galaxy by crafting the appropriate HTTP requests, and individual researchers can use the API directly for programmatic access to Galaxy’s features.

Language

The Galaxy core component and operation libraries are written in C and are built to the standards of the Bioinformatics group at UCSC. Thus, if it turns out to be more effective to run some Galaxy functions from UCSC instead of PSU, the programs are ready to be run there. Also, this code makes use of UCSC utility libraries to avoid duplication of effort.

Our initial UI (called HUI for History User Interface) is written in Perl for convenient text manipulation and CGI access, but one could use any language that can generate an HTTP request.

Local storage

Although Galaxy primarily processes source data obtained from other sites, it does have a local database for storing user histories (implemented in MySQL for compatibility with UCSC). It also stores some precomputed query results. We originally implemented this as a way to avoid recomputing popular and/or time-consuming queries again and again, but now we also view this “featured data sets” facility as a way to provide public access to newly obtained research results before they are available on the primary data sites, and to data sets that are too large for uploading from the Table Browser.

Additional local storage is used for reference data and temporary workspace needed by some of the tools, and for caching query results, output files, and custom data sets (uploaded by users) for further manipulation and/or subsequent retrieval.

Data format

The primary format that Galaxy uses to store query results is the BED (Browser Extensible Data) format that is used at UCSC for Genome Browser tracks and also is accepted at several other sites. This is a tab-separated text format readable by both humans and computer programs. The BED format is convenient for interoperability with UCSC’s Table Browser, Genome Browser, and other tools, but it has fairly strict limitations on the associated fields since it is primarily geared toward displaying the regions rather than conducting further analysis. Currently we are working on extensions to the BED format by adding extra columns that can be readily truncated back to true BED for the tools that require it, but ultimately we will probably need to use a more generic format, or several formats, to handle a broader range of data types. In particular, there are well-established file formats for alignment data (such as AXT and MAF) that are used directly. But regardless of the data formats we end up using internally, it will be important for Galaxy to provide a suitable complement of conversion tools so users can easily obtain output in whatever format they need.

Galaxy: A platform for interactive large-scale genome analysis Belinda Giardine 1 Center for Comparative Genomics and Bioinformatics, Huck Institutes for Life Sciences, Penn State University,