zPicture instructions

How to use zPicture (zpicture.dcode.org) -

Introduction
Input Sequences

FASTA Format

Uploading FASTA Sequences
Downloading Sequences from NCBI
Downloading Sequences and Gene Annotation from UCSC Genome Browser

Gene Annotation
Annotation of Repetitive Elements
Progress Report
Return to Previously Submitted Request
Results

Dynamic Visualization
Obtaining sequences underlying ECRs
Dot-Plots
rVista Portal (Identification of Evolutionary Conserved Binding Sites)
Interactive Updates (Modifying Annotation and Sequence Titles)
Identification of Evolutionary Conserved Regions (ECRs)
Blastz and Textual Alignment Files
Access to the Input Sequences

Visualization Legend
Questions or Suggestions?

Introduction
zPicture is a dynamic and interactive multiple nucleotide-sequence alignment tool. It is based on the blastz pairwise sequence aligner that is also utilized by the PipMaker tool. zPicture accept several type of input data for genomic sequences. Upon submission, repetitive elements are masked and alignments are generated for the non-repetitive regions between the reference sequence (the sequence that was submitted first) and all the other sequences submitted. Next, conservation profiles are constructed and are visualized as either Pip- or Smooth-graph conservation plots.

pip- and smooth- plots

Input Sequences
zPicture accepts input sequences in the FASTA format (see example below) and provides four different options for uploading sequences.

FASTA Format
FASTA nucleotide sequence format consists of a header line followed by the nucleotide sequence. Header line should be the first line in the file and should start with the '>' (greater than) symbol followed by the sequence name (that could be a gene name, locus name or any sequence title). Nucleotide sequence should consist only from A, C, G, T and N letters that are capitalized. Capitalization is particularly important for zPicture analysis since sequences at the UCSC genome browser database indicate repetitive elements by lower case letters. You can find more about lower- and upper-case letters in the section on Annotation of Repetitive Elements. Here is an example of a FASTA input file.

> BTNL2 gene
ATGGTGGATTTTCCAGGCTACAATCTGTCTGGTGCAGTCGCCTCCTTCCT
ATTCATCCTGCTGACAATGAAGCAGTCAGGTAGGATTCCCTTCTCCCTTT
ACTGTATAGTCTAATGTCCCAGTGAGCTAGTCTGGGTCCAAAGGTCGAGA
ACAACATCTAAGAGTGTAAGTCTGGGGCCAAGCCACCTGTATCCAAAAAG
GAACTCCTCACTTTTGAGGAGCTCCTCCACTCCCAGGAGCTCCTCCACTC
CTAGCTGAGTCACCTTTGGAAAGTTACTTGAGCACCTCATACCTTAGTTC
TTTCACCTTTTTAATGAGAATAACAGCAGTAACTACATCTCAGAGGCCAG

Uploading FASTA Sequences
    There are two options for uploading sequences from your personal computer into the zPicture program. One option is to open the sequence file on your computer in the text editor (Notepad, MS Word or smth else), select all the text, copy and paste it into the zPicture "Paste sequence" provided window. You can modify the sequence after you pasted it into the window. A second possibility is to upload the FASTA-formated TEXT file using the "FASTA file" upload option. Please note that you can not upload Word, Excel or any non-text files to zPicture using this option.

Downloading Sequences from NCBI
    If the sequence you would like to use to generate zPicture alignments is available at the NCBI database (National Center for Biotechnology Information) and it has an accession number assigned to it (AC145542.1, for example), then you can type in this accession number into the field labelled "NCBI accession #" and click the <submit> button. Using this option, sequences will automatically be uploaded into the zPicture program directly from the NCBI database, and submitted for alignments.

Downloading Sequences and Gene Annotation from the UCSC Genome Browser Database
    Alternatively sequences can be downloaded automatically from the UCSC Genome Browser Database. If you want to align sequences from specific regions of the Human, Mouse or the Rat genomes you can utilize the zPicture automatic-fetch option. To do this, click on 'UPLOAD sequences and gene annotations from UCSC Genome Browser' link. Upon clicking on this button, you will be directed to a new page similar to the one below, where you need to indicate the genome, the assembly, the type of gene annotation and the chromosomal location within the genome to be used for download. Nucleotide position should be indicated in the chrom:from-to format:

Different gene annotation tables could be utilized for extracting gene positional information including RefSeq genes, "Known genes" from UCSC, Ensemble gene predictions, etc. Also, repetitive elements are premasked and indicated by lower-case letters in the otherwise upper-case sequence files downloaded from the UCSC Genome Browser.

Gene Annotation

Gene annotations are useful for visualizing conservation profile of sequences with different character and distinguish between coding exons, UTRs, introns and intergenic elements. DNA annotation data can be provided using the "Gene annotation" input window. Positional information on gene features is constructed using header lines by indicating gene position, name and the direction of the gene, followed by line-by-line description of every exon within the transcript to separate exons from introns. Several genes inside of a larger locus can be described in the same annotation file. For example, two genes, BTNL2 and ER3, that span [100,200] and [400,500] intervals would be represented by the following annotation file:

> 100 200 BTNL2
100 120 UTR
121 140 CDS
180 200 CDS
< 400 500 ER3
400 500 CDS

">" and "<" symbols in the header lines indicate the direction of the genes (BTNL2 is transcribed in the forward direction, while ER3 is transcribed in the reverse direction). 'UTR'- untranslated coding region; 'CDS' - coding. [141,179] interval not covered by any exons of BTNL2, but covered by gene correspond to an intron of this gene

Annotation of Repetitive Elements
    zPicture masks repetitive elements prior to generating blastz alignments. Therefore, only non-repetitive bases are used to careate alignments; resulting in fast and reliable alignments sustained by 'clean' conservation profiles. Two different repeat-masking options are available. The simplest one is to submit premasked sequences in which non-repetitve nucleotides are capitalized, while repeats are either in lower-case or indicated by 'Ns'. This option is implemented by the UCSC Database, therefore sequence downloaded using this option will be preprocessed for repeats by default. A second option available allows user to mask repetitive elements witin their submitted sequences by allowing zPicture to run the locally installed RepeatMasker program (http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker). Repetitive elements specific to different species could be identified using this option. Please allow extra time for this step, since the masking process will increase the time for generating alignments.

Progress Report
    zPicture does not utilize email communications to protect user-privacy and eliminate uninformed delays due to email traffic uncertainties. A zPicture alignment progress report is provided in the same window in which sequences were submitted. The progress report is automatically updated at short intervals and is transferred to the results report as soon as zPicture processing has been completed. An ID number is provided for each alignment request, allowing the user to save the requests and to return to the data as desired.

Return to Previously Submitted Request
    A zPicture request ID is provided for every zPicture submission. This number if saved could be used to return to the zPicture results. To return to a previous request, please type in the zPicture ID number into the provided window on zPicture home page and press the submit button.

Results
After zPicture alignments are completed, a summary page will display the results. This web-page contains several links to data analysis and visualization options. These consist of (1) a dynamic conservation profile visualization module, (2) a dot-plot that displays positional relationships among the aligned sequences, (3) interatively modifyable sequence titles and gene annotation files, (4) a dynamic extraction options for retrieving sequences underlying Evolutionary Conserved Regions (ECRs), (5) blast-type and/or blastz-type alignments, and the input sequence files.

Dynamic Visualization

   The dynamic visualization option allows users to actively modify the conservation profile for the alignments. The base sequence position is plotted on the x-axis, while on the y-axis the level of sequence similarity at each position is displayed (percent identity, %ID) . Several consrvation-plot parameters can be adjusted to obtain the most informative visual display for the alignment. The conservation profiles can be displayed as pip-plots or as smooth-graphs. Pip-plots display all the short ungapped alignements as black horizontal lines. The length of a pip-line reflects the length of the ungapped alignment in the base sequence, while its vertical position indicates the %ID for this alignment. Smooth-plot is constructed using a 100 bps sliding window. Smooth graph height at a position x represents the level of sequence identity averaged in the sliding window centered at this position. The detection of Evolutionary Conserved Regions (ECRs) is dynamic. "ECR length" and "ECR similarity" parameters define the minimum length and minimum %ID of an aligned region to be scored as an ECR. The "Bottom cut-off" parameter defines the minimum level of %ID in the plot. "Base-top switch" option allows for switching the base sequence. For example if human was the base (first) sequence in a human/mouse alignment, this option will permit the user to visualize the alignment using the positional information for the mouse (second) sequence as the base organism without recalculating the alignment. "Graph height" parameter defines the height of each plot layer (please see Visualization Legend section for details).
    Conservation profiles can be easily saved as image files by clicking the right mouse button on the image and selecting "Select image (or picture) as..." option. Save the image as "zPicture.png", for example.

Obtaining sequences underlying ECRs
    It is easy to extract the alignment and sequences underlying any Evolutionary Conserved Region (ECR) from the "Conservation Plot". Just click on a colored peak and you'll be forwarded to a web-page with alignment and sequence details for the particular ECR selected.

Dot-Plots

Dot-plots indicate positional relationships of between the aligned sequences at different parts of the alignment. X- and y-axes linearize the bottom and top sequences and the diagonal lines illustrate ungapped alignments. The ends of each diagonal line indicate the starting and ending coordinates of sequences that were used to generate that particular ungapped alignment. Forward strand alignments are colored in red; reverse strand alignments are in blue. Sequence titles and lengths are given at the ends of each axes.

rVista Portal

zPicture alignments can be searched for transcription factor binding site motifs using the rVista tool, via the rVISTA submission portal automatically provided at the summary page. rVista (G. Loots et al., Genome Res. 2002 May; 12(5):832-9) utilizes sequence alignments to filter out binding sites sites that either do not have counterparts in the second sequence or are located in divergent regions with low sequence similarity. rVISTA identifies transcription factor binding sites that are highly conservedl ocally, and allows for cluster analysis.

Interactive Updates
One of the unique zPicture features is its ability to modify input data without redoing sequence alignments. Gene annotation and sequence titles can be modified after the alignments have been completed. Gene annotations could be either resubmitted (in case of the wrong input submission) or modified to either add, shift or remove undesirable features. This feature can also be used to change the name of genes and the sequence titles and customize the alignments.

Identification of Evolutionary Conserved Regions (ECRs)

The number and distribution of ECRs depends on the parameters used to detect conserved elements. There are no ideal parameters that will reliably work for any sequence comparisons. Parameters need to be adjusted depending on the evolutionary relationships of the compared sequences and the evolutionary forces that drove the divergence in these homologous sequences. Two parameters are used to identify ECRs - minimal "ECR length" and minimal "ECR sequence similarity". The dynamically modified output will depend on the specified parameters and will list all the identified ECRs and their overlaps with annotated genes.

Blastz and Textual Alignment Files
zPicture alignments that are created with the use of blastz local alignment tool can also be visualized as text files. Two options are available. Either the standard blastz output can be extracted or the alignment can be reformatted to the standard blast-type output. All the ungapped alignments are listed separately and summary statistics is provided for each of them.

         13710     13720     13730     13740     13750     13760     
base     ACcAACCTGAGAGAgAAAAAGTTGCgATTTTCTCCTCGCCcAAAAAGGGGaTGCTGATGG
         || ||||||||||| |||||||||| |||||||||||||| ||||||||| |||||||||
second   ACtgACCTGgaAtgaAAAtgGTTtCtAcTTgCTCtTgtCCaAAgAAGGGGtTctTcATGG
         31490     31500     31510     31520     31530     31540

Access to Input Sequences
The final table at the zPicture results web page contains links to the submitted input files and their derivatives. The seqI.fa is the FASTA input file for the sequence submitted first and the seqI.txt is the repeat-masked version of the seqI.fa file. The annoI.txt corresponds to the gene annotation of I-th sequence, while seqI.reps is the RepeatMasker output based on seqI.fa sequence. In cases of special masking, seqI.reps files is emulated by zPicture to be in the same format as an original RepeatMasker output file.

Visualization Legend

Questions or Suggestions?
Please contact <dcode@ncbi.nlm.nih.gov> if you have any questions or suggestions.