123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132 |
- .TH tabix 1 "11 May 2010" "tabix-0.2.0" "Bioinformatics tools"
- .SH NAME
- .PP
- bgzip - Block compression/decompression utility
- .PP
- tabix - Generic indexer for TAB-delimited genome position files
- .SH SYNOPSIS
- .PP
- .B bgzip
- .RB [ \-cdhB ]
- .RB [ \-b
- .IR virtualOffset ]
- .RB [ \-s
- .IR size ]
- .RI [ file ]
- .PP
- .B tabix
- .RB [ \-0lf ]
- .RB [ \-p
- .R gff|bed|sam|vcf]
- .RB [ \-s
- .IR seqCol ]
- .RB [ \-b
- .IR begCol ]
- .RB [ \-e
- .IR endCol ]
- .RB [ \-S
- .IR lineSkip ]
- .RB [ \-c
- .IR metaChar ]
- .I in.tab.bgz
- .RI [ "region1 " [ "region2 " [ ... "]]]"
- .SH DESCRIPTION
- .PP
- Tabix indexes a TAB-delimited genome position file
- .I in.tab.bgz
- and creates an index file
- .I in.tab.bgz.tbi
- when
- .I region
- is absent from the command-line. The input data file must be position
- sorted and compressed by
- .B bgzip
- which has a
- .BR gzip (1)
- like interface. After indexing, tabix is able to quickly retrieve data
- lines overlapping
- .I regions
- specified in the format "chr:beginPos-endPos". Fast data retrieval also
- works over network if URI is given as a file name and in this case the
- index file will be downloaded if it is not present locally.
- .SH OPTIONS OF TABIX
- .TP 10
- .BI "-p " STR
- Input format for indexing. Valid values are: gff, bed, sam, vcf and
- psltab. This option should not be applied together with any of
- .BR \-s ", " \-b ", " \-e ", " \-c " and " \-0 ;
- it is not used for data retrieval because this setting is stored in
- the index file. [gff]
- .TP
- .BI "-s " INT
- Column of sequence name. Option
- .BR \-s ", " \-b ", " \-e ", " \-S ", " \-c " and " \-0
- are all stored in the index file and thus not used in data retrieval. [1]
- .TP
- .BI "-b " INT
- Column of start chromosomal position. [4]
- .TP
- .BI "-e " INT
- Column of end chromosomal position. The end column can be the same as the
- start column. [5]
- .TP
- .BI "-S " INT
- Skip first INT lines in the data file. [0]
- .TP
- .BI "-c " CHAR
- Skip lines started with character CHAR. [#]
- .TP
- .B -0
- Specify that the position in the data file is 0-based (e.g. UCSC files)
- rather than 1-based.
- .TP
- .B -h
- Print the header/meta lines.
- .TP
- .B -B
- The second argument is a BED file. When this option is in use, the input
- file may not be sorted or indexed. The entire input will be read sequentially. Nonetheless,
- with this option, the format of the input must be specificed correctly on the command line.
- .TP
- .B -f
- Force to overwrite the index file if it is present.
- .TP
- .B -l
- List the sequence names stored in the index file.
- .RE
- .SH EXAMPLE
- (grep ^"#" in.gff; grep -v ^"#" in.gff | sort -k1,1 -k4,4n) | bgzip > sorted.gff.gz;
- tabix -p gff sorted.gff.gz;
- tabix sorted.gff.gz chr1:10,000,000-20,000,000;
- .SH NOTES
- It is straightforward to achieve overlap queries using the standard
- B-tree index (with or without binning) implemented in all SQL databases,
- or the R-tree index in PostgreSQL and Oracle. But there are still many
- reasons to use tabix. Firstly, tabix directly works with a lot of widely
- used TAB-delimited formats such as GFF/GTF and BED. We do not need to
- design database schema or specialized binary formats. Data do not need
- to be duplicated in different formats, either. Secondly, tabix works on
- compressed data files while most SQL databases do not. The GenCode
- annotation GTF can be compressed down to 4%. Thirdly, tabix is
- fast. The same indexing algorithm is known to work efficiently for an
- alignment with a few billion short reads. SQL databases probably cannot
- easily handle data at this scale. Last but not the least, tabix supports
- remote data retrieval. One can put the data file and the index at an FTP
- or HTTP server, and other users or even web services will be able to get
- a slice without downloading the entire file.
- .SH AUTHOR
- .PP
- Tabix was written by Heng Li. The BGZF library was originally
- implemented by Bob Handsaker and modified by Heng Li for remote file
- access and in-memory caching.
- .SH SEE ALSO
- .PP
- .BR samtools (1)
|