Simplesam

Simple pure Python SAM parser and objects for working with SAM records

Classes to handle alignments in the SAM format.

Reader -> Sam -> Writer

class simplesam.DefaultOrderedDict(default, items=[])[source]
__init__(default, items=[])[source]

Initialize self. See help(type(self)) for accurate signature.

class simplesam.Reader(f, regions=False, kind=None, samtools_path='samtools')[source]

Read SAM/BAM format file as an iterable.

__init__(f, regions=False, kind=None, samtools_path='samtools')[source]

Initialize self. See help(type(self)) for accurate signature.

__len__()[source]

Returns the number of reads in an indexed BAM file. Not implemented for SAM files.

__weakref__

list of weak references to the object (if defined)

header_as_dict(header)[source]

Parse the header list and return a nested dictionary.

next()[source]

Returns the next Sam object

seqs

Return just the sequence names from the @SQ library as a generator.

subsample(n)[source]

Returns an interator that draws every nth read from the input file. Returns Sam.

tile_genome(width)[source]

Return a generator of UCSC-style regions tiling width.

class simplesam.Sam(qname='', flag=4, rname='*', pos=0, mapq=255, cigar='*', rnext='*', pnext=0, tlen=0, seq='*', qual='*', tags=[])[source]

Object representation of a SAM entry.

__getitem__(tag)[source]

Retreives the SAM tag named “tag” as a tuple: (tag_name, data). The data type of the tag is interpreted as the proper Python object type.

>>> x = Sam(tags=['NM:i:0', 'ZZ:Z:xyz'])
>>> x['NM']
0
>>> x['ZZ']
'xyz'
__init__(qname='', flag=4, rname='*', pos=0, mapq=255, cigar='*', rnext='*', pnext=0, tlen=0, seq='*', qual='*', tags=[])[source]

Initialize self. See help(type(self)) for accurate signature.

__len__()[source]

Returns the length of the portion of self.seq aligned to the reference. Unaligned reads will have len() == 0. Insertions (I) and soft-clipped portions (S) will not contribute to the aligned length.

>>> x = Sam(cigar='8M2I4M1D3M4S')
>>> len(x)
16
__repr__()[source]

Return repr(self).

__setitem__(tag, data)[source]

Stores the SAM tag named “tag” with the value “data”. The data type of the tag is interpreted from the Python object type.

>>> x = Sam(tags=[])
>>> x['NM'] = 0
>>> x['NM']
0
__str__()[source]

Returns the string representation of a SAM entry. Correspondes to one line in the on-disk format of a SAM file.

cigars

Returns the CIGAR string as a tuple.

>>> x = Sam(cigar='8M2I4M1D3M')
>>> x.cigars
((8, 'M'), (2, 'I'), (4, 'M'), (1, 'D'), (3, 'M'))
coords

Returns a range of genomic coordinates for the query sequence positions in the gapped alignment.

duplicate

Returns True if the read is a PCR or optical duplicate.

gapped(attr, gap_char='-')[source]

Return a Sam sequence attribute or tag with all deletions in the reference sequence represented as ‘gap_char’ and all insertions in the reference sequence removed. A sequence could be :class:Sam.seq, Sam.qual, or any Sam tag that represents an aligned sequence, such as a methylation tag for bisulfite sequencing libraries.

>>> x = Sam(*'r001      99      ref     7       30      8M2I4M1D3M      =       37      39      TTAGATAAAGGATACTG       *'.split())
>>> x.gapped('seq')
'TTAGATAAGATA-CTG'
>>> x = Sam(*'r001      99      ref     7       30      8M2I4M1D3M      =       37      39      TTAGATAAAGGATACTG       *'.split(), tags=['ZM:Z:.........M....M.M'])
>>> x.gapped('ZM')
'............-M.M'
index_of(pos)[source]

Return the relative index within the alignment from a genomic position ‘pos’

mapped

Returns True of the read is mapped.

paired

Returns True if the read is paired and each segment properly aligned according to the aligner.

parse_md()[source]

Return the ungapped reference sequence from the MD tag, if present.

passing

Returns True if the read is passing filters, such as platform/vendor quality controls.

reverse

Returns True if Sam.seq is being reverse complemented.

safename

Return Sam.qname without paired-end identifier if it exists

secondary

Returns True if the read alignment is secondary.

tags

Parses the tags string to a dictionary if necessary.

>>> x = Sam(tags=['XU:Z:cgttttaa', 'XB:Z:cttacgttaagagttaac', 'MD:Z:75', 'NM:i:0', 'NH:i:1', 'RG:Z:1'])
>>> sorted(x.tags.items(), key=lambda x: x[0])
[('MD', '75'), ('NH', 1), ('NM', 0), ('RG', '1'), ('XB', 'cttacgttaagagttaac'), ('XU', 'cgttttaa')]
class simplesam.Writer(f, header=None)[source]

Write SAM/BAM format file from Sam objects.

__init__(f, header=None)[source]

Initialize self. See help(type(self)) for accurate signature.

__weakref__

list of weak references to the object (if defined)

write(sam)[source]

Write the string representation of the sam Sam object.

simplesam.bam_read_count(bamfile, samtools_path='samtools')[source]

Return a tuple of the number of mapped and unmapped reads in a BAM file

simplesam.decode_tag(tag_string)[source]

Parse a SAM format tag to a (tag, type, data) tuple. Python object types for data are set using the type code. Supported type codes are: A, i, f, Z, H, B

>>> decode_tag('YM:Z:#""9O"1@!J')
('YM', 'Z', '#""9O"1@!J')
>>> decode_tag('XS:i:5')
('XS', 'i', 5)
>>> decode_tag('XF:f:100.5')
('XF', 'f', 100.5)
simplesam.encode_tag(tag, data)[source]

Write a SAM tag in the format TAG:TYPE:data. Infers the data type from the Python object type.

>>> encode_tag('YM', '#""9O"1@!J')
'YM:Z:#""9O"1@!J'
simplesam.parse_sam_tags(tagfields)[source]

Return a dictionary containing the tags

simplesam.tile_region(rname, start, end, step)[source]

Make non-overlapping tiled windows from the specified region in the UCSC-style string format.

>>> list(tile_region('chr1', 1, 250, 100))
['chr1:1-100', 'chr1:101-200', 'chr1:201-250']
>>> list(tile_region('chr1', 1, 200, 100))
['chr1:1-100', 'chr1:101-200']

Indices and tables