Loading Data¶
HiFive data is handled using the FiveCData
and HiCData
classes.
Loading 5C data¶
HiFive can load 5C data from one of two source file types.
BAM Files¶
When loading 5C data from BAM files, they should always come in pairs, one for each end of the paired-end reads. HiFive can load any number of pairs of BAM files, such as when multiple sequencing lanes have been run for a single replicate. These files do not need to be indexed or sorted. All sequence names that these files were mapped against should exactly match the primer names in the BED file used to construct the Fragment object.
Count Files¶
Counts files are tabular text files containing pairs of primer names and a count of the number of observed occurrences of that pairing.
5c_for_primer1 5c_rev_primer2 10
5c_for_primer1 5c_rev_primer4 3
5c_for_primer3 5c_rev_primer4 18
Loading HiC Data¶
HiFive can load HiC data from three different types of source files.
BAM Files¶
When loading HiC data from BAM files, they should always come in pairs, one for each end of the paired-end reads. HiFive can load any number of pairs of BAM files, such as when multiple sequencing lanes have been run for a single replicate. These files do not need to be indexed or sorted. For faster loading, especially with very large numbers of reads, it is helpful to parse out single-mapped reads to reduce the number of reads that HiFive needs to traverse in reading the BAM files.
RAW Files¶
RAW files are tabular text files containing pairs of read coordinates from mapped reads containing the chromosome, coordinate, and strand for each read end. HiFive can load any number of RAW files into a single HiC Data object.
chr1 30002023 + chr3 4020235 -
chr5 9326220 - chr1 3576222 +
chr8 1295363 + chr6 11040321 +
MAT Files¶
MAT files are in a tabular text format previously defined for HiCPipe. This format consists of a pair of fend indices and a count of observed occurrences of that pairing. These indices must match those associated with the Fend object used when loading the data. Thus it is wise when using this format to also create the Fend object from a HiCPipe-style fend file to ensure accurate fend-count association.
fend1 fend2 count
1 4 10
1 10 5
1 13 1
Note
In order to maintain compatibility with HiCPipe, both tabular fend files and MAT files are 1-indexed, rather than the standard 0-indexed used everywhere else with HiFive.
MATRIX Files¶
HiC data may be loaded from matrix files in one of three formats: HDF5, NPZ, or TXT.
HDF5 Matrix Files¶
The data file format is inferred from the file extension (‘.hdf5’). Heatmap HDF5 files generated by HiFive are compatible with loading. HiFive expects one numpy array per chromosome and chromosome pair (if trans data is included) and will search for files for corresponding to every chromosome and chromosome pair. Acceptable matrix names are ‘*’, ‘.counts’, ‘.observed’, ‘chr*’, ‘chr*.counts’, and ‘chr*.observed’ where ‘*’ is either a chromosome name or pair of chromsome names separated by ‘_by_’ for trans interactions. Cis-interaction matrices may be in either square or upper-triangular matrices. Positions may be given by matrices names ‘.positions’ or ‘chr.positions’ and contain two columns, starting and ending positions for each bin for a chromosome. If no positions are given, then bins are assumed to correspond to those found in the fend file associated with the data object. Further, if the matrix is one-dimensional and the attribute ‘diagonal’ is included and equals ‘True’, the upper triangular matrix is assumed to include the diagonal (self-interacting bins). Otherwise it is assumed to be absent.
NPZ Matrix Files¶
The data file format is inferred from the file extension (‘.npz’). Heatmap NPZ files generated by HiFive are compatible with loading. HiFive expects one numpy array per chromosome and chromosome pair (if trans data is included) and will search for files for corresponding to every chromosome and chromosome pair. Acceptable matrix names are ‘*’, ‘.counts’, ‘.observed’, ‘chr*’, ‘chr*.counts’, and ‘chr*.observed’ where ‘*’ is either a chromosome name or pair of chromsome names separated by ‘_by_’ for trans interactions. Cis-interaction matrices may be in either square or upper-triangular matrices. Positions may be given by matrices names ‘.positions’ or ‘chr.positions’ and contain two columns, starting and ending positions for each bin for a chromosome. If no positions are given, then bins are assumed to correspond to those found in the fend file associated with the data object. Further, if the matrix is one-dimensional, the upper triangular matrix is assumed to include the diagonal (self-interacting bins).
TXT Matrix Files¶
Text matrix files are inferred from the presence of the ‘*’ character in the filename. A generic format filename with the chromosome name or chromosome pair should be passed and all chromosomes and pairs will be searched, replacing the ‘*’ with the appropriate name (e.g. 40Kb_counts_*.matrix). Text matrix files are tab-separated files that contain a rectangular matrix of values corresponding to binned read counts. These files can contain labels with the first line containing a tab followed by a tab-separated list of bin labels and each subsequent line containing a label followed by bin values. Labels should be in a format such that the bin position occurs after the ‘|’ character and in the form chrX:XXXX-XXXX (e.g. interval1|myexpriment|chr3:1000000-1040000). If no labels are provided, bins are assumed to be identical to the partitioning in the associated Fend object and starting with the first bin for the associated chromosome(s). Labeled matrices need not include all rows or columns for a given paritioning. Values falling outside of bins are discarded.
Note
In order to pass the filename format with the ‘*’ character, you must enclose the name in quotation marks (e.g. -X “your_name_*.matrix”).