HISAT2

To get gene counts, gene symbols, and gene lengths from the given paired-end RNA-Seq data, you’d typically follow the end-to-end process of quality control, read alignment, read counting, and annotation. Here’s a detailed step-by-step guide, which assumes you already have the necessary software tools installed on your Linux system:

Step 1: Quality Control with FastQC

First, run FastQC on your FASTQ files to check the quality of your sequencing reads.

fastqc *.fastq.gz

Once complete, you can inspect the FastQC reports (html files) to assess if further trimming or filtering is needed.

Step 2: Alignment with HISAT2

Align the paired-end reads to the reference genome using HISAT2. Here we’ll align one pair of reads as an example:

hisat2 -x /path/to/human_genome_index \
-1 H20_G534_S91_L004_R1_001_trimmed.fastq.gz \
-2 H20_G534_S91_L004_R2_001_trimmed.fastq.gz \
-S H20_G534_aligned.sam

You would repeat this for each of your FASTQ file pairs.

Step 3: Converting SAM to BAM and Sorting

Convert the SAM file to BAM and sort it for efficient downstream processing.

samtools view -bS H20_G534_aligned.sam | samtools sort -o H20_G534_aligned_sorted.bam

Again, repeat this for each of your samples.

Step 4: Counting Reads with featureCounts

Count the reads aligned to genes using featureCounts. Before you run featureCounts, you need a GTF file with annotations for the reference genome.

featureCounts -a /path/to/annotation.gtf \
-o gene_counts.txt \
-t exon \
-g gene_id \
-p \
--primary \
-s 2 \
-T 8 \
*.bam

In this command:

-a specifies the path to the annotation file.
-o specifies the output file name.
-t specifies the feature type to count, typically exons for gene-level quantification.
-g specifies the attribute in the GTF to use for summarizing counts, usually the gene ID.
-p counts fragments instead of individual reads, appropriate for paired-end data.
--primary ensures that only primary alignments are counted.
-s specifies the strand-specific counting; ‘2’ typically for dUTP methods.
-T is the number of threads to use, could be adjusted depending on your CPU.
*.bam indicates using all BAM files in the current directory.

Step 5: Annotation (Gene Symbols and Lengths)

The gene_counts.txt file will not contain gene symbols and lengths by default, so we need to join this information with our count data. Gene symbols and lengths can be extracted from the GTF file, followed by merging of this information with the output of featureCounts.

awk '$3 == "exon"' /path/to/annotation.gtf | \
awk '{print $10,$12,$4,$5}' | \
sed 's/";"/\t/g' | \
sed 's/"//g' | \
sort -k1,1 -u > genes_info.txt

join -1 1 -2 1 -o 1.1,2.2,2.3,1.2 \
<(sort -k1,1 gene_counts.txt) \
<(sort -k1,1 genes_info.txt) \
> final_gene_counts_with_info.txt

In these commands:

The first awk selects only exon features from the GTF.
The second awk prints the gene_id, gene_name, start, and end positions.
sed is used to clean and format the data.
sort -k1,1 -u sorts on gene_id and ensures unique entries.
join combines the gene counts with gene info based on gene_id.

This final_gene_counts_with_info.txt file should now contain your gene symbol and lengths along with their respective counts.

Important:

Replace /path/to/human_genome_index, /path/to/annotation.gtf, and sample names with your actual paths and sample names.
Ensure the chromosome naming convention in your FASTQ file matches that of the GTF file.
Check any software documentation for different versions or parameter specifications.

This is a comprehensive guide but each dataset may require specific attention. All the alignment and counting jobs might take a considerable amount of time and computational resources, depending on the size of the dataset.

Step 1: Quality Control with FastQC

Step 2: Alignment with HISAT2

Step 3: Converting SAM to BAM and Sorting

Step 4: Counting Reads with featureCounts

Step 5: Annotation (Gene Symbols and Lengths)

Be First to Comment

Leave a Reply Cancel reply