# Variables to be set manually, same for the slurms job and outputs
genome="Mo12_2014"# Declaring the genome ID variable
fr_ln="150"# Declaring the fragment length used for sequencing
autosome_sum_length="2937639396"# Declaring the sum of the length of all autosomes in the reference, third column of the Data/References/hg38/hg38.ungapped.lengths file
echo"Performing the pipeline for $genome sequenced with fragment lenght equals to $fr_ln bp and mapped to the reference hg38.
All necessary packages and programs are installed in the conda environmente named "CNVconda", all scripts are in /shared/home/righettin/Scripts directory."
# SECTION A: READS MAPPING
echo"
-. .-. .-. .-. .-. .-. .
||\|||\ /|||\|||\ /|||\|||\ /|
|/ \|||\|||/ \|||\|||/ \|||\||
~ \`-~ \`-\`\`-~ \`-\`\`-~ \`-
SECTION A: READS MAPPING
In this first section the reads of $genome are mapped to the human reference hg38 through bwa aln.
The output is then processed through some filters."
echo"Reads with Percentage of Identity Lower than 90% and shorter than 30bp removed."
# SECTION B: DATA PREPARATION
echo"
-. .-. .-. .-. .-. .-. .
||\|||\ /|||\|||\ /|||\|||\ /|
|/ \|||\|||/ \|||\|||/ \|||\||
~ \`-~ \`-\`\`-~ \`-\`\`-~ \`-
SECTION B: DATA PREPARATION
In this second section the de-novo mapping is used to perform the following steps:
0) Create all necessary directories and subdirectories
1) The BAM file is separated in 24 separate files, one for each chromosome, with numbers from 1 to 22 or letter X/Y. From now on, all the analyses of this section will be performed on each chromosome.
2) The GC-bias for each chromosome is computed and each alingment is corrected based on the results.
3) The depth of coverage for each chromosome is computed.
4) The average depth of coverage for each chromosome is computed and the results are concatenated in a single file.
5) The average depth is calculated excluding the blacklisted regions.
6) A .txt file containing the average depth of coverage of the genome, taking into account all autosomes, is generated.
7) A gzipped file containing the depth of coverage of each base of protein coding genes defined in the .bed reference is obtained. This will be the main input for the CNV estimation"
# 0) Create all necessary directories and subdirectories
echo"0) Creating all necessary directories and subdirectories:"
zcat $starting_gz | sed 's/chr//g' | gzip -c>$ending_gz# take out "chr" from the file
echo"Coverage of Protein-Coding Genes computed and stored in /shared/home/righetti/Analyses/Data_preparation/$genome/$genome.hg38.genes.protein_coding.coverage.gz."