Usage

Program options

AlphaPeel accepts several command-line options to control the program’s behaviour. To view a list of all supported options, run AlphaPeel like this: AlphaPeel, AlphaPeel -h, AlphaPeel -help or AlphaPeel --help.

User can check the version of the program with AlphaPeel -version. Remember to use the correct version of the documentation for the version of the program you are using. For example, the link to the documentation for version v1.3.0 is https://alphapeel.readthedocs.io/en/v1.3.0/index.html.

Input options

Individuals:
  -ped_file [PED_FILE ...]
                      Pedigree file(s)
                      (see format details).
  -geno_file [GENO_FILE ...]
                      Genotype file(s)
                      (see format details).
  -seq_file [SEQ_FILE ...]
                      Sequence allele read count file(s)
                      (see format details).
  -hap_file [HAP_FILE ...]
                      Haplotype file
                      (see format details)
  -x_chr              Indicate that input data is for the X chromosome.
  -pheno_file [PHENO_FILE ...]
                      Phenotype file(s)
                      (see format details).
  -plink_file [PLINK_FILE ...]
                      Plink (binary) file(s)
                      (see format details).
  -phased_geno_prob_file [PHASED_GENO_PROB_FILE ...]
                      Optional external phased_geno_prob_file file(s)
                      (see format details).
                      This will provide the starting internal genotype probability state.

Markers:
  -map_file MAP_FILE  Map file for loci in genomic data files
                      (see format details).
  -start_snp START_SNP
                      The first locus to consider.
                      Counting starts at 1.
                      Default: 1.
  -stop_snp STOP_SNP  The last locus to consider.
                      Default: all loci in input files.

Model parameters and other:
  -alt_allele_prob_file [ALT_ALLELE_PROB_FILE ...]
                      Alternative allele probability file
                      (see format details).
                      Default: 0.5 for each locus.
  -main_metafounder MAIN_METAFOUNDER
                      ID used to represent the base population of the pedigree
                      (metafounder / unknown parent group)
                      (see format details).
                      Default: MF_1.
  -rec_length REC_LENGTH
                      Recombination length of the chromosome in Morgans.
                      Default: 1.00.
  -mut_prob MUT_PROB
                      Mutation probability.
                      Default: 1e-8.
  -geno_error_prob GENO_ERROR_PROB
                      Genotype error probability.
                      Default: 0.0001.
  -seq_error_prob SEQ_ERROR_PROB
                      Sequence error probability.
                      Must not be 0.
                      Default: 0.001.
  -pheno_penetrance_prob_file
                      [PHENO_PENETRANCE_PROB_FILE ...]
                      Phenotype penetrance probability file
                      (see format details).

AlphaPeel requires a pedigree file (-ped_file) and one or more genomic data files to run the analysis.

AlphaPeel supports the following genomic data files: genotype file in the AlphaGenes format (-geno_file), sequence allele read count file in the AlphaGenes format (-seq_file), haplotype file in the AlphaGenes format (-hap_file), and binary Plink file (-plink_file).

Note

Internally, the haplotype input by -hap_file is used to fill up the missing genotype information, and the phasing information is not preserved. If you want to input phased data, consider the -phased_geno_prob_file input.

AlphaPeel supports genotype probabilities input in the AlphaGenes format (-phased_geno_prob_file). The -phased_geno_prob_file is assumed to contain error-free phased genotype probabilities. Hence, when using -phased_geno_prob_file as the input, you should not use with options -est_geno_error_prob, -est_seq_error_prob. The option does not support the -x_chr option for now.

When -x_chr is used the pedigree file and genotype file have specific requirements. Follow the links to file formats for more details.

AlphaPeel also supports phenotype files (-pheno_file) and corresponding phenotype penetrance files (-pheno_penetrance_prob_file). Both must be provided for work with phenotypes.

The input options in the form of -opt [XYZ ...] can accept more than one argument separated by spaces.

Use -start_snp and -stop_snp to run the analysis only on a subset of markers.

Note

We use interchangeably the terms “marker(s)”, “locus/loci”, “SNP(s)”, or “site(s)”, to refer to a specific position in the genome, where we typically observe polymorphism in the population.

AlphaPeel supports specifying a number of model parameters, which are probabilities of different events or outcomes: alternative allele probabilities in the base population(s) (-alt_allele_prob_file), recombination length of the chromosome (-rec_length), mutation rate (-mut_prob), genotype error probability (-geno_error_prob), sequence error probability (-seq_error_prob), and phenotype penetrance probabilities (-pheno_penetrance_prob_file).

The option -rec_length provides the software with the recombination length of the input chromosome. AlphaPeel assumes the recombination happens equally likely across the chromosome. Therefore, the recombination probability between two loci would be calculated by dividing the values of -rec_length by the distance of the physical positions of the two loci relative to the total length of the chromosome.

Note

The accuracy of AlphaPeel results has been shown to be quite robust to deviations in most of the model parameters, so it might not be needed to change or estimate them from the input data; at least unless large deviations from the defaults are known or expected. Having said this, do explore what works best for your data and your aims!

Note

We use the term “probability” to also represent the commonly used terms “frequency” and “rate”, to refer to the same concept of a value between 0 and 1 that quantifies the likelihood or proportion of a certain event or outcome.

Output options

Individuals:
  -no_dosage            Suppress default output of allele dosages.
  -geno                 Call and output genotypes.
                        The default genotype calling threshold is set to 1/3.
  -geno_threshold [GENO_THRESHOLD ...]
                        Custom genotype calling threshold(s) from the genotype probabilities.
                        Multiple space separated values allowed.
                        Value(s) less than 1/3 are replaced by 1/3.
  -geno_prob            Output genotype probabilities.
  -phased_geno_prob     Output phased genotype probabilities.
  -hap                  Call and output haplotypes.
                        The default haplotype calling threshold is set to 1/2.
  -hap_threshold [HAP_THRESHOLD ...]
                        Custom haplotype calling threshold(s) from the
                        phased genotype probabilities.
                        Multiple space separated values allowed.
                        Value(s) less than 1/2 are replaced by 1/2.
  -seg_prob             Output segregation probabilities.
  -pheno_prob           Output phenotype probabilities.
  -alt_allele_prob      Output alternative allele probabilities.
                        Output initial value 0.5 if none of est_start_alt_allele_prob,
                        est_alt_allele_prob, or alt_allele_prob_file is used.
  -pheno_penetrance_prob
                        Output phenotype penetrance probabilities.

Prefix, order, and IO:
  -out_file PREFIX      The output file prefix. All file outputs will be named
                        as "PREFIX.OUTPUT.txt", where "OUTPUT" is the type of output
                        (for example, "dosage" and "geno_prob").
  -out_id_order OUT_ID_ORDER
                        Determines the order of individuals in the output
                        file based on their order in the
                        corresponding input file. Individuals not in the input
                        file are placed at the end of the file and sorted in
                        alphanumeric order. These individuals can be suppressed
                        with the -out_id_only option. Accepted arguments for
                        this option are: id, pedigree, genotypes, sequence, and
                        segregation.
                        Default: id.
  -out_id_only          Suppress output for individuals not present in
                        the file specified with -out_id_order. It also suppresses
                        "dummy" individuals.
  -out_digits           Specify the number of digits to round the outputs.
                        Does not apply to outputs from alt_allele_prob,
                        geno_error_prob, seq_error_prob,
                        and pheno_penetrance.
                        Default: 4.

AlphaPeel by default produces a dosage file. Additional individual-level outputs can be requested with the options described above.

The -geno_threshold and -hap_threshold respectively control which genotypes and haplotypes are called. A threshold of 0.9 gives calls only if the probability for one genotype (or haplotype allele) is higher than 0.9. When the probability is lower than the threshold, the output is set to value 9 (missing). Using a higher value will increase the accuracy of called genotypes (or haplotypes), but will result in fewer calls. Since there are three genotype states and two haplotype states, “best-guess” genotypes and haplotypes are respectively called with a threshold less than 1/3 and 1/2.

We round the thresholds in the output filenames to three digits.

The output order of individuals can be changed using the -out_id_order option, with additional control provided with the -out_id_only option. The latter option is not recommended for hybrid peeling or any combination of different input files.

Peeling methods

Strategy:
  -method METHOD        Peeling method: single or multi. Default: multi.

Single-locus options for the second stage of hybrid peeling:
  -seg_file SEG_FILE    Segregation probabilities file.
  -seg_map_file SEG_MAP_FILE
                        Map file for loci in the segregation probabilities file
                         (see format details).

AlphaPeel supports three peeling strategies: single-locus, multi-locus, and hybrid.

Single-locus peeling method does not use linkage information between loci in iterative peeling. It is fast, but not very accurate.

Multi-locus peeling method runs multi-locus iterative peeling, which uses linkage information to increase accuracy and calculate segregation probabilities, but it is much slower than single-locus method.

Hybrid peeling is useful in settings with a SNP genotypes with a limited number of markers and a sequence allele read counts from a large number of loci. In this setting, you can first run the multi-locus peeling method on SNP genotypes to estimate segregation probabilities, and then run the single-locus peeling method on sequence allele read counts and the segregation probabilities. In this second stage of hybrid peeling, provide:

  • a -map_file with positions for loci in the sequence allele read count data,

  • a -seg_file with segregation probabilities generated via multi-locus method, and

  • a -seg_map_file with genetic positions for loci in the segregation probabilities file.

This combination of options is not required in the standard multi-locus mode.

Peeling parameters

Computational parameters:
  -n_cycle N_CYCLE      Number of peeling cycles.
                        Default: 5.
  -n_thread N_THREAD
                        Maximum number of threads to use.
                        Default: 1.

Estimation of model parameters:
  -est_start_alt_allele_prob
                        Estimate from all inputted genomic data prior to peeling and
                        output alternative allele probabilities.
  -est_alt_allele_prob  Estimate after each peeling cycle and output
                        alternative allele probabilities.
  -est_geno_error_prob  Estimate after each peeling cycle and
                        output genotype error probabilities.
  -est_seq_error_prob   Estimate after each peeling cycle and
                        output sequence error probabilities.
  -est_pheno_penetrance_prob
                        Estimate after each peeling cycle
                        and output phenotype penetrance probabilities.
  -no_phase_founder     Suppress phasing a heterozygous allele
                        (if such an allele can be found) in
                        genotyped individuals without genotyped parents.

The option -no_phase_founder can be used to suppress the default behaviour of phasing a midpoint heterozygote locus in a genotyped individual with ungenotyped parents, where the alternative allele is set as paternally inherited. Calling this option, we expect an equal probability of the alternative allele as paternally or maternally inherited.

Computational effort and speed of AlphaPeel can be controlled with the number of peeling cycles (-n_cycle, increasing the number will marginally increase accuracy, but also runtime) and the number of threads (-n_thread, to reduce runtime on large datasets).

AlphaPeel can estimate the model parameters from the input data. The default or user provided input values are used as a starting point for estimation. See a note on robustness of results to these parameters.

When estimation options are used, the respective parameters are estimated after each peeling cycle and output to a file at the end. This process usually increases running times and might require additional peeling cycles to converge. The estimates are based on inferred states of the modelled events and their match/mismatch between observed and inferred states.

Alternative allele probabilities are estimated (using -est_alt_allele_prob) as half of the mean of estimated allele dosage in the base population(s) (metafounders).

This estimation can be warm-started with a sample estimate from inputted genomic data (using -est_start_alt_allele_prob). Note that this sample estimate is not taking the pedigree structure into account, so it is a naive estimate and does not pertain to the pedigree base population(s) and hence ignores metafounders. Option -est_start_alt_allele_prob uses Newton optimisation, which also requires starting values. These starting values are by default 0.5, but can also be provided using -alt_allele_prob_file.

For a pedigree with multiple metafounders, there are three options to obtain metafounder-specific alternative allele probabilities:

(1) use the default starting values of 0.5 for all loci and then -est_alt_allele_prob,

(2) use -est_start_alt_allele_prob to use sample-based starting values and then -est_alt_allele_prob, or

(3) provide starting values using -alt_allele_prob_file and then -est_alt_allele_prob.

In all three cases, -est_alt_allele_prob is optional.

The estimates (using -est_start_alt_allele_prob or -est_alt_allele_prob) or inputs (using -alt_allele_prob_file) of the alternative allele probabilities are constrained to be between 0.001 and 0.999 to ensure valid probabilities and avoid getting trapped in boundary values, 0 or 1.

Error probabilities (using -est_geno_error_prob and -est_seq_error_prob) are estimated as the proportion of mismatches between observed and inferred states.

Phenotype penetrance probabilities (using -est_pheno_penetrance_prob) are estimated from the average conditional probabilities of phenotype states for each phased genotype across phenotyped individuals. This follows Kinghorn (2003) A simple method to detect a single gene that determines a categorical trait with incomplete penetrance. Assoc. Advmt. Anim. Breed. Genet. 15:103-106.