Genome-Wide Association Studies

Key Questions and Ideas today: 1) Discovering human variations associated with given phenotype by studying genotypes of case and control populations 2) Allelic counts from SNP arrays and microarrays 3) Read counts from gene sequencing 4) Prioritizing and organizing variants 5) Deriving correlations and causalities
Computational Approaches include contingency tables, Chi-Square, Fisher’s, and Likelihood tests
Mendelian disorder is disorder defined by single gene, therefore easy to map and have low frequency since they are selective
- considered as rare mutations in populations → 200 genes or polygenes may influence a given trait but could be common trait if the majority of said 200 gene variants aren’t disease causing
low frequency/rare variants = more likely to be causal in disease
high frequency/common variants = fewer effects causal to desease
- common defined as greater than 0.5% allelic frequency in population
we use SNP chips to detect single-nucleotide polymorphisms [1 base pair mutation] is to do population-based sequencing studies and then design a custom array based on common variants found in the studies
- array = direct readout of variation in all common variants in population
higher frequency variants can still have effects → original mutations may not be enough to cause a disease, but makes the individual susceptible to another future mutation which could cause a disease
- fundamental idea: this can cause increase risk, but will not directly cause disease
  - refer to these kinds of common variants in populations as markers and identifying novel markers is critical to figure out which variants are the most important
study for macular degeneration across 2172 patients 60years+
- for a particular SNP, found that for the 2 possible alleles of said gene [C and T], correlation of there was a **significant correlation between C-allele and AMD [**validated using Chi-Square Test]
- more familiar formula but formatted differently → another test is called Fisher’s Exact Test
  - Observed/Total Possible Prob equates to ((a+b / a) * (c+d/c))/((a+b+c+d)/(a+c))
when you have different affected and control group, you need to prevent against population stratification
- this is what happens when subpopulations of humans have different SNP variations because of race or location/ancestry
- causes false positives in GWAS is SNP differences are present between case and control
you have to do Chi-square test to make sure they are not significant between case and control
Manhattan plot = GWAS → spikes indicate significance of differences
Linkage in genes can influence variations as those are commonly inherited →
- distance between both genes on chromosome are very small if linkage occurs, preventing crossover at chiasmata, hence need to find the Linkage Disequilibrium between two possible Loci so that one can deduce whether possible correlations are based on linkage or something else
- Aa and Bb are alleles → need to show independence
  - D quantifies disequilibrium between both alleles
r^2 values relative to physical distance indicate linkage if r^2 is high while distance in kilobases is low → however anomalies hit once values go beyond a given threshold
you inherit more from your closer generation than way up the chain [ie. 25% of 50% of ur genome is your grandad’s but none of the less, you still inherit majority from dad
- the amount you inherit in these so called “haplotype blocks” decreases as you go farther back in generations
Raw Sequencing data:

Raw Sequencing data: