Mann-Whitney U-test

QUMA (QUantification tool for Methylation Analysis) top

The statistical significance between two groups of the entire set of CpG sites is evaluated with the Mann-Whitney U-test (also called the Wilcoxon rank-sum test) that is non-parametric statistical significance test for two distributed samples. Although, Student's t-test is useful in the same situations as Mann-Whitney U-test, we adopt not the parametric Student's t-test but the non-parametric Mann-Whitney U-test, because methylation status does not distribute as a normal distribution, especially in case of hyper- or hypo-methylation. Two-tailed p-value of the Mann-Whitney U-test is determined from ranks of ratio of methylated CpGs to all CpGs at each bisulfite sequence (exampled below). This p-value indicates the independence of distribution of the ratio of CpG methylation to all CpG.
Importantly, this test dose not detect differences in the some situations, especially CpG methylation of imprinting regions, because this test only check the difference of the average of two groups. Additionally, the patterns of CpG methylation are not considered.

Example

The sample data sets are:

	Me-CpGs/CpGs of each sequence (number of methylated CpGs / number of CpGs)	average ratio of methylation	number of sequences
group1	6/19, 6/19, 8/19, 9/19 12/19, 15/19, 16/19, 18/19, 18/19, 18/19, 18/18, 19/19, 19/19	0.7409	13 (= n₁)
group2	2/19, 2/19, 3/19, 3/19 5/19, 5/19, 7/19, 7/19, 7/19, 8/19	0.2579	10 (= n₂)

(This is the analyzed data of the QUMA sample sequence files.)
Is this difference between the average ratio of methylation (0.7409 vs. 0.2579) significant?

First, make ranking of the values (methylation ratio) and determine a rank. When two or more values are share the same rank, take an average of the rank values. In the sample data, two sequences are Me-CpGs/CpGs = 3/19 and the rank values are 3 and 4. Then use 3.5 (average of 3 and 4) as the rank.
Second, calculate sum of the rank (Rank sum): R₁ and R₂.

Position i		1	2	3	4	5	6	7	8	9	10	11	12	Rank sum
Me-CpGs/CpGs		2/19	3/19	5/19	6/19	7/19	8/19	9/19	12/19	15/19	16/19	18/19	1
rank		1,2	3,4	5,6	7,8	9-11	12,13	14	15	16	17	18-20	21-23
rank (average)		1.5	3.5	5.5	7.5	10	12.5	14	15	16	17	19	22
number of sequences	group1	0	0	0	2	0	1	1	1	1	1	3	3	212.5 (=R₁)
	group2	2	2	2	0	3	1	0	0	0	0	0	0	63.5 (=R₂)
	total	2	2	2	2	3	2	1	1	1	1	3	3

Third, determine temporary U-value, U1 and U2, as below.
U1 = n₁ * n₂ + n₁ * (n₁ + 1) / 2 - R₁ = 8.5
U2 = n₁ * n₂ + n₂ * (n₂ + 1) / 2 - R₂ = 121.5
Take the smaller value of U1 and U2 as the U-value. In this case, U = 8.5
Then determine a two-tailed p-value from the U-value. To determine the p-value, we take the approximation using the normal distribution for the number of sequences above 20. In the case of small sequences (20 and below), we determine the p-value from exact probabilities (Mann Whitney U exact test).

The normal approximation is performed as:
fomula1

where z is a standard normal deviate, E(U) is the mean of U and V(U) is the variance of U:
fomula2

where t_i is the number of tied ranks of the position i.
At the sample, E(U) = 65, V(U) = 257.812 and z = 3.51879. Then, the two-tailed p-value = 0.0004 is determined from the standard normal distribution (double value for two-tail).

Another sample data sets for Mann Whitne U exact test are:

Table1

	Me-CpGs/CpGs of each sequence (number of methylated CpGs / number of CpGs)	average ratio of methylation	number of sequences
group1	6/19, 6/19, 9/19 12/19, 15/19, 18/19	0.5789	6 (= n₁)
group2	3/19, 5/19, 5/19, 7/19, 7/19	0.2842	5 (= n₂)

Table2

Position i		1	2	3	4	5	6	7	8	number of sequences	Rank sum
Me-CpGs/CpGs		3/19	5/19	6/19	7/19	9/19	12/19	15/19	18/19
rank		1	2,3	4,5	6,7	8	9	10	11
rank (average)		1	2.5	4.5	6.5	8	9	10	11
number of sequences	group1	0	0	2	0	1	1	1	1	6	47 (=R₁)
	group2	1	2	0	2	0	0	0	0	5	19 (=R₂)
	total	1	2	2	2	1	1	1	1	11

U1 = n₁ * n₂ + n₁ * (n₁ + 1) / 2 - R₁ = 4
U2 = n₁ * n₂ + n₂ * (n₂ + 1) / 2 - R₂ = 26
U = min (U1, U2) = 4

When the marginal totals are fixed, there are 179 cases and 11 cases indicated below have U-value not more than the U-value of the sample.

Position i	1	2	3	4	5	6	7	8	Rank sum	U-value	Probability
Me-CpGs/CpGs	3/19	5/19	6/19	7/19	9/19	12/19	15/19	18/19
rank	1	2,3	4,5	6,7	8	9	10	11
rank (average)	1	2.5	4.5	6.5	8	9	10	11
group1/group2	1/0	2/0	2/0	1/1	0/1	0/1	0/1	0/1	21.5/44.5	0.5	0.00433
group1/group2	1/0	2/0	2/0	0/2	1/0	0/1	0/1	0/1	23/43	2	0.00216
group1/group2	1/0	2/0	2/0	0/2	0/1	1/0	0/1	0/1	24/42	3	0.00216
group1/group2	1/0	2/0	2/0	0/2	0/1	0/1	1/0	0/1	25/41	4	0.00216
group1/group2	1/0	2/0	1/1	2/0	0/1	0/1	0/1	0/1	23.5/42.5	2.5	0.00433
group1/group2	1/0	2/0	1/1	1/1	1/0	0/1	0/1	0/1	25/41	4	0.00866
group1/group2	0/1	1/1	0/2	1/1	1/0	1/0	1/0	1/0	47/19	4	0.00866
group1/group2	0/1	0/2	2/0	0/2	1/0	1/0	1/0	1/0	47/19	4	0.00216
group1/group2	0/1	0/2	1/1	2/0	0/1	1/0	1/0	1/0	47.5/18.5	3.5	0.00433
group1/group2	0/1	0/2	1/1	1/1	1/0	1/0	1/0	1/0	49/17	2	0.00866
group1/group2	0/1	0/2	0/2	2/0	1/0	1/0	1/0	1/0	51/15	0	0.00216

To determine a two-tailed p-value of the significance, make a sum of probabilities of these 11 cases. Then, the two-tailed p-value = 0.0498