XIV Statistics and Probability
Modules Overview
plt;pdf;cdf hist;ecdf;kde |
Graphs of cdf, pdf/pmf, ecdf, epdf/KDE, and histograms plt;n(0,1);pdf || plt;epn(2);pdf || plt;f(3,6);pdf || plt;epn(2);cdf || plt;gam(3,2);cdf || plt;poi(10);hist || plt;t(20);hist || plt;n(0.5,1.5);hist || |
n;t;f;epn uni;gam;bet cau;chi |
Common continuous distributions n(a, b)—Normal(a, b) with mean=a and std=b || t(n)—Student-t with df=n || uni(q,b)—Uniform(a, b) || epn(λ)—Exponential(λ) || gam(a, b)—Gamma(a, b) || bet(a,b)—Beta(a, b) || cau(a, b)—Cauchy(a, b) || chi(n)—Chi-squared(n) || f(m, n)—F-distribution with df1=m, df2=n || stt;t(5);pdf(x) || stt;cau(0,1);pdf(x) || stt;epn(3);cdf(x) || stt;uni(0,1);E(x) || stt;f(5,8);pdf(x) || stt;gam(2,6);V(z) || stt;n(0,1);P(x>0|x<2) || |
poi;ber;bin nbn;geo;hyp die |
Common discrete distributions poi(λ)—Poisson(λ) || ber(p)—Bernoulli(p) || bin(n, p)—Binomial(n, p) || nbn(n, p)—Negative Binomial(n, p) || geo(p)—Geometric(p) || hyp(N,K,n)—Hypergeometric(N, K, n) || die(n)—Die(n) || stt;poi(5);m || stt;bin(5,0.5);v || stt;hyp(10,5,7);rvs || stt;nbn(6,0.4);rvs || stt;geo(0.5);m || stt;die(6);P(x<3|x>1) || |
m;v;s;k;q;h cdf;pdf;svf;P std;med;isf M;rvs;ecdf |
probabilities, mean, variance/std, moments/MGF, cdf/svf/isf, pdf/pmf, quantiles/percentiles, skewness, kurtosis, entropy, median m—mean || std—standard deviation || v—variance || cdf—cumulative density function || pdf—probability density (mass) function || svf—survive function || isf—inverse survive function || q—quantile || h—entropy || s—skewness || k—kurtosis || rvs—simulate random variables || M—moments || med—median stt;epn(2.5);m || stt;t(11);cdf(2) || stt;f(3,5);pdf(x) || stt;uni(1,3);v || stt;poi(5);E(x) || stt;epn(2.1);svf(1) || stt;n(0,1);q(0.025) || stt;bin(5,0.5);v || stt;poi(10);std || stt;t(7);med || stt;n(0,1);s || stt;n(0,1);M(3) || plt;nbn(14,0.6);ecdf;34 || stt;n(0,1);P(x>-1.65&x<1.65) || stt;bin(20,0.5);P(x>10|x>3) || stt;epn(0.5);P(x>1|x<3) |
mni;mnm;cov crv;drv;dmc cmc;ppr;gpr wpr;rvs |
Multivariate distributions, user-defined distributions, and stochastic processes cov—covariance || crv—continuous r.v. || drv—discrete r.v. || dmc—discrete Markov chain || cmc—continuous Markov chain || mni(n, p1, p2, ..., pn)—multinomial(n, (p1, p2, ..., pn)) || mnm(μ, Σ)—multivariate normal N(μ, Σ) || stt;crv(3*x**2,0,1);E(x) || stt;drv((1,2,3),(0.2,0.3,0.5));cdf(x) || stt;mni(3,0.2,0.3,0.5);cov || stt;mni(10,0.2,0.3,0.2,0.3);rvs(20) || stt;mnm([3,4],[[2,1],[1,2]]);rvs(20) || stt;gam(2,4);V(x) || stt;chi(6);E(x) || stt;n(0,1);E(exp(t)) || stt;drv((1,2,3,4),(0.2,0.3,0.25,0.25));rvs(50) || stt;cmc((0,1,2),[[-1,1/10,9/10],[2/5,-1,3/5],[1/2,1/2,-1]]);P(x(43.2)>0|x(3.29),2) || stt;dmc((0,1,2),[[0.5,0.2,0.3],[0.2,0.5,0.3],[0.2,0.3,0.5]]);P(x[10]>0|x[2]>0) || stt;dmc((0,1,2),[[5/10,3/10,2/10],[2/10,7/10,1/10],[3/10,3/10,4/10]]);E(x[3]|x[1],1) || stt;dmc((0,1,2),[[0.5,0.2,0.3],[0.2,0.5,0.3],[0.2,0.3,0.5]]);rvs(50);(0,1,0) || stt;cmc((0,1,2),[[-1,1/10,9/10],[2/5,-1,3/5],[1/2,1/2,-1]]);P(x(1.57)<=x(3.14)|x(1.22),1) || stt;dat([[0,2],[1,1],[2,0]]);cov || |
E;P;V;H pdf;cdf;std |
Symbolic computing E(X)—expectation || V(X)—variance || P(X>a|X< b)—conditional probability || pdf(X)—probability density (mass) function || cdf(X)—cumulative density function || H(X)—entropy || stt;crv(x**2/9,0,3);E(2*x+1) || stt;drv((1,2,3),(1/5,2/5,2/5));V(2*x) || stt;gam(2,4);V(x) || stt;chi(6);E(x) || stt;n(0,1);E(exp(t)) || stt;epn(2);E(x^3) || stt;chi(5);E(x^2) || stt;epn(4);H(x) || stt;n(0,1);E(exp(I*w*x));x || |
dat | Data analysis: descriptive statistics, hypothesis testing, curve fitting and generalized linear models d—descriptive statistics || z—Z scores || q—quantile or percentile || cov—covariance/correlation matrices || rp—Pearson correlation || pt—scatter plot || box—boxplot || ecdf—plot for ecdf || rs—Spearman correlation || rk—Kendall correlation || kde—Gaussian kernel density estimates || t1—one sample t-test || t2—two sample t-test || pt—paired samples test || n—normality test || ks1—Kolmogorov-Smirnov test (1s) || ks2—Kolmogorov-Smirnov test (2s) || f1—one-way ANOVA || f2—two-way ANOVA || w—Wilcoxson test || mw—Mann-Whitney test || kw—Kruskal-Wallis test || chi—chi-square test || cnt—goodness of fit test || ols—linear regression || pr—polynomial regression || glm—generalized linear model || log—log linear model || stt;dat((10,25,10,13,11,11),(15,15,15,15,15,5));chi || stt;dat((2,3,5,7,1,9,8,6),(13,15,23,24,10,25,23,18));ols || stt;dat((2,3,5,7,1,9,8,6),(13,15,23,24,10,25,23,18));ols;pr || plt;dat(-0.19,-0.01,4.35,1.71,-1.37,-0.93,-1.11,-2.47);pt || plt;dat((-0.19,-0.01,4.35,1.71,-1.37,-0.93,-1.11,-2.47),(13.19,15.01,18.65,22.29,11.37,25.93, 24.11,20.47));pt || stt;dat((8.8,6.4,7.1,5.1,6.7,6.7,9.9,9.4,9.9,9.5,8.7,10.9),(0,0,0,0,0,0,1,1,1,1,1,1));glm;b || stt;dat((1,1,1,1,0,0,0,0),(0,1,0,1,0,1,0,1),(0,1,0,1,0,1,0,1),(56,80,48,133,63,98,47,97));log || stt;poi(9);rvs(30);d || stt;dat(14,4,7,11,5,8,8,8,6,12,12,7,8,12,9);d || stt;dat(14,4,7,11,5,8,8,8,6,12,12,7,8,12,9);z || |
Table of Contents
1 Modules and Instructions
Type of modules Computing modules are classified into seven groups according to their functionalities and statistical properties: (1) Descriptive statistics, (2) data visualization and exploration, (3) common distribution functions, (4) hypothesis testing, (5) covariance and correlation, (6) generalized linear models, and (7) random process. Before using any module, start with "stt" (statistics) to execute modules for statistical analysis instead of others (e.g., linear algebra), and start with "plt" to run modules for graphs or plots.
Data formats Small data sets are allowed to enter in the program for quick statistical analysis. You may enter one row or multiple rows of numbers in form of "dat(0.5,2,-2,0,2/3,3**(1/2))" (one row), "dat((2,3,1),(0,-2,0.5))" (two rows), "dat((-0.77,-0.28,-1.13,-0.8),(1.23,1.72,0.87,1.2),(3.23,3.72,2.87,3.2))" and so on. All data points must be numerical numbers, unless otherwise specified. Each row of number is enclosed by a parenthesis, different rows are separated by a comma ",", and numbers within each row are also separated by commas.
1. Modules for Descriptive Statistics and Z Scores
For a data set with one row of numbers, you can get the following statistics by keyword "d" (descriptive): sample size, min, max, mean, standard deviation (std), variance, skewness, kurtosis, median, inter-quantile range, and the standard error for mean estimate. Check stt;dat(5,1,10,5,6,10,9,9,4,8);d || stt;dat(1.99,-0.29,-0.93,-0.82,1.73,-0.57,-0.91,0.2,-0.42,0.2);d ||. Note that example usages are separated by "||".
Use keyword "z" for the z score of each value in the sample, relative to the sample mean and standard deviation. For example, stt;dat(11.73,13.93,12.79, 13.92,12.34,11.15,13.14,11.86,16.61,13.2);z ||.
Compute the p-th quantile or 100p% percentile by function "q(p)", where "p" is a number between 0 and 1. For example, stt;dat(5,1,10,5,6,10,9,9,4,8);q(0.6) ||. You can also compute the empirical cumulative density function for any number "a" by the function "ecdf(a)" (click here for details). For example, stt;dat(5,1,10,5,6,10,9,9,4,8);ecdf(2) || stt;dat(5,1,10,5,6,10,9,9,4,8);ecdf(6.7) ||.
2. Modules for Data Visualization and Exploration
To obtain a graph, start with "plt" rather than "stt". Use the form "plt;dat(one row);ecdf" to obtain the graph of the empirical cumulative density function (click here for details), use form "plt;dat(one rwo);kde" to obtain the graph of Gaussian kernel density estimate for a continuous random variable (click here for details), use "plt;dat(one rwo);pt" or "plt;dat((rowX),(rowY));pt" to obtain the scatter plot of your data (one or two rows for variables X and Y), and use form "plt;dat((row1),(row2),...,(rowk));box" to obtain the boxplot for one or multiple rows of your data. Check the examples: plt;dat((-1,2,1,0.5,4,3),(-2.8,3,2,0.8,7,5));pt || plt;dat(3,12,5,5,6,9,7,7,5,6,10,12,6,7,8,7,8,10,7,6);ecdf || plt;dat(4.97,4.96,6.38,4.85,5.89,5.13,6.38);box || plt;dat(3,12,5,5,6,9,7,7,5,6,10,12,6,7,8,7,8,10,7,6);kde || plt;dat((3,12,5,5,6,9,7,7,5,6,10,12,6),(2,8,5,9,-5,4,7,7,8,7,8,10,7));box ||.
You can plot the probability density function (pdf), cumulative density function (cdf) and histogram (hist) of samples randomly generated from some common distributions. For example, plt;n(2,1);pdf || plt;n(2,1);cdf || plt;n(2,1);hist || plt;poi(5);cdf || plt;poi(5);hist ||. Note that "n(2,1)" standards a normal distribution with mean 2 and standard deviation 1, and "poi(5)" for a Poisson distribution with mean 5. The graphs in these examples are based on a size of 2000 randomly simulated numbers from the specified distribution.
3. Modules for Common Distribution Functions
1. Discrete distributions Discrete distribution modules include Bernoulli (ber), binomial (bin), Poisson (poi), geometric (geo), negative binomial (nbn), hypergeometric (hyp), and discrete uniform (die). Check stt;ber(0.6);v || stt;poi(10);std | stt;bin(10,0.5);P(x>2) || stt;geo(0.7);m || stt;nbn(8,0.6);m || stt;hyp(20,5,12);rvs || stt;die(10);cdf(x) ||. Note that you need to provide appropriate parameter(s) for each discrete distribution (click here for details). Some distributions have one parameter, and some have two or more parameters.
2. Continuous distributions Continuous distribution modules include normal (n), exponential (epn), uniform (uni), Gamma (gam), Beta (bet), Cauchy (cau), F-distribution (f), Student-t distribution (t), Chi-square (chi) (click here for details). Check stt;n(3,1);q(0.8) || stt;epn(0.5);svf(1.2) || stt;uni(0,2);V(x) || stt;gam(2,3);P(x>2&x<4) || stt;bet(0.5,2);E(x) || stt;cau(0,1);cdf(x) || stt;f(8,11);P(x>13) || stt;t(2);P(x>2@x<-2) || stt;chi(1);q(0.975) ||.
3. User defined distributions Use "drv" (discrete random variable) to create a discrete random variable (r.v.). For example, stt;drv(0.75^x/3);P(x>2) creates a geometric random variable with probability mass function (pdf) \(f(x)=\frac{1}{3}(\frac{3}{4})^x\) for \(x=1,2,\cdots\), and \(\sum_{x=1}^∞f(x)=1\). For a finite discrete random variable, list the probability mass function in two different rows. For example, stt;drv((0,1,2),(2/4,1/4,1/4));cdf(x) ||. The first row lists the possible values the random variable can take, and the second lists their corresponding probabilities that must sum to 1 (click here for details).
Use "crv" (continuous random variable) to create a continuous random variable with specified pdf and interval (range). In this example "stt;crv(3*x^2/26,1,3);E(x)", the pdf is \(f(x)=\frac{3x^2}{26}\) and the range or interval the r.v. can take is [1, 3]. Note that the pdf must be integrated to 1 over this interval (click here for details).
4. Multinomial and multivariate normal distributions Use "mni(n, p1, p2, ..., pk)" to create a multinomial r.v. with parameter "n" for total number of events (trials), and "p1, p2, ..., pk" for the probabilities of event 1 to event k, respectively, where the probabilities "p1, p2, ..., pk" must sum to 1. For example, stt;mni(10,0.2,0.3,0.2,0.3);cov || stt;mni(10,0.2,0.3,0.2,0.3);pdf(2,3,2,3) || stt;mni(20,2/5,1/5,1/5,1/5);rvs(10) || stt;mni(20,2/5,1/5,1/5,1/5); pdf(9,0,7,4) || stt;mni(20,2/5,1/5,1/5,1/5);h ||. Four keywords "cov", "h", "rvs(n)", "pdf(x1, x2, ..., xk)" are available for multinomial distributions. When computing pdf/pdf for a multinomial r.v., the sum of the k events must be the total n, that is, x1 + x2 + ... + xk = n.
Use "mnm(μ, Σ)" to create a multivariate normal r.v. with mean vector μ = "(0,1,2)" and covariance matrix Σ = "((2,-1,1),(-1,2,-1),(1,-1,2)))", for instance. Check stt;mnm((0,1,2),((2,-1,1),(-1,2,-1),(1,-1,2)));cov || stt;mnm((0,1,2),((2,-1,1),(-1,2,-1),(1,-1,2)));cdf(1,2,1) || stt;mnm((0,1,2),((2,-1,1),(-1,2,-1),(1,-1,2)));pdf(0,1,0) || stt;mnm((0,1,2),((2,-1,1),(-1,2,-1),(1,-1,2)));rvs(20) || stt;mnm((0,1,2),((2,-1,1),(-1,2,-1),(1,-1,2)));h ||. The input covariance matrix Σ must be symmetric positive semidefinite. Four keywords "cov", "pdf(x1, x2, ..., xk)", "cdf(x1, x2, ..., xk)" and "h" are available for multivariate normal distributions.
5. Compute probabilities Use the function "P(.)" for computing probabilities (click here for details). For example, "P(2)" standards for the probability \(P(X=2)\), "P(x!=2)" for \(P(X≠2)\), "P(x>2)" for \(P(X>2)\), and "P(x<=2)" for \(P(X≤2)\). For compound inequalities, "P(y>2&y<=5)" or "P(2< y<=5)" stands for the probability \(P(2< Y≤5)\) or \(P(Y>2\text{ and }Y≤5)\), and "P(z>=6@z<3)" standards for the probability \(P(Z≥6∪Z<3)\) or \(P(Z≥6\text{ or }Z<3)\). Use similar combinations to compute probabilities. Check stt;bin(10,0.5);P(x>=6) || stt;epn(0.8);P(x<1) || stt;geo(0.7);P(y!=2) || stt;n(0,1);P(x>1.96@x<-1.96) || stt;n(0,1);P(x>-1&x<1) || stt;poi(6);P(2@5) || stt;bin(10,0.5);P(3<=w<=7) ||.
Use the function "P(.|.)" to compute conditional probabilities. For example, "P(x>5|x>=2)" represents the conditional probability \(P(X>5|X≥2)\). Similarly, "P(x<=4|x<8)" represents the probability \(P(X≤4|X<8)\). Check stt;poi(8);P(x<=4|x<8) || stt;poi(8);P(x>4|x>2) || stt;epn(1.2);P(x>2|x>1) || stt;geo(0.6);P(x>5|x>4) ||.
6. Compute mean, variance, skewness, kurtosis, entropy and moments Use the letter "m, v, s, k, h" to obtain the mean, variance, skewness, kurtosis, and entropy of the r.v., respectively. Use "std, med" to get the standard deviation and median of the r.v., respectively. Use "M(r)" for computing moments, where "r" is a positive integer representing the r-th moment (click here and here for details). Check stt;ber(0.6);h || stt;bin(20,0.4);m || stt;poi(9);med || stt;n(3,1.5);s || stt;epn(2);k || stt;gam(4,3);std || stt;uni(1,4);v || stt;poi(3);M(3) || stt;n(4,2);M(4) || stt;epn(4);s ||.
You can obtain the above quantities by "E(x), V(x), s(x), k(x), H(x), std(x), med(x)", which calculate mean, variance, skewness, kurtosis, entropy, standard deviation and median, respectively (click here for details). You can also compute the means and variances of linear combinations of the r.v.. For example, "E(2*x+1)", "V(2x+1)" compute the mean \(E(2X+1)\) and and \(\text{Var}(2X+1)\). In particular, "E(exp(t*x));x" returns the moment generating function for the r.v. ("x"), "E(exp(w*I*x));x" give the characteristic function of the r.v. ("x"). Ensure to add "x" to the end to tell the program that letter "x" represents the r.v. (instead of others letters (like "t", "w" or "I") in the expression). For example, stt;n(0,1);E(2*x+1) || stt;n(2,1.5);V(2*x+1) || stt;n(0,1);E(exp(t*x));x || stt;epn(3);E(exp(t*x));x || stt;n(0,1);E(exp(w*x*I));x || stt;crv(3*x^2/26,1,3);E(x) || stt;crv(3*x^2/26,1,3);std(x) ||.
7. Compute cdf, pdf, percentile, and (inverse) survival function Use "pdf(a), cdf(a), svf(a), isf(p), q(p)" to compute the value of pdf, cdf, survival function, inverse survival function, and the p-th percentile (quantile), respectively, for a given number "a" or "p" (for p ∈ [0, 1]) (click here and here for details). For example, stt;bin(10,0.7);cdf(6) || stt;poi(12);pdf(10) || stt;t(8);q(0.95) || stt;epn(0.8);svf(1) || stt;epn(2);isf(0.6) ||.
Use "pdf(x)" and "cdf(x)" to show the formulas for the pdf and cdf of the r.v. "x". For example, stt;n(0,1);pdf(x) || stt;epn(1);cdf(x) ||.
8. Simulating samples Use "rvs(n)" for simulating a random sample of size "n" from common distributions or user-defined distributions. For example, stt;chi(1);rvs(3) || stt;t(8);rvs(20) || stt;hyp(15,4,8);rvs(20) ||.
Use keyword "d" to obtain some summary statistics from simulated samples. Observe how these statistics change as the sample size n increases and examine the relationship between the sample statistics and population parameters. For example, stt;poi(20);rvs(30);d || stt;poi(20);rvs(300);d ||.
4. Modules for Hypothesis Testing
1. Student-t test Use "t1" for one-sample t-test, "t2" for two-samples t-test, and "pt" for paired-samples t-test (click here for details). For example, stt;dat(2.84,3.76,2.61,4.29,2.15,1.98,3.21,4.38,2.9,3.11,2.89,2.55,3.74,3.93,2.95);t1;3 || tests the mean equals 3; stt;dat((4,2,1,5,0,5,1,1,3,3,0,7,2,3,7),(6,2,6,5,2,0,4,3,8,1,3,7,4,5,5));t2 || tests if the two means are significantly different; stt;dat((5,2,8,4,1),(7,3,6,5,3));pt || tests if the two means are significantly different before and after some treatment.
2. Goodness of fit test Use "chi" to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table (test of homogeneity) (click here for details). For example, stt;dat((5,8,9,2,1,7),(6,6,6,4,4,6));chi ||, where the first row is the observed counts and the second row gives the expected frequencies. When just observed frequencies are given, it is assumed that the expected frequencies are uniform and given by the mean of the observed frequencies. For example, stt;dat(5,8,9,2,1,7);chi ||.
Use "n" to test normality. Use "ks1" (Kolmogorov-Smirnov test one sample) to test if a sample comes from a certain population distribution, and use "ks2" (Kolmogorov-Smirnov test one sample) to test if two samples come from the same distribution (click here for details). For example, stt;dat(6.03,5.41,4.01,6.17,3.93,5.24,5.97,6.4,4.58,6.11,5.43,5.65,6,4.31,4.42);n || stt;dat((1,7,9,7,12,9,12,9,6,12),(3,7,7,8,9,10,8,7,6,13));ks2 ||. To use "ks1", specify the distribution by keyword such as "n", "t", "f", "chi", "exp", "poi", "cau", "gam", "bet", and "uni", followed by its necessary parameter(s). For example, stt;dat(6.03,5.41,4.01,6.17,3.93,5.24,5.97,6.4,4.58,6.11,5.43,5.65,6,4.31,4.42);ks1;n;3;1 || stt;dat(6.03,5.41,4.01,6.17,3.93,5.24,5.97, 6.4,4.58,6.11,5.43,5.65,6,4.31,4.42);ks1;n;5;1 || stt;dat(12,4,5,9,8,10,5,9,1,5,9,9,6,7,3);ks1;poi;7 || stt;dat(1.27,2.3,1.56,0.48,4.44,1.52,1.09,0.19, 1.5,1.27,2.2,0.59,2.43,1.67,0.45);ks1;epn;0.8 ||.
3. Analysis of variance (ANOVA) Use "f1" for one-way ANOVA and "f2" for two-way ANOVA (click here for details). For example, a factor of three levels by stt;dat((2,3,4,5),(3,4,5,6),(4,3,5,7));f1 ||. Two factors A and B with 3 levels for A and 2 levels for B by stt;dat(((2,3,4,5),(8,9,8,8)),((3,4,5,6),(7,7,8,8)),((4,3,5,7),(9,10,11,9)));f2 ||. For two-way ANOVA, use "n" for interaction. For example, stt;dat(((2,3,4,5),(8,9,8,8)),((3,4,5,6),(7,7,8,8)),((4,3,5,7),(9,10,11,9)));f2;n ||.
4. Nonparametric test Use "mw" (Mann-Whitney U-test) to determine if two samples come from the same population in a nonparametric way. Kruskal-Wallis H-test by "kw" extends Mann-Whitney U-test to more than two samples (analogue to one-way ANOVA). Use "w" (Wilcoxon signed-rank test) to test the null hypothesis that two related paired samples come from the same distribution (nonparametric version of paired-samples t-test) (click here for details). When using the Wilcoxon signed-rank test, input the differences of each pair in one row. For example, stt;dat((1,7,9,7,12,9,12,9,6,12),(3,7,7,8,9,10,8,7,6,13));mw || stt;dat((2,3,4,5),(3,4,5,6),(4,3,5,7));kw || stt;dat(-2,-1,2,-1,-2);w ||.
5. Test of independence Use "cnt" (contingency table) for test of independence (click here for details). For example, check the observed counts in a
3 × 2 table by stt;dat((32,43,25),(27,30,30));cnt ||.
5. Modules for Correlation and Covariance
Covariance Use "cov" to compute the covariance and correlation matrices. For example, stt;dat((2,3,4,5),(6,5,6,3),(8,7,9,6));cov || returns a 3 × 3 covariance and correlation matrices and stt;dat((1,7,9,7,12,9,12,9,6,12),(3,7,7,8,9,10,8,7,6,13));cov || returns a 2 × 2 covariance and correlation matrices.
Correlation(Pearson, Spearman, Kendall) Use "rp", "rs", "rk" to calculate and test the null hypothesis that the correlation is 0 for Pearson product-moment correlation, Spearman's rank correlation, and Kendall's rank correlation, respectively (click here for details). For example, stt;dat((1,7,9,7,12,9,12,9,6,12),(3,7,7,8,9,10,8,7,6,13));rp || stt;dat((1,7,9,7,12,9,12,9,6,12),(3,7,7,8,9,10,8,7,6,13));rs || stt;dat((1,7,9,7,12,9,12,9,6,12),(3,7,7,8,9,10,8,7,6,13));rk ||.
6. Modules for Generalized Linear Models
Use "ols" for both simple or multiple linear regression. Note that the response variable is listed in the last row and explanatory variable(s) are listed before the response variable (click here for details). For example, stt;dat((2,3,6,8,1,3,5,9,2),(8,10,9,12,4,6,8,13,3));ols || stt;dat((2,3,6,8,1,3,5,9,2),(0,0,0,0,1,1,1,1,1),(8,10,9,12,4,6,8,13,3));ols ||. For predicted values and residuals, use "ols;pr"; for the predicted value given a new observation, use "ols;p(a)" or "ols;p(a,b)". For example, stt;dat((2,3,6,8,1,3,5,9,2),(8,10,9,12,4,6,8,13,3));ols;pr || stt;dat((2,3,6,8,1,3,5,9,2),(8,10,9,12,4,6,8,13,3));ols;p(4) || stt;dat((2,3,6,8,1,3,5,9,2),(0,0,0,0,1,1,1,1,1),(8,10,9,12,4,6,8,13,3));ols;pr || stt;dat((2,3,6,8,1,3,5,9,2),(0,0,0,0,1,1,1,1,1),(8,10,9,12,4,6,8,13,3));ols;p(4,1) ||.
Use "glm;b" for logistic regression, where "b" indicates the binomial family and logit link function (click here for details). Note that the response variable, which is binary, must be listed in the last row in the data set. For example, stt;dat((3.78,2.44,2.09,0.14,1.72,1.65,4.92,4.37,4.96,4.52,3.69,5.88),(0,0,0,0,0,0,1,1,1,1,1,1));glm;b ||.
Use "pr;n" (polynomial regression) for polynomial regression, where "n" is a positive integer for the order of degree (click here for details). For example, stt;dat((3,2,3,1,7,5,2,9),(12,5,8,2,38,28,5,90));pr;2 ||. Check the scatter plot by plt;dat((3,2,3,1,7,5,2,9),(12,5,8,2,38,28,5,90));pt ||.
Use "log" for log linear models (click here for details). For three-way tables, add model generators such as "12", "23", "13", "12+23", "12+13", "13+23", "12+13+23" to the end. Note that the row and column indicators are listed as explanatory variables before the counts, which are listed in the last row as response variable. For example; stt;dat((1,1,1,1,0,0,0,0),(0,1,0,1,0,1,0,1),(0,1,0,1,0,1,0,1),(35,59,47,112,42,77,26,76));log || stt;dat((1,1,1,1,0,0,0,0),(0,1,0,1,0,1,0,1),(0,1,0,1,0,1,0,1),(35,59,47,112,42,77,26,76));log;12+23 ||.
7. Modules for Random Processes
Use "dmc" (discrete Markov chain) to create a finite discrete time-homogeneous Markov chain (click here for details). The first row "(1,2,3)" in the example "stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);tp" lists the discrete state space, and the second portion is a 3 × 3 (constant) one-step transition matrix. Note that each row of the transition matrix must sum to 1. The keyword "tp" (transition probability) displays the state space and the transition matrix.
Use keyword "pn;n" for n-step transition matrix, where "n" is a positive integer for "n-step". Use "ld" for limiting distribution, "fm" for fundamental matrix, "cc" for communication class, "ap" for absorbing probabilities, and "dc" for decomposing transition matrix into submatrices with special properties. Use "md;n;(p1, ..., pk)" for n-step marginal distribution with initial distribution "(p1, ..., pk)" that sums to 1. For example, stt;dmc((0,1,2,3,4),[(1/2,1/2,0,0,0),(2/5,1/5,2/5,0,0), (0,0,1,0,0),(0,0,1/2,1/2,0),(1/2,0,0,0,1/2)]);dc || stt;dmc((0,1,2,3,4),[(1/2,1/2,0,0,0),(2/5,1/5,2/5,0,0), (0,0,1,0,0),(0,0,1/2,1/2,0),(1/2,0,0,0,1/2)]);tp || stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);pn;7 || stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);ld || stt;dmc((0,1,2,3,4),[(1/2,1/2,0,0,0),(2/5,1/5,2/5,0,0), (0,0,1,0,0),(0,0,1/2,1/2,0),(1/2,0,0,0,1/2)]);cc || stt;dmc((0,1,2,3,4),[(1/2,1/2,0,0,0),(2/5,1/5,2/5,0,0), (0,0,1,0,0),(0,0,1/2,1/2,0),(1/2,0,0,0,1/2)]);fm || stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);md;7;(1,0,0) ||.
Use "cmc" for continuous-time Markov chain (click here for details). For example, stt;cmc((0,1),((-1,1),(1,-1)));P(x(1.96),0|x(0.78),1) ||.
Compute transition probabilities The expression "P(x(3),1|x(0),1)" represents the probability that the process starts 1 at time 0 and moves to 1 at time 3. Similarly, "P(x(5)>=1|x(2)>1)" means the probability that the process takes values greater than 1 at time 2 and takes values greater or equal to 1 at time 5. For example, stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);P(x(3),1|x(0),1) || stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);P(x(3),1|x(0)>=2) || stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);P(x(6)!=1|x(2)<=2) || stt;cmc((1,2,3),((-6,3,3),(4,-12,8),(15,3,-18)));P(x(3.12),3|x(1)>1) ||.
Use "ppr(λ)" (Poisson process) for a Poisson random process with parameter "λ" (positive integer) (click here for details). For example, stt;ppr(3);P(x(2)<=3|x(1)>1) ||. You can simulate the arrival times for "n" events of a Poisson process using "rvs(n)", and simulate the inter-arrival times for "n" events using "rvs(n);W". For example, stt;ppr(9);rvs(50) || stt;ppr(9);rvs(50);W ||.
Use "wpr" (Wiener process) for a Wiener process (click here for details). For example, stt;wpr;P(x(1)< x(2)) ||. Use "gpr(a,b)" (Gamma process) for a Gamma process with parameters "a" and "b", where both "a" and "b" are positive numbers (click here for details). For example, stt;gpr(2,3);P(x(1)>2) ||.
2 Probability Theory
1. Probability
1. Computing probabilities by counting If an event \(A\) can occur in \(m\) different ways out of a total \(n\) possible ways (equally likely), the probability of occurrence is \(P(A)=\frac{m}{n}\). If an event \(B\) occurred \(h\) times in \(N\) repetitions of an experiment, \(P(B)=\frac{h}{N}\).
2. Conditional probabilities Compute conditional probabilities by: (1) the formula \(P(B|A)=\frac{P(A∩B)}{P(A)}\), and (2) treating the outcomes of event \(A\) as the new sample space where event \(B\) occurs in a number of ways. For example, the probability that a single toss of a die will result in a number less than 5 (event \(B\)) given that the toss results in an even number (event \(A\)) is by formula \(P(B|A)=\frac{2/6}{1/2}=\frac{2}{3}\), because the outcomes of the events and their probabilities are \(A=\{2,4,6\},B=\{1,2,3,4\},A∩B=\{2,4\},P(A)=\frac{1}{2},P(A∩B)=\frac{2}{6}\). By method (2), since \(A\) has occurred, we treat the outcomes {2, 4, 6} of event \(A\) as the new sample space, where there are two ways that event \(B\) can occur. Thus \(P(B|A)=\frac{2}{3}\).
3. Total probability and Bayes' rule Suppose \(A_1,\cdots,A_n\) are mutually exclusive and the sample space is \(S=∪_{i=1}^nA_i\). If \(A\) is any event, then by the law of total probability \(P(A)=\sum_{i=1}^nP(A|A_i)P(A_i)\), and by the Bayes' rule \(P(A_k|A)=\frac{P(A_k∩A)}{P(A)}=\frac{P(A|A_k)P(A_k)}{\sum_{i=1}^nP(A|A_i)P(A_i)}\). For example, a box contains two fair coins (\(F\)) and a biased coin (\(B\)) with \(P(H)=0.3\). A coin is chosen at random from the box and tossed three times, if two heads and a tail are observed (event \(A\)), the probability that the chosen coin is fair is \(P(F|A)=\frac{P(A|F)P(F)}{P(A|F)P(F)+P(A|B)P(B)}=\frac{(_2^3)0.5^2(1-0.5)^1\frac{2}{3}}{(_2^3)0.5^2(1-0.5)^1\frac{2}{3}+(_2^3)0.3^2(1-0.3)^1\frac{1}{3}}≈0.80\). We usually use combinations and permutations to count the number of ways an event or thing can be accomplished.
4. Fundamental principle of counting If one thing can be accomplished in \(n_1\) different ways and after this a second thing can be accomplished in \(n_2\) different ways, ..., and the \(k\)th thing can be accomplished in \(n_k\) different ways, then all the \(k\) things can be accomplished in that order in \(n_1n_2\cdots n_k\) different ways.
5. Combinations and permutations The number of ways for arranging (or permuting) \(r\) objects from \(n\) distinct objects (for \(r≤n\)) in a line is \(P_r^n=n(n-1)\cdots(n-r+1)=\frac{n!}{(n-r)!}\). It follows that \(P_n^n=n!\) for \(r=n\). If the \(n\) objects are not distinct and there are \(n_1,n_2,\cdots,n_k\) subsets of distinct objects for \(n=n_1+\cdots+n_k\), then the number of permutations of the objects is \(P_{(n_1,\cdots,n_k)}^n=\frac{n!}{n_1!\cdots n_k!}\). For example, the number of different permutations of 10-letter word "abbcccdddd" is \(\frac{10!}{1!2!3!4!}\) = 12600. Use "gamma(n+1)" to computer "n!", so gamma(11) = 10! = 3628800. The total number of combinations of \(r\) objects chosen from a total of \(n\) different objects is \(C_r^n=(_r^n)=\frac{n!}{(n-r)!r!}= \frac{P_r^n}{r!}\). Note that \((_r^n)=\binom n{n-r}\). For example, the number of ways in which 3 cards are randomly drawn from a deck of 52 playing cards is \((_3^{52})=\frac{52!}{49!3!}\) = 22100 by gamma(53)/(6*gamma(50)).
Usage for computing factorial, permutation and combination Use the function "gamma(n+1)" for computing the factorial \(n!\). Check 50! = gamma(51), \(P_{20}^{30}\) = gamma(31)/gamma(11) ||, and \(C_{40}^{60}=(_{40}^{60})\) = gamma(61)/(gamma(21)*gamma(41)) ||.
6. Computing probabilities for continuous r.v.'s involves definite integrals, and the probability is equivalent to the area enclosed by the density and the given intervals (click here for details).
Examples
(1) A card is randomly drawn from a deck of 52 playing card. The probability that the card is 7 or hart is \(P(7∪H)=P(7)+P(H)-P(7∩H)=\frac{4}{52}+\frac{13}{52}-\frac{1}{52}=\frac{4}{13}\). If two cards are drawn from the deck one by one without replacement, the probability that both are king is \(P(A∩B)=P(A)P(B|A)=\frac{4}{52}·\frac{3}{51}=\frac{1}{221}\).
(2) A fair die is tossed twice. The probability of getting 1, 2, or 3 on the first toss (event \(A\)) and 4, 5, or 6 on the second (event \(B\)) can be calculated by \(P(A∩B)=P(A)P(B|A)=P(A)P(B)=\frac{1}{2}·\frac{1}{2}= \frac{1}{4}\).
(3) Three balls are drawn successfully without replacement from a box that contains 3 red balls, 4 white balls and 5 blue balls. The probability that they are in the order red, white and blue is \(P(R_1∩W_2∩B_3)=\frac{3}{12}·\frac{4}{11}·\frac{5}{10}=\frac{1}{22}\).
(4)
Box I contains 4 red and 5 blue marbles, and Box II contains 3 red and 6 blue marbles. A fair coin is tossed. If the coin turns up heads, a marble is chosen from Box I; if it turns up tails, a marble is chosen from Box II. The probability that a blue marble is chosen is \(P(B)=P(I)P(B|I)+P(II)P(B|II)=\frac{1}{2}·\frac{5}{9}+\frac{1}{2}·\frac{6}{9}=\frac{11}{18}\).
(5) A shelf has 10 math books and 5 physics books. The probability that 4 math books will be together is \(\frac{4!12!}{15!}=\frac{4}{455}\).
(6) Suppose a room has \(n≤365\) people. The probability that the \(n\) people have different birthday (ignore leap years) is \(P_n^{365}=\frac{365}{365}·\frac{364}{365}\cdots\frac{365-n+1}{365}\). The probability that at least two of them have a common birthday (ignore leap years) can be calculated by \(P(A)=1-P(A^c)=1-\frac{365×364×\cdots×(365-n+1)}{365^n}\). If \(n=25,P(A)=0.5687\) by 1-(gamma(366)/gamma(365-25+1))/365.0**25 ||.
(7) How many people do you need to ask to have a 50:50 chance of finding someone who has the same birthday as yours? Solve for \(n\) = 253 from the equation \(P(A)=1-P(A^c)=1-\frac{364^n}{365^n}=0.5\) by trying 1-(364.0/365)**n such as 1-(364.0/365)**100 || 1-(364.0/365)**200 || 1-(364.0/365)**253.
2. Random Variables and Distribution Functions
We use random variables to represent and evaluate probability functions for various events. The probability distribution function or cumulative density function (cdf) of a random variable \(X\) is defined as \(F_X(x)=P(X≤x)\) for \(-∞< x<∞\). It follows that \(P(X>x)=1-F_X(x)≡S_X(x)\) (called survival function) and \(P(X< x)=F_X(x^-)=\lim\limits_{a→x^-}F_X(a)\). If \(F_X(x)\) is known, most information of the random experiment described by \(X\) is determined. Note that \(\displaystyle\lim_{x→∞}F_X(x)=P(X≤∞)=P(Ω)=1,\lim_{x→-∞}F_X(x)=P(X≤-∞)=P(Ø)=0\), and that the probability of the event \(\{ω:a< X(ω)≤b\text{ for }ω∈Ω\}\) is \(P(a< X≤b)=F_X(b)-F_X(a)\).
1. Discrete random variables and probability mass functions (pdf) A r.v. \(X\) is discrete if its range contains a finite or countably infinite number of points. In this case, \(F_X(x)\) changes values in jumps, or \(F_X(x)\) is a step function whose values between steps are constant. The function \(p(x)=P(X=x)\) is called the probability mass function (pdf) of \(X\), and \(\sum\limits_kp(x_k)=1\) for \(k=1,2,\cdots\). The cdf of \(X\) can be calculated by \(F_X(x)=P(X≤x)=\sum\limits_{x_k≤x}p(x)\).
Usage for finite discrete distributions Use "drv\(((x_1,\cdots,x_k),(p_1,\cdots,p_k))\)" (discrete random variable) to construct the distribution function for a finite discrete r.v. \(X\), where \((x_1,\cdots,x_k)\) are the possible values that \(X\) can take, and \((p_1,\cdots,p_k)\) are the probabilities of \(p_i=P(X=x_i)\) for \(i=1,2,\cdots,k\). The input probability vector "(p1, p2, ..., pk)" must sum to 1. For example, stt;drv((1,2,3),(0.2,0.3,0.5));cdf(x) || stt;drv((1,2,3,4),(1/4,1/4,1/4,1/4));cdf(3.5) || stt;drv((1,2,3),(1/5,2/5,2/5));P(3) || stt;drv((1,2,3),(1/5,2/5,2/5));E(x) || stt;drv((1,2,3),(1/5,2/5,2/5));V(x) || stt;drv((1,2,3),(1/5,2/5,2/5));rvs(10) || stt;drv((1,2,3),(1/5,2/5,2/5));q(0.5) ||.
Usage for infinite discrete distributions For an infinite discrete r.v., construct its distribution function by "drv(f(x))", where "f(x)" represents the the formula for the nonnegative pdf of the r.v. \(X\). Note that the infinite sum ("ism") for the input pdf "f(x)" must be 1, \(\sum_{x=1}^∞f(x)=1\). For example, if the input pdf is \(f(x)=(\frac{1}{2})^x\), it satisfies the requirement because the geometric series \(\sum_{i=1}^∞(\frac{1}{2})^x=1\). Check ism;0.5**x;x;1;oo || stt;drv(0.5^x);E(x) || stt;drv(0.5^x);P(x>4) || stt;drv(0.5^x);V(x) || stt;drv(0.5^x);cdf(3) || stt;drv(0.5^x);P(1@2) || stt;drv(0.5^x);P(x>1) || stt;drv(0.5^x);P(x>2&x<4) || stt;drv(0.5^x);P(x<2@x>5) || stt;drv(0.5^x);P(x>6|x>2) || stt;drv(0.5^x);M(2) ||. On the other hand, if \(f(x)=(\frac{3}{4})^x\), since \(\sum_{i=1}^∞(\frac{3}{4})^x=3\), normalize \(f(x)\) by \(\frac{1}{3}(\frac{3}{4})^x\), and then use the normalized one as the input pdf. Check ism;0.75**x;x;1;oo || ism;0.75**x/3;x;1;oo || stt;drv(0.75**x/3);P(x>2) || stt;drv(0.75**x/3);P(4<=x<8) ||. Make sure your input pdf "f(x)" is positive and must sum to 1.
Note that in the above examples, the expression "P(3)" stands for the probability P(X = 3) for a discrete r.v. X, "P(1@2)" for P(X = 1 or X = 2), "P(x>1)" for P(X > 1), "P(x>2&x<4)" for P(2 < X < 4), "P(4<=x<8)" for P(4 ≤ X < 8), and "P(x<2@x>5)" for P(X < 2 ∪ X > 5). The expression "rvs(10)" simulates 10 random numbers from the distribution, and "q(0.5)" gives the 50th percentile or 0.5th quantile.
2. Continuous random variables and probability density functions (pdf) The range of a continuous r.v. \(X\) contains an interval of real numbers. The cdf \(F_X(x)=P(X≤x)=\int_{-∞}^xf(x)dx\) of \(X\) is continuous, and its derivative \(F'_X(x)=f_X(x)\), which is called the probability density function (pdf) of \(X\), exists everywhere except at possibly a finite number of points. If \(X\) is continuous, \(P(X=x)=0\) for all \(x,\int_{-∞}^∞f(x)dx=1\) and \(f(x)≥0\), and \(P(a< X< b)=P(a≤X≤b)=P(a≤x< b)=P(a< x≤b)=\int_a^bf(x)dx=F_X(b)-F_X(a)\).
Usage for continuous distributions Use "crv(f(x),a,b)" (continuous random variable) to construct the distribution function for a continuous r.v. \(X\), where "f(x)" is the nonnegative pdf of \(X\), and "a" and "b" represent the lower and upper limits of the interval [a, b] of real numbers that \(X\) can take. Use "oo" for infinity "\(∞\)" and "-oo" for "-\(∞\)". Make sure your input pdf "f(x)" is nonnegative and integrated to 1 over the given interval. Check stt;crv(exp(-x),0,oo);P(x>2) || stt;crv(exp(-x),0,oo);P(x>3|x>1) || stt;crv(exp(-x),0,oo);P(x<1@x>5) || stt;crv(2*exp(-2*x),0,oo);V(x) || stt;crv(3*x^2,0,1);cdf(x) || stt;crv(2*x/3,1,2);P(x>1.5) || stt;crv(3*x^2,0,1);P(x<0.7) || stt;crv(3*x^2,0,1);E(x) || stt;crv(2*x/3,1,2);V(x) || stt;crv(2*x/3,1,2);cdf(x) || stt;crv(1,0,1);V(x) || stt;crv(1,0,1);rvs(30) || stt;crv(1,0,1);q(0.3) ||.
Verify that the pdfs in the above examples are integrated to 1 over the given intervals by itg;exp(-x);x;0;oo || itg(2*exp(-2*x),x,0,oo) || itg(3*x^2,x,0,1) || itg;2*x/3;x;1;2 || itg;1;x;0;1 ||.
3. Joint distributions If \(X,Y\) are two discrete r.v.'s, \(P(X=x,Y=y)=f(x,y)\) is the joint pdf and \(\displaystyle\sum_x\sum_yf(x,y)=1\). The marginal pdf's are defined by \(f_X(x)=P(X=x)=\sum\limits_yf(x,y),f_Y(y)=P(Y=y)=\sum\limits_xf(x,y)\). The joint distribution function or the cdf of \(X,Y\) is \(F(x,y)=P(X≤x,Y≤y)=\displaystyle\sum_{u≤x}\sum_{v≤y}f(u,v)\).
If \(X,Y\) are two continuous r.v.'s with joint pdf \(f(x,y)≥0\), then \(\int_{-∞}^∞\int_{-∞}^∞f(x,y)dxdy=1\), and the joint distribution function is \(F(x,y)=P(X≤x,Y≤y)=\int_{-∞}^y\int_{-∞}^xf(x,y)dxdy\). The probability \(P(a< X< b,c< Y< d)=\int_a^b\int_c^df(x,y)dxdy\). The marginal distributions are defined by \(F_X(x)=P(X≤x)=\int_{-∞}^∞\int_{-∞}^xf(x,y)dxdy,F_Y(y)=P(Y≤y)=\int_{-∞}^y\int_{-∞}^∞f(x,y)dxdy\). The joint pdf can be obtained by \(f(x,y)=\frac{∂^2F(x,y)}{∂x∂y}\), and the marginal densities can be obtained by \(f_X(x)=\int_{-∞}^∞f(x,y)dy,f_Y(y)=\int_{-∞}^∞f(x,y)dx\).
4. Independent random variables If \(X,Y\) are two discrete r.v.'s and \(P(X=x,Y=y)=P(X=x)P(Y=y)\) or \(f(x,y)=f_X(x)f_Y(y)\), then \(X,Y\) are independent. If \(X,Y\) are continuous r.v.'s and \(P(X≤x,Y≤y)=P(X≤x)P(Y≤y)\), or \(F(x,y)=F_X(x)F_Y(y)\) or \(f(x,y)=f_X(x)f_Y(y)\), then \(X,Y\) are independent. The converse is also true for both discrete and continuous cases.
5. Conditional distributions If \(X,Y\) are discrete r.v.'s, \(f(y|x)=P(Y=y|X=x)=\frac{f(x,y)}{f_X(x)}\) is called the conditional pdf of \(Y\) given \(X\), and \(f(x|y)=\frac{f(x,y)}{f_Y(y)}\). Similarly, define conditional pdf as \(f(x|y)=\frac{f(x,y)}{f_Y(y)}\) and \(f(y|x)=\frac{f(x,y)}{f_X(x)}\) for continuous r.v.'s \(X,Y\).
Example 1 Suppose the joint density of \(X,Y\) is \(f(x,y)=Cx^2y\) for \(0≤X≤4,1≤Y≤5\). The constant \(C\) can be determined by \(\int_1^5\int_0^4Cx^2ydxdy=256C=1⇒C=\frac{1}{256}\) by itg;c*x^2*y;x;0;4;y;1;5 ||.
(1) Find the probability \(P(0≤X≤1,2≤Y≤3)=\int_2^3\int_0^1\frac{x^2y}{256}dxdy=\frac{5}{1536}\) by itg;x^2*y/256;x;0;1;y;2;3 ||, and the probability \(P(X>2,Y≤3)=\frac{7}{24}\) by itg;x^2*y/256;x;2;4;y;1;3 ||.
(2) The marginal density of \(X\) is \(f_X(x)=\int_1^5\int_0^4\frac{x^2y}{256}dy=\frac{3x^2}{64}\) by itg;x^2*y/256;y;1;5 for \(0≤X≤4\) and 0 otherwise. The marginal density of \(Y\) is \(f_Y(y)=\int_1^5\int_0^4\frac{x^2y}{256}dx=\frac{y}{12}\) by itg;x^2*y/256;x;0;4 for \(1≤Y≤5\) and 0 otherwise.
(3) The marginal cdf of \(X\) is \(F_X(x)=P(X≤x)=\int_0^x\frac{3t^2}{64}dt=\frac{x^3}{64}\) or \(F_X(x)=\int_0^x\int_1^5\frac{v^2u}{256}dudv\) by itg;v^2*u/256;u;1;5;v;0;x for \(0≤X≤4;F_X(x)=0\) for \(X<0;F_X(x)=1\) for \(X>4\). Similarly, the cdf of \(Y\) is \(F_Y(y)=\int_2^y\frac{t}{12}dt=\frac{y^2-1}{24}=\int_0^4\int_1^y\frac{v^2u}{256}dudv\) by itg;v^2*u/256;u;1;y;v;0;4 for \(1≤Y≤5;F_Y(y)=0\) for \(Y<1;F_Y(y)=1\) for \(Y>5\).
(4) Since \(f(x,y)=f_X(x)f_Y(y)=\frac{3x^2}{64}\frac{y}{12}=\frac{x^2y}{256},X\) and \(Y\) are independent, and the cdf of \(X,Y\) is \(F(x,y)=F_X(x)F_Y(y)\). Thus, \(F(x,y)=0\) for \(x< 0\) or \(y<1;F(x,y)=\frac{x^3}{64}\) for \(0≤x≤4,y>5;F(x,y)=\frac{y^2-1}{24}\) for \(x>4,1≤y≤5;F(x,y)=\frac{x^2}{64}\frac{y^2-1}{24}\) for \(0≤x≤4\) and \(1≤Y≤5;F(x,y)=1\) for \(x>4\) and \(y>5\).
(5) The probability \(P(X+Y≤2)=\int_A\frac{x^2y}{256}dxdy\), where the region \(A\) is a triangle enclosed by \(y=2-x\) and \(0≤x≤1,1≤y≤2-x\). Thus, \(P(X+Y≤2)=\int_0^1\int_1^{2-x}\frac{x^2y}{256}dydx=\frac{1}{2560}\) by itg;x^2*y/256;y;1;2-x;x;0;1 ||. Similarly, \(P(X+Y≤5)=\frac{3}{20}\) by itg;x^2*y/256;y;1;5-x;x;0;4
Example 2 Suppose \(X,Y\) have a joint density \(f(x,y)=\frac{3x^2+2y}{126}\) for \(0≤X≤3,1≤Y≤4\).
(1) The marginal distributions functions \(F_X(x)=\int_{u=0}^x\int_{v=1}^4\frac{3u^2+2v}{126}dvdu=\frac{3x^3+15x-18}{126}\) for \(0≤X≤3\), and \(F_X(x)=0\) for \(x<0\) and \(F_X(x)=1\) for \(X>3\). Similarly, \(F_Y(y)=\int_{v=1}^y\int_{u=0}^3\frac{3u^2+2v}{126}dvdu= \frac{3y^2+27y-30}{126}\) for \(1≤Y≤4\), and \(F_Y(y)=0\) for \(y<1\) and \(F_Y(y)=1\) for \(Y>4\). Check itg;(3*u^2+2*v)/126;v;1;4;u;1;x || itg;(3*u^2+2*v)/126;u;0;3;v;1;y || itg;(3*x^2+2*y)/126;x;0;3;y;1;4 ||.
(2) The marginal densities \(f_X(x)=F'_X(x)=\frac{9x^2+15}{126},0≤X≤3;f_Y(y)=F_Y(y)=\frac{6y+27}{126},1≤Y≤4\). Since \(f(x,y)≠f_X(x)f_Y(y),X,Y\) are not independent.
(3) \(P(0< X<1,Y>2)=\int_0^1\int_2^4\frac{3x^2+2y}{126}dydx=\frac{1}{9};P(X>1)=\int_1^3\int_1^4\frac{3x^2+2y}{126}dydx=\frac{6}{7};P(X+Y>2)=1-P(X+Y≤2)\)
\(=1-\int_{0}^1\int_{y=1}^{2-x}\frac{3x^2+2y}{126}dydx=\frac{1493}{1512}\). Check itg;(3*x^2+2*y)/126;y;2;4;x;0;1 || itg;(3*x^2+2*y)/126;y;1;4;x;1;3 || 1-itg((3*x^2+2*y)/126,y,1,2-x,x,0,1) ||.
(4) The joint distribution function is \(F(x,y)=0\) for \(X<0\) or \(Y<1;F(x,y)=1\) for \(X>3\) and \(Y>4;F(x,y)=\int_0^x\int_1^y\frac{3u^2+2v}{126}dvdu\)
\(=\frac{2x^3(y-1)+xy^2+4xy-5x}{252}\) for \(0≤X≤3,1≤Y≤4;F(x,y)=F_X(x)\) for \(0≤X≤3,y>4;F(x,y)=F_Y(y)\) for \(x>3,1≤Y≤4\). Check itg;(3*u^2+2+v)/126;v;1;y;u;0;x ||
Example 3 If \(X,Y\) have a joint density \(f(x,y)=1,0≤X≤1,0≤Y≤1\), then \(P(|X-Y|≤0.25)=P(-0.25≤X-Y≤0.25)\)
\(=1-\int_{0.25}^1\int_{y-.25}^.75dxdy-\int_{.25}^1\int_0^{x-.25}dydx=\frac{7}{16}\) by 1-itg(1,x,y-1/4,3/4,y,1/4,1)-itg(1,y,0,x-1/4,x,1/4,1) ||.
Example 4 If \(X,Y\) have a joint density \(f(x,y)=\frac{3+2xy}{48},0≤x≤2,1≤y≤4\).
(1) \(f(x|y)=\frac{f(x,y)}{f_Y(y)}=\frac{3+2xy}{4(y+72)}\) by (3+2*x*y)/(48*itg(3+2*x*y/48,x,0,2)) ||.
(2) \(f(y|x)=\frac{f(x,y)}{f_X(x)}=\frac{3+2xy}{3(5x+144)}\) by (3+2*x*y)/(48*itg(3+2*x*y/48,y,1,4)) ||.
(3) \(P(Y>2|1< X< 1+dx)=\int_2^4\frac{3+2y}{3(149)}=\frac{6}{149}\) by itg;(3+2*y)/(3*149);y;2;4 ||.
Example 5 Suppose \(X,Y\) have a joint density \(f(x,y)=\frac{2x^2y^{-2}}{27}\) for \(1≤X≤4,1≤Y≤x\). Then \(f_X(x)=\int_1^x\frac{2x^2y^{-2}}{27}dy=\frac{2x(x-1)}{27}\),
\(f_Y(y)=\int_x^4\frac{2x^2y^{-2}}{27}dx=\frac{2(64-y^3)}{81y^2}\). Check itg;2*x^2/y^2/27;y;1;x || itg;2*x^2/y^2/27;x;y;4 || and the integrals of the densities itg;2*x^2/y^2/27;y;1;x;x;1;4 || itg;2*x*(x-1)/27;x;1;4 || itg;2*(64-y^3)/(81*y^2);y;1;4 || all return 1. Since \(f(x,y)≠f_X(x)f_Y(y),X,Y\) are not independent. The condition distributions \(f_{Y|X}(y|x)=\frac{x}{(x-1)y^2}, f_{X|Y}(x|y)=\frac{3x^2}{64-y^3}\). Check that the integrals of the densities itg;x/((x-1)*y^2);y;1;x || itg;(3*x^2)/(64-y^3);x;y;4 ||.
3. Expectation, Variance, Covariance, Moment, and Moment Generating Function (MGF)
Definitions The expectation of a r.v. \(X\) is defined as \(E(X)=μ=\sum_xxP(X=x)\) (discrete r.v.) and \(E(X)=μ=\int_{-∞}^∞xf_X(x)dx\) (continuous r.v.). The variance of \(X\) is defined as \(\text{Var}(X)=σ^2=E(X-μ)^2\) and the standard deviation is \(σ=\sqrt{\text{Var}(X)}\). The expectation of the function \(g(X)\) of \(X\) is \(E[g(X)]=\sum_xg(x)P(X=x)\) (discrete r.v.) and \(E[g(X)]=\int_{-∞}^∞g(x)f_X(x)dx\) (continuous r.v.). The \(r\)th moment of \(X\) about the origin is defined as \(E(X^r)=\sum_xx^rP(X=x)\) (discrete r.v.) and \(E(X^r)=\int_{-∞}^∞x^rf_X(x)dx\) (continuous r.v.). The moment generating function is defined as \(M(t)=E(e^{tX})=\sum_xe^{tx}P(X=x)\) (discrete r.v.) and \(M(t)=E(e^{tX})=\int_{-∞}^∞e^{tx}f_X(x)dx\). The characteristic function is defined as \(E(e^{iωX})=\sum_xe^{iωx}P(X=x)\) (discrete r.v.) and \(E(e^{iωX})=\int_{-∞}^∞e^{iωx}f_X(x)dx\) (continuous r.v.).
The covariance of two r.v.'s \(X,Y\) is defined as \(σ_{XY}=E[(X-μ_X)(Y-μ_Y)]=\sum_x\sum_y(x-μ_x)(y-μ_y)f(x,y)\) (discrete r.v.) and \(σ_{XY}=E[(X-μ_X)(Y-μ_Y)]=\int_{-∞}^∞(x-μ_x)(y-μ_y)f(x,y)dxdy\) (continuous r.v.). The correlation of \(X,Y\) is \(ρ=\frac{σ_{XY}}{σ_Xσ_Y}\).
Usages for mean, variance, standard deviation, moment, and MGF Use "E(x)", "V(x)", and "std(x)" to calculate the expectation, variance and standard deviation of a user-defined r.v. "x", and use "E(g(x))", "V(g(x))", and "std(g(x))" to computer the corresponding values of "g(x)". For common distributions, simply use "m, v, std" to compute the mean, variance and standard deviation. Compute the moment generating function by "E(exp(t*x))" and the characteristic function by "E(exp(I*w*x))", where "I" represents the imaginary unit \(i\) such that \(i^2=-1\). Use "M(r)" or "E(X^r)" to compute the r-th moment \(E(X^r)\), where "r" is a positive integer.
Example 1 The density of standard normal distribution is \(f(x)=\frac{1}{\sqrt{2π}}e^{-\frac{x^2}{2}}\), and \(E(x)=\int_{-∞}^∞xf(x)dx=0\) because \(xf(x)\) is odd. Using integration by parts, \(E(x^2)=\frac{1}{\sqrt{2π}}\int_{-∞}^∞x^2e^{-\frac{x^2}{2}}dx =-\frac{1}{\sqrt{2π}}\int_{-∞}^∞xde^{-\frac{x^2}{2}} =-\frac{1}{\sqrt{2π}}xe^{-\frac{x^2}{2}}|_{-∞}^∞+\frac{1}{\sqrt{2π}} \int_{-∞}^∞e^{-\frac{x^2}{2}}dx=1\). Check itg;e^(-x^2/2)/(2*pi)^(1/2);x;-oo;oo || itg;x*e^(-x^2/2)/(2*pi)^(1/2);x;-oo;oo || itg;x^2*e^(-x^2/2)/(2*pi)^(1/2);x;-oo;oo || stt;n(0,1);E(x) || stt;n(0,1);E(X^2) || stt;n(0,1);V(x) || stt;n(0,1);std(x) || stt;n(0,1);M(3) || stt;n(0,1);M(6) || stt;n(0,1);E(exp(t*x));x || stt;n(0,1);E(exp(I*w*x));x || stt;n(0,1);V(a*x);x || stt;n(0,1);V(x+a);x || stt;n(0,1);E(2*x+3) || stt;n(0,1);E(a*x+b);x || stt;n(0,1);E(x^3) || stt;n(0,1);E(x^6) ||.
Note that if the expressions "E(.)", "V(.)", "std(.)" involve more than one variables, add a letter after the expression to indicate which is the random variable. For example, the expression "E(a*x+b);x" indicates "x" is the random variable and "a" and "b" are treated as constants.
Example 2 Suppose the density function of a r.v. \(X\) is \(f(x)=3e^{-3x}\) for \(x≥0\). Then \(E(X)=\frac{1}{3};E(X^2)=\frac{2}{9};E(X^3)=\frac{2}{9};E(X^4)=\frac{8}{27}\);
\(E(e^{tx})=\frac{3}{3-t}=\frac{1}{1-t/3}=1+\frac{t}{3}+\frac{t^2}{3^2}+\frac{t^3}{3^3}+\frac{t^4}{3^4}+\cdots\). Compute them directly by definite integrals: itg;x*3*exp(-3*x);x;0;oo || itg;x^2*3*exp(-3*x);x;0;oo || itg;x^3*3*exp(-3*x);x;0;oo || itg;x^4*3*exp(-3*x);x;0;oo || itg;exp(t*x)*3*exp(-3*x);x;0;oo ||. Note that \(X\) has a exponential distribution with parameter λ = 3, which can be entered as "epn(3)". Check stt;epn(3);M(1) || stt;epn(3);M(2) || stt;epn(3);M(3) || stt;epn(3);M(4) ||. You can use the function "epn()" to calculate more results: stt;epn(0.5);P(x>1) || stt;epn(0.5);q(0.6) || stt;epn(2);V(x) || stt;epn(10);s || stt;epn(1);k || stt;epn(4);cdf(x) || stt;epn(0.6);svf(2) || stt;epn(1);E(exp(I*w*x));x ||.
Example 3 Suppose \(X,Y\) have a joint density \(f(x,y)=\frac{xy}{648}\) for \(0≤x≤6,3≤y≤9\). Then \(E(X)=4;E(Y)=\frac{13}{2};E(XY)=26\);
\(E(X^2)=18;E(Y^2)=45;\text{Var}(X)=E(X-4)^2=2;\text{Var}(Y)=E(Y-\frac{13}{2})^2=\frac{11}{4};\text{Cov}(X,Y)=E[(X-4)(Y-\frac{13}{2})]=0;\)
\(E(3X+2Y)=3E(X)+2E(Y)=25;\text{Var}(X+Y)=2+\frac{11}{4}=\frac{15}{4};E(X|Y)=E(X)=4;E(Y|X)=E(Y)=\frac{13}{2}\). Check itg;x^2*y/648;x;0;6;y;3;9 || itg;x*y^2/648;x;0;6;y;3;9 || itg;x^2*y^2/648;x;0;6;y;3;9 || itg;x^3*y/648;x;0;6;y;3;9 || itg;x*y^3/648;x;0;6;y;3;9 || itg;(x-4)^2*x*y/648;x;0;6;y;3;9 || itg;(y-13/2)^2*x*y/648;x;0;6;y;3;9 || itg;(x-4)*(y-13/2)*x*y/648;x;0;6;y;3;9 || itg;(3*x+2*y)*x*y/648;x;0;6;y;3;9 || itg(x*x*y/(648*itg(x*y/648,x,0,6)),x,0,6) || itg(y*x*y/(648*itg(x*y/648,y,3,9)),y,3,9) ||.
Example 4 Suppose \(X,Y\) have a joint density \(f(x,y)=\frac{x^2y+xy^2}{81}\) for \(0≤x≤3,0≤y≤3\). Then \(E(X)=E(Y)=\frac{17}{8};\text{Cov}(X,Y)=\frac{-1}{64};\)
\(\text{Cov}(X)=\text{Cov}(Y)=\frac{139}{320};E(XY)=\frac{9}{2};\text{Var}(X+Y)=2(\frac{139}{320}+2(\frac{-1}{64})=\frac{67}{80};E(X|y)=\frac{4y+9}{4y+4};E(Y|x)=\frac{4x+9}{4x+4}\). Check itg;x*(x^2*y+x*y^2)/81;x;0;3;y;0;3 || itg;y*(x^2*y+x*y^2)/81;x;0;3;y;0;3 || itg;(x-17/8)*(y-17/8)*(x^2*y+x*y^2)/81;x;0;3;y;0;3 || itg;(x-17/8)^2*(x^2*y+x*y^2)/81;x;0;3;y;0;3 || itg;(y-17/8)^2*(x^2*y+x*y^2)/81;x;0;3;y;0;3 || itg;x*y*(x^2*y+x*y^2)/81;x;0;3;y;0;3 || itg(x*(x^2*y+x*y^2)/(81*itg((x^2*y+x*y^2)/81,x,0,3)),x,0,3) || itg(y*(x^2*y+x*y^2)/(81*itg((x^2*y+x*y^2)/81,y,0,3)),y,0,3) ||. Find \(\text{Var}(X|y)\) by
itg((x-(4*y+9)/(4*y+4))^2*(x^2*y+x*y^2)/(81*itg((x^2*y+x*y^2)/81,x,0,3)),x,0,3) || and \(\text{Var}(Y|X)\) by
itg((y-(4*x+9)/(4*x+4))^2*(x^2*y+x*y^2)/(81*itg((x^2*y+x*y^2)/81,y,0,3)),y,0,3) ||.
4. Mode, Median, Quantile, Skewness, Kurtosis, and Entropy
Mean or expectation provides a measure of central tendency for the values of a distribution. Mode and median are two other measures of central tendency. The mode of a discrete r.v. has the greatest probability of occurring. If a distribution has two, three or more such mode values, we say the distribution is bimodal, trimodal and multi-modal, respectively. The mode of a continuous r.v. is the value at which its pdf has a relative maximum.
The median is the value \(x\) such that \(P(X< x)≤\frac{1}{2}\) and \(P(X>x)≤\frac{1}{2}\). If \(X\) is continuous, \(P(X>x)=P(X< x)=\frac{1}{2}\). The quantile corresponding to a probability \(α\) is the value \(x_α\) such that the area to the left of \(x_α\) is \(α\), that is, \(P(X≤x_α)=α\). We also call the \(α\)-th quantile the 100\(α\)-th percentile. Thus, the 25th percentile is \(x_{0.25}\), the 50th percentile or median is \(x_{0.5}\), and the 75th percentile is \(x_{0.75}\). The difference \(x_{0.75}-x_{0.25}\) is called the inter-quantile range (IQR).
In information theory, the entropy of a discrete r.v. \(X\) is the average level of information or uncertainty inherent to the the possible outcomes of \(X\), and it can be calculated by \(H(X)=E(-\log p(x))=-\sum_xp(x)\log p(x)\).
Skewness and kurtosis Skewness measures the degree of asymmetry (or departure from symmetry) of a distribution, and it is defined as \(α_3=\frac{E(x-μ)^3}{σ^3}\). If \(α_3>0\), we say the distribution is skewed to the right; if \(α_3<0\), it is skewed to the right. For a symmetric distribution (e.g., normal), \(α_3=0\). Kurtosis, which is defined as \(α_4=\frac{E(x-μ)^4}{σ^4}\), measures the degree of peakedness of a distribution. Large kurtosis means the distribution may have its values concentrated around its mean (so that the distribution appears a large peak near its mean). Kurtosis is usually taken relative to the kurtosis of the standard normal distribution, which is 3.
Usages for median, skewness, kurtosis, and entropy For common distribution functions such as normal, exponential, uniform, and Poisson distributions, the letters "m, v, s, k, std, med, h" represent mean, variance, skewness, kurtosis, standard deviation, median, and entropy respectively. For user-defined distributions, use "E(x), V(x), s(x), k(x), std(x), med(x), H(x)". Use "q(α)" to compute the α-th quantile or 100α-th percentile. For example, stt;n(0,1);k || stt;n(1,1);s || stt;epn(0.6);k || stt;poi(7);s || stt;uni(1,3);k || stt;n(0,1);q(0) || stt;n(0,1);q(1) || stt;n(0,1);q(0.05) || stt;n(0,1);q(0.975) || stt;crv(2*x,0,1);s(x) || stt;crv(3*x^2,0,1);k(x) || stt;poi(5);q(0.5) || stt;ber(0.5);h ||.
Example 1 Suppose \(X\) has a density \(f(x)=\frac{x(16-x^2)}{64},0≤X≤4\). The median \(X_{0.5}\) can be obtained by \(P(X>X_{0.5})=P(X< X_{0.5})=0.5\)
\(⇒\int_0^m\frac{x(16-x^2)}{64}dx=0.5⇒m^4-32m^2+128=0⇒X_{0.5}=m=2\sqrt{4-2\sqrt{2}}≈2.16\). Check slv(itg(x*(16-x^2)/64,x,0,m)-0.5,m) || slv(itg(x*(16-x^2)/64,x,0,m)-0.5,m) || stt;crv(x*(16-x^2)/64.0,0,4.0);med(x) ||. The mode is the value of \(X\) where the density \(f(x)\) has a relative maximum (continuous case). It can be found by setting \(f'(x)=\frac{16-3x^2}{64}=0⇒x=\frac{4\sqrt{3}}{3}≈2.31\). Check slv(dif(x*(16-x^2)/64,x),x) ||. The skewness is about -0.125 by stt;crv(x*(16-x^2)/64,0,4);s(x), which means the distribution is slightly skewed to the left. The kurtosis is about 2.18 by stt;crv(x*(16-x^2)/64,0,4.0);k(x), which means the distribution is less peaked than normal distribution that has kurtosis 3. The 75th percentile is about 2.83, the 25th percentile is about 1.46, and the inter-quantile range is 1.37. Check stt;crv(x*(16-x^2)/64,0,4.0);q(0.75) || stt;crv(x*(16-x^2)/64,0,4.0);q(0.25) ||. The 10th percentile is about 0.91 by stt;crv(x*(16-x^2)/64,0,4.0);q(0.1) ||. The mean absolute deviation is \(E(|X-\frac{32}{15}|)=0.74\) by stt;crv(x*(16-x^2)/64,0,4);E(abs(x-32/15)) ||, where \(E(X)=\frac{32}{15}\) by stt;crv(x*(16-x^2)/64,0,4);E(abs(x)) ||.
Example 2 Suppose \(X\) is a discrete random variable with pdf \(P(X=x)=(\frac{1}{2})^x,x=1,2,\cdots\).The mode is the value of \(X\) with the greatest probability of occurring. Thus, the mode is \(P(X=1)=\frac{1}{2}\). Since \(P(X>1)=P(X<2)=\frac{1}{2}\), the median is any value between 1 and 2, say 1.5. The mean \(μ=\sum_{x=1}^∞x(\frac{1}{2})^x⇒μ-\frac{1}{2}μ=r+r^2+r^2+\cdots⇒μ=2\), where \(r=\frac{1}{2}\). Since this is the geometric distribution with p = 0.5, check stt;geo(1/2);m || stt;geo(0.5);med || stt;geo(0.5);s || stt;geo(0.5);P(x>1) || stt;geo(0.5);P(x<2) ||.
5. Common Probability Distributions
I. Distributions for Discrete Random Variables
1. Bernoulli A r.v. X ~ Bernoulli(p) with "success" probability of an experiment P(X = 1) = p and "failure" probability P(X = 0) = 1 - p. Use "ber(p)" for a Bernoulli r.v. for 0 < p < 1. Check stt;ber(p);cdf(x) || stt;ber(p);pdf(x) || stt;ber(0.5);h ||.
2. Binomial The binomial distribution is \(P(X=x)=(_x^n)p^x(1-p)^{n-x}\) for \(x=0,1,2,\cdots,n\). It corresponds to successive terms in the binomial expansion \((p+q)^n=p^n+(_1^n)p^{n-1}q+(_2^n)p^{n-2}q^2+\cdots+q^n=\sum_{x=0}^n(_x^n)p^xq^{n-x}\) for \(q=1-p\). If \(n=1\), the distribution is Bernoulli distribution. Use "bin(n,p)" for a binomial distribution with "n" trials and the probability of success "p" (0 < p < 1) for each trial. For example, stt;bin(20,0.5);pdf(x) || stt;bin(20,0.5);m || stt;bin(20,0.5);v || stt;bin(20,0.5);std || stt;bin(20,0.5);s || stt;bin(20,0.5);k || stt;bin(15,0.6);P(x>7) || stt;bin(15,0.6);h ||.
Examples
(1) The probability distribution of boys in families with 4 children (assuming equal probabilities for boys and girls) is \(P(X=x)=(_x^4)0.5^x(1-0.5)^{4-x}=(_x^4)0.5^4\). Check stt;bin(4,0.5);pdf(x) || stt;bin(4,0.5);P(1) || stt;bin(4,0.5);P(3).
(2) For 100 tosses of a fair coin, the expectation is np = 50,variance is np(1 - p) = 25, and the probability P(X ≤ 45) ≈ 0.184. Check stt;bin(100,0.5);m || stt;bin(100,0.5);v || stt;bin(100,0.5);cdf(45) ||. The probability can be approximated by the central limit theorem by stt;n(50,5);P(x<=45) ||.
3. Geometric For a sequence of Bernoulli\((p)\) trials, the probability of observing the first "success" is \(P(X=x)=p(1-p)^{x-1}\) for \(X=1,2,\cdots\), where \(X\) is a geometric r.v.. Use "geo(p)" for a geometric distribution with a parameter 0 < p < 1. Check the pdf, cdf, mean, and variance by stt;geo(p);pdf(x) || stt;geo(p);cdf(x) || stt;geo(p);E(x) || stt;geo(p);V(x) ||.
If \(X\) is a geometric r.v., it has a memoryless property \(P(X>i+j|X>i)=P(X>j)\). The geometric distribution is the only discrete distribution that possess this property. For example, if \(X\sim\) geometric\((0.5),P(X>2+3|X>2)=P(X>3)\) = 0.125. Check stt;geo(0.5);P(x>4|x>1) || stt;geo(0.5);P(x>5|x>2) || stt;geo(0.5);P(x>7|x>4) || stt;geo(0.5);P(x>3) || stt;geo(0.5);svf(3) ||.
4. Negative binomial For a sequence of independent Bernoulli\((p)\) trials, the probability of observing the \(k\)th "success" is \(P(X=x)=\binom {x-1}{k-1}p^k(1-p)^{x-k}\) for \(X=k,k+1,\cdots\), where \(X\) is a negative binomial r.v. with parameter \(k,p\). Use "nbn(k,p)" for a negative binomial distribution with parameters k ≥ 1 and 0 < p < 1. Check the pdf, mean, and variance by stt;nbn(p,k);pdf(x) || stt;nbn(k,p);E(x) || stt;nbn(k,p);V(x) || stt;nbn(5,0.5);P(6) || stt;nbn(5,0.5);P(7) ||.
5. Poisson The pmf of a Poisson r.v. is \(P(x)=\frac{e^{-λ}λ^x}{x!}\) for \(x≥0,λ>0\). Use "poi(λ)" for a Poisson distribution with parameter λ. Check stt;poi(r);pdf(x) || stt;poi(r);cdf(x) || stt;poi(r);E(x) || stt;poi(r);V(x) || stt;poi(10);P(7) || stt;poi(12);P(x>7|x>5) || stt;poi(21);P(x>5&x<13) || stt;poi(34);P(x>15@x<8) || stt;poi(6);h ||.
Poisson distribution can be used to approximate Binomial distribution Bin\((n,p)\) when \(n\) is large and \(p\) is small. Suppose \(λ\) is an expected number of occurrences within a period of time. Divide the time interval into \(n\) equal subintervals, with each a number of \(\frac{λ}{n}\) occurrences. Assume the probability of one occurrence in a subinterval is \(\frac{λ}{n}\), which corresponds to a Bernoulli random variable \(X_i\sim\) Ber\((\frac{λ}{n})\). Then Bin\((n,p)≈\) Poisson\((np)\). To see this, \(Y_n=\displaystyle\sum_{i=1}^nX_i\sim\text{ Bin}(n,\frac{λ}{n}),\lim_{n\to ∞}P(Y_n=k)=\lim_{n\to ∞}\frac{n!}{k!(n-k)!}(\frac{λ}{n})^k(1-\frac{λ}{n})^{n-k}=\lim_{n\to ∞}\frac{λ^k}{k!}\frac{n!}{(n-k)!(n-λ)^k}(1-\frac{λ}{n})^n=e^{-λ}\frac{λ^k}{k!}\). Check stt;bin(20,0.1);P(3) || stt;poi(2);P(3) ||.
6. Hypergeometric One important application of the hypergeometric distribution is sampling without replacement. Suppose there are \(N\) objects (population size), \(K\) of which are of interest. Randomly draw a sample of \(n\) objects from \(N\), the number \(k\) objects of interest in the sample has a hypergeometric distribution with a pmf \(P(X=k)=\frac{(_k^K)(_{n-k}^{N-K})}{(_n^N)}\) for \(\max(0,n+K-N)≤k≤\min(K,n)\). Use "hyp(N, K, n)" for a hypergeometric distribution with parameters "N, K, n", where K ≤ N and n ≤ N. Check stt;hyp(20,7,10);m || stt;hyp(20,7,10);v || stt;hyp(20,7,10);pdf(x) || stt;hyp(20,7,10);cdf(x) || stt;hyp(20,7,10);P(2) || stt;hyp(20,7,10);P(x>=2&x<5) || stt;hyp(20,7,10);P(x<=5|x>2) ||.
7. Discrete uniform A r.v. \(X\) is a discrete uniform r.v. if its pdf is \(P(X=x)=\frac{1}{n}\) for an integer \(x\) satisfying \(1≤x≤n\). Note that the r.v. \(X\) takes an integer between 1 and \(n\) with equal probability. Use "die(n)" for a discrete uniform distribution with parameter "n" a positive integer. Check stt;die(5);m || stt;die(5);v || stt;die(6);P(x>3) || stt;die(6);v || stt;die(10);P(x<5) || stt;die(20);std || stt;die(8);h || stt;die(15);q(0.5) || stt;die(9);cdf(x) ||. For small "n", you can use "drv" to get the same results. Check stt;drv((1,2,3,4,5),(1/5,1/5,1/5,1/5,1/5));pdf(x) || stt;drv((0,1,2,3,4),(1/5,1/5,1/5,1/5,1/5));cdf(x) || stt;drv((2,3,4,5,6),(1/5,1/5,1/5,1/5,1/5));E(x) || stt;drv((1,3,5,7,9),(1/5,1/5,1/5,1/5,1/5));V(x) ||.
8. Multinomial The multinomial distribution is the generalization of the binomial distribution. The pmf of the \(k\) r.v.'s \(X_1,\cdots,X_k\) in a total of \(n\) trials is \(P(X_1=n_1,X_2=n_2,\cdots,X_k=n_k)=\frac{n!}{n_1!n_2!\cdots n_k!}p_1^{n_1}p_2^{n_2}\cdots p_k^{n_k}\), where \(n=n_1+\cdots+n_k\) and \(\sum_{i=1}^kp_i=1\). Note that \(n_1,\cdots,n_k\) are the counts of \(k\) mutually exclusive events (categories) represented by \(X_1,\cdots,X_k\) in the \(n\) trials. Use "mni(n, p1, p2, ..., pk)" to create a multinomial distribution with parameter "n" total number of trials and the vector of "k" probabilities "p1, p2, ..., pk" that must sum to 1. For example, stt;mni(10,0.2,0.3,0.5);cov || stt;mni(15,0.2,0.3,0.5);h || stt;mni(12,1/3,1/3,1/3);pdf(3,4,5) || stt;mni(10,0.2,0.3,0.5);rvs(20) ||. Note that only keywords "cov", "h", "rvs(n)" and "pdf(n1, n2, ..., nk)" are available for multinomial distributions. To obtain the pmf for a set of counts by "pdf(n1, n2, ..., nk)", the sum of the counts "n1, n2, ..., nk" for the k categories must be equal to n, the total trials.
II. Distributions for Continuous Random Variables
1. Normal (Gaussian) Use "n(μ, σ)" for a normal distribution with mean μ and standard deviation σ. Find the pdf, mean, variance, and standard deviation of a normal distribution N(μ, σ\(^2\)) by stt;n(u,s);pdf(x) || stt;n(u,s);E(x) || stt;n(u,s);V(x) || stt;n(u,s);std(x) ||. If X ~ N(0, 1), P(-1 < X < 1) ≈ 0.68; P(-2 < X < 2) ≈ 0.95; P(-3 < X < 3) ≈ 0.99; P(X < 1.96) = 0.975; P(X > 1.96) = 0.025. Check stt;n(0,1);P(x>-1&x<1) || stt;n(0,1);P(x>-2&x<2) || stt;n(0,1);P(x>-3&x<3) || stt;n(0,1);cdf(1.96) || stt;n(0,1);svf(1.96) ||. The moment generating function \(M(t)=e^{-\frac{t^2}{2}}\) by stt;n(0,1);E(exp(t*x));x ||, and the characteristic function \(ϕ(w)=e^{-\frac{w^2}{2}}\) by stt;n(0,1);E(exp(I*w*x));x ||. Check other examples stt;n(1,2);E(exp(t*x));x || stt;n(1,2);P(x>-2&x<1) || stt;n(2,3);E(exp(I*w*x));x || stt;n(2,3);s || stt;n(2,3);k || stt;n(3,4);q(0.2) ||.
2. Exponential The pdf of exponential distribution is \(f_X(x)=λe^{-λx}\) for \(x≥0,λ>0\). Use "epn(λ)" for an exponential distribution with parameter "λ". Check stt;epn(t);pdf(t) || stt;epn(t);cdf(t) || stt;epn(t);E(t) || stt;epn(t);V(t) || stt;epn(t);E(exp(a*t)) || stt;epn(1);P(x>1) || stt;epn(2);P(x>3|x>1) ||.
Exponential distribution has memoryless property \(P(X>s+t|x>s)=P(X>t)\) for \(s>0,t≥0\). For example, a r.v. X ~ exponential(0.8), P(X > 3| X > 1) = P(X > 2) = 0.2019 by stt;epn(0.8);P(x>3|x>1) || stt;epn(0.8);P(x>3|x>1) ||. Similarly check the results stt;epn(0.8);P(x>4|x>2) || stt;epn(0.8);P(x>2.5|x>0.5) || stt;epn(0.8);P(x>5|x>3) ||.
3. Uniform The density of a uniform r.v. \(X\) on an interval \([a,b]\) is \(f_X(x)=\frac{1}{b-a}\). Use "uni(a, b)" for a uniform distribution with lower bound "a" and upper bound "b", or a < b. Check stt;uni(0,1);pdf(x) || stt;uni(0,1);cdf(x) || stt;uni(0,1);v || stt;uni(0,1);E(exp(t*x));x || stt;uni(2,5);cdf(x) || stt;uni(2,4);P(x>3) ||.
4. Student-t Use "t(n)" for a Student-t distribution with degree of freedom "n" a positive integer. Display the pdf of the t-distribution \(t_n\) with degree of freedom \(n\) by "stt;t(n);pdf(x)". For example, "stt;t(20);pdf(x)" returns the pdf of \(t_{20}\). Check nit;gamma(21/2)/(gamma(10)*(20*pi)^(0.5))*(1+x^2/20)^(-21);x;-oo;0 || stt;t(15);m || stt;t(18);v || stt;t(23);P(x>2) | stt;t(17);P(x<-2.5) || stt;t(1);q(0.025) || stt;t(2);q(0.975) ||.
5. Chi-Square distribution \(χ^2(n)\) is a special case of Gamma distribution with parameters \(a=\frac{n}{2},b=2\), where \(n\) is the degree of freedom. The pdf is \(χ_n^2(x)=\frac{x^{\frac{n}{2}-1}e^{-\frac{x}{2}}}{2^{\frac{n}{2}}Γ(\frac{n}{2})}\) for \(x>0\). Use "chi(n)" for a Chi-square distribution with degree of freedom "n" a positive integer. Check stt;chi(n);pdf(x) || stt;gam(n/2,2);pdf(x) || stt;chi(5);m || stt;gam(5/2,2);m || stt;chi(5);v || itg;x^(n/2-1)*e^(-x/2);x;0;oo || stt;chi(7);q(0.975) || stt;chi(1);P(x>3.5) || stt;chi(2);P(x<1) ||.
6. F-Distribution Use "f(m,n)" to create an F-distribution with degrees of freedom "m" and "n". For example, compute the pdf, mean, variance of the F-distribution F(5,8) by stt;f(5,8);pdf(x) || stt;f(5,8);m || stt;f(5,8);v || stt;f(5,8);s || stt;f(5,8);med || stt;f(6,9);q(0.05) || stt;f(9,11);P(x>2) ||.
6. Gamma The Gamma function \(Γ(a)=\int_0^∞x^{a-1}e^{-x}dx\) can be evaluated by "gamma(a)" for \(a>0\) or "itg;x^(a-1)*exp(-x);x;0;oo". Check gamma(1/2) || gamma(18) ||. The density of a Gamma r.v. is \(f_X(x)=\frac{x^{a-1}e^{-\frac{x}{b}}}{b^aΓ(a)}\) for \(x>0,a>0,b>0\). Use "gam(a, b)" for a Gamma distribution with positive parameters "a" and "b". Check stt;gam(a,b);pdf(x) || stt;gam(a,b);cdf(x) || stt;gam(a,b);E(x) || stt;gam(a,b);V(x) || stt;gam(a,b);E(exp(t*x));x || stt;gam(2,3);m || stt;gam(0.5,1);v || stt;gam(1,2);P(x>2) || stt;gam(1,2);P(x<1) || stt;gam(2,6);P(x>1) || stt;gam(3,3);P(x>1&x<4) ||. Exponential distribution is a special Gamma distribution with \(a=1\) and Chi-Square distribution is a Gamma distribution with \(a=\frac{n}{2},b=\frac{1}{2}\). Check stt;gam(1,b);pdf(x) || stt;gam(n/2,1/2);pdf(x) ||.
7. Beta Let \(X_1\sim\) Gamma\((k,b),X_2\sim\) Gamma\((m,b),Y=\frac{X_1}{X_1+X_2}\). Then \(Y\) follows a Beta\((k,m)\) distribution with pdf \(f_Y(y)=\frac{Γ(k+m)}{Γ(k)Γ(m)}y^{k-1}(1-y)^{m-1}\) for \(0< y <1\). Thus, \(E(Y)=\frac{Γ(k+m)}{Γ(k)Γ(m)}\int_0^1yy^{k-1}(1-y)^{m-1}dy=\frac{Γ(k+m)}{Γ(k)Γ(m)}\frac{Γ(k+1)Γ(m)}{Γ(k+m+1)}=\frac{k}{k+m}\),
\(E(Y^2)=\frac{Γ(k+m)}{Γ(k)Γ(m)}\int_0^1y^2y^{k-1}(1-y)^{m-1}dy=\frac{(k+1)k}{(k+m+1)(k+m)}\), and \(\text{Var}(Y)=E(Y^2)-E(Y)^2=\frac{km}{(k+m+1)(k+m)^2}\). The pdf of \(1-Y=\frac{X_2}{X_1+X_2}\) is a Beta\((m,k)\) distribution.
In general, the pdf of a r.v. \(X\sim\) Beta\((a,b)\) is \(f(x)=\frac{1}{B(a,b)}x^{a-1}(1-x)^{b-1}\), where \(B(a,b)=\int_0^1x^{a-1}(1-x)^{b-1}dx=\frac{Γ(a)Γ(b)}{Γ(a+b)}\) is the Beta function for \(a>0,b>0\). Use "beta(a,b)" for evaluating the Beta function, and use "bet(a, b)" for a Beta distribution with positive parameters "a" and "b". Check beta(0.5,1) || beta(2,3) || beta(3,5)-gamma(3)*gamma(5)/gamma(8) ||.
Examples If Y ~ Beta(4, 3), P(Y ≤ 0.5) = 0.3475 by nit;gamma(7)/(gamma(4)*gamma(3))*y^3*(1-y)^2;y;0;0.5 || stt;bet(4,3);pdf(x) || stt;bet(4,3);cdf(x) || stt;bet(4,3);cdf(0.5) || stt;bet(4,3);m || stt;bet(4,3);v ||. Compute P(Y > 0.6) for Y ~ Beta(2.5, 3) by nit;gamma(11/2)/(gamma(5/2)*gamma(3))*y^(3/2)*(1-y)^2;y;0.6;1 || stt;bet(2.5,3);pdf(x) || stt;bet(2.5,3);m || stt;bet(2.5,3);svf(0.6) || stt;bet(2.5,3);P(x>0.6) ||. Compute P(Y > 0.4) for Y ~ Beta(1.5, 2) by nit;gamma(3.5)/(gamma(1.5)*gamma(2))*y^(1/2)*(1-y);y;0.4;1 ||. Check itg;x^(n/2-1)*e^(-x/2);x;0;oo || itg;x^(n/2-1)*e^(-x/3);x;0;oo || itg;x^(z-1)*e^(-x);x;0;oo || nit;x^(5-1)*e^(-x);x;0;oo || nit;x^10*e^(-x);x;0;oo || nit;x^(-1/2)*e^(-x);x;0;oo ||.
8. Cauchy The pdf of a Cauchy distribution with location parameter \(a\) and scale parameter \(b\) is \(f_X(x)=\frac{1}{πb[1+(\frac{x-a}{b})^2]}\) for \(a≥0,b>0\). Use "cau(a, b)" for a Cauchy distribution with parameters "a" and "b". Show the pdf and cdf of a Cauchy distribution by stt;cau(a,b);pdf(x) || stt;cau(a,b);cdf(x) ||. Note that Cauchy distribution does not have mean, variance and moments. Check stt;cau(0,1);P(x>0.8) || stt;cau(2,3);P(x<1.6) || stt;cau(2,3);P(x<1.6) || stt;cau(2,3);q(0.8) || stt;cau(0,1);rvs(20) ||.
9. Bivariate and multivariate normal distribution Suppose \((X_1,Y_1),\cdots,(X_n,Y_n)\) are i.i.d. sample from a bivariate normal distribution of \(X,Y\) with population parameters \(μ_X,μ_Y,σ_X^2,σ_Y^2,ρ\). The correlation between \(X\) and \(Y\) is \(ρ=\frac{σ_{XY}}{σ_Xσ_Y}\), where \(σ_{XY}=E(XY)-E(X)E(Y)\) is the covariance. The correlation between the distributions of sample means \(\overline{X_n}\) and \(\overline{Y_n}\) is \(\frac{\text{Cov}(\overline{X}_n, \overline{Y}_n)}{σ_{\overline{X}_n}σ_{\overline{Y}_n}}\). Since \(σ_{\overline{Y}_n}=\frac{σ_X}{\sqrt{n}},σ_{\overline{X}_n}=\frac{σ_Y}{\sqrt{n}}\), and \(\text{Cov}(\overline{X}_n,\overline{Y}_n) =\frac{1}{n^2}\text{Cov}(\sum_{i=1}^nX_i,\sum_{i=1}^nY_i)=\frac{1}{n}\text{Cov}(X,Y),ρ_{(\overline{X}_n,\overline{Y}_n)}= \frac{\text{Cov}(X,Y)/n}{σ_Xσ_Y/n}=ρ\), the same as the population correlation. Note that the marginal distributions are \(X\sim N(μ_X,σ_X^2),Y\sim N(μ_Y,σ_Y^2)\).
Use "mnm(μ, Σ)" for a multivariate normal distribution with mean vector μ and covariance matrix Σ. Note that the input covariance matrix must be symmetric and positive semidefinite. Keywords "cov", "h" "pdf(x1, x2, ..., xk)","cdf(x1, x2, ..., xk)", and "rvs(n)" available for k-variate normal distribution for computing the covariance, entropy, pdf, cdf, simulating a random sample of size n, respectively. For example, stt;mnm((0,0),((1,0),(0,1)));pdf(1,1) || stt;mnm((0,0),((1,0),(0,1)));cdf(1,1) || stt;mnm([3,4],[[2,1],[1,2]]);cdf(1,2) || stt;mnm([3,4],[[2,1],[1,2]]);rvs(30) || stt;mnm([3,4],[[2,1],[1,2]]);cov || stt;mnm([3,4],[[2,1],[1,2]]);h || stt;mnm((1,2,3),((1,-1,1),(-1,2,-1),(1,-1,2)));cdf(0,1,0) ||.
7. Histograms and Graphs of Distribution Functions
Histograms displays the shape of the distribution of data values. To create a histogram, the range of the data is divided into intervals or bins, and the number or proportion of observations falling in each bin is plotted. Use the keyword "hist" to produce the histograms with size 2000 for the common distribution functions described above. For example, "plt;n(0,1);hist" displays the histogram of the standard normal distribution. Check plt;epn(2);hist || plt;t(12);hist || plt;uni(0,1);hist || plt;cau(2,5);hist || plt;f(7,10);hist || plt;chi(8);hist || plt;gam(3,7);hist || plt;bet(1,2);hist || plt;ber(0.6);hist || plt;bin(20,0.5);hist || plt;poi(4);hist || plt;geo(0.5);hist || plt;nbn(14,0.6);hist || plt;hyp(50,15,16);hist ||.
Graphs of cumulative density functions Use the keyword "cdf" to produce the cdf of a r.v.. For example, "plt;n(0,1);cdf" yields the graph of the cdf of the standard normal distribution. Check plt;epn(2);cdf || plt;t(12);cdf || plt;uni(0,1);cdf || plt;cau(2,5);cdf || plt;f(7,10);cdf || plt;chi(8);cdf || plt;gam(3,7);cdf || plt;bet(1,2);cdf || plt;ber(0.6);cdf || plt;bin(20,0.5);cdf || plt;poi(4);cdf || plt;geo(0.5);cdf || plt;nbn(14,0.6);cdf ||.
Graphs of probability density functions The graph for the pdf of the standard normal distribution can be obtained by "plt;n(0,1);pdf". Use keyword "pdf" for the graph of the pdf of a continuous r.v.. Check other pdfs by plt;epn(2);pdf || plt;t(12);pdf || plt;uni(0,1);pdf || plt;cau(2,5);pdf || plt;f(7,10);pdf || plt;chi(8);pdf || plt;gam(3,7);pdf || plt;bet(1,2);pdf ||.
You can directly enter a density function for its graph. Check plt;20/(2*pi)^(1/2)*exp(-x^2/2);20*gamma(15)/(gamma(14.5) *(29*pi)^.5)*(1+x^2/29)^(-15) || plt;20/(2*pi)^(1/2)*exp(-x^2/2);20*gamma(3)/(gamma(2.5)*(5*pi)^.5)*(1+x^2/5)^(-3) || plt;(1/3)*exp(-x/3); 3*exp(-3*x);15*exp(-15*x) || plt;1-exp(-x/3);1-exp(-3*x);1-exp(-15*x) || plt;10*x^(6-1)*exp(-x/1)/(gamma(6)*1^6);10*x^(3-1)*exp(-x/1)/(gamma(3)*1^3);itv=(0,20) || plt;10*x^(3-1)*exp(-x/2)/(gamma(3)*2^3);10*x^(2-1)*exp(-x/2)/(gamma(2)*2^2);itv=(0,20) || plt;1/(pi*(1^2+(x-0)^2));2/(pi*(2^2+(x-1)^2)) ||.
8. Limit Theorems
1. Inequalities If \(X\) is a non-negative r.v. with finite mean \(E(X)\), then for any \(t>0,P(X>t)≤\frac{E(X)}{t}\) (Markov's inequality). By Markov's inequality, \(P(|X-μ|≥t)=P((X-μ)^2≥t^2)≤\frac{E(X-μ)^2}{t^2}=\frac{σ^2}{t^2}\), provided that the mean \(E(X)=μ\) and variance Var\((X)=σ^2\) of \(X\) exist. Thus, we have \(P(|X-μ|≥t)≤\frac{σ^2}{t^2}\) for any r.v. with finite mean and variance (Chebyshev's inequality). Following Chebyshev's inequality, we observe that \(P(\frac{|X-μ|}{σ}≥\frac{t}{σ})=P(|Z|≥k)≤\frac{σ^2}{t^2}=\frac{1}{k^2}\), where \(t=kσ\). It follows that \(P(|Z|>2)≤\frac{1}{4},P(|Z|>3)≤\frac{1}{9}\). Chebyshev' inequality tells us that the values of a r.v. \(X\) are unlikely to be many standard deviations away from its mean \(μ\).
We use some common distributions to check these two inequalities. If X ~ Poisson(λ), E(x) = Var(x) = λ, and by Markov's inequality P(X > 1) ≤ λ, P(X > 2) ≤ λ/2, P(X > 3) ≤ λ/3, P(X > 4) ≤ λ/4, P(X > 5) ≤ λ/5, and P(X > t) ≤ λ/t for any t > 0. Check stt;poi(0.2);P(X>1) || stt;poi(3);P(x>5) || stt;poi(20);P(x>25) ||. By Chebyshev's inequality P(Z > 2 or Z < -2) ≤ 1/4 and P(Z > 3 or Z < -3) ≤ 1/9. If λ = 4, |Z| > 2 implies X > 8 or X < 0 and |Z| > 3 implies X > 10 or X < -2. Check stt;poi(4);P(x>8) || stt;poi(4);P(x>10) ||. Similarly, check Chebyshev's inequality if X ~ N(0,1): stt;n(0,1);P(X>1) || stt;n(0,1);P(x>2) || stt;n(0,1);P(x>3) || stt;n(0,1);P(x>4) ||. If X ~ exponential(λ), E(X) = 1/λ and Var(X) = 1/λ². Check Markov's inequality by stt;epn(0.5);P(x>1) || stt;epn(0.5);P(x>2) || stt;epn(0.5);P(x>3) || stt;epn(0.5);P(x>10) || stt;epn(4);P(x>1) || stt;epn(4);P(x>0.5) || stt;epn(4);P(x>0.25) ||. Note that |Z| > 2 implies X > 3/λ and |Z| > 3 implies X > 4/λ. Check stt;epn(0.5);P(x>6) || stt;epn(0.5);P(x>8) || stt;epn(4);P(x>3/4) || stt;epn(4);P(x>1) ||.
Example Suppose \(X\sim\) exponential(4). Then \(P(|X-μ|≥1)≤σ^2=P(|X-\frac{1}{4}|≥1)≤\frac{1}{16}\) = 0.0625 by Chebyshev's inequality, where \(μ=E(X)=\frac{1}{4},σ^2=\frac{1}{16}\). We can calculate \(P(|X-\frac{1}{4}|≥1)=P(X>\frac{5}{4})\) = 0.0101 < 0.0625. Check nit;x*4*exp(-4*x);x;5/4;oo || stt;epn(4);m || stt;epn(4);v || stt;epn(4);std || stt;epn(4);P(x>5/4) ||.
2. Law of large numbers Let \(X_1,\cdots,X_n\) be i.i.d. r.v.'s with finite mean \(E(X_i)=μ\), and let \(\overline{X}_n=\frac{1}{n}\sum_{i=1}^nX_i\). Then for any \(ε>0\), the weak law of large numbers (WLLN) says \(\lim\limits_{n→∞}P(|\overline{X}_n-μ|>ε)=0\) (denote \(\overline{X}_n\overset{p}{→}μ\)), and the strong law of large numbers (SLLN) says \(P(\lim\limits_{n→∞}|\overline{X}_n-μ|>ε)=0\) (denote \(\overline{X}_n→μ\)). The WLLN tells how a sequence of probabilities converges, but the SLLN tells how the sequence of r.v.'s behaves in the limit. The WLLN is a direct result of Chebyshev's inequality because \(P(|\overline{X}_n-μ|>ε)≤\frac{σ^2}{nε^2}→0\) as \(n→∞\), where \(σ^2\) = Var\((X)\) and \(σ_{\overline{x}_n}=\frac{σ}{\sqrt{n}}\) is the standard deviation (or standard error) of the distribution of sample means.
Use "sem" to estimate the standard error (se) \(σ_{\overline{x}_n}\) of sample mean. For example, if \(X\sim N(μ,σ^2),σ_{\overline{x}_n}=\frac{2}{\sqrt{n}}\), stt;n(1,2);rvs(100);d || gives the se is about 0.2, which means if n = 100, the sample mean is about 0.2 standard deviations away from the population mean. The se from stt;n(1,2);rvs(10000);d is about 0.02. Check stt;poi(3);rvs(500);d || stt;epn(4);rvs(1000);d ||.
3. Central limit theorem (CLT) Let \(X_1,\cdots,X_n\) be i.i.d. r.v.'s with finite mean \(μ\) and variance \(σ^2\) and let \(\overline{X}_n=\frac{1}{n}\sum_{i=1}^nX_i\). Then \(Z_n=\frac{\overline{X}_n-μ}{σ/\sqrt{n}}\overset{d}{→}Z\sim N(0,1)\). In other words, \(\lim\limits_{n→∞}P(Z_n≤z)=\int_{-∞}^z\frac{1}{\sqrt{2π}}e^{-\frac{x^2}{2}}dx\). By the CLT, we can use a normal distribution to approximate probabilities about the sample mean (or random sum), regardless of the distribution of \(X_1,\cdots,X_n\).
Example Consider tossing a fair coin 200 times and denote X as the number of heads. Calculating the exact probability P(X > 98) = 0.5840 by the binomial distribution is difficult, but we can approximate it by a normal distribution with mean np = 100 and variance 50. Check stt;bin(200,0.5);P(x>98) || stt;n(100,50^(1/2));P(x>98) ||, which gives 0.6114. Similarly, calculate the exact probability P(X > 60) = 0.054 for X ~ Poisson(49) is difficult by hand. Check stt;poi(49);P(x>60) ||. However, we can approximated by N(49, 7) by stt;n(49,7);P(x>60), which gives 0.058.
9. Functions of Random Variables
1. Transformation of random variables Let \(X\sim f_X(x)\) be a discrete r.v.. If \(U=h_1(X)\) and \(h_1\) is strictly increasing \(⇒X=h_2(U)\), then the pmf of \(U\) is \(G(U)=P(U=u)=P(h_1(X)=u)=P(X=h_2(U))=f_X(h_2(U))\). Let \(X,Y\sim f_{X,Y}(x,y)\) be discrete random variables and \(U=h_1(X,Y),V=h_2(X,Y)⇒X=h_3(U,V),Y=h_4(U,V)\), then the joint pdf of \(U,V\) is given by \(G(U,V)=f_{X,Y}(h_3(U,V),h_4(U,V))\).
Let \(X\sim f_X(x)\) be a continuous random variable. If \(U=h_1(X)\) and \(h_1\) is strictly increasing \(⇒X=h_2(U)\). Then the cdf of \(U\) is \(G(U)=P(U≤u)=P(h_1(X)≤u)=P(X≤h_2(u))=\int_{-∞}^{h_2(u)}f_X(x)dx\), and the pdf is \(g(U)=\frac{d}{du}\int_{-∞}^{h_2(u)}f_X(t)dt=f(h_2(u))|h'_2(u)|\). Similarly, let \(X,Y\sim f_{X,Y}(x,y)\) be continuous random variables and \(U=h_1(X,Y),V=h_2(X,Y)⇒X=h_3(U,V),Y=h_4(U,V)\), then the density is \(g(u,v)=f_{X,Y}(h_3(u,v),h_4(u,v))|J|\), where \(J=\frac{∂(x,y)}{∂(u,v)}=\begin{vmatrix}\frac{∂x}{∂u}&\frac{∂x}{∂v}\\ \frac{∂y}{∂u}&\frac{∂y}{∂v} \end{vmatrix}\) is the Jacobian.
Example 1
(1) Suppose \(X\sim f(x)=e^{-x},Y=g(X)=X^2\). Then \(F_Y(y)=P(Y≤y)=P(X^2≤y)=P(X≤\sqrt{y})=1-e^{-\sqrt{y}}\) because the cdf of \(X\) is \(F_X(x)=1-e^{-x}\). As a result, the pdf of \(Y\) is \(f_Y(y)=F'_Y(y)=\frac{1}{2\sqrt{y}}e^{-\sqrt{y}},y≥0\). Check itg;1/(2*y^(1/2))*exp(-y^(1/2));y;0;oo ||.
(2) Let \(X\sim U(0,1),Y=g(X)=-2\ln X. F_Y(y)=P(Y≤y)=P(-2\ln X≤y)=P(X≥e^{-\frac{y}{2}})=1-F_X(e^{-\frac{y}{2}})=1-e^{-\frac{y}{2}}\) because the cdf of \(X\) is \(F_X(x)=x\). The pdf of \(Y\) is \(f_Y(y)=F'_y(y)=\frac{1}{2}e^{-\frac{y}{2}}\), which is an exponential distribution with parameter \(λ=\frac{1}{2}\).
(3) Let \(X\sim f(x)=4x^3-3x^2+2x,0< x <1,Y=g(X)=\sqrt{X}\). Then \(F_Y(y)=P(Y≤y)=P(\sqrt{X}⩽y)=P(X≤y^2)=F_X(y^2)\). Since \(F_X(x)=\int_0^x4t^3-3t^2+2tdt=x^4-x^3+x^2,F_X(y^2)=y^8-y^6+y^4\), and \(f_Y(y)=F'_Y(y)=8y^7-6y^5+4y^3\) for \(0< y <1\).
(4) Let \(Z\sim N(0,1),Y=g(Z)=Z^2,F_Y(y)=P(Y≤y)=P(Z^2≤y)=P(-\sqrt{y}≤z≤\sqrt{y})=\Phi(\sqrt{y})-\Phi(-\sqrt{y})\). Thus the pdf of \(Y\) is \(f_Y(y)=F'_Y(y)=\frac{1}{2\sqrt{y}}f_Y(\sqrt{y})+\frac{1}{2\sqrt{y}}f_Y(-\sqrt{y})=\frac{1}{\sqrt{y}}\frac{1}{\sqrt{2π}}e^{-\frac{y}{2}}=\frac{1}{Γ(\frac{1}{2})2^{\frac{1}{2}}}y^{\frac{1}{2}-1}e^{-\frac{y}{2}}=\text{Gamma}(\frac{1}{2},2)=χ^2_1,y≥0\). Note that \(Γ(\frac{1}{2})=\sqrt{π}\) and Gamma\((\frac{n}{2},2)=χ^2_n\).
(5) Let \(X_1,X_2\sim f(x)=2e^{-2x},Y=g(X_1,X_2)=\frac{X_2}{X_1}\). Then \(F_Y(y)=P(\frac{X_2}{X_1}≤y)=P(X_2≤yX_1)=\int_0^∞\int_0^{yx_1} 4e^{-2(x_1+x_2)}dx_2dx_1\)
\(=\frac{y}{y+1}\), and \(f_Y(y)=F'_Y(y)=\frac{1}{(1+y)^2},y>0\).
Example 2
(1) Suppose \(X\sim e^{-x},Y=X^2\). Then \(F_Y(y)=P(Y≤y)=P(X^2≤y)=P(X≤\sqrt{y})=1-e^{-\sqrt{y}}\), and the pdf of \(Y\) is \(f_Y(y)=F'_Y(y)=\frac{1}{2\sqrt{y}}e^{-\sqrt{y}},y≥0\).
(2) Suppose \(X\sim U(1,3),Y=X^3⇒X=Y^{\frac{1}{3}}\). Then \(\frac{dx}{dy}=\frac{1}{3}y^{-\frac{2}{3}},f_X(x(y))=\frac{1}{2}\), and \(f_Y(y)=\frac{1}{2}\frac{1}{3}y^{-\frac{2}{3}}= \frac{1}{6}y^{-\frac{2}{3}},1≤y≤27\).
(3) Suppose \(X\sim \frac{3x^2}{133}\) for \(-2≤X≤5\) and \(U=X^2\). The distribution function of \(U\) is \(G(U)=P(U≤u)=P(X^2≤u)\)
\(=P(-\sqrt{u}≤X≤\sqrt{u})=\int_{-\sqrt{u}}^{\sqrt{u}}f(x)dx=\begin{cases}\int_{-\sqrt{u}}^{\sqrt{u}}\frac{3x^2}{133}dx=\frac{2u^{3/2}}{133}&0≤u≤4\\\int_2^{\sqrt{u}}\frac{3x^2}{133}dx=\frac{u^{3/2}-8}{133}&4< u≤25 \end{cases}\). Check itg;3*x^2/133;x;-u^(1/2);u^(1/2) || itg;3*x^2/133;x;2;u^(1/2) ||. The density is \(g(u)=\begin{cases}\frac{3u^{\frac{1}{2}}}{133}&0≤u≤4\\\frac{3u^{\frac{1}{2}}}{266}&4≤u≤25\end{cases}\). Check itg(3*u^(1/2)/133,u,0,4)+itg(3*u^(1/2)/266,u,4,25) ||, which equals 1.
(4) \(X_1,X_2\sim N(0,1), Y=g(X_1,X_2)=X_1^2+X_2^2\). Then \(F_Y(y)=P(X_1^2+X_2^2≤y)=\iint_R\frac{1}{2π}e^{-\frac{x_1^2+x_2^2}{2}}dx_1x_2\)
\(=\frac{1}{2π}\int_0^{2π}\int_{r=0}^{\sqrt{y}}re^{-\frac{r^2}{2}}drdθ=1-e^{-\frac{y}{2}}\), and thus \(f_Y(y)=F'_Y(y)=\frac{1}{2}e^{-\frac{y}{2}}=\) Gamma\((1,2),y≥0\).
(5) If \(X_1,X_2\sim U(0,1),Y=X_1+X_2,f_Y(y)=\begin{cases}\int_0^y1dx_1=y&\text{ for }y< 1\\\int_{y-1}^11dx_1=2-y&\text{ for }y> 1\end{cases}\).
(6) Suppose \(X_1,X_2\sim f(x)=e^{-x},Y=\frac{X_2}{X_1}\). Let \(Y_1=X_1,Y_2=\frac{X_2}{X_1}\). Solve for \(x_1=y_1,x_2=x_1y_2=y_1y_2\). The Jacobian is \(\begin{vmatrix}1&0\\y_2&y_1\end{vmatrix}=y_1\). Substitute \(x_1,x_2\) into the joint pdf of \(X_1,X_2\) and get \(e^{-x_1-x_2}=e^{-y_1-y_1y_2}\). Thus, the pdf of \(Y_2\) is \(\int_0^∞ ze^{-y_1-y_1y_2}dy_1=\frac{1}{(1+y_2)^2}\), or \(f_Y(y)=\frac{1}{(1+y)^2}\). Check itg;z*e^(-z-z*y);z;0;oo ||.
Example 3 Let \(X,Y\sim f(x,y)=\frac{4xy}{875}\) for \(0≤X≤5,1≤Y≤6\). Find the density of \(U=X+2Y\). Let \(V=X\). Then \(X=V,Y=\frac{U-V}{2}\) and the Jacobian \(|J|=\frac{1}{2}\). Thus, \(g(u,v)=\frac{u(u-v)}{875}\) for \(0≤v≤5,2≤u-v≤12\). The marginal density \(g(U)=\int_0^{u-2}\frac{u(u-v)}{875}dv=\frac{(u-2)^2(u+4)}{5250}\) for \(2≤u≤7;g(U)=\int_0^5\frac{u(u-v)}{875}dv=\frac{u}{70}-\frac{1}{21}\) for \(7≤u≤12;g(U)=\int_{u-12}^5\frac{u(u-v)}{875}dv=\frac{507u-u^3-3706}{5250}\). Check itg;v*(u-v)/875;v;0;u-2 || itg;v*(u-v)/875;v;0;5 || itg;v*(u-v)/875;v;u-12;5 ||.
Example 4 Find the density of \(U=XY\) if \(X,Y\) have a joint density \(f(x,y)\). Let \(V=X⇒X=V,Y=\frac{U}{V}\), and the Jacobian \(J=\frac{-1}{V}\). Thus, the joint density \(g(u,v)=f(v,\frac{u}{v})|v^{-1}|\), and the marginal density is \(g(u)=\int_{-∞}^∞f(v,\frac{u}{v})|v^{-1}|dv\). We can use the definition \(G(u)=P(U≤u)=P(XY≤u)=\int_{XY≤u}f(x,y)dxdy\). If \(u≥0\), then \(X>0,Y≤\frac{u}{X}\); if \(X<0,Y≥\frac{u}{X}\). Thus, \(G(u)=\int_{-∞}^0\int_{u/x}^∞f(x,y)dydx+\int_0^∞\int_{-∞}^{u/x}f(x,y)dydx\), and \(g(u)=G'(u)=\int_{-∞}^0f(x,\frac{u}{x})\frac{-1}{x}dx+\int_0^∞f(x,\frac{u}{x})\frac{1}{x}dx\)
\(=\int_{-∞}^∞f(v,\frac{u}{x})|x^{-1}|dx\). The same result can be obtained for \(u<0\).
2. Convolution Suppose \(X,Y\) have a joint density \(f(x,y)\) and \(U=X+Y\). Then the density of \(U\) is \(g(U)=\int_{-∞}^∞f(v,u-v)dv\). To see this, let \(V=X⇒X=V,Y=U-V\) and the Jacobian is \(J=-1\). Thus, the joint density of \(U,V\) is \(g(u,v)=f(v,u-v)|J|=f(v,u-v)\)
\(⇒g(u)=\int_{-∞}^∞f(v,u-v)dv\). In case of \(X,Y\) are independent and \(f(x,y)=f_X(x)f_Y(y)\), then \(g(u)=\int_{-∞}^∞f(v,u-v)dv\)
\(=\int_{-∞}^∞f_X(v)f_Y(u-v)dv≡f_X*f_Y\).
We can find \(g(U)=G'(u)\), where \(G(u)=P(X+Y≤u)=\iint_{X+Y≤u}f(x,y)dxdy\)
\(=\int_{x=-∞}^∞\int_{y=-∞}^{u-x}dydx\). Thus, \(G'u)=\int_{-∞}^∞f(x,u-x)dx\).
For example, if \(f_X(x)=2e^{-2x},f_Y(y)=3e^{-3y},U=X+Y\) for \(x≥0,y≥0\), then \(g(u)=\int_0^u2e^{-2v}3e^{-3(u-v)}dv=6(e^{-2u}-e^{-3u})\) for \(u≥v\) and \(g(u)=0\) for \(u< v\). Check itg;6*e^(-2*v)*e^(-3*(u-v));v;0;u || itg;6*(e^(-2*u)-e^(-3*u));u;0;oo ||. Note that \(f_X*f_Y=f_Y*f_X\), which means \(\int_{-∞}^∞f_X(v)f_Y(u-v)dv=\int_{-∞}^∞f_Y(v)f_X(u-v)dv\) by change of variables.
3 Statistical Inference
1. Parameter Estimation
1. Moment method (MOM) We can use the \(k\)th moment \(μ_k=E(X^k)\) of a r.v. \(X\) and its sample moment \(\hat{μ}_k=\frac{1}{n}\sum_{i=1}^nx_i^k\) to estimate some population parameters. The MOM estimators are asymptotically unbiased and consistent, but they may not necessarily be efficient nor unbiased with minimum variance. Also note that some distributions may not have moments (e.g., Cauchy distribution), and even if the first moment exists, the other moments may not exist.
Examples The following examples use the moment method to estimate some population parameters.
(1) Estimate parameter \(b\) using the sample \(X_1,\cdots,X_n\sim U(0,b)\). Solve the equation \(E(X)=\frac{b}{2}=\overline{X}_n⇒\hat{b}=2\overline{X}_n\), twice the sample mean. If \(b=2,E(X)=1,\hat{b}=2\overline{X}_n\). You can compare \(\hat{b}\) with the maximum value of the sample, or \(\max(X_1,\cdots,X_n)\). Check the results stt;uni(0,2);rvs(30);d || stt;uni(0,2);rvs(300);d ||. Note that the keyword "d" gives the summary statistics for the simulated samples. Use sample mean and maximum value for estimating the true parameter b = 2.
(2) \(X_1,\cdots,X_n\sim\) Geometric\((p),E(X)=\frac{1}{p}=\overline{X}_n⇒\hat{p}=\frac{1}{\overline{X}_n}\) (biased), because \(E(\frac{1}{\overline{X}_n})=p+\frac{pq}{n}-\frac{pq(p-q)}{n^2}+\cdots≠p\) for \(q=1-p\). Check stt;geo(0.5);rvs(50);d || stt;geo(0.5);rvs(500);d || stt;geo(0.5);rvs(5000);d || stt;geo(0.8);rvs(1000);d || stt;geo(0.3);rvs(200);d ||.
(3) \(X\sim U(a,b),E(X)=\frac{a+b}{2}=\overline{X}_n,\text{Var}(X)=\frac{(b-a)^2}{12}=S_n^2\) (sample variance). Solve for the two equations simultanesously for \(a=\overline{X}_n-S_n\sqrt{3},b=\overline{X}_n+S_n\sqrt{3}\) by eqs;a+b-2*x;(b-a)^2-12*s^2;a;b ||. Suppose a = 2, b = 5, you can calculate the MOM estimators \(\hat{a},\hat{b}\) using the means and variances of simulated samples of different sizes. Check stt;uni(2,5);rvs(50);d || stt;uni(2,5);rvs(5000);d ||.
(4) \(X\sim\) Beta\((n,m),E(X)=\frac{n}{n+m},\text{Var}(X)=\frac{mn}{(n+m)(n+m+1)}=S_n^2\). Solve the two equations for parameters \(\hat{n}=\overline{X}_n(\frac{\overline{X}_n-\overline{X}_n^2}{S_n^2}-1)\),
\(\hat{m}=(1-\overline{X}_n)(\frac{\overline{X}_n -\overline{X}_n^2}{S_n^2}-1)\). If n = 2, m = 3, check stt;bet(2,3);rvs(30);d || stt;bet(2,3);rvs(3000);d || stt;bet(2,3);E(x) || stt;bet(2,3);V(x) ||.
(5) \(X\sim\) Gamma\((a,b),E(X)=ab=\overline{X}_n,\text{Var}(X)=ab^2=S_n^2\). Solve the two equations for parameters \(\hat{a}=\frac{\overline{X}_n^2}{S_n^2},\hat{b}= \frac{S_n^2}{\overline{X}_n}\). If a = 3, b = 2, use the sample means and variances to estimate a and b by stt;gam(3,2);rvs(100);d || stt;gam(3,2);rvs(1000);d || stt;gam(3,2);E(x) || stt;gam(3,2);V(x).
2. Maximum likelihood is the most common method for estimating parameters. Suppose \(X_1,\cdots,X_n\) are i.i.d.'s from \(f(x|θ)\). The likelihood function is \(L(θ)=\displaystyle\prod_{i=1}^nf(X_i|θ)\), which is just the joint pdf/pmf of the data, except that we treat it as a function of \(θ\). The maximum likelihood estimate (MLE) of \(θ\), denoted \(\hat{θ}_n\), is the value of \(θ\) that maximizes the likelihood function \(L(θ)\), or \(\hat{θ}_n=\arg\max\limits_{θ∈Θ}L(θ)\). Under certain conditions, the MLE \(\hat{θ}_n\) possesses many properties. Simply put, MLE is consistent, equivalent, asymptotically optimal, asymptotically normal, and asymptotically unbiased.
Example 1
(1) Both \(\overline{X}_n=\frac{1}{n}\sum_{i=1}^nX_i\) and \(S_n^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-\overline{X}_n)^2=\frac{1}{n-1}(\sum_{i=1}^n(X_i-μ)^2+n(\overline{X}_n-μ)^2)=\frac{1}{n-1} (\sum_{i=1}^nX_i^2-n\overline{X}_n)\) are unbiased and consistent estimators to population mean \(μ\) and variance \(σ^2\). That is, \(E(\overline{X}_n)=μ, E(S_n^2)=\frac{1}{n-1}(nσ^2-σ^2)=σ^2\). The consistency of sample mean is guaranteed by the law of large numbers (WLLN). To show \(S_n^2\) is consistent, \(\frac{1}{n}\sum_{i=1}^nX_i^2\overset{p}{→}E(X^2),\overline{X}_n\overset{p}{→}μ\) by WLLN, and \(\overline{X}_n^2\overset{p}{→}μ^2\) by continuous mapping theorem. Thus, \(S_n^2=\frac{n}{n-1}[\frac{1}{n}\sum_{i=1}^nX_i^2- \overline{X}_n^2]\overset{p}{→}E(X_1^2)-[E(X_1)]^2=σ^2\).
(2) If \(X\sim\) Poisson\((λ)\), the MLE \(\hat{λ}_n=\overline{X}_n\), and the sample mean and variance (MLE) both are consistent and unbiased estimates of the population mean and variance. Suppose λ = 8. Check stt;poi(8);rvs(100);d || stt;poi(8);rvs(1000);d || stt;poi(8);rvs(10000);d ||. If \(X\sim\) exponential\((5)\), then the MLE \(\hat{λ}_n=\frac{1}{\overline{X}_n}\), and the MLE's of the population mean 0.2 and variance 0.04 are \(\overline{X}_n,\overline{X}^2_n\), respectively. Check stt;epn(5);rvs(100);d || stt;epn(5);rvs(1000);d || stt;epn(5);rvs(10000);d ||.
Example 2 The following examples use the maximum likelihood estimation method for some population parameters.
(1) If data \(X\sim U(a,b)\), estimate \(a,b\). The likelihood function \(l(a,b)=\frac{1}{(b-a)^n}\displaystyle \prod_{i=1}^nI_{(a,b)}(X_i)\), where \(I_{(a,b)}(x)=\begin{cases}0&x< a\\1&a≤x≤b\\0&x> b\end{cases}\). To maximize \(l(a,b)\), the parameter \(a\) cannot be bigger than \(X_{(1)}\) and \(b\) cannot be smaller than \(X_{(n)}\). Otherwise, the function \(I_{(a,b)}\) equals 0. Thus, the MLE \(\hat{a}_n=X_{(1)},\hat{b}_n=X_{(n)}\), which is more efficient than the MOM estimates of \(a,b\). Similarly, if \(X\sim U(0,b)\), the likelihood function \(l(b)=\frac{1}{b^n}\displaystyle\prod_{i=1}^nI_{(0,b)}(X_i)=\frac{1}{b^n}I_{(-∞,b)}(X_{(n)})I_{(0,∞)}(X_{(1)})\), which implies \(b\) cannot be smaller than \(X_{(n)}\) and the MLE \(\hat{b}_n=X_{(n)}\). Check the minimum and maximum of the samples simulated from a uniform distribution. They are efficient MLE's of the parameters \(a,b\). For example, stt;uni(0,1);rvs(10);d || stt;uni(1,4);rvs(30);d || stt;uni(-2,3);rvs(100);d ||.
(2) If \(X\sim\frac{2x}{a}e^{\frac{x^2}{a}},x≥0,l(a)=-n\log(a)-\frac{\sum_{i=1}^nX_i^2}{a}+C,l'(a)=\frac{-n}{a}+ \frac{\sum_{i=1}^nX_i^2}{a^2}=0⇒\hat{a}_n=\frac{\sum_{i=1}^nX_i^2}{n}=\overline{X_n^2}\), which is unbiased because \(E(\overline{X_n^2})=a\) by itg;2*x^3/a*exp(-x^2/a);x;0;oo ||. To check if \(\hat{a}_n\) is optimal, \(\text{Var}(\overline{X_n^2})=\frac{E(X^4)-E(X^2)^2}{n}=\frac{2a^2-a^2}{n}=\frac{a^2}{n}\). Check itg;2*x^3/a*exp(-x^2/a);x;0;oo || itg;2*x^3/a*exp(-x^2/a);x;0;oo ||. We can use another method (Fisher information) to obtain the same result. \(l''(a)=\frac{n}{a^2}-\frac{2\sum_{i=1}^nX_i^2}{a^3}⇒I_n(a)=-E(l''(a))=\frac{2\sum_{i=1}^nE(X_i)^2}{a^3}-\frac{n}{a^2}=\frac{2na}{a^3} -\frac{n}{a^2}=\frac{n}{a^2}\), which agrees with the Cramer-Rao lower bound.
(3) Let \(X\sim N(μ,σ^2)\). The log-likelihood function \(l(μ,σ^2)=-\frac{n}{2}\log σ^2-\frac{\sum_{i=1}^n(X_i-μ)^2}{2σ^2},\frac{∂l}{∂μ}=\frac{\sum_{i=1}^n(X_i-μ)}{σ^2}=0\),
\(\frac{∂l}{∂σ^2}=\frac{-n}{2σ^2}+ \frac{\sum_{i=1}^n(X_i-μ)^2}{2σ^4}=0\). The MLEs are \(\hat{μ}_n=\overline{X}_n,\hat{σ}_n^2=\frac{\sum_{i=1}^n(X_i-μ)^2}{n}\). Both are unbiased. It is known \(\hat{μ}_n=\overline{X}_n\) is optimal. Check \(\text{Var}(\hat{σ}_n^2)=\frac{\sum_{i=1}^nE[(X_i-μ)^2-σ^2]^2}{n^2}=\frac{E(X_i-μ)^4-2E(X_i-μ)^2σ^2+σ^4}{n}=\frac{3σ^4-2σ^4+σ^4}{n} =\frac{2σ^4}{n}\), which agrees with the Cramer-Rao lower bound. We can verify the result by \(I_n(σ^2)=\frac{1}{\text{Var}(\hat{σ}_n^2)}\), and \(l''(σ^2)=\frac{∂^2l}{∂σ^4}=\frac{n}{2σ^4}-\frac{\sum_{i=1}^n(X_i-μ)^2}{σ^6},I_n(σ^2)=-E(l''(σ^2))=\frac{nσ^2}{σ^6}-\frac{n}{2σ^4}=\frac{n}{2σ^4}\). Thus, \(\hat{σ}_n^2\) is an unbiased minimum variance estimate. Let μ = 2 and σ = 2 and X ~ N(2, 2). Then the sample mean and variance are the MLE's of the population mean and variance. Check stt;n(2,2);rvs(100);d || stt;n(2,2);rvs(1000);d || stt;n(2,2);rvs(10000);d ||.
Example 3 The following examples compare the efficiency of different estimators.
(1) Given that \(\overline{X}_n\) is an estimator of \(μ\) from \(N(μ,σ^2),\text{Var}(\overline{X}_n)=\frac{σ^2}{n}\), the Cramer-Rao lower bound can be obtained by \(nI(μ)\). Since \(l(μ)=-\log σ-\frac{(x-μ)^2}{2σ^2}+C⇒l'(μ)=\frac{x-μ}{σ^2}⇒l''(μ)=-\frac{1}{σ^2}⇒nI(μ)=-nE(l''(μ))=\frac{n}{σ^2}\). Thus, \(\overline{X}_n\) is the best unbiased estimator.
(2) Suppose that \(X_1,X_2\) are from the same population with mean \(μ\) and variance \(μ\). Calculate the relative efficiency of estimators \(\overline{X}_2\) to \(\frac{X_1+2X_2}{3}\) for population \(μ\). Since \(\text{Var}(\overline{X}_2)=\frac{σ^2}{2},\text{Var}(\frac{X_1+2X_2}{3})=\frac{σ^2+4σ^2}{9}=\frac{5σ^2}{9}\), the relative efficiency is \(\frac{9}{10}≈90\%\), which means \(\overline{X}_2\) is more efficient than \(\frac{X_1+2X_2}{3}\).
(3) Suppose \(X_1,\cdots,X_n\sim\) Bernoulli\((p).\overline{X}_n\) is an estimator of \(p\) and \(\text{Var}(\overline{X}_n)=\frac{p(1-p)}{n}\) is the Cramer-Rao lower bound variance. Then \(l(p)=X\log(p)+(1-X)\log(1-p),l'(p)=\frac{X}{p}-\frac{1-X}{1-p},l''(p)=-\frac{X}{p^2}-\frac{1-X}{(1-p)^2},nI(p)=-E(l''(p))=n(\frac{1}{p}+\frac{1}{1-p})=\frac{n}{p(1-p)}\). Suppose p = 0.6, check the sample mean by stt;ber(0.6);rvs(100);d || stt;ber(0.6);rvs(1000);d || stt;ber(0.6);rvs(10000);d ||.
(4) Let \(X_1,\cdots,X_n\sim U(0,b). E(2\overline{X}_n)=b,\text{Var}(2\overline{X}_n)=\frac{b^2}{3n}\). Thus, \(\overline{X}_n\) is an unbiased consistent estimator of \(b\). Since \(\frac{X_{(n)}}{b}\sim\) beta\((n,1)\) for the largest value \(X_{(n)},E(\frac{X_{(n)}}{b})= \frac{n}{n+1},\text{Var}(\frac{X_{(n)}}{b})=\frac{n}{(n+1)^2(n+2)}\). So \(X_{(n)}\) is asymptotically unbiased. An unbiased estimator is \(\frac{n+1}{n}X_{(n)}\) and its variance is \(\text{Var}(\frac{n+1}{n}X_{(n)})=\frac{(n+1)^2}{n^2}\frac{nb^2}{(n+1)^2(n+2)} =\frac{b^2}{n(n+2)}\). The relative efficiency of \(2\overline{X}_n\) to \(X_{(n)}\) is \(\frac{n+2}{3}\), which means \(X_{(n)}\) is more efficient than \(2\overline{X}_n\). On the limit, it is infinitely more efficient. Suppose b = 10. Check the maximum value and twice sample mean for each simulated sample by stt;uni(0,10);rvs(10);d || stt;uni(0,10);rvs(20);d || stt;uni(0,10);rvs(50);d || stt;uni(0,10);rvs(100);d ||.
3. Delta method and limiting distributions If \(Y=g(X)\) is a smooth function of a r.v. \(X\), the first order linearization through Taylor expansion about \(μ_x\) gives \(g(X)≈g(μ_x)+(X-μ_x)g'(μ_x)\), and thus we have \(μ_y≈g(μ_x),σ^2_y≈σ^2_x[g'(μ_x)]^2\). The accuracy of the approximation depends on how \(g\) is in a neighborhood of \(μ_x\) and the value of \(σ_x\). If a sequence of r.v.'s \(X_n\) satisfies \(\sqrt{n}(X_n-μ_x)\overset{d}{→}N(0,σ^2_x)\), then the limiting distribution of \(g(X_n)\) can be obtained by \(\sqrt{n}[g(X)-g(μ_x)]\overset{d}{→}N(0,σ^2_x[g'(μ_x)]^2)\), provided that \(g'(μ_x)≠0\) exists.
Examples
(1) Suppose \(X_1,\cdots,X_n\) are Bernoulli\((p)\) random variables. Let \(y=g(p)=\frac{p}{1-p}\) be the odds. The MLE of \(y\) is \(\hat{y}=\frac{\hat{p}}{1-\hat{p}}\). Then \(\text{Var}(\hat{y})=[g'(p)]^2\text{Var}(\hat{p})= \frac{1}{(1-p)^4}\frac{p(1-p)}{n}= \frac{p}{n(1-p)^3}\), where \(g'(p)=\frac{1}{(1-p)^2}\). The limiting distribution is \(\sqrt{n}(g(\hat{p})-g(p))\sim N(0,\frac{p}{(1-p)^3})\).
(2) Suppose \(X_1,\cdots,X_n\) are Geometric\((p)\) random variables. The likelihood function is \(l(X_1,\cdots,X_n|p)=\displaystyle\prod_{i=1}^n(1-p)^{X_i-1}p\)
\(=(1-p)^{\sum_{i=1}^nX_i-n}p^n\) and the log-likelihood function is \(l(p)=n(\overline{X}_n-1)\log(1-p)+n\log(p)⇒l'(p)=-\frac{n(\overline{X}_n-1)}{1-p}+\frac{n}{p}\). The MLE of \(p\) is \(\hat{p}_n=\frac{1}{\overline{X}_n}\). The Fisher information is \(I_n(p)=-E(l''(p))=\frac{n}{p^2(1-p)}\), where \(l''(p)=-\frac{n(\overline{X}_n-1)}{(1-p)^2}-\frac{n}{p^2}\). Directly apply the asymptotic normality of MLE, \(\sqrt{n}(\hat{p}-p)\overset{d}{→}N(0,p^2(1-p))\). We can use the delta-method to obtain the same limiting distribution. By the central limit theorem (CLT), \(\sqrt{n}(\overline{X}_n-\frac{1}{p})\overset{d}{→}N(0,\frac{1-p}{p^2})\). Note that \(E(X_i)=\frac{1}{p}\) and \(\text{Var}(X_i)=\frac{1-p}{p^2}\). Let \(g(x)=\frac{1}{x}\). Then \(\hat{p}=g(\overline{X}_n)\) and \(\text{Var}(g(\overline{X}_n))=[g'(\frac{1}{p})]^2\text{Var}(\overline{X}_n) =(-p^2)^2\frac{1-p}{np^2}=\frac{p^2(1-p)}{n}\). Therefore, \(\sqrt{n}(\hat{p}-p)\overset{d}{→}N(0,p^2(1-p))\).
(3) For i.i.d. sample \(X_1,\cdots,X_n\sim N(μ,σ^2)\), the MLE of \(y=g(μ,σ^2)=\frac{σ}{μ}\) is \(\hat{y}=\frac{\hat{μ}_n}{\hat{σ}_n}\), where \(\hat{μ}_n=\overline{X}_n,\hat{σ}^2_n=\frac{1}{n}\sum_{i=1}^n(X_i-\overline{X}_n)^2\). Then \(\sqrt{n}(\hat{y}-y)\overset{d}{→}N(0,∇g(μ,σ^2)^TI^{-1}(μ,σ^2)g(μ,σ^2))\). Since \(I^{-1}(μ,σ^2)=\begin{bmatrix}σ^2&0\\0&2σ^4\end{bmatrix},\frac{∂g}{∂μ} =-\frac{σ}{μ^2},\frac{∂g}{∂σ^2}=\frac{1}{2σμ}\),
\(∇g(μ,σ^2)=(-\frac{σ}{μ^2},\frac{2σ}{μ})^T,g(μ,σ^2)^TI^{-1}(μ,σ^2)g(μ,σ^2)=\frac{σ^4}{μ^4}+\frac{σ^2}{2μ^2}\).
2. Hypothesis Testing
I. Basic Concepts
We make statistical decisions about populations based on sample information. Statistical hypotheses are statements about population parameters or probability distributions. Suppose we partition the parameter space \(Θ\) into two disjoint sets \(Θ_0\) and \(Θ_1\) and test \(H_0:θ∈Θ_0\) versus \(H_1:θ∈Θ_1\), where \(H_0\) is the null hypothesis and \(H_1\) the alternative hypothesis. A hypothesis of the form \(θ=θ_0\) is a simple hypothesis; a hypothesis of the form \(θ≠θ_0,θ> θ_0\) or \(θ< θ_0\) is a composite hypothesis. A test of form \(H_0:θ=θ_0,H_1:θ≠θ_0\) is called two-tailed test; a test of the form \(H_0:θ≥θ_0,H_1:θ<θ_0\) or \(H_0:θ≤θ_0,H_1:θ>θ_0\) is called one-tailed test.
Let \(X\) be a r.v. and \(\cal{X}\) be the range of \(X\). To test a hypothesis \(H_0\) versus \(H_1\) is to find an appropriate rejection region \(R_r⊂\mathcal{X}\), which is a subset of \(\mathcal{X}\). The decision rule is that if \(X∈R_r\) we reject \(H_0\), and if \(X∉R_r\), we do not reject \(H_0\). Note that rejecting \(H_0\) does not mean we accept \(H_1\). The rejection region \(R_r\) is usually of the form \(R_r=\{x∈\mathcal{X}:T(x)≥c\}\), where \(T(X)=T(X_1,\cdots,X_n)\) is a test statistic, \(c\) is a critical value, and \(X_1,\cdots,X_n\) are observations from the r.v. \(X\).
1. Type I, Type II errors and power There are two types of errors we can make in hypothesis testing: (1) Rejecting \(H_0\) when \(H_0\) is true is called a type I error. (2) Not rejecting \(H_0\) when \(H_0\) is false is called a type II error. The probability of a type I error is \(α=P(\text{Reject } H_0|H_0)=P(X∈R_r|H_0)\). The probability of a type II error can be calculated by \(β(θ)=P(\text{Accept }H_0|H_1)=P(\text{Reject }H_1|H_1)=P(X∉R_r|H_1)\), which depends on \(H_1\). The power function equals to \(P(\text{Reject }H_0|H_1)=P(X∈R_r|H_1)=1-β(θ)\).
2. Significance level The size (denote as \(α\)) of a test is the probability of falsely rejecting \(H_0\) when \(H_0\) is true, or the probability of making type I error. For a simple hypothesis \(H_0\), the size \(α=P(\text{Reject }H_0|H_0)\) is the type I error. In the case of a composite \(H_0\), the size is defined to be \(α=\sup\limits_{θ∈H_0}P(\text{Reject }H_0|θ)\). A test is said to have significance level \(α\) if its size is less than or equal to \(α\). In many cases the size and level of a test are equal. There are two key steps in hypothesis testing: (1) find a proper testing statistic and (2) identify the rejection region with a given level \(α\).
3. P-value P-value is the probability of obtaining a real-valued test statistic at least as extreme as the one actually obtained if \(H_0\) were true. The p-value can be calculated by \(P(T≥t|H_0)\) for a right-tailed test, by \(P(T≤t|H_0)\) for a left-tailed test, and by \(2\min(P(T≥t|H_0),P(T≤-t|H_0))\) for a two-tailed test or \(P(|T|≥t|H_0)\) if the sampling distribution of \(T\) is symmetric about 0. Informally, p-value is a measure of the evidence against \(H_0\): The smaller the p-value, the stronger the evidence against \(H_0\) and the smaller the rejection region. However, a large \(p\)-value is not strong evidence in favor of \(H_0\). A large \(p\)-value can occur either because \(H_0\) is true, or \(H_0\) is false but the test has low power. The p-value is not the probability that the null hypothesis is true.
Example Suppose \(X_1,\cdots,X_n\sim N(μ,σ^2)\) with known \(σ^2\) and we want to test \(H_0:μ=μ_0\) versus \(H_1:μ≠μ_0\). The power is given by \(P(X∈R_r|H_1)=P_μ(|\overline{X}_n-μ_0|≥c)=P_μ(\overline{X}_n-μ>c+μ_0-μ)+P_μ(\overline{X}_n-μ≤μ_0-c-μ)\), which implies the power function equals \(P_μ(\frac{\sqrt{n}(\overline{X}_n-μ)}{σ}>\frac{\sqrt{n}(c-μ+μ_0)}{σ})+P_μ(\frac{\sqrt{n}(\overline{X}_n-μ)}{σ}≤\frac{\sqrt{n}(μ_0-c-μ)}{σ})=1-\Phi(\frac{\sqrt{n}(μ_0+c-μ)}{σ})+\Phi(\frac{\sqrt{n}(μ_0-c-μ)}{σ})\), where \(R_r\) denotes the rejection region, \(\overline{X}_n=T\) is the test statistic, and \(\Phi(.)\) the cdf of the standard normal distribution. The power decreases as the distance \(|μ_0-μ|\) decreases, or the power increase as \(|μ_0-μ|\) increases.
The type I error is \(P(X∈R_r|H_0)=P_{μ_0}(|\overline{X}_n-μ_0|>c)=1-\Phi(\frac{\sqrt{n}c}{σ})+\Phi(\frac{-\sqrt{n}c}{σ})= 2(1-\Phi(\frac{\sqrt{n}c}{σ}))\), where \(\Phi(x)+\Phi(-x)=1\) for \(x>0\). To make the type I error under a level \(α\), set \(2(1-\Phi(\frac{\sqrt{n}c}{σ}))=α⇒\frac{\sqrt{n}c}{σ}=\Phi^{-1}(1-\frac{α}{2})=Z_{1-α/2}⇒c=\frac{Z_{1-α/2}σ}{\sqrt{n}}\). Thus, for given \(n,σ,α\), we can determine the critical value \(c\) such that the type I error \(P_{μ_0}(\frac{\sqrt{n}|\overline{X}_n-μ_0|}{σ}≥Z_{1-α/2})=α\) is at most \(α\). The type II error can be calculated by \(β=P(X∉R_r|H_1)=1-P(X∈R_r|H_1)=1-P_u(|\overline{X}_n-μ_0|>c)\).
Consider we test \(H_0:μ=4\) versus \(H_1:μ=5\) and accept a decision rule that the critical value \(c=1\), which implies we reject \(H_0\) if \(|\overline{X}_n-4|≥1\). Suppose \(σ=2,n=36,\overline{X}_n=4.8\). Then we do not reject \(H_0\) because |4.8 - 4| < 1. The power is given by \(1-\Phi(0)+\Phi(-6)\) = 0.5 by stt;n(0,1);cdf(-6) ||. We can directly calculate the power from the sampling distribution of \(\overline{X}_n\sim N(5,\frac{2}{6})\) under \(H_1:μ=5\) by stt;n(5,2/6);P(T>5@T<3). Thus, the type II error is 1 - 0.5 = 0.5 by stt;n(5,1/3);P(T>=3&T<=5). The type I error can be calculated by \(2(1-\Phi(3))\) ≈ 0.0027 by stt;n(0,1);cdf(3) || stt;n(4,1/3);P(T>5@T<3) ||, because the sampling distribution of \(\overline{X}_n\) is \(N(5,\frac{2}{6})\) under \(H_0\). The test retains \(H_0\) with low power or large type II error. Since this test is two-tailed, the p-value is 0.0164 by stt;n(0,1);P(T>2.4@T<-2.4) || stt;n(4,1/3);P(T>4.8@T<3.2) ||. If we test \(H_0:μ=4\) versus \(H_1:μ=5.5\), we accept \(H_0\) with large power (about 0.93) and small type II error by stt;n(5.5,1/3);P(T>5@T<3).
If we test \(H_0:μ=4\) versus \(H_1:μ=5\) at a level α = 0.05, then c = 2(1.96)/6 ≈ 0.653, the power is about 0.85 by stt;n(0,1);q(0.975) || stt;n(0,1);P(x<-1.041) || stt;n(0,1);P(x<-4.959) ||. The type II error is about 0.15. The test rejects \(H_0\) because |4.8 - 4| > 0.653 and the p-value is 0.0164 by stt;n(0,1);P(T>2.4@T<-2.4) || stt;n(4,1/3);P(T>4.8@T<3.2).
To test \(H_0:μ=4\) versus \(H_1:μ>4\) at a level α = 0.05, the type I error is \(α=P_{μ_0}(\overline{X}_n-4>c)=1-\Phi(\frac{\sqrt{n}c}{σ})\) because the test is one-tailed. Solve for \(c=\frac{σ\Phi^{-1}(1-α)}{\sqrt{n}}=2\Phi^{-1}(0.95)/6\) = 1.645/3 ≈ 0.5483 by stt;n(0,1);q(0.95) ||. The test rejects \(H_0\) because 4.8 - 4 > 0.5483 and the p-value for this right-tailed test is 0.0082 by stt;n(0,1);P(T>2.4) || stt;n(4,1/3);P(T>4.8) ||. The power function depends on \(μ\) under \(H_1\). If \(μ\) = 5, the power is 0.91 by stt;n(5,2/6);P(T>4.5483) ||; if \(μ\) = 4.5, the power is 0.4424 by stt;n(4.5,2/6);P(T>4.5483) ||.
Similarly, if we test \(H_0:μ=4\) versus \(H_1:μ<4\) at a level of significance α = 0.05, the type I error for this left-tailed test is given by \(P_{μ_0}(\overline{X}_n-4< c)=\Phi(\frac{\sqrt{n}c}{σ})=α\). Solve for \(c=\frac{σ\Phi^{-1}(α)}{\sqrt{n}}⇒ c=2\Phi^{-1}(0.05)/6\) = -1.645/3 = -0.5483 by stt;n(0,1);q(0.05) ||. The test fails to reject \(H_0\) because 4.8 - 4 > -0.548. Since this is a left-tailed test, the p-value is 0.992 by stt;n(0,1);P(T<2.4) || stt;n(4,1/3);P(T<4.8) ||.
If we want to test \(H_0:μ≥4\) versus \(H_1:μ<4\) at a level α = 0.05, a 0.95 confidence interval for \(μ\) is 4.8 ± 1.96(2/6) = (4.147, 5.453), so the test fails to reject \(H_0\). The type I error is \(P(\overline{X}_n-4<-c)=\Phi(-\frac{\sqrt{n}(c)}{σ})=\Phi(-3c)=0.05⇒c=-\Phi^{-1}(0.05)/3\) = -(-1.645/3) = 0.5483 by stt;n(0,1);q(0.05) ||. Since 4.8 - 4 > -0.5483, the test fails to reject \(H_0\) and the p-value for this left-tailed test is 0.992 by stt;n(0,1);P(T<2.4) || stt;n(4,1/3);P(T<4.8) ||.
If we want to test \(H_0:μ≤4\) versus \(H_1:μ>4\) at a level α = 0.05, the test rejects \(H_0\) because the population mean less than or equal to 4 lies outside of a 0.95 confidence interval (4.147, 5.453), which implies the observed statistic (sample mean = 4.8) differs significantly from what would be expected under \(H_0\). In this case, the power function is \(P_μ(\overline{X}_n-4>c)=1-\Phi(\frac{\sqrt{n}(4+c-μ)}{σ})\). Since under \(H_0:μ≤4\), the size of the test is \(1-\Phi(\frac{\sqrt{n}(4+c-4)}{σ})=1-\Phi(3c)=0.05⇒c=\Phi^{-1}(0.95)/3=0.5483\). The test rejects \(H_0\) because 4.8 - 4.0 > 0.5483 and the p-value for this right-tailed test is 0.0082 stt;n(0,1);P(T>2.4) || stt;n(4,1/3);P(T>4.8) ||.
II. Tests Involve Normal Distributions for Large Samples
For large samples, many statistics are approximately normally distributed. We can use this fact for testing hypotheses on means, proportions, differences of means, and differences of proportions.
1. Wald test Consider testing \(H_0:θ=θ_0;H_1:θ≠θ_0\). Suppose \(\hat{θ}\) is an estimate of \(θ\) and \(\text{se}(\hat{θ})\) is the estimated standard error of \(\hat{θ}\) (can be computed at \(θ=θ_0\)). Assume \(\hat{θ}\) is asymptotically normal, or \(W=\frac{\hat{θ}-θ_0}{\text{se}(\hat{θ})}\overset{d}{→}N(0,1)\). Then the test rejects \(H_0\) at a significance level \(α\) if \(|W|>Z_{α/2}\) (two-tailed). In other words, the test rejects \(H_0\) if and only if \(θ_0∉(\hat{θ}-Z_{α/2}\text{se}(\hat{θ}),\hat{θ}-Z_{α/2}\text{se}(\hat{θ}))\). Similarly, we can test the hypotheses \(H_0:θ=θ_0;H_1:θ>θ_0\) or \(H_1:θ<θ_0\) (one-tailed).
2. Test means If \(θ=μ\) is population mean and \(\hat{θ}=\overline{X}_n\) is sample mean, then \(W=Z=\frac{\overline{X}_n-μ_0}{σ/\sqrt{n}}\), where \(σ\) is population standard deviaton and \(n\) is sample size. Under \(H_0\), the sampling (null) distribution of \(\overline{X}_n\sim N(μ,\frac{σ}{\sqrt{n}})\) by WLLN. The test rejects \(H_0:μ=μ_0\) versus \(H_1:μ≠μ_0\) at \(α = 0.05\) if \(Z∉(-1.96,1.96)\) (two-tailed test). If \(H_1:μ>μ_0\), the test rejects \(H_0\) at the 0.05 level if \(Z>1.64\) (one-tailed test); if \(H_1:μ<μ_0\), the test rejects \(H_0\) at the 0.05 level if \(Z<-1.64\) (one tailed test). Check stt;n(0,1);P(-1.96< Z<1.96) || stt;n(0,1);P(Z>1.64) || stt;n(0,1);P(Z<-1.64) || stt;n(0,1);q(0.975) || stt;n(0,1);q(0.025) || stt;n(0,1);q(0.95) || stt;n(0,1);q(0.05) ||. Similarly, if \(α = 0.01\), check teh critical values by stt;n(0,1);P(-2.58< Z<2.58) || stt;n(0,1);P(Z>2.33) || stt;n(0,1);P(Z<-2.33) || stt;n(0,1);q(0.995) || stt;n(0,1);q(0.005) || stt;n(0,1);q(0.01) || stt;n(0,1);q(0.99) ||.
3. Test proportions Suppose \(θ=p\) is the population proportion of "success" and \(\hat{p}\) is the observed proportion of "sucess" in a sample. Then \(Z=\frac{\hat{p}-p}{\sqrt{p(1-p)/n}}\), where \(n\) is sample size. In case \(\hat{p}=\frac{X}{n},Z=\frac{X-np}{\sqrt{np(1-p)}}\), where \(X\) is observed number of "success" in a sample. Use the test statistic \(Z\) for testing \(H_0:p=p_0\) versus \(H_1:p≠p_0\) (two-tailed), \(H_1:p>p_0\) or \(H_1:p< p_0\) (one-tailed).
Example Consider testing that a coin is fair or \(H_0:p=0.5\) versus \(p≠0.5\). The decision rule is if the observed number of heads \(X\) in a single sample of 100 tosses is in the interval [40, 60], then the test retains \(H_0\). The type I error is \(P(X < 40 ∪ X > 60|H_0)\) = 0.0352 by stt;bin(100,0.5);P(x<40@x>60) ||. If \(X=60\), we accept \(H_0\); if \(X=70\) or \(X=30\), we reject \(H_0\). If we test \(H_0:p=0.5\) versus \(H_1:p=0.7\), the type II error is \(P(40≤X≤60|H_1)\) = 0.021 by stt;bin(100,0.7);P(40<=x<=60) ||, and the power is \(P(X<40∪X>60|H_1)\) = 0.979 by stt;bin(100,0.7);P(x<40@x>60) ||. If \(H_1:p=0.6\), the type II error is 0.5379 and the power is 0.4621 by stt;bin(100,0.6);P(40<=x<=60) || stt;bin(100,0.6);P(x<40@x>60) ||.
If \(X=60,Z=\frac{60-100(0.5)}{\sqrt{100(0.5)(1-0.5)}}=2\), if \(X=30,Z=-4\) and if \(X=70,Z=4\). The p-value is \(P(|Z|>2)\) = 0.0455 and \(P(|Z|>4)\) = 0.0000 by stt;n(0,1);P(z>2@z<-2) || stt;n(0,1);P(z>4@z<-4) ||. The test rejects \(H_0\) at the level 0.0352. Thus, the test retains \(H_0\) if \(X=60\) and rejects \(H_0\) if \(X=30\) or \(X=70\). The test results based on normal approximation are the same as the results based on the exact binomial distribution.
4. Test differences of means Suppose \(θ=μ_1-μ_2\). Consider testing \(H_0:μ_1=μ_2\) or \(H_0:μ_1-μ_2=0\), the null hypothesis that there is no difference between the two population means \(μ_1\) and \(μ_2\). Then the test statistic is \(Z=\frac{\overline{X}_1-\overline{X}_2}{σ_{\overline{X}_1- \overline{X}_2}}\) and \(σ_{\overline{X}_1-\overline{X}_2}=\sqrt{\frac{σ_1^2}{n_1}+\frac{σ_2^2}{n_2}}\), where \(σ_1,σ_2\) are two population standard deviations and \(n_1,n_2\) are the sample sizes. If \(σ_1,σ_2\) are unknown, replace them by sample standard deviations.
Example The numbers of a math test for two classes are: \(\overline{X}_1=84,S_1=7,n_1=45;\overline{X}_2=81,S_1=5,n_1=50\). Test if there is a significant difference between the performance of the two classes at the levels \(α=0.05\) and \(α=0.01\). The test statistic \(Z=\frac{84-81}{\sqrt{49/45+25/50}}=2.38\). The observed statistic \(Z\) is beyond the interval (-1.96, 1.96) (a 0.95 confidence interval), but is within the interval (-2.58, 2.58) (a 0.99 confidence interval). The p-value is \(P(|Z|>2.38)=0.0173\) by stt;n(0,1);P(z>2.38@z<-2.38) ||. Thus, the test rejects \(H_0\) at the level \(α=0.05\), but retains \(H_0\) at \(α=0.01\). Also check stt;n(0,1);q(0.975) || stt;n(0,1);q(0.995) ||.
5. Test differences of proportions Suppose \(θ=p_1-p_2\). Consider testing \(H_0:p_1=p_2=p\) or \(H_0:p_1-p_2=0\), the null hypothesis that there is no difference between the two population proportions \(p_1\) and \(p_2\). If \(\hat{p}_1,\hat{p}_2\) are the observed proportions from two different samples, then the test statistic is \(Z=\frac{\hat{p}_1-\hat{p}_2}{σ_{\hat{p}_1- \hat{p}_2}}\) and \(σ_{\hat{p}_1-\hat{p}_2}=\sqrt{p(1-p)(\frac{1}{n_1} +\frac{1}{n_2})}\), where \(n_1,n_2\) are the sample sizes. Use \(\hat{p}=\frac{n_1\hat{p}_1+n_2\hat{p}_2}{n_1+n_2}\) as an estimate of the population proportion under \(H_0\).
Example A sample poll of 230 voters from district A and 250 voters from district B showed that 55% and 47%, repectively, were in favor of a given candidate. To test if there is a significant difference between the two districts at the levels \(α=0.05\) and \(α=0.01\), calculate \(\hat{p}=\frac{230(0.55)+250(0.47)}{230+250}\) ≈ 0.5083, \(σ_{\hat{p}_1-\hat{p}_2}=\sqrt{0.5083(1-0.5083)(\frac{1}{230}+\frac{1}{250})}\) ≈ 0.046, and \(Z=\frac{0.55-0.47}{0.046}\) ≈ 1.7391. The test retains \(H_0\) at both levels. The p-value is \(P(|Z|>1.7391)\) = 0.082 by stt;n(0,1);P(z>1.7391@z<-1.7391). If we test \(H_0:p_1=p_2\) versus \(H_1:p_1>p_2\) (the candidate is preferred in district A), then the p-value is \(P(Z>1.732)\) = 0.041 by stt;n(0,1);P(z>1.7391). So the test rejects \(H_0\) at the 0.05 level, but retains \(H_0\) at the 0.01 level.
III. Likelihood Ratio Tests
1. Likelihood ratio tests Suppose we test \(H_0:θ= θ_0\) versus \(H_1:θ=θ_1\). The likelihood ratio test rejects \(H_0⇔ \frac{L(\text{Data}|θ_0)}{L(\text{Data}|θ_1)}< c\). If the test has significance level \(α_{LR}\), then any other test for which the significance level \(α≤ α_{LR}\) has power less than or equal to that of the likelihood ratio test \(1−β ≤1 − β_{LR}\). This Neyman-Pearson Lemma says basing the test on likelihood ratio is optimal for simple hypotheses.
Suppose we wish to test \(H_0:θ ∈Θ_0\) versus \(H_1:θ∈Θ_1\), where \(Θ_0\) and \(Θ_1\) are two disjoint sets of the parameter space \(Θ=Θ_0∪Θ_1\). If the hypotheses are composite, each likelihood is evaluated at that value of \(θ\) that maximizes it. This yields the generalized likelihood ratio \(Λ^∗=\frac{\sup\limits_{θ∈Θ_0} L(θ)}{\sup\limits_{θ∈Θ_1} L(θ)}\), where \(L(θ)\) is the likelihood function. Small values of \(Λ^∗\) tend to discredit \(H_0\). We often use the statistic \(Λ=\frac{\sup\limits_{θ∈Θ_0} L(θ)}{\sup\limits_{θ∈Θ} L(θ)}\) instead of \(Λ^∗\), where \(Λ\) is called the likelihood ratio statistic and \(Λ=\min(Λ^*,1)\). Thus, small values of \(Λ^∗\) correspond to small values of \(Λ\). The rejection region \(R_r\) has the form \(\{X:Λ(X ) < λ\}\), and the threshold \(λ\) is chosen so that \(P(Λ(X) < λ|H_0)=α\), where \(α\) is the desired significance level of the test.
Define a test statistic \(λ(X)=-2\log(\frac{\sup\limits_{θ∈Θ_0} L(θ)}{\sup\limits_{θ∈Θ} L(θ)})=-2(l(\hat{θ}_0)-l(\hat{θ}_n))\), where \(\hat{θ}_0\) is the MLE when \(θ\) is restricted in \(Θ_0,\hat{θ}_n\) is the MLE for \(θ\) in \(Θ\), and \(l(θ)=\log L(θ)\). Since \(Θ_0⊆Θ, \sup\limits_{θ∈Θ_0} L(θ)≤\sup\limits_{θ∈Θ} L(θ)⇒λ(X)≥0\). Suppose \(H_0\) is true. Then \(λ(X)\) is close to 0 for large \(n\), and if \(λ(X)\) is large, we reject \(H_0\).
Under \(H_0: θ=θ_0\) versus \(H_1:θ≠θ_0,λ(X)\overset{d}{→}χ^2_1\) as \(n→∞\), and the p-value is given by \(P(χ^2_1> λ(X))\). Under smoothness conditions on \(f(x|θ)\) and \(X_1,\cdots,X_n\sim f(x|θ)\) are i.i.d. , the null distribution of \(λ(X)\) for testing \(H_0:θ∈Θ_0\) versus \(H_1:θ∉Θ_0\) is \(λ(X)\overset{d}{→}χ^2_d\) as the sample size \(n→∞\), where \(d = \dimΘ −\dim Θ_0\) and \(\dimΘ,\dimΘ_0\) are the numbers of free parameters in \(Θ\) and \(Θ_0\). We reject \(H_0\) at the level \(α\) if \(λ(X)> χ^2_{d,1-α}\). The p-value can be calculated by \(P_{θ_0}(χ^2_d≥λ(x_1,\cdots,x_n))\), where \(λ(x_1,\cdots,x_n)\) is the test statistic.
Example For \(X_1,\cdots,X_n\sim N(μ,σ^2)\), consider testing \(H_0:μ=μ_0\) versus \(H_1:μ≠μ_0\) for \(Θ_0=\{μ:μ=μ_0\}\) and \(Θ=\{μ:μ∈R\}\). The log-likelihood function is \(l(μ,σ^2)=-\frac{1}{2σ^2}\sum_{i=1}^n(X_i-μ)^2+C\), where \(C\) is a constant that does not depend on \(μ\). On \(Θ_0,μ_0\) maximizes \(l(μ,σ^2)\), and on \(Θ,\overline{X}_n\) maximizes \(l(μ,σ^2)\). Thus, \(T(X)=\frac{1}{σ^2}(\sum_{i=1}^n(X_i-μ_0)^2-\sum_{i=1}^n(X_i-\overline{X}_n)^2)=\frac{n(\overline{X}_n-μ_0)^2}{σ^2}\). Under the null hypothesis, \(\frac{\sqrt{n}(\overline{X}_n-μ_0)}{σ}\sim N(0,1)\), which implies \(T(X)\sim χ^2_1\). Thus, we reject \(H_0\) if \(T(X)> χ^2_{1,1-α/2}\) or \(T(X)<χ^2_{α/2}\). Given \(n=40,σ^2=16,\overline{X}_n=45,α=0.05\), we test \(H_0:μ=50;H_1:μ≠50\) by likelihood ratio test with \(T(X)=\frac{(45-50)^2}{4^2}=\frac{25}{16}\). Since \(χ^2_{1,0.975}=5.024\) and \(χ^2_{1,0.025}=0.0001\) by stt;chi(1);q(0.975) || stt;chi(1);q(0.025) |. The test retains \(H_0\). The same result can be obtained by the Wald test with p-value = 0.211 by stt;n(50,4);P(T>55@T<45) || stt;n(0,1);P(Z>5/4@Z<-5/4) ||.
2. Goodness of fit test Likelihood ratio tests can be used in goodness of fit test. For testing \(H_0:F_X=F_0\) versus \(H_1:F_X≠F_0\), suppose \(X_1,\cdots,X_n\) are samples of \(F_X\). Divide the range of the data into \(k\) disjoint subintervals \(I_i=(a_i,a_{i+1}],1≤i≤k\), and count the number \(n_i\) of samples that fall in each subinterval \(\{X_j:X_j∈(a_i,a_{i+1}],1≤j≤n\}\). Then we have \(n=\sum_{i=1}^kn_i,\sum_{i=1}^kp_i=1\), and the probability of a random variable \(X\) taking value in \(I_i\) equals \(p_i=P(X∈I_i)=F_X(a_{i+1})-F_X(a_i)\), or \(X\sim M(n,{\bf p})\), a multinomial distribution, where \({\bf p}=(p_1,\cdots,p_k)\). Thus, testing \(H_0\) becomes testing \(H_0:p_i=p_{i0},1≤i≤k\) (necessary condition for \(F_X=F_0\)) versus \(H_0:p_i≠p_{i0},1≤i≤k\). The likelihood ratio test statistic is \(λ(X)=2n\sum_{i=1}^k\hat{p}_i\log\frac{\hat{p}_i}{p_{i0}}=2 \sum_{i=1}^kO_i\log\frac{O_i}{E_i}\), where \(\hat{p}_i=\frac{n_i}{n},O_i=n\hat{p}_i,E_i=np_{i0}\). Under \(H_0,λ(X)\overset{d}{→}χ^2_{k-1}\) as \(n→∞\). Pearson's statistic \(χ^2=\sum_{i=1}^k\frac{(O_i-E_i)^2}{E_i}=\sum_{i=1}^k\frac{(x_i-np_{i0})^2}{np_{i0}}\) is commonly used to test for goodness of fit. It is asymptotically equivalent to the likelihood ratio under \(H_0\).
Use "chi" to test the null hypothesis that the observed counts (categorical data) has the expected frequencies. Input data in the form "dat((observed), (expected))", where the first row is for observed counts and the second for expected counts. For example, testing \(H_0:{\bf p}=(\frac{15}{80},\frac{20}{80},\frac{15}{80},\frac{10}{80},\frac{10}{80},\frac{10}{80})\) by stt;dat((10,25,10,13,11,11),(15,20,15,10,10,10));chi, where n = 80, the first row of numbers is the observed counts, and the second row is the expected frequencies. The null distribution is multinomial with n = 80 and p = (15/80, 20/80, 15/80, 10/80, 10/80, 10/80). Note that the sum of observed counts must be equal to the sum of expected frequencies.
3. Test of independence Assume joint distribution of the cell counts \(n_{ij}\) in a two-way contingency table is multinomial \(n_{ij}\sim M(n,\{π_{ij}\})\) for \(1≤i≤I,1≤j≤J\) , where \(I\) is the number of rows, \(J\) is the number of columns, and \(π_{ij}=P(X_{ij}=n_{ij})\) is the cell probabilities. To test independence is to test the hypothesis \(H_0:π_{ij}=π_{i.}π_{.j}\) that there is no relationship between the variables that cross classify the rows and columns in the table, where \(π_{i.},π_{.j}\) are the marginal probabilities that an observation will fall in the \(i\)th row and \(j\)th column, respectively. The likelihood ratio statistic is \(λ(X)=-2n\sum_{i=1}^I\sum_{j=1}^J\hat{π}_{ij}\log \frac{\hat{π}_{i.}\hat{π}_{.j}}{\hat{π}_{ij}}\), where \(\hat{π}_{i.}=\frac{n_{i.}}{n}=\frac{1}{n}\sum_{i=1}^In_{ij},\hat{π}_{.j} =\frac{n_{.j}}{n}=\frac{1}{n}\sum_{j=1}^Jn_{ij},\hat{π}_{ij}=\frac{n_{ij}}{n}\) are the MLEs. Thus, \(λ(X)\overset{d}{→}χ^2_{(I-1)(J-1)}\). The Pearson statistic is \(χ^2=\sum_{i=1}^I\sum_{j=1}^J\frac{(O_{ij}-E_{ij})^2}{E_{ij}} =n\sum_{i=1}^I\sum_{j=1}^J\frac{(\hat{π}_{ij}- \hat{π}_{i.}\hat{π}_{.j})^2}{\hat{π}_{i.}\hat{π}_{.j}}\). If \(\hat{π}_{ij}\) greatly deviates from \(\hat{π}_{i.}\hat{π}_{.j}\), it's more likely for \(H_0\) to be rejected. The two test statistics \(χ^2\) and \(λ(X)\) are approximately equivalent.
Use keyword "cnt" (two-way contingency table) for the hypothesis test of independence. For a table of k rows, input each row of observed counts in the table in the form "dat((row1), (row2), ..., (rowk))". For example, a 2 × 3 table stt;dat((39,25,28),(30,30,30));cnt ||, a 2 × 2×2 table stt;dat(((75,25),(65,35)),((70,30),(70,30)));cnt ||, a 3 × 2 table stt;dat((43,9),(45,7),(80,10));cnt ||, and a 2×2 × 2×2 table stt;dat([((12,17),(11,16)),((11,12),(15,16))],[((23,15),(30,22)), ((14,17),(15,16))]);cnt ||.
IV. Kolmogorov-Smirnov Test
1. One-Sample Kolmogorov-Smirnov test We can use ecdf \(F_n(x)\) to test the null hypothesis \(H_0:F_X=F_0\) by sample observations \(X_1,\cdots,X_n\). If the statistic \(T_{ks}=\sup\limits_{x∈R}|F_n(x)-F_0(x)|\) is large, we are likely to reject \(H_0\). Under \(H_0,\sqrt{n}T_{ks}\) satisfies the Kolmogorov distribution \(P(K≤x)=1-2\sum_{i=1}^∞(-1)^{k-1}e^{-2k^2x^2}\).
2. Two-sample Kolmogorov-Smirnov test Let \(X_1,\cdots,X_n\sim F_X,Y_1,\cdots,Y_m\sim F_Y\). For testing \(H_0:F_X=F_Y;H_1:F_X≠F_Y\), we reject \(H_0\) if \(\sqrt{\frac{mn}{m+n}}T_{ks}>K_{1-α}\), where \(T_{ks}=\sup\limits_{x∈R}|F_{X,n}(x)-F_{Y,m}(x)|\), and \(\sqrt{\frac{mn}{m+n}}T_{ks}\) satisfies the Kolmogorov distribution under \(H_0\).
Use keyword "ks1" for one-sample Kolmogorov-Smirnov test for goodness of fit, which gives the resulting statistic \(T_{ks}\) and p-value (two-tailed) for testing \(H_0\) that \(X_1,\cdots,X_n\sim F_0\). To indicate a particular distribution \(F_0\), use keywords "n", "t", "f", "epn", "chi", poi","gam", "bet", "cau", "uni" followed by their necessary parameters. For example, "stt;dat(-1.179,2.369,0.07,-1.576,0.051,-0.498, -2.108,0.15,0.067,0,-0.83,-0.18,-0.932,-0.319,-0.741);ks1;t;6 ||.
Directly use keyword "n" to test normality for one sample without specifying the mean and variance parameters. For example, test if the data points come from a normal distribution by stt;dat(1.56,3.69,0.7,1.32,1.86,3.28,2.4,1.71,0.71,2.65,2.68,1.84,2.99,2.6,1.43);n || stt;dat(1.56,3.69,0.7,1.32,1.86,3.28,2.4,1.71,0.71,2.65,2.68,1.84,2.99,2.6,1.43);ks1;n;2;1 ||. Note that if you use keyword "ks1" for testing normality, you need to specify mean and variance for the test.
Use "ks2" for two-sample Kolmogorov-Smirnov test for goodness of fit, which gives the resulting test statistic \(T_{ks}\) and p-value (two-tailed) for testing \(H_0\) that \(F_X=F_Y\). For example, stt;dat((2,0,-3,5,5,7,3,2,1,-2),(7,-4,8,9,-10,1,5,6,3,0));ks2 ||.
V. Small Sample T-Test, \(χ^2\)-Test, and F-Test
1. One-sample t-test Suppose \(X_1,\cdots X_n\sim N(μ,σ^2)\). Then \(Z=\frac{\overline{X}_n-μ}{σ/\sqrt{n}}\sim N(0,1)\). If \(σ^2\) were known, we can use \(Z\) to test the null hypothesis that the population mean is \(μ\). In case \(σ^2\) is unknown and the sample size is small \((n < 30)\), under \(H_0\) that a normal population has mean \(μ\), the test statistic \(T=\frac{\overline{X}_n-μ}{S/\sqrt{n}}\) follows Student-t distribution with degree of freedom \(n-1\), where \(S^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-\overline{X}_n)^2\) is the sample variance. If the sample size is large or \(n>30\), we can use the Z-test (Wald test).
Use keyword "t1;μ" for one sample t-test, where μ is the mean in the null hypothesis. For example, test \(H_0:μ=0\) by stt;dat(-2.48,0.31,-3.48,-1.16,-0.34,-0.66,0.14,0.44,0.17,-0.26,2.86,-0.87,0.75,-1.34,1.57);t1;0 ||.
2. Two independent samples t-test Suppose \(X_1,\cdots,X_n\sim N(μ_X,σ^2);Y_1,\cdots,Y_m\sim N(μ_Y,σ^2)\). If \(σ^2\) were known, \(Z=\frac{\overline{X}_n-\overline{Y}_m-μ_X-μ_Y}{σ\sqrt{\frac{1}{n}+\frac{1}{m}}}\sim N(0,1)\). To test \(H_0:μ_X=μ_Y\) that two independent samples come from the same population (normal or approximately normal) with common variance, the null distribution \(Z=\frac{\overline{X}_n-\overline{Y}_m}{σ\sqrt{\frac{1}{n}+\frac{1}{m}}}\sim N(0,1)\). However, \(σ^2\) is usually not known, and we use t-test in this case.
Under \(H_0\), the test statistic \(T=\frac{\overline{X}_n-\overline{Y}_m}{\hat{σ}\sqrt{\frac{1}{n}+\frac{1}{m}}}\) follows Student-t distribution with degree of freedom \(n+m-2\), where the pooled (weighted) variance is \(\hat{σ}^2=\frac{nS_X^2+mS_Y^2}{n+m-2},S_X^2\) and \(S_Y^2\) are sample variance, and \(n,m\) are sample sizes. Note that if sample size is greater than 30, we can replace the t-test with the Z-test (Wald test).
Use "t2" for two-samples t-test and enter two data sets (two rows of numbers) for the two-samples t-test. For example, stt;dat((-2.54,0.3,-4.52,-1.19,-0.31,-0.66,0.13,0.42,0.18,-0.28),(-2.47,0.31,-3.33,-1.16,-0.35,-0.67));t2 ||.
3. Paired-samples t-test If two samples are paired, either paired-subjects or paired-observations such as "before" and "after" measurement on the same subject, the samples are dependent. A paired-samples t-test is used for comparing the means of paired-samples. Denote the pair \((X_i,Y_i)\) for \(i=1,2,\cdots,n\). Assume that \(E(X_i)=μ_X,E(Y_i)=μ_Y\) and \(\text{Var}(X_i)=σ_X^2,\text{Var}(Y_i)=σ_Y^2,\text{Cov}(X_i,Y_i) =σ_{XY}=ρσ_Xσ_Y\), and that each pair is independently distributed.
Let \(\bar{D}=\overline{X}_n-\overline{Y}_n\). Then \(E(\bar{D})=μ_X-μ_Y=E(X_i-Y_i)\) and \(\text{Var}(\bar{D})=σ_{\bar D}^2=\frac{1}{n}(σ_X^2+σ_Y^2-2σ_{XY})=\frac{1}{n}\text{Var}(X_i-Y_i)\), where \(n\) is sample size. If \(ρ>0\) or pairs are positively correlated, \(σ_{\bar D}^2\) is smaller than that of the unpaired case in which \(ρ=0\), and thus pairing is more effective than unpaired samples. Since \(σ_{\bar D}\) is generally unknown, inferences will be based on \(S_{\bar D}=\frac{S_D}{\sqrt{n}}\), where \(S_D\) is the sample standard deviation of observed differences between pairs. The test statistic is \(t=\frac{\bar{D}-μ_X-μ_Y}{S_{\bar D}}\). Assume the differences of each pair are from normal distribution. Then \(t\) follows a student-t distribution with degree of freedom \(n-1\) under \(H_0\). Use keyword "pt" for the null hypothesis that two paired samples have identical means. For example, stt;dat((0.86,1.07,0.7,-1,2.61,-1.62,2.23,1.75),(1,1.57,1.2,-0.5,2.11,-1.12,1.73,2.25));pt ||.
4. Test variances To test the hypothesis \(H_0\) that a normal population has a variance \(σ^2\), use the test statistic \(T=\frac{(n-1)S^2}{σ^2}\sim χ^2_{n-1}\), where \(S^2\) is the sample variance. For a two-tailed test, the test retains \(H_0\) at the 0.05 level if \(χ^2_{n-1,0.025}≤T≤χ^2_{n-1,0.975}\). For testing \(H_0\) versus \(H_1\) that the variance is greater than \(σ^2\), rejects \(H_0\) if \(T>χ^2_{n-1,0.95}\); for testing \(H_0\) versus \(H_1\) that the variance is less than \(σ^2\), rejects \(H_0\) if \(T<χ^2_{n-1,0.05}\). For example, if \(S^2=29,n=36\), to test \(H_0:σ^2=20;H_1:σ^2≠20, T=\frac{35(29)}{20}=50.75\), and \(χ^2_{35,0.975}=53.20,χ^2_{35,0.025}=20.57\) by stt;chi(35);q(0.975) || stt;chi(35);q(0.025) ||. Thus, the test retains \(H_0\) at the 0.05 level. For testing \(H_0:σ^2=20;H_1:σ^2>20\), the test rejects \(H_0\) because \(T>χ^2_{35,0.95}=49.8\) (one-tailed). For testing \(H_0:σ^2=20;H_1:σ^2<20\), the test retains \(H_0\) because \(T>χ^2_{35,0.05}=22.47\) (one-tailed).
5. Test ratios of variances To test \(H_0:σ^2_1=σ^2_2\), the hypothesis that there is no difference between population variances, use the test statistic \(T=\frac{S_1^2/σ^2_1}{S_2^2/σ^2_2}\), where \(S_1^2,S_2^2\) are the sample variances of the two samples. Under \(H_0,T=\frac{S_1^2}{S_2^2}\sim F_{n_1-1,n_2-1}\), an \(F\) distribution with degrees of freedom \(n_1-1,n_2-1\), where \(n_1,n_2\) are the sample sizes. For a two-tailed test, the test retains \(H_0\) if \(F_{0.025}≤T≤F_{0.0975}\) at the 0.05 level. For example, suppose \(S_1^2=52,S_2^2=27,n_1=32,n_2=30\). For testing \(H_0:σ^2_1=σ^2_2;H_1:σ^2_1≠σ^2_2,T=\frac{52}{27}=1.93,F_{0.975}=2.08\) and \(F_{0.025}=0.48\) by stt;f(31,29);q(0.025) || stt;f(31,29);q(0.975) ||. Thus, the test retains \(H_0\) at the 0.05 level. For testing \(H_0:σ^2_1=σ^2_2;H_1:σ^2_1>σ^2_2\) and \(F_{0.95}=1.85\) by stt;f(31,29);q(0.95) ||. Thus, the test rejects \(H_0\) at the 0.05 level because \(T>F_{0.95}\) (one-tailed). For testing \(H_0:σ^2_1=σ^2_2;H_1:σ^2_1<σ^2_2,F_{0.05}=0.54\) by stt;f(31,29);q(0.05) ||. Thus, the test retains \(H_0\) at the 0.05 level because \(T>F_{0.05}\) (one-tailed).
VI. Analysis of Variance (ANOVA)
1. Analysis of variance (ANOVA) We use analysis of variance to compare and test the significance of means for more than two samples. In one-way (factor) classification, we observe measurements for \(I\) independent groups (treatments or levels) of samples, where the number of measurements (replications) in each group is \(n_i,i=1,2,\cdots,I\). The statistical model is \(Y_{ij}=μ+α_i+ε_{ij}\) with a constraint \(\sum_{i=1}^Iα_i=0\), where \(Y_{ij}\) is the \(j\)th observation of the \(i\)th treatment, \(μ\) is the over all mean, \(α_i\) is the differential effect of the \(i\)th treatment, and \(ε_{ij}\sim N(0,σ^2)\) is assumed to be independent and normally distributed with mean 0 and common variance \(σ^2\).
One-way ANOVA is based on the identity for the total sum of squares \(SS_T=SS_B+SS_W\), where \(SS_B\) is the sum of squares between groups and \(SS_W\) is the sum of squares within groups. Under the model assumptions and constraints, \(\frac{SS_W}{σ^2}\sim χ^2_{I(J-1)},\frac{SS_B}{σ^2}\sim χ^2_{I-1}\), and they are independent. The sum of squares divided by the degrees of freedom will result in the mean squares (MS) such as \(\frac{SS_B}{I-1},\frac{SS_W}{I(J-1)}\), where \(\frac{SS_W}{I(J-1)}\) is an unbiased estimate of \(σ^2\). The statistic \(F=\frac{SS_B/(I-1)}{SS_W/[I(J-1)]}\) is used to test the null hypothesis that all (treatment) means are equal \(H_0:α_1=\cdots=α_I=0\) or \(H_0:μ_1=\cdots=μ_I=μ\). If \(H_0\) is true, \(F\) should be close to one; if \(H_0\) is false, \(F\) should be large. When \(H_0\) is true and \(F\) is close to 1, we can think the \(m\) independent groups of samples come from the same population, which is assumed normal, with common mean and variance. Under the model assumptions, the null distribution of \(F\) is the \(F\) distribution with \(I-1\) and \(I(J-1)\) degrees of freedom.
Use keyword "f1" for one-way ANOVA. Input the data for the treatment with different levels in the form "dat((level1), (level2), ..., (levelk))", where each row represents each level. For example, there are three levels for the treatment in "stt;dat((8,9,6,9),(7,9,8,8),(9,11,10,10,8)); f1", where Group 1 has 4 replications, Group 2 has 4 and Group 3 has 5. Enter the measurements of each group in a parenthesis. Check stt;dat((27,33,33,19,30),(37,27,41,26,28),(17,33,33,28,25),(38,34,34,47,31));f1 ||.
2. Two-way ANOVA involves two factors \(A\) and \(B\), each at two or more levels. The model is given by \(Y_{ijk}=μ+α_i+β_j+δ_{ij}+ε_{ijk}\) with the constraints \(\sum_{i=1}^Iα_i=0,\sum_{j=1}^Jβ_j=0,\sum_{i=1}^Iδ_{ij}= \sum_{j=J}^Iδ_{ij}=0\), where \(δ_{ij}\) is interaction between the two factors, and \(ε_{ijk}\sim N(0,σ^2)\) is assumed i.i.d. for \(i=1,\cdots I, j=1,\cdots,J,k=1,\cdots,K\). The identity for the sum of squares is \(SS_T=SS_A+SS_B+SS_{AB}+SS_E\) and they are independently distributed. Under the model assumptions and constraints, (1) \(\frac{SS_A}{σ^2}\sim χ^2_{I-1}\) under \(H_A:α_i=0\), (2) \(\frac{SS_B}{σ^2}\sim χ^2_{J-1}\) under \(H_B:β_j=0\), (3) \(\frac{SS_{AB}}{σ^2}\sim χ^2_{(I-1)(J-1)}\) under \(H_{AB}:δ_{ij}=0\), and (4) \(\frac{SS_E}{σ^2}\sim χ^2_{IJ(K-1)}\). The mean squares (MS) are the sums of squares divided by their degrees of freedom and the \(F\) statistics are ratios of mean squares. For testing \(H_A:α_i=0,F_A=\frac{MS_A}{MS_E}=\frac{SS_A/(I-1)}{SS_E/(IJ(K-1))}\), for testing \(H_B:β_j=0,F_B=\frac{MS_B}{MS_E}\), and for testing \(H_{AB}:δ_{ij}=0\) (there is no interaction), \(F_{AB}=\frac{MS_{AB}}{MS_E}\). Large \(F_A\) suggests some of the \(α_i\) are nonzero, large \(F_B\) suggests some of the \(β_j\) are not zero, and large \(F_{AB}\) suggests some of the \(δ_{ij}\) are nonzero.
Use keyword "f2" for two-way ANOVA and keyword "n" for interaction. If factor A has 3 levels and factor B has 2 levels, there are in total 6 cells, and input the data in the form of "dat( ((cell1),(cell2)), ((cell3),(cell4)), ((cell5),(cell6)) )". If each cell has 4 measurements, for example, enter the numbers in each cell. Check stt;dat(((3,3.4,5,4),(5,6.3,4,5)),((6,5,6,6),(5,4,4,5)),((3,3,4,3), (5,5,6,7)));f2 || stt;dat(((3,3.4,5,4),(5,6.3,4,5)),((6,5,6,6),(5,4,4,5)),((3,3,4,3),(5,5,6,7)));f2;n ||. Note that if a cell has only one observation, add a "," to the observation. For example, factors A and B both have 3 levels, so there are in all 9 cells, and each cell only has one measurement, check stt;dat(((7.5,),(9.4,),(10.2,),(9.7,)),((11.8,),(10.8,),(10.6,),(10,)),((8.9,),(9.8,),(8.7,),(8.2,)));f2 ||.
3. Nonparametric Tests
1. Mann-Whitney U-test We use Mann-Whitney U-test (nonparametric) to determine whether or not two samples come from the same population. The test statistic \(U=n_1n_2+\frac{n_1(n_1+1)}{2}-R_1\), where \(n_1,n_2\) are the sizes of the two samples and \(R_1\) is the rank sum for sample 1 (smaller group). The sampling distribution of \(U\) is symmetric. If both \(n_1,n_2\) are greater than 8, \(U\) is approximately a standard normal distribution. Use keyword "mw" for the test. For example, stt;dat((-2.54,0.3,-4.52,-1.19,-0.31,-0.66,0.13,0.42,0.18,-0.28),(-2.47,0.31,-3.33,-1.16,-0.35,-0.67));mw ||.
2. Kruskal-Wallis H-test is a generalization of Mann-Whitney U-test to more than two samples. Suppose that we have \(k\) samples of sizes \(n_1,\cdots,n_k\) and that the sum of ranks for the \(k\) groups are \(R_1,\cdots,R_k\). Then the test statistic is \(H=\frac{12}{n(n+1)}\sum_{j=1}^k\frac{R_j^2}{n_j}-3(n+1)\), where \(n=n_1+\cdots+n_k\) is the total size of all samples. The sampling distribution of \(H\) is nearly a chi-square distribution with \(k-1\) degree of freedom, provided that \(n_1,\cdots,n_k\) are all at least 5. The H-test (nonparametric) is similar to one-way ANOVA (parametric). Use keyword "kw" for Kruskal-Wallis H-test. For example, stt;dat((1,3,5,7,9,8),(2,4,6,8,10,12),(2,2,8,8,9,13));kw tests if the three samples come from the same population. Check stt;dat((179,238,234,98,203),(278,175,312,165,186),(72,235,235,182,150),(281,246,240,371,218));kw ||.
3. Wilcoxon signed-rank test tests the null hypothesis that two paired-samples come from the same distribution. In particular, it tests whether the distribution of the differences \(d\) is symmetric about zero. It is a non-parametric version of the paired-samples t-test. Use keyword "w" and enter the differences of the pairs for the test. For example, stt;dat(6,8,4,6,3,2,8,9,1,-8,5,6,0,-7,5);w ||.
4. Test correlations Pearson's product-moment correlation measures the linear relationship between two variables. Denote \(i\)th pair \((X_i,Y_i)\) in a sample of size \(n\). Then the sample Pearson correlation is \(r=\frac{\sum_{i=1}^n(X_i-\overline{X})(Y_i-\overline{Y})}{\sqrt{\sum_{i=1}^n(X_i-\overline{X})^2}\sqrt{\sum_{i=1}^n(Y_i-\overline{Y})^2}}=\frac{n\sum_{i=1}^nX_iY_i-\sum_{i=1}^nX_i\sum_{i=1}^nY_i}{\sqrt{n\sum_{i=1}^nX_i^2-(\sum_{i=1}^nX_i)^2}\sqrt{n\sum_{i=1}^nY_i^2-(\sum_{i=1}^nY_i)^2}}\). Use keyword "rp" to obtain the correlation coefficient and p-value for testing \(H_0:ρ=0\) (parametric). For example, stt;dat((2.5,3,7,5,9,6),(2,4,8,3,10,7));rp.
Spearman's rank correlation between two variables is equal to the Pearson correlation between the rank values of those two variables. While Pearson's correlation assesses linear relationships, Spearman's correlation assesses monotonic relationships (whether linear or not). Use keyword "rs" to obtain Spearman's rank correlation and the p-value for testing \(H_0:ρ=0\) (nonparametric). For example, stt;dat((23,31,40,12,20,3,89,62),(25,15,89,64,23,51,46,11));rs ||.
Kendall’s tau is to measure the ordinal association between measured quantities. Use "rk" for the statistic tau and the p-value for testing \(H_0:τ=0\) (nonparametric). For example, stt;dat((2,12,3,5,7),(3,8,9,11,6));rk ||.
4. Kernel Density Estimation (KDE)
For continuous r.v.'s, histogram can be used to estimate the pdf the r.v., but the estimates are not smooth. Kernel density estimation is a way to estimate the smoothed pdf of a r.v. in a non-parametric way. Let \(K(x)\) be a nonnegative, symmetric weight function, centered at zero and integrating to 1, and \(K\) is called the kernel. The re-scaled version of \(K(x)\) is \(K_h(x)=\frac{1}{h}K(\frac{x}{h})\), where \(h>0\) is a smoothing parameter called the bandwidth. As \(h→0, K_h(x)\) becomes more concentrated and peaked about zero; as \(h→∞, K_h(x)\) becomes more spread out and flatter. If \(K(x)=N(0,1)\), then \(K_h(x)=N(0,h)\).
If \(X_1,\cdots, X_n∼f\), then an estimate of the continuous density function \(f\) is \(\hat{f}_h(x)=\frac{1}{n}\sum_{i=1}^nK_h(x−X_i)=\frac{1}{nh}\sum_{i=1}^nK(\frac{x−X_i}{h})\), which is called a kernel probability density estimate. It consists of the superposition of “hills” centered on the observations. If \(K(x)=N (0,1)\), then \(K_h(x−X_i)=N(X_i, h)\). The bandwidth \(h\) controls the smoothness of \(\hat{f}_h(x)\) and corresponds to the bin width of the histogram. If \(h\) is too small, then \(f_h(x)\) is too rough, if \(h\) is too large, then the shape of \(\hat{f}_h(x)\) is smeared out too much.
Use keyword "kde" to estimate and plot the probability density by Gaussian kernels in a non-parametric way. It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be over-smoothed. For example, plt;dat(1.86,2.07,1.7,0,3.61,-0.62,3.23,2.75,0.39,0.62,3.88,2.5,1.95,4.25,1.38,2.05,1.15,2.15,1.32,1.34);kde ||. You can simulated a sample from a distribution and use KDE to estimate pdf by keyword "ked;n", where "n" is sample size. For example, plt;epn(0.8);kde;40 || plt;n(2,1);kde;40 || plt;uni(1,3);kde;40 || plt;gam(2,4);kde;40 || plt;bet(1,2);kde;40 || plt;cau(2,3);kde;40 ||.
5. Empirical Cumulative Density Functions (ecdf)
The ecdf \(F_n\) is a cdf that puts mass \(\frac{1}{n}\) at each data point \(X_i\), and is defined as \(F_n(x)=\frac{1}{n}(\text{the number of }X_i ≤ x)\). Denote the ordered batch of numbers by \(x_{(1)},\cdots,x_{(n)}\). If \(x< x_{(1)},F_n(x)=0\), if \(x_{(1)} ≤ x < x_{(2)},F_n(x)=\frac{1}{n}\), if \(x_{(k)} ≤ x < x_{(k+1)},F_n(x)=\frac{k}{n}\), and if \(x>x_{(n)},F_n(x)=1\) for all \(x\). The ecdf is the “data analogue” of the cdf of a random variable. If \(X_1,\cdots,X_n\sim F\) are i.i.d., \(F_n(x)\) is an consistent estimator of \(F(x)\) by WLLN. The consistency of \(F_n(x)\) is point wise for any fixed \(x\). In fact, the consistency of \(F_n(x)\) to \(F(x)\) is uniform for any \(x\) by Glivenko-Cantelli theorem.
Use keyword "ecdf(x)" to compute the ecdf for a given number "x" and a set of data points. For example, stt;dat(-1,0.5,0,0.7,-2.3,0.8,1.2,0.1,-0.3);ecdf(0.4) || returns \(F_n(0.4)\). Check stt;dat(-1,0.5,0,0.7,-2.3,0.8,1.2,0.1,-0.3);ecdf(1) ||. The graph of the ecdf of a random sample from the standard normal distribution with size "n = 40" can be obtained by "plt;n(0,1);ecdf;40". Check other graphs of ecdfs plt;epn(2);ecdf;30 || plt;t(12);ecdf;50 || plt;uni(0,1);ecdf;40 || plt;cau(2,5);ecdf;35 || plt;f(7,10);ecdf;60 || plt;chi(8);ecdf;25 || plt;gam(3,7);ecdf;36 || plt;bet(1,2);ecdf;47 || plt;ber(0.6);ecdf;55 || plt;bin(20,0.5);ecdf;56 || plt;poi(4);ecdf;58 || plt;geo(0.5);ecdf;54 || plt;nbn(14,0.6);ecdf;34 ||.
4 Curve Fitting and Generalized Linear Models
1. Linear Regression Models
1. Simple linear regression Assume a set of data \((X_i,Y_i),1⩽i≤n\) is from a bivariate joint distribution \(f_{X,Y}(x,y)\) of random variables \((X,Y)\). We want to predict \(Y\) based on a function \(g(X)\) and hope \(Y\) and the prediction \(g(X)\) are close in terms of a loss function. We focus on quadratic loss and try to find a particular function \(g\) such that the average loss (or risk function) is as small as possible \(\min\limits_{g}E_{X,Y}(Y-g(X))^2\). The optimal choice of \(g\) is the conditional expectation of \(Y\) given \(X\) or \(g(x)=E(Y|X=x)\). Note that \(E(Y|X)\) is a function of \(X\), which is called the regression of \(Y\) on \(X\), or the best predictor of \(Y\) conditional on \(X\).
If there is not any constraint on \(g\), we can always find a curve (a function of \(g\)) going through all observed points \((X_i,Y_i),1≤i≤n\). We usually restrict \(g\) to a class of functions such as linear functions, polynomials, or logistic functions. Linear function is the simplest class of functions, and the simple linear regression is \(E(Y_i|X_i=x_i)=β_0+β_1x_i\) or \(Y_i=β_0+β_1X_i+ε_i\), where \(ε_i\) is the error term. Note that the empirical risk function \(R(β_0,β_1)=\sum_{i=1}^n(Y_i-(β_0+β_1X_i))^2\) is a convex function of \(β_0,β_1\), and the least squares estimators are given by \(\hat{β}_0=\overline{Y}_n-\hat{β}_1\overline{X}_n\) and \(\hat{β}_1=\frac{\sum_{i=1}^n(X_i-\overline{X}_n)(Y_i-\overline{Y}_n)}{\sum_{i=1}^n(X_i-\overline{X}_n)^2}=\frac{S_{XY}}{S_{XX}}\). The predicted value at \(X_i\) is \(\hat{Y}_i=\hat{β}_0+\hat{β}_1X_i\), and the residuals are \(\hat{ε}_i=Y_i-\hat{Y}_i=Y_i-(\hat{β}_0+\hat{β}_1X_i)\). The notations \(\overline{X}_n,\overline{Y}_n,S_{XY},S_{XX}≡S_X^2,S_{YY}≡S^2_Y\) represent the usual sample means, covariance, and variances.
For statistical inferences, we assume the error \(ε_i\sim N(0,σ^2)\) are i.i.d. and \(Y_i\sim N(β_0+β_1X_i,σ^2)\) are independent random variables with equal variance. The MLE of \(\hat{\bf β}=(\hat{β}_0,\hat{β}_1)^T\) matches the least square estimators. The MLE of \(σ^2\) is biased and is given by \(\hat{σ}^2=\frac{1}{n}\sum_{i=1}^n \hat{ε}_i^2=\frac{1}{n}\sum_{i=1}^n(Y_i-\hat{β}_0-\hat{β}_1X_i)^2\), where \(\hat{ε}_i=Y_i-\hat{Y}_i= Y_i-\hat{β}_0-\hat{β}_1X_i\) are residuals.
Let \({\bf X}=\begin{bmatrix}1&X_1\\\vdots&\vdots\\1&X_n\end{bmatrix},{\bf Y}=\begin{bmatrix}Y_1\\\vdots\\Y_n\end{bmatrix},{\bf ε}=\begin{bmatrix}ε_1\\\vdots\\ε_n \end{bmatrix},{\bf β}=\begin{bmatrix}β_0\\β_1\end{bmatrix}\). Then the simple linear regression becomes \({\bf Y}={\bf Xβ}+{\bf ε}\). Since \(ε_i\) has 0 mean and equal variance, \(E(\hat{\bf β})={\bf β},\text{Cov}(\hat{\bf β})=E(\hat{\bf β}-{\bf β})(\hat{\bf β}-{\bf β})^T=σ^2({\bf X}^T{\bf X})^{-1}\). Under the normal error model, \(\hat{\bf β}\sim N({\bf β},σ^2({\bf X}^T{\bf X})^{-1})\), and we use this fact for statistical inferences.
Let the error sum of squares be \(SSE=\sum_{i=1}^n\hat{ε}_i^2=\sum_{i=1}^n(Y_i-\hat{Y}_i)^2\). Then \(\hat{σ}=\sqrt{\frac{SSE}{n}}\) is called the standard error of estimate of \(Y\) on \(X\), and \(MSE= \frac{SSE}{n-2}=\frac{1}{n-2}\sum_{i=1}^n\hat{ε}_i^2\) is called the mean square error. Under normal error model, \(SSE\) satisfies \(\frac{SSE}{σ^2}\sim χ^2_{n-2}\), and it is independent of \(\hat{\bf β}\). Thus, \(MSE\) is a consistent and unbiased estimator of \(σ^2\) since \(\sum_{i=1}^n\hat{ε}_i^2\sim σ^2χ^2_{n-2}\) (by WLLN), and \(\hat{σ}^2\) is a biased estimate of \(σ^2\).
2. Correlation and R-squared For simple linear regression, \(\hat{σ}^2=S_Y^2(1-r^2)⇒r^2=1-\frac{\hat{σ}^2}{S_Y^2}=1-\frac{\sum_{i=1}^n(Y_i-\hat{Y}_i)^2}{\sum_{i=1}^n(Y_i-\bar{Y}_i)^2}=\frac{\sum_{i=1}^n(\hat{Y}_i-\bar{Y})^2}{\sum_{i=1}^n(Y_i-\bar{Y}_i)^2}\), because \(\sum_{i=1}^n(Y_i-\bar{Y}_i)^2=\sum_{i=1}^n(Y_i-\hat{Y}_i)^2+\sum_{i=1}^n(\hat{Y}_i-\bar{Y})^2\), where the first sum (of squares) is called unexplained variation and the second sum (of squares) is the explained variation. Thus, \(r^2\) (also called coefficient of determination) is equal to the proportion of variation in \(Y\) which is explained by the linear regression, and it can be considered as a measure of how well the line fits the sample data. In extreme cases where \(r^2=±1\), we say there is a perfect linear relationship (regression line) between \(X\) and \(Y\).
For linear regression, \(r=\frac{S_{XY}}{S_XS_Y}\) is the Pearson product moment correlation that only measures linear relationship, the calculation of \(r\) depends on the sample \(X\)'s and \(Y\)'s, and \(r^2\) happens to coincide with the coefficient of determination. However, Pearson correlation \(r\) is not suitable for nonlinear relationship. The coefficient of determination \(r^2=\frac{\sum_{i=1}^n(\hat{Y}_i-\bar{Y})^2}{\sum_{i=1}^n(Y_i-\bar{Y}_i)^2}\) reflects the form of regression model, so is suitable for measuring both linear or nonlinear relationships (it is sometimes called multiple correlation coefficients).
Use keyword "ols" (ordinary least squares) for estimating simple or multiple linear regression models. Always input the response variable \(Y\) before explanatory variable \(X\) in form of "dat((row of X), (row of Y))". Use "ols;pr" for predicted values and residuals, and use "ols;p(a)" for the predicted value of a new or future observation \(X=a\). For example, stt;dat((40,50,55,60,65,62,63,68,66),(102,129,142,152,166,161,162,173,170));ols || stt;dat((40,50,55,60,65,62,63,68,66),(102,129,142,152,166,161,162,173,170));ols;pr || stt;dat((40,50,55,60,65,62,63,68,66),(102,129,142,152,166,161,162,173,170));ols;p(45) ||. For plot of residuals and plot between predicted values and residuals, get the residuals first by keyword "pr" and use "pt" for scatter plots. For example plt;dat(-0.77,0.66,0.88,-1.9,-0.68,1.98,0.43,-1.35,0.7);pt || plt;dat((102.8,128.3,141.1,153.9,166.7,159,161.6,174.4, 169.2),(-0.8,0.7,0.9,-1.9,-0.7,2,0.4,-1.4,0.7));pt ||.
In practice the type of curves is often suggested by scatter plot. In particular, linear relationship is easy to visualize from scatter plot. Use keyword "pt" for scatter plot, and use "rp" for Pearson correlation. For example, plt;dat((40,50,55,60,65,62,63,68,66),(102,129,142,152,166,161,162,173,170));pt || stt;dat((40,50,55,60,65,62,63,68,66),(102,129,142,152,166,161,162,173,170));rp ||.
3. Multiple linear regression The model is \(Y_i=β_0+β_1X_{i1}+\cdots+β_{p-1}X_{i,p-1}+ε_i\) with \(E(ε_i)=0,\text{Var}(ε_i)=σ^2\) and \(\text{Cov}(ε_i,ε_j)=0\) for \(i≠j\). The mean response is \(E(Y_i|{\bf X}_i)=β_0+β_1X_{i1}+\cdots+β_{p-1}X_{i,p-1}\). The matrix form is \({\bf Y}={\bf Xβ}+{\bf ε}\), where \({\bf Y}=\begin{bmatrix}Y_1\\\vdots\\Y_n\end{bmatrix},{\bf ε}=\begin{bmatrix}ε_1\\\vdots\\ε_n\end{bmatrix}, {\bf β}=\begin{bmatrix}β_0\\\vdots\\β_{p-1}\end{bmatrix},{\bf X}=\begin{bmatrix}1&X_{11}&\cdots&X_{1,p-1}\\\vdots&\vdots&\vdots&\vdots \\1&X_{n1}&\cdots&X_{n,p-1}\end{bmatrix}\). Under model assumption, \(E({\bf ε})={\bf 0},\text{Cov}({\bf ε})=E({\bf εε}^T)=σ^2{\bf I}_n\). The least square estimator \(\hat{\bf β}=({\bf X}^T{\bf X})^{-1}{\bf X}^T{\bf Y}\). Under the normal error model, \(\hat{\bf β}\sim N({\bf β},σ^2({\bf X}^T{\bf X})^{-1})\) since \(E(\hat{\bf β})={\bf β}\) and \(\text{Var}(\hat{\bf β})=σ^2({\bf X}^T{\bf X})^{-1}\).
4. Model assumptions and diagnostics Linear regression involves assumptions such as linear relationship, i.i.d., normal error with mean 0 and uncorrelated and constant variance. Typical types of violation of these assumptions include nonlinearity, non-constant variance, non-normality of error term, correlated errors, and existence of outliers. We can use plot of residuals to identify any patterns of violations. Ideally, the residuals should spread equally around the horizontal zero line. Also check the plot of residuals (y-axis) versus predicted values (or predictors as x-axis) \((\hat{ε}_i,\hat{Y}_i)\). If nonlinear patterns are identified between residuals and fitted values, consider adding nonlinear predictors (polynomial regression) to the model.
5. Model selection If there are \(k\) covariates, there are \(2^k\) possible models, and thus we are often concerned with which covariate should or should not be included in a multiple regression model. In general, too few covariates gives high bias; too many covariates gives high variance. A good model needs to reach a balance between bias and variance. The common methods for model searching and selection involve first performing forward or backward stepwise regression, then evaluating the scores of AIC (information criterion), BIC (Bayesian information criterion), or Mallow's statistic, and continuing adding (or dropping) variables one at a time until the score does not improve. However, neither is guaranteed to have a model with the best score.
Use keyword "ols" for estimating multiple linear regression models. The input data format is "dat((row x1), ..., (row xk), (row y))", where the last row is the response variable "y", and the rows for "x1, x2, ..., xk" are for the explanatory variables. For example, stt;dat((40,50,55,60,65,62,63,68,66),(0,0,0,0,1,1,1,1,1),(102,129,142,152,166,161,162,173,170));ols || stt;dat((3,0,2,2,5,2),(5.6,8.6,9.8,9.1,16,14.2));ols || stt;dat((3,0,2,2,5,2),(8.9,10.2,13.9,11.6,11,8.9),(5.6,8.6,9.8,9.1,16,14.2));ols || stt;dat((2.3,3.4,4.8,3.6,0.3,0.3),(5.6,8.6,9.8,9.1,16,14.2));ols || stt;dat((3,0,2,2,5,2),(2.3,3.4,4.8,3.6,0.3,0.3),(5.6,8.6,9.8,9.1,16,14.2));ols;p(1,5) || stt;dat((3,0,2,2,5,2),(8.9,10.2,13.9,11.6,11,8.9),(2.3,3.4,4.8,3.6,0.3,0.3),(5.6,8.6,9.8,9.1,16,14.2));ols || stt;dat((3,0,2,2,5,2),(8.9,10.2,13.9,11.6,11,8.9),(2.3,3.4,4.8,3.6,0.3,0.3),(5.6,8.6,9.8,9.1,16,14.2));ols;pr || stt;dat((3,0,2,2,5,2),(8.9,10.2,13.9,11.6,11,8.9),(2.3,3.4,4.8,3.6,0.3,0.3),(5.6,8.6,9.8,9.1,16,14.2));ols;p(1,10,5) ||.
2. Polynomial Regression
Given a set of data \((X_i,Y_i),1≤i≤n\), want to find a curve that the squared distance \(\sum\limits_{β_k,0≤k≤p}\sum_{i=1}^n(Y_i-\sum_{i=0}^pβ_kX_i^k)^2\) is minimized. The polynomial regression fits into the framework of multiple regression with an \(n×(p+1)\) matrix \({\bf X}=\begin{bmatrix}1&X_1&\cdots&X_1^p\\\vdots&\vdots&\vdots&\vdots \\1&X_n&\cdots&X_n^p\end{bmatrix},{\bf β}∈R^{p+1}\), and \({\bf Y}={\bf Xβ}+{\bf ε}\). If \({\bf X}\) is of rank \(p+1\) or there are \(p+1\) distinct value of \(X_i\), then the least squares estimator is \(\hat{\bf β}=({\bf X}^T{\bf X})^{-1}{\bf X}^T{\bf Y}\). If \(p+1< n\), the curve is impossible to go through every observed points.
Use keyword "pr;n" for estimating the coefficients of the polynomial regression models, where integer "n" refer to the order of degree for the polynomial. For example, stt;dat((6.4,3.4,4.9,3.1,3.3,2.1,6.4,6.2),(56.2,18.3,34.5,15.7,18,8.8,56.1,50.3));ols || stt;dat((6.4,3.4,4.9,3.1,3.3,2.1,6.4,6.2),(56.2,18.3,34.5,15.7,18,8.8,56.1,50.3));pr;2 ||.
3. Logistic Regression
1. Logistic regression is an example of generalized linear model. Suppose we observe \((X_i,Y_i),1≤i≤n\) for binary response \(Y_i\) that only take values of 0 and 1. The logistic regression, \(p_i=P(Y_i=1|X_i)=\frac{e^{β_0+β_1X_i}}{1+e^{β_0+β_1X_i}},P(Y_i=0|X_i)=\frac{1}{1+e^{β_0+β_1X_i}}=1-p_i\), is a binary random variable depending on the predictor. The distribution of \(Y\) given \(X\) is Bernoulli. Define \(\text{logit}(p_i)=\log\frac{p_i}{1-p_i} =β_0+β_1X_i\), a canonical link function for binary response.
To interpret parameters \(β_0,β_1\), use \(\text{odds}(X)=\frac{P(X)}{1-P(X)}\). Consider the odds ratio \(\frac{\text{odds}(X+1)}{\text{odds}(X)}= \frac{e^{β_0+β_1(X+1)}}{e^{β_0+β_1X}}=e^{β_1}\), which means the odds at \((X+1)\) is \(e^{β_1}\) times the odds at \(X\).
2. General logistic regression Suppose the predictors \({\bf X}_i=(X_{i0},X_{i1} \cdots,X_{i,p-1})^T∈R^p\) for \(1≤i≤n\), and \({\bf β}=(β_0,β_1,\cdots,β_{p-1})^T\). Then \(p_i=P(Y_1=1|{\bf X}_i)=\frac{\exp({\bf X}_i^T{\bf β})}{1+\exp({\bf X}_i^T{\bf β})}\), and \(1-p_i=P(Y_1=0|{\bf X}_i)=\frac{1}{1+\exp({\bf X}_i^T{\bf β})}\). The logit is \(\text{logit}(p_i)=\log\frac{p_i}{1-p_i}={\bf X}_i^T{\bf β}\).
Use keyword "glm;b" for estimating logistic regression models, where "b" means binary response or binomial link function. Input data in a form of "dat((row X1), (row X2), ..., (row Xk), (row Y))", where the last row of 0's and 1's is for "Y", the binary response variable, and the previous rows are for the explanatory variables X's. For example, stt;dat((2,1,0,3,5,-1),(-2,-1,0,1,3,1),(0,1,0,1,0,1));glm;b || stt;dat((2,1,0,3,5,-1),(0,1,0,1,0,1));glm;b || stt;dat((-2,-1,0,1,3,1),(0,1,0,1,0,1));glm;b || stt;dat((11,14,12,15,13,16,17,19,18,17,16,20),(0,0,0,0,0,0,1,1,1,1,1,1));glm;b ||.
For each logistic regression model, the program will return the standard error estimate for \(\hat{\bf β}\) and the results of the hypothesis testing for the coefficients \({\bf β}\). You can also use \(χ^2\) ("Pearson chi2") or \(G^2\) ("deviance") and the degree of freedom ("Df Model") for goodness of fit test.
4. Log Linear Models
We often use log linear models to fit cell counts in contingency tables and study the patterns or associations of expected counts with levels of categorical variables as well as interactions among those variables.
1. Two-way tables Consider an \(I×J\) contingency table with \(N=IJ\) cells. Suppose the \(N\) cell counts \(Y_{ij}\) have independent Poisson distributions with mean \(E(Y_{ij})=μ_{ij}\) that satisfy a multiplicative model \(μ_{ij}=μα_iβ_j\), where \(α_i\) and \(β_j\) are positive constants satisfying \(\sum_{i=1}^Iα_i=1,\sum_{j=1}^Jβ_j=1\). Then the Poisson log linear model \(\log μ_{ij}=λ+λ_i^X+λ_j^Y\) with constraints \(λ_i^X=0\) and \(λ_i^Y=0\) has additive main effects of the two classifications but no interactions (independence model), where \(λ=\log μ,λ_i^X=\log α_i,λ_j^Y=\log β_j\). Note that the sum of all cell counts \(n=\sum_{i=1}^I\sum_{j=1}^JY_{ij}\) also follows a Poisson distribution with mean \(μ=\sum_i\sum_jμ_{ij}\). The MLEs of \(μ_{ij}\) is \(\hat{μ}_{ij}=\frac{n_{i.}n_{.j}}{n_{ij}}\).
The saturated model is \(\log μ_{ij}=λ+λ_i^X+λ_j^Y+λ_{ij}^{XY}\), where \(λ_{ij}^{XY}\) represent interactions \(X\) and \(Y\). The saturated model is also a hierarchical model, which means that the model includes all lower-order terms composed from variables contained in a higher-order model term. For example, the saturated model contains \(λ_{ij}^{XY}\) and it also contains terms \(λ_i^X,λ_j^Y\).
In addition, conditional on the total sample size \(n\), the cell counts have a multinomial distribution \(Y_{ij}|n\sim M(n,π_{ij})\) with cell probabilities \(π_{ij}=\frac{μ_{ij}}{μ}\) and \(μ_{ij}=nπ_{ij}\). Similarly, conditional on \(n\), the row sum or column sum follows a multinomial distribution, that is, \(Y_{i.}|n\sim M(n,π_{i.})\) and \(Y_{.j}|n\sim M(n,π_{.j})\), where \(π_{i.}=a_i=\frac{\sum_{j=1}^Jμ_{ij}}{μ},π_{.j}=β_j =\frac{\sum_{i=1}^Iμ_{ij}}{μ}\). Thus, conditional on \(n\), the model is a multinomial one that satisfies \(π_{ij}=α_iβ_j=π_{i.}π_{.j}\).
Use keyword "log" to estimate log linear models for two-way tables. Saturated models are not estimated, so the results are not available for saturated models. As usual, the data for the response variable (counts) appears in the last row, and the data for explanatory variables (row and column indicators) appear first.
Example Given the cell counts in the table \(\begin{array}{c|lcr}&Y_1&Y_2\\ \hline X_1&27&132\\ X_2&41&219\end{array}\). Input the row indicators as "(0,0,1,1)" for categorical variable \(X\) with two categories \(X_1\) and \(X_2\), the column indicators as "(1,0,1,0)" for \(Y\) with categories \(Y_1\) and \(Y_2\), and the cell counts as "(27,132,41,219)". Check stt;dat((0,0,1,1),(1,0,1,0),(17,122,31,209));log ||. The "Deviance" \(G^2\), "Pearson chi2" \(χ^2\), and "Df Residuals" (df) are calculated for chi-squared goodness of fit tests. In this example, G\(^2\) = 0.106, χ\(^2\) = 0.107, df = 1 and the p-value is about 0.746 by stt;chi(1);P(x>0.107) ||.
2. Three-way tables \(I×J×K\) A three-way contingency table is a cross-classification of observations by the levels of three categorical variables \(X,Y,Z\). Suppose \(X\) has \(I\) levels, \(Y\) has \(J\) levels, and \(Z\) has \(K\) levels. The saturated model is \(\log μ_{ijk}=λ+λ_i^X+λ_j^Y+λ_k^Z+λ_{ij}^{XY}+λ_{ik}^{XZ}+λ_{jk}^{YZ}+λ_{ijk}^{XYZ}\).
3. Mutual independence The mutual independence model of the \(X,Y,Z\) are \(π_{ijk}=π_{i..}π_{.j.}π_{..k}\) assuming a multinomial distribution with cell probabilities \(π_{ijk}\) and \(\sum_i\sum_j\sum_kπ_{ijk}=1\). The log linear form of the mutual independence model is \(\log μ_{ijk}=λ+λ_i^X+λ_j^Y+λ_k^Z\) assuming a Poisson sampling with mean \(μ_{ijk}\). Mutual independence implies joint independence of any one variable from the others.
4. Two-way independence Variable \(Y\) is jointly independent of \(X\) and \(Z\) if \(π_{ijk}=π_{i.k}π_{.j.}\) for all \(i,j,k\), which is called two-way independence between \(Y\) and a variable composed of the \(IK\) combinations of levels of \(X\) and \(Z\). The two-way independence of \(X\) and \(YZ\), or \(Z\) and \(XY\) is defined in a similar fashion. The log linear model for \(Y\) and \(XZ\) is \(\log μ_{ijk}=λ+λ_i^X+λ_j^Y+λ_k^Z+λ_{ik}^{XZ}\). The log linear models for other two-way independence are \(\log μ_{ijk}=λ+λ_i^X+λ_j^Y+λ_{ij}^{XY}\) and \(\log μ_{ijk}=λ+λ_j^Y+λ_k^Z+λ_{jk}^{YZ}\).
5. Conditional independence If \(X\) and \(Y\) are independent in partial table \(k\), then \(X\) and \(Y\) are called conditionally independent at level \(k\) of \(Z\), and we have \(π_{ij|k}=P(X=i,Y=j|Z=k)=P(X=i|Z=k)P(Y=j|Z=k)=π_{i.|k}π_{.j|k}\) for all \(i,j\). If \(X\) and \(Y\) are conditionally independent at every level of \(Z\), they are to be conditionally independent given \(Z\), which means \(Y\) does not depend on \(X\) by given or controlling \(Z\). The corresponding log linear model is \(\log μ_{ijk}=λ+λ_i^X+λ_j^Y+λ_k^Z+λ_{ik}^{XZ}+λ_{jk}^{YZ}\). The log linear models for other conditional independence are \(\log μ_{ijk}=λ+λ_i^X+λ_j^Y+λ_k^Z+λ_{ij}^{XY}+λ_{kj}^{ZY}\) and \(\log μ_{ijk}=λ+λ_i^X+λ_j^Y+λ_k^Z+λ_{ji}^{YX}+λ_{ki}^{ZX}\). Conditional independence is weaker than mutual and joint independence. Joint independence implies conditional independence. The log linear model that allows the three pairs of variables to be conditionally independent is \(\log μ_{ijk}=λ+λ_i^X+λ_j^Y+λ_k^Z+λ_{ij}^{XY}+λ_{kj}^{ZY}+λ_{ik}^{XZ}\).
6. Chi-squared goodness of fit test Use \(X^2\) ("Pearson chi2") and \(G^2\) ("deviance") to test whether a model holds by comparing cell fitted values to observed counts. The degree of freedom ("Df Residuals") equals the number of cell counts minus the number of model parameters ("Df Model").
Use keyword "log" for estimating the mutual independence models, and add a model generator to specify models for conditional independence. For example, "log;12", "log;23", "log;13", "log;12+13", "log;12+23", "log;13+23", and "log;12+13+23". As usual, the data for the response variable (counts) appears in the last row, and the data for explanatory variables appear first.
Example The cell counts for three categorical variables \(X,Y,Z\) with each two levels are given in the table \(\begin{array}{c|lcr}&&Z_1&Z_2\\\hline X_1&Y_1&56&80\\&Y_2&48&133\\\hline X_2&Y_1&63&98\\&Y_2&47&97\end{array}\). Check the results for various models by stt;dat((1,1,1,1,0,0,0,0),(0,1,0,1,0,1,0,1),(0,1,0,1,0,1,0,1),(56,80,48,133,63,98,47,97));log || stt;dat((1,1,1,1,0,0,0,0),(0,1,0,1,0,1,0,1),(0,1,0,1,0,1,0,1),(56,80,48,133,63,98,47,97));log;12 || stt;dat((1,1,1,1,0,0,0,0),(0,1,0,1,0,1,0,1),(0,1,0,1,0,1,0,1),(56,80,48,133,63,98,47,97));log;13 || stt;dat((1,1,1,1,0,0,0,0),(0,1,0,1,0,1,0,1),(0,1,0,1,0,1,0,1),(56,80,48,133,63,98,47,97));log;23 || stt;dat((1,1,1,1,0,0,0,0),(0,1,0,1,0,1,0,1),(0,1,0,1,0,1,0,1),(56,80,48,133,63,98,47,97));log;12+23 || stt;dat((1,1,1,1,0,0,0,0),(0,1,0,1,0,1,0,1),(0,1,0,1,0,1,0,1),(56,80,48,133,63,98,47,97));log;13+23 || stt;dat((1,1,1,1,0,0,0,0),(0,1,0,1,0,1,0,1),(0,1,0,1,0,1,0,1),(56,80,48,133,63,98,47,97));log;12+13 || stt;dat((1,1,1,1,0,0,0,0),(0,1,0,1,0,1,0,1),(0,1,0,1,0,1,0,1),(56,80,48,133,63,98,47,97));log;12+13+23 ||.
5 Stochastic Process
1. Stochastic Process
A stochastic process is a collection of random variables \(\{X(t),t∈T\}\) defined on a given probability space, where \(T\) is the index (parameter) set and \(t\) is a parameter. The set \(\cal X\) of all values or states \(X(t)\) can take forms the state space for the process. If \(T=\{0,1,2,\cdots\}\) is discrete, the process is called discrete-parameter (discrete-time) process. If \(T=[0,∞)\) is continuous, the process is a continuous-time process. If the state space is discrete, the process is discrete-state process; if the state space is continuous, the process is a continuous-state process.
For a fixed time \(t_1,X(t_1)=X_1\), and the cdf \(F_X(x_1,t_1)=P(X(t_1)≤x_1)\), which is called the first-order distribution of \(X(t)\). Similarly, \(F_X(x_1,x_2,t_1,t_2)=P(X(t_1)≤x_1,X(t_2)≤x_2)\) is the second-order distribution of \(X(t)\), and \(F_X(x_1,x_2,\cdots,x_n,t_1,t_2,\cdots,t_n)\)
\(=P(X(t_1)≤x_1,X(t_2)≤x_2,\cdots,X(t_n)≤x_n)\) is the \(n\)th-order distribution of \(X(t)\).
1. Mean, autocorrelation and autocovariance functions The mean of \(X(t)\) is \(μ_X(t)=E(X(t))\). The autocorrelation function is defined as \(R_X(s,t)=E[X(s)X(t)]\), which is a measure of dependence among the random variables of \(X(t)\). Note that \(R_X(s,t)=R_X(t,s)\),
\( R_X(t,t)=E[X(t)^2]\). The autocovariance function is \(K_X(s,t)=\text{Cov}(X(s),X(t))=E[(X(s)-μ_X(s))(X(t)-μ_X(t))]\)
\(=R_X(s,t)-μ_X(s)μ_X(t)\). If the mean of \(X(t)\) is 0, \(K_X(s,t)=R_X(s,t)\). Note that the variance of \(X(t)\) is \(σ_X^2=E[X(t)-μ_X(t)]^2=K_X(t,t)\).
2. Stationary process A process \(\{X(t),t∈T\}\) is stationary if for all \(n\) and for every set of time instances \(t_i∈T,i=1,2,\cdots,n\), the joint distribution \(F_X(x_1,\cdots,x_n,t_1,\cdots,t_n)=F_X(x_1,\cdots,x_n,t_1+τ,\cdots,t_n+τ)\) for any \(τ\). This means \(X(t)\) and \(X(t+τ)\) have the same distribution for any \(τ\) and for any order \(n\). For the first-order, \(F_X(x,t)=F_X(x,t+τ)\); for the second-order, \(F_X(x_1,x_2,t_1,t_2)=F_X(x_1,x_2,t_1+τ,t_2+τ)=F_X(x_1,x_2,t_2-t_1)\). The distributions of non-stationary processes depend on the points \(t_1,\cdots,t_n\).
3. Wide-sense stationary process (WSS) If the process \(\{X(t),t∈T\}\) is stationary to order 2, then \(X(t)\) is wide (weak)-sense stationary. In this case, \(E(X(t))=μ_X,R_X(s,t)=R_X(|s-t|)\). It is clear that a stationary (strict-sense) process is a WSS.
4. Independent process A process \(\{X(t),t∈T\}\) is an independent random process if \(X(t_i),i=1,2,\cdots,n\) are independent random variables for all \(n\). In this case, the first-order distributions suffice of \(X(t)\) for characterizing the process.
5. Processes with stationary independent increments A random process \(\{X(t),t∈T\}\) has independent increments if \(X(0),X(t_1)-X(0),X(t_2)-X(t_1),\cdots,X(t_n)-X(t_{n-1})\) are independent for \(0< t_1<\cdots< t_n\). If \(\{X(t),t∈T\}\) has independent increments and \(X(t)-X(s)\) has the same distribution as \(X(t+τ)-X(s+τ)\) for all \(t,s,τ≥0\), then \(X(t)\) has stationary independent increments. Poisson and Wiener processes have stationary independent increments. Suppose \(X(0)=0\). Then \(E(X(t))=E[X(1)]t=μt,\text{Var}(X(t))=\text{Var}(X(1))t=σ^2t\).
6. Markov process A stochastic process \(\{X(t),t∈T\}\) is a Markov process if \(P(X(t_{n+1})≤x_{n+1}|X(t_1)=x_1,X(t_2)=x_2,\cdots,X(t_n)=x_n)=P(X(t_{n+1})≤x_{n+1}|X(t_n)=x_n)\) for \(t_1< t_2<\cdots< t_n< t_{n+1},x_i∈{\cal X}\) for all \(i\). The property \(P(X(t_{n+1}≤x_{n+1})|X(t_n),\cdots,x(t_1))=P(X(t_{n+1}≤x_{n+1})|X(t_n))\) is called the Markov (memoryless) property, which means the future state of the process depends only on the present state and not on the past history. Using the Markov property, the \(n\)th-order distribution \(F_X(x_1,\cdots,x_n;t_1,\cdots,t_n)=F_X(x_1,t_1)∏_{k=2}^nP(X(t_k)≤x_k|X(t_{k-1})=x_{k-1})\), which is determined by the second-order distributions. The joint density can be written as \(f(x_1,\cdots,x_n)=f(x_1)f(x_2|x_1)f(x_3|x_2)\cdots f(x_n|x_{n-1})\). Note that any process with independent increments is a Markov process.
7. Normal process A stochastic process \(\{X(t),t∈T\}\) is a normal process if its \(n\)th-order distribution is a multivariate normal distribution for any integer \(n\). A normal process is determined by the second-order distributions. So if a normal process is a WSS, it is also a stationary process.
2. Discrete Time Markov Chains (DTMC)
In a DTMC, the Markov property can be written as \(P(X_{n+1}=j|X_0=i_0,X_1=i_1,\cdots,X_n=i)=P(X_{n+1}=j|X_n=i)\), which is known as one-step transition probabilities (from one state to another). If, in addition, this key quantity \(P(X_{n+1}=j|X_n=i)\) is independent of \(n\), the process is called homogeneous Markov chain, and it has stationary transition probabilities. But note that a Markov process in general is not a stationary process. In the following section, we focus on finite discrete time-homogeneous Markov chains \(\{X_n,n≥0\}\).
1. Transition probabilities The transition probability matrix \({\bf P}=[p_{ij}]\) is a square matrix with \(p_{ij}=P(X_{n+1}=j|X_n=i)≥0\) for \(i,j≥0\) and integer \(n≥0\). The sum of each row of \({\bf P}\) is 1. \({\bf P}\) is also called Markov (stochastic) matrix.
Given the one-step transition matrix "P" and state space "S" for a finite discrete time-homogeneous Markov chain, we can calculate the probability from state "i" at time "n" to state "j" at time "n+1" by "stt;dmc(S,P);P(x(n+1),j|x(n),i)", in which the function "dmc(S,P)" (discrete Markov chain) returns a chain with state space "S" and transition matrix "P", and "P(x(n+1),j|x(n),i)" represents the probability \(P(X_{n+1}=j|X_n=i)\). Probabilities will be calculated based on indexes rather than state names. It is always better to use rational instead of floating point numbers for the probabilities in the transition matrix to avoid errors. Examples: stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);P(x(5),2|x(1),1) || stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);P(x(5),1|x(1)!=2) || stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);P(x(7)>1|x(1)<=2) || stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);P(x(10),1|x(1)>1) ||. Note that "P(x(5),2|x(1),1)" represents \(P(X_5=2|X_1=1)\), "P(x(5),1|x(1)!=2)" represents \(P(X_5=1|X_1!=2)\), and "P(x(7)>1|x(1)<=2)" represents \(P(X_7>1|X_1≤2)\).
Example If a chain with states {1, 2, 3}, π = (0.3, 0.3, 0.4) and transition matrix P=[(0.2,0.3,0.5),(0.3,0.3,0.4),(0.2,0.2,0.6)], then \(P(X_0=1,X_1=2,X_2=3)=P(X_0=1)P(X_1=2|X_0=1)P(X_2=3|X_1=2)=0.3p_{12}p_{23}=0.3(0.3)(0.4)=0.036\).
The keyword "tp" returns a transition probability matrix P, which is reordered so that recurrent states appear first and transient states appear last. Other representations include inserting transient states first and recurrent states last. Examples: stt;dmc((1,2,3,4,5),[(1/2,1/2,0,0,0),(2/5,1/5,2/5,0,0),(0,0,1,0,0),(0,0,1/2,1/2,0), (1/2,0,0,0,1/2)]);tp || stt;dmc((0,1,2,3,4),[(0,1/2,1/2,0,0),(0,0,1,0,0),(1/2,0,1/2,0,0),(0,1,0,0,0),(0,3/10,0,3/10,4/10)]);tp || stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);tp || stt;dmc((1,2,3),[(1/2,0,1/2),(1/2,1/2,0),(0,1/2,1/2)]);tp || stt;dmc((1,2,3,4),[(0,0,1/2,1/2),(1,0,0,0),(0,1,0,0),(0,1,0,0)]);tp ||.
2. n-step transition matrix For a homogeneous Markov chain, the \(n\)-step transition probability \(p_{ij}(n)=P(X_n=j|X_0=i)\). According to Chapman-Kolmogorov equation \({\bf P}^{n+m}={\bf P}^n{\bf P}^m\) or \(p_{ij}(n+m)=\sum\limits_kp_{ik}(n)p_{kj}(m)\), the transition from states \(i\) to \(j\) in \(n+m\) steps can be achieved by (1) moving \(i\) to \(k\) in \(n\) steps and (2) moving \(k\) to \(j\) in \(m\) steps, and the events in steps (1) and (2) are independent. Examples: stt;dmc((0,1,2),[[0.5,0.2,0.3],[0.2,0.5,0.3],[0.2,0.3,0.5]]);pn;10 || stt;dmc((0,1,2),[[0.5,0.2,0.3],[0.2,0.5,0.3],[0.2,0.3,0.5]]);P(x(8),2|x(3),1) || stt;dmc((0,1,2),[[0.5,0.2,0.3],[0.2,0.5,0.3],[0.2,0.3,0.5]]);P(x(8)>0|x(3),0) ||.
3. Marginal probabilities Let \({\bf π}(0)=[π_0(0),π_1(0),\cdots,π_K(0)]\) be the vector of initial-state probability distribution or the marginal distribution (probability mass function) of \(X_0=X(0)\), where \(K\) is the number of states. After \(n\)-step transitions, the marginal distribution of \(X_n\) can be calculated by \({\bf π}(n)=[π_0(n),π_1(n),\cdots,π_K(n)]={\bf π}(0){\bf P}^n\), which implies the probability distribution of \(X_n\) is completely determine by the transition matrix \({\bf P}\) and the initial-state distribution vector \({\bf π}(0)\). Use "stt;dmc(S,P);md;π(0)" to calculate the marginal distribution, where "md" (marginal distribution) is the keyword, "π(0)" is the initial distribution vector, "S" the state space, and "P" the transition matrix. Examples: stt;dmc((0,1,2),[[0.5,0.2,0.3],[0.2,0.5,0.3],[0.2,0.3,0.5]]);md;8;(0.3,0.3,0.4) ||.
4. Stationary and limiting distributions The probability vector \({\bf π}\) is called a stationary distribution for a homogeneous Markov chain if \({\bf πP}={\bf π}\). If \(X_0\sim {\bf π}\) is drawn from \({\bf π}\) that is stationary, then \(X_1\sim {\bf π}P={\bf π}\). Continuous this process, \(X_n\sim {\bf π}P^n={\bf π}\). This implies that once the chain has a distribution \({\bf π}\), it will has the distribution \({\bf π}\) forever.
A Markov chain is regular if there exists a positive integer \(m\) such that \({\bf πP}^m={\bf A}=[a_{ij}]\) and all entries \(a_{ij}>0\). For a regular homogeneous finite-state Markov chain \(\{X_n,n≥0\},\lim\limits_{n→∞}{\bf P}^n=\hat{\bf P}\), where \(\hat{\bf P}\) is a matrix whose rows are identical and equal to the stationary distribution \({\bf π}\) of \(X_n\). That is, the limit \(π_j=\lim\limits_{n→∞}p_{ij}(n)\) exits and independent of \(i\). An irreducible, ergodic Markov chain has a unique stationary distribution \({\bf π}\). The limiting distribution exists and equals to \({\bf π}\). If \(π_ip_{ij}=π_jp_{ji}\), then \({\bf π}\) satisfies detailed balance. Detailed balance guarantees that \({\bf π}\) is stationary. This is because the \(j\)th element of \({\bf π}P\) is \(\sum_iπ_ip_{ij}=\sum_iπ_jp_{ji}=π_j\sum_ip_{ji}=π_j\).
The keyword "ld" (limiting distribution) returns the limiting distribution of the chain. Examples: stt;dmc((1,2),((1-a,a),(b,1-b)));ld || stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);ld || stt;dmc((1,2),[(4/10,6/10),(3/10,7/10)]);ld || stt;dmc((1,2),[(0,1),(2/10,8/10)]);ld || stt;dmc((1,2,3),[(1/2,1/2,0),(4/5,1/5,0),(1,0,0)]);ld || stt;dmc((1,2,3),[(1/2,1/2,0),(4/5,1/5,0),(0,0,1)]);ld || (not unique).
5. Classification of states State \(j\) is accessible from state \(i\) if \(p_{ij}(n)>0\) for some \(n≥0\). If two states are accessible to each other, we say they communicate. If all states communicate with each other, we say the Markov chain is irreducible. A set of states is closed if once the chain enters that set it never leaves. A closet set consisting of a single state is called an absorbing state. State \(i\) is recurrent if \(P(X_n=i|X_0=i)=1\) for some \(n≥1\). Otherwise, state \(i\) is transient. Suppose the chain starts with state \(i\). The \(m\)-step transition probability from states \(i\) to \(j\) is \(p_{ij}(m)=P(X_m=j,X_k≠j,k=1,\cdots,m-1|X_0=i)\). Then \(p_{ij}(0)=0,p_{ij}(1)=P(X_1=j|X_0=i)=p_{ij}\), and for \(m>1,p_{ij}(m)=\sum\limits_{k≠j}p_{ik}p_{kj}(m-1)\). Let \(I_m=1\) if \(X_m=i\) and 0 otherwise. The number of times the chain is in state \(i\) is \(Y=\sum_{m=1}^∞I_m\), and \(E(Y|X_0=i)=\displaystyle\sum_{m=1}^∞E(I_m|X_0=i)=\sum_{m=1}^∞P(X_m=i|X_0=i)=\sum_{m=1}^∞p_{ii}(m)\). Thus, state \(i\) is recurrent if and only if \(\sum_{m=0}^∞p_{ii}(m)=∞\), and state \(i\) is transient (nonrecurrent) if and only if \(\sum_{m=0}^∞p_{ii}(m)<∞\).
Suppose \(X_0=i\). Denote \(T_j\) be the recurrent time or the number of steps of the first visit to state \(j\) after time zero. If state \(j\) is never visited, \(T_j=∞\). Otherwise, \(E(T_i|X_0=i)=\sum\limits_nnp_{ii}(n)\). A recurrent state is null if \(E(T_i|X_0=i)=∞\). Otherwise, it is called non-null or positive. If a state \(i\) is null and recurrent, then \(p_{ii}(n)→0\) as \(n→∞\). In a finite state Markov chain, all recurrent states are positive.
If state \(i\) is recurrent and states \(i,j\) communicate or \(i↔j\), then \(j\) is recurrent; if state \(i\) is transient and \(i↔j\), then \(j\) is transient. A finite Markov chain must have at least one recurrent state. The states of a finite, irreducible Markov chain are all recurrent.
The keyword "cc" (communication class) returns the list of communication (recurrence) classes that partition the states of the Markov chain. The classes are a list of tuples. Each tuple represents a single communication class with its properties. The first element in the tuple is the list of states in the class, the second element is whether the class is recurrent and the third element is the period of the communication class. Examples: stt;dmc((1,2,3),[(0,1,0),(1,0,0),(1,0,0)]);cc ||. The results show that states 1 and 2 communicate, are recurrent and have a period of 2. State 3 is transient with a period of 1. Check more examples stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);cc || stt;dmc((1,2),[(4/10,6/10),(3/10,7/10)]);cc || stt;dmc((1,2,3),[(1/2,0,1/2),(1/2,1/2,0),(0,1/2,1/2)]);cc || stt;dmc((1,2,3,4),[(0,0,1/2,1/2),(1,0,0,0),(0,1,0,0),(0,1,0,0)]); cc || stt;dmc((1,2,3,4,5),[(3/10,4/10,0,0,3/10),(0,1,0,0,0),(0,0,0,6/10,4/10),(0,0,0,0,1),(0,0,1,0,0)]);cc ||.
6. Periodicity The period \(d\) of state \(i\) is the greatest common divisor for the transition steps \(n≥1\) such that \(p_{ii}(n)>0\). If \(d>1\), the state \(i\) is called periodic; if \(d=1\), the state \(i\) is aperiodic. Note that whenever \(p_{ii}=p_{ii}(0)>0\), state \(i\) is aperiodic. State \(i\) is an absorbing state if \(p_{ii}=1\). This means once state \(i\) is reached, it is never left. Examples: stt;dmc((1,2,3),((0,1,1),(0,0,1),(1,0,0)));cc ||. Suppose the chain states in state "1", it will be in state "3" at times 3,6,9, .... The period of state "3" is \(d=3\) and \(p_{ii}(n)=0\) when \(n\) is not divisible by 3. Check stt;dmc((1,2,3,4), [(0,0,1/2,1/2),(1,0,0,0),(0,1,0,0),(0,1,0,0)]);cc || stt;dmc((1,2,3,4),[(0,0,1/2,1/2),(1,0,0,0),(0,1,0,0),(0,1,0,0)]);P(x(3),1|x(0),1) || stt;dmc((1,2,3,4),[(0,0,1/2,1/2),(1,0,0,0),(0,1,0,0),(0,1,0,0)]);P(x(3),3|x(0),3) || stt;dmc((1,2,3,4,5),[(3/10,4/10,0,0,3/10),(0,1,0,0,0),(0,0,0,6/10,4/10), (0,0,0,0,1),(0,0,1,0,0)]);cc || stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);cc ||.
If a state \(i\) has period \(d\) and \(i↔j\), then state \(j\) has period \(d\). A state is ergodic if it is recurrent, non-null and aperiodic. A chain is ergodic if all its states are ergodic.
7. Decomposition The state space \(\cal X\) can be written as the disjoint union \({\cal X=X}_T∪{\cal X}_1∪\cdots\), where \({\cal X}_T\) are the transient states and each \({\cal X}_i\) is a closed, irreducible set of recurrent states. The transition matrix P can be decomposed into 4 submatrices with special properties: (1) Submatrix A from recurrent states to recurrent states, (2) submatrix B from transient to recurrent states, (3) submatrix C from transient to transient states, and (4) submatrix of zeros for recurrent to transient states. The keyword "dc" (decomposition) gives the submatrices A, B, and C in the order. Examples: stt;dmc((0,1,2,3,4),[(1/2,1/2,0,0,0),(2/5,1/5,2/5,0,0), (0,0,1,0,0),(0,0,1/2,1/2,0),(1/2,0,0,0,1/2)]);dc ||. Submatrix A = [1] is from recurrent to recurrent; B = (0, 0.4, 0.5, 0)\(^T\) from transient to recurrent; C = [(0.5, 0.5, 0,0),(0.4, 0.2, 0, 0),(0, 0, 0.5, 0),(0.5, 0, 0, 0.5)] from transient to transient. This means that state "2" is the only absorbing state (since A is a 1 x 1 matrix), B is a 4 x 1 matrix since the 4 remaining transient states all merge into recurrent state "2", and C is the 4 x 4 matrix that shows how the transient states 0, 1, 3, 4 all interact. Check stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);dc || stt;dmc((1,2,3,4,5),[(3/10,4/10,0,0,3/10),(0,1,0,0,0),(0,0,0,6/10,4/10),(0,0,0,0,1),(0,0,1,0,0)]);dc ||.
8. Conditional expectation The result from stt;dmc((1,2,3),[(0.2,0.3,0.5),(0.3,0.5,0.2),(0.5,0.3,0.2)]);E(x(5)|x(1),2) || is the conditional expectation \(E(x(5)|x(1)=2)\).
9. Absorbing probabilities Consider a Markov chain \(X_n\) with finite state space \(\{1,2,\cdots,N\}\) and transition matrix \({\bf P}\). Let \(A=\{1,\cdots,m\}\) be the absorbing states and \(B=\{m+1,\cdots,N\}\) be the non-absorbing states. Express the transition matrix as \({\bf P}=\begin{bmatrix}I&{\bf 0}\\R&Q\end{bmatrix}\), where \(I\) is an \(m×m\) identity matrix, the entries in \(R\) are the one-step transition probabilities from non-absorbing states to absorbing states, and the elements in matrix \(Q\) are the transition probabilities among the non-absorbing states. The matrix \((I-Q)^{-1}\) is called the fundamental matrix of the Markov chain \(X_n\). If we compute \(u_{kj}=P(X_n=j∈A|X_0=k∈B)\), then \(U=[u_{kj}]=(1-Q)^{-1}R\) is an \((N-m)×m\) matrix of the absorption probabilities for the various absorbing states. Use the keyword "ap" (absorption probabilities) to compute the absorbing probabilities. The ij-th entry of the matrix denotes the probability of Markov chain being absorbed in state "j" starting from state "i". Check stt;dmc((1,2,3),[(1/2,1/2,0),(4/5,1/5,0),(1,0,0)]);ap ||.
10. Fundamental matrix Each entry of the fundamental matrix can be interpreted as the expected number of times the chains is in state "j" if it started in state "i". Use the keyword "fm" (fundamental matrix) to obtain the matrix. Check stt;dmc((1,2,3,4),[(0,0,1/2,1/2),(1,0,0,0),(0,1,0,0),(0,1,0,0)]);fm || stt;dmc((1,2,3,4,5),[(3/10,4/10,0,0,3/10),(0,1,0,0,0),(0,0,0,6/10,4/10),(0,0,0,0,1),(0,0,1,0,0)]);fm ||.
11. Inference for Markov chains Consider a finite-state MC \(X_n\) with state space \({\cal X}=\{1,2,\cdots,N\}\). The unknown parameters of the chain are the elements in \({\bf π},{\bf P}\). Suppose \(X_1,\cdots,X_n\) are observed from the chain. The likelihood function \(L({\bf π,P})=π_0(i)\prod_{i=1}^N\prod_{j=1}^Np_{ij}^{n_{ij}}\), where \(n_{ij}\) are the observed numbers from stats \(i\) to \(j\). The MLE of \({\bf P}\) is obtained by maximizing the likelihood subject to the constraint that the elements are non-negative and the rows sum to 1. The solution to the transition matrix is \(\hat{p}_{ij}=\frac{n_{ij}}{n_i}\) and \(n_i=\sum_{j=1}^Nn_{ij}\). If \(n_i=0,\hat{p}_{ij}=0\).
12. Simulating Markov chains Suppose the initial distribution is π and transition matrix is P. (1) Simulating \(X_0\) from π and suppose \(X_0=x_0\). (2) Simulate \(X_1\) with probability \(P(X_1=x_1|X_0=x_0)\) using the row of P that corresponds to state \(x_0\). (3) Simulate \(X_2\) with probability \(P(X_2=x_2|X_1=x_1)\) using the row of P that corresponds to \(x_1\). (4) Repeat steps (2) and (3).
3. Continuous Time Markov Chains (CTMC)
In a CTMC, the transition times from one state to another are not integers, but positive numbers that follow an exponential distribution. A Poisson process can be regarded as a special CTMC with states 0, 1, 2, ... that always proceed from state \(i\) to state \(i+1\) with rate λ. Such a process is known as a pure birth process since whenever a transition occurs, the number of occurrences is always increased by 1.
Let \(X(t)=i\) in a process represent the number of people in a system at time \(t\) for integers \(i≥0\). Suppose that people enter the system at an exponential rate \(λ_i\) and leave the system at an exponential rate \(μ_i\). Then this continuous-time process \(\{X(t),t≥0\}\) is called a birth and death process with birth rate \(\{λ_i\}_{i=0}^∞\) and death rate \(\{μ_i\}_{i=0}^∞\) and \(μ_0=0\). For any birth and death process, the transition rates are \(v_0=λ,v_i=λ_i+μ_i\) and the transition probabilities are \(p_{0,1}=1,p_{i,i+1}=\frac{λ_i}{λ_i+μ_i},p_{i,i-1}=\frac{μ_i}{λ_i+μ_i}\) for \(i>0\). Thus, the Poisson process is a birth and death process with rates \(λ_i=λ,μ_i=0\) for all \(i\). The Yule process is a birth process with linear rates \(λ_i=iλ,μ_i=0\) for all \(i\).
The function "cmc(S,P)" represents continuous time Markov chain. Check stt;cmc((0,1),((-1,1),(1,-1)));P(x(1.96),0|x(0.78),1) || stt;cmc((0,1),((-1,1),(1,-1)));P(x(1.7)>0|x(0.82),1) || stt;cmc((0,1),((-1,1),(1,-1)));ld || stt;cmc((1,2,3),((-6,3,3),(4,-12,8),(15,3,-18)));P(x(3.12),3|x(1)>1) || stt;cmc((1,2,3),((-6,3,3),(4,-12,8),(15,3,-18)));ld ||.
4. Poisson Process
The Poisson process is often used to model the number of occurrences of events such as the number of patient arrivals at the ER or vehicles passing through an intersection during a certain period of time, assuming the average occurrence of those events is known.
1. Renewal, arrival and counting processes Let \(T_i\) be a random variable representing the time at which the \(i\)th event occurs for \(i=1,2,\cdots\). Then the values of all \(T_i\)'s are called arrival times, which are an increasing sequence of positive numbers. Let \(W_n=T_n-T_{n-1}\) and \(T_0=0\). Then \(W_n\) is the time period between the \((n-1)\)th and \(n\)th events, and the sequence of \(W_n\) for \(n≥1\) is called an inter-arrival times of the counting process. If all the random variables of \(W_n\) are i.i.d, then the sequence is called a renewal process or recurrent process. Since \(T_n=W_1+W_2+\cdots+W_n\), the sequence \(\{T_n,n≥1\}\) is called an arrival process. A random process \(\{X(t),t≥0\}\) is a counting process if \(X(t)\) represents the total number of "events" that have occurred in the time interval \((0,t)\). It is clear that \(X(s)≤X(t)\) for \(s< t\) and \(X(t)-X(s)\) equals the number of events that have occurred on the interval \((s,t)\).
2. Poisson process is a counting process with rate \(λ>0,X(0)=0\) if \(X(t)\) has independent increments, and the number of events in any interval of length \(t\) follows a Poisson distribution with mean \(E(X(t))=\text{Var}(X(t))=λt\) and \(P[X(t+s)-X(s)=n]=\frac{(λt)^ne^{-λ}}{n!}\) for \(s,t>0, n=0,1,\cdots\). The expected number of events in any interval of unit length is \(λ\). Equivalently, a counting process \(X(t)\) is a Poisson process with rate \(λ>0,X(0)=0\) if \(X(t)\) has independent and stationary increments, \(P(X(t+Δt)-X(t)=1)=λΔt+o(Δt)\) and \(P(X(t+Δt)-X(t)≥2)=o(Δt)\). Consequently, \(P(X(t+Δt)-X(t)=0)=1-λΔt+o(Δt)\), which implies the probability that no event occurs in any short interval of time approaches 1 as \(Δt→0\).
In a Poisson process with average occurrence rate \(λ>0\) (average number of events per unit time), the time interval \(W_n\) (inter-arrival times) between successive events are i.i.d of exponential r.v's with parameter \(λ\) (or \(\{W_n,n≥1\}\) is renewal process). As a result, the arrival time \(T_n=W_1+\cdots+W_n\) are i.i.d Gamma\((n,λ)\) r.v's with parameters \(a=n,b=λ\). The autocorrelation of \(X(t)\) is \(R_X(s,t)=λ\min(s,t)+λ^2st\), and the autocovariance is \(K_X(s,t)=σ^2\min(s,t)=λ\min(s,t)\) because of the property of independent stationary increments.
The function "ppr(λ)" (Poisson process) create a Poisson process with rate λ. So stt;ppr(3);P(x(t)<2) || returns the probability \(P(X(t)<2)=(3t+1)e^{-3t}\), a function of \(t\). Check more examples stt;ppr(3);P(x(1),0) || stt;ppr(3);P(x(1)>0) || stt;ppr(3);P(x(3)<1|x(1),0) || stt;ppr(3);P(x(4),3|x(2),3) || stt;ppr(3);P(x(2)<=3|x(1)>1) ||.
3. Simulating a Poisson process with rate λ (1) Simulate a random number \(W_i\) from an exponential distribution with parameter λ for \(i=1,2,\cdots\). Then \(W_i\) denotes the inter-arrival (waiting) times between successive events. (2) Set \(T_i=\sum_{j=1}^iW_j\) and \(T_i\) represents the arrival time at which the number of events \(X(t)\) increases by 1 for the \(i\)th time. Note that the increasing sequence \(\{T_i\}\) can be directly generated by a Gamma distribution with parameter \(i\) and λ.
Use "stt;ppr(λ);rvs(n)" to simulate the arrival times for "n" events of a Poisson process, and use "stt;ppr(λ);rvs(n);W" to simulate the inter-arrival times for "n" events. For example, stt;ppr(0.5);rvs(50) || stt;ppr(10);rvs(100) ||. Note that the resulting lists of numbers are not the numbers of events, but the arrival times that events 1, 2, ..., n occur. Similarly, the resulting lists from "stt;ppr(3);rvs(50);W" or "stt;ppr(0.5);rvs(25);W" are the inter-arrival times.
5. Wiener Process
A random process \(\{X(t),t≥0\}\) is a Wiener (Brownian motion) process if \(X(t)\) has independent stationary increments, \(X(t)-X(s)\) is normally distributed for \(t>s\), and \(E(X(t))=0,X(0)=0\). As a result, \(\text{Var}(X(t))=σ^2t,K_X(s,t)=R_X(s,t)=σ^2\min(s,t)\) for \(s,t≥0\), where \(σ^2\) is a parameter for the Wiener process. If \(σ=1\), the process is called the standard Wiener process. The function "wpr" (Wiener process) create a Wiener process with mean \(E(X(t))=0,\text{Var}(X(t))=σ^2t\), a standard Wiener process. The example "stt;wpr;P(x(t)<7)" returns the probability \(P(X(t)<7)\). Check more examples stt;wpr;P(x(1)<7) || stt;wpr;P(x(1)< x(2)) || stt;wpr;P(x(3)>x(2)) || stt;wpr;P(x(4)<2) || stt;wpr;P(x(4)<2|x(1)>0) ||.
6. Gamma Process
A Gamma process \(\{X(t),t≥0\}\) is a random process with independent gamma distributed increments. The parameters of a Gamma process are the parameters \(a>0,b>0\) in a Gamma distribution. Thus, \(E(X(t))=μt=\frac{bt}{a},\text{Var}(X(t))=σ^2t=\frac{bt}{a^2}\). Use the function "gpr(a,b)" for to create a Gamma process. Check stt;gpr(1,2);P(x(t)>1) || stt;gpr(2,3);P(x(1)>2) ||.