Sampling from Urns - Elementary Probability and the prob Package G. Jay Kerns January 9, 2008

Perhaps the most fundamental of statistical experiments consists of drawing distinguishable objects from an urn. The prob package addresses this topic with the urnsamples(x, size, replace, ordered) function. The argument x represents the urn from which sampling is to be done. The size argument tells how large the sample will be. The ordered and replace arguments are logical and specify how sampling will be performed. We will discuss each in turn. In the interest of saving space, for this example let our urn simply contain three balls, labeled 1, 2, and 3, respectively. We are going to take a sample of size 2.

Ordered, With Replacement

If sampling is with replacement, then we can get any outcome 1, 2, 3 on any draw. Further, by

“ordered” we mean that we shall keep track of the order of the draws that we observe. We can accomplish this in R with

> urnsamples(1:3, size = 2, replace = TRUE, ordered = TRUE) X1 X2

Notice that rows 2 and 4 are identical, save for the order in which the numbers are shown. Further, note that every possible pair of the numbers 1 through 3 are listed. This experiment is equivalent to rolling a 3-sided die twice, which we could have accomplished with rolldie(2, nsides = 3).

Ordered, Without Replacement

Here sampling is without replacement, so we may not observe the same number twice in any row.

Order is still important, however, so we expect to see the outcomes 1, 2 and 2, 1 somewhere in our data frame as before.

> urnsamples(1:3, size = 2, replace = FALSE, ordered = TRUE) X1 X2

1 1 2

2 2 1

3 1 3

4 3 1

5 2 3

6 3 2

This is just as we expected. Notice that there are less rows in this answer, due to the restricted sampling procedure. If the numbers 1, 2, and 3 represented “Fred”, “Mary”, and “Sue”, respectively, then this experiment would be equivalent to selecting two people of the three to serve as president and vice-president of a company, respectively, and the sample space lists all possible ways that this could be done.

Unordered, Without Replacement

Again, we may not observe the same outcome twice, but in this case, we will only keep those outcomes which (when jumbled) would not duplicate earlier ones.

> urnsamples(1:3, size = 2, replace = FALSE, ordered = FALSE) X1 X2

1 1 2

2 1 3

3 2 3

This experiment is equivalent to reaching in the urn, picking a pair, and looking to see what they are. This is the default setting of urnsamples(), so we would have received the same output by simply typing urnsamples(1:3,2).

Unordered, With Replacement

The last possibility is perhaps the most interesting. We replace the balls after every draw, but we do not remember the order in which the draws come.

> urnsamples(1:3, size = 2, replace = TRUE, ordered = FALSE) X1 X2

1 1 1

2 1 2

3 1 3

4 2 2

5 2 3

6 3 3

We may interpret this experiment in a number of alternative ways. One way is to consider this as simply putting two 3-sided dice in a cup, shaking the cup, and looking inside as in a game of Liar’s Dice, for instance. Each row of the sample space is a potential pair we could observe. Another equivalent view is to consider each outcome a separate way to distribute two identical golf balls

into three boxes labeled 1, 2, and 3. Regardless of the interpretation, urnsamples() lists every possible way that the experiment can conclude.

Note that the urn does not need to contain numbers; we could have just as easily taken our urn to be x = c("Red", "Blue", "Green"). But, there is an important point to mention before proceeding. Astute readers will notice that in our example, the balls in the urn were distinguishable in the sense that each had a unique label to distinguish it from the others in the urn. A natural question would be, “What happens if your urn has indistinguishable elements, for example, what if x = c("Red", "Red", "Blue")?” The answer is that urnsamples() behaves as if each ball in the urn is distinguishable, regardless of its actual contents. We may thus imagine that while there are two red balls in the urn, the balls are such that we can tell them apart (in principle) by looking closely enough at the imperfections on their surface.

In this way, when the x argument of urnsamples() has repeated elements, the resulting sample space may appear to be ordered = TRUE even when, in fact, the call to the function was urnsam-ples(..., ordered = FALSE). Similar remarks apply for the replace argument. We investigate this issue further in the last section.

3 Counting Tools

The sample spaces we have seen so far have been relatively small, and we can visually study them without much trouble. However, it is VERY easy to generate sample spaces that are prohibitively large. And while R is wonderful and powerful and does almost everything except wash windows, even R has limits of which we should be mindful.

In many cases, we do not need to actually generate the sample spaces of interest; it suffices merely to count the number of outcomes. The nsamp() function will calculate the number of rows in a sample space made by urnsamples(), without actually devoting the memory resources necessary to generate the space. The arguments are: n, the number of (distinguishable) objects in the urn, k, the sample size, and replace, ordered as above.

In a probability course, one derives the formulas used in the respective scenarios. For our purposes, it is sufficient to merely list them in the following table. Note that x! = x(x − 1)(x − 2) · · · 3 · 2 · 1 and ⁿ_k = n!/[k!(n − k)!].

Values of nsamp(n, k, replace, ordered) ordered = TRUE ordered = FALSE replace = TRUE n^k ^(n−1+k)!_(n−1)!k!

replace = FALSE _(n−k)!^n! ⁿ_k

Examples

We will compute the number of outcomes for each of the four urnsamples() examples that we saw in the last section. Recall that we took a sample of size two from an urn with three distinguishable elements.

> nsamp(n = 3, k = 2, replace = TRUE, ordered = TRUE) [1] 9

> nsamp(n = 3, k = 2, replace = FALSE, ordered = TRUE) [1] 6

> nsamp(n = 3, k = 2, replace = FALSE, ordered = FALSE)

[1] 3

> nsamp(n = 3, k = 2, replace = TRUE, ordered = FALSE) [1] 6

Compare these answers with the length of the data frames generated above.

3.1 The Multiplication Principle

A benefit of nsamp() is that it is vectorized, so that entering vectors instead of numbers for n, k, replace, and ordered results in a vector of corresponding answers. This becomes particularly convenient when trying to demonstrate the Multiplication Principle for solving combinatorics problems.

Example

Question: There are 11 artists who each submit a portfolio containing 7 paintings for competition in an art exhibition. Unfortunately, the gallery director only has space in the winners’ section to accomodate 12 paintings in a row equally spread over three consecutive walls. The director decides to give the first, second, and third place winners each a wall to display the work of their choice. The walls boast 31 separate lighting options apiece. How many displays are possible?

Answer: The judges will pick 3 (ranked) winners out of 11 (with rep=FALSE, ord=TRUE). Each artist will select 4 of his/her paintings from 7 for display in a row (rep=FALSE, ord=TRUE), and lastly, each of the 3 walls has 31 lighting possibilities (rep=TRUE, ord=TRUE). These three numbers can be calculated quickly with

> n = c(11, 7, 31)

> k = c(3, 4, 3)

> r = c(FALSE, FALSE, TRUE)

> x = nsamp(n, k, rep = r, ord = TRUE) [1] 990 840 29791

(Notice that ordered is always TRUE; nsamp() will recycle ordered and replace to the appropriate length.) By the Multiplication Principle, the number of ways to complete the experiment is the product of the entries of x:

> prod(x) [1] 24774195600

Compare this with the some standard ways to compute this in R:

> (11 * 10 * 9) * (7 * 6 * 5 * 4) * 31^3 [1] 24774195600

or alternatively

> prod(9:11) * prod(4:7) * 31^3 [1] 24774195600

or even

> prod(factorial(c(11, 7))/factorial(c(8, 3))) * 31^3 [1] 24774195600

As one can guess, in many of the standard counting problems there aren’t much savings in the amount of typing; it is about the same using nsamp() versus factorial() and choose(). But the virtue of nsamp() lies in its collecting the relevant counting formulas in a one-stop shop.

Ultimately, it is up to the user to choose the method that works best for him/herself.

4 Defining a Probability Space

Once a sample space is defined, the next step is to associate a probability model with it in order to be able to answer probabilistic questions. Formally speaking, a probability space is a triple (S,B, IP), where S is a sample space, B is a sigma-algebra of subsets of S, and IP is a probability measure defined onB. However, for our purposes all of the sample spaces are finite, so we may takeB to be the power set (the set of all subsets of S) and it suffices to specify IP on the elements of S, the outcomes. The only requirement for IP is that its values should be nonnegative and sum to 1.

The end result is that in the prob package, a probability space is an object of outcomes S and a vector of probabilities (called “probs”) with entries that correspond to each outcome in S. When S is a data frame, we may simply add a column called probs to S and we will be finished; the probability space will simply be a data frame which we may call space. In the case that S is a list, we may combine the outcomes and probs into a larger list, space; it will have two components:

outcomes and probs. The only requirement we place is that the entries of probs be nonnegative and sum(probs) is one.

To accomplish this in R, we may use the probspace() function. The general syntax is prob-space(x, probs), where x is a sample space of outcomes and probs is a vector (of the same length as the number of outcomes in x). The specific choice of probs depends on the context of the problem, and some examples follow to demonstrate some of the more common choices.

4.1 Examples

The Equally Likely Model

The equally likely model asserts that every outcome of the sample space has the same probability, thus, if a sample space has n outcomes, then probs would be a vector of length n with identical entries 1/n. The quickest way to generate probs is with the rep() function. We will start with the experiment of rolling a die, so that n = 6. We will construct the sample space, generate the probs vector, and put them together with probspace().

> outcomes = rolldie(1)

[1] 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667

> probspace(outcomes, probs = p)

The probspace() function is designed to save us some time in many of the most common situa-tions. For example, due to the especial simplicity of the sample space in this case, we could have achieved the same result with simply (note the name change for the first column)

> probspace(1:6, probs = p)

Further, since the equally likely model plays such a fundamental role in the study of probability, the probspace() function will assume that the equally model is desired if no probs are specified.

Thus, we get the same answer with only

> probspace(1:6)

And finally, since rolling dice is such a common experiment in probability classes, the rolldie() function has an additional logical argument makespace that will add a column of equally likely probs to the generated sample space:

> rolldie(1, makespace = TRUE)

or just rolldie(1:6,TRUE). Many of the other sample space functions (tosscoin(), cards(), roulette(), etc.) have similar makespace arguments. Check the documentation for details.

One sample space function that does NOT have a makespace option is the urnsamples() function. This was intentional. The reason is that under the varied sampling assumptions the outcomes in the respective sample spaces are NOT, in general, equally likely. It is important for the user to carefully consider the experiment to decide whether or not the outcomes are equally likely, and then use probspace() to assign the model.

An unbalanced coin

While the makespace argument to tosscoin() is useful to represent the tossing of a fair coin, it is not always appropriate. For example, suppose our coin is not perfectly balanced, for instance, maybe the “H” side is somewhat heavier such that the chances of a H appearing in a single toss is 0.70 instead of 0.5. We may set up the probability space with

> probspace(tosscoin(1), probs = c(0.7, 0.3)) toss1 probs

1 H 0.7

2 T 0.3

The same procedure can be used to represent an unbalanced die, roulette wheel, etc.

在文檔中 Elementary Probability and the prob Package G. Jay Kerns January 9, 2008 (頁 4-10)