0.1 Introduction
Parallel group randomised controlled trials are typically conducted by recruiting a fixed number of individuals and allocating each to receive one of two treatments, ultimately testing a prespecified hypothesis. Since Wald published his work on the sequential probability ratio test
(wald1947), there has been substantial interest in trial designs that allow hypotheses to be tested multiple times during the trial. With this approach, the trial may be stopped early if the data so suggests. This leads to patient exposure to inferior treatments being limited, and, by helping to lower the expected required sample size, the cost of a trial will often be reduced.Armitage1975
was responsible for much of the early use of such methods in medicine. However, his and the other initial approaches were fully sequential, with data analysed after every patient. Whilst this may seem desirable, it is impractical and thus this methodology did not gain general acceptance. The pivotal moment in this field came with the work of
pocock1977, who provided a clear way of determining group sequential designs with desired typeI and typeII error rates. In a group sequential design, a hypothesis is analysed multiple times during an on going trial, but as the name suggests, only after groups of certain sizes have been assessed. This allows the majority of the benefits of a fully sequential approach to be retained, whilst also making the design feasible in practice.
Since this paper, group sequential designs have been researched extensively, and utilised regularly in clinical trials. Today, methodology is well established for designing group sequential trials with normal, binary, and survival endpoints. Approaches are available to design trials with unknown variance, with multiple arms, or to optimise a designs features. For a detailed discussion of available methods see
whitehead1997 or jennison2000.In this paper, we focus on the design of twotreatment group sequential trials with a normally distributed outcome variable, but note that asymptotically other endpoint types can be treated with the same normal test statistics. We proceed by summarising the statistical theory behind group sequential methodology. Following this we detail our new commands, and provide several examples of their use.
0.2 Statistical Theory
We consider a randomised twoarm group sequential trial design with up to planned analyses. We index one arm by 0, and the other by 1. Often, it will be the case that arm 0 is a control and arm 1 an experimental treatment, but this may not always be true. We assume that the th analyses takes place after and patients have been randomised to arms 0 and 1, respectively. Possible extensions to this framework are discussion in Section 8. Thus is the ratio of patients allocated to arm 1 relative to arm 0, and we refer to as the group size. The outcome from patient in arm in stage , , is assumed to be distributed as follows
Thus, we are assuming that the variance in response of both treatments is known.
Our ultimate goal is to make inference about the difference in the average treatment effect of arms 0 and 1. To this end, we define , and at each interim analysis compute the following teststatistic
with
(1) 
the information for this analysis. It can be shown that have for the parameter of interest , with information levels
, what has been referred to as the canonical joint distribution
(jennison2000). That is
is multivariate normal;

, ;

, .
Using this, the operating characteristics of a group sequential design with any choice of stopping boundaries can be determined using multivariate normal integration as described in jennison2000 and wason2015. This allows the use of numerical optimisation routines to determine suitable sample sizes and stopping boundaries. The particular type of boundaries to utilise depends on the chosen hypothesis testing framework. Therefore, in the following sections we discuss several established methods for twosided, and then onesided, tests.
0.3 TwoSided Tests
0.3.1 Stopping Rules and Operating Characteristics
In a twosided test, we assess whether there is significant evidence of a difference in the mean responses of the two treatment arms. That is, we test
Here, a group sequential trial design is characterised by stopping boundaries and , with for , and , and the following stopping rules at analyses

If stop and reject ;

If stop and do not reject ;

otherwise continue to stage .
The choice ensures termination after analysis , whilst also guaranteeing a conclusion is made about .
Then, the probability of rejecting for any , given , is
Similarly, the probability of not rejecting for any is
Using the above, the expected sample size for any can be calculated as
As discussed earlier, each of these probabilities can be computed using multivariate normal integration. Explicitly, defining
then, for example
Here,
, the square root of the vector
is taken in an element wise manner, andis the probability density function of a multivariate normal distribution with mean
and covariance matrix , evaluated at vector . In all of the commands presented here, these integrals are evaluated using the mata function pmvnormal_mata() (grayling2016).With the above specifications, all that remains is a method for determining stopping boundaries, and an associated required sample size, such that and , for clinically relevant difference , and desired typeI and typeII error rates and . It is this problem that much of the group sequential clinical trial design literature has focused upon. In the following sections we discuss several options available via our commands.
0.3.2 Early Stopping to Reject
Much of the early work on group sequential trial design focused on twosided tests with early stopping only to reject . That is, with for . In particular, haybittle1971 and peto1976 suggested a simple set of boundaries with for . The final critical boundary
is then determined to ensure an overall typeI error rate of
. Following the determination of , a onedimensional numerical search is utilised to ascertain the exact required group size for power of when , treating as a continuous quantity.Haybittle and Peto’s procedure is advantageous in that it is a simple one, whilst its wide stopping boundaries mean that early stopping is unlikely: a desirable property in some instances to help increase data accumulation, with termination only in the case of extreme disparities in treatment performance. However, trialists will often desire stopping boundaries that help to substantially reduce the expected sample size when is not true. For this, wang1987 suggested the following family of stopping boundaries, indexed by a parameter
Their procedure encompasses the popular pocock1977 and obrien1979 boundaries, by taking or respectively. In this approach, a numerical search is utilised for any chosen to determine the value of that implies the correct typeI error rate . Following this, as with Haybittle and Peto’s design, a further search is then used to ascertain the required sample size for the power constraint. In general, it has been shown that as increases, the maximum sample size increases, but the expected sample size for larger values of decreases.
Later, we present commands haybittlePeto and wangTsiatis for determining the stopping boundaries and required sample size of these designs for any choice of , , , , , and .
0.3.3 Early Stopping to Reject and Not Reject
The above designs deal well with the issue of ethics in twosided clinical trials; namely the desire to stop early when the difference between treatments is substantial. However, there are also often sound reasons to desire early stopping when it is clear there is no detectable treatment difference; usually based around reducing the cost of a trial. These are trial designs with not all , . pampallona1994 described a oneparameter family of such trial designs, again indexed by a shape parameter , that has been referred to as the power family of inner wedge designs. Explicitly
The final information level is then
to ensure as desired. A twodimensional numerical search is utilised to determine the values of and that provide the desired typeI and typeII error rates given choices for , , and . With these values identified, the final required information level is used to determine the exact required group size through Equation (1). As in the procedure of wang1987 above, the inclusion of the parameter allows a large range of designs to be determined, with varying performance in terms of their expected sample sizes. In Section 0.6 we will see how these performances can be examined graphically.
Alternatively, whitehead1983 and whitehead1997 proposed an approach for the determination of a group sequential clinical trial design for a twosided test with early stopping to not reject , termed the double triangular test. Specifically, they demonstrated that a design with
where
and
would approximately attain a typeI error rate of when , and a typeII error rate of when .
Later, we discuss our commands innerWedge and doubleTriangular for determining these designs.
0.4 OneSided Tests
0.4.1 Stopping Rules and Operating Characteristics
In a onesided test, we assess whether, without loss of generality, the mean response on treatment 1 is significantly larger than that on treatment 0. That is, we test
A group sequential trial design of this type is characterised by stopping boundaries and , with for and , and the following stopping rules at analyses

If stop and reject ,

If stop and do not reject ,

otherwise continue to stage ,
Again, the choice is to ensure termination after analysis , and to guarantee a conclusion is drawn about .
Now, the probability of rejecting for any , given , becomes
Similarly, the probability of not rejecting for any is
As before, the expected sample size for any is given by
Moreover, these probabilities can again be computed using multivariate normal integration. Using our notation from earlier, we have for example
In some situations, a onesided test will be more appropriate because departures from in one direction are implausible. Alternatively, it may be the case that we are interested in directly testing the superiority of one treatment over another. Consequently, much research has gone in to determining designs that will have desired operating characteristics (now, a typeI error rate of when , and a typeII error rate of when ) and favourable performance in terms of the expected sample size. Below, we discuss two popular methods, available for implementation via our commands.
0.4.2 Power Family of OneSided Designs
In addition to their power family of inner wedge designs, pampallona1994 also detailed a oneparameter family of designs for onesided tests, with boundaries given by
As before, taking a final information level of
ensures that as desired, and a twodimensional grid search can be used to determine the appropriate values of and . Our command powerFamily is available to perform these computations.
0.4.3 Triangular Test
whitehead1983 and whitehead1997 also proposed a triangular test for onesided group sequential clinical trial designs. Specifically, they proposed
with
and
demonstrating this design would approximately attain the desired operating characteristics.
This design has proven popular with trialists because of the speed with which it can be calculated, and also because of its strong performance in terms of its expected sample sizes (wason2012). Our command triangular determines this design.
0.5 Syntax
In this section, we detail the syntax of our six discussed commands, which are all declared as rclass
doubleTriangular, l(integer 3) delta(real 0.2) alpha(real 0.05) beta(real 0.2) sigma(numlist) ratio(real 1) performance *
haybittlePeto, l(integer 3) delta(real 0.2) alpha(real 0.05) beta(real 0.2) sigma(numlist) ratio(real 1) performance *
innerWedge, l(integer 3) delta(real 0.2) alpha(real 0.05) beta(real 0.2) sigma(numlist) ratio(real 1) omega(real 0.5) performance *
powerFamily, l(integer 3) delta(real 0.2) alpha(real 0.05) beta(real 0.2) sigma(numlist) ratio(real 1) omega(real 0.5) performance *
triangular, l(integer 3) delta(real 0.2) alpha(real 0.05) beta(real 0.2) sigma(numlist) ratio(real 1) performance *
wangTsiatis, l(integer 3) delta(real 0.2) alpha(real 0.05) beta(real 0.2) sigma(numlist) ratio(real 1) omega(real 0.5) performance *
Here, the prescribed options denote the following
alpha is the desired overall typeI error rate, . That is, it is the twosided or onesided typeI error rate according to the chosen command.
beta is the desired typeII error rate, .
delta is the clinically relevant difference at which we power, .
l is the maximum number of allowed stages in the design, .
omega is the shape parameter of the boundaries of the power family and WangTsiatis designs.
performance specifies that the performance of the identified design, i.e. its expected sample size and power curves, should be determined and plotted.
ratio is the desired ratio of the sample sizes between arms 0 and 1.
sigma
is the standard deviation of the responses in arms 0 and 1;
and . This can either be of length two, containing the assumed values of these two parameters, or of length one, implying .Attainable via return list for all six commands, are the determined exact required group size (r(n)), and the stopping boundaries , , and as appropriate (e.g., r(a)). In addition, the vector of information levels (r(I)), the covariance matrix (r(Lambda)), and a vector summarising the performance of the design (r(performance))
are available.
Note that in all of these commands, required one dimensional numerical searches are performed using a purpose built implementation of Brent’s algorithm (Brent1973). In contrast, all two dimensional numerical searches are carried out with the NelderMead option in optimize().
0.6 Example 1: TwoSided Tests
As our first example, we consider the case , , , , , and , in twosided testing.
We begin by demonstrating how doubleTriangular can be used to determine the boundaries and sample size required by the Double Triangular test of whitehead1983. Explicitly, the following code is used to determine the design
. doubleTriangular, l(2) alpha(0.05) beta(0.2) delta(0.2) sigma(2) r(1) 2stage Group Sequential Trial Design 37 The hypotheses to be tested are as follows: H0: tau = 0 H1: tau != 0, with the following error constraints: P(Reject H0 — tau = 0) = .05, P(Reject H0 — tau = delta = .2) = 1  .2. Doubletriangular boundaries selected……………….. …now determining design…………………………….. …design determined. Returning the results…………….. …Exact required group sizes for each arm determined to be: 875.5 and 875.5. …Rejection boundaries r determined to be: (2.2,2.07). …Acceptance boundaries a determined to be: (.73,2.07). …Operating characteristics of the design are: P(Reject H0 — tau = 0) = .0531, P(Reject H0 — tau = .2) = .8003, E(N — tau = 0) = 2514.6, E(N — tau = .2) = 2550.5, max_tau E(N — tau) = 2716.4, max N = 3501.9.
As can be seen, by default the commands return an informative summary of the chosen testing framework, their progress, and the characteristics of the final design. Specifically, the first few lines describe the hypotheses that will be tested based on the chosen command. The input values of alpha and beta are then used in printing a summary of the desired operating characteristics. Several lines then follow which describe the progress of the command in completing its required computations. Next, the exact required number of patients in each arm, in each stage, are printed. The rejection and acceptance boundaries then follow, along with a summary of the operating characteristics of the identified design. In this case we see the design has a typeI errorrate of 0.053, and power of 0.800. This is a wellknown limitation of the double triangular design: the typeI and typeII error requirements are only approximately achieved. The final four printed results summarise various important sample size characteristics of the design: the expected sample size when , that when , the maximum expected sample size over all possible values of , and the maximum possible required sample size. We can see that in this case, whilst the maximum possible value of is 3501.9, we would expected to not require more than 2716.4 patients.
Being able to easily determine this design is useful, however in most situations it is unlikely that a trialist will have a single design in mind. Consequently, it is important to be able to determine the performance of several designs and compare them graphically. Here, we demonstrate this for the power family of inner wedge designs. Using the following code, we find the designs for , , and , saving their performance. Then, we combine the saved graphs to produce Figure 1
. qui innerWedge, l(2) alpha(0.05) beta(0.2) delta(0.2) sigma(2) omega(0.5) r(1) ¿ perf saving(firstDesign) nodraw title(&Omega = 0.5) scale(0.75) scheme(sj) . qui innerWedge, l(2) alpha(0.05) beta(0.2) delta(0.2) sigma(2) omega(0.25) r(1) ¿ perf saving(secondDesign) no draw title(&Omega = 0.25) scale(0.75) scheme(sj) . qui innerWedge, l(2) alpha(0.05) beta(0.2) delta(0.2) sigma(2) omega(0) r(1) perf ¿ saving(thirdDesign) nodraw title(&Omega = 0) scale(0.75) scheme(sj) . qui innerWedge, l(2) alpha(0.05) beta(0.2) delta(0.2) sigma(2) omega(0.25) r(1) ¿ perf saving(fourthDesign) nodraw title(&Omega = 0.25) scale(0.75) scheme(sj) . graph combine firstDesign.gph secondDesign.gph thirdDesign.gph fourthDesign.gph, ¿ ycommon scheme(sj)
We observe that increasing the value of appears to reduce the expected sample required when is small. However, this comes at a cost to that required when is large.
0.7 Example 2: OneSided Tests
As our next example, we consider onesided testing. We take , , , , , and . Similarly to the above, we demonstrate how powerFamily can be used to determine several designs (, , and ), and in addition compute the boundaries and sample size of the triangular test. Saving the performance of each, we then compare their performance graphically, creating Figure 2 with the following code
. qui powerFamily, l(3) alpha(0.1) beta(0.1) delta(0.25) sigma(1, 2) omega(0.25) ¿ r(2) perf saving(firstDesign) nodraw title(Power family with &Omega = 0.25) ¿ scale(0.75) scheme(sj) . qui powerFamily, l(3) alpha(0.1) beta(0.1) delta(0.25) sigma(1, 2) omega(0) r(2) ¿ perf saving(secondDesign) nodraw title(Power family with &Omega = 0) ¿ scale(0.75) scheme(sj) . qui powerFamily, l(3) alpha(0.1) beta(0.1) delta(0.25) sigma(1, 2) omega(0.25) ¿ r(2) perf saving(thirdDesign) nodraw title(Power family with &Omega = 0.25) ¿ scale(0.75) scheme(sj) . qui triangular, l(3) alpha(0.1) beta(0.1) delta(0.25) sigma(1, 2) r(2) perf ¿ saving(fourthDesign) nodraw title(Triangular test) scale(0.75) scheme(sj) . graph combine firstDesign.gph secondDesign.gph thirdDesign.gph fourthDesign.gph, ¿ ycommon scheme(sj)
As has been reported previously, the triangular test does indeed fare well in comparison to the two identified power family designs. Explicitly, it has the lowest maximum expected sample size of the four designs. However, this does come at the cost of an increased maximum possible sample size, as evidence by its performance for large .
0.8 Conclusion
It is important that any clinical trial control both its typeI and typeII error rates accurately. For this task, Stata introduced in Version 13 the command power, which can be used for an extremely wide array of trial scenarios. However, as we have discussed, group sequential clinical trial designs are extremely popular with researchers, and to date few available commands are available in Stata for determining such designs. Notable exceptions include nstage (Barthel2009; Bratton2015) and nstagebin (Bratton2014) for multiarm multistage trial designs with timetoevent and binary endpoints respectively. In addition, the command simsam can determine the required sample size of certain group sequential clinical trial designs given stopping boundaries (Hooper2013). There are no established commands however for determining the boundaries and group size required by the wide array of group sequential trial designs for normally distributed outcomes discussed here.
Several extensions to our commands are now possible. We have assumed that the variance of the responses on both treatment arms is known prior to trial commencement. Whilst this is a common assumption in the group sequential design literature, often this will be a strong one to make. However, whitehead2009
proposed a simple quantile substitution method for dealing with this problem, which has been shown to generally control the typeI error rate to the correct level
(wason2012a). This would no doubt be a useful addition to our commands. Moreover, we have assumed that the interim analyses are equally spaced interms of the number of patient responses accrued in each arm. lan1983 proposed an error spending approach to the design of group sequential trials that allows this assumption to be relaxed. Consequently, a command to employ such methodology could prove useful to those seeking more complex designs.Additionally, our focus has been on twoarm trials. Today, multiarm multistage trials are becoming increasingly popular. Therefore, extending these designs to allow for multiple experimental arms would be advantageous. Finally, there have now been several proposals for the determination of optimal or nearoptimal group sequential designs (see, for example, wason2012, wason2012a, and Wason2015a). To allow trialists to maximise the efficiency gains made by utilising a group sequential design, the establishment of commands for determining such designs would be highly advantageous.
Regardless of these possible expansions, our commands can be used to determine stopping boundaries, exact required group sizes, and also to compare the performance of a selection of designs. Consequently, they should prove useful to those seeking to exploit the efficiencies of a group sequential design whilst working in Stata.
0.9 Acknowledgements
Michael J. Grayling is supported by the Wellcome Trust (Grant Number 099770/Z/12/Z). James M. S. Wason is supported by the National Institute for Health Research Cambridge Biomedical Research Centre (Grant Number MC_UP_1302/6). Adrian P. Mander is supported by the Medical Research Council (Grant Number MC_UP_1302/2).
Comments
There are no comments yet.