Restricting the sample

By "restricting" the sample we mean selecting observations on the basis of some Boolean (logical) criterion, or by means of a random number generator. This is likely to be most relevant for cross-sectional or panel data.

Suppose we have data on a cross-section of individuals, recording their gender, income and other characteristics. We wish to select for analysis only the women. If we have a gender dummy variable with value 1 for men and 0 for women we could do


	smpl gender=0 --restrict
to this effect. Or suppose we want to restrict the sample to repondents with incomes over $50,000. Then we could use


	smpl income>50000 --restrict

A question arises here. If we issue the two commands above in sequence, what do we end up with in our sub-sample: all cases with income over 50000, or just women with income over 50000? By default, in a gretl script, the answer is the former: all cases with income over 50000. The second restriction replaces the first (but see the Section called The Sample menu items below). If you want the restrictions to cumulate (that is, for the active restriction to be calculated as the logical product of the most recently specified restriction and any previous restrictions) you have two options:

Unlike a simple "setting" of the sample, "restricting" the sample may result in selection of non-contiguous observations from the full data set. It may also change the structure of the data set.

This can be seen in the case of panel data. Say we have a panel of five firms (indexed by the variable firm) observed in each of several years (identified by the variable year). Then the restriction


	smpl year=1995 --restrict
produces a dataset that is not a panel, but a cross-section for the year 1995. Similarly

	smpl firm=3 --restrict
produces a time-series dataset for firm number 3.

For these reasons (possible non-contiguity in the observations, possible change in the structure of the data), gretl acts differently when you "restrict" the sample as opposed to simply "setting" it. In the case of setting, the program merely records the starting and ending observations and uses these as parameters to the various commands calling for the estimation of models, the computation of statistics, and so on. In the case of restriction, the program makes a reduced copy of the dataset and by default treats this reduced copy as a simple, undated cross-section. If you wish to re-impose a time-series or panel interpretation of the reduced dataset you can do so using setobs (and panel if appropriate).

You should be aware that, because of this difference in treatment of the "setting" and "restricting" forms of sub-sampling, the two forms are handled independently by gretl.

This means that if you want for some reason to combine a logical restriction with a limitation of the sample based on the observation number, you need to express the latter as a logical restriction, which you can do using the internal variable obs. To return to the cross-sectional example given above, if I want a sub-sample consisting of women only, excluding observations 1 to 30, I can do


	smpl (gender=0 & obs > 30) --restrict

The fact that "restricting" the sample results in the creation of a reduced copy of the original dataset may raise an issue when the dataset is very large (say, several thousands of observations). With such a dataset in memory, the creation of a copy may lead to a situation where the computer runs low on memory for calculating regression results. You can work around this as follows:

  1. Open the full data set, and impose the sample restriction.

  2. Save a copy of the reduced data set to disk.

  3. Close the full dataset and open the reduced one.

  4. Proceed with your analysis.