Title: | Distribution-based Random Forest |
---|---|
Description: | Extension of the rpart package with added loss functions and random forest functionality. |
Authors: | Roel Henckaerts [aut, cre] |
Maintainer: | Roel Henckaerts <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 1.0.0 |
Built: | 2024-10-17 05:16:26 UTC |
Source: | https://github.com/skranz/distRforest |
The car.test.frame
data frame has 60 rows and 8 columns,
giving data on makes of cars taken from the April, 1990 issue of
Consumer Reports. This is part of a larger dataset, some
columns of which are given in cu.summary
.
car.test.frame
car.test.frame
This data frame contains the following columns:
Price
a numeric vector giving the list price in US dollars of a standard model
Country
of origin, a factor with levels ‘France’, ‘Germany’, ‘Japan’ , ‘Japan/USA’, ‘Korea’, ‘Mexico’, ‘Sweden’ and ‘USA’
Reliability
a numeric vector coded 1
to 5
.
Mileage
fuel consumption miles per US gallon, as tested.
Type
a factor with levels
Compact
Large
Medium
Small
Sporty
Van
Weight
kerb weight in pounds.
Disp.
the engine capacity (displacement) in litres.
HP
the net horsepower of the vehicle.
Consumer Reports, April, 1990, pp. 235–288 quoted in
John M. Chambers and Trevor J. Hastie eds. (1992) Statistical Models in S, Wadsworth and Brooks/Cole, Pacific Grove, CA, pp. 46–47.
z.auto <- rpart(Mileage ~ Weight, car.test.frame) summary(z.auto)
z.auto <- rpart(Mileage ~ Weight, car.test.frame) summary(z.auto)
Data on 111 cars, taken from pages 235–255, 281–285 and 287–288 of the April 1990 Consumer Reports Magazine.
data(car90)
data(car90)
The data frame contains the following columns
a factor giving the country in which the car was manufactured
engine displacement in cubic inches
engine displacement in liters
engine revolutions per mile, or engine speed at 60 mph
distance between the car's head-liner and the head of a 5 ft. 9 in. front seat passenger, in inches, as measured by CU
maximum front leg room, in inches, as measured by CU
front shoulder room, in inches, as measured by CU
the overall gear ratio, high gear, for manual transmission
the overall gear ratio, high gear, for automatic transmission
net horsepower
the red line—the maximum safe engine speed in rpm
height of car, in inches, as supplied by manufacturer
overall length, in inches, as supplied by manufacturer
luggage space
a numeric vector of gas mileage in miles/gallon as tested by CU; contains NAs.
alternate name, if the car was sold under two labels
list price with standard equipment, in dollars
distance between the car's head-liner and the head of a 5 ft 9 in. rear seat passenger, in inches, as measured by CU
rear fore-and-aft seating room, in inches, as measured by CU
rear shoulder room, in inches, as measured by CU
an ordered factor with levels ‘Much worse’ <
‘worse’ < ‘average’ < ‘better’ < ‘Much
better’: contains NA
s.
factor giving the rim size
Number of turns of the steering wheel required for a turn of 30 foot radius, manual steering
Number of turns of the steering wheel required for a turn of 30 foot radius, power steering
steering type offered: manual, power, or both
fuel refill capacity in gallons
factor giving tire size
manual transmission, a factor with levels ‘’, ‘man.4’, ‘man.5’ and ‘man.6’
automatic transmission, a factor with levels ‘’, ‘auto.3’, ‘auto.4’, and ‘auto.CVT’. No car is missing both the manual and automatic transmission variables, but several had both as options
the radius of the turning circle in feet
a factor giving the general type of car. The levels are: ‘Small’, ‘Sporty’, ‘Compact’, ‘Medium’, ‘Large’, ‘Van’
an order statistic giving the relative weights of the cars; 1 is the lightest and 111 is the heaviest
length of wheelbase, in inches, as supplied by manufacturer
width of car, in inches, as supplied by manufacturer
This is derived (with permission) from the data set car.all
in
S-PLUS, but with some further clean up of variable names and definitions.
car.test.frame
,
cu.summary
for extracts from other versions of the dataset.
data(car90) plot(car90$Price/1000, car90$Weight, xlab = "Price (thousands)", ylab = "Weight (lbs)") mlowess <- function(x, y, ...) { keep <- !(is.na(x) | is.na(y)) lowess(x[keep], y[keep], ...) } with(car90, lines(mlowess(Price/1000, Weight, f = 0.5)))
data(car90) plot(car90$Price/1000, car90$Weight, xlab = "Price (thousands)", ylab = "Weight (lbs)") mlowess <- function(x, y, ...) { keep <- !(is.na(x) | is.na(y)) lowess(x[keep], y[keep], ...) } with(car90, lines(mlowess(Price/1000, Weight, f = 0.5)))
The cu.summary
data frame has 117 rows and 5 columns,
giving data on makes of cars taken from the April, 1990 issue of
Consumer Reports.
cu.summary
cu.summary
This data frame contains the following columns:
Price
a numeric vector giving the list price in US dollars of a standard model
Country
of origin, a factor with levels ‘Brazil’, ‘England’, ‘France’, ‘Germany’, ‘Japan’, ‘Japan/USA’, ‘Korea’, ‘Mexico’, ‘Sweden’ and ‘USA’
Reliability
an ordered factor with levels ‘Much worse’ < ‘worse’ < ‘average’ < ‘better’ < ‘Much better’
Mileage
fuel consumption miles per US gallon, as tested.
Type
a factor with levels
Compact
Large
Medium
Small
Sporty
Van
Consumer Reports, April, 1990, pp. 235–288 quoted in
John M. Chambers and Trevor J. Hastie eds. (1992) Statistical Models in S, Wadsworth and Brooks/Cole, Pacific Grove, CA, pp. 46–47.
fit <- rpart(Price ~ Mileage + Type + Country, cu.summary) par(xpd = TRUE) plot(fit, compress = TRUE) text(fit, use.n = TRUE)
fit <- rpart(Price ~ Mileage + Type + Country, cu.summary) par(xpd = TRUE) plot(fit, compress = TRUE) text(fit, use.n = TRUE)
This function acts as a user-friendly interface for the variable importance
scores in a random forest based on individual rpart
trees.
importance_rforest(object)
importance_rforest(object)
object |
fitted model object from the class |
data frame with one row for each variable and four columns:
the name of the variable.
the average importance score over all the individual trees.
scaled scores which sum to one.
scaled scores such that the maximum value is equal to one.
The kyphosis
data frame has 81 rows and 4 columns.
representing data on children who have had corrective spinal surgery
kyphosis
kyphosis
This data frame contains the following columns:
Kyphosis
a factor with levels
absent
present
indicating if a kyphosis (a type of deformation)
was present after the operation.
Age
in months
Number
the number of vertebrae involved
Start
the number of the first (topmost) vertebra operated on.
John M. Chambers and Trevor J. Hastie eds. (1992) Statistical Models in S, Wadsworth and Brooks/Cole, Pacific Grove, CA.
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis) fit2 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis, parms = list(prior = c(0.65, 0.35), split = "information")) fit3 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis, control = rpart.control(cp = 0.05)) par(mfrow = c(1,2), xpd = TRUE) plot(fit) text(fit, use.n = TRUE) plot(fit2) text(fit2, use.n = TRUE)
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis) fit2 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis, parms = list(prior = c(0.65, 0.35), split = "information")) fit3 <- rpart(Kyphosis ~ Age + Number + Start, data=kyphosis, control = rpart.control(cp = 0.05)) par(mfrow = c(1,2), xpd = TRUE) plot(fit) text(fit, use.n = TRUE) plot(fit2) text(fit2, use.n = TRUE)
This function provides labels for the branches of an rpart
tree.
## S3 method for class 'rpart' labels(object, digits = 4, minlength = 1L, pretty, collapse = TRUE, ...)
## S3 method for class 'rpart' labels(object, digits = 4, minlength = 1L, pretty, collapse = TRUE, ...)
object |
fitted model object of class |
digits |
the number of digits to be used for numeric values.
All of the |
minlength |
the minimum length for abbreviation of character or factor variables.
If |
pretty |
an argument included for compatibility with the tree package:
|
collapse |
logical. The returned set of labels is always of the same length as the number of nodes in the tree. If If |
... |
optional arguments to |
Vector of split labels (collapse = TRUE
) or matrix of left and
right splits (collapse = FALSE
) for the supplied rpart
object. This function is called by printing methods for rpart
and is not intended to be called directly by the users.
Creates a plot on the current graphics device of the deviance of the node divided by the number of observations at the node. Also returns the node number.
meanvar(tree, ...) ## S3 method for class 'rpart' meanvar(tree, xlab = "ave(y)", ylab = "ave(deviance)", ...)
meanvar(tree, ...) ## S3 method for class 'rpart' meanvar(tree, xlab = "ave(y)", ylab = "ave(deviance)", ...)
tree |
fitted model object of class |
xlab |
x-axis label for the plot. |
ylab |
y-axis label for the plot. |
... |
additional graphical parameters may be supplied as arguments to this function. |
an invisible list containing the following vectors is returned.
x |
fitted value at terminal nodes ( |
y |
deviance of node divided by number of observations at node. |
label |
node number. |
a plot is put on the current graphics device.
z.auto <- rpart(Mileage ~ Weight, car.test.frame) meanvar(z.auto, log = 'xy')
z.auto <- rpart(Mileage ~ Weight, car.test.frame) meanvar(z.auto, log = 'xy')
Handles missing values in an "rpart"
object.
na.rpart(x)
na.rpart(x)
x |
a model frame. |
Default function that handles missing values when calling the
function rpart
.
It omits cases where part of the response is missing or all the explanatory variables are missing.
Returns a names list where each element contains the splits on the path from the root to the selected nodes.
path.rpart(tree, nodes, pretty = 0, print.it = TRUE)
path.rpart(tree, nodes, pretty = 0, print.it = TRUE)
tree |
fitted model object of class |
nodes |
an integer vector containing indices (node numbers) of all nodes for which paths are desired. If missing, user selects nodes as described below. |
pretty |
an integer denoting the extent to which factor levels in split labels
will be abbreviated. A value of (0) signifies no abbreviation. A
|
print.it |
Logical. Denotes whether paths will be printed out as
nodes are interactively selected. Irrelevant if |
The function has a required argument as an rpart
object and
a list of nodes as optional arguments. Omitting a list of
nodes will cause the function to wait for the user to
select nodes from the dendrogram. It will return a list,
with one component for each node specified or selected.
The component contains the sequence of splits leading to
that node. In the graphical interaction, the individual
paths are printed out as nodes are selected.
A named (by node) list, each element of which contains all the splits on the path from the root to the specified or selected nodes.
A dendrogram of the rpart
object is expected to be visible on
the graphics device, and a graphics input device (e.g. a mouse) is
required. Clicking (the selection button) on a node selects that
node. This process may be repeated any number of times. Clicking the
exit button will stop the selection process and return the list of
paths.
This function was modified from path.tree
in S.
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis) print(fit) path.rpart(fit, node = c(11, 22))
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis) print(fit) path.rpart(fit, node = c(11, 22))
Plots an rpart object on the current graphics device.
## S3 method for class 'rpart' plot(x, uniform = FALSE, branch = 1, compress = FALSE, nspace, margin = 0, minbranch = 0.3, ...)
## S3 method for class 'rpart' plot(x, uniform = FALSE, branch = 1, compress = FALSE, nspace, margin = 0, minbranch = 0.3, ...)
x |
a fitted object of class |
uniform |
if |
branch |
controls the shape of the branches from parent to child node. Any number from 0 to 1 is allowed. A value of 1 gives square shouldered branches, a value of 0 give V shaped branches, with other values being intermediate. |
compress |
if |
nspace |
the amount of extra space between a node with children and
a leaf, as compared to the minimal space between leaves.
Applies to compressed trees only. The default is the value of
|
margin |
an extra fraction of white space to leave around the borders of the tree. (Long labels sometimes get cut off by the default computation). |
minbranch |
set the minimum length for a branch to |
... |
arguments to be passed to or from other methods. |
This function is a method for the generic function plot
, for objects
of class rpart
.
The y-coordinate of the top node of the tree will always be 1.
The coordinates of the nodes are returned as a list, with
components x
and y
.
An unlabeled plot is produced on the current graphics device: one being opened if needed.
In order to build up a plot in the usual S style, e.g., a separate
text
command for adding labels, some extra information about the
plot needs be retained. This is kept in an environment in the package.
fit <- rpart(Price ~ Mileage + Type + Country, cu.summary) par(xpd = TRUE) plot(fit, compress = TRUE) text(fit, use.n = TRUE)
fit <- rpart(Price ~ Mileage + Type + Country, cu.summary) par(xpd = TRUE) plot(fit, compress = TRUE) text(fit, use.n = TRUE)
Gives a visual representation of the cross-validation results in an
rpart
object.
plotcp(x, minline = TRUE, lty = 3, col = 1, upper = c("size", "splits", "none"), ...)
plotcp(x, minline = TRUE, lty = 3, col = 1, upper = c("size", "splits", "none"), ...)
x |
an object of class |
minline |
whether a horizontal line is drawn 1SE above the minimum of the curve. |
lty |
line type for this line |
col |
colour for this line |
upper |
what is plotted on the top axis: the size of the tree (the number of leaves), the number of splits or nothing. |
... |
additional plotting parameters |
The set of possible cost-complexity prunings of a tree from a nested
set. For the geometric means of the intervals of values of cp
for which
a pruning is optimal, a cross-validation has (usually) been done in
the initial construction by rpart
. The cptable
in the fit contains
the mean and standard deviation of the errors in the cross-validated
prediction against each of the geometric means, and these are plotted
by this function. A good choice of cp
for pruning is often the
leftmost value for which the mean lies below the horizontal line.
None.
A plot is produced on the current graphical device.
Generates a PostScript presentation plot of an rpart
object.
post(tree, ...) ## S3 method for class 'rpart' post(tree, title., filename = paste(deparse(substitute(tree)), ".ps", sep = ""), digits = getOption("digits") - 2, pretty = TRUE, use.n = TRUE, horizontal = TRUE, ...)
post(tree, ...) ## S3 method for class 'rpart' post(tree, title., filename = paste(deparse(substitute(tree)), ".ps", sep = ""), digits = getOption("digits") - 2, pretty = TRUE, use.n = TRUE, horizontal = TRUE, ...)
tree |
fitted model object of class |
title. |
a title which appears at the top of the plot. By default, the
name of the |
filename |
ASCII file to contain the output. By default, the name of the file is
the name of the object given by |
digits |
number of significant digits to include in numerical data. |
pretty |
an integer denoting the extent to which factor levels will be
abbreviated in the character strings defining the splits;
(0) signifies no abbreviation of levels. A |
use.n |
Logical. If |
horizontal |
Logical. If |
... |
other arguments to the |
The plot created uses the functions plot.rpart
and text.rpart
(with
the fancy
option). The settings were chosen because they looked
good to us, but other options may be better, depending on the rpart
object.
Users are encouraged to write their own function containing favorite
options.
a plot of rpart
is created using the postscript
driver, or
the current device if filename = ""
.
plot.rpart
, rpart
, text.rpart
, abbreviate
z.auto <- rpart(Mileage ~ Weight, car.test.frame) post(z.auto, file = "") # display tree on active device # now construct postscript version on file "pretty.ps" # with no title post(z.auto, file = "pretty.ps", title = " ") z.hp <- rpart(Mileage ~ Weight + HP, car.test.frame) post(z.hp)
z.auto <- rpart(Mileage ~ Weight, car.test.frame) post(z.auto, file = "") # display tree on active device # now construct postscript version on file "pretty.ps" # with no title post(z.auto, file = "pretty.ps", title = " ") z.hp <- rpart(Mileage ~ Weight + HP, car.test.frame) post(z.hp)
This function obtains predictions from a random forest based on aggregating
the predictions from individual rpart
trees. A majority vote is
taken for binary classification trees, while the predictions are averaged for
normal, poisson, gamma and lognormal regression trees.
## S3 method for class 'rforest' predict(object, newdata)
## S3 method for class 'rforest' predict(object, newdata)
object |
fitted model object from the class |
newdata |
data frame containing the observations to predict. This
argument can only be missing when the random forest in |
numeric vector with the averaged predictions (for regression) or the majority vote (for classification) of the individual trees.
Returns a vector of predicted responses from a fitted rpart
object.
## S3 method for class 'rpart' predict(object, newdata, type = c("vector", "prob", "class", "matrix"), na.action = na.pass, ...)
## S3 method for class 'rpart' predict(object, newdata, type = c("vector", "prob", "class", "matrix"), na.action = na.pass, ...)
object |
fitted model object of class |
newdata |
data frame containing the values at which predictions are required.
The predictors referred to in the right side of
|
type |
character string denoting the type of predicted value returned. If
the |
na.action |
a function to determine what should be done with
missing values in |
... |
further arguments passed to or from other methods. |
This function is a method for the generic function predict for class
"rpart"
. It can be invoked by calling predict
for an object
of the appropriate class, or directly by calling predict.rpart
regardless of the class of the object.
A new object is obtained by
dropping newdata
down the object. For factor predictors, if an
observation contains a level not used to grow the tree, it is left at
the deepest possible node and frame$yval
at the node is the
prediction.
If type = "vector"
:
vector of predicted responses.
For regression trees this is the mean response at the node, for Poisson
trees it is the estimated response rate, and for classification trees
it is the predicted class (as a number).
If type = "prob"
:
(for a classification tree) a matrix of class probabilities.
If type = "matrix"
:
a matrix of the full responses
(frame$yval2
if this exists, otherwise frame$yval
). For
regression trees, this is the mean response, for Poisson trees it is
the response rate and the number of events at that node in the fitted
tree, and for classification trees it is the concatenation of at least
the predicted class, the class counts at that node in the fitted tree,
and the class probabilities (some versions of rpart may contain
further columns).
If type = "class"
:
(for a classification tree) a factor of classifications based on the
responses.
z.auto <- rpart(Mileage ~ Weight, car.test.frame) predict(z.auto) fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis) predict(fit, type = "prob") # class probabilities (default) predict(fit, type = "vector") # level numbers predict(fit, type = "class") # factor predict(fit, type = "matrix") # level number, class frequencies, probabilities sub <- c(sample(1:50, 25), sample(51:100, 25), sample(101:150, 25)) fit <- rpart(Species ~ ., data = iris, subset = sub) fit table(predict(fit, iris[-sub,], type = "class"), iris[-sub, "Species"])
z.auto <- rpart(Mileage ~ Weight, car.test.frame) predict(z.auto) fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis) predict(fit, type = "prob") # class probabilities (default) predict(fit, type = "vector") # level numbers predict(fit, type = "class") # factor predict(fit, type = "matrix") # level number, class frequencies, probabilities sub <- c(sample(1:50, 25), sample(51:100, 25), sample(101:150, 25)) fit <- rpart(Species ~ ., data = iris, subset = sub) fit table(predict(fit, iris[-sub,], type = "class"), iris[-sub, "Species"])
This function prints an rpart
object. It is a method for the generic
function print
of class "rpart"
.
## S3 method for class 'rpart' print(x, minlength = 0, spaces = 2, cp, digits = getOption("digits"), ...)
## S3 method for class 'rpart' print(x, minlength = 0, spaces = 2, cp, digits = getOption("digits"), ...)
x |
fitted model object of class |
minlength |
Controls the abbreviation of labels: see |
spaces |
the number of spaces to indent nodes of increasing depth. |
digits |
the number of digits of numbers to print. |
cp |
prune all nodes with a complexity less than |
... |
arguments to be passed to or from other methods. |
This function is a method for the generic function print
for class
"rpart"
. It can be invoked by calling print for an object of the
appropriate class, or directly by calling print.rpart
regardless of
the class of the object.
A semi-graphical layout of the contents of x$frame
is
printed. Indentation is used to convey the tree topology.
Information for each node includes the node number, split, size,
deviance, and fitted value. For the "class"
method, the
class probabilities are also printed.
print
, rpart.object
,
summary.rpart
, printcp
z.auto <- rpart(Mileage ~ Weight, car.test.frame) z.auto ## Not run: node), split, n, deviance, yval * denotes terminal node 1) root 60 1354.58300 24.58333 2) Weight>=2567.5 45 361.20000 22.46667 4) Weight>=3087.5 22 61.31818 20.40909 * 5) Weight<3087.5 23 117.65220 24.43478 10) Weight>=2747.5 15 60.40000 23.80000 * 11) Weight<2747.5 8 39.87500 25.62500 * 3) Weight<2567.5 15 186.93330 30.93333 * ## End(Not run)
z.auto <- rpart(Mileage ~ Weight, car.test.frame) z.auto ## Not run: node), split, n, deviance, yval * denotes terminal node 1) root 60 1354.58300 24.58333 2) Weight>=2567.5 45 361.20000 22.46667 4) Weight>=3087.5 22 61.31818 20.40909 * 5) Weight<3087.5 23 117.65220 24.43478 10) Weight>=2747.5 15 60.40000 23.80000 * 11) Weight<2747.5 8 39.87500 25.62500 * 3) Weight<2567.5 15 186.93330 30.93333 * ## End(Not run)
Displays the cp
table for fitted rpart
object.
printcp(x, digits = getOption("digits") - 2)
printcp(x, digits = getOption("digits") - 2)
x |
fitted model object of class |
digits |
the number of digits of numbers to print. |
Prints a table of optimal prunings based on a complexity parameter.
z.auto <- rpart(Mileage ~ Weight, car.test.frame) printcp(z.auto) ## Not run: Regression tree: rpart(formula = Mileage ~ Weight, data = car.test.frame) Variables actually used in tree construction: [1] Weight Root node error: 1354.6/60 = 22.576 CP nsplit rel error xerror xstd 1 0.595349 0 1.00000 1.03436 0.178526 2 0.134528 1 0.40465 0.60508 0.105217 3 0.012828 2 0.27012 0.45153 0.083330 4 0.010000 3 0.25729 0.44826 0.076998 ## End(Not run)
z.auto <- rpart(Mileage ~ Weight, car.test.frame) printcp(z.auto) ## Not run: Regression tree: rpart(formula = Mileage ~ Weight, data = car.test.frame) Variables actually used in tree construction: [1] Weight Root node error: 1354.6/60 = 22.576 CP nsplit rel error xerror xstd 1 0.595349 0 1.00000 1.03436 0.178526 2 0.134528 1 0.40465 0.60508 0.105217 3 0.012828 2 0.27012 0.45153 0.083330 4 0.010000 3 0.25729 0.44826 0.076998 ## End(Not run)
Determines a nested sequence of subtrees of the supplied rpart
object
by recursively snipping
off the least important splits, based on the
complexity parameter (cp
).
prune(tree, ...) ## S3 method for class 'rpart' prune(tree, cp, ...)
prune(tree, ...) ## S3 method for class 'rpart' prune(tree, cp, ...)
tree |
fitted model object of class |
cp |
Complexity parameter to which the |
... |
further arguments passed to or from other methods. |
A new rpart
object that is trimmed to the value cp
.
z.auto <- rpart(Mileage ~ Weight, car.test.frame) zp <- prune(z.auto, cp = 0.1) plot(zp) #plot smaller rpart object
z.auto <- rpart(Mileage ~ Weight, car.test.frame) zp <- prune(z.auto, cp = 0.1) plot(zp) #plot smaller rpart object
Method for residuals
for an rpart
object.
## S3 method for class 'rpart' residuals(object, type = c("usual", "pearson", "deviance"), ...)
## S3 method for class 'rpart' residuals(object, type = c("usual", "pearson", "deviance"), ...)
object |
fitted model object of class |
type |
Indicates the type of residual desired. For regression or For classification trees the For |
... |
further arguments passed to or from other methods. |
Vector of residuals of type type
from a fitted rpart
object.
McCullagh P. and Nelder, J. A. (1989) Generalized Linear Models. London: Chapman and Hall.
fit <- rpart(skips ~ Opening + Solder + Mask + PadType + Panel, data = solder, method = "anova") summary(residuals(fit)) plot(predict(fit),residuals(fit))
fit <- rpart(skips ~ Opening + Solder + Mask + PadType + Panel, data = solder, method = "anova") summary(residuals(fit)) plot(predict(fit),residuals(fit))
This function acts as a user-friendly interface to build a random forest
based on individual rpart
trees.
rforest( formula, data, method, weights = NULL, parms = NULL, control = NULL, ncand, ntrees, subsample = 1, track_oob = FALSE, keep_data = FALSE, red_mem = FALSE )
rforest( formula, data, method, weights = NULL, parms = NULL, control = NULL, ncand, ntrees, subsample = 1, track_oob = FALSE, keep_data = FALSE, red_mem = FALSE )
formula |
object of the class |
data |
data frame containing the training data observations. |
method |
string specifying the type of forest to build. Options are:
|
weights |
optional name of the variable in |
parms |
optional parameters for the splitting function, see
|
control |
list of options that control the fitting details of the
|
ncand |
integer specifying the number of randomly chosen variable candidates to consider at each node to find the optimal split. |
ntrees |
integer specifying the number of trees in the ensemble. |
subsample |
numeric in the range [0,1]. Each tree in the ensemble is
built on randomly sampled data of size |
track_oob |
boolean to indicate whether the out-of-bag errors should be
tracked (TRUE) or not (FALSE). This option is not implemented for
All these errors are evaluated in
a weighted version if |
keep_data |
boolean to indicate whether the |
red_mem |
boolean whether to reduce the memory footprint of the
|
object of the class rforest
, which is a list containing the
following elements:
list of length equal to
ntrees
, containing the individual rpart
trees.
numeric vector of length equal to ntrees
,
containing the OOB error at each iteration (if track_oob = TRUE
).
the training data
(if keep_data = TRUE
).
Fit a rpart
model
rpart(formula, data, weights, subset, na.action = na.rpart, method, model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ncand, seed, redmem = FALSE, ...)
rpart(formula, data, weights, subset, na.action = na.rpart, method, model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ncand, seed, redmem = FALSE, ...)
formula |
a formula, with a response but no interaction
terms. If this a a data frame, that is taken as the model frame
(see |
data |
an optional data frame in which to interpret the variables named in the formula. |
weights |
optional case weights. |
subset |
optional expression saying that only a subset of the rows of the data should be used in the fit. |
na.action |
the default action deletes all observations for which
|
method |
one of Alternatively, |
model |
if logical: keep a copy of the model frame in the result?
If the input value for |
x |
keep a copy of the |
y |
keep a copy of the dependent variable in the result. If
missing and |
parms |
optional parameters for the splitting function. |
control |
a list of options that control details of the
|
cost |
a vector of non-negative costs, one for each variable in the model. Defaults to one for all variables. These are scalings to be applied when considering splits, so the improvement on splitting on a variable is divided by its cost in deciding which split to choose. |
ncand |
integer specifying the number of randomly chosen variable candidates to consider at each node to find the optimal split. |
seed |
seed for the random number generator which chooses the
|
redmem |
boolean whether to reduce the memory footprint of the
fitted |
... |
arguments to |
This differs from the tree
function in S mainly in its handling
of surrogate variables. In most details it follows Breiman
et. al (1984) quite closely. R package tree provides a
re-implementation of tree
.
An object of class rpart
. See rpart.object
.
Breiman L., Friedman J. H., Olshen R. A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth.
rpart.control
, rpart.object
,
summary.rpart
, print.rpart
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis) fit2 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis, parms = list(prior = c(.65,.35), split = "information")) fit3 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis, control = rpart.control(cp = 0.05)) par(mfrow = c(1,2), xpd = NA) # otherwise on some devices the text is clipped plot(fit) text(fit, use.n = TRUE) plot(fit2) text(fit2, use.n = TRUE)
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis) fit2 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis, parms = list(prior = c(.65,.35), split = "information")) fit3 <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis, control = rpart.control(cp = 0.05)) par(mfrow = c(1,2), xpd = NA) # otherwise on some devices the text is clipped plot(fit) text(fit, use.n = TRUE) plot(fit2) text(fit2, use.n = TRUE)
Various parameters that control aspects of the rpart
fit.
rpart.control(minsplit = 20, minbucket = round(minsplit/3), cp = 0.01, maxcompete = 4, maxsurrogate = 5, usesurrogate = 2, xval = 10, surrogatestyle = 0, maxdepth = 30, ...)
rpart.control(minsplit = 20, minbucket = round(minsplit/3), cp = 0.01, maxcompete = 4, maxsurrogate = 5, usesurrogate = 2, xval = 10, surrogatestyle = 0, maxdepth = 30, ...)
minsplit |
the minimum number of observations that must exist in a node in order for a split to be attempted. |
minbucket |
the minimum number of observations in any terminal |
cp |
complexity parameter. Any split that does not decrease the overall
lack of fit by a factor of |
maxcompete |
the number of competitor splits retained in the output. It is useful to know not just which split was chosen, but which variable came in second, third, etc. |
maxsurrogate |
the number of surrogate splits retained in the output. If this is set to zero the compute time will be reduced, since approximately half of the computational time (other than setup) is used in the search for surrogate splits. |
usesurrogate |
how to use surrogates in the splitting process. |
xval |
number of cross-validations. |
surrogatestyle |
controls the selection of a best surrogate.
If set to |
maxdepth |
Set the maximum depth of any node of the final tree, with the root
node counted as depth 0. Values greater than 30 |
... |
mop up other arguments. |
A list containing the options.
This function does the initialization step for rpart, when the response is a survival object. It rescales the data so as to have an exponential baseline hazard and then uses Poisson methods. This function would rarely if ever be called directly by a user.
rpart.exp(y, offset, parms, wt)
rpart.exp(y, offset, parms, wt)
y |
the response, which will be of class |
offset |
optional offset |
parms |
parameters controlling the fit.
This is a list with components |
wt |
case weights, if present |
a list with the necessary initialization components
Terry Therneau
These are objects representing fitted rpart
trees.
frame |
data frame with one row for each node in the tree.
The Extra response information which may be present is in |
where |
an integer vector of the same length as the number of observations in the
root node, containing the row number of |
call |
an image of the call that produced the object, but with the arguments
all named and with the actual formula included as the formula argument.
To re-evaluate the call, say |
terms |
an object of class |
splits |
a numeric matrix describing the splits: only present if there are any.
The row label is the name of
the split variable, and columns are |
csplit |
an integer matrix. (Only present only if at least one of the split
variables is a factor or ordered factor.) There is a row for
each such split, and the number of columns is the largest number of
levels in the factors. Which row is given by the |
method |
character string: the method used to grow the tree. One of
|
cptable |
a matrix of information on the optimal prunings based on a complexity parameter. |
variable.importance |
a named numeric vector giving the importance of each variable. (Only
present if there are any splits.) When printed by
|
numresp |
integer number of responses; the number of levels for a factor response. |
parms , control
|
a record of the arguments supplied, which defaults filled in. |
functions |
the |
ordered |
a named logical vector recording for each variable if it was an ordered factor. |
na.action |
(where relevant) information returned by |
There may be attributes "xlevels"
and "levels"
recording the levels of any factor splitting variables and of a factor
response respectively.
Optional components include the model frame (model
), the matrix
of predictors (x
) and the response variable (y
) used to
construct the rpart
object.
The following components must be included in a legitimate rpart
object.
Produces 2 plots. The first plots the r-square (apparent and apparent - from cross-validation) versus the number of splits. The second plots the Relative Error(cross-validation) +/- 1-SE from cross-validation versus the number of splits.
rsq.rpart(x)
rsq.rpart(x)
x |
fitted model object of class |
Two plots are produced.
The labels are only appropriate for the "anova"
method.
z.auto <- rpart(Mileage ~ Weight, car.test.frame) rsq.rpart(z.auto)
z.auto <- rpart(Mileage ~ Weight, car.test.frame) rsq.rpart(z.auto)
Creates a "snipped" rpart object, containing the nodes that remain after selected subtrees have been snipped off. The user can snip nodes using the toss argument, or interactively by clicking the mouse button on specified nodes within the graphics window.
snip.rpart(x, toss)
snip.rpart(x, toss)
x |
fitted model object of class |
toss |
an integer vector containing indices (node numbers) of all subtrees to be snipped off. If missing, user selects branches to snip off as described below. |
A dendrogram of rpart
is expected to be visible on the graphics
device, and a graphics input device (e.g., a mouse) is required. Clicking
(the selection button) on a node displays the node number, sample
size, response y-value, and Error (dev). Clicking a second time on the
same node snips that subtree off and visually erases the subtree.
This process may be repeated an number of times. Warnings result from
selecting the root or leaf nodes. Clicking the exit button will stop
the snipping process and return the resulting rpart
object.
See the documentation for the specific graphics device for details on graphical input techniques.
A rpart
object containing the nodes that remain after specified or
selected subtrees have been snipped off.
Visually erasing the plot is done by over-plotting with the background colour. This will do nothing if the background is transparent (often true for screen devices).
## dataset not in R ## Not run: z.survey <- rpart(market.survey) # grow the rpart object plot(z.survey) # plot the tree z.survey2 <- snip.rpart(z.survey, toss = 2) # trim subtree at node 2 plot(z.survey2) # plot new tree # can also interactively select the node using the mouse in the # graphics window ## End(Not run)
## dataset not in R ## Not run: z.survey <- rpart(market.survey) # grow the rpart object plot(z.survey) # plot the tree z.survey2 <- snip.rpart(z.survey, toss = 2) # trim subtree at node 2 plot(z.survey2) # plot new tree # can also interactively select the node using the mouse in the # graphics window ## End(Not run)
The solder
data frame has 720 rows and 6 columns, representing
a balanced subset of a designed experiment varying 5 factors on the
soldering of components on printed-circuit boards.
solder
solder
This data frame contains the following columns:
Opening
a factor with levels ‘L’, ‘M’ and ‘S’ indicating the amount of clearance around the mounting pad.
Solder
a factor with levels ‘Thick’ and ‘Thin’ giving the thickness of the solder used.
Mask
a factor with levels ‘A1.5’, ‘A3’, ‘B3’ and ‘B6’ indicating the type and thickness of mask used.
PadType
a factor with levels ‘D4’, ‘D6’, ‘D7’, ‘L4’, ‘L6’, ‘L7’, ‘L8’, ‘L9’, ‘W4’ and ‘W9’ giving the size and geometry of the mounting pad.
Panel
1:3
indicating the panel on a board being tested.
skips
a numeric vector giving the number of visible solder skips.
John M. Chambers and Trevor J. Hastie eds. (1992) Statistical Models in S, Wadsworth and Brooks/Cole, Pacific Grove, CA.
fit <- rpart(skips ~ Opening + Solder + Mask + PadType + Panel, data = solder, method = "anova") summary(residuals(fit)) plot(predict(fit), residuals(fit))
fit <- rpart(skips ~ Opening + Solder + Mask + PadType + Panel, data = solder, method = "anova") summary(residuals(fit)) plot(predict(fit), residuals(fit))
A set of 146 patients with stage C prostate cancer, from a study exploring the prognostic value of flow cytometry.
data(stagec)
data(stagec)
A data frame with 146 observations on the following 8 variables.
pgtime
Time to progression or last follow-up (years)
pgstat
1 = progression observed, 0 = censored
age
age in years
eet
early endocrine therapy, 1 = no, 2 = yes
g2
percent of cells in G2 phase, as found by flow cytometry
grade
grade of the tumor, Farrow system
gleason
grade of the tumor, Gleason system
ploidy
the ploidy status of the tumor, from flow cytometry. Values are ‘diploid’, ‘tetraploid’, and ‘aneuploid’
A tumor is called diploid (normal complement of dividing cells) if the fraction of cells in G2 phase was determined to be 13% or less. Aneuploid cells have a measurable fraction with a chromosome count that is neither 24 nor 48, for these the G2 percent is difficult or impossible to measure.
require(survival) rpart(Surv(pgtime, pgstat) ~ ., stagec)
require(survival) rpart(Surv(pgtime, pgstat) ~ ., stagec)
Returns a detailed listing of a fitted rpart
object.
## S3 method for class 'rpart' summary(object, cp = 0, digits = getOption("digits"), file, ...)
## S3 method for class 'rpart' summary(object, cp = 0, digits = getOption("digits"), file, ...)
object |
fitted model object of class |
digits |
Number of significant digits to be used in the result. |
cp |
trim nodes with a complexity of less than |
file |
write the output to a given file name. (Full listings of a tree are often quite long). |
... |
arguments to be passed to or from other methods. |
This function is a method for the generic function summary for class
"rpart"
. It can be invoked by calling summary
for an object of the appropriate class, or directly by calling
summary.rpart
regardless of the class of the object.
It prints the call, the table shown by printcp
, the
variable importance (summing to 100) and details for each node (the
details depending on the type of tree).
summary
, rpart.object
, printcp
.
## a regression tree z.auto <- rpart(Mileage ~ Weight, car.test.frame) summary(z.auto) ## a classification tree with multiple variables and surrogate splits. summary(rpart(Kyphosis ~ Age + Number + Start, data = kyphosis))
## a regression tree z.auto <- rpart(Mileage ~ Weight, car.test.frame) summary(z.auto) ## a classification tree with multiple variables and surrogate splits. summary(rpart(Kyphosis ~ Age + Number + Start, data = kyphosis))
Labels the current plot of the tree dendrogram with text.
## S3 method for class 'rpart' text(x, splits = TRUE, label, FUN = text, all = FALSE, pretty = NULL, digits = getOption("digits") - 3, use.n = FALSE, fancy = FALSE, fwidth = 0.8, fheight = 0.8, bg = par("bg"), minlength = 1L, ...)
## S3 method for class 'rpart' text(x, splits = TRUE, label, FUN = text, all = FALSE, pretty = NULL, digits = getOption("digits") - 3, use.n = FALSE, fancy = FALSE, fwidth = 0.8, fheight = 0.8, bg = par("bg"), minlength = 1L, ...)
x |
fitted model object of class |
splits |
logical flag. If |
label |
For compatibility with |
FUN |
the name of a labeling function, e.g. |
all |
Logical. If |
minlength |
the length to use for factor labels. A value of 1 causes them to be
printed as ‘a’, ‘b’, .....
Larger values use abbreviations of the label names.
See the |
pretty |
an alternative to the |
digits |
number of significant digits to include in numerical labels. |
use.n |
Logical. If |
fancy |
Logical. If |
fwidth |
Relates to option |
fheight |
Relates to option |
bg |
The color used to paint the background to annotations if |
... |
Graphical parameters may also be supplied as arguments to this
function (see |
the current plot of a tree dendrogram is labeled.
text
, plot.rpart
, rpart
,
labels.rpart
, abbreviate
freen.tr <- rpart(y ~ ., freeny) par(xpd = TRUE) plot(freen.tr) text(freen.tr, use.n = TRUE, all = TRUE)
freen.tr <- rpart(y ~ ., freeny) par(xpd = TRUE) plot(freen.tr) text(freen.tr, use.n = TRUE, all = TRUE)
Gives the predicted values for an rpart
fit, under
cross validation, for a set of complexity parameter values.
xpred.rpart(fit, xval = 10, cp, return.all = FALSE)
xpred.rpart(fit, xval = 10, cp, return.all = FALSE)
fit |
a object of class |
xval |
number of cross-validation groups. This may also be an explicit list of integers that define the cross-validation groups. |
cp |
the desired list of complexity values. By default it is taken from the
|
return.all |
if FALSE return only the first element of the prediction |
Complexity penalties are actually ranges, not values. If the
cp
values found in the table were ,
,
and
, for instance, this means that the first row of the
table holds for all complexity penalties in the range
,
the second row for
cp
in the range and
the third row for
. By default, the geometric mean
of each interval is used for cross validation.
A matrix with one row for each observation and one column for each complexity
value. If return.all
is TRUE and the prediction for each node
is a vector, then the result will be an array containing all of the
predictions. When the response is categorical, for instance, the
result contains the predicted class followed by the class
probabilities of the selected terminal node;
result[1,,]
will be the matrix of predicted classes,
result[2,,]
the matrix of class 1 probabilities, etc.
fit <- rpart(Mileage ~ Weight, car.test.frame) xmat <- xpred.rpart(fit) xerr <- (xmat - car.test.frame$Mileage)^2 apply(xerr, 2, sum) # cross-validated error estimate # approx same result as rel. error from printcp(fit) apply(xerr, 2, sum)/var(car.test.frame$Mileage) printcp(fit)
fit <- rpart(Mileage ~ Weight, car.test.frame) xmat <- xpred.rpart(fit) xerr <- (xmat - car.test.frame$Mileage)^2 apply(xerr, 2, sum) # cross-validated error estimate # approx same result as rel. error from printcp(fit) apply(xerr, 2, sum)/var(car.test.frame$Mileage) printcp(fit)