|
Visualization of multivariate functions, sets, and dataWe provide visualization tools which are specialized for visualizing multivariate density functions and density estimates. Tools are also provided for visualizing multivariate data. Multivariate functions are visualized with level set trees. Level set trees transform a multivariate function to a 1D function so that its mode structure is isomorphic to the original multivariate function. A volume plot is a plot of this 1D function and a barycenter plot shows the locations of the barycenters of the level sets. When a multivariate function is unimodal we may visualize the boundary functions of its level sets with the level set tree type methods. We call a shape tree a recursive approximation of a connected set, and a radius plot or a tail probability plot visualizes the mode structure of the boundary function. A location plot shows how the shape is located in the multivariate Euclidean space. A discrete set may be visualized with a similar type of recursive approximation as continuous sets. We call a tail tree such a recursive approximation of a multivariate point cloud. A tail frequency plot and a tail tree plot show the mode structure of this point cloud. We provide an R package "denpro". R is a language and environment for statistical computing and graphics. It may be downloaded from R archive network . The package "denpro" is designed by Jussi Klemelä. I am grateful for bug reports. ContentsIntroduction:Level set treesWe visualize a density function with the help of its level set tree. A level set tree is a tree structure built from the separated components of the level sets of the function. It makes a recursive approximation of the function. With different graphical representations of level set trees we may visualize the number and relative importance of the modes, the barycenters of the separated regions of the level sets, in particular the locations of the modes, and the skewness and the kurtosis of the density. The visualization method may be applied in exploratory data analysis, to make graphical inference on the existence of the modes, and to apply graphical methods for the smoothing parameter selection. Visualization with level set trees is in many cases more effective and easy to use than the method of looking at slices and projections of the density function. Studying level sets of a function for a series of levels provides information on the shape of a multivariate function. Level sets has been applied in mode detection and density estimation in 3 and 4 dimensional cases. In the 4D case one may visualize 3D density contours as the fourth variable is changed over its range. Studying projections and marginal densities is in many cases sufficient but projections may hide some high dimensional features. In density estimation we are interested how the probability mass is distributed over the d-dimensional Euclidean space. Local extremes of the density function express concentration of the probability mass. We are interested both in visualizing the locations of these local extremes, and also in visualizing the size of the probability mass associated with each local extreme. A volume plot visualizes the amount of probability mass associated with each local extreme. A barycenter plot visualizes the locations of these probability masses, drawing a "skeleton" of the function. The figures below show an equal mixture of 4 Gaussian densities in the 3D Euclidean space. Below we visualize a kernel estimate with 2000 observations from a mixture of 5 Gaussian densities in the 4D Euclidean space. Window a.1) shows the volume plot, a.2) zooms to the upper levels of the estimate, and b.1-4) show barycenter plots. Below we illustrate the definition of the level set tree. Shape treesBeyond the mode structure, level set tree based plots give only a limited insight into the shape of the density. We need additional visualization tools since also unimodal densities may have a large variety of shapes: the probability mass is distributed only in some regions of the huge multivariate Euclidean space. The visualization tools which we present are based on a shape tree of a set. We visualize a density by visualizing a series of level sets of the density. We form a series of shape trees from a series of level sets and apply graphical tools to visualize the shape trees. A shape tree may be formed from a connected set, where by a connected set we mean a set which cannot be expressed as a union of two sets which do not touch each other. A shape tree of a connected set is a tree which is formed by looking at the deviations of the set from the spheres centered at a given reference point, where the reference point is typically the location of the mode of the density, or the barycenter of the set. Level sets of a unimodal density are connected. When a density is multimodal some of its level sets, possibly at higher levels, are not connected, and in this case we should visualize separately each connected component, or restrict ourselves to the visualization of lower level sets. A shape tree of a star shaped set is basically a level set tree of its boundary function. A shape tree of a connected set is a tree which is formed by looking at the deviations of the set from the balls centered at a given reference point, where the reference point is typically the location of the mode of the density, or the barycenter of the set. The root of the tree is identified with the set itself. The non-root nodes correspond to the separated components of the intersection of the set with the complement of a ball centered at the reference point. We present two types of plots: the shape plots and the location plot. The shape plots visualize only the shape of the set and the location plot visualizes spatial information showing how the shape is located in the multivariate Euclidean space. The shape plots include the radius plot, the tail probability plot, the probability content plot, and the volume plot. These plots are plots of 1D functions which are obtained from the shape tree. Thus we define transforms of a multivariate set to 1D functions. For example, a radius content transform reveals basically the following features of the set.
Below are examples of 2D radius transforms. Tail treesWith tail trees we visualize the shape, the location, and the orientation of a multivariate point cloud. We interpret the point cloud as realizations of n random vectors having a common density function. We concentrate on the case where the point cloud is connected in the sense of single linkage clustering. The assumption of the connectedness of the point cloud may be interpreted as the connectedness of the support, or a low-level level set, of the density which has generated the observations. The visualizations are based on the concept of a tail tree of a set. A tail tree of a set of finite cardinality is a tree whose nodes are associated with the subsets of the set, so that the sets in the upper levels of the tree correspond to the extensions or tails of the set. The concept of a tail tree is closely related to the concept of a shape tree. A shape tree is defined for a connected set, when the connectedness was defined in terms of the topology of the Euclidean metric. We want to modify that concept in order to visualize sets of finite cardinality. A shape tree may be applied to visualize level sets of a density estimate, but with tail trees we may visualize the raw data, avoiding the density estimation step. Since density estimation is difficult in high dimensional cases we may be able to analyze higher dimensional data with tail trees than with shape trees. Below we visualize a data of size 1000 from a distribution with Gaussian copula whose shape parameter is r=0.5, and whose marginals are Student with degrees of freedom 3. Literature:Level set trees are described in the article "Visualization of multivariate density estimates with level set trees" . See Homepage of the article . Shape trees are described in the article "Visualization of multivariate density estimates with shape trees" . Some algorithms related to level set trees are described in the article "Algorithms for manipulation of level sets of nonparametric density estimates". The article describes algorithm DynaDecompose, but please note that algorithm LeafsFirst is now better supported. Tail trees are described in the article "Visualization of multivariate data with tail trees" . Some 2D shape isomorphic transforms (2D volume function and 2D probability content function) are described in the article "Visualization of the spread of multivariate distributions". Mode graphs and branching maps are described in the article "Visualization of scales of multivariate density estimates". Installation instructions:The programs are provided as an R-package. There are two ways to install the functions.
Documentation:Here is a listing of the functions, which the package provides. Functions to calculate a level set tree, shape tree or tail tree:
Visualization of scales of estimates:
Visualization of anisotropic spread (2D volume function and 2D probability content function):
Other utilities:
Other visualization tools:
Clustering:
Miscallenous:
In R use the command help(func) to get online manual for "func". Tutorial:# First load the library library(denpro) Short tutorial# level set tree dendat<-sim.data(n=200,type="mulmod") # data pcf<-pcf.kern(dendat,h=1,N=c(32,32)) # kernel estimate lst<-leafsfirst(pcf) # level set tree td<-treedisc(lst,pcf,ngrid=60) # pruned level set tree plotvolu(td) # volume plot plotbary(td) # barycenter plot # shape tree dendat<-sim.data(n=200,type="cross") # data pcf<-pcf.kern(dendat,h=1,N=c(32,32)) # kernel estimate st<-leafsfirst(pcf,propo=0.01) # shape tree, 1 per cent level set tdst<-treedisc(st,pcf,ngrid=60) # pruned shape tree plotvolu(tdst) # radius plot plotbary(tdst) # location plot # tail tree dendat<-sim.data(n=200,type="cross") # data tt<-leafsfirst(dendat=dendat,rho=0.65) # tail tree plotbary(tt) # tail tree plot plotvolu(tt) # tail frequency plot Level set treesdendat<-sim.data(n=200,type="mulmod") # data # calculate a kernel estimate h<-1 # smoothing parameter N<-c(8,8) # size of the grid pcf<-pcf.kern(dendat,h,N) # kernel estimate as a piecewise constant function # calculate a level set tree of the estimate lst<-leafsfirst(pcf) # plot a volume plot, with colors and labels plotvolu(lst, colo=TRUE, modelabel=TRUE, ptext=0.002) # plot a barycenter plot, for the second coordinate, # with labels, and return the locations of the modes plotbary(lst,coordi=2,modlabret=T,ptext=0.002) # prune the level set tree to contain only 10 levels td<-treedisc(lst,pcf,ngrid=10) plotvolu(td) # look at the contour and perspective plots of the estimates dp<-draw.pcf(pcf) contour(dp$x,dp$y,dp$z,xlab="coordinate 1", ylab="coordinate 2") persp(dp$x,dp$y,dp$z,phi=20,theta=50) Level set trees with data based interpolationThe function "leafsfirst" computes a level set tree from a piecewise constant function and the piecewise constant function is such that the pieces where the function is constant are rectangles. In high dimensional spaces regular equispaced partitions have a huge size and it might be difficult to find a good irregular partition of a reasonable size. Thus we have implemented a function "leafsfirst.intpol" that does not require an initial step of calculating a piecewise constant function but that requires only that the function is evaluated at any irregular collection of points. The basic examples where we can use this function are a density function estimate that is evaluated at the observations, and a regression function estimate that is evaluated at the observations of the x-variables. Function "leafsfirst.intpol" approximates level sets of the function with unions of balls, centered at the points where the function is evaluated. Thus we have to provide the radius of the balls (parameter rho). The choice of parameter rho affects the level set trees, because a too small parameter rho results to a level set tree with many spurious local maxima and a too large parameter rho results to a level set tree with only one local maxima. source("denpro.R") # simulate data "dendat" n<-200 dendat<-sim.data(n=n,type="mulmod") # kernel density estimate is evaluated at the observations f<-matrix(0,n,1) h<-0.4 # smoothing parameter for (i in 1:n){ arg<-dendat[i,] f[i]<-kernesti.dens(arg,dendat,h) } # calculate a level set tree with level sets that are # collections of balls of radius rho=0.6 rho<-0.6 lst<-leafsfirst.intpol(dendat, f, rho=rho) # plot a level set tree and a volume plot of the level set tree plottree(lst,modelabel=FALSE) plotvolu(lst,modelabel=FALSE) # improve the level set tree by calculating volumes of level # sets more accurately M<-100 # sample size of the Monte Carlo samples grid<-3 # the number of level sets whose volumes are calculated lst2<-leafsfirst.intpol.volu(lst, dendat, rho, M=M, grid=grid) plotvolu(lst2,modelabel=FALSE) # to make plotting faster, we can prune the level set ngrid<-40 lst.redu2<-treedisc.intpol(lst2, f, ngrid=ngrid) plotvolu(lst.redu2,modelabel=FALSE) Shape treesdendat<-sim.data(n=200,type="cross") # data pcf<-pcf.kern(dendat,h=1,N=c(32,32)) # kernel estimate # calculate a shape tree, when the reference point is the mode, # and the level set corresponds to the 10 per cent level st<-leafsfirst(pcf,propor=0.1) # prune the shape tree std<-treedisc(st,pcf,ngrid=60) # plot a volume plot, with colors and labels plotvolu(std, colo=TRUE, modelabel=TRUE, ptext=0.1, symbo="L") # plot a barycenter plot, for the second coordinate, # with labels, and return the locations of the modes plotbary(std, coordi=2, modlabret=T, ptext=0.1, symbo="L") # choose the reference point to be the barycenter bary<-st$bary st<-leafsfirst(pcf,propor=0.1,ref=bary) std<-treedisc(st,pcf,ngrid=60) plotvolu(std) Level set trees: Kernel estimate with the DynaDecompose algorithmdendat<-sim.data(n=200,type="mulmod") h<-1 N<-c(32,32) Q<-30 pd<-profkern(dendat,h,N,Q) plotvolu(pd) plotbary(pd,coordi=1) plotbary(pd,coordi=2,modlabret=T) modecent(pd) Tail trees# generate the data dendat<-sim.data(n=500,type="cross") # calculate a tail tree rho<-0.65 # resolution threshold tt<-leafsfirst(dendat=dendat,rho=rho) # tail tree plots plotbary(tt) plotbary(tt, coordi=2, modelabel=TRUE, modlabret=TRUE, ptext=0.2, symbo="T") # tail frequency plot plotvolu(tt,colo=TRUE) # with colors plotvolu(tt,ptext=0.2,modelabel=TRUE,symbo="T",colo=TRUE) # with mode labels Data structures of the packageIn order to help to apply the functions of the package to visualize some user defined functions we describe some basic objects of the package. Level set tree objectLevel set tree object represents a level set tree. This object is given as an input for the visualization functions "plotvolu" and "plotbary". Level set tree object is a list containing vectors "parent", "level", "volume" and matrix "center". Length of vectors is equal, let us call this length "nodenum". The matrix "center" is d*nodenum-matrix, where d is the dimension of the underlying function which is represented by the level set tree. Elements of the vectors (and columns of the matrix) correspond to the nodes of the level set tree, which in turn correspond to disconnected components of the level sets of the underlying function.
Shape tree objectShape tree object represents a shape tree. This object is visualized with the functions "plotvolu" and "plotbary". Shape tree object is similar to the level set tree object, but the list contains in addition the vector "refe" and the real value "maxdis".
Piecewise constant function objectThe piecewise constant function object represents a rectangularwise constant function. This object is given as an input to the function "leafsfirst" which computes the level set tree with the Leafsfirst algorithm. The piecewise constant function object is a list whose elements are vectors "value", "N", and "support", and matrices "down" and "high". The basic idea is that we have a fine regular partition and the function is piecewise constant on a non-regular partition, whose members are unions of the rectangles in the smallest resolution level. There are N[1]* … *N[d] small rectangles.
In package "delt" function "pcf.greedy.kernel" computes a piecewise constant function which uses a more adaptive partition than a partition which can be obtained as combining rectangles in a regular partition. In this case the piecewise constant function is a list with elements
Evaluation tree objectAn evaluation tree is a binary tree which stores the values of the function which have been evaluated on a grid (regular or irregular). A level set tree is a representation of the function, but how to calculate this representation when we know only how to evaluate the function at individual points? The package contains function "profeval" which computes a level set tree when the function is represented as an "evaluation tree". Note an alternative and computationally inefficient way to calculate a level set tree implemented by the function "profgene". An evaluation tree is a binary search tree, thus making the updating fast. It stores spatially close points so that they are close in the tree structure, making calculation of a level set tree feasible, since we may apply a bottom-up dynamic programming algorithm to find the disconnected components of the level set tree. The evaluation tree object is a list containing vectors and matrix "left", "right", "parent", "low" (matrix), "upp" (matrix), "split", "direc", "mean" "infopointer" (not implemented), "value", "down" (matrix), "high" (matrix), "support", "N", "nodefinder" (not implemented).
|