Estimation of Level Set Trees Using Adaptive Partitions

Introduction

This is the home page of the article "Estimation of Level Set Trees Using Adaptive Partitions", written by Lasse Holmström, Kyösti Karttunen, and Jussi Klemelä.

Abstract of the article

We present methods for the estimation of level sets, a level set tree, and a volume function of a multivariate density function. The methods are such that the calculation is feasible and statistically efficient in moderate dimensional cases and for large sample sizes. We apply kernel estimation together with an adaptive partition of the sample space. We illustrate how level set trees can be applied in cluster analysis and in flow cytometry.

Reference

The article is published in "http://link.springer.com/article/10.1007/s00180-016-0702-2"

@article{volume,
title = {Estimation of Level Set Trees Using Adaptive Partitions},
author = {Holmstr\"om, L. and Karttunen, K. and Klemel\"a, J.},
year = {2017},
journal = {Comput. Statist.},
volume = {32},
pages = {1139-1163},
}

Supplementary material

We provide supplementary material in volume-supplementary.pdf.

Software

The software is available as R-packages "denpro" and "delt".

Package "denpro" implements the calculation of level set trees for the regular grid kernel density estimator and package "delt" implements the calculation for the adaptive grid kernel density estimator, among other things.

Package "delt" implements a discretized kernel estimator using an adaptive partition in function "pcf.greedy.kernel". Function "pcf.greedy.kernel" has an option to use C++ code. This option can be used by calling the function with type="cpp".

In order to use the R-packages the following commands have to be issued:

library(denpro)
library(delt)

The use of C++ code requires the package "Rcpp", described in http://dirk.eddelbuettel.com/code/rcpp/Rcpp-FAQ.pdf. Mac installation is described in http://www.thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks-lgfortran-and-lquadmath-error. Windows installation requires that 1) R path does not include empty spaces, 2) Rtools corresponding to the R version is installed from http://cran.r-project.org/bin/windows/Rtools/, 3) the computer is booted.

In order to use C++ code the program densplitter2.cpp (by Sauli Herrala) is needed, and the following commands have to be issued:

library(Rcpp)
library(RcppArmadillo)
sourceCpp("densplitter2.cpp")

Note that there is a difference depending on whether the package "delt" is installed from CRAN or from the home page of the package.

The difference is due to the fact that function "pcf.greedy.kernel.R" of the home page version of the packgage refers to function "densplitter2", which contains the C++ code, but "densplitter2" is not part of the package, and according to the CRAN policy a function cannot be referred which is not a part of the package, or part of a required package. Thus, function "pcf.greedy.kernel.R" is different in the CRAN-version of the package and in the home page-version of the package.

Tutorial for the use of software

1. Kernel density estimation with adaptive and regular grids

# generate data
seed<-1
n<-1000
d<-2
l<-2
D<-4; c<-D/sqrt(d)
M<-matrix(0,l,d); M[2,]<-c
sig<-matrix(1,l,d)
p<-rep(1/l,l)
dendat<-sim.data(type="mixt",n=n,M=M,sig=sig,p=p,seed=seed)

# regular grid
h<-(4/(d+2))^(1/(d+4))*n^(-1/(d+4))*apply(dendat,2,sd)
N<-rep(20,d)
pcf<-pcf.kern(dendat,h,N=N)
lst<-leafsfirst(pcf)
plotvolu(lst)

# adaptive grid
h<-(4/(d+2))^(1/(d+4))*n^(-1/(d+4))*apply(dendat,2,sd)
minobs<-5
pcf<-pcf.greedy.kernel(dendat,h,minobs=minobs,type="greedy") 
# pcf<-pcf.greedy.kernel(dendat,h,minobs=minobs,type="cpp")
lst<-leafsfirst.adagrid(pcf)
plotvolu(lst)

plotvolu(lst,cutlev=0.004)
plotbary(lst)

2. Colored scatter plots

2.1 Regular grid

# generate data
seed<-1
n<-500
d<-2
l<-3; D<-4; c<-D/sqrt(2)
M<-matrix(0,l,d); M[2,]<-c; M[3,]<--c
sig<-matrix(1,l,d)
p<-rep(1/l,l)
dendat<-sim.data(type="mixt",n=n,M=M,sig=sig,p=p,seed=seed)

# colored volume function 
h<-(4/(d+2))^(1/(d+4))*n^(-1/(d+4))*apply(dendat,2,sd)
N<-rep(20,d)
pcf<-pcf.kern(dendat,h,N)
lst<-leafsfirst(pcf)
plotvolu(lst,colo=TRUE)

# colored scatter plot
cd<-colors2data(dendat,pcf,lst)
plot(dendat,col=cd$datacolo)

# colored scatter plot: plot the observations so that the observations with
# the highest density value are plotted last
cd<-colors2data(dendat,pcf,lst)
plot(dendat[cd$ord,1],dendat[cd$ord,2],col=cd$datacolo[cd$ord])

# partial clustering with a fixed level 
cl<-cluster.lst(dendat,h,N=N,labels="colors",type="grid",lambda=0.02)
plot(dendat,col=cl)

# complete clustering with a fixed level
cl<-cluster.lst(dendat,h,N=N,complete=TRUE,labels="colors",type="grid",lambda=0.02)
plot(dendat,col=cl)

# partial clustering with locally changing levels
# plottree(lst,colo=TRUE,info=seq(1:length(lst$parent)))
nodes<-findbnodes(lst,modenum=3)
plotvolu(lst,colo=TRUE,nodes=nodes)
#
cd<-colors2data(dendat,pcf,lst,nodes=nodes)
plot(dendat,col=cd$datacolo)

# complete clustering with locally changing levels
nodes<-findbnodes(lst,modenum=3)
cl<-cluster.lst(dendat,h,N,nodes=nodes,complete=TRUE,labels="colors")
plot(dendat,col=cl)

2.2 Adaptive grid

# generate data
seed<-1
n<-500
d<-2
l<-3; D<-4; c<-D/sqrt(2)
M<-matrix(0,l,d); M[2,]<-c; M[3,]<--c
sig<-matrix(1,l,d)
p<-rep(1/l,l)
dendat<-sim.data(type="mixt",n=n,M=M,sig=sig,p=p,seed=seed)

# colored volume function 
h<-(4/(d+2))^(1/(d+4))*n^(-1/(d+4))*apply(dendat,2,sd)
minobs<-3
pcf<-pcf.greedy.kernel(dendat,h,minobs=minobs,type="greedy")  #type="cpp")
lst<-leafsfirst.adagrid(pcf)
plotvolu(lst,colo=TRUE)

# colored scatter plot
cd<-colors2data(dendat,pcf,lst,type="ada")
plot(dendat,col=cd$datacolo)

# colored scatter plot: plot the observations so that the observations with
# the highest density value are plotted last
cd<-colors2data(dendat,pcf,lst,type="ada")
plot(dendat[cd$ord,1],dendat[cd$ord,2],col=cd$datacolo[cd$ord])

# partial clustering with a fixed level 
cl<-cluster.lst(dendat,h,labels="colors",type="adap",lambda=0.021)
plot(dendat,col=cl)

# complete clustering with a fixed level
cl<-cluster.lst(dendat,h,complete=TRUE,labels="colors",type="adap",lambda=0.021)
plot(dendat,col=cl)

# partial clustering with locally changing levels
# plottree(lst,colo=TRUE,info=seq(1:length(lst$parent)))
lst2<-prunemodes(lst,modenum=3)
nodes<-findbnodes(lst2,modenum=3) 
plotvolu(lst2,colo=TRUE,nodes=nodes)
#
cd<-colors2data(dendat,pcf,lst2,nodes=nodes,type="ada")
plot(dendat,col=cd$datacolo)

# complete clustering with locally changing levels
nodes<-findbnodes(lst2,modenum=3)
cl<-cluster.lst(dendat,h,nodes=nodes,complete=TRUE,labels="colors",type="adap",minobs=minobs)
plot(dendat,col=cl)

Reproducing the Examples

Simplex in 7D

We draw a volume function from a density estimate when the data is generated from the distribution with the modes at corners of the 7D unit simplex. The distribution is the equal mixture of 8 normal distributions with the marginal variances 0.25.

# generate data
seed<-1
n<-10000
d<-7
l<-d+1
M<-matrix(0,l,d); for (i in 2:(d+1)) M[i,i-1]<-1
sig<-matrix(1/4,l,d)
p<-rep(1/l,l)
dendat<-sim.data(type="mixt",n=n,M=M,sig=sig,p=p,seed=seed)

# density esimation
h<-(4/(d+2))^(1/(d+4))*n^(-1/(d+4))*apply(dendat,2,sd)
minobs<-5
pcf<-pcf.greedy.kernel(dendat,h,minobs=minobs,type="cpp")

# level set trees
lst<-leafsfirst.adagrid(pcf)
lst2<-prunemodes(lst,modenum=l)
cd<-colors2data(dendat,pcf,lst2,type="ada")

# volume function
plotvolu(lst2,cutlev=0.35,colo=TRUE)

# scatter plot
plot(dendat[cd$ord,1],dendat[cd$ord,2],col=cd$datacolo[cd$ord],
xlab="",ylab="")

Pentahedron in 4D

We draw a volume function from a density estimate when the data is generated from the distribution with the modes at corners of a pentahedron in 4D The distribution is the equal mixture of 5 peaked distributions. Each peaked distribution is the equal mixture of the standard normal density and the normal density with marginal standard deviations 0.1.

# generate data
seed<-1
n<-10000
d<-4
l<-2*(d+1)
dist<-1
M<-matrix(0,l,d)
M[1,]<-dist*c(1/2, 0,0,0)
M[2,]<-dist*c(-1/2,0,0,0)
M[3,]<-dist*c(0,sqrt(3)/2,0,0)
M[4,]<-dist*c(0,1/(2*sqrt(3)),sqrt(2/3),0)
M[5,]<-dist*c(0,1/(2*sqrt(3)),1/(2*sqrt(6)),sqrt(15/24))
M[6,]<-dist*c(1/2, 0,0,0)
M[7,]<-dist*c(-1/2,0,0,0)
M[8,]<-dist*c(0,sqrt(3)/2,0,0)
M[9,]<-dist*c(0,1/(2*sqrt(3)),sqrt(2/3),0)
M[10,]<-dist*c(0,1/(2*sqrt(3)),1/(2*sqrt(6)),sqrt(15/24))
sig<-matrix(1,l,d)
sig[(d+2):(2*d+2),]<-0.1
p<-rep(1/l,l)
dendat<-sim.data(type="mixt",n=n,M=M,sig=sig,p=p,seed=seed,d=d)

# adaptive grid
h<-(4/(d+2))^(1/(d+4))*n^(-1/(d+4))*apply(dendat,2,sd)
minobs<-5
pcf<-pcf.greedy.kernel(dendat,h,minobs=minobs,type="cpp")
lst<-leafsfirst.adagrid(pcf)
cd<-colors2data(dendat,pcf,lst,type="ada")

# volume function
plotvolu(lst,cutlev=0.2,colo=TRUE)

# scatter plot
plot(dendat[cd$ord,1],dendat[cd$ord,2],col=cd$datacolo[cd$ord])

MISE Computations

We need the following functions.

ise<-function(n,d,seed=1,type="ada",N=NULL,lambda=NULL,minobs=1,distri="gauss")
{
set.seed(seed)
if (distri=="gauss") dendat<-matrix(rnorm(n*d),n,d)
else{
 l<-2
 m<-matrix(0,l,d)
 sig<-matrix(1,l,d)
 sig[2,]<-0.1
 p<-rep(1/l,l)
 dendat<-sim.data(type="mixt",n=n,M=m,sig=sig,p=p,seed=seed,d=d)
}
h<-(4/(d+2))^(1/(d+4))*n^(-1/(d+4))*apply(dendat,2,sd)
if (type=="ada"){
   pcf<-pcf.greedy.kernel(dendat,h,minobs=minobs,type="cpp")
   recs<-pcf$recs
}
else if (type=="dyadic"){
   pcf<-pcf.greedy.kernel(dendat,h,minobs=minobs,type="dyadic")
   recs<-pcf$recs
}
else{
   pcf<-pcf.kern(dendat,h,N)
   lkm<-length(pcf$value)
   recs<-matrix(0,lkm,2*d)
   step<-stepcalc(pcf$support,N)
   for (i in 1:lkm){
     for (j in 1:d){
        recs[i,2*j-1]<-pcf$down[i,j]*step[j]+pcf$support[2*j-1]
        recs[i,2*j]<-pcf$high[i,j]*step[j]+pcf$support[2*j-1]
     }
   }
}
lkm<-length(pcf$value)
intfhat2<-0
intfhatf<-0
for (i in 1:lkm){
    val<-pcf$value[i]
    rec<-recs[i,]
    if (!is.null(lambda)){
        mid<-matrix(0,d,1)
        for (j in 1:d){
            mid[j]<-(rec[2*j-1]+rec[2*j])/2
        }
        arvo<-prod(dnorm(mid))
        if (arvo>lambda){
           volu<-massone(rec)
           proba<-gaussproba(rec)
           intfhat2<-intfhat2+val^2*volu
           intfhatf<-intfhatf+val*proba
        }
    }
    else{
        volu<-massone(rec)
        if (distri=="gauss") proba<-gaussproba(rec) else proba<-gaussproba.mixt(rec,m,sig,p)
        intfhat2<-intfhat2+val^2*volu
        intfhatf<-intfhatf+val*proba
    }
}
ans<-intfhat2-2*intfhatf
return(ans)
}

gaussproba<-function(rec)
{
d<-length(rec)/2
a<-matrix(0,d,1)
b<-matrix(0,d,1)
for (i in 1:d){
    a[i]<-rec[2*i-1]
    b[i]<-rec[2*i]
}
vali<-matrix(0,2^d,d)
for (i in 1:2^d){
    vali[i,]<-digit(i,rep(2,d))
}
vali[2^d,]<-0
ans<-0
for (i in 1:2^d){
    chos<-vali[i,]
    u<-matrix(0,d,1)
    for (j in 1:d) if (chos[j]==0) u[j]<-a[j] else u[j]<-b[j]
    ans<-ans+(-1)^(sum(chos+1))*prod(pnorm(u))
}
return(ans)
}

gaussproba.mixt<-function(rec,m,sig,p)
{
d<-length(rec)/2
a<-matrix(0,d,1)
b<-matrix(0,d,1)
for (i in 1:d){
    a[i]<-rec[2*i-1]
    b[i]<-rec[2*i]
}
vali<-matrix(0,2^d,d)
for (i in 1:2^d){
    vali[i,]<-digit(i,rep(2,d))
}
vali[2^d,]<-0
ans<-0
for (i in 1:2^d){
    chos<-vali[i,]
    u<-matrix(0,d,1)
    for (j in 1:d) if (chos[j]==0) u[j]<-a[j] else u[j]<-b[j]
    Fu<-p[1]*prod(pnorm(u/sig[1,]))+p[2]*prod(pnorm(u/sig[2,]))
    ans<-ans+(-1)^(sum(chos+1))*Fu
}
return(ans)
}

digit<-function(luku,base){
# Gives representation of luku for system with base
# luku is a natural number >=0
# base is d-vector of integers >=2, d>=2, 
# base[d] tarvitaan vain tarkistamaan onko luku rajoissa
# Returns d-vector of integers.
# example: digit(52,c(10,10)), returns vector (2,5)
#
d<-length(base)
digi<-matrix(0,d,1)
jako<-matrix(0,d,1)
jako[d]<-base[1]
for (i in (d-1):1){
  jako[i]<-base[d-i+1]*jako[i+1]
}
vah<-0
for (i in 1:(d-1)){
  digi[i]<-floor((luku-vah)/jako[i+1]) #if digi[i]>base[i], then ERROR
  vah<-vah+digi[i]*jako[i+1]
}
digi[d]<-luku-vah
# annetaan vastaus kaanteisesti se 2354 annetaan c(4,5,3,2)
# talloin vastaavuus sailyy base:n kanssa 
apu<-digi[d:1]
return(apu)
}

stepcalc<-function(suppo,N)
{
d<-length(N)
step<-matrix(0,d,1)
for (i in 1:d){
    step[i]<-(suppo[2*i]-suppo[2*i-1])/N[i]
}
return(step)
}

Standard Normal Density

We compute the mean integrated squared error as a function of the sample size when the underlying distribution is standard normal.

Dimension 2

d<-2
neet<-seq(100,3500,500)
geet<-c(10,20,30)
miset11<-matrix(0,length(neet),1)
miset22<-matrix(0,length(neet),length(geet))
M<-100
for (mm in 1:M){
seed<-mm
for (i in 1:length(neet)){
    n<-neet[i]
    miset11[i]<-miset11[i]+ise(n,d,seed=seed,type="ada")
    for (j in 1:length(geet)){
        g<-geet[j]
        N<-rep(g,d)
        miset22[i,j]<-miset22[i,j]+ise(n,d,seed=seed,type="notada",N=N)
    }
}}
miset11<-miset11/M
miset22<-miset22/M

miset1<-miset11+(4*pi)^(-d/2)
miset2<-miset22+(4*pi)^(-d/2)
plot(neet,miset1,type="b",ylim=c(min(miset1,miset2),max(miset1,miset2)),
log="y",xlab="",ylab="")
matplot(neet,miset2,add=TRUE,type="b",lty=1,
col=c("red","blue","green"),log="y")
mtext("sample size",side=1,line=2.4)
mtext("MISE",side=2,line=2.4)

Dimension 3

d<-3
neet<-seq(1000,10000,1000)
geet<-c(10,20)
miset11<-matrix(0,length(neet),1)
miset22<-matrix(0,length(neet),length(geet))
M<-100
for (mm in 1:M){
seed<-mm
for (i in 1:length(neet)){
    n<-neet[i]
    miset11[i]<-miset11[i]+ise(n,d,seed=seed,type="ada")
    for (j in 1:length(geet)){
        g<-geet[j]
        N<-rep(g,d)
        miset22[i,j]<-miset22[i,j]+ise(n,d,seed=seed,type="notada",N=N)
    }
}}
miset11<-miset11/M
miset22<-miset22/M

miset1<-miset11+(4*pi)^(-d/2)
miset2<-miset22+(4*pi)^(-d/2)
plot(neet,miset1,type="b",ylim=c(min(miset1,miset2),max(miset1,miset2)),
log="y",xlab="",ylab="")
matplot(neet,miset2,add=TRUE,type="b",lty=1,
col=c("red","blue","green"),log="y")
mtext("sample size",side=1,line=2.4)
mtext("MISE",side=2,line=2.4)

Inhomogeneous Density

We compute the mean integrated squared error as a function of the sample size when the underlying distribution has an inhomogeneous density.

Dimension 2

distri<-"notgauss"
d<-2
neet<-seq(100,3500,500)
geet<-c(10,20,30)
miset11<-matrix(0,length(neet),1)
miset22<-matrix(0,length(neet),length(geet))
M<-100
for (mm in 1:M){
seed<-mm
for (i in 1:length(neet)){
    n<-neet[i]
    miset11[i]<-miset11[i]+ise(n,d,seed=seed,type="ada",distri=distri)
    for (j in 1:length(geet)){
        g<-geet[j]
        N<-rep(g,d)
        miset22[i,j]<-miset22[i,j]+ise(n,d,seed=seed,type="notada",N=N,distri=distri)
    }
}}
miset11<-miset11/M
miset22<-miset22/M

mo<-10000
sample<-matrix(rnorm(mo*d),mo,d)
monte<-mean(0.1^(-d)*dnorm(sample[,1]/0.1)*dnorm(sample[,2]/0.1))
int2int2d<-0.5^2*((4*pi)^(-d/2)+(4*pi)^(-d/2)/0.1^d+2*monte)

miset1<-miset11+int2int2d
miset2<-miset22+int2int2d
plot(neet,miset1,type="b",ylim=c(min(miset1,miset2),max(miset1,miset2)),
log="y",xlab="",ylab="")
matplot(neet,miset2,add=TRUE,type="b",lty=1,
col=c("red","blue","green"),log="y")
mtext("sample size",side=1,line=2.4)
mtext("MISE",side=2,line=2.4)

Dimension 3

distri<-"notgauss"
d<-3
neet<-seq(1000,10000,1000)
geet<-c(10,20)
miset11<-matrix(0,length(neet),1)
miset22<-matrix(0,length(neet),length(geet))
M<-100
for (mm in 1:M){
seed<-mm
for (i in 1:length(neet)){
    n<-neet[i]
    miset11[i]<-miset11[i]+ise(n,d,seed=seed,type="ada",distri=distri)
    for (j in 1:length(geet)){
        g<-geet[j]
        N<-rep(g,d)
        miset22[i,j]<-miset22[i,j]+ise(n,d,seed=seed,type="notada",N=N,distri=distri)
    }
}}
miset11<-miset11/M
miset22<-miset22/M

mo<-10000
sample<-matrix(rnorm(mo*d),mo,d)
monte<-mean(0.1^(-d)*dnorm(sample[,1]/0.1)*dnorm(sample[,2]/0.1)*dnorm(sample[,3]/0.1))
int2int3d<-0.5^2*((4*pi)^(-d/2)+(4*pi)^(-d/2)/0.1^d+2*monte)

miset1<-miset11+int2int3d
miset2<-miset22+int2int3d
plot(neet,miset1,type="b",ylim=c(min(miset1,miset2),max(miset1,miset2)),
log="y",xlab="",ylab="")
matplot(neet,miset2,add=TRUE,type="b",lty=1,
col=c("red","blue","green"),log="y")
mtext("sample size",side=1,line=2.4)
mtext("MISE",side=2,line=2.4)

Analysis of Cytometry Data

The code below is provided by Kyösti Karttunen.

Reading Data

The data may be read by the command

file="http://jklm.fi/art/volume/FCMdata.csv"
x<-read.csv(file=file)
dendat.pre1 <- as.matrix(x)[,2:7]

The data is obtained from R-package "flowMeans", and it can be obtained also by the following commands.

library("flowMeans")
data(x, package = "flowMeans") 
dendat.pre1 <- as.matrix(x)

Reproducing the Figures

varnames <- colnames(dendat.pre1)
dendat.pre2 <- apply(
  dendat.pre1, MARGIN = 2,
  FUN = function(x) (x - min(x))/diff(range(x))
)
edges.tf <- ((dendat.pre2 < 0.001) | (dendat.pre2 > 0.999))
edges.rows <- unique( which(edges.tf, arr.ind=T)[ , 1] )
dendat.pre3 <- dendat.pre2[-edges.rows, ]
dendat <- matrix(as.double(dendat.pre3), ncol=dim(dendat.pre3)[2])

n <- dim(dendat)[1]   # nbr of observations
d <- dim(dendat)[2]   # dimension
h <- (4/(d+2))^(1/(d+4)) * n^(-1/(d+4)) * apply(dendat,2,sd)
minobs = 12
pcf.gk <- pcf.greedy.kernel(dendat, h, minobs=minobs, type="cpp")
lst.gk.start <- leafsfirst.adagrid(pcf.gk)
lst.gk <- prunemodes(lst.gk.start, modenum = 6)
palettiC <- c(
  "red", "blue", "green", "orange", "navy", "darkgreen",
  "orchid", "aquamarine", "turquoise",
  "pink", "violet", "magenta", "chocolate", "cyan",
  rep( colors()[50:657], 400)
)
colors.gk <- colors2data(dendat, pcf.gk, lst.gk, type="ada")

plotvolu(lst.gk, colo=T,xaxt="s", yaxt="s",xlim = c(0.353, 0.365))

nplotsample <- n
plotorder  <- colors.gk$ord  # all
datacolors <- colors.gk$datacolo

nplotsample <- 2000
panel.upp <- function(x, y, ...) {
  n  <- length(x)
  ipoints <- sample(n, nplotsample)
  #points(x[ipoints], y[ipoints], pch="+", cex=1)  
  points(x, y, pch="+", cex=1)  
} #-----

# --- refine parameters for pairs
pairs.plotthiscase <- function(pairsvariables) {
  par(mar=c(2,1,1,1)+0.1)
  pairs(
    dendat[(plotorder), pairsvariables],
    labels = varnames[pairsvariables],
    xlim = c(0,1), ylim = c(0,1), cex.axis = 1.5,cex=1,
    col = datacolors[(plotorder)], pch="+", 
    gap = 0,
    upper.panel = panel.upp
  )
} #-----

pairsvars.article <- c(1,3,4,5)  
pairs.plotthiscase(pairsvars.article)

Function "pairs" may not work in Ubuntu, because fonts are missing. Ths following command and a reboot may fix the problem.

sudo apt-get install t1-xfree86-nonfree ttf-xfree86-nonfree ttf-xfree86-nonfree-syriac xfonts-75dpi xfonts-100dpi

September 2016