Introduction
This is the home page of the article "Estimation of Level Set Trees Using Adaptive Partitions", written by Lasse Holmström, Kyösti Karttunen, and Jussi Klemelä.
Abstract of the article
We present methods for the estimation of level sets, a level set tree, and a volume function of a multivariate density function. The methods are such that the calculation is feasible and statistically efficient in moderate dimensional cases and for large sample sizes. We apply kernel estimation together with an adaptive partition of the sample space. We illustrate how level set trees can be applied in cluster analysis and in flow cytometry.
Reference
The article is published in "http://link.springer.com/article/10.1007/s00180-016-0702-2"
@article{volume, title = {Estimation of Level Set Trees Using Adaptive Partitions}, author = {Holmstr\"om, L. and Karttunen, K. and Klemel\"a, J.}, year = {2017}, journal = {Comput. Statist.}, volume = {32}, pages = {1139-1163}, }
Supplementary material
We provide supplementary material in volume-supplementary.pdf.
Software
The software is available as R-packages "denpro" and "delt".
Package "denpro" implements the calculation of level set trees for the regular grid kernel density estimator and package "delt" implements the calculation for the adaptive grid kernel density estimator, among other things.
Package "delt" implements a discretized kernel estimator using an adaptive partition in function "pcf.greedy.kernel". Function "pcf.greedy.kernel" has an option to use C++ code. This option can be used by calling the function with type="cpp".
In order to use the R-packages the following commands have to be issued:
library(denpro) library(delt)
The use of C++ code requires the package "Rcpp", described in http://dirk.eddelbuettel.com/code/rcpp/Rcpp-FAQ.pdf. Mac installation is described in http://www.thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks-lgfortran-and-lquadmath-error. Windows installation requires that 1) R path does not include empty spaces, 2) Rtools corresponding to the R version is installed from http://cran.r-project.org/bin/windows/Rtools/, 3) the computer is booted.
In order to use C++ code the program densplitter2.cpp (by Sauli Herrala) is needed, and the following commands have to be issued:
library(Rcpp) library(RcppArmadillo) sourceCpp("densplitter2.cpp")
Note that there is a difference depending on whether the package "delt" is installed from CRAN or from the home page of the package.
-
If package "delt" is installed from CRAN, then function
pcf.greedy.kernel.cpp.R
is needed and the following command has to be issued:
source("pcf.greedy.kernel.cpp.R")
- If package "delt" is installed from the home page of the package, then function "pcf.greedy.kernel.cpp.R" is not needed.
Tutorial for the use of software
1. Kernel density estimation with adaptive and regular grids
# generate data seed<-1 n<-1000 d<-2 l<-2 D<-4; c<-D/sqrt(d) M<-matrix(0,l,d); M[2,]<-c sig<-matrix(1,l,d) p<-rep(1/l,l) dendat<-sim.data(type="mixt",n=n,M=M,sig=sig,p=p,seed=seed) # regular grid h<-(4/(d+2))^(1/(d+4))*n^(-1/(d+4))*apply(dendat,2,sd) N<-rep(20,d) pcf<-pcf.kern(dendat,h,N=N) lst<-leafsfirst(pcf) plotvolu(lst) # adaptive grid h<-(4/(d+2))^(1/(d+4))*n^(-1/(d+4))*apply(dendat,2,sd) minobs<-5 pcf<-pcf.greedy.kernel(dendat,h,minobs=minobs,type="greedy") # pcf<-pcf.greedy.kernel(dendat,h,minobs=minobs,type="cpp") lst<-leafsfirst.adagrid(pcf) plotvolu(lst) plotvolu(lst,cutlev=0.004) plotbary(lst)
2. Colored scatter plots
2.1 Regular grid
# generate data seed<-1 n<-500 d<-2 l<-3; D<-4; c<-D/sqrt(2) M<-matrix(0,l,d); M[2,]<-c; M[3,]<--c sig<-matrix(1,l,d) p<-rep(1/l,l) dendat<-sim.data(type="mixt",n=n,M=M,sig=sig,p=p,seed=seed) # colored volume function h<-(4/(d+2))^(1/(d+4))*n^(-1/(d+4))*apply(dendat,2,sd) N<-rep(20,d) pcf<-pcf.kern(dendat,h,N) lst<-leafsfirst(pcf) plotvolu(lst,colo=TRUE) # colored scatter plot cd<-colors2data(dendat,pcf,lst) plot(dendat,col=cd$datacolo) # colored scatter plot: plot the observations so that the observations with # the highest density value are plotted last cd<-colors2data(dendat,pcf,lst) plot(dendat[cd$ord,1],dendat[cd$ord,2],col=cd$datacolo[cd$ord]) # partial clustering with a fixed level cl<-cluster.lst(dendat,h,N=N,labels="colors",type="grid",lambda=0.02) plot(dendat,col=cl) # complete clustering with a fixed level cl<-cluster.lst(dendat,h,N=N,complete=TRUE,labels="colors",type="grid",lambda=0.02) plot(dendat,col=cl) # partial clustering with locally changing levels # plottree(lst,colo=TRUE,info=seq(1:length(lst$parent))) nodes<-findbnodes(lst,modenum=3) plotvolu(lst,colo=TRUE,nodes=nodes) # cd<-colors2data(dendat,pcf,lst,nodes=nodes) plot(dendat,col=cd$datacolo) # complete clustering with locally changing levels nodes<-findbnodes(lst,modenum=3) cl<-cluster.lst(dendat,h,N,nodes=nodes,complete=TRUE,labels="colors") plot(dendat,col=cl)
2.2 Adaptive grid
# generate data seed<-1 n<-500 d<-2 l<-3; D<-4; c<-D/sqrt(2) M<-matrix(0,l,d); M[2,]<-c; M[3,]<--c sig<-matrix(1,l,d) p<-rep(1/l,l) dendat<-sim.data(type="mixt",n=n,M=M,sig=sig,p=p,seed=seed) # colored volume function h<-(4/(d+2))^(1/(d+4))*n^(-1/(d+4))*apply(dendat,2,sd) minobs<-3 pcf<-pcf.greedy.kernel(dendat,h,minobs=minobs,type="greedy") #type="cpp") lst<-leafsfirst.adagrid(pcf) plotvolu(lst,colo=TRUE) # colored scatter plot cd<-colors2data(dendat,pcf,lst,type="ada") plot(dendat,col=cd$datacolo) # colored scatter plot: plot the observations so that the observations with # the highest density value are plotted last cd<-colors2data(dendat,pcf,lst,type="ada") plot(dendat[cd$ord,1],dendat[cd$ord,2],col=cd$datacolo[cd$ord]) # partial clustering with a fixed level cl<-cluster.lst(dendat,h,labels="colors",type="adap",lambda=0.021) plot(dendat,col=cl) # complete clustering with a fixed level cl<-cluster.lst(dendat,h,complete=TRUE,labels="colors",type="adap",lambda=0.021) plot(dendat,col=cl) # partial clustering with locally changing levels # plottree(lst,colo=TRUE,info=seq(1:length(lst$parent))) lst2<-prunemodes(lst,modenum=3) nodes<-findbnodes(lst2,modenum=3) plotvolu(lst2,colo=TRUE,nodes=nodes) # cd<-colors2data(dendat,pcf,lst2,nodes=nodes,type="ada") plot(dendat,col=cd$datacolo) # complete clustering with locally changing levels nodes<-findbnodes(lst2,modenum=3) cl<-cluster.lst(dendat,h,nodes=nodes,complete=TRUE,labels="colors",type="adap",minobs=minobs) plot(dendat,col=cl)
Reproducing the Examples
Simplex in 7D
We draw a volume function from a density estimate when the data is generated from the distribution with the modes at corners of the 7D unit simplex. The distribution is the equal mixture of 8 normal distributions with the marginal variances 0.25.
# generate data seed<-1 n<-10000 d<-7 l<-d+1 M<-matrix(0,l,d); for (i in 2:(d+1)) M[i,i-1]<-1 sig<-matrix(1/4,l,d) p<-rep(1/l,l) dendat<-sim.data(type="mixt",n=n,M=M,sig=sig,p=p,seed=seed) # density esimation h<-(4/(d+2))^(1/(d+4))*n^(-1/(d+4))*apply(dendat,2,sd) minobs<-5 pcf<-pcf.greedy.kernel(dendat,h,minobs=minobs,type="cpp") # level set trees lst<-leafsfirst.adagrid(pcf) lst2<-prunemodes(lst,modenum=l) cd<-colors2data(dendat,pcf,lst2,type="ada") # volume function plotvolu(lst2,cutlev=0.35,colo=TRUE) # scatter plot plot(dendat[cd$ord,1],dendat[cd$ord,2],col=cd$datacolo[cd$ord], xlab="",ylab="")
Pentahedron in 4D
We draw a volume function from a density estimate when the data is generated from the distribution with the modes at corners of a pentahedron in 4D The distribution is the equal mixture of 5 peaked distributions. Each peaked distribution is the equal mixture of the standard normal density and the normal density with marginal standard deviations 0.1.
# generate data seed<-1 n<-10000 d<-4 l<-2*(d+1) dist<-1 M<-matrix(0,l,d) M[1,]<-dist*c(1/2, 0,0,0) M[2,]<-dist*c(-1/2,0,0,0) M[3,]<-dist*c(0,sqrt(3)/2,0,0) M[4,]<-dist*c(0,1/(2*sqrt(3)),sqrt(2/3),0) M[5,]<-dist*c(0,1/(2*sqrt(3)),1/(2*sqrt(6)),sqrt(15/24)) M[6,]<-dist*c(1/2, 0,0,0) M[7,]<-dist*c(-1/2,0,0,0) M[8,]<-dist*c(0,sqrt(3)/2,0,0) M[9,]<-dist*c(0,1/(2*sqrt(3)),sqrt(2/3),0) M[10,]<-dist*c(0,1/(2*sqrt(3)),1/(2*sqrt(6)),sqrt(15/24)) sig<-matrix(1,l,d) sig[(d+2):(2*d+2),]<-0.1 p<-rep(1/l,l) dendat<-sim.data(type="mixt",n=n,M=M,sig=sig,p=p,seed=seed,d=d) # adaptive grid h<-(4/(d+2))^(1/(d+4))*n^(-1/(d+4))*apply(dendat,2,sd) minobs<-5 pcf<-pcf.greedy.kernel(dendat,h,minobs=minobs,type="cpp") lst<-leafsfirst.adagrid(pcf) cd<-colors2data(dendat,pcf,lst,type="ada") # volume function plotvolu(lst,cutlev=0.2,colo=TRUE) # scatter plot plot(dendat[cd$ord,1],dendat[cd$ord,2],col=cd$datacolo[cd$ord])
MISE Computations
We need the following functions.
ise<-function(n,d,seed=1,type="ada",N=NULL,lambda=NULL,minobs=1,distri="gauss") { set.seed(seed) if (distri=="gauss") dendat<-matrix(rnorm(n*d),n,d) else{ l<-2 m<-matrix(0,l,d) sig<-matrix(1,l,d) sig[2,]<-0.1 p<-rep(1/l,l) dendat<-sim.data(type="mixt",n=n,M=m,sig=sig,p=p,seed=seed,d=d) } h<-(4/(d+2))^(1/(d+4))*n^(-1/(d+4))*apply(dendat,2,sd) if (type=="ada"){ pcf<-pcf.greedy.kernel(dendat,h,minobs=minobs,type="cpp") recs<-pcf$recs } else if (type=="dyadic"){ pcf<-pcf.greedy.kernel(dendat,h,minobs=minobs,type="dyadic") recs<-pcf$recs } else{ pcf<-pcf.kern(dendat,h,N) lkm<-length(pcf$value) recs<-matrix(0,lkm,2*d) step<-stepcalc(pcf$support,N) for (i in 1:lkm){ for (j in 1:d){ recs[i,2*j-1]<-pcf$down[i,j]*step[j]+pcf$support[2*j-1] recs[i,2*j]<-pcf$high[i,j]*step[j]+pcf$support[2*j-1] } } } lkm<-length(pcf$value) intfhat2<-0 intfhatf<-0 for (i in 1:lkm){ val<-pcf$value[i] rec<-recs[i,] if (!is.null(lambda)){ mid<-matrix(0,d,1) for (j in 1:d){ mid[j]<-(rec[2*j-1]+rec[2*j])/2 } arvo<-prod(dnorm(mid)) if (arvo>lambda){ volu<-massone(rec) proba<-gaussproba(rec) intfhat2<-intfhat2+val^2*volu intfhatf<-intfhatf+val*proba } } else{ volu<-massone(rec) if (distri=="gauss") proba<-gaussproba(rec) else proba<-gaussproba.mixt(rec,m,sig,p) intfhat2<-intfhat2+val^2*volu intfhatf<-intfhatf+val*proba } } ans<-intfhat2-2*intfhatf return(ans) } gaussproba<-function(rec) { d<-length(rec)/2 a<-matrix(0,d,1) b<-matrix(0,d,1) for (i in 1:d){ a[i]<-rec[2*i-1] b[i]<-rec[2*i] } vali<-matrix(0,2^d,d) for (i in 1:2^d){ vali[i,]<-digit(i,rep(2,d)) } vali[2^d,]<-0 ans<-0 for (i in 1:2^d){ chos<-vali[i,] u<-matrix(0,d,1) for (j in 1:d) if (chos[j]==0) u[j]<-a[j] else u[j]<-b[j] ans<-ans+(-1)^(sum(chos+1))*prod(pnorm(u)) } return(ans) } gaussproba.mixt<-function(rec,m,sig,p) { d<-length(rec)/2 a<-matrix(0,d,1) b<-matrix(0,d,1) for (i in 1:d){ a[i]<-rec[2*i-1] b[i]<-rec[2*i] } vali<-matrix(0,2^d,d) for (i in 1:2^d){ vali[i,]<-digit(i,rep(2,d)) } vali[2^d,]<-0 ans<-0 for (i in 1:2^d){ chos<-vali[i,] u<-matrix(0,d,1) for (j in 1:d) if (chos[j]==0) u[j]<-a[j] else u[j]<-b[j] Fu<-p[1]*prod(pnorm(u/sig[1,]))+p[2]*prod(pnorm(u/sig[2,])) ans<-ans+(-1)^(sum(chos+1))*Fu } return(ans) } digit<-function(luku,base){ # Gives representation of luku for system with base # luku is a natural number >=0 # base is d-vector of integers >=2, d>=2, # base[d] tarvitaan vain tarkistamaan onko luku rajoissa # Returns d-vector of integers. # example: digit(52,c(10,10)), returns vector (2,5) # d<-length(base) digi<-matrix(0,d,1) jako<-matrix(0,d,1) jako[d]<-base[1] for (i in (d-1):1){ jako[i]<-base[d-i+1]*jako[i+1] } vah<-0 for (i in 1:(d-1)){ digi[i]<-floor((luku-vah)/jako[i+1]) #if digi[i]>base[i], then ERROR vah<-vah+digi[i]*jako[i+1] } digi[d]<-luku-vah # annetaan vastaus kaanteisesti se 2354 annetaan c(4,5,3,2) # talloin vastaavuus sailyy base:n kanssa apu<-digi[d:1] return(apu) } stepcalc<-function(suppo,N) { d<-length(N) step<-matrix(0,d,1) for (i in 1:d){ step[i]<-(suppo[2*i]-suppo[2*i-1])/N[i] } return(step) }
Standard Normal Density
We compute the mean integrated squared error as a function of the sample size when the underlying distribution is standard normal.
Dimension 2
d<-2 neet<-seq(100,3500,500) geet<-c(10,20,30) miset11<-matrix(0,length(neet),1) miset22<-matrix(0,length(neet),length(geet)) M<-100 for (mm in 1:M){ seed<-mm for (i in 1:length(neet)){ n<-neet[i] miset11[i]<-miset11[i]+ise(n,d,seed=seed,type="ada") for (j in 1:length(geet)){ g<-geet[j] N<-rep(g,d) miset22[i,j]<-miset22[i,j]+ise(n,d,seed=seed,type="notada",N=N) } }} miset11<-miset11/M miset22<-miset22/M miset1<-miset11+(4*pi)^(-d/2) miset2<-miset22+(4*pi)^(-d/2) plot(neet,miset1,type="b",ylim=c(min(miset1,miset2),max(miset1,miset2)), log="y",xlab="",ylab="") matplot(neet,miset2,add=TRUE,type="b",lty=1, col=c("red","blue","green"),log="y") mtext("sample size",side=1,line=2.4) mtext("MISE",side=2,line=2.4)
Dimension 3
d<-3 neet<-seq(1000,10000,1000) geet<-c(10,20) miset11<-matrix(0,length(neet),1) miset22<-matrix(0,length(neet),length(geet)) M<-100 for (mm in 1:M){ seed<-mm for (i in 1:length(neet)){ n<-neet[i] miset11[i]<-miset11[i]+ise(n,d,seed=seed,type="ada") for (j in 1:length(geet)){ g<-geet[j] N<-rep(g,d) miset22[i,j]<-miset22[i,j]+ise(n,d,seed=seed,type="notada",N=N) } }} miset11<-miset11/M miset22<-miset22/M miset1<-miset11+(4*pi)^(-d/2) miset2<-miset22+(4*pi)^(-d/2) plot(neet,miset1,type="b",ylim=c(min(miset1,miset2),max(miset1,miset2)), log="y",xlab="",ylab="") matplot(neet,miset2,add=TRUE,type="b",lty=1, col=c("red","blue","green"),log="y") mtext("sample size",side=1,line=2.4) mtext("MISE",side=2,line=2.4)
Inhomogeneous Density
We compute the mean integrated squared error as a function of the sample size when the underlying distribution has an inhomogeneous density.
Dimension 2
distri<-"notgauss" d<-2 neet<-seq(100,3500,500) geet<-c(10,20,30) miset11<-matrix(0,length(neet),1) miset22<-matrix(0,length(neet),length(geet)) M<-100 for (mm in 1:M){ seed<-mm for (i in 1:length(neet)){ n<-neet[i] miset11[i]<-miset11[i]+ise(n,d,seed=seed,type="ada",distri=distri) for (j in 1:length(geet)){ g<-geet[j] N<-rep(g,d) miset22[i,j]<-miset22[i,j]+ise(n,d,seed=seed,type="notada",N=N,distri=distri) } }} miset11<-miset11/M miset22<-miset22/M mo<-10000 sample<-matrix(rnorm(mo*d),mo,d) monte<-mean(0.1^(-d)*dnorm(sample[,1]/0.1)*dnorm(sample[,2]/0.1)) int2int2d<-0.5^2*((4*pi)^(-d/2)+(4*pi)^(-d/2)/0.1^d+2*monte) miset1<-miset11+int2int2d miset2<-miset22+int2int2d plot(neet,miset1,type="b",ylim=c(min(miset1,miset2),max(miset1,miset2)), log="y",xlab="",ylab="") matplot(neet,miset2,add=TRUE,type="b",lty=1, col=c("red","blue","green"),log="y") mtext("sample size",side=1,line=2.4) mtext("MISE",side=2,line=2.4)
Dimension 3
distri<-"notgauss" d<-3 neet<-seq(1000,10000,1000) geet<-c(10,20) miset11<-matrix(0,length(neet),1) miset22<-matrix(0,length(neet),length(geet)) M<-100 for (mm in 1:M){ seed<-mm for (i in 1:length(neet)){ n<-neet[i] miset11[i]<-miset11[i]+ise(n,d,seed=seed,type="ada",distri=distri) for (j in 1:length(geet)){ g<-geet[j] N<-rep(g,d) miset22[i,j]<-miset22[i,j]+ise(n,d,seed=seed,type="notada",N=N,distri=distri) } }} miset11<-miset11/M miset22<-miset22/M mo<-10000 sample<-matrix(rnorm(mo*d),mo,d) monte<-mean(0.1^(-d)*dnorm(sample[,1]/0.1)*dnorm(sample[,2]/0.1)*dnorm(sample[,3]/0.1)) int2int3d<-0.5^2*((4*pi)^(-d/2)+(4*pi)^(-d/2)/0.1^d+2*monte) miset1<-miset11+int2int3d miset2<-miset22+int2int3d plot(neet,miset1,type="b",ylim=c(min(miset1,miset2),max(miset1,miset2)), log="y",xlab="",ylab="") matplot(neet,miset2,add=TRUE,type="b",lty=1, col=c("red","blue","green"),log="y") mtext("sample size",side=1,line=2.4) mtext("MISE",side=2,line=2.4)
Analysis of Cytometry Data
The code below is provided by Kyösti Karttunen.
Reading Data
The data may be read by the command
file="http://jklm.fi/art/volume/FCMdata.csv" x<-read.csv(file=file) dendat.pre1 <- as.matrix(x)[,2:7]
The data is obtained from R-package "flowMeans", and it can be obtained also by the following commands.
library("flowMeans") data(x, package = "flowMeans") dendat.pre1 <- as.matrix(x)
Reproducing the Figures
varnames <- colnames(dendat.pre1) dendat.pre2 <- apply( dendat.pre1, MARGIN = 2, FUN = function(x) (x - min(x))/diff(range(x)) ) edges.tf <- ((dendat.pre2 < 0.001) | (dendat.pre2 > 0.999)) edges.rows <- unique( which(edges.tf, arr.ind=T)[ , 1] ) dendat.pre3 <- dendat.pre2[-edges.rows, ] dendat <- matrix(as.double(dendat.pre3), ncol=dim(dendat.pre3)[2]) n <- dim(dendat)[1] # nbr of observations d <- dim(dendat)[2] # dimension h <- (4/(d+2))^(1/(d+4)) * n^(-1/(d+4)) * apply(dendat,2,sd) minobs = 12 pcf.gk <- pcf.greedy.kernel(dendat, h, minobs=minobs, type="cpp") lst.gk.start <- leafsfirst.adagrid(pcf.gk) lst.gk <- prunemodes(lst.gk.start, modenum = 6) palettiC <- c( "red", "blue", "green", "orange", "navy", "darkgreen", "orchid", "aquamarine", "turquoise", "pink", "violet", "magenta", "chocolate", "cyan", rep( colors()[50:657], 400) ) colors.gk <- colors2data(dendat, pcf.gk, lst.gk, type="ada") plotvolu(lst.gk, colo=T,xaxt="s", yaxt="s",xlim = c(0.353, 0.365)) nplotsample <- n plotorder <- colors.gk$ord # all datacolors <- colors.gk$datacolo nplotsample <- 2000 panel.upp <- function(x, y, ...) { n <- length(x) ipoints <- sample(n, nplotsample) #points(x[ipoints], y[ipoints], pch="+", cex=1) points(x, y, pch="+", cex=1) } #----- # --- refine parameters for pairs pairs.plotthiscase <- function(pairsvariables) { par(mar=c(2,1,1,1)+0.1) pairs( dendat[(plotorder), pairsvariables], labels = varnames[pairsvariables], xlim = c(0,1), ylim = c(0,1), cex.axis = 1.5,cex=1, col = datacolors[(plotorder)], pch="+", gap = 0, upper.panel = panel.upp ) } #----- pairsvars.article <- c(1,3,4,5) pairs.plotthiscase(pairsvars.article)
Function "pairs" may not work in Ubuntu, because fonts are missing. Ths following command and a reboot may fix the problem.
sudo apt-get install t1-xfree86-nonfree ttf-xfree86-nonfree ttf-xfree86-nonfree-syriac xfonts-75dpi xfonts-100dpi