Title: | Estimation of Number of Clusters and Identification of Atypical Units |
---|---|
Description: | It is a package that helps to estimate the number of real clusters in data as well as to identify atypical units. The underlying methods are based on distances rather than on unit x variables. |
Authors: | Itziar Irigoien [aut, cre], Concepcion Arenas [aut] |
Maintainer: | Itziar Irigoien <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.4.2 |
Built: | 2024-11-15 03:35:08 UTC |
Source: | https://github.com/cran/ICGE |
The original authors compared pairs of snap-frozen and RNAlater preservative-suspended tissue from lymph node-negative breast tumors (B) and Dukes' B colon tumors (C). The actual data set, by de Souto et. al (2008), is build with purpose of separating B from C.
data(chowdary)
data(chowdary)
Data frame with 183 rows and 104 columns.
Original source from ‘National Center for Biotechnology Information’ from the United States of America, query GSE3726.
de Souto MCP, Costa IG, de Araujo DSA, Ludermir TB, and Schliep A (2008). Clustering Cancer Gene Expression Data: a Comparative Study. BMC Bioinformatics, 8, 497–511.
Chowdary D, Lathrop J, Skelton J, Curtin K, Briggs T, Zhang Y, Yu J, Wang X, and Mazumder A (2006). Prognostic gene expression signatures can be measured in tissues collected in RNAlater preservative. Journal Molecular Diagnosis, 8, 31–39.
data(chowdary) tumor <- as.factor(as.matrix(chowdary[1,])) x <- as.matrix(chowdary[-1,]) mode(x) <- "numeric" s <- sample(row.names(x),1) boxplot( x[s,] ~ tumor , ylab=s)
data(chowdary) tumor <- as.factor(as.matrix(chowdary[1,])) x <- as.matrix(chowdary[-1,]) mode(x) <- "numeric" s <- sample(row.names(x),1) boxplot( x[s,] ~ tumor , ylab=s)
dbhatta
computes and returns the Bhattacharyya distance matrix between the rows of a data matrix. This distance is defined between two units and
being
frequencies with
and
.
dbhatta(x)
dbhatta(x)
x |
a matrix containing, in its rows, the frequencies for each unit. Note: check that each row adds up to 1 |
A dist
object with distance information.
Itziar Irigoien [email protected]; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV-EHU), Donostia, Spain.
Conchita Arenas [email protected]; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.
Bhattacharyya, A. (1946). On a measure of divergence of two multinomial populations. Sankhya: The Indian Journal of Statistics, Series A. 14, 177-136.
dist
, dmahal
,
dgower
, dcor
, dproc2
#5 individuals represented by their relative frequencies of 4 characteristics (M1-M4): f <- matrix(c(0.36, 0.21, 0.23, 0.20, 0.66, 0.18, 0.11, 0.05, 0.01, 0.24, 0.62, 0.13, 0.43, 0.38, 0.08, 0.11, 0.16, 0.07, 0.09, 0.68), byrow=TRUE, nrow=5, dimnames=list(1:5, paste("M", 1:4, sep=""))) # Bhattacharyya distances between pairs d <- dbhatta(f)
#5 individuals represented by their relative frequencies of 4 characteristics (M1-M4): f <- matrix(c(0.36, 0.21, 0.23, 0.20, 0.66, 0.18, 0.11, 0.05, 0.01, 0.24, 0.62, 0.13, 0.43, 0.38, 0.08, 0.11, 0.16, 0.07, 0.09, 0.68), byrow=TRUE, nrow=5, dimnames=list(1:5, paste("M", 1:4, sep=""))) # Bhattacharyya distances between pairs d <- dbhatta(f)
dcor
computes and returns the Correlation distance matrix between the rows of a data matrix. This distance is defined by .
dcor(x)
dcor(x)
x |
a numeric matrix. |
A dist
object with distance information.
Itziar Irigoien [email protected]; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV-EHU), Donostia, Spain.
Conchita Arenas [email protected]; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.
Gower, J.C. (1985). Measures of similarity, dissimilarity and distance. In: Encyclopedia of Statistical Sciences, volume 5, 397–405. J. Wiley and Sons.
dist
, dmahal
,
dgower
, dbhatta
, dproc2
#Generate 10 objects in dimension 8 n <- 10 mu <- sample(1:10, 8, replace=TRUE) x <- matrix(rnorm(n*8, mean=mu, sd=1), nrow=n, byrow=TRUE) # Correlation distances between pairs d <- dcor(x)
#Generate 10 objects in dimension 8 n <- 10 mu <- sample(1:10, 8, replace=TRUE) x <- matrix(rnorm(n*8, mean=mu, sd=1), nrow=n, byrow=TRUE) # Correlation distances between pairs d <- dcor(x)
Assume that n units are divided into k groups C1,...,Ck . Function deltas
computes and returns the distance between each pair of groups. It uses the distances between pairs of units.
deltas(d, pert = "onegroup")
deltas(d, pert = "onegroup")
d |
a distance matrix or a |
pert |
an n-vector that indicates which group each unit belongs to. Note that the expected values of |
A matrix containing the distances between each pair of groups.
Itziar Irigoien [email protected]; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV/EHU), Donostia, Spain.
Conchita Arenas [email protected]; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.
Arenas, C. and Cuadras, C.M. (2002). Some recent statistical methods based on distances. Contributions to Science, 2, 183–191.
Cuadras, C.M., Fortiana, J. and Oliva, F. (1997). The proximity of an individual to a population with applications in discriminant analysis. Journal of Classification, 14, 117–136.
data(iris) d <- dist(iris[,1:4]) deltas(d,iris[,5])
data(iris) d <- dist(iris[,1:4]) deltas(d,iris[,5])
Data from a dermatology study provided by H.A. Guvenir (Dpt. Computer Engineering and Information Science, Bilkent University, Turkey).The data set contains 366 instances presenting 34 different clinical attributes (12 clinical features as age or family history and 22 histopathological features obtained from a biopsy), and a class variable indicating the disease. There are 8 missing values. This data set has been used extensively for classification tasks.
data(dermatology)
data(dermatology)
Matrix with 366 rows.
Attribute information obtained from the UCI KDD data repository:
Clinical Attributes: (they take values 0, 1, 2, 3, unless otherwise indicated)
1: erythema; 2: scaling; 3: definite borders; 4: itching; 5: koebner phenomenon; 6: polygonal papules; 7: follicular papules; 8: oral mucosal involvement; 9: knee and elbow involvement; 10: scalp involvement; 11: family history, (0 or 1); 34: Age.
Histopathological Attributes: (they take values 0, 1, 2, 3)
12: melanin incontinence; 13: eosinophils in the infiltrate; 14: PNL infiltrate; 15: fibrosis of the papillary dermis; 16: exocytosis; 17: acanthosis; 18: hyperkeratosis; 19: parakeratosis; 20: clubbing of the rete ridges; 21: elongation of the rete ridges; 22: thinning of the suprapapillary epidermis; 23: spongiform pustule; 24: munro microabcess; 25: focal hypergranulosis; 26: disappearance of the granular layer; 27: vacuolisation and damage of basal layer; 28: spongiosis; 29: saw-tooth appearance of retes; 30: follicular horn plug; 31: perifollicular parakeratosis; 32: inflammatory monoluclear inflitrate; 33: band-like infiltrate.
The considered diseases are: 1 - psoriasis, 2 - seboreic dermatitis, 3- lichen planus, 4 - pityriasis rosea, 5 - chronic dermatitis, 6 - pityriasis rubra pilaris.
The UCI KDD Archive.
Guvenir H, Demiroz G, Ilter N (1998). Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals. Artificial Intelligence in Medicine, 13, 147–165.
Irigoien I, Arenas C (2008). INCA: New statistic for estimating the number of clusters and identifying atypical units. Statistics in Medicine, 27, 2948–2973.
data(dermatology) x <- dermatology[, 1:34] group <- as.factor(dermatology[,35]) plot(group)
data(dermatology) x <- dermatology[, 1:34] group <- as.factor(dermatology[,35]) plot(group)
dgower
computes and returns the Gower distance matrix for mixed variables.
dgower(x, type = list())
dgower(x, type = list())
x |
data matrix. |
type |
it is a list with components |
The distance between two
pairs of objects i and j is obtained as
where
is the Gower's similarity coefficient for mixed data. This function allows
to include missing values (as
NA
) and therefore calculates distances based on Gower's weighted similarity coefficient.
A dist
object with distance information.
There is the function daisy()
in cluster
package which can perform the Gower distance for mixed variables. The difference is that in daisy()
the distance is calculated as and in
dgower()
it is calculated as .
Itziar Irigoien [email protected]; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV/EHU), Donostia, Spain.
Conchita Arenas [email protected]; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.
Gower, J.C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27, 857–871.
dist
, dmahal
,
dbhatta
, dcor
, dproc2
#Generate 10 objects in dimension 6 # Quantitative variables mu <- sample(1:10, 2, replace=TRUE) xc <- matrix(rnorm(10*2, mean = mu, sd = 1), ncol=2, byrow=TRUE) # Binary variables xb <- cbind(rbinom(10, 1, 0.1), rbinom(10, 1, 0.5), rbinom(10, 1, 0.9)) # Nominal variables xn <- matrix(sample(1:3, 10, replace=TRUE), ncol=1) x <- cbind(xc, xb, xn) # Distances d <- dgower(x, type=list(cuant=1:2, bin=3:5, nom=6))
#Generate 10 objects in dimension 6 # Quantitative variables mu <- sample(1:10, 2, replace=TRUE) xc <- matrix(rnorm(10*2, mean = mu, sd = 1), ncol=2, byrow=TRUE) # Binary variables xb <- cbind(rbinom(10, 1, 0.1), rbinom(10, 1, 0.5), rbinom(10, 1, 0.9)) # Nominal variables xn <- matrix(sample(1:3, 10, replace=TRUE), ncol=1) x <- cbind(xc, xb, xn) # Distances d <- dgower(x, type=list(cuant=1:2, bin=3:5, nom=6))
dmahal
computes and returns the Mahalanobis distance matrix between the rows of a data matrix.
dmahal(datos, S)
dmahal(datos, S)
datos |
data matrix. |
S |
covariance matrix. |
A dist
object with distance information.
There is a function mahalanobis()
in stats package which can perform the Mahalanobis distance. While mahalanobis()
calculates the Mahalanobis distance with respect to given a center, function dmahal()
is designed to calculate the distance between each pair of units given a data matrix.
Itziar Irigoien [email protected]; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV/EHU), Donostia, Spain.
Conchita Arenas [email protected]; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.
Everitt B. S. and Dunn G. (2001) Applied Multivariate Data Analysis. 2 edition, Edward Arnold, London.
dist
, dbhatta
,
dgower
, dcor
, dproc2
#Generate 10 objects in dimension 2 mu <- rep(0, 2) Sigma <- matrix(c(10,3,3,2),2,2) x <- mvrnorm(n=10, rep(0, 2), Sigma) d <- dmahal(x, Sigma)
#Generate 10 objects in dimension 2 mu <- rep(0, 2) Sigma <- matrix(c(10,3,3,2),2,2) x <- mvrnorm(n=10, rep(0, 2), Sigma) d <- dmahal(x, Sigma)
dproc2
computes and returns all the pairwise procrustes distances between genes in a time course experiment, using their expression profile.
dproc2(x, timepoints = NULL)
dproc2(x, timepoints = NULL)
x |
a matrix containing, in its rows, the gene expression values at the T considered time points. |
timepoints |
a T-vector with the T observed time points. If |
Each row i of matrix x is arranged in a two column matrix Xi. In Xi, the first column contains the time points and the second column the observed gene expression values (xi1...).
A dist
object with distance information.
Itziar Irigoien [email protected]; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV/EHU), Donostia, Spain.
Conchita Arenas [email protected]; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.
Irigoien, I. , Vives, S. and Arenas, C. (2011). Microarray Time Course Experiments: Finding Profiles. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(2), 464–475.
Gower, J. C. and Dijksterhuis, G. B. (2004) Procrustes Problems. Oxford University Press.
Sibson, R. (1978). Studies in the Robustness of Multidimensional Scaling: Procrustes statistic. Journal of the Royal Statistical Society, Series B, 40, 234–238.
dist
, dmahal
, dgower
, dcor
dbhatta
# Given 10 hypothetical time course profiles # over 6 time points at 1, 2, ..., 6 hours. x <- matrix(c(0.38, 0.39, 0.38, 0.37, 0.385, 0.375, 0.99, 1.19, 1.50, 1.83, 2.140, 2.770, 0.38, 0.50, 0.71, 0.72, 0.980, 1.010, 0.20, 0.40, 0.70, 1.06, 2.000, 2.500, 0.90, 0.95, 0.97, 1.50, 2.500, 2.990, 0.64, 2.61, 1.51, 1.34, 1.330 ,1.140, 0.71, 1.82, 2.28, 1.72, 1.490, 1.060, 0.71, 1.82, 2.28, 1.99, 1.975, 1.965, 0.49, 0.78, 1.00, 1.27, 0.590, 0.340, 0.71,1.00, 1.50, 1.75, 2.090, 1.380), nrow=10, byrow=TRUE) # Graphical representation matplot(t(x), type="b") # Distance matrix between them d <- dproc2(x)
# Given 10 hypothetical time course profiles # over 6 time points at 1, 2, ..., 6 hours. x <- matrix(c(0.38, 0.39, 0.38, 0.37, 0.385, 0.375, 0.99, 1.19, 1.50, 1.83, 2.140, 2.770, 0.38, 0.50, 0.71, 0.72, 0.980, 1.010, 0.20, 0.40, 0.70, 1.06, 2.000, 2.500, 0.90, 0.95, 0.97, 1.50, 2.500, 2.990, 0.64, 2.61, 1.51, 1.34, 1.330 ,1.140, 0.71, 1.82, 2.28, 1.72, 1.490, 1.060, 0.71, 1.82, 2.28, 1.99, 1.975, 1.965, 0.49, 0.78, 1.00, 1.27, 0.590, 0.340, 0.71,1.00, 1.50, 1.75, 2.090, 1.380), nrow=10, byrow=TRUE) # Graphical representation matplot(t(x), type="b") # Distance matrix between them d <- dproc2(x)
Assume that n units are divided into k clusters C1,...,Ck, and consider a fixed unit x0. Function estW
calculates the INCA statistic and the related
statistics.
estW(d, dx0, pert = "onegroup")
estW(d, dx0, pert = "onegroup")
d |
a distance matrix or a |
dx0 |
an n-vector containing the distances d0j between x0 and unit j. |
pert |
an n-vector that indicates which group each unit belongs to. Note that the expected values of |
The function returns an object of class incaest
which is a list containing the following components:
Wvalue |
is the INCA statistic |
Uvalue |
is a vector containing the statistics |
For a correct geometrical interpretation it is convenient to verify whether the distance matrix d is Euclidean.
Itziar Irigoien [email protected]; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV/EHU), Donostia, Spain.
Conchita Arenas [email protected]; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.
Arenas, C. and Cuadras, C.M. (2002). Some recent statistical methods based on distances. Contributions to Science, 2, 183–191.
Irigoien, I. and Arenas, C. (2008). INCA: New statistic for estimating the number of clusters and identifying atypical units. Statistics in Medicine, 27(15), 2948–2973.
data(iris) d <- dist(iris[,1:4]) # characteristics of a specific flower (likely group 1) x0 <- c(5.3, 3.6, 1.1, 0.1) # distances between flower x0 and the rest of flowers in iris dx0 <- rep(0,150) for (i in 1:150){ dif <-x0-iris[i,1:4] dx0[i] <- sqrt(sum(dif*dif)) } estW(d, dx0, iris[,5])
data(iris) d <- dist(iris[,1:4]) # characteristics of a specific flower (likely group 1) x0 <- c(5.3, 3.6, 1.1, 0.1) # distances between flower x0 and the rest of flowers in iris dx0 <- rep(0,150) for (i in 1:150){ dif <-x0-iris[i,1:4] dx0[i] <- sqrt(sum(dif*dif)) } estW(d, dx0, iris[,5])
INCAindex
helps to estimate the number of clusters in a dataset.
INCAindex(d, pert_clus)
INCAindex(d, pert_clus)
d |
a distance matrix or a |
pert_clus |
an n-vector that indicates which group each unit belongs to. Note that the expected values of |
Returns an object of class incaix
which is a list containing the following components:
well_class |
a vector indicating the number of well classified units. |
Ni_cluster |
a vector indicating each cluster size. |
Total |
percentage of objects well classified in the partition defined by |
For a correct geometrical interpretation it is convenient to verify whether the distance matrix d is Euclidean. It admits the associated methods summary and plot. The first simply returns the percentage of well-classified units and the second offers a barchart with the percentages of well classified units for each group in the given partition.
Itziar Irigoien [email protected]; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV/EHU), Donostia, Spain.
Conchita Arenas [email protected]; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.
Arenas, C. and Cuadras, C.M. (2002). Some recent statistical methods based on distances. Contributions to Science, 2, 183–191.
Irigoien, I. and Arenas, C. (2008). INCA: New statistic for estimating the number of clusters and identifying atypical units. Statistics in Medicine, 27(15), 2948–2973.
#generate 3 clusters, each of them with 20 objects in dimension 5. mu1 <- sample(1:10, 5, replace=TRUE) x1 <- matrix(rnorm(20*5, mean = mu1, sd = 1),ncol=5, byrow=TRUE) mu2 <- sample(1:10, 5, replace=TRUE) x2 <- matrix(rnorm(20*5, mean = mu2, sd = 1),ncol=5, byrow=TRUE) mu3 <- sample(1:10, 5, replace=TRUE) x3 <- matrix(rnorm(20*5, mean = mu3, sd = 1),ncol=5, byrow=TRUE) x <- rbind(x1,x2,x3) # Euclidean distance between units. d <- dist(x) # given the right partition, calculate the percentage of well classified objects. partition <- c(rep(1,20), rep(2,20), rep(3,20)) INCAindex(d, partition) # In order to estimate the number of cluster in data, try several # partitions and compare the results library(cluster) T <- rep(NA, 5) for (l in 2:5){ part <- pam(d,l)$clustering T[l] <- INCAindex(d,part)$Total } plot(T, type="b",xlab="Number of clusters", ylab="INCA", xlim=c(1.5, 5.5))
#generate 3 clusters, each of them with 20 objects in dimension 5. mu1 <- sample(1:10, 5, replace=TRUE) x1 <- matrix(rnorm(20*5, mean = mu1, sd = 1),ncol=5, byrow=TRUE) mu2 <- sample(1:10, 5, replace=TRUE) x2 <- matrix(rnorm(20*5, mean = mu2, sd = 1),ncol=5, byrow=TRUE) mu3 <- sample(1:10, 5, replace=TRUE) x3 <- matrix(rnorm(20*5, mean = mu3, sd = 1),ncol=5, byrow=TRUE) x <- rbind(x1,x2,x3) # Euclidean distance between units. d <- dist(x) # given the right partition, calculate the percentage of well classified objects. partition <- c(rep(1,20), rep(2,20), rep(3,20)) INCAindex(d, partition) # In order to estimate the number of cluster in data, try several # partitions and compare the results library(cluster) T <- rep(NA, 5) for (l in 2:5){ part <- pam(d,l)$clustering T[l] <- INCAindex(d,part)$Total } plot(T, type="b",xlab="Number of clusters", ylab="INCA", xlim=c(1.5, 5.5))
INCAnumclu
helps to estimate the number of clusters in a
dataset. The INCA index associated to different partitions with
different number of clusters is calculated.
INCAnumclu(d, K, method = "pam", pert, L= NULL, noise=NULL)
INCAnumclu(d, K, method = "pam", pert, L= NULL, noise=NULL)
d |
a distance matrix or a |
K |
the maximum number of cluster to be considered. For each k value ( k=2,..,K) a partition with k clusters is calculated. |
method |
character string defining the clustering method in
order to obtain the partitions. The hierarchical aglomerative clustering methods are perfomed via |
pert |
only useful when parameter |
L |
default value NULL, but when some units are considered by
the user as noise units, |
noise |
when |
Returns an object of class incanc
which is a numeric vector containing the INCA index associated to each of the k (k=2,...,K) partitions. When noise
is no null, the function returns a list with the INCA index for each partition, which is calculated without noise units as well as with noise units. The associated plot
returns INCA index plot, both, with and without noise.
Itziar Irigoien [email protected]; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV/EHU), Donostia, Spain.
Conchita Arenas [email protected]; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.
Irigoien, I. and Arenas, C. (2008). INCA: New statistic for estimating the number of clusters and identifying atypical units. Statistics in Medicine, 27(15), 2948–2973.
Arenas, C. and Cuadras, C.M. (2002). Some recent statistical methods based on distances. Contributions to Science, 2, 183–191.
#------- Example 1 -------------------------------------- #generate 3 clusters, each of them with 20 objects in dimension 5. mu1 <- sample(1:10, 5, replace=TRUE) x1 <- matrix(rnorm(20*5, mean = mu1, sd = 1),ncol=5, byrow=TRUE) mu2 <- sample(1:10, 5, replace=TRUE) x2 <- matrix(rnorm(20*5, mean = mu2, sd = 1),ncol=5, byrow=TRUE) mu3 <- sample(1:10, 5, replace=TRUE) x3 <- matrix(rnorm(20*5, mean = mu3, sd = 1),ncol=5, byrow=TRUE) x <- rbind(x1,x2,x3) # calculte euclidean distance between them d <- dist(x) # calculate the INCA index associated to partitions with k=2, ..., k=5 clusters. INCAnumclu(d, K=5) out <- INCAnumclu(d, K=5) plot(out) #------- Example 1 cont. -------------------------------- # With hypothetical noise elements noiseunits <- rep(FALSE, 60) noiseunits[sample(1:60, 20)] <- TRUE out <- INCAnumclu(d, K=5, L="custom", noise=noiseunits) plot(out)
#------- Example 1 -------------------------------------- #generate 3 clusters, each of them with 20 objects in dimension 5. mu1 <- sample(1:10, 5, replace=TRUE) x1 <- matrix(rnorm(20*5, mean = mu1, sd = 1),ncol=5, byrow=TRUE) mu2 <- sample(1:10, 5, replace=TRUE) x2 <- matrix(rnorm(20*5, mean = mu2, sd = 1),ncol=5, byrow=TRUE) mu3 <- sample(1:10, 5, replace=TRUE) x3 <- matrix(rnorm(20*5, mean = mu3, sd = 1),ncol=5, byrow=TRUE) x <- rbind(x1,x2,x3) # calculte euclidean distance between them d <- dist(x) # calculate the INCA index associated to partitions with k=2, ..., k=5 clusters. INCAnumclu(d, K=5) out <- INCAnumclu(d, K=5) plot(out) #------- Example 1 cont. -------------------------------- # With hypothetical noise elements noiseunits <- rep(FALSE, 60) noiseunits[sample(1:60, 20)] <- TRUE out <- INCAnumclu(d, K=5, L="custom", noise=noiseunits) plot(out)
Assume that n units are divided into k groups C1,...,Ck. Function INCAtest
performs the typicality INCA test. Therein, the null hypothesis that a new unit x0 is a typical unit with respect to a previously fixed partition is tested versus the alternative hypothesis that the unit is atypical.
INCAtest(d, pert, d_test, np = 1000, alpha = 0.05, P = 1)
INCAtest(d, pert, d_test, np = 1000, alpha = 0.05, P = 1)
d |
a distance matrix or a |
pert |
an n-vector that indicates which group each unit belongs to. Note that the expected values of |
d_test |
an n-vector containing the distances from x0 to the other units. |
np |
sample size for the bootstrap sample for the bootstrap procedure. |
alpha |
fixed level for the test. |
P |
Number of times the bootstrap procedure is repeated. |
A list with class "incat" containing the following components:
StatisticW0 |
value of the INCA statistic. |
ProjectionsU |
values of statistics measuring the projection from the specific object to each considered group. |
pvalues |
p-values obtained in the |
alpha |
specified value of the level of the test. |
To obtain the INCA statistic distribution, under the null hypothesis, the program can consume long time. For a correct geometrical interpretation it is convenient to verify whether the distance matrix d is Euclidean.
Itziar Irigoien [email protected]; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV-EHU), Donostia, Spain.
Conchita Arenas [email protected]; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.
Irigoien, I. and Arenas, C. (2008). INCA: New statistic for estimating the number of clusters and identifying atypical units. Statistics in Medicine, 27(15), 2948–2973.
Arenas, C. and Cuadras, C.M. (2002). Some recent statistical methods based on distances. Contributions to Science, 2, 183–191.
#generate 3 clusters, each of them with 20 objects in dimension 5. mu1 <- sample(1:10, 5, replace=TRUE) x1 <- matrix(rnorm(20*5, mean = mu1, sd = 1),ncol=5, byrow=TRUE) mu2 <- sample(1:10, 5, replace=TRUE) x2 <- matrix(rnorm(20*5, mean = mu2, sd = 1),ncol=5, byrow=TRUE) mu3 <- sample(1:10, 5, replace=TRUE) x3 <- matrix(rnorm(20*5, mean = mu3, sd = 1),ncol=5, byrow=TRUE) x <- rbind(x1,x2,x3) # Euclidean distance between units in matrix x. d <- dist(x) # given the right partition partition <- c(rep(1,20), rep(2,20), rep(3,20)) # x0 contains a unit from one group, as for example group 1. x0 <- matrix(rnorm(1*5, mean = mu1, sd = 1),ncol=5, byrow=TRUE) # distances between x0 and the other units. dx0 <- rep(0,60) for (i in 1:60){ dif <-x0-x[i,] dx0[i] <- sqrt(sum(dif*dif)) } INCAtest(d, partition, dx0, np=10) # x0 contains a unit from a new group. x0 <- matrix(rnorm(1*5, mean = sample(1:10, 5, replace=TRUE), sd = 1), ncol=5, byrow=TRUE) # distances between x0 and the other units in matrix x. dx0 <- rep(0,60) for (i in 1:60){ dif <-x0-x[i,] dx0[i] <- sqrt(sum(dif*dif)) } INCAtest(d, partition, dx0, np=10)
#generate 3 clusters, each of them with 20 objects in dimension 5. mu1 <- sample(1:10, 5, replace=TRUE) x1 <- matrix(rnorm(20*5, mean = mu1, sd = 1),ncol=5, byrow=TRUE) mu2 <- sample(1:10, 5, replace=TRUE) x2 <- matrix(rnorm(20*5, mean = mu2, sd = 1),ncol=5, byrow=TRUE) mu3 <- sample(1:10, 5, replace=TRUE) x3 <- matrix(rnorm(20*5, mean = mu3, sd = 1),ncol=5, byrow=TRUE) x <- rbind(x1,x2,x3) # Euclidean distance between units in matrix x. d <- dist(x) # given the right partition partition <- c(rep(1,20), rep(2,20), rep(3,20)) # x0 contains a unit from one group, as for example group 1. x0 <- matrix(rnorm(1*5, mean = mu1, sd = 1),ncol=5, byrow=TRUE) # distances between x0 and the other units. dx0 <- rep(0,60) for (i in 1:60){ dif <-x0-x[i,] dx0[i] <- sqrt(sum(dif*dif)) } INCAtest(d, partition, dx0, np=10) # x0 contains a unit from a new group. x0 <- matrix(rnorm(1*5, mean = sample(1:10, 5, replace=TRUE), sd = 1), ncol=5, byrow=TRUE) # distances between x0 and the other units in matrix x. dx0 <- rep(0,60) for (i in 1:60){ dif <-x0-x[i,] dx0[i] <- sqrt(sum(dif*dif)) } INCAtest(d, partition, dx0, np=10)
This lymphography domain was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. Thanks go to M. Zwitter and M. Soklic for providing the data. The data are available at the UCI KDD data repository (Hettich S and Bay SD, 1999).
The data set consists of 148 instances presenting 18 different mixed attributes (1 cuantitative, 9 binaries and 9 nominals), and a class variable indicating the diagnostic. There are not missing values.
data(lympha)
data(lympha)
Data frame with 148 instances and 19 features.
Attribute information:
— NOTE: All attribute values in the database have been entered as numeric values corresponding to their index in the list of attribute values for that attribute domain as given below.
1. class: normal find, metastases, malign lymph, fibrosis
2. lymphatics: normal, arched, deformed, displaced
3. block of affere: no, yes
4. bl. of lymph. c: no, yes
5. bl. of lymph. s: no, yes
6. by pass: no, yes
7. extravasates: no, yes
8. regeneration of: no, yes
9. early uptake in: no, yes
10. lym.nodes dimin: 0-3
11. lym.nodes enlar: 1-4
12. changes in lym.: bean, oval, round
13. defect in node: no, lacunar, lac. marginal, lac. central
14. changes in node: no, lacunar, lac. margin, lac. central
15. changes in stru: no, grainy, drop-like, coarse, diluted, reticular, stripped, faint
16. special forms: no, chalices, vesicles
17. dislocation of: no, yes
18. exclusion of no: no, yes
19. no. of nodes in: 0-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, >=70
The UCI KDD Archive.
Hettich S and Bay SD (1999). The UCI KDD Archive. Department of Information and Computer Science. University of California at Irvine, Irvine, USA.
data(lympha) aux <- table(lympha[,1]) barplot(aux, names.arg=c("normal", "metastases", "malign lymph", "fibrosis"))
data(lympha) aux <- table(lympha[,1]) barplot(aux, names.arg=c("normal", "metastases", "malign lymph", "fibrosis"))
Assume that n units are divided into k groups C1,...,Ck. The function calculates the proximity function from a specific unit x0 to the groups Cj.
proxi(d, dx0, pert = "onegroup")
proxi(d, dx0, pert = "onegroup")
d |
a distance matrix or a |
dx0 |
an n-vector containing the distances from x0 to the other units. |
pert |
an n-vector that indicates which group each unit belongs to. Note that the expected values of |
k-vector containing the proximity function value from x0 to each group.
Itziar Irigoien [email protected]; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV/EHU), Donostia, Spain.
Conchita Arenas [email protected]; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.
Arenas, C. and Cuadras, C.M. (2002). Some recent statistical methods based on distances. Contributions to Science, 2, 183–191.
Cuadras, C.M., Fortiana, J. and Oliva, F. (1997). The proximity of an individual to a population with applications in discriminant analysis. Journal of Classification, 14, 117–136.
data(iris) d <- dist(iris[,1:4]) # xo contains a unit from one group, as for example group 1. x0 <- c(5.3, 3.6, 1.1, 0.1) # distances between x0 and the other units. dx0 <- rep(0,150) for (i in 1:150){ dif <-x0-iris[i,1:4] dx0[i] <- sqrt(sum(dif*dif)) } proxi(d, dx0, iris[,5]) # xo contains a unit from one group, as for example group 2. x0 <- c(6.4, 3.0, 4.8, 1.3) # distances between x0 and the other units. dx0 <- rep(0,150) for (i in 1:150){ dif <-x0-iris[i,1:4] dx0[i] <- sqrt(sum(dif*dif)) } proxi(d, dx0, iris[,5])
data(iris) d <- dist(iris[,1:4]) # xo contains a unit from one group, as for example group 1. x0 <- c(5.3, 3.6, 1.1, 0.1) # distances between x0 and the other units. dx0 <- rep(0,150) for (i in 1:150){ dif <-x0-iris[i,1:4] dx0[i] <- sqrt(sum(dif*dif)) } proxi(d, dx0, iris[,5]) # xo contains a unit from one group, as for example group 2. x0 <- c(6.4, 3.0, 4.8, 1.3) # distances between x0 and the other units. dx0 <- rep(0,150) for (i in 1:150){ dif <-x0-iris[i,1:4] dx0[i] <- sqrt(sum(dif*dif)) } proxi(d, dx0, iris[,5])
Sythetic time course data where 210 genes profiles along 6 time points are reported and where the genes are drawn from 8 different populations.
data(SyntheticTimeCourse)
data(SyntheticTimeCourse)
Data frame with 120 rows and 7 columns.
Attribute information: Column cl: the class that the gen belongs to. Columns t1 - t6: gene's expression along the t1, ..., t6 time points considered.
data(SyntheticTimeCourse) x <- SyntheticTimeCourse[, 2:7] cl <- SyntheticTimeCourse[, 1] par(mfrow=c(3,3)) for (g in 1:8){ xx <- t(x[cl==g,] ) yy <- matrix(c(1:6 ), nrow=6, ncol=15, byrow=FALSE) matplot(yy,xx, pch=21, type="b", axes=FALSE, ylim=c(0,3.5), xlim=c(0.5,6.5), xlab="", ylab="", col="black", main=paste("G",g)) abline(h=0) abline(v=0.5) mtext("Time", side=1) mtext("Expression", side=2) }
data(SyntheticTimeCourse) x <- SyntheticTimeCourse[, 2:7] cl <- SyntheticTimeCourse[, 1] par(mfrow=c(3,3)) for (g in 1:8){ xx <- t(x[cl==g,] ) yy <- matrix(c(1:6 ), nrow=6, ncol=15, byrow=FALSE) matplot(yy,xx, pch=21, type="b", axes=FALSE, ylim=c(0,3.5), xlim=c(0.5,6.5), xlab="", ylab="", col="black", main=paste("G",g)) abline(h=0) abline(v=0.5) mtext("Time", side=1) mtext("Expression", side=2) }
Assume that n units are divided into k groups C1,...,Ck. The function calculates the geometrical variability for each group in data.
vgeo(d, pert = "onegroup")
vgeo(d, pert = "onegroup")
d |
a distance matrix or a |
pert |
an n-vector that indicates which group each unit belongs to. Note that the expected values of |
It is a matrix containing the geometric variability for each group.
Itziar Irigoien [email protected]; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV/EHU), Donostia, Spain.
Conchita Arenas [email protected]; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.
Arenas, C. and Cuadras, C.M. (2002). Some recent statistical methods based on distances. Contributions to Science, 2, 183–191.
Cuadras, C.M. (1992). Some examples of distance based discrimination. Biometrical Letters, 29(1), 3–20.
data(iris) d <- dist(iris[,1:4]) vgeo(d,iris[,5])
data(iris) d <- dist(iris[,1:4]) vgeo(d,iris[,5])