--- title: "latentcor" author: "Mingze Huang, Christian L. Müller, Irina Gaynanova" date: "`r Sys.Date()`" bibliography: latentcor.bib output: rmarkdown::html_vignette extra_dependencies: ["amsmath"] nocite: | @croux2013robust @filzmoser2021pcapp @liu2009nonparanormal @fox2019poly vignette: > %\VignetteIndexEntry{latentcor} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} options(tinytex.verbose = TRUE) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(latentcor) ``` # Introduction R package `latentcor` utilizes the powerful semi-parametric latent Gaussian copula models to estimate latent correlations between mixed data types. The package allows to estimate correlations between any of continuous/binary/ternary/zero-inflated (truncated) variable types. The underlying implementation takes advantage of fast multi-linear interpolation scheme with a clever choice of grid points that give the package a small memory footprint, and allows to use the latent correlations with sub-sampling and bootstrapping. # Statement of need No R software package is currently available that allows accurate and fast correlation estimation from mixed variable data in a unifying manner. The R package `latentcor`, introduced here, thus represents the first stand-alone R package for computation of latent correlation that takes into account all variable types (continuous/binary/ordinal/zero-inflated), comes with an optimized memory footprint, and is computationally efficient, essentially making latent correlation estimation almost as fast as rank-based correlation estimation. # Getting started ## A simple example with two variables First, we will generate a pair of variables with different types using a sample size $n=100$ which will serve as example data. Here first variable will be ternary, and second variable will be continuous. ```{r data_generation} simdata = gen_data(n = 100, types = c("ter", "con")) ``` The output of `gen_data` is a list with 2 elements: ```{r data_output} names(simdata) ``` - `X`: a matrix ($100\times 2$), the first column is the ternary variable; the second column is the continuous variable. ```{r data_matrix} X = simdata$X head(X, n = 6L) ``` - `plotX`: NULL (`showplot = FALSE`, can be changed to display the plot of generated data in`gen_data` input). ```{r data_plot} simdata$plotX ``` Then we can estimate the latent correlation matrix based on these 2 variables using `latentcor` function. ```{r estimation} estimate = latentcor(X, types = c("ter", "con")) ``` The output of `latentcor` is a list with several elements: ```{r estimation_output} names(estimate) ``` - `zratios` is a list has the same length as the number of variables. Here the first element is a ($2\times1$) vector indicating the cumulative proportions for zeros and ones in the ternary variable (e.g. first element in vector is the proportion of zeros, second element in vector is the proportion of zeros and ones.) The second element of the list is NA for continuous variable. ```{r zratios} estimate$zratios ``` - `K`: Kendall $\tau$ ($\tau_{a}$) correlation matrix for these 2 variables. ```{r Kendall} estimate$K ``` - `Rpointwise`: matrix of pointwise estimated correlations. Due to pointwise estimation, `Rpointwise` is not guaranteed to be positive semi-definite ```{r latent_correlation_pointwise} estimate$Rpointwise ``` - `R`: estimated final latent correlation matrix, this matrix is guaranteed to be strictly positive definite (through `nearPD` projection and parameter `nu`, see Mathematical framework for estimation) if `use.nearPD = TRUE`. ```{r latent_correlation} estimate$R ``` - `plotR`: NULL by default as `showplot = FALSE` in `latentcor`. Otherwise displays a heatmap of latent correlation matrix. ```{r heatmap} estimate$plotR ``` ## Example with mtcars dataset We use the build-in dataset `mtcars`: ```{r mtcars} head(mtcars, n = 6L) ``` Let's take a look at the unique values for each variable to determine the corresponding data type. ```{r unique} apply(mtcars, 2, table) ``` Then we can estimate the latent correlation matrix for all variables of `mtcars` by using `latentcor` function. ```{r mtcars_estimation} estimate_mtcars = latentcor(mtcars, types = c("con", "ter", "con", "con", "con", "con", "con", "bin", "bin", "ter", "con")) ``` Note that the determination of variable types can also be done automatically by `latentcor` package using `get_types` function: ```{r mtcars_types, message = FALSE} estimate_mtcars = latentcor(mtcars, types = get_types(mtcars)) ``` This function is run automatically inside `latentcor` if the `types` are not supplied by the user, however the automatic determination of types takes extra time, so we recommend to specify `types` explicitly if they are known in advance. The output of `latentcor` for `mtcars`: ```{r mtcars_estimation_output} names(estimate_mtcars) ``` - `zratios`: zratios for corresponding variables in `mtcars`. ```{r mtcars_zratios} estimate_mtcars$zratios ``` - `K`: Kendall $\tau$ ($\tau_{a}$) correlation matrix for variables in `mtcars`. ```{r mtcars_Kendall} estimate_mtcars$K ``` - `Rpointwise`: matrix of pointwise estimated correlations for `mtcars`. ```{r mtcars_latent_correlation_pointwise} estimate_mtcars$Rpointwise ``` - `R`: estimated final latent correlation matrix for `mtcars`. ```{r mtcars_latent_correlation} estimate_mtcars$R ``` - `plotR`: NULL by default as `showplot = FALSE` in `latentcor`. Otherwise displays a heatmap of latent correlation matrix for `mtcars` (See [heatmap of latent correlation (approx) for mtcars](https://rpubs.com/mingzehuang/797937)). ```{r mtcars_heatmap} estimate_mtcars$plotR ``` ## Example using latentcor with subsampling While `latentcor` can determine the types of each variable automatically, it is recommended to call `get_types` first and then supply `types` explicitly to save the computation time, especially when using latentcor with sub-sampling (which we illustrate below). First, we will generate variables with different types using a sample size $n=100$ which will serve as an example data for subsampling. ```{r data_generation 2} simdata2 = gen_data(n = 100, types = c(rep("ter", 3), "con", rep("bin", 3))) ``` To use the data with subsampling, we recommend to first run `get_types` on the full data ```{r types subsampling} types = get_types(simdata2$X) types ``` Then, when doing subsampling, we recommend to explicitly supply identified types to `latentcor`. We illustrate using 10 subsamples, each of size 80. ```{r subsampling} start_time = proc.time() for (s in 1:10){ # Select a random subsample of size 80 subsample = sample(1:100, 80) # Estimate latent correlation on subsample specifying the types Rs = latentcor(simdata2$X[subsample, ], types = types) } proc.time() - start_time ``` Compared with ```{r subsampling 2} start_time = proc.time() for (s in 1:10){ # Select a random subsample of size 80 subsample = sample(1:100, 80) # Estimate latent correlation on subsample specifying the types Rs = latentcor(simdata2$X[subsample, ], types = get_types(simdata2$X)) } proc.time() - start_time ``` # References