kFoldMaskTensor
)In this vignette, we introduce a cross-validation approach using mask tensor (Fujita 2018; Owen 2009) to estimate the optimal rank parameter for matrix/tensor decomposition. Mask matrix/tensor has the same size of data matrix/tensor and contains only 0 or 1 elements, 0 means not observed value, and 1 means observed value. In this approach, only non-zero and non-NA elements are randomly specified as 0 and estimated by other elements reconstructed by the result of matrix/tensor decomposition. This can be considered a cross-validation approach because the elements specified as 1 are the training dataset and the elements specified as 0 are the test dataset.
Here, we use these packages.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Here, we use three types of demo data as follows:
To set mask matrix/tensor, here we use kFoldMaskTensor
,
which divides only the elements other than NA and 0 into k
mask matrices/tensors. In NMF
, the mask matrix can be
specified as M
and the rank parameter (J
) with
the smallest test error is considered the optimal rank. Here, three mask
matrices are prepared for each rank, and the average is used to estimate
the optimal rank.
out_NMF <- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
count <- 1
for(i in 1:10){
masks_NMF <- kFoldMaskTensor(data_matrix, k=3)
for(j in 1:3){
out_NMF[count, 3] <- rev(
NMF(data_matrix,
M = masks_NMF[[j]],
J = i)$TestRecError)[1]
count <- count + 1
}
}
Looking at the average test error for each rank, the optimal rank appears to be around 8 to 10 (with some variation depending on random seeds).
ggplot(out_NMF, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error")
## # A tibble: 10 × 2
## rank Avg
## <fct> <dbl>
## 1 1 4193.
## 2 2 3330.
## 3 3 3282.
## 4 4 1520.
## 5 5 1304.
## 6 6 1336.
## 7 7 1101.
## 8 8 1012.
## 9 9 1162.
## 10 10 988.
## # A tibble: 1 × 2
## rank Avg
## <fct> <dbl>
## 1 10 988.
Same as NMF
, mask matrix M
can be specified
in NMTF
, and the rank parameter (rank
) with
the smallest test error is considered the optimal rank. The following
code is time-consuming and should be executed in your own
environment.
out_NMTF <- expand.grid(replicate=1:3, rank2=1:10, rank1=1:10, value=0)
rank_NMTF <- paste0(out_NMTF$rank1, "-", out_NMTF$rank2)
out_NMTF <- cbind(out_NMTF, rank=factor(rank_NMTF, levels=unique(rank_NMTF)))
count <- 1
for(i in 1:10){
for(j in 1:10){
masks_NMTF <- kFoldMaskTensor(data_matrix, k=3)
for(k in 1:3){
out_NMTF[count, 4] <- rev(
NMTF(data_matrix,
M = masks_NMTF[[k]],
rank = c(i, j))$TestRecError)[1]
count <- count + 1
}
}
}
ggplot(out_NMTF, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Same as NMF
, mask matrices M
can be
specified in siNMF
, and the rank parameter (J
)
with the smallest test error is considered the optimal rank. The
following code is time-consuming and should be executed in your own
environment.
out_siNMF <- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
count <- 1
for(i in 1:10){
masks_siNMF <- lapply(1:3, function(x){
list(
kFoldMaskTensor(data_matrices[[1]], k=3)[[x]],
kFoldMaskTensor(data_matrices[[2]], k=3)[[x]],
kFoldMaskTensor(data_matrices[[3]], k=3)[[x]])
})
for(j in 1:3){
out_siNMF[count, 3] <- rev(
siNMF(data_matrices,
M = masks_siNMF[[j]],
J = i)$TestRecError)[1]
count <- count + 1
}
}
Same as NMF
, mask matrices M
can be
specified in jNMF
, and the rank parameter (J
)
with the smallest test error is considered the optimal rank. The
following code is time-consuming and should be executed in your own
environment.
out_jNMF <- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
count <- 1
for(i in 1:10){
masks_jNMF <- lapply(1:3, function(x){
list(
kFoldMaskTensor(data_matrices[[1]], k=3)[[x]],
kFoldMaskTensor(data_matrices[[2]], k=3)[[x]],
kFoldMaskTensor(data_matrices[[3]], k=3)[[x]])
})
for(j in 1:3){
out_jNMF[count, 3] <- rev(
jNMF(data_matrices,
M = masks_jNMF[[j]],
J = i)$TestRecError)[1]
count <- count + 1
}
}
Same as NMF
, mask tensor M
can be specified
in NTF
, and the rank parameter (rank
) with the
smallest test error is considered the optimal rank. The following code
is time-consuming and should be executed in your own environment.
out_NTF <- expand.grid(replicate=1:3, rank=factor(1:10), value=0)
count <- 1
for(i in 1:10){
masks_NTF <- kFoldMaskTensor(data_tensor, k=3)
for(j in 1:3){
out_NTF[count, 3] <- rev(
NTF(data_tensor,
M = masks_NTF[[j]],
rank = i)$TestRecError)[1]
count <- count + 1
}
}
Same as NMF
, mask tensor M
can be specified
in NTD
, and the rank parameter (rank
) with the
smallest test error is considered the optimal rank. The following code
is time-consuming and should be executed in your own environment.
out_NTD <- expand.grid(replicate=1:3, rank3=1:5, rank2=1:5, rank1=1:5, value=0)
rank_NTD <- paste0(out_NTD$rank1, "-", out_NTD$rank2,
"-", out_NTD$rank3)
out_NTD <- cbind(out_NTD, rank=factor(rank_NTD, levels=unique(rank_NTD)))
count <- 1
for(i in 1:5){
for(j in 1:5){
for(k in 1:5){
masks_NTD <- kFoldMaskTensor(data_tensor, k=3)
for(k in 1:3){
out_NTD[count, 5] <- rev(
NTD(data_tensor,
M = masks_NTD[[k]],
rank = c(i, j, k))$TestRecError)[1]
count <- count + 1
}
}
}
}
ggplot(out_NTD, aes(x=rank, y=value)) +
geom_point() +
stat_summary(fun = mean, geom = "point", shape=21, size=3, fill="blue") +
stat_summary(fun = mean, geom = "line", colour = "blue", aes(group=1)) +
xlab("Rank") +
ylab("Test Reconstruction Error") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.1.4 ggplot2_3.5.1 rTensor_1.4.8 nnTensor_1.3.0 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 tcltk_4.4.2
## [5] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.3 grid_4.4.2
## [9] RColorBrewer_1.1-3 fastmap_1.2.0 maps_3.4.2.1 jsonlite_1.8.9
## [13] misc3d_0.9-1 spam_2.11-1 viridisLite_0.4.2 scales_1.3.0
## [17] jquerylib_0.1.4 cli_3.6.3 rlang_1.1.5 munsell_0.5.1
## [21] withr_3.0.2 cachem_1.1.0 yaml_2.3.10 tools_4.4.2
## [25] colorspace_2.1-1 buildtools_1.0.0 vctrs_0.6.5 R6_2.5.1
## [29] lifecycle_1.0.4 plot3D_1.4.1 MASS_7.3-64 pkgconfig_2.0.3
## [33] bslib_0.9.0 pillar_1.10.1 gtable_0.3.6 glue_1.8.0
## [37] Rcpp_1.0.14 fields_16.3 xfun_0.50 tibble_3.2.1
## [41] tidyselect_1.2.1 sys_3.4.3 knitr_1.49 farver_2.1.2
## [45] htmltools_0.5.8.1 maketools_1.3.1 labeling_0.4.3 tagcloud_0.6
## [49] dotCall64_1.2 compiler_4.4.2