# Introduction to the regfilter package

The regfilter package contains filtering techniques to remove noisy samples in regression datasets. It adapts up to a total of 14 classic and recent noise filters to be used in regression problems employing the approach proposed in Martin et al. (2021).

## Instalation

The regfilter package can be installed in R from CRAN servers using the command:

# install.packages("regfilter")

This command installs all the dependencies of the package as well as all the regression algorithms necessary for the operation of the noise filters. In order to access all the functions of the package, it is necessary to use the R command:

library(regfilter)

## Documentation

All the information corresponding to each noise filter can be consulted from the CRAN website. Additionally, the help() command can be used. For example, in order to check the documentation of the regIPF noise filter, we can use:

help(regIPF)

## Usage of regressand noise filters

For processing noisy regression data, each noise filter in the regfilter package provides two standard ways of use:

• Default method. It receives a data frame with the input attributes in the x argument, whereas the output variable is received through the y argument (a double vector).
• Class formula method. This method allows passing the whole data frame (attributes and response variable) in the data argument. In addition, the attributes along with the output regressand must be indicated in the formula argument.

An example on how to use these two methods for filtering out the rock dataset with the regCNN noise filter is shown below:

data(rock)
#>   area    peri     shape perm
#> 1 4990 2791.90 0.0903296  6.3
#> 2 7002 3892.60 0.1486220  6.3
#> 3 7558 3930.66 0.1833120  6.3
#> 4 7352 3869.32 0.1170630  6.3
#> 5 7943 3948.54 0.1224170 17.1
#> 6 7979 4010.15 0.1670450 17.1
# Using the default method:
set.seed(9)
out.def <- regCNN(x = rock[,-ncol(rock)], y = rock[,ncol(rock)])
# Using the formula method:
set.seed(9)
out.frm <- regCNN(formula = perm ~ ., data = rock)
# Check the match of noisy indices:
all(out.def$idnoise == out.frm$idnoise)
#> [1] TRUE

Note that, the $$\$$ operator is used to access the elements returned by the filter in the objects $$out.def$$ and $$out.frm$$.

## Output values

All regression noise filters return an object of class rfdata. It is designed to unify the output value of the methods included in the regfilter package. The class rfdata is a list of elements with the most relevant information of the noise filtering process:

• xclean a data frame with the input attributes of clean samples (without errors).
• yclean a double vector with the output regressand of clean samples (without errors).
• numclean an integer vector with the amount of clean samples.
• idclean an integer vector with the indices of clean samples.
• xnoise a data frame with the input attributes of noisy samples (with errors).
• ynoise a double vector with the output regressand of noisy samples (with errors).
• numnoise an integer vector with the amount of noisy samples.
• idnoise an integer vector with the indices of noisy samples.
• filter the full name of the noise filter used.
• param a list of the argument values.
• call the function call.

As an example, the structure of the rfdata object returned using the regCNN noise filter is shown below:

str(out.def)
#> List of 11
#>  $xclean :'data.frame': 39 obs. of 3 variables: #> ..$ area : int [1:39] 4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ...
#>   ..$peri : num [1:39] 2792 3893 3931 3869 3949 ... #> ..$ shape: num [1:39] 0.0903 0.1486 0.1833 0.1171 0.1224 ...
#>  $yclean : num [1:39] 6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ... #>$ numclean: int 39
#>  $idclean : num [1:39] 1 2 3 4 5 6 7 8 9 10 ... #>$ xnoise  :'data.frame':    9 obs. of  3 variables:
#>   ..$area : int [1:9] 3469 1468 3524 5267 5048 1016 5605 8793 5514 #> ..$ peri : num [1:9] 1377 476 1189 1645 942 ...
#>   ..$shape: num [1:9] 0.177 0.439 0.164 0.254 0.329 ... #>$ ynoise  : num [1:9] 100 100 100 100 1300 1300 1300 1300 580
#>  $numnoise: int 9 #>$ idnoise : int [1:9] 37 38 39 40 41 42 43 44 47
#>  $filter : chr "Condensed Nearest Neighbors" #>$ param   :List of 1
#>   ..$t: num 0.2 #>$ call    : language regCNN(x = rock[, -ncol(rock)], y = rock[, ncol(rock)])
#>  - attr(*, "class")= chr "rfdata"

In order to display the results of the class rfdata in a friendly way in the R console, two specific print and summary functions are implemented. The print function presents the basic information of the regressand noise filter:

print(out.def)
#>
#> ## Noise model:
#> Condensed Nearest Neighbors
#>
#> ## Parameters:
#> - t = 0.2
#>
#> ## Number of noisy and clean samples values:
#> - Noisy values: 9/48 (18.75%)
#> - Clean values: 39/48 (81.25%)

The information offered by print is as follows:

• The name of the noise filtering model.
• The parameters associated with the noise filtering model.
• The number of noisy and clean samples in the dataset.

On the other hand, the summary function displays the information of the dataset processed with the noise filter along with other additional details. This function can be called by typing the following R command:

summary(out.frm, showid = TRUE)
#>
#> ########################################################
#>  Noise filtering process: Summary
#> ########################################################
#>
#> ## Original call:
#> regCNN(formula = perm ~ ., data = rock)
#>
#> ## Noise model:
#> Condensed Nearest Neighbors
#>
#> ## Parameters:
#> - t = 0.2
#>
#> ## Number of noisy and clean samples values:
#> - Noisy values: 9/48 (18.75%)
#> - Clean values: 39/48 (81.25%)
#>
#> ## Indices of noisy samples:
#> - Output class: 37, 38, 39, 40, 41, 42, 43, 44, 47

The information offered by this function is as follows:

• The function call.
• The name of the regressand noise filter.
• The parameters associated with the noise filter.
• The indices of the noisy and clean samples (if showid = TRUE).