Logo Universidad de Oviedo Logo ECSC

Special Issue and Workshop

Special issue in International Journal of Approximate Reasoning (Elsevier, ISSN: 0888-613X) and a corresponding workshop on

HARNESSING THE INFORMATION CONTAINED IN LOW-QUALITY DATA SOURCES

Organized by Oviedo University¹ and European Centre for Soft Computing

Introduction

A differentiated treatment for stochastic and epistemic uncertainties is an important issue for many modeling and decision making problems. Different frameworks for representing imprecise, uncertain or vague information have been proposed (possibility theory, fuzzy sets, imprecise probabilities, etc.) The design and validation of computational intelligence systems that discover knowledge on the basis of incomplete and imprecise information, including learning algorithms, classification and regression models is deeply influenced by these representations and by the different interpretations of mathematical tools like fuzzy or random sets.

Aim of the Special Issue

The special issue aims to encourage discussion on those theoretical and methodological aspects that impact practical applications on modeling and knowledge discovery. Articles reporting on both theoretical and empirical research will be considered for inclusion, as well as survey and position papers suggesting new research directions and critiques of current trends.

Workshop Style

Selected authors of position papers will be invited as speakers in a two-day workshop that will take place on Wednesday, 16th May 2012 and Thursday, 17th May 2012 at European Centre for Soft Computing, Mieres, Asturias (Spain).

Audience members are expected to ask questions to the speakers. Each of these position papers will be accompanied with the most relevant discussions and a final rejoinder from the author.

Outcome of the workshop

Collect basic papers by speakers
Collect significant comments and rejoinders
Collect replies to questions posed in the round tables
Based on this material write a joint position paper clarifying the various approaches to uncertainty in statistics and formalized canonical problems

Round Table: Uncertainty in Statistics

Chairman: Didier Dubois

Download outline (pdf)

In the last 40 years some formalisms have emerged for handling uncertainty and/or human-originated information, that seem to compete with the probabilistic tradition.
- Fuzzy set theory (Zadeh)
- Random sets (Kendall, Matheron)
- Belief functions (Shafer Smets)
- Possibility theory (Shackle, Zadeh)
- Imprecise probabilities (Dempster, Walley)
- etc.
All these formalisms put forward the use of sets as opposed to or as complementary to the use of probability distributions.
The aim of the round table is to better understand their impact on statistics and information processing.

Schedule

Download material (pdf)

Wednesday, 16th
- (09:00-09:30) Presentation, discussion of the work plan
- (09:40-11:00) Ana Colubi: Statistical methods for random fuzzy sets: Theory and applications (50’ + 30’ discussion)
- (11:00-11:15) Coffee break
- (11:15-12:35) Eyke Hullermeier: Learning from Imprecise Data: On the Notion of Data Disambiguation (50’ + 30’ discussion)
- (12:40-13:30) Presentation of the topics of the roundtable. Proposal about issues to be discussed the following day.
- (13:30-15:00) Lunch
- (15:10-16:30) Didier Dubois: Statistical Reasoning with Set-Valued Information: Ontic vs. Epistemic Views (50’ + 30’ discussion)
- (16:30-16:45) Coffee break
- (16:45-18:15) Thierry Denoeux: Statistical inference from uncertain data in the belief function framework (50’ + 30’ discussion)
- (20:30-22:30) Dinner
Thursday, 17th
- (09:00-10:20) James Keller: Comparing Partitions from Clustering Algorithms (50’ + 30’ discussion)
- (10:20-10:35) Coffee break
- (10:35-11:55) Christian Borgelt: Approaches to Fault-Tolerant Item Set Mining (50’ + 30’ discussion)
- (11:55-12:10) Coffee break
- (12:10-13:30) Serafín Moral: Imprecise probability models for representing ignorance. Applications to learning credal networks (50’ + 30’ discussion)
- (13:30-15:00) Lunch
- (15:30-17:30) Roundtable
- (17:30-17:45) Closing. Discussion about the Special Issue

Plenary speakers

Ana Colubi
Title: Statistical methods for random fuzzy sets: Theory and applications.
Abstract: A random fuzzy set is a model to formalize the random generation of fuzzy data. Fuzzy data are often used to represent perceptions, ratings, subjective opinions, etc. In some cases, they represent an imprecise perception of a precise quantity, while some other times they represent an intrinsically non-precise characteristic. That is the case, for instance, of the expert assessment of the quality of any item. In this context, fuzzy data can be treated as elements of a conventional metric space. In any case, when the final aim is to obtain statistical conclusions which do not refer to any (possible existing) underlying quantity but to the fuzzy data itself, the available results in probability and statistics for metric spaces may be applied.
Some statistical tools based on a family of intuitive and operative L2-type metrics inspired on the mid-spread decomposition of intervals will be recalled. It will be shown that the rich theory of statistics for Hilbert space may be occasionally used. However, the lack of linearity of the space of fuzzy sets endowed with the the usual arithmetic implies some difficulties. Specifically, inferences on the fuzzy mean, the Frechet variance and regression problems will be discussed. Real-life examples will be used to illustrate the methods.
Eyke Hullemeier
Title: Learning from Imprecise Data: On the Notion of Data Disambiguation
Abstract: An increasing number of publications is currently devoted to the learning of models from imprecise data, such as interval data or, more generally, data modeled in terms of fuzzy subsets of an underlying reference space. Needless to say, this idea also requires the extension of corresponding learning algorithms. Unfortunately, this is often done without clarifying the actual meaning of an interval or fuzzy observation, and the interpretation of membership functions. Distinguishing between an ”ontic” and an ”epistemic” interpretation of (fuzzy) set-valued data, we argue that different interpretations call for different types of extensions of existing learning algorithms and methods for data analysis. Then, focusing on the epistemic view, we argue that, in model induction from imprecise data, one should try to find a model that ”disambiguates” the data instead of reproducing it. More specifically, this leads to a learning procedure that performs model identification and data disambiguation simultaneously. This idea is illustrated by means of two concrete problems, namely regression analysis with fuzzy data and classifier learning from ambiguously labeled instances.
Didier Dubois
Title: Statistical Reasoning with Set-Valued Information: Ontic vs. Epistemic Views
Abstract: Sets, hence fuzzy sets, may have a conjunctive or a disjunctive reading. In the conjunctive reading a (fuzzy) set represents an ob ject of interest for which a (gradual rather than Boolean) composite description makes sense. In contrast disjunctive (fuzzy) sets refer to the use of sets as a representation of incomplete knowledge. They do not model ob jects or quantities, but partial information about an underlying ob ject or a precise quantity. In this case the fuzzy set captures uncertainty, and its membership function is a possibility distribution. We call epistemic such fuzzy sets, since they represent states of incomplete knowledge. Distinguishing between ontic and epistemic fuzzy sets is important in information-processing tasks because there is a risk of misusing basic notions and tools, such as distance between fuzzy sets, variance of a fuzzy random variable, fuzzy regression, etc. We discuss several examples where the ontic and epistemic points of view yield different approaches to these concepts.
Thierry Denoeux
Title: Statistical inference from uncertain data in the belief function framework
James Keller
Title: Comparing Partitions from Clustering Algorithms
Abstract: Many of us participate in clustering research as a means of exploration aimed at understanding the structure and organization of vague and imprecise data. Most papers focus on the creation of new approaches to perform clustering. But, just how good are the results of clustering algorithms? There are several well known measures of cluster validity that are routinely utilized. Most focus on balancing the criteria of compactness and separation. We present here a method for comparing crisp and soft partitions (i.e., probabilistic, fuzzy and possibilistic) to a known crisp reference partition. Many of the classical indices that have been used with outputs of crisp clustering algorithms are generalized so that they are applicable for candidate partitions of any type. In particular, focus is placed on generalizations of the Rand index. Additionally, we extend these partition comparison methods by (1) investigating the behavior of the soft Rand for comparing non-crisp, specifically possibilistic, partitions and (2) we demonstrate how the possibilistic Rand and visual assessment of (cluster) tendency (VAT) algorithm can be used to discover the number of actual clusters and coincident clusters for outputs from the possibilistic c-means (PCM) algorithm.
Christian Borgelt
Title: Approaches to Fault-Tolerant Item Set Mining
Abstract: In standard frequent item set mining a transaction supports an item set only if all items in the set are present. However, in many cases the transaction data to analyze is imperfect: items that are actually contained in a transaction are not recorded as such. The reasons can be manifold, ranging from noise through measurement errors to an underlying feature of the observed process. In such a case full containment is too strict a requirement that can render it impossible to find certain relevant groups of items. By relaxing the support definition, allowing for some items of a given set to be missing from a transaction, this drawback can be amended. The resulting item sets have been called approximate, fault-tolerant or fuzzy item sets. In this talk I present two cost-based approaches and accompanying efficient algorithms to find such item sets. The first works by inserting missing items into transactions, penalizing the transaction weight in such a case, while the second computes and evaluates subset size occurrence distributions. I demonstrate the benefits of the algorithms by applying them to an artificial data set (as a proof of concept) and to a concept detection task on the 2008/2009 Wikipedia Selection for schools.
Serafín Moral
Title: Imprecise probability models for representing ignorance. Applications to learning credal networks
Abstract: This paper will investigate suitable imprecise prior probability models for learning with credal networks. The best known imprecise model for inference about a multinomial distribution is the Imprecise Dirichlet Model (IDM) which assumes as prior information the set of all Dirichlet distributions with a fixed equivalent sample size S . However, the IDM has been shown to be too cautious in some given situations and non useful to learn about independence relationships from data. To solve these problems, other alternative models have been proposed, as the imprecise sample size Dirichlet model (ISSDM). We will consider the application of the ISSDM to learn credal networks, both, for determining the graphical structure and for estimating the parameters. Special emphasis will be given to the principles, assumptions, and justification of the model. An important aspect that will be studied will be the distinction between global procedures (which divide the global sample size between the different conditional distributions) and local procedures (which assume a model for each conditional distribution), showing that this can be a source of the imprecision in the equivalent sample size. An algorithm will be given to learn the structure of a credal network and the suitability of different propagation procedures will be discussed. Finally, we will make some experiments to show the behaviour of the ISSDM for classification problems and when learning from databases.

Scheduling

~~Open call for papers : February 10, 2011~~
~~Tentative submission (title and abstract): March 20, 2012~~
Workshop: May 16 - May 17, 2012
Paper submission: July 20, 2012
First revision: September 20, 2012
Updated versions: October 20, 2012
Second revision: November 20, 2012
Final version: December 20, 2012

Names and contact details of the guest editors

Inés Couso
Dept. Statistics, Operational Research and M. T.
University of Oviedo
Gijón E-33002, Spain
Tel: +34 985181906
email: couso@uniovi.es

Luciano Sánchez
Dept. Computer Science
University of Oviedo
Gijón E-33002, Spain
Tel: +34 985182130
Fax: +34 985181986
email: luciano@uniovi.es

¹Under Research Projects TIN2008-06681-C06-04: “Knowledge Discovery based on Evolutionary Learning: Current Trends and New Challenges (KEEL-CTNC) / Evolutionary Learning with Low Quality Data and Genetic Fuzzy Systems. Distributed High-Dimensional Data sets” and TIN2011-24302: “CI-LQD: Computational Intelligence Techniques for Modeling and Decision Making with Low Quality Data: Theoretical, Methodological and Practical Issues”