MLSP 2012 Competition: Amazon Data Science Competition

A data and signal analysis competition is being organized. Winners will present their work and receive their award during the Workshop.

Opening of the competition: January 20, 2012
Deadline of algorithm submission: Extended to May 21, 2012

Goal: The goal is to select/design a classifier (and any pre-processing systems, including a feature extractor) that correctly classifies a pair of an employee and a resource (a resource can be a computer, a data base, or a data portal etc) into one of two classes: access or non-access. The winner will be the submission that minimizes the number of manual access grant/removal operations in the future testing data.

Eligibility: Anyone.

Registration: Registration is not required. However, if you wish to receive important updates on the competition by email then please send a request to the address provided on the web page.

kindle fire tablet

Awards: Up to three newest Kindle Fire tablet computers from Amazon will be awarded as prizes. Amazon will support and invite the winners to present their results at Amazon's new SLU headquarter in Seattle, USA.

Data description

Full details including data set description can be found here (PDF, 41 KB).

The objective of this competition is to build a model, learned using historical data, that will determine an employee's access needs such that manual access transactions (grants and revokes) are minimized as the employee's attributes change over time. The model will take an employee attribute record and a resource code and will return true if the employee should be given access this resource and false if the employee should not be given access to this resource.

The problem can be formulated as follows:

At time T, create a snapshot of STATUS(EMPLOYEE_ID, RESOURCE_ID), which is either 1 (access) or 0 (non-access). Build a system F, which models STATUS ~ {EMPLOYEE ATTRIBUTES, RESOURCE ATTRIBUTES}.

Therefore at time T, for each employee, we have an access profile PROFILE(EMPLOYEE_ID, T).

The measure of success is to minimize the cost of add/remove actions for all employees for a given time perdiod.

Data sets

This data is non-confidential and suitable for public consumption.

Data Set #1:

contains the set of attributes associated with each user; each column corresponds to a single attribute (the rows have a time dimension).

Data Set #2:

contains the access transaction history (the rows have a time dimension).

Data Set #3:

contains the user access snapshot at the beginning and the end of the transaction history (the rows have a time dimension; either 2011-11-01 or 2010-11-01).

For more information visit