Research Papers


Research #1: Best Paper Nominees

Tuesday, Nov. 4, 11:00 - 12:30 - Hotel RC - Mirabilis Room

Chair: Sarfraz Khurshid

1.     Naaliel Mendes, Joao Duraes and Henrique Madeira. Security Benchmarks for Web Serving Systems

2.     Hong Lu, Tao Yue, Shaukat Ali, Kunming Nie and Li Zhang. Zen-CC: An Automated and Incremental Conformance Checking Solution to Support Interactive Product Configuration

3.     James Walden, Jeffrey Stuckman and Riccardo Scandariato. Predicting Vulnerable Components: Software Metrics vs Text Mining


Research #2: Modeling

Tuesday, Nov. 4, 14:00 - 15:30 - Hotel RC - Aragonese+Catalana Room

Chair: Henrique Madeira

1.     Jesús Sánchez Cuadrado, Esther Guerra and Juan De Lara. Uncovering errors in ATL model transformations using static analysis and constraint solving

2.     Jinhee Park, Nakwon Lee and Jongmoon Baik. On the Long Term Predictive Capability of Data-driven Software Reliability Model: An Empirical Evaluation

3.     Raymond Devillers, Jean-Yves Didier, Hanna Klaudel and Johan Arcile. Deadlock and temporal properties analysis in mixed reality applications


Research #3: Program Logic

Tuesday, Nov. 4, 16:00 - 18:00 - Hotel RC - Aragonese+Catalana Room

Chair: Katinka Wolter

1.     Fangfang Zhang, Dinghao Wu, Peng Liu and Sencun Zhu. Program Logic Based Software Plagiarism Detection

2.     Bastian Zimmer, Christoph Dropmann and Jochen Ulrich Hänger. A systematic approach for interference analysis

3.     Ding Ye, Yu Su, Yulei Sui and Jingling Xue. WPBOUND: Enforcing Spatial Memory Safety Efficiently at Runtime with Weakest Preconditions

4.     Aleksandar Milenkoski, Bryan D. Payne, Nuno Antunes, Marco Vieira and Samuel Kounev. Experience Report: An Analysis of Hypercall Handler Vulnerabilities


Research #4: Fault Localization

Wednesday, Nov. 5, 11:00 - 12:30 - CC - Aula Magna

Chair: Michael Grottke

1.     Franz Wotawa and Birgit Hofer. Why does my spreadsheet compute wrong values?

2.     Hao Hu, Hongyu Zhang, Jifeng Xuan and Weigang Sun. Effective Bug Triage based on Historical Bug-Fix Information

3.     Wolfgang Högerle, Friedrich Steimann and Marcus Frenkel. More Debugging in Parallel


Research #5: Case Studies I

Wednesday, Nov. 5, 14:00 - 15:30 - CC - Aula Magna

Chair: Veena Mendiratta

1.     Keun Soo Yim. Norming to Performing: Failure Analysis and Deployment Automation of Big Data Software Developed by Highly Iterative Models

2.     Nuno Silva and Marco Vieira. Experience Report: Orthogonal Classification of Safety Critical Issues

3.     Xin Chen, Charng-Da Lu and Karthik Pattabiraman. Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study


Research #6: Data Analysis

Wednesday, Nov. 5, 16:00 - 17:30 - CC - Aula Magna

Chair: Ilir Gashi

1.     Catello Di Martino, Daniel Chen, Geetika Goel, Rajeshwari Ganesan, Zbigniew Kalbarczyk and Ravishankar Iyer. Analysis and Diagnosis SLA Violations in a Production SaaS Cloud

2.     Rahul Gopinath, Carlos Jensen and Alex Groce. Mutations: How close are they to real faults?

3.     Ermira Daka and Gordon Fraser. A Survey on Unit Testing Practice and Problems


Research #7: Industrial Systems

Thursday, Nov. 6, 09:00 - 10:30 - CC - Aula Magna

Chair: Yvan Labiche

1.     Marcello Cinque, Domenico Cotroneo, Raffaele Della Corte and Antonio Pecchia. Assessing Direct Monitoring Techniques to Analyze Failures of Critical Industrial Systems

2.     Sagar Sen, Carlo Ieva, Arnab Sarkar, Atle Sander and Astrid Grime. Experince Report: Verifying Data Interaction Coverage to Improve Testing of Data-intensive Systems: The Norwegian Customs and Excise Case Study

3.     Normann Decker, Franziska Kühn, Martin Leucker and Daniel Thoma. Runtime Verification of Web Services for Interconnected Medical Devices


Research #8: Case Studies II

Thursday, Nov. 6, 11:00 - 12:30 - CC - Aula Magna

Chair: Brendan Murphy

1.     Jeehyun Hwang, Laurie Williams, Mladen Vouk and Da Young Lee. Access Control Policy Evolution: An Empirical Study

2.     Roland Mader, Rene Obendrauf, Philipp Prinz and Gerhard Grießnig. A Safety Engineering Tool Framework Supporting Error Model Creation and Visualization

3.     Davide Giacomo Cavezza, Roberto Pietrantuono, Javier Alonso, Stefano Russo and Kishor Trivedi. Reproducibility of environment-dependent software failures: An Experience Report


Research #9: Software Testing

Thursday, Nov. 6, 14:00 - 15:30 - CC - Aula Magna

Chair: Leonardo Mariani

1.     Jie Zhang, Muyao Zhu, Dan Hao and Lu Zhang. An Empirical Study on the Scalability of Selective Mutation Testing

2.     Nesa Asoudeh and Yvan Labiche. Multi-objective construction of an entire adequate test suite for an EFSM

3.     Kim Herzig. Using Pre-Release Test Failures to Build Early Post-Release Defect Prediction Models


Research #10: Applications of Machine Learning

Thursday, Nov. 6, 14:00 - 15:30 - Hotel RC - Aragonese Room

Chair: Roberto Natella

1.     Huihua Lu, Ekrem Kocaguneli and Bojan Cukic. Defect Prediction between Software Versions with Active Learning and Dimensionality Reduction

2.     Sunint Khalsa and Yvan Labiche. An orchestrated survey of available algorithms and tools for Combinatorial Testing

3.     Tien-Duy B. Le, Ferdian Thung and David Lo. Predicting Effectiveness of IR-Based Bug Localization Techniques



Security Benchmarks for Web Serving Systems
The security of software-based systems is one of the most difficult issues when accessing the suitability of systems to most application scenarios. However, security is very hard to evaluate and quantify, and there are no standard methods to benchmark the security of software systems. This work proposes a novel methodology for benchmarking the security of software-based systems. This methodology uses the notion of risk in a quantifiable way and allows the comparison of functionally-equivalent systems (or different configurations of the same system) to enable users and system integrators to identify and select the most secure one. The benchmark methodology is based on both analytical and experimental steps and can be applicable to any software system. The benchmark procedures and rules guide users on how to instantiate the methodology to specific scenarios and how to execute the benchmark. In this paper we also present an instantiation of the methodology to a case study of web-serving systems and show how to use the results to identify the most secure system under benchmark.

Zen-CC: An Automated and Incremental Conformance Checking Solution to Support Interactive Product Configuration
In the context of product line engineering (PLE), providing immediate feedback on the correctness of a manual configuration step to users has a practical impact on whether a configuration process with tool support can be successfully adopted in practice. Model-based PLE has brought opportunities to enable automated product configuration and derivation for large-scale systems/software, in which models are used as the abstract specification of commonalities and variabilities of products of a product line. In our previous work, we have proposed a UML-based variability modeling methodology and an interactive configuration process. Based on these work, in this paper, we propose an automated and incremental conformance checking approach to ensure that the manual configuration to each variation point conforms to a set of pre-defined conformance rules specified in OCL. The proposed approach, called Zen-CC is implemented as a component of our product configuration and derivation tool, named as Zen-Configurator. The proposed approach is evaluated with two real-world case studies and results showed that the performance of Zen-CC is significantly better than a baseline algorithm checking all the conformance rules at each configuration step. Moreover, the performance of Zen-CC rarely varies during the configuration process, suggesting that our approach is scalable for configuring products with a large number of configuration points.

Predicting Vulnerable Components: Software Metrics vs Text Mining
Building secure software is difficult, time-consuming, and expensive. Prediction models that identify vulnerability prone software components can be used to focus security efforts, thus helping to reduce the time and effort required to secure software. Several kinds of vulnerability prediction models have been proposed over the course of the past decade. However, these models were evaluated with differing methodologies and datasets, making it difficult to determine the relative strengths and weaknesses of different modeling techniques. In this paper, we provide a high-quality, public dataset, containing 223 vulnerabilities found in three web applications, to help address this issue. We used this dataset to compare vulnerability prediction models based on text mining with models using software metrics as predictors. We found that text mining models had higher recall than software metrics based models for all three applications.

Uncovering Errors in ATL Model Transformations Using Static Analysis and Constraint Solving
Model transformations play a prominent role in Model-Driven Engineering (MDE), where they are used to transform models between languages, to refactor and simulate models, or to generate code from models. However, while the reliability of any MDE process depends on the correctness of its transformations, methods helping in detecting errors in transformations and automate their verification are still needed. To improve this situation, we propose a method for the static analysis of one of the most widely used model transformation languages: ATL. The method proceeds in three steps. Firstly, it infers typing information from the transformation and detects potential errors statically. Then, it generates OCL path conditions for the candidate errors, stating the requirements for a model to hit the problematic statements in the transformation. Last, it relies on constraint solving to generate a test model fragment or witness that exercises the transformation, making it execute the problematic statement. Our method is supported by a prototype tool that integrates a static analyzer, a testing tool and a constraint solver. We have used the tool to analyse medium and large-size third-party ATL transformations, discovering a wide number of errors.

On the Long-Term Predictive Capability of Data-Driven Software Reliability Model: An Empirical Evaluation
In recent years, data-driven software reliability models have been proposed to solve the problematic issues of existing software reliability growth models (i.e., Unrealistic underlying assumptions and model selection problems). However, the previous data-driven approaches mostly focused on sample fitting or next-step prediction without adequate evaluation on their long-term predictive capability. This paper investigates three multi-step-ahead prediction strategies for data-driven software reliability models and compares their predictive performance on failure count data and time between failure data. Then, the model with the outstanding strategy on each data type is compared with conventional software reliability growth models. We found that the Recursive strategy gives better prediction for fault count data, while no strategy is superior to the others for time between failure data. Such data-driven approach with the best input domain showed performance as good as the best one among the software reliability growth models in long-term prediction. These results indicate the applicability of data-driven methods even in long-term prediction and help reliability practitioners to identify an appropriate multi-step prediction strategy for software reliability.

Deadlock and Temporal Properties Analysis in Mixed Reality Applications
Mixed reality systems overlay real data with virtual information in order to assist users in their current task, they are used in many fields (surgery, maintenance, entertainment). Such systems generally combine several hardware components operating at different time scales, and software that has to cope with these timing constraints. MIRELA, for Mixed Reality Language, is a framework aimed at modelling, analysing and implementing systems composed of sensors, processing units, shared memories and rendering loops, communicating in a well-defined manner and submitted to timing constraints. The paper describes how harmful software behaviour, which may result in possible hardware deterioration or revert the system's primary goal from user assistance to user impediment, may be detected such as (global and local) deadlocks or starvation features. This also includes a study of temporal properties resulting in a finer understanding of the software timing behaviour, in order to fix it if needed.

Program Logic Based Software Plagiarism Detection
Software plagiarism, an act of illegally copying others' code, has become a serious concern for honest software companies and the open source community. In this paper, we propose LoPD, a program logic based approach to software plagiarism detection. Instead of directly comparing the similarity between two programs, LoPD searches for any dissimilarity between two programs by finding an input that will cause these two programs to behave differently, either with different output states or with semantically different execution paths. As long as we can find one dissimilarity, the programs are semantically different, but if we cannot find any dissimilarity, it is likely a plagiarism case. We leverage symbolic execution and weakest precondition reasoning to capture the semantics of execution paths and to find path dissimilarities. LoPD is more resilient to current automatic obfuscation techniques, compared to the existing detection mechanisms. In addition, since LoPD is a formal program semantics-based method, it can provide a guarantee of resilience against many known obfuscation attacks. Our evaluation results indicate that LoPD is both effective and efficient in detecting software plagiarism.

A Systematic Approach for Software Interference Analysis
Interferences are a common challenge in integrated systems. An interference is a failure propagation scenario in which a failure of one software component propagates to another software component via the platform's shared computational resources. To account for this, safety standards demand freedom from interference in order to control failure propagation between mixed-critical software components. However, the analysis of potential interferences for a given system is often performed ad-hoc, for example using lists of known issues. Arguing the sufficiency of the interference analysis is difficult using such an approach, especially when dealing with new technologies for which established lists do not exist yet. To this end, this paper presents an interference analysis method that allows for the systematic identification and specification of interferences.

WPBOUND: Enforcing Spatial Memory Safety Efficiently at Runtime with Weakest Preconditions
Spatial errors (e.g., Buffer overflows) continue to be one of the dominant threats to software reliability and security in C/C++ programs. Presently, the software industry typically enforces spatial memory safety by instrumentation. Due to high overheads incurred in bounds checking at runtime, many program inputs cannot be exercised, causing some input-specific spatial errors to go undetected in today's commercial software. This paper introduces a new compile-time optimisation for reducing bounds checking overheads based on the notion of Weakest Precondition (WP). The basic idea is to guard a bounds check at a pointer dereference inside a loop, where the WP-based guard is hoisted outside the loop, so that its falsehood implies the absence of out-of-bounds errors at the dereference, thereby avoiding the corresponding bounds check inside the loop. This WP-based optimisation is applicable to any spatial-error detection approach (in software or hardware or both). To evaluate the effectiveness of our optimisation, we take SOFTBOUND, a compile-time tool with an open-source implementation in LLVM, as our baseline. SOFTBOUND adopts a pointer-based checking approach with disjoint metadata, making it a state-of-the-art tool in providing compatible and complete spatial safety for C. Our new tool, called WPBOUND, is a refined version of SOFTBOUND, also implemented in LLVM, by incorporating our WP-based optimisation. For a set of 12 SPEC C benchmarks evaluated, WPBOUND reduces the average (geometric mean) slowdown of SOFTBOUND from 71% to 45% (by a reduction of 37%), with small code size increases.

An Analysis of Hypercall Handler Vulnerabilities
Hypervisors are becoming increasingly ubiquitous with the growing proliferation of virtualized data centers. As a result, attackers are exploring vectors to attack hypervisors, against which an attack may be executed via several attack vectors such as device drivers, virtual machine exit events, or hyper calls. Hyper calls enable intrusions in hypervisors through their hyper call interfaces. Despite the importance, there is very limited publicly available information on vulnerabilities of hyper call handlers and attacks triggering them, which significantly hinders advances towards monitoring and securing these interfaces. In this paper, we characterize the hyper call attack surface based on analyzing a set of vulnerabilities of hyper call handlers. We systematize and discuss the errors that caused the considered vulnerabilities, and activities for executing attacks triggering them. We also demonstrate attacks triggering the considered vulnerabilities and analyze their effects. Finally, we suggest an action plan for improving the security of hyper call interfaces.

Why Does my Spreadsheet Compute Wrong Values?
Spreadsheets are by far the most used programs that are written by end-users. They often build the basis for decisions in companies and governmental organizations and therefore they have a high impact on our daily life. Ensuring correctness of spreadsheets is thus an important task. But what happens after detecting a faulty behavior? This question has not been sufficiently answered. Therefore, we focus on fault localization techniques for spreadsheets. In this paper, we introduce a novel dependency-based approach for model-based fault localization in spreadsheets. This approach improves diagnostic accuracy while keeping computation times short, thus making the automated fault localization more appropriate for practical applications. The presented approach allows for an acceptable fault localization time of less than a second, and reduces the number of computed root cause candidates by 15 % on average, when compared with another dependency-based approach.

Predicting Effectiveness of IR-Based Bug Localization Techniques
Recently, many information retrieval (IR) based bug localization approaches have been proposed in the literature. These approaches use information retrieval techniques to process a textual bug report and a collection of source code files to find buggy files. They output a ranked list of files sorted by their likelihood to contain the bug. Recent approaches can achieve reasonable accuracy, however, even a state-of-the-art bug localization tool outputs many ranked lists where buggy files appear very low in the lists. This potentially causes developers to distrust bug localization tools. Parnin and Orso recently conduct a user study and highlight that developers do not find an automated debugging tool useful if they do not find the root cause of a bug early in a ranked list. To address this problem, we build an oracle that can automatically predict whether a ranked list produced by an IR-based bug localization tool is likely to be effective or not. We consider a ranked list to be effective if a buggy file appears in the top-N position of the list. If a ranked list is unlikely to be effective, developers do not need to waste time in checking the recommended files one by one. In such cases, it is better for developers to use traditional debugging methods or request for further information to localize bugs. To build this oracle, our approach extracts features that can be divided into four categories: score features, textual features, topic model features, and metadata features. We build a separate prediction model for each category, and combine them to create a composite prediction model which is used as the oracle. We name our proposed approach APRILE, which stands for Automated Prediction of IR-based Bug Localization's Effectiveness. We have evaluated APRILE to predict the effectiveness of three state-of-the-art IR based bug localization tools on more than three thousands bug reports from AspectJ, Eclipse, and SWT. APRILE can achieve an average precision, recall, and F-measure of at least 70.36%, 66.94%, and 68.03%, respectively. Furthermore, APRILE outperforms a baseline approach by 84.48%, 17.74%, and 31.56% for the AspectJ, Eclipse, and SWT bug reports, respectively.

More Debugging in Parallel
Programs may contain multiple faults, in which case their debugging can be parallelized. However, effective parallelization requires some guarantees that parallel debugging tasks do not address the same fault, an inherent problem of earlier, clustering-based approaches to parallel debugging. In this paper, we identify a number of fundamental trade-offs to be made when selecting algorithms for parallel debugging, and explore these trade-offs using one clustering algorithm and three algorithms from integer linear programming. Results of an evaluation involving a total of 75,000 faulty versions (with up to 32 injected faults) of 15 subject programs suggest that depending on the number of faults present and the trade-offs one is willing to accept, speed-ups much larger than previously reported can be achieved, even if all derived parallel debugging tasks are handled sequentially.

Failure Analysis and Deployment Automation of Big Data Software Developed by Highly Iterative Models
We observe many interesting failure characteristics from Big Data software developed and released using some kinds of highly iterative development models (e.g., Agile). ~16% of failures occur due to faults in software deployments (e.g., Packaging and pushing to production). Our analysis shows that many such production outages are at least partially due to some human errors rooted in the high frequency and complexity of software deployments. ~51% of the observed human errors (e.g., Tran-Scription, education, and communication error types) are avoidable through automation. We thus develop a fault-tolerant automation framework to make it efficient to automate end-to-end software deployment procedures. We apply the framework to two Big Data products. Our case studies show the complexity of the deployment procedures of multi-homed Big Data applications and help us to study the effectiveness of the validation and verification techniques for user-provided automation programs. We analyze the production failures of the two products again after the automation. Our experimental data shows how the automation and the associated procedure improvements reduce the deployment faults and overall failure rate, and improve the feature launch velocity. Automation facilitates more formal, procedure-driven software engineering practices which not only reduce the manual work and human-oriented, avoidable production outages but also help engineers to better understand overall software engineering procedures, making them more auditable, predictable, reliable, and efficient. We discuss two novel metrics to evaluate progress in mitigating human errors and the conditions indicating points to start such transition from owner-driven deployment practice.

Orthogonal Classification of Safety Critical Issues
Techniques to classify defects have been used for decades, providing relevant information on how to improve systems. Such techniques heavily rely on human experience and have been generalized to cover different types of systems at different maturity levels. However, their application to safety-critical systems development and operation phases neither is very common, or at least not spread publicly, nor disseminated in the industrial and academic worlds. This practical experience report presents the results and conclusions from applying a mature Orthogonal Defect Classification (ODC) to a large set of safety-critical issues. The work is based on the analysis of more than 240 real issues (defects) identified during all the lifecycle phases of 4 safety-critical systems in the aerospace and space domains. The outcomes reveal the challenges in properly classifying this specific type of issues with the broader ODC approach. The difficulties are identified and systematized and specific proposals for improvement are proposed.

Reproducibility of Environment-Dependent Software Failures: An Experience Report
We investigate the dependence of software failure reproducibility on the environment in which the software is executed. The existence of such dependence is ascertained in literature, but so far it is not fully characterized. In this paper we pinpoint some of the environmental components that can affect the reproducibility of a failure and show this influence through an experimental campaign conducted on the My SQL Server software system. The set of failures of interest is drawn from My SQL's failure reports database and an experiment is designed for each of these failures. The experiments expose the influence of disk usage and level of concurrency on My SQL failure reproducibility. Furthermore, the results show that high levels of usage of these factors increase the probabilities of failure reproducibility.

Analysis and Diagnosis of SLA Violations in a Production SaaS Cloud
This paper investigates SLA violations of a production SaaS platform by means of joint use of field failure data analysis (FFDA) and fault injection. The objective of this study is to diagnose the causes of SLA violations, pinpoint critical failure modes under realistic error assumptions and identify potential means to increase the user perceived availability of the platform and assurance of SLA requirements. We base our study on 283 days of logs obtained during the production time of the platform, while it was employed to process business data received by 42 customers in 22 countries. In this paper, we develop a set of tools that include i) a FFDA toolset used to analyze the data extracted from the platform and by the operating system event logs and ii) a. NET/C++ injector able to automate the injection of specific runtime errors in the production code and the collection of results. Major findings include i) 93% of all service level agreement (SLA) violations were due to system failures, ii) there were a few cases of bursts of SLA violations that could not be diagnosed from the logs and were revealed from the performed injections, and iii) the error injection revealed several error propagation paths leading to data corruptions that could not be detected from the analysis of failure data.

Mutations: How Close are they to Real Faults?
Mutation analysis is often used to compare the effectiveness of different test suites or testing techniques. One of the main assumptions underlying this technique is the Competent Programmer Hypothesis, which proposes that programs are very close to a correct version, or that the difference between current and correct code for each fault is very small. Researchers have assumed on the basis of the Competent Programmer Hypothesis that the faults produced by mutation analysis are similar to real faults. While there exists some evidence that supports this assumption, these studies are based on analysis of a limited and potentially non-representative set of programs and are hence not conclusive. In this paper, we separately investigate the characteristics of bug-fixes and other changes in a very large set of randomly selected projects using four different programming languages. Our analysis suggests that a typical fault involves about three to four tokens, and is seldom equivalent to any traditional mutation operator. We also find the most frequently occurring syntactical patterns, and identify the factors that affect the real bug-fix change distribution. Our analysis suggests that different languages have different distributions, which in turn suggests that operators optimal in one language may not be optimal for others. Moreover, our results suggest that mutation analysis stands in need of better empirical support of the connection between mutant detection and detection of actual program faults in a larger body of real programs.

A Survey on Unit Testing Practices and Problems
Unit testing is a common practice where developers write test cases together with regular code. Automation frameworks such as JUnit for Java have popularised this approach, allowing frequent and automatic execution of unit test suites. Despite the appraisals of unit testing in practice, software engineering researchers see potential for improvement and investigate advanced techniques such as automated unit test generation. To align such research with the needs of practitioners, we conducted a survey amongst 225 software developers, covering different programming languages and 29 countries, using a global online marketing research platform. The survey responses confirm that unit testing is an important factor in software development, and suggest that there is indeed potential and need for research on automation of unit testing. The results help us to identify areas of importance on which further research will be necessary (e.g., Maintenance of unit tests), and also provide insights into the suitability of online marketing research platforms for software engineering surveys.

Assessing Direct Monitoring Techniques to Analyze Failures of Critical Industrial Systems
The analysis of monitoring data is extremely valuable for critical computer systems. It allows to gain insights into the failure behavior of a given system under real workload conditions, which is crucial to assure service continuity and downtime reduction. This paper proposes an experimental evaluation of different direct monitoring techniques, namely event logs, assertions, and source code instrumentation, that are widely used in the context of critical industrial systems. We inject 12,733 software faults in a real-world air traffic control (ATC) middleware system with the aim of analyzing the ability of mentioned techniques to produce information in case of failures. Experimental results indicate that each technique is able to cover a limited number of failure manifestations. Moreover, we observe that the quality of collected data to support failure diagnosis tasks strongly varies across the techniques considered in this study.

Verifying Data Interaction Coverage to Improve Testing of Data-Intensive Systems: The Norwegian Customs and Excise Case Study
Testing data-intensive systems is paramount to increase our reliance on information processed in e-governance, scientific/ medical research, and social networks. A common practice in the industrial testing process is to use test databases copied from live production streams to test functionality of complex database applications that manage well-formedness of data and its adherence to business rules in these systems. This practice is often based on the assumption that the test database adequately covers realistic scenarios to test, hopefully, all functionality in these applications. There is a need to systematically evaluate this assumption. We present a tool-supported method to model realistic scenarios and verify whether copied test databases actually cover them and consequently facilitate adequate testing. We conceptualize realistic scenarios as data interactions between fields cross-cutting a complex database schema and model them as test cases in a classification tree model. We present a human-in the-loop tool, DEPICT, that uses the classification tree model as input to (a) facilitate interactive selection of a connected sub graph from often many possible paths of interactions between tables specified in the model (b) automatically generate SQL queries to create an inner join between tables in the connected sub graph (c) extract records from the join and generate a visual report of satisfied and unsatisfied interactions hence quantifying test adequacy of the test database. We report our experience as a qualitative evaluation of approach and with a large industrial database from the Norwegian Customs and Excise information system TVINN featuring large and complex databases with millions of records.

Runtime Verification of Web Services for Interconnected Medical Devices
This paper presents a framework to ensure the correctness of service-oriented architectures based on runtime verification techniques. Traditionally, the reliability of safety critical systems is ensured by testing the complete system including all subsystems. When those systems are designed as service-oriented architectures, and independently developed subsystems are composed to new systems at runtime, this approach is no longer viable. Instead, the presented framework uses runtime monitors synthesised from high-level specifications to ensure safety constraints. The framework has been designed for the interconnection of medical devices in the operating room. As a case study, the framework is applied to the interconnection of an ultrasound dissector and a microscope. Benchmarks show that the monitoring overhead is negligible in this setting.

Access Control Policy Evolution: An Empirical Study
Access control policies (ACPs) are necessary mechanisms for protection of critical resources and applications. As operational and security requirements of a system evolve, so do access control policies. It is important to help policy authors in effectively managing access control policies by providing insights into historical trends and evolution patterns of access control policies. We analyzed ACP evolution in three systems: Security Enhanced Linux (SELinux) operating system, Virtual Computing Laboratory (VCL) cloud, and a network intrusion detection (Snort) application. We propose an approach, which extracts evolution patterns based on the analysis of ACP historical change data. An evolution pattern indicates an abstraction of change in the permissions/privileges assigned to a group or a user. We then developed a model of ACPs evolution. We found eight frequently occurring evolution patterns across the three systems. In our context this model can predict evolution patterns in ACPs with a precision of 50-80%, a recall of 70-90% and an F-measure of 65-75%.

A Safety Engineering Tool Supporting Error Model Creation and Visualization
In this paper, we present a novel software tool called AVL Safety Extensions which is part of a tool framework for model-based automotive safety engineering. The tool framework supports a tool dependent methodology (TDM) which covers the left-hand V-model phases of ISO 26262-3 and ISO 26262-4 and requires the use of the language SSML (System Safety Modeling Language). The AVL Safety Extensions support safety engineers applying the TDM by creating consistent and complete work products and by simplifying and automating workflow steps. We present the AVL Safety Extensions in the context of the tool framework, the language SSML and the TDM focusing on the AVL Safety Extensions' capabilities for error model creation and visualization supporting safety analysis techniques such as FTA (Fault Tree Analysis) and FMEA (Failure Modes and Effects Analysis). Moreover, we illustrate the applicability of the presented approach using an industrial case study of hybrid electric vehicle development.

Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study
In this paper, we analyze a workload trace from the Google cloud cluster and characterize the observed failures. The goal of our work is to improve the understanding of failures in compute clouds. We present the statistical properties of job and task failures, and attempt to correlate them with key scheduling constraints, node operations, and attributes of users in the cloud. We also explore the potential for early failure prediction, and anomaly detection for the jobs. Based on our results, we speculate that there are many opportunities to enhance the reliability of the applications running in the cloud, such as pro-active maintenance of nodes or limiting job resubmissions. We further find that resource usage patterns of the jobs can be leveraged by failure prediction techniques. Finally, we find that the termination statuses of jobs and tasks can be clustered into six dominant categories based on the user profiles.

An Empirical Study on the Scalability of Selective Mutation Testing
Software testing plays an important role in ensuring software quality by running a program with test suites. Mutation testing is designed to evaluate whether a test suite is adequate in detecting faults. Due to the expensive cost of mutation testing, selective mutation testing was proposed to select a subset of mutants whose effectiveness is similar to the whole set of generated mutants. Although selective mutation testing has been widely investigated in recent years, many people still doubt whether it can suit well for large programs. To study the scalability of selective mutation testing, we systematically explore how the program size impacts selective mutation testing through four projects (including 12 versions all together). Based on the empirical study, for programs smaller than 16 KLOC, selective mutation testing has surprisingly good scalability. In particular, for a program whose number of lines of executable code is E, the number of mutants used in selective mutation testing is proportional to Ec, where c is a constant whose value is between 0.05 and 0.25.

Multi-objective Construction of an Entire Adequate Test Suite for an EFSM
In this paper we propose a method and a tool to generate test suites from extended finite state machines, accounting for multiple (potentially conflicting) objectives. We aim at maximizing coverage and feasibility of a test suite while minimizing similarity between its test cases and minimizing overall cost. Therefore, we define a multi-objective genetic algorithm that searches for optimal test suites based on four objective functions. In doing so, we create an entire test suite at once as opposed to test cases one at a time. Our approach is evaluated on two different case studies, showing interesting initial results.

Using Pre-Release Test Failures to Build Early Post-Release Defect Prediction Models
Software quality is one of the most pressing concerns for nearly all software developing companies. At the same time, software companies also seek to shorten their release cycles to meet market demands while maintaining their product quality. Identifying problematic code areas becomes more and more important. Defect prediction models became popular in recent years and many different code and process metrics have been studied. There has been minimal effort relating test executions during development with defect likelihood. This is surprising as test executions capture the stability and quality of a program during the development process. This paper presents an exploratory study investigating whether test execution metrics, e.g. Test failure bursts, can be used as software quality indicators and used to build pre- and post-release defects prediction models. We show that test metrics collected during Windows 8 development can be used to build pre- and post-release defect prediction models early in the development process of a software product. Test metrics outperform pre-release defect counts when predicting post-release defects.

Defect Prediction between Software Versions with Active Learning and Dimensionality Reduction
Accurate detection of defects prior to product release helps software engineers focus verification activities on defect prone modules, thus improving the effectiveness of software development. A common scenario is to use the defects from prior releases to build the prediction model for the upcoming release, typically through a supervised learning method. As software development is a dynamic process, fault characteristics in subsequent releases may vary. Therefore, supplementing the defect information from prior releases with limited information about the defects from the current release detected early seems to offer intuitive and practical benefits. We propose active learning as a way to automate the development of models which improve the performance of defect prediction between successive releases. Our results show that the integration of active learning with uncertainty sampling consistently outperforms the corresponding supervised learning approach. We further improve the prediction performance with feature compression techniques, where feature selection or dimensionality reduction is applied to defect data prior to active learning. We observe that dimensionality reduction techniques, particularly multidimensional scaling with random forest similarity, work better than feature selection due to their ability to identify and combine essential information in data set features. We present the improvements offered by this methodology through the prediction of defective modules in the three successive versions of Eclipse.

An Orchestrated Survey of Available Algorithms and Tools for Combinatorial Testing
For functional testing based on the input domain of a functionality, parameters and their values are identified and a test suite is generated using a criterion exercising combinations of those parameters and values. Since software systems are large, resulting in large numbers of parameters and values, a technique based on combinatorics called Combinatorial Testing (CT) is used to automate the process of creating those combinations. CT is typically performed with the help of combinatorial objects called Covering Arrays. The goal of the present work is to determine available algorithms/tools for generating a combinatorial test suite. We tried to be as complete as possible by using a precise protocol for selecting papers describing those algorithms/tools. The 75 algorithms/tools we identified are then categorized on the basis of different comparison criteria, including: the test suite generation technique, the support for selection (combination)criteria, mixed covering array, the strength of coverage, and the support for constraints between parameters. Results can be of interest to researchers or software companies who are looking for a CT algorithm/tool suitable for their needs.

Effective Bug Triage Based on Historical Bug-Fix Information
For complex and popular software, project teams could receive a large number of bug reports. It is often tedious and costly to manually assign these bug reports to developers who have the expertise to fix the bugs. Many bug triage techniques have been proposed to automate this process. In this paper, we describe our study on applying conventional bug triage techniques to projects of different sizes. We find that the effectiveness of a bug triage technique largely depends on the size of a project team (measured in terms of the number of developers). The conventional bug triage methods become less effective when the number of developers increases. To further improve the effectiveness of bug triage for large projects, we propose a novel recommendation method called Bug Fixer, which recommends developers for a new bug report based on historical bug-fix information. Bug Fixer constructs a Developer-Component-Bug (DCB) network, which models the relationship between developers and source code components, as well as the relationship between the components and their associated bugs. A DCB network captures the knowledge of "who fixed what, where". For a new bug report, Bug Fixer uses a DCB network to recommend to triager a list of suitable developers who could fix this bug. We evaluate Bug Fixer on three large-scale open source projects and two smaller industrial projects. The experimental results show that the proposed method outperforms the existing methods for large projects and achieves comparable performance for small projects.