The art and science of analyzing software data / [electronic resource]
by Bird, Christian [editor.]; Menzies, Tim [editor.]; Zimmermann, Thomas, Ph. D [editor.].
Material type: BookPublisher: Amsterdam ; Morgan Kaufmann/Elsevier, 2015.Description: 1 online resource (xxiii, 660 pages) : illustrations (some color).ISBN: 9780124115439; 0124115438; 0124115195; 9780124115194.Subject(s): Data mining | Computer programming -- Management | COMPUTERS -- Database Management -- Data Mining | Computer programming -- Management | Data mining | Electronic booksOnline resources: ScienceDirectOnline resource; title from PDF title page (EBSCO, viewed September 9, 2015).
Includes bibliographical references and index.
This book provides valuable information on analysis techniques often used to derive insight from software data. It shares best practices in the field generated by leading data scientists, collected from their experience training software engineering students and practitioners to master data science. Topics include: analysis of security data; code reviews; app stores; log files; user telemetry; co-change, text, topic and concept analyses; release planning and generation of source code comments. It includes stories from the trenches from expert data scientists illustrating how to apply data analysis in industry and open source, present results to stakeholders, and drive decisions. -- Edited summary from book.
Ch. 1 Past, Present, and Future of Analyzing Software Data -- 1.1. Definitions -- 1.2. The Past: Origins -- 1.2.1. Generation 1: Preliminary Work -- 1.2.2. Generation 2: Academic Experiments -- 1.2.3. Generation 3: Industrial Experiments -- 1.2.4. Generation 4: Data Science Everywhere -- 1.3. Present Day -- 1.4. Conclusion -- Acknowledgments -- References -- ch. 2 Mining Patterns and Violations Using Concept Analysis -- 2.1. Introduction -- 2.1.1. Contributions -- 2.2. Patterns and Blocks -- 2.3.Computing All Blocks -- 2.3.1. Algorithm in a Nutshell -- 2.4. Mining Shopping Carts with Colibri -- 2.5. Violations -- 2.6. Finding Violations -- 2.7. Two Patterns or One Violation? -- 2.8. Performance -- 2.9. Encoding Order -- 2.10. Inlining -- 2.11. Related Work -- 2.11.1. Mining Patterns -- 2.11.2. Mining Violations -- 2.11.3. PR-Miner -- 2.12. Conclusions -- Acknowledgments -- References -- ch. 3 Analyzing Text in Software Projects -- 3.1. Introduction.
3.2. Textual Software Project Data and Retrieval -- 3.2.1. Textual Data -- 3.2.2. Text Retrieval -- 3.3. Manual Coding -- 3.3.1. Coding Process -- 3.3.2. Challenges -- 3.4. Automated Analysis -- 3.4.1. Topic Modeling -- 3.4.2. Part-of-Speech Tagging and Relationship Extraction -- 3.4.3.n-Grams -- 3.4.4. Clone Detection -- 3.4.5. Visualization -- 3.5. Two Industrial Studies -- 3.5.1. Naming the Pain in Requirements Engineering: A Requirements Engineering Survey -- 3.5.2. Clone Detection in Requirements Specifications -- 3.6. Summary -- References -- ch. 4 Synthesizing Knowledge from Software Development Artifacts -- 4.1. Problem Statement -- 4.2. Artifact Lifecycle Models -- 4.2.1. Example: Patch Lifecycle -- 4.2.2. Model Extraction -- 4.3. Code Review -- 4.3.1. Mozilla Project -- 4.3.2. WebKit Project -- 4.3.3. Blink Project -- 4.4. Lifecycle Analysis -- 4.4.1. Mozilla Firefox -- 4.4.2. WebKit -- 4.4.3. Blink -- 4.5. Other Applications -- 4.6. Conclusion -- References.
Ch. 5 A Practical Guide to Analyzing IDE Usage Data -- 5.1. Introduction -- 5.2. Usage Data Research Concepts -- 5.2.1. What is Usage Data and Why Should We Analyze it? -- 5.2.2. Selecting Relevant Data on the Basis of a Goal -- 5.2.3. Privacy Concerns -- 5.2.4. Study Scope -- 5.3. How to Collect Data -- 5.3.1. Eclipse Usage Data Collector -- 5.3.2. Mylyn and the Eclipse Mylyn Monitor -- 5.3.3. CodingSpectator -- 5.3.4. Build it Yourself for Visual Studio -- 5.4. How to Analyze Usage Data -- 5.4.1. Data Anonymity -- 5.4.2. Usage Data Format -- 5.4.3. Magnitude Analysis -- 5.4.4. Categorization Analysis -- 5.4.5. Sequence Analysis -- 5.4.6. State Model Analysis -- 5.4.7. The Critical Incident Technique -- 5.4.8. Including Data from Other Sources -- 5.5. Limits of What You Can Learn from Usage Data -- 5.6. Conclusion -- 5.7. Code Listings -- Acknowledgments -- References -- ch. 6 Latent Dirichlet Allocation: Extracting Topics from Software Engineering Data -- 6.1. Introduction.
6.2. Applications of LDA in Software Analysis -- 6.3. How LDA Works -- 6.4. LDA Tutorial -- 6.4.1. Materials -- 6.4.2. Acquiring Software-Engineering Data -- 6.4.3. Text Analysis and Data Transformation -- 6.4.4. Applying LDA -- 6.4.5. LDA Output Summarization -- 6.5. Pitfalls and Threats to Validity -- 6.5.1. Criterion Validity -- 6.5.2. Construct Validity -- 6.5.3. Internal Validity -- 6.5.4. External Validity -- 6.5.5. Reliability -- 6.6. Conclusions -- References -- ch. 7 Tools and Techniques for Analyzing Product and Process Data -- 7.1. Introduction -- 7.2.A Rational Analysis Pipeline -- 7.2.1. Getting the Data -- 7.2.2. Selecting -- 7.2.3. Processing -- 7.2.4. Summarizing -- 7.2.5. Plumbing -- 7.3. Source Code Analysis -- 7.3.1. Heuristics -- 7.3.2. Lexical Analysis -- 7.3.3. Parsing and Semantic Analysis -- 7.3.4. Third-Party Tools -- 7.4.Compiled Code Analysis -- 7.4.1. Assembly Language -- 7.4.2. Machine Code -- 7.4.3. Dealing with Name Mangling -- 7.4.4. Byte Code.
7.4.5. Dynamic Linking -- 7.4.6. Libraries -- 7.5. Analysis of Configuration Management Data -- 7.5.1. Obtaining Repository Data -- 7.5.2. Analyzing Metadata -- 7.5.3. Analyzing Time Series Snapshots -- 7.5.4. Analyzing a Checked Out Repository -- 7.5.5.Combining Files with Metadata -- 7.5.6. Assembling Repositories -- 7.6. Data Visualization -- 7.6.1. Graphs -- 7.6.2. Declarative Diagrams -- 7.6.3. Charts -- 7.6.4. Maps -- 7.7. Concluding Remarks -- References -- ch. 8 Analyzing Security Data -- 8.1. Vulnerability -- 8.1.1. Exploits -- 8.2. Security Data "Gotchas" -- 8.2.1. Gotcha #1. Having Vulnerabilities is Normal -- 8.2.2. Gotcha #2. "More Vulnerabilities" Does not Always Mean "Less Secure" -- 8.2.3. Gotcha #3. Design-Level Flaws are not Usually Tracked -- 8.2.4. Gotcha #4. Security is Negatively Defined -- 8.3. Measuring Vulnerability Severity -- 8.3.1. CVSS Overview -- 8.3.2. Example CVSS Application -- 8.3.3. Criticisms of the CVSS.
8.4. Method of Collecting and Analyzing Vulnerability Data -- 8.4.1. Step 1. Trace Reported Vulnerabilities Back to Fixes -- 8.4.2. Step 2. Aggregate Source Control Logs -- 8.4.3. Step 3a. Determine Vulnerability Coverage -- 8.4.4. Step 3c. Classify According to Engineering Mistake -- 8.5. What Security Data has Told Us Thus Far -- 8.5.1. Vulnerabilities have Socio-Technical Elements -- 8.5.2. Vulnerabilities have Long, Complex Histories -- 8.6. Summary -- References -- ch. 9 A Mixed Methods Approach to Mining Code Review Data: Examples and a Study of Multicommit Reviews and Pull Requests -- 9.1. Introduction -- 9.2. Motivation for a Mixed Methods Approach -- 9.3. Review Process and Data -- 9.3.1. Software Inspection -- 9.3.2. OSS Code Review -- 9.3.3. Code Review at Microsoft -- 9.3.4. Google-Based Gerrit Code Review -- 9.3.5. GitHub Pull Requests -- 9.3.6. Data Measures and Attributes -- 9.4. Quantitative Replication Study: Code Review on Branches.
9.4.1. Research Question 1-Commits per Review -- 9.4.2. Research Question 2-Size of Commits -- 9.4.3. Research Question 3-Review Interval -- 9.4.4. Research Question 4-Reviewer Participation -- 9.4.5. Conclusion -- 9.5. Qualitative Approaches -- 9.5.1. Sampling Approaches -- 9.5.2. Data Collection -- 9.5.3. Qualitative Analysis of Microsoft Data -- 9.5.4. Applying Grounded Theory to Archival Data to Understand OSS Review -- 9.6. Triangulation -- 9.6.1. Using Surveys to Triangulate Qualitative Findings -- 9.6.2. How Multicommit Branches are Reviewed in Linux -- 9.6.3. Closed Coding: Branch or Revision on GitHub and Gerrit -- 9.6.4. Understanding Why Pull Requests are Rejected -- 9.7. Conclusion -- References -- ch. 10 Mining Android Apps for Anomalies -- 10.1. Introduction -- 10.2. Clustering Apps by Description -- 10.2.1. Collecting Applications -- 10.2.2. Preprocessing Descriptions with NLP -- 10.2.3. Identifying Topics with LDA -- 10.2.4. Clustering Apps with K-means.
10.2.5. Finding the Best Number of Clusters -- 10.2.6. Resulting App Clusters -- 10.3. Identifying Anomalies by APIs -- 10.3.1. Extracting API Usage -- 10.3.2. Sensitive and Rare APIs -- 10.3.3. Distance-Based Outlier Detection -- 10.3.4. CHABADA as a Malware Detector -- 10.4. Evaluation -- 10.4.1. RQ1: Anomaly Detection -- 10.4.2. RQ2: Feature Selection -- 10.4.3. RQ3: Malware Detection -- 10.4.4. Limitations and Threats to Validity -- 10.5. Related Work -- 10.5.1. Mining App Descriptions -- 10.5.2. Behavior/Description Mismatches -- 10.5.3. Detecting Malicious Apps -- 10.6. Conclusion and Future Work -- Acknowledgments -- References -- ch. 11 Change Coupling Between Software Artifacts: Learning from Past Changes -- 11.1. Introduction -- 11.2. Change Coupling -- 11.2.1. Why Do Artifacts Co-Change? -- 11.2.2. Benefits of Using Change Coupling -- 11.3. Change Coupling Identification Approaches -- 11.3.1. Raw Counting -- 11.3.2. Association Rules -- 11.3.3. Time-Series Analysis.
11.4. Challenges in Change Coupling Identification -- 11.4.1. Impact of Commit Practices -- 11.4.2. Practical Advice for Change Coupling Detection -- 11.4.3. Alternative Approaches -- 11.5. Change Coupling Applications -- 11.5.1. Change Prediction and Change Impact Analysis -- 11.5.2. Discovery of Design Flaws and Opportunities for Refactoring -- 11.5.3. Architecture Evaluation -- 11.5.4. Coordination Requirements and Socio-Technical Congruence -- 11.6. Conclusion -- References -- ch. 12 Applying Software Data Analysis in Industry Contexts: When Research Meets Reality -- 12.1. Introduction -- 12.2. Background -- 12.2.1. Fraunhofer's Experience in Software Measurement -- 12.2.2. Terminology -- 12.2.3. Empirical Methods -- 12.2.4. Applying Software Measurement in Practice-The General Approach -- 12.3. Six Key Issues when Implementing a Measurement Program in Industry -- 12.3.1. Stakeholders, Requirements, and Planning: The Groundwork for a Successful Measurement Program.
12.3.2. Gathering Measurements-How, When, and Who -- 12.3.3. All Data, No Information-When the Data is not What You Need or Expect -- 12.3.4. The Pivotal Role of Subject Matter Expertise -- 12.3.5. Responding to Changing Needs -- 12.3.6. Effective Ways to Communicate Analysis Results to the Consumers -- 12.4. Conclusions -- References -- ch. 13 Using Data to Make Decisions in Software Engineering: Providing a Method to our Madness -- 13.1. Introduction -- 13.2. Short History of Software Engineering Metrics -- 13.3. Establishing Clear Goals -- 13.3.1. Benchmarking -- 13.3.2. Product Goals -- 13.4. Review of Metrics -- 13.4.1. Contextual Metrics -- 13.4.2. Constraint Metrics -- 13.4.3. Development Metrics -- 13.5. Challenges with Data Analysis on Software Projects -- 13.5.1. Data Collection -- 13.5.2. Data Interpretation -- 13.6. Example of Changing Product Development Through the Use of Data -- 13.7. Driving Software Engineering Processes with Data -- References.
Ch. 14 Community Data for OSS Adoption Risk Management -- 14.1. Introduction -- 14.2. Background -- 14.2.1. Risk and Open Source Software Basic Concepts -- 14.2.2. Modeling and Analysis Techniques -- 14.3. An Approach to OSS Risk Adoption Management -- 14.4. OSS Communities Structure and Behavior Analysis: The XWiki Case -- 14.4.1. OSS Community Social Network Analysis -- 14.4.2. Statistical Analytics of Software Quality, OSS Communities' Behavior and OSS Projects -- 14.4.3. Risk Indicators Assessment via Bayesian Networks -- 14.4.4. OSS Ecosystems Modeling and Reasoning in i* -- 14.4.5. Integrating the Analysis for a Comprehensive Risk Assessment -- 14.5.A Risk Assessment Example: The Moodbile Case -- 14.6. Related Work -- 14.6.1. Data Analysis in OSS Communities -- 14.6.2. Risk Modeling and Analysis via Goal-oriented Techniques -- 14.7. Conclusions -- Acknowledgments -- References.
Ch. 15 Assessing the State of Software in a Large Enterprise: A 12-Year Retrospective -- 15.1. Introduction -- 15.2. Evolution of the Process and the Assessment -- 15.3. Impact Summary of the State of Avaya Software Report -- 15.4. Assessment Approach and Mechanisms -- 15.4.1. Evolution of the Approach Over Time -- 15.5. Data Sources -- 15.5.1. Data Accuracy -- 15.5.2. Types of Data Analyzed -- 15.6. Examples of Analyses -- 15.6.1. Demographic Analyses -- 15.6.2. Analysis of Predictability -- 15.6.3. Risky File Management -- 15.7. Software Practices -- 15.7.1. Original Seven Key Software Areas -- 15.7.2. Four Practices Tracked as Representative -- 15.7.3. Example Practice Area-Design Quality In -- 15.7.4. Example Individual Practice-Static Analysis -- 15.8. Assessment Follow-up: Recommendations and Impact -- 15.8.1. Example Recommendations -- 15.8.2. Deployment of Recommendations -- 15.9. Impact of the Assessments -- 15.9.1. Example: Automated Build Management.
15.9.2. Example: Deployment of Risky File Management -- 15.9.3. Improvement in Customer Quality Metric (CQM) -- 15.10. Conclusions -- 15.10.1. Impact of the Assessment Process -- 15.10.2. Factors Contributing to Success -- 15.10.3.Organizational Attributes -- 15.10.4. Selling the Assessment Process -- 15.10.5. Next Steps -- 15.11. Appendix -- 15.11.1. Example Questions Used for Input Sessions -- Acknowledgments -- References -- ch. 16 Lessons Learned from Software Analytics in Practice -- 16.1. Introduction -- 16.2. Problem Selection -- 16.3. Data Collection -- 16.3.1. Datasets -- 16.3.2. Data Extraction -- 16.4. Descriptive Analytics -- 16.4.1. Data Visualization -- 16.4.2. Reporting via Statistics -- 16.5. Predictive Analytics -- 16.5.1.A Predictive Model for all Conditions -- 16.5.2. Performance Evaluation -- 16.5.3. Prescriptive Analytics -- 16.6. Road Ahead -- References -- ch. 17 Code Comment Analysis for Improving Software Quality -- 17.1. Introduction.
17.1.1. Benefits of Studying and Analyzing Code Comments -- 17.1.2. Challenges of Studying and Analyzing Code Comments -- 17.1.3. Code Comment Analysis for Specification Mining and Bug Detection -- 17.2. Text Analytics: Techniques, Tools, and Measures -- 17.2.1. Natural Language Processing -- 17.2.2. Machine Learning -- 17.2.3. Analysis Tools -- 17.2.4. Evaluation Measures -- 17.3. Studies of Code Comments -- 17.3.1. Content of Code Comments -- 17.3.2.Common Topics of Code Comments -- 17.4. Automated Code Comment Analysis for Specification Mining and Bug Detection -- 17.4.1. What Should We Extract? -- 17.4.2. How Should We Extract Information? -- 17.4.3. Additional Reading -- 17.5. Studies and Analysis of API Documentation -- 17.5.1. Studies of API Documentation -- 17.5.2. Analysis of API Documentation -- 17.6. Future Directions and Challenges -- References -- ch. 18 Mining Software Logs for Goal-Driven Root Cause Analysis -- 18.1. Introduction.
18.2. Approaches to Root Cause Analysis -- 18.2.1. Rule-Based Approaches -- 18.2.2. Probabilistic Approaches -- 18.2.3. Model-Based Approaches -- 18.3. Root Cause Analysis Framework Overview -- 18.4. Modeling Diagnostics for Root Cause Analysis -- 18.4.1. Goal Models -- 18.4.2. Antigoal Models -- 18.4.3. Model Annotations -- 18.4.4. Loan Application Scenario -- 18.5. Log Reduction -- 18.5.1. Latent Semantic Indexing -- 18.5.2. Probabilistic Latent Semantic Indexing -- 18.6. Reasoning Techniques -- 18.6.1. Markov Logic Networks -- 18.7. Root Cause Analysis for Failures Induced by Internal Faults -- 18.7.1. Knowledge Representation -- 18.7.2. Diagnosis -- 18.8. Root Cause Analysis for Failures due to External Threats -- 18.8.1. Antigoal Model Rules -- 18.8.2. Inference -- 18.9. Experimental Evaluations -- 18.9.1. Detecting Root Causes due to Internal Faults -- 18.9.2. Detecting Root Causes due to External Actions -- 18.9.3. Performance Evaluation -- 18.10. Conclusions.
19.5.1. OTT Case Study-The Context and Content -- 19.5.2. Formalization of the Problem -- 19.5.3. The Case Study Process -- 19.5.4. Release Planning in the Presence of Advanced Feature Dependencies and Synergies -- 19.5.5. Real-Time What-to-Release Planning -- 19.5.6. Re-Planning Based on Crowd Clustering -- 19.5.7. Conclusions and Discussion of Results -- 19.6. Summary and Future Research -- 19.7. Appendix: Feature Dependency Constraints -- Acknowledgments -- References -- ch. 20 Boa: An Enabling Language and Infrastructure for Ultra-Large-Scale MSR Studies -- 20.1. Objectives -- 20.2. Getting Started with Boa -- 20.2.1. Boa's Architecture -- 20.2.2. Submitting a Task -- 20.2.3. Obtaining the Results -- 20.3. Boa's Syntax and Semantics -- 20.3.1. Basic and Compound Types -- 20.3.2. Output Aggregation -- 20.3.3. Expressing Loops with Quantifiers -- 20.3.4. User-Defined Functions -- 20.4. Mining Project and Repository Metadata.
20.4.1. Types for Mining Software Repositories -- 20.4.2. Example 1: Mining Top 10 Programming Languages -- 20.4.3. Intrinsic Functions -- 20.4.4. Example 2: Mining Revisions that Fix Bugs -- 20.4.5. Example 3: Computing Project Churn Rates -- 20.5. Mining Source Code with Visitors -- 20.5.1. Types for Mining Source Code -- 20.5.2. Intrinsic Functions -- 20.5.3. Visitor Syntax -- 20.5.4. Example 4: Mining AST Count -- 20.5.5. Custom Traversal Strategies -- 20.5.6. Example 5: Mining for Added Null Checks -- 20.5.7. Example 6: Finding Unreachable Code -- 20.6. Guidelines for Replicable Research -- 20.7. Conclusions -- 20.8. Practice Problems -- References -- ch. 21 Scalable Parallelization of Specification Mining Using Distributed Computing -- 21.1. Introduction -- 21.2. Background -- 21.2.1. Specification Mining Algorithms -- 21.2.2. Distributed Computing -- 21.3. Distributed Specification Mining -- 21.3.1. Principles -- 21.3.2. Algorithm-Specific Parallelization.
21.4. Implementation and Empirical Evaluation -- 21.4.1. Dataset and Experimental Settings -- 21.4.2. Research Questions and Results -- 21.4.3. Threats to Validity and Current Limitations -- 21.5. Related Work -- 21.5.1. Specification Mining and Its Applications -- 21.5.2. MapReduce in Software Engineering -- 21.5.3. Parallel Data Mining Algorithms -- 21.6. Conclusion and Future Work.
There are no comments for this item.