The Use of Text Retrieval and Natural Language Processing in Software Engineering

During software evolution many related artifacts are created or modified. Some of these are composed of structured data (e.g., analysis data), some contain semi-structured information (e.g., source code), and many include unstructured information (e.g., natural language text). Software artifacts written in natural language (e.g., requirements, bug reports, etc.), together with the comments and identifiers in the source code encode to a large degree the domain of the software, the developers’ knowledge about the system, capture design decisions, developer information, etc. Retrieving and analyzing the textual information existing in software is extremely important for supporting a variety of Software Engineering (SE) tasks.

Text Retrieval (TR) is a branch of Information Retrieval (IR) that leverages information stored primarily in the form of text. TR methods have been proved as suitable candidates for the retrieval and the analysis of textual data embedded in software or present in other sources. TR techniques treat text as bag of words. Thus, they are often used in conjunction with Natural Language Processing (NLP) tools to analyze, for example, the structure of sentences, the meaning of words, etc.

The course will start with introducing the background on TR and NLP techniques and tools. Next, we will review and discuss the application of TR and NLP in different SE tasks. The course is focuses on research articles; no textbook is required.

Course Name: The Use of Text Retrieval and Natural Language Processing in Software Engineering
Course Number: Cpt S 595
Credits: 3
Semester: Fall 2016
Prerequisites: Graduate standing.
Course required/elective: elective.

Schedule: Wed 9:30am – 12:00pm
Location: Sloan 326
Course webpage: http://www.veneraarnaoudova.ca/cpt-s-595-tr-and-nlp-in-se-fall-2016/
Professors/Coordinators: Venera Arnaoudova.
Office: EME 127.
Office hours: By e-mail appointment.

Resources:

 Recommended textbook(s):

[1] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.
[2] Dan Jurafsky and James H. Martin, Speech and Language Processing, 2nd Ed., 2007.

Journal/conference papers:

[3] S. Keshav. 2007. How to read a paper. SIGCOMM Comput. Commun. Rev. 37, 3 (July 2007), 83-84.
[4] Mary Shaw. 2003. Writing good software engineering research papers: minitutorial. In Proceedings of the 25th International Conference on Software Engineering (ICSE ’03). IEEE Computer Society, 726-736.
[5] B.A. Kitchenham, S. Charters, Guidelines for Performing Systematic Literature Reviews in Software Engineering. Technical Report EBSE-2007-01, 2007.
[6] S. L. Abebe, S. Haiduc, P. Tonella, and A. Marcus. Lexicon bad smells in software. In Proceedings of the Working Conference on Reverse Engineering (WCRE), pages 95–99, 2009.
[7] V. Arnaoudova, M. Di Penta, and G. Antoniol. “Linguistic antipatterns: What they are and how developers perceive them” Empirical Software Engineering (EMSE), pages 1–55, 2015.
[8] S. L. Abebe, P. Tonella, “Towards the Extraction of Domain Concepts from the Identifiers” in 18th Working Conference on Reverse Engineering (WCRE), 2011, pp. 77-86.
[9] Matthew J. Howard, Samir Gupta, Lori Pollock, and K. Vijay-Shanker. Automatically mining software-based, semantically-similar words from comment-code mappings. In Proceedings of the Working Conference on Mining Software Repositories (MSR), pages 377-386, 2013.
[10] Haiduc, S.; Aponte, J.; Moreno, L.; Marcus, A., “On the Use of Automated Text Summarization Techniques for Summarizing Source Code,” in Proceedings of the Working Conference on Reverse Engineering (WCRE), pp.35-44, 2010.
[11] L. Moreno, G. Bavota, M. Di Penta, R. Oliveto, A. Marcus, and G. Canfora. “Automatic generation of release notes“. In Proceedings of the ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE), 2014, pp. 484–495.
[12] Linares-Vasquez, M.; Hossen, K.; Dang, H.; Kagdi, H.; Gethers, M.; Poshyvanyk, D. “Triaging Incoming Change Requests: Bug or Commit History, or Code Authorship?” in 28th IEEE International Conference on Software Maintenance (ICSM), 2012, pp. 451-460.
[13] Runeson, P., Alexandersson, M., and Nyholm, O., “Detection of Duplicate Defect Reports Using Natural Language Processing“, in Proceedings of the International Conference on Software Engineering (ICSE), 2007, pp. 499-510.
[14] Poshyvanyk, D., Gael-Gueheneuc, Y., Marcus, A., Antoniol, G., and Rajlich, V., “Feature Location using Probabilistic Ranking of Methods based on Execution Scenarios and Information Retrieval“, IEEE Transactions on Software Engineering, vol. 33, no. 6, June 2007.
[15] Moreno, L.; Treadway, J. J.; Marcus, A. & Shen, W. “On the Use of Stack Traces to Improve Text Retrieval-Based Bug Localization” in IEEE International Conference on Software Maintenance and Evolution, IEEE, 2014, pp. 151-160.
[16] A. Marcus and J. Maletic, “Recovering documentation-to-source-code traceability links using latent semantic indexing,” in Proceedings of the 25th International Conference on Software Engineering, 2003. IEEE, May 2003, pp. 125–135.
[17] Panichella, A. and McMillan, C. and Moritz, E. and Palmieri, D. and Oliveto, R. and Poshyvanyk, D. and De Lucia, A., “When and How Using Structural Information to Improve IR-Based Traceability Recovery“, in Proceedings of the European Conference on Software Maintenance and Reengineering (CSMR), 2013, pp. 199-208.
[18] Tian, K., Revelle, M., Poshyvanyk, D., “Using Latent Dirichlet Allocation for Automatic Categorization of Software“, in Proceedings Working Conference on Mining Software Repositories (MSR), 2009, pp.163-166.
[19] Alessandra Gorla, Ilaria Tavecchia, Florian Gross, and Andreas Zeller. Checking app behavior against app descriptions. In Proceedings of the International Conference on Software Engineering (ICSE), 2014, pp. 1025-1035.
[20] Antoniol, G., Hayes, J. H., Gueheneuc, Y.-G., and Di Penta, M., “Reuse or rewrite: Combining textual, static, and dynamic analyses to assess the cost of keeping a system up-to-date“, in Proceedings of the International Conference on Software Maintenance (ICSM), 2008, pp. 147-156.
[21] McMillan, C., Grechanik, M., Poshyvanyk, D., Fu, C., and Xie, Q., “Exemplar: A Source Code Search Engine For Finding Highly Relevant Applications“, IEEE Transactions on Software Engineering (TSE), 2012, 38, pp. 1069 – 1087
[22] Poshyvanyk, D., Marcus, A., Ferenc, R., Gyimóthy, T. “Using Information Retrieval based Coupling Measures for Impact Analysis“, Empirical Software Engineering, Vol. 14, No. 1, February 2009, pp. 5-32.
[23] Gethers, M., Kagdi, H., Dit, B., and Poshyvanyk, D., “An Adaptive Approach to Impact Analysis from Change Requests to Source Code“, in Proceedings of the International Conference on Automated Software Engineering (ASE), 2011, pp. 540-543.

Course description: Application of text retrieval and natural language processing techniques and tools to solve software engineering tasks.

Overview and Course Goals: This course provides basic background on Text Retrieval (TR) and Natural Language Processing (NLP) techniques and tools. It then reviews the application of TR and NLP for different Software Engineering tasks.

Course topics:

– Text Retrieval
– Natural Language Processing
– Software engineering tasks, e.g.,

  • Refactoring
  • Reverse engineering
  • (Re)documentation
  • Concept Location
  • Traceability
  • Software Categorization

Learning outcomes and evaluation:

Students that successfully complete the course will:

  1. Be able to summarize, critique, and present papers applying TR and NLP techniques in Software Engineering.
  2. Be able to perform a systematic literature survey.
  3. Be able to apply TR and NLP techniques to solve a research problem in a different domain.
  4. Improve their ability to critique and present a research paper.
  5. Write a research paper with clear motivation, comparison with existing work, methodology, etc.

Week-by-week schedule[1]:

[1] All submissions are due on the specified date by midnight Pacific Time.

Week Date Topics Resources Deadlines
1-2 08/24-31 Syllabus. How to write, read, and present a research paper. Literature survey. [3], [4], [5]
08/24-31 Introduction to the use of TR and NLP in Software Engineering.
3 09/07 Preprocessing. Ch. 2 [1]
09/07 Background on TR methods.  Ch. 1, 6, 18 [1]
4 09/14 Background on NLP methods. Ch. 4, 5, 12, 19 [2]
09/14 Hands-on. Written project (research and implementation) proposal due by September 17 (updated deadline).
5 09/21 Research project proposal presentations.
09/21 Implementation project proposal presentations. Students presenting next week must send details on the paper by September 21.
6 09/28 Refactoring. Identifying poor quality identifiers and naming inconsistencies. [6], [7]
09/28 Paper presentations. See the paper presentation schedule. Students presenting next week must send details on the paper by September 28.
7 10/05 Reverse Engineering. Building software ontologies. Identifying semantic relations between words. [8], [9]
10/05 Paper presentations. See the paper presentation schedule. Students presenting next week must send details on the paper by October 5.
8 10/12 (Re)documentation. Extracting a set of important keywords. Generating natural language sentences. [10], [11]
10/12 Paper presentations. See the paper presentation schedule. Students presenting next week must send details on the paper by October 12.
9 10/19 Bug triage and bug report analysis. [12], [13]
10/19 Paper presentations TBA
10 10/26 Project presentations.
10/26 Students presenting next week must send details on the paper by October 26.
11 11/02 Concept location. [14], [15]
11/02 Paper presentations TBA Students presenting next week must send details on the paper by November 2.
12 11/09 Traceability link recovery. [16], [17]  
11/09 Paper presentations. TBA Students presenting next week must send details on the paper by November 9.
13 11/16 Software categorization. [18], [19]
11/16 Paper presentations. TBA Students presenting next week must send details on the paper by November 16.
14 11/23  No class: Thanksgiving vacation.
11/23 Students presenting next week must send details on the paper by November 23.
15 11/30 Software reuse. [20], [21]
11/30 Paper presentations. TBA
16 12/07 Research project presentations.
12/07 Implementation project presentations.

 

Grading framework: Course grades are based on a research project totaling 40% of the final grade, paper presentations totaling 20% of the final grade, an implementation project totaling 25% of the final grade, and participation in class totaling 15% of the final grade.

The research project consists of applying a TR/NLP technique to solve a problem of your choice. This is an individual project that accounts for 50% of the final grade and will be evaluated as follows:
– 5% written project proposal. The submission must use the IEEE template for conference proceedings and must include at minimum an abstract, introduction, related work, and methodology sections.
– 5% proposal presentation (15 min, including questions).
– 25% final project submission. The submission must include a complete paper (a continuation of the project proposal) of 10 pages including references, the source code of the implementation, the data used to evaluate the approach, and a documentation explaining the artifacts and a user manual.
– 5% final project presentation

Paper presentations account for 20% of the final grade.

The implementation project consist of implementing an existent approach that uses TR or NLP for a SE task. This is an individual project and it will be evaluated as follows:
– 5% written proposal describing the steps and tools that will be used for the implementation.
– 5% proposal presentation (15 min, including questions).
– 15% source code and test cases.

Final grades will be awarded on the following scale:
Interval            Grade
[90,100]          A
[87,90)            A‐
[83,87)            B+
[80,83)            B
[77,80)            B‐
[73,77)            C+
[70,73)            C
[67,70)            C‐
[63,67)            D+
[60,63)            D
[0,60)             F

Course rules:

Unless posted otherwise, assignment documents shall be submitted electronically.

Late penalty is a flat 10% deduction per day. Late assignments may be turned up to one week after the original due date, and an advanced notice must be given to the instructor beforehand for the late submission. No homework will be accepted after its due day without advanced notice or special permission from the instructor.

Reasonable Accommodation:

Reasonable accommodations are available for students with a documented disability. If you have a disability and need accommodations to fully participate in this class, please either visit or call the Access Center (Washington Building 217; 509-335-3417) to schedule an appointment with an Access Advisor. All accommodations MUST be approved through the Access Center.

Academic Integrity:

I encourage you to work with classmates on assignments. However, each student must turn in original work. No copying will be accepted. Students who violate WSU’s Standards of Conduct for Students will receive an F as a final grade in this course, will not have the option to withdraw from the course and will be reported to the Office Student Conduct. Cheating is defined in the Standards for Student Conduct WAC 504-26-010 (3). It is strongly suggested that you read and understand these definitions. (Read more: http://apps.leg.wa.gov/wac/default.aspx?cite=504-26-010)

Safety:

Washington State University is committed to maintaining a safe environment for its faculty, staff, and students. Safety is the responsibility of every member of the campus community and individuals should know the appropriate actions to take when an emergency arises. In support of our commitment to the safety of the campus community the University has developed a Campus Safety Plan, http://safetyplan.wsu.edu. It is highly recommended that you visit this web site as well as the University emergency management web site at http://oem.wsu.edu/ to become familiar with the information provided.