Question and Answering System
By Vincent Lin & Lester Lee
Our Question-Answer system reads in a list of questions and generates up to 10 answers for each question by pulling relevant information from the corresponding documents. It will output these answers to the file predictions.txt. To run our program, use the command
python qasystem.py [start question #] [end question #]
If no parameters are specified, it will try to answer every question from QID0 to QID300. If the QID does not appear in the question list, then it will simply skip that question.
Question Processing
First, we tokenize the question using NLTK’s sent_tokenize and word_tokenize. We then filter all the stop words from the question using NLTK’s stopwords corpus to produce a tuple of two lists: the first list is every word that isn’t a stopword, and the second list is every word that is a stopword. We keep track of the question’s stopwords because that will help us decide what type of question it is later on. We also lowercase all the stopwords, such that it’ll help us match the questions in the word.
We do not do any further preprocessing for the questions. We remove the stopwords because those words are irrelevant to the query. In general, every single tokenizer and Named Entity Tagger have their own preprocessing processes. For example, when we passed in text without the stopwords and lowered cased into NLTK NER and Stanford NER, it called “Bolivia, VB”. We believe that this is due to the fact that they both use a syntactic parser for sentence processing.
Passage Retrieval
We use regular expressions to parse out all the text between <TEXT>*</TEXT> tags to retrieve the text from each of the top 50 documents. We then use NLTK’s sent_tokenize and word_tokenize to tokenize each document. Each document has a score (a float value); we use this to rank how relevant each document is.
For each sentences, we use frequency feature vectors to represent the question and the document, where each position in the vector represents the number of times a specific word shows up in the text. Since we are using the non-stopwords from the question as the features, the question is represented as the vector <1, … , 1>. We think this is the most reasonable implementation for the passage retrieval for our purpose since each of our passages is a sentence and the amount of time the keywords that appears in the passage is the most important one. For our top passages, we generally get 4-5 hits; this we hope eliminates weaker passages with weaker scores but more frequency.
We do not use cosine similarity because it only takes into account the angle between two vectors, and not their magnitudes. Rather, we use dot-product similarity to calculate how similar the passage is to the question. This means that the dot-product similarity between the question and the document is simply the sum of all the frequencies of the question non-stopwords that show up in the document. However, since each document has a different score, and the higher score indicates that the document is more relevant, we multiply this sum by the score to obtain the document’s final similarity score. We may able to experiment with different weights, such as multiplying by ½*scores or 2*scores. Due to time consideration, we weren’t able to do so.
Once we calculate the similarity for every document, we take the top 5 scores. We tried taking the top 1 and 3 scores, but 5 gave us the best performance. Note, the top 5 scores means that there could be more than 5 documents, we process. We split the documents into sentences, and for each tokenized sentence, we use the Stanford Named Entity Recognizer tagger to tag each named entity and NLTK’s pos_tag to get the POS tags for each token. We will use the Stanford NER and NLTK’s pos_tag for answer processing.
Answer Processing
We start with an empty list labeled answers. Given a list of POS tagged tokens as well as named entities, we look at the question tuple generated during our Question Processing section to figure out what type of question it is. We do this by checking which word appears in the list of stopwords that were filtered out of the question. Before going through the heuristic list, we parsed out the numeric entities, which is classified by our NLTK pos tagger by “CD” followed usually by “NNS”. This is because the Stanford NER did not contain Named Entities including size, zip code, and other numeric entities, which is key to the answer of several questions.
Heuristic Rules:
After going through the named entities, we also parse out each noun phrase from the list of tokens using POS tags. We then append every noun phrase to answers. We do this because the answer to most questions involve a noun phrase.We then append every numeric entity to answers since that’s the only thing left in our list.
Once answers has been populated with the relevant named entities and noun phrases, we sort it by frequency, since the same named entity or noun phrase may appear multiple times. Phrases with higher frequencies are given more weight, since they are likely to be relevant to the answer. We then return the top 10 elements in answers.
We also implemented a feature that takes into account the distances between the keywords and the noun phrases and noun entities when it comes to answer formation. However, it turns out that this has barely improved the answer formations and in some cases make it worse and increase the runtime of the program substantially. Thus, we decided to leave it out of the final program. We thought that this is due to the fact since we are only using sentences, the noun entity should be fairly close to keywords in order to get selected.
Competition improvements:
For the competition, we decided to increase the size of the window of the top scores to 7 since as the top scores increases. Furthermore, we went through the test questions in an attempt to formulate better heuristic rules. We added two rules:
Since it takes about an hour to run this code, we’ve included our prediction.txt.
Evaluation and Analysis of Result:
Passage Retrieval:
Top 5 scores mmr: 0.202774
Top 3 scores mmr: 0.192143
Top 1 score mmr: 0.142281
The passage Retrieval changes according to the number of top scores, we saw very little improvement from Top 3 scores to Top 5 scores. We think that Top 3 is enough to get all of the relevant passages so not many options are added to the answers list in Top 5. However, the dropoff between top 1 and 3 is too large to be accepted as trade offs for runtime.
Our passage retrieval system is not necessarily the best as it only takes into account the number of hits each sentences are able to make. For certain question types such as the acronym question types, it is extremely beneficial to use heuristic rules such as find all the Named Entities begins with say, ACS. Furthermore, for some questions, the sentence the term is introduced might not be where the question is in. Thus, finding a better way to chuck the answers might be better for us. However, when we did experiment with 2 sentences segments, the results that it returns were worse, so we decided to keep the same sentence.
Note: these scores is meant to test passage retrieval, we used a more primitive version of our heuristic system without the Competition Improvements and without the numeric entity rule.
Answer Formation:
Best possible system: MRR: 0.208651
This is made with Top 5 scoring documents with Competition Improvements and Numeric Entity Improvements. Since we didn’t have enough time, we were only able to run this program once (it takes about 45 minutes per run).
Out of the questions we scored correct, we only gotten correct 6 noun phrase answers, with the rest 26 being the Named Entities. Obviously, most of the Jeopardy questions requires an Named Entity so this is not surprising. It appears that we’ve performed extremely well in cases of Where, Who, and When questions. This makes sense since they require simple Named Entities which we parse easily through our document retrieval. We were also able to score some points by implementing the numeric rules which allowed us to answer some questions with numbers. Overall, I think the heuristic rules we’ve implemented performed extremely well. Outside of these questions, the generic “What” questions, which requires further heuristic rule in order to perform well in.
Questions that this system does extremely poorly (basically 0% accuracy) in are acronyms, chemical formula questions, synonym questions, definitional questions, questions that requires a verb phrase answer and questions that requires answers longer than multiple noun phrases separated by prepositions. For acronym, chemical formula and synonym questions, we need additional preprocessing and additional heuristic in passage retrieval in order to find the specific noun phrases in our answer formation documents. For definitional questions such as “What is X?”, questions that requires a verb phrase answer and questions that requires answers longer than multiple noun phrases separated by prepositions, the additional heuristic rules require substantially more work and may impact other question’s accuracies.
Future Improvements and An 8 Hour Journey into Pylucene:
Passage Retrieval:
After achieving the MRR of 0.208651, we’ve decided that the remaining time could be better spent by implementing the pylucene for better passage retrieval. Our passage retrieval leaves much to be desired. When we measured the passage retrieval services, we found that the correct rate of passage retrieval is relatively low. Most notably, for one of the questions “what is a Stratocaster?”, our passage retrieval system yielded no passages. We would have liked to implement the feature of query expansion by stemming, better weighting mechanism, synonym query expansion with Pylucene.
Query Expansion: we want to expand the query search to all terms that are stemmed and lower cased from the question original terms.[1]
Synonym Query Expansion: For Synonym Query Expansion, we would have taken the key query terms and identify key synonyms that are commonly associated with these terms and search for both the original query terms and the synonym terms. Synonymous terms would be given lesser weight than the original query terms. These terms can also be taken into account later in the answer formations. For example, “when asked what baseball team does the Pittsburg has?” We can look for synonyms ‘organization’ for ‘team’, which presumably should improve our accuracy in passage retrieval.
Distance from keywords: For certain noun phrases, we want to take into the average distance between the words and the documents in our retrieval services.
Better weighting mechanism: Our current weighting mechanism takes into account only the frequency of relevant terms and the score of the document from which those terms were taken. In some cases, it overweighs the score of the document and prevents the most relevant documents from being retrieved. We want to implement a variety of other features, such as the average distances between noun entities. The Lucene weighting mechanism would improve the performance of our system and allow us to play around with different weighting mechanisms.
Chunking: We currently chunk passages by the sentence, but we could vary this window. For example, we can do 2 or 3 sentences, which may make the distance weighting more valuable.
Further Heuristics in Passage Retrieval: We only use heuristics in answer formation, so for some questions, the relevant documents are not retrieved. The most relevant example is the acronym questions, such as “What does the NASA stand for?” or “What is the chemical formula for X?”. Currently, our success rate for these types of questions is 0%. Since we don’t look for passages with the explicit words for NASA, one way to improve on it is to use heuristic rules such as “if the question is in the form of ‘what does X stand for?”, then search the passage for all 4 words noun phrases and named entities whose words begin with N, A, S, A in that sequence. Other form of heuristic rules would also be helpful in determining the most relevant passages to each respective questions.
Adventure in PyLucene:
Once we decided to implement the PyLucene, we had to install both JCC and Apache Ant, which was a nightmare to install in Mac. Furthermore, there is very little documentation available for PyLucene. We eventually had to look through PyLucene sample code, which we only found useful after several experimentations. PyLucene requires us to split one top doc into separate document text files for each respective passages; this generation alone took 40 minutes. We would then have to use the search function to go through each respective documents and make query using the SearchFile function. At this point, we decided it was not worth the time to finish implementing this system as the features we wanted PyLucene to perform required too much time investment.
For references, we’ve included in the directory the IndexFile.py and SearchFile.py. The IndexFile.py takes in a directory of topdocs and index them so then you can use SearchFile.py to perform specific searches on them.
Answer Retrieval:
For Answer Retrieval, we may consider two types of improvements, query reformulation and additional heuristic rules.
Query/Question Reformulation: The Query Reformulation reformulates the query, which is in the form of certain question types as one specific type. For example, we can reformulate various questions asking for the same thing very differently. “What is the color of the Williams College?” and “Name the color of Williams College” are both asking for the same thing. Similarly, there are many other different question types that we can reformulate into a standardized question.
Additional Heuristic Rules: More linguistic analysis on certain formats of the questions will allow us to come up with more human rules and increase the MRR of our system.
[1] Note a group of University of Washington graduate students found that query expansion added more noise to the Passage Retrieval system than helping it. This was done on TREC 2004 and 2005 data.
http://faculty.washington.edu/levow/courses/ling573_SPR2011/slides/D3_group_6.pdfhttp://www2.lingfil.uu.se/SLTC2014/abstracts/sltc2014_submission_3.pdf