Workshop on NLP and Information Extraction

Seventh Students' Conference of Linguistics in India

8-10 February, 2013

Department of Linguistics, Aligarh Muslim University, Aligarh

SCONLI HOME
Register for the Conference
CALL FOR PAPERS
CONFERENCE SCHEDULE
List of Accepted Papers
WORKSHOPS
ACCOMMODATION/TRAVEL
ABSTRACT BOOKLET
PHOTO GALLERY
TRAVELLING GUIDE
GUIDELINES FOR FULL PAPER
ARCHIVES & LINKS
ORGANIZING COMMITTEE

NLP and Information Extraction

by Narayan Choudhary and Parth Pathak, ezDI, LLC.

The area of NLP requires a lot of linguistic inputs in terms of analysis and logic to get better at results. NLP has been used in several tasks including machine translation, information retrieval, speech synthesis, grammar and spelling checking and others. The data that the NLP works on can be either speech or text. While NLP research on the speech data is relatively new, the work on text has a long history of more than half a century. Despite this long history of NLP works on text, there is still a lot that needs to be done.

The proposed workshop is to talk about text processing done with regard to the extraction of information from text. Information extraction (IE) is considered different from information retrieval in the way that in IE, information is gathered from a source around the defined information searched for while information extraction is the task of first converting the given raw source text into a structured data format from where the information can be easily accessed.

In the proposed workshop we will show the state of the art of information extraction used for converting the natural language text as witnessed in the medical records of the US Hospitals into a structured data (RDF ).
We use an in-house developed platform called ezNLP, as the base to start our work. ezNLP takes as input the documents and converts the document into an RDF document, an XML based format that uses the techniques of semantic web to encode the information in a way that can be easily interpreted by machine and rendered by the machine on demand.
ezNLP utilizes several modules that are based on NLP techniques. Some these modules are as follows:

Sentence boundary detector	Dictionary Lookup Annotator
Tokenizer	Normalizer
Dependency parser	Part-of-speech tagger
Phrasal chunker	Drug mention annotator
Negation detector

Many of these problems have been solved as we work our way through it. For example with the adaptation techniques, we now have better PoS tagger and the nominal phrase identification is almost as good as it can be. But all this improvements require a lot of work starting from corpus analysis to precisely defining what we want to extract from the raw text. All this has to be done with an understanding of the goal. In the domain we work, the idea is to help doctors identify the diseases easily and make no mistake while taking a decision. The kind of tool we are developing will help other health care providers such as hospital management, health agencies, insurance companies. All of them will benefit from a better analysis of millions of medical documents. The health research team would get helped in finding out the patterns between two or more than two events occurring at the same time (as witnessed through the medical records of the patients).

In this workshop, we will first give an overview of what is being done and how. The workshop will be done in two sessions. The first session will introduce the topic and talk about the details of the benefits of what we are doing i.e. the motivation and touch upon the NLP side of it where there is a need of major linguistic input. The second session will continue the previous session of linguistic focus and give some overview of the computing techniques we use to solve some of our problems.

Sponsor(s)

Aligarh Muslim University,Aligarh

Central Institute of Indian Languages, Mysore