Jinru He  

Office: 1601 Willow Rd,
              Menlo Park, CA, 94025
Email: jhe (at) cis.poly.edu
Homepage: http://www.jinruhe.com


Web Exploration and Search Group
Department of Computer Science and Engineering
New York University, Polytechic Institute


About me

I joined Facebook as a software engineer in Mar. 2012.
I am a Ph.D student in Department of Computer Science and Engineering of Polytechnic Institute of New York University (NYU-POLY).
My advisor is Prof. Torsten Suel.
My current research interests include: web search technology, text and index compression, specifically in Temporal indexing and compression.
read more about my research
During my undergraduate years, I spent a lot of time with colleagues on building and maintaining a web site called Hustonline
My resume(Nov 2011).

For more about me, check out my facebook profile:
Jinru He | Create Your Badge


Education

2007.9.-present PhD student, Department of Computer Science,
Polytechnic Institute of New York University
2003.9.-2007.6. Bachelor of Science, School of Software Enginaeering,
Huazhong University of Science and Technology, Wuhan, China.


Selected Publications

  • Optimizing Positional Index Structures for Versioned Document Collections,
    with Torsten Suel,
    35th Annual International ACM SIGIR Conference (SIGIR'12), Portland, OR, available soon

  • Text vs. Space: Efficient Geo-Search Query Processing ,
    with Maria Christoforaki, C. Dimopoulos, Torsten Suel and Alex Markowetz,
    20th ACM Conference on Information and Knowledge Management (CIKM'11), Glasgow, Scotland, PDF Code(Acceptance rate 15%).

  • Faster Temporal Range Queries over Versioned Text ,
    with Torsten Suel,
    34th Annual International ACM SIGIR Conference (SIGIR'11), Beijing, China, PDF(Acceptance rate 19.8%).

  • Improved Index Compression Techniques for Versioned Document Collections ,
    with Junyuan Zeng and Torsten Suel,
    19th ACM Conference on Information and Knowledge Management (CIKM'10), Toronto, Canada, Oct. 2010, PDF, PPT (Acceptance rate 13.4%).

  • Compact Full-Text Indexing of Versioned Document Collections ,
    with Hao Yan and Torsten Suel,
    18th ACM Conference on Information and Knowledge Management (CIKM'09), Hong Kong, China, Nov. 2009, PDF (Acceptance rate 14.8%).

  • Using Graphics Processors for High Performance IR Query Processing
    with Shuai Ding, Hao Yan and Torsten Suel,
    18th International World Wide Web Conference (WWW'09), Madrid, Spain, April 2009, PDF (Acceptance rate 11.8%).


Experience

06/2011-08/2011 Research intern
Microsoft Research/Bing
09/2007-present Research Assistant
Polytechnic Institute of NYU


Selected Research Projects

  • Efficient Geo-Search Query Processing

    An number of search services allow users to constrain text queries (e.g., photography classes) to a geographic location (e.g., Santa Monica). This includes local search engines such as Google Local, mobile search services accessible from smart phones. This motivates the problem of how to efficiently execute search queries that contain a mix of textual and spatial constraints In this project, we take a new look at this problem. Executing such queries requires a combination of textual (e.g., inverted lists) and spatial (e.g., R-trees, space-filling curves) data structures. We describe several existing and new algorithms that make different choices on this trade-off between text and space, and evaluate them on large data sets. Our results indicate that an efficient query processor needs to first get the textual aspects of the problem right. In fact, even a naive approach that applies spatial filtering at the end appears to outperform many previous schemes, while even better results are obtained by integrating some light-weight spatial structure into the inverted index design. (C++) code

  • Faster Temporal Range Queries over Versioned Text

    versioned textual collections are collections that retain multiple versions of a document as it evolves over time. Important large-scale examples are Wikipedia and the web collection of the Internet Archive. Search queries over such collections of- ten use keywords as well as temporal constraints, most com- monly a time range of interest. In this paper, we study how to support such temporal range queries over versioned text. Our goal is to process these queries faster than the corre- sponding keyword-only queries, by exploiting the additional constraint. A simple approach might partition the index into different time ranges, and then access only the relevant parts. However, specialized inverted index compression techniques are crucial for large versioned collections, and a naive partitioning can negatively affect index size and query through- put. We show how to achieve high query throughput by using smart index partitioning techniques that take index compres- sion into account. Experiments on over 85 million versions of Wikipedia articles show that queries can be executed in a few milliseconds on memory-based index structures, and only slightly more time on disk-based structures. We also show how to efficiently support the recently proposed stable top-k search primitive on top of our schemes.(C++)

  • Compact Full-Text Indexing of Versioned Document Collections:

    We study the problem of creating highly compressed full-text index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important large-scale examples of such collections are Wikipedia or the web page archive maintained at the Internet Archive. We propose several new techniques for organizing and compressing inverted index structures for versioned document collections. We also perform a detailed experimental comparison of our new technique and the existing techniques in the literature. Our results on an archive of the history of the English version of Wikipedia shows very significant benefits over previous approaches. (C++)

  • Using Graphics Processors for High Performance IR Query Processing

    By utilizing the powerful parallel computing ability available in a modern Graphics Processing Unit, I am developing and implementing algorithms useful to be used for query processing in a search engine. In particular, this work focuses on efficient decompression of data encoded using Rice and PforDelta coding compression. Our work has shown that the power of a GPU offers significant potential improvements in computing power, when exploiting appropriate algorithms.

  • Hustonline FTP Search Engine

    There are thousands of FTP sites full of useful files and data in CERNET(Chinese Education and Research NETworks). FTP Search Engine is one of the solutions to seek specific files on FTP sites in CERNET. I designed the architecture of the FTP search engine. I also implemented crawler, indexer and data storage. Up to now, the number of FTP sites collected by the FTP Search Engine has reached 6000, and nearly 25 million data have been stored. The website can be reached by http://www.sohust.com. (C++, Windows)


Useful Tools

  • Fast geo query processing code used in our CIKM11 paper:geonew.tar.gz
  • index compression code for versioned document collection used in CIKM09 CIKM 10:version.zip
  • We build a small index compression tools that could compress inverted list into reasonable small size and decompress very fast, you could download our tookits here:Polycomp.tar
  • Optimized Pfor-delta compression code created by Shuai Ding: OPT_PFD.zip

Labmate

Josh Attenberg (Etsy)
Maria Christoforaki
Constantinos Dimopoulos
Shuai Ding (Facebook)
Qingqing Gan (Microsoft)
Hao Yan (Linkedin)
Jiangong Zhang (Amazon)

last updated 01/07/2012
Locations of visitors to this page