Sample Projects

Following is a list of sample projects/tasks that we helped our researchers with recently. This list is compiled with an intention to both inform faculty about quantitative research activities that are happening around the Ross School and to highlight the types of services that are available from Ross Research Computing

  1. RC assisted Lindsey Gallo (Accounting) in computing intra-day CAPM betas for a large universe of firms. High-frequency Trade & Quote (TAQ) data were used for the betas’ computation. Large TAQ data (> 37 terabytes) were processed using SAS and SQL languages on the WRDS Cloud.
  2. RC assisted Kyle Handley (Business Economics) in reading scanned PDF tables into CSV using Amazon Textract API, a smart OCR service available in AWS Cloud. Thousands of pages of scanned PDF tables of agricultural tariff schedules were processed asynchronously in AWS cloud.
  3. RC consulted Mana Heshmati (Strategy) on using Youtube API to programmatically search for video interviews of top corporate executives. She was also consulted in using Amazon Transcribe API to transcribe the audio content of the interviews. Amazon Transcribe service is a smart speech-to-text API that recognizes multiple speakers and their accents. The API uses deep learning algorithms under the hood.
  4. RC assisted Mana Heshmati (Strategy) in pulling and merging tables from Compustat and BoardEx databases available on the WRDS platform. She needed to merge various industrial classification codes from Compustat with Boardex datasets. Compustat-CRSP-Boardex linktable was used to merge the databases.
  5. RC assisted Anna Costello (Accounting) in parsing and converting unstructured XML data to structured CSV data. Python xml.etree.ElementTree module was used to accomplish the task.
  6. RC assisted Mihir Mehta (Accounting) in extraction and text-analysis of about 2.3 million news articles (containing specific information) using Microsoft Bing News API and Python Newspaper API.
  7. RC assisted Lindsey Gallo (Accounting) in extracting analysts’ financial forecast data from image/scanned files using OCR Web Service API and Python. The scanned files were obtained from Wolters Kluwer company.
  8. RC consulted Tom Lyon (Business Economics) in extracting text from a large corpus of scanned email conversations between private corporations and Environmental Protection Agency (EPA) officials. The text is being parsed to transform the data into a model ready form.
  9. RC consulted Jordan Siegel (Strategy) in pulling firm-specific fundamental data including company financials, industrial classification codes and headquarter addresses from Orbis databases available through WRDS platform. A SAS script on WRDS-Cloud was written to process the large Orbis data.
  10. RC assisted Cathy Shakespeare (Accounting) in extracting information on fair value disclosure and loan footnotes for select companies from Security and Exchange Commission website (sec.gov). Selenium web driver was used to navigate the website, and extract the data.
  11. RC assisted Julia Lee (M&O) in extraction and parsing of the Instagram image data for hashtag # Planet/Plastic using Selenium web driver and Python Beautiful Soup library. The data on author names, # of comments, # of followers, # of likes etc. were extracted.
  12. RC assisted Julia Lee (M&O) in visualization of data on environmentally-friendly behaviors in the general population by using matplotlib and plotly libraries in Python.
  13. RC assisted Julia Lee (M&O) in analyzing picture/image data using Microsoft Azure Computer Vision API. She wanted to extract tags and caption information of National Geographic magazine pictures available on Instagram.
  14. RC assisted Lindsey Gallo (Accounting) in extracting data from PDF tables using Python Tabula Library. She was also consulted regarding how to extract data from image/scanned files using Python Tesseract Code.
  15. RC assisted Lindsey Gallo (Accounting) in automatic download and concatenation of a large number of datafiles available through an API. Python requests and pandas libraries were used to download and append data.
  16. RC consulted Scott Masten (BE) and his research assistant in extracting information from large Web of Science publications data (~800 GB) available in XML format. Scott is analyzing the trend in the number of publications across different U.S. schools and disciplines from 1945-2017. Flux HPC cluster was used to process the data.
  17. RC assisted Raji Kunapuli (Strategy) in conducting text analysis of a corpus of analyst reports (in PDF format). Python was used to write algorithms that search and count specific words/phrases from a series of reports.
  18. RC assisted Mihir Mehta (Accounting) in extracting monthly count data of a business phrase that appeared in major business newspapers (on Factiva) since 1995.
  19. RC consulted Frank Li (Business Economics) in using Python 'Selenium' library to automate data collection from dynamic websites, which are difficult to scrape by using regular methods. Selenium is a web driving tool that is used to automate web browsers simulating human click.
  20. RC assisted Cassandra Chambers (MO) in matching, by company names, a set of firms with those corresponding in Compustat, after accounting for spaces, stop/redundant words, punctuations, letter cases, suffixes etc.
  21. RC consulted Mijeong Kwon (MO) and Saerom Lee (Strategy) in performing parallel computing using Python multiprocessing library and Stata-MP, in consultation with CSCAR.
  22. RC assisted Eun Woo (M&O) in conducting an "event study" that examines abnormal stock returns of the plaintiff and dependent firms involved in patent infringement litigations. Wharton Research Data Services (WRDS) SAS macro that uses CRSP data was customized to obtain abnormal returns for -1 to +5 days of event dates.
  23. RC consulted Harsh Ketkar (Strategy) in parsing XML files using Python ElementTree library. Harsh wanted to parse about 20 GB of PubMed/MEDLINE citations data available in XML format.
  24. RC consulted Mijeong kwon (M&O) on extracting Steam video game reviews data using Python Scrapy library. The scraping/extraction was complicated because the reviews on each page are loaded dynamically as the user scrolls down using an "infinite scroll" design.
  25. RC consulted Harsh Ketkar (Strategy) in downloading, and analyzing Big Data (~ 6 terabytes) using Hive QL on the Flux Hadoop cluster. RC worked closely with ITS ARC staff for this project.