By LOUIS FLORES
Update 07 October 2016 13:45 ⎪ After the New York City Housing Authority, or NYCHA, produced over 400 GB of data files in a second response to a Freedom of Information Law request filed by Progress Queens, most of the data is now in a format to be reviewed, Progress Queens is revealing.
For weeks after the data files had been received, there was a difficulty in opening the data files due to their size. One file was over 200 GB in size, making it impossible to open such a file using the computer hardware used by Progress Queens. However, Progress Queens was able to receive valuable technological assistance, allowing the files to be eventually opened.
The second FOIL response was produced by NYCHA in two groups of .TXT files : (i) extracts from an application named Maximo and (ii) extracts from an application named Siebel. The data files appeared to represent database dumps using names for known database management systems. Maximo is a real estate property database management system marketed by IBM, and Siebel is the maker of customer relationship management applications that is now a unit of Oracle.
Because the .TXT files appeared to represent database dumps, a possibility existed that the data files could be converted into .CSV files due to use of commas to separate values within the .TXT files. Files in .CSV format can be readily read by various applications to perform data analysis. However, there existed within some files within the Maximo and the Siebel data some rows of data with extra commas, blocking attempts made to read the files using the Pandas library for data analysis of the Python programming language.
For a period of time, the publisher of Progress Queens sought assistance from the data journalism teams of other news publications to open the data files and to possibly collaborate on the reporting of the data files. The editor of a data journalism team at one news publication informed the publisher of Progress Queens that defects in Siebel data files would almost make it impossible to read the data. However, Progress Queens received assistance from data experts, and one data expert cleaned up the Siebel data for Progress Queens, and, ultimately, the publisher of Progress Queens created a module to remove the corrupted rows of data from the Maximo files, making it possible to eventually read the non-defective rows of data.
According to the metadata for the files on the external hard drive provided by NYCHA to Progress Queens, the latest date by which the data files had been exported was 29 February 2016. Progress Queens filed its FOIL request on 21 March 2016. A first FOIL response was received by Progress Queens on 03 June 2016. A second FOIL response was received on 19 August 2016. For unknown reasons, NYCHA waited until making a second FOIL response before producing the Maximo and Siebel extracts, even though the extracts were in existence at the time that NYCHA made its first FOIL response.
Due to new information contained in the second FOIL response, Progress Queens will be updating reports its published based on the limited information contained in the first FOIL response.
The data files of the second FOIL response were received without any explanation of the numerous fields that are used by NYCHA for its two database systems. One data file was discovered to be missing, and NYCHA has yet to indicate that it will provide the missing file.
The files produced to Progress Queens were presumably also produced to the U.S. Attorney's Office for New York's southern district, which is reportedly investigating NYCHA for the physical condition standards of its public housing developments.
Even before Progress Queens received the second FOIL response, Progress Queens had predicted that NYCHA was using two systems to keep track of its maintenance records. It is not known why NYCHA keeps two sets of books. Numerous requests made by Progress Queens to interview NYCHA CEO Shola Olatoye have not been answered.
Initial inspection of the data files reveals multiple issues with the .TXT files from a technical aspect. Aside from NYCHA's failure to address rows of data that corruptly use commas in the Siebel data, the Maximo data was observed to be unwieldy. Over 300 different fields were discovered in each row of data in the main Maximo file. After the fields were reviewed by Progress Queens, it was revealed that approximately one third of the fields were either completely empty or nearly completely empty. In order to perform reviews of the main Maximo file, the publisher of Progress Queens will not be reviewing those empty or nearly empty fields. A list has been prepared of the remaining column names, which will be reviewed.
The main Maximo data file also appeared to be the product of several reports having been appended to each other, because headers and hyphens used to separate header rows from text rows were repeated several times in the data of the main Maximo file. In the past, Progress Queens has raised questions about the quality of the data being produced by NYCHA.
According to information obtained by Progress Queens, the U.S. Attorney's Office uses Concordance and IPRO to manage large-scale, electronically-stored information. Concordance is an application that automatically performs optical character recognition of electronic files to speed up the legal review of documents, and IPRO is the maker of software that provides technological tools to conduct electronic discovery for legal professionals. It is not known how those applications would allow Federal prosecutors to conduct data analysis of millions of rows of data that represent NYCHA's property maintenance logs.
Multiple requests made by Progress Queens to the U.S. Attorney's Office to interview the Assistant U.S. Attorneys conducting the reported NYCHA investigation have been denied.
A note about the process to read and review NYCHA's second FOIL response
Progress Queens could not have been able to have opened the data files produced by NYCHA in its second FOIL response without valuable assistance provided by members of Beta NYC and Learn Python NYC, groups that advocate for open data and support independent study of the Python programming language, respectively.
The data produced by NYCHA were .TXT files that were inflated due to the fixed-width format given by the database management systems from which the files were exported. With assistance from data experts, Progress Queens was able to open the files after the files had been processed by having extra white space stripped and the .TXT files converted into .CSV format.
For example, the Maximo files comprised approximately 350 GB of multiple .TXT files, including what appeared to be one file that represented the main database dump for this section of files. That main file measured over 210 GB as a .TXT file. The processed files representing the Maximo section will be publicly posted online in due course.
Meanwhile, the Siebel files comprised six (6) .TXT files that collectively measured approximately 60 GB of data. After the extra blank space was eliminated, the six files were converted into .CSV files and concatenated into a single file that measured only 2,59 GB of data. The concatenated .CSV file has been posted online, making that information open data.