Class ParquetFileReader

    • Field Detail

      • PARQUET_READ_PARALLELISM

        public static String PARQUET_READ_PARALLELISM
      • f

        protected final org.apache.parquet.io.SeekableInputStream f
    • Constructor Detail

      • ParquetFileReader

        @Deprecated
        public ParquetFileReader​(org.apache.hadoop.conf.Configuration configuration,
                                 org.apache.hadoop.fs.Path filePath,
                                 List<BlockMetaData> blocks,
                                 List<org.apache.parquet.column.ColumnDescriptor> columns)
                          throws IOException
        Deprecated.
        will be removed in 2.0.0.
        Parameters:
        configuration - the Hadoop conf
        filePath - Path for the parquet file
        blocks - the blocks to read
        columns - the columns to read (their path)
        Throws:
        IOException - if the file can not be opened
      • ParquetFileReader

        @Deprecated
        public ParquetFileReader​(org.apache.hadoop.conf.Configuration configuration,
                                 FileMetaData fileMetaData,
                                 org.apache.hadoop.fs.Path filePath,
                                 List<BlockMetaData> blocks,
                                 List<org.apache.parquet.column.ColumnDescriptor> columns)
                          throws IOException
        Deprecated.
        will be removed in 2.0.0.
        Parameters:
        configuration - the Hadoop conf
        fileMetaData - fileMetaData for parquet file
        filePath - Path for the parquet file
        blocks - the blocks to read
        columns - the columns to read (their path)
        Throws:
        IOException - if the file can not be opened
      • ParquetFileReader

        @Deprecated
        public ParquetFileReader​(org.apache.hadoop.conf.Configuration conf,
                                 org.apache.hadoop.fs.Path file,
                                 ParquetMetadata footer)
                          throws IOException
        Deprecated.
        will be removed in 2.0.0.
        Parameters:
        conf - the Hadoop Configuration
        file - Path to a parquet file
        footer - a ParquetMetadata footer already read from the file
        Throws:
        IOException - if the file can not be opened
    • Method Detail

      • readAllFootersInParallelUsingSummaryFiles

        @Deprecated
        public static List<Footer> readAllFootersInParallelUsingSummaryFiles​(org.apache.hadoop.conf.Configuration configuration,
                                                                             List<org.apache.hadoop.fs.FileStatus> partFiles)
                                                                      throws IOException
        Deprecated.
        metadata files are not recommended and will be removed in 2.0.0
        for files provided, check if there's a summary file. If a summary file is found it is used otherwise the file footer is used.
        Parameters:
        configuration - the hadoop conf to connect to the file system;
        partFiles - the part files to read
        Returns:
        the footers for those files using the summary file if possible.
        Throws:
        IOException - if there is an exception while reading footers
      • readAllFootersInParallelUsingSummaryFiles

        @Deprecated
        public static List<Footer> readAllFootersInParallelUsingSummaryFiles​(org.apache.hadoop.conf.Configuration configuration,
                                                                             Collection<org.apache.hadoop.fs.FileStatus> partFiles,
                                                                             boolean skipRowGroups)
                                                                      throws IOException
        Deprecated.
        metadata files are not recommended and will be removed in 2.0.0
        for files provided, check if there's a summary file. If a summary file is found it is used otherwise the file footer is used.
        Parameters:
        configuration - the hadoop conf to connect to the file system;
        partFiles - the part files to read
        skipRowGroups - to skipRowGroups in the footers
        Returns:
        the footers for those files using the summary file if possible.
        Throws:
        IOException - if there is an exception while reading footers
      • readAllFootersInParallel

        @Deprecated
        public static List<Footer> readAllFootersInParallel​(org.apache.hadoop.conf.Configuration configuration,
                                                            List<org.apache.hadoop.fs.FileStatus> partFiles)
                                                     throws IOException
        Deprecated.
        metadata files are not recommended and will be removed in 2.0.0
        Parameters:
        configuration - the conf to access the File System
        partFiles - the files to read
        Returns:
        the footers
        Throws:
        IOException - if an exception was raised while reading footers
      • readAllFootersInParallel

        @Deprecated
        public static List<Footer> readAllFootersInParallel​(org.apache.hadoop.conf.Configuration configuration,
                                                            List<org.apache.hadoop.fs.FileStatus> partFiles,
                                                            boolean skipRowGroups)
                                                     throws IOException
        Deprecated.
        will be removed in 2.0.0; use open(InputFile, ParquetReadOptions)
        read all the footers of the files provided (not using summary files)
        Parameters:
        configuration - the conf to access the File System
        partFiles - the files to read
        skipRowGroups - to skip the rowGroup info
        Returns:
        the footers
        Throws:
        IOException - if there is an exception while reading footers
      • readAllFootersInParallel

        @Deprecated
        public static List<Footer> readAllFootersInParallel​(org.apache.hadoop.conf.Configuration configuration,
                                                            org.apache.hadoop.fs.FileStatus fileStatus,
                                                            boolean skipRowGroups)
                                                     throws IOException
        Deprecated.
        will be removed in 2.0.0; use open(InputFile, ParquetReadOptions)
        Read the footers of all the files under that path (recursively) not using summary files.
        Parameters:
        configuration - a configuration
        fileStatus - a file status to recursively list
        skipRowGroups - whether to skip reading row group metadata
        Returns:
        a list of footers
        Throws:
        IOException - if an exception is thrown while reading the footers
      • readAllFootersInParallel

        @Deprecated
        public static List<Footer> readAllFootersInParallel​(org.apache.hadoop.conf.Configuration configuration,
                                                            org.apache.hadoop.fs.FileStatus fileStatus)
                                                     throws IOException
        Deprecated.
        will be removed in 2.0.0; use open(InputFile, ParquetReadOptions)
        Read the footers of all the files under that path (recursively) not using summary files. rowGroups are not skipped
        Parameters:
        configuration - the configuration to access the FS
        fileStatus - the root dir
        Returns:
        all the footers
        Throws:
        IOException - if an exception is thrown while reading the footers
      • readFooters

        @Deprecated
        public static List<Footer> readFooters​(org.apache.hadoop.conf.Configuration configuration,
                                               org.apache.hadoop.fs.Path path)
                                        throws IOException
        Deprecated.
        will be removed in 2.0.0; use open(InputFile, ParquetReadOptions)
        Parameters:
        configuration - a configuration
        path - a file path
        Returns:
        a list of footers
        Throws:
        IOException - if an exception is thrown while reading the footers
      • readFooters

        @Deprecated
        public static List<Footer> readFooters​(org.apache.hadoop.conf.Configuration configuration,
                                               org.apache.hadoop.fs.FileStatus pathStatus)
                                        throws IOException
        Deprecated.
        will be removed in 2.0.0; use open(InputFile, ParquetReadOptions)
        this always returns the row groups
        Parameters:
        configuration - a configuration
        pathStatus - a file status to read footers from
        Returns:
        a list of footers
        Throws:
        IOException - if an exception is thrown while reading the footers
      • readFooters

        @Deprecated
        public static List<Footer> readFooters​(org.apache.hadoop.conf.Configuration configuration,
                                               org.apache.hadoop.fs.FileStatus pathStatus,
                                               boolean skipRowGroups)
                                        throws IOException
        Deprecated.
        will be removed in 2.0.0; use open(InputFile, ParquetReadOptions)
        Read the footers of all the files under that path (recursively) using summary files if possible
        Parameters:
        configuration - the configuration to access the FS
        pathStatus - the root dir
        skipRowGroups - whether to skip reading row group metadata
        Returns:
        all the footers
        Throws:
        IOException - if an exception is thrown while reading the footers
      • readSummaryFile

        @Deprecated
        public static List<Footer> readSummaryFile​(org.apache.hadoop.conf.Configuration configuration,
                                                   org.apache.hadoop.fs.FileStatus summaryStatus)
                                            throws IOException
        Deprecated.
        metadata files are not recommended and will be removed in 2.0.0
        Specifically reads a given summary file
        Parameters:
        configuration - a configuration
        summaryStatus - file status for a summary file
        Returns:
        the metadata translated for each file
        Throws:
        IOException - if an exception is thrown while reading the summary file
      • readFooter

        @Deprecated
        public static final ParquetMetadata readFooter​(org.apache.hadoop.conf.Configuration configuration,
                                                       org.apache.hadoop.fs.Path file)
                                                throws IOException
        Deprecated.
        will be removed in 2.0.0; use open(InputFile, ParquetReadOptions)
        Reads the meta data block in the footer of the file
        Parameters:
        configuration - a configuration
        file - the parquet File
        Returns:
        the metadata blocks in the footer
        Throws:
        IOException - if an error occurs while reading the file
      • readFooter

        @Deprecated
        public static ParquetMetadata readFooter​(org.apache.hadoop.conf.Configuration configuration,
                                                 org.apache.hadoop.fs.Path file,
                                                 ParquetMetadataConverter.MetadataFilter filter)
                                          throws IOException
        Deprecated.
        will be removed in 2.0.0; use open(InputFile, ParquetReadOptions)
        Reads the meta data in the footer of the file. Skipping row groups (or not) based on the provided filter
        Parameters:
        configuration - a configuration
        file - the Parquet File
        filter - the filter to apply to row groups
        Returns:
        the metadata with row groups filtered.
        Throws:
        IOException - if an error occurs while reading the file
      • readFooter

        @Deprecated
        public static final ParquetMetadata readFooter​(org.apache.hadoop.conf.Configuration configuration,
                                                       org.apache.hadoop.fs.FileStatus file)
                                                throws IOException
        Deprecated.
        will be removed in 2.0.0; use open(InputFile, ParquetReadOptions)
        Parameters:
        configuration - a configuration
        file - the Parquet File
        Returns:
        the metadata with row groups.
        Throws:
        IOException - if an error occurs while reading the file
      • open

        @Deprecated
        public static ParquetFileReader open​(org.apache.hadoop.conf.Configuration conf,
                                             org.apache.hadoop.fs.Path file)
                                      throws IOException
        Deprecated.
        will be removed in 2.0.0; use open(InputFile)
        Parameters:
        conf - a configuration
        file - a file path to open
        Returns:
        a parquet file reader
        Throws:
        IOException - if there is an error while opening the file
      • open

        @Deprecated
        public static ParquetFileReader open​(org.apache.hadoop.conf.Configuration conf,
                                             org.apache.hadoop.fs.Path file,
                                             ParquetMetadata footer)
                                      throws IOException
        Deprecated.
        will be removed in 2.0.0
        Parameters:
        conf - a configuration
        file - a file path to open
        footer - a footer for the file if already loaded
        Returns:
        a parquet file reader
        Throws:
        IOException - if there is an error while opening the file
      • open

        public static ParquetFileReader open​(org.apache.parquet.io.InputFile file)
                                      throws IOException
        Open a file.
        Parameters:
        file - an input file
        Returns:
        an open ParquetFileReader
        Throws:
        IOException - if there is an error while opening the file
      • open

        public static ParquetFileReader open​(org.apache.parquet.io.InputFile file,
                                             ParquetReadOptions options)
                                      throws IOException
        Open a file with options.
        Parameters:
        file - an input file
        options - parquet read options
        Returns:
        an open ParquetFileReader
        Throws:
        IOException - if there is an error while opening the file
      • getRecordCount

        public long getRecordCount()
      • getFilteredRecordCount

        public long getFilteredRecordCount()
      • getPath

        @Deprecated
        public org.apache.hadoop.fs.Path getPath()
        Deprecated.
        will be removed in 2.0.0; use getFile() instead
        Returns:
        the path for this file
      • getFile

        public String getFile()
      • setRequestedSchema

        public void setRequestedSchema​(org.apache.parquet.schema.MessageType projection)
      • readRowGroup

        public org.apache.parquet.column.page.PageReadStore readRowGroup​(int blockIndex)
                                                                  throws IOException
        Reads all the columns requested from the row group at the specified block.
        Parameters:
        blockIndex - the index of the requested block
        Returns:
        the PageReadStore which can provide PageReaders for each column.
        Throws:
        IOException - if an error occurs while reading
      • readNextRowGroup

        public org.apache.parquet.column.page.PageReadStore readNextRowGroup()
                                                                      throws IOException
        Reads all the columns requested from the row group at the current file position.
        Returns:
        the PageReadStore which can provide PageReaders for each column.
        Throws:
        IOException - if an error occurs while reading
      • readFilteredRowGroup

        public org.apache.parquet.column.page.PageReadStore readFilteredRowGroup​(int blockIndex)
                                                                          throws IOException
        Reads all the columns requested from the specified row group. It may skip specific pages based on the column indexes according to the actual filter. As the rows are not aligned among the pages of the different columns row synchronization might be required. See the documentation of the class SynchronizingColumnReader for details.
        Parameters:
        blockIndex - the index of the requested block
        Returns:
        the PageReadStore which can provide PageReaders for each column or null if there are no rows in this block
        Throws:
        IOException - if an error occurs while reading
      • readNextFilteredRowGroup

        public org.apache.parquet.column.page.PageReadStore readNextFilteredRowGroup()
                                                                              throws IOException
        Reads all the columns requested from the row group at the current file position. It may skip specific pages based on the column indexes according to the actual filter. As the rows are not aligned among the pages of the different columns row synchronization might be required. See the documentation of the class SynchronizingColumnReader for details.
        Returns:
        the PageReadStore which can provide PageReaders for each column
        Throws:
        IOException - if an error occurs while reading
      • skipNextRowGroup

        public boolean skipNextRowGroup()
      • getNextDictionaryReader

        public org.apache.parquet.column.page.DictionaryPageReadStore getNextDictionaryReader()
        Returns a DictionaryPageReadStore for the row group that would be returned by calling readNextRowGroup() or skipped by calling skipNextRowGroup().
        Returns:
        a DictionaryPageReadStore for the next row group
      • getDictionaryReader

        public org.apache.parquet.hadoop.DictionaryPageReader getDictionaryReader​(int blockIndex)
      • getDictionaryReader

        public org.apache.parquet.hadoop.DictionaryPageReader getDictionaryReader​(BlockMetaData block)
      • getBloomFilterDataReader

        public BloomFilterReader getBloomFilterDataReader​(int blockIndex)
      • readBloomFilter

        public org.apache.parquet.column.values.bloomfilter.BloomFilter readBloomFilter​(ColumnChunkMetaData meta)
                                                                                 throws IOException
        Reads Bloom filter data for the given column chunk.
        Parameters:
        meta - a column's ColumnChunkMetaData to read the dictionary from
        Returns:
        an BloomFilter object.
        Throws:
        IOException - if there is an error while reading the Bloom filter.
      • readColumnIndex

        @Private
        public org.apache.parquet.internal.column.columnindex.ColumnIndex readColumnIndex​(ColumnChunkMetaData column)
                                                                                   throws IOException
        Parameters:
        column - the column chunk which the column index is to be returned for
        Returns:
        the column index for the specified column chunk or null if there is no index
        Throws:
        IOException - if any I/O error occurs during reading the file
      • readOffsetIndex

        @Private
        public org.apache.parquet.internal.column.columnindex.OffsetIndex readOffsetIndex​(ColumnChunkMetaData column)
                                                                                   throws IOException
        Parameters:
        column - the column chunk which the offset index is to be returned for
        Returns:
        the offset index for the specified column chunk or null if there is no index
        Throws:
        IOException - if any I/O error occurs during reading the file