Package org.apache.parquet.hadoop
Class ParquetOutputFormat<T>
- java.lang.Object
-
- org.apache.hadoop.mapreduce.OutputFormat<K,V>
-
- org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<Void,T>
-
- org.apache.parquet.hadoop.ParquetOutputFormat<T>
-
- Type Parameters:
T- the type of the materialized records
- Direct Known Subclasses:
ExampleOutputFormat
public class ParquetOutputFormat<T> extends org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<Void,T>
OutputFormat to write to a Parquet file It requires aWriteSupportto convert the actual records to the underlying format. It requires the schema of the incoming records. (provided by the write support) It allows storing extra metadata in the footer (for example: for schema compatibility purpose when converting from a different schema language). The format configuration settings in the job configuration:# The block size is the size of a row group being buffered in memory # this limits the memory usage when writing # Larger values will improve the IO when reading but consume more memory when writing parquet.block.size=134217728 # in bytes, default = 128 * 1024 * 1024 # The page size is for compression. When reading, each page can be decompressed independently. # A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. # If this value is too small, the compression will deteriorate parquet.page.size=1048576 # in bytes, default = 1 * 1024 * 1024 # There is one dictionary page per column per row group when dictionary encoding is used. # The dictionary page size works like the page size but for dictionary parquet.dictionary.page.size=1048576 # in bytes, default = 1 * 1024 * 1024 # The compression algorithm used to compress pages parquet.compression=UNCOMPRESSED # one of: UNCOMPRESSED, SNAPPY, GZIP, LZO, ZSTD. Default: UNCOMPRESSED. Supersedes mapred.output.compress* # The write support class to convert the records written to the OutputFormat into the events accepted by the record consumer # Usually provided by a specific ParquetOutputFormat subclass parquet.write.support.class= # fully qualified name # To enable/disable dictionary encoding parquet.enable.dictionary=true # false to disable dictionary encoding # To enable/disable summary metadata aggregation at the end of a MR job # The default is true (enabled) parquet.enable.summary-metadata=true # false to disable summary aggregation # Maximum size (in bytes) allowed as padding to align row groups # This is also the minimum size of a row group. Default: 8388608 parquet.writer.max-padding=8388608 # 8 MB
If parquet.compression is not set, the following properties are checked (FileOutputFormat behavior). Note that we explicitely disallow custom Codecsmapred.output.compress=true mapred.output.compression.codec=org.apache.hadoop.io.compress.SomeCodec # the codec must be one of Snappy, GZip or LZO
if none of those is set the data is uncompressed.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classParquetOutputFormat.JobSummaryLevel
-
Field Summary
Fields Modifier and Type Field Description static StringBLOCK_SIZEstatic StringBLOOM_FILTER_ENABLEDstatic StringBLOOM_FILTER_EXPECTED_NDVstatic StringBLOOM_FILTER_FPPstatic StringBLOOM_FILTER_MAX_BYTESstatic StringCOLUMN_INDEX_TRUNCATE_LENGTHstatic StringCOMPRESSIONstatic StringDICTIONARY_PAGE_SIZEstatic StringENABLE_DICTIONARYstatic StringENABLE_JOB_SUMMARYDeprecated.static StringESTIMATE_PAGE_SIZE_CHECKstatic StringJOB_SUMMARY_LEVELMust be one of the values inParquetOutputFormat.JobSummaryLevel(case insensitive)static StringMAX_PADDING_BYTESstatic StringMAX_ROW_COUNT_FOR_PAGE_SIZE_CHECKstatic StringMEMORY_POOL_RATIOstatic StringMIN_MEMORY_ALLOCATIONstatic StringMIN_ROW_COUNT_FOR_PAGE_SIZE_CHECKstatic StringPAGE_ROW_COUNT_LIMITstatic StringPAGE_SIZEstatic StringPAGE_WRITE_CHECKSUM_ENABLEDstatic StringSTATISTICS_TRUNCATE_LENGTHstatic StringVALIDATIONstatic StringWRITE_SUPPORT_CLASSstatic StringWRITER_VERSION
-
Constructor Summary
Constructors Constructor Description ParquetOutputFormat()used when directly using the output format and configuring the write support implementation using parquet.write.support.classParquetOutputFormat(S writeSupport)constructor used when this OutputFormat in wrapped in another one (In Pig for example)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description static FileEncryptionPropertiescreateEncryptionProperties(org.apache.hadoop.conf.Configuration fileHadoopConfig, org.apache.hadoop.fs.Path tempFilePath, WriteSupport.WriteContext fileWriteContext)static intgetBlockSize(org.apache.hadoop.conf.Configuration configuration)Deprecated.static intgetBlockSize(org.apache.hadoop.mapreduce.JobContext jobContext)static booleangetBloomFilterEnabled(org.apache.hadoop.conf.Configuration conf)static intgetBloomFilterMaxBytes(org.apache.hadoop.conf.Configuration conf)static org.apache.parquet.hadoop.metadata.CompressionCodecNamegetCompression(org.apache.hadoop.conf.Configuration configuration)static org.apache.parquet.hadoop.metadata.CompressionCodecNamegetCompression(org.apache.hadoop.mapreduce.JobContext jobContext)static intgetDictionaryPageSize(org.apache.hadoop.conf.Configuration configuration)static intgetDictionaryPageSize(org.apache.hadoop.mapreduce.JobContext jobContext)static booleangetEnableDictionary(org.apache.hadoop.conf.Configuration configuration)static booleangetEnableDictionary(org.apache.hadoop.mapreduce.JobContext jobContext)static booleangetEstimatePageSizeCheck(org.apache.hadoop.conf.Configuration configuration)static ParquetOutputFormat.JobSummaryLevelgetJobSummaryLevel(org.apache.hadoop.conf.Configuration conf)static longgetLongBlockSize(org.apache.hadoop.conf.Configuration configuration)static intgetMaxRowCountForPageSizeCheck(org.apache.hadoop.conf.Configuration configuration)static MemoryManagergetMemoryManager()static intgetMinRowCountForPageSizeCheck(org.apache.hadoop.conf.Configuration configuration)org.apache.hadoop.mapreduce.OutputCommittergetOutputCommitter(org.apache.hadoop.mapreduce.TaskAttemptContext context)static intgetPageSize(org.apache.hadoop.conf.Configuration configuration)static intgetPageSize(org.apache.hadoop.mapreduce.JobContext jobContext)static booleangetPageWriteChecksumEnabled(org.apache.hadoop.conf.Configuration conf)org.apache.hadoop.mapreduce.RecordWriter<Void,T>getRecordWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path file, org.apache.parquet.hadoop.metadata.CompressionCodecName codec)org.apache.hadoop.mapreduce.RecordWriter<Void,T>getRecordWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path file, org.apache.parquet.hadoop.metadata.CompressionCodecName codec, ParquetFileWriter.Mode mode)org.apache.hadoop.mapreduce.RecordWriter<Void,T>getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext)org.apache.hadoop.mapreduce.RecordWriter<Void,T>getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext, org.apache.hadoop.fs.Path file)org.apache.hadoop.mapreduce.RecordWriter<Void,T>getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode)org.apache.hadoop.mapreduce.RecordWriter<Void,T>getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext, ParquetFileWriter.Mode mode)static booleangetValidation(org.apache.hadoop.conf.Configuration configuration)static booleangetValidation(org.apache.hadoop.mapreduce.JobContext jobContext)static org.apache.parquet.column.ParquetProperties.WriterVersiongetWriterVersion(org.apache.hadoop.conf.Configuration configuration)WriteSupport<T>getWriteSupport(org.apache.hadoop.conf.Configuration configuration)static Class<?>getWriteSupportClass(org.apache.hadoop.conf.Configuration configuration)static booleanisCompressionSet(org.apache.hadoop.conf.Configuration configuration)static booleanisCompressionSet(org.apache.hadoop.mapreduce.JobContext jobContext)static voidsetBlockSize(org.apache.hadoop.mapreduce.Job job, int blockSize)static voidsetColumnIndexTruncateLength(org.apache.hadoop.conf.Configuration conf, int length)static voidsetColumnIndexTruncateLength(org.apache.hadoop.mapreduce.JobContext jobContext, int length)static voidsetCompression(org.apache.hadoop.mapreduce.Job job, org.apache.parquet.hadoop.metadata.CompressionCodecName compression)static voidsetDictionaryPageSize(org.apache.hadoop.mapreduce.Job job, int pageSize)static voidsetEnableDictionary(org.apache.hadoop.mapreduce.Job job, boolean enableDictionary)static voidsetMaxPaddingSize(org.apache.hadoop.conf.Configuration conf, int maxPaddingSize)static voidsetMaxPaddingSize(org.apache.hadoop.mapreduce.JobContext jobContext, int maxPaddingSize)static voidsetPageRowCountLimit(org.apache.hadoop.conf.Configuration conf, int rowCount)static voidsetPageRowCountLimit(org.apache.hadoop.mapreduce.JobContext jobContext, int rowCount)static voidsetPageSize(org.apache.hadoop.mapreduce.Job job, int pageSize)static voidsetPageWriteChecksumEnabled(org.apache.hadoop.conf.Configuration conf, boolean val)static voidsetPageWriteChecksumEnabled(org.apache.hadoop.mapreduce.JobContext jobContext, boolean val)static voidsetStatisticsTruncateLength(org.apache.hadoop.mapreduce.JobContext jobContext, int length)static voidsetValidation(org.apache.hadoop.conf.Configuration configuration, boolean validating)static voidsetValidation(org.apache.hadoop.mapreduce.JobContext jobContext, boolean validating)static voidsetWriteSupportClass(org.apache.hadoop.mapred.JobConf job, Class<?> writeSupportClass)static voidsetWriteSupportClass(org.apache.hadoop.mapreduce.Job job, Class<?> writeSupportClass)-
Methods inherited from class org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
checkOutputSpecs, getCompressOutput, getDefaultWorkFile, getOutputCompressorClass, getOutputName, getOutputPath, getPathForWorkFile, getUniqueFile, getWorkOutputPath, setCompressOutput, setOutputCompressorClass, setOutputName, setOutputPath
-
-
-
-
Field Detail
-
ENABLE_JOB_SUMMARY
@Deprecated public static final String ENABLE_JOB_SUMMARY
Deprecated.An alias for JOB_SUMMARY_LEVEL, where true means ALL and false means NONE- See Also:
- Constant Field Values
-
JOB_SUMMARY_LEVEL
public static final String JOB_SUMMARY_LEVEL
Must be one of the values inParquetOutputFormat.JobSummaryLevel(case insensitive)- See Also:
- Constant Field Values
-
BLOCK_SIZE
public static final String BLOCK_SIZE
- See Also:
- Constant Field Values
-
PAGE_SIZE
public static final String PAGE_SIZE
- See Also:
- Constant Field Values
-
COMPRESSION
public static final String COMPRESSION
- See Also:
- Constant Field Values
-
WRITE_SUPPORT_CLASS
public static final String WRITE_SUPPORT_CLASS
- See Also:
- Constant Field Values
-
DICTIONARY_PAGE_SIZE
public static final String DICTIONARY_PAGE_SIZE
- See Also:
- Constant Field Values
-
ENABLE_DICTIONARY
public static final String ENABLE_DICTIONARY
- See Also:
- Constant Field Values
-
VALIDATION
public static final String VALIDATION
- See Also:
- Constant Field Values
-
WRITER_VERSION
public static final String WRITER_VERSION
- See Also:
- Constant Field Values
-
MEMORY_POOL_RATIO
public static final String MEMORY_POOL_RATIO
- See Also:
- Constant Field Values
-
MIN_MEMORY_ALLOCATION
public static final String MIN_MEMORY_ALLOCATION
- See Also:
- Constant Field Values
-
MAX_PADDING_BYTES
public static final String MAX_PADDING_BYTES
- See Also:
- Constant Field Values
-
MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK
public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK
- See Also:
- Constant Field Values
-
MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK
public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK
- See Also:
- Constant Field Values
-
ESTIMATE_PAGE_SIZE_CHECK
public static final String ESTIMATE_PAGE_SIZE_CHECK
- See Also:
- Constant Field Values
-
COLUMN_INDEX_TRUNCATE_LENGTH
public static final String COLUMN_INDEX_TRUNCATE_LENGTH
- See Also:
- Constant Field Values
-
STATISTICS_TRUNCATE_LENGTH
public static final String STATISTICS_TRUNCATE_LENGTH
- See Also:
- Constant Field Values
-
BLOOM_FILTER_ENABLED
public static final String BLOOM_FILTER_ENABLED
- See Also:
- Constant Field Values
-
BLOOM_FILTER_EXPECTED_NDV
public static final String BLOOM_FILTER_EXPECTED_NDV
- See Also:
- Constant Field Values
-
BLOOM_FILTER_MAX_BYTES
public static final String BLOOM_FILTER_MAX_BYTES
- See Also:
- Constant Field Values
-
BLOOM_FILTER_FPP
public static final String BLOOM_FILTER_FPP
- See Also:
- Constant Field Values
-
PAGE_ROW_COUNT_LIMIT
public static final String PAGE_ROW_COUNT_LIMIT
- See Also:
- Constant Field Values
-
PAGE_WRITE_CHECKSUM_ENABLED
public static final String PAGE_WRITE_CHECKSUM_ENABLED
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
ParquetOutputFormat
public ParquetOutputFormat(S writeSupport)
constructor used when this OutputFormat in wrapped in another one (In Pig for example)- Type Parameters:
S- the Java write support type- Parameters:
writeSupport- the class used to convert the incoming records
-
ParquetOutputFormat
public ParquetOutputFormat()
used when directly using the output format and configuring the write support implementation using parquet.write.support.class- Type Parameters:
S- the Java write support type
-
-
Method Detail
-
getJobSummaryLevel
public static ParquetOutputFormat.JobSummaryLevel getJobSummaryLevel(org.apache.hadoop.conf.Configuration conf)
-
setWriteSupportClass
public static void setWriteSupportClass(org.apache.hadoop.mapreduce.Job job, Class<?> writeSupportClass)
-
setWriteSupportClass
public static void setWriteSupportClass(org.apache.hadoop.mapred.JobConf job, Class<?> writeSupportClass)
-
getWriteSupportClass
public static Class<?> getWriteSupportClass(org.apache.hadoop.conf.Configuration configuration)
-
setBlockSize
public static void setBlockSize(org.apache.hadoop.mapreduce.Job job, int blockSize)
-
setPageSize
public static void setPageSize(org.apache.hadoop.mapreduce.Job job, int pageSize)
-
setDictionaryPageSize
public static void setDictionaryPageSize(org.apache.hadoop.mapreduce.Job job, int pageSize)
-
setCompression
public static void setCompression(org.apache.hadoop.mapreduce.Job job, org.apache.parquet.hadoop.metadata.CompressionCodecName compression)
-
setEnableDictionary
public static void setEnableDictionary(org.apache.hadoop.mapreduce.Job job, boolean enableDictionary)
-
getEnableDictionary
public static boolean getEnableDictionary(org.apache.hadoop.mapreduce.JobContext jobContext)
-
getBloomFilterMaxBytes
public static int getBloomFilterMaxBytes(org.apache.hadoop.conf.Configuration conf)
-
getBloomFilterEnabled
public static boolean getBloomFilterEnabled(org.apache.hadoop.conf.Configuration conf)
-
getBlockSize
public static int getBlockSize(org.apache.hadoop.mapreduce.JobContext jobContext)
-
getPageSize
public static int getPageSize(org.apache.hadoop.mapreduce.JobContext jobContext)
-
getDictionaryPageSize
public static int getDictionaryPageSize(org.apache.hadoop.mapreduce.JobContext jobContext)
-
getCompression
public static org.apache.parquet.hadoop.metadata.CompressionCodecName getCompression(org.apache.hadoop.mapreduce.JobContext jobContext)
-
isCompressionSet
public static boolean isCompressionSet(org.apache.hadoop.mapreduce.JobContext jobContext)
-
setValidation
public static void setValidation(org.apache.hadoop.mapreduce.JobContext jobContext, boolean validating)
-
getValidation
public static boolean getValidation(org.apache.hadoop.mapreduce.JobContext jobContext)
-
getEnableDictionary
public static boolean getEnableDictionary(org.apache.hadoop.conf.Configuration configuration)
-
getMinRowCountForPageSizeCheck
public static int getMinRowCountForPageSizeCheck(org.apache.hadoop.conf.Configuration configuration)
-
getMaxRowCountForPageSizeCheck
public static int getMaxRowCountForPageSizeCheck(org.apache.hadoop.conf.Configuration configuration)
-
getEstimatePageSizeCheck
public static boolean getEstimatePageSizeCheck(org.apache.hadoop.conf.Configuration configuration)
-
getBlockSize
@Deprecated public static int getBlockSize(org.apache.hadoop.conf.Configuration configuration)
Deprecated.
-
getLongBlockSize
public static long getLongBlockSize(org.apache.hadoop.conf.Configuration configuration)
-
getPageSize
public static int getPageSize(org.apache.hadoop.conf.Configuration configuration)
-
getDictionaryPageSize
public static int getDictionaryPageSize(org.apache.hadoop.conf.Configuration configuration)
-
getWriterVersion
public static org.apache.parquet.column.ParquetProperties.WriterVersion getWriterVersion(org.apache.hadoop.conf.Configuration configuration)
-
getCompression
public static org.apache.parquet.hadoop.metadata.CompressionCodecName getCompression(org.apache.hadoop.conf.Configuration configuration)
-
isCompressionSet
public static boolean isCompressionSet(org.apache.hadoop.conf.Configuration configuration)
-
setValidation
public static void setValidation(org.apache.hadoop.conf.Configuration configuration, boolean validating)
-
getValidation
public static boolean getValidation(org.apache.hadoop.conf.Configuration configuration)
-
setMaxPaddingSize
public static void setMaxPaddingSize(org.apache.hadoop.mapreduce.JobContext jobContext, int maxPaddingSize)
-
setMaxPaddingSize
public static void setMaxPaddingSize(org.apache.hadoop.conf.Configuration conf, int maxPaddingSize)
-
setColumnIndexTruncateLength
public static void setColumnIndexTruncateLength(org.apache.hadoop.mapreduce.JobContext jobContext, int length)
-
setColumnIndexTruncateLength
public static void setColumnIndexTruncateLength(org.apache.hadoop.conf.Configuration conf, int length)
-
setStatisticsTruncateLength
public static void setStatisticsTruncateLength(org.apache.hadoop.mapreduce.JobContext jobContext, int length)
-
setPageRowCountLimit
public static void setPageRowCountLimit(org.apache.hadoop.mapreduce.JobContext jobContext, int rowCount)
-
setPageRowCountLimit
public static void setPageRowCountLimit(org.apache.hadoop.conf.Configuration conf, int rowCount)
-
setPageWriteChecksumEnabled
public static void setPageWriteChecksumEnabled(org.apache.hadoop.mapreduce.JobContext jobContext, boolean val)
-
setPageWriteChecksumEnabled
public static void setPageWriteChecksumEnabled(org.apache.hadoop.conf.Configuration conf, boolean val)
-
getPageWriteChecksumEnabled
public static boolean getPageWriteChecksumEnabled(org.apache.hadoop.conf.Configuration conf)
-
getRecordWriter
public org.apache.hadoop.mapreduce.RecordWriter<Void,T> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException
- Specified by:
getRecordWriterin classorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat<Void,T>- Throws:
IOExceptionInterruptedException
-
getRecordWriter
public org.apache.hadoop.mapreduce.RecordWriter<Void,T> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext, ParquetFileWriter.Mode mode) throws IOException, InterruptedException
- Throws:
IOExceptionInterruptedException
-
getRecordWriter
public org.apache.hadoop.mapreduce.RecordWriter<Void,T> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext, org.apache.hadoop.fs.Path file) throws IOException, InterruptedException
- Throws:
IOExceptionInterruptedException
-
getRecordWriter
public org.apache.hadoop.mapreduce.RecordWriter<Void,T> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode) throws IOException, InterruptedException
- Throws:
IOExceptionInterruptedException
-
getRecordWriter
public org.apache.hadoop.mapreduce.RecordWriter<Void,T> getRecordWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path file, org.apache.parquet.hadoop.metadata.CompressionCodecName codec) throws IOException, InterruptedException
- Throws:
IOExceptionInterruptedException
-
getRecordWriter
public org.apache.hadoop.mapreduce.RecordWriter<Void,T> getRecordWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path file, org.apache.parquet.hadoop.metadata.CompressionCodecName codec, ParquetFileWriter.Mode mode) throws IOException, InterruptedException
- Throws:
IOExceptionInterruptedException
-
getWriteSupport
public WriteSupport<T> getWriteSupport(org.apache.hadoop.conf.Configuration configuration)
- Parameters:
configuration- to find the configuration for the write support class- Returns:
- the configured write support
-
getOutputCommitter
public org.apache.hadoop.mapreduce.OutputCommitter getOutputCommitter(org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException- Overrides:
getOutputCommitterin classorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat<Void,T>- Throws:
IOException
-
getMemoryManager
public static MemoryManager getMemoryManager()
-
createEncryptionProperties
public static FileEncryptionProperties createEncryptionProperties(org.apache.hadoop.conf.Configuration fileHadoopConfig, org.apache.hadoop.fs.Path tempFilePath, WriteSupport.WriteContext fileWriteContext)
-
-