External Data Sources

External data sources framework is used to bring external data into Spark.

Two strategies are used, one is FileSourceStrategy and the other is DataSourceStrategy.

DataSourceScanExec is the bottom SparkPlan used to do scanning from external system. In detials, there are batched mode BatchedDataSourceScanExec and row mode RowDataSourceScanExec. The former is used for vectorization scan, supported by parquet (orc to be implemented).

For DataSourceStrategy, rdd is embedded in its construction, and constructed by relation.buildScan

For FileSourceStrategy, rdd is constructed by FileScanRDD

results matching ""

    No results matching ""