AWS Glue incremental reflection

The documentation says that for s3/hdfs, “The system can automatically identify new files in the directory”. Does this apply to when new files show up in an AWS glue partition? If so, why does it ask what field I want to use to identify new records?

@parisholley Glue is considered a table store like Hive, the above feature is only for filesystem sources like S3

so i have to use full refresh and always download all the s3 data mapped by hive each time?

@parisholley You can use incremental refresh by using a monotonically increasing column from the dataset disovered after adding the Glue source, see attached screenshot.

sure but it isn’t clear what that actually means…

if I have a table partitioned by YYYY/MM/DD/HH, that means i could have files within that partition for every minute/second etc…

a) if files are written into that partition out of order (minute 2 before minute 1), what is the behavior?
b) if files written to that partition are arbitrary (eg: file 1, file 2) and each of those files could contain any date range between 0-60 minutes of that hour, what is the behavior?
c) if 99% of the data for hour 1 is written, but due to a delay or extended processing, the remaining hour 1 data is created (in new files) after hour 2 has already started running, what is the behavior (eg: adding files to old partitions)?

@parisholley The column selected should be a monotonically increasing column