Loading CSV files
Hi,
I have data in CSV files. Partitioned inside a directory, one file per day, alive (constantly written to, insert at the end). I want to load them on-line (not once a day). Normally I would write peace of code, that remembers the last file and place, so it can easily go back and continue. But the Connect engine looks like a great ready solution for this kind of thing. I'm just not sure about:
1. I can't use indexes, because the file is changed all the time. Can I somehow make sure the data will be visible in a "connect table" always in the same order?
2. About "Multiple File Tables". Whats the order of data inside them?
3. What about simple file format change: add column at the end. Will data from older files be visible in "Multiple File Table" as null?
Answer
I am not sure I understand what you wish to do. A multiple CSV CONNECT tables can be defined on a set of csv files and will process them as if they were one table. There is no need for "loading" because CONNECT directly reads the source data. Of course it can also be used to load these data in another engine table by something such as:
insert into other_engine_table select * from connect_csv_table where some_clause;
However some points are:
1. CONNECT multiple tables are not indexable. They are read only either. This means that inserts must be made by an external application.
2. The ordering of data depends on the collation ordering of the file names. Reading them is sequential, therefore data are always retrieved in the same order.
3. Adding new columns makes older files wrong, not having the required number of field. However, this can be fixed using the table options MAXERR=n (n being large enough, for instance INT_MAX32) causing n wrong lines not to raise errors, and ACCEPT=1 causing these lines to be accepted with the missing columns set to NULL.
Query pruning can be done using a where clause on an added FILEID special column.
A much better option in your case would be to use a partitioned CONNECT CSV table. Updating and inserting could be done via MariaDB/CONNECT with automatic updating of eventual indexes. Repartition into files could be made automatic based on the day and query pruning would also be automatic.
The current version of CONNECT does not yet support partitioning but I am working on it and it should be supported in future versions.