I’ve been importing gtfs files into MySQL tables. There is a wide variation between data types for different agencies, as there is no standard for data types. For instance, GTFS trips.txt has a trip_id column, where a typical value is  “94108050”, but PSTA has “973379-JUN13-HQ-Weekday-02-0000100”!

I need to do some cross table queries using some of these values, so need to assign keys to them, and I also want it to be efficient. I had initially selected varchar(10) for trip_id, but this is not long enough to hold the PSTA trip_id, and a query that worked with other agencies failed with PSTA.

PhpMyAdmin has a selection Propose Table Structure, which invokes the following command:

SELECT * FROM `psta_trips` PROCEDURE ANALYSE ( );
Propose Table Structure

Using phpMyAdmin Propose Table Structure to discover PROCEDURE ANALYZE

This outputs the following fields:

Field_name,Min_value,Max_value,Min_length,Max_length,Empties_or_zeros,Nulls,Avg_value_or_avg_length,Std,Optimal_fieldtype

I can use the Max_value value to optimize the data (and key) for trip_id, as well as any other fields, after data insertion and before using the data.

Or, after looking at the Oracle MySQL documentation for Using PROCEDURE ANALYZE, it becomes clear that this procedure is made for for optimizing tables “after importing new data”. For a table with data that will not change, it should be safe to use the Optimal_fieldtype value, if invoked with parameters 1,1. For the above table, ANALYZE suggests VARCHAR(37).

Now, I can re-visit the PHP program that imports the gtfs data.  I’ll add a routine to run this query on the populated tables, and run an ALTER TABLE statement to clean them up.