![]() I used floats, ints, date-times, long strings with high entropy and short strings with very low entropy (50 columns of each data type to be exact). Importing a bunch of largish files (20x50k rows)Īnd to simulate a more real-life example I created a few different data types for the columns to see how they are handled by BigQuery.AVRO - a binary format that GCP recommends for fast load times.įor each of the formats above, I ran 3 experiments:.item 0.answer 0.value. PARQUET - a columnar storage format with snappy compression that’s natively supported by pandas. If it was always the first question, I could pull it like jsonvalue (resource.However, as a json value can store a unstructure data, there are several functions that allow to. focused on the semi-structured data than using few data from JSON column to query relational rows. CSV.GZIP - same as above, but compressed with GZIP. Now BigQuery supports JSONEXTRACTARRAY(): For example, to solve this particular question: SELECT id, ARRAY( SELECT JSONEXTRACTSCALAR(x, '.author.email') FROM UNNEST(JSONEXTRACTARRAY(payload, '. Then you can query a json column as any other kind of column. Google bigquery just announced native JSON data type.CSV - comma-separated files with no compression at all.I rarely ever use OCR and I mostly use Python for my work, which means I have a bunch of pandas at my disposal that can easily write CSV and PARQUET files with compression, so the list of file types I decided to test are: ![]() GCP recommends AVRO for fast ingestion times due to the way it compresses data, but who believes documentation these days? I want to see for myself which file format will come out on top. Let’s hope our files will load quicker than it took to capture this picture… - Photo by Anders Jildén on Unsplashįor full access to all Medium articles - including mine - consider subscribing here.
0 Comments
Leave a Reply. |