I occasionally need to grant a non-technical colleague the ability to input information into our data warehouse on an ad-hoc basis. For example, our customer service team at Milk Bar maintains a list of special wedding cake orders in Google Sheets that we need to collect data from for downstream calculations.

This is a tricky problem for a data engineer — my colleagues don’t have the technical skill to interact directly with our data stack, and I don’t want to have to support my own web form or similarly involved infrastructure to collect this information. What if we could sync their Google Sheet directly to a table in our data warehouse?

At first glance, this sounds like a terrible idea. Free data entry into a data warehouse table with no schema control is a data engineer’s nightmare. What’s to stop someone from accidentally deleting a column or entering data of the wrong type? How do I know that my colleagues are entering data according to my expectations? I can already feel your heart starting to race thinking about how many best practices this approach violates!

Stay with me. I’m here to convince you that Google Sheets, with the proper guardrails in place, is an excellent way for colleagues outside of an analytics team to input tabular data for downstream use, even in automated transformations!

We’re going to explore setting up a restricted, data-validated Google Sheet and discuss options for syncing that data to your data warehouse, testing it, and even sending transformed results back to Sheets for your colleagues to view or analyze.

Securing your Google Sheet

Permissions

We’ll start with the most obvious protection you should enforce on your Google Sheet data source: share permissions. Plainly and simply, the only people allowed to edit your sheet should be people you trust to edit it appropriately and have been trained to do so. Keep link sharing off unless you need it to be on, in which case, link permissions should be set to can view.

The next important step is often overlooked because the setting is hidden away in Google’s Advanced section of the share menu. Make sure you restrict editors from adding new people or changing access. You want to be sure that your colleagues won’t inadvertently add unapproved editors.

Protected sheets and ranges

In addition to workbook-level permissions like we set above, we can also set permissions for individual sheets and even specific cell ranges.

To set these permissions, navigate to the Data menu and select Protected sheets and ranges. You can restrict sheets and ranges so they can only be edited by certain people, or allow people to edit them only after acknowledging a warning about the consequences of their actions.

You might have some sheets that contain necessary information for formulas or calculations that don’t need to be edited by any of your users. If that’s the case, protect these sheets so only you can modify them.

Data validation

Thorough data validation is what makes Google Sheets a reliable option for direct sync to your data warehouse. With data validation, you can impose virtually any constraint you can think of on the desired range in your sheet.

You can get to Data Validation from the Data menu in Google Sheets. Data validation asks you for a cell range, a constraint, and an action to take if that constraint is not met.

I would recommend “Reject input” for the action your sheet should take when invalid data is entered. Google’s “Show warning” option displays a tiny orange triangle in the top right corner of the problematic cell, and it’s very easy for the user to miss. Rejecting everything that doesn’t validate ensures you’re only passing clean data downstream to any automated processes that depend on a consistent schema.

I’ll go through each constraint option provided by Google and discuss how we might use each one to qualify clean data and reject messy data.

  • List from a range. This criterion requires any entry in our cell range to match the value of a cell in a validation range. We often use it to force the user to choose from a list of predefined values or categories. As a handy bonus, you can add a drop-down list to each cell in your range so your user can pick a valid value directly. I like to keep a separate, fully protected sheet (only I can edit it) for my validation range.
  • List of items. Similar to List from a range, except you provide the valid values directly as a comma-separated list. This is great for validating entries against a short list of static values.
  • Number. This criterion is used to check that entries fall within a numeric range. We can prevent negative numbers or force entry to be less than a known maximum.
  • Text. We can test entries to see if they contain or do not contain certain sub-strings. Perhaps you want to make sure your users are entering a proper prefix when they copy over invoice numbers. There are also a couple of useful options at the bottom of the drop-down: is valid email and is valid URL. These are both common validation needs that would require regex or something similarly complex to confirm.
  • Date. We can validate that entries are valid dates and fall within a required range. Because of the multitude of date formats, dates are notoriously messy. This validation criterion is a vital if your users are entering dates anywhere in your sheet.
  • Custom formula is. This is the real game-changer for data validation. You can write any custom formula that returns a True or False value and Google will validate entries against it. You can implement some really complex validation logic this way. For example, use the REGEXPMATCH formula to confirm that entries match a more complicated set of string logic. Google Sheets also doesn’t allow you to apply multiple validation criteria to the same cell, so this option allows us to combine criteria into a single formula.
  • Checkbox. You can replace the cell(s) in your cell range with a checkbox that returns a custom value when it is checked and unchecked. True and False will probably account for 99% of the use cases here, but it’s nice to have the flexibility to represent something else if you want to.

If that doesn’t give you enough to test, you can write a custom Google Apps Script with the DataValidationBuilder class to implement your own rule.

Modifying data validation you’ve already put in place is a bit annoying. There’s no consolidated list of all the validation rules you’ve applied within a sheet. Instead, you have to select the cell or range of interest and return to the Data Validation menu to remove or modify criteria.

Once you’ve thoroughly validated the data in your Google Sheet, your sheet will be just as good as a custom web form, if not better. You can be very confident in the quality of your data.

Signaling to your users with conditional formatting

Sometimes rejecting a user’s input with data validation is too strict and makes for a bad user experience. You can use Google’s option to “Show warning,” but the orange warning triangle that gets added is too easy for a user to miss. I prefer to use conditional formatting to turn the whole cell in question an alarming shade of red or an approving shade of green.

Let’s say you want to warn users if they enter a value that is more than two standard deviations greater or less than the average. You could set up a formula with conditional formatting to turn the cell yellow as warning to the user that the value is an outlier and may have been entered incorrectly.

Sure, you could use a custom data validation formula to reject the value, but conditional formatting gives you shades of gray (literally!) to signal to your users beyond accepted or rejected. Occasionally that flexibility is nice to have. I’m not going to go into the specifics of setting up conditional formatting, but Google’s documentation on the feature is here if you’re not familiar with the concept.

Optional: Transform with separate IMPORTRANGE sheets

Instead of syncing directly from the user-accessed sheet to the data warehouse, I like to set up an intermediate import sheet that only I can edit. This intermediate sheet provides me with a bit more control. I configure this sheet to copy all the data from the user-accessed sheet using the IMPORTRANGE formula. This sheet is where I can make any necessary adjustments before I pass it to my data warehouse.

  • For example, I might only import a specific range of the user-accessed sheet, so errant data outside of that range doesn’t break anything downstream.
  • I could also union two user-accessed sheets (maybe there are multiple sheets, one for each year) into a single sheet with a vertical array.
  • I can even use the QUERY formula to filter the rows in the user-accessed sheet to remove blanks.

Alternative: Completely control data entry with a Google Form

If users don’t need to edit data after they enter it, you can set up an append-only system using a Google Form. Google Forms has a native integration with Google Sheets, which makes this setup dead simple for append-only needs. Google Forms will append a row for each response to your form in a timestamped Google Sheet. If you add questions, the sheet will expand to match your form.

Ensure clean data by choosing the appropriate question types for your form. To configure the Google Sheets integration, you’ll need to click the Google Sheets logo in the responses tab, which will generate the sheet.

Since you don’t have the ability to reject data outside of the form’s own validation, I would recommend setting up an intermediate importer sheet as discussed above. Perform any required filtering there to get your data in a clean state for load.

Syncing your Google Sheet to your data warehouse

Loading a Google Sheet into a data warehouse is a common problem, so there are a lot of potential solutions.

If you’re a Google BigQuery user, the easiest approach, by far, is to set up the Google Sheet as an external table. BigQuery’s documentation describes how to do this in more detail, but all you need to do is define a schema for the table and you can query the first tab of your sheet directly from BigQuery.

If you’re using Snowflake, Redshift, or another data warehouse, you can take your pick from these other options.

  • Most data loading services have an integration that handles loading Google Sheets. See Fivetran’s documentation and Alooma’s.
  • Stitch’s Google Sheets integration. Data loading service Stitch has a guide on how to set up some Apps Script code in Google Sheets that will send data to Stitch’s Import API. Stitch will then load that data into your data warehouse of choice. With this approach, you have to sync the data manually by clicking “sync to Stitch,” which may be a blessing or a curse depending on your needs. Note that you will have to sign up with Stitch if you’re not already a customer, but syncing a few sheets isn’t likely to exceed their 5 million rows/month free tier.
  • sheetload. This is a Python tool created by the data team at GitLab to load local CSVs and Google Sheets into a data warehouse. At the time of writing, the tool supports Postgres and Snowflake.
  • Ro’s Google Sheet integration. The data team at Ro open-sourced their custom approach to loading Google Sheets data into Snowflake. Note that their approach truncates and replaces the data table on every load, so it won’t preserve deleted or modified rows.
  • You could also look into rolling your own modular approach with Zapier, webhooks, and a serverless function like AWS Lambda. You could trigger your Zap on row creation in your sheet which would send the data, via a webhook, to a serverless function set up to receive the payload as an endpoint. This approach becomes more challenging if you need to capture updates and deletions.

Testing the data in your warehouse

Once the data is syncing to your warehouse, you’re more or less done! If you still have concerns about data quality or aren’t entirely satisfied with the data validation you set up, you can test your synced data warehouse table as a staging table. Once your data passes your SQL tests, copy it to production.

Testing data in the data warehouse is a larger topic, (a topic I’ve written about extensively here), but we use data build tool (dbt) at Milk Bar to achieve this. dbt manages our data warehouse and runs our SQL tests in our CI pipeline to check for any data issues that escaped validation.

Bonus Points for BigQuery: Sending data back to Google Sheets

If you’re a BigQuery user and your company uses GSuite, there’s one more capability that I have to mention. You can query BigQuery data into Google Sheets.

In the Data menu in Google Sheets, there’s an option for Data Connectors. Here you can connect to your BigQuery data warehouse and query tables with SQL to return live data into your Google Sheet. Perform any necessary downstream calculations or transformations in your warehouse, then return valuable information to your users at a cadence appropriate for your data stack.

This approach only supports up to 10,000 rows, but Google is testing an extension of this feature called Connected Sheets, which will allow you to return the entire table to Google Sheets, no matter how large it is. This feature is currently in beta, which you can apply for here.

I hope this guide has equipped you to set up your own Google Sheet data sources confidently and correctly. If you have concerns about or creative ideas for using Google Sheets as a data source, let me know in the comments.

Further reading

  1. Matthew Lincoln’s outstanding post, Best Practices for Using Google Sheets in Your Data Project