Have you ever found yourself in a situation where you accidentally pushed secret keys or huge files while using Git for version control? Did you know that removing those keys even 20 seconds after exposing the sensitive data to public might be already too late?
In this blog post, I would like to highlight the dangers of exposing confidential information and emphasize what can possibly go wrong. I then provide a few nice tricks that I also use, so that you don’t need to worry or be scared while using Git anymore.
We are human after all, so we all make mistakes; but it is also crucial to learn from those mistakes.
Dangers of pushing unwanted files and information
One type of unwanted stuff on Git is the very large files. If you accidentally committed a large file to a repository this will most certainly limit how much time it will take for you to pull or push and even will give you an error if the file is larger than 100MBs.
Second, if you are already into software, by far you had seen this many times: Never push confidential information to a repository. Attackers with minimal resources can compromise many GitHub users by stealing leaked secrets and keys. Yet, I see that people are still not quite careful about this. Therefore, I’d like to share a few stats.
Can I just make another commit and remove it?
No. In fact, this is one of the most dangerous things that you can do. People tend to think that when they remove the files from a repository, they are no longer reachable. Yet, this is not correct. This is what Git is used for. It tracks your file version history so that you can go back in time when you would like to revert changes.
By making a commit to remove a file in the following way, you are just directing strangers on the Internet to where your secrets live.
$ git commit -m "Remove api key"
You can see how frequent this is with just a single searchhttps://github.com/search?q=remove+api+key&type=commits. To make it more clear, during the writing of this post on January 5, 2023, there were 1M+ commits returned on GitHub for the search query “remove api key” and 735K+ commits for the query “remove password”.
For example, with ChatGPTChatGPT is -fine-tuned version of the model GPT 3.5- a language model trained to produce text and optimize a dialogue using Reinforcement Learning with Human Feedback (RLHF): … Continue reading getting popular and people trying to write Python scripts to play with it (🤍), I found countless OpenAI API keys living in random corners of GitHub (🥲).
Think about the possibilities!
What can I do then?
When we think about removing a commit from Git history, the first thing that comes into mind is to immediately change the tip of the branch to an older commit. This safely moves us back in time to when the key was not present in the repository.
$ git reset <SHA1> $ git commit -am "message" $ git push -f <remote-name> <branch-name>
But it has been a while… Is it too late?
Well, if the problem you have is related to large file sizes, you can always use
git filter-branch to remove past information / files from your history. In addition to this, there is a much better and simpler approach that I like using a lot.
Meet BFG-Repo-Cleaner! – This is a tool written in Scala that removes large files (like the pre-trained models or large PDFs that you are not able to get rid of) or troublesome blobs (e.g. API keys, passwords, secrets) like
git filter-branch does, but faster.
The official recommendation of GitHub also recommendsRemoving sensitive data from a repository: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository using BFG-Repo-Cleaner for purging a file.
Edit: I have recently been informed that
git filter-branch is deprecated. You can now use
git filter-repo or directly the aforementioned BFG tool, instead.
Oh that’s a relief!
No, it is not. Of course, you can use this tool anytime to remove large files, etc. However, you should still be careful before pushing unwanted credentials to public repositories. If you have recently exposed a secret on GitHub, you should be really fast to take it back with the aforementioned tools.
A paper named “How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories”Meli, Michael, Matthew R. McNiece, and Bradley Reaves. “How bad can it git? characterizing secret leakage in public github repositories.” NDSS. 2019. builds the first comprehensive, longitudinal analysis of secret leakage on GitHub. There, the researchers evaluate two different approaches for mining secrets: one is able to discover 99% of newly commited files containing secrets in real-time, whereas the other leverages a large snapshot covering 13% of all public repositories, some dating to GitHub’s creation.
- Do you think that most of the keys that were discovered are used for test purposes? Well, here is the scary news: The researchers in this work estimated that 89.10% of all discovered secrets are sensitive.
- Some trends:
- The largest drop in secret presence is found to be in the first hour after discovery, with 6% of all detected secrets being removed.
- Secrets that existed for a duration longer than a day tended to stay long-term — at the end of the first day, over 12% of secrets were gone, while only 19% were gone after 16 days.
- The rate at which secrets and files were removed dramatically outpaces the rate at which repos were removed: The users were not deleting their repos, but were simply creating new commits that removed the file or secret.
- Finally, the most important takeaway: The median time to discover a secret shared on GitHub was found to be ~20 seconds, ranging from half a second to over 4 minutes, without any impact of at what time of the day the secret was pushed. So you have much less time than you think, after your accidental push.
GitHub should have much more strict policies or checks for commits that might possibly expose a secret. Or at least, a warning for the new registered accounts directing them to the respective documentation? I believe this is crucial especially to welcome newcomers who are just starting their programming journey. Developers (especially juniors) should be aware of how to make source code public securely and of possible consequences they might need to deal with for ignoring to do so.
How can I avoid accidental commits?
- Avoid catch-all commands like
git add .or
git commit -a. Instead, use
git add filename. Individually staging files also makes it better to track changes across commits. You can always use staging options in source control components of popular text- / source code- editors like Visual Studio Code.
- Always have a look into your file changes. Use,
git diff --cached, and keep an eye on the changes on your working tree.
- There are other types of tools that help you avoid committing your keys like git-secrets.
- You can also use pre-commit hooks.
During the writing of this post, none of the keys found public had been bombed 💣 (I mean scraped). Therefore, no owner of these repos with possibly leaked information had been harmed. I just would like to emphasize the dangers of sharing confidential data, so that people can start being more careful about what to share with random strangers and what not, both on software development platforms and on a much larger scale (such as social media platforms).
Let me know in the comments if you have other tips and tricks!
— Coding Woman
|↑2||ChatGPT is -fine-tuned version of the model GPT 3.5- a language model trained to produce text and optimize a dialogue using Reinforcement Learning with Human Feedback (RLHF): https://chat.openai.com/chat|
|↑3||Removing sensitive data from a repository: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository|
|↑4||Meli, Michael, Matthew R. McNiece, and Bradley Reaves. “How bad can it git? characterizing secret leakage in public github repositories.” NDSS. 2019.|