• Home
  • Blog Posts
  • Spaces 🌿
    • Mastodon
    • Twitter
    • Instagram
    • Goodreads
    • GitHub
    • Mail
    • RSS Feed
  • About Me
  • Contact Me
  • Disclosure

  • Home
  • Blog Posts
  • Spaces 🌿

Git – The Good, The Bad and The Ugly

January 6, 2023

Have you ever found yourself in a situation where you accidentally pushed secret keys or huge files while using Git for version control? Did you know that removing those keys even 20 seconds after exposing the sensitive data to public might be already too late?

In this blog post, I would like to highlight the dangers of exposing confidential information and emphasize what can possibly go wrong. I then provide a few nice tricks that I also use, so that you don’t need to worry or be scared while using Git anymore.

We are human after all, so we all make mistakes; but it is also crucial to learn from those mistakes. 

Dangers of pushing unwanted files and information

One type of unwanted stuff on Git is the very large files. If you accidentally committed a large file to a repository this will most certainly limit how much time it will take for you to pull or push and even will give you an error if the file is larger than 100MBs.

Second, if you are already into software, by far you had seen this many times: Never push confidential information to a repository. Attackers with minimal resources can compromise many GitHub users by stealing leaked secrets and keys. Yet, I see that people are still not quite careful about this. Therefore, I’d like to share a few stats.

Can I just make another commit and remove it?

No. In fact, this is one of the most dangerous things that you can do. People tend to think that when they remove the files from a repository, they are no longer reachable. Yet, this is not correct. This is what Git is used for. It tracks your file version history so that you can go back in time when you would like to revert changes.

By making a commit to remove a file in the following way, you are just directing strangers on the Internet to where your secrets live. 

$ git commit -m "Remove api key"

You can see how frequent this is with just a single search[1]https://github.com/search?q=remove+api+key&type=commits. To make it more clear, during the writing of this post on January 5, 2023, there were 1M+ commits returned on GitHub for the search query “remove api key” and 735K+ commits for the query “remove password”. 

For example, with ChatGPT[2]ChatGPT is -fine-tuned version of the model GPT 3.5- a language model trained to produce text and optimize a dialogue using Reinforcement Learning with Human Feedback (RLHF): … Continue reading getting popular and people trying to write Python scripts to play with it (🤍), I found countless OpenAI API keys living in random corners of GitHub (🥲).

Think about the possibilities!

What can I do then?

When we think about removing a commit from Git history, the first thing that comes into mind is to immediately change the tip of the branch to an older commit. This safely moves us back in time to when the key was not present in the repository.

$ git reset <SHA1>
$ git commit -am "message"
$ git push -f <remote-name> <branch-name>

But it has been a while… Is it too late?

Well, if the problem you have is related to large file sizes, you can always use git filter-branch to remove past information / files from your history. In addition to this, there is a much better and simpler approach that I like using a lot.

Meet BFG-Repo-Cleaner! – This is a tool written in Scala that removes large files (like the pre-trained models or large PDFs that you are not able to get rid of) or troublesome blobs (e.g. API keys, passwords, secrets) like git filter-branch does, but faster.

The official recommendation of GitHub also recommends[3]Removing sensitive data from a repository: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository using BFG-Repo-Cleaner for purging a file.

Edit: I have recently been informed that git filter-branch is deprecated. You can now use git filter-repo or directly the aforementioned BFG tool, instead.

Oh that’s a relief!

No, it is not. Of course, you can use this tool anytime to remove large files, etc. However, you should still be careful before pushing unwanted credentials to public repositories. If you have recently exposed a secret on GitHub, you should be really fast to take it back with the aforementioned tools.

A paper named “How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories”[4]Meli, Michael, Matthew R. McNiece, and Bradley Reaves. “How bad can it git? characterizing secret leakage in public github repositories.” NDSS. 2019. builds the first comprehensive, longitudinal analysis of secret leakage on GitHub. There, the researchers evaluate two different approaches for mining secrets: one is able to discover 99% of newly commited files containing secrets in real-time, whereas the other leverages a large snapshot covering 13% of all public repositories, some dating to GitHub’s creation.

  • Do you think that most of the keys that were discovered are used for test purposes? Well, here is the scary news: The researchers in this work estimated that 89.10% of all discovered secrets are sensitive.
  • Some trends:
    1. The largest drop in secret presence is found to be in the first hour after discovery, with 6% of all detected secrets being removed. 
    2. Secrets that existed for a duration longer than a day tended to stay long-term — at the end of the first day, over 12% of secrets were gone, while only 19% were gone after 16 days.
    3. The rate at which secrets and files were removed dramatically outpaces the rate at which repos were removed: The users were not deleting their repos, but were simply creating new commits that removed the file or secret.
  • Finally, the most important takeaway: The median time to discover a secret shared on GitHub was found to be ~20 seconds, ranging from half a second to over 4 minutes, without any impact of at what time of the day the secret was pushed. So you have much less time than you think, after your accidental push.

Final Words

GitHub should have much more strict policies or checks for commits that might possibly expose a secret. Or at least, a warning for the new registered accounts directing them to the respective documentation? I believe this is crucial especially to welcome newcomers who are just starting their programming journey. Developers (especially juniors) should be aware of how to make source code public securely and of possible consequences they might need to deal with for ignoring to do so.

How can I avoid accidental commits?

  • Avoid catch-all commands like git add . or git commit -a. Instead, use git add filename. Individually staging files also makes it better to track changes across commits. You can always use staging options in source control components of popular text- / source code- editors like Visual Studio Code.
  • Always have a look into your file changes. Use, git diff --cached, and keep an eye on the changes on your working tree.
  • There are other types of tools that help you avoid committing your keys like git-secrets.
  • You can also use pre-commit hooks.

Disclaimer

During the writing of this post, none of the keys found public had been bombed 💣 (I mean scraped). Therefore, no owner of these repos with possibly leaked information had been harmed. I just would like to emphasize the dangers of sharing confidential data, so that people can start being more careful about what to share with random strangers and what not, both on software development platforms and on a much larger scale (such as social media platforms).

Let me know in the comments if you have other tips and tricks!

— Coding Woman

References[+]

References
↑1 https://github.com/search?q=remove+api+key&type=commits
↑2 ChatGPT is -fine-tuned version of the model GPT 3.5- a language model trained to produce text and optimize a dialogue using Reinforcement Learning with Human Feedback (RLHF): https://chat.openai.com/chat
↑3 Removing sensitive data from a repository: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/removing-sensitive-data-from-a-repository
↑4 Meli, Michael, Matthew R. McNiece, and Bradley Reaves. “How bad can it git? characterizing secret leakage in public github repositories.” NDSS. 2019.
Share

Computer Science / Engineering  / Software Development

codingwoman

8 Comments


eternaloctober
January 6, 2023 at 2:13 AM
Reply

The Good, the Insane, and the Tugly



plg94
January 6, 2023 at 2:39 AM
Reply

use git filter-repo instead, filter-branch is deprecated
the 100MB limit (for one transaction(?)) applies only to Github, not Git in general. And while we’re at it, we should at least mention LFS, git-annex &co.
pushing secrets to a (public) git repo is bad, yes, but not fundamentally different than publishing them on your blog by accident or writing your password on facebook. Once published, all secrets should be regarded compromised, no matter how fast and thoroughly you can delete them. So this is not really Git’s fault.
I agree that bad Git tutorials who only teach add . and commit -am are partly to blame. Another suggestion: use commit -v, it lists all staged changes again in the editor below the commit message.



goranlepuz
January 6, 2023 at 7:34 AM
Reply

It is mostly not the fault of git that people put secrets in source control. People have been doing that before git existed.

The funny part, for me, us that the centralised nature of the service that is github, is making it easier to scrape secrets.

Finally,

How can I avoid accidental commits?

Is kinda crucial. And there, people need to be more careful, first and foremost. Tooling etc can only help so much. Probably something along the lines of “make a foolproof system and only a fool would want to use it” applies .



KieranDevvs
January 6, 2023 at 1:19 PM
Reply

No no no no no!!! Even if you manage to remove the git history of a pushed secret within 10 seconds, don’t assume no one has pulled it down, especially on a public repository.

Rotate the secret out. Its dead and should not be used!

You’re really going to risk your job or business on the assumption that you think you were fast enough? Even if your project isn’t commercial and its just for hobby, its a matter of good practice.



edgmnt_net
January 6, 2023 at 1:23 PM
Reply

Use .gitignore, use a build directory, don’t store credentials or any other random files under the local repo’s base directory. Not only because you could accidentally push them, but also because you could lose them, e.g. if you do git clean.



Chris "Not So" Short @ChrisShort@hachyderm.io
January 6, 2023 at 7:17 PM
Reply

Suggested Read: Git – The Good, The Bad and The Ugly codingwoman.com/git-the-good-t…



oaf357
January 6, 2023 at 7:24 PM
Reply

oaf357 bookmarked this Post on reddit.com.



Palma
January 8, 2023 at 3:17 AM
Reply

It’s very cool what you do



Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.



  • Hi there 🙌 — I'm Idil, a master's student in computer science 👩‍💻 interested in machine learning and computer vision. I write blog posts on a variety of topics including computer science, my recently read books, and self-development.

  • Newsletter


© Copyright Coding Woman | 2018 - 2023