Distinguishing Users in SageMaker Notebook When Pushing to GitHub

0

Hi Everyone,

I’m trying to commit and push code from a SageMaker Notebook Instance to a private GitHub repository following AWS’s official setup instructions (linked here). The setup works fine, but I encountered an issue: it does not distinguish between different users when committing to GitHub.

Our team shares the same SageMaker Notebook instance and collaborates on the same private GitHub repo. However, in GitHub, all commits appear under the same username, "EC2 Default User". We attempted to manually change the Git username using git config --global user.name "Alix", but this change applies to all users on the instance. As a result, if Alix updates the username, GitHub records all subsequent commits under "Alix", regardless of the actual contributor.

Additional Concern: GitHub Secret I also noticed that the secret used to connect SageMaker and GitHub is a GitHub personal access token. Does this mean it is tied to an individual account? If so, what’s the best practice for selecting and managing a shared credential in a team environment while maintaining proper security and individual commit identities?

How can we ensure each user commits with their own credentials? We want each team member to push commits under their own GitHub identity while continuing to use the shared SageMaker Notebook instance. Any guidance or best practices would be greatly appreciated!

Thanks!

2 Answers
1
Accepted Answer

In general I think it's pretty difficult to reconcile the ideas of 1) providing multiple people shared access to the same running server environment and kernels with open-ended Python & Linux tools and 2) maintaining clear and verifiable attribution of who did what and when...

Even in SageMaker AI Studio's "shared spaces", there are similar issues: And I think (could be missing something) they're reflective of broader challenges in the underlying Jupyter ecosystem, rather than being particularly SageMaker-specific?

My advice would probably be to take a step back from this way of working:

  • Prefer individual working spaces, probably using Studio to help manage the users and to benefit from the extra features vs plain notebook instances
  • Encourage users to collaborate via dedicated data stores e.g. git branches; MLFlow experiment tracking servers; dedicated data stores like S3
  • Optimize costs by enabling auto-shutdown and choosing small instance types for notebook environments: Use SageMaker Training Jobs for long-running or high-resource tasks, and benefit from shared visibility of their logs/metrics/etc across the team. You can even run scheduled notebook jobs to minimize required code change between local interactive work and batch jobs.

If your team gets into the habit of using SageMaker Jobs (which by default will automatically upload the submitted code to a new S3 location, and preserve logs, metrics, etc), then hopefully they'd feel comfortable interacting through these more governed central stores like git, the model registry, the experiments - rather than needing to directly share the notebook environment. If they get used to using on-demand compute jobs for heavier tasks, and an expectation that the notebook itself is likely to auto-shutdown if left idle for a while - it can also help reduce overall costs versus provisioning one large-sized, always-on notebook instance.


Edit to add: AFAIK the only mechanism git has for verifying individual commits is commit signing e.g. with GPG - which is not ideal as it can be a bit of a pain to set up for anybody not familiar... Mayyyybe you could:

  • Set up commit.gpgsign true to enforce commit signing
  • Make everybody use a password-protected GPG key (in which case even if they try to use somebody else's key it would fail because they wouldn't know the password)
  • Make everybody use a password-protected SSH key for git server authentication (this is just for actual push/pull itself - not commit level) ...Then theoretically I think you could establish traceability on both commits and pushes: But the user experience might not be nice and git still wouldn't stop you from committing with name Dorkus Borkus or whoever was on the instance last - it's just that GitHub would know that the key that signed the commit wasn't that user but somebody else.
AWS
EXPERT
answered 2 months ago
profile pictureAWS
EXPERT
reviewed 2 months ago
  • Hi @Alex_T,

    Thank you very much for your help and explanation—it's much appreciated. It sounds like individual code spaces are the way to go, if I understand you correctly.

    If that’s the case, I have a follow-up question: Does having individual spaces (like SageMaker Studio Private Space) mean that different users won’t be able to access each other’s spaces or code? We sometimes need code reviews or wish to reuse others’ scripts, which makes a shared space more convenient for us.

    Could you suggest a best practice for code sharing or code review when using private spaces? Let’s say we have 5 to 10 users—if each user has their own Studio JupyterLab space, what’s the best way to share code for review and collaboration?

    Thanks again for your insight!

0

Hi @Alex_T,

Thank you very much for your help and explanation—it's much appreciated. It sounds like individual code spaces are the way to go, if I understand you correctly.

If that’s the case, I have a follow-up question: Does having individual spaces (like SageMaker Studio Private Space) mean that different users won’t be able to access each other’s spaces or code? We sometimes need code reviews or wish to reuse others’ scripts, which makes a shared space more convenient for us.

Could you suggest a best practice for code sharing or code review when using private spaces? Let’s say we have 5 to 10 users—if each user has their own Studio JupyterLab space, what’s the best way to share code for review and collaboration?

Thanks again for your insight!

answered 2 months ago
  • So in my opinionated-opinion the best practice is that this is exactly the challenge addressed by git/GitLab/GitHub/etc version control systems. Pushing to a branch gives a way of sharing "work in progress" code, and pull requests / merge requests give a framework for reviewing the merge of specific changes into the team's "main" shared working branch. Viewing notebook diffs requires proper tooling because nobody just wants to look at the raw JSON, but SageMaker JupyterLab should be capable of rendering them nicely and I know GitLab/GitHub have plugins available to help with PR/MR review experiences too. IMO the main difference between ML and normal software dev in this regard is that 1/ the specific version of the input dataset is also important, and 2/ the computation to re-create the calculation from scratch (e.g. train a model) might be slow/big. The solutions for this are 1/ to be clear about whether you're trying to share & review code or results generated by code or both, and 2/ to also be rigorous about how you version control and share datasets, for which git usually isn't the best tool but there are others out there like DVC in Open Source or SageMaker/DataZone/Glue's native capabilities. You can also explore tools like SageMaker Model Registry for sharing model versions with attached evaluation metrics, and SageMaker Managed MLFlow for automatically tracking experiments in a shared UI the team can explore and analyze.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions