Lost changes in SageMaker Studio notebooks


I have repeatedly encountered issue with lost recent changes in SageMaker Studio Collaborative space. It occurs when the session expires and requires reconnection. Even though the notebook instance is running, the checkpoint files are missing and the notebook does not display at all. When checking the file via terminal, previous (even 11h old) state is displayed. Any idea how to prevent the issue?

I am using VpcOnly mode with private subnets only. There are two active users but only one is currently using the collaborative space. The notebooks often runs for couple of hours.

1 Answer

It sounds (from "The notebooks often runs for couple of hours") like you're running long-running - perhaps non-interactive - tasks on the notebook itself?

In general (not particularly relating to data loss), I'd suggest to avoid this if you can: SageMaker's containerized (processing, training, batch transform) jobs functionality should be the preferred choice for running your serious jobs/experiments in a way that's:

  • Scalable - Supporting multi-instance distribution as well as selecting larger or smaller instance types
  • Automatically trackable - Job metadata & parameters will be stored in the SageMaker APIs, exact version of code automatically uploaded to S3 (assuming "script mode"), and logs+metrics stored via CloudWatch
  • Repeatable - In addition to the metadata logging, each job initializes with fresh, well-defined container image without the risk that somebody installed libraries / changed something in the shared notebook space.
  • Cost-effective - Because you can keep your notebook instance (which, being interactive, may often be idle) small and instead select the instance sizes needed for your jobs and pay only for the time the jobs run.

If you must, you could run your notebook as a job... But ideally I'd recommend factoring your code into .py scripts and using actual Processing/Training jobs for richer metadata history. There's an example here comparing a "local" (all-in-notebook) notebook to a "SageMaker" (code in scripts, notebook initiates the training job) one for model training. You could even use a SageMaker Pipeline to orchestrate multiple steps like pre-processing > training > evaluation.

Anyway, back to Studio data loss...

In my experience, disconnecting or timing out from Studio hasn't caused any interruption of actual kernel computation or loss of data written to disk... But I have seen that cell outputs (i.e. console logs) can get cut off: For example if I start a long-running cell that trains a model and then saves it, then close the browser and come back tomorrow, I'd expect the model to be trained and saved and to be able to resume the notebook from the next cell... But some training logs might be missing from the cell output.

I tentatively believe from this open Jupyter GitHub issue from 2015, that this issue isn't particular to SageMaker. It seems like Jupyter's launch of real-time collaboration was likely to improve the situation a bit, but not expected to fix it completely in the case that all users disconnect? For cell output, my suggested workaround would be to log important things to file in your long-running cells (or, better yet, use SageMaker jobs instead as mentioned above)

If you're instead struggling with losing unsaved work when re-authenticating, by suggestion would be:

  • When your session times out, leave that tab as it is. Do not exit/refresh/close/etc: Because Jupyter is a hybrid IDE and some state is stored client-side.
  • In a new browser tab, re-launch Studio and log in. When you do, it might present you with a blank workspace
  • Close the new tab and go back to the old one: With your auth cookie refreshed, your previous workspace should be there exactly as you left it including any unsaved files that didn't get synced to the server.

If your issue is something different than disappearing cell outputs or lost changes due to refreshing for re-authentication, then apologies I haven't faced what you saw yet! Either way, if you are doing long-lived, compute-intensive work directly in the notebook I'd recommend to look at processing/training jobs instead.

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions