Data Analysis with AWS Transform for mainframe - Understand your applications data landscape
Learn how the Data Analysis capability in AWS Transform for mainframe surfaces data relationships across programs, JCL jobs, and datasets. This article covers Data Lineage and Data Dictionary using the CardDemo sample application, and shows how the downloadable artifacts can support impact assessment and migration planning.
Data Analysis with AWS Transform for Mainframe
This article demonstrates how to use the Data Analysis capability in AWS Transform for mainframe to understand data relationships across programs, JCL jobs, and datasets, which is a critical step before making modernization decisions. Using the CardDemo sample application, we walk through how Data Lineage and Data Dictionary work together to surface insights that would otherwise require weeks of manual analysis.
Why Data Analysis matters in mainframe modernization
Mainframe applications carry decades of accumulated data complexity. COBOL programs reference datasets through logical names that vary from program to program. JCL jobs wire datasets through DD names that only make sense at runtime. Copybooks define data structures that often exist nowhere else in documentation. Understanding this data landscape, specifically which programs touch which datasets, how they access them, and what the data looks like, is foundational to any modernization effort.
Data Analysis helps teams achieve:
- Complete impact analysis when migrating or replacing a dataset
- Accurately identify shared data dependencies between programs
- Preserve context on data structures during transformation mapping
The Data Analysis capability in AWS Transform automates this discovery. It produces two complementary outputs, each covered in detail in this article:
- Data Lineage: Traces the relationships between data sources, programs, and JCL jobs, covering the "where" and "how" of data usage
- Data Dictionary: Documents the structural metadata of data elements with business language descriptions, covering the "what" of data structure
Together, they provide an integrated view: lineage shows who uses the data and how, while the dictionary shows what the data looks like and what it means.
Prerequisites
Data Analysis requires code analysis as a prerequisite. When creating a job plan that includes data analysis, AWS Transform for mainframe automatically adds code analysis as a preceding step. To get started, provide the location of the application source code in an Amazon S3 bucket during the "Kick off modernization" step of the job plan.
Data Lineage: Tracing who uses what data and how
Data lineage tracks the flow of data across a mainframe application, including where it originates, which programs and JCL jobs interact with it, and how it's accessed. For the CardDemo application, the Data Lineage summary surfaces these relationships at a glance:
- 130 data sets across VSAM KSDS, VSAM PATH, GDG Base, and Non-VSAM types
- 2 Db2 tables (TRANSACTION_TYPE and TRANSACTION_TYPE_CATEGORY)
- 30 COBOL programs that access data sources
- 49 JCL jobs that manage data operations
- 104 total operations (reads, writes, updates, deletes, inserts, selects)
Some of these numbers are clickable drill-downs. The four views available, namely Data sets, Db2 tables, Programs, and JCLs, approach the same data relationships from different angles.
Data sets view
The Data sets view lists every dataset in the application with its usage count, type, and read/write breakdown. Expanding a dataset reveals every program that references it, along with the logical name each program uses, the number of operations, and whether the access is read, write, or both.
For example, expanding AWS.M2.CARDDEMO.ACCTDATA.VSAM.KSDS, the primary account data file, shows 11 COBOL programs referencing it through 5 different logical names: ACCTFILE-FILE, ACCOUNT-FILE, ACCOUNT-INPUT, ACCT-FILE, and ACCTDAT. Some programs only read from it (CBACT01C, CBEXPORT), while others perform read, update, and write operations (CBACT04C, CBTRN02C, COACTUPC).
This view also provides a "View data dictionary" button that navigates directly to the structural metadata for the associated copybook, bridging the gap between "who uses this data" and "what does this data look like."
Db2 tables view
For Db2 tables, the view shifts to database-specific operations: Select, Insert, Update, and Create. CardDemo has two Db2 tables:
- CARDDEMO.TRANSACTION_TYPE, accessed by 3 programs (COBTUPDT, COTRTLIC, COTRTUPC) with 10 total operations
- CARDDEMO.AUTHFRDS, a fraud detection table accessed only by COPAUS2C with 2 operations (Insert and Update)
The operation breakdown at the program level shows exactly which programs modify table data versus which only query it. This is useful context when planning migration sequencing or assessing change impact.
Programs view
The Programs view flips the perspective: instead of "which programs use this dataset," it shows "which datasets does this program use." Expanding a program like CBACT04C (interest calculation) reveals it touches 6 different data sources across VSAM KSDS, VSAM PATH, and GDG Base types, with copybook linkages (CVACT01Y, CVACT03Y, CVTRA01Y) visible alongside each dataset reference.
This is the view to use to understand the full data footprint of a specific program before making changes to it.
JCLs view
The JCL view uses a split-pane layout where the left pane lists JCL jobs with their steps; selecting a step shows its dataset references in the right pane with DD name, physical dataset name, type, and disposition.
Take TRANREPT.jcl as an example, a transaction reporting job with three steps:
- STEP05R runs SORT (system utility) against 2 datasets
- STEP05R.PRC001 runs IDCAMS (system utility) against 3 datasets
- STEP10R runs CBTRN03C (application program) against 6 datasets including CARDXREF, DATEPARM, TRANCATG, and TRANFILE
The disposition column (DISP) reveals dataset lifecycle intent: SHR means shared read access, while NEW,CATLG,DELETE means the step creates a new dataset. The system program flag distinguishes utility steps from application logic, which is important when tracing which steps contain business-relevant data operations versus housekeeping.
Three downloadable artifacts
Beyond the UI, Data Lineage produces three structured CSV artifacts (downloadable as a ZIP from S3):
Note: After downloading artifacts from below location, verify that the zip file contains 5 CSV files, three for data lineage and two files for data dictionary
s3://<<your_s3_bucket>>/transform-output/<<job-id>>/1/data_analysis/data_analysis_result_yyyymmdd_hhmmss.zip.
-
program_to_dsn.csv maps each COBOL program to the datasets it accesses (Program → DD Name → Physical Dataset), with logical names, copybook references, data source types, and access patterns (READ, WRITE, UPDATE, DELETE, INSERT, SELECT).
-
jcl_to_dsn.csv maps each JCL job to datasets at the step level (JCL → Step → DD Name → Physical Dataset), with dispositions and a system program flag that distinguishes utility programs (IDCAMS, SORT, IEBGENER) from application programs.
-
dsn_to_file.csv is the consolidated reverse lookup (Physical Dataset → all programs and JCLs). For each physical dataset, it lists every program and JCL that references it, combining both forward lookups into a single view.
Together, these artifacts enable tracing from the varied logical names programs use in code all the way to the actual physical dataset on disk.
Data Dictionary: Understanding what the data looks like
While lineage answers "who uses what data and how," the Data Dictionary answers "what does the data actually look like and what does it mean." It catalogs field-level metadata for both COBOL copybook structures and Db2 tables.
COBOL data structures
CardDemo contains 59 COBOL data structures documented across its copybooks. Select any structure to reveal a detailed, scrollable table with field-level metadata organized in two groups:
Structural metadata (left columns): Field name, field type (RECORD, GROUP, or FIELD), logical group, mainframe data type (e.g., X(12) for alphanumeric, S9(4) COMP for binary), generic data type, COBOL level number, and data length. This preserves the hierarchical nesting that COBOL uses, where Level 01 records contain Level 02 groups, which contain Level 03 fields.
Business context (right columns): Business definition, decimal positions, REDEFINES relationships, root record, field position (byte offset), and value clause.
The business definition column is worth highlighting. Each field gets a plain-English description. For example, TRNNAMEI is described as "Transaction name input field, 4 character transaction identifier." These descriptions surface knowledge that is rarely documented elsewhere. They also serve as a foundation for natural language querying when building chat-based interfaces over mainframe data.
Other details worth noting:
- REDEFINES relationships are captured, showing where COBOL overlays one field definition on another's memory, a common pattern that's critical to understand during data mapping
- Value clauses indicate field initialization values, useful for understanding default states
- Mainframe data types like S9(4) COMP (binary) help identify fields that need special handling during transformation. Binary and packed-decimal types don't map directly to modern data types without conversion logic
Db2 tables
The Db2 tables view follows the same pattern. The screenshot below shows CardDemo's TRANSACTION_TYPE and TRANSACTION_TYPE_CATEGORY tables with primary keys, Db2 data types, data lengths, and business definitions visible in a single view. Additional columns such as foreign key references, nullable indicators, decimal precision, and default values are also available in the UI and captured in the downloadable CSV artifact.
Two downloadable artifacts
The Data Dictionary also produces structured CSV artifacts:
-
data_dictionary_cpy.csv provides field-level metadata for all COBOL copybook structures. Columns include field name, type, mainframe data type, generic data type, level, data length, business definition, REDEFINES, value clause, and more.
-
data_dictionary_ddl.csv provides Db2 table and column metadata including data types, primary/foreign keys, nullable flags, default values, and business definitions.
Navigating between lineage and dictionary
The Data Lineage and Data Dictionary views are cross-linked. From any dataset in lineage, choose "View data dictionary" to navigate to the copybook metadata; from the dictionary, choose "View data lineage" to show which programs and JCLs reference that structure.
Practical use cases
The UI views are valuable for exploration, but the real power of Data Analysis emerges when the downloadable artifacts are used to answer specific questions.
Impact analysis of a data source
When planning to migrate a specific dataset or Db2 table, the first question is: what depends on it? Using dsn_to_file.csv, look up any data source to get the complete list of programs and JCLs that reference it. The artifact also shows whether each program reads, writes, or updates that dataset, helping assess migration risk. A dataset accessed by multiple programs with write operations carries higher coordination effort than one that is read-only.
For transitive dependencies, meaning programs that don't directly access the dataset but depend on programs that do, combine this with the AWS Transform dependency analysis artifact, a JSON document that traces the complete dependency chain between all components.
Tracing logical names to physical datasets
Different programs often use different logical names for the same physical dataset. The lineage artifacts resolve this by mapping Logical Name → DD Name → Physical Dataset Name.
Consider the account data file AWS.M2.CARDDEMO.ACCTDATA.VSAM.KSDS. Across CardDemo's programs, this single dataset is referenced by 5 different logical names across 10 programs:
Without resolving these names to the same physical dataset, any downstream analysis would treat ACCTFILE-FILE, ACCOUNT-FILE, ACCOUNT-INPUT, ACCOUNT-OUTPUT, and ACCTDAT as five separate files. That's a significant source of error when assessing data dependencies or planning migration scope.
Combining lineage with dictionary
The artifacts become more powerful when combined. Lineage shows which programs share a dataset; the dictionary shows what that dataset contains, including field types like S9(4) COMP (binary, requiring conversion) and business definitions that clarify what each field represents.
Key takeaways
Data Analysis in AWS Transform provides a structured, automated way to understand the data landscape of a mainframe application. Here's what to take away:
-
The UI is an exploration layer; the artifacts are the foundation. The five downloadable CSV artifacts (three from lineage, two from dictionary) are structured, portable files that load into spreadsheets, scripts, databases, or custom analysis tools.
-
Multiple logical names, one physical dataset. Programs reference the same dataset through different names. In CardDemo, a single account file is called 5 different names across 10 programs. The lineage artifacts resolve these to physical dataset names, eliminating duplicate references and enabling accurate dependency mapping.
-
Business definitions surface undocumented knowledge. The field descriptions in the data dictionary provide plain-English context for COBOL fields and Db2 columns, useful for migration planning, documentation, or building natural language interfaces over mainframe data.
-
Data Analysis is an assessment accelerator, not a migration tool. It surfaces the data relationships and structures needed to make informed modernization decisions. For transitive dependencies across the full application, combine it with the AWS Transform dependency analysis artifact obtained through code analysis.
Conclusion
Data Analysis produces structured, downloadable CSV artifacts that go beyond what the UI shows. These artifacts load into spreadsheets, scripts, or databases to support impact analysis, data migration planning, and application decomposition. Start with Data Analysis early in the modernization journey. The data relationships it surfaces will inform every downstream decision, from which datasets to migrate first, to which programs share data and need to move together.
Additional resources
- Language
- English
Well documented. Thanks. We do see customer enhancements requests for Data Seed Analysis.
