Workflow: Scripts vs Notebooks + Style Guide (chuẩn kỹ sư)

↑ R0 · ← R0 Setup · → R1: Data Structures

1) Vấn đề thật sự: “chạy được” chưa đủ

Trong môi trường kỹ sư, R không phải để “tạo ra một plot đẹp” mà để tạo ra artifact có thể kiểm chứng:

Chạy lại hôm nay và 3 tháng sau vẫn ra cùng kết quả (hoặc khác có lý do rõ ràng).
Có đường đi của dữ liệu và logic rõ ràng để debug khi số liệu sai.
Có cấu trúc đủ tốt để người khác đọc, review, và tiếp tục phát triển.

Trang này chốt 3 thứ: workflow (scripts vs notebooks), style guide, và reproducibility.

2) Workflow: scripts vs notebooks

2.1 Chọn theo deliverable

Deliverable	Nên dùng	Lý do
Pipeline chạy tự động (CI/cron), tạo file output	Script (`scripts/`)	Non-interactive, dễ parameterize, dễ test, dễ log
Khám phá dữ liệu, thử giả thuyết, EDA nhanh	Notebook/Report (Quarto/Rmd)	Tốc độ iterate, narrative, hiển thị output trực quan
Báo cáo chính thức gửi stakeholder	Report (Quarto/Rmd)	Reproducible render, nội dung + code + output gắn với nhau
Library nội bộ (reuse, test, versioning)	Package hoặc `R/` functions	API rõ, unit test, dependency quản lý tốt

Nguyên tắc: interactive để khám phá, non-interactive để ship.

2.2 Rule of thumb (thực dụng)

Notebook/Report: tối đa 20–30% “logic biến đổi dữ liệu”. Phần còn lại phải nằm trong function/module để reuse và test.
Script: có “entrypoint” rõ ràng, nhận parameter (đường dẫn input, seed, config), và tạo output có cấu trúc.
Cái gì có khả năng chạy lại nhiều lần, hoặc chạy trên máy khác: ưu tiên script.

2.3 Một workflow hợp chuẩn (khuyến nghị)

✅ Checklist triển khai

Step 0 — Explore (notebook/report)

EDA nhanh, xác định biến đầu vào/đầu ra, và shape dữ liệu
Chốt “hợp đồng” (contract) cho data: cột nào bắt buộc, kiểu dữ liệu, key

Step 1 — Extract logic (functions)

Tách hàm biến đổi dữ liệu ra R/ hoặc src/ (tùy layout)
Hàm có input/output rõ ràng, không phụ thuộc global state

Step 2 — Orchestrate (script)

Script gọi functions, xử lý I/O, và ghi output
Lưu session info + seed + metadata kèm output

Step 3 — Communicate (report)

Report chỉ “đọc output” và kể câu chuyện (plot/tables)

3) Style guide: viết R theo chuẩn kỹ sư

3.1 Naming: giảm entropy, tăng khả năng đọc

Chọn một phong cách và giữ nhất quán.

Khuyến nghị thực dụng:

snake_case cho biến và function: load_raw_events(), calculate_retention().
Tên biến nói rõ “đơn vị”: events_df, user_ids, retention_rate.
Boolean bắt đầu bằng is_, has_, should_: is_valid_user, has_missing_values.
Function dùng verb + noun: validate_schema(), build_features(), write_artifacts().
Hằng số (constant) gom một chỗ và đặt tên dễ hiểu: MIN_ROWS, DEFAULT_SEED.

Tránh:

Tên quá chung như data, tmp, df1.
Dùng T/F thay vì TRUE/FALSE (dễ bị override).

3.2 Function boundaries: “đúng kích thước” để test

Một function tốt thường có:

1 nhiệm vụ chính (nếu phải dùng “và” để mô tả → đang làm nhiều việc).
Input explicit (parameters), output explicit (return value).
Side-effect (đọc/ghi file, set options) nằm ở lớp orchestrator (script), không nằm sâu trong logic.

Mẫu tối thiểu:

validate_events <- function(events_df) {
  stopifnot(is.data.frame(events_df))
  required_cols <- c("user_id", "event_time", "event_type")
  missing_cols <- setdiff(required_cols, names(events_df))
  if (length(missing_cols) > 0) stop(paste("Missing columns:", paste(missing_cols, collapse = ", ")))
  invisible(TRUE)
}

3.3 File layout: đặt đúng thứ đúng chỗ

Một layout tối thiểu “ship-ready”:

R/: functions reusable (logic thuần)
scripts/: entrypoints (I/O + orchestrate)
reports/: Quarto/Rmd (narrative + visualize)
data/: tách raw/, processed/, external/ nếu có
tests/: unit tests cho functions quan trọng

Nguyên tắc:

Report không “lén” đọc raw data và tự biến đổi sâu; report đọc từ processed/ hoặc output của pipeline.
Script không chứa 300 dòng dplyr chain; script gọi functions.

3.4 Lint mindset: lint là hàng rào, không phải cảnh sát

Lint không thay bạn tư duy, nhưng giúp:

Bắt lỗi ngớ ngẩn sớm (object không dùng, style lệch, khả năng bug cao).
Giảm tranh cãi style trong code review.

Mindset:

Lint như test: chạy tự động, fail thì fix trước khi merge.
“Một chuẩn chung” quan trọng hơn “chuẩn hoàn hảo”.

4) Reproducibility: seeds, session info, deterministic outputs

4.1 Seeds: đặt seed ở đâu cho đúng

Mục tiêu: cùng input + cùng environment → cùng output.

Nguyên tắc:

Đặt seed ở “entrypoint” (script) và truyền seed/parameter rõ ràng.
Nếu dùng ngẫu nhiên ở nhiều bước, gom seed thành config và log lại.
Khi chạy song song, cần chiến lược RNG phù hợp (đặt seed một lần rồi để framework quản lý stream).

Mẫu entrypoint:

args <- commandArgs(trailingOnly = TRUE)
seed <- if (length(args) >= 1) as.integer(args[[1]]) else 20250101L
set.seed(seed)

4.2 Session info: ghi lại để debug “works on my machine”

Luôn lưu kèm artifact:

sessionInfo() (R version, platform, attached packages)
lockfile dependency (ví dụ renv.lock nếu dùng renv)
metadata run (timestamp, seed, input hash/paths)

Ví dụ lưu session info ra file:

dir.create("artifacts", showWarnings = FALSE, recursive = TRUE)
writeLines(capture.output(sessionInfo()), "artifacts/sessionInfo.txt")

4.3 Deterministic outputs: tránh “diff noise”

Một số nguồn gây “output drift” rất hay gặp:

Thứ tự row không ổn định (join, group-by không sort)
Locale/timezone làm đổi format ngày/giờ
Floating point và format in/out

Checklist để output ổn định:

✅ Checklist triển khai

Step 0 — Sort trước khi ghi

Chốt key và order() hoặc sắp xếp theo cột quan trọng trước khi write.csv()

Step 1 — Chuẩn hóa timezone/locale khi format

Ưu tiên UTC khi ghi timestamp output

Step 2 — Chuẩn hóa rounding/format cho số

Chọn một rule rounding và áp dụng nhất quán cho báo cáo

5) “Definition of Done” cho một pipeline R0

✅ Checklist triển khai

Ship checklist

Chạy được bằng 1 lệnh (script entrypoint), không cần click trong IDE
Có output folder rõ ràng, không ghi đè lung tung
Có seed + sessionInfo + renv lockfile để tái lập
Có validation tối thiểu (schema/check key/NA policy)

1) Vấn đề thật sự: “chạy được” chưa đủ ​

2) Workflow: scripts vs notebooks ​

2.1 Chọn theo deliverable ​

2.2 Rule of thumb (thực dụng) ​

2.3 Một workflow hợp chuẩn (khuyến nghị) ​

Step 0 — Explore (notebook/report) ​

Step 1 — Extract logic (functions) ​

Step 2 — Orchestrate (script) ​

Step 3 — Communicate (report) ​

3) Style guide: viết R theo chuẩn kỹ sư ​

3.1 Naming: giảm entropy, tăng khả năng đọc ​

3.2 Function boundaries: “đúng kích thước” để test ​

3.3 File layout: đặt đúng thứ đúng chỗ ​

3.4 Lint mindset: lint là hàng rào, không phải cảnh sát ​

4) Reproducibility: seeds, session info, deterministic outputs ​

4.1 Seeds: đặt seed ở đâu cho đúng ​

4.2 Session info: ghi lại để debug “works on my machine” ​

4.3 Deterministic outputs: tránh “diff noise” ​

Step 0 — Sort trước khi ghi ​

Step 1 — Chuẩn hóa timezone/locale khi format ​

Step 2 — Chuẩn hóa rounding/format cho số ​

5) “Definition of Done” cho một pipeline R0 ​

Ship checklist ​

1) Vấn đề thật sự: “chạy được” chưa đủ

2) Workflow: scripts vs notebooks

2.1 Chọn theo deliverable

2.2 Rule of thumb (thực dụng)

2.3 Một workflow hợp chuẩn (khuyến nghị)

Step 0 — Explore (notebook/report)

Step 1 — Extract logic (functions)

Step 2 — Orchestrate (script)

Step 3 — Communicate (report)

3) Style guide: viết R theo chuẩn kỹ sư

3.1 Naming: giảm entropy, tăng khả năng đọc

3.2 Function boundaries: “đúng kích thước” để test

3.3 File layout: đặt đúng thứ đúng chỗ

3.4 Lint mindset: lint là hàng rào, không phải cảnh sát

4) Reproducibility: seeds, session info, deterministic outputs

4.1 Seeds: đặt seed ở đâu cho đúng

4.2 Session info: ghi lại để debug “works on my machine”

4.3 Deterministic outputs: tránh “diff noise”

Step 0 — Sort trước khi ghi

Step 1 — Chuẩn hóa timezone/locale khi format

Step 2 — Chuẩn hóa rounding/format cho số

5) “Definition of Done” cho một pipeline R0

Ship checklist