R script 脚本化

R 脚本简介 Rscript

脚本化，就是把程序写成有输入和输出的独立程序文件。就像shell的 ls 命令, 比对软件 STAR 一样，用户只需要调用脚本名字、给出参数，程序就能判断参数是否合法，并返回结果。把程序像黑盒一样封装起来，方便用户使用，还能轻易整合到 snakemake 等流程中。

推荐使用朴素的写法，尽量只使用R核心API，依赖越少越稳定，以便其他用户复用。

推荐：
- 脚本名字以 xx.script.R 结尾。
- 开头部分写上脚本的目的，使用方法，测试环境，版本变化
- R脚本模板 参考： bioToolKit/R_scripts/single_script/template.script.R

# Aim: do monocle2 as a script, output to a dir
# how to use: $ Rscript this.script.R xx.Seurat.obj [outputRoot]
# Env: R 4.1.0
# version: 0.2-20240504
# version: 0.3-20240504 set parameters

R 脚本化相关技术

脚本内部把参数列表保存在数组 myArgs=commandArgs(TRUE) 中，下标0是该脚本文件名本身，下标1是第一个参数，以此类推。

获取的参数是字符串形式，要做算术运算需要主动转为数字形式，比如 as.numeric(str2)。

一般还需要验证参数，比如个数、类型是否正确，输出文件是否已经存在等。

接收参数

# 实例(Ubuntu 20.04 + R 4.1.1)
#例1: 接收参数，并对第2个参数加100后输出
$ cat t1.R 
#获取命令行参数
myArgs = commandArgs(TRUE)

#myArgs是所有参数的向量
print(myArgs) 
print(class(myArgs))
message("length:", length(myArgs), ", first:", myArgs[1] )

n1=as.numeric(myArgs[2])
print( n1+100 )

运行该脚本，并传入参数:

$ Rscript t1.R hello 50
[1] "hello" "50"   
[1] "character"
length:2, first:hello
[1] 150


# 如果传入参数个数不够，不报错，但是会计算异常，比如第二个转为数字返回NA。
$ Rscript t1.R hello
[1] "hello"
[1] "character"
length:1, first:hello
[1] NA

小结: 通过 Rscript 执行R脚本。R脚本内通过 arr = commandArgs(TRUE) 接收参数，接收到的是一个数组，第一个参数的下标是1；类型为字符串，可以按需要强转。

参数校验与主动报错

先检查参数个数，再检查参数类型，还要检查输出文件是否存在，如果存在是否覆盖？

主动报错，即在参数不合法时，主动抛出错误，终止程序运行，并给出错误提示或给出合理化建议。

如果参数个数不对，则报错：

$ cat t2.R 
myArgs = commandArgs(TRUE)

if(length(myArgs) < 2){
  stop("You must provide at least 2 parameters!")
}

print(myArgs) 
message("length:", length(myArgs), ", first:", myArgs[1] )


$ Rscript t2.R hello #一个参数报错
Error: You must provide at least 2 parameters!
Execution halted

$ Rscript t2.R hello world #两个参数正常执行
[1] "hello" "world"
length:2, first:hello

如果输入(file)文件不存在，则报错

# 获取数据文件的行列数
$ cat t3.R
myArgs = commandArgs(TRUE)

input=myArgs[1]

if(!file.exists(input)){
  stop("Input file not exists:", input)
}

dat=read.table(input)
message(input, ":", nrow(dat), " x ", ncol(dat))



# 测试 
# 写入一个文件
$ R -e "write.table(iris, 'iris.txt')" 

$ Rscript t3.R "iris.txt" #有文件时，返回行列数
iris.txt:150 x 5

$ Rscript t3.R "iris2.txt" #没有文件时，主动报错
Error: Input file not exists:iris2.txt
Execution halted

如果输出(dir)文件夹不存在，则报错

$ cat t3B.R
myArgs = commandArgs(TRUE)

outputRoot="./"
if(length(myArgs)>=2){
  outputRoot=myArgs[2] 
  # must end with /
  if(substring(outputRoot, nchar(outputRoot)) != "/"){
    outputRoot=paste0(outputRoot, "/")
  }
  # the dir must exist
  if(!dir.exists(outputRoot)){
    stop( sprintf("dir not exist, outputRoot=%s", outputRoot) )
  }
}
message( sprintf("settings: outputRoot=%s", outputRoot) )


# 测试: 第二个参数是输出文件夹，不存在则报错
$ Rscript t3B.R obj /data4/
Error: dir not exist, outputRoot=/data4/
Execution halted

$ Rscript t3B.R obj /home/
settings: outputRoot=/home/

stopifnot(): 脚本参数的校验，和函数参数的校验类似。

# stopifnot(): If any of the expressions (in ... or exprs) are not all TRUE, stop is called, producing an error message indicating the first expression which was not (all) true.

$ cat t4.R 
myArgs = commandArgs(TRUE)
stopifnot(length(myArgs)>=2)
message( paste(myArgs, collapse=", ") )


# 测试1: 参数个数至少2个
$ Rscript t4.R 1 23 4 5
1, 23, 4, 5

# 测试2: 参数个数不够2个则报错
$ Rscript t4.R 1
Error: length(myArgs) >= 2 is not TRUE
Execution halted

打印时间戳进度条

运行时间长的脚本需要进度条，以缓解用户的焦虑，告诉用户程序运行到哪了，方便用户估计还要等多久。一般有伴随着进度条直接写时分秒的，还有写已经运行时间的。

显示时分秒的

for(i in 1:10){
  Sys.sleep(runif(1)*0.5+0.5)
  message("[",Sys.time(), "]", i)
}

输出:
[2022-04-20 11:31:36]1
[2022-04-20 11:31:37]2
[2022-04-20 11:31:38]3
...

其他时间相关函数：

# 查询时区
> Sys.time() #现在的时间 [1] "2022-12-04 07:49:19 UTC"
> Sys.timezone() #查看当前时区 [1] "Etc/UTC"

# 支持的时区
> OlsonNames() #所有R支持的时区
> grep("Shanghai", OlsonNames(), value=T)
#[1] "Asia/Shanghai"

# 设置时区后，服务器R时间就和本地一致了
> Sys.setenv(TZ = "Asia/Shanghai")
> Sys.timezone() # 查看时区 "Asia/Shanghai"

> Sys.time() #现在的时间和windows右下角一致了
# [1] "2022-12-04 15:59:50 CST"

显示运行时间

start=as.numeric(Sys.time())
for(i in 1:100){
  if(i%%10==0){
    print(paste(i, '; Elapse', round(as.numeric(Sys.time())-start,2), 'seconds'))
  }
  
  # time consuming task
  Sys.sleep(0.05) #休眠0.05s
}
输出:
[1] "10 ; Elapse 0.48 seconds"
[1] "20 ; Elapse 1 seconds"
[1] "30 ; Elapse 1.5 seconds"
[1] "40 ; Elapse 2.02 seconds"
... 

记录总时间

start_time=Sys.time()
Sys.sleep(5)
a1=Sys.time()-start_time
round(as.numeric(a1),2) #5.1s

> message("Time:",round(as.numeric( Sys.time()-start_time ),2), "seconds")
Time:21.93seconds

如果没有文件夹，则创建文件夹

# create dir
# data/
# pdf/
if(!dir.exists( paste0(outputRoot,"data/") )){
  dir.create( paste0(outputRoot,"data/") )
}
if(!dir.exists( paste0(outputRoot,"pdf/") )){
  dir.create( paste0(outputRoot,"pdf/") )
}

system() 在R中调用系统命令

invokes the OS command specified by command.
Usage
system(command, intern = FALSE,
       ignore.stdout = FALSE, ignore.stderr = FALSE,
       wait = TRUE, input = NULL, show.output.on.console = TRUE,
       minimized = FALSE, invisible = TRUE)
默认等待该系统命令结束 wait=T，才执行下一行。

# R脚本
$ vim test.R
Args = commandArgs(T)

shell_cmd = paste0("grep -n CHR ", Args[1])
grep_out = system(shell_cmd, intern = TRUE) #调用shell的grep命令，需要返回值
# cat(grep_out)
print(grep_out)

文档内容
$ cat aa.txt 
this
is CHR
a
CHR2 book

调用结果
$ Rscript test.R aa.txt
# 2:is CHR 4:CHR2 book  ## R中使用cat的输出
[1] "2:is CHR"     "4:CHR2 book "

把系统命令 awk 封装到R函数中

# https://github.com/ElkonLab/scAPA/blob/master/scAPA/R/First_threesteps.R
#'  The function reads the output file generated by Homer findPeaks. It
#'  Renders it into a bed file, sort it, and then merge peaks that are
#'  less than 100 (nt) apart using bedtools merge.
#' @return Writes a bed file 'merge.peakfile.bed' with the merged peaks.
merge_peaks = function(bedtools.path, path, peaks.file){
	# home_dir = getwd()
	# on.exit(setwd(home_dir))
	# setwd(path)
	
	#R中执行 awk 
	readHomerfile.command = paste0('awk \'BEGIN{OFS = "\t"}{if(NR>34) ',
								  'print $2, $3, $4, "',
								',.",".",$5}\' ', peaks.file,
								' > ./notsorted_notmerged_peak.bed')
	system(command = readHomerfile.command, wait = T)  
	
	# R中调用 bedtools sort 
	# Sort the bed using bedtools Sort (required for bedtools merge)
	bedtools.sort.command = paste0(bedtools.path, "bedtools sort -i ./notsorted",
									 "_notmerged_peak.bed > ./notmerged_peak.bed")
	system(command = bedtools.sort.command, wait = T)
	system(command = "rm ./notsorted_notmerged_peak.bed", wait = T) # 删掉临时文件 rm
	
	
	# Merge using bedtools merge
	bedtools.merge.command = paste0(bedtools.path,
									 "mergeBed -d 100 -s -c 6 -o distinct -i ",
									 "./notmerged_peak.bed > ./merge.peakfile")
	system(command = bedtools.merge.command, wait = T)
	
	#Turning the output into a bed file
	render.tobed.command = paste0('awk \'BEGIN{OFS = "\t"}{print $1, $2, $3, ',
								   '".",".",$4}\' ./merge.peakfile > .',
								   '/merge.peakfile.bed')
	system(command = render.tobed.command, wait = T)
	system(command = "rm ./merge.peakfile", wait = T)
	system(command = "rm ./notmerged_peak.bed", wait = T)
}

Rmarkdown 自动化报告

帮助文件: 官方入门教程 (rmarkdown-cheatsheet), cheatsheet.pdf,

权威指南: R Markdown: The Definitive Guide

Rmd 可以把R代码(甚至bash/Python等)嵌入 markdown 中，通过渲染步骤，动态生成 html /pdf /ppt 等文档(甚至 notebook/slide/dashboard/website等)。方便分析、共享、重复。

Rmarkdown 延伸出了 bookdown 这个写书的R包。

Rmd基本流程

1. 新建Rmd文件。 Rstudio 菜单 File - New File - R Markdown，输入文件名后点击OK，右击保存为 viridis.Rmd(文件名任意，后缀是.Rmd即可)。

2. 写内容。支持自定义css文件。

---
title: "viridis"
author: "BioMooc"
date: '2023-01-19'
output: 
  html_document:
    toc: true
    css: static/css/weixin.css
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


```{r echo=FALSE, eval=FALSE}
if(0){
  setwd("/data/wangjl/rmarkdown_demo/bookdown-demo-main/backup")
  # 渲染语句
  rmarkdown::render("viridis.Rmd")
}
```


# 1.生成颜色实例: 参数类似

```{r fig.width=5, fig.height=0.5}
library(viridis)

#c1=viridis(n=10, alpha = 1, begin = 0, end = 1, direction = 1); c1
c1=viridis(10, alpha = 1, begin = 0, end = 1, direction = 1, option = "A"); c1
# n 色彩个数
# alpha 不透明度，1表示完全不透明，0表示完全透明。
# begin 和 end 表示 hue 开始和结束的位置[0,1]
# direction 颜色方向，1从深到浅，-1反之。
# option 表示8种颜色方案，可以换前面的函数名，也可以换 option 种的字母A-H

par(mar=c(0,0,0,0))
barplot( rep(1, length(c1)), col=c1, 
         border=NA, yaxt='n', space=0, main="")
```

3. 渲染生成网页报告。在Rstudio中执行R代码: rmarkdown::render("viridis.Rmd");

4. 预览效果。找到生成的 viridis.html 文件，点击新窗口，不满意了修改，再渲染，再预览，直到满意。

5. 上传到 Rpubs.com 上。点击 viridis.Rmd 文件右上角的 “Publish” 图标，登录（如果没有注册，就注册一个），输入URL使用的文件名，确定上传。

预览地址: https://rpubs.com/dawnEve/viridis

Rmarkdown 语法简介

是否执行: Rmd 中的代码可以执行/不执行(eval=FALSE)，可以显示/不显示(echo = TRUE)。

传递参数: Rmd可以设置顶部变量；可以传入渲染时参数，会覆盖顶部变量。

绘图参数: 每个chunk都会恢复par()默认绘图参数。所以上一个代码块设置的边界，到下一个代码块就失效了，不用默认值就需要重新设置。

# 常见问题
(1) Rmd中设置不执行的R语句

仅仅是注释，不执行的R语句

```{r eval=FALSE}
install.packages("bookdown")
# or the development version
# devtools::install_github("rstudio/bookdown")

# how to render this file?
# rmarkdown::render('input.Rmd', params = list(state = state))
```




(2) 传递参数
# 基本用法
# R 中启动渲染
for (state in state.name) {
  rmarkdown::render('input.Rmd', params = list(state = state))
}


# Rmd中
---
title: "A report for `r params$state`"
output: html_document
params:
  state: Nebraska
  year: 2019
  midwest: true
---

注意: 每个参数的:后面必须有一个空格，否则报错。


The area of `r params$state` is
`r state.area[state.name == params$state]`
square miles.

优先级上，在 render() 函数中设置的优先级更高，会覆盖 Rmd中yaml的params中的设置。




(3) 修改工作目录
knitr::opts_knit$set(root.dir = "/data/jinwf/chengm/rawdata/neutrophils/scRNA/")




(4) 开始渲染
> rmarkdown::render(
  "./202202newTag/script/b03_flow_Seurat2scVelo.Rmd",
  params = list(
    seurat_path = "./202202newTag/solo/APC1.Seurat.Rds",
    outputRoot = "/data/jinwf/chengm/rawdata/neutrophils/scRNA/202202newTag/solo/APC1/",
    sampleName="APC1",
    keyword=""
    ),
    output_dir="./202202newTag/solo/APC1/", output_file="01.html")

参考资料

Rmarkdown 也支持其他语言 https://bookdown.org/yihui/rmarkdown/language-engines.html#python