写在前面

本文章为笔者原创，转载需要表明出处，联系作者：luckydreamcatcher@163.com | the.matrix.vvv@gmail.com

笔者最近在做golang重构旧php模块的事情，PHP模块峰值请求约1.5w QPS，是典型的高并发场景，重构过程中，代码中一些容易被开发者**”选择性忽略”的问题会被指数级放大，比如内存泄露、full GC等等，所以上线/放量**前必须进行压力测试。

为了更贴近生产环境，与QA同学合作，将重构后的golang模块部署到生产集群，选择一台20标准CPU核的实例，从服务列表中摘除，做压力测试。预期单实例的配置，可以扛住400QPS，在压力测试过程中发现，并发达到400QPS时，实例CPU使用率达到100%，成为性能瓶颈。

节约机器资源作为golang重构旧php的重要收益之一，为了达成此目标，笔者必须解决golang模块的cpu性能瓶颈，达到预期性能。

CPU性能分析

CPU性能分析，又称为CPU Profiling，下面介绍了三种笔者常用的性能分析手段：

go tool pprof命令行工具
go tool pprof可视化工具
FlameGraph火焰图

go tool pprof工具非常强大，性能分析不止这三中方式，可根据业务场景自由选择。实践中，笔者推荐使用go tool pprof与Flame Graph两种方式相结合。

为了在不影响阅读的前提下，保证服务安全，文章的敏感信息，笔者均用”xxx”进行了替换。

CPU Profiling原理

借助工具进行CPU Profiling之前，我们需要了解CPU Profiling的基本原理，这样才可以对数据做出更准确的判断。然而许许多多Google搜索到的技术博客，几乎千篇一律的都是在介绍golang pprof工具的使用。笔者在阅读golang runtime/pprof源码的基础上，借鉴了Linux perf工具的工作原理，说明一下。

Golang pprof默认会以100Hz（1秒100次）的频率，采集各个goroutine调用栈。假设函数foo在采样时，位于调用栈栈顶，则可以认为当前goroutine在执行foo函数，假如100次采样，foo函数30次位于调用栈栈顶，则可以认为foo函数执行消耗30%。了解了基本原理，下面我们便可以借助工具进行分析。

Golang模块开启Profiling

Golang官方提供强大的runtime/pprof包，用于Golang程序的Profiling。runtime/pprof包功能强大，但对于需长久运行的服务，不够方便。在生产环境中，建议开启http pprof，通过Web服务提供Profiling数据，方便直接使用浏览器查看或其它分析工具拉取数据进行进一步分析。

1
2
3

import (
  _ "net/http/pprof"
)

net/http/pprof包init初始化函数会在默认HTTP Server注册几个路由，将runtime/pprof的输出包装为http服务的响应，逻辑比较简单，可以参考阅读net/http/pprof包源码，此处不做赘述。

go tool pprof命令行工具

采用Golang自带的pprof命令行工具，进行CPU性能分析：

1	go tool pprof http://xxxx.baidu.com:2021/debug/pprof/profile?seconds=120

go tool pprof会将服务端http响应数据写入本地文件（本地文件默认存储/root/pprof目录下，输入go tool pprof 即可分析本地文件），运行2min之后，自动进入交互式命令行，使用top命令即可查看CPU耗时排行：

(pprof) top
Showing nodes accounting for 18.09s, 35.20% of 51.39s total
Dropped 856 nodes (cum <= 0.26s)
Showing top 10 nodes out of 246
      flat  flat%   sum%        cum   cum%
     3.20s  6.23%  6.23%      3.58s  6.97%  syscall.Syscall
     2.43s  4.73% 10.96%     10.65s 20.72%  runtime.mallocgc
     2.06s  4.01% 14.96%      2.10s  4.09%  encoding/json.stateInString
     2.02s  3.93% 18.89%      3.97s  7.73%  runtime.scanobject
     1.61s  3.13% 22.03%      4.76s  9.26%  encoding/json.checkValid
     1.43s  2.78% 24.81%      1.43s  2.78%  runtime.usleep
     1.39s  2.70% 27.52%      5.18s 10.08%  runtime.mapassign_faststr
     1.36s  2.65% 30.16%      1.67s  3.25%  encoding/json.unquoteBytes
     1.30s  2.53% 32.69%      1.45s  2.82%  net/url.unescape
     1.29s  2.51% 35.20%      1.86s  3.62%  encoding/json.(*decodeState).rescanLiteral

标注：

flat: 函数（不包含子函数）执行耗时；
flat%：函数执行耗时占抽样时间百分比；
sum%: 此行（包括）之前，flat%之和；
cum: 函数（包含调用的子函数）执行耗时；
cum%: 函数（包含调用的子函数）的执行耗时占抽样时间百分比；

top命令默认显示前10条数据，按照flat列降序排列。虽然定位了CPU耗时较高的函数，但是粒度较细，并不能直观反应产生性能瓶颈的代码。可以指定-cum参数，显示函数累加执行耗时排行，键入命令top20 -cum：

(pprof) top20 -cum
Showing nodes accounting for 0.61s, 1.19% of 51.39s total
Dropped 856 nodes (cum <= 0.26s)
Showing top 20 nodes out of 246
      flat  flat%   sum%        cum   cum%
     0.04s 0.078% 0.078%     41.26s 80.29%  net/http.(*conn).serve
     0.04s 0.078%  0.16%     38.25s 74.43%  github.com/gin-gonic/gin.(*Context).Next
     0.03s 0.058%  0.21%     38.24s 74.41%  icode.baidu.com/baidu/gdp/gdp.WebHandlerFunc.toGinHandlerFunc.func1
         0     0%  0.21%     38.17s 74.28%  github.com/gin-gonic/gin.(*Engine).handleHTTPRequest
         0     0%  0.21%     38.06s 74.06%  github.com/gin-gonic/gin.(*Engine).ServeHTTP
         0     0%  0.21%     37.96s 73.87%  net/http.serverHandler.ServeHTTP
     0.01s 0.019%  0.23%     36.86s 71.73%  github.com/gin-gonic/gin.RecoveryWithWriter.func1
         0     0%  0.23%     36.86s 71.73%  icode.baidu.com/baidu/gdp/gdp.ginHandler2WebHandler.func1
     0.07s  0.14%  0.37%     36.80s 71.61%  icode.baidu.com/baidu/gdp/gdp.ShoubaiTowerLogware
         0     0%  0.37%     32.46s 63.16%  icode.baidu.com/baidu/gdp/gdp.recovery
     0.01s 0.019%  0.39%     32.45s 63.14%  icode.baidu.com/baidu/xxxxxx/xxxxloc/middlewares.Recovery
     0.03s 0.058%  0.45%     32.41s 63.07%  icode.baidu.com/baidu/xxxxxx/xxxxloc/middlewares.PProfAuth
         0     0%  0.45%     31.95s 62.17%  icode.baidu.com/baidu/xxxxxx/xxxxloc/middlewares.ParseParams
     0.04s 0.078%  0.53%     30.18s 58.73%  icode.baidu.com/baidu/xxxxxx/xxxxloc/controllers.(*WeatherController).GetIndexWeather
     0.04s 0.078%   0.6%     27.44s 53.40%  icode.baidu.com/baidu/xxxxxx/xxxxloc/models/service/page/weather.(*WeatherIndexIphone).GetData
     0.01s 0.019%  0.62%     25.48s 49.58%  icode.baidu.com/baidu/xxxxxx/xxxxloc/models/service/data/weather.(*DataWeatherCommon).GetWeatherData
     0.02s 0.039%  0.66%     21.03s 40.92%  encoding/json.Unmarshal
         0     0%  0.66%     16.16s 31.45%  encoding/json.(*decodeState).unmarshal
     0.04s 0.078%  0.74%     16.16s 31.45%  encoding/json.(*decodeState).value
     0.23s  0.45%  1.19%     16.15s 31.43%  encoding/json.(*decodeState).object

由上可以观察到，系统执行流程大概包括了http框架、controller层、page层、data层，符合调用堆栈。encoding/json.Unmarshal函数累积执行耗时占总样本百分比为40.92%**，很明显不合理，是系统性能瓶颈**。go tool pprof命令行工具使用简单方便，无需要借助工具，但是表达不直观，我们可以借助下面提到的两种方式——以图或火焰图的形式。

go tool pprof可视化工具

go tool pprof命令行支持-png、-svg、-pdf等选项，输出png图片、svg图片、pdf文档。go tool pprof此功能依赖graphviz组件。

安装graphviz

graphviz组件依赖较多，建议解决各个linux发行版本的包管理器进行安装，源码安装参考官方：http://www.graphviz.org/download/source/。

# debian
apt-get install -y graphviz
# macos
brew install graphviz

导出图片

1	go tool pprof -png http://xxxx.baidu.com:2021/debug/pprof/profile?seconds=120 >> profile.png

火焰图

火焰图（FlameGraph）能直观的反映出系统的执行情况，是一种性能分析利器。Golang语言pprof工具暂不支持导出火焰图，需要安装第三方工具。笔者推荐使用由Uber开源的go-torch。

安装FlameGraph分析工具

安装go-torch：

1	go install github.com/uber/go-torch

安装go-torch依赖——FlameGraph脚本:

# 下载FlameGraph
wget https://github.com/brendangregg/FlameGraph/archive/master.zip
# 解压
unzip master.zip
# 移动至/opt目录
sudo mv FlameGraph-master /opt/FlameGraph
# 添加至系统Path中
echo 'export PATH=$PATH:/opt/FlameGraph' |sudo tee -a /etc/profile && source /etc/profile

go-torch工具安装成功，运行go torch --help查看帮助信息：

Usage:
  go-torch [options] [binary] <profile source>

pprof Options:
  -u, --url=         Base URL of your Go program (default: http://localhost:8080)
      --suffix=      URL path of pprof profile (default: /debug/pprof/profile)
  -b, --binaryinput= File path of previously saved binary profile. (binary profile is anything accepted by
                     https://golang.org/cmd/pprof)
      --binaryname=  File path of the binary that the binaryinput is for, used for pprof inputs
  -t, --seconds=     Number of seconds to profile for (default: 30)
      --pprofArgs=   Extra arguments for pprof

Output Options:
  -f, --file=        Output file name (must be .svg) (default: torch.svg)
  -p, --print        Print the generated svg to stdout instead of writing to file
  -r, --raw          Print the raw call graph output to stdout instead of creating a flame graph; use with Brendan Gregg's flame
                     graph perl script (see https://github.com/brendangregg/FlameGraph)
      --title=       Graph title to display in the output file (default: Flame Graph)
      --width=       Generated graph width (default: 1200)
      --hash         Colors are keyed by function name hash
      --colors=      set color palette. choices are: hot (default), mem, io, wakeup, chain, java, js, perl, red, green, blue,
                     aqua, yellow, purple, orange
      --cp           Use consistent palette (palette.map)
      --reverse      Generate stack-reversed flame graph
      --inverted     icicle graph

Help Options:
  -h, --help         Show this help message

通常情况下，我们只需要关注-u参数与-f参数即可，运行如下命令进行CPU采样，输出svg格式火焰图：

1	go-torch -u http://xxxx.baidu.com:2021/debug/pprof/profile?seconds=120

go-torch运行输出如下：

1 2	INFO[10:57:29] Run pprof command: go tool pprof -raw -seconds 30 http://xxxx.baidu.com:2021/debug/pprof/profile?seconds=120 INFO[10:58:00] Writing svg to torch.svg

FlameGrpah文件：torch.svg

FlameGraph-火焰图分析

许多人对火焰图的理解有歧义，有些似懂非懂，按照自己的主观意识去解读，导致陷入误区。要使用火焰图进行性能分析，首先需要明确火焰图x轴与y轴的确切含义。

y 轴表示调用栈，每一层都标识一个函数，调用栈越深，火焰就越高，顶部就是当前在执行的函数。

x 轴表示抽样数，如果一个函数在 x 轴占据的宽度越宽，表示它被抽到的次数多，执行的时间长（x 轴非时间轴，是所有的调用栈合并后，按函数字母顺序排列的）。因此，火焰图顶部，只要有”平顶”（plateaus），则表示该函数可能存在性能问题。

提示：svg格式，当移动鼠标至其中一栏时会显示”Tips”信息，包含采样数、占采样总数百分比等等信息，有关火焰图更详细的资料参考：cpu flame graph。

上图中，可以很明显观察到encoding/json.Unmarshal函数耗费了40%的CPU时间，是系统的性能瓶颈。定位了性能瓶颈之后，我们应当思考如何优化了。

性能优化

json反序列化成为系统性能瓶颈，可以说在情理之内，预期之外。业务角度，我们的Golang模块强依赖的下游服务，返回了一个大JSON（大约几十KB），且字段嵌套层级较深，json反序列化耗时是情理之内的，但是耗费了惊人的40%的CPU时间是预期之外的。

为了解决这个问题，有如下几条思路：

对下游返回JSON”瘦身”，未使用的字段不做解析；
使用LRU Cache，在内存中缓存已反序列化之后的Struct；
使用性能更高的开源json序列化方案；

如何进行性能调优，解决文章中的Case，笔者将会在新的文章中阐述思路，本文不做过多叙述。

联系作者

有更好的”性能调优”方式，也欢迎一块儿交流一下（邮箱：luckydreamcatcher@163.com，微信号：15210466756）；

参考资料

枫叶居

Golang性能分析