一 AOF
二 AOFRW

三 AOFRW存在的問題
1 記憶體開銷
aof_pending_rewrite:0
aof_buffer_length:35500
aof_rewrite_buffer_length:34000
aof_pending_bio_fsync:0
3351:M 25 Jan 2022 09:55:39.655 * Backgroundappendonlyfilerewritingstartedbypid 6817
3351:M 25 Jan 2022 09:57:51.864 * AOFrewritechildaskstostopsendingdiffs.
6817:C 25 Jan 2022 09:57:51.864 * Parentagreedtostopsendingdiffs. FinalizingAOF...
6817:C 25 Jan 2022 09:57:51.864 * Concatenating 2135.60MBofAOFdiffreceivedfromparent.
3351:M 25 Jan 2022 09:57:56.545 * BackgroundAOFbuffersize: 100 MB
2 CPU開銷
-
在AOFRW期間,主程序需要花費CPU時間向aof_rewrite_buf寫資料,並使用eventloop事件迴圈向子程序傳送aof_rewrite_buf中的資料:
/* Append data to the AOF rewrite buffer, allocating new blocks if needed. */
voidaofRewriteBufferAppend(unsignedchar *s, unsignedlong len){
// 此處省略其他細節...
/* Install a file event to send data to the rewrite child if there is
* not one already. */
if (!server.aof_stop_sending_diff &&
aeGetFileEvents(server.el,server.aof_pipe_write_data_to_child) == 0)
{
aeCreateFileEvent(server.el, server.aof_pipe_write_data_to_child,
AE_WRITABLE, aofChildWriteDiffData, NULL);
}
// 此處省略其他細節...
}
-
在子程序執行重寫操作的後期,會迴圈讀取pipe中主程序傳送來的增量資料,然後追加寫入到臨時AOF檔案:
intrewriteAppendOnlyFile(char *filename){
// 此處省略其他細節...
/* Read again a few times to get more data from the parent.
* We can't read forever (the server may receive data from clients
* faster than it is able to send data to the child), so we try to read
* some more data in a loop as soon as there is a good chance more data
* will come. If it looks like we are wasting time, we abort (this
* happens after 20 ms without new data). */
int nodata = 0;
mstime_t start = mstime();
while(mstime()-start < 1000 && nodata < 20) {
if (aeWait(server.aof_pipe_read_data_from_parent, AE_READABLE, 1) <= 0)
{
nodata++;
continue;
}
nodata = 0; /* Start counting from zero, we stop on N *contiguous*
timeouts. */
aofReadDiffFromParent();
}
// 此處省略其他細節...
}
-
在子程序完成重寫操作後,主程序會在backgroundRewriteDoneHandler 中進行收尾工作。其中一個任務就是將在重寫期間aof_rewrite_buf中沒有消費完成的資料寫入臨時AOF檔案。如果aof_rewrite_buf中遺留的資料很多,這裡也將消耗CPU時間。
voidbackgroundRewriteDoneHandler(int exitcode, int bysignal) {
// 此處省略其他細節...
/* Flush the differences accumulated by the parent to the rewritten AOF. */
if (aofRewriteBufferWrite(newfd) == -1) {
serverLog(LL_WARNING,
"Error trying to flush the parent diff to the rewritten AOF: %s", strerror(errno));
close(newfd);
goto cleanup;
}
// 此處省略其他細節...
}
3 磁碟IO開銷
4 程式碼複雜度
/* AOF pipes used to communicate between parent and child during rewrite. */
int aof_pipe_write_data_to_child;
int aof_pipe_read_data_from_parent;
int aof_pipe_write_ack_to_parent;
int aof_pipe_read_ack_from_child;
int aof_pipe_write_ack_to_child;
int aof_pipe_read_ack_from_parent;
四 MP-AOF實現
1 方案概述
-
BASE:表示基礎AOF,它一般由子程序透過重寫產生,該檔案最多隻有一個。
-
INCR:表示增量AOF,它一般會在AOFRW開始執行時被建立,該檔案可能存在多個。
-
HISTORY:表示歷史AOF,它由BASE和INCR AOF變化而來,每次AOFRW成功完成時,本次AOFRW之前對應的BASE和INCR AOF都將變為HISTORY,HISTORY型別的AOF會被Redis自動刪除。

2 關鍵實現
Manifest
1)在記憶體中的表示
-
aofInfo:表示一個AOF檔案資訊,當前僅包括檔名、檔案序號和檔案型別
-
base_aof_info:表示BASE AOF資訊,當不存在BASE AOF時,該欄位為NULL
-
incr_aof_list:用於存放所有INCR AOF檔案的資訊,所有的INCR AOF都會按照檔案開啟順序排放
-
history_aof_list:用於存放HISTORY AOF資訊,history_aof_list中的元素都是從base_aof_info和incr_aof_list中move過來的
typedefstruct {
sds file_name; /* file name */
longlong file_seq; /* file sequence */
aof_file_type file_type; /* file type */
} aofInfo;
typedefstruct {
aofInfo *base_aof_info; /* BASE file information. NULL if there is no BASE file. */
list *incr_aof_list; /* INCR AOFs list. We may have multiple INCR AOF when rewrite fails. */
list *history_aof_list; /* HISTORY AOF list. When the AOFRW success, The aofInfo contained in
`base_aof_info` and `incr_aof_list` will be moved to this list. We
will delete these AOF files when AOFRW finish. */
longlong curr_base_file_seq; /* The sequence number used by the current BASE file. */
longlong curr_incr_file_seq; /* The sequence number used by the current INCR file. */
int dirty; /* 1 Indicates that the aofManifest in the memory is inconsistent with
disk, we need to persist it immediately. */
} aofManifest;
structredisServer {
// 此處省略其他細節...
aofManifest *aof_manifest; /* Used to track AOFs. */
// 此處省略其他細節...
}
2)在磁碟上的表示
fileappendonly.aof.1.base.rdbseq 1 typeb
fileappendonly.aof.1.incr.aofseq 1 typei
fileappendonly.aof.2.incr.aofseq 2 typei
fileappendonly.aof.1.base.rdbseq 1 typebnewkeynewvalue
fileappendonly.aof.1.incr.aoftypeiseq 1
# thisisannotations
seq 2 typeifileappendonly.aof.2.incr.aof
檔案命名規則
-
seq為檔案的序號,由1開始單調遞增,BASE和INCR擁有獨立的檔案序號
-
type為AOF的型別,表示這個AOF檔案是BASE還是INCR
-
format用來表示這個AOF內部的編碼方式,由於Redis支援RDB preamble機制,因此BASE AOF可能是RDB格式編碼也可能是AOF格式編碼:
appendonly.aof.1.base.rdb // 開啟RDB preamble
appendonly.aof.1.base.aof // 關閉RDB preamble
appendonly.aof.1.incr.aof
appendonly.aof.2.incr.aof
相容老版本升級
-
如果appenddirname目錄不存在 -
或者appenddirname目錄存在,但是目錄中沒有對應的manifest清單檔案 -
如果appenddirname目錄存在且目錄中存在manifest清單檔案,且清單檔案中只有BASE AOF相關資訊,且這個BASE AOF的名字和server.aof_filename相同,且appenddirname目錄中不存在名為server.aof_filename的檔案
/* Load the AOF files according the aofManifest pointed by am. */
int loadAppendOnlyFiles(aofManifest *am) {
// 此處省略其他細節...
/* If the 'server.aof_filename' file exists in dir, we may be starting
* from an old redis version. We will use enter upgrade mode in three situations.
*
* 1. If the 'server.aof_dirname' directory not exist
* 2. If the 'server.aof_dirname' directory exists but the manifest file is missing
* 3. If the 'server.aof_dirname' directory exists and the manifest file it contains
* has only one base AOF record, and the file name of this base AOF is 'server.aof_filename',
* and the 'server.aof_filename' file not exist in 'server.aof_dirname' directory
* */
if (fileExist(server.aof_filename)) {
if (!dirExists(server.aof_dirname) ||
(am->base_aof_info == NULL && listLength(am->incr_aof_list) == 0) ||
(am->base_aof_info != NULL && listLength(am->incr_aof_list) == 0 &&
!strcmp(am->base_aof_info->file_name, server.aof_filename) && !aofFileExist(server.aof_filename)))
{
aofUpgradePrepare(am);
}
}
// 此處省略其他細節...
}
-
使用server.aof_filename作為檔名來構造一個BASE AOF資訊 -
將該BASE AOF資訊持久化到manifest檔案 -
使用rename 將舊AOF檔案移動到appenddirname目錄中
void aofUpgradePrepare(aofManifest *am) {
// 此處省略其他細節...
/* 1. Manually construct a BASE type aofInfo and add it to aofManifest. */
if (am->base_aof_info) aofInfoFree(am->base_aof_info);
aofInfo *ai = aofInfoCreate();
ai->file_name = sdsnew(server.aof_filename);
ai->file_seq = 1;
ai->file_type = AOF_FILE_TYPE_BASE;
am->base_aof_info = ai;
am->curr_base_file_seq = 1;
am->dirty = 1;
/* 2. Persist the manifest file to AOF directory. */
if (persistAofManifest(am) != C_OK) {
exit(1);
}
/* 3. Move the old AOF file to AOF directory. */
sds aof_filepath = makePath(server.aof_dirname, server.aof_filename);
if (rename(server.aof_filename, aof_filepath) == -1) {
sdsfree(aof_filepath);
exit(1);;
}
// 此處省略其他細節...
}
多檔案載入及進度計算
int loadAppendOnlyFiles(aofManifest *am) {
// 此處省略其他細節...
/* Here we calculate the total size of all BASE and INCR files in
* advance, it will be set to `server.loading_total_bytes`. */
total_size = getBaseAndIncrAppendOnlyFilesSize(am);
startLoading(total_size, RDBFLAGS_AOF_PREAMBLE, 0);
/* Load BASE AOF if needed. */
if (am->base_aof_info) {
aof_name = (char*)am->base_aof_info->file_name;
updateLoadingFileName(aof_name);
loadSingleAppendOnlyFile(aof_name);
}
/* Load INCR AOFs if needed. */
if (listLength(am->incr_aof_list)) {
listNode *ln;
listIter li;
listRewind(am->incr_aof_list, &li);
while ((ln = listNext(&li)) != NULL) {
aofInfo *ai = (aofInfo*)ln->value;
aof_name = (char*)ai->file_name;
updateLoadingFileName(aof_name);
loadSingleAppendOnlyFile(aof_name);
}
}
server.aof_current_size = total_size;
server.aof_rewrite_base_size = server.aof_current_size;
server.aof_fsync_offset = server.aof_current_size;
stopLoading();
// 此處省略其他細節...
}
AOFRW Crash Safety
-
BASE AOF的名字中包含檔案序號,保證每次建立的BASE AOF不會和之前的BASE AOF衝突; -
先執行AOF的rename 操作,再修改manifest檔案;
fileappendonly.aof.1.base.rdbseq 1 typeb
fileappendonly.aof.1.incr.aofseq 1 typei
fileappendonly.aof.1.base.rdbseq 1 typeb
fileappendonly.aof.1.incr.aofseq 1 typei
fileappendonly.aof.2.incr.aofseq 2 typei
fileappendonly.aof.2.base.rdbseq 2 typeb
fileappendonly.aof.1.base.rdbseq 1 typeh
fileappendonly.aof.1.incr.aofseq 1 typeh
fileappendonly.aof.2.incr.aofseq 2 typei
-
在修改記憶體中的server.aof_manifest前,先dup一份臨時的manifest結構,接下來的修改都將針對這個臨時的manifest進行。這樣做的好處是,一旦後面的步驟出現失敗,我們可以簡單的銷燬臨時manifest從而回滾整個操作,避免汙染server.aof_manifest全域性資料結構; -
從臨時manifest中獲取新的BASE AOF檔名(記為new_base_filename),並將之前(如果有)的BASE AOF標記為HISTORY; -
將子程序產生的temp-rewriteaof-bg-pid.aof臨時檔案重新命名為new_base_filename; -
將臨時manifest結構中上一次的INCR AOF全部標記為HISTORY型別; -
將臨時manifest對應的資訊持久化到磁碟(persistAofManifest內部會保證manifest本身修改的原子性); -
如果上述步驟都成功了,我們可以放心的將記憶體中的server.aof_manifest指標指向臨時的manifest結構(並釋放之前的manifest結構),至此整個修改對Redis可見; -
清理HISTORY型別的AOF,該步驟允許失敗,因為它不會導致資料一致性問題。
voidbackgroundRewriteDoneHandler(int exitcode, int bysignal){
snprintf(tmpfile, 256, "temp-rewriteaof-bg-%d.aof",
(int)server.child_pid);
/* 1. Dup a temporary aof_manifest for subsequent modifications. */
temp_am = aofManifestDup(server.aof_manifest);
/* 2. Get a new BASE file name and mark the previous (if we have)
* as the HISTORY type. */
new_base_filename = getNewBaseFileNameAndMarkPreAsHistory(temp_am);
/* 3. Rename the temporary aof file to 'new_base_filename'. */
if (rename(tmpfile, new_base_filename) == -1) {
aofManifestFree(temp_am);
goto cleanup;
}
/* 4. Change the AOF file type in 'incr_aof_list' from AOF_FILE_TYPE_INCR
* to AOF_FILE_TYPE_HIST, and move them to the 'history_aof_list'. */
markRewrittenIncrAofAsHistory(temp_am);
/* 5. Persist our modifications. */
if (persistAofManifest(temp_am) == C_ERR) {
bg_unlink(new_base_filename);
aofManifestFree(temp_am);
goto cleanup;
}
/* 6. We can safely let `server.aof_manifest` point to 'temp_am' and free the previous one. */
aofManifestFreeAndUpdate(temp_am);
/* 7. We don't care about the return value of `aofDelHistoryFiles`, because the history
* deletion failure will not cause any problems. */
aofDelHistoryFiles();
}
支援AOF truncate
if (ftruncate(server.aof_fd, server.aof_last_incr_size) == -1) {
//此處省略其他細節...
}
AOFRW限流
if (server.aof_state == AOF_ON &&
!hasActiveChildProcess() &&
server.aof_rewrite_perc &&
server.aof_current_size > server.aof_rewrite_min_size &&
!aofRewriteLimited())
{
longlongbase = server.aof_rewrite_base_size ?
server.aof_rewrite_base_size : 1;
longlong growth = (server.aof_current_size*100/base) - 100;
if (growth >= server.aof_rewrite_perc) {
rewriteAppendOnlyFileBackground();
}
}
五 總結
搜尋與推薦技術實戰訓練營
關鍵詞
資料
AOF檔案
子程序
問題
寫命令