Nội dung kịch bản¶
Đầu vào monitor:
- Nic của host (qua TICK, prometheus, zabbix,..)
- Kết nối đến máy ảo. (vd ping đến máy đó)
Kịch bản: xuất hiện các alarm
-
Alarm 1: nic host mất kết nối : nic_operstate {host= compute01, interface = ens8}= 0
-
Alarm 2: Máy ảo mất kết nối (có thể do vitrage tự deduce ra, hoặc monitor tool phát hiện)
Yêu cầu:
-
Collect được alarm khi xảy ra
-
Vitrage nhân định Alarm 1 => Alarm 2
-
Vitrage gọi đến mistral thực hiện seft healing
Thực hiện¶
Chuẩn bị monitor các thành phần:¶
Ở đây cần nic của host. Lựa chọn TICK lấy alarm.
- Đầu tiên cần thiết lập cho telegraf lấy được metric host nic down: ta tạo file
/usr/share/telegraf_check_nic.sh
#!/bin/sh
#nics=`find /sys/class/net ! -type d | xargs --max-args=2 realpath | awk -F\/ '/pci/{print $NF}'`
nics=`ip link | grep "up" | awk -F": " {'print $2'} | grep ^e`
hostname=`hostname`
for nic in $nics
do
operstate=`cat /sys/class/net/$nic/operstate`
if [ "$operstate" = "up" ]
then
field=1
else
field=0
fi
echo "nic,host=${hostname},interface=${nic} operstate=${field}"
done
sửa trong file config của telegraf:
/etc/telegraf/telegraf.conf
[[inputs.exec]]
## Commands array
commands = [
"/usr/share/telegraf_check_nic.sh",
]
timeout = "5s"
data_format = "influx"
Như vậy , telegraf sẽ liên tục gửi về cho influxdb metric định dạng:
- nic,host=compute01,interface=ens8 operstate= 0 nếu nic ens8 down
-
nic,host=compute01,interface=ens8 operstate= 1 nếu nic ens8 up
-
Tiếp theo cần thiết lập cảnh báo kapacitor: Tạo 1 tick script:
/etc/kapacitor/tick_script_nicdown.txt
dbrp "telegraf"."autogen"
var db = 'telegraf'
var rp = 'autogen'
var measurement = 'nic'
var groupBy = ['host', 'interface']
var whereFilter = lambda: TRUE
var name = 'nic down'
var idVar = name + '-{{.Group}}'
var message = ' {{ index .Tags "interface" }} down'
var idTag = 'alertID'
var levelTag = 'level'
var messageField = 'message'
var durationField = 'duration'
var outputMeasurement = 'alerts'
var triggerType = 'threshold'
var crit = 1
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "operstate")
.as('value')
var trigger = data
|alert()
.crit(lambda: "value" < crit)
.message(message)
.id(idVar)
.idTag(idTag)
.levelTag(levelTag)
.messageField(messageField)
.durationField(durationField)
.stateChangesOnly()
.exec('/usr/bin/python', '/etc/kapacitor/kapacitor_vitrage.py', 'rabbit://openstack:Welcome123@192.168.77.51')
trigger
|httpOut('output')
Import tick_script trên cho kapacitor:
kapacitor define check_nic_down -tick /etc/kapacitor/tick_script_nicdown.txt
kapacitor endnable check_nic_down
- Thực hiện cấu hình cho kapacitor trả về alarm được cho vitrage: https://docs.openstack.org/vitrage/latest/contributor/kapacitor-datasource.html
- Kết quả sau khi đặt monitor nic xong: thử ip link down ens8:
Cấu hình vitrage¶
- Thêm kịch bản để khi có alarm nic của host down sẽ xác định cho host nào bị ảnh hưởng.
scenario2-1.yaml
metadata:
version: 2
name: kapacitor host nic alarm to nic alarm
type: standard
description: kapacitor host nic alarm to nic alarm
definitions:
entities:
- entity:
category: ALARM
name.regex: (.ens8) down
template_id: host_alarm
- entity:
category: RESOURCE
type: host.nic
template_id: nic
- entity:
category: RESOURCE
type: nova.host
template_id: host
relationships:
- relationship:
source: host_alarm
target: host
relationship_type: on
template_id : alarm_on_host
- relationship:
source: nic
target: host
relationship_type: attach
template_id : nic_attach_host
scenarios:
- scenario:
condition: alarm_on_host and nic_attach_host
actions:
- action:
action_type : raise_alarm
properties:
alarm_name: down
severity: critical
action_target:
target: nic
scenario2-2.yaml
metadata:
version: 2
name: nic down scenario
type: standard
description: nic down scenario
definitions:
entities:
- entity:
category: ALARM
name: down
template_id: nic_alarm
- entity:
category: RESOURCE
type: neutron.network
template_id: net
- entity:
category: RESOURCE
type: host.nic
template_id: nic
- entity:
category: RESOURCE
type: neutron.port
template_id: port
- entity:
category: RESOURCE
type: nova.instance
template_id: instance
- entity:
category: RESOURCE
type: nova.host
template_id: host
- entity:
category: ALARM
name: disconnect
template_id: instance_alarm
relationships:
- relationship:
source: nic_alarm
target: nic
relationship_type: on
template_id : alarm_on_nic
- relationship:
source: nic
target: net
relationship_type: provide
template_id : nic_provide_net
- relationship:
source: net
target: port
relationship_type: contains
template_id : net_contains_port
- relationship:
source: port
target: instance
relationship_type: attached
template_id : port_attached_instance
- relationship:
source: host
target: instance
relationship_type: contains
template_id : host_contains_instance
- relationship:
source: nic
target: host
relationship_type: attach
template_id : nic_attach_host
- relationship:
source: instance_alarm
target: instance
relationship_type: on
template_id : alarm_on_instance
scenarios:
- scenario:
condition: alarm_on_nic and nic_provide_net and net_contains_port and port_attached_instance and host_contains_instance and nic_attach_host
actions:
- action:
action_type : raise_alarm
properties:
alarm_name: disconnect
severity: critical
action_target:
target: instance
- action:
action_type : execute_mistral
properties:
workflow: instance_live_migrate
input:
instance_id: get_attr(instance, id)
- scenario:
condition: alarm_on_nic and host_contains_instance and nic_attach_host and alarm_on_instance
actions:
- action:
action_type : add_causal_relationship
action_target:
source: nic_alarm
target: instance_alarm
- Chạy lệnh:
$ vitrage template validate --type standard --path /etc/vitrage/templates/scenario2-1.yaml
$ vitrage template add --type standard --path /etc/vitrage/templates/scenario2-1.yaml
$ vitrage template validate --type standard --path /etc/vitrage/templates/scenario2-2.yaml
$ vitrage template add --type standard --path /etc/vitrage/templates/scenario2-2.yaml
Cấu hình mistral¶
Thêm workflow instance_live_migrate sau:
workflow_live_migrate.yaml
version: '2.0'
instance_live_migrate:
type: direct
input:
- instance_id
- notifi_to_email: admin-openstack@yopmail.com
tasks:
get_instance_status_before:
action: nova.servers_get server=<% $.instance_id %>
publish:
instance_name: <% task(get_instance_status_before).result.name %>
status_before: <% task(get_instance_status_before).result.status %>
host_before: <% task(get_instance_status_before).result["OS-EXT-SRV-ATTR:host"] %>
on-success: live_migrate_instance
on-error: send_error_email
live_migrate_instance:
action: nova.servers_live_migrate
input:
server: <% $.instance_id %>
block_migration: False
disk_over_commit: False
host:
retry:
count: 10
delay: 30
on-success: wait_for_instance_migrate
on-error: send_error_email
wait_for_instance_migrate:
action: nova.servers_get server=<% $.instance_id %>
retry:
count: 30
delay: 2
continue-on: <% task(wait_for_instance_migrate).result.status !="MIGRATING" %>
on-success: wait_instance_status_active
wait_instance_status_active:
action: nova.servers_get server=<% $.instance_id %>
retry:
delay: 10
count: 30
continue-on: <% task(wait_instance_status_active).result.status !="ACTIVE" %>
on-success: get_instance_status_after
get_instance_status_after:
action: nova.servers_get server=<% $.instance_id %>
publish:
status_after: <% task(get_instance_status_after).result.status %>
host_after: <% task(get_instance_status_after).result["OS-EXT-SRV-ATTR:host"] %>
on-complete: check_diffrent_host
check_diffrent_host:
action: std.noop
on-complete:
- send_error_email: <% $.status_before != $.status_after or $.host_before = $.host_after %>
send_error_email:
action: std.email
input:
to_addrs: [<% $.notifi_to_email %>]
subject: ERROR live migrate virtual machine
body: |
We try to migrate vm <% $.instance_id %> when host have problem
Please look at mistral workflow <% execution().id %> for more detail
from_addr: ngohoa211@gmail.com
smtp_server: smtp.gmail.com
smtp_password: secret
Kết quả¶
- Khi tắt card mạng, đầu tiên, sẽ có vitrage deduce alarm trên máy báo hiệu máy ảo mất kết nối

- Sau đó, hệ thống tự động chuyển máy ảo sang host khác

