快速注册

设计数据密集型应用读书笔记9

dewleaf 2020-06-07 13:22:10

这一章节关注分布式系统可能出现的问题，以及在这些不确定下如何设计一个系统。

在单机模式下，如果有内部故障出现，我们倾向于机器完全crash，而不是返回错误的结果。因为错误的结果更难去处理；在多机模式下，情况就完全不同了。在分布式系统中，系统的任何一部分都有可能以不可预测的方式出现问题。这也被称作partial failure，不可预测性是其的主要困难点。系统越大，partial failure的出现频率就越高，如果直接crash再retry，那么很可能系统就无法handle request了。基于网络的应用是在线的（online），常常要求在任何时刻都有很短的latency，因而遇到故障就暂停系统，service unavailable是不可行的。

超计算和云计算：超计算运用超级计算能力的计算机来解决大量计算问题，其运作类似单机的模式，系统任一部分出现问题，就让整个系统crash掉；云计算运用大量的一般计算能力的计算机，其运行方式类似分布式系统。

分布式系统可能出现的问题：

there is no shared memory，only message passing via an unreliable network with variable delays，and the systems may suffer from partial failures，unreliable clocks，and processing pauses.

A distributed system cannot exclusively rely on a single node，because a node may fail at any time, potentially leaving the system stuck and unable to recover. Instead, many distributed algorithms rely on a quorum, that is voting among the nodes: decisions require some minimum number of votes from several nodes in order to reduce the dependence on any one particular node.

1. 网络不可靠：分布式系统需要node之间发送消息来通信。发送者无法判断包是否deliver，唯一的方式是接收者发送一条response message，然而这条消息也可能丢失或延迟送达。通常处理这一问题的方式是设置timeout：一段时间后发送者停止等待，假定response不会到达，declare the node is dead. 现有的技术并不能对网络的时延和可靠性做任何的保证，我们必须假定网络拥塞，queueing，unbounded delay都可能发生。因而，并没有一个correct value for timeout，需要通过试验决定。

network partition：当网络的一部分由于网络故障与网络的其他部门割裂，这部分网络被称为network partition。

TCP vs UDP：对于视频，网络通话（UDP），如果有一个包丢失，通常没有足够的时间去重传。这种情况下，应用可以用空白帧来替代丢失的包的time slot，这样重传被转移到human layer（能重说一遍吗？）

synchronous network vs asynchronous network：传统的电话就是同步网络，一定带宽的数据通路是提前预定的，在整个通信过程中线路都是专用的，而不会分配给其他的用户。因而即使数据可能需要通过若干个router，其并不需要排队，因而网络端到端的最大时延是固定可预测的，也称作bounded delay. 与此相比，tcp连接是异步网络，通路不会提前预定，而是尝试在最短的时间内传送数据，当tcp 连接是idle，它不会占据任何带宽。tcp连接更适合突发的流量（bursty traffic），虽然会导致特定情况下数据queueing，但是其最大化了线路的利用率。

2. 时钟不可靠：在分布式系统中，每个node的时钟都不一致。可以在某种程度下同步不同node的时钟，最常用的机制是network time protocol（NTP），其允许电脑的时钟根据一组servers所报告的时间来进行调整，这些server从一个更准确的时间源来获取时间，比如GPS receiver。

如果应用需要synchronized clock，那么必须认真地监控不同机器的clock offset，一旦任何node与其他nodes的时钟偏移过大，那么需要将其declare为dead并移出cluster。

时钟不可靠的影响：

Last write wins：这个算法在复制write到别的replica时，会根据当前node的时钟tag上一个时间戳。使用最后的write作为concurrent write的winner，但是如果node的时钟并不in sync呢？可能会导致任意数量的data被无声地丢掉，却没有任何error报告给应用。

snapshot isolation：这个处理并发写操作的方法，需要给每个transaction assign 一个 transaction id，且后发生的transaction有更大的id。当db分布到多个机器时，甚至可能在不同的data center，一个全局的，单调递增的transaction id生成必须所有node协调。

3. 进程暂停

对于基于leader的replication，一个partition有一个leader，只有leader可以接受写操作。一般，leader是通过向其他的nodes获取一个lease来实现的，类似于有timeout的lock。其request handling loop类似：

while (true) {

request = getIncomingRequest();

// Ensure that the lease always has at least 10 seconds remaining

if (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) {

lease = lease.renew();

}

if (lease.isValid()) {

process(request);

}

首先，这个实现需要synchronized clock；其次，进程很有可能在某个点（比如isvalid后，process前）被暂停很长的时间。当其resume时很有可能lease已经expire了，从而导致问题出现。

为何thread会被暂停如此之久？

GC；（一种解决方案是将GCpause 看作planned outage of a node，当一个node在收集垃圾时，让其他的node来handle requests）

VM be suspended；

OS context switch；OS page swapping to disk；low disk IO；

单机模式下有很多tools确保线程安全，mutex，semaphore，blocking queue等；然而在分布式系统中，由于没有shared memory，所有信息只能通过message于不可靠的网络上传送，上面这些tool则不再适用。

Fencing token可以用来解决上述问题：every time the lock server grants a lock or lease，it also returns a fencing token，which is a number that increases every time a lock is granted. We can then require every time a client sends a write request to the storage service，it must include its current fencing token.

A node in a distributed system must assume that its execution can be paused for a significant length of time at any point, even in the middle of a function.

在这些不确定下如何设计一个分布式系统呢？

在分布式系统中，我们可以state system model的假定，从而设计出算法在满足以上假设的情况下解决分布式问题的系统。

关于timing的假定有如下system model：

同步模型：assumes bounded network delay，bounded process pauses，bounded clock error

部分同步模型：assumes 大多数时间其表现的类似同步模型，然而有时会超出bounds for network delay，process pause，and clock drift.

异步模型：在这个模型下，算法不允许做任何有关timing 的假定。

关于node failure的模型：

crash-stop faults：在这个模型下，算法可以假定一个结点只能通过crash来fail。一旦fail，结点就不会再回来。

crash-recovery faults：在这个模型下，结点被假定crash后有可能recover，且其有稳定的存储，能在crash中保留。

byzantine faults：node可以做任何事，甚至欺骗其他的node。

算法的正确性：通过定义一些properties，我们可以确定算法的正确性。

来自豆瓣App

回应转发赞收藏

加载中...

私密豆列或部分疑似不友善内容不展示

<前页后页>

dewleaf (上海)

享受生活就现在让延迟享受都见鬼去吧

设计数据密集型应用读书笔记9

热门话题 · · · · · · ( 去话题广场 )