显示标签为“网站架构”的博文。显示所有博文
显示标签为“网站架构”的博文。显示所有博文

2009年4月13日星期一

大量小文件的实时同步方案

传统的文件同步方案有rsync(单向) 和 unison(双向)等,它们需要扫描所有文件后进行比对,差量传输。如果文件数量达到了百万甚至千万量级,扫描所有文件将非常耗时。而且正在发生变化的往往是其中很少的一部分,这是非常低效的方式。

之前看了Amazon的Dynamo的设计文档, 它们每个节点的数据是通过Hash Tree来实现同步,既有通过日志来同步的软实时特点(msyql, bdb等),也可以保证最终数据的一致性(rsync, unison等)。Hash Tree的大体思路是将所有数据存储成树状结构,每个节点的Hash是其所有子节点的Hash的Hash,叶子节点的Hash是其内容的Hash。这样一 旦某个节点发生变化,其Hash的变化会迅速传播到根节点。需要同步的系统只需要不断查询跟节点的hash,一旦有变化,顺着树状结构就能够在logN级 别的时间找到发生变化的内容,马上同步。

文件系统天然的是树状结构,尽管不是平衡的数。如果文件的修改时间是可靠的,可以表征文件的变化,那就可以用它作为文件的Hash值。另一方面,文 件的修改通常是按顺序执行的,后修改的文件比早修改的文件具有更大的修改时间,这样就可以把一个目录内的最大修改时间作为它的修改时间,以实现Hash Tree。这样,一旦某个文件被修改,修改时间的信息就会迅速传播到根目录。

一般的文件系统都不是这样做的,目录的修改时间表示的是目录结构最后发生变化的时间,不包括子目录,否则会不堪重负。

2009年2月16日星期一

大型网站架构系列之四,多对多关系以及并发缓存的设计

多对多关系以及多表查询优化处理

上篇以用户数据表为例介绍了基本的数据分割方案以及基本的配置方案。但是在2.0时代,这种简单的列表索引已经远远实现起来是问题的,多对多关系将是最常见的关系。现在我们针对web2.0数据中广泛存在的多对多关系进行阐述和具体行为判断,比如一个很简单的例子,在2.0时代,好友功能是最常被用到的,每个用户会有很多的好友,同时也会是很多人的好友,那么这个数据量将会是用户数的平方的级别。同样,对于文章标签,每个文章可以有多个标签,而每个标签又可以有多个文章,这又是一个几何乘积,数据量又会是个天文数字。这里不再介绍基于硬件,IO,集群方面的问题,我们以项目开发的角度来实现他

这里先介绍一个基本的施行方案,而后我们进一步的对它进行扩充以满足我们的以后的具体需求

对于多对多关系,传统的处理方案有三种,一种是通过SEARCH的方法来实现,第二一种是通过另建一个索引表,存贮对应的ID以进行存贮,第三种是通过二次归档缓冲来实现(本人不知道用什么语言来描述这种处理方法,姑且如此吧)

对于第一种方案,因为要涉及大量的LIKE查询,性能不敢恭维,基于全文索引的方式可能解决这个问题,但是利用第三方的数据可能未必能适合我们的胃口,我们也可能没有足够的时间和精力来独立开发实现。第二种的情况下,数据库的行的数量也是惊人海量级别的,维护索引表的散列处理,并且要跨表跨区查询,还要维护数据的唯一性,数据处理过程相 当的复杂性能也就不言而喻了。

文入正题,下面以一个简单的例子解释下第三种方案,对数据多对多关系举出来具体的解决方案,我们这里以标签和文章之间的多对多关系为例来讲解,大家可以举一反三的思考群组和用户之间,相册和被圈用户之间等等复杂的多对多关系,如下方案可能不是最好的方案,但是实践证明还是综合时间和开发成本是最合理的。

首先滤清一下流程,在传统的数据库设计中我们是如下走的:当一篇博文发布的时候并插入标签的时候一般是三步走(也可以理解为四步,以为还要判断标签是否存在的问题),第一步插入文章数据库并获取文章的ID,第二步插入标签数据库同时查询标签是否存在,如果存在就取出标签的ID,否则的话插入新标签并取出ID,第三部,将文章的ID和标签的ID插入索引表来建立关联。如果这个时候在索引表上建立了索引的话就是灾难性的,特别是在数据量大的情况下,尽管它可以有效的提高查询速度,但是发布的速度可能就会让人无法忍受了。

我们处理的方法也是四部曲,对多对多关系进行进一步的处理。
用标签的时候,我们用的最多的就是查询标签下的文章和显示文章的标签,所以我们实现这例就成了。
第一步,数据冗余
老生常谈的话题,对文章做冗余,加一个TAG列,我们可以讲TAG的标签如下写[TagID,TagName]| [TagID,TagName]| [TagID,TagName] 同样对于TAG表,我们做如下冗余加个Article字段,如下内容[ArticleID,Title]| [ArticleID, Title]| [ArticleID, Title],在需要增加的时候我们只要APPEND一下就可以了,至于ARTICLE的结构和TAG的结构可以参考我上一篇文章的介绍。其实根据需要还可以存贮更多。
有人会问,为什么要存贮TagName和 ArticleTitle呢,其实是为了避免跨表查询和INNERJOIN查询来做的,In查询和跨表查询会造成全表遍历,所以我们在执行的时候In查询是必须要找到一个有效的替代方法的。关于数据冗余的问题,我们可能还会做的更变态一些,这个后面慢慢说。

第二步:异步存贮。
在设计模式下我们常思考的是单件模式,我们采用另类的单件模式思维来处理,也就是把文章和标签之间的索引作为专门的进程来做,异步的实现。
为了避免文章在发布的时候以为要检查TAG表而造成的线程拥堵,我们需要采取延迟加载的方案来做。服务器应该维护一个进程专业的对标签和文章地段的查询和索引,我们在发布文章的时候应该把标签同步这一块托管给另外的一个进程或者服务器进行处理,并进行索引。

第三步:二次索引:
对于频繁的判断标签去或者热门的标签我们还可以在内存里组织一套有效的索引,比如对于标签“疯狂代码”,我们用树来把它表示出来。对于疯狂代码我们索引一个疯,其实用程序表达就是疯狂代码[0]。而在数组”疯”中存贮以疯开头的标签组,以”傲”的数组中存贮以”傲”开头的标签。如果量更大的话还可以再做N级索引,将这些常用的标签对应设计内存索引,我们可以把它想象的理解为内存中的 Suggest(比如google搜索时的Suggest),使用中我们可以直接拿来使用
第四步:针对跨表查询的处理
很多情况下,我们可能避免不了多表查询,或者 IN,or查询,除去业务层封装的分区视图集群之外,我们还可以处理的更好,在很多情况下,我们的查询会是非常频繁非常统一的(这里的统一指热门查询),比如在SNS中常见的性别,嗜好等多条件搜索,而这些数据可能存贮在多个数据表结构中,而这样会吧不可避免的会产生全表遍历查询。
处理方法也很简单,把原来散列的垂直分割的表再合并起来,合并到另外的只读的订阅服务器上,然后做适当的结构优化和索引,剩下的大家应该明白我的意思了,虽然简单,但是这种处理方法非常适合以后服务器的横向扩充。

以上是对多对多关系和多表查询的一个简单的架构说明,肯定有人会问,如果这样做的话工作量不是太大了吗,分词处理什么的,对每个多对多关系进行处理。
OK,咱们可以进一步的把它来抽象化,我们用TableA 表示Article表,用TagbleT表示Tag表,我们可以讲字段抽象化出来,也就是一个ID,一个Tag的String 同理对于标签表也是如此。朋友们应该可以理解我的意思了。

对,就是做个代码生成器把对应的多对多关系给生成出来,这个很好写的,几个Append就可以搞定。如果想更方便的处理,那么把这个东西做成单件的模式抽象化出来,然后再违反一下原则,做成基类,其他关系继承这个基类。。。。。剩下的应该很简单了,具体实现大家思考吧。

让并发来的更猛烈些吧,高并发环境下的数据处理方案

对于高并发性质的网站,在sns特别是webgame方面应该是最容易也是最难处理的地方了,容易处理的是如果是纯粹基于数据库驱动也就是 select和update的问题,而难的地方也是不是select而是update,在高并发的驱动下,update经常会超时,虽然我们可以在 finally把它处理掉,让人郁闷的是,数据库连接池仍然会饱和,数据仍然会丢失….

上面的情况是非常常见的web项目失败的原因之一,在数据飞速膨胀和并发呈几何级增长的情况下,制约我们的可能是io,database本身的问题了,让我们头痛的是不管是哪种数据库,Oracle也好,mysql也好,sqlserver也好都会timeout,而且是频繁的timeout频繁的Exception。这个时候就需要我们的应用程序在处理的前期就应该考虑到的,一个好的数据缓存策略常常决定了我们的成败,而缓存策略也是web项目最难以测试和最容易出错的地方。

在大型网站架构中,最关键最核心的也是缓存策略了,介于其复杂性,这里只简单的介绍一下基于高并发数据库缓存方案,后面的将详细介绍常用的缓存策略。这个方法与其叫缓存不如叫数据缓冲,其实也是异步更新数据,根据负载情况不同,我们哪怕仅仅将数据缓冲1秒,带来的负载提升就已经非常好了。

实现原理很简单,将并发的更新首先缓存到一个应用程序池中,然后定时查询(注意这里的方案应和缓存方案具体结合,这里只介绍概要情况)。

传统的update请求处理流程是:请求—》应用程序—》更新数据库,如下图:
数据缓冲和更新部分可以在数据层里独立实现,也就是update的传递的时候首先传递缓冲池,然后定时更新,这里需要注意的数据缓冲池的还要做的另外一份 工作就是全局的数据缓存,缓存数据更新到数据这段的时间间隔,我们可以理解为临时表,再提取上下文请求的即时信息的时候首先从缓冲池里读取(这里有很多技 巧,比如巧妙的利用cookie,session做;临界条件判断),流程如下图所示
上面简单的介绍了一下基于数据更新缓存的处理,下篇具体详细介绍基于并发更新机制的详细缓存处理机制

2008年10月10日星期五

YouTube Architecture

Update: YouTube: The Platform. YouTube adds a new rich set of APIs in order to become your video platform leader--all for free. Upload, edit, watch, search, and comment on video from your own site without visiting YouTube. Compose your site internally from APIs because you'll need to expose them later anyway.

YouTube grew incredibly fast, to over 100 million video views per day, with only a handful of people responsible for scaling the site. How did they manage to deliver all that video to all those users? And how have they evolved since being acquired by Google?

Information Sources

# Google Video

Platform

# Apache
# Python
# Linux (SuSe)
# MySQL
# psyco, a dynamic python->C compiler
# lighttpd for video instead of Apache

What's Inside?

The Stats

# Supports the delivery of over 100 million videos per day.
# Founded 2/2005
# 3/2006 30 million video views/day
# 7/2006 100 million video views/day
# 2 sysadmins, 2 scalability software architects
# 2 feature developers, 2 network engineers, 1 DBA

Recipe for handling rapid growth


while (true)
{
identify_and_fix_bottlenecks();
drink();
sleep();
notice_new_bottleneck();
}

This loop runs many times a day.

Web Servers

# NetScalar is used for load balancing and caching static content.
# Run Apache with mod_fast_cgi.
# Requests are routed for handling by a Python application server.
# Application server talks to various databases and other informations sources to get all the data and formats the html page.
# Can usually scale web tier by adding more machines.
# The Python web code is usually NOT the bottleneck, it spends most of its time blocked on RPCs.
# Python allows rapid flexible development and deployment. This is critical given the competition they face.
# Usually less than 100 ms page service times.
# Use psyco, a dynamic python->C compiler that uses a JIT compiler approach to optimize inner loops.
# For high CPU intensive activities like encryption, they use C extensions.
# Some pre-generated cached HTML for expensive to render blocks.
# Row level caching in the database.
# Fully formed Python objects are cached.
# Some data are calculated and sent to each application so the values are cached in local memory. This is an underused strategy. The fastest cache is in your application server and it doesn't take much time to send precalculated data to all your servers. Just have an agent that watches for changes, precalculates, and sends.

Video Serving

# Costs include bandwidth, hardware, and power consumption.
# Each video hosted by a mini-cluster. Each video is served by more than one machine.
# Using a a cluster means:
- More disks serving content which means more speed.
- Headroom. If a machine goes down others can take over.
- There are online backups.
# Servers use the lighttpd web server for video:
- Apache had too much overhead.
- Uses epoll to wait on multiple fds.
- Switched from single process to multiple process configuration to handle more connections.
# Most popular content is moved to a CDN (content delivery network):
- CDNs replicate content in multiple places. There's a better chance of content being closer to the user, with fewer hops, and content will run over a more friendly network.
- CDN machines mostly serve out of memory because the content is so popular there's little thrashing of content into and out of memory.
# Less popular content (1-20 views per day) uses YouTube servers in various colo sites.
- There's a long tail effect. A video may have a few plays, but lots of videos are being played. Random disks blocks are being accessed.
- Caching doesn't do a lot of good in this scenario, so spending money on more cache may not make sense. This is a very interesting point. If you have a long tail product caching won't always be your performance savior.
- Tune RAID controller and pay attention to other lower level issues to help.
- Tune memory on each machine so there's not too much and not too little.

Serving Video Key Points

# Keep it simple and cheap.
# Keep a simple network path. Not too many devices between content and users. Routers, switches, and other appliances may not be able to keep up with so much load.
# Use commodity hardware. More expensive hardware gets the more expensive everything else gets too (support contracts). You are also less likely find help on the net.
# Use simple common tools. They use most tools build into Linux and layer on top of those.
# Handle random seeks well (SATA, tweaks).

Serving Thumbnails

# Surprisingly difficult to do efficiently.
# There are a like 4 thumbnails for each video so there are a lot more thumbnails than videos.
# Thumbnails are hosted on just a few machines.
# Saw problems associated with serving a lot of small objects:
- Lots of disk seeks and problems with inode caches and page caches at OS level.
- Ran into per directory file limit. Ext3 in particular. Moved to a more hierarchical structure. Recent improvements in the 2.6 kernel may improve Ext3 large directory handling up to 100 times, yet storing lots of files in a file system is still not a good idea.
- A high number of requests/sec as web pages can display 60 thumbnails on page.
- Under such high loads Apache performed badly.
- Used squid (reverse proxy) in front of Apache. This worked for a while, but as load increased performance eventually decreased. Went from 300 requests/second to 20.
- Tried using lighttpd but with a single threaded it stalled. Run into problems with multiprocesses mode because they would each keep a separate cache.
- With so many images setting up a new machine took over 24 hours.
- Rebooting machine took 6-10 hours for cache to warm up to not go to disk.
# To solve all their problems they started using Google's BigTable, a distributed data store:
- Avoids small file problem because it clumps files together.
- Fast, fault tolerant. Assumes its working on a unreliable network.
- Lower latency because it uses a distributed multilevel cache. This cache works across different collocation sites.
- For more information on BigTable take a look at Google Architecture, GoogleTalk Architecture, and BigTable.

Databases

# The Early Years
- Use MySQL to store meta data like users, tags, and descriptions.
- Served data off a monolithic RAID 10 Volume with 10 disks.
- Living off credit cards so they leased hardware. When they needed more hardware to handle load it took a few days to order and get delivered.
- They went through a common evolution: single server, went to a single master with multiple read slaves, then partitioned the database, and then settled on a sharding approach.
- Suffered from replica lag. The master is multi-threaded and runs on a large machine so it can handle a lot of work. Slaves are single threaded and usually run on lesser machines and replication is asynchronous, so the slaves can lag significantly behind the master.
- Updates cause cache misses which goes to disk where slow I/O causes slow replication.
- Using a replicating architecture you need to spend a lot of money for incremental bits of write performance.
- One of their solutions was prioritize traffic by splitting the data into two clusters: a video watch pool and a general cluster. The idea is that people want to watch video so that function should get the most resources. The social networking features of YouTube are less important so they can be routed to a less capable cluster.
# The later years:
- Went to database partitioning.
- Split into shards with users assigned to different shards.
- Spreads writes and reads.
- Much better cache locality which means less IO.
- Resulted in a 30% hardware reduction.
- Reduced replica lag to 0.
- Can now scale database almost arbitrarily.

Data Center Strategy

# Used manage hosting providers at first. Living off credit cards so it was the only way.
# Managed hosting can't scale with you. You can't control hardware or make favorable networking agreements.
# So they went to a colocation arrangement. Now they can customize everything and negotiate their own contracts.
# Use 5 or 6 data centers plus the CDN.
# Videos come out of any data center. Not closest match or anything. If a video is popular enough it will move into the CDN.
# Video bandwidth dependent, not really latency dependent. Can come from any colo.
# For images latency matters, especially when you have 60 images on a page.
# Images are replicated to different data centers using BigTable. Code
looks at different metrics to know who is closest.

Lessons Learned

# Stall for time. Creative and risky tricks can help you cope in the short term while you work out longer term solutions.

# Prioritize. Know what's essential to your service and prioritize your resources and efforts around those priorities.

# Pick your battles. Don't be afraid to outsource some essential services. YouTube uses a CDN to distribute their most popular content. Creating their own network would have taken too long and cost too much. You may have similar opportunities in your system. Take a look at Software as a Service for more ideas.

# Keep it simple! Simplicity allows you to rearchitect more quickly so you can respond to problems. It's true that nobody really knows what simplicity is, but if you aren't afraid to make changes then that's a good sign simplicity is happening.

# Shard. Sharding helps to isolate and constrain storage, CPU, memory, and IO. It's not just about getting more writes performance.

# Constant iteration on bottlenecks:
- Software: DB, caching
- OS: disk I/O
- Hardware: memory, RAID

# You succeed as a team. Have a good cross discipline team that understands the whole system and what's underneath the system. People who can set up printers, machines, install networks, and so on. With a good team all things are possible.

Flickr Architecture

Update: Flickr hits 2 Billion photos served. That's a lot of hamburgers.

Flickr is both my favorite bird and the web's leading photo sharing site. Flickr has an amazing challenge, they must handle a vast sea of ever expanding new content, ever increasing legions of users, and a constant stream of new features, all while providing excellent performance. How do they do it?

Site: http://www.flickr.com/

Information Sources

# Flickr and PHP (an early document)
# Capacity Planning for LAMP
# Federation at Flickr: Doing Billions of Queries a Day by Dathan Pattishall.
# Building Scalable Web Sites by Cal Henderson from Flickr.
# Database War Stories #3: Flickr by Tim O'Reilly
# Cal Henderson's Talks. A lot of useful PowerPoint presentations.

Platform

# PHP
# MySQL
# Shards
# Memcached for a caching layer.
# Squid in reverse-proxy for html and images.
# Linux (RedHat)
# Smarty for templating
# Perl
# PEAR for XML and Email parsing
# ImageMagick, for image processing
# Java, for the node service
# Apache
# SystemImager for deployment
# Ganglia for distributed system monitoring
# Subcon stores essential system configuration files in a subversion repository for easy deployment to machines in a cluster.
# Cvsup for distributing and updating collections of files across a network.

The Stats

# More than 4 billion queries per day.
# ~35M photos in squid cache (total)
# ~2M photos in squid’s RAM
# ~470M photos, 4 or 5 sizes of each
# 38k req/sec to memcached (12M objects)
# 2 PB raw storage (consumed about ~1.5TB on Sunday
# Over 400,000 photos being added every day

The Architecture

# A pretty picture of Flickr's architecture can be found on this slide . A simple depiction is:
-- Pair of ServerIron's
---- Squid Caches
------ Net App's
---- PHP App Servers
------ Storage Manager
------ Master-master shards
------ Dual Tree Central Database
------ Memcached Cluster
------ Big Search Engine

- The Dual Tree structure is a custom set of changes to MySQL that allows scaling by incrementally adding masters without a ring architecture. This allows cheaper scaling because you need less hardware as compared to master-master setups which always requires double the hardware.
- The central database includes data like the 'users' table, which includes primary user
keys (a few different IDs) and a pointer to which shard a users' data can be found on.

# Use dedicated servers for static content.
# Talks about how to support Unicode.
# Use a share nothing architecture.
# Everything (except photos) are stored in the database.
# Statelessness means they can bounce people around servers and it's easier to make their APIs.
# Scaled at first by replication, but that only helps with reads.
# Create a search farm by replicating the portion of the database they want to search.
# Use horizontal scaling so they just need to add more machines.
# Handle pictures emailed from users by parsing each email is it's delivered in PHP. Email is parsed for any photos.
# Earlier they suffered from Master-Slave lag. Too much load and they had a single point of failure.
# They needed the ability to make live maintenance, repair data, and so forth, without taking the site down.
# Lots of excellent material on capacity planning. Take a look in the Information Sources for more details.
# Went to a federated approach so they can scale far into the future:
- Shards: My data gets stored on my shard, but the record of performing action on your comment, is on your shard. When making a comment on someone else's’ blog
- Global Ring: Its like DNS, you need to know where to go and who controls where you go. Every page view, calculate where your data is, at that moment of time.
- PHP logic to connect to the shards and keep the data consistent (10 lines of code with comments!)
# Shards:
- Slice of the main database
- Active Master-Master Ring Replication: a few drawbacks in MySQL 4.1, as honoring commits in Master-Master. AutoIncrement IDs are automated to keep it Active Active.
- Shard assignments are from a random number for new accounts
- Migration is done from time to time, so you can remove certain power users. Needs to be balanced if you have a lot of photos… 192,000 photos, 700,000 tags, will take about 3-4 minutes. Migration is done manually.
# Clicking a Favorite:
- Pulls the Photo owners Account from Cache, to get the shard location (say on shard-5)
- Pulls my Information from cache, to get my shard location (say on shard-13)
- Starts a “distributed transaction” - to answer the question: Who favorited the photo? What are my favorites?
# Can ask question from any shard, and recover data. Its absolutely redundant.
# To get rid of replication lag…
- every page load, the user is assigned to a bucket
- if host is down, go to next host in the list; if all hosts are down, display an error page. They don’t use persistent connections, they build connections and tear it down. Every page load thus, tests the connection.
# Every users reads and writes are kept in one shard. Notion of replication lag is gone.
# Each server in shard is 50% loaded. Shut down 1/2 the servers in each shard. So 1 server in the shard can take the full load if a server of that shard is down or in maintenance mode. To upgrade you just have to shut down half the shard, upgrade that half, and then repeat the process.
# Periods of time when traffic spikes, they break the 50% rule though. They do something like 6,000-7,000 queries per second. Now, its designed for at most 4,000 queries per second to keep it at 50% load.
# Average queries per page, are 27-35 SQL statements. Favorites counts are real time. API access to the database is all real time. Achieved the real time requirements without any disadvantages.
# Over 36,000 queries per second - running within capacity threshold. Burst of traffic, double 36K/qps.
# Each Shard holds 400K+ users data.
- A lot of data is stored twice. For example, a comment is part of the relation between the commentor and the commentee. Where is the comment stored? How about both places? Transactions are used to prevent out of sync data: open transaction 1, write commands, open transaction 2, write commands, commit 1st transaction if all is well, commit 2nd transaction if 1st committed. but there still a chance for failure when a box goes down during the 1st commit.
# Search:
- Two search back-ends: shards 35k qps on a few shards and Yahoo!’s (proprietary) web search
- Owner’s single tag search or a batch tag change (say, via Organizr) goes to the Shards due to real-time requirements, everything else goes to Yahoo!’s engine (probably about 90% behind the real-time goodness)
- Think of it such that you’ve got Lucene-like search
# Hardware:
- EMT64 w/RHEL4, 16GB RAM
- 6-disk 15K RPM RAID-10.
- Data size is at 12 TB of user metadata (these are not photos, this is just innodb ibdata files - the photos are a lot larger).
- 2U boxes. Each shard has~120GB of data.
# Backup procedure:
- ibbackup on a cron job, that runs across various shards at different times. Hotbackup to a spare.
- Snapshots are taken every night across the entire cluster of databases.
- Writing or deleting several huge backup files at once to a replication filestore can wreck performance on that filestore for the next few hours as it replicates the backup files. Doing this to an in-production photo storage filer is a bad idea.
- However much it costs to keep multiple days of backups of all of your data, it's worth it. Keeping staggered backups is good for when you discover something gone wrong a few days later. something like 1, 2, 10 and 30 day backups.
# Photos are stored on the filer. Upon upload, it processes the photos, gives you different sizes, then its complete. Metadata and points to the filers, are stored in the database.
# Aggregating the data: Very fast, because its a process per shard. Stick it into a table, or recover data from another copy from other users shards.
# max_connections = 400 connections per shard, or 800 connections per server & shard. Plenty of capacity and connections. Thread cache is set to 45, because you don’t have more than 45 users having simultaneous activity.
# Tags:
- Tags do not fit well with traditional normalized RDBMs schema design. Denormalization or heavy caching is the only way to generate a tag cloud in milliseconds for hundreds of millions of tags.
- Some of their data views are calculated offline by dedicated processing clusters which save the results into MySQL because some relationships are so complicated to calculate it would absorb all the database CPU cycles.
# Future Direction:
- Make it faster with real-time BCP, so all data centers can receive writes to the data layer (db, memcache, etc) all at the same time. Everything is active nothing will ever be idle.

Lessons Learned

# Think of your application as more than just a web application. You'll have REST APIs, SOAP APIs, RSS feeds, Atom feeds, etc.

# Go stateless. Statelessness makes for a simpler more robust system that can handle upgrades without flinching.

# Re-architecting your database sucks.

# Capacity plan. Bring capacity planning into the product discussion EARLY. Get buy-in from the $$$ people (and engineering management) that it’s something to watch.

# Start slow. Don’t buy too much equipment just because you’re scared/happy that your site will explode.

# Measure reality. Capacity planning math should be based on real things, not abstract ones.

# Build in logging and metrics. Usage stats are just as important as server stats. Build in custom metrics to measure real-world usage to server-based stats.

# Cache. Caching and RAM is the answer to everything.

# Abstract. Create clear levels of abstraction between database work, business logic, page logic, page mark-up and the presentation layer. This supports quick turn around iterative development.

# Layer. Layering allows developers to create page level logic which designers can use to build the user experience. Designers can ask for page logic as needed. It's a negotiation between the two parties.

# Release frequently. Even every 30 minutes.

# Forget about small efficiencies, about 97% of the time. Premature optimization is the root of all evil.

# Test in production. Build into the architecture mechanisms (config flags, load balancing, etc.) with which you can deploy new hardware easily into (and out of) production.

# Forget benchmarks. Benchmarks are fine for getting a general idea of capabilities, but not for planning. Artificial tests give artificial results, and the time is better used with testing for real.

# Find ceilings.
- What is the maximum something that every server can do ?
- How close are you to that maximum, and how is it trending ?
- MySQL (disk IO ?)
- SQUID (disk IO ? or CPU ?)
- memcached (CPU ? or network ?)

# Be sensitive to the usage patterns for your type of application.
- Do you have event related growth? For example: disaster, news event.
- Flickr gets 20-40% more uploads on first work day of the year than any previous peak the previous year.
- 40-50% more uploads on Sundays than the rest of the week, on average

# Be sensitive to the demands of exponential growth. More users means more content, more content means more connections, more connections mean more usage.

# Plan for peaks. Be able to handle peak loads up and down the stack.

Digg Architecture

Update 2:: How Digg Works and How Digg Really Works (wear ear plugs). Brought to you straight from Digg's blog. A very succinct explanation of the major elements of the Digg architecture while tracing a request through the system. I've updated this profile with the new information.
Update: Digg now receives 230 million plus page views per month and 26 million unique visitors - traffic that necessitated major internal upgrades.

Traffic generated by Digg's over 22 million famously info-hungry users and 230 million page views can crash an unsuspecting website head-on into its CPU, memory, and bandwidth limits. How does Digg handle billions of requests a month?

Site: http://digg.com

Information Sources

# How Digg Works by Digg
# How Digg.com uses the LAMP stack to scale upward
# Digg PHP's Scalability and Performance

Platform

# MySQL
# Linux
# PHP
# Lucene
# Python
# APC PHP Accelerator
# MCache
# Gearman - job scheduling system
# MogileFS - open source distributed filesystem
# Apache
# Memcached

The Stats

# Started in late 2004 with a single Linux server running Apache 1.3, PHP 4, and MySQL. 4.0 using the default MyISAM storage engine
# Over 22 million users.
# 230 million plus page views per month
# 26 million unique visitors per month
# Several billion page views per month
# None of the scaling challenges faced had anything to do with PHP. The biggest issues faced were database related.
# Dozens of web servers.
# Dozens of DB servers.
# Six specialized graph database servers to run the Recommendation Engine.
# Six to ten machines that serve files from MogileFS.

What's Inside

# Specialized load balancer appliances monitor the application servers, handle failover, constantly adjust the cluster according to health, balance incoming requests and caching JavaScript, CSS and images. If you don't have the fancy load balancers take a look at Linux Virtual Server and Squid as a replacement.
# Requests are passed to the Application Server cluster. Application servers consist of: Apache+PHP, Memcached, Gearman and other daemons. They are responsible for making coordinating access to different services (DB, MogileFS, etc) and creating the response sent to the browser.
# Uses a MySQL master-slave setup.
- Four master databases are partitioned by functionality: promotion, profiles, comments, main. Many slave databases hang off each master.
- Writes go to the masters and reads go to the slaves.
- Transaction-heavy servers use the InnoDB storage engine.
- OLAP-heavy servers use the MyISAM storage engine.
- They did not notice a performance degradation moving from MySQL 4.1 to version 5.
- The schema is denormalized more than "your average database design."
- Sharding is used to break the database into several smaller ones.
# Digg's usage pattern makes it easier for them to scale. Most people just view the front page and leave. Thus 98% of Digg's database accesses are reads. With this balance of operations they don't have to worry about the complex work of architecting for writes, which makes it a lot easier for them to scale.
# They had problems with their storage system telling them writes were on disk when they really weren't. Controllers do this to improve the appearance of their performance. But what it does is leave a giant data integrity whole in failure scenarios. This is really a pretty common problem and can be hard to fix, depending on your hardware setup.
# To lighten their database load they used the APC PHP accelerator MCache.
# Memcached is used for caching and memcached servers seemed to be spread across their database and application servers. A specialized daemon monitors connections and kills connections that have been open too long.
# You can configure PHP not parse and compile on each load using a combination of Apache 2’s worker threads, FastCGI, and a PHP accelerator. On a page's first load the PHP code is compiles so any subsequent page loads are very fast.
# MogileFS, a distributed file system, serves story icons, user icons, and stores copies of each story’s source. A distributed file system spreads and replicates files across a lot of disks which supports fast and scalable file access.
# A specialized Recommendation Engine service was built to act as their distributed graph database. Relational databases are not well structured for generating recommendations so a separate service was created. LinkedIn did something similar for their graph.

Lessons Learned

# The number of machines isn't as important what the pieces are and how they fit together.
# Don't treat the database as a hammer. Recommendations didn't fit will with the relational model so they made a specialized service.
# Tune MySQL through your database engine selection. Use InnoDB when you need transactions and MyISAM when you don't. For example, transactional tables on the master can use MyISAM for read-only slaves.
# At some point in their growth curve they were unable to grow by adding RAM so had to grow through architecture.
# People often complain Digg is slow. This is perhaps due to their large javascript libraries rather than their backend architecture.
# One way they scale is by being careful of which application they deploy on their system. They are careful not to release applications which use too much CPU. Clearly Digg has a pretty standard LAMP architecture, but I thought this was an interesting point. Engineers often have a bunch of cool features they want to release, but those features can kill an infrastructure if that infrastructure doesn't grow along with the features. So push back until your system can handle the new features. This goes to capacity planning, something the Flickr emphasizes in their scaling process.
# You have to wonder if by limiting new features to match their infrastructure might Digg lose ground to other faster moving social bookmarking services? Perhaps if the infrastructure was more easily scaled they could add features faster which would help them compete better? On the other hand, just adding features because you can doesn't make a lot of sense either.
# The data layer is where most scaling and performance problems are to be found and these are language specific. You'll hit them using Java, PHP, Ruby, or insert your favorite language here.

Google Architecture

Update: Greg Linden points to a new Google article MapReduce: simplified data processing on large clusters. Some interesting stats: 100k MapReduce jobs are executed each day; more than 20 petabytes of data are processed per day; more than 10k MapReduce programs have been implemented; machines are dual processor with gigabit ethernet and 4-8 GB of memory.

Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet scale applications at an alarmingly high competition crushing rate. Their goal is always to build a higher performing higher scaling infrastructure to support their products. How do they do that?

Information Sources

# Video: Building Large Systems at Google
# Google Lab: The Google File System
# Google Lab: MapReduce: Simplified Data Processing on Large Clusters
# Google Lab: BigTable.
# Video: BigTable: A Distributed Structured Storage System.
# Google Lab: The Chubby Lock Service for Loosely-Coupled Distributed Systems.
# How Google Works by David Carr in Baseline Magazine.
# Google Lab: Interpreting the Data: Parallel Analysis with Sawzall.
# Dare Obasonjo's Notes on the scalability conference.

Platform

# Linux
# A large diversity of languages: Python, Java, C++

What's Inside?

The Stats

# Estimated 450,000 low-cost commodity servers in 2006
# In 2005 Google indexed 8 billion web pages. By now, who knows?
# Currently there over 200 GFS clusters at Google. A cluster can have 1000 or even 5000 machines. Pools of tens of thousands of machines retrieve data from GFS clusters that run as large as 5 petabytes of storage. Aggregate read/write throughput can be as high as 40 gigabytes/second across the cluster.
# Currently there are 6000 MapReduce applications at Google and hundreds of new applications are being written each month.
# BigTable scales to store billions of URLs, hundreds of terabytes of satellite imagery, and preferences for hundreds of millions of users.

The Stack

Google visualizes their infrastructure as a three layer stack:

# Products: search, advertising, email, maps, video, chat, blogger
# Distributed Systems Infrastructure: GFS, MapReduce, and BigTable.
# Computing Platforms: a bunch of machines in a bunch of different data centers
# Make sure easy for folks in the company to deploy at a low cost.
# Look at price performance data on a per application basis. Spend more money on hardware to not lose log data, but spend less on other types of data. Having said that, they don't lose data.

Reliable Storage Mechanism with GFS (Google File System)

# Reliable scalable storage is a core need of any application. GFS is their core storage platform.
# Google File System - large distributed log structured file system in which they throw in a lot of data.
# Why build it instead of using something off the shelf? Because they control everything and it's the platform that distinguishes them from everyone else. They required:
- high reliability across data centers
- scalability to thousands of network nodes
- huge read/write bandwidth requirements
- support for large blocks of data which are gigabytes in size.
- efficient distribution of operations across nodes to reduce bottlenecks
# System has master and chunk servers.
- Master servers keep metadata on the various data files. Data are stored in the file system in 64MB chunks. Clients talk to the master servers to perform metadata operations on files and to locate the chunk server that contains the needed they need on disk.
- Chunk servers store the actual data on disk. Each chunk is replicated across three different chunk servers to create redundancy in case of server crashes. Once directed by a master server, a client application retrieves files directly from chunk servers.
# A new application coming on line can use an existing GFS cluster or they can make your own. It would be interesting to understand the provisioning process they use across their data centers.
# Key is enough infrastructure to make sure people have choices for their application. GFS can be tuned to fit individual application needs.

Do Something With the Data Using MapReduce

# Now that you have a good storage system, how do you do anything with so much data? Let's say you have many TBs of data stored across a 1000 machines. Databases don't scale or cost effectively scale to those levels. That's where MapReduce comes in.
# MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
# Why use MapReduce?
- Nice way to partition tasks across lots of machines.
- Handle machine failure.
- Works across different application types, like search and ads. Almost every application has map reduce type operations. You can precompute useful data, find word counts, sort TBs of data, etc.
- Computation can automatically move closer to the IO source.
# The MapReduce system has three different types of servers.
- The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks.
- The Map servers accept user input and performs map operations on them. The results are written to intermediate files
- The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them.
# For example, you want to count the number of words in all web pages. You would feed all the pages stored on GFS into MapReduce. This would all be happening on 1000s of machines simultaneously and all the coordination, job scheduling, failure handling, and data transport would be done automatically.
- The steps look like: GFS -> Map -> Shuffle -> Reduction -> Store Results back into GFS.
- In MapReduce a map maps one view of data to another, producing a key value pair, which in our example is word and count.
- Shuffling aggregates key types.
- The reductions sums up all the key value pairs and produces the final answer.
# The Google indexing pipeline has about 20 different map reductions. A pipeline looks at data with a whole bunch of records and aggregating keys. A second map-reduce comes a long, takes that result and does something else. And so on.
# Programs can be very small. As little as 20 to 50 lines of code.
# One problem is stragglers. A straggler is a computation that is going slower than others which holds up everyone. Stragglers may happen because of slow IO (say a bad controller) or from a temporary CPU spike. The solution is to run multiple of the same computations and when one is done kill all the rest.
# Data transferred between map and reduce servers is compressed. The idea is that because servers aren't CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.

Storing Structured Data in BigTable

# BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.
# BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesn't support joins or SQL type queries.
# It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.
# Commercial databases simply don't scale to this level and they don't work across 1000s machines.
# By controlling their own low level storage system Google gets more control and leverage to improve their system. For example, if they want features that make cross data center operations easier, they can build it in.
# Machines can be added and deleted while the system is running and the whole system just works.
# Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp.
# Each row is stored in one or more tablets. A tablet is a sequence of 64KB blocks in a data format called SSTable.
# BigTable has three different types of servers:
- The Master servers assign tablets to tablet servers. They track where tablets are located and redistributes tasks as needed.
- The Tablet servers process read/write requests for tablets. They split tablets when they exceed size limits (usually 100MB - 200MB). When a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers.
- The Lock servers form a distributed lock service. Operations like opening a tablet for writing, Master aribtration, and access control checking require mutual exclusion.
# A locality group can be used to physically store related bits of data together for better locality of reference.
# Tablets are cached in RAM as much as possible.

Hardware

# When you have a lot of machines how do you build them to be cost efficient and use power efficiently?
# Use ultra cheap commodity hardware and built software on top to handle their death.
# A 1,000-fold computer power increase can be had for a 33 times lower cost if you you use a failure-prone infrastructure rather than an infrastructure built on highly reliable components. You must build reliability on top of unreliability for this strategy to work.
# Linux, in-house rack design, PC class mother boards, low end storage.
# Price per wattage on performance basis isn't getting better. Have huge power and cooling issues.
# Use a mix of collocation and their own data centers.

Misc

# Push changes out quickly rather than wait for QA.
# Libraries are the predominant way of building programs.
# Some are applications are provided as services, like crawling.
# An infrastructure handles versioning of applications so they can be release without a fear of breaking things.

Future Directions for Google

# Support geo-distributed clusters.
# Create a single global namespace for all data. Currently data is segregated by cluster.
# More and better automated migration of data and computation.
# Solve consistency issues that happen when you couple wide area replication with network partitioning (e.g. keeping services up even if a cluster goes offline for maintenance or due to some sort of outage).

Lessons Learned

# Infrastructure can be a competitive advantage. It certainly is for Google. They can roll out new internet services faster, cheaper, and at scale at few others can compete with. Many companies take a completely different approach. Many companies treat infrastructure as an expense. Each group will use completely different technologies and their will be little planning and commonality of how to build systems. Google thinks of themselves as a systems engineering company, which is a very refreshing way to look at building software.

# Spanning multiple data centers is still an unsolved problem. Most websites are in one and at most two data centers. How to fully distribute a website across a set of data centers is, shall we say, tricky.

# Take a look at Hadoop (product) if you don't have the time to rebuild all this infrastructure from scratch yourself. Hadoop is an open source implementation of many of the same ideas presented here.

# An under appreciated advantage of a platform approach is junior developers can quickly and confidently create robust applications on top of the platform. If every project needs to create the same distributed infrastructure wheel you'll run into difficulty because the people who know how to do this are relatively rare.

# Synergy isn't always crap. By making all parts of a system work together an improvement in one helps them all. Improve the file system and everyone benefits immediately and transparently. If every project uses a different file system then there's no continual incremental improvement across the entire stack.

# Build self-managing systems that work without having to take the system down. This allows you to more easily rebalance resources across servers, add more capacity dynamically, bring machines off line, and gracefully handle upgrades.

# Create a Darwinian infrastructure. Perform time consuming operation in parallel and take the winner.

# Don't ignore the Academy. Academia has a lot of good ideas that don't get translated into production environments. Most of what Google has done has prior art, just not prior large scale deployment.

# Consider compression. Compression is a good option when you have a lot of CPU to throw around and limited IO.

eBay Architecture

Update 2: EBay's Randy Shoup spills the secrets of how to service hundreds of millions of users and over two billion page views a day in Scalability Best Practices: Lessons from eBay on InfoQ. The practices: Partition by Function, Split Horizontally, Avoid Distributed Transactions, Decouple Functions Asynchronously, Move Processing To Asynchronous Flows, Virtualize At All Levels, Cache Appropriately.
Update: eBay Serves 5 Billion API Calls Each Month. Aren't we seeing more and more traffic driven by mashups composed on top of open APIs? APIs are no longer a bolt on, they are your application. Architecturally that argues for implementing your own application around the same APIs developers and users employ.

Who hasn't wondered how eBay does their business? As one of the largest most loaded websites in the world, it can't be easy. And the subtitle of the presentation hints at how creating such a monster system requires true engineering: Striking a balance between site stability, feature velocity, performance, and cost.

You may not be able to emulate how eBay scales their system, but the issues and possible solutions are worth learning from.

Site: http://ebay.com

Information Sources

# The eBay Architecture - Striking a balance between site stability, feature velocity, performance, and cost.
# Podcast: eBay’s Transactions on a Massive Scale
# Dan Pritchett on Architecture at eBay interview by InfoQ

Platform

# Java
# Oracle
# WebSphere, servlets
# Horizontal Scaling
# Sharding
# Mix of Windows and Unix

What's Inside?

This information was adapted from Johannes Ernst's Blog

The Stats

# On an average day, it runs through 26 billion SQL queries and keeps tabs on 100 million items available for purchase.
# 212 million registered users, 1 billion photos
# 1 billion page views a day, 105 million listings, 2 petabytes of data, 3 billion API calls a month
# Something like a factor of 35 in page views, e-mails sent, bandwidth from June 1999 to Q3/2006.
# 99.94% availability, measured as "all parts of site functional to everybody" vs. at least one part of a site not functional to some users somewhere
# The database is virtualized and spans 600 production instances residing in more than 100 server clusters.
# 15,000 application servers, all J2EE. About 100 groups of functionality aka "apps". Notion of a "pool": "all the machines that deal with selling"...

The Architecture

# Everything is planned with the question "what if load increases by 10x". Scaling only horizontal, not vertical: many parallel boxes.
# Architectures is strictly divided into layers: data tier, application tier, search, operations,
# Leverages MSXML framework for presentation layer (even in Java)
# Oracle databases, WebSphere Java (still 1.3.1)
# Split databases by primary access path, modulo on a key.
# Every database has at least 3 on-line databases. Distributed over 8 data centers
# Some database copies run 15 min behind, 4 hours behind
# Databases are segmented by function: user, item account, feedback, transaction, over 70 in all.
# No stored procedures are used. There are some very simple triggers.
# Move cpu-intensive work moved out of the database layer to applications applications layer: referential integrity, joins, sorting done in the application layer! Reasoning: app servers are cheap, databases are the bottleneck.
# No client-side transactions. no distributed transactions
# J2EE: use servlets, JDBC, connection pools (with rewrite). Not much else.
# No state information in application tier. Transient state maintained in cookie or scratch database.
# App servers do not talk to each other -- strict layering of architecture
# Search, in 2002: 9 hours to update the index running on largest Sun box available -- not keeping up.
# Average item on site changes its search data 5 times before it is sold (e.g. price), so real-time search results are extremely important.
# "Voyager": real-time feeder infrastructure built by eBay.. Uses reliable multicast from primary database to search nodes, in-memory search index, horizontal segmentation, N slices, load-balances over M instances, cache queries.

Lessons Learned

# Scale Out, Not Up
– Horizontal scaling at every tier.
– Functional decomposition.

# Prefer Asynchronous Integration
– Minimize availability coupling.
– Improve scaling options.

# Virtualize Components
– Reduce physical dependencies.
– Improve deployment flexibility.

# Design for Failure
– Automated failure detection and notification.
– “Limp mode” operation of business features.

# Move work out of the database into the applications because the database is the bottleneck. Ebay does this in the extreme. We see it in other architecture using caching and the file system, but eBay even does a lot of traditional database operations in applications (like joins).

# Use what you like and toss what you don't need. Ebay didn't feel compelled to use full blown J2EE stack. They liked Java and Servlets so that's all they used. You don't have to buy into any framework completely. Just use what works for you.

# Don't be afraid to build solutions that meet and evolve with your needs. Every off the shelf solution will fail you at some point. You have to go the rest of the way on your own.

# Operational controls become a larger and larger part of scalability as you grow. How do you upgrade, configure, and monitor thousands of machines will running a live system?

# Architectures evolve. You need to be able to change, refine, and develop your new system while keeping your existing site running. That's the primary challenge of any growing website.

# It's a mistake to worry too much about scalability from the start. Don't suffer from paralysis by analysis and worrying about traffic that may never come.

# It's also a mistake not to worry about scalability at all. You need to develop an organization capable of dealing with architecture evolution. Understand you are never done. Your system will always evolve and change. Build those expectations and capabilities into your business from the start. Don't let people and organizations be why your site fails. Many people will think the system should be perfect from the start. It doesn't work that way. A good system is developed overtime in response to real issues and concerns. Expect change and adapt to change.

大型网站架构系列之三,多对多关系的优化设计

上篇大型网站架构系列之二,底层架构概论中以用户数据表为例介绍了基本的数据分割方案以及基本的配置方案。但是在2.0时代,这种简单的列表索引已经远远实现起来是问题的,多对多关系将是最常见的关系。现在我们针对web2.0数据中广泛存在的多对多关系进行阐述和具体行为判断,比如一个很简单的例子,在2.0时代,好友功能是最常被用到的,每个用户会有很多的好友,同时也会是很多人的好友,那么这个数据量将会是用户数的平方的级别。同样,对于文章标签,每个文章可以有多个标签,而每个标签又可以有多个文章,这又是一个几何乘积,数据量又会是个天文数字。

传统的处理方案有两种,一种是通过SEARCH的方法来实现,一种是通过另建一个索引表,存贮对应的ID以进行存贮。对于第一种方案,因为要涉及大量的LIKE查询,性能不敢恭维,第二种的情况下,数据库的行的数量也是惊人海量级别的,并且要跨表跨区查询,还要维护数据的唯一性,数据处理过程相当的复杂性能也就不言而喻了。

文入正题,下面对数据多对多关系举出来具体的解决方案,我们这里以标签和文章之间的多对多关系为例来讲解,大家可以举一反三的思考群组和用户之间,相册和被圈用户之间等等复杂的多对多关系。

首先滤清一下流程,我们以传统方案的第二种为例,在传统的数据库设计中我们是如下走的:当一篇博文发布的时候并插入标签的时候一般是三步走(也可以理解为四步,以为还要判断标签是否存在的问题),第一步插入文章数据库并获取文章的ID,第二步插入标签数据库同时查询标签是否存在,如果存在就取出标签的ID,否则的话插入新标签并取出ID,第三部,将文章的ID和标签的ID插入索引表来建立关联。如果这个时候在索引表上建立了索引的话就是灾难性的,特别是在数据量大的情况下,尽管它可以有效的提高查询速度,但是发布的速度可能就会让人无法忍受了。

我们处理的方法也是三部曲,对多对多关系进行进一步的处理。
用标签的时候,我们用的最多的就是查询标签下的文章和显示文章的标签,所以我们实现这例就成了。
第一步,抛弃索引表。
对文章做冗余字段,加一个TAG列,我们可以讲TAG的标签如下写[TagID,TagName]| [TagID,TagName]| [TagID,TagName] 同样 对于TAG表,我们做如下冗余加个Article字段,如下内容[ArticleID,Title]| [ArticleID, Title]| [ArticleID, Title],在需要增加的时候我们只要APPEND一下就可以了,至于ARTICLE的结构和TAG的结构可以参考我上一篇文章的介绍。其实根据需要还可以存贮更多。
有人会问,为什么要存贮TagName和ArticleTitle呢,其实是为了避免跨表查询和INNERJOIN查询来做的,In查询和跨表查询会造成全表遍历,所以我们在执行的时候In查询是必须要找到一个有效的替代方法的。

第二部:异步加载。
在设计模式下我们常思考的是单件模式,我们采用另类的单件模式来处理,也就是把文章和标签之间的索引作为专门的进程来做,异步的实现。
为了避免文章在发布的时候以为要检查TAG表而造成的线程拥堵,我们需要采取延迟加载的方案来做。服务器应该维护一个进程专业的对标签和文章地段的查询和索引,我们在发布文章的时候应该把标签同步这一块托管给另外的一个程序进行处理,并进行索引。
第三部:标签缓存索引:
对于频繁的判断标签去或者热门的标签我们还可以组织一套有效的索引,比如对于标签“疯狂代码”和”傲博知识库”,我们用树来把它表示出来。对于疯狂代码我们索引一个疯,其实用程序表达就是疯狂代码[0],同样傲博知识库就是傲博知识库[0]。而在数组”疯”中存贮以疯开头的标签组,以”傲”的数组中存贮以”傲”开头的标签。如果量更大的话还可以再做二级索引。

这涉及另外一个话题了就是分词,上面是一个简单的分词方案,大家在进行GOOGLE搜索的时候应该很输入它的Suggest方法吧,就是这个道理。最终讲标签有效的索引,并提取热门的作为一个全局静态变量,我们就可以绕过数据查询这一关,对第二部的单件模式又是一个进化。

以上是对多对多关系的一个简单的架构说明,肯定有人会问,如果这样做的话工作量不是太大了吗,分词处理什么的,对每个多对多关系进行处理。
OK,咱们可以进一步的把它来抽象化,我们用TableA 表示Article表,用TagbleT表示Tag表,我们可以讲字段抽象化出来,也就是一个ID,一个Tag的String 同理对于标签表也是如此。朋友们应该可以理解我的意思了。

对,就是做个代码生成器把对应的多对多关系给生成出来,这个很好写的,几个Append就可以搞定。如果想更方便的处理,那么把这个东西做成单件的模式抽象化出来,然后再违反一下原则,做成基类,其他关系继承这个基类。。。。。剩下的应该很简单了,具体实现大家思考吧。

请参照第二篇的文章进行进一步优化设计来实现更高的负载性能

下章接着讲述数据分割和散列方面的内容

大型网站架构系列之二,底层架构概论

上篇大型网站架构系列之一中介绍的基于AJAX的攻击很多人提出疑问,比如不能跨域,减轻负担之类。Ajax是通过简单的GET和POST进行数据传递的,采用HTTPDEBUGGER,抓取数据,然后采用如下方案,顺便写个示例的攻击代码.比传统的webform,我们更容易构造一些,其实对于 webform和ajax的处理和发包过程是一样的,ajax数据量相对小,速度也快一些。
结合SharpPcap和HttpWebRequest我们构造一个合理的正常的IP数据包过去,代码很长,我们用伪代码简单的表达一下。
request.CreateUrl(Ajax处理页面);
request.Method=”GetORPost”;
request.refere=”网页来源”;
SharpPcap.SetLinkConnection(伪造IP地址);
String content = request.GetResponseStream() 如果作为一个多线程的应用程序对对方的WEB构成批量发包的话(假如是DEDECMS),足可以把Dedecms的数据库搞垮

文入正题:
对于上回书提到要解决问题A,我们先讲解一下电信公司ADSL的布局方案。机器上没有装VISIO,我简单的用文字描述一下流程。
Adsl用户Aè输入用户名密码è远程连接到账户数据库(在天津)è账户数据库连接计费数据库并返回状è如果成功,连接PPPOE服务器,并进一步连接计费数据库è认证服务并连接。

这里没有什么特别的地方,但是和qq通讯服务是一样的,就是采用了统一的用户验证服务器,同时对于用户验证的信息数据库是只读的,我们从其中可以想到什么吗?

以上是个简单的例子,下面开始谈具体的架构策略,首先对于上篇提到的问题A,我们首先以用户数据库为例来做解释和要求。
首先做用户量估算需求,假如我们做的是学术社区,那么这个用户量不会很大,可能我们不需要考虑这个,对于用户量的级别,我们暂时把用户量级别定为三种,百万级别(M)和千万界别(S),以及亿万级别(Q),并考虑用户登录验证以及查询常用的操作,对M和S进行扩充以及了解。

众所周知,在这个情况下,对于用户数据的负载其实并非可行而不可行的问题,而是如何最大化的保证查询和更新以及各个服务器之间的数据同步。这里我们不再讲解如何优化如何索引,只介绍架构初期的方案,下面介绍的方案如果涉及全表查询,可以采用分区视图的方案,大家可以具体搜索相关资料。

对于M级别来说,现有的DBMS经过合理的布局完全可以满足需求。我们需要解决的问题的关键其实是处理IO方面的问题,处理方案相对简单一些,对数据库的FILE文件分磁盘存贮(不是分区,是不同的硬盘),根据负载量大小,我们可以适当的控制硬盘的数量和文件分区的数量。

对于S级别,上个处理方案已经不能完全满足需求了,这个时候我们需要对注册和入库的流程进行简单的修改了,解决方案是数据散列和分区视图的概念,具体概念大家去google一下,我不详细说明了。
我们常用的方案有三种。第一种是等容扩充法,在用户注册控制的基础上,保证每个库的用户容量不超过500万,超过之后入第二个库,以此类推,这个方案可以保证系统有效的扩充性,但不能保证数据被有效的索引。第二种就是共区索引方案,其实和第一种方案有异曲同工的之说但是讲第一种方案进行了合理的优化,按照用户名进行分库存贮。比如我们可以建立26的数据库,按照用户名的索引来控制用户数据入哪个库。假如用户名是CrazyCoder,那么就讲该用户名的数据存放在用户表C中,在数据存贮的时候可以很方便的根据用户名进行相应的数据查询,方案二可以有效的解决数据索引问题。方案三是一个更具模型化的方案,结合方案一和方案二,进行用户ID的编码,不是INDENTIFY Cloumn,我们用一种序列化的方案将用户名以编码的形式存贮,比如用户名是CrazyCoder,我们的编码方案就是通过算法进行数字化,将 CrazyCoder按照C,R,A,….存贮为数字索引,然后进行分区存贮,数字类型的数据在数据库中可以更有效的被查询和被更新和共享,结合方案一和方案二这个就是方案三。


对于Q级别。数据量已经是足够海量了,这个时候无论用哪种方案都是一个让人头大的数据,不能简单的用查询的方案来处理了,可以参考S级别的进行处理。但这个时候我们采用的方案是根据用户活跃度的权值结合数据量进行临时数据表的存放。如果一个非意外的数据情况下,每天登录的用户量不会上千万。这个时候我们需要做的是一个简单的数据代理程序。一个临时的用户验证数据库,每天执行一次批处理,将活跃度高的用户账户提取到临时数据库中,查询的时候先查询临时库,如果没有在进行全库查询。这个根据系统的负载情况来估计阈值,不同的系统估算方案也不尽相同。


上面对于,M,S,Q三种界别进行了简单的概述,下面介绍一个在其之上更高级的一个查询方案,数据缓存服务器,我们也可以把它理解为缓冲服务器,数据做为只读来使用。

具体实现方案如下:以为涉及了海量,DBMS常规的缓存方案已经不符合我们的要求了,那么我们需要一个有效的缓存方案,这个时候处理的流程其实就是讲最常用最直接的数据直接存放在缓存服务器中,而这个缓存服务器定时从主服务器获取并更新信息。这个是一个简单的查询,我们还可以更深入的讲缓存服务器做二次缓存,也就是一次性处理输入并存放到LIST数据中,作为全局变量放到内存中进行查询,同时用HASHTABLE或者数组进行数据组索引(可以是多级),根据查询分布到各个变量中。直接从内存中读取数据。

以笔者的经验来说的话,对于ITEM数据不超过10K的来说,每个列表最佳的存放范围是0到6万之间。

这里简单的介绍了一下DBMS基本架构,里面具体细节处理的还有很多,这里只介绍个大概的纲要。有问题请给我发邮件(Heroqst # Gmail.com),请讲#替换为@


这里只是简单的介绍了一下DBMS的基本布局,下章讲具体对我们常见的多对多关系数据库进行具体配置说明。

首先介绍一下问题的大概,比如对于文章和标签,每个文章可以有多个标签,而每个标签下又会有多个文章,那么数据量将是文章数乘以标签数,这个时候如何进行处理并有效的索引,将是下章多对多关系的优化设计要介绍的内容。

大型网站架构系列之一,前言,不得不考虑的问题

前言:这两天机器坏了,正在送修中,写个系列的大型网站架构的文章,希望对有志在互联网做出一番事业的站长朋友们一些帮助。

注意:这里的大型网站架构只包括高互动性高交互性的数据型大型网站,基于大家众所周知的原因,我们就不谈新闻类和一些依靠HTML静态化就可以实现的架构了,我们以高负载高数据交换高数据流动性的网站为例,比如海内,开心网等类似的web2.0系列架构。我们这里不讨论是PHP还是JSP或者. NET环境,我们从架构的方面去看问题,实现语言方面并不是问题,语言的优势在于实现而不是好坏,不论你选择任何语言,架构都是必须要面对的。

文入正题:
首先讨论一下大型网站需要注意和考虑的问题
A. 海量数据的处理。
众所周知,对于一些相对小的站点来说,数据量并不是很大,select和update就可以解决我们面对的问题,本身负载量不是很大,最多再加几个索引就可以搞定。对于大型网站,每天的数据量可能就上百万,如果一个设计不好的多对多关系,在前期是没有任何问题的,但是随着用户的增长,数据量会是几何级的增长的。在这个时候我们对于一个表的select和update的时候(还不说多表联合查询)的成本的非常高的。
B. 数据并发的处理
在一些时候,2.0的CTO都有个尚方宝剑,就是缓存。对于缓存,在高并发高处理的时候也是个大问题。在整个应用程序下,缓存是全局共享的,然而在我们进行修改的时候就,如果两个或者多个请求同时对缓存有更新的要求的情况下,应用程序会直接的死掉。这个时候,就需要一个好的数据并发处理策略以及缓存策略。
另外,就是数据库的死锁问题,也许平时我们感觉不到,死锁在高并发的情况下的出现的概率是非常高的,磁盘缓存就是一个大问题。
C. 文件存贮的问题
对于一些支持文件上传的2.0的站点,在庆幸硬盘容量越来越大的时候我们更多的应该考虑的是文件应该如何被存储并且被有效的索引。常见的方案是对文件按照日期和类型进行存贮。但是当文件量是海量的数据的情况下,如果一块硬盘存贮了500个G的琐碎文件,那么维护的时候和使用的时候磁盘的Io就是一个巨大的问题,哪怕你的带宽足够,但是你的磁盘也未必响应过来。如果这个时候还涉及上传,磁盘很容易就over了。
也许用raid和专用存贮服务器能解决眼下的问题,但是还有个问题就是各地的访问问题,也许我们的服务器在北京,可能在云南或者新疆的访问速度如何解决?如果做分布式,那么我们的文件索引以及架构该如何规划。
所以我们不得不承认,文件存贮是个很不容易的问题
D. 数据关系的处理
我们可以很容易的规划出一个符合第三范式的数据库,里面布满了多对多关系,还能用GUID来替换INDENTIFY COLUMN 但是,多对多关系充斥的2.0时代,第三范式是第一个应该被抛弃的。必须有效的把多表联合查询降到最低。
E. 数据索引的问题
众所周知,索引是提高数据库效率查询的最方面最廉价最容易实现的方案。但是,在高UPDATE的情况下,update和delete付出的成本会高的无法想想,笔者遇到过一个情况,在更新一个聚焦索引的时候需要10分钟来完成,那么对于站点来说,这些基本上是不可忍受的。
索引和更新是一对天生的冤家,问题A,D,E这些是我们在做架构的时候不得不考虑的问题,并且也可能是花费时间最多的问题,
F. 分布式处理
对于2.0网站由于其高互动性,CDN实现的效果基本上为0,内容是实时更新的,我们常规的处理。为了保证各地的访问速度,我们就需要面对一个绝大的问题,就是如何有效的实现数据同步和更新,实现各地服务器的实时通讯有是一个不得不需要考虑的问题。
G. Ajax的利弊分析
成也AJAX,败也AJAX,AJAX成为了主流趋势,突然发现基于XMLHTTP的post和get是如此的容易。客户端get或者post到服务器数据,服务器接到数据请求之后返回来,这是一个很正常的AJAX请求。但是在AJAX处理的时候,如果我们使用一个抓包工具的话,对数据返回和处理是一目了然。对于一些计算量大的AJAX请求的话,我们可以构造一个发包机,很容易就可以把一个webserver干掉。
H. 数据安全性的分析
对于HTTP协议来说,数据包都是明文传输的,也许我们可以说我们可以用加密啊,但是对于G问题来说的话,加密的过程就可能是明文了(比如我们知道的 QQ,可以很容易的判断他的加密,并有效的写一个跟他一样的加密和解密方法出来的)。当你站点流量不是很大的时候没有人会在乎你,但是当你流量上来之后,那么所谓的外挂,所谓的群发就会接踵而来(从qq一开始的群发可见端倪)。也许我们可以很的意的说,我们可以采用更高级别的判断甚至HTTPS来实现,注意,当你做这些处理的时候付出的将是海量的database,io以及CPU的成本。对于一些群发,基本上是不可能的。笔者已经可以实现对于百度空间和 qq空间的群发了。大家愿意试试,实际上并不是很难。
I. 数据同步和集群的处理的问题
当我们的一台databaseserver不堪重负的时候,这个时候我们就需要做基于数据库的负载和集群了。而这个时候可能是最让人困扰的的问题了,数据基于网络传输根据数据库的设计的不同,数据延迟是很可怕的问题,也是不可避免的问题,这样的话,我们就需要通过另外的手段来保证在这延迟的几秒或者更长的几分钟时间内,实现有效的交互。比如数据散列,分割,内容处理等等问题
K.数据共享的渠道以及OPENAPI趋势
Openapi已经成为一个不可避免的趋势,从google,facebook,myspace到海内校内,都在考虑这个问题,它可以更有效的留住用户并激发用户的更多的兴趣以及让更多的人帮助你做最有效的开发。这个时候一个有效的数据共享平台,数据开放平台就成为必不可少的途径了,而在开放的接口的情况保证数据的安全性和性能,又是一个我们必须要认真思考的问题了。

当然还有更多需要考虑的问题,我这里就写一个最需要考虑的问题,欢迎补充。

下一篇文章底层架构概论中将针对问题A,提出具体的解决方案和思路