Physical unit of flash memory. PageNAND â unit for read & ... 64128 NAND pages. Flash Translation Layer .... Cf. s
About SSD Dongjun Shin Samsung Electronics
Outline
SSD primer
Optimal I/O for SSD
Benchmarking Linux FS on SSD
Case study: ext4, btrfs, xfs
Design consideration for SSD
What’s next?
New interfaces for SSD
Parallel processing of small I/O
SSD Primer (1/2)
Physical unit of flash memory
PageNAND – unit for read & write
BlockNAND – unit for erase (a.k.a erasable block)
Physical characteristics
Erase before rewrite
Sequential write within an erasable block
LBA space (visible to OS) Flash Translation Layer Flash memory space NAND page (24kB)
NAND block = 64128 NAND pages
SSD Primer (2/2)
Internal organization: 2dimensional (NxM parallelism)
Similar to RAID0 (stripe size = sector or pageNAND)
Effective page & block size is multiplied by NxM (max) 0 4 8 12
16 20 24 28 48 52 56 60
32 36 40 44 64 68 72 76
48 52 56 60 80 84 88 92
Chip0
Chip1
Chip2
Chip3
1 5 9 13 33 37 41 45
17 21 25 29 49 53 57 61
Ch0 32 36 40 44
SSD Controller Host I/F (ex. SATA) running F/W(FTL)
Ch1
2 6 10 14
Ch2 34 38 42 46
Ch3
18 22 26 30 50 54 58 62
3 7 10 15 35 39 43 47
Mway (pipelining)
Nchannel (striping)
Optimal I/O for SSD
Key points
Parallelism • The larger the size of I/O request, the better
Match with physical characteristics • Alignment with page or block size of NAND* • Segmented sequential write (within an erasable block)
What about Linux?
HDD also favors larger I/O readahead, deferred aggregated write
Segmented FS layout good if aligned with erasable block boundary
Write optimization FS dependent (ex. allocation policy) * Usually, partition layout is not aligned (1st partition at LBA 63)
Test environment (1/2)
Hardware
Software
Intel Core 2 Duo
[email protected], 1GB RAM Fedora 7 (Kernel 2.6.24) Benchmark: postmark
Filesystems
No journaling ext2 Journaling ext3, ext4, reiserfs, xfs
• ext3, ext4: data=writeback,barrier=1[,extents] • xfs: logbsize=128k
COW, logstructured btrfs (latest unstable, 4k block), nilfs (testing8)
SSD
Vendor M (32GB, SATA): read 100MB/s, write 80MB/s Test partition starts at LBA 16384 (8MB, aligned)
Test environment (2/2)
Postmark workload
Ref: Evaluating Blocklevel Optimization through the IO Path (USENIX 2007)
Workload
File size
# of file (workset)
# of transaction
Total app
SS
915K
10,000
100,000
read/write 630M/755M*
SL
915K
100,000
100,000
600M/1.8G
LS
0.13M
1,000
10,000
9.7G/12G
LL
0.13M
4,250
10,000
9G/17G * Mostly writeonly
Benchmark results (1/2) Small file size (SS, SL) 2500
2000
transaction/sec
1500
1000
500
0
SS ext2
ext3
SL ext4
reiserfs
xfs
btrfs
nilfs
Benchmark results (2/2) Large file size (LS, LL) 30
25
20 transaction/sec
15
10
5
0
LS ext2
ext3
LL ext4
reiserfs
xfs
btrfs
nilfs
I/O statistics (1/2) Average size of I/O
140
Avg I/O size (Kbytes)
120 100 80 60 40 20 0
SS
SL
LS
LL
SS
SL
read
LS write
ext2
ext3
ext4
reiserfs
xfs
btrfs
nilfs
LL
I/O statistics (2/2)
Segmented sequentiality of write I/O (segment: 1MB) 20.00%
100%
100%
100%
100%
18.00% 16.00% 14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00%
SS
SL ext2
ext3
LS ext4
reiserfs
xfs
LL btrfs
nilfs
Case study ext4 Condition
data=ordered, allocation: default/noreservation/oldalloc 1200
1000
transaction/sec
1. Almost no difference between allocation policies 2. Why data=ordered is better for SL?
800
600
400
200
0
SS ext4wb
SL ext4ord
ext4nores
ext4olda
Case study btrfs Condition
Block size: 4k/16k, allocation: ssd option on/off 1800 1600 1400
transaction/sec
1. 4k is better than 16k (sequentiality = 12% : 2%) 2. ssd option is effective (1040% improvement)
1200 1000 800 600 400 200 0
SS
SL btrfs4k
LS btrfs16k
btrfsssd4k
LL
Case study xfs Condition
Mount with barrier on/off 800 700
Large barrier overhead...
600 transaction/sec
500 400 300 200 100 0
SS
SL xfsbar
LS xfsnobar
LL
Design consideration for SSD
Lessons from flash FS (ex. logfs)
Sequential writing at multiple logging points
Wandering tree • Traceoff between sequentiality vs. amount of write • Cf. space map (Sun ZFS)
Need to optimize garbage collection overhead • Either FS itself or FTL in SSD
Next topic: Endtoend optimization
Exchange info with SSD (trim, SSD identification)
Make best use of parallelism
New interfaces for SSD (t13.org)
Trim command
Let device know which LBA range is not used • This will be helpful for optimizing FTL
Should be passed through: FS bio scsi libata • Passing bio with no data • What about I/O reordering & I/O queuing?
SSD identification (added to “ATA identify”)
Report size of page and erasable block • Physical or effective?
Useful for FS and volume manager
Parallel processing of small I/O
Make better use of I/O queuing (TCQ or NCQ)
Parallel processing of small I/O
Desktop environment? Barrier?
request queue
A
B 1
C 2
D 3
A
B
4
C 1
D 2
A
Ch0
A
Ch0
B,C
Ch1
B,C
Ch1 Ch2
Ch2
D
D
Ch3
without I/O queuing, 4 steps chip is busy
Ch3
with I/O queuing, 2 steps chip is idle
Summary
Optimization for SSD
Alignment is important
Segmented sequentiality
Make better use of parallelism (either small or large) • I/O barrier may stall the pipelined processing
What can you do?
File system: alignment, allocation policy, design (ex. COW)
Block layer: bio w/ hint, barrier, I/O queueing, scheduler(?)
Volume manager: alignment, allocation
Virtual memory: readahead
References
T13 spec for SSD
http://www.t13.org/documents/UploadedDocuments/docs2007/e07153r0Soil
http://www.t13.org/documents/UploadedDocuments/docs2007/e07154r0Not
Introduction to SSD and flash memory
http://download.microsoft.com/download/a/f/d/afdfd50d6eb9425e84e1b40
http://download.microsoft.com/download/d/f/6/df6accd54bf249848285f4f2
http://download.microsoft.com/download/a/f/d/afdfd50d6eb9425e84e1b40
FTL description & optimization
BPLRU: A Buffer Management Scheme for Improving Random Writes in Flash Storage (FAST ’08)
Appendix. I/O Pattern
SS workload – ext4, xfs
Appendix. I/O Pattern
SS workload – btrfs, nilfs
Appendix. I/O Pattern
SL workload – ext4, xfs
Appendix. I/O Pattern
SL workload – btrfs, nilfs
Appendix. I/O Pattern
LS workload – ext4, reiserfs, xfs
Appendix. I/O Pattern
LS workload – btrfs, nilfs
Appendix. I/O Pattern
LL workload – ext4, reiserfs, xfs
Appendix. I/O Pattern
LL workload – btrfs, nilfs