I have reviewed draft-dnoveck-nfsv4-internationalization. In my opinion, this draft is extremely important to the Internet community and beyond, and should progress. This being an early review, perhaps I should stop there. However, there is an important, long-running, low-volume debate to finally settle here, and it has to be settled in the I18N community. The architectures and realities of the relevant operating systems makes it impossible for us to practicably put the onus for I18N on the filesystem _protocols_. No, that onus can _only_ live in the _filesystems_. I cannot stress this enough. If you stop reading here, you can take just the above paragraph with you and consider it carefully. If you continue reading, please forgive me for the length of this post. The document at hand is almost entirely dedicated to convincing the present audience of the above premise and fact. Most of the first ten pages are non-normative text, and when it gets to what happens in reality... it's essentially still informative rather than normative text. The I-D even modifies the meaning of RFC2119 so it can pretend to be normative while not really being normative, all so it can continue the fiction that I18N belongs in NFSv4 (and what about WebDAV? and SFTP? and ...?) and not in the filesystem. These assertions may cause friction. Therefore I seek to convince you, as the author tries as well, but I want to go further: I want to stop pretending that the filesystem _protocol_ can be responsible for I18N. Even if this viewpoint ends up on the rough side of consensus, the running code can. not. change. Anyone who wishes to argue that we can only target the protocols and not the filesystems needs to consider this fact. The architecture of that running code has been as it is for many decades -- almost as many decades as there has been an Internet community! The author gets to the nub of it in section 3, which in pages 5 and 6 says (with marked elisions): During the period from the publication of RFC3010 [14] until now, two different perspectives with regard to internationalization have been held and represented, to varying degrees, in specifications for NFSv4 minor versions. o The perspective held by NFSv4 implementers treated most aspects of internationalization as basically outside the scope of what NFSv4 client and server implementers could deal with. This was because the POSIX interface treated filenames as uninterpreted strings of bytes, ... o Within the IETF in general and in the IESG, there was a feeling that new protocols, such as NFSv4, could not avoid dealing with internationalization issues, ... It has now come time to finally settle this debate, these 'different perspectives'. The essential detail that we cannot alter is the architecture of most every general purpose operating system such as Unix, Unix-like derivatives (e.g., BSD and derivatives), Unix-like non-derivatives (e.g., Linux), and even Windows, as well as others. Specifically: - there is a pluggable filesystem API -- the virtual filesystem switch (VFS); - filesystem protocol clients are plugins for the VFS; - filesystem protocol servers operate above the VFS; - the VFS API, and the SPI that plugins implement, are in the main I18N-unaware -- they are just-use-8 (BSD, Linux, Unix) or just-use-UTF-16 (I believe Win32 also leaves I18N to the filesystems, though I may be wrong about this); - the VFS and below are utterly unaware of the locale or even codeset used by application clients of that API. Indeed, on Unix and Unix-like systems, the C library system call stubs, the system calls themselves, and the entirety of the VFS, treat filenames and paths as mostly-binary blobs with just two special byte values: NUL (because these are C strings) and 0x2F (ASCII '/', because it's the filesystem component separator as there is no array-of- components representation of paths in the various system calls), and a few special names in ASCII (e.g., ".", ".."). The kernel side of all of this is even less aware of user-level locale selection (not. at. all.) than it is of user-level codeset selection (NULL and / being special and ASCII, so only ASCII and superset codesets need apply). That this set of facts is common to such diverse operating systems should be indicative of how natural this architecture is. It's really quite standard to have pluggable interfaces for this sort of functionality, and it's not at all surprising that software architectures the evolved in the 1980s didn't account for I18N. To be sure, there are special-purpose fileservers, of course, and those might not have a VFS -- who knows what they do. But that hardly matters because it suffices that we have decades-long history of VFS architectures in widespread present use. That is running code, much, much running code. The fact that filesystem protocol servers operate _above_ the VFS essentially rules out implementation in, e.g., NFSv4 servers, of I18N behaviors such as: - normalize on CREATE Sure, NFSv4 servers could, but what about POSIX and WIN32 applications running on the same server? What about other filesystem protocol servers on the same system? They sure don't and won't, and we can't make them do it. - preserve form on CREATE and do form-insensitive matching on LOOKUP This could be implemented, but conflicts can't be avoided because... but what about POSIX and WIN32 applications running on the same server? ... (Ditto.) - reject non-Unicode (non-UTF-8 in the case of NFSv4) Sure, NFSv4 servers could, but what about POSIX and WIN32 applications running on the same server? ... (Ditto.) Should NFSv4 servers filter out non-UTF-8 filenames in READDIR?? - apply specific mappings in case-insensitive filesystems (Ditto.) There's almost no major I18N best practice that an NFSv4 fileserver can reliably implement on a general-purpose operating system! Just about the only I18N best practice an NFSv4 fileserver can apply is to refuse to CREATE new non-UTF-8 filenames. So why should we have an I18N burden on NFSv4 at all? If the above is not enough to convince the reader, then what about the other Internet filesystem protocols, WebDAV and SFTP? If multiple Internet filesystem protocols can (and they do) co-exist on the same servers as NFSv4, sharing the same content, how can they have different I18N requirements and recommendations? The answer is obvious: they can't. And what about non-Internet filesystem protocols, such as: - Lustre - OpenAFS - Auristor - CIFS/SMB - ... that also co-exist with Internet filesystem protocols? We can't advise their designers and implementors, and we can't look to them to learn from their I18N choices? Well, we can't impose I18N requirements on them, no, except by proxy via the Internet filesystem protocols they also implement (or allow), but again, that just doesn't work. And that brings up third-party implementations of Internet filesystem protocols on general-purpose operating systems. Those can't possibly force _our_ I18N values on the platform's native non-Internet filesystem protocols. E.g., an SFTP server on Windows co-existing with SMB. What a mess, no? But there is a saving grace. There is one unifying thread: the VFS architecture. That I18N-unaware layer above the actual filesystems. It turns out that this is the key to the puzzle. This blissful lack of awareness of I18N at the VFS layer means we can push I18N all the way down to the filesystem and get good results. Some of us reached this conclusion almost twenty years ago, when ZFS implemented I18N in the filesystem. Even before that, engineers at Apple seem to have reached similar conclusions. In fact, all the problems of filesystem I18N are relatively easy to address if we push them into the filesystem. Yes, different filesystem specifications and implementations may well make different I18N choices -- they already do anyways, and we can't exactly force them to change. There are only a few I18N problems to address in the filesystem. I'll focus here only on filenames (and pathnames). We can describe them and specify solutions as a BCP or even Standard and hopefully those filesystems that don't yet implement any of these I18N behaviors can get the hint and start doing so. These problems are: - Unicode equivalence There are two approaches in the wild: - normalize on CREATE (and typically also LOOKUP) HFS+, for example does this. HFS+ normalizes to something close to NFD, while input methods generally produce sequences closer to NFC, at least for Latin scripts anyways. Other filesystems could well go for NFC, which serves to illustrate that there is a variety of I18N behavior in the wild. - form-preserving on CREATE, form-insensitive on LOOKUP ZFS, for example, does this. Again, diverse I18N behaviors in the wild. A third and unsatisfying approach is to do nothing. Naturally we would not endorse that approach -- we might not even mention it. - Case mappings These are only relevant to case-insensitive filesystems. It is not uncommon to have a single server sharing multiple different filesystems some of which are case-sensitive, and some of which are case-insensitive. Here the main problem is that there can be only a single set of mappings per-filesystem, and this set of mappings may vary by locale. Ergo, each case-sensitive filesystem needs to specify a locale or default to a sensible one. Note that knowing the locale of user application processes does not help here because it is just not possible to have different case mappings in the same case-insensitive filesystems for different users. - What to do about non-Unicode file names This is a matter of legacy. We, the IETF, can say that Internet filesystem protocol servers MUST NOT allow the creation of new such names, but forbidding such names in the results of listing directories is harder. We can even pretend legacy filesystem content does not exist. Still, there are only two sensible policies a filesystem might implement: - forbid non-Unicode; - allow non-Unicode, making no attempt to deal with equivalence. A document that explains all of the above and correctly addresses I18N requirements mainly at filesystems can be shorter than the document I just reviewed, and can avoid the uncomfortable attempt at providing alternate definitions of RFC2119 terms. Let us do that. I volunteer to author or edit such a document if need be. All that said, there is one way in which I18N does apply specifically to NFSv4: in non-filename Unicode strings, such as the name@domain representation of users and groups in access control lists (ACLs). Fortunately there is no controversy about that, or the choices made in NFSv4 regarding those, and nothing more need be said about that. Nico --