1999 Column Index

Year 2000? Viruses? Protect Your Files!

By Bill Nicholls
October 21, 1999

With all the worry about year 2000 bugs, viruses and e-mail Trojan horses, many small and medium size businesses are working hard to prevent these problems. This is a prudent business step, but systems will fail and people will delete the wrong files despite these precautions. In addition to prevention, let's look at a complementary approach -- make it easy to recover from problems that damage vital files. While this approach is not a panacea, it does provide file recovery in whatever depth you decide to implement.

The solution I propose is an automatic archive of important user files. This facility should be available all the time on your internal network, make it easy for users to recover files, and be immune to common Windows viruses and Y2K problems. It also should be inexpensive, reliable, and require minimal human support. While many people think this is a pipe dream, they would be wrong in this instance. It is possible to achieve a working system that meets those requirements. Sounds unlikely? Read on.

Because the word backup carries a common computer meaning -- copy all files to tape, I will use the word archive. The dictionary has this definition for the word 'archive':
  1. A place where public records and historical documents are kept.
  2. Public records, documents, etc., as kept in such a depository.

The word depository is defined as a place for keeping things safe. So an archive is a place to keep public records and documents safe. This is quite distinct from the computer industry use of the word backup.

Let me digress a moment on the reasons to create an archive. Most files are lost because of human error, not hardware or software problems. I include virus attacks in the human category because that is where they originate, helped on by software weaknesses. While we can make some hardware reliable to four nines (99.99 percent), we have yet to make people anywhere near that reliable. Since this is not likely to change soon, we should make it possible to recover lost files as quickly and easily as possible, regardless of the cause of the loss.

Every company, regardless of type or size, has its most important part go home at night. The second most important part is the knowledge and skill sets of the employees. This is mainly stored today in human memory and in files scattered around the network on various user systems. This critical data is rarely backed up unless the company has forced everyone to store all of their files on the server, which is backed up by the company. That second approach puts a burden on the server, the network, and the employees and is not usually enforced. But most users won't archive until they lose something important, and sometimes not even then. For a home user, this may be an acceptable risk. For a company, it is not.

Planning The Archive System

What I will describe here can be implemented on any standard network using TCP/IP or SMB (Netbios) as its transport method. The size of the network, measured by the number of systems attached, can be smaller than my six system network, or as large as a company division. The principles are the same; what changes is the size and speed of the required hardware.

Let's start with identifying the assumptions I'll be making:

  1. Most of the user systems run one of the Microsoft Windows OSes
  2. The archive will not be stored on existing servers
  3. Server systems and other large databases are excluded (see note)

Note: There are more efficient ways to protect server-based data and databases. But critical server files such as password and configuration files can and should be archived.

For small networks, up to about 20 systems, a small archive system will do. How small? As little as a 486/DX2-66 with 16 Mbyte of memory and sufficient hard disk space, plus the OS requirements (See Selecting The OS). Most companies have retired older systems, some of them Pentium class. But CPU speed is not the critical item with this approach, nor is memory. What is critical are the network throughput and the total file size of all the files to be archived.

When you plan the size of your archive system, first determine the total file size to be secured in the archive. This does not mean you need or want to archive all of the files on the network, only the critical ones. To make the total file size more manageable, you should exclude all files that can be reloaded from original media, such as OS and application installs. These should be kept in a fireproof safe, or offsite copies made. Even if you had to purchase all new media, this is not the most critical part of your information.

Archiving The Critical Files

In the broadest sense, the critical files are those that capture employee knowledge or customer information. Rather than burden each user with deciding what those files are, it is reasonable to simply secure all user created files in the archive on a regular basis. Finding the user files can be done many ways, but a simple one is to simply identify which logical disks contain user files and which file extensions identify user files. Alternatively, all files can be selected and non-user files such as .EXE and .DLL can be excluded.

A full list of files can be easily generated from the DOS or OS/2 command line with:

DIR /F /S >list

This redirects a full listing of file names, including the path, to a file named 'list.' A simple program to process the list by extension can create a new list2 that only includes those critical user files. List2 can then be used to retrieve each file's size and calculate total file size. This should be done for each user system. Add all the user system totals together for a grand total of space required. This is how much space the first archive pass will need to store all critical files. Additional space will be needed for new and changed files that are created after the original archive. A simple way to estimate that would be to double the original grand total to allow for dynamic growth of files.

The second step is to determine how deep the archive will be, i.e. how many prior versions of a file will be stored. For a simple example, here is the plan for an archive that is one-week deep:

A month deep archive can be organized this way:

Protecting against problems that are discovered after the archive cycles requires defense in depth. This is done by moving the full archive on a weekly or monthly basis to a reliable medium and storing it away from the systems being protected, preferably offsite in a fireproof safe. Today we have choices of tape, CDROM writeable and rewritable, magneto-optical such as the Fujitsu 640 Mbyte drive and soon, DVD-RAM which holds 2.6 Gbyte per platter. Magnetic tape is the most common and usually the medium is the least expensive, but any one of these will work for you.

The disk where the archives are stored should be dedicated to the archive. Each system on the net will have a directory with its system name. The process of performing a single archive is simple. For each networked system:

  1. Get the system's archive control file.
  2. From the archive, map the remote system's drive to a drive letter.
  3. Generate a list of files to be archived for that system (e.g. list2).
  4. Clear the archive pointed to by system/day
  5. Invoke a program with the list of files.
  6. The program copies the files to the correct archive directory
  7. Unmap the remote system

Building The Archive Processes

The archive control file contains a system name on the first line. That is followed by one or more lines starting with a drive letter followed by the extensions to be processed forthat drive. Drive letter/directory paths may be used to reduce the scope of the archive. Each time the archive is processed, a new list should be generated to capture new files. On the full archive all files are listed, but on the other runs, only the changed files marked by the archive flag in the client's directory should be listed.

Archive flags may be reset on a daily or other period basis. Daily reset results in incremental archives which only copy files that changed that day. Longer periods between resets gives you a cumulative archive that has all the files changed that period in each day's archive. Weekly periods are often chosen to make it easier to find a needed file, at the cost of more storage space.

One very important requirement for safety of the archive is that all file systems in the archive be read only from the client computers. This prevents runaway programs or viruses in the clients from damaging files in the archive. Similarly, the client file systems should be read only from the archive.

The archive system can be easily driven by a few Perl or REXX scripts, or even a CMD or BAT file with some simple supporting programs. Step six can be done in a number of ways, from a simple COPY command to copy utilities to a compression program. Using compression lets more data be stored but increases the complexity of recovering a file.

Regardless of which method is used, you will need a user interface running on the archive system that makes file recovery easy and prevents access to other systems' archives. Using passwords for access control is problematical since many people have difficulty remembering passwords, especially those used infrequently. A better approach is another script or simple program which reads the remote system name and makes only that archive available for retrieval.

Configuring The Hardware

The hardware you will need depends on the operating system, total file size to be secured, the network throughput and whether compression will be used. Figuring total file size is a by-product of setting up the archive process. Until five years ago, saving more than one gigabyte of files would have been expensive. Today, fast 13 Gbyte IBM drives are available for under $150 on the Web. You can check http://www.tcwo.com for a good selection at competitive prices. At that price you should buy two and use them in a mirror or duplex configuration. Some OSes have mirroring built in, others need a software purchase. You always have the poor-man's mirror -- a program that runs every N hours to copy the archive data from one disk to the other. Disk speed is not an issue since the network will limit total throughput.

Network throughput, not speed, is the critical bottleneck. A 10-Mbit Ethernet will run as fast as 1 Mbyte/sec with only two systems transferring files, but single transfers slow as other traffic interferes. This leads to a critical decision for choosing when to run the archive process. If the systems are left on 24 hours, then night is usually the lowest demand time. However, if some of the computers are turned off at night, they will miss the archive pass.

The solution for this is either run the archive during the day and put up with the interference, or enhance the archive process to keep track of systems that don't respond at night and archive the exceptions during the day. If the archive runs during the day, it may need to be throttled so it doesn't load the network and slow everyone else down. One easy way to do this is to pause between file copies (step six) for a short time, typically one to three seconds. This also slows the archive process, so you may need to adjust the delay to get all the files archived.

A reasonable system for a medium size network (30-50 systems) could be run on a Pentium 100 with 32 Mbytes of memory and dual 4 or 8 Gbyte-IDE drives. A single 10-Mbit Ethernet connection can easily transfer 50 Kbyte/sec or 180 Mbyte/hour during the day unless the net is overloaded. The one factor that does affect CPU load is the use of compression. If you use compression for the archive, plan to double the CPU speed to handle the extra load. Larger networks will need faster processors, more disk space and either a 100-Mbit ethernet or dual 10-Mbit connections. Don't forget a UPS to protect the archive against power failures.

Connecting the archive system to the network is best done at a switch or router rather than a hub. On the hub it will interfere with all other traffic on the hub, but a switch or router will limit the interference to the net segment that is being archived. If you must run the archive during the day, and network performance is impacted too much, then a network upgrade is necessary. Many times this can be as simple as an inexpensive eight port switch placed between the archive system and the hubs.

Selecting The OS

Although I have referred to the archive system as a 'server,' it does not require an OS with server functions because it only talks to one system at a time. Thus standard desktop systems will work fine. What is important is reliability, year 2000 compliance and compatibility with the rest of the network. Since it runs over the network, mixed system and file types are handled by the client system transparently to the archive. In any case, the cost of an OS for the archive server is tiny compared to the potential for loss of critical information.

Selecting the OS is simple but non-intuitive. If your site runs most or all Microsoft OS, you must not use a Microsoft OS for the archive system. Why? Because you don't want the archive system to be vulnerable to the same viruses or Y2K bugs as the rest of the systems, and vice versa. Using the same OS makes the archive system vulnerable to exactly what you set it up to protect against. Even if you get everything else right, that choice will cancel much of the archive's security. If all you run is Unix systems, use a non-Unix OS for the archive.

Fortunately you have a lot of good choices, split between non-Unix and Unix type systems. In the former, OS/2 or Netware could be your best choice because of stability and ease of interfacing. Warp 4 with fixpak 10 is year 2000 compliant and the standard install can handle Netbios and TCP/IP without problems. Warp is compatible with Windows security and can run a McAfee virus checker when not running the archive. A newer system, BeOS at release 4.5 may be an alternative if you have an experienced Be person. I don't have any knowledge of its stability in this application.

Unix type systems, especially FreeBSD and Linux are possible choices, as is Solaris on Intel. But be warned: Setting up Unix systems to work with Windows systems is not a trivial exercise. Nothing less than an experienced person will make this work well in less than a few weeks. Even though Samba, the SMB network interface has been extensively redone, it still has a long list of parameters to be set and the security interface is different than NT.

Wrapping The Package

To summarize the necessary steps in setting up a reliable archive server:

  1. Archive only the important user files, not all of the files.
  2. Use high-quality disks and mirror or duplex them.
  3. Run the archive automatically every day or night.
  4. Set the archive disks to be read only from the clients, and vice versa.
  5. Copy the archive periodically to a portable medium and store offsite for defense in depth.
  6. Select a different OS than the rest of your systems for the archive server.

Now more than ever, people expect continuous availability, and recreating files by hand is very expensive. It might make good sense to include your website in the daily archive, run when the Web server is at lowest demand. Outage of a website is an instant loss of credibility and sales. As I see it, an archive server is very cheap insurance against problems you can count on happening. After all, we all make misteaks.

All content on this site is Copyright 2001 by Bill Nicholls