ADFS Study Material Unit-4

Transcription

ADFS Study Material Unit-4
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
CHARUTAR VIDYA MANDAL’S
SEMCOM
Vallabh Vidyanagar
Faculty Name: Ami D. Trivedi
Class: SYBCA
Subject: US03CBCA03 (Advanced Data & File Structure)
*UNIT – 4 (FILE ORGANIZATION)
**TERMINOLOGY, DEFINITIONS AND CONCEPTS IN FILE ORGANIZATION
*TERMINOLOGY& DEFINITIONS
1. Record / Group / Segment
Record is sometimes called group or segment.
Record is a collection of information about a particular entity.
OR
Record is a collection of related fields that can be treated as a unit.
For example: a record may consist of information about a student of a college,
passenger on air flight, book of library, item sold on shop etc.
2. Field / Item
Field of a record is a unit of meaningful information about an entity.
OR
Field is an elementary (basic or fundamental) data item characterized by its size and
types.
For example: Different field / items of a
1) Student record – roll number, class, marks of different subjects etc.
2) Passenger record – name, address, seat number, menu restrictions etc
3. File
A collection of records involving a set of entities with certain (some) common
aspects and organized for particular purpose is called file.
OR
Data is organized for storage in files. A file is a collection of similar related records.
For example: Following forms a file.
1) Collection of student records of a particular college / class
2) Collection of all passenger records of a particular flight
Page 1 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
4. Key
A record item that uniquely identifies a record in a file is called a key.
For example: Following will uniquely identify record.
1) Roll number in Student records of a class
2) Passenger seat number in Flight record
5. Database



Items are composed (collected) to form record.
Records are composed to form files.
Files are composed to form set of files.
If set of files has certain association / relationship between records of files then such
set of files is known as database.
For example: student information file, attendance file and marks file are related to
each other. They can be collected to form database.
Items, Records, Files, Database are logical terms. There is no indication about how
they appear on physical storage device.
Term File has both physical and logical interpretation.
Physically a file is collection of physical records often reside contiguously in external
memory. Logically it is viewed by customer.
*INTRODUCTION TO FILE ORGANIZATION
Storage presentation and data manipulations discussed in data structure (before this
topic) applied only to data entities which were assumed to reside in main memory.
Why all information to be processed can not reside in main memory? There are 2
reasons for this.
1. Some programs and data are very large. So they do not fit in RAM.
2. It is desirable (necessary) to store information for future usage.
So, large volume of data is generally stored in external memory in form of special
data holding entities called files.
For example: University data, Airline or Railway reservation data
Files deal with storage and retrieval of data in a computer. Programming languages
provide statements and data structures which allow users to write programs to
access and use the data in the file.
External storage devices like Magnetic tapes, Magnetic drum, Hard disk etc. are
used to store the data.
Page 2 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
Users of data will expect that
1. It is always reliable for processing.
2. It is secure.
3. It is stored in such a way that it will be enough flexible for users to manipulate
data as per requirement.
FILE ORGANIZATION
Definition: File organization is a method of storing Data record in a file and the
following implications (effect) on the way these records are accessed.
Following factors are considered while selecting a particular file organization for
users.
1.
2.
3.
4.
5.
6.
Ease of retrieval
Convenience for updating data
Cost of storage
Reliability
Security
Integrity
FILE OPERATIONS
File operations deal with movement of data in and out of files. Some basic file
operations that can be performed on any kind of file are as follows.
1.
2.
3.
4.
5.
Creation of file
Reading of file
Updation of file
Insertions in the file
Deletion from the file
ALGORITHMIC NOTATION
We need algorithm notation to represent file processing. File processing is supported
in most programming languages.
A file is given an identifier name. We denote this identifier in algorithm by the syntax
<file name>.
Open a file
First requirement before processing a file is to allocated storage for a buffer.
We need to identify the operations that can be performed on the file. It is necessary
to know the use of a file - input, output or both (update).
This serves as protection. We do not wish to write on a file used for input.
These functions can be expressed in our algorithmic notation as follows:
Page 3 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
1) Open <file name> for input
Input specifies that records may be read only.
2) Open <file name> for output
Output specifies that records may be written only.
3) Open <file name> for update
Update specifies that records may be read and rewritten or possibly written at
the end of file.
Note: It is not possible to rewrite records in a sequential file stored on tape.
Close a file
When the processing of a file is complete, the buffer space can be deallocated by
writing
Close <file name> file
The Open and Close statements also prevent more than one user accessing the file
simultaneously in time sharing and multiprogramming environments.
A program can not open a file which is currently open for writing by another program
until that program closes the file.
There are two basic operations that can be specified for a sequential file: read and
write.
Read and Write operation
The object of a read statement or source of write statement should be the identifier
of a variable or structure.
This identifier of a variable or structure corresponds to the records in the file. We will
denote this identifier by <record name>.
A read statement has the form:Read from <file name> file into <record name>
A write statement has the form:Write <record name> on <file name> file
Rewrite operation
A third operation that is applicable to sequential files stored on direct-access devices
is:
Rewrite <record name> on <file name> file
This operation can be used only when the file has been opened for update. It writes
a record in the location of the record which is most recently read.
Page 4 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
COMMONLY USED FILE ORGANIZATION
Most operating systems provide a set of basic file organization that are popular with
the users of the system. 3 most commonly used file organizations are:
1. Sequential files
2. Indexed sequential files
3. Direct / Random files
1. SEQUENTIAL FILES
STRUCTURE OF SEQUENTIAL FILES


In a sequential file, records are stored one after the other on a storage device.
Sequential file has been the most popular basic file structure used in the dataprocessing industry.
EXTERNAL STORAGE DEVICES


All types of external storage devices (serial / direct access storage devices)
support a sequential-file organization.
Some devices can only support sequential files. (due to their physical nature)
For example:
 Magnetic tape, paper tape readers, card readers, tape cassettes and line
printers.
 Magnetic disks and drum provide both direct and sequential access to records.
HOW RECORDS OF SEQUENTIAL FILE STORED ON DISK




A sequential file is physically stored on a drum or disk by storing sequence of
records in adjacent locations on a track.
If the file is larger than the space available on a track, then records are stored on
adjacent (next) tracks.
If all tracks of one cylinder is filled with the records then next cylinder is used. If
all cylinders of one storage device are full then another storage device can be
used.
Multiple devices are attached to a common control unit.
OPERATIONS ON SEQUENTIAL FILE




Operations that can be performed on a sequential file may differ slightly,
depending on the storage device used.
For example, a file on magnetic tape can be either an input file or output file, but
not both at one time.
A sequential file on disk can be used strictly for input, strictly for output or for
update.
Update means that the record most recently read can be rewritten on the same
file.
Page 5 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
BLOCK and BUFFER concept for READ and WRITE



It is often advantageous to group a number of logical records into a single
physical record or block.
Complete blocks are transferred between main memory and the external
storage.

Each time the READ instruction is executed; the next record from the MASTER
sequential file is moved into the program area and assigned to structure
EMPLOYEE.

When a read or write operation is executed for a particular storage device a
block of logical records is transferred.

Here, there is a difference between program’s read & write statements and the
read & write commands issued for a particular device.
This difference will be resolved by using a buffer between external storage and
the data area of a program.
A buffer is a section of main memory which is equal in to the maximum size of a
block of logical records used by a program.
SINGLE BUFFERING
Operating system use buffers for the “blocking” and “deblocking” of records.
e.g. MASTER sequential file is an input file.
Reading operation on MASTER file
Fig-1 Illustration of the reading of a block of records
Page 6 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV

When the first READ statement is executed, a block of records is moved from
external storage to a buffer.

The first record in the block is then transferred to the program’s data area, as
shown below Fig-1.

For each following execution of a READ statement, the next following record in
the buffer is transferred to the data area.

Every record in the buffer will be moved to the data area, in response to READ
statements.

Only after that, the next READ statement cause another block to be transferred
to the buffer from external storage.

The new records in the buffer are moved to the data area, as described
previously. This entire process will be repeated for each block which will be read.
Writing operation on MASTER file

In a same fashion, WRITE statements cause the transfer of program data to the
buffer.

When the buffer becomes full (corresponding to a block of logical records), then
the block is written on the external storage device immediately after the
preceding block of records.
MULTIPLE BUFFERING

Multi buffering makes use of a queue of buffers which are normally controlled
by the operating system.

The need for more than one buffer arises because of the delay to “read in” or
“write out” the next block of records.

Delay in the execution of a program occurs only when a blocking factor of n and
a single buffer is used.

Delay occurs after every n executions of the READ or WRITE statement.

It is wise to eliminate this delay by using multiple buffers if
1. the program is executing in an environment where the desired response
time is small and
2. where processor and input/output activity overlapped
Working of multiple buffer
A circular queue of three buffers is shown in below Fig-2.
Page 7 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
Fig-2 Three buffer system

When the first READ statement is executed, the three buffers A, B, and C are
filled with three consecutive blocks, one block per one buffer.

All the records in buffer A will be processed. After that the execution of a
subsequent READ instruction will transfer the first record from buffer B to the
program’s data area.

Concurrently (parallel), a read command will be issued by the operating system.
And a block transfer from the sequential file on the external storage device to
buffer A will be initiated.

Subsequent executions of READ instructions will transfer records in sequence
from buffer B and then buffer C.

By the time the records of butler C are being processed. Buffer A contains a new
buffer full of records and buffer B is being refilled.

If the process of filling one buffer with new records is generally balanced with the
process of reading records from the remaining buffers, then there will be only
one read or write delay, namely, when the buffers are initially being filled.
FIXED LENGTH AND VARIABLE LENGTH RECORD

Blocks of logical records being filled make up a sequential file. In some systems
such as an IBM mainframe system, block can be either fixed length or variable
length.

A variable length block contains variable length records. Therefore, it is not
known how many records fit in a block.

So a maximum length is defined for the block. This maximum length is used to
estimate the size of a buffer needed to hold the block. As many as possible,
records are grouped into the block by the data management facilities.

Tape and disk files provide for variable-length records, either unblocked or
blocked. The use of variable-length records may significantly reduce the amount
of space required to store a file.
Page 8 of 42
US03CBCA03 (Advanced Data & File Structure)

Unit - IV
However, beware of trivial (minor) applications in which variations in record size
are small or the file itself is small, because the system generates overhead that
may defeat any expected savings.
Variable Length Record Format
A below Fig-3, shows the general view of record format for variable-length blocked
and unblocked records.
Fig-3 Record format for variable length (a) unblocked and (b) blocked records

A BL (block length) and an RL (record length) must be stored with each block
and record respectively.

These lengths are needed when unblocking the records during a READ
instruction.

The maximum length of a block depends on the storage device used for the file.

With magnetic tape, the length depends on the maximum space available for the
buffer in main memory.

With disk storage, blocks are generally limited in size to the capacity of a track.

Using a sector-addressable device, a block corresponds to some maximum
number of sectors.
Record format components

Immediately preceding each variable-length record on tape or disk is a 4-byte
record control word (RCW or RL) that supplies the length of the record.

Immediately preceding each block is a 4-byte block control word (BCW or BL)
that supplies the length of the block.

As a consequence, both records and blocks may be variable length. You have
to supply a maximum block size into which the system is to fit as many records
as possible.
Page 9 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
Unblocked Records

Variable-length records that are unblocked contain a BCW and an RCW before
each block. Here are three unblocked records:
|BCW|RCW|Record 1|•••|BCW|RCW|Record 2|•••|BCW|RCW|Record 3|

Suppose that three records are to be stored as variable-length unblocked. Their
lengths are 310, 260, and 280 bytes, respectively:
Field:
BCW
Length:
4
Contents: 318
RCW
4
314
record
310
record 1
BCW
4
268
RCW
4
264
record
260
record 2
BCW
4
288
RCW
4
284
record
280
record 3

RCW contains the length of the record plus its own length of 4.

First record has a length of 310, its RCW contains 314. The BCW contains the
length of the RCW(s) plus its own length of 4.

RCW contains a length of 314, the BCW contains 318.
Blocked Records

Variable-length records that are blocked contain a BCW before each block and
an RCW before each record.

The following shows a block of three records:
|BCW|RCW|Record 1|RCW|Record 2|RCW|Record 3|

Suppose that the same three records with lengths of 310, 260, and 280 bytes
are to be stored as variable-length blocked and are to fit into a maximum block
size of 900 bytes:
Field:
BCW
Length:
4
Contents: 866

RCW
4
314
record RCW
310
4
record 1 264
record RCW
260
4
record 2 284
record
280
record 3
Length of the block is the sum of one BCW, the RCWs, and the record lengths:
Block control word:
4
Record control words: + 12
Record lengths:
+ 850
Total length:
866
bytes
bytes

The system stores as many records as possible in the block up to (in this
example) 900 bytes.

Thus a block may contain any number of bytes up to 900, and both blocks and
records are variable length.
Page 10 of 42
US03CBCA03 (Advanced Data & File Structure)

Unit - IV
Your BLKSIZE entry tells the system the maximum block length.
For example, if the BLKSIZE entry in the preceding example specified 800, the
system would fit only the first two records in the block, and the third record would
begin the next block.
PROCESSING SEQUENTIAL FILES
Sequential files are most suitable for
1. Serial processing and
2. Sequential processing
1. Serial processing

Serial processing means accessing records one after the other, according to the
physical order in which they appear in the file.

Consider a MASTER file of employee’s records is ordered by employee
surname. e.g. ADAMS is first, BAKER is second …….and ZAPE is last.
Here, sequentially processing the file by surname is equivalent to serially
processing the file.

Most sequential files are ordered by a key or index item when the file is created.
For e.g. employee id, student roll number, or item number. The key or index item
should be the item which is most often searched when processing the file.

To show the importance of key selection, assume the MASTER file of employees
is ordered by social insurance number.
Suppose we want to find the records of a number of employees given only their
names.
Suppose we want to find the employee record whose name is ADAMS. We need
to serially process the file until a record with a name ADAMS will be found.
Consider the process of a second record, say for BAKER. Position of BAKERS
record has no relationship with the position of ADAM record.
So we have no alternative. We need to start once again serially processing from
beginning of the MASTER file.

Irrespective of the key or item index upon which the file is ordered, serial
processing on file is required on rare occasions.
For example, if we want to add a pay increase of 500 Rs. for all employees, then
we can serially process the file. Here, it doesn't matter whether the file is ordered
by name or by social insurance number.
Page 11 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
2. Sequential processing

Sequential processing is the accessing of records, one after the other in
ascending order by a key or index item of the record.

In sequential processing, transaction records are usually grouped together (i.e.
batched) and are sorted according to the same index item as records in the file.
Each successive record of the file is read, compared with an incoming
transaction record and then processed.
It is processed in a manner that is usually dependent upon whether the value of
record’s index item is less than, equal to or greater than the value of the index
the item of the transaction record.

Sequential and serial processing are most effective when a high percentage of
the records in a file must be processed.
Since every record in the file must be scanned, a relatively large number of
transaction’ should he batched together for processing.
Addition of record

If records are to be added to a file, it is necessary to create a new file unless the
records are to the added to the end of the file.
Deletion of record

Records can be deleted from a sequential file by tagging them as “deleted”
during a file update.

However, this procedure leads to files with embedded “dummy” records, and
storage is not efficiently used and processing time is increased.

Usually, records to be deleted are physically removed by creating a new file.
While creating a new file is sometimes necessary, it should he done as
infrequently as possible.
ADVANTAGES OF SEQUENTIAL FILE ORGANIZATION
1. It is easy to implement.
2. It provides fast access to the next record.
3. Useful if high percentage of the records in a file must be processed.
DISADVANTAGES OF SEQUENTIAL FILE ORGANIZATION
1. It is difficult to update a file if insertion of new record requires movement of large
amount of data of the file.
2. Random access is time consuming because we need to start processing from
beginning of file.
Page 12 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
2. INDEXED SEQUENTIAL FILES
STRUCTURE OF INDEXED SEQUENTIAL FILES

An important aspect affecting the file structure is the type of physical medium on
which the file resides.

The capability of directly accessing a record is based on a key (or unique index).
It can only be achieved if the external storage device used supports this type of
access.
EXTERNAL STORAGE DEVICES

Devices such as card readers and tape units allow the access of a particular
record only after reading all the other records that physically appear before a
desired record in the file.
So direct record access is impossible for these types of devices.

The types of external storage devices that support both direct and sequential
access are magnetic drums and disk units.

The file-structure concepts relating to indexed sequential files are best illustrated
when we consider a magnetic disk as the storage medium.
In fact, because of their low price / performance ratio and large total storage
capacity, disks are generally chosen when using indexed sequential files.
2 TYPES OF INDEXED SEQUENTIAL FILE ORGANIZATIONS
1) used by IBM
2) used by CDC
1) IBM INDEXED SEQUENTIAL FILE
An IBM’s indexed sequential file consists of 3 separate areas:
1. Prime area
2. Index area
3. Overflow area
1. Prime area

The prime area is an area into which data records are written when the file is first
created.

The file is created sequentially, that is, by writing records in the prime area in a
sequence dictated by the lexical ordering of the keys of the records.

The writing process starts at the second track of a particular cylinder say the n th
cylinder, of a disk.
Page 13 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV

When this cylinder is filled, writing process continues on the second track of the
next (n +1)th cylinder. It continues in this fashion until the file’s creation is
completed.

If the newly created file is accessed sequentially according to the key item, the
records are processed in the order in which they were written.
2. Index area

The second important area of an indexed sequential file is the index area. It is
created automatically by the data-management routines in the operating system.

A number of index levels may be involved in an indexed sequential file.
1) Track index
2) Cylinder index
3) Master index
1) Track index

The lowest level of index is the track index.

It is always written on the first track (named track 0) of the cylinders for the
indexed sequential file.

The track index contains two entries for each prime track of the cylinder – a
normal entry and an overflow entry.

The normal entry is composed of
 Address of the prime track to which the entry is associated and
 Highest value of the keys for the records stored on track.

If there are no overflow records, the overflow entry is set equal to the normal
entry.
Fig-4 Track index and prime area of an indexed sequential file
Page 14 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV

Fig-4 illustrates the file structure of an indexed sequential file of customer
records in which the key item is an account number.

Only one cylinder is shown with a prime area of m tracks.
2) Cylinder index

Track index describes the storage of records on the tracks of a cylinder.

Same way, the cylinder index indicates how records are distributed over a
number of cylinders.

A cylinder index references a track index—one cylinder index entry per track
index.
3) Master index

A final level of indexing exists in this hierarchical indexing structure.

A master index is used for an extremely large file where a search of the cylinder
index too time consuming.

This index forms the root node of the tree of indices used in indexed sequential
file.
Relationships between the different levels of index
Following Fig-5 explains relationships between the different levels of index.
089725|Data
Fig-5 Relationships between the different levels of index
Page 15 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV

This example assumes that there are 12 cylinders on which records are stored.

Track 0 of every cylinder is the track index of a particular cylinder. It contains m
entries because there are m prime tracks per cylinder.

Cylinder index contains 12 entries – one per cylinder.

Master index contains 3 entries – one per cylinder index.
Searching a record from Indexed Sequential file

Locating the record corresponding to the customer with account number 089725
involves a search of the master index to find the proper cylinder index with which
the record is associated (e.g., cylinder index 1).

Next, a search is made of the cylinder index to find the cylinder on which the
record is located (e.g. cylinder 1).

A search of the track index produces the track number on which the record
resides (e.g., track m).

Finally, a search of the track is required to locate the desired record.

It should be noted that a master index is not always necessary, and it should
only be requested for large files. When it is used, it should reside in main
memory during all of the processing of an indexed sequential file.
Adding a record to Indexed Sequential file

If records are added to a sequential file, a new sequential tile must be created.
We can use the same approach when handling additions for an indexed
sequential file.

It is possible to access records directly in an indexed sequential file.

So this type of file is generally used in a more volatile and quick-response
demanding environment, i.e., an environment in which many additions and
deletions arise from on-line transactions or small batches of transactions.

Such deletions and additions must be immediately reflected in the flit one cannot
wait until the end of months.
3. Overflow area

The problems of adding records are handled by creating an overflow area or
areas, usually on the same device on which the file resides.
1) Cylinder overflow area
2) Independent overflow area
Page 16 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
1) Cylinder overflow area

A cylinder overflow area is a number of dedicated tracks on a cylinder that
contains a number of prime-area tracks.
Addition of a record

If an overflow is created in the prime-area tracks of the cylinder through the
addition of a record, then the overflow records are stored in the cylinder overflow
area.

An overflow record is identical to a prime record, except that a track / record
address field is added to the end of the record.
This track / record address field contains a pointer to the overflow record with the
next largest key value in the list of overflow records for a particular track.
There may be a number of cylinder overflow tracks, and the overflow records for
each track are grouped together in a linked list.
The head of the link list is given by the track/record address in the track index.
The final record in the linked list is specified by placing the number of the
associated prime track in the track / record address field of the overflow record

The effect of overflow record on the structure of the file is illustrated in Fig-6 (a)
and Fig-6 (b).

We assume that one track contains only three records (unrealistic example).
Addition of a record with key 026924
Fig-6 Effects of an overflow in an indexed sequential file
(a) The addition of a record with key 026924
Page 17 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV

Initially, a customer record with an account number of 026924 is added to the
file, as depicted in Fig. 6(a) - .

The record with account number 028761 must be moved to the cylinder overflow
area at track m + 1.
Track 2 is placed in the link field of over flow record 028761.
This is done to preserve the sequential ordering of records in track 2 of the prime
area. Fig. 6(a) - 
This change demands two other alterations to the file.

First, the normal entry in the track index for track 2 must be changed from
028761 to 026924. Because 026924 is now the highest key value for the track.
Fig. 6(a) - 

Secondly,
 Overflow entry is adjusted so that its first subentry contains the largest key
value of any record for track 2 (i.e.. the value 028761). Fig. 6(a) - 
 And the second entry is set to the track / record address of the overflow
record with the smallest key value for track 2. Fig. 6(a) - 
Addition of a record with key 021008
Fig-6 Effects of an overflow in an indexed sequential file
(b) The addition of a record with key 021008
Page 18 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV

Record with key 023612 and 024121 will be shifted one record position on right
side to accommodate record with key 021008. Fig. 6(b) - 

Record with the key value 026924 becomes an overflow record with the addition
of the record with a key value of 021008.
The track / address field is set to point to the record with key value 028761.
Fig. 6(b) - 


Normal entry in the track index for track 2 must be changed from 026924 to
024121. Because 024121 is now the highest key value for the track.
Fig. 6(b) - 
Second entry of Overflow entry for track 2 is set point to the overflow record with
the next largest key value in the list of overflow record for track 2. Fig. 6(b) - 
2) Independent overflow area

As more and more records are added to the indexed sequential file the cylinder
overflow area becomes full.

When this happens, further overflow records are transferred to an independent
overflow area.

This can be done if such an area is specified when the file is created. The
independent overflow area resides on a cylinder or cylinders apart from any
prime-area cylinder.

Overflow records are linked together in the same manner as they are in the
cylinder overflow area.

Note, however for disks with movable heads, the use of independent overflow
areas should be discouraged.

Because significant number of seeks are generated when the access arm is
moved between the prime and independent overflow areas.
Deletion of a record

In IBM indexed sequential organization, deleted records are not physically
removed from a file, but are just marked as deleted by placing ‘11111111’ B (all
ones) in the first byte of the record.

If a new record is added later which has the same key as a record previously
deleted, then the space occupied by a deleted record is recovered.
Page 19 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
Reorganization of a file

Records which are placed in an overflow area are never moved back into the
prime area because of a deletion. Only by reorganizing the file can an overflow
record be placed in the prime area.

Reorganization is achieved by sequentially copying the records of the file into a
temporary file and then recreating the file by sequentially copying the records
back into the original file.

Because the retrieval of overflow records can carry a large overhead, the
amount of disorganization in an indexed sequential file should be monitored
closely.

A good rule of thumb when using movable head disk is to reorganize when
records must be placed in the independent overflow area.
2) CDC INDEXED SEQUENTIAL FILE
The SCOPE monitor for the Control Data 6600 and CYBER series of machines
provides an indexed sequential file that has different structure from the IBM system.
A SCOPE Indexed Sequential (SIS) file is organized into
1. Data blocks and
2. Index blocks
Both blocks are handled as logical records which are allocated and transferred to
and from main storage under the guidance of the SCOPE monitor.
The user has no control over the physical placement of the blocks on the external
storage device.
This strict system control is a necessary requirement because SCOPE allows the
simultaneous sharing of disk files in a multiuser environment.
The user does have control of the size of the data and index blocks.
1. Data blocks

A data block is composed of
 Data records
 Keys with pointers to the data records within the data block and
 Padding space into which overflow records are placed.

A set of data blocks is shown in Fig-7.

Note that the user may specify the size of the padding area as a factor of the
size of the complete block (i.e.5 means that half of the data block is assigned as
padding).
Page 20 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
Fig-7 Index and Data blocks in a SCOPE indexed sequential file
2. Index blocks

The index blocks form a tree-structured hierarchy of keys and pointers like the
index areas of IBM system.
 An index block contains
 Pairs of keys and addresses, and
 Padding space for the addition of such pairs.

A key/address pair is composed of the lowest key of a particular data block or a
“lower level” index block and the address of the data or index block in which this
key resides.

Fig-7 shows the relationship between index blocks and data blocks in a two-level
index block file.

Note that the user may select a padding factor for the index blocks. User may
specify the number of index levels for the file when it is created.

In Fig-7, the Master index Block (MB) references three Subordinate Index
Blocks (SIB1, SIB2, and SIB3).

These index blocks point to data blocks (DB1, DB2,…,DBm).

Here we are allowing only three records in a data block (unrealistic situation).
Page 21 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
Addition of a record

With the addition of two records with keys of 010943 and 010000, we can
illustrate how overflow records are handled in the SCOPE system.

Fig-8 (a) shows the local effect the addition of record 010943 has on DB1.

Fig-8 (b) and Fig-8 (c) shows the more global effects the addition of record
010000 has on DB1, SIB1, and MB.

A by-product of this addition (record with key 010000) is the creation of a new
data block, DBm + 1.

This block contains half the records of DB1 if there had been enough space.
Fig-8 The effects of the addition of (a) record with key 010943 and
(b) and (c) record with key 010000
Overflows in an index block

As more data blocks are created, the index block becomes full. Overflows in an
index block are handled in the same manner as for data blocks.

A new block created and half the index records in the full index block are moved
to the new index block.
Page 22 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV

The reason for splitting an overflow block is to eliminate the problem of
continually having to move overflow records from a full block into a separate
overflow area (as done with prime-area overflows in the IBM system).

Of course the “splitting” process requires more memory than a “record-at-atime” overflow process. Because padding space must be reserved.
Deletion of a record

In the CDC SCOPE system, deleted records are “garbage collected”.

That is, the holes left by deleted records are replaced by the active records with
higher key values in the block.

So both the active record area and the padding areas are always contiguous
areas in a block.
PROCESSING INDEXED SEQUENTIAL FILES

Organization of an indexed sequential file is much more complex than a
sequential file.

Because of this complexity, most operating systems provide access facilities or
methods to handle the file structure changes. These changes result from the
insertion and deletion of records.

The main advantage of an indexed sequential file is that records can be
processed either sequentially or directly.

The sequential processing of an indexed sequential file is logically identical to
the sequential processing of a sequential file.
i.e., records are processed in a sequence determined by the index item.

The types of transactions that are performed are the reading, alteration, addition,
and deletion of records.
These are accomplished at a user level with READ, WRITE, and REWRITE
statements.

Types of transactions and the operations used to effect these transaction types
are usually the same for the sequential processing of sequential and indexed
sequential flies.
But the manner in which the records are accessed is substantially different, due
to the differences in the file structures.
Page 23 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
ADVANTAGES OF INDEXED SEQUENTIAL ORGANIZATION
1. The indexed sequential organization combines the best features of a sequential
organization with the features of random organization with the help of indexes.
2. It is of advantage for files, which need sequential organization for efficient
processing of some applications and which also need random access for
efficient immediate processing, file interrogation, etc.
DISADVANTAGES OF INDEXED SEQUENTIAL ORGANIZATION
1. Extra storage requirement to allow subsequent addition of records and for the
indexes.
2. Extra processing time required to use the indexes and the time for reorganizing
the file.
3. Another serious limitation of this organization is the use of multiple key fields. If
retrieval of record is based on a part of the record other than single key field,
than the entire file must be searched sequentially.
In essence the indexed sequential organization is widely used in information
systems.
3. DIRECT / RANDOM FILES

Random or direct organization of a file implies that a particular record can be
accessed directly, without scanning other records that precede it in the file.
Example-1

Consider a customer billing system to illustrate the type of file processing
associated with a direct file.

To accommodate any type of on line processing like checking the bill amount or
verifying list of goods purchased so far etc., individual customer records must be
accessed directly.

It is also desirable that records are ordered sequentially by account number. This
is necessary to generate monthly customer bills based on receipts which are
received in batches from point of sale.

In near future, company may want to remove this system of filling purchase slip
at point of sale and then sending slips to main office for computer processing.

A simple but more expensive approach is: A purchase or return is posted against
a customer account immediately via on-line terminals operated by point-of-sale
clerks.

The on-line terminals are remote from the computer, yet tied directly to it via
communication lines.
Page 24 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV

When purchases and returns are handled at the point of sale, it is required to
batch all the customer receipts and sequentially process them against the
account via the merge update procedure.

To accommodate this form of on-line processing, an individual customer’s record
must be directly accessible.
Example-2

In a railway reservation system, passengers demand reservation for any train,
for any date and for any kind of seats (2 tier AC, 3 tier AC, Second Class etc) in
random order.

Here, the train data are required to be accessed randomly.
DIRECT ACCESS

If the need for sequential processing is eliminated, we can design a system that
requires only the capability of direct access.

There are number of file structure that provides efficient direct access.

Such an efficiency of access is gained because we remove the requirement that
the file must be organized so that it can be accessed both sequentially and
directly.
STORAGE MEDIUM - An important aspect affecting the file structure

An important aspect affecting the file structure is the type of physical medium on
which the file resides.

The capability of directly accessing a record based on a key or unique index can
only be achieved if the external storage device used supports this type of
access.

This type of organization invariably requires record addresses and therefore it is
feasible only on direct access storage devices (DASD’s), like disk, drum and not
on sequential access storage devices (SASD’s), like magnetic tape or punch
cards. This type of organization is needed in the case of online real time
information system.

The efficiency of random organization is influenced by the characteristics of the
file storage device, and therefore consideration must be made for which device
to use as file storage medium.

Storage of random access file involves rotational movement for moving the
storage media past the read / write head requires time.
Page 25 of 42
US03CBCA03 (Advanced Data & File Structure)

Unit - IV
The time to access a record could be divided into
 Seek Time: Time required to position a movable read/write head over the
record in track to be used. If the read/write head is fixed i.e. there is one
read/write head for every track, then this time will be zero.
 Rotational Time: Rotational delay in moving the storage medium under the
read/write head.

Because of these two times in files stored on DASD while each record is
accessible without reading other records, the time intervals required to access
different records are not equal but depends on the location of the read/write
head and the position of the surface of DASD.

The seek time can be especially significant & should be tried to be reduced to
the minimum.

Devices such as tape units allow the access of a particular record only after
reading all the other records that physically appear before a desired record in the
file. Hence, direct record access is impractical for these types of devices.

For a magnetic disk, a particular block can be accessed by specifying its
cylinder, track, and sector (i.e., its block address).

Thus, to access a specific record, it is only necessary to obtain the address of
the block containing the record, read the block, and search the block for the
record.

This can provide quick access to a record providing that the address of its block
can be obtained.

Therefore, disks can be used for the storage of direct files. Disks are also
selected for this purpose because of their low price–performance ratio and large
total storage capacity.
STRUCTURE OF DIRECT FILES
How records are stored in Direct file organization

Records in a file are stored in blocks which are called buckets in direct file
organization.

Each bucket contains n records, as opposed to just one record in a location of a
hash table. The number of records in a bucket is called the bucket capacity.

Bucket: - A storage cell in which data may be accumulated [Bucket is the largest
quantity of data that can be retrieved in one access.]

There can be b buckets capable of holding s records. i.e. consist of s slots, each
slot being large enough to hold 1 record.
Page 26 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
Fig-9 5 buckets 2 slots per bucket.

When variable size records are present, the number of slots per bucket will only
be a rough indicator of the number of records a bucket can hold.

The actual number will vary dynamically with the size of records in a particular
bucket.

Basically, we can think of a bucket as a sector in a sector-addressable device or
as a block in a block-addressable device.

For a particular record to be accessed, the bucket in which the record resides
must be located, the contents of the bucket brought into a buffer in memory, and
then the desired record extracted from the buffer.

Let us define an address space A of size m such that A = {C+1,C+2,.. , C+m },
where C is an integer constant. Then mb records can be accommodated by A.

Consider a key set S = {X1,X2, . . . ,Xn} which is a subset of the set K of possible
keys which is called key space.
HASHING ALGORITHM

In a direct (also called random access) file, a transformation or mapping is made
from the key of a record to the address of storage location at which that record is
to reside in the file. i.e. the block on disk which contains the record.

One mechanism used for generating this transformation is called a hashing
algorithm.

Hashing algorithm consists of two components:
1. Hashing function
It defines a mapping from the key space to the address space.
2. Collision-resolution technique
It resolves conflicts that arise when more than one record key is mapped to
the same table location.

The hashing algorithms used for direct files are very similar to those used for
tables.

In particular, the time to access a record in a table in main memory is on the
order of microseconds, whereas the time to access a record in external memory
is on the order of milliseconds.
Page 27 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
1. HASHING FUNCTION

Hashing function transfers the key of the record into a direct address by applying
it to a predetermined formula.

The file management system would involve an algorithm that would take the key
field, perform some mathematical operations on it and produce the relative
record number or the actual disk address to which that record would be
assigned. This process is commonly referred to as HASHING.

Hashing is also
(Randomizing)

Since the set of addresses produced is much smaller than the set of possible
keys, the address or the addresses generated from distinct (different) keys by
this method are not always distinct.

When the formula uses a transformation on keys that produces a uniform spread
of address across the available file address is called Hashing and the
transformation itself is called a HASH.

Ideally a Hash must spread the calculated addresses uniformly throughout the
available addresses and it should not generate the same address for the distinct
key. However no method of hashing is known which is ideal in above sense.

A collision is said to have occurred when two keys are hashed to the same
address and the two keys are called synonyms.

The address calculated by hashing a key is called the home address of the
record and the record, which is stored at the home address, is called Home
record.

Why all hashing methods fail to provide a really uniform distribution of keys over
addresses?
known
as
key-to-address
transformation
method.
This is because the keys themselves are not uniformly distributed over the entire
range, though they are distinct clusters and gaps occurred from the way in which
keys are assigned and modifications over period of time.

To minimize the number of synonyms is good feature of this randomizing method
and one can do this by allotting more spaces for the file than actually required to
hold all records.

The term “Packing Factor” means the percentage of allotted locations that are
actually used.
Desired properties of hashing function
1.
2.
3.
4.
Speed
Generation of addresses uniformly
Easily computable
Minimize the no of collisions
Page 28 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
Different types of hashing function
1) Division/Remainder
2) Mid Square
3) Folding
4) Digit Analysis
And many more.

All hashing algorithms works on numeric keys served algorithms have been
developed for hashing.

The most frequently used randomizing method is the Division Remainder
Method.
Explanation of different Hashing functions
1) Division / Remainder

One of the first hashing functions & is the most widely accepted method. This
technique is very simple and often gives results.

The key is divided by positive integer, particularly a prime no., which is nearest
to the no. of addresses allotted to the file and the remainder obtained from the
division becomes the address for that key.

It is defined as H(x) = x mod m + 1 where m is prime number.

Operator mod means modulo operator. For example if x=35 and m=11 then
H(35) = 35 mod 11 + 1 = 2 + 1 = 3
This method gives a hash value which belong to set {1, 2, ……,m}.

Keys which are close to each other or clustered are mapped to unique
addresses. For example, for m=31, the keys 1000, 1001, ……,1010 are mapped
to addresses 9, 10, ….,19.

This uniformity is disadvantage if two or more clusters are mapped to same
addresses. For example, another cluster of keys is 2300, 2301,……, 2313 are
mapped to addresses 7,8,……, 20.

The reason for this phenomenon is that keys in the two clusters gives same
remainder when m=31.

In general, it is uncommon for a number of keys to give same remainder when m
is a large prime number.

In practice it has been found that odd divisors without factors less than 20 are
also satisfactory.
Page 29 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
Advantages
1. Easy to apply
Disadvantages
1. If divisor is not chosen carefully, many hashing collisions can occur & the
method then gives poor result.
2. Large amount of unused space.
2) Mid Square

Another hashing function that has been widely used in many applications is the
“Middle of Square” function.

In this method the key is multiplied by itself and its address is obtained by
selecting an appropriate number of bits or digits from the middle of the square.

The same positions in the square must be used for all the products.

Usually the number of bits or digits chosen depends on the table size and
consequently can fit into one computer word of memory.

E.g. Consider a six digit key 123456. Squaring this key will result in
15241383936. If a 3 digit address is required, position 5 to 7 can be chosen. It
will give address 138.

This method has certain drawbacks, but gives good result for certain key sets.
3) Folding

In this method the key is split (portioned) into a number of parts. Each part has
same length as the required address with the possible exception of the last part.

The parts are then added together ignoring the final carry to form address.

For example, key is 356942781. Its 3 digit address in fold shifting method will
be: 356, 942 and 781 are added to give 079.

A variation of the basic method will be reversing of digits in outermost parts. This
variation is called fold boundary method. In this example, 653, 942 and 187
are added to give 782.
4) Digit Analysis

This method forms addresses by selecting and shifting digits or bits of the
original key. This function is distribution-dependent.

For e.g. a key 7546123 is transformed to the address 2164 by selecting digits in
positions 3 to 6 and reversing their order.
Page 30 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV

For a given key set, the same positions in the key and the same rearrangement
pattern must be used consistently.

Initially, an analysis on a sample of the key set is performed to determine which
key positions should be used in forming an address.

Digit positions having the most uniform distributions (i.e. the smallest peaks &
valleys) are selected.

As an example consider the digit analysis of the sample key set.
A total of 5000 ten-digit keys are analyzed to determine which key positions
should be used in forming addresses in the address space. {0, 1, - - - - - -, 9999}.
Positions 2, 4, 5 & 9 have the most uniform distribution of digits, so they are
selected.
For e.g. key 1234567890 is transformed to the address 9542 by selecting digits
in positions 2, 4, 5, 9 & by reversing their order.

This hashing transformation technique has been used in conjunction with static
key sets (i.e. Key sets that do not change over time.)

This method relies on the digits in some of the key positions being approximately
equally distributed. If such is not the case, the method cannot be used with good
results.
Disadvantage
This method requires extremely extensive manipulation of all the keys & is not
practical for those files that receive even a single new update record.
2. COLLISION-RESOLUTION TECHNIQUE

Hashing consists of two stages, transforming the key to calculate addresses as
uniformly as possible among the available addresses and the overflow
techniques for directing synonyms to other addresses.

The second aspect of a hashing algorithm is the collision-resolution technique. In
a direct file, the smallest addressable unit is the bucket, which may contain many
records mapped to the same address.

Hence, in a direct file with a given bucket capacity, a certain number of collisions
are expected.

When there are more colliding records for a given bucket than the bucket
capacity, however, some method must be found for handling these overflow
records.

The term overflow-handling technique is used in place of a collision-resolution
technique, which is commonly adopted for hash-table methods.
Page 31 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
CLASSES OF COLLISION-RESOLUTION TECHNIQUES
There are 2 classes of collision-resolution techniques.
1. Open addressing and
2. Chaining
The same general classification can be applied to overflow-handling techniques.
1. OPEN ADDRESSING
i.
Linear Probing

When we use a bucket with a capacity greater than 1, we are in fact imposing a
restricted linear-probe form of open addressing.

When a record is added to a bucket that is not full, the new record is added at
the next open location.

Of course, the next open record location is in the same bucket and is already
reserved for records that are mapped to that bucket address.

We refer to the bucket referenced by the address calculation function of a record
as the primary bucket for that record.

If a record is not present in the primary bucket, it is located in an overflow
bucket, or it is not in the file.

Since the complete contents of a bucket are brought into main memory with one
request, it is very beneficial if the desired record is located somewhere in the
primary bucket.

If the record is not in the primary bucket, a request must be made to bring in an
overflow bucket, as determined by the overflow-handling method.

As with standard hashing, it may be necessary to mark locations as previously
occupied or never occupied so that a search need not consider (further) overflow
buckets if never-occupied locations exist.

If a linear-probe open-addressing method of handling overflows is used,
successive searches are made of the records in the remaining buckets in the file.

The search is terminated successfully when the record is located.

However, it is terminated unsuccessfully if a never-occupied record is
encountered in a bucket or if the search returns to the original bucket tested.

But the file system should be designed so that the latter situation should never
happen, as all the disk reads needed to access all buckets would effectively stop
the whole system for some extended period of time.
Page 32 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
Example Assume that insertions are performed in following order:
NODE, STORAGE, AN, ADD, FUNCTION, B, BRAND and PARAMETER.
The name NODE is mapped into 1.
The name STORAGE is mapped into 2.
The names AN and ADD are mapped into 3.
The names FUNCTION, B, BRAND and PARAMETER are mapped into 9.
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11





NODE
STORAGE
AN
ADD
PARAMETER
Empty
Empty
Empty
FUNCTION
B
BRAND
Number of probes
1
1
1
2
8
1
2
3
The first 3 keys are placed in single probe.
ADD must go into position 4 instead of 3 because 3 is already occupied. So
number of probes will be 2.
FUNCTION is placed in position 9 in 1 probe.
B and BRAND take 2 and 3 probes respectively.
PARAMETER will be placed into position 5 after 8 probes because positions 9,
10, 11, 1, 2, 3 and 4 are occupied.
Drawback of Linear Probing
The trend is for long sequences of occupied positions (filled slots) to become longer.
This is known as Primary Clustering.
ii.
Random Probing

Effect of primary clustering can be reduced by selecting a different probing
technique called Random probing.

This method generates a random sequence of positions rather than an ordered
sequence like linear probing method.

The random sequence generated must contain every position between 1 and m
exactly once. A table is full when first duplicate position is generated.

An example of random number generator: Y
(y + c) mod m
Where y is the initial position number of random sequence. c and m are integers
that are relatively prime to each other (i.e. their greatest common divisor is 1).
For example m=11, c=5 and starting with initial value y=2; generates the
sequence 7, 1, 6, 0, 5, 10, 4, 9, 3, 8, and 2.
Page 33 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
Drawback of Random Probing
Clustering occurs when 2 keys are hashed into the same value. In such case, same
sequence of positions is generated for both keys by random probe method. This is
known as Secondary Clustering.
iii.
Double Hashing or Rehashing - Solution for Secondary Clustering

To solve secondary clustering problem, second hashing function is used.

For example, if H1 is first hashing function where
H1(x1) = H1(x2) = i for x1 ǂ x2 and i is the hash value.
Means for two keys x1 and x2, we have same hash value.
Now, we have second hashing function H2 such that H2(x1) ǂ H2(x2). Means hash
value of two keys x1 and x2 for second hashing function is not same.

This variation of pen addressing is known as Double Hashing or Rehashing.

However, neither a random probe form of open addressing nor double hashing is
a good overflow method to use for direct files.

In both methods, the sequences of overflow buckets examined do not exhibit the
property of physical adjacency.

Physical adjacency can be important, since records that are not physically
adjacent have a higher probability of requiring a seek in a movable head-storage
device. The extra seek time may be prohibitive.
2. CHAINING

Overflow records can also be chained from the primary area to a separate
overflow area.

An overflow record should reside in an overflow bucket that is in the same seek
area (e.g., the same cylinder), as the primary bucket for the record.

A simple strategy would be to set aside the last few buckets of a seek area
strictly for overflow records from the primary buckets in that area.

However, groups of prime-area buckets and overflow buckets will now be
interspersed (scattered) throughout the file space.

Thus the overflow buckets would break up the linear addressing scheme
required for direct addressing.

To keep the hashing function simple, an independent overflow area that is totally
separate from the prime area may be preferred.

Another possibility is for the overflow area to be organized as a sequence of
overflow buckets, each of which can hold several logical records.
Page 34 of 42
US03CBCA03 (Advanced Data & File Structure)
i.
Unit - IV
Chaining with coalescing lists

The logic of creating & maintaining the file depends primarily on the overflow
records. Where the overflow records are to be placed and how they could be
located quickly, a common & efficient method for this is chaining method.

Here, overflow records are placed in available buckets in the prime area of the
direct file. Overflows are located using pointers from one bucket to another.

Therefore, when a key is hashed to a bucket, a search starts through a chain of
buckets until the required record is found or an empty storage location is found.

In this method, one record on each bucket is used as a chaining record to
provide a link between the home-bucket & an overflow bucket.
Overflow records are written in the next available bucket. Figure shows the
chaining method.

The order in which the records are loaded is A1, B1, A2, D1, C1, A3, C2, A4
(overflow), B2, C3, B3 (Overflow), A5 (overflow).
Bucket
A
Chaining
Record
B
Data Record
A1
A2
A3
B1
A4
B2
C
C1
C2
C3
D
D1
B3
A5
B
D

While locating a record the search always start from the home bucket and then it
continuous to all those buckets to which the chaining record point.

For e.g. we intend to locate a record A5.




We search the home bucket A, where it is not available.
Chaining record of bucket A points to bucket B so next search bucket B.
There too we do not find the record.
Next bucket pointed by chaining record of bucket B is bucket D. So we go to
bucket D, where the record is located.
 If the chaining record is blank then it means that there are no more records
pertaining to that bucket.
For example, if we are looking for record C4 we start searching the home
bucket C, where it is not found.
Since the chaining record of bucket C is blank there are no more C records.
Hence the record C4 is not in the file.
Page 35 of 42
US03CBCA03 (Advanced Data & File Structure)
ii.
Unit - IV
Chaining with separate lists

One of the most popular methods of handling overflow records is called Chaining
with separate lists or separate chaining.

In this method the colliding records are chained into a special overflow area
which is separate from prime area.

Prime area contains the part of table into which records are initially hashed. A
separate linked list is maintained for each set of colliding records.

Therefore, a pointer field is required for each record in the primary and overflow
areas.

Assume that insertions are performed in following order:
NODE, STORAGE, AN, ADD, FUNCTION, B, BRAND and PARAMETER.

Colliding records are in each linked list are not kept in alphabetical order.

When a new colliding record is entered in the overflow area, it is placed at the
front of those records in the appropriate linked list of the overflow area.
1
2
3
4
5
6
7
8
9
10
11
Key
NODE
STORAGE
AN
Empty
Empty
Empty
Empty
Empty
FUNCTION
Empty
Empty
Data
Link
Primary area
Key
Data
ADD
B
BRAND
PARAMETER
Link
Overflow area
More efficient representation

In this representation, all the records reside in the overflow area while the prime
area contains only pointers.

For a table with large records, this approach results in a heavily packed overflow
area. But the hash table can be made large without wasting much storage.

Assume that insertions are performed in following order:
NODE, STORAGE, AN, ADD, FUNCTION, B, BRAND and PARAMETER.
Page 36 of 42
US03CBCA03 (Advanced Data & File Structure)
1
NODE
2
STORAGE
3
ADD
AN
PARAMETER
BRAND
Unit - IV
4
5
6
7
8
9
B
FUNCTION
10
11
OTHER WAYS OF ORGANIZING A DIRECT FILE

There are other ways of organizing a direct file that are less popular, but
nevertheless may be applicable in some situations.

Such techniques are:
1. Direct addressing
2. Cross reference table or Indexing
3. Indexing method involving binary tree
1. Direct addressing

If the number of records in the file is relatively small and the record size is
relatively large (i.e., only a few records per bucket is achievable), it may be
worthwhile to consider a direct addressing scheme.

If the size of K is equal to number of record locations in A, and the keys are
consecutive, then a transformation can be defined that assigns each possible
key value to a specific record in a specific bucket. This type of one-to-one
transformation is termed direct addressing.

Direct addressing is one of the methods used in random organization to
establish the relationship between the record key and its address.

In this method the record key field actually equals the record address. For this
the records must contain some fields that may be directly used both as the key &
the address.
Page 37 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV

Consider an employee file where key is employee number. Employee numbers
range from 1 to 9999. We must have provision of 9999 records in the file.

Thus if the file organization is not compact when this system is used much
storage space is wasted.

Whenever direct addressing is feasible it is optionally efficient because only one
storage access is required to obtain any record and no time is wasted in
scanning any additional file or indexed.

In most situations, however, S is a small subset of K, and direct addressing
results in very low utilization of direct-access storage. Instead, indirect
addressing is implemented.

That is, S is mapped into A with the distinct possibility that so many records will
be mapped to the same bucket resulting in a bucket overflow.

When this happens, a technique for handling bucket overflow must be used to
store any overflow records.
2. Cross reference table or Indexing

Several methods for achieving direct-address translation involve the use of
cross-referencing or indexing.

A cross-reference table is simply a table of keys and addresses, where the
address is the external storage address of the record with the associated key.

Following table illustrates a cross-reference table of surnames and external
storage addresses for a sector-addressable device.
Cross-Reference Table
Surname
Ashcroft
Barnsley
Bernard
Duke
Edder
Groff
Katz
Murray
Paulsen
Smith
Thomas
Tollard
Yu
External
sector)
1,07,04
1,07,05
1,08,15
1,06,09
1,07,00
1,06,10
1,06,12
1,08,13
1,06,00
1,06,03
1,06,04
1,06,11
1,07,01
Address(cylinder,
track,
Page 38 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV

Locating a record given its key is simply a matter of retrieving from the table the
external address associated with the key.

And then issue an I/O command that directly retrieves the bucket containing the
desired record.

In most programming languages, however, the programmer must maintain the
cross-reference table.

Since the cross-reference table is essentially a dictionary, any of the standard
implementations of a dictionary can be used.

The table can be kept as an unordered list and can therefore accommodate
additions with ease. A linear search is then required to find an address.

Alternatively, the table can be implemented as a list ordered by the key set, and
a binary search can be employed to locate an address more rapidly.

However, record additions and deletions present problems because the table
must be maintained in sorted order.
Advantage
Only one external read is required to access a record.
Disadvantage
Size of the file is restricted by the size of the table that can be stored in main
memory.
3. Indexing method involving binary tree
<
K
≥
S
D
B
Ashcroft 1,07,04
Barnsley 1,07,05
G
Groff 1,06,10
BE
Bernard 1,08,15
Duke 1,06,09
E
Edder 1,07,00
Fig-10 Binary-tree cross-reference indexing scheme
Page 39 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV

Indexing methods involving binary trees, m-ary trees, and trie structures can also
be chosen to achieve direct addressing in direct files.

With tree-structured methods record additions and deletions can be handled
more effectively.

A binary tree indexing scheme for the cross-reference table is given in figure.
Note that in this tree, key–address pairs are only stored in the leaves.

In fact, a key value need not be stored in a leaf, and the address in a leaf can be
the bucket for several records.
PROCESSING DIRECT FILES

Processing of a direct file is dependent on how the key set for the records is
transformed into external device addresses.
Random access

Direct files are primarily processed directly.

That is, a key is mapped to an address. And depending on the nature of the file
transaction; a record is created, deleted, updated or accessed.

A record may be that address or possibly at some subsequent (successive)
address if a collision takes place. Subsequent address is determined by the
overflow handling technique which is adopted.
Serial access

In some instances, it may be necessary to perform an identical transaction on all,
or nearly all, records in the file.

For example, in a credit card application, it may be desirable to print monthly
bills. The generation of monthly bills can be accomplished by accessing the
records in a physically sequential (i.e., serial) manner.

For most direct applications, serial access presents no problems. Access
commences at the physical beginning and terminates at the physical end of the
file.

If the file uses a separate overflow area, however, it may be difficult to access
this area in a serial fashion independent of the prime area.

A logically consistent yet potentially time-consuming method of serial access of a
separate overflow area is
 To read all the records in the first prime-area bucket, followed by all the
overflow records for this bucket, and then
 Return to read the next prime-area bucket, followed by its overflow records,
and so on, until all records in the file have been accessed.
Page 40 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
Deletion of record

Many systems that support direct files simply mark deleted records and recover
the space occupied by these records only when a new record can be added to
the file at the marked location.

In the meantime, the space occupied by a deleted record may be needlessly
examined in searches of the bucket.

In addition, deleted records affect performance when probing or chaining through
a file in the search for an overflow record.

It is sometimes possible to remove deleted records logically, especially where
chaining is involved; however, this is rarely done.

Instead, it is the programmer’s responsibility to monitor the activity of the file and
reorganize it whenever performance degrades significantly.

Reorganization can be accomplished by reading the file serially and creating a
new direct file that involves only the active records of the old file.
Summary
We complete this subsection by summarizing the important properties related to
direct files:
1. Direct access to records in a direct file is rapid, especially for files with low load
factors and few overflow records.
2. Because a certain portion of the file remains unused in order to prevent an
excessive number of overflow records, the space utilization for a direct file is
poor when compared with a sequential file.
3. The performance attained by using a direct file is very dependent upon the keyto-address transformation algorithm adopted. The transformation used is
application dependent and is generally implemented and maintained through the
users’ programs.
4. Records can be accessed serially, but not sequentially, unless a separate
ordered list of keys is maintained.
ADVANTAGES OF RANDOM ORGANIZATION
1. Any record can be accessed by the single access; the address being calculated
by any of the three methods described earlier and need not the key examined of
other records as in the sequential storage, when trying to retrieve a particular
record.
2. Individual records can be stored, retrieved and updated without affecting other
records on the storage. This increases the efficiency of integrating the file.
Page 41 of 42
US03CBCA03 (Advanced Data & File Structure)
Unit - IV
DISADVANTAGES OF RANDOM ORGANIZATION
1. All records should be of fixed length; variable records cannot be handled without
creating a trailer file. This requirement is sometimes imposed by the way
hardware addresses reference the storage media.
2. Using dictionary lock up there is extra storage space of dictionary and the extra
processing required in searching it.
3. Using Hashing there is extra operation of hashing to obtain the address, the
necessity for overflow addressing to handle synonyms, the unused file locations
resulting from the hashing and the necessity to keep track of unused file
locations for use as overflow.
4. Using direct addressing there is unused file locations for which record do not
exist.
5. In general, sequential processing cannot be performed easily. Also processing
sequential files (on cards or tape) automatically provides a backup file for
reconstruction purpose as a by-product of the processing. In random
organization special provisions must be made for back up & reconstruction e.g.
periodically dumping the file on magnetic tape or cards.
Disclaimer
The study material is compiled by Ami D. Trivedi. The basic objective of this material
is to supplement teaching and discussion in the classroom in the subject. Students
are required to go for extra reading in the subject through library work.
Page 42 of 42