DWM - Vidyalankar

Transcription

DWM - Vidyalankar
Vidyalankar
B.E. Sem. VII [INFT]
Data Warehousing and Mining & Business Intelligence
Prelim Question Paper Solution
ka
r
1. (a) BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data
mining algorithm used to perform hierarchical clustering over particularly large data-sets. An
advantage of Birch is its ability to incrementally and dynamically cluster incoming, multidimensional metric data points in an attempt to produce the best quality clustering for a given set
of resources (memory and time constraints). In most cases, Birch only requires a single scan of
the database. In addition, Birch is recognized as the, "first clustering algorithm proposed in the
database area to handle 'noise' (data points that are not part of the underlying pattern) effectively".
an
Previous clustering algorithms performed less effectively over very large databases and did not
adequately consider the case wherein a data-set was too large to fit in main memory. As a result,
there was a lot of overhead maintaining high clustering quality while minimizing the cost of
addition IO (input/output) operations. Furthermore, most of Birch's predecessors inspect all data
points (or all currently existing clusters) equally for each 'clustering decision' and do not perform
heuristic weighting based on the distance between these data points.
Advantages with BIRCH
al
It is local in that each clustering decision is made without scanning all data points and currently
existing clusters. It exploits the observation that data space is not usually uniformly occupied and
not every data point is equally important. It makes full use of available memory to derive the
finest possible sub-clusters while minimizing I/O costs. It is also an incremental method that does
not require the whole data set in advance.
BIRCH Clustering Algorithm
For this first we define the following concepts::
dy
Clustering Feature : Given N d-dimensional data points in a cluster, Xi, CF vector of the cluster is
defined as a triple CF = (N,LS,SS), where LS is the linear sum and SS is the square sum of data
points.
Vi
CF tree : A CF tree is a height balanced tree with two parameters: branching factor B and
threshold T. Each non-leaf node contains at most B entries of the form [CFi,childi], where childi is
a pointer to its ith child node and CFi is the subcluster represented by this child. A leaf node
contains at most L entries each of the form [CFi] . It also has to two pointers prev and next which
are used to chain all leaf nodes together. The tree size is a function of T. The larger the T is, the
smaller the tree is. We also require a node to fit in a page of size of p. B and L are determined by
P. So P can be varied for performance tuning. It is a very compact representation of the dataset
because each entry in a leaf node is not a single data point but a subcluster.
In the algorithm in the first step it scans all data and builds an initial memory CF tree using the
given amount of memory. In the second step it scans all the leaf entries in the initial CF tree to
rebuild a smaller CF tree, while removing outliers and grouping crowded subclusters into larger
ones. In step three we use an existing clustering algorithm to cluster all leaf entries. Here an
agglomerative hierarchial clustering algorithm is applied directly to the subclusters represented
by their CF vectors. It also provides the flexibiltiy of allowing the user to specify either the
desired number of clusters or the desired diameter threshold for clusters. After this step we obtain
a set of clusters that captures major distribution pattern in the data. However there might exist
minor and localized inaccuracies which can be handled by an optional step 4. In step 4 we use the
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
1
Vidyalankar : B.E.  DWM
centroids of the clusters produced in step as seeds and redistribute the data points to its closest
sees to obtain a new set of clusters. Step 4 also provides us with an option of discarding outliers.
That is a point which is too far from its closest seed can be treated as an outlier.
ka
r
1. (b)
al
The five KDD Steps
an
The KDD Process stands for the Knowledge Discovery in Databases. According to Fayyad there
are five steps: Selection, Pre-processing, Transformation, Data Mining and Interpretation. These
five steps are passed through iteratively. Every step can be seen as a work-through phase. Such a
phase requires the supervision of a user and can lead to multiple results. The best of these results
is used for the next iteration, the others should be documented. In the following, the steps will be
briefly described.
dy
1. In the Selection-step the significant data gets selected or created. Henceforward the KDD
process is maintained on the gathered target data. Only relevant information is selected, and
also meta data or data that represents background knowledge. Sometimes the combination of
data from ubiquitous sources can be useful, but possible matters of compatibility have to be
observed.
Vi
2. A good result after applying data mining depends on an appropriate data preparation in the
beginning. Important elements of the provided data have to be detected and filtered out.
These kind of things are settled in the Pre-processing phase. To detect knowledge the
effective main task is to pre-process the data properly and not only to apply data mining tools.
The less noise contained in data the higher is the efficiency of data mining. Elements of the
pre-processing span the cleaning of wrong data, the treatment of missing values and the
creation of new attributes.
3. That data also needs to be transferred into a data-mining-capable format. The Transformation
phase of the data may result in a number of different data formats, since variable data mining
tools may require variable formats. The data also is manually or automatically reduced. The
reduction can be made via lossless aggregation or a loss full selection of only the most
important elements. A representative selection can be used to draw conclusions to the entire
data.
4. In the Data Mining phase, the data mining task is approached. Fayyad gives a classified
overview over existing data mining techniques. He makes suggestions, which technique may
be used for which objectives, but most of the techniques are now improved. The output of
this step is detected patterns. Data Mining will be focused on following articles.
2
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
Prelim Question Paper Solution
5. The interpretation of the detected pattern reveals whether or not the pattern is interesting.
That is, whether they contain knowledge at all. This is why this step is also called evaluation.
The duty is to represent the result in an appropriate way so it can be examined thoroughly. If
the located pattern is not interesting, the cause for it has to be found out. It will probably be
necessary to fall back on a previous step for another attempt.
The detected knowledge out of the KDD process is usually used to support the decisions of the
management. Therefore it flows into a Decision Support System (DSS) or into marketing
automation for direct marketing purposes.
al
an
ka
r
2. (a) Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text
analytics, refers to the process of deriving high-quality information from text. High-quality
information is typically derived through the devising of patterns and trends through means such
as statistical pattern learning. Text mining usually involves the process of structuring the input
text (usually parsing, along with the addition of some derived linguistic features and the removal
of others, and subsequent insertion into a database), deriving patterns within the structured data,
and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers
to some combination of relevance, novelty, and interestingness. Typical text mining tasks include
text categorization, text clustering, concept/entity extraction, production of granular taxonomies,
sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations
between named entities).
Vi
dy
Applications
Recently, text mining has received attention in many areas.
Security applications
Many text mining software packages are marketed for security applications, especially analysis of
plain text sources such as Internet news. It also involves in the study of text encryption.
Biomedical applications
A range of text mining applications in the biomedical literature has been described. One example
is PubGene that combines biomedical text mining with network visualization as an Internet
service. Another text mining example is GoPubMed. Semantic similarity has also been used by
text-mining systems, namely, GOAnnotator.
Software and applications
Text mining methods and software is also being researched and developed by major firms,
including IBM and Microsoft, to further automate the mining and analysis processes, and by
different firms working in the area of search and indexing in general as a way to improve their
results. Within public sector much effort has been concentrated on creating software for tracking
and monitoring terrorist activities.
Online media applications
Text mining is being used by large media companies, such as the Tribune Company, to
disambiguate information and to provide readers with greater search experiences, which in turn
increases site "stickiness" and revenue. Additionally, on the back end, editors are benefiting by
being able to share, associate and package news across properties, significantly increasing
opportunities to monetize content.
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
3
Vidyalankar : B.E.  DWM
Marketing applications
Text mining is starting to be used in marketing as well, more specifically in analytical customer
relationship management.
Sentiment analysis
Sentiment analysis may involve analysis of movie reviews for estimating how favorable a review
is for a movie. Such an analysis may need a labeled data set or labeling of the affectivity of
words. A resource for affectivity of words has been made for WordNet.
Text has been used to detect emotions in the related area of affective computing. Text based
approaches to affective computing have been used on multiple corpora such as students
evaluations, children stories and news stories.
an
ka
r
2. (b) Data Stream Mining is the process of extracting knowledge structures from continuous, rapid
data records. A data stream is an ordered sequence of instances that in many applications of data
stream mining can be read only once or a small number of times using limited computing and
storage capabilities. Examples of data streams include computer network traffic, phone
conversations, ATM transactions, web searches, and sensor data. Data stream mining can be
considered a subfield of data mining, machine learning, and knowledge discovery.
In many data stream mining applications, the goal is to predict the class or value of new instances
in the data stream given some knowledge about the class membership or values of previous
instances in the data stream. Machine learning techniques can be used to learn this prediction task
from labeled examples in an automated fashion. In many applications, the distribution underlying
the instances or the rules underlying their labeling may change over time, i.e. the goal of the
prediction, the class to be predicted or the target value to be predicted, may change over time.
This problem is referred to as concept drift.
al
The Hoeffding tree induction algorithm has proven to be one of the best methods for data stream
classification. The algorithm is realised in a system known as VFDT (Very Fast Decision Tree
learner) which encompasses a number of practical considerations. One of these is connected with
ties. Ties occur when two or more attributes have close split evaluation values. Instead of waiting
to see which attribute is superior, a potentially wasteful exercise, VFDT forces a split to be made
on one of the attributes as long as the difference between the split evaluation values is within user
specified bounds.
dy
Hoeffding Tree Algorithm (1)
 Inputs: S is a sequence of examples, X is a set of discrete attributes, G(.) is a split evaluation
function, δ is one minus the desired probability of choosing the correct attribute at any given
node.
 Output: HT is a decision tree.
Vi
Hoeffding Tree Algorithm (2)
Procedure HoeffdingTree(S, X, G, δ)
Let HT be a tree with a single leaf l1 (the root).
For each class yk
For each value xij of each attribute Xi X
Let nijk(l1)=0.
For each example (x, yk) in S
Sort (x, y) into a leaf l using HT.
For each xij in x such that Xi Xl
Increment nijk(l).
If the examples seen so far at l are not all of the same class, then
Compute Gl(Xi) for each attribute Xi Xl using nijk(l).
Let Xa be the attribute with highest Gl.
Let Xb be the attribute with second-highest Gl.
Compute ε using hoeffding bound.
4
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
Prelim Question Paper Solution
If Gl(Xa) Gl(Xb)> ε, then
Replace l by an internal node that splits on Xa.
For each branch of the split
Add a new leaf lm, and let Xm = X  {Xa}.
For each class yk and each value xij of each attribute Xi Xm
Let nijk(lm)=0.
r
3. (a) Data mining algorithms embody techniques that have sometimes existed for many years, but have
only lately been applied as reliable and scalable tools that time and again outperform older
classical statistical methods. While data mining is still in its infancy, it is becoming a trend and
ubiquitous. Before data mining develops into a conventional, mature and trusted discipline, many
still pending issues have to be addressed. Some of these issues are addressed below. Note that
these issues are not exclusive and are not ordered in any way.
an
ka
Security and social issues: Security is an important issue with any data collection that is shared
and/or is intended to be used for strategic decision-making. In addition, when data is collected for
customer profiling, user behaviour understanding, correlating personal data with other
information, etc., large amounts of sensitive and private information about individuals or
companies is gathered and stored. This becomes controversial given the confidential nature of
some of this data and the potential illegal access to the information. Moreover, data mining could
disclose new implicit knowledge about individuals or groups that could be against privacy
policies, especially if there is potential dissemination of discovered information. Another issue
that arises from this concern is the appropriate use of data mining. Due to the value of data,
databases of all sorts of content are regularly sold, and because of the competitive advantage that
can be attained from implicit knowledge discovered, some important information could be
withheld, while other information could be widely distributed and used without control.
dy
al
User interface issues: The knowledge discovered by data mining tools is useful as long as it is
interesting, and above all understandable by the user. Good data visualization eases the
interpretation of data mining results, as well as helps users better understand their needs. Many
data exploratory analysis tasks are significantly facilitated by the ability to see data in an
appropriate visual presentation. There are many visualization ideas and proposals for effective
data graphical presentation. However, there is still much research to accomplish in order to obtain
good visualization tools for large datasets that could be used to display and manipulate mined
knowledge. The major issues related to user interfaces and visualization are “screen real-estate”,
information rendering, and interaction. Interactivity with the data and data mining results is
crucial since it provides means for the user to focus and refine the mining tasks, as well as to
picture the discovered knowledge from different angles and at different conceptual levels.
Vi
Mining methodology issues: These issues pertain to the data mining approaches applied and
their limitations. Topics such as versatility of the mining approaches, the diversity of data
available, the dimensionality of the domain, the broad analysis needs (when known), the
assessment of the knowledge discovered, the exploitation of background knowledge and
metadata, the control and handling of noise in data, etc. are all examples that can dictate mining
methodology choices. For instance, it is often desirable to have different
Most algorithms assume the data to be noise-free. This is of course a strong assumption. Most
datasets contain exceptions, invalid or incomplete information, etc., which may complicate, if not
obscure, the analysis process and in many cases compromise the accuracy of the results. As a
consequence, data preprocessing (data cleaning and transformation) becomes vital. It is often seen
as lost time, but data cleaning, as time consuming and frustrating as it may be, is one of the most
important phases in the knowledge discovery process. Data mining techniques should be able to
handle noise in data or incomplete information.
More than the size of data, the size of the search space is even more decisive for data mining
techniques. The size of the search space is often depending upon the number of dimensions in the
domain space. The search space usually grows exponentially when the number of dimensions
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
5
Vidyalankar : B.E.  DWM
increases. This is known as the curse of dimensionality. This “curse” affects so badly the
performance of some data mining approaches that it is becoming one of the most urgent issues to
solve.
ka
r
Performance issues: Many artificial intelligence and statistical methods exist for data analysis
and interpretation. However, these methods were often not designed for the very large data sets
data mining is dealing with today. Terabyte sizes are common. This raises the issues of scalability
and efficiency of the data mining methods when processing considerably large data. Algorithms
with exponential and even medium-order polynomial complexity cannot be of practical use for
data mining. Linear algorithms are usually the norm. In same theme, sampling can be used for
mining instead of the whole dataset.
However, concerns such as completeness and choice of samples may arise. Other topics in the
issue of performance are incremental updating, and parallel programming. There is no doubt that
parallelism can help solve the size problem if the dataset can be subdivided and the results can be
merged later. Incremental updating is important for merging results from parallel mining, or
updating data mining results when new data becomes available without having to re-analyze the
complete dataset.
al
an
Data source issues: There are many issues related to the data sources, some are practical such as
the diversity of data types, while others are philosophical like the data glut problem. We certainly
have an excess of data since we already have more data than we can handle and we are still
collecting data at an even higher rate. If the spread of database management systems has helped
increase the gathering of information, the advent of data mining is certainly encouraging more
data harvesting. The current practice is to collect as much data as possible now and process it, or
try to process it, later. The concern is whether we are collecting the right data at the appropriate
amount, whether we know what we want to do with it, and whether we distinguish between what
data is important and what data is insignificant. Regarding the practical issues related to data
sources, there is the subject of heterogeneous databases and the focus on diverse complex data
types.
3. (b) In data warehousing, a fact table consists of the measurements, metrics or facts of a business
process. It is often located at the centre of a star schema or a snowflake schema, surrounded by
dimension tables.
Vi
dy
Fact tables provide the (usually) additive values that act as independent variables by which
dimensional attributes are analyzed. Fact tables are often defined by their grain. The grain of a
fact table represents the most atomic level by which the facts may be defined. The grain of a
SALES fact table might be stated as "Sales volume by Day by Product by Store". Each record in
this fact table is therefore uniquely defined by a day, product and store. Other dimensions might
be members of this fact table (such as location/region) but these add nothing to the uniqueness of
the fact records. These "affiliate dimensions" allow for additional slices of the independent facts
but generally provide insights at a higher level of aggregation (a region contains many stores).
If the business process is SALES, then the corresponding fact table will typically contain columns
representing both raw facts and aggregations in rows such as:







6
$12,000, being "sales for New York store for 15-Jan-2005"
$34,000, being "sales for Los Angeles store for 15-Jan-2005"
$22,000, being "sales for New York store for 16-Jan-2005"
$50,000, being "sales for Los Angeles store for 16-Jan-2005"
$21,000, being "average daily sales for Los Angeles Store for Jan-2005"
$65,000, being "average daily sales for Los Angeles Store for Feb-2005"
$33,000, being "average daily sales for Los Angeles Store for year 2005"
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
Prelim Question Paper Solution
r
dy

al

an
ka



"average daily sales" is a measurement which is stored in the fact table. The fact table also
contains foreign keys from the dimension tables, where time series (e.g. dates) and other
dimensions (e.g. store location, salesperson, product) are stored.
All foreign keys between fact and dimension tables should be surrogate keys, not reused keys
from operational data.
The centralized table in a star schema is called a fact table. A fact table typically has two types of
columns: those that contain facts and those that are foreign keys to dimension tables. The primary
key of a fact table is usually a composite key that is made up of all of its foreign keys. Fact tables
contain the content of the data warehouse and store different types of measures like additive, non
additive, and semi additive measures.
Measure types
Additive - Measures that can be added across all dimensions.
Non Additive - Measures that cannot be added across any dimension.
Semi Additive - Measures that can be added across some dimensions and not across others.
A fact table might contain either detail level facts or facts that have been aggregated (fact tables
that contain aggregated facts are often instead called summary tables).
Special care must be taken when handling ratios and percentage. One good design rule is to never
store percentages or ratios in fact tables but only calculate these in the data access tool. Thus only
store the numerator and denominator in the fact table, which then can be aggregated and the
aggregated stored values can then be used for calculating the ratio or percentage in the data access
tool.
In the real world, it is possible to have a fact table that contains no measures or facts. These tables
are called "factless fact tables", or "junction tables".
The "Factless fact tables" can for example be used for modeling many-to-many relationships or
capture events.
Types of fact tables
There are basically three fundamental measurement events, which characterizes all fact tables.[2]
Transactional
A transactional table is the most basic and fundamental. The grain associated with a transactional
fact table is usually specified as "one row per line in a transaction", e.g., every line on a receipt.
Typically a transactional fact table holds data of the most detailed level, causing it to have a great
number of dimensions associated with it.
Periodic snapshots
The periodic snapshot, as the name implies, takes a "picture of the moment", where the moment
could be any defined period of time, e.g. a performance summary of a salesman over the previous
month. A periodic snapshot table is dependent on the transactional table, as it needs the detailed
data held in the transactional fact table in order to deliver the chosen performance output.
Accumulating snapshots
This type of fact table is used to show the activity of a process that has a well-defined beginning
and end, e.g., the processing of an order. An order moves through specific steps until it is fully
processed. As steps towards fulfilling the order are completed, the associated row in the fact table
is updated. An accumulating snapshot table often has multiple date columns, each representing a
milestone in the process. Therefore, it's important to have an entry in the associated date
dimension that represents an unknown date, as many of the milestone dates are unknown at the
time of the creation of the row.
Vi

4. (a) Numerosity Reduction
Sampling is a typical numerosity reduction technique. There are several ways to construct a
sample:
 Simple random sampling without replacement – performed by randomly choosing n1 data
points such that n1 < n. n is the number of data points in the original dataset D.
 Simple random sampling with replacement – we are selecting n1 < n data points, but draw
them one at a time (n1 times). In such a way, one data point can be drawn multiple times in
the same subsample.
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
7
Vidyalankar : B.E.  DWM


Cluster sample – examples in D are originally grouped into M disjoint clusters. Then a simple
random sample of m < M elements can be drawn.
Stratified sample – D is originally divided into disjoint parts called strata. Then, a stratified
sample of D is generated by obtaining a simple random sample at each stratum. This helps
getting a representative sample especially when the data is skewed (say, many more examples
of class 0 then of class 1). Stratified samples can be proportionate and disproportionate.
Numerosity Reduction
Data volume can be reduced by choosing alternative forms of data representation.
r
Parametric
 Regression (a model or function estimating the distribution instead of the data.)
Clustering

Sampling
an

ka
Nonparametric
 Histograms
Reduction with Histograms
A popular data reduction technique:
Divide data into buckets and store representation of buckets (sum, count, etc.)
Vi
dy
al
 Equiwidth (histogram with bars having the same width)
 Equidepth (histogram with bars having the same height)
 VOptimal (histrogram with least variance  (countb *valueb)
 MaxDiff (bucket boundaries defined by user specified threshold)
Related to quantization problem.
Reduction with Clustering :
Partition data into clusters based on “closeness” in space. Retain representatives of clusters
(centroids) and outliers. Effectiveness depends upon the distribution of data Hierarchical
clustering is possible (multiresolution).
Reduction with Sampling :
Allows a large data set to be represented by a much smaller random sample of the data (subset).
 How to select a random sample ?
 Will the patterns in the sample represent the patterns in the data?
 Simple random sample without replacement (SRSWOR)
 Simple random sampling with replacement (SRSWR)
 Cluster sample (SRSWOR or SRSWR from clusters)
 Stratified sample (stratum = group based on attribute value)
Random sampling can produce poor results  active research.
8
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
Prelim Question Paper Solution
Discretization
Discretization is used to reduce the number of values for a given continuous attribute, by dividing
the range of the attribute into intervals. Interval labels are then used to replace actual data values.
Some data mining algorithms only accept categorical attributes and cannot handle a range of
continuous attribute value.
Discretization can reduce the data set, and can also be used to generate concept hierarchies
automatically.
dy
al
an
ka
r
4. (b) Tracking systems and hit counters are powerful tools to determine if your customers are finding
your site. However, they don’t help you determine the possibility of growth. That’s where a good
online business intelligence data service comes in.
Securing your company’s position is hard work but it’s only the first step. You still need to grow
even if it’s just to secure new customers. Business intelligence keeps you informed of your
market trends, alerts you to new avenues of generating revenue, and helps you determine how
your competition is doing. Without that knowledge you may suffer false growth or setbacks.
But then, you already know that. You’ve used various methods of business intelligence data
retrieval already to get where you are. You’ve sent people to your competition to see how they do
things differently, you’ve hired mystery shoppers to assess your company’s performance, and
you’ve read every trade magazine or business newspaper you can get your hands on to gather that
information. That’s a lot of man hours to spend on business intelligence data gathering and it’s of
only limited value.
Online business intelligence software for data mining takes advantage of web data mining and
data warehousing to help you gather your information in a timelier and more valuable manner.
The business intelligence software will search the trade magazines and newspapers relevant to
your business to provide the growth information you need. With web data mining it can help you
evaluate your performance in comparison to your competition.
Entering a new revenue market is always frightening but diversification is a key factor to
surviving difficult times. Business intelligence software for data mining provides predictive
analysis of various growth potentials according to the criteria you determine important. The
savings in man hours alone will pay for the software, but consider also how the predictive
analysis will help you avoid trying to enter a market that your business can’t compete in.
With the assistance of a business intelligence service you can face the most difficult of financial
times with more confidence. You can determine where to diversify and when because you’ll have
the intelligence to make smart choices. Best of all, your intelligence will be on your desktop in a
neat report not scattered in files and notes.
Being able to use the information you gather is at least as important as gathering it. Business
intelligence strategy should be used when thinking of how to apply the knowledge you’ve gained
to maximize the benefits.
5. (a) An Architecture for Data Mining
Vi
To best apply these advanced techniques, they must be fully integrated with a data warehouse as
well as flexible interactive business analysis tools. Many data mining tools currently operate
outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data.
Furthermore, when new insights require operational implementation, integration with the
warehouse simplifies the application of results from data mining. The resulting analytic data
warehouse can be applied to improve business processes throughout the organization, in areas
such as promotional campaign management, fraud detection, new product rollout, and so on.
Figure 1 illustrates an architecture for advanced analysis in a large data warehouse.
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
9
ka
Fig. 1 : Integrated Data Mining Architecture
r
Vidyalankar : B.E.  DWM
The ideal starting point is a data warehouse containing a combination of internal data tracking all
customer contact coupled with external market data about competitor activity. Background
information on potential customers also provides an excellent basis for prospecting. This
warehouse can be implemented in a variety of relational database systems: Sybase, Oracle,
Redbrick, and so on, and should be optimized for flexible and fast data access.
al
an
An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business
model to be applied when navigating the data warehouse. The multidimensional structures allow
the user to analyze the data as they want to view their business – summarizing by product line,
region, and other key perspectives of their business. The Data Mining Server must be integrated
with the data warehouse and the OLAP server to embed ROI-focused business analysis directly
into this infrastructure. An advanced, process-centric metadata template defines the data mining
objectives for specific business issues like campaign management, prospecting, and promotion
optimization. Integration with the data warehouse enables operational decisions to be directly
implemented and tracked. As the warehouse grows with new decisions and results, the
organization can continually mine the best practices and apply them to future decisions.
dy
This design represents a fundamental shift from conventional decision support systems. Rather
than simply delivering data to the end user through query and reporting software, the Advanced
Analysis Server applies users’ business models directly to the warehouse and returns a proactive
analysis of the most relevant information. These results enhance the metadata in the OLAP Server
by providing a dynamic metadata layer that represents a distilled view of the data. Reporting,
visualization, and other analysis tools can then be applied to plan future actions and confirm the
impact of those plans.
Vi
5. (b) Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain
group of items, you are more (or less) likely to buy another group of items. For example, if you
are in an English pub and you buy a pint of beer and don't buy a bar meal, you are more likely to
buy crisps (US. chips) at the same time than somebody who didn't buy beer.
The set of items a customer buys is referred to as an itemset, and market basket analysis seeks to
find relationships between purchases.
Typically the relationship will be in the form of a rule:
IF {beer, no bar meal} THEN {crisps}.
The probability that a customer will buy beer without a bar meal (i.e. that the antecedent is true)
is referred to as the support for the rule. The conditional probability that a customer will purchase
crisps is referred to as the confidence.
The algorithms for performing market basket analysis are fairly straightforward (Berry and
Linhoff is a reasonable introductory resource for this). The complexities mainly arise in
10
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
Prelim Question Paper Solution
dy
al
an
ka
r
exploiting taxonomies, avoiding combinatorial explosions (a supermarket may stock 10,000 or
more line items), and dealing with the large amounts of transaction data that may be available.
A major difficulty is that a large number of the rules found may be trivial for anyone familiar
with the business. Although the volume of data has been reduced, we are still asking the user to
find a needle in a haystack. Requiring rules to have a high minimum support level and a high
confidence level risks missing any exploitable result we might have found. One partial solution to
this problem is differential market basket analysis, as described below.
How is it used?
In retailing, most purchases are bought on impulse. Market basket analysis gives clues as to what
a customer might have bought if the idea had occurred to them. (For some real insights into
consumer behavior, see Why We Buy: The Science of Shopping by Paco Underhill.)
As a first step, therefore, market basket analysis can be used in deciding the location and
promotion of goods inside a store. If, as has been observed, purchasers of Barbie dolls have are
more likely to buy candy, then high-margin candy can be placed near to the Barbie doll display.
Customers who would have bought candy with their Barbie dolls had they thought of it will now
be suitably tempted.
But this is only the first level of analysis. Differential market basket analysis can find interesting
results and can also eliminate the problem of a potentially high volume of trivial results.
In differential analysis, we compare results between different stores, between customers in
different demographic groups, between different days of the week, different seasons of the year,
etc.
If we observe that a rule holds in one store, but not in any other (or does not hold in one store, but
holds in all others), then we know that there is something interesting about that store. Perhaps its
clientele are different, or perhaps it has organized its displays in a novel and more lucrative way.
Investigating such differences may yield useful insights which will improve company sales.
Other Application Areas
Although Market Basket Analysis conjures up pictures of shopping carts and supermarket
shoppers, it is important to realize that there are many other areas in which it can be applied.
These include:
 Analysis of credit card purchases.
 Analysis of telephone calling patterns.
 Identification of fraudulent medical insurance claims.
(Consider cases where common rules are broken).
 Analysis of telecom service purchases.
Note that despite the terminology, there is no requirement for all the items to be purchased at the
same time. The algorithms can be adapted to look at a sequence of purchases (or events) spread
out over time. A predictive market basket analysis can be used to identify sets of item purchases
(or events) that generally occur in sequence — something of interest to direct marketers,
criminologists and many others
Vi
6. (a)  Let minimum confidence required 70%.
 We have to first find out the frequent itemset using Apriori algorithm.
 Then, Association rules will be generated using min. support and min. confidence.
Step 1 : Generating 1itemset Frequent Pattern
Compare
candidate
Itemset
Sup.Count
Itemset
Sup.Count
Scan
D
support
count
with
for count
{I1}
6
{I1}
6
minimum
support
of
each
count
{I2}
7
{I2}
7
candidate
{I3}
6
{I3}
6
{I4}
2
{I4}
2
{I5}
2
{I5}
2
C1
L1
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
11
Vidyalankar : B.E.  DWM


The set of frequent 1itemsets, L1, consists of the candidate 1itemsets satisfying minimum
support.
In the first iteration of the algorithm, each item is a member of the set of candidate.
Step 2: Generating 2itemset Frequent Pattern


Sup.
Count
4
4
1
2
4
2
2
0
1
0
Compare
candidate
support
count with
minimum
support
count
{I1, I2}
{I1, I3}
{I1, I5}
{I2, I3}
{I2, I4}
{I2, I5}
L2
Sup.
Count
4
4
2
4
2
2
C2
To discover the set of frequent 2itemsets, L2, the algorithm uses L1 Join L1 to generate a
candidate set of 2itemsets, C2.
Next, the transactions in D are scanned and the support count for each candidate itemset in C2 is
accumulated (as shown in the middle table).
The set of frequent 2itemsets, L2, is then determined, consisting of those candidate 2itemsets in
C2 having minimum support.
Step 3 : Generating 3itemset Frequent Pattern
Scan D for
count
of
each
candidate
al
Scan D for
Itemset
count of each
candidate
{I1, I2, I3}
{I1, I2, I5}
Vi






Itemset
{I1, I2, I3}
{I1, I2, I5}
Sup.
Count
2
2
Compare
candidate
support
count
Itemset
with min
{I1, I2, I3}
support
count
{I1, I2, I5}
Sup.
Count
2
2
C3
C3 C , involves use of the Apriori property.
L3
The generation of the set of candidate 3itemsets,
3
In order to find C3, we compute L2 Join L2.
C3 = L2 Join L2 {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
Now, join step is complete and Prune step will be used to reduce the size of C3. Prune step
helps to avoid heavy computation due to large Ck.
Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we
can determine that four latter candidates cannot possibly be frequent. How?
For example, lets take {I1, I2, I3}. The 2item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}.
Since all 2item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3.
Lets take another example of {I2, I3, I5} which shows how the pruning is performed. The
2item subsets are {I2, I3}, {I2, I5} & {I3, I5}.
BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property.
Thus We will have to remove {I2, I3, I5} from C3.
Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join
operation for Pruning.
Now, the transactions in D are scanned in order to determine L3, consisting of those
candidates 3itemsets in C3 having minimum support.
dy




12
Itemset
r
Scan D for Itemset
count of each
{I1, I2}
candidate
{I1, I3}
{I1, I4}
{I1, I5}
{I2, I3}
{I2, I3}
{I2, I4}
{I3, I4}
{I3, I5}
{I4, I5}
an

Itemset
{I1, I2}
{I1, I3}
{I1, I4}
{I1, I5}
{12, I3}
{I2, I4}
{I2, I5}
{I3, I4}
{I3, I5}
{I4, I5}
C2
ka
Generate
C2
candidates
from L1
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
Prelim Question Paper Solution
Step 4 : Generating 4itemset Frequent Pattern
 The algorithm used L3 Join L3 to generate a candidate set of 4itemsets, C4. Although the join
results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not
frequent.
 Thus, C4 =  , and algorithm terminates, having found all of the frequent items. This
completes our Apriori Algorithm.
 What’s Next?
These frequent itemsets will be used to generate strong association rules (where strong
association rules satisfy both minimum support & minimum confidence).
Vi
dy
al


ka

Procedure :
 For each frequent itemset “I”, generate all nonempty subsets of I.
 For every nonempty subset S of I, output the rule “S  (IS)” if support_count(I) /
support_count(S) > = min_conf where min_conf is minimum confidence threshold.
Back To Example :
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2,
I5}, {I1, I2, I3}, {I1, I2, I5}}.
 Lets take I = {I1, I2, I5}.
 Its all nonempty subsets ar e{I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}.
Let minimum confidence threshold is, say 70%.
The resulting association rules are shown below, each listed with its confidence.
 R1 : I1 ^ I2  I5
 Confidence = SC{I1, I2, I5}/SC{I1, I2} = 2/4 = 50%
 R1 is Rejected.
 R2: I1 ^ I5  I2
 Confidence = SC{I1, I2, I5}/SC{I1, I5} = 2/2 = 100%
 R2 is selected.
R3 : I2 ^ I5  I1
 Confidence = SC{I1, I2, I5}/SC{I2, I5} = 2/2 = 100%
 R3 is Selected.
 R4 : I1  I2 ^ I5
 Confidence = SC{I1, I2, I5}/SC{I1} = 2/6 = 33%
 R4 is Rejected.
 R5: I2  I1 ^ I5
 Confidence = SC{I1, I2, I5}/{I2} = 2/7 = 29%
 R5 is Rejected.
 R6: I5  I1 ^ I2
 Confidence = SC{I1, I2, I5}/{I5} = 2/2 = 100%
 R6 is Selected.
In this way, We have found three strong association rules.
an

r
Step 5 : Generating Association Rules from Frequent Itemsets
6. (b) (i) Support & Confidence :
In addition to support, there is another measure that expresses the degree of uncertainty about the
ifthen rule. This is known as the confidence of the rule. This measure compares the
cooccurrence of the antecedent and consequent tem sets in the database to the occurrence of the
antecedent item sets. Confidence is defined as the ratio of the number of transactions that include
all antecedent and consequent item sets (namely, the support) to the number of transactions that
include all the antecedent item sets :
no.transactions with both antecedent and consequent item sets
Confidence =
no.transactions with antecedent item set
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
13
Vidyalankar : B.E.  DWM
ka
r
For example, suppose that a supermarket database has 100,000 pointofsale transactions. Of
these transactions, 2000 include both orange juice and (overthecounter) flu medication, and
800 of these include soup purchases. The association rule “IF orange juice and flu medication are
purchased THEN soup is purchased on the same trip” has a support of 800 transactions
(alternatively, 0.8% = 800/100,000) and a confidence of 40% (=800/2000).
To see the relationship between support and confidence, let us think about what each is measuring
(estimating). One way to think of support is that it is the (estimated) probability that a transaction
selected randomly from the database will contain all items in the antecedent and the consequent:
P(antecedent AND consequent).
In comparison, the confidence is the (estimated) conditional probability that a transaction selected
randomly will include all the items in the consequent given that the transaction includes all the
items in the antecedent:
P(antecedentANDconsequent)
P(consequent| antecedent) .
P(antecedent)
A high value of confidence suggests a strong association rule (in which we are highly confident).
However, this can be deceptive because if the antecedent and/or the consequent has a high level
of support, we can have a high value for confidence even when the antecedent and consequent
and independent! For example, if nearly all customers buy bananas and nearly all customers buy
ice cream, the confidence level will be high regardless of whether there is an association between
the items.
an
6. (b) (ii) Entropy and Gini Index :
There are a number of ways to measure impurity. The two most popular measures are the Gini
index and an entropy measure. We describe both next. Denote the m classes of the response
variable by k = 1, 2, …, m.
The Gini impurity index for a rectangle A is defined by
I(A) = 1 
m
 Pk2 ,
k1
dy
al
where pk is the proportion of observations in rectangle A that belongs to class k. This measure
takes values between 0 (if all the observations belong to the same class) and (m  1)/m (when all
m classes are equally represented). Figure 1 shows the values of the Gini index for a twoclass
case as a function of pk. It can be seen that the impurity measure is at its peak when pk = 0.5 (i.e.,
when the rectangle contains 50% of each of the two classes).
A second impurity measure is the entropy measure. The entropy for a rectangle A is defined by
m
entropy (A) =
 pk log 2 (pk )
k1
Vi
[to compute log2(x) in Excel, use the function = log(x, 2)]. This measure ranges between 0 (most
pure, all observations belong to the same class) and log2(m) (when all m classes are represented
equally. In the twoclass case, the entropy measure is maximized (like the Gini index) at pk = 0.5
Fig. 1
14
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
Prelim Question Paper Solution
ka
r
7. (a) Web content mining is related but different from data mining and text mining. It is related to data
mining because many data mining techniques can be applied in Web content mining. It is related
to text mining because much of the web contents are texts. However, it is also quite different
from data mining because Web data are mainly semi-structured and/or unstructured, while data
mining deals primarily with structured data. Web content mining is also different from text
mining because of the semi-structure nature of the Web, while text mining focuses on
unstructured texts. Web content mining thus requires creative applications of data mining and/or
text mining techniques and also its own unique approaches. In the past few years, there was a
rapid expansion of activities in the Web content mining area. This is not surprising because of the
phenomenal growth of the Web contents and significant economic benefit of such mining.
However, due to the heterogeneity and the lack of structure of Web data, automated discovery of
targeted or unexpected knowledge information still present many challenging research problems.
In this tutorial, we will examine the following important Web content mining problems and
discuss existing techniques for solving these problems. Some other emerging problems will also
be surveyed.
Data/information extraction: Our focus will be on extraction of structured data from Web
pages, such as products and search results. Extracting such data allows one to provide
services. Two main types of techniques, machine learning and automatic extraction are
covered.

Web information integration and schema matching: Although the Web contains a huge
amount of data, each web site (or even page) represents similar information differently. How
to identify or match semantically similar data is a very important problem with many
practical applications. Some existing techniques and problems are examined.

Opinion extraction from online sources: There are many online opinion sources, e.g.,
customer reviews of products, forums, blogs and chat rooms. Mining opinions (especially
consumer opinions) is of great importance for marketing intelligence and product
benchmarking. We will introduce a few tasks and techniques to mine such sources.

Knowledge synthesis: Concept hierarchies or ontology are useful in many applications.
However, generating them manually is very time consuming. A few existing methods that
explores the information redundancy of the Web will be presented. The main application is to
synthesize and organize the pieces of information on the Web to give the user a coherent
picture of the topic domain..
dy
al
an


Segmenting Web pages and detecting noise: In many Web applications, one only wants the
main content of the Web page without advertisements, navigation links, copyright notices.
Automatically segmenting Web page to extract the main content of the pages is interesting
problem. A number of interesting techniques have been proposed in the past few years.
Vi
Web Usage Mining
Web usage mining is the type of Web mining activity that involves the automatic discovery of
user access patterns from one or more Web servers. As more organizations rely on the Internet
and the World Wide Web to conduct business, the traditional strategies and techniques for market
analysis need to be revisited in this context. Organizations often generate and collect large
volumes of data in their daily operations. Most of this information is usually generated
automatically by Web servers and collected in server access logs. Other sources of user
information include referrer logs which contains information about the referring pages for each
page reference, and user registration or survey data gathered via tools such as CGI scripts.
Analyzing such data can help these organizations to determine the life time value of customers,
cross marketing strategies across products, and effectiveness of promotional campaigns, among
other things. Analysis of server access logs and user registration data can also provide valuable
information on how to better structure a Web site in order to create a more effective presence for
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
15
Vidyalankar : B.E.  DWM
the organization. In organizations using intranet technologies, such analysis can shed light on
more effective management of workgroup communication and organizational infrastructure.
Finally, for organizations that sell advertising on the World Wide Web, analyzing user access
patterns helps in targeting ads to specific groups of users.
an
ka
r
7. (b) k-means clustering is a data mining / machine learning algorithm used to cluster observations into
groups of related observations without any prior knowledge of those relationships. The k-means
algorithm is one of the simplest clustering techniques and it is commonly used in medical
imaging, biometrics and related fields.
The k-means Algorithm
The k-means algorithm is an evolutionary algorithm that gains its name from its method of
operation. The algorithm clusters observations into k groups, where k is provided as an input
parameter. It then assigns each observation to clusters based upon the observation’s proximity to
the mean of the cluster. The cluster’s mean is then recomputed and the process begins again.
Here’s how the algorithm works:
1. The algorithm arbitrarily selects k points as the initial cluster centers (“means”).
2. Each point in the dataset is assigned to the closed cluster, based upon the Euclidean distance
between each point and each cluster center.
3. Each cluster center is recomputed as the average of the points in that cluster.
4. Steps 2 and 3 repeat until the clusters converge. Convergence may be defined differently
depending upon the implementation, but it normally means that either no observations change
clusters when steps 2 and 3 are repeated or that the changes do not make a material difference
in the definition of the clusters.
dy
al
Choosing the Number of Clusters
One of the main disadvantages to k-means is the fact that you must specify the number of clusters
as an input to the algorithm. As designed, the algorithm is not capable of determining the
appropriate number of clusters and depends upon the user to identify this in advance. For
example, if you had a group of people that were easily clustered based upon gender, calling the kmeans algorithm with k =3 would force the people into three clusters, when k=2 would provide a
more natural fit. Similarly, if a group of individuals were easily clustered based upon home state
and you called the k-means algorithm with k=20, the results might be too generalized to be
effective.
Vi
For this reason, it’s often a good idea to experiment with different values of k to identify the
value that best suits your data. You also may wish to explore the use of other data mining
algorithms in your quest for machine-learned knowledge
16
1113/Engg/BE/Pre Pap/2013/INFT/Soln/DWM
