Homework 2: Bayesian Network Written Part

Transcription

Homework 2: Bayesian Network Written Part
CS6957: Probabilistic Modeling
Sumedha Singla u0877456
Homework 2: Bayesian Network
Written Part
In this part you will be analyzing risk factors for certain health problems (heart disease,
stroke, heart attack, diabetes). The data is from the 2011 Behavioral Risk Factor
Surveillance System (BRFSS) survey, which is run by the Centers for Disease Control
(CDC). The distilled data is in the spreadsheet RiskFactorData.csv.
Ques1: Create the following Bayesian network to analyze the survey results. You will
want to use the provided function createCPT.fromData.
a. What is the size (in terms of the number of probabilities needed) of this network?
Alternatively, what is the total number of probabilities needed to store the full
joint distribution?
Solution:
Size of the network: 502
Total number of probabilities needed for full joint distribution: 32768
Ques2: For each of the four health outcomes (diabetes, stroke, heart attack, angina),
answer the following by querying your network (using your infer function):
a. What is the probability of the outcome if I have bad habits (smoke and don't
exercise)? How about if I have good habits (don't smoke and do exercise)?
b. What is the probability of the outcome if I have poor health (high blood pressure,
high cholesterol, and overweight)? What if I have good health (low blood
pressure, low cholesterol, and normal weight)?
Solution (a):
Diabetes
Diabetes 1 (yes) 2 (Only during pregnancy) 3 (no) 4 (pre- diabetic) P(Diabetes| Exercise, Smoke) Bad Habit
Good Habit
Exercise = 2 (No) Smoke = Exercise = 1 (Yes) Smoke
1 (Yes) = No (2) 0.150515956 0.127119324 0.008964854 0.008864952 0.822423377 0.847693106 0.018095813 0.016322618 Stroke
Stroke 1 (yes) 2 (no) P(Stroke| Exercise, Smoke) Bad Habit
Good Habit
Exercise = 2 (No) Smoke = 1 Exercise = 1 (Yes) Smoke =
(Yes) No (2) 0.04926405 0.03611044 0.95073595 0.96388956 Heart Attack
Attack P(Attack| Exercise, Smoke) Bad Habit
Good Habit
Exercise = 2 (No) Smoke = 1 Exercise = 1 (Yes) Smoke =
(Yes) No (2) 0.07433041 0.05279765 0.92566959 0.94720235 Angina P(Angina| Exercise, Smoke) Bad Habit
Good Habit
Exercise = 2 (No) Smoke = 1 Exercise = 1 (Yes) Smoke =
(Yes) No (2) 0.08044778 0.05475537 0.91955222 0.94524463 1 (yes) 2 (no) Angina
1 (yes) 2 (no) Solution (b):
Diabetes
Diabetes 1 (yes) 2 (Only during
pregnancy) 3 (no) 4 (pre- diabetic) P(Diabetes| Bp, Bmi, Cholesterol) Bad Health
Good Health
Bp = 1 (Yes) Cholesterol = 1 Bp = 3 (No) Cholesterol = 2
(Yes) Bmi = 3 (over weight) (No) Bmi = 2 (Normal) 0.115422719 0.057709954 0.007661825 0.009543386 0.860872761 0.016042695 0.922193878 0.010552782 Stroke
Stroke 1 (yes) 2 (no) P(Stroke| Bp, Bmi, Cholesterol) Bad Health
Good Health
Bp = 1 (Yes) Cholesterol = 1 Bp = 3 (No) Cholesterol = 2
(Yes) Bmi = 3 (over weight) (No) Bmi = 2 (Normal) 0.08268577 0.01446014 0.91731423 0.98553986 Heart Attack
Attack P(Attack | Bp, Bmi, Cholesterol) Bad Health
Good Health
Bp = 1 (Yes) Cholesterol = Bp = 3 (No) Cholesterol = 2
1 (Yes) Bmi = 3 (over
(No) Bmi = 2 (Normal) weight) 0.1407844 0.01616133 0.8592156 0.98383867 Angina P(Angina| Bp, Bmi, Cholesterol) Bad Health
Good Health
Bp = 1 (Yes) Cholesterol = Bp = 3 (No) Cholesterol = 2
1 (Yes) Bmi = 3 (over
(No) Bmi = 2 (Normal) weight) 0.1616076 0.01332601 0.8383924 0.98667399 1 (yes) 2 (no) Angina
1 (yes) 2 (no) Ques3: Evaluate the effect a person's income has on their probability of having one of the
four health outcomes (diabetes, stroke, heart attack, angina). For each of these four
outcomes, plot their probability given income status (your horizontal axis should be i = 1,
2, …… 8; and your vertical axis should be P(y = 1 | income = i), where y is the outcome).
What can you conclude?
Solution:
Conclusion: From the above graph we can conclude that with increase in income the
probability of having any of the four health outcomes (diabetes, stroke, heart attack,
angina) decreases. Although the absolute decreases in probability is very small, but
relatively it’s very significant.
For diabetes with increase in income from 1 to 8 the probability of having diabetes
decreases from 14.64462% to 12.33497%, the absolute decrease is 2.3%. But relatively
there is 16.5224% decrease in the chances of having diabetes by increase of income from
1 to 8 level, which is pretty significant. That is a person with income level 8 has 16.52%
less chances of having diabetes as compared to person with income level 1.
Similarly we can see the same trend for all the other health outcomes.
For Anigna with increase in income from 1 to 8 the probability of having anigna
decreases from 7.936% to 5.09%, the absolute decrease is 2.84%. But relatively there is
35.86% decrease in the chances of having anigna by increase of income from 1 to 8 level,
which is pretty significant.
For heart attack with increase in income from 1 to 8 the probability of having heart
attack decreases from 7.36% to 4.94%, the absolute decrease is 2.42%. But relatively
there is 32.88% decrease in the chances of having heart attack by increase of income
from 1 to 8 level.
For stroke with increase in income from 1 to 8 the probability of having stroke decreases
from 4.96% to 3.36%, the absolute decrease is 1.6%. But relatively there is 32.25%
decrease in the chances of having stroke by increase of income from 1 to 8 level.
Ques4: Create a second Bayesian network as above, but add edges from smoking to each
of the four outcomes and edges from exercise to each of the four outcomes. Now redo the
queries in Question 2. What was the effect, and do you think the assumptions of the first
graph were valid or not?
Solution:
Diabetes
Diabetes 1 (yes) 2 (Only during pregnancy) 3 (no) 4 (pre- diabetic) P (Diabetes| Exercise, Smoke) Bad Habit
Good Habit
Exercise = 2 (No) Smoke = Exercise = 1 (Yes) Smoke
1 (Yes) = No (2) 0.210944859 0.098552162 0.006915095 0.009884084 0.760692694 0.877575578 0.021447352 0.013988176 From the above graphs we can conclude that, on adding edges from smoking to diabetes
and edges from exercise to diabetes, the resultant probabilities differ. Probability of
having diabetes with bad habits increases, and probability of having diabetes with good
habits decreases. Hence our assumption is not valid.
Stroke
Stroke 1 (yes) 2 (no) P(Stroke| Exercise, Smoke) Bad Habit
Good Habit
Exercise = 2 (No) Smoke = 1 Exercise = 1 (Yes) Smoke =
(Yes) No (2) 0.07803498 0.02431088 0.92196502 0.97568912 From the above graphs we can conclude that, on adding edges from smoke and exercise
to stroke, the probability of having stroke increases given bad habits and decreases given
good habits. Hence our assumption was not valid.
Heart Attack
Attack 1 (yes) 2 (no) P(Attack| Exercise, Smoke) Bad Habit
Good Habit
Exercise = 2 (No) Smoke = 1 Exercise = 1 (Yes) Smoke =
(Yes) No (2) 0.1211659 0.03101531 0.8788341 0.96898469 From the above graphs we can conclude that, on adding edges from smoke and exercise
to attack, the probability of having attack increases given bad habits and decreases given
good habits. Hence our assumption was not valid.
Angina
Angina 1 (yes) 2 (no) P(Angina| Exercise, Smoke) Bad Habit
Good Habit
Exercise = 2 (No) Smoke = 1 Exercise = 1 (Yes) Smoke =
(Yes) No (2) 0.1190069 0.03680005 0.8809931 0.96319995 From the above graphs we can conclude that, on adding edges from smoke and exercise
to angina, the probability of having angina increases given bad habits and decreases given
good habits. Hence our assumption was not valid.
Solution (b):
Diabetes
Diabetes 1 (yes) 2 (Only during
pregnancy) 3 (no) 4 (pre- diabetic) P(Diabetes| Bp, Bmi, Cholesterol) Bad Health
Good Health
Bp = 1 (Yes) Cholesterol = 1 Bp = 3 (No) Cholesterol = 2
(Yes) Bmi = 3 (over weight) (No) Bmi = 2 (Normal) 0.123480634 0.054172949 0.007460298 0.009731215 0.852415963 0.016643105 0.925952333 0.010143502 From the above graphs we can conclude that, on adding edges from smoking and exercise
to diabetes, the resultant probabilities differ. Probability of having diabetes with bad
health increases, and probability of having diabetes with good health decreases. Hence
our assumption is not valid.
Stroke
Stroke 1 (yes) 2 (no) P(Stroke| Bp, Bmi, Cholesterol) Bad Health
Good Health
Bp = 1 (Yes) Cholesterol = 1 Bp = 3 (No) Cholesterol = 2
(Yes) Bmi = 3 (over weight) (No) Bmi = 2 (Normal) 0.08425692 0.01399739 0.91574308 0.98600261 From the above graphs we can conclude that, on adding edges from smoking and exercise
to stroke, the resultant probabilities are almost same. So our assumptions are valid in this
case.
Heart Attack
Attack 1 (yes) 2 (no) P(Attack | Bp, Bmi, Cholesterol) Bad Health
Good Health
Bp = 1 (Yes) Cholesterol = Bp = 3 (No) Cholesterol = 2
1 (Yes) Bmi = 3 (over
(No) Bmi = 2 (Normal) weight) 0.1421993 0.01546893 0.8578007 0.98453107 On adding edges from smoking and exercise to stroke, the resultant probabilities are
almost same. There is a slight increase in probability of having attack considering
smoking and exercise causes (edges) attack given bad health, and a slight decrease in
probability of having attack given good health. Hence our assumption is almost valid.
Angina
Angina 1 (yes) 2 (no) P(Angina| Bp, Bmi, Cholesterol) Bad Health
Good Health
Bp = 1 (Yes) Cholesterol = Bp = 3 (No) Cholesterol = 2
1 (Yes) Bmi = 3 (over
(No) Bmi = 2 (Normal) weight) 0.1629716 0.01294426 0.8370284 0.98705574 In the above case our assumption was valid, as there is no much difference in
probabilities before and after adding edges between smoke, exercise to Angina.
Ques5: Make a third network, starting from the network in Question 4, but adding an
edge from diabetes to stroke. For both networks, evaluate the following probabilities:
P(stroke = 1(yes) | diabetes = 1(yes)) and P(stroke = 1(yes)| diabetes = 3 (no))
Again, what was the effect, and was the assumption about the interaction between
diabetes and stroke valid?
Solution:
Probabilities before adding edges between diabetes and stroke
Stroke P(Stroke | Diabetes) Diabetes = 1 (yes) Diabetes = 3 (No) 1 (yes)
0.04416376
0.04047831
2 (no)
0.95583624
0.959521698
Probabilities after adding edge between diabetes and stroke.
Stroke P(Stroke | Diabetes) Diabetes = 1 (yes) Diabetes = 3 (No) 1 (yes)
0.07619782
0.03501533
2 (no)
0.92380218
0.96498467
From the above tables we can conclude that, on considering diabetes as one of the causes
of stroke in Bayesian network (by adding the edge between diabetes and stroke) the
probability of having stroke given diabetes increases i.e if a person have diabetes there is
greater possibility that the person have stroke too if diabetes is one of the causes of stroke.
There is 42% increase in relative probability.
The absolute difference in probabilities is small, but relatively they are high. Hence here
our assumption was invalid.