autoscout missing

Transcription

autoscout missing

Sofia University „St. Kliment Ohridski“
Faculty of Mathematics and Informatics
Department of Computing Systems
MASTER DEGREE THESIS
On topic
„Halogen: web-based application
for car selection”
Graduate: Vihren Kotsev Ganev
Specialty: E-business and e-governance
Faculty number: M24318
Supervisor:
Assoc. Prof. Kamen Spasov, Ph.D.
Co-advisor:
Dr. - Ing. Elior Vila
Sofia, 2015
Abstract:
It's been years since people buy and drive their personal cars.
It all sounds perfect, but there is one general problem with cars – they cost money.
People and Bulgarians, in particular, do not understand the problem very well and when
they are about to buy a car they choose it depending on how much it costs today, how much
will cost the initial repairs and the exterior look.
Actually there are a lot more parameters that should be taken into account.
There is one thing that we all can be sure of. The best car for someone's needs is not
necessarily the same as the best choice for the other.
Halogen aims to point out the most suitable models for each individual user request,
filtering all models currently on the market and also giving additional information on fuel
cost, depreciation cost, total cost and many others.
Резюме:
Вече дълги години хората купуват и управляват свои собствени автомобили.
Всичко звучи чудесно, но съществува един всеобщ проблем с автомобилите – те
струват пари.
Хората, и българите в частност, не разбират този проблем много добре и когато са
на път да закупят автомобил те го избират в зависимост от това колко струва
автомобилът днес, колко ще струват първоначалните ремонти и дали автомобилът
изглежда добре.
Всъщност има много повече параметри, които трябва да се вземат предвид при
избора на автомобил.
Има едно нещо, в което всички можем да сме сигурни. Най-добрият автомобил за
нуждите на един човек не е задължително най-добрият избор за друг човек.
Halogen има за цел да покаже най-подходящите модели автомобили за всеки един
човек индивидуално, филтрирайки всички модели, които в момента са на пазара, и
давайки допълнителна информация за: разходи за гориво, обезценяване, общи разходи
и много други.
Declaration of authorship:
I, Vihren Ganev, hereby declare that the following is the result of my own work under
the supervision of Assoc. Prof. Kamen Spasov, Ph.D. and co-advisory of Dr. - Ing. Elior Vila.
All sources are cited in the Reference section. All libraries and code snippets which are
someone else’s work are cited in the Reference section.
2
Table of Contents
1. Introduction
5
1.1. Relevance of the problem and motivation
5
1.2. Master degree thesis goal and tasks
6
1.3. Expected benefits from the implementation
8
1.4. Master degree thesis structure
9
2. Preview of the current ways people choose car to buy
10
2.1. Core definitions
10
2.2. Approaches and methods for solving the problems
11
2.3. Existing solutions
12
2.4. Conclusion
14
3. Technologies, platforms and methodologies used
15
3.1. Requirements for the tools, place and manner of use
15
3.2. Choice of tools
16
3.3. Conclusion
17
4. Analysis
18
4.1. Concept
18
4.2. Functional requirements
19
4.3. Non-functional requirements
20
4.4. Business processes
22
4.5. Conclusion
24
3
5. Design
25
5.1. Main architecture
25
5.2. Data model
28
5.3. Diagrams
31
5.4. User interface
35
5.5. Additional modules
39
5.6. Conclusion
39
6. Realization, testing, integration
6.1. Realization of the modules
40
40
6.1.1. Auto insurance pricing module
40
6.1.2. Citizen liability pricing module
41
6.1.3. Currency module
42
6.1.4. Euro NCAP module
43
6.1.5. Fuel module
44
6.1.6. Images module
44
6.1.7. Information module
45
6.1.8. Mobile module
50
6.1.9. Models module
51
6.1.10. Search module
52
6.1.11. Tasks module
56
6.1.12. Taxes module
59
6.1.13. Tires module
59
6.1.14. Vignette module
59
6.2. System integration
60
6.3. Testing
62
6.4. Experimental integration
62
7. Conclusion
64
7.1. Summary of the execution and requisition for original results
64
7.2. Guidelines for future development and improvement
65
4
1. Introduction
1.1. Relevance of the problem and motivation
There is a statistic showing that each year more and more new cars are sold worldwide
[1] and in Bulgaria, in particular [2]. The same is true for used cars also. The trend shows
people will continue to buy and drive their own car and nowadays it is common to see every
member of a family owning one.
The result of this is spending a lot of money on car-related stuff, not only the fuel. A
well-known example is paying yearly tax for owning a car. Owning two cars means paying
twice. And there are a lot of taxes that should be paid no matter driving the car or not.
Attracted by low price and good look people forget to make a calculation on how much
the car is really going to cost them.
When put in place to choose a car people often take a predefined list containing about 20
of the most popular models on the market and choose the one they like the most.
This is how choosing a car works today.
This is discussed every day in forums, TV shows and the press [3]. All ratings that
appear on the mentioned communication channels are primarily based on the current price of
the car (the lower the better), the fuel consumption, and the engine power. [4]
What is sad is that sometimes advices include engine volume and power (the more the
better). Since the purpose of owning a car is generally to move easier on large distances or
with a lot of luggage these advices are groundless.
This master degree thesis is intended and designed to work with the characteristics of
cars as automobiles rather than cars as toys. The system will also give a list of information
rather than advices so people can later apply preferences and choose on their own. The engine
and some other characteristics of the car do not describe it as an automobile and should not
be part of the search criteria. A full list of what is and is not part of the search criteria can be
found in the next papers.
There are various parameters depending on which a car can go up or down in a list. It is
important to know and understand that the best car for someone’s needs may not be the best
car for his friend since people’s needs and financial situation vary.
The motivation for choosing this topic for the master degree thesis is that the best cars
are not always the most popular ones. Actually these two things have nothing in common so
why should one lead to another?
What’s more important is that the best car for someone’s needs is quite personal decision
and as such should be made by the future owner himself. Who else can tell how secure the
future car should be?
5
1.2. Master degree thesis goal and tasks
The goal of the master degree thesis is to be developed an autonomous web-based
system which points out the most suitable car models for everyone who wants to buy a car,
based on his individual needs and financial situation.
This is how choosing a car will work tomorrow.
Here is a full list of the initial tasks that have to be developed for the successful
realization of this master degree thesis.
 To find a database with real car prices.
The other way of doing this is having a depreciation formula and making
calculations with it. But having a real database with real car prices for the local
market is priceless.
 To find all the needed data for all car models that can be used for searching or should
be included in the calculation formula and automate its collecting.
It would be easy if the system is just a filter for car ads – what car, how much
money, what type of engine and similar. Combining pure table information like car
width and length with market information like the price is more interesting from a
programming point of view and more importantly – useful from people’s point of
view.
Finding this table information may not be easy. The source has to be relatively upto-date with new models, to include information like tires size (used for calculating
the price of the tires) and other less known data which cannot be found in all sources.
 To find prices for all variable costs on car models and automate their collecting.
It seems like there is no place on the Internet where people say how much they
pay and how often they go for repairs, including car model and all the additional
information the system needs.
What’s more is that even if there was such an information for a car model or for
all car models from a manufacturer, there would still be models with no official
information and that would put them some steps behind the competitors for no
obvious reason.
Maybe some formulas should be made based on talks with friends, mechanics and
official representatives thanks to whom the system can make approximations by itself
for all the variable costs on repairs a car model needs.
 To find prices for all variable costs an owner pays, different from the repairs and
automate their gathering.
These are all kinds of taxes:
 Citizen liability – official average prices from large insurance broker that lists prices
from several insurance companies
6
 Auto insurance – usually this is not calculated but taken from the Schwackeliste.
Since an access to it have insurance companies only, the prices here are a percentage
of the original price of the car.
 Road tax – formula based on the current law in Bulgaria
 Vignette – from the official page of Agency Road Infrastructure which is responsible
for the price of the vignettes in Bulgaria
 Technical inspection – average prices for the country
 Yearly servicing – Since there is no official information on this point, the formula for
calculating yearly service should be based on the age of the car, its price, and class.
Cars from the higher classes usually have costly servicing.
This point also includes the price of the tires. Since there are hundreds of
combinations of tires size it would be best if a formula exists – for calculating tires’
average price based on real tires from a large Bulgarian online shop.
 To find data for security information on car models.
There was just one source that can provide the needed information – Euro NCAP.
The Euro NCAP is the standard for crash tests in Europe. It has different ratings
for adults, children, and even pedestrians. The fourth rating is for available security
systems on board. On each of these categories, the tested car scores between 0 and
100 % and has a total score of between 0 and 5 stars.
This is exactly what the system needs to provide to its users.
An integration or crawler should be implemented with the Euro NCAP website.
 To find images for each car model and model year.
Taking in mind that cars’ look changes with the model years there should be as
many images as combinations of number of models and model year exist in the
database. All they have to be downloaded and resized with proper dimensions.
 To build the system – frontend, backend, database. Fast at calculating score for each
car model and then sort by it.
In the frontend, there should be an easy way of choosing values for the different
filters and searches fields with sliders and checkboxes. The table with results in the
frontend should have the ability to search and sort on different columns.
The backend is responsible for all the important things:
o Collecting information from all sources and keeping it up-to-date into a
database
o Here will be made all calculations
o Here will be constructed all search queries
o The backend will decide when and how to use caching
7
o Some tests should be made on whether the system searches fast or slow in the
database. If the search is slow an additional specialized database for searching
should be used for this part of the system.
o The database itself has to be relational since the information is of this kind.
 To test the system with requests from real people with different needs and financial
situation.
After the system is ready it should be tested with requests from friends and other people
when given the possible filtering and searching fields. This would test what kind of cars they
prefer and if the existing filtering and searching options satisfy them.
To test if people understand the ideology of the system another test should be made
including more conceptual questions like “How do you choose which car to buy?” and “Do
you agree with this system’s concept?”
1.3. Expected benefits from the implementation
The current implementation gives the people an alternative way of choosing a car.
Considering choosing a car as a two-step process, the first step is choosing the model
before choosing the exemplary. This is where Halogen (the name of the system developed in
this master degree thesis) finds its place and people are prompted to start using the first step –
choosing the model – before choosing the car itself.
The system tends to replace what people do manually – searching the web, forums,
articles, asking friends and others – with what should be done with a few clicks.
After people find the website useful they are expected to spread the word about it to their
friends and colleagues. Solving a real problem a lot of people have is worth to share.
The expected benefit for the users is showing exactly how much a car is going to cost
them specifically. The same price is also showed as price per month and price per kilometre.
For example, if someone does not drive a lot and drives only urban it may turn out to that it is
cheaper to use taxis instead of owning a car.
The truth is it is very unlikely to have two similar requests. Giving a personal list of
results is very important. When placed in front of a well-informed choice people would
choose what they see is the best for their needs. Halogen may give unexpected results for
some of the people using the system.
When having a working system, the answer to how to make it better is simple – make it
profitable. At the end of the master degree thesis can be found guidelines for future
development and improvement and how to make the system profitable.
8
1.4. Master degree thesis structure
The master degree thesis begins with an introduction describing the relevance of the
problem and the motivation for choosing this topic followed by the description of the main
goal and tasks and the expected benefits of the implementation. The introduction aims to
make it clear this is a serious existing problem a lot of people have and this is a possible
solution. An interesting fact is that there is just one known solution to the same problem from
another country that was launched just a few months ago, just after the start of the work on
this master degree thesis.
The second chapter is a preview of the current ways people choose car to buy including
the approaches, methods and standards and the existing solutions. Here the current ways are
described in details based on real world examples of their strengths and weaknesses. What
would change if people use Halogen is also an affected topic.
The third chapter is on the technologies, platforms and methodologies used in this master
degree thesis. Beginning with the requirements for the tools, followed by their types, place
and manner of use and ending with the choice of tools and conclusion. The chapter includes a
brief description of all the chosen tools – languages, databases, frameworks and others used
for development. There are some non-standard methodologies and decisions that get the
needed attention here.
Fourth chapter is all about concept and analysis. Describing the concept of the system in
details, functional and non-functional requirements, business processes. The concept is
described in details since at first people hardly understand when and why they may use the
system. There is no competitive solution on the Bulgarian market yet and that is why people
should understand very clearly what the system can help them with.
Fifth chapter is about system design. Describing the main architecture of the system, the
data model including diagrams, some information about the user interface and finishing with
the additional modules developed. This is not a traditional web-based system. This system is
interesting with lots of design decisions – be it the way it collects data and keeps it up-to-date
or the way it lists results for the end user. Here are included graphics about the database also
– how the search works and why this is the best decision.
Realization, testing, and integration describe the ways and reasons modules were written,
how they are integrated forming a working system, a brief information about the testing and
experimental integration in the real world. This chapter also includes information about the
development of the system so it can be deployed to different environments easier and with no
problems at all. This is not seen very often and is described in details together with some
benefits of developing a system this way.
The master degree thesis ends with a conclusion – summary of the execution of the
initial tasks and requisition for original results as well as guidelines for future development
and improvement. This chapter includes information on how to keep the system on track,
make it better and profitable.
9
2. Preview of the current ways people choose car to buy
2.1. Core definitions
System: An aggregation of tools and methods that is able to provide some information to
a person or another system.
Web browser: A program used to surf the Internet.
Web-based system: A system that can be opened from a web browser.
Autonomous web-based system: A web-based system that needs neither human nor nonhuman interaction to do its everyday job without interruption, i.e. a system that “just works”.
Car (1): a transportation vehicle that is categorized as a category B vehicle from the
European Union directive 2006/126/EC.
Car (2): wheeled, self-powered motor vehicle used for transportation.
Financial situation: An indicator of whether the person values the time he puts aside
because of car’s needs. It yes – on what price.
Schwackeliste – The one and only list used by all insurance companies in order to decide
what is the insurance premium they would offer to the owner of the car. The list includes all
car manufacturers, models, modifications, and extras. For each of them is said for what
condition and years the price the car should be insured. The same is usually multiplied by a
factor between 0 and 1 and the resulting number is given to the clients as an insurance
premium for their car.
Euro NCAP – Euro NCAP means “European New Car Assessment Programme”. Euro
NCAP provides both drivers and the automotive industry with a realistic and independent
assessment of the safety of new cars. This is well-known but very rarely used criterion for
buying a new car in Bulgaria. It is commonly used in western European countries.
Vignette – A tax paid for driving on extra-urban roads. Paid weekly, monthly or yearly.
Virtual machine – An isolated fully-functional operating system that can be edited,
moved and managed directly from another operating system, being it virtual or not. The
virtual machines have their own hard drive which holds the operating system and on which
can be installed any software.
Caching – A mechanism that provides faster data access most commonly by storing the
data in the RAM (cache). Cache is often used for faster data retrieving. Otherwise to get the
needed data some slow or not constant speed operations have to be performed. Usually after
the data is retrieved for the first time it is saved in the cache and every next request for
getting the same data will be redirected to the cache and will return the data immediately.
Halogen – The name of the system developed in this master degree thesis.
10
2.2. Approaches and methods for solving the problems
Talking about people buying cars which are not the best for their needs. At the moment,
people do a manual job when choosing the best model for themselves – searching the web,
reading forums and articles, asking friends. All of this should be done with a few clicks.
People are smart and they make their choice this way not because this is how it has
always been done before but because they do not know there is an alternative.
The suggested way of solving the problem and choosing the most suitable car for
person’s needs is to make it on steps:
2.2.1. Should the car be large, with a big trunk, should it be safe and how much?
Include as many specifications, as can think of, which describe the car as an
automobile.
2.2.2. Then find values for these specifications for all cars that can be found on the
market today and filter out only the cars from the previous step, resulting in a list
of cars matching the criteria. This may cost a lot of time and efforts. Often all the
searching cannot be done for less than 24 hours.
2.2.3. Now that there is a list of cars matching the preferences make calculations on
how much every aspect of the car would cost, also the fuel and depreciation
costs, the time cost, and others. This is to add about 10 columns to the initial
table. Finding all this information also takes a lot of time.
2.2.4. Then do the math and sort the results by money cost in ascending order.
Congratulations, just found out the most suitable car for this person’s needs.
Halogen replaces all points from 2 to the end so the person needs just to decide what car
would match his needs.
In this method, all the human factor is skipped and the chance of errors because of lack
of information and personal opinion are also skipped.
Bulgaria is a small country and the forums and topics we see on the Internet are all the
same so basically when people search for opinions on a car they all see the same opinions
that do not change with time. A car is either classified as good or bad on the market both with
not a lot of difficulties.
The real work that should be done here is to convince people this is how they should
choose a car. The conviction itself is a slow process and should be proven with results.
11
2.3. Existing solutions
There is just one known solution which lists car models that match user’s criteria for car
as an automobile.
The project is born in Germany and named Motoragent (motoragent.de) [5]. The
interface of the project is in German only and it is referred to the German market as well. We
all know Germany is one of the biggest car industries in the world. People there usually buy
new cars and sell them up to 5 years later mainly because of increased taxation costs.
As written into their page Motoragent is developed in partnership with mobile.de [6],
autoscout24.de [7] and spritmonitor.de [8]. All these websites are well-known in the country
and even in Bulgaria to some people interested in the worldwide car market.
Mobile.de is Germany’s leading marketplace for vehicle sales. The offers and prices of
cars in Motoragent use the ads from mobile.de.
Autoscout24.de is one of the leading car portals in Europe. It is a platform for trading
cars similar to mobile.de and is also part of Motoragent providing information about the car
ads published into the autoscout24.de website.
Spritmonitor.de provides real people’s cars fuel consumption and cost analysis. The
values from spritmonitor.de form the consumption data on Motoragent.
Let’s take a deeper look into Motoragent’s functionalities.
The website begins with a brief description of what is Motoragent and prompts the
visitor to continue to the main section.
The main section has several tabs as follows:
2.3.1. The first tab is for the current car the visitor drives. There can be selected
manufacturer, model and model year. This question and all next questions can
also be ignored. This question does not change the results at the end and maybe
just for collecting user information which is not directly connected to the purpose
of the website.
2.3.2. The second tab lets the user choose the minimum number of seats he wants the
car to have and also the minimum number of seats for children. The interface
here is a little bit annoying since the user is the choosing minimum number, but
the interface looks like he is choosing maximum. The website does not have full
information for all cars (which can be seen at the end when showing the results)
and if we select something here or in the next tabs, all cars with missing such
information get removed from the list.
2.3.3. Third question is about the number of kilometres the visitor travels each year. A
slider between 0 and 50 000. Basically if the person is a taxi driver or just drives
a lot he cannot use the website because he cannot select more than 50 000
kilometres per year. On the same page, there is a second slider of whether the
visitor drives more urban or extra-urban. This slider is kind of redundant since it
does not indicate percentages but just whatever the user select and wherever he
12
puts it. It is interesting that if the user just clicks on it, the results which are
shown on the right get shuffled without the slider position being changed.
2.3.4. Fourth question – does the visitor prefer sport or economy driving. There is a
slider on the page again with no clear definition of percentages or states. It is just
sport on the one side and economy on the other.
2.3.5. Fifth question is about the trunk but not the trunk size with real numbers. Again
just like the previous question – a slider of “less” on the one side and “more” on
the other. Like people would be able to determine how much “more” space they
need compared to nothing since there are not numbers anywhere.
2.3.6. Sixth question is about the price for purchasing the car. There is a slider for
choosing minimum and maximum price the visitor wants to pay. Interesting if
someone ever chooses minimum amount of money he wants to spend. A very
good question at all but what if there is a very good car which costs just a few
euro more than the limit but would save the visitor thousands of euro per year.
2.3.7. Seventh question – how old can the car be – again a range slider from 0 to 20
years like someone does not want to drive a new car.
2.3.8. Eight question – we are still away from finishing the questionnaire. What is the
maximum amount of kilometres the car should have been driven? It is now clear
that Motoragent is more like an exemplary chooser rather than a model chooser.
For exemplary choosers, it matters how much the car costs today and how many
kilometres it has been driven.
2.3.9. Ninth question – select specific car shape. There is a list of eight car shapes like
van, coupe and others. Here we can filter out those shapes we do not like. There
is an interesting behaviour if selected “with sliding doors” no matter the other
selected options just those with sliding doors remain in the results list. The other
options can be combined, returning all results matching one of all selected but
this one does not apply.
2.3.10. Tenth tab – named “About me” includes information about the person. First
question – in what time does the user wants to buy his next car. The maximum
value that can be selected here is 24 months. Maybe people in Germany buy a
new car no more rarely than two years which is nice. This does not apply to
people in Bulgaria, not even close. The second question here is to choose a
gender and third – age group like 17-19, 20-29 and others.
2.3.11. Finish! Here can be chosen specific brands and to list only cars from the chosen
brands.
Good news the list of cars that match all our preferences is here. It is great to see some
images after answering about fifteen questions. A few seconds later some strange results are
noticed.
13
All cars here have percentages. The first car is not 100% and the next are even lower.
The question is what is the 100% base that no car has reached?
With the selected preferences, the list contains some cars more than once. Two Renault
Clio from the same model years one is a hatchback and the other – station wagon. Well,
station wagons were not even checked on the ninth tab.
On the same page where the results are listed, there are also options to compare and filter
by transmission, weight and wheel-driving, option to compare. This is all we have here.
Motoragent has a very nice-looking interface and with a partnership with all these
popular websites it will sure help people choose the best car for their needs.
Now let us continue with the existing solutions and substitutions, in particular. Although
with a different idea and part of the whole picture websites for direct choosing of car
exemplars like mobile.bg can be marked as competitors or existing solution. Since users
should get the needed attention and be convinced there is a reason to use this system it is also
possible for the users to refuse and continue to use their regular websites for direct exemplar
choosing.
As said before, the current ways people choose car models are asking friends and
searching the Internet for opinions. To be honest nothing can beat friends’ word so this
cannot be skipped anyway. Searching the Internet is a good thing if it is for finding
information, not advices. In the last years some car manufacturer official forums and fan
forums in Bulgaria were bought by a large media group which made them place to earn
money from and nowadays these forums have nothing in common with what they ought to
be.
Experienced people share that Internet advices are the worst they have seen and may be
more catastrophic than everything else. Seeing the real results of using the proposed system
people are expected to start gaining trust and use it more often.
2.4. Conclusion
The current ways people choose car to buy have their advantages, as well as
disadvantages. People used to choose cars by doing lots of manual work by themselves and it
is time to end this status quo.
All the work a person has to do and all the research he has to make is automated within
the system proposed in this master degree thesis. People should just use this system and
choose the car that best matches their needs.
14
3. Technologies, platforms and methodologies used
3.1. Requirements for the tools, place and manner of use
The requirements for tools are separated on hardware and software requirements.
The master degree thesis includes the proposed system as well as a configured virtual
machine with hardware specifications and installed all the needed software to run the system.
By using a virtual machine, it is easier to run and modify the system on different computers
no matter the host operating system and settings.
Starting with the software requirements. The system includes different modules some of
which may be written in different programming languages.
The system itself is web-based and, therefore, the language it is written with has to deal
well with web. The chosen language should be easy to use with or without templates. It is
best to be chosen an MVC framework [9] [10]. The language and framework should both
have large online communities in order to get help if needed. The language has to be objectoriented.
In order to have faster development the programming language should interpreted and
not compiled by default.
The framework should be able to be used both from web and from the command-line
running cronjobs.
The framework should also provide easy ways to: be installed, rewrite URLs with
custom ones, connect to a database, manage configurations, and be extended.
For the frontend, a well-known JavaScript framework should be used which has built-in
selector engine. The frontend framework should also be extendable with third-party libraries.
A library for a table view with sorting by custom value and searching should exist. The
framework should support older browsers like IE8.
The database used has to be a well-known one, SQL database. The database has to
provide an option to have functional indexes.
The hardware requirements are based on the current virtual machine on which the project
is developed and how it can be extended. Since the system can be extended linearly (scales
well) the requirements are just for Linux-based OS.
The system needs caching mechanism to make it return the needed data faster if
requested more than once. Also, the searches in the system are not based on any personal or
confidential data and caching first user’s search data can be used by the next users with no
issues at all. The caching should be memory-based.
A root access to the operating system is needed in order to install the tools, extensions,
modules, libraries and everything else the system needs for its proper work and usage.
All technologies and everything related to this project have to be free to use for personal
and non-personal projects. Open source technologies are treated with advantage.
15
3.2. Choice of tools
The first and most important choice here is the programming language. There are not lots
of languages to choose between: Perl, PHP, Python and Ruby. The requirements of choosing
one as described above lead to PHP in the first place.
PHP is a server-side scripting language designed for web development. PHP code can be
simply mixed with HTML code, or it can be used in combination with various template
engines and web frameworks. PHP code is usually processed by a PHP interpreter, which is
usually implemented as a web server's native module or a Common Gateway Interface (CGI)
executable. After the PHP code is interpreted and executed, the web server sends the resulting
output to its client, usually in a form of a part of the generated web page. PHP is the most
popular language for web development. [11]
Apache server serves PHP for the web and can be used with the modules needed for the
website to work. [12]
Choosing a PHP framework is a tough decision. There are plenty of new frameworks
coming up every year and pretending to be the fastest or smartest ones. A full list of the
compared frameworks is here: CakePHP, CodeIgniter, FuelPHP, Kohana, Laravel, Phalcon,
Symfony, Yii, Zend Framework. Laravel and CodeIgniter are the two most popular
frameworks matching the requirements. Although advertising itself as a new generation
framework Laravel has some bits of the old frameworks that are missing but shouldn’t.
Instead of making it easy, the framework makes it complex to install by requiring additional
tools and by missing default routing policy. In contrast, CodeIgniter is just “download and
run”. It works by default with no additional settings and has default policies for everything
that is needed. Less code makes the application easy to understand and maintain. [13]
CodeIgniter is the framework of choice for this project.
The frontend uses the jQuery JavaScript library which by itself uses Sizzle. Sizzle is the
most popular, fast and correctly working JavaScript selector library [14]. jQuery’s
community is rapidly growing but is already in the first place compared to its competitors.
There is one very good table plug-in for jQuery called DataTables. DataTables is a
highly flexible tool, based upon the foundations of progressive enhancement which adds
advanced interaction controls to any HTML table. It supports all the needed functionalities
and more. [15]
There are tons of SQL databases, but most of them drop right after the requirement for
supporting functional indexes [16]. Statistics shows the fastest and well-known database that
matches the requirements is PostgreSQL. PostgreSQL is also known as world's most
advanced open source database and will do perfect work in the current project.
16
The virtual machine runs on Debian 7 Wheezy which is a Linux-based operating system
[17]. It has 2 CPUs assigned, 1GB of RAM and 8GB of HDD. The same are configured as
recommended values while the whole system can run on a far slower and cheaper hardware
(or virtual machine) like 1 CPU, 256MB of RAM and 4GB of HDD. The system is tested and
works normally with the listed minimum requirements on both a virtual machine and an old
PC from year 2005. Special thanks to Philip Balinov, DevOps Engineer at Komfo, for
helping install and configure the virtual machine.
Memcached [18] is a very good caching mechanism which stores the data in the RAM.
PHP and CodeIgniter work well with Memcached and Memcached can be controlled directly
from PHP, changing limits and settings when it is necessary and accessing data fast.
Several PHP modules also need to be installed. In alphabetical order: php5-pear, php5cli, php5-common, php5-curl, php5-dev, php5-gd, php5-mcrypt, php5-memcached, php5pgsql and php5-xmlrpc.
The system also makes use of Zend Opcache which improves the performance by storing
precompiled bytecode in the shared memory. [19]
All technologies – PHP, Apache, CodeIgniter, jQuery, DataTables, PostgreSQL, Debian,
Memcached and Zend Opcache are free to use. Also most of them are open source, too.
3.3. Conclusion
The choice of tools is not an easy task and should be taken very seriously. All
technologies should be chosen with care and intention for future ease of use and expansion
capabilities.
Choosing PHP as a core language makes all requests to the system independent and the
failure of one would not affect the other in any way. In this matter, the system becomes more
stable.
At the end of this master degree thesis, Halogen will be complex but very easy to run
thanks to the preconfigured virtual machine. Everything the administrator has to do is run the
machine and then type in his own browser the URL address of the system.
It’s that easy and works like a charm.
17
4. Analysis
4.1. Concept
This is a first-of-a-kind system being developed in Bulgaria. There is just one known
system which does a close job, developed and actively used in Germany.
The concept of the system is that people should own and drive cars that at the same time
match their needs and cost them the least amount of money. As simple as that.
Truth is, young people choose cars differently. Many of them do not have financial
education. Generally they do not think about depreciation or taxes. Young people are buying
cars for fun. There is no way to make these people think of depreciation, taxes, repairs or tires
price. It’s all fun, right?
The other part are people who buy a car because they need one and these are the people
who would find this system useful. These people know the meaning of “car” and would buy
one having it in mind.
Of course there can be hundreds of little specifics whether a person would buy a car or
not like “the way the handle above the side windows closes when being released“ and even if
the developer of a system wants to include them, this bounds to the impossible.
The most important filters of using a car for transportation are not many:
4.1.1. Length and width – to have more space in the car or to park easily. The length of
the car is usually related to car’s segment while the width – to car’s class.
4.1.2. Number of seats and volume of the trunk – to drive the whole family and have
space for a lot of luggage. If the person intends to drive three little kids and an
adult at the same time maybe he would need a car with 7+ seats. The same
applies to the trunk also. The more people there are in the car, the more space it
should have for luggage. Or if this is for urban driving only maybe the trunk size
does not matter.
These numbers would give an idea of what kind of car the person is looking for.
In today’s world, there is one more thing, extremely important and often neglected.
Actually it is more important for new car owners. It is called “Safety”. Halogen also gives the
user opportunity to choose the level of safety of the car he wants to drive, if he cares about
safety at all.
Finishing with the filters, here comes the place for sorting the results and calculating how
much money would all car models need for the whole period of driving.
This is the time to point the number of kilometres urban and extra-urban driving per
year, as well as the total number of years the person think he will drive the car.
18
The last tab is also something often neglected. Is it the same to own a cheap car which
breaks down often and a more expensive one which does not? All people who have bought a
car for the fun will now say the expensive car will depreciate so it’s better to own the first
one.
The time it would take the owner to go and fix his car often costs more than the repair
itself. The calculation is easy. If my boss pays me 20 BGN per hour and it takes me 2 hours
to go for a 30 BGN repair, the repair costs me 70 BGN, not 30.
A lot of people do not count their personal time as lost benefits so in the system there is a
slider starting from 0 and representing how much money (in BGN) per hour does the person
value his own time used for car’s needs.
This is all the information the system needs and the next step is to find the best models
for the person.
The system search is based on these 4 criteria: length, width, number of seats and trunk
size plus 4 criteria for safety: for adults, children and pedestrians’ protection and safety
systems.
The system sorting is based on 3 predictions made by the person himself: travelled
distance urban and extra-urban per year and car usage in years; and 1 more number – own
time valuing in BGN per hour.
It is surprisingly easy how by sliding a total of 12 sliders one can get so much
information about what car he should buy, together with the predicted monthly and yearly
costs, the initial and end-value of the car, the repair prices and others. To collect this
information manually, depending on the length of the list, takes days and in some cases even
weeks.
The concept is this – by making it easy to the user, to show the maximum amount of
confirmed information he is interested in, make close predictions for the other and show
summarized detailed results – while letting him take the final decision by himself.
4.2. Functional requirements
The functional requirements for the system are based on talks with potential users and
their requests. The functional requirements are to have a way for filtering and sorting the
results for car models. In order to achieve this, the system has to have information on all car
models currently on the market.
The system has to provide an easy way for the user to select values for all filtering and
sorting fields.
The system has to show a table with results and different columns with an option to sort
by every meaningful column: car age, fuel cost for the whole period, depreciation price,
maintenance price, time spent, total cost.
19
This table should be aggregated and not show the same car model a lot of times. There
should be a way to show a model a maximum of 3 times – 1 time for each matching engine
type and if the user wants to learn more about the same car model and engine type
combinations – he should be able to click on a link and see them all.
This is enforced because providing broader search parameters would return all model
years for a car and if there are 4 cars listed with 25 years each this is too much to scroll
through and gives no valuable information at all.
4.3. Non-functional requirements
Let’s separate the non-functional requirements on the base of the FURPS+ model.
The FURPS+ model consists of functionality, usability, reliability, performance,
supportability and additional requirements.
Starting from the usability requirements, all they are based on the ways people use the
system and how they perceive it. It is important the system to be easy to use, with no major
difficulties for the users. The interface should feel straightforward.
In order to keep the system user-base growing it needs both – to attract new users and to
keep the old ones. This is the reason for the website to lay stress on the content and
functionality. Since all the text information is not so much (5-10 lines) and is mainly what
should be read by the user, it has to use a large font. When stepping to the sliders – filtering
and sorting – the sliders themselves have to be relatively large and easy to move by clicking
on the slider or just moving the handle.
The system should consist of a starting page with a short description of the system and a
header image, one filtering and sorting page with several tabs and one last page for the results
with table and additional information.
The last page – with results table – should include images for all cars. It is way easier for
the users to orient if an image accompanies the car model.
The reliability requirements include information about the time the system is up
(uptime). It is important to have a quick way to recover after full system collapse.
The system should be able to recover for up to 15 minutes after a full system collapse.
This master degree thesis includes a preconfigured virtual machine and a recovery would
mean just a database backup to be imported. With imported database, the system can start
working like a brand new – from the beginning. What needs to be done after is to start a task
for collecting car images that are missing. The system is designed in a way that missing
images are not critical for its proper operation and information retrieval.
In terms of performance, the response time is the most important topic here. Other
important ones are time to restart, time to recover.
The system uses memcached caching and gzip file compression and will serve the
information with minimum latency and size.
20
The system and the database should be as fast as possible when retrieving information.
One query makes multiple calculations and database joins and if the result-set is big, the
query may execute slow.
In order to be useful enough, a web-based system needs to have a response time of 0.25
seconds at most (250 milliseconds) [20]. Halogen has to be faster and return information
from the database in 99% of all cases for up to 0.025 seconds for a total of 0.20 seconds page
load time which is 20% faster than the recommended maximum time.
The other 1% of the cases include broader search range returning many results. Having
lots of results is relatively slow. They should be processed and sorted at the end which takes
time. These cases are not useful in the general case, but some people want to have all
available cars listed for a reason. Maybe they are looking for the cheapest car of all, no matter
what.
All times are in case of requests from localhost and additional latency may exist when
requesting from slow or very distant network.
There are also a lot of cronjobs running independently. They take more time but since
this has no effect on the end user they are not included in the upper calculations.
Each cronjob has to run for less than 5 minutes no matter the action.
The time to restart the machine has to be less than 30 seconds. This includes shutdown,
boot, kernel select plus waiting, starting all services and handling successfully the first
request.
Depending on the hardware, the system can handle different number of concurrent
connections. Having the recommended hardware of 2 CPUs and 1GB of RAM, the system
would be able to handle up to 5 concurrent connections, each of which returning a result
within the desired response time. If this setup is not enough the system can easily scale on
multiple machines. The code can also work with a remote database, the system can be load
balanced and both do not require additional modification. Testing with 100 concurrent
connections Halogen works but returns the results after a lot of waiting.
Supportability requirements include testing, compatibility, configuring and logging.
When adding a new module to the system the module is tested with unit tests, automated
tests and black box tests by users. When the unit and automated tests, pass the module is
deployed on the production server for a certain percentage of the visitors only. The
percentage is increased of everything seems right. By doing this, the impact of a problem
would be minimized, affecting only those small percentage of users. The percentage is
increased in steps as follows – 10%, 20%, 50%, and 100%.
The new module should be compatible with the current state of the system – it should be
integrated into the system and be able to use the same database. The new module should be
developed as abstract as possible in order to be integrated with other future projects or be
open-sourced.
21
Halogen should be configurable by means of limits, database settings, logging and error
reporting, routing and others. A way to turn off a module should exist such as all related
functionalities to be turned off, too. This should not affect the whole system in any way.
It is important for the operating system to have log rotating, backups and cronjobs. It
should also support traffic filtering in case of serious issues detected.
The system is compatible with external systems by using their APIs. Such systems are
mobile.bg, fuelo.net and ajax.googleapis.com.
The current system also connects to other websites for collecting information, which do
not have APIs. These systems are: api.bg, automedia.investor.bg, euroncap.com, bnb.bg and
sdi.bg.
There are no additional requirements to the system because the system is not developed
for a client.
4.4. Business processes
Design  Development  Testing  Deployment
4.4.1. Design
Graphical design. This process includes the whole concept and graphical design. The
designer makes the design to be intuitive and easy to use. Usually design is the leading factor
for customer satisfaction. In the proposed system design and functionality share the load
50/50 – both are equally important for customer satisfaction and website success. Main
indicator for the success of the process is customer satisfaction.
Design includes the whole web system vision to be intuitive and easy to use. The main
indicator for success here is how easy users orient into website’s structure and if they manage
to use the website easy.
The design process also includes choosing a colour scheme as well as coding all website
pages. It is important the colours to match users’ expectations and not annoy them. Customer
satisfaction is again the main indicator for success of this process.
Database design. The database design is part of the design. This process defines the
content and structure of the database. The main indicator for success of the process is the
speed and efficiency for retrieving information from the database. This process is executed by
the developers.
First the type of information to be stored into the database is defined with the type of the
tables – type of data, keys and relations.
22
4.4.2. Development
This process includes the actual development of the web site. Main indicator for success
is a working website with no bugs and system collapses.
All pages are developed within this process. All functionalities are also developed here
including:

Functionality for getting citizen liability prices from the website sdi.bg – a
famous Bulgarian car insurance broker

Functionality for getting the USD / BGN rate from bnb.bg because some car
prices are listed in USD and should be converted

Functionality for getting information from euroncap.com which is the standard
for crash tests in Europe

Functionality for synchronizing fuel prices from fuelo.net – prices of gasoline,
diesel, and LPG

Functionality for downloading and resizing images from ajax.googleapis.com for
all cars and model years in the system

Functionality for custom fixing information holes. Sometimes the information for
a specific model year is missing. This information which can be extracted from
the previous and next years combined

Functionality for getting car data from mobile.bg via their API

Functionality for getting car models stock details from automedia.investor.bg
useful for filtering in the search

Functionality for calculating car tax based on the texts of Bulgarian law

Functionality for calculating tires price for each car based on a research on tires
prices from a large Bulgarian website for selling tires

Functionality for getting vignette price from the official website of Agency Road
Infrastructure – api.bg

Functionality for calculating auto insurance premium – formula based on a
research from the website of sdi.bg

Search functionality for filtering and sorting by user request data

Functionality to run cronjobs

Queue functionality for executing cronjobs tasks from a table into the database

Functionality for activating new “information” tables generated and filled with
scripts
23
4.4.3. Testing
After the complete development of a module, it is tested usually with unit tests and more
often being released to a small part of the visitors incrementally.
While and after executing the test scenarios is checked whether the actual results match
the expected ones. All errors are saved in a bug tracking system. If errors have occurred not
because of bugs but because of other events, these errors are also recorded in a document and
all steps to prevent their future occurrences are taken.
Passing all tests means the system or module is ready to be deployed in the way it is seen
when passed the tests.
4.4.4. Deployment
The system can be deployed relatively easy.
One way to do this is to take the virtual machine hard drive which is preinstalled with all
that is needed and upload it directly with the source code in it.
If the hosting provider does not offer such a functionality then the source code of the
system or module should be uploaded manually.
There is an existing git repository. A post-commit hook is attached to the client and after
each commit the database gets exported to a .sql file and automatically added to the commit.
At the same time on the server there is a post-receive hook which is activated after each push
to the bare repository. If the push message contains the word “deploy” the code is
automatically deployed after the push.
4.5. Conclusion
Analysis of the system show the initial target group for the system consists primarily of
people with families, stable incomes and serious job.
Although the system looks pretty simple with just two meaningful pages – search criteria
and results – the development section is full of functionalities that have to be available
starting from the initial version of the system.
The system is difficult to construct having so many functionalities but once developed it
should fit perfectly in the current market.
24
5. Design
5.1. Main architecture
The architecture of the application is shown in Fig. 1.
The website has one entry point which can be invoked
from two different places in two different ways – either by
opening the website or by running a cronjob.
The website uses the MVC (Model – View – Controller)
architecture pattern where:
 The models make database queries and return results
to the controllers
 Controllers make the connection between the models,
libraries, and views
 Views are used to generate the frontend code from
HTML pages and PHP variables
The website uses CodeIgniter – a powerful PHP
framework with a very small footprint, built for developers
who need a simple and elegant toolkit to create full-featured
web applications.
The entry point (index.php) is moved to a directory
/www which is on the same level as the other directories of
the project: /application, /database and /system. By doing
this, we make it impossible to further access important
backend files directly via the web browser because only the
content of the /www directory can be accessed directly
through HTTP.
The /www directory apart from index.php also contains
directories for storing all CSS and JavaScript files
respectively /www/css and /www/js. Two more directories
are situated in /www and they are /www/images for storing
images used in website – header, arrows for the results table
at the end and others. The second directory is /www/cars in
which all car images are stored.
Fig. 1. Screenshot of the
tree hierarchy of the code
The first thing in index.php is getting the
PROJECT_ENV environment variable which is set on the
server and importing it as the ENVIRONMENT constant.
The constant is used in several places in the project. For
example, cronjobs cannot be run from the web browser but if
the environment is “development” they can for ease of use.
25
Then the framework loads its core files automatically from /system and continues to the
/application directory. This is the directory where the developer has to do his job.
The /application directory consists of:

/application/config – a directory with all custom configurations separated in files. The
most important ones here are:
o /application/config/autoload.php – here we point out which modules, libraries, and
others to be autoloaded by default. In this case, these are the “database” and “session”
libraries.
o /application/config/config.php – lots of configurations that do not belong in any other
file
o /application/config/constants.php – defines constants visible in the whole project
o /application/config/database.php – makes the connection to a database. Choose the
database configuration whose name matches the name of the environment
o /application/config/routes.php – defines custom routes that do not match the default
rule. Such a route is used for the tasks queue.

/application/controllers – a list of controllers from the MVC pattern

/application/core – a directory with commonly used files like a controller being inherited
by other controllers, a model being inherited by other models and others

/application/libraries – a directory holding all libraries used in the project. A library is a
set of methods which can be used in other projects, does not depend on and does not have
any insuperable dependencies related to the project in any way.

/application/logs – a directory for logging the errors of the code execution

/application/models – a list of models from the MVC pattern

/application/views – a list of views from the MVC pattern

The last top-level directory – /database – consists of two files which cannot be accessed
from the web directly:

a shell script executed as a post-commit git hook which connects to the database, exports
and gzips it, then copies the exported .sql file to the /database directory and deletes it
from the server. Lastly adds the file to the commit

the database.sql file itself
The CodeIgniter framework offers an advanced model for loading files which is one of
the reasons for choosing the framework. If not said otherwise, the framework will load the
controller named as the first part of the URI and the method from this controller named as the
second part of the URI.
26
In this case, if the address the user is opening is “http://halogen.bg/search/model” the
framework will open the “search” controller and execute the “model” method.
The default controller name is defined in /config/routes.php with
“$route[‘default_controller’]”, in this case being “home”.
The default action name is “index”.
Knowing this if opening just http://halogen.bg/ the framework will open the “home”
controller and execute method “index”.
The first and second parameters are controller and method names. If provided with third
and next they are passed as parameters to the corresponding method. Very easy to use and
handle.
Usually, what needs to be done for a functionality to become a reality is to create a
controller, load a model and get the needed data, load a view and pass the data to show.
The controllers and models usually both extend default CodeIgniter’s such. What is good
to be and is made is to have one more level of abstraction. The project has this level which
defines some useful methods and makes checks on whether the access to the requested
resource is allowed. In /application/core, there are files MY_Controller.php and
MY_Model.php which both are used instead of the default ones of CodeIgniter. One more
file here is MY_Cronjob_Controller which is extended by the classes being run from the
command-line interface (CLI).
All database requests use CodeIgniter’s Active Record class which is not Active Record
in the real meaning of the words. It makes queries being run in an easy way through a
database object no matter the database type.
Here is an example of running an SQL query using CodeIgniter’s Active Records class:
$this->db->select('*');
$this->db->from('tasks');
$this->db->where('status', $key);
$query = $this->db->get();
$task = $this->get_one($query);
There is a .htaccess file in the /www directory that helps make the URLs pretty by
redirecting all requests to the website through index.php. Once in index.php the system
knows how to handle the request correctly.
The content of the .htaccess file can also be put in the virtual host of the Apache server
(which is the right place to do this) but for simplicity it is added to the project directly in the
.htaccess file:
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule .* index.php/$0 [PT,L]
27
5.2. Data model
The data model changed several times during the development of the master degree
thesis. In the project data, transformations are commonly used.
The database consists of the following tables:
 “ci_sessions” – a table for storing session data managed by the CodeIgniter
framework
 “euroncap” – a table for storing data from crawling the website of euroncap.com. The
information here is about every car manufacturer, model, producing year, link to the
page on Euro NCAP’s website and the four grades the car has received – between 0
and 100. 0 means the information is missing
 “images” – a table for storing hashed image names associated with a specific car
model year. The table stores the car manufacturer, model and year, URL address
from which the image is downloaded and the image key. The image key is hashed
image name plus file extension. The image key is the name of the image from the
/www/cars directory
 “information” – tables automatically generated by the system with a serial number at
the end. These are the main tables where all the information goes in the end. They
merge the information via PHP scripts. This table includes all the information for
filtering and sorting. Its columns are:
o car manufacturer (“car_maker”), model (“car_model”) and years (“car_years”),
fuel type (“car_fuel”) – these need to be shown to the user, saying which car to
buy in exactly how many years and with what type of fuel
o price (“car_price”) – the price of the car is important and used to further generate
the depreciation
o kilometres (“car_km”)– for a statistic usage only at this time. It is interesting to
see how cars between 5 and 20 years all have the same travelled distance
o car width (“width”) and length (“length”) – used for the filtering
o number of seats (“seats”) and trunk size (“trunk”) – again for filtering
o the four Euro NCAP grades (“euroncap_adult”, “euroncap_child”,
“euroncap_pedestrian”, “euroncap_safety”)– used for filtering purposes
o fuel consumption urban (“fuel_city”) and extra-urban (“fuel_highway”) – these
two are used for sorting. The numbers are later multiplied by the cost of the
corresponding fuel type
o initial money (“initial_costs”) – the initial price of the car plus all costs
(registration, repairs, taxes and others) before the person can ride his car freely
without worries
28
o yearly money (“yearly_costs”) – the price for repairs, taxes and others that should
be paid each year in order to keep driving the car the same free and no-worries
way
o initial hours (“initial_hours”)– the number of hours that the actions from the
initial money column would take. This column matters if the person has chosen a
number different from 0 on the own time cost slider
o yearly hours (“yearly_hours”) – the number of hours that the actions from yearly
money column would take the person. Like the initial hours this column matters
only if the own time cost slider value matters
o real (“data_is_real”) – a column showing if the data in the row is real or is
derived from real data (is not real). The system will only suggest cars if the data
is real. The not real data will still be used to make calculations for future use
o key (“key”) – a hashed column existing with the only purpose to provide faster
joins. The key is generated from the base of car manufacturer + model + fuel
type. When requesting results from the system, the SQL query should join with
the same table
 “mobile” – a table holding the information from mobile.bg. Car manufacturer and
model, kilometres and production date, price, engine type and both – date published
and edited in mobile.bg
 “models” – a table holding the information for the car models and modifications from
automedia.investor.bg – car manufacturer and model, producing dates range, fuel
type, engine power (in horse powers), engine volume (in cm3), car width and length,
trunk size and number of seats, fuel consumption urban and extra-urban, tires size,
coupe type
 “settings” – a table holding more general information for the whole system or
information for the search sorting calculations like fuel price, for example. The table
has the following columns: type of the setting, value, and name
 “tasks” – a table acting like a queue for the tasks being populated and waiting for an
available executor. The table has these columns: task – a JSON encoded data about
the task that should be executed with the parameters included, status – shows
whether the task has been reserved for execution (a method for dealing with
concurrency) and type – a number between 1 and 10. Thanks to the “status” column
when several same-type executors start at the same time each task is executed exactly
once
29
Here are the tables, now let us take a look at the connections between them.
 “ci_sessions” is a framework-specific table so it has nothing to do with system’s logic
 “euroncap”, “images”, “mobile” and “models” contain information coming from
external websites. They do not have any foreign keys or direct connection with other
tables
 “settings” is related to the system itself containing more general information so it also
does not have an external connection or foreign keys
 “tasks” exists with the only purpose to act like a queue for the system. It is just read
and write table and has no connections with anything else
 the “information” tables are the only tables that are created because of the information
stored in other tables. Although this being true the decision is to not have a direct
connection to any other table:
o car manufacturer and model are written differently in all tables having these
columns. Mercedes, Mercedes-Benz, Mercedes AG and others should be the
same but are not. Lots of car manufacturers and models differ slightly or more in
their names. Also, the data in the “models” table does not have a unique
combination car manufacturer + model and this makes it impossible for these
columns to be foreign keys to a specific table in any way.
o “years” is a column showing the age of the car, not when it is produced. In all
other tables, year-like columns show the manufacturing date
o the four columns from the “euroncap” table are a good candidate but this would
mean to have one more join in the query and tests show it is better and faster to
have these four columns imported with their values. Denormalizing a bit helps a
lot
o all other columns cannot or is better not to be used as a foreign key
An index on the “key” column in “information” tables makes it fast to search. Every
search request makes two joins to the same “information” table using the “key” column.
There is one more index combining several fields in the “information” tables. The index
combines the columns used to search together: “data_is_real”, “width”, “length”, “trunk”,
“seats”, “euroncap_adult”, “euroncap_child”, “euroncap_pedestrian” and “euroncap_safety”.
Although all searches are performed on different combinations of fields, with different values
and so on, when all fields are always included in the query the index usage becomes a reality.
This is a great performance boost because the option of not using any index is way slower.
All primary keys are using their own sequence in the terminology of PostgreSQL. A
sequence is an auto-incremented number. Every time the next number of the sequence is
requested, the sequence is incremented and the new number is returned.
30
5.3. Diagrams
Since the project is very specific by merging lots of data sources and has no foreign keys,
this is how the database diagram looks (Fig. 2).
Fig. 2. Database diagram
As seen there are columns existing in more than one table. They may or may not be the
same as those in the final table – the “information” table.
Here is one more diagram showing exactly the columns that exist in several tables.
Columns containing the same or related information are coloured the same way and
connected with arrows as shown in Fig. 3.
Fig. 3. Database diagram showing columns containing the same or related information
31
The three tables from the left – “ci_sessions”, “settings” and “tasks” are not connected in
any way to the core system.
The main table is connected to the other four tables, merging them. The data in the
“information” table comes from the other tables as follows:
 the columns “car_maker” and “car_model” go together. They can be found in all of
the other four tables. This is easily explained because these two columns describe a
car model which is the base of the system – recommending car models to by
 the “car_years” column is just like the “car_maker” and “car_model”. It describes a
“car_model” directly. Car manufacturers change the exterior and interior of the car
but do not change the model name. This is the reason one combination of
“car_maker” and “car_model” to have multiple and different “car_years”. The
“car_years” column in a different format can be found in all the other tables
 “car_price” and “car_km” are columns coming from table “mobile”. They are the
average values from the matching in the “mobile” table
 “car_fuel” comes from two places in a different format – “mobile” and “models”.
“Car_fuel” together with “car_maker” and “car_model” forms the hashed column
“key” in the “information” table. “Car_fuel” is important for showing the results in
the frontend also, grouped by “car_maker”, “car_model” and “car_fuel”
 “width”, “length”, “seats”, “fuel_city”, “fuel_highway” and “trunk” are columns
describing the car for filtering (“width”, “length”, “seats”, “trunk”) and sorting
(“fuel_city”, “fuel_highway”) because it matters how many litres per 100 kilometres
the car needs. These six columns can be found in the “models” table only
 the four “euroncap” columns (“euroncap_adult”, “euroncap_child”,
“euroncap_pedestrian”, “euroncap_safety”) come from the “euroncap” table
Apart from the database it is interesting to show why these optimizations of combining
all the information in just one table for search are enforced – Fig. 4.
Fig. 4. PostgreSQL explanation graphic for a search query
32
Starting from the top left:
 First an index scan on the index information_121_index_all (“…_index_all”) is
performed for the field “data_is_real=1” plus all the search criteria – “width”,
“length”, “trunk”, “seats”, “euroncap_adult”, “euroncap_child”,
“euroncap_pedestrian” and “euroncap_safety”
 The index scan is followed by a nested loop which means a join with the data from the
same table. This join makes another index scan on the same table but using the
information_121_index (“…_index”) index. The search here is pretty fast. A more
natural solution is to use a specialized database for search like ElasticSearch and this
was the initial idea. Later it turned out the PostgreSQL is dealing well with these
searches and using one more database would not result in any performance gain
 Again to the right, one more nested loop is done with the same table. As the previous
join, this one uses index and the scan and join are both fast
 Next a hash left join is performed with the result of hashing the “images” table with
sequence scan. Here the join performs a search for car’s image from the same year.
The result of the query until this moment is a generated table in memory with all
columns and data
 The HashAggregate is the place to filter some more data. The HashAggregate
executes the “HAVING” SQL clause. In PostgreSQL, HashAggregates are
considered one of the fastest parts of executing a query so this step is quite fast again
 The last step is to sort the results. The sort is done by one loop
The query executor can decide to execute the query in different ways depending on the
provided filtering criteria. For example, if no filter criteria are provided the executor would
start from the third table (second join), then move to the first table and then to the second
(first join).
PostgreSQL’s query executor really executes the given query the best way possible.
Here is the query execution plan for this sample search query (Fig. 5):
33
Fig. 5 – PostgreSQL query plan for a search query
The query is executed for 22ms.
At the moment, the system works with more than 100 000 real car ads and can
recommend more than 3300 cars. This is a lot.
Almost 700 is the maximum number of results that can be shown at the end in the results
table. The calculation shows these 2600 cars are repeating combinations of car manufacturer,
model and fuel type but with different age.
Knowing this simple statistic, there is no way a buyer to have thought of all the
possibilities. Now the buyer is a few clicks away from an advice that may save him hundreds
or thousands amount of money on a yearly basis.
34
One more interesting test is the benchmark test performed with the tool “ab” (Apache
Bench). The test is performed with 100 requests with different concurrency levels to the
search page with the recommended setup: 2 CPUs and 1GB of RAM.
The numbers in the head row 1, 2, 3 and 5 show the concurrency level while the
percentages on the left 50%, 75% and 99% show the percentage of requests. The results are
the number of milliseconds the corresponding percentage of requests and concurrency are
served shown on Fig. 6.
As said before, a fast website is considered one that loads for less than 0.25 seconds =
250 milliseconds.
The results in the table are in milliseconds.
1
2
3
5
10
50%
46
69
100
204
604
75%
48
76
113
239
945
99%
59
92
149
335
955
Fig. 6 – Benchmark test results: columns show concurrent queries while rows show the
percentage passed requests
The cells coloured in green have values less than 250 milliseconds while the other are
above this result. The 250 milliseconds is achieved in 75% of the requests with a concurrency
level of 5.
5.4. User interface
The user interface includes just three different types of screens.
All pages aim to offer minimal design as a beginning and the philosophy is led by the
“keep it simple” principles. All pages have the same structure with header and footer and
differ in the content section only.
The home page contains information about the system and a button to start using it.
The filter and sort page has tabs each of which shows sliders and checkboxes related to
the name of the tab. At the bottom of each tab, there are one or more buttons for navigating to
the next, previous tab or to see the results. All sliders are smart and when only one of the
slider’s values (“from” or “to”) matters only this handle is available for moving.
The results page prints the results to a table. When the page loads the table gets
processed by the DataTables plugin adding filtering option and allowing custom sorting.
The first design is made by the developer himself and in order to be successful, the
website needs design made by a professional designer – Jordan Lipchev.
35
Here are three screenshots of the initial design (Fig. 7, Fig. 8 and Fig. 9) and three of the
current professional design (Fig. 10, Fig. 11 and Fig. 12):
Fig. 7 – Home page of Halogen (previous design)
Fig. 8 – Search page (previous design)
36
Fig. 9 – Results table (previous design)
Fig. 10 – Home page of Halogen (new design)
37
Fig. 11 - Search page (new design)
Fig. 12 – Results table (new design)
38
5.5. Additional modules
As mentioned before in the backend the CodeIgniter framework is used.
An additional module for crawling websites – PHPQuery – is also used. It is a serverside, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery
JavaScript Library. In other words jQuery for PHP.
One more PHP library for image resizing is used. The library is named resize.
A library for easy image resizing is also used.
In the frontend, the following JavaScript libraries are used: jQuery and jQueryUI, and the
DataTables plugin. DataTables is used for the results table plus filtering and sorting.
Linux’s native cronjobs are used.
5.6. Conclusion
Despite the problems with data integrity and formatting the result, it is worth it.
There are some interesting discoveries. Talking generally, with no additional filter
criteria:
 If the person does not value his own time it is better to buy an old car running on LPG
 If the person values his own time for 25 BGN per hour, then it is better to buy newer
car running on gasoline or diesel
 If the person highly values his time at 60 BGN per hour (like a businessman) no
matter how many kilometres per year – buy a new car
 No matter the criteria it is cheaper to drive smaller car
 If the person drives a lot (more than 30 000 kilometres per year) and does not value
his time it is cheaper to buy an old LPG car
 If the person drives a lot and values his time at 25 BGN per hour then it is slightly
better to own a new gasoline or diesel car than an old LPG car
 If the person does not drive much, just about 5 000 kilometres per year, urban only,
then it is the same as using taxis only
Other interesting results can also be taken from the database or alongside with user
requests.
39
6. Realization, testing, integration
6.1. Realization of the modules
All modules are written in PHP in the backend and JavaScript in the frontend.
6.1.1. Auto insurance pricing module
The insurance price of a car is usually taken from the Schwackeliste. Since only
insurance brokers have access to the catalog a research on car premiums was made alongside
with sample requests to the website of an insurance broker. Around 100 test requests were
made and the results were aggregated to an array of coefficients.
The prices of the auto insurance are calculated as a multiplication of the price of the car
and the percentage from this array, which is based on car’s age.
The array itself contains the following data starting from a value for a new car, then
value for 1-year-old and so on (Fig. 13). The last value is the same for older cars also:
Fig. 13 – Auto insurance premium by car age. X-axis shows the age of the car, Y-axis shows
the percentage of the price of the car to pay as an insurance premium
The auto insurance pricing module is used in the calculations for the price a car costs to
the owner per year. The auto insurance is paid every year. This is a cost which is not affected
by driven kilometres. The auto insurance has to be paid every year and the premium is shown
in percentages in the graph above.
40
6.1.2. Citizen liability pricing module
Just like the auto insurance module, the citizen liability is not affected by driven
kilometres and is paid year after year.
The prices of the citizen liability change month by month and some insurance brokers
have differences in their criteria these are the average prices on the market.
In order to generate the prices an average (most common) driver data are used for age,
living city and driving experience.
The prices are collected every month from sdi.bg which is a large Bulgarian insurance
broker. What matters here is the volume of the engine.
In order to collect the prices again, a cronjob is executed which spawns as many
processes to the “tasks” table as different engines volumes are found in the database. The
volumes are taken from the website of sdi.bg also.
Cars with engine volume more than the last one pay what the last pays.
This is the method for getting the average price for the citizen liability.
public function get_median($values)
{
$median = 0;
$total_values = count($values);
if ($total_values % 2 === 1)
{
$average_index = floor($total_values / 2);
$median = $values[$average_index];
}
else
{
$average_index1 = $total_values / 2 - 1;
$average_index2 = $total_values / 2;
$median = ($values[$average_index1] + $values[$average_index2])
/ 2;
}
return round($median, 2);
}
41
Here is a graphic of the citizen liability prices for the different engine volumes actual to
the current date – Fig. 14.
Fig. 14 – Citizen liability price by engine volume. X-axis shows the engine volume while the
Y-axis shows the price in BGN for a year
6.1.3. Currency module
The currency module aims to convert the prices of the cars that come from mobile.bg in
currency different from BGN to BGN.
It synchronizes the rate of the USD from the website of the Bulgarian National Bank –
bnb.bg every day. A cronjobs is used once again.
The other option for car price is the EUR but since the rate to the BGN is fixed it is just
saved in the database as 1.95583 as said on bnb.bg.
This is the formula for calculating the rate of USD / BGN.
round((string) $row->RATE * (string) $row->RATIO, 2);
42
6.1.4. Euro NCAP module
Collecting information from euroncap.com.
The module is implemented as a standalone library which crawls the Euro NCAP
website for all kinds of cars. After being crawled, each page gets parsed in order to extract
the useful data from it. For the parsing itself, the PHPQuery library is used.
The result of using this library is an array containing all the car manufacturers, models,
links to the pages and grades for all categories the car has been graded.
This module runs as a cronjob once per month. All the crawlings take time and the
synchronization needs to be done separately from the user requests.
After getting the results, the system inserts the new ones in the database – synchronizes
the database.
Since the euroncap.com system is kind of strange rating the cars on stars that have
nothing in common with the numbers from the test results, the system uses the real numbers
from the tests results which later converts to percentages knowing the maximum results
possible.
Sometimes a car has grades in not all of the categories.
When this task is started the first thing to do is to check for new tests. If there are no new
tests at the euroncap.com website then nothing happens. If new tests are available each of
them is inserted into the “tasks” table and waits to get executed (the test data to be collected).
An interesting graphic here is the average Euro NCAP percentage on yearly basis by
type of protection – Fig. 15.
Fig. 15 – Euro NCAP average protection percentages by year. X-axis is the year, Y-axis show
the percentage
43
The percentage of cars by protection level is also an interesting topic (Fig. 16). There are
just a few models topping the chart with 100% results. Generally the results are good but still
a lot of cars are below a certain level of protection, say it is 70% or 50%.
Fig. 16 – Euro NCAP percentage of cars by protection level. X-axis is the range of protection
percentages, Y-axis represents the percentage of all cars with this protection level
6.1.5. Fuel module
Halogen collects the fuel prices thanks to an API from fuelo.net several times per day.
Fuelo provides an endpoint which returns the average price from the gas stations in Bulgaria.
The system synchronizes the prices to the database and calculations
Fuel prices are saved into the “settings” table.
6.1.6. Images module
Halogen works well with and without car images. The user interface is very important
and users want to see graphics. This is the reason an images module is developed which
collects car images from google, using the image search API with address
ajax.googleapis.com.
There are almost 1000 combinations of car manufacturer plus car model. But since car’s
look changes over time a multiplication with production year gives more than 6000
combinations which are about 7% of all cars.
Once per day a cronjobs runs and collects images of the cars that are missing but at the
same time the model is in the database.
The images module uses the images library developed as a standalone library for image
extracting from Google’s images API.
44
The search keyword is formed by combining the car manufacturer, car model,
manufacturing year and the word “front”. The purpose is to get as much as possible “good”
images –showing a car list looks better when pictures show cars’ smile.
Right before insertion into the database the image is downloaded, renamed and resized.
The download is using curl with custom header for user agent.
Renaming the image with a specially generated name helps when the picture needs to be
updated.
Resizing the image is an important step which resizes to at most 400 x 300 pixels
meaning the resize will be automatically proportional. The resize is implemented as an
external resize library using GD.
The whole image processing is done by the “manage_image” method:
public function manage_image($car_maker, $car_model, $year, $image)
{
$image_file_name = $this->information_library>generate_key($car_maker, $car_model, $year);
$download_file_name = FCPATH . 'cars/' . $image_file_name;
try
{
$image_info = $this->download_image($download_file_name,
$image);
$this->rename_image($download_file_name, $image_info);
$this->resize_image($download_file_name, $image_info);
$ext = $this->get_ext($image_info);
}
catch (Exception $ex)
{
return FALSE;
}
return $image_file_name . '.' . $ext;
}
6.1.7. Information module
The information module is the heart of the whole system. Thanks to information module
results are generated and populated into the database.
There is a task being run once per day which manages the generation of the
“information” table.
First a new table is generated with all the needed columns, primary key and indexes.
Then a query collects those models from the “mobile” table which have at least 5
matches for the same car manufacturer, car model, age and fuel type. For each of these
models, a task is inserted and at the end one more task is inserted called “information_fix”.
45
When one of these tasks for the models is run first an insert is made to assure the model
gets inserted, then model’s variable information are selected which are the average price and
run. Then additional information about the model are collected like the power of the engine,
manufacturing dates from and to, width, length and many more. Also, several corrections are
executed here – correction on the fuel consumption for cars running on LPG, correction of
the car manufacturer and model names because they differ in the different tables and more.
After that the information from Euro NCAP is taken, if any. Later the hours and costs
calculations are done which include getting all the information for prices of citizen liability,
fuel and others. At the end, an update is performed with all the information that has just been
calculated and generated.
When all the tasks for models have finished, the “information_fix” is executed. This task
finds the holes in the just populated table. An example of a hole is a missing year for a car
model.
Let’s have this example. The combination car manufacturer, model and age exists for age
from 1 to 10, except the 7-year olds. This tasks would make calculations on the average of
the 6 and 8-year olds and insert this one so the whole model would be complete.
By having these holes fixed a better search can be performed by calculating the expenses
for these fixed car holes easily.
What is worth to say is that the system will not recommend any of these “fixed” cars.
Halogen will only recommend cars for which the initial data is real.
To calculate the value to fill in the hole, the “calculate_hole_value” method is used. By
passing the first and second prices, together with the number of missing consecutive years to
generate value for (the period may have several missing consecutive years) the result is
returned as following a linear calculation formula:
protected function calculate_hole_value($first, $second, $period_part)
{
$min = min($first, $second);
$max = max($first, $second);
$diff = $max - $min;
return $max - $period_part * $diff;
}
When all holes are fixed again using tasks one “activate” task is run.
The “activate” task activates the table in which all tasks until this time have inserted and
updated information. The “activate” task is configured to keep 3 “information” tables history.
The older get dropped. At the end, the whole cache is cleaned, ready to handle new caching
data.
46
At this time, the new “information” table is ready to be used and all requests are
automatically redirected to it because it is the default table now.
The “information” tables have columns for car manufacturer and model, car age, car
kilometres and fuel type, width and length, number of seats and trunk size, fuel consumption
urban and extra-urban, Euro NCAP data for adults, children, pedestrians protection and safety
systems, initial and yearly costs and hours that have to be spent on the car, a column showing
if the data is real or result of fixing holes, key for easy joining.
Every “information” table is generated automatically from the code by running the
“create_table” method first and the “create_indexes” method second:
public function create_table()
{
$this->load->dbforge();
$this->dbforge->add_field("information_id bigint NOT NULL DEFAULT
nextval('information_sequence')");
$this->dbforge->add_field("car_maker character varying(63) NOT
NULL");
$this->dbforge->add_field("car_model character varying(63) NOT
NULL");
$this->dbforge->add_field("car_years integer NOT NULL");
$this->dbforge->add_field("car_price decimal NOT NULL");
$this->dbforge->add_field("car_km integer NOT NULL");
$this->dbforge->add_field("car_fuel character varying(15) NOT NULL");
$this->dbforge->add_field("width integer DEFAULT 0");
$this->dbforge->add_field("length integer DEFAULT 0");
$this->dbforge->add_field("seats integer DEFAULT 0");
$this->dbforge->add_field("fuel_city decimal DEFAULT 0");
$this->dbforge->add_field("fuel_highway decimal DEFAULT 0");
$this->dbforge->add_field("trunk integer DEFAULT 0");
$this->dbforge->add_field("euroncap_adult integer DEFAULT 0");
$this->dbforge->add_field("euroncap_child integer DEFAULT 0");
$this->dbforge->add_field("euroncap_pedestrian integer DEFAULT 0");
$this->dbforge->add_field("euroncap_safety integer DEFAULT 0");
$this->dbforge->add_field("initial_costs decimal DEFAULT 0");
$this->dbforge->add_field("initial_hours integer DEFAULT 0");
$this->dbforge->add_field("yearly_costs decimal DEFAULT 0");
$this->dbforge->add_field("yearly_hours integer DEFAULT 0");
$this->dbforge->add_field("data_is_real integer DEFAULT 1");
$this->dbforge->add_field("key character(32) NOT NULL");
$this->dbforge->add_key('information_id', TRUE);
return $this->dbforge->create_table($this->get_next_table_name(),
TRUE);
}
47
public function create_indexes()
{
$index_query = "CREATE INDEX " . $this->get_next_table_name() .
"_index ON " . $this->get_next_table_name() . " (key);";
$index_all_query = "CREATE INDEX " . $this->get_next_table_name() .
"_index_all ON " . $this->get_next_table_name() . " (data_is_real, width,
length, trunk, seats, euroncap_adult, euroncap_child, euroncap_pedestrian,
euroncap_safety);";
return $this->db->query($index_query) && $this->db>query($index_all_query);
}
By making graphics, it is easier to notice trends, be it on price or something else.
Let us first see the number of different models on the market for each different year (Fig.
17) and then the average model age in years (Fig. 18):
Fig. 17 – Number of different modes on the market by year
Fig. 18 – Average model age in years
48
Here are three graphics showing the prices of three of the most popular car models on the
Bulgarian market – VW Golf (Fig. 19), Opel Corsa (Fig. 20) and Ford Fiesta (Fig. 21).
Fig. 19 – VW Golf price over time by engine type. X-axis: age of the car, Y-axis: average cost
Fig. 20 – Opel Corsa price over time by engine type. X-axis: age of the car, Y-axis: average
cost
Fig. 21 – Ford Fiesta price over time by engine type. X-axis: age of the car, Y-axis: average
cost
49
Some interesting observations for the three models: Gasoline models generally become
cheaper with time; Diesel models are more expensive on the 3rd year than on the 2nd; no
matter the fuel type all prices flatten and merge at the end; the 1st offers are usually from the
2nd year.
6.1.8. Mobile module
The integration with mobile.bg’s API is an interesting topic.
The API has just one endpoint (for ads export) and the requests to it are limited to 15
calls for 15 minutes. Taking it on average this makes 1 call each a minute and the answer
pops up naturally – cronjobs.
The API cannot export all ads at once neither can do something complex.
The maximum lifetime of an ad in the database of mobile.bg is 49 days except when
being modified or extended.
The cronjob for collecting mobile.bg ads is executed every minute. What it does is to
make a request to mobile.bg asking for ads from a specific day in the past. The day is the
number of minutes from the current time % 30. If we are 10:50h the script will ask for ads
from 20 days ago because 50 % 30 is equal to 20.
Here is the code to generate dates from and to for collection car ads from mobile.bg:
$current_hour_minutes = round(date("i") % 30);
$date_from = strtotime("-$current_hour_minutes days 00:00:00");
$date_to = strtotime("-$current_hour_minutes days 23:59:59");
The collected data is mobile.bg id, car manufacturer and model, kilometres of the car,
production date, price, date of publishing and last editing of the ad, fuel type.
The connection to mobile.bg itself is made using a custom library. The library makes the
request using curl. The request returns JSON encoded data which gets decoded and used
later.
When the ads get returned a check is made with their IDs for whether and which of them
exist and which are the new ones.
Before any other checks, one for reliability is performed. This one includes checking if
the price matches some predefined values like 0, 11 or 111 these are fake prices for crashed
cars or ones being sold by parts.
If the car has been a taxi or has a leasing it is also excluded. These are not many cars but
the prices they are announced with are usually fake or there is a section in the description
saying something more about the price.
Ads containing words hinting the price is not real are also excluded.
This is the code checking if an ad is reliable. By default, it is reliable.
50
private function _check_ad_reliable($ad)
{
$reliable = TRUE;
if (in_array((int) $ad['price'], [0, 11, 111]))
{
$reliable = FALSE;
}
elseif (isset($ad['condition']) && $this>_check_existence($ad['condition'], ['на части']))
{
$reliable = FALSE;
}
elseif (isset($ad['extri']) && $this->_check_existence($ad['extri'],
['TAXI', 'Катастрофирал', 'Лизинг', 'На части']))
{
$reliable = FALSE;
}
elseif (isset($ad['description']) && $this>_check_existence($ad['description'], ['вноска', 'вноски', 'vnoska',
'vnoski']))
{
$reliable = FALSE;
}
return $reliable;
}
If the ad is existing, a whole check for updates is performed. When an update is found
the ad with this specific change is recorded for an update.
If the ad is new it is just added to the ads for insertion.
At the end, these arrays of ads to insert and update are sent to the database.
Each time Halogen requests data from mobile.bg the response has length of about 7.5-8.0
MB JSON encoded data. All this information gets parsed at once with no trouble.
6.1.9. Models module
The module is started by a cronjob once per month. It checks the website
automedia.investor.bg for new car data. If any new data is found a task for collecting each
new model is inserted. The pages are parsed since the connection is one-way only.
The parsed useful information goes into the models table.
Data for catalog id, car manufacturer, car model, manufacturing dates from and to, fuel
type, engine volume and power, car width and length, trunk size, tires, fuel consumption
urban and extra-urban, coupe type and number of seats.
51
6.1.10. Search module
The search module is the main module in the Halogen. The module is system-dependent
and cannot be separated as a library.
When a request comes into the search module it first parses the parameters. The parsing
is important because one of the most common roads to security holes and issues is the search
module of a system.
After the parameters are parsed into an array, a key is generated on the base of all these
parameters. The key is hashed and uniquely related to the combination of search parameters.
This is where the cache takes place for the first time. Memcached is used for caching the
searches. A quick check in the cache says if such a query has already been made. If so – the
results are immediately returned to the script from the cache (from the RAM) and then to the
frontend.
If the desired data is not in the cache the search should now be performed. What is
important for the search is to know the fuel prices at this moment and which “information“
table is the active one. These parameters are passed to the search method in the model.
The search is now performed with the parameters requested from the user and the result
is returned as an array to the controller.
The search query itself is not a simple one. This is a sample search query:
SELECT
i1.car_maker,
i1.car_model,
i1.car_years,
i1.car_fuel,
im.image_key AS car_image,
ROUND(i1.car_price, 2) AS car_price_beginning,
ROUND(LEAST(i1.car_price, i3.car_price), 2) AS car_price_end,
ROUND(i1.car_price - LEAST(i1.car_price, i3.car_price), 2) AS
depreciation,
ROUND(i1.initial_costs + SUM(i2.yearly_costs), 2) AS running_cost,
i1.initial_hours + SUM(i2.yearly_hours) AS running_hours,
ROUND((5000 / 100 * i1.fuel_city + 10000 / 100 * i1.fuel_highway) *
(CASE i1.car_fuel WHEN 'Бензин' THEN 2.06 WHEN 'Дизел' THEN 2.21 ELSE 0.85
END) * 5, 2) AS fuel_cost,
ROUND(i1.car_price + i1.initial_costs + SUM(i2.yearly_costs) LEAST(i1.car_price, i3.car_price) + (i1.initial_hours +
SUM(i2.yearly_hours)) * 5 + (5000 / 100 * i1.fuel_city + 10000 / 100 *
i1.fuel_highway) * (CASE i1.car_fuel WHEN 'Бензин' THEN 2.06 WHEN 'Дизел'
THEN 2.21 ELSE 0.85 END) * 5, 2) AS total_cost
52
FROM information_121 AS i1
JOIN information_121 AS i2 ON i1.key = i2.key
JOIN information_121 AS i3 ON i1.key = i3.key
LEFT JOIN images AS im ON i1.car_maker = im.car_maker
AND i1.car_model = im.car_model
AND EXTRACT(YEAR FROM current_date)::integer - 1 - i1.car_years =
im.year
WHERE
(i1.data_is_real = 1)
AND (i1.car_years <= i2.car_years and i1.car_years + 5 >
i2.car_years)
AND (i1.car_years + 5 = i3.car_years)
AND (i1.width >= 1600 AND i1.width <= 1900)
AND (i1.length >= 3800 AND i1.length <= 4600)
AND (i1.seats >= 4 AND i1.seats <= 5)
AND (i1.trunk >= 250)
AND (i1.euroncap_adult >= 75) AND (i1.euroncap_child >= 75)
AND (i1.euroncap_pedestrian >= 50) AND (i1.euroncap_safety >= 50)
GROUP BY
i1.information_id, i1.car_maker, i1.car_model, i1.car_years,
im.image_key, i1.car_fuel, i1.car_price, i3.car_price,
i1.initial_costs, i1.initial_hours, i1.fuel_city, i1.fuel_highway
HAVING COUNT(i2.information_id) = 5
AND SUM(i2.data_is_real) >= (CASE COUNT(i2.information_id) WHEN 1
THEN 1 ELSE 2 END)
ORDER BY total_cost
53
Here’s is a graphic showing the execution time for running this query to the database for
a sample of 100 runs (Fig. 22).
Fig. 22 – Search query execution time in milliseconds. X-axis is the serial number of the test,
Y-axis is the query execution time in milliseconds
There are more parameters that can be added which make the “WHERE” clause even
larger and more complex.
Let us take a look at the slowest query that can be generated in the system. The query
uses the full ranges for all sliders (the ones resulting in showing the most results) as well as
all checkboxes. The query runs for 165 milliseconds. The query execution plan is different
because PostgreSQL checks the speed to run the query using different execution plans and
chooses the fastest one.
When the search query returns the results they get filtered in the PHP code and the result
is only one car for a combination of car manufacturer, car model and engine type. The reason
this filtering is performed in the code is simple – it is faster. If the same has to be done in the
query it gets really complex and slow. SQL is not made for this type of logic while loops in
the code are perfect. The result that remains is of course to first one – the best result for this
combination.
After the filtering, some more information is added to each row. This information is
about the percentages matching, price per kilometre and price for a month. In theory, these
can also be included in the SQL query but the percentages matching would be too slow to
find so this is done after the results are ready. The other two fields can easily be included in
the query but since the percentages matching is not, and this loop in the code will be done, it
is easier to be added here, leaving the SQL query as simple as possible.
54
The results are now filtered and supplemented which makes them complete.
Now save in the cache the JSON encoded data array in order to get it fast when need it in
the future. The cache time-to-live (TTL) is set to 24 hours which most probably will be
enough since the “information” table is regenerated every day and when this is done – all
cache becomes not current and is cleared.
public function index()
{
$params = $this->parse_params(FALSE);
$key = $this->make_key($params);
$data = $this->cache->get($key);
if ($data === FALSE)
{
$this->load->model('Fuel_model');
$fuel_prices = $this->Fuel_model->get_prices();
$this->load->model('Information_model');
$table_name = $this->Information_model->get_table_name();
$data = $this->Search_model->find($params, ['fuel' =>
$fuel_prices, 'table' => $table_name]);
$this->manage_search_resuts($data, $params);
$this->cache->save($key, json_encode($data), 24 * 60 * 60); //
in seconds
}
else
{
$data = json_decode($data, TRUE);
}
$query = http_build_query($this->input->get());
$this->load_view('search', ['models' => $data, 'params' => $params,
'query' => $query, 'show_more' => TRUE]);
}
55
6.1.11. Tasks module
Halogen has lots of web pages that need to be crawled and lots of information to be
generated. The task module executes tasks with a given type and number of tasks.
The different types are defined as constants: TYPE_1, TYPE_3, TYPE_5 and TYPE_8.
Tasks with different type represent different queues. For each queue, there is a cronjob
running different number of tasks at once.
Tasks with TYPE_1 are the “information” tasks. When they are generated and pushed to
the table they are small and fast to be executed, should be executed as fast as possible. This is
the reason three workers (cronjobs) for TYPE_1 are started at once every minute, each with
500 tasks to execute. This means 1500 tasks from TYPE_1 are executed every minute.
Tasks with TYPE_3 are the Euro NCAP and “images” tasks. TYPE_3 are tasks that are
limited in the requests to the corresponding website. For example, if more than 5 requests to
the euroncap.com website are made within a minute the site goes down. The same is for
ajax.googleapis.com, but the service just rejects the queries. Three tasks with TYPE_3 are
executed every minute.
Tasks with TYPE_5 are tasks for collecting information about the citizen liability prices
from the website of sdi.bg. The limit here is 2 tasks per minute because otherwise the script
gets rejected. The connection to the website is slow and the whole prices collection takes up
to 7 minutes.
Tasks with TYPE_8 are tasks for collecting information for the models from the website
of automedia.investor.bg. Just one task from TYPE_8 is executed each minute. Tests show
increasing the number of tasks to more than 5 tasks per minute makes the website go down.
More task types can be added at any time.
The task controller is used to generate tasks. The entry point for executing all tasks is
also the tasks controller. Called with type and number of tasks the endpoint continues the
execution until available tasks exist and the number of tasks to be executed is not reached.
Each task is JSON decoded, a new controller is instantiated and the corresponding
method is executed with the provided parameters. When done the next task follows the same
pattern.
This is the way of loading the corresponding controller for the task and executing the
action and passing task’s parameters.
$controller_name = $task_decoded['controller'];
$action_name = isset($task_decoded['action']) ? $task_decoded['action'] :
'index';
$params = isset($task_decoded['params']) ? $task_decoded['params'] : [];
include_once APPPATH . 'controllers/' . $controller_name . '.php';
$controller = new $controller_name();
call_user_func_array(array($controller, $action_name), $params);
56
An interesting point is concurrency resolving. The used database PostgreSQL works well
and this is the reason concurrency problem exists.
When selecting a task the flow is to first select the task and then delete it from the
database. These are two queries and other queries may intercept the flow here by being run
between these two.
If three cronjobs for executing tasks from the same type are run together there is a high
chance for two or more of them to take the same task and start executing it.
The solution is when selecting a task from the table to be generated a random string, and
first a free task to be executed gets updated by setting its key to the same value as the random
string. By doing this, the task is reserved for this executor.
But, what is more, in order to update (reserve) just one task and set the random string
two queries should be run. The first query is for selecting the task id and the second is to
update the task by this id. The quirk here is to add the words “FOR UPDATE” at the end of
the internal select query. By doing this, the developer tells the database to delay the other
queries to the same table and wait for the execution of the whole current query to end, not
just the internal select.
UPDATE tasks SET status = ‘na3h0fhwe80fhhuhufsesuhijw9f6a0fje’ WHERE
task_id = (
SELECT task_id
FROM tasks
WHERE status IS NULL AND type = 1
ORDER BY type DESC, task_id ASC
LIMIT 1
FOR UPDATE
)
When the task is reserved by running this query then the executor can select and delete it
knowing for sure this task will be selected by exactly one executor.
All tasks should be written in a way that even if the task fails for a reason the system
should continue working without any issues. And the tasks should be able to be queued once
again by the same task generation script.
57
This is how the crontab looks like:
# cronjobs
0
0
*
*
* php /var/www/halogen/www/index.php fuel
0
0
*
*
* php /var/www/halogen/www/index.php currency
0 11
1
*
* php /var/www/halogen/www/index.php vignette
*
*
*
* php /var/www/halogen/www/index.php mobile
*
# cronjobs that generate more cronjobs
0
4
*
*
* php /var/www/halogen/www/index.php task information
10
4
*
*
* php /var/www/halogen/www/index.php task images
0 23
1
*
* php /var/www/halogen/www/index.php task citizen_liability
0 23
2
*
* php /var/www/halogen/www/index.php task models
0 23
3
*
* php /var/www/halogen/www/index.php task euroncap
# executors
*
*
*
*
* php /var/www/halogen/www/index.php task 8 1
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
As seen the script is separated on cronjobs executed annually, cronjobs run once per day
or month and executors which run every minute.
58
6.1.12. Taxes module
The tax module is a stand-alone library based on the Bulgarian law for car taxes. The law
itself is based on the power of the car in kilowatts which can easily be converted to horse
powers and the age of the car. The module is used for calculating the costs on a yearly basis.
6.1.13. Tires module
This module is again realized as a library.
A research was made on the price of the tires with different sizes. Then a common
formula was found showing the temp tires get cheaper or more expensive.
Since there is a wide variety of tire sizes and car owners know that tires need to be
changed after a certain number of years this is a repeating cost based on the cost of the tires.
If we say tires are changed every 5 years and one tire costs 125 BGN, this means 4
summer and 4 winter tires, 125 BGN year, for 5 years. 200 BGN per year for the tires only.
Not a big cost but surely not a small one many people forget about.
There is a formula for calculating tires’ price. Tires prices are predictable in the general
case – the smaller the cheaper.
There is a default tires size and price for it and all other prices are related to the default
one.
The default size and price are: 135/X R15 for 120 BGN
The research shows tire’s height does not make any different in the price and this is the
reason it is not included in the formula.
For tires 145/X R16 the price is 170 BGN; 185/X R17 – 290 BGN.
6.1.14. Vignette module
Currently driving on the Bulgarian extra-urban roads requires buying a vignette sticker.
The sticker is most often bought every year. Its price is fixed by the Agency Road
Infrastructure.
For the purpose of having the price of the sticker added to the system a library for
parsing the website of the agency is developed.
The price of the vignette is also saved in the “settings” table.
59
6.2. System integration
There is a graphic of how the system communicates with external systems.
The double-sided arrows show two-way communication or otherwise said – using an
API while the one-sided arrows show one-way communication – parsing information from
the web pages – Fig. 23.
Fig. 23 – System integration of Halogen. Left side – systems to connect with using an API.
Right side – systems to crawl
As seen in the graphic only mobile.bg, fuelo.net, and ajax.googleapis.com provide a twoway communication while all the other sites require parsing.
There is one important rule for the system to integrate with the other systems – to not
modify the data it receives. Exception is the data where all options are known and their
number is measurable. For example, the manufacturing date can be modified because all
options are known and the modification is straightforward. The model name should not be
modified because there are models with different model indexes that have to be collected asis and later processed in a suitable way that may vary.
It is important to have the original data into the database.
Halogen connects to mobile.bg via their API receiving arrays of ads with full
information about the ads. The new ads get inserted, the changed – updated. This is the most
interesting integration in the true meaning of the word. A special IP-restricted personal key is
used on every request.
60
The data from mobile.bg gets processed so the rows that are inserted into the database
contain information in the following columns. “mobile_id” – the ad id from mobile.bg’s API.
“car_maker” and “car_model” – inserted as-is names of the car manufacturer and model.
“km” – the kilometres of the car as-is. “date_produced” – the date in format YYYY-MM-01.
The “date_produced” comes as month in Bulgarian plus year and is very unpleasant to parse
but in the selected database format it works well. “price” – the price of the car plus the
currency from another field, the price is converted to BGN and saved as such.
“date_published” and “date_edited” both come from the API as well as the “engine_type”.
Exceptions are the LPG engines for which the extras are parsed searching for some special
words.
Connecting to fuelo.net is done the same way with the difference that the prices of the
fuels get updated every time. Again a personal key is used for each request. The key is
generated by registering into the system.
Google provides a new API endpoint for searching for images. In order to use this new
API for image search, the developer has to download and include the whole package of
Google PHP libraries for all their services which is not what Halogen needs. Connecting to
ajax.googleapis.com with a GET request is good enough and provides reliable results. With
just one line of code.
The integration with automedia.investor.bg is done via parsing pages from the website.
The URL addresses have structure of: website / catalog / car manufacturer id / car model id /
car modification id. This is way too much to crawl and the crawler would need to parse a lot
of additional pages for contents of all car manufacturers, all models with pagination then all
modifications with pagination again. A workaround was found – just change the modification
id in the URL, seems like the website does not take the other into account when a
modification id is found. This makes the whole crawling easier and faster.
When a new page in automedia.investor.bg is found it is crawled and inserted into
Halogen’s database.
Parsing api.bg is an easy task. The page is set to have the same static URL address.
Parsing euroncap.com is a tough task. The website consists of several categories and also
has a different structure for old and new models. These features make the website feel
incomplete or partly working, but it works pretty well once understood.
Parsing the currency from XML is the only way to get the BGN/USD rate from bnb.bg.
Both BGN/USD and BGN/EUR are inserted and kept in synchronization with the “settings”
table.
The website sdi.bg is one of the largest Bulgarian websites, and companies for car
insurances. In order to get prices from the website the person should fill in lots of information
which generally does not influence the prices or influence the price of one specific company.
Halogen fills information to make the prices as close as possible to as many people as
possible. For example, Halogen fills in the driver experience to be more than 10 years, owner
age to be 35 years, number of seats – 5 seats and others.
61
The prices are then synchronized with the “settings” table and used later with the next
“information” table generation.
6.3. Testing
Apart from the non-functional requirements for testing the testing is done by real people
on incremental basis for both – the whole system and every new module developed. The
percentage of people is increased in steps as follows – 10%, 20%, 50%, and 100%.
Testing is also performed at the same time the development is done. Every developed
functionality should be deployed with no known issues.
Everything a module should do is defined clearly before the beginning of the
development. After a part of the module is ready it is tested be the developer. When another
part connected to the first is done both are tested separately and together. At the end, the
whole module with all of its branches is tested and deployed if everything looks right.
6.4. Experimental integration
Halogen is experimentally deployed and can be opened on http://halogen.bg/.
The hosting is a virtual server (VPS) from superhosting.bg running Debian 7.1 Wheezy
with 2 CPUs x 2.6 GHz and 2 GB RAM, SSH key-based authentication. The server is also
used for hosting another project.
Superhosting does not support creating a machine from custom virtual machine image
and all tools had to be installed separately.
What is installed on the experimental integration machine is:
 PHP version 5.4.36.0 which is the latest stable release available for Debian systems
directly from the apt-get utility. The needed extensions are also installed with their
latest versions
 Apache version 2.2.22 – again the latest stable release for Debian
 PostgreSQL version 9.1 – latest stable. Configured to listen to connections from
localhost only
 Memcached version 1.4.13 – latest stable with default settings – up to 64 MB RAM
62
 Zend Opcache version 7.0.2 – the latest stable with the following configuration:
o enabled for web and CLI
o revalidate frequency 60 seconds with validate timestamps (the Opcache will
check for updated scripts by file timestamps every 60 seconds)
o memory consumption up to 128 MB (the maximum size of shared memory the
Opcache can use)
o max accelerated files (the maximum number of keys the Opcache can save, this
should be a prime number) – 223
o interned strings buffer – 8 MB (the amount of memory used to store interned
strings)
 CodeIgniter version 2.2.0
 jQuery version 1.11.2 (latest stable from the 1.x branch), jQueryUI for jQuery 1.11.2
 DataTables version 1.10.4 for jQuery 1.11.2
 Cronjob settings – identical to the ones from chapter 6.1.
The server stats of the experimental integration are shown in Fig. 24.
Fig. 24 – Experimental integration server stats by running the “top” command
63
7. Conclusion
7.1. Summary of the execution and requisition for original results
The list of initial tasks includes:
7.1.1. To find a database with real car prices.
This task is executed by using the API of mobile.bg. Partnering with mobile.bg
(Bulgaria’s largest website for car ads) is a great success for every project, even
more for the just released ones – like Halogen.
7.1.2. To find all the needed data for all car models that can be used for searching or
should be included in the calculation formula and automate its gathering.
This task includes the modules for collecting information from euroncap.com
(used for filtering) and automedia.investor.bg (used for filtering and sorting).
Both are implemented and the synchronization with them works well.
7.1.3. To find prices for all variable costs on car models and automate their gathering.
Although repair prices were not found anywhere on the Internet, the talks with
friends and mechanics helped creating a formula for calculating the variable costs
of repairs a car model needs.
The developed methods include these costs to car’s price when the data is
processed and added to the database.
7.1.4. To find prices for all variable costs an owner pays, different from the repairs and
automate their gathering.
All costs were found from official websites and the data is used in the
calculations.
7.1.5. To find data for security information on car models.
Data from euroncap.com is used. All crash test results are collected and
synchronized with the database of Halogen.
7.1.6. To find images for each car model and model year.
Images are collected from Google and resized to smaller dimensions which are
large enough for the system to look good and useful enough from user’s
perspective.
7.1.7. To build the system – frontend, backend, database. Fast at calculating score for
each car model and then sort by it.
The whole system is built, working and experimentally deployed. In the frontend,
the DataTables plugin works well with the expected functionality.
In the frontend, there should be an easy way of choosing values for the different
filters and searches fields with sliders and checkboxes. The table with results in
the frontend should have the ability to search and sort on different columns.
Collecting information, calculations, search and caching are all implemented.
Some tests should be made on whether the system searches fast or slow in the
64
database. If the search is slow an additional specialized database for searching
should be used for this part of the system.
7.1.8. To test the system with requests from real people with different needs and
financial situation.
Tests were made with even more people than expected.
As expected the younger people prefer to choose a car based on different criteria
– first of all is the engine power and volume. Older people understand the
ideology better and are more willing to believe the results.
Relatively small number of people admit they have never thought of the real
criteria on which they buy a car. Most of the older people agree with Halogen’s
concept of buying a car as an automobile based on costs.
As a summary, the initial tasks are all executed and the result is the web-based
application for car selection called Halogen.
On the Bulgarian market still no such system exists and the opportunity stays open. On
the international market, besides Germany no other country has a website offering
functionality on choosing a car based on lifetime total costs.
Germany’s system (Motoragent) continues to be developed and changed now giving
some more realistic results but still away from the concept of Halogen.
Both systems Halogen and Motoragent have a lot to learn from one another.
7.2. Guidelines for future development and improvement
Halogen has the opportunity to be the first of a kind on the local market and if it
succeeds in attracting visitors the concept of choosing a car based on lifetime total costs may
be adopted widely.
Halogen works well at the current moment, but there is a lot to be improved and
developed.
One more integration with mobile.bg needs to be made for synchronizing deleted or
expired ads. Otherwise, these ads remain in the database with the prices they were once
added. This distorts the average prices and should be fixed in the near future. Implementing
synchronization of deleted and expired ads will result in showing real average prices for cars
in 100% of the time.
At the moment using several old prices is not a problem because the system is still young
so the prices do not differ that much.
A solution to the costs for repairs should be found. Ideally here comes the business part.
The official representatives may want to give a list of usual repairs their cars experience
together with prices, dates and kilometres. Of course, they would like to show off like the
best car manufacturer in terms of costs and would give a small list resulting in showing their
cars on the top. This would cost them a certain amount of money on monthly basis. Together
with the list of repairs a list of their new cars will be added.
65
Also part of the business is adding banners between the header and results table, between
the results table and the footer, and Google Adwords or other on the right side of the results
table. Full rebranding may be interesting for the official representatives and is an option.
Ideally the banners on the top and bottom would be traded with mobile.bg – advertising
their site while they advertize Halogen or including Halogen into their circle of websites and
showing banners from their system for 50% of the cost. This is a matter of discussion.
The generation of the “information” tables may be done several times per day resulting
in more accurate results through the whole day.
Price prediction algorithm can be changed, especially to work with new models of a
manufacturer. Currently, if a model is and the person chooses 5-year ownership, the model
would not appear in the suggestions because the end price is not known. Well, an
approximation to the depreciation and other costs can be made and this model can be
included in the calculations.
All the approximations and calculations for time can be checked once again. This is also
a point of discussion with the representatives.
Another source of getting information about the models is good to be found. The reason
is in the current source no new models are added as well as in all initially checked sources.
A new task execution mechanism using real queue should be implemented. It should use
a lower level language like Python and RabbitMQ or other combination.
Using APIs helps a lot. If APIs are available for most of the external services that would
be great and will remove or at least minimize the errors of the system made by parsing data.
When a stable version of the 3.x branch of CodeIgniter is released which may happen
soon, migration may be considered.
Site visitors may want to see the current rating of the cars based on what they have
already selected. This may be added as Motoragent has done it – in the right side of the
website or as a bar above the tabs on this page.
Queries to the database can be optimized so the SQL join to the images table to happen
using one field only.
Also, when filtering data for a value that is >= 0 OR =0, the query may remain with >=0
only. This type of optimization can be made although it would not affect query’s
performance.
66
References
[1] Number of cars sold worldwide from 1990 to 2015 (in million units)
http://www.statista.com/statistics/200002/international-car-sales-since-1990/
[2] More than 20 000 new cars sold in Bulgaria for 2014
http://bnt.bg/news/ikonomika/nad-20-000-novi-avtomobila-prodadeni-u-nas-prez-2014-g
[3] Buying a car. How to buy a used car
http://www.which.co.uk/cars/choosing-a-car/buying-a-car/buying-a-used-car/how-to-buy-aused-car-/
[4] Rankings: Used Cars
http://usnews.rankingsandreviews.com/cars-trucks/rankings/used/
[5] Auto Kaufberatung. Finde Dein passendes Auto.
http://www.motoragent.de/
[6] mobile.de – Deutschlands größter Fahrzeugmarkt für Neu- und Gebrauchtwagen
http://www.mobile.de/
[7] Auto Scout 24 – Die 1 für Fahrer
http://www.autoscout24.de/
[8] Spritmonitor.de – Verbrauchswerte real erfahren
http://www.spritmonitor.de/
[9] What Are The Benefits of MVC?
http://blog.iandavis.com/2008/12/what-are-the-benefits-of-mvc/,
9 December 2008
[10] Simple Example of MVC (Model View Controller) Design Pattern for Abstraction
http://www.codeproject.com/Articles/25057/Simple-Example-of-MVC-Model-ViewController-Design
67
[11] What is PHP?
http://php.net/manual/en/intro-whatis.php
[12] The Number One HTTP Server On The Internet
http://httpd.apache.org/
[13] Comparison of web application frameworks
http://en.wikipedia.org/wiki/Comparison_of_web_application_frameworks,
12 March 2015
[14] jQuery - Write less do more
http://jquery.com/
[15] DataTables - Table plug-in for jQuery
http://www.datatables.net/
[16] Comparison of relational database management systems
http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems,
11 March 2015
[17] Debian - The Universal Operating System
https://www.debian.org/
[18] memcached - a distributed memory object caching system
http://memcached.org/
[19] Zend OPcache improves PHP performance by storing precompiled script bytecode
in shared memory
http://php.net/manual/en/intro.opcache.php
[20] 250ms is enough wait to send web users packing
http://www.slashgear.com/250ms-is-enough-wait-to-send-web-users-packing-01216392/,
1 March 2012
68

autoscout missing

Transcription

Similar documents

the original dover drag strip 26th annual

Robot OOP - Andrew Woods

press release march 2006

Window Sticker

Window Sticker - Pierson Ford

Window Sticker - FordDirect.com

File Processing Using PHP on IBM i

Page Views

the biggest us car meeting in italy amazed again!

Starting today, visitors to Bergdorf Goodman`s seventh floor will be