autoscout missing
Transcription
autoscout missing
Sofia University „St. Kliment Ohridski“ Faculty of Mathematics and Informatics Department of Computing Systems MASTER DEGREE THESIS On topic „Halogen: web-based application for car selection” Graduate: Vihren Kotsev Ganev Specialty: E-business and e-governance Faculty number: M24318 Supervisor: Assoc. Prof. Kamen Spasov, Ph.D. Co-advisor: Dr. - Ing. Elior Vila Sofia, 2015 Abstract: It's been years since people buy and drive their personal cars. It all sounds perfect, but there is one general problem with cars – they cost money. People and Bulgarians, in particular, do not understand the problem very well and when they are about to buy a car they choose it depending on how much it costs today, how much will cost the initial repairs and the exterior look. Actually there are a lot more parameters that should be taken into account. There is one thing that we all can be sure of. The best car for someone's needs is not necessarily the same as the best choice for the other. Halogen aims to point out the most suitable models for each individual user request, filtering all models currently on the market and also giving additional information on fuel cost, depreciation cost, total cost and many others. Резюме: Вече дълги години хората купуват и управляват свои собствени автомобили. Всичко звучи чудесно, но съществува един всеобщ проблем с автомобилите – те струват пари. Хората, и българите в частност, не разбират този проблем много добре и когато са на път да закупят автомобил те го избират в зависимост от това колко струва автомобилът днес, колко ще струват първоначалните ремонти и дали автомобилът изглежда добре. Всъщност има много повече параметри, които трябва да се вземат предвид при избора на автомобил. Има едно нещо, в което всички можем да сме сигурни. Най-добрият автомобил за нуждите на един човек не е задължително най-добрият избор за друг човек. Halogen има за цел да покаже най-подходящите модели автомобили за всеки един човек индивидуално, филтрирайки всички модели, които в момента са на пазара, и давайки допълнителна информация за: разходи за гориво, обезценяване, общи разходи и много други. Declaration of authorship: I, Vihren Ganev, hereby declare that the following is the result of my own work under the supervision of Assoc. Prof. Kamen Spasov, Ph.D. and co-advisory of Dr. - Ing. Elior Vila. All sources are cited in the Reference section. All libraries and code snippets which are someone else’s work are cited in the Reference section. 2 Table of Contents 1. Introduction 5 1.1. Relevance of the problem and motivation 5 1.2. Master degree thesis goal and tasks 6 1.3. Expected benefits from the implementation 8 1.4. Master degree thesis structure 9 2. Preview of the current ways people choose car to buy 10 2.1. Core definitions 10 2.2. Approaches and methods for solving the problems 11 2.3. Existing solutions 12 2.4. Conclusion 14 3. Technologies, platforms and methodologies used 15 3.1. Requirements for the tools, place and manner of use 15 3.2. Choice of tools 16 3.3. Conclusion 17 4. Analysis 18 4.1. Concept 18 4.2. Functional requirements 19 4.3. Non-functional requirements 20 4.4. Business processes 22 4.5. Conclusion 24 3 5. Design 25 5.1. Main architecture 25 5.2. Data model 28 5.3. Diagrams 31 5.4. User interface 35 5.5. Additional modules 39 5.6. Conclusion 39 6. Realization, testing, integration 6.1. Realization of the modules 40 40 6.1.1. Auto insurance pricing module 40 6.1.2. Citizen liability pricing module 41 6.1.3. Currency module 42 6.1.4. Euro NCAP module 43 6.1.5. Fuel module 44 6.1.6. Images module 44 6.1.7. Information module 45 6.1.8. Mobile module 50 6.1.9. Models module 51 6.1.10. Search module 52 6.1.11. Tasks module 56 6.1.12. Taxes module 59 6.1.13. Tires module 59 6.1.14. Vignette module 59 6.2. System integration 60 6.3. Testing 62 6.4. Experimental integration 62 7. Conclusion 64 7.1. Summary of the execution and requisition for original results 64 7.2. Guidelines for future development and improvement 65 4 1. Introduction 1.1. Relevance of the problem and motivation There is a statistic showing that each year more and more new cars are sold worldwide [1] and in Bulgaria, in particular [2]. The same is true for used cars also. The trend shows people will continue to buy and drive their own car and nowadays it is common to see every member of a family owning one. The result of this is spending a lot of money on car-related stuff, not only the fuel. A well-known example is paying yearly tax for owning a car. Owning two cars means paying twice. And there are a lot of taxes that should be paid no matter driving the car or not. Attracted by low price and good look people forget to make a calculation on how much the car is really going to cost them. When put in place to choose a car people often take a predefined list containing about 20 of the most popular models on the market and choose the one they like the most. This is how choosing a car works today. This is discussed every day in forums, TV shows and the press [3]. All ratings that appear on the mentioned communication channels are primarily based on the current price of the car (the lower the better), the fuel consumption, and the engine power. [4] What is sad is that sometimes advices include engine volume and power (the more the better). Since the purpose of owning a car is generally to move easier on large distances or with a lot of luggage these advices are groundless. This master degree thesis is intended and designed to work with the characteristics of cars as automobiles rather than cars as toys. The system will also give a list of information rather than advices so people can later apply preferences and choose on their own. The engine and some other characteristics of the car do not describe it as an automobile and should not be part of the search criteria. A full list of what is and is not part of the search criteria can be found in the next papers. There are various parameters depending on which a car can go up or down in a list. It is important to know and understand that the best car for someone’s needs may not be the best car for his friend since people’s needs and financial situation vary. The motivation for choosing this topic for the master degree thesis is that the best cars are not always the most popular ones. Actually these two things have nothing in common so why should one lead to another? What’s more important is that the best car for someone’s needs is quite personal decision and as such should be made by the future owner himself. Who else can tell how secure the future car should be? 5 1.2. Master degree thesis goal and tasks The goal of the master degree thesis is to be developed an autonomous web-based system which points out the most suitable car models for everyone who wants to buy a car, based on his individual needs and financial situation. This is how choosing a car will work tomorrow. Here is a full list of the initial tasks that have to be developed for the successful realization of this master degree thesis. To find a database with real car prices. The other way of doing this is having a depreciation formula and making calculations with it. But having a real database with real car prices for the local market is priceless. To find all the needed data for all car models that can be used for searching or should be included in the calculation formula and automate its collecting. It would be easy if the system is just a filter for car ads – what car, how much money, what type of engine and similar. Combining pure table information like car width and length with market information like the price is more interesting from a programming point of view and more importantly – useful from people’s point of view. Finding this table information may not be easy. The source has to be relatively upto-date with new models, to include information like tires size (used for calculating the price of the tires) and other less known data which cannot be found in all sources. To find prices for all variable costs on car models and automate their collecting. It seems like there is no place on the Internet where people say how much they pay and how often they go for repairs, including car model and all the additional information the system needs. What’s more is that even if there was such an information for a car model or for all car models from a manufacturer, there would still be models with no official information and that would put them some steps behind the competitors for no obvious reason. Maybe some formulas should be made based on talks with friends, mechanics and official representatives thanks to whom the system can make approximations by itself for all the variable costs on repairs a car model needs. To find prices for all variable costs an owner pays, different from the repairs and automate their gathering. These are all kinds of taxes: Citizen liability – official average prices from large insurance broker that lists prices from several insurance companies 6 Auto insurance – usually this is not calculated but taken from the Schwackeliste. Since an access to it have insurance companies only, the prices here are a percentage of the original price of the car. Road tax – formula based on the current law in Bulgaria Vignette – from the official page of Agency Road Infrastructure which is responsible for the price of the vignettes in Bulgaria Technical inspection – average prices for the country Yearly servicing – Since there is no official information on this point, the formula for calculating yearly service should be based on the age of the car, its price, and class. Cars from the higher classes usually have costly servicing. This point also includes the price of the tires. Since there are hundreds of combinations of tires size it would be best if a formula exists – for calculating tires’ average price based on real tires from a large Bulgarian online shop. To find data for security information on car models. There was just one source that can provide the needed information – Euro NCAP. The Euro NCAP is the standard for crash tests in Europe. It has different ratings for adults, children, and even pedestrians. The fourth rating is for available security systems on board. On each of these categories, the tested car scores between 0 and 100 % and has a total score of between 0 and 5 stars. This is exactly what the system needs to provide to its users. An integration or crawler should be implemented with the Euro NCAP website. To find images for each car model and model year. Taking in mind that cars’ look changes with the model years there should be as many images as combinations of number of models and model year exist in the database. All they have to be downloaded and resized with proper dimensions. To build the system – frontend, backend, database. Fast at calculating score for each car model and then sort by it. In the frontend, there should be an easy way of choosing values for the different filters and searches fields with sliders and checkboxes. The table with results in the frontend should have the ability to search and sort on different columns. The backend is responsible for all the important things: o Collecting information from all sources and keeping it up-to-date into a database o Here will be made all calculations o Here will be constructed all search queries o The backend will decide when and how to use caching 7 o Some tests should be made on whether the system searches fast or slow in the database. If the search is slow an additional specialized database for searching should be used for this part of the system. o The database itself has to be relational since the information is of this kind. To test the system with requests from real people with different needs and financial situation. After the system is ready it should be tested with requests from friends and other people when given the possible filtering and searching fields. This would test what kind of cars they prefer and if the existing filtering and searching options satisfy them. To test if people understand the ideology of the system another test should be made including more conceptual questions like “How do you choose which car to buy?” and “Do you agree with this system’s concept?” 1.3. Expected benefits from the implementation The current implementation gives the people an alternative way of choosing a car. Considering choosing a car as a two-step process, the first step is choosing the model before choosing the exemplary. This is where Halogen (the name of the system developed in this master degree thesis) finds its place and people are prompted to start using the first step – choosing the model – before choosing the car itself. The system tends to replace what people do manually – searching the web, forums, articles, asking friends and others – with what should be done with a few clicks. After people find the website useful they are expected to spread the word about it to their friends and colleagues. Solving a real problem a lot of people have is worth to share. The expected benefit for the users is showing exactly how much a car is going to cost them specifically. The same price is also showed as price per month and price per kilometre. For example, if someone does not drive a lot and drives only urban it may turn out to that it is cheaper to use taxis instead of owning a car. The truth is it is very unlikely to have two similar requests. Giving a personal list of results is very important. When placed in front of a well-informed choice people would choose what they see is the best for their needs. Halogen may give unexpected results for some of the people using the system. When having a working system, the answer to how to make it better is simple – make it profitable. At the end of the master degree thesis can be found guidelines for future development and improvement and how to make the system profitable. 8 1.4. Master degree thesis structure The master degree thesis begins with an introduction describing the relevance of the problem and the motivation for choosing this topic followed by the description of the main goal and tasks and the expected benefits of the implementation. The introduction aims to make it clear this is a serious existing problem a lot of people have and this is a possible solution. An interesting fact is that there is just one known solution to the same problem from another country that was launched just a few months ago, just after the start of the work on this master degree thesis. The second chapter is a preview of the current ways people choose car to buy including the approaches, methods and standards and the existing solutions. Here the current ways are described in details based on real world examples of their strengths and weaknesses. What would change if people use Halogen is also an affected topic. The third chapter is on the technologies, platforms and methodologies used in this master degree thesis. Beginning with the requirements for the tools, followed by their types, place and manner of use and ending with the choice of tools and conclusion. The chapter includes a brief description of all the chosen tools – languages, databases, frameworks and others used for development. There are some non-standard methodologies and decisions that get the needed attention here. Fourth chapter is all about concept and analysis. Describing the concept of the system in details, functional and non-functional requirements, business processes. The concept is described in details since at first people hardly understand when and why they may use the system. There is no competitive solution on the Bulgarian market yet and that is why people should understand very clearly what the system can help them with. Fifth chapter is about system design. Describing the main architecture of the system, the data model including diagrams, some information about the user interface and finishing with the additional modules developed. This is not a traditional web-based system. This system is interesting with lots of design decisions – be it the way it collects data and keeps it up-to-date or the way it lists results for the end user. Here are included graphics about the database also – how the search works and why this is the best decision. Realization, testing, and integration describe the ways and reasons modules were written, how they are integrated forming a working system, a brief information about the testing and experimental integration in the real world. This chapter also includes information about the development of the system so it can be deployed to different environments easier and with no problems at all. This is not seen very often and is described in details together with some benefits of developing a system this way. The master degree thesis ends with a conclusion – summary of the execution of the initial tasks and requisition for original results as well as guidelines for future development and improvement. This chapter includes information on how to keep the system on track, make it better and profitable. 9 2. Preview of the current ways people choose car to buy 2.1. Core definitions System: An aggregation of tools and methods that is able to provide some information to a person or another system. Web browser: A program used to surf the Internet. Web-based system: A system that can be opened from a web browser. Autonomous web-based system: A web-based system that needs neither human nor nonhuman interaction to do its everyday job without interruption, i.e. a system that “just works”. Car (1): a transportation vehicle that is categorized as a category B vehicle from the European Union directive 2006/126/EC. Car (2): wheeled, self-powered motor vehicle used for transportation. Financial situation: An indicator of whether the person values the time he puts aside because of car’s needs. It yes – on what price. Schwackeliste – The one and only list used by all insurance companies in order to decide what is the insurance premium they would offer to the owner of the car. The list includes all car manufacturers, models, modifications, and extras. For each of them is said for what condition and years the price the car should be insured. The same is usually multiplied by a factor between 0 and 1 and the resulting number is given to the clients as an insurance premium for their car. Euro NCAP – Euro NCAP means “European New Car Assessment Programme”. Euro NCAP provides both drivers and the automotive industry with a realistic and independent assessment of the safety of new cars. This is well-known but very rarely used criterion for buying a new car in Bulgaria. It is commonly used in western European countries. Vignette – A tax paid for driving on extra-urban roads. Paid weekly, monthly or yearly. Virtual machine – An isolated fully-functional operating system that can be edited, moved and managed directly from another operating system, being it virtual or not. The virtual machines have their own hard drive which holds the operating system and on which can be installed any software. Caching – A mechanism that provides faster data access most commonly by storing the data in the RAM (cache). Cache is often used for faster data retrieving. Otherwise to get the needed data some slow or not constant speed operations have to be performed. Usually after the data is retrieved for the first time it is saved in the cache and every next request for getting the same data will be redirected to the cache and will return the data immediately. Halogen – The name of the system developed in this master degree thesis. 10 2.2. Approaches and methods for solving the problems Talking about people buying cars which are not the best for their needs. At the moment, people do a manual job when choosing the best model for themselves – searching the web, reading forums and articles, asking friends. All of this should be done with a few clicks. People are smart and they make their choice this way not because this is how it has always been done before but because they do not know there is an alternative. The suggested way of solving the problem and choosing the most suitable car for person’s needs is to make it on steps: 2.2.1. Should the car be large, with a big trunk, should it be safe and how much? Include as many specifications, as can think of, which describe the car as an automobile. 2.2.2. Then find values for these specifications for all cars that can be found on the market today and filter out only the cars from the previous step, resulting in a list of cars matching the criteria. This may cost a lot of time and efforts. Often all the searching cannot be done for less than 24 hours. 2.2.3. Now that there is a list of cars matching the preferences make calculations on how much every aspect of the car would cost, also the fuel and depreciation costs, the time cost, and others. This is to add about 10 columns to the initial table. Finding all this information also takes a lot of time. 2.2.4. Then do the math and sort the results by money cost in ascending order. Congratulations, just found out the most suitable car for this person’s needs. Halogen replaces all points from 2 to the end so the person needs just to decide what car would match his needs. In this method, all the human factor is skipped and the chance of errors because of lack of information and personal opinion are also skipped. Bulgaria is a small country and the forums and topics we see on the Internet are all the same so basically when people search for opinions on a car they all see the same opinions that do not change with time. A car is either classified as good or bad on the market both with not a lot of difficulties. The real work that should be done here is to convince people this is how they should choose a car. The conviction itself is a slow process and should be proven with results. 11 2.3. Existing solutions There is just one known solution which lists car models that match user’s criteria for car as an automobile. The project is born in Germany and named Motoragent (motoragent.de) [5]. The interface of the project is in German only and it is referred to the German market as well. We all know Germany is one of the biggest car industries in the world. People there usually buy new cars and sell them up to 5 years later mainly because of increased taxation costs. As written into their page Motoragent is developed in partnership with mobile.de [6], autoscout24.de [7] and spritmonitor.de [8]. All these websites are well-known in the country and even in Bulgaria to some people interested in the worldwide car market. Mobile.de is Germany’s leading marketplace for vehicle sales. The offers and prices of cars in Motoragent use the ads from mobile.de. Autoscout24.de is one of the leading car portals in Europe. It is a platform for trading cars similar to mobile.de and is also part of Motoragent providing information about the car ads published into the autoscout24.de website. Spritmonitor.de provides real people’s cars fuel consumption and cost analysis. The values from spritmonitor.de form the consumption data on Motoragent. Let’s take a deeper look into Motoragent’s functionalities. The website begins with a brief description of what is Motoragent and prompts the visitor to continue to the main section. The main section has several tabs as follows: 2.3.1. The first tab is for the current car the visitor drives. There can be selected manufacturer, model and model year. This question and all next questions can also be ignored. This question does not change the results at the end and maybe just for collecting user information which is not directly connected to the purpose of the website. 2.3.2. The second tab lets the user choose the minimum number of seats he wants the car to have and also the minimum number of seats for children. The interface here is a little bit annoying since the user is the choosing minimum number, but the interface looks like he is choosing maximum. The website does not have full information for all cars (which can be seen at the end when showing the results) and if we select something here or in the next tabs, all cars with missing such information get removed from the list. 2.3.3. Third question is about the number of kilometres the visitor travels each year. A slider between 0 and 50 000. Basically if the person is a taxi driver or just drives a lot he cannot use the website because he cannot select more than 50 000 kilometres per year. On the same page, there is a second slider of whether the visitor drives more urban or extra-urban. This slider is kind of redundant since it does not indicate percentages but just whatever the user select and wherever he 12 puts it. It is interesting that if the user just clicks on it, the results which are shown on the right get shuffled without the slider position being changed. 2.3.4. Fourth question – does the visitor prefer sport or economy driving. There is a slider on the page again with no clear definition of percentages or states. It is just sport on the one side and economy on the other. 2.3.5. Fifth question is about the trunk but not the trunk size with real numbers. Again just like the previous question – a slider of “less” on the one side and “more” on the other. Like people would be able to determine how much “more” space they need compared to nothing since there are not numbers anywhere. 2.3.6. Sixth question is about the price for purchasing the car. There is a slider for choosing minimum and maximum price the visitor wants to pay. Interesting if someone ever chooses minimum amount of money he wants to spend. A very good question at all but what if there is a very good car which costs just a few euro more than the limit but would save the visitor thousands of euro per year. 2.3.7. Seventh question – how old can the car be – again a range slider from 0 to 20 years like someone does not want to drive a new car. 2.3.8. Eight question – we are still away from finishing the questionnaire. What is the maximum amount of kilometres the car should have been driven? It is now clear that Motoragent is more like an exemplary chooser rather than a model chooser. For exemplary choosers, it matters how much the car costs today and how many kilometres it has been driven. 2.3.9. Ninth question – select specific car shape. There is a list of eight car shapes like van, coupe and others. Here we can filter out those shapes we do not like. There is an interesting behaviour if selected “with sliding doors” no matter the other selected options just those with sliding doors remain in the results list. The other options can be combined, returning all results matching one of all selected but this one does not apply. 2.3.10. Tenth tab – named “About me” includes information about the person. First question – in what time does the user wants to buy his next car. The maximum value that can be selected here is 24 months. Maybe people in Germany buy a new car no more rarely than two years which is nice. This does not apply to people in Bulgaria, not even close. The second question here is to choose a gender and third – age group like 17-19, 20-29 and others. 2.3.11. Finish! Here can be chosen specific brands and to list only cars from the chosen brands. Good news the list of cars that match all our preferences is here. It is great to see some images after answering about fifteen questions. A few seconds later some strange results are noticed. 13 All cars here have percentages. The first car is not 100% and the next are even lower. The question is what is the 100% base that no car has reached? With the selected preferences, the list contains some cars more than once. Two Renault Clio from the same model years one is a hatchback and the other – station wagon. Well, station wagons were not even checked on the ninth tab. On the same page where the results are listed, there are also options to compare and filter by transmission, weight and wheel-driving, option to compare. This is all we have here. Motoragent has a very nice-looking interface and with a partnership with all these popular websites it will sure help people choose the best car for their needs. Now let us continue with the existing solutions and substitutions, in particular. Although with a different idea and part of the whole picture websites for direct choosing of car exemplars like mobile.bg can be marked as competitors or existing solution. Since users should get the needed attention and be convinced there is a reason to use this system it is also possible for the users to refuse and continue to use their regular websites for direct exemplar choosing. As said before, the current ways people choose car models are asking friends and searching the Internet for opinions. To be honest nothing can beat friends’ word so this cannot be skipped anyway. Searching the Internet is a good thing if it is for finding information, not advices. In the last years some car manufacturer official forums and fan forums in Bulgaria were bought by a large media group which made them place to earn money from and nowadays these forums have nothing in common with what they ought to be. Experienced people share that Internet advices are the worst they have seen and may be more catastrophic than everything else. Seeing the real results of using the proposed system people are expected to start gaining trust and use it more often. 2.4. Conclusion The current ways people choose car to buy have their advantages, as well as disadvantages. People used to choose cars by doing lots of manual work by themselves and it is time to end this status quo. All the work a person has to do and all the research he has to make is automated within the system proposed in this master degree thesis. People should just use this system and choose the car that best matches their needs. 14 3. Technologies, platforms and methodologies used 3.1. Requirements for the tools, place and manner of use The requirements for tools are separated on hardware and software requirements. The master degree thesis includes the proposed system as well as a configured virtual machine with hardware specifications and installed all the needed software to run the system. By using a virtual machine, it is easier to run and modify the system on different computers no matter the host operating system and settings. Starting with the software requirements. The system includes different modules some of which may be written in different programming languages. The system itself is web-based and, therefore, the language it is written with has to deal well with web. The chosen language should be easy to use with or without templates. It is best to be chosen an MVC framework [9] [10]. The language and framework should both have large online communities in order to get help if needed. The language has to be objectoriented. In order to have faster development the programming language should interpreted and not compiled by default. The framework should be able to be used both from web and from the command-line running cronjobs. The framework should also provide easy ways to: be installed, rewrite URLs with custom ones, connect to a database, manage configurations, and be extended. For the frontend, a well-known JavaScript framework should be used which has built-in selector engine. The frontend framework should also be extendable with third-party libraries. A library for a table view with sorting by custom value and searching should exist. The framework should support older browsers like IE8. The database used has to be a well-known one, SQL database. The database has to provide an option to have functional indexes. The hardware requirements are based on the current virtual machine on which the project is developed and how it can be extended. Since the system can be extended linearly (scales well) the requirements are just for Linux-based OS. The system needs caching mechanism to make it return the needed data faster if requested more than once. Also, the searches in the system are not based on any personal or confidential data and caching first user’s search data can be used by the next users with no issues at all. The caching should be memory-based. A root access to the operating system is needed in order to install the tools, extensions, modules, libraries and everything else the system needs for its proper work and usage. All technologies and everything related to this project have to be free to use for personal and non-personal projects. Open source technologies are treated with advantage. 15 3.2. Choice of tools The first and most important choice here is the programming language. There are not lots of languages to choose between: Perl, PHP, Python and Ruby. The requirements of choosing one as described above lead to PHP in the first place. PHP is a server-side scripting language designed for web development. PHP code can be simply mixed with HTML code, or it can be used in combination with various template engines and web frameworks. PHP code is usually processed by a PHP interpreter, which is usually implemented as a web server's native module or a Common Gateway Interface (CGI) executable. After the PHP code is interpreted and executed, the web server sends the resulting output to its client, usually in a form of a part of the generated web page. PHP is the most popular language for web development. [11] Apache server serves PHP for the web and can be used with the modules needed for the website to work. [12] Choosing a PHP framework is a tough decision. There are plenty of new frameworks coming up every year and pretending to be the fastest or smartest ones. A full list of the compared frameworks is here: CakePHP, CodeIgniter, FuelPHP, Kohana, Laravel, Phalcon, Symfony, Yii, Zend Framework. Laravel and CodeIgniter are the two most popular frameworks matching the requirements. Although advertising itself as a new generation framework Laravel has some bits of the old frameworks that are missing but shouldn’t. Instead of making it easy, the framework makes it complex to install by requiring additional tools and by missing default routing policy. In contrast, CodeIgniter is just “download and run”. It works by default with no additional settings and has default policies for everything that is needed. Less code makes the application easy to understand and maintain. [13] CodeIgniter is the framework of choice for this project. The frontend uses the jQuery JavaScript library which by itself uses Sizzle. Sizzle is the most popular, fast and correctly working JavaScript selector library [14]. jQuery’s community is rapidly growing but is already in the first place compared to its competitors. There is one very good table plug-in for jQuery called DataTables. DataTables is a highly flexible tool, based upon the foundations of progressive enhancement which adds advanced interaction controls to any HTML table. It supports all the needed functionalities and more. [15] There are tons of SQL databases, but most of them drop right after the requirement for supporting functional indexes [16]. Statistics shows the fastest and well-known database that matches the requirements is PostgreSQL. PostgreSQL is also known as world's most advanced open source database and will do perfect work in the current project. 16 The virtual machine runs on Debian 7 Wheezy which is a Linux-based operating system [17]. It has 2 CPUs assigned, 1GB of RAM and 8GB of HDD. The same are configured as recommended values while the whole system can run on a far slower and cheaper hardware (or virtual machine) like 1 CPU, 256MB of RAM and 4GB of HDD. The system is tested and works normally with the listed minimum requirements on both a virtual machine and an old PC from year 2005. Special thanks to Philip Balinov, DevOps Engineer at Komfo, for helping install and configure the virtual machine. Memcached [18] is a very good caching mechanism which stores the data in the RAM. PHP and CodeIgniter work well with Memcached and Memcached can be controlled directly from PHP, changing limits and settings when it is necessary and accessing data fast. Several PHP modules also need to be installed. In alphabetical order: php5-pear, php5cli, php5-common, php5-curl, php5-dev, php5-gd, php5-mcrypt, php5-memcached, php5pgsql and php5-xmlrpc. The system also makes use of Zend Opcache which improves the performance by storing precompiled bytecode in the shared memory. [19] All technologies – PHP, Apache, CodeIgniter, jQuery, DataTables, PostgreSQL, Debian, Memcached and Zend Opcache are free to use. Also most of them are open source, too. 3.3. Conclusion The choice of tools is not an easy task and should be taken very seriously. All technologies should be chosen with care and intention for future ease of use and expansion capabilities. Choosing PHP as a core language makes all requests to the system independent and the failure of one would not affect the other in any way. In this matter, the system becomes more stable. At the end of this master degree thesis, Halogen will be complex but very easy to run thanks to the preconfigured virtual machine. Everything the administrator has to do is run the machine and then type in his own browser the URL address of the system. It’s that easy and works like a charm. 17 4. Analysis 4.1. Concept This is a first-of-a-kind system being developed in Bulgaria. There is just one known system which does a close job, developed and actively used in Germany. The concept of the system is that people should own and drive cars that at the same time match their needs and cost them the least amount of money. As simple as that. Truth is, young people choose cars differently. Many of them do not have financial education. Generally they do not think about depreciation or taxes. Young people are buying cars for fun. There is no way to make these people think of depreciation, taxes, repairs or tires price. It’s all fun, right? The other part are people who buy a car because they need one and these are the people who would find this system useful. These people know the meaning of “car” and would buy one having it in mind. Of course there can be hundreds of little specifics whether a person would buy a car or not like “the way the handle above the side windows closes when being released“ and even if the developer of a system wants to include them, this bounds to the impossible. The most important filters of using a car for transportation are not many: 4.1.1. Length and width – to have more space in the car or to park easily. The length of the car is usually related to car’s segment while the width – to car’s class. 4.1.2. Number of seats and volume of the trunk – to drive the whole family and have space for a lot of luggage. If the person intends to drive three little kids and an adult at the same time maybe he would need a car with 7+ seats. The same applies to the trunk also. The more people there are in the car, the more space it should have for luggage. Or if this is for urban driving only maybe the trunk size does not matter. These numbers would give an idea of what kind of car the person is looking for. In today’s world, there is one more thing, extremely important and often neglected. Actually it is more important for new car owners. It is called “Safety”. Halogen also gives the user opportunity to choose the level of safety of the car he wants to drive, if he cares about safety at all. Finishing with the filters, here comes the place for sorting the results and calculating how much money would all car models need for the whole period of driving. This is the time to point the number of kilometres urban and extra-urban driving per year, as well as the total number of years the person think he will drive the car. 18 The last tab is also something often neglected. Is it the same to own a cheap car which breaks down often and a more expensive one which does not? All people who have bought a car for the fun will now say the expensive car will depreciate so it’s better to own the first one. The time it would take the owner to go and fix his car often costs more than the repair itself. The calculation is easy. If my boss pays me 20 BGN per hour and it takes me 2 hours to go for a 30 BGN repair, the repair costs me 70 BGN, not 30. A lot of people do not count their personal time as lost benefits so in the system there is a slider starting from 0 and representing how much money (in BGN) per hour does the person value his own time used for car’s needs. This is all the information the system needs and the next step is to find the best models for the person. The system search is based on these 4 criteria: length, width, number of seats and trunk size plus 4 criteria for safety: for adults, children and pedestrians’ protection and safety systems. The system sorting is based on 3 predictions made by the person himself: travelled distance urban and extra-urban per year and car usage in years; and 1 more number – own time valuing in BGN per hour. It is surprisingly easy how by sliding a total of 12 sliders one can get so much information about what car he should buy, together with the predicted monthly and yearly costs, the initial and end-value of the car, the repair prices and others. To collect this information manually, depending on the length of the list, takes days and in some cases even weeks. The concept is this – by making it easy to the user, to show the maximum amount of confirmed information he is interested in, make close predictions for the other and show summarized detailed results – while letting him take the final decision by himself. 4.2. Functional requirements The functional requirements for the system are based on talks with potential users and their requests. The functional requirements are to have a way for filtering and sorting the results for car models. In order to achieve this, the system has to have information on all car models currently on the market. The system has to provide an easy way for the user to select values for all filtering and sorting fields. The system has to show a table with results and different columns with an option to sort by every meaningful column: car age, fuel cost for the whole period, depreciation price, maintenance price, time spent, total cost. 19 This table should be aggregated and not show the same car model a lot of times. There should be a way to show a model a maximum of 3 times – 1 time for each matching engine type and if the user wants to learn more about the same car model and engine type combinations – he should be able to click on a link and see them all. This is enforced because providing broader search parameters would return all model years for a car and if there are 4 cars listed with 25 years each this is too much to scroll through and gives no valuable information at all. 4.3. Non-functional requirements Let’s separate the non-functional requirements on the base of the FURPS+ model. The FURPS+ model consists of functionality, usability, reliability, performance, supportability and additional requirements. Starting from the usability requirements, all they are based on the ways people use the system and how they perceive it. It is important the system to be easy to use, with no major difficulties for the users. The interface should feel straightforward. In order to keep the system user-base growing it needs both – to attract new users and to keep the old ones. This is the reason for the website to lay stress on the content and functionality. Since all the text information is not so much (5-10 lines) and is mainly what should be read by the user, it has to use a large font. When stepping to the sliders – filtering and sorting – the sliders themselves have to be relatively large and easy to move by clicking on the slider or just moving the handle. The system should consist of a starting page with a short description of the system and a header image, one filtering and sorting page with several tabs and one last page for the results with table and additional information. The last page – with results table – should include images for all cars. It is way easier for the users to orient if an image accompanies the car model. The reliability requirements include information about the time the system is up (uptime). It is important to have a quick way to recover after full system collapse. The system should be able to recover for up to 15 minutes after a full system collapse. This master degree thesis includes a preconfigured virtual machine and a recovery would mean just a database backup to be imported. With imported database, the system can start working like a brand new – from the beginning. What needs to be done after is to start a task for collecting car images that are missing. The system is designed in a way that missing images are not critical for its proper operation and information retrieval. In terms of performance, the response time is the most important topic here. Other important ones are time to restart, time to recover. The system uses memcached caching and gzip file compression and will serve the information with minimum latency and size. 20 The system and the database should be as fast as possible when retrieving information. One query makes multiple calculations and database joins and if the result-set is big, the query may execute slow. In order to be useful enough, a web-based system needs to have a response time of 0.25 seconds at most (250 milliseconds) [20]. Halogen has to be faster and return information from the database in 99% of all cases for up to 0.025 seconds for a total of 0.20 seconds page load time which is 20% faster than the recommended maximum time. The other 1% of the cases include broader search range returning many results. Having lots of results is relatively slow. They should be processed and sorted at the end which takes time. These cases are not useful in the general case, but some people want to have all available cars listed for a reason. Maybe they are looking for the cheapest car of all, no matter what. All times are in case of requests from localhost and additional latency may exist when requesting from slow or very distant network. There are also a lot of cronjobs running independently. They take more time but since this has no effect on the end user they are not included in the upper calculations. Each cronjob has to run for less than 5 minutes no matter the action. The time to restart the machine has to be less than 30 seconds. This includes shutdown, boot, kernel select plus waiting, starting all services and handling successfully the first request. Depending on the hardware, the system can handle different number of concurrent connections. Having the recommended hardware of 2 CPUs and 1GB of RAM, the system would be able to handle up to 5 concurrent connections, each of which returning a result within the desired response time. If this setup is not enough the system can easily scale on multiple machines. The code can also work with a remote database, the system can be load balanced and both do not require additional modification. Testing with 100 concurrent connections Halogen works but returns the results after a lot of waiting. Supportability requirements include testing, compatibility, configuring and logging. When adding a new module to the system the module is tested with unit tests, automated tests and black box tests by users. When the unit and automated tests, pass the module is deployed on the production server for a certain percentage of the visitors only. The percentage is increased of everything seems right. By doing this, the impact of a problem would be minimized, affecting only those small percentage of users. The percentage is increased in steps as follows – 10%, 20%, 50%, and 100%. The new module should be compatible with the current state of the system – it should be integrated into the system and be able to use the same database. The new module should be developed as abstract as possible in order to be integrated with other future projects or be open-sourced. 21 Halogen should be configurable by means of limits, database settings, logging and error reporting, routing and others. A way to turn off a module should exist such as all related functionalities to be turned off, too. This should not affect the whole system in any way. It is important for the operating system to have log rotating, backups and cronjobs. It should also support traffic filtering in case of serious issues detected. The system is compatible with external systems by using their APIs. Such systems are mobile.bg, fuelo.net and ajax.googleapis.com. The current system also connects to other websites for collecting information, which do not have APIs. These systems are: api.bg, automedia.investor.bg, euroncap.com, bnb.bg and sdi.bg. There are no additional requirements to the system because the system is not developed for a client. 4.4. Business processes Design Development Testing Deployment 4.4.1. Design Graphical design. This process includes the whole concept and graphical design. The designer makes the design to be intuitive and easy to use. Usually design is the leading factor for customer satisfaction. In the proposed system design and functionality share the load 50/50 – both are equally important for customer satisfaction and website success. Main indicator for the success of the process is customer satisfaction. Design includes the whole web system vision to be intuitive and easy to use. The main indicator for success here is how easy users orient into website’s structure and if they manage to use the website easy. The design process also includes choosing a colour scheme as well as coding all website pages. It is important the colours to match users’ expectations and not annoy them. Customer satisfaction is again the main indicator for success of this process. Database design. The database design is part of the design. This process defines the content and structure of the database. The main indicator for success of the process is the speed and efficiency for retrieving information from the database. This process is executed by the developers. First the type of information to be stored into the database is defined with the type of the tables – type of data, keys and relations. 22 4.4.2. Development This process includes the actual development of the web site. Main indicator for success is a working website with no bugs and system collapses. All pages are developed within this process. All functionalities are also developed here including: Functionality for getting citizen liability prices from the website sdi.bg – a famous Bulgarian car insurance broker Functionality for getting the USD / BGN rate from bnb.bg because some car prices are listed in USD and should be converted Functionality for getting information from euroncap.com which is the standard for crash tests in Europe Functionality for synchronizing fuel prices from fuelo.net – prices of gasoline, diesel, and LPG Functionality for downloading and resizing images from ajax.googleapis.com for all cars and model years in the system Functionality for custom fixing information holes. Sometimes the information for a specific model year is missing. This information which can be extracted from the previous and next years combined Functionality for getting car data from mobile.bg via their API Functionality for getting car models stock details from automedia.investor.bg useful for filtering in the search Functionality for calculating car tax based on the texts of Bulgarian law Functionality for calculating tires price for each car based on a research on tires prices from a large Bulgarian website for selling tires Functionality for getting vignette price from the official website of Agency Road Infrastructure – api.bg Functionality for calculating auto insurance premium – formula based on a research from the website of sdi.bg Search functionality for filtering and sorting by user request data Functionality to run cronjobs Queue functionality for executing cronjobs tasks from a table into the database Functionality for activating new “information” tables generated and filled with scripts 23 4.4.3. Testing After the complete development of a module, it is tested usually with unit tests and more often being released to a small part of the visitors incrementally. While and after executing the test scenarios is checked whether the actual results match the expected ones. All errors are saved in a bug tracking system. If errors have occurred not because of bugs but because of other events, these errors are also recorded in a document and all steps to prevent their future occurrences are taken. Passing all tests means the system or module is ready to be deployed in the way it is seen when passed the tests. 4.4.4. Deployment The system can be deployed relatively easy. One way to do this is to take the virtual machine hard drive which is preinstalled with all that is needed and upload it directly with the source code in it. If the hosting provider does not offer such a functionality then the source code of the system or module should be uploaded manually. There is an existing git repository. A post-commit hook is attached to the client and after each commit the database gets exported to a .sql file and automatically added to the commit. At the same time on the server there is a post-receive hook which is activated after each push to the bare repository. If the push message contains the word “deploy” the code is automatically deployed after the push. 4.5. Conclusion Analysis of the system show the initial target group for the system consists primarily of people with families, stable incomes and serious job. Although the system looks pretty simple with just two meaningful pages – search criteria and results – the development section is full of functionalities that have to be available starting from the initial version of the system. The system is difficult to construct having so many functionalities but once developed it should fit perfectly in the current market. 24 5. Design 5.1. Main architecture The architecture of the application is shown in Fig. 1. The website has one entry point which can be invoked from two different places in two different ways – either by opening the website or by running a cronjob. The website uses the MVC (Model – View – Controller) architecture pattern where: The models make database queries and return results to the controllers Controllers make the connection between the models, libraries, and views Views are used to generate the frontend code from HTML pages and PHP variables The website uses CodeIgniter – a powerful PHP framework with a very small footprint, built for developers who need a simple and elegant toolkit to create full-featured web applications. The entry point (index.php) is moved to a directory /www which is on the same level as the other directories of the project: /application, /database and /system. By doing this, we make it impossible to further access important backend files directly via the web browser because only the content of the /www directory can be accessed directly through HTTP. The /www directory apart from index.php also contains directories for storing all CSS and JavaScript files respectively /www/css and /www/js. Two more directories are situated in /www and they are /www/images for storing images used in website – header, arrows for the results table at the end and others. The second directory is /www/cars in which all car images are stored. Fig. 1. Screenshot of the tree hierarchy of the code The first thing in index.php is getting the PROJECT_ENV environment variable which is set on the server and importing it as the ENVIRONMENT constant. The constant is used in several places in the project. For example, cronjobs cannot be run from the web browser but if the environment is “development” they can for ease of use. 25 Then the framework loads its core files automatically from /system and continues to the /application directory. This is the directory where the developer has to do his job. The /application directory consists of: /application/config – a directory with all custom configurations separated in files. The most important ones here are: o /application/config/autoload.php – here we point out which modules, libraries, and others to be autoloaded by default. In this case, these are the “database” and “session” libraries. o /application/config/config.php – lots of configurations that do not belong in any other file o /application/config/constants.php – defines constants visible in the whole project o /application/config/database.php – makes the connection to a database. Choose the database configuration whose name matches the name of the environment o /application/config/routes.php – defines custom routes that do not match the default rule. Such a route is used for the tasks queue. /application/controllers – a list of controllers from the MVC pattern /application/core – a directory with commonly used files like a controller being inherited by other controllers, a model being inherited by other models and others /application/libraries – a directory holding all libraries used in the project. A library is a set of methods which can be used in other projects, does not depend on and does not have any insuperable dependencies related to the project in any way. /application/logs – a directory for logging the errors of the code execution /application/models – a list of models from the MVC pattern /application/views – a list of views from the MVC pattern The last top-level directory – /database – consists of two files which cannot be accessed from the web directly: a shell script executed as a post-commit git hook which connects to the database, exports and gzips it, then copies the exported .sql file to the /database directory and deletes it from the server. Lastly adds the file to the commit the database.sql file itself The CodeIgniter framework offers an advanced model for loading files which is one of the reasons for choosing the framework. If not said otherwise, the framework will load the controller named as the first part of the URI and the method from this controller named as the second part of the URI. 26 In this case, if the address the user is opening is “http://halogen.bg/search/model” the framework will open the “search” controller and execute the “model” method. The default controller name is defined in /config/routes.php with “$route[‘default_controller’]”, in this case being “home”. The default action name is “index”. Knowing this if opening just http://halogen.bg/ the framework will open the “home” controller and execute method “index”. The first and second parameters are controller and method names. If provided with third and next they are passed as parameters to the corresponding method. Very easy to use and handle. Usually, what needs to be done for a functionality to become a reality is to create a controller, load a model and get the needed data, load a view and pass the data to show. The controllers and models usually both extend default CodeIgniter’s such. What is good to be and is made is to have one more level of abstraction. The project has this level which defines some useful methods and makes checks on whether the access to the requested resource is allowed. In /application/core, there are files MY_Controller.php and MY_Model.php which both are used instead of the default ones of CodeIgniter. One more file here is MY_Cronjob_Controller which is extended by the classes being run from the command-line interface (CLI). All database requests use CodeIgniter’s Active Record class which is not Active Record in the real meaning of the words. It makes queries being run in an easy way through a database object no matter the database type. Here is an example of running an SQL query using CodeIgniter’s Active Records class: $this->db->select('*'); $this->db->from('tasks'); $this->db->where('status', $key); $query = $this->db->get(); $task = $this->get_one($query); There is a .htaccess file in the /www directory that helps make the URLs pretty by redirecting all requests to the website through index.php. Once in index.php the system knows how to handle the request correctly. The content of the .htaccess file can also be put in the virtual host of the Apache server (which is the right place to do this) but for simplicity it is added to the project directly in the .htaccess file: RewriteEngine on RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule .* index.php/$0 [PT,L] 27 5.2. Data model The data model changed several times during the development of the master degree thesis. In the project data, transformations are commonly used. The database consists of the following tables: “ci_sessions” – a table for storing session data managed by the CodeIgniter framework “euroncap” – a table for storing data from crawling the website of euroncap.com. The information here is about every car manufacturer, model, producing year, link to the page on Euro NCAP’s website and the four grades the car has received – between 0 and 100. 0 means the information is missing “images” – a table for storing hashed image names associated with a specific car model year. The table stores the car manufacturer, model and year, URL address from which the image is downloaded and the image key. The image key is hashed image name plus file extension. The image key is the name of the image from the /www/cars directory “information” – tables automatically generated by the system with a serial number at the end. These are the main tables where all the information goes in the end. They merge the information via PHP scripts. This table includes all the information for filtering and sorting. Its columns are: o car manufacturer (“car_maker”), model (“car_model”) and years (“car_years”), fuel type (“car_fuel”) – these need to be shown to the user, saying which car to buy in exactly how many years and with what type of fuel o price (“car_price”) – the price of the car is important and used to further generate the depreciation o kilometres (“car_km”)– for a statistic usage only at this time. It is interesting to see how cars between 5 and 20 years all have the same travelled distance o car width (“width”) and length (“length”) – used for the filtering o number of seats (“seats”) and trunk size (“trunk”) – again for filtering o the four Euro NCAP grades (“euroncap_adult”, “euroncap_child”, “euroncap_pedestrian”, “euroncap_safety”)– used for filtering purposes o fuel consumption urban (“fuel_city”) and extra-urban (“fuel_highway”) – these two are used for sorting. The numbers are later multiplied by the cost of the corresponding fuel type o initial money (“initial_costs”) – the initial price of the car plus all costs (registration, repairs, taxes and others) before the person can ride his car freely without worries 28 o yearly money (“yearly_costs”) – the price for repairs, taxes and others that should be paid each year in order to keep driving the car the same free and no-worries way o initial hours (“initial_hours”)– the number of hours that the actions from the initial money column would take. This column matters if the person has chosen a number different from 0 on the own time cost slider o yearly hours (“yearly_hours”) – the number of hours that the actions from yearly money column would take the person. Like the initial hours this column matters only if the own time cost slider value matters o real (“data_is_real”) – a column showing if the data in the row is real or is derived from real data (is not real). The system will only suggest cars if the data is real. The not real data will still be used to make calculations for future use o key (“key”) – a hashed column existing with the only purpose to provide faster joins. The key is generated from the base of car manufacturer + model + fuel type. When requesting results from the system, the SQL query should join with the same table “mobile” – a table holding the information from mobile.bg. Car manufacturer and model, kilometres and production date, price, engine type and both – date published and edited in mobile.bg “models” – a table holding the information for the car models and modifications from automedia.investor.bg – car manufacturer and model, producing dates range, fuel type, engine power (in horse powers), engine volume (in cm3), car width and length, trunk size and number of seats, fuel consumption urban and extra-urban, tires size, coupe type “settings” – a table holding more general information for the whole system or information for the search sorting calculations like fuel price, for example. The table has the following columns: type of the setting, value, and name “tasks” – a table acting like a queue for the tasks being populated and waiting for an available executor. The table has these columns: task – a JSON encoded data about the task that should be executed with the parameters included, status – shows whether the task has been reserved for execution (a method for dealing with concurrency) and type – a number between 1 and 10. Thanks to the “status” column when several same-type executors start at the same time each task is executed exactly once 29 Here are the tables, now let us take a look at the connections between them. “ci_sessions” is a framework-specific table so it has nothing to do with system’s logic “euroncap”, “images”, “mobile” and “models” contain information coming from external websites. They do not have any foreign keys or direct connection with other tables “settings” is related to the system itself containing more general information so it also does not have an external connection or foreign keys “tasks” exists with the only purpose to act like a queue for the system. It is just read and write table and has no connections with anything else the “information” tables are the only tables that are created because of the information stored in other tables. Although this being true the decision is to not have a direct connection to any other table: o car manufacturer and model are written differently in all tables having these columns. Mercedes, Mercedes-Benz, Mercedes AG and others should be the same but are not. Lots of car manufacturers and models differ slightly or more in their names. Also, the data in the “models” table does not have a unique combination car manufacturer + model and this makes it impossible for these columns to be foreign keys to a specific table in any way. o “years” is a column showing the age of the car, not when it is produced. In all other tables, year-like columns show the manufacturing date o the four columns from the “euroncap” table are a good candidate but this would mean to have one more join in the query and tests show it is better and faster to have these four columns imported with their values. Denormalizing a bit helps a lot o all other columns cannot or is better not to be used as a foreign key An index on the “key” column in “information” tables makes it fast to search. Every search request makes two joins to the same “information” table using the “key” column. There is one more index combining several fields in the “information” tables. The index combines the columns used to search together: “data_is_real”, “width”, “length”, “trunk”, “seats”, “euroncap_adult”, “euroncap_child”, “euroncap_pedestrian” and “euroncap_safety”. Although all searches are performed on different combinations of fields, with different values and so on, when all fields are always included in the query the index usage becomes a reality. This is a great performance boost because the option of not using any index is way slower. All primary keys are using their own sequence in the terminology of PostgreSQL. A sequence is an auto-incremented number. Every time the next number of the sequence is requested, the sequence is incremented and the new number is returned. 30 5.3. Diagrams Since the project is very specific by merging lots of data sources and has no foreign keys, this is how the database diagram looks (Fig. 2). Fig. 2. Database diagram As seen there are columns existing in more than one table. They may or may not be the same as those in the final table – the “information” table. Here is one more diagram showing exactly the columns that exist in several tables. Columns containing the same or related information are coloured the same way and connected with arrows as shown in Fig. 3. Fig. 3. Database diagram showing columns containing the same or related information 31 The three tables from the left – “ci_sessions”, “settings” and “tasks” are not connected in any way to the core system. The main table is connected to the other four tables, merging them. The data in the “information” table comes from the other tables as follows: the columns “car_maker” and “car_model” go together. They can be found in all of the other four tables. This is easily explained because these two columns describe a car model which is the base of the system – recommending car models to by the “car_years” column is just like the “car_maker” and “car_model”. It describes a “car_model” directly. Car manufacturers change the exterior and interior of the car but do not change the model name. This is the reason one combination of “car_maker” and “car_model” to have multiple and different “car_years”. The “car_years” column in a different format can be found in all the other tables “car_price” and “car_km” are columns coming from table “mobile”. They are the average values from the matching in the “mobile” table “car_fuel” comes from two places in a different format – “mobile” and “models”. “Car_fuel” together with “car_maker” and “car_model” forms the hashed column “key” in the “information” table. “Car_fuel” is important for showing the results in the frontend also, grouped by “car_maker”, “car_model” and “car_fuel” “width”, “length”, “seats”, “fuel_city”, “fuel_highway” and “trunk” are columns describing the car for filtering (“width”, “length”, “seats”, “trunk”) and sorting (“fuel_city”, “fuel_highway”) because it matters how many litres per 100 kilometres the car needs. These six columns can be found in the “models” table only the four “euroncap” columns (“euroncap_adult”, “euroncap_child”, “euroncap_pedestrian”, “euroncap_safety”) come from the “euroncap” table Apart from the database it is interesting to show why these optimizations of combining all the information in just one table for search are enforced – Fig. 4. Fig. 4. PostgreSQL explanation graphic for a search query 32 Starting from the top left: First an index scan on the index information_121_index_all (“…_index_all”) is performed for the field “data_is_real=1” plus all the search criteria – “width”, “length”, “trunk”, “seats”, “euroncap_adult”, “euroncap_child”, “euroncap_pedestrian” and “euroncap_safety” The index scan is followed by a nested loop which means a join with the data from the same table. This join makes another index scan on the same table but using the information_121_index (“…_index”) index. The search here is pretty fast. A more natural solution is to use a specialized database for search like ElasticSearch and this was the initial idea. Later it turned out the PostgreSQL is dealing well with these searches and using one more database would not result in any performance gain Again to the right, one more nested loop is done with the same table. As the previous join, this one uses index and the scan and join are both fast Next a hash left join is performed with the result of hashing the “images” table with sequence scan. Here the join performs a search for car’s image from the same year. The result of the query until this moment is a generated table in memory with all columns and data The HashAggregate is the place to filter some more data. The HashAggregate executes the “HAVING” SQL clause. In PostgreSQL, HashAggregates are considered one of the fastest parts of executing a query so this step is quite fast again The last step is to sort the results. The sort is done by one loop The query executor can decide to execute the query in different ways depending on the provided filtering criteria. For example, if no filter criteria are provided the executor would start from the third table (second join), then move to the first table and then to the second (first join). PostgreSQL’s query executor really executes the given query the best way possible. Here is the query execution plan for this sample search query (Fig. 5): 33 Fig. 5 – PostgreSQL query plan for a search query The query is executed for 22ms. At the moment, the system works with more than 100 000 real car ads and can recommend more than 3300 cars. This is a lot. Almost 700 is the maximum number of results that can be shown at the end in the results table. The calculation shows these 2600 cars are repeating combinations of car manufacturer, model and fuel type but with different age. Knowing this simple statistic, there is no way a buyer to have thought of all the possibilities. Now the buyer is a few clicks away from an advice that may save him hundreds or thousands amount of money on a yearly basis. 34 One more interesting test is the benchmark test performed with the tool “ab” (Apache Bench). The test is performed with 100 requests with different concurrency levels to the search page with the recommended setup: 2 CPUs and 1GB of RAM. The numbers in the head row 1, 2, 3 and 5 show the concurrency level while the percentages on the left 50%, 75% and 99% show the percentage of requests. The results are the number of milliseconds the corresponding percentage of requests and concurrency are served shown on Fig. 6. As said before, a fast website is considered one that loads for less than 0.25 seconds = 250 milliseconds. The results in the table are in milliseconds. 1 2 3 5 10 50% 46 69 100 204 604 75% 48 76 113 239 945 99% 59 92 149 335 955 Fig. 6 – Benchmark test results: columns show concurrent queries while rows show the percentage passed requests The cells coloured in green have values less than 250 milliseconds while the other are above this result. The 250 milliseconds is achieved in 75% of the requests with a concurrency level of 5. 5.4. User interface The user interface includes just three different types of screens. All pages aim to offer minimal design as a beginning and the philosophy is led by the “keep it simple” principles. All pages have the same structure with header and footer and differ in the content section only. The home page contains information about the system and a button to start using it. The filter and sort page has tabs each of which shows sliders and checkboxes related to the name of the tab. At the bottom of each tab, there are one or more buttons for navigating to the next, previous tab or to see the results. All sliders are smart and when only one of the slider’s values (“from” or “to”) matters only this handle is available for moving. The results page prints the results to a table. When the page loads the table gets processed by the DataTables plugin adding filtering option and allowing custom sorting. The first design is made by the developer himself and in order to be successful, the website needs design made by a professional designer – Jordan Lipchev. 35 Here are three screenshots of the initial design (Fig. 7, Fig. 8 and Fig. 9) and three of the current professional design (Fig. 10, Fig. 11 and Fig. 12): Fig. 7 – Home page of Halogen (previous design) Fig. 8 – Search page (previous design) 36 Fig. 9 – Results table (previous design) Fig. 10 – Home page of Halogen (new design) 37 Fig. 11 - Search page (new design) Fig. 12 – Results table (new design) 38 5.5. Additional modules As mentioned before in the backend the CodeIgniter framework is used. An additional module for crawling websites – PHPQuery – is also used. It is a serverside, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library. In other words jQuery for PHP. One more PHP library for image resizing is used. The library is named resize. A library for easy image resizing is also used. In the frontend, the following JavaScript libraries are used: jQuery and jQueryUI, and the DataTables plugin. DataTables is used for the results table plus filtering and sorting. Linux’s native cronjobs are used. 5.6. Conclusion Despite the problems with data integrity and formatting the result, it is worth it. There are some interesting discoveries. Talking generally, with no additional filter criteria: If the person does not value his own time it is better to buy an old car running on LPG If the person values his own time for 25 BGN per hour, then it is better to buy newer car running on gasoline or diesel If the person highly values his time at 60 BGN per hour (like a businessman) no matter how many kilometres per year – buy a new car No matter the criteria it is cheaper to drive smaller car If the person drives a lot (more than 30 000 kilometres per year) and does not value his time it is cheaper to buy an old LPG car If the person drives a lot and values his time at 25 BGN per hour then it is slightly better to own a new gasoline or diesel car than an old LPG car If the person does not drive much, just about 5 000 kilometres per year, urban only, then it is the same as using taxis only Other interesting results can also be taken from the database or alongside with user requests. 39 6. Realization, testing, integration 6.1. Realization of the modules All modules are written in PHP in the backend and JavaScript in the frontend. 6.1.1. Auto insurance pricing module The insurance price of a car is usually taken from the Schwackeliste. Since only insurance brokers have access to the catalog a research on car premiums was made alongside with sample requests to the website of an insurance broker. Around 100 test requests were made and the results were aggregated to an array of coefficients. The prices of the auto insurance are calculated as a multiplication of the price of the car and the percentage from this array, which is based on car’s age. The array itself contains the following data starting from a value for a new car, then value for 1-year-old and so on (Fig. 13). The last value is the same for older cars also: Fig. 13 – Auto insurance premium by car age. X-axis shows the age of the car, Y-axis shows the percentage of the price of the car to pay as an insurance premium The auto insurance pricing module is used in the calculations for the price a car costs to the owner per year. The auto insurance is paid every year. This is a cost which is not affected by driven kilometres. The auto insurance has to be paid every year and the premium is shown in percentages in the graph above. 40 6.1.2. Citizen liability pricing module Just like the auto insurance module, the citizen liability is not affected by driven kilometres and is paid year after year. The prices of the citizen liability change month by month and some insurance brokers have differences in their criteria these are the average prices on the market. In order to generate the prices an average (most common) driver data are used for age, living city and driving experience. The prices are collected every month from sdi.bg which is a large Bulgarian insurance broker. What matters here is the volume of the engine. In order to collect the prices again, a cronjob is executed which spawns as many processes to the “tasks” table as different engines volumes are found in the database. The volumes are taken from the website of sdi.bg also. Cars with engine volume more than the last one pay what the last pays. This is the method for getting the average price for the citizen liability. public function get_median($values) { $median = 0; $total_values = count($values); if ($total_values % 2 === 1) { $average_index = floor($total_values / 2); $median = $values[$average_index]; } else { $average_index1 = $total_values / 2 - 1; $average_index2 = $total_values / 2; $median = ($values[$average_index1] + $values[$average_index2]) / 2; } return round($median, 2); } 41 Here is a graphic of the citizen liability prices for the different engine volumes actual to the current date – Fig. 14. Fig. 14 – Citizen liability price by engine volume. X-axis shows the engine volume while the Y-axis shows the price in BGN for a year 6.1.3. Currency module The currency module aims to convert the prices of the cars that come from mobile.bg in currency different from BGN to BGN. It synchronizes the rate of the USD from the website of the Bulgarian National Bank – bnb.bg every day. A cronjobs is used once again. The other option for car price is the EUR but since the rate to the BGN is fixed it is just saved in the database as 1.95583 as said on bnb.bg. This is the formula for calculating the rate of USD / BGN. round((string) $row->RATE * (string) $row->RATIO, 2); 42 6.1.4. Euro NCAP module Collecting information from euroncap.com. The module is implemented as a standalone library which crawls the Euro NCAP website for all kinds of cars. After being crawled, each page gets parsed in order to extract the useful data from it. For the parsing itself, the PHPQuery library is used. The result of using this library is an array containing all the car manufacturers, models, links to the pages and grades for all categories the car has been graded. This module runs as a cronjob once per month. All the crawlings take time and the synchronization needs to be done separately from the user requests. After getting the results, the system inserts the new ones in the database – synchronizes the database. Since the euroncap.com system is kind of strange rating the cars on stars that have nothing in common with the numbers from the test results, the system uses the real numbers from the tests results which later converts to percentages knowing the maximum results possible. Sometimes a car has grades in not all of the categories. When this task is started the first thing to do is to check for new tests. If there are no new tests at the euroncap.com website then nothing happens. If new tests are available each of them is inserted into the “tasks” table and waits to get executed (the test data to be collected). An interesting graphic here is the average Euro NCAP percentage on yearly basis by type of protection – Fig. 15. Fig. 15 – Euro NCAP average protection percentages by year. X-axis is the year, Y-axis show the percentage 43 The percentage of cars by protection level is also an interesting topic (Fig. 16). There are just a few models topping the chart with 100% results. Generally the results are good but still a lot of cars are below a certain level of protection, say it is 70% or 50%. Fig. 16 – Euro NCAP percentage of cars by protection level. X-axis is the range of protection percentages, Y-axis represents the percentage of all cars with this protection level 6.1.5. Fuel module Halogen collects the fuel prices thanks to an API from fuelo.net several times per day. Fuelo provides an endpoint which returns the average price from the gas stations in Bulgaria. The system synchronizes the prices to the database and calculations Fuel prices are saved into the “settings” table. 6.1.6. Images module Halogen works well with and without car images. The user interface is very important and users want to see graphics. This is the reason an images module is developed which collects car images from google, using the image search API with address ajax.googleapis.com. There are almost 1000 combinations of car manufacturer plus car model. But since car’s look changes over time a multiplication with production year gives more than 6000 combinations which are about 7% of all cars. Once per day a cronjobs runs and collects images of the cars that are missing but at the same time the model is in the database. The images module uses the images library developed as a standalone library for image extracting from Google’s images API. 44 The search keyword is formed by combining the car manufacturer, car model, manufacturing year and the word “front”. The purpose is to get as much as possible “good” images –showing a car list looks better when pictures show cars’ smile. Right before insertion into the database the image is downloaded, renamed and resized. The download is using curl with custom header for user agent. Renaming the image with a specially generated name helps when the picture needs to be updated. Resizing the image is an important step which resizes to at most 400 x 300 pixels meaning the resize will be automatically proportional. The resize is implemented as an external resize library using GD. The whole image processing is done by the “manage_image” method: public function manage_image($car_maker, $car_model, $year, $image) { $image_file_name = $this->information_library>generate_key($car_maker, $car_model, $year); $download_file_name = FCPATH . 'cars/' . $image_file_name; try { $image_info = $this->download_image($download_file_name, $image); $this->rename_image($download_file_name, $image_info); $this->resize_image($download_file_name, $image_info); $ext = $this->get_ext($image_info); } catch (Exception $ex) { return FALSE; } return $image_file_name . '.' . $ext; } 6.1.7. Information module The information module is the heart of the whole system. Thanks to information module results are generated and populated into the database. There is a task being run once per day which manages the generation of the “information” table. First a new table is generated with all the needed columns, primary key and indexes. Then a query collects those models from the “mobile” table which have at least 5 matches for the same car manufacturer, car model, age and fuel type. For each of these models, a task is inserted and at the end one more task is inserted called “information_fix”. 45 When one of these tasks for the models is run first an insert is made to assure the model gets inserted, then model’s variable information are selected which are the average price and run. Then additional information about the model are collected like the power of the engine, manufacturing dates from and to, width, length and many more. Also, several corrections are executed here – correction on the fuel consumption for cars running on LPG, correction of the car manufacturer and model names because they differ in the different tables and more. After that the information from Euro NCAP is taken, if any. Later the hours and costs calculations are done which include getting all the information for prices of citizen liability, fuel and others. At the end, an update is performed with all the information that has just been calculated and generated. When all the tasks for models have finished, the “information_fix” is executed. This task finds the holes in the just populated table. An example of a hole is a missing year for a car model. Let’s have this example. The combination car manufacturer, model and age exists for age from 1 to 10, except the 7-year olds. This tasks would make calculations on the average of the 6 and 8-year olds and insert this one so the whole model would be complete. By having these holes fixed a better search can be performed by calculating the expenses for these fixed car holes easily. What is worth to say is that the system will not recommend any of these “fixed” cars. Halogen will only recommend cars for which the initial data is real. To calculate the value to fill in the hole, the “calculate_hole_value” method is used. By passing the first and second prices, together with the number of missing consecutive years to generate value for (the period may have several missing consecutive years) the result is returned as following a linear calculation formula: protected function calculate_hole_value($first, $second, $period_part) { $min = min($first, $second); $max = max($first, $second); $diff = $max - $min; return $max - $period_part * $diff; } When all holes are fixed again using tasks one “activate” task is run. The “activate” task activates the table in which all tasks until this time have inserted and updated information. The “activate” task is configured to keep 3 “information” tables history. The older get dropped. At the end, the whole cache is cleaned, ready to handle new caching data. 46 At this time, the new “information” table is ready to be used and all requests are automatically redirected to it because it is the default table now. The “information” tables have columns for car manufacturer and model, car age, car kilometres and fuel type, width and length, number of seats and trunk size, fuel consumption urban and extra-urban, Euro NCAP data for adults, children, pedestrians protection and safety systems, initial and yearly costs and hours that have to be spent on the car, a column showing if the data is real or result of fixing holes, key for easy joining. Every “information” table is generated automatically from the code by running the “create_table” method first and the “create_indexes” method second: public function create_table() { $this->load->dbforge(); $this->dbforge->add_field("information_id bigint NOT NULL DEFAULT nextval('information_sequence')"); $this->dbforge->add_field("car_maker character varying(63) NOT NULL"); $this->dbforge->add_field("car_model character varying(63) NOT NULL"); $this->dbforge->add_field("car_years integer NOT NULL"); $this->dbforge->add_field("car_price decimal NOT NULL"); $this->dbforge->add_field("car_km integer NOT NULL"); $this->dbforge->add_field("car_fuel character varying(15) NOT NULL"); $this->dbforge->add_field("width integer DEFAULT 0"); $this->dbforge->add_field("length integer DEFAULT 0"); $this->dbforge->add_field("seats integer DEFAULT 0"); $this->dbforge->add_field("fuel_city decimal DEFAULT 0"); $this->dbforge->add_field("fuel_highway decimal DEFAULT 0"); $this->dbforge->add_field("trunk integer DEFAULT 0"); $this->dbforge->add_field("euroncap_adult integer DEFAULT 0"); $this->dbforge->add_field("euroncap_child integer DEFAULT 0"); $this->dbforge->add_field("euroncap_pedestrian integer DEFAULT 0"); $this->dbforge->add_field("euroncap_safety integer DEFAULT 0"); $this->dbforge->add_field("initial_costs decimal DEFAULT 0"); $this->dbforge->add_field("initial_hours integer DEFAULT 0"); $this->dbforge->add_field("yearly_costs decimal DEFAULT 0"); $this->dbforge->add_field("yearly_hours integer DEFAULT 0"); $this->dbforge->add_field("data_is_real integer DEFAULT 1"); $this->dbforge->add_field("key character(32) NOT NULL"); $this->dbforge->add_key('information_id', TRUE); return $this->dbforge->create_table($this->get_next_table_name(), TRUE); } 47 public function create_indexes() { $index_query = "CREATE INDEX " . $this->get_next_table_name() . "_index ON " . $this->get_next_table_name() . " (key);"; $index_all_query = "CREATE INDEX " . $this->get_next_table_name() . "_index_all ON " . $this->get_next_table_name() . " (data_is_real, width, length, trunk, seats, euroncap_adult, euroncap_child, euroncap_pedestrian, euroncap_safety);"; return $this->db->query($index_query) && $this->db>query($index_all_query); } By making graphics, it is easier to notice trends, be it on price or something else. Let us first see the number of different models on the market for each different year (Fig. 17) and then the average model age in years (Fig. 18): Fig. 17 – Number of different modes on the market by year Fig. 18 – Average model age in years 48 Here are three graphics showing the prices of three of the most popular car models on the Bulgarian market – VW Golf (Fig. 19), Opel Corsa (Fig. 20) and Ford Fiesta (Fig. 21). Fig. 19 – VW Golf price over time by engine type. X-axis: age of the car, Y-axis: average cost Fig. 20 – Opel Corsa price over time by engine type. X-axis: age of the car, Y-axis: average cost Fig. 21 – Ford Fiesta price over time by engine type. X-axis: age of the car, Y-axis: average cost 49 Some interesting observations for the three models: Gasoline models generally become cheaper with time; Diesel models are more expensive on the 3rd year than on the 2nd; no matter the fuel type all prices flatten and merge at the end; the 1st offers are usually from the 2nd year. 6.1.8. Mobile module The integration with mobile.bg’s API is an interesting topic. The API has just one endpoint (for ads export) and the requests to it are limited to 15 calls for 15 minutes. Taking it on average this makes 1 call each a minute and the answer pops up naturally – cronjobs. The API cannot export all ads at once neither can do something complex. The maximum lifetime of an ad in the database of mobile.bg is 49 days except when being modified or extended. The cronjob for collecting mobile.bg ads is executed every minute. What it does is to make a request to mobile.bg asking for ads from a specific day in the past. The day is the number of minutes from the current time % 30. If we are 10:50h the script will ask for ads from 20 days ago because 50 % 30 is equal to 20. Here is the code to generate dates from and to for collection car ads from mobile.bg: $current_hour_minutes = round(date("i") % 30); $date_from = strtotime("-$current_hour_minutes days 00:00:00"); $date_to = strtotime("-$current_hour_minutes days 23:59:59"); The collected data is mobile.bg id, car manufacturer and model, kilometres of the car, production date, price, date of publishing and last editing of the ad, fuel type. The connection to mobile.bg itself is made using a custom library. The library makes the request using curl. The request returns JSON encoded data which gets decoded and used later. When the ads get returned a check is made with their IDs for whether and which of them exist and which are the new ones. Before any other checks, one for reliability is performed. This one includes checking if the price matches some predefined values like 0, 11 or 111 these are fake prices for crashed cars or ones being sold by parts. If the car has been a taxi or has a leasing it is also excluded. These are not many cars but the prices they are announced with are usually fake or there is a section in the description saying something more about the price. Ads containing words hinting the price is not real are also excluded. This is the code checking if an ad is reliable. By default, it is reliable. 50 private function _check_ad_reliable($ad) { $reliable = TRUE; if (in_array((int) $ad['price'], [0, 11, 111])) { $reliable = FALSE; } elseif (isset($ad['condition']) && $this>_check_existence($ad['condition'], ['на части'])) { $reliable = FALSE; } elseif (isset($ad['extri']) && $this->_check_existence($ad['extri'], ['TAXI', 'Катастрофирал', 'Лизинг', 'На части'])) { $reliable = FALSE; } elseif (isset($ad['description']) && $this>_check_existence($ad['description'], ['вноска', 'вноски', 'vnoska', 'vnoski'])) { $reliable = FALSE; } return $reliable; } If the ad is existing, a whole check for updates is performed. When an update is found the ad with this specific change is recorded for an update. If the ad is new it is just added to the ads for insertion. At the end, these arrays of ads to insert and update are sent to the database. Each time Halogen requests data from mobile.bg the response has length of about 7.5-8.0 MB JSON encoded data. All this information gets parsed at once with no trouble. 6.1.9. Models module The module is started by a cronjob once per month. It checks the website automedia.investor.bg for new car data. If any new data is found a task for collecting each new model is inserted. The pages are parsed since the connection is one-way only. The parsed useful information goes into the models table. Data for catalog id, car manufacturer, car model, manufacturing dates from and to, fuel type, engine volume and power, car width and length, trunk size, tires, fuel consumption urban and extra-urban, coupe type and number of seats. 51 6.1.10. Search module The search module is the main module in the Halogen. The module is system-dependent and cannot be separated as a library. When a request comes into the search module it first parses the parameters. The parsing is important because one of the most common roads to security holes and issues is the search module of a system. After the parameters are parsed into an array, a key is generated on the base of all these parameters. The key is hashed and uniquely related to the combination of search parameters. This is where the cache takes place for the first time. Memcached is used for caching the searches. A quick check in the cache says if such a query has already been made. If so – the results are immediately returned to the script from the cache (from the RAM) and then to the frontend. If the desired data is not in the cache the search should now be performed. What is important for the search is to know the fuel prices at this moment and which “information“ table is the active one. These parameters are passed to the search method in the model. The search is now performed with the parameters requested from the user and the result is returned as an array to the controller. The search query itself is not a simple one. This is a sample search query: SELECT i1.car_maker, i1.car_model, i1.car_years, i1.car_fuel, im.image_key AS car_image, ROUND(i1.car_price, 2) AS car_price_beginning, ROUND(LEAST(i1.car_price, i3.car_price), 2) AS car_price_end, ROUND(i1.car_price - LEAST(i1.car_price, i3.car_price), 2) AS depreciation, ROUND(i1.initial_costs + SUM(i2.yearly_costs), 2) AS running_cost, i1.initial_hours + SUM(i2.yearly_hours) AS running_hours, ROUND((5000 / 100 * i1.fuel_city + 10000 / 100 * i1.fuel_highway) * (CASE i1.car_fuel WHEN 'Бензин' THEN 2.06 WHEN 'Дизел' THEN 2.21 ELSE 0.85 END) * 5, 2) AS fuel_cost, ROUND(i1.car_price + i1.initial_costs + SUM(i2.yearly_costs) LEAST(i1.car_price, i3.car_price) + (i1.initial_hours + SUM(i2.yearly_hours)) * 5 + (5000 / 100 * i1.fuel_city + 10000 / 100 * i1.fuel_highway) * (CASE i1.car_fuel WHEN 'Бензин' THEN 2.06 WHEN 'Дизел' THEN 2.21 ELSE 0.85 END) * 5, 2) AS total_cost 52 FROM information_121 AS i1 JOIN information_121 AS i2 ON i1.key = i2.key JOIN information_121 AS i3 ON i1.key = i3.key LEFT JOIN images AS im ON i1.car_maker = im.car_maker AND i1.car_model = im.car_model AND EXTRACT(YEAR FROM current_date)::integer - 1 - i1.car_years = im.year WHERE (i1.data_is_real = 1) AND (i1.car_years <= i2.car_years and i1.car_years + 5 > i2.car_years) AND (i1.car_years + 5 = i3.car_years) AND (i1.width >= 1600 AND i1.width <= 1900) AND (i1.length >= 3800 AND i1.length <= 4600) AND (i1.seats >= 4 AND i1.seats <= 5) AND (i1.trunk >= 250) AND (i1.euroncap_adult >= 75) AND (i1.euroncap_child >= 75) AND (i1.euroncap_pedestrian >= 50) AND (i1.euroncap_safety >= 50) GROUP BY i1.information_id, i1.car_maker, i1.car_model, i1.car_years, im.image_key, i1.car_fuel, i1.car_price, i3.car_price, i1.initial_costs, i1.initial_hours, i1.fuel_city, i1.fuel_highway HAVING COUNT(i2.information_id) = 5 AND SUM(i2.data_is_real) >= (CASE COUNT(i2.information_id) WHEN 1 THEN 1 ELSE 2 END) ORDER BY total_cost 53 Here’s is a graphic showing the execution time for running this query to the database for a sample of 100 runs (Fig. 22). Fig. 22 – Search query execution time in milliseconds. X-axis is the serial number of the test, Y-axis is the query execution time in milliseconds There are more parameters that can be added which make the “WHERE” clause even larger and more complex. Let us take a look at the slowest query that can be generated in the system. The query uses the full ranges for all sliders (the ones resulting in showing the most results) as well as all checkboxes. The query runs for 165 milliseconds. The query execution plan is different because PostgreSQL checks the speed to run the query using different execution plans and chooses the fastest one. When the search query returns the results they get filtered in the PHP code and the result is only one car for a combination of car manufacturer, car model and engine type. The reason this filtering is performed in the code is simple – it is faster. If the same has to be done in the query it gets really complex and slow. SQL is not made for this type of logic while loops in the code are perfect. The result that remains is of course to first one – the best result for this combination. After the filtering, some more information is added to each row. This information is about the percentages matching, price per kilometre and price for a month. In theory, these can also be included in the SQL query but the percentages matching would be too slow to find so this is done after the results are ready. The other two fields can easily be included in the query but since the percentages matching is not, and this loop in the code will be done, it is easier to be added here, leaving the SQL query as simple as possible. 54 The results are now filtered and supplemented which makes them complete. Now save in the cache the JSON encoded data array in order to get it fast when need it in the future. The cache time-to-live (TTL) is set to 24 hours which most probably will be enough since the “information” table is regenerated every day and when this is done – all cache becomes not current and is cleared. public function index() { $params = $this->parse_params(FALSE); $key = $this->make_key($params); $data = $this->cache->get($key); if ($data === FALSE) { $this->load->model('Fuel_model'); $fuel_prices = $this->Fuel_model->get_prices(); $this->load->model('Information_model'); $table_name = $this->Information_model->get_table_name(); $data = $this->Search_model->find($params, ['fuel' => $fuel_prices, 'table' => $table_name]); $this->manage_search_resuts($data, $params); $this->cache->save($key, json_encode($data), 24 * 60 * 60); // in seconds } else { $data = json_decode($data, TRUE); } $query = http_build_query($this->input->get()); $this->load_view('search', ['models' => $data, 'params' => $params, 'query' => $query, 'show_more' => TRUE]); } 55 6.1.11. Tasks module Halogen has lots of web pages that need to be crawled and lots of information to be generated. The task module executes tasks with a given type and number of tasks. The different types are defined as constants: TYPE_1, TYPE_3, TYPE_5 and TYPE_8. Tasks with different type represent different queues. For each queue, there is a cronjob running different number of tasks at once. Tasks with TYPE_1 are the “information” tasks. When they are generated and pushed to the table they are small and fast to be executed, should be executed as fast as possible. This is the reason three workers (cronjobs) for TYPE_1 are started at once every minute, each with 500 tasks to execute. This means 1500 tasks from TYPE_1 are executed every minute. Tasks with TYPE_3 are the Euro NCAP and “images” tasks. TYPE_3 are tasks that are limited in the requests to the corresponding website. For example, if more than 5 requests to the euroncap.com website are made within a minute the site goes down. The same is for ajax.googleapis.com, but the service just rejects the queries. Three tasks with TYPE_3 are executed every minute. Tasks with TYPE_5 are tasks for collecting information about the citizen liability prices from the website of sdi.bg. The limit here is 2 tasks per minute because otherwise the script gets rejected. The connection to the website is slow and the whole prices collection takes up to 7 minutes. Tasks with TYPE_8 are tasks for collecting information for the models from the website of automedia.investor.bg. Just one task from TYPE_8 is executed each minute. Tests show increasing the number of tasks to more than 5 tasks per minute makes the website go down. More task types can be added at any time. The task controller is used to generate tasks. The entry point for executing all tasks is also the tasks controller. Called with type and number of tasks the endpoint continues the execution until available tasks exist and the number of tasks to be executed is not reached. Each task is JSON decoded, a new controller is instantiated and the corresponding method is executed with the provided parameters. When done the next task follows the same pattern. This is the way of loading the corresponding controller for the task and executing the action and passing task’s parameters. $controller_name = $task_decoded['controller']; $action_name = isset($task_decoded['action']) ? $task_decoded['action'] : 'index'; $params = isset($task_decoded['params']) ? $task_decoded['params'] : []; include_once APPPATH . 'controllers/' . $controller_name . '.php'; $controller = new $controller_name(); call_user_func_array(array($controller, $action_name), $params); 56 An interesting point is concurrency resolving. The used database PostgreSQL works well and this is the reason concurrency problem exists. When selecting a task the flow is to first select the task and then delete it from the database. These are two queries and other queries may intercept the flow here by being run between these two. If three cronjobs for executing tasks from the same type are run together there is a high chance for two or more of them to take the same task and start executing it. The solution is when selecting a task from the table to be generated a random string, and first a free task to be executed gets updated by setting its key to the same value as the random string. By doing this, the task is reserved for this executor. But, what is more, in order to update (reserve) just one task and set the random string two queries should be run. The first query is for selecting the task id and the second is to update the task by this id. The quirk here is to add the words “FOR UPDATE” at the end of the internal select query. By doing this, the developer tells the database to delay the other queries to the same table and wait for the execution of the whole current query to end, not just the internal select. UPDATE tasks SET status = ‘na3h0fhwe80fhhuhufsesuhijw9f6a0fje’ WHERE task_id = ( SELECT task_id FROM tasks WHERE status IS NULL AND type = 1 ORDER BY type DESC, task_id ASC LIMIT 1 FOR UPDATE ) When the task is reserved by running this query then the executor can select and delete it knowing for sure this task will be selected by exactly one executor. All tasks should be written in a way that even if the task fails for a reason the system should continue working without any issues. And the tasks should be able to be queued once again by the same task generation script. 57 This is how the crontab looks like: # cronjobs 0 0 * * * php /var/www/halogen/www/index.php fuel 0 0 * * * php /var/www/halogen/www/index.php currency 0 11 1 * * php /var/www/halogen/www/index.php vignette * * * * php /var/www/halogen/www/index.php mobile * # cronjobs that generate more cronjobs 0 4 * * * php /var/www/halogen/www/index.php task information 10 4 * * * php /var/www/halogen/www/index.php task images 0 23 1 * * php /var/www/halogen/www/index.php task citizen_liability 0 23 2 * * php /var/www/halogen/www/index.php task models 0 23 3 * * php /var/www/halogen/www/index.php task euroncap # executors * * * * * php /var/www/halogen/www/index.php task 8 1 * * * * * php /var/www/halogen/www/index.php task 5 2 * * * * * php /var/www/halogen/www/index.php task 3 3 * * * * * php /var/www/halogen/www/index.php task 1 500 * * * * * php /var/www/halogen/www/index.php task 1 500 * * * * * php /var/www/halogen/www/index.php task 1 500 As seen the script is separated on cronjobs executed annually, cronjobs run once per day or month and executors which run every minute. 58 6.1.12. Taxes module The tax module is a stand-alone library based on the Bulgarian law for car taxes. The law itself is based on the power of the car in kilowatts which can easily be converted to horse powers and the age of the car. The module is used for calculating the costs on a yearly basis. 6.1.13. Tires module This module is again realized as a library. A research was made on the price of the tires with different sizes. Then a common formula was found showing the temp tires get cheaper or more expensive. Since there is a wide variety of tire sizes and car owners know that tires need to be changed after a certain number of years this is a repeating cost based on the cost of the tires. If we say tires are changed every 5 years and one tire costs 125 BGN, this means 4 summer and 4 winter tires, 125 BGN year, for 5 years. 200 BGN per year for the tires only. Not a big cost but surely not a small one many people forget about. There is a formula for calculating tires’ price. Tires prices are predictable in the general case – the smaller the cheaper. There is a default tires size and price for it and all other prices are related to the default one. The default size and price are: 135/X R15 for 120 BGN The research shows tire’s height does not make any different in the price and this is the reason it is not included in the formula. For tires 145/X R16 the price is 170 BGN; 185/X R17 – 290 BGN. 6.1.14. Vignette module Currently driving on the Bulgarian extra-urban roads requires buying a vignette sticker. The sticker is most often bought every year. Its price is fixed by the Agency Road Infrastructure. For the purpose of having the price of the sticker added to the system a library for parsing the website of the agency is developed. The price of the vignette is also saved in the “settings” table. 59 6.2. System integration There is a graphic of how the system communicates with external systems. The double-sided arrows show two-way communication or otherwise said – using an API while the one-sided arrows show one-way communication – parsing information from the web pages – Fig. 23. Fig. 23 – System integration of Halogen. Left side – systems to connect with using an API. Right side – systems to crawl As seen in the graphic only mobile.bg, fuelo.net, and ajax.googleapis.com provide a twoway communication while all the other sites require parsing. There is one important rule for the system to integrate with the other systems – to not modify the data it receives. Exception is the data where all options are known and their number is measurable. For example, the manufacturing date can be modified because all options are known and the modification is straightforward. The model name should not be modified because there are models with different model indexes that have to be collected asis and later processed in a suitable way that may vary. It is important to have the original data into the database. Halogen connects to mobile.bg via their API receiving arrays of ads with full information about the ads. The new ads get inserted, the changed – updated. This is the most interesting integration in the true meaning of the word. A special IP-restricted personal key is used on every request. 60 The data from mobile.bg gets processed so the rows that are inserted into the database contain information in the following columns. “mobile_id” – the ad id from mobile.bg’s API. “car_maker” and “car_model” – inserted as-is names of the car manufacturer and model. “km” – the kilometres of the car as-is. “date_produced” – the date in format YYYY-MM-01. The “date_produced” comes as month in Bulgarian plus year and is very unpleasant to parse but in the selected database format it works well. “price” – the price of the car plus the currency from another field, the price is converted to BGN and saved as such. “date_published” and “date_edited” both come from the API as well as the “engine_type”. Exceptions are the LPG engines for which the extras are parsed searching for some special words. Connecting to fuelo.net is done the same way with the difference that the prices of the fuels get updated every time. Again a personal key is used for each request. The key is generated by registering into the system. Google provides a new API endpoint for searching for images. In order to use this new API for image search, the developer has to download and include the whole package of Google PHP libraries for all their services which is not what Halogen needs. Connecting to ajax.googleapis.com with a GET request is good enough and provides reliable results. With just one line of code. The integration with automedia.investor.bg is done via parsing pages from the website. The URL addresses have structure of: website / catalog / car manufacturer id / car model id / car modification id. This is way too much to crawl and the crawler would need to parse a lot of additional pages for contents of all car manufacturers, all models with pagination then all modifications with pagination again. A workaround was found – just change the modification id in the URL, seems like the website does not take the other into account when a modification id is found. This makes the whole crawling easier and faster. When a new page in automedia.investor.bg is found it is crawled and inserted into Halogen’s database. Parsing api.bg is an easy task. The page is set to have the same static URL address. Parsing euroncap.com is a tough task. The website consists of several categories and also has a different structure for old and new models. These features make the website feel incomplete or partly working, but it works pretty well once understood. Parsing the currency from XML is the only way to get the BGN/USD rate from bnb.bg. Both BGN/USD and BGN/EUR are inserted and kept in synchronization with the “settings” table. The website sdi.bg is one of the largest Bulgarian websites, and companies for car insurances. In order to get prices from the website the person should fill in lots of information which generally does not influence the prices or influence the price of one specific company. Halogen fills information to make the prices as close as possible to as many people as possible. For example, Halogen fills in the driver experience to be more than 10 years, owner age to be 35 years, number of seats – 5 seats and others. 61 The prices are then synchronized with the “settings” table and used later with the next “information” table generation. 6.3. Testing Apart from the non-functional requirements for testing the testing is done by real people on incremental basis for both – the whole system and every new module developed. The percentage of people is increased in steps as follows – 10%, 20%, 50%, and 100%. Testing is also performed at the same time the development is done. Every developed functionality should be deployed with no known issues. Everything a module should do is defined clearly before the beginning of the development. After a part of the module is ready it is tested be the developer. When another part connected to the first is done both are tested separately and together. At the end, the whole module with all of its branches is tested and deployed if everything looks right. 6.4. Experimental integration Halogen is experimentally deployed and can be opened on http://halogen.bg/. The hosting is a virtual server (VPS) from superhosting.bg running Debian 7.1 Wheezy with 2 CPUs x 2.6 GHz and 2 GB RAM, SSH key-based authentication. The server is also used for hosting another project. Superhosting does not support creating a machine from custom virtual machine image and all tools had to be installed separately. What is installed on the experimental integration machine is: PHP version 5.4.36.0 which is the latest stable release available for Debian systems directly from the apt-get utility. The needed extensions are also installed with their latest versions Apache version 2.2.22 – again the latest stable release for Debian PostgreSQL version 9.1 – latest stable. Configured to listen to connections from localhost only Memcached version 1.4.13 – latest stable with default settings – up to 64 MB RAM 62 Zend Opcache version 7.0.2 – the latest stable with the following configuration: o enabled for web and CLI o revalidate frequency 60 seconds with validate timestamps (the Opcache will check for updated scripts by file timestamps every 60 seconds) o memory consumption up to 128 MB (the maximum size of shared memory the Opcache can use) o max accelerated files (the maximum number of keys the Opcache can save, this should be a prime number) – 223 o interned strings buffer – 8 MB (the amount of memory used to store interned strings) CodeIgniter version 2.2.0 jQuery version 1.11.2 (latest stable from the 1.x branch), jQueryUI for jQuery 1.11.2 DataTables version 1.10.4 for jQuery 1.11.2 Cronjob settings – identical to the ones from chapter 6.1. The server stats of the experimental integration are shown in Fig. 24. Fig. 24 – Experimental integration server stats by running the “top” command 63 7. Conclusion 7.1. Summary of the execution and requisition for original results The list of initial tasks includes: 7.1.1. To find a database with real car prices. This task is executed by using the API of mobile.bg. Partnering with mobile.bg (Bulgaria’s largest website for car ads) is a great success for every project, even more for the just released ones – like Halogen. 7.1.2. To find all the needed data for all car models that can be used for searching or should be included in the calculation formula and automate its gathering. This task includes the modules for collecting information from euroncap.com (used for filtering) and automedia.investor.bg (used for filtering and sorting). Both are implemented and the synchronization with them works well. 7.1.3. To find prices for all variable costs on car models and automate their gathering. Although repair prices were not found anywhere on the Internet, the talks with friends and mechanics helped creating a formula for calculating the variable costs of repairs a car model needs. The developed methods include these costs to car’s price when the data is processed and added to the database. 7.1.4. To find prices for all variable costs an owner pays, different from the repairs and automate their gathering. All costs were found from official websites and the data is used in the calculations. 7.1.5. To find data for security information on car models. Data from euroncap.com is used. All crash test results are collected and synchronized with the database of Halogen. 7.1.6. To find images for each car model and model year. Images are collected from Google and resized to smaller dimensions which are large enough for the system to look good and useful enough from user’s perspective. 7.1.7. To build the system – frontend, backend, database. Fast at calculating score for each car model and then sort by it. The whole system is built, working and experimentally deployed. In the frontend, the DataTables plugin works well with the expected functionality. In the frontend, there should be an easy way of choosing values for the different filters and searches fields with sliders and checkboxes. The table with results in the frontend should have the ability to search and sort on different columns. Collecting information, calculations, search and caching are all implemented. Some tests should be made on whether the system searches fast or slow in the 64 database. If the search is slow an additional specialized database for searching should be used for this part of the system. 7.1.8. To test the system with requests from real people with different needs and financial situation. Tests were made with even more people than expected. As expected the younger people prefer to choose a car based on different criteria – first of all is the engine power and volume. Older people understand the ideology better and are more willing to believe the results. Relatively small number of people admit they have never thought of the real criteria on which they buy a car. Most of the older people agree with Halogen’s concept of buying a car as an automobile based on costs. As a summary, the initial tasks are all executed and the result is the web-based application for car selection called Halogen. On the Bulgarian market still no such system exists and the opportunity stays open. On the international market, besides Germany no other country has a website offering functionality on choosing a car based on lifetime total costs. Germany’s system (Motoragent) continues to be developed and changed now giving some more realistic results but still away from the concept of Halogen. Both systems Halogen and Motoragent have a lot to learn from one another. 7.2. Guidelines for future development and improvement Halogen has the opportunity to be the first of a kind on the local market and if it succeeds in attracting visitors the concept of choosing a car based on lifetime total costs may be adopted widely. Halogen works well at the current moment, but there is a lot to be improved and developed. One more integration with mobile.bg needs to be made for synchronizing deleted or expired ads. Otherwise, these ads remain in the database with the prices they were once added. This distorts the average prices and should be fixed in the near future. Implementing synchronization of deleted and expired ads will result in showing real average prices for cars in 100% of the time. At the moment using several old prices is not a problem because the system is still young so the prices do not differ that much. A solution to the costs for repairs should be found. Ideally here comes the business part. The official representatives may want to give a list of usual repairs their cars experience together with prices, dates and kilometres. Of course, they would like to show off like the best car manufacturer in terms of costs and would give a small list resulting in showing their cars on the top. This would cost them a certain amount of money on monthly basis. Together with the list of repairs a list of their new cars will be added. 65 Also part of the business is adding banners between the header and results table, between the results table and the footer, and Google Adwords or other on the right side of the results table. Full rebranding may be interesting for the official representatives and is an option. Ideally the banners on the top and bottom would be traded with mobile.bg – advertising their site while they advertize Halogen or including Halogen into their circle of websites and showing banners from their system for 50% of the cost. This is a matter of discussion. The generation of the “information” tables may be done several times per day resulting in more accurate results through the whole day. Price prediction algorithm can be changed, especially to work with new models of a manufacturer. Currently, if a model is and the person chooses 5-year ownership, the model would not appear in the suggestions because the end price is not known. Well, an approximation to the depreciation and other costs can be made and this model can be included in the calculations. All the approximations and calculations for time can be checked once again. This is also a point of discussion with the representatives. Another source of getting information about the models is good to be found. The reason is in the current source no new models are added as well as in all initially checked sources. A new task execution mechanism using real queue should be implemented. It should use a lower level language like Python and RabbitMQ or other combination. Using APIs helps a lot. If APIs are available for most of the external services that would be great and will remove or at least minimize the errors of the system made by parsing data. When a stable version of the 3.x branch of CodeIgniter is released which may happen soon, migration may be considered. Site visitors may want to see the current rating of the cars based on what they have already selected. This may be added as Motoragent has done it – in the right side of the website or as a bar above the tabs on this page. Queries to the database can be optimized so the SQL join to the images table to happen using one field only. Also, when filtering data for a value that is >= 0 OR =0, the query may remain with >=0 only. This type of optimization can be made although it would not affect query’s performance. 66 References [1] Number of cars sold worldwide from 1990 to 2015 (in million units) http://www.statista.com/statistics/200002/international-car-sales-since-1990/ [2] More than 20 000 new cars sold in Bulgaria for 2014 http://bnt.bg/news/ikonomika/nad-20-000-novi-avtomobila-prodadeni-u-nas-prez-2014-g [3] Buying a car. How to buy a used car http://www.which.co.uk/cars/choosing-a-car/buying-a-car/buying-a-used-car/how-to-buy-aused-car-/ [4] Rankings: Used Cars http://usnews.rankingsandreviews.com/cars-trucks/rankings/used/ [5] Auto Kaufberatung. Finde Dein passendes Auto. http://www.motoragent.de/ [6] mobile.de – Deutschlands größter Fahrzeugmarkt für Neu- und Gebrauchtwagen http://www.mobile.de/ [7] Auto Scout 24 – Die 1 für Fahrer http://www.autoscout24.de/ [8] Spritmonitor.de – Verbrauchswerte real erfahren http://www.spritmonitor.de/ [9] What Are The Benefits of MVC? http://blog.iandavis.com/2008/12/what-are-the-benefits-of-mvc/, 9 December 2008 [10] Simple Example of MVC (Model View Controller) Design Pattern for Abstraction http://www.codeproject.com/Articles/25057/Simple-Example-of-MVC-Model-ViewController-Design 67 [11] What is PHP? http://php.net/manual/en/intro-whatis.php [12] The Number One HTTP Server On The Internet http://httpd.apache.org/ [13] Comparison of web application frameworks http://en.wikipedia.org/wiki/Comparison_of_web_application_frameworks, 12 March 2015 [14] jQuery - Write less do more http://jquery.com/ [15] DataTables - Table plug-in for jQuery http://www.datatables.net/ [16] Comparison of relational database management systems http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems, 11 March 2015 [17] Debian - The Universal Operating System https://www.debian.org/ [18] memcached - a distributed memory object caching system http://memcached.org/ [19] Zend OPcache improves PHP performance by storing precompiled script bytecode in shared memory http://php.net/manual/en/intro.opcache.php [20] 250ms is enough wait to send web users packing http://www.slashgear.com/250ms-is-enough-wait-to-send-web-users-packing-01216392/, 1 March 2012 68