DOP8: Merging both data and analysis operators life
Transcription
DOP8: Merging both data and analysis operators life
DOP8: Merging both data and analysis operators life cycles for Technology Enhanced Learning Nadine Mandran, Michael Ortega, Vanda Luengo, Denis Bouhineau LIG, University of Grenoble 1. Domaine universitaire, BP 46 - 38402 Grenoble Cedex (France) [email protected] ABSTRACT This paper presents DOP8: a Data Mining Iterative Cycle that improves the classical data life cycle. While the latter only combines the data production and data analysis phases, DOP8 also integrates the analysis operators life cycle. In this cycle, data life cycle and operators life cycle processing meet in the data analysis step. This paper also presents a reification of DOP8 in a new computing platform: UnderTracks. The latter provides a flexibility on storing and sharing data, operators and analysis processes. Undertracks is compared with three types of platform ’Storage platform’, ’Analysis platform’ and ’Storage and Analysis platform’. Several real TEL analysis scenarios are present into the platform, (1) to test Undertracks flexibility on storing data and operators and (2) to test Undertracks flexibility on designing analysis processes. Categories and Subject Descriptors J1 [ADMINISTRATIVE DATA PROCESSING/Education], H.2.8 [DATA APPLICATIONS/Datamining, statistical databases], K3 [COMPUTER AND EDUCATION], D.2.9 [MANAGEMENT/Life cycle]. General Terms Design, Management, Human Factors, Experimentation Keywords process analysis, data life cycle, operators life cycle, flexibility, sharing, computing platform. 1. INTRODUCTION. Baker and Siemens [9] underline the importance of increasing the opportunities for collaborative research and sharing of research findings between both Educational Data Mining (EDM) and Learning Analytics Knowledge (LAK) communities. One center of interest for both communities is the production and the analysis of data. The two communities have to cope with the data volume increase, which is becoming an issue for data production. Moreover, the two communities produce data and operators in different ways: EDM produces analysis operators that focus on "automated adaptation, by computer with no human in the loop", while LAK focuses on "informing and empowering instructors and learners" [1]. Currently, data and operators are produced in separate ways, by different communities. The two data life cycles are then also separated. This separation increases the difficulty in sharing them, and in providing efficient analysis and relevant results. One solution to improve collaboration between both communities is to provide a structure for sharing data as well as operators and analysis processes. Combining production and data © 2015 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. LAK '15, March 16 - 20, 2015, Poughkeepsie, NY, USA Copyright 2015 ACM 978-1-4503-3417-4/15/03…$15.00 http://dx.doi.org/10.1145/2723576.2723580. analysis in the same structure means taking into account the two data life cycles. Roughly speaking, the resulting data life cycle must consist of three major steps: the design of the study protocol, the data production and the data analysis In this paper, we address this issue by providing a new data mining iterative cycle: DOP8, which combines both data’ life cycle and operators’ life cycle. This new cycle is reified on a platform called UnderTracks (UT). The next section describes related work. The core of our contribution: respectively the DOP8 cycle and the UT platform are presented in section 3 and 4. In section 5, the UT platform is compared to five other platforms, and section 6 discusses about our proposition. 2. RELATED WORK. 2.1 Data processing schemes Many data processing are available, and their terminology and organization are specific to both the activity sector and the business process they are used for. We focus on three of them, two of which are designed for educational data mining and one which is designed for social sciences. The cycle for applying data mining in educational systems presented by Romero and Baker [8] proposes two main steps: a data production step, closely linked to educational systems, and a data mining step, which combines data mining operators for showing discovered knowledge to the academics responsible or the educator, and recommendations to the students. The processing proposed by Stamper et al. [10] consists of six steps: Data design, Data collection, Data Analysis, Publish results, Data archiving, Secondary Analysis (see Figure 1). The latter step clearly identifies the need for data reuse. However, we consider that secondary analysis is not a step in a data life cycle, but rather another goal. Moreover, Stamper et al. indicate a difficulty of reusing data especially if the metadata are not sufficient. In their cycle the Data Archiving takes place after the Data Analysis step. We consider that to ensure efficient data sharing and reuse metadata should be created at each step of the data life cycle. This is a mean of keeping all information about data process. Furthermore, in this cycle the pre-processing step recommended by Romero et al. [7] is not presented. Figure 1: Data life cycle described by Stamper et al [10] Regarding social sciences and humanities, the UK.DATA ARCHIVE proposes data processing in the form of a data “life” cycle [2]. This cycle includes six steps: Creating, Processing, Analyzing, Preserving, Giving access, Reusing. As with the Romero and Baker cycle [8], the idea of a cycle is interesting since analysis results can lead to other new issues. This cycle can be described from a higher level (lower granularity), by grouping steps: three steps correspond to the data process itself (creating, processing and analyzing), while the other three steps correspond to dissemination issues (preserving, accessing and reusing). It can also be described from a lower level (higher granularity). After this review, we identify two shortcomings: (1) The combination between the data processing and the operators processing is not planned, (2) The platforms are not flexible enough to integrate and disseminate data, operators and analysis processes. To address these shortcomings, we propose DOP8 and we design a structure that enhances the flexibility. Both points are described in next sections. 3. DOP8 Compared to the latter processing, the first two do not include or do not detail the pre-processing step, although this step is important for controlling and enriching the data. These three processing only focus on data, although the data analysis step also involves analysis operators. As these operators are usually developed through three main steps: Design, Development and Validation, we consider that operators have their own life cycle, not identified in these three schemes. 2.2 Educational Data and DataMining Platforms We classify existing datamining platforms into three categories: (1) ‘Storage platform’: built to store data and metadata. (2) ‘Analysis platform’: built to analyze data with statistics or data mining operators. (3) ‘Storage and Analysis platform’: Mixed platform allow data and operators to be combined. Verbert et al. analyse three platforms: dataTEL [12], DataShop [5] and Mulce [6]. While dataTEL and Mulce are ‘Storage platforms’, PSLC DataShop1 is a ‘Storage and Analysis platform’. They highlight the strong relationship between the reviewed platforms and the research questions. DataShop defines a specification for describing datasets that are derived from intelligent tutoring systems. This platform then strongly orients the research questions around “prediction of learner performance and discovering learner models”. In Mulce, the main research topic is “Enhancing social learning environments”. With these platforms it is then difficult to use the same dataset for other research questions. Tin Can [15], 'Storage platform', allows flexibility thanks to their basic format. This approach proposes collecting a Learning Record Store (LRS) in RDF Format. However, the analysis operators are not collected or diffused, and the research questions are not considered. Regarding the ‘Analysis platform‘, platforms like RapidMiner [13] or Orange [3] are dedicated platforms to specialists. EDM contains a wide range of specific data mining algorithms [8] that are not shared in an easy way in these platforms and they are not dedicated to TEL researches. DataShop, the only existing ‘Storage and Analysis platform’ proposes operators linked to a specific data type: ITS data. This platform is mainly dedicated to research works on students’ knowledge and their relation to other domains like collaboration. For other research questions, DataShop proposes two solutions: (1) a web-services approach for accessing data, and (2) a storage space to disseminate externally developed operators. In relation to the second point, these solutions are “external” solutions, i.e. it is not possible to associate both, data and external analysis operators in an easy way. 1 https://pslcdatashop.web.cmu.edu/ResearchGoals?typeId=all Figure 2: DOP8: Data life cycle and operators life cycle processing meet in the data analysis step. The name ‘DOP8’ means: 'Data, Operators and Processes’ combined into a double cycle. Two sub-cycles make up the DOP8: a cycle describes the data processing and one other describes the operators processing. Each cycle is split into several steps. A step is defined with a verb that describes the main actions of the step. At each step, metadata are generated and stored to describe the tasks and their results. The metadata can also include the balance sheet of the task, which describes the weakness and the strength of each task. Table 1 and 2 presents the goals and the expected results of both data cycle and operators cycle steps. Table 1. Goals and expected results for each step of the data cycle. Step Prepare Goals Design the study to address the research question. Collect Collect the raw data with or without computer artefacts. Store them with metadata. Validate the raw data to ensure the coherence and relevance of data. Store it with metadata. Convert validated data to enriched data. Store them with metadata. Validate Enrich Expected results Study protocol and description of raw data and metadata for storage Raw data, metadata and balance sheet of study. Validated data and metadata on validation Enriched data and metadata on enrichment Table 2. Goals and expected results for operators’ cycle. Step Design Develop Validate Goals Explore and specify the new operators for analysing data Develop the operators from specifications Validate and store the operators for dissemination Expected results Specifications Operational operators Validated operators and metadata on operators As presented in Figure 2, the DOP8 cycle proposes to intersect the two cycles at one step: ‘Analyse’. It combines data and operators for creating analysis processes, and each cycle benefits from execution and results of this step. The expected results can be: (1) interpretation of the results analysis in relation to the research questions, (2) processes analysis, (3) balance sheet of the study, and possibly new research questions, (4) balance sheet of the used operators, and if applicable improvement of them or creation of a new one. DOP8 provides a framework to guide the data analysis and operators development by considering and combining both data and operators life cycles. To instantiate this framework, we developed a platform, which allows users (1) to store data, operators and analysis processes consistently in the same environment, and (2) to combine the two cycles for providing flexible data analysis. This platform is described in the next section. 4. UNDERTRACKS (UT) Taking into account our work on mandatory metadata, only two tables are mandatory: “Description” and “Events”. Some data are mandatory for describing the features of study and events. In the Description table, the mandatory fields are: Fields Name of study Period of study Authors Countries of study Agent types Description The name of the study. E.g. a date, a year, a semester, … Names and emails Country in which the study is conducted Agents can be of different types (e.g. student, tutor, group, system, simulator). Numbers of agents The number of agents involved Domain topics E.g. Chemistry, Mathematics Production mode: Data are designed with a precise protocol with or without for addressing a research question or data study protocol are produced on the field by the students or teachers. In the Events table the mandatory fields are: UT is an instantiation of DOP8 and is dedicated to TEL researchers. It proposes two main pieces of software: UTP, dedicated to the Production of data and operators, and UTA, dedicated to the Analysis step. UTA allows visual construction of the analysis processes. UTP and UTA are independent, they interoperate. Fields Timestamp Agent Action 4.1 Definitions UTP provides two ways for storing raw data into its database: (1) online method, the TEL system is directly connected to UTP; and (2) offline method, the TEL system saves data into files that are downloaded into the UTP database afterwards. An experiment that sets up a DOP8, linked with one or more TEL systems, is called a “study”. A study involves raw data that is based on “event” logs: temporally located information. An event log usually contains the entity responsible for the log, called an “agent”, that could be a TEL user or the TEL system itself. The event log also contains an “action” To resume, an event log is a temporally located information that describes an action from or between agents. Each algorithm that could be applied on data is called an “operator”. An operator is an entity that takes input execution and the results and can provide output data. Operators can be chained from a raw data to a final result, and this “workflow” is called an analysis “process”. 4.2 Mandatory Metadata and Fields on UT Into a ‘Storage platform’, global descriptions of data at a high level are essential for reusing and sharing data. These descriptions are usually called ”metadata”. To create useful metadata for which the time for creation time is not time-consuming [4], we define a set of mandatory metadata and a set of mandatory fields (in the sense of a database table). The mandatory metadata describe the study (see 4.3). The mandatory fields are the minimum requirements for storing data, operators and processes into UT to ensure dissemination. Several interviews conducted with researchers and computer engineers, and based on TEL researchers’ expertise, have defined these sets. They have been tested with several studies. Consequently, UT is designed with a limited number of mandatory metadata and fields. 4.3 UT Production (UTP) UTP is dedicated to the data and operator production. It allows UT users to store data, operators and processes and to document them with metadata. For storing data, five tables are available: 1Description table: for storing the study description, 2- Events table: for storing the events, 3- Agents table: for describing the TEL agents, 4- Context table: for describing the context of the study, 5 - Actions table: for describing the actions. Description Date of the event or an ordered value. Agent that produces events (usually anonymized). The action produced by the corresponding agent. 4.4 Two ways to import data 4.4.1 Operators UTP provides an interface for storing operators. Each operator has two kinds of descriptions: 1. A technical description. This allows both the UTA interface and the UTA users to know how the operator works and how it has to be connected with data or other operators, 2. A usage description (operator documentation). This allows UTA users to know how and why to use it. In the technical description, all the fields are mandatory. They mainly describe (1) the input and output data format, and (2) the parameters that modify the operator’s behavior. In order to ensure that operators can be used on UT data, the technical description is closely linked to the data format previously described. In the current version of UT, the source code of an operator can be in Java, C++ or Python. In the ‘Operators table’ the mandatory fields are: Fields Name Category Description Owner Description Name of the operators Data management, Data mining, Visualization, Statistics, or Others. Description of the operator’s functionalities Names and mails of process’s owners 4.4.2 Processes Analysis processes are created from UTA (see 4.5). Two kinds of data are then stored: (1) The process itself: the process file consists of a list of operators and data names, with the description of their links, (2) A process description: created using four mandatory fields. In the ‘Processes table’ mandatory fields are: Fields Description Name Name of the process Description The description of the process goals Names and mails of process’s owners Whether or not the process is specific for the TEL researches/data. 4.5 UT Analysis (UTA) It is a Java application, executed on the client side. This application is connected to UTP, where it accesses to the current state of the data, operators and processes description bases. A UTA user can then graphically connect data and operators to create a visual workflow (see Figure 4). Once the user decides to “run” the workflow, UTP sends a textual description of the process to an engine on the UT server that executes the workflow, and stores the intermediate. Once the execution is complete, the user can consult the final and intermediate results. According to user changes on the workflow description, such as for instance the raw starting data or simply a parameter that influences the behavior of one operator’s algorithm, the workflow is respectively, completely or partially re-executed. As UT is designed to share and reuse data and processes, UTA provides a tool for describing and storing new processes into the UT database. Users can also download existing processes from UTP in order to execute or modify them. articles or any additional data describing a study, can be stored too. The large diversity of these studies illustrates the flexibility of the UT data structure. Except for the mandatory fields, UT users create their own structure, in terms of number and name of fields. The ability to store data from different research domains, different type of agents, and different sources of data is then possible in the same database. Moreover, UT currently stores datasets from 8 TEL systems developed in our team TEL systems, as well as from 2 external TEL systems: Tamago cours from Lyon research TEAM EducTice and Moodle. Data from 14 studies have been uploaded from text files and 6 systems are directly connected to UTP. Table 3. Major differences between DataShop and UnderTracks. Data Owners TEL_specific MULCE Tin Can Storing RapidMiner Orange Analysis DATAShop UnderTracks Operators 5. UnderTracks vs OTHER PLATFORMS Mixed To compare UnderTracks with the Datamining platforms presented in the related work, we use the DOP8 cycle. In Figure 3, we then reduce DOP8 to its silhouette, and we blacken the boxes, which symbolize the steps that the relevant platforms propose. Mulce and TinCan are specialized in collecting and storing the data. In Mulce the metadata descriptions are detailed, in particular for metadata about the study context. However, neither platform provides guidelines for the ‘Design’ step, in the sense that there is no online form to guide the design of a study. Neither platform integrates any step from the operators life cycle. RapidMiner and Orange are specialized in the operators storing and in data analysis. They provide a visual programming tool for processes. Neither platform integrates any step from the data life cycle. DataShop and UnderTracks store data and operators. Both platforms provide a space to analyze data. But, there are differences between the two platforms, the most important of which are presented in Table 3. Among the existing platforms, Undertracks is a mixed platform, such as Datashop. The major difference is the ability of UnderTracks to integrate data and operators with a degree of flexibility and to allow UT users to build their own analysis processes by combining data and operators. 6. UNDERTRACKS FLEXIBILITY 6.1 Diversity of data stored in UTP The UT data format is flexible enough to store datasets from different learning domains (already 5 domains: Biology, Maths, Physics, Medicine and Computer sciences) and different type of agents (already 4 agent types: students, Tutors, groups of students, system). UTP stores activity logs, but also heterogeneous data from specific systems (already 5 data types: Logs, Annotations, Scores, Eye tracker, Simulator). Additional files, such as papers, Process Figure 3: Black boxes represent the steps the platform proposes. DataShop UnderTracks Specific educational data, produced by students interactions especially with ITS systems All kind of TEL data, (e.g. experimental annotations, tracks, haptic and eye tracker) from several actors: students, teachers, and systems (e.g. simulator, ITS, LMS) Database generic operators (filter, select), and specific operators, such as pattern visualization showed in Figure 8 But also any kind of operators, Data Mining (e.g. Weka clustering), statistical (e.g. crosstab and chi square test) Database generic operators (filter, select), and specific operators (e.g. learning curve [11]) linked with data stored with DataShop data structure. Store external operators but cannot use them into the analysis part of the platform Use the pre-designed processes. Use of pre-designed processes and ability to design new ones by combining data, operators and existing processes Figure 4: Screenshot of a process created and displayed by UTA. The blue box is the starting data, while the green and the yellow boxes are algorithmic and visualization operators respectively. The numbers in the blacken box indicates the different steps in this analysis process. 6.2 Flexibility of the analysis processes UTA combines data and operators to design a process. A user can combine different operators in the same process, either once or several times. He can also combine different data tables, once or several times each tabl;e. UT is then flexible in that each element of the process can be replaced. This kind of flexibility can be equated with the reuse of operator and analysis processes on several datasets (see 2.2). For instance, we first tested the process on a dataset from the Copex-Chimie TEL system (agents are students, actions are chemical manipulations and lessons readings) and then on a dataset from the PSLC DataShop platform: Geometry Area 96 [14] The goal of this analysis process is to reveal the agent strategy by visualizing specific action sequences from each agent. The analysis is split into 4 steps: (1) Visualization of action to explore the raw data; (2) Computation of the action frequencies to select the relevant actions addressing the research question; (3) Creation of patterns with relevant actions, computation of frequencies, selection and renaming the relevant patterns ; (4) Visualization of the relevant patterns sequence for visually analyzing the agents’ strategies. This analysis process (see Figure 4) combines one event data table and seven operators from UnderTracks. Only five different operators are used: three visualization operators and two pure algorithmic operators. The visualization operators are used as leaves of the process workflow, but interactive visualization could allow these operators to be used also as roots. Today, as UnderTracks is used by an increasing number of researchers (in our team but also by one external team), its number of studies and operators grows fast. Given this situation, we are currently working on integrating quality indicators for data sharing. Another challenge is to increase the number of DOP8 steps integrated by UnderTracks, and more precisely the “prepare” step. One way to simplify and shorten this work is to guide the TEL researcher. Guiding could start when the researcher elaborates the study protocol, and could be designed in relation to the researchers' analysis practices, needs and expectations. To investigate this subject, an analysis is conducted with researchers and data analysis experts. Also, we are currently improving UTA by interoperating with Orange [3] an existing datamining platform. 8. AKNOWLEDGMENTS This research has been partially supported by HUBBLE ANR GRANT number ANR-14-CE24-0015-01, and MOCA ANR, UPMF-9522000392. 9. REFERENCES [1] Baker, R.S., Corbett, A.T., Koedinger, K.R. and Wagner, A.Z. 2004. Off-Task Behavior in the Cognitive Tutor Classroom: When Students “Game The System”. Proceedings of ACM CHI (2004), 383–390. [2] Bishop, L. 2011. UK Data Archive Resources for Studying Older People and Ageing. (2011). [3] Demsar, J., Curk, T. and Erjavec, A. 2013. Orange: Data Mining Toolbox in Python. Journal of Machine Learning Research. 14, (2013), 2349–2353. [4] Duval, E. 2001. Metadata standards: What, who & why. Journal of Universal Computer Science. 7, 7 (2001), 591–601. [5] Koedinger, K.R., Baker, Rsj., Cunningham, K., Skogsholm, A., Leber, B. and Stamper, J. 2010. A data repository for the EDM community: The PSLC DataShop. Handbook of educational data mining. 43, (2010). Figure 5: Visual results of the same analysis process applied to two different studies. The visual results of both analyses are presented in figure 5. Each line shows the action sequence of one student. Each color bar shows an action, where the colors distinguish the different types of actions. For each study, the first graph shows the action sequences, the second graph shows the sequence of the patterns built with relevant actions. They can compare the differences between student behaviors, in relation to the sequences of actions or patterns. This process was first co-designed for the CopexChimie TEL system, with TEL researchers and a statistician. The first designs and tests were conducted using Microsoft Excel, and took several months. The corresponding operators were developed and stored into UT. Once this integration was complete, it took only a few minutes to create the process with UTA. All the efforts made in constructing the process for the first dataset were quickly reused for the second dataset (about 20min). In this case, the research question was similar for both studies, meaning that the process can be reused on different data. Furthermore, for a different research question, the operators of this process can be reused separately, in another order or combined with other ones. 7. CONCLUSION DOP8 integrates both data and operators life cycles in the same double loop cycle. It has been implemented in a new platform: UnderTracks. Based on the integration of a wide panel of TEL studies in UT, we illustrated the positive properties offered: flexibility and sharing of data, operators and analysis processes. The flexibility of UT is also evaluated by using the same analysis process on two different datasets. [6] Reffay, C., Betbeder, M.-L. and Chanier, T. 2012. Multimodal learning and teaching corpora exchange: lessons learned in five years by the Mulce project. International Journal of Technology Enhanced Learning. 4, 1 (2012), 11–30. [7] Romero, C., Romero, J.R. and Ventura, S. 2014. A Survey on Pre-Processing Educational Data. Educational Data Mining. Springer. 29– 64. [8] Romero, C. and Ventura, S. 2007. Educational data mining: A survey from 1995 to 2005. Expert Systems with Applications. 33, 1 (2007), 135–146. [9] Siemens, G. and Baker, R.S. 2012. Learning analytics and educational data mining: towards communication and collaboration. Proceedings of the 2nd international conference on learning analytics and knowledge (2012), 252–254. [10] Stamper, J.C., Koedinger, K.R., Baker, R.S.J. d, Skogsholm, A., Leber, B., Demi, S., Yu, S. and Spencer, D. 2011. Managing the Educational Dataset Lifecycle with DataShop. Artificial Intelligence in Education. G. Biswas, S. Bull, J. Kay, and A. Mitrovic, eds. Springer Berlin Heidelberg. 557–559. [11] Stamper, J. and Koedinger, K.R. 2011. Human-machine student model discovery and improvement using data. Proceedings of the 15th International Conference on Artificial Intelligence in Education. (2011). [12] Verbert, K., Drachsler, H., Manouselis, N., Wolpers, M., Vuorikari, R. and Duval, E. 2011. Dataset-driven research for improving recommender systems for learning. Proceedings of the 1st International Conference on Learning Analytics and Knowledge (2011), 44–53. [13] http://rapidminer.com/. [14] https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=76. [15] http://tincanapi.com/overview/.