Combing the data index system from top to bottom
1. Determine the target
This is the first question you should ask yourself. What is the ultimate goal of making great efforts to do data analysis? If this is not clear, then the data system must be at a loss.
Do you want to increase user activity, increase users, increase sales, or what other goals? When I think about it, I seem to want them all. Everyone hopes that there is no problem, but it will expand the boundaries of work indefinitely and make things impossible to advance. So start with the goal /KPI that you care about most.
So, what is our most important goal?
For companies with users in different fields, stages and roles, the answer to this question is different: for many company bosses, profit is their most concerned goal; For companies or governments that do not sell products/services, perhaps customer satisfaction is the most concerned goal; For trading platform companies or early e-commerce companies, profit is not the focus, and trading volume is the most concerned goal.
The most concerned goal is set. Can we solve all the problems we want below? That's not true. The biggest misunderstanding brought by big data is that the more data and fields, the better. However, when we really solve specific business problems, we must cut out a related subset from the complete big data set to use.
For a single person, neither the boss nor the senior management should pay too much attention to the goal /KPI at the same time. Look at dozens of KPIs at the same time, and imagine that it will be dizzy and time-consuming. However, for enterprises, there are indeed many KPIs that are very important. What should I do? It can be divided into many people, that is, different roles work together, each role pays attention to its own goals, and all roles together are a complete set of all goals/KPIs of the company.
Suppose the boss's most concerned goal is profit, and profit = income-cost. This goal can be decomposed into sales director's concern about income and operation director's concern about cost. Of course, it doesn't mean that the boss can't look at the income, but lock the routine focus in the feasible range.
2. Decomposition indicators
The goal is determined, and the next step is to decompose the relevant indicators.
What indicators are needed to monitor or analyze targets? For example, profit, the relevant indicators are income and cost. Of course, it's too rough. What kind of income and cost should be considered. For example, the sales of the retail industry can be decomposed into passenger flow, store entry rate, purchase rate, customer unit price and repurchase rate.
Therefore, there are many ways of decomposition, which need to follow the MECE principle (exhaustion and independence).
3. Refine fields
For the calculation formula of indicators, which fields are involved, which tables are in which libraries, whether data cleaning is needed, and what cleaning rules are.
For example, the purchase rate is calculated by the formula "number of buyers/number of people entering the store", and the number of buyers is calculated by counting "customer ID". The fields involved in these indicators correspond to which fields in which table in the database, and need to be sorted out, which requires the intervention and cooperation of IT personnel or database administrators.
4. Non-functional requirements
After the completion of the above third step, we have actually sorted out the index system and can land, but in order to make the final data system more complete, friendly and usable, we still need to sort out some non-functional requirements.
UI: It doesn't seem to matter what display style you prefer, but in fact, users will deal with data systems every day, and a beautiful and experienced system UI will make users prefer it.
Page flow: which related indicators are placed on the same report page, what is the hierarchical relationship between pages, and how users jump between pages.
Permission: Who can see which data ranges and which fields and indicators need unified permission control to avoid data security problems.
ETL: What is the frequency and regularity of data synchronization from data source to analysis system?
Integration: whether it is necessary to integrate with other systems in terms of interfaces and early warning messages.
Performance: Invisible and intangible, but it directly determines the system availability. If it takes a few minutes or even tens of minutes to see the results in the case of a large amount of data, I believe no one will be willing to use this system.
5. System implementation
After the above four items are completed, we have formed a data operation system requirement document/implementation plan, which can be put into the data operation system. Then the workload and time plan are determined according to the number of pages of the report and the complexity of data preparation.
2. Implement the landing of BI system from bottom to top.
1. connection data
Build the system step by step according to the requirements document/implementation plan. Some enterprises call this system a big data platform, while others call it a BI system. The scope of big data platform will be wider, but for enterprise data operation, BI must be the core component.
Then, whether it is the development or the rapid implementation based on third-party tools like Yonghong Technology, the first step of system construction is to connect various data sources and get through the channels with them.
In enterprises, the data environment is often heterogeneous, and the data sources may include databases, Hadoop platforms, Excel files, log files, NoSQL databases, third-party interfaces and so on. Each data source needs a fast and friendly docking method.
Finally, we can see all the tables and fields in all the data sources we need in the system.
2. Data processing
The data in the data source is often more or less irregular, such as repeated records, missing null values, obviously unreasonable abnormal values (such as closing orders in 2020), and the same thing may have multiple names in the system.
If these data are not processed or called cleaning, it will have a great impact on the accuracy of analysis, so some preprocessing is needed. This process is often the most time-consuming and boring, but it is also very important.
The author reminds that the problem of this link will be further discussed in the next article "On Data Governance Methods from Avenue to Jane".
3. Data modeling
After data processing, the next step is to do data modeling.
When it comes to modeling, users with non-technical backgrounds will be intimidated and find it difficult to understand. What is the mold actually made? Simply put, associating multiple tables together is a data model.
For example, if a company wants to do performance analysis, it needs employees' length of service, education, number of projects, project amount, project profit rate and other indicators, in which the length of service and education are in the personal information table, the number and amount of projects are in the project table, and the project profit rate is in the financial table. These three tables have the same field "employee number", and these three tables are related through this field, which is a data model and a data model.
Make a data report