November, 16 2017
Hongxia Yang, Director & Senior Staff Data Scientist, Alibaba Group
What is Data Science? And what is Data Science in a leading global company? Let me share a few thoughts based on my industry experience over the past few years and my current job at Alibaba Inc. We have all heard of this company as an e-commerce giant. Alibaba is no longer just an e-retail store but has been building up its data ecosystem quickly, see Figure 1. Perhaps other global companies have similar systems. Alibaba officials are making the switch to being a “data company”. This is because, in a nutshell, “Data is the blood of the new economy,” explained Alibaba Group CEO Daniel Zhang recently. Alibaba collects massive amounts of data on its e-commerce marketplaces, through mobile wallet Alipay, and from digital entertainment sites, and social media properties operating within its ecosystem. We use data to refuel our business and refuel the participants in the ecosystems to help our partners to do business more easily anywhere in this fast-changing digital world. Effectively mined and analyzed, data enables businesses, large or small, to better understand markets and consumer behavior, improve products and user experiences, and even anticipate the wants and needs of individual customers.
I am currently leading the algorithm team of the data enabling platform. Here are some of our activities that are driven by leading technologies, which our group must be familiar with:
- Real-time computing: Our 11.11 data screen, Latency < 5s, QPS > 100 Million/s,BPS > 100 GB/s;
- Off-line computing: Daily dashboard finished before 7a.m., key data finished before 9 a.m.;
- On-Line Analytical Processing (OLAP): Only few seconds in response to an instant query of hundreds of millions of data;
- Data service for employees: ALIDATA is the unified data platform,serves more than 10k employees every week;
- Data service for sellers: Business Advisor, unified data platform for sellers, provides one-stop data service;
- Data service for consumers: Export data service to search, personalized recommendation, enhance experience.
In order to empower the different areas of the digital markets, we focus on developing algorithms in the following specific areas and implement our algorithmic platform (Figure 2). This may be useful information for people training in statistics or data science.
- OneID: Internet user identity and tracking technology. Through extremely large-scale graphical model algorithms, and with cleansing and integration of multi-sourced heterogeneous data, we aim to achieve 100 billion level entity identification and link prediction of user’s global behavior characteristics.
- GProfile Factory: Global consumer-based profiles. We rely on the payment, location, audio-visual entertainment, credit and other data of Alibaba’s ecological users to build a comprehensive global data collection and analysis system. Combining machine learning and deep learning frameworks enables us to build a set of automated label production systems that are successfully deployed in many business scenarios.
- OneGraph: Technology of distributed large-scale knowledge graph construction, computing and inference. We aim to build intelligent knowledge graph and Artificial Intelligence (AI) related technologies to manage and operate big data. Also through the construction of the business knowledge graph, we can support merchants for business innovative applications and business decisions.
- One Prediction: Ensembled framework of time-series prediction. We support various needs of cost and revenue prediction with the automatic prediction system.
- Fraud Detection: AI supported intelligent fraud detection system. Successful deployment includes mobile application channel promotion settlement, OneID data cleaning, traffic, Youku and other fraud detection business scenarios.
- Cloud Technology: Rapidly innovate with a best-of-breed platform and robust catalog of products. We aim to construct applications from bite-sized business logic billed to the nearest 100 milliseconds, which makes possible fully server-less models of computing where logic can be spun up on-demand in response to events originating from anywhere.
Overall, equipped with the advancement of statistical, machine learning, data mining and deep learning models, we have laid the foundation to transform ourselves into a leading data commerce company that leverages big data/cloud computing to upgrade all of its businesses, ecosystem partners, and customers.
Lastly, I would like to share some of my thoughts regarding working in industry with really big data, in which case many of our usual (classroom) statistical modeling assumptions do not hold. Industry evolves very fast and it requires us to deploy some initial workable models immediately. Besides, coding requirements are a bit different from academia and large scale coding languages including Spark, Python and SQL are must-learn. My team often has several projects going on simultaneously. My typical day is usually a mix of high level modeling, system design and computer programming, meeting and interacting with my team members, preparing talks and presentations to my peers and high level management, email, and doing collaborative research both internally and externally. All of these need increasingly more technical and non-technical interactions in general.