Nowadays, more and more enterprises, universities and academic organizations are organizing various types of data competitions to "identify" outstanding talents in the field of data science, and to encourage them to find breakthrough solutions for a certain data field or application scenario, and to leave valuable information for future data researchers. Experience.
SmilexuhcIn the GitHub community, the top solutions for data contests were sorted out, including Top solutions for pure data contests and data contests in the field of natural language processing (NLP). Small partners who are interested in these events can come together to see this full dry sticker.
Pure data contest
Participants were asked to predict the user's ad click probability using artificial intelligence technology based on mass advertising data from the iFLYTEK AI marketing cloud. The competition provides five types of data, including basic advertising data, advertising material information, media information, user information and context information, A total of 1001650 preliminary test data and 1998350 trial data (retrial training data: preliminary data retrial data).
This competition requires participants to take Ali E-commerce advertisements as the research object, based on the massive real transaction data provided by Taobao platform, to predict the purchasing intention of users through artificial intelligence technology construction. This competition provides participants with five types of data, including basic data, advertising commodity information, user information, context information and store information. The data used in the preliminary contest contains samples of several days; the data of the last day is used to evaluate the results, which are not disclosed to the competitors; and the data of the remaining days are provided to the competitors as training data.
Rank9 (Season 1):Https://github.com/yuxiaowww/IJCAI-18-TIANCHI
The title of this algorithm contest comes from an advertising technology product based on real business scenarios
Considering the security assurance of business data, all data provided by the competition are desensitized data. The whole data set is divided into training set and test set: the training set calibrates the users who belong to the seed package and those who do not belong to the seed package (i.e. positive and negative samples). The test set will test whether the algorithm of the competitors can accurately calibrate whether the users in the test set belong to the corresponding seed package, the training set and the test set correspond to each other. Seed bags are identical. The seed packages provided by the preliminary and semi-finals are different except for the order of magnitude.
Rank10 (preliminary contest):Https://github.com/ShawnyXiao/2018-Tencent-Lookalike
The competition requires participants to predict active users in the future based on desensitization and sampled data. Teams need to design corresponding algorithms for data analysis and processing, and the results of the competition are evaluated and ranked using online evaluation data according to the designated evaluation indicators. The data provided by the contest are user behavior data after desensitization and sampling. The date information is numbered uniformly. The first day is 01, the second day is 02. By analogy, tab segmentation is used in all files.
This competition requires participants to design their own data processing operation and training models based on the given users who have purchased the target commodity in the past three months and their data information of browsing, purchasing and evaluating in the previous year, so as to predict the most likely users to purchase the target commodity in the next month and predict them. Examine the first purchase date in the time period. Data mainly includes user basic information, SKU basic information, user behavior information, user order information and evaluation information.
Based on the real-time data of fan SCADA, the participants are required to establish the early fault detection model of blade cracking through machine learning, in-depth learning, statistical analysis and other methods, so as to give early warning of blade cracking fault. The data set provided by the competition includes training set and test set: there are 40,000 samples of 25 types of fans in the training set, and 80,000 samples without fan number in the test set.
Based on the analysis of the principle of photovoltaic power generation, the contestants are required to demonstrate the factors that affect the output power of photovoltaic, such as irradiance and working temperature of photovoltaic panels. A prediction model is established by real-time monitoring of the operating state parameters and meteorological parameters of photovoltaic panels to predict the instantaneous power generation of photovoltaic power plants, and according to the DCS system of photovoltaic power plants. The actual generation data are compared and analyzed to verify the practical application value of the model.
The competition provides 9,000 training points and 8,000 test sets, including photovoltaic panel operating state parameters (solar panel backplane temperature, the voltage and current of its photovoltaic array) and meteorological parameters (solar irradiance, ambient temperature and humidity, wind speed, wind direction, etc.).
Rank1:https://zhuanlan.zhihu.com/p/44755488?utm_source=qq(this scheme can also be viewed in WeChat: < high="" score="" model="" scheme="" in="" a="" xgboost="" lightgbm="" lstm:="" machine="" learning="" contest]="" />
8. AI Global Challenger Competition
This competition requires participants to establish an accurate risk control model based on the basic identity information, consumer behavior, bank repayment and other data information of nearly 70,000 loan users provided by the immediate financial platform to predict whether users will overdue repayment.
The competition requires participants to establish an accurate risk control model based on the basic identity information, consumer behavior and bank repayment data of nearly 70,000 loan users provided by Rong360 and financial institutions on the platform to predict whether the users will overdue repayment.
The competition requires participants to predict whether users will cancel the coupon within 15 days after receiving it in July 2016, based on the real online and offline consumption behavior of a given user between January 1, 2016 and June 30, 2016. AUC is used to evaluate the competition. First, the AUC value of each coupon is calculated separately, and then the AUC value of all coupons is averaged as the final evaluation standard.
The competition requires participants to forecast the price of agricultural products in July based on the price data of agricultural products before June 2016. The preliminary competition of this topic is based on the price data of farm commodity markets in China, while the second competition is based on the weather and other multi-source data.
The State Grid monitors the abnormalities of users and their transformers, and conducts spot checks on users according to abnormal conditions by field maintenance personnel. The results of the checks are fed back. If it is found that the users are stealing electricity, the information of the users will be fed back. In this contest, participants are required to establish a detection model for electricity theft and identify the user's electricity theft behavior by providing relevant data and the results of inspectors'inspection.
In the preliminary competition of this topic, participants are required to analyze the search keywords of another 20,000 people based on the million-level search terms given by 20,000 users and the training set of genuine gender, age and educational background obtained from the survey. The classification algorithm is constructed by machine learning and data mining technology, and the search keywords of another 20,000 people are analyzed, and their gender and year are given. Age, academic qualifications and other user attribute information. During the rematch, the scale of the training set and the test set were extended to 100 thousand users.
Precision marketing is a new direction of Internet marketing and advertising marketing. Especially when users are in specific locations and businesses, how to match users according to user portraits and push corresponding preferential and advertising information through different channels has become a new developer of many Internet and non-Internet enterprises. To. Taking one of the marketing scenarios as an example, the contestants are required to complete user portrait description and merchant matching based on user location information, merchant classification and location information.
OneFive2016 CCF-Human or Robots
In the first half of 2016, AdMaster's anti-cheating solution identified an average of up to 28% of false traffic per day, i.e. non-human malicious traffic caused by robotic simulation and black IP. This contest requires participants to automatically detect these false traffic flow through user behavior logs.
In this competition, participants are required to forecast the national and regional demand of a commodity in the next two weeks based on the data of a large number of buyers and sellers in the past year. Participants need to use data mining technology and methods to accurately depict the changing law of commodity demand, predict the future national and regional demand, and take into account the impact of future uncertainties on logistics costs, so as to achieve global optimization. The competition provides national and regional warehousing data for goods from October 10, 2014 to December 27, 2015.
Natural Language Processing (NLP)
The contest requires participants to analyze the internal structure and semantic information of text based on a batch of long text data and classification information provided by Daguan data, combined with the most advanced NLP and artificial intelligence technology, and construct a text classification model to achieve accurate classification. The data provided by the competition consist of 2 CSV files, training data set and test data set.
The contest requires participants to develop an algorithm to improve the recognition ability and service quality of intelligent customer service based on the real data of intelligent customer service chat robots provided by patting loans, taking natural language processing and text mining technology as the main research object.
This competition requires contestants to analyze the real dialogue data (after desensitization) and the given dialogue data between JingDong user and JingDong artificial customer service, and build end-to-end task-driven multi-wheel dialogue system to output the answer to meet the needs of users.
This contest focuses on the adaptation of short text matching in language. The source language is English and the target language is Spanish. Competition requires participants to build cross-language short text matching model to improve the ability of intelligent customer service robot.
In addition, Smilexuhc also provides you with two empirical articles. If you are interested in them, you can collect and learn from your predecessors.
Ask Me Anything session with a Kaggle Grandmaster Vladimir I. Iglovikov "I.:Https://pan.baidu.com/s/1XkFwko_YrI5TfjjIai7ONQ