Notice! This page is still in progress!

What’s the best strategy to win in PUBG?

Objective:

  • Predict

Data:

  • 65,000 games’ worth of anonymized player data

Directory Structure

.                         # root folder
├── data                  # folder which contains data sets
|  └── train.csv          # train sample
|  └── test.csv           # test sample
├── pubg.ipynb            # main file
├── utils_data.py         # contains some utility functions
Directory Structure
.                         # root folder
├── data                  # folder which contains data sets
|  └── train.csv          # train sample
|  └── test.csv           # test sample
├── pubg.ipynb            # main file
└── utils_data.py         # contains some utility functions

Lets get it started

PUBG Start

%matplotlib inline

import pandas as pd
import numpy as np

from data_utils import df_info

We will start by importing required packages and loading the training set. data_utils is a file I prepared in advance some useful function utilities especially when working with tabular data. You can view the code of the file here.

Tip! The csv_read() parameter nrows= can be used to limit the number of imported rows to the number given. This might be practical for performance reasons in case you don’t need the whole dataset and just want to do some minor data analysis or for function testing purposes etc.. If nrows=None everything will be loaded.

Alt Text

train = pd.read_csv("data/train.csv", nrows=None)

In the following I will use some functions I prepared in advance. You can view the code for the functions here.

def get_cats(df):
    cats = []
    for col in df:
        if pd.api.types.is_string_dtype(df[col]):
            cats.append(len(df[col].unique()))
        else:
            cats.append(np.nan)
    return pd.DataFrame(cats, index=df.columns, columns=['categories'])


def df_info(df: pd.DataFrame, show_rows: int=2, horizontal: bool=True, percentiles: list=[.25, .5, .75], selected_cols: list=None, includes: str=None): 
    df = df.copy()
    if includes == 'objects':
        for col in df:
            if not pd.api.types.is_string_dtype(df[col]): 
                df.drop([col], axis=1, inplace=True)
    elif includes == 'numeric':
        for col in df:
            if not pd.api.types.is_numeric_dtype(df[col]): 
                df.drop([col], axis=1, inplace=True)
    if df.empty: raise ValueError(f'No "{includes}" type columns!')
    
    # in case you just want information on certain rows, specify those columns in selected_cols
    if selected_cols:
        df = df[selected_cols]
    
    # data types
    types = pd.DataFrame(df.dtypes, columns=["dtype"])
    
    # description of the dataframe (mean, median, std, min-max values)
    descr = df.describe(percentiles=percentiles).drop(["count"])
    
    # count missing values
    nans = pd.DataFrame(df.isnull().sum(), columns=["missings"])
    
    # count how many categories are contained in each string type column
    cats = get_cats(df)
    
    # show the first few rows depending on how many you want to show
    head = df.head(show_rows)
    
    # display it either vertically or horizontally
    if horizontal:
        info_df = pd.concat([types, nans, cats, descr.T, head.T], axis=1)
    else:
        info_df = pd.concat([types.T, nans.T, cats.T, descr, head], axis=0)
            
    # show all rows and columns, no matter how large the dataframe is
    with pd.option_context("display.max_rows", None, "display.max_columns", None):
        display(info_df)
df_info(train, horizontal=False, percentiles=[.5], includes="all")
Id groupId matchId assists boosts damageDealt DBNOs headshotKills heals killPlace killPoints kills killStreaks longestKill maxPlace numGroups revives rideDistance roadKills swimDistance teamKills vehicleDestroys walkDistance weaponsAcquired winPoints winPlacePerc
dtype int64 int64 int64 int64 int64 float64 int64 int64 int64 int64 int64 int64 int64 float64 int64 int64 int64 float64 int64 float64 int64 int64 float64 int64 int64 float64
missings 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
categories NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean 499.5 1.75983e+06 499.5 0.371 1.134 174.757 0.964 0.341 1.391 42.427 1116.12 1.322 0.699 25.3916 41.064 39.613 0.199 387.059 0.003 3.97792 0.014 0.006 1087.49 3.717 1506.99 0.486571
std 288.819 877624 288.819 0.817329 1.70908 230.343 1.7337 0.809552 2.77954 28.4318 150.125 2.17372 0.806877 51.0182 23.7698 23.1409 0.547447 1095.86 0.0948683 21.0137 0.133498 0.0772656 1142.88 2.97755 39.9435 0.316501
min 0 24 0 0 0 0 0 0 0 1 908 0 0 0 3 3 0 0 0 0 0 0 0 0 1349 0
50% 499.5 1.98371e+06 499.5 0 0 100 0 0 0 38 1057.5 1 1 1.406 29 28 0 0 0 0 0 0 581.55 3 1500 0.4792
max 999 2.7006e+06 999 7 10 2285 22 8 29 98 1792 26 4 415.4 100 99 5 8197 3 251.8 2 1 5176 37 1744 1
0 0 24 0 0 5 247.3 2 0 4 17 1050 2 1 65.32 29 28 1 591.3 0 0 0 0 782.4 4 1458 0.8571
1 1 440875 1 1 0 37.65 1 1 0 45 1072 1 1 13.55 26 23 0 0 0 0 0 0 119.6 3 1511 0.04
def trn_val_split(df, size=0.8):
    idxs = np.random.permutation(range(len(df)))[:int(len(df) * size)]
    trn, val = df.iloc[idxs], df.drop([idxs])
    return trn, val