Notice! This page is still in progress!
What’s the best strategy to win in PUBG?
Objective:
- Predict
Data:
- 65,000 games’ worth of anonymized player data
Directory Structure
. # root folder
├── data # folder which contains data sets
| └── train.csv # train sample
| └── test.csv # test sample
├── pubg.ipynb # main file
├── utils_data.py # contains some utility functions
Directory Structure
. # root folder
├── data # folder which contains data sets
| └── train.csv # train sample
| └── test.csv # test sample
├── pubg.ipynb # main file
└── utils_data.py # contains some utility functions
Lets get it started
%matplotlib inline
import pandas as pd
import numpy as np
from data_utils import df_info
We will start by importing required packages and loading the training set. data_utils
is a file I prepared in advance some useful function utilities especially when working with tabular data.
You can view the code of the file here.
Tip! The csv_read()
parameter nrows=
can be used to limit the number of imported rows to the number given. This might be practical for performance reasons in case you don’t need the whole dataset and just want to do some minor data analysis or for function testing purposes etc.. If nrows=None
everything will be loaded.
train = pd.read_csv("data/train.csv", nrows=None)
In the following I will use some functions I prepared in advance. You can view the code for the functions here.
def get_cats(df):
cats = []
for col in df:
if pd.api.types.is_string_dtype(df[col]):
cats.append(len(df[col].unique()))
else:
cats.append(np.nan)
return pd.DataFrame(cats, index=df.columns, columns=['categories'])
def df_info(df: pd.DataFrame, show_rows: int=2, horizontal: bool=True, percentiles: list=[.25, .5, .75], selected_cols: list=None, includes: str=None):
df = df.copy()
if includes == 'objects':
for col in df:
if not pd.api.types.is_string_dtype(df[col]):
df.drop([col], axis=1, inplace=True)
elif includes == 'numeric':
for col in df:
if not pd.api.types.is_numeric_dtype(df[col]):
df.drop([col], axis=1, inplace=True)
if df.empty: raise ValueError(f'No "{includes}" type columns!')
# in case you just want information on certain rows, specify those columns in selected_cols
if selected_cols:
df = df[selected_cols]
# data types
types = pd.DataFrame(df.dtypes, columns=["dtype"])
# description of the dataframe (mean, median, std, min-max values)
descr = df.describe(percentiles=percentiles).drop(["count"])
# count missing values
nans = pd.DataFrame(df.isnull().sum(), columns=["missings"])
# count how many categories are contained in each string type column
cats = get_cats(df)
# show the first few rows depending on how many you want to show
head = df.head(show_rows)
# display it either vertically or horizontally
if horizontal:
info_df = pd.concat([types, nans, cats, descr.T, head.T], axis=1)
else:
info_df = pd.concat([types.T, nans.T, cats.T, descr, head], axis=0)
# show all rows and columns, no matter how large the dataframe is
with pd.option_context("display.max_rows", None, "display.max_columns", None):
display(info_df)
df_info(train, horizontal=False, percentiles=[.5], includes="all")
Id | groupId | matchId | assists | boosts | damageDealt | DBNOs | headshotKills | heals | killPlace | killPoints | kills | killStreaks | longestKill | maxPlace | numGroups | revives | rideDistance | roadKills | swimDistance | teamKills | vehicleDestroys | walkDistance | weaponsAcquired | winPoints | winPlacePerc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dtype | int64 | int64 | int64 | int64 | int64 | float64 | int64 | int64 | int64 | int64 | int64 | int64 | int64 | float64 | int64 | int64 | int64 | float64 | int64 | float64 | int64 | int64 | float64 | int64 | int64 | float64 |
missings | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
categories | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
mean | 499.5 | 1.75983e+06 | 499.5 | 0.371 | 1.134 | 174.757 | 0.964 | 0.341 | 1.391 | 42.427 | 1116.12 | 1.322 | 0.699 | 25.3916 | 41.064 | 39.613 | 0.199 | 387.059 | 0.003 | 3.97792 | 0.014 | 0.006 | 1087.49 | 3.717 | 1506.99 | 0.486571 |
std | 288.819 | 877624 | 288.819 | 0.817329 | 1.70908 | 230.343 | 1.7337 | 0.809552 | 2.77954 | 28.4318 | 150.125 | 2.17372 | 0.806877 | 51.0182 | 23.7698 | 23.1409 | 0.547447 | 1095.86 | 0.0948683 | 21.0137 | 0.133498 | 0.0772656 | 1142.88 | 2.97755 | 39.9435 | 0.316501 |
min | 0 | 24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 908 | 0 | 0 | 0 | 3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1349 | 0 |
50% | 499.5 | 1.98371e+06 | 499.5 | 0 | 0 | 100 | 0 | 0 | 0 | 38 | 1057.5 | 1 | 1 | 1.406 | 29 | 28 | 0 | 0 | 0 | 0 | 0 | 0 | 581.55 | 3 | 1500 | 0.4792 |
max | 999 | 2.7006e+06 | 999 | 7 | 10 | 2285 | 22 | 8 | 29 | 98 | 1792 | 26 | 4 | 415.4 | 100 | 99 | 5 | 8197 | 3 | 251.8 | 2 | 1 | 5176 | 37 | 1744 | 1 |
0 | 0 | 24 | 0 | 0 | 5 | 247.3 | 2 | 0 | 4 | 17 | 1050 | 2 | 1 | 65.32 | 29 | 28 | 1 | 591.3 | 0 | 0 | 0 | 0 | 782.4 | 4 | 1458 | 0.8571 |
1 | 1 | 440875 | 1 | 1 | 0 | 37.65 | 1 | 1 | 0 | 45 | 1072 | 1 | 1 | 13.55 | 26 | 23 | 0 | 0 | 0 | 0 | 0 | 0 | 119.6 | 3 | 1511 | 0.04 |
def trn_val_split(df, size=0.8):
idxs = np.random.permutation(range(len(df)))[:int(len(df) * size)]
trn, val = df.iloc[idxs], df.drop([idxs])
return trn, val