'SKLearn train_test_split jumbles indexes - unreproducable with dummy data

I have a setup where I read a csv into a dataframe, add a calculated column, then do a train_test_split. A dummy solution would be:

import pandas as pd
import random
asd = random.sample(range(1, 1456165166), 500)

index = list(range(500))

data = {
  "calories": asd,
  "lol": asd,
  "ix": index
}
df = pd.DataFrame(data)
df = df.set_index("ix")
df['target_cat'] = np.where(df['lol'] > 154683526, 0, 1)
X = df.loc[:,df.columns != "lol"]
y = df['lol']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

This works perfectly, and it keeps the indices in X_train and y_train consistent, e.g.:enter image description here

This would be the expected behavior. Now if I do the same with my own data read from a csv, all the indices get jumbled:

train = pd.read_csv(all_files[2], delimiter=",", index_col=None, header=0, encoding="UTF-8")
train = train.set_index("id")
train['target_cat'] = np.where(train['target_reg'] == 0, 0, 1)
X = train.loc[:,train.columns != "target_cat"]
y = train['target_cat']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

When I do this, the indices won't line up, seem to be off by 1:

enter image description here

How on Earth can this happen? I've been sitting here baffled and have no idea what I'm missing...



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source