- Factors close to the border in excessive dimensions:
- Think about selecting a random level inside a sq. (2D). It’s uncommon that this level shall be very near the sides of the sq..
- However in excessive dimensions, like a ten,000-dimensional dice, virtually each level you decide is very near the border. That is unusual as a result of it’s not what you anticipate from decrease dimensions.
- Distance between factors:
- In a 2D sq., the typical distance between two random factors is about 0.52.
- In 3D (like a dice), the typical distance is a bit of larger, about 0.66.
- However in tremendous excessive dimensions, like a 1,000,000-dimensional dice, the gap between two random factors turns into enormous (about 408). That is shocking as a result of each factors are nonetheless inside the identical “dice.”
- What this implies for machine studying:
- When you could have high-dimensional information, most factors are far other than one another. So, the info turns into sparse, that means the factors are scattered everywhere.
- This makes it laborious for a mannequin to make good predictions, as a result of the brand new information level will probably be removed from any of the present coaching factors. Because of this, the mannequin may not carry out effectively, and you’ll simply overfit, that means your mannequin is simply too particular to the coaching information however doesn’t work effectively on new information.
In abstract, high-dimensional areas behave in ways in which really feel very completely different from low dimensions, and this will trigger issues when coaching machine studying fashions with lots of options or dimensions.