Home Up PDF Prof. Dr. Ingo Claßen
Visualization - Distributions - DSML

Attribution

The following slides are based on

  • Fundamentals of Data Visualization (link)
    and are following the
  • Attribution-NonCommercial-NoDerivatives 4.0 International License (link)

Histogram

Number of records per value bin

Bin size must be chosen appropriately

Here numbers of passengers on Titanic per age band, bin size 5

Different Bin Sizes

(a) 1 year: too small

(b) 3 years: ok

(c) 5 years: ok

(d) 15 years: too large

Density Plot

Used for continuous values

Better, when there are many data points

Sometimes matter of taste what fits better

Kernel, bandwidth

(a) gaussian kernel, bandwidth = 0.5: too peaky

(b) gaussian kernel, bandwidth = 2: ok

(c) gaussian kernel, bandwidth = 5: ok

(d) rectangular kernel, bandwidth = 2: too steppy

Boxplot

Simple and informative

Useful for many distributions in one diagram

Violin Plot

Modern variant of boxplot

Suffcient data points needed

Two Distributions at once

Not clear if bars overlap

If bars are stacked, comparison of female passengers are impaired

If bars are not stacked, with transparent bars, meaning of third color not clear

Maybe density plot better suited

Density line helps interpretation

Beware of scaling of y-axis

Separate plots even better

In this case age pyramid best

Multiple Distributions at once

Density estimates of the butterfat percentage in the milk of four cattle breeds

Histograms are not suitable in this case

Many Distributions at once

Mean daily temperatures

Boxplots

Violin plots

Strip plots

Problem overlapping points

Better, Strip plots, jittered

Points are randomly moved horizontally

Sina plots, combination of violin plot and strip plot

Ridgeline plots, half violin plots rotated by 90 degrees

Very many Distributions

Evolution of movie lengths over time