Exploratory Data Analysis

Electronics 2024-01-24 132 Views

7.5.1 A categorical and continuous variable

It’s common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape. For example, let’s explore how the price of a diamond varies with its quality:

(data = diamonds, mapping = (x = price)) + 
  (mapping = (colour = cut), binwidth = 500)

It’s hard to see the difference in distribution because the overall counts differ so much:

(diamonds) + 
  (mapping = (x = cut))

To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display density, which is the count standardised so that the area under each frequency polygon is one.

(data = diamonds, mapping = (x = price, y = ..density..)) + 
  (mapping = (colour = cut), binwidth = 500)
#> Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
#> ℹ Please use `after_stat(density)` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

There’s something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe that’s because frequency polygons are a little hard to interpret - there’s a lot going on in this plot.

Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:

A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.
Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.
A line (or whisker) that extends from each end of the box and goes to the
farthest non-outlier point in the distribution.

Let’s take a look at the distribution of price by cut using :

(data = diamonds, mapping = (x = cut, y = price)) +
  ()

We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counterintuitive finding that better quality diamonds are cheaper on average! In the exercises, you’ll be challenged to figure out why.

cut

classmpg

(data = mpg, mapping = (x = class, y = hwy)) +
  ()

classhwy

(data = mpg) +
  (mapping = (x = (class, hwy, FUN = median), y = hwy))

If you have long variable names, will work better if you flip it 90°. You can do that with .

(data = mpg) +
  (mapping = (x = (class, hwy, FUN = median), y = hwy)) +
  ()

7.5.1.1 Exercises

Prev： Functional Data Analysis

Next： What is Secondary Data? + [Examples, Sources, & Analysis]

comment more>>

Exploratory Data Analysis

7.5.1 A categorical and continuous variable

7.5.1.1 Exercises

Electronics

Computers

Smart Home

Automotive

Beauty And Personal Care

Women's Fashion

Men's Fashion

Oil painting

Black PAL, NTSC, AUTO DVD, SD, USB, IR, FM, Games Car Headrest DVD Players With 2 Wireless Joysticks

Guide to Night Vision Binoculars & Other Devices

Handheld PC eBook Reader Solutions

Sony Electronics Issues Warning About Dual Disc Compatibility With Sony Disc Players

2023 Tech G| BAK Battery Launches Its First Semi

50 best Kindle covers and sleeves

Functional Data Analysis

REDCap (Research Electronic Data Capture)

The 7 best electronic signature apps to sign documents online in 2023

What is Secondary Data? + [Examples, Sources, & Analysis]

GBM Data Extraction and Gamma

Call For Training Program of Data Analysis and Applied Occupational Skills for African Countries