• 2. Choose rows

Motivating scenario: you want to limit your analysis to rows with certain values.

Learning goals: By the end of this sub-chapter you should be able to

Use the filter function to choose the rows you want to work with.
Display care so that you do not get the wrong answer when filtering your data.

Remove rows with filter()

There are reasons to remove rows by their values. For example, we could remove plants from US and LB locations. We can achieve this with the filter() function as follows:

A visual representation of filtering data using `filter()` in `dplyr`. The top table contains two columns: `prop_hyb` and `petal_color`, listing hybrid proportions alongside flower color ("white" and "pink"). Below, an R code snippet applies `filter(petal_color == "pink")`, removing rows where `petal_color` is not "pink." The bottom table displays the filtered dataset, which includes only rows with pink flowers, while the white flower row has been excluded. — Figure 1: Using `filter()` to subset data based on a condition. The top table contains two columns: `prop_hyb` (proportion of hybrids) and `petal_color` (flower color), with values including both “white” and “pink” flowers. The function `filter(petal_color == "pink")` is applied to retain only rows where `petal_color` is “pink.” The resulting dataset, shown in the bottom table, excludes the “white” flower row and keeps only the observations where petal color is “pink.”

ril_data |> filter(location == "GC" | location == "SR"): To only retain samples from GC or (noted by |) SR. Recall that == asks the logical question, “Does the location equal SR?” So combined, the code reads “Retain only samples with location equal to SR or location equal to GC.”

OR, equivalently

ril_data |> filter(location != "US" & location != "LB"): To remove samples from US and (noted by &) LB. Recall that != asks the logical question, “Does the location not equal US?” Combined the code reads “Retain only samples with location not equal to US and with location not equal to LB.”

# rerunning the code summarizing visitation by 
# location and petal color and then ungrouping it
# but filtering out plants from locations US and LB
# note that the output no longer says `# Groups:   location [2]`
ril_data      |>
  filter(location != "US" & location != "LB")|>
  group_by(location, petal_color) |>
  summarize(avg_vistis = mean(mean_visits, na.rm = TRUE),
            cor_visits_petal_area =  cor(mean_visits ,petal_area_mm, 
                use = "pairwise.complete.obs"))|>
  ungroup()

`summarise()` has grouped output by 'location'. You can override using the
`.groups` argument.

# A tibble: 6 × 4
  location petal_color avg_vistis cor_visits_petal_area
  <chr>    <chr>            <dbl>                 <dbl>
1 GC       pink            0.193                 0.0899
2 GC       white           0.0104                0.0886
3 GC       <NA>            0.2                   0.443 
4 SR       pink            1.76                  0.387 
5 SR       white           0.733                 0.458 
6 SR       <NA>            1.04                  0.717

Warning… ! Remove one thing can change another:

Think hard about removing things (e.g. missing data), and if you do decide remove things, consider where in the pipeline you are doing so. Reoving one thing can change another. For example, compare:

ril_data |>
  filter(!is.na(petal_color))|>
  summarise(avg_visits = mean(mean_visits, na.rm = TRUE))

# A tibble: 1 × 1
  avg_visits
       <dbl>
1      0.705

and

ril_data |>
  summarise(avg_visits = mean(mean_visits, na.rm = TRUE))

# A tibble: 1 × 1
  avg_visits
       <dbl>
1      0.693

These answers differ because when we removed plants with no petal color information we also removed their pollinator visitation values.