| Title: | Flags Spatial Errors in Biological Collection Data Using Specialists' Information |
|---|---|
| Description: | Automatically flags common spatial errors in biological collection data using metadata and specialists' information. RuHere implements a workflow to manage occurrence data through six steps: dataset merging, metadata flagging, validation against expert-derived distribution maps, visualization of flagged records, and sampling bias exploration. It specifically integrates specialist-curated range information to identify geographic errors and introductions that often escape standard automated validation procedures. For details on the methodology, see: Trindade & Caron (2026) <doi:10.64898/2026.02.02.703373>. |
| Authors: | Weverton C. F. Trindade [aut, cre] (ORCID: <https://orcid.org/0000-0003-2045-4555>), Fernanda S. Caron [aut] (ORCID: <https://orcid.org/0000-0002-1884-6157>) |
| Maintainer: | Weverton C. F. Trindade <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.0.1 |
| Built: | 2026-06-10 06:40:47 UTC |
| Source: | https://github.com/wevertonbio/ruhere |
A data set of amphibian communities from the Atlantic Forests of South America sourced from Vancine et al. (2018).
atlantic_amphibiansatlantic_amphibians
A data.table or data.frame with 8,254 rows and 3 columns:
Character. Scientific name of the virtual species.
Numeric. Georeferenced longitude in decimal degrees.
Numeric. Georeferenced latitude in decimal degrees.
Vancine et al. 2018. ATLANTIC AMPHIBIANS: a data set of amphibian communities from the Atlantic Forests of South America. Ecology, 99(7), 1692-1692. doi:10.1002/ecy.2392
inventory_completeness()
# First rows head(atlantic_amphibians)# First rows head(atlantic_amphibians)
This function checks which datasets contain distributional information for a given set of species, based on expert-curated sources. It searches the selected datasets and reports whether each species has available distribution data.
available_datasets( data_dir, species, datasets = "all", return_distribution = FALSE )available_datasets( data_dir, species, datasets = "all", return_distribution = FALSE )
data_dir |
(character) directory path where the datasets were saved. See Details for more information. |
species |
(character) vector with the species names to be checked for the availability of distributional information. |
datasets |
(character) vector indicating which datasets to search.
Options are |
return_distribution |
(logical) whether to return the spatial objects
( |
The distribution datasets can be obtained using the functions
florabr_here(), wcvp_here(), bien_here(), and faunabr_here(),
which download and prepare the corresponding sources for use in RuHere.
If return_distribution = FALSE, a data.frame containing the species names
and the datasets where distributional information is available.
If return_distribution = TRUE, it also returns a list containing the
SpatVector objects representing the species ranges.
# Set directory where datasets were saved # Here, we'll use the directory where the example datasets are stored datadir <- system.file("extdata", "datasets", package = "RuHere") # Check available datasets d <- available_datasets(data_dir = datadir, species = c("Araucaria angustifolia", "Handroanthus serratifolius", "Cyanocorax caeruleus")) # Check available datasets and return distribution d2 <- available_datasets(data_dir = datadir, species = c("Araucaria angustifolia", "Handroanthus serratifolius", "Cyanocorax caeruleus"), return_distribution = TRUE)# Set directory where datasets were saved # Here, we'll use the directory where the example datasets are stored datadir <- system.file("extdata", "datasets", package = "RuHere") # Check available datasets d <- available_datasets(data_dir = datadir, species = c("Araucaria angustifolia", "Handroanthus serratifolius", "Cyanocorax caeruleus")) # Check available datasets and return distribution d2 <- available_datasets(data_dir = datadir, species = c("Araucaria angustifolia", "Handroanthus serratifolius", "Cyanocorax caeruleus"), return_distribution = TRUE)
This function downloads distribution information from the BIEN database,
required for filtering occurrence records using specialists' information via
the flag_bien() function.
bien_here( data_dir, species, synonyms = NULL, overwrite = TRUE, progress_bar = FALSE, verbose = TRUE )bien_here( data_dir, species, synonyms = NULL, overwrite = TRUE, progress_bar = FALSE, verbose = TRUE )
data_dir |
(character) directory to save the data downloaded from BIEN. |
species |
(character) a vector of species names for which to retrieve distribution information. |
synonyms |
(data.frame) an optional data.frame containing synonyms of
the target species. The first column must contain the target species names,
and the second column their corresponding synonyms. Default is |
overwrite |
(logical) whether to overwrite existing files. Default is
|
progress_bar |
(logical) whether to display a progress bar during processing.
If TRUE, the 'pbapply' package must be installed. Default is |
verbose |
(logical) whether to display progress messages. Default is
|
This function uses the BIEN::BIEN_ranges_load_species() function to
retrieve polygons representing the distribution ranges of species available
in the BIEN database.
Because taxonomic information in BIEN may be outdated, you can optionally
provide a table of synonyms to broaden the search. The synonyms data.frame
should have the accepted species in the first column and their synonyms in
the second. See RuHere::synonys for an example.
A data frame indicating whether the polygon(s) representing the species range
are available in BIEN.
If the range is available, a GeoPackage file (.gpkg) is saved in
data_dir/bien. The file name corresponds to the species name, with an
underscore (“_”) replacing the space between the genus and the specific
epithet.
# Define a directory to save the data data_dir <- tempdir() # Here, a temporary directory # Download species distribution information from BIEN bien_here(data_dir = data_dir, species = "Handroanthus serratifolius")# Define a directory to save the data data_dir <- tempdir() # Here, a temporary directory # Download species distribution information from BIEN bien_here(data_dir = data_dir, species = "Handroanthus serratifolius")
Combines multiple occurrence data frames (for example, from GBIF,
SpeciesLink, BIEN, or iDigBio) into a single standardized dataset. This is
particularly useful after using format_columns() to ensure column
compatibility across data sources.
bind_here(..., fill = FALSE)bind_here(..., fill = FALSE)
... |
(data.frame) two or more data frames with occurrence records to combine. |
fill |
(logical) whether to fills missing columns with |
When fill = TRUE, columns not shared among the input data frames are added
and filled with NA, ensuring that all columns align before binding.
Internally, this function uses data.table::rbindlist() for efficient row
binding.
A data.frame containing all occurrence records combined.
# Import and standardize GBIF data("occ_gbif", package = "RuHere") #Import data example gbif_standardized <- format_columns(occ_gbif, metadata = "gbif") # Import and standardize SpeciesLink data("occ_splink", package = "RuHere") #Import data example splink_standardized <- format_columns(occ_splink, metadata = "specieslink") # Import and standardize BIEN data("occ_bien", package = "RuHere") #Import data example bien_standardized <- format_columns(occ_bien, metadata = "bien") # Import and standardize idigbio data("occ_idig", package = "RuHere") #Import data example idig_standardized <- format_columns(occ_idig, metadata = "idigbio") # Merge all all_occ <- bind_here(gbif_standardized, splink_standardized, bien_standardized, idig_standardized)# Import and standardize GBIF data("occ_gbif", package = "RuHere") #Import data example gbif_standardized <- format_columns(occ_gbif, metadata = "gbif") # Import and standardize SpeciesLink data("occ_splink", package = "RuHere") #Import data example splink_standardized <- format_columns(occ_splink, metadata = "specieslink") # Import and standardize BIEN data("occ_bien", package = "RuHere") #Import data example bien_standardized <- format_columns(occ_bien, metadata = "bien") # Import and standardize idigbio data("occ_idig", package = "RuHere") #Import data example idig_standardized <- format_columns(occ_idig, metadata = "idigbio") # Merge all all_occ <- bind_here(gbif_standardized, splink_standardized, bien_standardized, idig_standardized)
Check if the records fall in the country assigned in the metadata
check_countries( occ, long = "decimalLongitude", lat = "decimalLatitude", country_column, distance = 5, try_to_fix = FALSE, progress_bar = FALSE, verbose = TRUE )check_countries( occ, long = "decimalLongitude", lat = "decimalLatitude", country_column, distance = 5, try_to_fix = FALSE, progress_bar = FALSE, verbose = TRUE )
occ |
(data.frame) a dataset with occurrence records, preferably with
country information standardized using |
long |
(character) column name with longitude. Default is 'decimalLongitude'. |
lat |
lat (character) column name with latitude. Default is 'decimalLatitude'. |
country_column |
(character) column name containing the country information. |
distance |
(numeric) maximum distance (in kilometers) a record can fall
outside the country assigned in the |
try_to_fix |
(logical) whether to check if coordinates are inverted or
transposed (see |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) whether to print messages about function progress.
Default is |
The original occ data.frame with an additional column (correct_country)
indicating whether each record falls within the country specified in the
metadata (TRUE) or not (FALSE).
# Load example data data("occurrences", package = "RuHere") #Import data example # Standardize country names occ_country <- standardize_countries(occ = occurrences, return_dictionary = FALSE) # Check whether records fall within assigned countries occ_country_checked <- check_countries(occ = occ_country, country_column = "country_suggested")# Load example data data("occurrences", package = "RuHere") #Import data example # Standardize country names occ_country <- standardize_countries(occ = occurrences, return_dictionary = FALSE) # Check whether records fall within assigned countries occ_country_checked <- check_countries(occ = occ_country, country_column = "country_suggested")
Check if the records fall in the state assigned in the metadata
check_states( occ, long = "decimalLongitude", lat = "decimalLatitude", state_column, distance = 5, try_to_fix = FALSE, progress_bar = FALSE, verbose = TRUE )check_states( occ, long = "decimalLongitude", lat = "decimalLatitude", state_column, distance = 5, try_to_fix = FALSE, progress_bar = FALSE, verbose = TRUE )
occ |
(data.frame) a dataset with occurrence records, preferably with
country information standardized using |
long |
(character) column name with longitude. Default is 'decimalLongitude'. |
lat |
lat (character) column name with latitude. Default is 'decimalLatitude'. |
state_column |
(character) column name containing the state information. |
distance |
(numeric) maximum distance (in kilometers) a record can fall
outside the state assigned in the |
try_to_fix |
(logical) whether to check if coordinates are inverted or
transposed (see |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) whether to print messages about function progress.
Default is |
The original occ data.frame with an additional column (correct_state)
indicating whether each record falls within the state specified in the
metadata (TRUE) or not (FALSE).
# Load example data data("occurrences", package = "RuHere") #Import data example # Subset occurrences for Araucaria angustifolia occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Standardize country names occ_country <- standardize_countries(occ = occ, return_dictionary = FALSE) # Standardize state names occ_state <- standardize_states(occ = occ_country, country_column = "country_suggested", return_dictionary = FALSE) # Check whether records fall within assigned states occ_state_checked <- check_states(occ = occ_state, state_column = "state_suggested")# Load example data data("occurrences", package = "RuHere") #Import data example # Subset occurrences for Araucaria angustifolia occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Standardize country names occ_country <- standardize_countries(occ = occ, return_dictionary = FALSE) # Standardize state names occ_state <- standardize_states(occ = occ_country, country_column = "country_suggested", return_dictionary = FALSE) # Check whether records fall within assigned states occ_state_checked <- check_states(occ = occ_state, state_column = "state_suggested")
country_dictionary provides a set of lookup tables used to standardize
country names and country codes in occurrence datasets.
The dictionary is built from rnaturalearthdata::map_units110
and consolidates a wide variety of country name variants (in several
languages and formats), as well as multiple coding systems, into a single
suggested standardized name.
This object is used internally by functions that clean or harmonize
country fields, ensuring that country names in occurrence datasets (e.g.,
"Brasil","brasil", "BR", "BRA", "République Française") are all
mapped consistently to a single standardized form ("brazil", "france",
etc.).
country_dictionarycountry_dictionary
A named list of two data frames:
country_nameA data frame with two columns:
country_nameCharacter. Lowercased and accent-stripped country
name variants (from multiple rnaturalearthdata fields such as
name, name_long, abbrev, formal_en, and alternative names in
several languages).
country_suggestedCharacter. The standardized country name,
derived from the name column of map_units110, also lowercased and
accent-stripped.
country_codeA data frame with two columns:
country_codeCharacter. Country codes from several systems, including ISO-2, ISO-3, FIPS, postal codes, and others, after filtering invalid or ambiguous codes.
country_suggestedCharacter. The standardized country name corresponding to each code.
The dictionary is generated by:
extracting multiple name and code fields from
rnaturalearthdata::map_units110,
converting names to lowercase and removing accents,
converting codes to uppercase,
removing invalid or ambiguous codes (e.g., -99, "J", various
country mismatches),
and ensuring uniqueness across all entries.
data(country_dictionary) head(country_dictionary$country_name) head(country_dictionary$country_code)data(country_dictionary) head(country_dictionary$country_name) head(country_dictionary$country_code)
Extracts the country for each occurrence record based on coordinates.
country_from_coords( occ, long = "decimalLongitude", lat = "decimalLatitude", country_column = NULL, from = "all", output_column = "country_xy", append_source = FALSE )country_from_coords( occ, long = "decimalLongitude", lat = "decimalLatitude", country_column = NULL, from = "all", output_column = "country_xy", append_source = FALSE )
occ |
(data.frame) a dataset with occurrence records, preferably
standardized using |
long |
(character) column name with longitude. Default is 'decimalLongitude'. |
lat |
(character) column name with latitude. Default is 'decimalLatitude'. |
country_column |
(character) the column name containing the country.
Only applicable if |
from |
(character) whether to extract the country for all records ('all') or only for records missing country information ('na_only'). If 'na_only', you must provide the name of the column with country information. Default is 'all'. |
output_column |
(character) column name created in |
append_source |
(logical) whether to create a new column in |
The countries are extracted from coordinates using a map retrieved from
rnaturalearthdata::map_units110.
The original occ data.frame with an additional column containing the
countries extracted from coordinates.
# Import and standardize GBIF data("occ_gbif", package = "RuHere") #Import data example gbif_standardized <- format_columns(occ_gbif, metadata = "gbif") gbif_countries <- country_from_coords(occ = gbif_standardized)# Import and standardize GBIF data("occ_gbif", package = "RuHere") #Import data example gbif_standardized <- format_columns(occ_gbif, metadata = "gbif") gbif_countries <- country_from_coords(occ = gbif_standardized)
This function creates a metadata template to be used in format_columns()
for formatting and standardizing column names and classes in occurrence
datasets.
All column names specified as arguments must be present in the occ dataset.
If you obtained data from GBIF, SpeciesLink, BIEN or iDigBio using the functions provided in the RuHere package, you do not need to use this function, as the package already includes metadata templates for these datasets.
create_metadata( occ, scientificName, decimalLongitude, decimalLatitude, collectionCode = NA, catalogNumber = NA, coordinateUncertaintyInMeters = NA, elevation = NA, country = NA, stateProvince = NA, municipality = NA, locality = NA, year = NA, eventDate = NA, recordedBy = NA, identifiedBy = NA, basisOfRecord = NA, occurrenceRemarks = NA, habitat = NA, datasetName = NA, datasetKey = NA, key = NA )create_metadata( occ, scientificName, decimalLongitude, decimalLatitude, collectionCode = NA, catalogNumber = NA, coordinateUncertaintyInMeters = NA, elevation = NA, country = NA, stateProvince = NA, municipality = NA, locality = NA, year = NA, eventDate = NA, recordedBy = NA, identifiedBy = NA, basisOfRecord = NA, occurrenceRemarks = NA, habitat = NA, datasetName = NA, datasetKey = NA, key = NA )
occ |
(data.frame or data.table) a dataset with occurrence records to be standardized. |
scientificName |
(character) column name in |
decimalLongitude |
(character) column name in |
decimalLatitude |
(character) column name in |
collectionCode |
(character) an optional column name in |
catalogNumber |
(character) an optional column name in |
coordinateUncertaintyInMeters |
(character) an optional column name with the coordinate uncertainty in meters. |
elevation |
(character) an optional column name with the elevation information. |
country |
(character) an optional column name with the country of the record. |
stateProvince |
(character) an optional column name with the state or province of the record. |
municipality |
(character) an optional column name with the municipality of the record. |
locality |
(character) an optional column name with the locality description. |
year |
(character) an optional column name with the year when the occurrence was recorded. |
eventDate |
(character) an optional column name with the event date. |
recordedBy |
(character) an optional column name with the name of the collector or recorder. |
identifiedBy |
(character) an optional column name with the name of the identifier. |
basisOfRecord |
(character) an optional column name with the basis of record. |
occurrenceRemarks |
(character) an optional column name with remarks about the occurrence. |
habitat |
(character) an optional column name with the habitat description. |
datasetName |
(character) an optional column name with the dataset name. |
datasetKey |
(character) an optional column name with the dataset key. |
key |
(character) an optional column name with the unique occurrence identifier. |
A data.frame containing a metadata template that can be directly used in
the format_columns() function.
# Load data example # Occurrences of Puma concolor from the atlanticr R package data("puma_atlanticr", package = "RuHere") # Create metadata to standardize the occurrences puma_metadata <- create_metadata(occ = puma_atlanticr, scientificName = "actual_species_name", decimalLongitude = "longitude", decimalLatitude = "latitude", elevation = "altitude", country = "country", stateProvince = "state", municipality = "municipality", locality = "study_location", year = "year_finish", habitat = "vegetation_type", datasetName = "reference") # Now, we can use this metadata to standardize the columns puma_occ <- format_columns(occ = puma_atlanticr, metadata = puma_metadata, binomial_from = "actual_species_name", data_source = "atlanticr")# Load data example # Occurrences of Puma concolor from the atlanticr R package data("puma_atlanticr", package = "RuHere") # Create metadata to standardize the occurrences puma_metadata <- create_metadata(occ = puma_atlanticr, scientificName = "actual_species_name", decimalLongitude = "longitude", decimalLatitude = "latitude", elevation = "altitude", country = "country", stateProvince = "state", municipality = "municipality", locality = "study_location", year = "year_finish", habitat = "vegetation_type", datasetName = "reference") # Now, we can use this metadata to standardize the columns puma_occ <- format_columns(occ = puma_atlanticr, metadata = puma_metadata, binomial_from = "actual_species_name", data_source = "atlanticr")
cultivated is a list of character vectors containing keywords used to
identify whether an occurrence record refers to cultivated or
non-cultivated individuals.
This object is used internally by flag_cultivated() to scan occurrence
fields (such as notes, habitat descriptions, or remarks) and classify
records as cultivated or not cultivated based on textual patterns.
The list combines terms from plantR (plantR:::cultivated and
plantR:::notCultivated) with additional multilingual variants commonly
found in herbarium metadata.
cultivatedcultivated
A named list with two elements:
cultivatedCharacter vector. Terms that indicate an individual is
cultivated. Imported from plantR:::cultivated.
not_cultivatedCharacter vector. Terms suggesting an individual is
not cultivated (e.g., “not cultivated”, “not planted”, “no plantada”,
“no cultivada”), including terms from plantR:::notCultivated.
These terms are matched case-insensitively after text cleaning (e.g., lowercasing and accent removal).
de Lima, Renato AF, et al. plantR: An R package and workflow for managing species records from biological collections. Methods in Ecology and Evolution, 14.2 (2023): 332-339.
flag_cultivated
data(cultivated) cultivated$cultivated cultivated$not_cultivateddata(cultivated) cultivated$cultivated cultivated$not_cultivated
fake_data is a synthetic dataset created for testing functions that validate
and correct country- or state-level geographic coordinates.
Controlled coordinate errors were introduced (e.g., inverted signs, swapped values, combinations of swaps and inversions) to simulate common georeferencing mistakes.
This dataset is intended for automated testing of functions such as
check_countries() and check_states().
fake_datafake_data
A data frame with the same structure as all_occ, containing
occurrence records with intentionally manipulated coordinates.
An additional column data_source = "fake_data" identifies these records.
The coordinate errors include:
Inverted longitude: multiplying longitude by -1.
Inverted latitude: multiplying latitude by -1.
Both coordinates inverted.
Swapped coordinates: (lon, lat) → (lat, lon).
Swapped + inverted in four combinations:
swapped only,
swapped + inverted longitude,
swapped + inverted latitude,
swapped + both inverted.
data(fake_data)data(fake_data)
This function downloads the Taxonomic Catalog of the Brazilian Fauna
database, which is required for filtering occurrence records using
specialists' information via the flag_faunabr() function.
faunabr_here( data_dir, data_version = "latest", solve_discrepancy = TRUE, overwrite = TRUE, remove_files = TRUE, verbose = TRUE )faunabr_here( data_dir, data_version = "latest", solve_discrepancy = TRUE, overwrite = TRUE, remove_files = TRUE, verbose = TRUE )
data_dir |
(character) a directory to save the data downloaded from Fauna do Brazil. |
data_version |
(character) version of the Fauna do Brazil database to download. Use "latest" to get the most recent version, which is updated frequently. Alternatively, specify an older version (e.g., data_version="1.2"). Default value is "latest". |
solve_discrepancy |
(logical) whether to resolve inconsistencies between species and subspecies information. When set to TRUE (default), species information is updated based on unique data from subspecies. For example, if a subspecies occurs in a certain state, it implies that the species also occurs in that state. |
overwrite |
(logical) If TRUE, data is overwritten. Default is TRUE. |
remove_files |
(logical) whether to remove the downloaded files used in building the final dataset. Default is TRUE. |
verbose |
(logical) whether to display messages during function execution. Set to TRUE to enable display, or FALSE to run silently. Default is TRUE. |
A message indicating that the data were successfully saved in the directory
specified by data_dir.
# Define a directory to save the data data_dir <- tempdir() # Here, a temporary directory # Download the latest version of the Flora e Funga do Brazil database faunabr_here(data_dir = data_dir)# Define a directory to save the data data_dir <- tempdir() # Here, a temporary directory # Download the latest version of the Flora e Funga do Brazil database faunabr_here(data_dir = data_dir)
This function identifies and correct inverted and transposed coordinates based on country information
fix_countries( occ, long = "decimalLongitude", lat = "decimalLatitude", country_column, correct_country = "correct_country", distance = 5, progress_bar = FALSE, verbose = TRUE )fix_countries( occ, long = "decimalLongitude", lat = "decimalLatitude", country_column, correct_country = "correct_country", distance = 5, progress_bar = FALSE, verbose = TRUE )
occ |
(data.frame) a dataset with occurrence records, preferably with
country information checked using |
long |
(character) column name with longitude. Default is 'decimalLongitude'. |
lat |
lat (character) column name with latitude. Default is 'decimalLatitude'. |
country_column |
(character) name of the column containing the country information. |
correct_country |
(character) name of the column with logical value indicating whether each record falls within the country specified in the metadata. Default is 'correct_country'. See details. |
distance |
(numeric) maximum distance (in kilometers) a record can fall
outside the country assigned in the |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) whether to print messages about function progress.
Default is |
The function checks and corrects coordinate errors in occurrence records
by testing whether each point falls within the expected country polygon
(from RuHere’s internal world map).
The input occurrence data must contain a column (specified in the
correct_country argument) with logical values indicating which records to
check and fix — only those marked as FALSE will be processed. This column
can be obtained by running the check_countries() function.
It runs a series of seven tests to detect common issues such as inverted signs or swapped latitude/longitude values. Inverted coordinates have their signs flipped (e.g., -45 instead of 45), placing the point in the opposite hemisphere, while swapped coordinates have latitude and longitude values exchanged (e.g., -47, -15 instead of -15, -47).
For each test, country borders are buffered by distance km to account for
minor positional errors.
The type of issue (or "correct") is recorded in a new column,
country_issues. Records that match their assigned country after any
correction are updated accordingly, while remaining mismatches are labeled
"incorrect".
This function can be used internally by check_countries() to automatically
identify and fix common coordinate errors.
The original occ data.frame with the coordinates in the long and lat
columns corrected, and an additional column (country_issues) indicating
whether the coordinates are:
correct: the record falls within the assigned country;
inverted: longitude and/or latitude have reversed signs;
swapped: longitude and latitude are transposed (i.e., each appears in the other's column). incorrect: the record falls outside the assigned country and could not be corrected.
# Load example data data("occurrences", package = "RuHere") # Import example data # Standardize country names occ_country <- standardize_countries(occ = occurrences, return_dictionary = FALSE) # Check whether records fall within the assigned countries occ_country_checked <- check_countries(occ = occ_country, country_column = "country_suggested") # Fix records with incorrect or misassigned countries occ_country_fixed <- fix_countries(occ = occ_country_checked, country_column = "country_suggested")# Load example data data("occurrences", package = "RuHere") # Import example data # Standardize country names occ_country <- standardize_countries(occ = occurrences, return_dictionary = FALSE) # Check whether records fall within the assigned countries occ_country_checked <- check_countries(occ = occ_country, country_column = "country_suggested") # Fix records with incorrect or misassigned countries occ_country_fixed <- fix_countries(occ = occ_country_checked, country_column = "country_suggested")
This function identifies and correct inverted and transposed coordinates based on state information.
fix_states( occ, long = "decimalLongitude", lat = "decimalLatitude", state_column, correct_state = "correct_state", distance = 5, progress_bar = FALSE, verbose = TRUE )fix_states( occ, long = "decimalLongitude", lat = "decimalLatitude", state_column, correct_state = "correct_state", distance = 5, progress_bar = FALSE, verbose = TRUE )
occ |
(data.frame) a dataset with occurrence records, preferably with
state information checked using |
long |
(character) column name with longitude. Default is 'decimalLongitude'. |
lat |
lat (character) column name with latitude. Default is 'decimalLatitude'. |
state_column |
(character) name of the column containing the state information. |
correct_state |
(character) name of the column with logical value indicating whether each record falls within the state specified in the metadata. Default is 'correct_state'. See details. |
distance |
(numeric) maximum distance (in kilometers) a record can fall
outside the state assigned in the |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) whether to print messages about function progress. Default is TRUE. |
The function checks and corrects coordinate errors in occurrence records
by testing whether each point falls within the expected state polygon
(from RuHere’s internal world map).
The input occurrence data must contain a column (specified in the
correct_state argument) with logical values indicating which records to
check and fix — only those marked as FALSE will be processed. This column
can be obtained by running the check_states() function.
It runs a series of seven tests to detect common issues such as inverted signs or swapped latitude/longitude values. Inverted coordinates have their signs flipped (e.g., -45 instead of 45), placing the point in the opposite hemisphere, while swapped coordinates have latitude and longitude values exchanged (e.g., -47, -15 instead of -15, -47).
For each test, state borders are buffered by distance km to account for
minor positional errors.
The type of issue (or "correct") is recorded in a new column,
state_issues. Records that match their assigned state after any
correction are updated accordingly, while remaining mismatches are labeled
"incorrect".
This function can be used internally by check_states() to automatically
identify and fix common coordinate errors.
The original occ data.frame with the coordinates in the long and lat
columns corrected, and an additional column (state_issues) indicating
whether the coordinates are:
correct: the record falls within the assigned state;
inverted: longitude and/or latitude have reversed signs;
swapped: longitude and latitude are transposed (i.e., each appears in the other's column). incorrect: the record falls outside the assigned state and could not be corrected.
# Load example data data("occurrences", package = "RuHere") # Import example data # Subset records of Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Standardize country names occ_country <- standardize_countries(occ = occ, return_dictionary = FALSE) # Standardize state names occ_state <- standardize_states(occ = occ_country, country_column = "country_suggested", return_dictionary = FALSE) # Check whether records fall within the assigned states occ_states_checked <- check_states(occ = occ_state, state_column = "state_suggested") # Fix records with incorrect or misassigned states occ_states_fixed <- fix_states(occ = occ_states_checked, state_column = "state_suggested")# Load example data data("occurrences", package = "RuHere") # Import example data # Subset records of Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Standardize country names occ_country <- standardize_countries(occ = occ, return_dictionary = FALSE) # Standardize state names occ_state <- standardize_states(occ = occ_country, country_column = "country_suggested", return_dictionary = FALSE) # Check whether records fall within the assigned states occ_states_checked <- check_states(occ = occ_state, state_column = "state_suggested") # Fix records with incorrect or misassigned states occ_states_fixed <- fix_states(occ = occ_states_checked, state_column = "state_suggested")
Flags (validates) occurrence records based on known distribution data
from the Botanical Information and Ecology Network (BIEN) data. This function
checks if an occurrence point for a given species falls within its documented
distribution, allowing for user-defined buffers around the region. Records
are flagged as valid (TRUE) if they fall inside the documented
distribution (plus optional buffer) for the species in the BIEN dataset.
flag_bien( data_dir, occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", buffer = 10, progress_bar = FALSE, verbose = TRUE )flag_bien( data_dir, occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", buffer = 10, progress_bar = FALSE, verbose = TRUE )
data_dir |
(character) Required directory path where the |
occ |
(data.frame or data.table) a data frame containing the occurrence records to be flagged. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
buffer |
(numeric) buffer distance (in kilometers) to be applied around the region of distribution. Default is 20 km. |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) if |
A data.frame that is the original occ data frame
augmented with a new column named bien_flag. This column is
logical (TRUE/FALSE) indicating whether the record falls
within the expected distribution (plus buffer) based on the BIEN
data. Records for species not found in the BIEN data will have
NA in the bien_flag column.
# Load example data data("occurrences", package = "RuHere") # Filter occurrences for golden trumpet tree occ <- occurrences[occurrences$species == "Handroanthus serratifolius", ] # Set folder where distributional datasets were saved # Here, just a sample provided in the package # You must run 'bien_here()' beforehand to download the necessary data files dataset_dir <- system.file("extdata/datasets", package = "RuHere") # Flag records using BIEN specialist information occ_bien <- flag_bien(data_dir = dataset_dir, occ = occ)# Load example data data("occurrences", package = "RuHere") # Filter occurrences for golden trumpet tree occ <- occurrences[occurrences$species == "Handroanthus serratifolius", ] # Set folder where distributional datasets were saved # Here, just a sample provided in the package # You must run 'bien_here()' beforehand to download the necessary data files dataset_dir <- system.file("extdata/datasets", package = "RuHere") # Flag records using BIEN specialist information occ_bien <- flag_bien(data_dir = dataset_dir, occ = occ)
flag_colors is a named character vector defining the default colors used to
plot occurrence records flagged with mapview_here().
flag_colorsflag_colors
A named character vector where:
Flag labels corresponding to categories generated by the
various flag_* and checking functions.
Hex color codes or standard R color names used for plotting.
mapview_here
data(flag_colors) # View all flag categories and their colors flag_colorsdata(flag_colors) # View all flag categories and their colors flag_colors
This functions creates a new column representing the consensus across multiple flag columns. The consensus can be computed in two ways:
"all_true": A record is considered valid (TRUE) only if all
specified flag are valid (TRUE).
"any_true": A record is considered valid (TRUE) if at least one
specified flag is valid (TRUE).
flag_consensus( occ, flags, consensus_rule = "all_true", flag_name = "consensus_flag", remove_flag_columns = FALSE )flag_consensus( occ, flags, consensus_rule = "all_true", flag_name = "consensus_flag", remove_flag_columns = FALSE )
occ |
(data.frame or data.table) a dataset with occurrence records that has been processed by two or more flagging functions. |
flags |
(character) a string vector with the names of the flags to be used in the consensus evaluation. See details for see the options. |
consensus_rule |
(character) A string specifying how the consensus
should be computed. Options are |
flag_name |
(character) name of the column that will store the
consensus result. Default is |
remove_flag_columns |
(logical) whether to remove the original flag
columns specified in |
The following flags are available: correct_country, correct_state, cultivated, fossil, inaturalist, faunabr, florabr, wcvp, iucn, duplicated, thin_geo, thin_env, year, .val, .equ, .zer, .cap, .cen, .sea, .urb, .otl, .gbf, .inst, and .aohi.
The original occ with an additional logical column defined by
flag_name, indicating the consensus result based on the selected
consensus_rule.
# Load example data data("occ_flagged", package = "RuHere") # Get consensus using florabr, wcvp, and iucn flags # Valid (TRUE) only when all flags are TRUE occ_consensus_all <- flag_consensus(occ = occ_flagged, flags = c("florabr", "wcvp", "iucn"), consensus_rule = "all_true") # Valid (TRUE) when at least one flag is TRUE occ_consensus_any <- flag_consensus(occ = occ_flagged, flags = c("florabr", "wcvp", "iucn"), consensus_rule = "any_true")# Load example data data("occ_flagged", package = "RuHere") # Get consensus using florabr, wcvp, and iucn flags # Valid (TRUE) only when all flags are TRUE occ_consensus_all <- flag_consensus(occ = occ_flagged, flags = c("florabr", "wcvp", "iucn"), consensus_rule = "all_true") # Valid (TRUE) when at least one flag is TRUE occ_consensus_any <- flag_consensus(occ = occ_flagged, flags = c("florabr", "wcvp", "iucn"), consensus_rule = "any_true")
This function identifies records of cultivated individuals based on record description.
flag_cultivated( occ, columns = c("occurrenceRemarks", "habitat", "locality"), cultivated_terms = NULL, not_cultivated_terms = NULL )flag_cultivated( occ, columns = c("occurrenceRemarks", "habitat", "locality"), cultivated_terms = NULL, not_cultivated_terms = NULL )
occ |
(data.frame) a data frame containing the occurrence records to be
examined, preferably standardized using |
columns |
columns (character) vector of column names in |
cultivated_terms |
(character) optional vector of additional terms that
indicate a cultivated individual. Default is NULL, meaning it will use the
cultivated-related expressions available in |
not_cultivated_terms |
(character) optional vector of additional terms
that indicate a non-cultivated individual. Default is NULL, meaning it will
use the non cultivated-related expressions available in
|
A data.frame that is the original occ data frame augmented with
a new column named cultivated_flag. Records identified as cultivated
receive FALSE, while all other records receive TRUE.
# Load example data data("occurrences", package = "RuHere") # Flag fossil records occ_cultivated <- flag_cultivated(occ = occurrences)# Load example data data("occurrences", package = "RuHere") # Flag fossil records occ_cultivated <- flag_cultivated(occ = occurrences)
This function identifies duplicated records based on species name and coordinates, as well as user-defined additional columns or raster cells. Among duplicated records, the function keeps only one unflagged record, chosen according to a continuous variable (e.g., keeping the most recent), a categorical variable (e.g., prioritizing a specific data source), or randomly.
flag_duplicates( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", additional_groups = NULL, continuous_variable = NULL, decreasing = TRUE, categorical_variable = NULL, priority_categories = NULL, by_cell = FALSE, raster_variable = NULL )flag_duplicates( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", additional_groups = NULL, continuous_variable = NULL, decreasing = TRUE, categorical_variable = NULL, priority_categories = NULL, by_cell = FALSE, raster_variable = NULL )
occ |
(data.frame) a data frame containing the occurrence records to be
examined, preferably standardized using |
species |
(character) the name of the column containing species names. Default is "species". |
long |
(character) the name of the column containing longitude values.
Default is |
lat |
(character) the name of the column containing latitude values.
Default is |
additional_groups |
(character) optional vector of additional column
names to consider when identifying duplicates. For example, if |
continuous_variable |
(character) optional name of a numeric column used
to sort duplicated records and select one to remain unflagged. Default is
|
decreasing |
(logical) whether to sort records in decreasing order using
the |
categorical_variable |
(character) (character) optional name of a
categorical column used to sort duplicated records and select one to remain
unflagged. If provided, the order of priority must be specified through
|
priority_categories |
(character) vector of categories, in the desired
order of priority, present in the column specified in |
by_cell |
(logical) whether to use raster cells instead of raw
coordinates to identify duplicates (i.e., all records inside the same raster
cell are treated as duplicates). If |
raster_variable |
(SpatRaster) a |
A data.frame that is the original occ data frame augmented with
a new column named duplicated_flag. Records identified as duplicated
receive FALSE, while all unique retained records receive TRUE.
# Load example data data("occurrences", package = "RuHere") # Duplicate some records as example occurrences <- rbind(occurrences[1:1000, ], occurrences[1:100,]) # Flag duplicates occ_dup <- flag_duplicates(occ = occurrences) sum(!occ_dup$duplicated_flag) #Number of duplicated records# Load example data data("occurrences", package = "RuHere") # Duplicate some records as example occurrences <- rbind(occurrences[1:1000, ], occurrences[1:100,]) # Flag duplicates occ_dup <- flag_duplicates(occ = occurrences) sum(!occ_dup$duplicated_flag) #Number of duplicated records
This function evaluates multiple environmentally thinned datasets (produced using different number of blocks) and selects the one that best balances low spatial autocorrelation and number of retained records.
For each number of bins provided in n_bins, the function computes Moran's I
for the selected environmental variables and summarizes autocorrelation using
a chosen statistic (mean, median, minimum, or maximum). The best thinning
level is then selected according to criteria described in Details.
flag_env_moran( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", env_layers, n_bins, distance = "haversine", moran_summary = "mean", min_records = 10, min_imoran = 0.1, prioritary_column = NULL, decreasing = TRUE, do_pca = FALSE, mask = NULL, pca_buffer = 1000, flag_for_NA = FALSE, return_all = FALSE, verbose = TRUE )flag_env_moran( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", env_layers, n_bins, distance = "haversine", moran_summary = "mean", min_records = 10, min_imoran = 0.1, prioritary_column = NULL, decreasing = TRUE, do_pca = FALSE, mask = NULL, pca_buffer = 1000, flag_for_NA = FALSE, return_all = FALSE, verbose = TRUE )
occ |
(data.frame or data.table) a data frame containing the occurrence records for a single species. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
env_layers |
(SpatRaster) object containing environmental variables for
splitting in |
n_bins |
(numeric) vector of number of bins into which each environmental variable will be divided (e.g., c(5, 10, 15, 20)). |
distance |
(character) distance metric used to compute the weight matrix
for Moran's I. One of |
moran_summary |
(character) summary statistic used to select the best
thinning distance. One of |
min_records |
(numeric) minimum number of records required for a dataset
to be considered. Default: |
min_imoran |
(numeric) minimum Moran's I required to avoid selecting
datasets with extremely low spatial autocorrelation. Default: |
prioritary_column |
(character) name of a numeric columns in |
decreasing |
(logical) whether to sort records in decreasing order using
the |
do_pca |
(logical) whether environmental variables should be summarized
using PCA before computing Moran's I. Default: |
mask |
(SpatVector or SpatExtent) optional spatial object to mask the
|
pca_buffer |
(numeric) buffer width (km) used when PCA is computed from
the convex hull of records. Ignored if |
flag_for_NA |
(logical) whether to treat records falling in |
return_all |
(logical) whether to return the full list of all thinned
datasets. Default is |
verbose |
(logical) whether to print messages about the progress.
Default is |
This function is inspired by the approach used in Velazco et al. (2020), extending the procedure by allowing:
prioritization of records based on a user-defined variable (e.g., year)
optional PCA transformation of environmental layers
selection rules that prevent datasets with too few records or extremely low Moran's I from being chosen.
Procedure overview
For each bin number in n_bins, generate a spatially thinned dataset
using thin_env() function.
Extract environmental values for the retained records.
Compute Moran's I for each environmental variable.
Summarize autocorrelation per dataset (mean, median, min, or max).
Apply the selection criteria:
Keep only datasets with at least min_records records.
Keep only datasets with Moran's I greater or equal to min_imoran.
Round Moran's I to two decimal places and select the dataset with the 25th lowest autocorrelation.
If more than on dataset is selected, choose the dataset retaining more records.
If still tied, choose the dataset with the largest number of bins.
Distance matrix for Moran's I Moran's I requires a weight matrix derived from pairwise distances among records. Two distance types are available:
"haversine": geographic distance computed with fields::rdist.earth()
(default; recommended for longitude/latitude coordinates)
"euclidean": Euclidean distance computed with stats::dist()
Environmental PCA (optional)
If do_pca = TRUE, the environmental layers are summarized using PCA before
Moran's I is computed.
If mask is provided, PCA is computed on masked layers.
Otherwise, a convex hull around the records is buffered by pca_buffer
kilometers to define the PCA area.
It will select the axis that together explain more than 90% of the variation.
A list with:
occ: the selected thinned occurrence dataset with the column
thin_env_flagindicating whether each record is retained (TRUE) or flagged
as redundant (FALSE) in the environmental space .
imoran: a table summarizing Moran's I for each thinning distance
n_bins: the number of bins that produced the selected dataset
moran_summary: the summary statistic used to select the dataset
all_thined: (optional) list of thinned datasets for all bin numbers.
Only returned if return_all was set to TRUE
# Load example data data("occurrences", package = "RuHere") # Subset occurrences from Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Load example of raster variables data("worldclim", package = "RuHere") # Unwrap Packed raster r <- terra::unwrap(worldclim) # Select thinned occurrences occ_env_moran <- flag_env_moran(occ = occ, n_bins = c(5, 10, 20, 30, 40, 50), env_layers = r) # Selected number of bins occ_env_moran$n_bins # Number of flagged and unflagged records sum(occ_env_moran$occ$thin_env_flag) #Retained sum(!occ_env_moran$occ$thin_env_flag) #Flagged for thinning out # Results os the spatial autocorrelation analysis occ_env_moran$imoran# Load example data data("occurrences", package = "RuHere") # Subset occurrences from Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Load example of raster variables data("worldclim", package = "RuHere") # Unwrap Packed raster r <- terra::unwrap(worldclim) # Select thinned occurrences occ_env_moran <- flag_env_moran(occ = occ, n_bins = c(5, 10, 20, 30, 40, 50), env_layers = r) # Selected number of bins occ_env_moran$n_bins # Number of flagged and unflagged records sum(occ_env_moran$occ$thin_env_flag) #Retained sum(!occ_env_moran$occ$thin_env_flag) #Flagged for thinning out # Results os the spatial autocorrelation analysis occ_env_moran$imoran
Flags (validates) occurrence records based on known distribution data
from the Catálogo Taxônomico da Fauna do Brasil (faunabr) data. This function
checks if an occurrence point for a given species falls within its documented
distribution, allowing for user-defined buffers around Brazilian states,
or the entire country. Records are flagged as valid (TRUE) if they fall
within the specified range for the distribution information available in the
faunabr data.
flag_faunabr( data_dir, occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", origin = NULL, by_state = TRUE, buffer_state = 20, by_country = TRUE, buffer_country = 20, keep_columns = TRUE, spat_state = NULL, spat_country = NULL, progress_bar = FALSE, verbose = FALSE )flag_faunabr( data_dir, occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", origin = NULL, by_state = TRUE, buffer_state = 20, by_country = TRUE, buffer_country = 20, keep_columns = TRUE, spat_state = NULL, spat_country = NULL, progress_bar = FALSE, verbose = FALSE )
data_dir |
(character) Required directory path where the |
occ |
(data.frame or data.table) a data frame containing the occurrence records to be flagged. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
origin |
(character) filter the |
by_state |
(logical) if |
buffer_state |
(numeric) buffer distance (in kilometers) to be applied around the known state distribution boundaries. Records within this distance are considered valid. Default is 20 km. |
by_country |
(logical) if |
buffer_country |
(numeric) buffer distance (in kilometers) to be applied around the country boundaries. Records within this distance are considered valid. Default is 20 km. |
keep_columns |
(logical) if |
spat_state |
(SpatVector) a SpatVector of the Brazilian states. By default, it uses the SpatVector provided by geobr::read_state(). It can be another Spatvector, but the structure must be identical to 'faunabr::states', with a column called "abbrev_state" identifying the states codes. |
spat_country |
(SpatVector) a SpatVector of the world countries. By default, it uses the SpatVector provided by rnaturalearth::ne_countries. It can be another Spatvector, but the structure must be identical to 'faunabr::world_fauna', with a column called "country_code" identifying the country codes. |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) if |
#' A data.frame that is the original occ data frame
augmented with a new column named faunabr_flag. This column is
logical (TRUE/FALSE) indicating whether the record falls
within the expected distribution (plus buffer) based on the faunabr
data. Records for species not found in the faunabr data will have
NA in the faunabr_flag column.
# Load example data data("occurrences", package = "RuHere") # Get only occurrences from Azure Jay occ <- occurrences[occurrences$species == "Cyanocorax caeruleus", ] # Set folder where distributional datasets were saved # Here, just a sample provided in the package # You must run 'faunabr_here()' beforehand to download the necessary data files for your species dataset_dir <- system.file("extdata/datasets", package = "RuHere") # Flag records using faunabr specialist information occ_fauna <- flag_faunabr(data_dir = dataset_dir, occ = occ)# Load example data data("occurrences", package = "RuHere") # Get only occurrences from Azure Jay occ <- occurrences[occurrences$species == "Cyanocorax caeruleus", ] # Set folder where distributional datasets were saved # Here, just a sample provided in the package # You must run 'faunabr_here()' beforehand to download the necessary data files for your species dataset_dir <- system.file("extdata/datasets", package = "RuHere") # Flag records using faunabr specialist information occ_fauna <- flag_faunabr(data_dir = dataset_dir, occ = occ)
Flags (validates) occurrence records based on known distribution data
from the Flora e Funga do Brasil (florabr) data. This function checks if an
occurrence point for a given species falls within its documented distribution,
allowing for user-defined buffers around Brazilian states, biomes, or the
entire country. Records are flagged as valid (TRUE) if they fall within
the specified range for the distribution information available in the
florabr data.
flag_florabr( data_dir, occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", origin = NULL, by_state = TRUE, buffer_state = 20, by_biome = TRUE, buffer_biome = 20, by_endemism = TRUE, buffer_brazil = 20, state_vect = NULL, state_column = NULL, biome_vect = NULL, biome_column = NULL, br_vect = NULL, keep_columns = TRUE, progress_bar = FALSE, verbose = FALSE )flag_florabr( data_dir, occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", origin = NULL, by_state = TRUE, buffer_state = 20, by_biome = TRUE, buffer_biome = 20, by_endemism = TRUE, buffer_brazil = 20, state_vect = NULL, state_column = NULL, biome_vect = NULL, biome_column = NULL, br_vect = NULL, keep_columns = TRUE, progress_bar = FALSE, verbose = FALSE )
data_dir |
(character) directory path where the |
occ |
(data.frame) a data frame containing the occurrence records to be flagged. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
origin |
(character or NULL) filter the |
by_state |
(logical) if |
buffer_state |
(numeric) buffer distance (in kilometers) to be applied around the known state distribution boundaries. Records within this distance are considered valid. Default is 20 km. |
by_biome |
(logical) if |
buffer_biome |
(numeric) buffer distance (in kilometers) to be applied around the known biome distribution boundaries. Records within this distance are considered valid. Default is 20 km. |
by_endemism |
(logical) if |
buffer_brazil |
(numeric) buffer distance (in kilometers) to be applied around the entire Brazilian boundary. Default is 20 km. |
state_vect |
(SpatVector) qn optional custom simple features
( |
state_column |
(character) the name of the column in |
biome_vect |
(SpatVector) an optional custom simple features ( |
biome_column |
(character) the name of the column in |
br_vect |
(SpatVector) an optional custom simple features ( |
keep_columns |
(logical) if |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) if |
A data.frame that is the original occ data frame
augmented with a new column named florabr_flag. This column is
logical (TRUE/FALSE) indicating whether the record falls
within the expected distribution (plus buffer) based on the florabr
data. Records for species not found in the florabr data will have
NA in the florabr_flag column.
# Load example data data("occurrences", package = "RuHere") # Get only occurrences from Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Set folder where distributional datasets were saved # Here, just a sample provided in the package # You must run 'florabr_here()' beforehand to download the necessary data files for your species dataset_dir <- system.file("extdata/datasets", package = "RuHere") # Flag records using specialist information from Flora do Brasil occ_flora <- flag_florabr(data_dir = dataset_dir, occ = occ)# Load example data data("occurrences", package = "RuHere") # Get only occurrences from Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Set folder where distributional datasets were saved # Here, just a sample provided in the package # You must run 'florabr_here()' beforehand to download the necessary data files for your species dataset_dir <- system.file("extdata/datasets", package = "RuHere") # Flag records using specialist information from Flora do Brasil occ_flora <- flag_florabr(data_dir = dataset_dir, occ = occ)
This function identifies occurrence records that correspond to fossils, based on specific search terms found in selected columns.
flag_fossil( occ, columns = c("basisOfRecord", "occurrenceRemarks"), fossil_terms = NULL )flag_fossil( occ, columns = c("basisOfRecord", "occurrenceRemarks"), fossil_terms = NULL )
occ |
(data.frame) a data frame containing the occurrence records to be
examined, preferably standardized using |
columns |
(character) vector of column names in |
fossil_terms |
(character) optional vector of additional terms that
indicate a fossil record (e.g., |
A data.frame that is the original occ data frame augmented with
a new column named fossil_flag. Records identified as fossils receive
FALSE, while all other records receive TRUE.
# Load example data data("occurrences", package = "RuHere") # Flag fossil records occ_fossil <- flag_fossil(occ = occurrences)# Load example data data("occurrences", package = "RuHere") # Flag fossil records occ_fossil <- flag_fossil(occ = occurrences)
This function evaluates multiple geographically thinned datasets (produced using different thinning distances) and selects the one that best balances low spatial autocorrelation and number of retained records.
For each thinning distance provided in d, the function computes Moran's I
for the selected environmental variables and summarizes autocorrelation using
a chosen statistic (mean, median, minimum, or maximum). The best thinning
level is then selected according to criteria described in Details.
flag_geo_moran( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", d, distance = "haversine", moran_summary = "mean", min_records = 10, min_imoran = 0.1, prioritary_column = NULL, decreasing = TRUE, env_layers, do_pca = FALSE, mask = NULL, pca_buffer = 1000, return_all = FALSE, verbose = TRUE )flag_geo_moran( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", d, distance = "haversine", moran_summary = "mean", min_records = 10, min_imoran = 0.1, prioritary_column = NULL, decreasing = TRUE, env_layers, do_pca = FALSE, mask = NULL, pca_buffer = 1000, return_all = FALSE, verbose = TRUE )
occ |
(data.frame or data.table) a data frame containing the occurrence records for a single species. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
d |
(numeric) vector of thinning distances in kilometers (e.g., c(5, 10, 15, 20)). |
distance |
(character) distance metric used to compute the weight matrix
for Moran's I. One of |
moran_summary |
(character) summary statistic used to select the best
thinning distance. One of |
min_records |
(numeric) minimum number of records required for a dataset
to be considered. Default: |
min_imoran |
(numeric) minimum Moran's I required to avoid selecting
datasets with extremely low spatial autocorrelation. Default: |
prioritary_column |
(character) name of a numeric columns in |
decreasing |
(logical) whether to sort records in decreasing order using
the |
env_layers |
(SpatRaster) object containing environmental variables for computing Moran's I. |
do_pca |
(logical) whether environmental variables should be summarized
using PCA before computing Moran's I. Default: |
mask |
(SpatVector or SpatExtent) optional spatial object to mask the
|
pca_buffer |
(numeric) buffer width (km) used when PCA is computed from
the convex hull of records. Ignored if |
return_all |
(logical) whether to return the full list of all thinned
datasets. Default: |
verbose |
(logical) whether to print messages about the progress.
Default is |
This function is inspired by the approach used in Velazco et al. (2021), extending the procedure by allowing:
prioritization of records based on a user-defined variable (e.g., year)
optional PCA transformation of environmental layers
selection rules that prevent datasets with too few records or extremely low Moran's I from being chosen.
Procedure overview
For each distance in d, generate a spatially thinned dataset using
thin_geo() function.
Extract environmental values for the retained records.
Compute Moran's I for each environmental variable.
Summarize autocorrelation per dataset (mean, median, min, or max).
Apply the selection criteria:
Keep only datasets with at least min_records records.
Keep only datasets with Moran's I higher than min_imoran.
Round Moran's I to two decimal places and select the dataset with the 25th lowest autocorrelation.
If more than on dataset is selected, choose the dataset retaining more records.
If still tied, choose the dataset with the smallest thinning distance.
Distance matrix for Moran's I Moran's I requires a weight matrix derived from pairwise distances among records. Two distance types are available:
"haversine": geographic distance computed with fields::rdist.earth()
(default; recommended for longitude/latitude coordinates)
"euclidean": Euclidean distance computed with stats::dist()
Environmental PCA (optional)
If do_pca = TRUE, the environmental layers are summarized using PCA before
Moran's I is computed.
If mask is provided, PCA is computed on masked layers.
Otherwise, a convex hull around the records is buffered by pca_buffer
kilometers to define the PCA area.
It will select the axis that together explain more than 90% of the variation.
A list with:
occ: the selected thinned occurrence dataset with the column
thin_geo_flagindicating whether each record is retained (TRUE) or flagged.
imoran: a table summarizing Moran's I for each thinning distance
distance: the thinning distance that produced the selected dataset
moran_summary: the summary statistic used to select the dataset
all_thined: (optional) list of thinned datasets for all distances. Only
returned if return_all was set to TRUE
Velazco, S. J. E., Svenning, J. C., Ribeiro, B. R., & Laureto, L. M. O. (2021). On opportunities and threats to conserve the phylogenetic diversity of Neotropical palms. Diversity and Distributions, 27(3), 512–523. https://doi.org/10.1111/ddi.13215
# Load example data data("occurrences", package = "RuHere") # Subset occurrences from Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Load example of raster variables data("worldclim", package = "RuHere") # Unwrap Packed raster r <- terra::unwrap(worldclim) # Select thinned occurrences occ_geo_moran <- flag_geo_moran(occ = occ, d = c(5, 10, 20, 30), env_layers = r) # Selected distance occ_geo_moran$distance # Number of flagged and unflagged records sum(occ_geo_moran$occ$thin_geo_flag) #Retained sum(!occ_geo_moran$occ$thin_geo_flag) #Flagged for thinning out # Results os the spatial autocorrelation analysis occ_geo_moran$imoran# Load example data data("occurrences", package = "RuHere") # Subset occurrences from Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Load example of raster variables data("worldclim", package = "RuHere") # Unwrap Packed raster r <- terra::unwrap(worldclim) # Select thinned occurrences occ_geo_moran <- flag_geo_moran(occ = occ, d = c(5, 10, 20, 30), env_layers = r) # Selected distance occ_geo_moran$distance # Number of flagged and unflagged records sum(occ_geo_moran$occ$thin_geo_flag) #Retained sum(!occ_geo_moran$occ$thin_geo_flag) #Flagged for thinning out # Results os the spatial autocorrelation analysis occ_geo_moran$imoran
This function identifies and flags occurrence records sourced from iNaturalist. It can flag all iNaturalist records or only those that do not have Research Grade status.
flag_inaturalist(occ, columns = "datasetName", research_grade = FALSE)flag_inaturalist(occ, columns = "datasetName", research_grade = FALSE)
occ |
(data.frame) a data frame containing the occurrence records to be
examined, preferably standardized using |
columns |
(character) column name in |
research_grade |
(logical) whether to flag all records from
iNaturalist, including those with Research Grade status. Default is |
According to iNaturalist, Observations become Research Grade when:
the iNaturalist community agrees on species-level ID or lower, i.e. when more than 2/3 of identifiers agree on a taxon;
the community taxon and the observation taxon agree;
or the community agrees on an ID between family and species and votes that the community taxon is as good as it can be.
A data.frame that is the original occ data frame augmented with
a new column named inaturalist_flag. Flagged records receive
FALSE, while all other records receive TRUE.
# Load example data data("occurrences", package = "RuHere") # Flag only iNaturalist records without Research Grade occ_inat <- flag_inaturalist(occ = occurrences, research_grade = FALSE) table(occ_inat$inaturalist_flag) # Number of records flagged (FALSE) # Flag all iNaturalist records (including Research Grade) occ_inat <- flag_inaturalist(occ = occurrences, research_grade = TRUE) table(occ_inat$inaturalist_flag) # Number of records flagged (FALSE)# Load example data data("occurrences", package = "RuHere") # Flag only iNaturalist records without Research Grade occ_inat <- flag_inaturalist(occ = occurrences, research_grade = FALSE) table(occ_inat$inaturalist_flag) # Number of records flagged (FALSE) # Flag all iNaturalist records (including Research Grade) occ_inat <- flag_inaturalist(occ = occurrences, research_grade = TRUE) table(occ_inat$inaturalist_flag) # Number of records flagged (FALSE)
Flags (validates) occurrence records based on known distribution data
from the International Union for Conservation of Nature (IUCN) data. This
function checks if an occurrence point for a given species falls within its
documented distribution, allowing for user-defined buffers around the region.
Records are flagged as valid (TRUE) if they fall inside the documented
distribution (plus optional buffer) for the species in the IUCN dataset.
flag_iucn( data_dir, occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", origin = "native", presence = "all", buffer = 20, progress_bar = FALSE, verbose = FALSE )flag_iucn( data_dir, occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", origin = "native", presence = "all", buffer = 20, progress_bar = FALSE, verbose = FALSE )
data_dir |
(character) Required directory path where the |
occ |
(data.frame or data.table) a data frame containing the occurrence records to be flagged. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
origin |
(character) vector specifying which origin categories should
be considered as part of the species' range. Options are: |
presence |
(character) vector specifying which presence type should
be considered as part of the species' range. Options are: |
buffer |
(numeric) buffer distance (in kilometers) to be applied around the region of distribution. Default is 20 km. |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) if |
A data.frame that is the original occ data frame
augmented with a new column named iucn_flag. This column is
logical (TRUE/FALSE) indicating whether the record falls
within the expected distribution (plus buffer) based on the IUCN
data. Records for species not found in the IUCN data will have
NA in the iucn_flag column.
# Load example data data("occurrences", package = "RuHere") # Get only occurrences from Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Set folder where distributional datasets were saved # Here, just a sample provided in tha package # You must run 'iucn_here()' beforehand to download the necessary data files dataset_dir <- system.file("extdata/datasets", package = "RuHere") # Flag records using IUCN specialist information occ_iucn <- flag_iucn(data_dir = dataset_dir, occ = occ)# Load example data data("occurrences", package = "RuHere") # Get only occurrences from Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Set folder where distributional datasets were saved # Here, just a sample provided in tha package # You must run 'iucn_here()' beforehand to download the necessary data files dataset_dir <- system.file("extdata/datasets", package = "RuHere") # Flag records using IUCN specialist information occ_iucn <- flag_iucn(data_dir = dataset_dir, occ = occ)
A named character vector used to convert internal flag column names (produced by the package's flagging functions) into human-readable labels.
flag_namesflag_names
A named character vector of length 25.
The names correspond to the original flag codes (e.g., "correct_country",
"duplicated_flag", ".cen", "consensus_flag"), and the values are the
cleaned, human-readable labels (e.g., "Wrong country", "Duplicated",
"Country/Province centroid", "consensus").
This object is used internally by functions such as mapview_here() and
remove_flagged()to display more intuitive flag names to users.
Flags (validates) occurrence records based on known distribution data
from the World Checklist of Vascular Plants (WCVP) data. This function checks
if an occurrence point for a given species falls within its documented
distribution, allowing for user-defined buffers around the region. Records
are flagged as valid (TRUE) if they fall inside the documented
distribution (plus optional buffer) for the species in the WCVP dataset.
flag_wcvp( data_dir, occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", origin = "native", buffer = 20, progress_bar = FALSE, verbose = FALSE )flag_wcvp( data_dir, occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", origin = "native", buffer = 20, progress_bar = FALSE, verbose = FALSE )
data_dir |
(character) Required directory path where the |
occ |
(data.frame or data.table) a data frame containing the occurrence records to be flagged. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
origin |
(character) vector specifying which origin categories should
be considered as part of the species' range. Options are: |
buffer |
(numeric) buffer distance (in kilometers) to be applied around the region of distribution. Default is 20 km. |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) if |
A data.frame that is the original occ data frame
augmented with a new column named wcvp_flag. This column is
logical (TRUE/FALSE) indicating whether the record falls
within the expected distribution (plus buffer) based on the WCVP
data. Records for species not found in the WCVP data will have
NA in the wcvp_flag column.
# Load example data data("occurrences", package = "RuHere") # Filter occurrences for Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Set folder where distributional datasets were saved # Here, just a sample provided in the package # You must run 'wcvp_here()' beforehand to download the necessary data files dataset_dir <- system.file("extdata/datasets", package = "RuHere") # Flag records using WCVP specialist information occ_wcvp <- flag_wcvp(data_dir = dataset_dir, occ = occ)# Load example data data("occurrences", package = "RuHere") # Filter occurrences for Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Set folder where distributional datasets were saved # Here, just a sample provided in the package # You must run 'wcvp_here()' beforehand to download the necessary data files dataset_dir <- system.file("extdata/datasets", package = "RuHere") # Flag records using WCVP specialist information occ_wcvp <- flag_wcvp(data_dir = dataset_dir, occ = occ)
This function identifies occurrence records collected before or after user-specified years.
flag_year( occ, year_column = "year", lower_limit = NULL, upper_limit = NULL, flag_NA = FALSE )flag_year( occ, year_column = "year", lower_limit = NULL, upper_limit = NULL, flag_NA = FALSE )
occ |
(data.frame) a dataset with occurrence records, preferably
standardized using |
year_column |
(character) name of the column containing the year in which the occurrence was recorded. This column must be numeric. |
lower_limit |
(numeric) the minimum acceptable year. Records collected
before this value will be flagged. Default is |
upper_limit |
(numeric) the maximum acceptable year. Records collected
after this value will be flagged. Default is |
flag_NA |
(character) whether to flag records with missing year
information. Default is |
A data.frame identical to occ but with an additional column named
year_flag. Records collected outside the year range specified are assigned
FALSE.
# Load example data data("occurrences", package = "RuHere") # Flag records collected before 1980 and after 2010 occ_year <- flag_year(occ = occurrences, lower_limit = 1980, upper_limit = 2010)# Load example data data("occurrences", package = "RuHere") # Flag records collected before 1980 and after 2010 occ_year <- flag_year(occ = occurrences, lower_limit = 1980, upper_limit = 2010)
This function downloads the Flora e Funga do Brasil database, which is
required for filtering occurrence records using specialists' information
via the flag_florabr() function.
florabr_here( data_dir, data_version = "latest", solve_discrepancy = TRUE, overwrite = TRUE, remove_files = TRUE, verbose = TRUE )florabr_here( data_dir, data_version = "latest", solve_discrepancy = TRUE, overwrite = TRUE, remove_files = TRUE, verbose = TRUE )
data_dir |
(character) a directory to save the data downloaded from Flora e Funga do Brasil. |
data_version |
(character) version of the Flora e Funga do Brasil database to download. Use "latest" to get the most recent version, updated weekly. Alternatively, specify an older version (e.g., data_version="393.319"). Default value is "latest". |
solve_discrepancy |
(logical) whether to resolve discrepancies between species and subspecies/varieties information. When set to TRUE, species information is updated based on unique data from varieties and subspecies. For example, if a subspecies occurs in a certain biome, it implies that the species also occurs in that biome. Default is TRUE. |
overwrite |
(logical) if TRUE, data is overwritten. Default = TRUE. |
remove_files |
(logical) whether to remove the downloaded files used in building the final dataset. Default is TRUE. |
verbose |
(logical) whether to display messages during function execution. Set to TRUE to enable display, or FALSE to run silently. Default is TRUE. |
A message indicating that the data were successfully saved in the directory
specified by data_dir.
# Define a directory to save the data data_dir <- tempdir() # Here, a temporary directory # Download the latest version of the Flora e Funga do Brasil database florabr_here(data_dir = data_dir)# Define a directory to save the data data_dir <- tempdir() # Here, a temporary directory # Download the latest version of the Flora e Funga do Brasil database florabr_here(data_dir = data_dir)
Format and standardize column names and data types of an occurrence dataset
format_columns( occ, metadata, extract_binomial = TRUE, binomial_from = NULL, include_subspecies = FALSE, include_variety = FALSE, check_numeric = TRUE, numeric_columns = NULL, check_encoding = TRUE, data_source = NULL, progress_bar = FALSE, verbose = FALSE )format_columns( occ, metadata, extract_binomial = TRUE, binomial_from = NULL, include_subspecies = FALSE, include_variety = FALSE, check_numeric = TRUE, numeric_columns = NULL, check_encoding = TRUE, data_source = NULL, progress_bar = FALSE, verbose = FALSE )
occ |
(data.frame or data.table) a dataset with occurrence records,
preferably obtained from |
metadata |
(character or data.frame) if a character, one of 'gbif',
'specieslink', 'bien', or 'idigbio', specifying which metadata template to
use (the corresponding data frames are available in
|
extract_binomial |
(logical) whether to create a column with the binomial name of the species. If FALSE, it will create a column "species" with the exact name stored in the scientificName column. Default is TRUE. |
binomial_from |
(character) the column name in metadata from which to
extract the binomial name. Only applicable if |
include_subspecies |
(logical) whether to include subspecies in the
binomial name. Only applicable if |
include_variety |
(logical) whether to include variety in the binomial
name. Only applicable if |
check_numeric |
(logical) whether to check and coerce the columns
specified in |
numeric_columns |
(character) a vector of column names that must be
numeric. Default is NULL, meaning that if |
check_encoding |
(logical) whether to check and fix the encoding of columns that typically contain special characters (see Details). Default is TRUE. |
data_source |
(character) the source of the occurrence records. Default
is NULL, meaning it will use the same string provided in |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) whether to print messages about the progress. Default is FALSE. |
If a user-defined metadata data.frame is provided, it must include the following 21 columns: 'scientificName', 'collectionCode', 'catalogNumber', 'decimalLongitude', 'decimalLatitude', 'coordinateUncertaintyInMeters', 'elevation', 'country', 'stateProvince', 'municipality', 'locality', 'year', 'eventDate', 'recordedBy', 'identifiedBy', 'basisOfRecord', 'occurrenceRemarks', 'habitat', 'datasetName', 'datasetKey', and 'key'.
If check_encoding = TRUE, the function will inspect and, if necessary, fix
the encoding of these columns:
'collectionCode', 'catalogNumber', 'country', 'stateProvince',
municipality', 'locality', 'eventDate','recordedBy', 'identifiedBy',
'basisOfRecord', and 'datasetName'.
A data.frame with standardized column names and data types according to the specified metadata.
# Example with GBIF data("occ_gbif", package = "RuHere") #Import data example gbif_standardized <- format_columns(occ_gbif, metadata = "gbif") # Example with SpeciesLink data("occ_splink", package = "RuHere") #Import data example splink_standardized <- format_columns(occ_splink, metadata = "specieslink") # Example with BIEN data("occ_bien", package = "RuHere") #Import data example bien_standardized <- format_columns(occ_bien, metadata = "bien") # Example with idigbio data("occ_idig", package = "RuHere") #Import data example idig_standardized <- format_columns(occ_idig, metadata = "idigbio")# Example with GBIF data("occ_gbif", package = "RuHere") #Import data example gbif_standardized <- format_columns(occ_gbif, metadata = "gbif") # Example with SpeciesLink data("occ_splink", package = "RuHere") #Import data example splink_standardized <- format_columns(occ_splink, metadata = "specieslink") # Example with BIEN data("occ_bien", package = "RuHere") #Import data example bien_standardized <- format_columns(occ_bien, metadata = "bien") # Example with idigbio data("occ_idig", package = "RuHere") #Import data example idig_standardized <- format_columns(occ_idig, metadata = "idigbio")
Wrapper function to access and download occurrence records from the Botanical Information and Ecology Network (BIEN) database. It provides a unified interface to query BIEN data by species, genus, family, or by geographic or political boundaries.
get_bien(by = "species", cultivated = FALSE, new.world = NULL, all.taxonomy = FALSE, native.status = FALSE, natives.only = TRUE, observation.type = FALSE, political.boundaries = TRUE, collection.info = TRUE, only.geovalid = TRUE, min.lat = NULL, max.lat = NULL, min.long = NULL, max.long = NULL, species = NULL, genus = NULL, country = NULL, country.code = NULL, state = NULL, county = NULL, state.code = NULL, county.code = NULL, family = NULL, sf = NULL, dir, filename = "bien_output", file.format = "csv", compress = FALSE, save = FALSE, verbose = TRUE, ...)get_bien(by = "species", cultivated = FALSE, new.world = NULL, all.taxonomy = FALSE, native.status = FALSE, natives.only = TRUE, observation.type = FALSE, political.boundaries = TRUE, collection.info = TRUE, only.geovalid = TRUE, min.lat = NULL, max.lat = NULL, min.long = NULL, max.long = NULL, species = NULL, genus = NULL, country = NULL, country.code = NULL, state = NULL, county = NULL, state.code = NULL, county.code = NULL, family = NULL, sf = NULL, dir, filename = "bien_output", file.format = "csv", compress = FALSE, save = FALSE, verbose = TRUE, ...)
by |
(character) type of query to perform ( |
cultivated |
(logical) whether to include cultivated records or exclude
them. Default is |
new.world |
(logical) if |
all.taxonomy |
(logical) if |
native.status |
(logical) if |
natives.only |
(logical) if |
observation.type |
(logical) if |
political.boundaries |
(logical) if |
collection.info |
(logical) if |
only.geovalid |
(logical) if |
min.lat |
(numeric) the minimum latitude (in decimal degrees) for a
bounding-box query when |
max.lat |
(numeric) the maximum latitude (in decimal degrees) for a
bounding-box query when |
min.long |
(numeric) the minimum longitude (in decimal degrees) for a
bounding-box query when |
max.long |
(numeric) the maximum longitude (in decimal degrees) for a
bounding-box query when |
species |
(character) species name(s) to query when |
genus |
(character) genus name(s) to query when |
country |
(character) country name when |
country.code |
(character) two-letter ISO country code corresponding
to |
state |
(character) state or province name when |
county |
(character) county or equivalent subdivision name
when |
state.code |
(character) state or province code corresponding
to |
county.code |
(character) county or equivalent subdivision code
corresponding to |
family |
(character) family name(s) to query when |
sf |
(object of class |
dir |
(character) directory path where the file will be saved.
Required if |
filename |
(character) name of the output file without extension.
Default is |
file.format |
(character) file format for saving output ( |
compress |
(logical) if |
save |
(logical) if |
verbose |
(logical) if |
... |
additional arguments passed to the underlying BIEN function. |
A data.frame containing BIEN occurrence records that match
the specified query. The structure and available columns depend on the chosen
by value and the corresponding BIEN function.
# Example: download occurrence records for a single species res_test <- get_bien( by = "species", species = "Paubrasilia echinata", cultivated = TRUE, native.status = TRUE, observation.type = TRUE, only.geovalid = TRUE )# Example: download occurrence records for a single species res_test <- get_bien( by = "species", species = "Paubrasilia echinata", cultivated = TRUE, native.status = TRUE, observation.type = TRUE, only.geovalid = TRUE )
This function creates a multidimensional grid in environmental space by
splitting each environmental variable into n_bins equally sized intervals.
It then assigns each occurrence record to an environmental block (bin
combination) and identifies records that fall into the same block (i.e.,
records that are close to each other in environmental space).
The results can be visualized using the plot_env_bins() function.
get_env_bins( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", env_layers, n_bins = 5 )get_env_bins( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", env_layers, n_bins = 5 )
occ |
(data.frame or data.table) a data frame containing the occurrence records for a single species. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
env_layers |
(SpatRaster) object containing environmental variables. |
n_bins |
(numeric) number of bins into which each environmental variable will be divided. |
A list with:
data: a data frame including extracted environmental values, bin
indices, and a unique block_id for each record.
breaks: a named list of numeric vectors containing the break points
for each variable (used by plot_env_bins()).
# Load example data data("occurrences", package = "RuHere") # Get only occurrences from Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Load example of raster variables data("worldclim", package = "RuHere") # Unwrap Packed raster r <- terra::unwrap(worldclim) # Get bins b <- get_env_bins(occ = occ, env_layers = r, n_bins = 5)# Load example data data("occurrences", package = "RuHere") # Get only occurrences from Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Load example of raster variables data("worldclim", package = "RuHere") # Unwrap Packed raster r <- terra::unwrap(worldclim) # Get bins b <- get_env_bins(occ = occ, env_layers = r, n_bins = 5)
Downloads species occurrence records from the iDigBio (Integrated Digitized Biocollections) database with flexible taxonomic and geographic filtering options.
get_idigbio(species = NULL, fields = "all", genus = NULL, family = NULL, order = NULL, phylum = NULL, kingdom = NULL, country = NULL, county = NULL, limit = NULL, offset = NULL, dir, filename = "idigbio_output", save = FALSE, compress = FALSE, file.format = "csv", verbose = TRUE, ...)get_idigbio(species = NULL, fields = "all", genus = NULL, family = NULL, order = NULL, phylum = NULL, kingdom = NULL, country = NULL, county = NULL, limit = NULL, offset = NULL, dir, filename = "idigbio_output", save = FALSE, compress = FALSE, file.format = "csv", verbose = TRUE, ...)
species |
(character) scientific name(s) of species to search for.
Default is |
fields |
(character) fields to retrieve from iDigBio. Default is |
genus |
(character) genus name for filtering results. Default is |
family |
(character) family name for filtering results. Default is |
order |
(character) order name for filtering results. Default is |
phylum |
(character) phylum name for filtering results. Default is |
kingdom |
(character) kingdom name for filtering results. Default is |
country |
(character) country name for geographic filtering. Default is |
county |
(character) county name for geographic filtering. Default is |
limit |
(numeric) maximum number of records to retrieve. Default is
|
offset |
(numeric) number of records to skip before starting retrieval.
Default is |
dir |
(character) directory path where the file will be saved.
Required if |
filename |
(character) name of the output file without extension.
Default is |
save |
(logical) if |
compress |
(logical) if |
file.format |
(character) file format for saving output ( |
verbose |
(logical) if |
... |
additional arguments passed to |
A data.frame containing occurrence records from iDigBio with the requested
fields.
## search for a single species records_basic <- get_idigbio(species = "Arecaceae") ## search for multiple species records_multiple <- get_idigbio( species = c("Araucaria angustifolia"), limit = 100) ## save results as a compressed RDS file records_saved_rds <- get_idigbio( species = "Anacardiaceae", limit = 50, dir = tempdir(), filename = "anacardiaceae_records", save = TRUE, compress = TRUE, file.format = "rds")## search for a single species records_basic <- get_idigbio(species = "Arecaceae") ## search for multiple species records_multiple <- get_idigbio( species = c("Araucaria angustifolia"), limit = 100) ## save results as a compressed RDS file records_saved_rds <- get_idigbio( species = "Anacardiaceae", limit = 50, dir = tempdir(), filename = "anacardiaceae_records", save = TRUE, compress = TRUE, file.format = "rds")
Retrieves occurrence data from the speciesLink network using user-defined filters. The function allows querying by taxonomic, geographic, and collection-related parameters.
get_specieslink(species = NULL, key = NULL, dir, filename = "specieslink_output",save = FALSE, basisOfRecord = NULL, family = NULL, institutionCode = NULL, collectionID = NULL, catalogNumber = NULL, kingdom = NULL, phylum = NULL, class = NULL, order = NULL, genus = NULL, specificEpithet = NULL, infraspecificEpithet = NULL, collectionCode = NULL, identifiedBy = NULL, yearIdentified = NULL, country = NULL, stateProvince = NULL, county = NULL, typeStatus = NULL, recordedBy = NULL, recordNumber = NULL, yearCollected = NULL, locality = NULL, occurrenceRemarks = NULL, barcode = NULL, bbox = NULL, landuse_1 = NULL, landuse_year_1 = NULL, landuse_2 = NULL, landuse_year_2 = NULL, phonetic = FALSE, coordinates = NULL, scope = NULL, synonyms = NULL, typus = FALSE, images = FALSE, redlist = NULL, limit = NULL, file.format = "csv", compress = FALSE, verbose = TRUE)get_specieslink(species = NULL, key = NULL, dir, filename = "specieslink_output",save = FALSE, basisOfRecord = NULL, family = NULL, institutionCode = NULL, collectionID = NULL, catalogNumber = NULL, kingdom = NULL, phylum = NULL, class = NULL, order = NULL, genus = NULL, specificEpithet = NULL, infraspecificEpithet = NULL, collectionCode = NULL, identifiedBy = NULL, yearIdentified = NULL, country = NULL, stateProvince = NULL, county = NULL, typeStatus = NULL, recordedBy = NULL, recordNumber = NULL, yearCollected = NULL, locality = NULL, occurrenceRemarks = NULL, barcode = NULL, bbox = NULL, landuse_1 = NULL, landuse_year_1 = NULL, landuse_2 = NULL, landuse_year_2 = NULL, phonetic = FALSE, coordinates = NULL, scope = NULL, synonyms = NULL, typus = FALSE, images = FALSE, redlist = NULL, limit = NULL, file.format = "csv", compress = FALSE, verbose = TRUE)
species |
(character) species name. Default is |
key |
(character) API key or authentication token if required. Default
is |
dir |
(character) directory where files will be saved (if |
filename |
(character) name of the output file without extension.
Default is |
save |
(logical) whether to save the results to file. Default is |
basisOfRecord |
(character) filter by basis of record. Default is |
family |
(character) family name. Default is |
institutionCode |
(character) code of the institution that holds the
specimen. Default is |
collectionID |
(character) unique identifier for the collection.
Default is |
catalogNumber |
(character) catalog number of the specimen or record.
Default is |
kingdom |
(character) kingdom name. Default is |
phylum |
(character) phylum name. Default is |
class |
(character) class name. Default is |
order |
(character) order name. Default is |
genus |
(character) genus name. Default is |
specificEpithet |
(character) specific epithet of the species. Default
is |
infraspecificEpithet |
(character) infraspecific epithet. Default
is |
collectionCode |
(character) code identifying the collection within an
institution. Default is |
identifiedBy |
(character) name of the person who identified the
specimen. Default is |
yearIdentified |
(numeric) year of identification. Default is |
country |
(character) country name. Default is |
stateProvince |
(character) state or province name. Default is |
county |
(character) county or municipality name. Default is |
typeStatus |
(character) type status. Default is |
recordedBy |
(character) collector name. Default is |
recordNumber |
(numeric) collector’s record number. Default is |
yearCollected |
(numeric) year of collection. Default is |
locality |
(character) locality description. Default is |
occurrenceRemarks |
(character) text field for remarks about the
occurrence. Default is |
barcode |
(character) barcode or unique specimen identifier. Default is
|
bbox |
(character) bounding box coordinates in the format
|
landuse_1 |
(character) land use category for the first year.
Default is |
landuse_year_1 |
(numeric) year corresponding to |
landuse_2 |
(character) land use category for the second year.
Default is |
landuse_year_2 |
(numeric) year corresponding to |
phonetic |
(logical) whether to use phonetic matching for taxon names.
Default is |
coordinates |
(character) whether to include only records with
geographic coordinates ( |
scope |
(character) scope of the query ( |
synonyms |
(chacarter) whether to include synonyms of the specified
taxon ( |
typus |
(logical) whether to filter only type specimens. Default is
|
images |
(logical) whether to restrict to records with associated
images. Default is |
redlist |
(character) filter by IUCN Red List category. Default is
|
limit |
(numeric) maximum number of records to return. Default is
|
file.format |
(character) file format for saving output ( |
compress |
(logical) whether to compress the output file into |
verbose |
(logical) if #' @details The speciesLink API key can be set permanently using:
set_specieslink_credentials("your_api_key")
|
A data.frame containing the occurrence data fields returned
by speciesLink.
## Not run: # Retrieve records for Arecaceae in São Paulo res <- get_specieslink( family = "Arecaceae", country = "Brazil", stateProvince = "São Paulo", basisOfRecord = "PreservedSpecimen", limit = 10 ) # Save results as compressed CSV get_specieslink( family = "Arecaceae", country = "Brazil", save = TRUE, dir = tempdir(), filename = "arecaceae_sp", compress = TRUE ) ## End(Not run)## Not run: # Retrieve records for Arecaceae in São Paulo res <- get_specieslink( family = "Arecaceae", country = "Brazil", stateProvince = "São Paulo", basisOfRecord = "PreservedSpecimen", limit = 10 ) # Save results as compressed CSV get_specieslink( family = "Arecaceae", country = "Brazil", save = TRUE, dir = tempdir(), filename = "arecaceae_sp", compress = TRUE ) ## End(Not run)
This function creates a static map of occurrence records using ggplot2, highlighting which points were flagged by data-validation functions. This visualization helps users quickly inspect spatial patterns of flagged and unflagged records and diagnose potential data-quality issues.
The function can also be used to plot the heatmap generated by the
spatial_kde() function.
ggmap_here( occ, species = NULL, long = "decimalLongitude", lat = "decimalLatitude", flags = "all", additional_flags = NULL, names_additional_flags = NULL, col_additional_flags = NULL, show_no_flagged = TRUE, col_points = NULL, size_points = 1, heatmap = NULL, low_color = "blue", mid_color = "yellow", high_color = "red", midpoint = 0.5, alpha_heatmap = 0.5, continent = NULL, continent_fill = "gray70", continent_linewidth = 0.3, continent_border = "white", ocean_fill = "aliceblue", extension = NULL, facet_wrap = FALSE, theme_plot = ggplot2::theme_minimal(), ... )ggmap_here( occ, species = NULL, long = "decimalLongitude", lat = "decimalLatitude", flags = "all", additional_flags = NULL, names_additional_flags = NULL, col_additional_flags = NULL, show_no_flagged = TRUE, col_points = NULL, size_points = 1, heatmap = NULL, low_color = "blue", mid_color = "yellow", high_color = "red", midpoint = 0.5, alpha_heatmap = 0.5, continent = NULL, continent_fill = "gray70", continent_linewidth = 0.3, continent_border = "white", ocean_fill = "aliceblue", extension = NULL, facet_wrap = FALSE, theme_plot = ggplot2::theme_minimal(), ... )
occ |
(data.frame or data.table) a dataset containing occurrence records that has been processed by one or more flagging functions. See Details for available flag types. |
species |
(character) name of the species to subset and plot. Default is
|
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
flags |
(character) the flags to be used for coloring the records. Use
|
additional_flags |
(character) an optional named character vector with
the names of additional logical columns to be used as flags. Default is |
names_additional_flags |
(character) an optional different name to the
flag provided in |
col_additional_flags |
(character) if |
show_no_flagged |
(logical) whether to display records that did not receive any flag.Default is TRUE. |
col_points |
(character) A named vector assigning colors to each
flag. If |
size_points |
(numeric) point size for plotting occurrences. Default is 6. |
heatmap |
(SpatRaster) an optional heatmap containing the estimated density
of occurrence records, typically generated by the |
low_color |
(character) color used for the lowest density values in the heatmap. Only applicable if a heatmap is provided. Default is "blue". |
mid_color |
(character) color used for the midpoint of the heatmap gradient. Default is "yellow". |
high_color |
(character) color used for the highest density values in the heatmap. Default is "red". |
midpoint |
(numeric) the central value of the heatmap gradient,
corresponding to |
alpha_heatmap |
(numeric) Alpha transparency applied to the heatmap layer, ranging from 0 (fully transparent) to 1 (fully opaque). Default is 0.5. |
continent |
(SpatVector) optional polygon layer representing continent
boundaries. If |
continent_fill |
(character) fill color for the continent polygons. Default is "gray70". |
continent_linewidth |
(numeric) line width for continent boundaries. Default is 0.3. |
continent_border |
(character) color of the continent polygon borders. Default is "white". |
ocean_fill |
(character) background color used to represent the ocean. Default is "aliceblue". |
extension |
(SpatExtent or numeric) optional map extent specified as a
|
facet_wrap |
(logical) whether to plots each flag in a separate panel
using |
theme_plot |
(theme) a |
... |
other arguments passed to |
This function expects an occurrence dataset that has already been processed
by one or more flagging routines from RuHere or related packages such as
CoordinateCleaner. Any logical column in occ can be used as a flag.
The following built-in flag names are recognized:
From RuHere:
correct_country, correct_state, cultivated, florabr, faunabr,
wcvp, iucn, bien, duplicated, thin_geo, thin_env, consensus
From CoordinateCleaner:
.val, .equ, .zer, .cap, .cen, .sea, .urb, .otl, .gbf,
.inst, .aohi
Users may also supply additional logical columns using
additional_flags, optionally providing alternative display names
(names_additional_flags) and colors (col_additional_flags).
If continent is not provided, the background map is a simplified world
polygon included with the package (a modified version of
rnaturalearthdata::map_units110). To inspect this object, run:
terra::unwrap(getExportedValue("RuHere", "world"))
When facet_wrap = TRUE, each flag is plotted in a separate panel,
allowing direct comparison among different types of data issues.
An ggplot object displaying flagged and optionally unflagged occurrence records.
# Load example data data("occ_flagged", package = "RuHere") # Visualize all flags with ggplot ggmap_here(occ = occ_flagged) # Visualize each flag in a separate panel ggmap_here(occ = occ_flagged, facet_wrap = TRUE)# Load example data data("occ_flagged", package = "RuHere") # Visualize all flags with ggplot ggmap_here(occ = occ_flagged) # Visualize each flag in a separate panel ggmap_here(occ = occ_flagged, facet_wrap = TRUE)
This function is the dedicated plotting tool for outputs from richness_here().
It automatically handles single-layer rasters (e.g., species richness) and
multi-layer rasters (e.g., multiple biological traits or flags), creating
a standardized visual using ggplot2.
ggrid_here( raster, low_color = "blue", mid_color = "yellow", high_color = "red", alpha = 0.8, continent = NULL, continent_fill = "gray70", continent_linewidth = 0.3, continent_border = "white", ocean_fill = "aliceblue", extension = NULL, theme_plot = ggplot2::theme_minimal(), ... )ggrid_here( raster, low_color = "blue", mid_color = "yellow", high_color = "red", alpha = 0.8, continent = NULL, continent_fill = "gray70", continent_linewidth = 0.3, continent_border = "white", ocean_fill = "aliceblue", extension = NULL, theme_plot = ggplot2::theme_minimal(), ... )
raster |
(SpatRaster) A raster object generated by |
low_color |
(character) color for the lowest values. Default is "blue". |
mid_color |
(character) color for the midpoint. Default is "yellow". |
high_color |
(character) color for the highest values. Default is "red". |
alpha |
(numeric) transparency of the grid (0-1). Default is 0.8. |
continent |
(SpatVector) optional polygon layer for boundaries. |
continent_fill |
(character) fill color for continents. Default is "gray70". |
continent_linewidth |
(numeric) line width for continent boundaries. Default is 0.3. |
continent_border |
(character) color of the continent polygon borders. Default is "white". |
ocean_fill |
(character) background color for the ocean. Default is "aliceblue". |
extension |
(SpatExtent or numeric) optional map extent. |
theme_plot |
(theme) a |
... |
other arguments passed to |
A ggplot object.
# Load example data data("occ_flagged", package = "RuHere") # Simple richness map r_records <- richness_here(occ_flagged, summary = "records", res = 2) ggrid_here(r_records) # Density of specific flags # Let's see where 'florabr' flags are concentrated r_flags <- richness_here(occ_flagged, summary = "records", field = "florabr_flag", field_name = "Records flagged by florabr", fun = function(x, ...) sum(!x, na.rm = TRUE), res = 2) ggrid_here(r_flags)# Load example data data("occ_flagged", package = "RuHere") # Simple richness map r_records <- richness_here(occ_flagged, summary = "records", res = 2) ggrid_here(r_records) # Density of specific flags # Let's see where 'florabr' flags are concentrated r_flags <- richness_here(occ_flagged, summary = "records", field = "florabr_flag", field_name = "Records flagged by florabr", fun = function(x, ...) sum(!x, na.rm = TRUE), res = 2) ggrid_here(r_flags)
This function imports a dataset downloaded from GBIF using a request key
generated by the request_gbif() function. It optionally allows saving
the imported occurrences to disk in CSV or GZIP format.
import_gbif( request_key, write_file = FALSE, output_dir = NULL, file.format = "gz", select_columns = TRUE, columns_to_import = NULL, overwrite = FALSE, ... )import_gbif( request_key, write_file = FALSE, output_dir = NULL, file.format = "gz", select_columns = TRUE, columns_to_import = NULL, overwrite = FALSE, ... )
request_key |
an object of class 'request_key' returned by the
|
write_file |
whether to save the downloaded occurrences to disk.
Default is FALSE. If TRUE, you must specify the |
output_dir |
(character) a directory to save the data downloaded from
GBIF. Only applicable if |
file.format |
(character) the format to save the file. Options available
are 'csv' (comma-separated values) and 'gz' (compressed GZIP). Only
applicable if |
select_columns |
(logical) whether to import only specific columns (TRUE) or all columns (FALSE) from the occurrence table. Default is TRUE. |
columns_to_import |
(character) vector of column names to import.
Default is NULL, meaning it will import the column names specified in
|
overwrite |
(logical) whether to overwrite the file in the 'output_dir' if it already exists. Default is FALSE. |
... |
other arguments passed to |
A data frame containing the GBIF occurrence records. If write_file = TRUE,
the function also saves the dataset to disk in the specified format.
This function requires an active internet connection.
## Not run: # Prepare data to request GBIF download gbif_prepared <- prepare_gbif_download(species = "Araucaria angustifolia") # Submit a request to download occurrences gbif_requested <- request_gbif(gbif_info = gbif_prepared) # Check progress rgbif::occ_download_wait(gbif_requested) # After succeeded, import data occ_gbif <- import_gbif(request_key = gbif_requested) ## End(Not run)## Not run: # Prepare data to request GBIF download gbif_prepared <- prepare_gbif_download(species = "Araucaria angustifolia") # Submit a request to download occurrences gbif_requested <- request_gbif(gbif_info = gbif_prepared) # Check progress rgbif::occ_download_wait(gbif_requested) # After succeeded, import data occ_gbif <- import_gbif(request_key = gbif_requested) ## End(Not run)
Estimates expected species richness, sample coverage (inventory completeness), and coverage deficit for spatial units based on the framework proposed by Chao & Jost (2012).
inventory_completeness( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", raster_base, minimum_species = 3, maximum_expected = "equal_obs", remove_NA = TRUE, fill_NA = TRUE, return = c("completeness", "deficit") )inventory_completeness( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", raster_base, minimum_species = 3, maximum_expected = "equal_obs", remove_NA = TRUE, fill_NA = TRUE, return = c("completeness", "deficit") )
occ |
(data.frame or data.table) a data frame containing the occurrence records. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in occ that contains the species names Default is "species". |
long |
(character) the name of the column in occ that contains the longitude values. Default is "decimalLongitude". |
lat |
(character) the name of the column in occ that contains the latitude values. Default is "decimalLatitude". |
raster_base |
(SpatRaster) a reference raster used to aggregate records into spatial units. |
minimum_species |
(numeric) the minimum number of species required in a cell to calculate completeness and deficit. If the number of observed species is lower than this threshold, the function sets completeness = 0 and deficit = 1. Default is 3. |
maximum_expected |
(numeric or character) The upper limit for the estimated species richness (s_exp). Options include:
This prevents mathematically inflated estimates in cells with extremely low sampling coverage. |
remove_NA |
(logical) whether to remove sampling units in raster_base where values are NA. |
fill_NA |
(logical) if TRUE (default), cells within the |
return |
(character) metrics to return.. Available options are "n", "s_obs", "s_exp", "singletons", "doubletons", "completeness" and "deficit". See details. |
The function calculates metrics based on the frequency of rare species
(singletons and doubletons) within each cell of the raster_base.
n: Total number of records.
s_obs: Observed species richness (number of sampled species).
s_exp: Estimated asymptotic species richness based on the Chao1 estimator.
singletons: Species represented by exactly one record.
doubletons: Species represented by exactly two records.
completeness: Sample coverage, representing the proportion of
the total individuals in occ that belong to the species in the sample.
deficit: Coverage deficit, which is the probability that the next sampled individual represents a previously unsampled species (1 - completeness)
A SpatRaster object containing the spatialized metrics defined in
return.
Chao A, Jost L (2012) Coverage-based rarefaction and extrapolation: standardizing samples by completeness rather than size. Ecology 93:2533–2547. https://doi.org/10.1890/11-1952.1
# Load example of raster variables data("worldclim", package = "RuHere") r <- terra::unwrap(worldclim) # Aggregate cells r_base <- terra::aggregate(r, 5) # Import data set of amphibian communities from the Atlantic Forest data("atlantic_amphibians", package = "RuHere") # Run analysis res <- inventory_completeness(occ = atlantic_amphibians, raster_base = r_base) terra::plot(res)# Load example of raster variables data("worldclim", package = "RuHere") r <- terra::unwrap(worldclim) # Aggregate cells r_base <- terra::aggregate(r, 5) # Import data set of amphibian communities from the Atlantic Forest data("atlantic_amphibians", package = "RuHere") # Run analysis res <- inventory_completeness(occ = atlantic_amphibians, raster_base = r_base) terra::plot(res)
This function downloads information on species distributions from the IUCN
Red List, required for filtering occurrence records using specialists'
information via the flag_iucn() function.
iucn_here( data_dir, species, synonyms = NULL, iucn_credential = NULL, overwrite = FALSE, progress_bar = FALSE, verbose = FALSE, return_data = TRUE )iucn_here( data_dir, species, synonyms = NULL, iucn_credential = NULL, overwrite = FALSE, progress_bar = FALSE, verbose = FALSE, return_data = TRUE )
data_dir |
(character) directory to save the data downloaded from IUCN. |
species |
(character) a vector of species names for which to retrieve distribution information. |
synonyms |
(data.frame) an optional data.frame containing synonyms of
the target species. The first column must contain the target species names,
and the second column their corresponding synonyms. Default is |
iucn_credential |
(character) your IUCN API key. Default is |
overwrite |
(logical) whether to overwrite existing files. Default is
|
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
verbose |
(logical) whether to display progress messages. Default is
|
return_data |
(logical) whether to return a data frame containing the
species distribution information downloaded from IUCN. Default is |
This function uses the rredlist::rl_species() function to retrieve
distribution data from the IUCN Red List. The data include information at
the country and regional levels, following the World Geographical Scheme for
Recording Plant Distributions (WGSRPD) — but applicable to both plants and
animals.
Unfortunately, the range polygons available at https://www.iucnredlist.org/resources/spatial-data-download cannot be accessed automatically.
Because taxonomic information in IUCN may be outdated, you can optionally
provide a table of synonyms to broaden the search. The synonyms data.frame
should have the accepted species in the first column and their synonyms in
the second. See RuHere::synonys for an example.
The function also downloads the WGSRPD map used to represent distribution regions.
A message indicating that the data were successfully saved in the directory
specified by data_dir.
If return_data = TRUE, the function additionally returns a data frame
containing the species distribution information retrieved from IUCN.
## Not run: # Define a directory to save the data data_dir <- tempdir() # Here, a temporary directory # Download species distribution information from IUCN iucn_here(data_dir = data_dir, species = "Araucaria angustifolia") ## End(Not run)## Not run: # Define a directory to save the data data_dir <- tempdir() # Here, a temporary directory # Download species distribution information from IUCN iucn_here(data_dir = data_dir, species = "Araucaria angustifolia") ## End(Not run)
This function creates an interactive map of occurrence records using mapview, visually highlighting flags. This tool helps users explore which records were flagged by one or more validation functions and inspect them directly on the map.
map_here( occ, species = NULL, long = "decimalLongitude", lat = "decimalLatitude", flags = "all", additional_flags = NULL, names_additional_flags = NULL, col_additional_flags = NULL, show_no_flagged = TRUE, cex = 6, lwd = 2, col_points = NULL, label = NULL, ... )map_here( occ, species = NULL, long = "decimalLongitude", lat = "decimalLatitude", flags = "all", additional_flags = NULL, names_additional_flags = NULL, col_additional_flags = NULL, show_no_flagged = TRUE, cex = 6, lwd = 2, col_points = NULL, label = NULL, ... )
occ |
(data.frame or data.table) a dataset containing occurrence records that has been processed by one or more flagging functions. See Details for available flag types. |
species |
(character) name of the species to subset and plot. Default is
|
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
flags |
(character) the flags to be used for coloring the records. Use
|
additional_flags |
(character) an optional named character vector with
the names of additional logical columns to be used as flags. Default is |
names_additional_flags |
(character) an optional different name to the
flag provided in |
col_additional_flags |
(character) if |
show_no_flagged |
(logical) whether to display records that did not receive any flag.Default is TRUE. |
cex |
(numeric) point size for plotting occurrences. Default is 6. |
lwd |
(numeric) line width for point borders. Default is 2. |
col_points |
(character) A named vector assigning colors to each
flag. If |
label |
(character) column name in |
... |
additional arguments passed to |
The following flags are available: correct_country, correct_state, cultivated, fossil, inaturalist, faunabr, florabr, wcvp, iucn, duplicated, thin_geo, thin_env, .val, .equ, .zer, .cap, .cen, .sea, .urb, .otl, .gbf, .inst, and .aohi.
These flags are typically generated by functions in the RuHere or
CoordinateCleanerworkflow to identify potential data-quality issues in
occurrence records.
Users may also supply additional logical columns using
additional_flags, optionally providing alternative display names
(names_additional_flags) and colors (col_additional_flags).
An interactive mapview object displaying flagged and optionally unflagged occurrence records.
# Load example data data("occ_flagged", package = "RuHere") # Visualize flags interactively map_here(occ = occ_flagged, label = "record_id")# Load example data data("occ_flagged", package = "RuHere") # Visualize flags interactively map_here(occ = occ_flagged, label = "record_id")
This function computes Moran's I autocorrelation coefficient for a numeric
vector x using a matrix of weights. The method follows Gittleman and Kot
(1990). This function is an implementation of ape::Moran.I(), but rewritten
in C++ to be substantially faster and more memory-efficient.
moranfast( x, weight, na_rm = TRUE, scaled = FALSE, alternative = c("two.sided") )moranfast( x, weight, na_rm = TRUE, scaled = FALSE, alternative = c("two.sided") )
x |
(numeric) A numeric vector (e.g., environmental values extracted from occurrence records). |
weight |
(matrix) A matrix of spatial weights (e.g., a distance or
inverse-distance matrix). The number of rows must be equal to the length of
|
na_rm |
(logical) whether to remove missing values from |
scaled |
(logical) whether to scale Moran's I so that it ranges between
–1 and +1. Default is |
alternative |
(character) The alternative hypothesis tested against
the null hypothesis of no autocorrelation. Must be one of |
A list with the following components:
observed – The observed Moran's I.
expected – The expected value of Moran's I under the null hypothesis.
sd – The standard deviation of Moran's I under the null hypothesis.
p.value – The p-value of the test based on the chosen alternative.
Gittleman, J. L., & Kot, M. (1990). Adaptation: statistics and a null model for estimating phylogenetic effects. Systematic Zoology, 39(3), 227–241.
# Load example data data("occurrences", package = "RuHere") # Filter occurrences of Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Load example of raster variables data("worldclim", package = "RuHere") # Unwrap Packed raster r <- terra::unwrap(worldclim) # Extract values for bio_1 bio_1 <- terra::extract(r$bio_1, occ[, c("decimalLongitude", "decimalLatitude")], ID = FALSE, xy = TRUE) #Remove NAs bio_1 <- na.omit(bio_1) # Convert values to numeric v <- as.numeric(bio_1$bio_1) # Compute geographic distance matrix d <- fields::rdist.earth(x1 = as.matrix(bio_1[, c("x", "y")]), miles = FALSE) # Inverse-distance weights d <- 1/d # Fill diagonal with 0 diag(d) <- 0 # Remove finite values d[is.infinite(d)] <- 0 # Compute Moran's I m <- moranfast(x = v, weight = d, scale = TRUE) # Print results m# Load example data data("occurrences", package = "RuHere") # Filter occurrences of Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Load example of raster variables data("worldclim", package = "RuHere") # Unwrap Packed raster r <- terra::unwrap(worldclim) # Extract values for bio_1 bio_1 <- terra::extract(r$bio_1, occ[, c("decimalLongitude", "decimalLatitude")], ID = FALSE, xy = TRUE) #Remove NAs bio_1 <- na.omit(bio_1) # Convert values to numeric v <- as.numeric(bio_1$bio_1) # Compute geographic distance matrix d <- fields::rdist.earth(x1 = as.matrix(bio_1[, c("x", "y")]), miles = FALSE) # Inverse-distance weights d <- 1/d # Fill diagonal with 0 diag(d) <- 0 # Remove finite values d[is.infinite(d)] <- 0 # Compute Moran's I m <- moranfast(x = v, weight = d, scale = TRUE) # Print results m
A cleaned dataset of occurrence records for Yellow Trumpet Tree
(Handroanthus serratifolius) retrieved from the BIEN database.
The raw data were downloaded using get_bien()
The dataset was subsequently processed with the package’s internal
flagging workflow (flag_duplicates() and remove_flagged()) to remove
duplicated records.
occ_bienocc_bien
A data frame containing spatial coordinates, taxonomic information, and metadata returned by BIEN, after cleaning. Columns include (but may not be limited to):
scrubbed_species_binomial: Cleaned species name
longitude, latitude: Geographic coordinates
country, state_province, and other political boundary fields
get_bien()
# View dataset head(occ_bien) # Number of records nrow(occ_bien)# View dataset head(occ_bien) # Number of records nrow(occ_bien)
A dataset containing the occurrence records of Araucaria angustifolia after applying several of the package’s flagging and data-quality assessment functions.
occ_flaggedocc_flagged
A data frame where each row corresponds to a georeferenced occurrence of A. angustifolia.
occurrences,
standardize_countries(), standardize_states(),
flag_florabr(), flag_wcvp(), flag_iucn(),
flag_cultivated(), flag_inaturalist(),
flag_duplicates(), mapview_here()
# First rows head(occ_flagged) # Count flagged vs. unflagged records table(occ_flagged$correct_country)# First rows head(occ_flagged) # Count flagged vs. unflagged records table(occ_flagged$correct_country)
A cleaned dataset of occurrence records for Araucaria angustifolia (Parana pine) retrieved from GBIF.
Records were downloaded using the package’s GBIF workflow
(prepare_gbif_download(), request_gbif(), import_gbif()), and then
cleaned using the internal flagging workflow (duplicate detection and
removal).
occ_gbifocc_gbif
A data frame containing georeferenced GBIF occurrence records for A. angustifolia after all cleaning steps.
prepare_gbif_download(), request_gbif(), import_gbif(),
flag_duplicates(), remove_flagged()
# Preview dataset head(occ_gbif) # Number of cleaned records nrow(occ_gbif)# Preview dataset head(occ_gbif) # Number of cleaned records nrow(occ_gbif)
A cleaned dataset of occurrence records for azure jay (Cyanocorax caeruleus)
retrieved from the iDigBio using get_idigbio().
Records were cleaned using the package's internal duplicate-flagging workflow.
occ_idigocc_idig
A data frame containing georeferenced iDigBio occurrence records for C. caeruleus after all cleaning steps.
get_idigbio(), flag_duplicates(), remove_flagged()
# First rows head(occ_idig) # Number of cleaned records nrow(occ_idig)# First rows head(occ_idig) # Number of cleaned records nrow(occ_idig)
A cleaned dataset of occurrence records for azure jay (Cyanocorax caeruleus)
retrieved from the SpeciesLink using get_specieslink().
Records were cleaned using the package's internal duplicate-flagging workflow.
occ_splinkocc_splink
A data frame containing georeferenced SpeciesLink occurrence records for C. caeruleus after all cleaning steps.
get_specieslink(), flag_duplicates(), remove_flagged()
# First rows head(occ_splink) # Number of cleaned records nrow(occ_splink)# First rows head(occ_splink) # Number of cleaned records nrow(occ_splink)
A harmonized, multi-source occurrence dataset containing cleaned georeferenced records for three species:
Araucaria angustifolia (Parana pine)
Cyanocorax caeruleus (Azure jay)
Handroanthus serratifolius (Yellow trumpet tree)
Records were retrieved from GBIF, speciesLink, BIEN, and iDigBio, standardized through the package workflow, merged, and cleaned to remove duplicates.
occurrencesoccurrences
A data frame where each row represents a georeferenced occurrence record for one of the three species.
Columns correspond to the standardized output of
format_columns(), including:
species: Cleaned binomial species name
decimalLongitude, decimalLatitude: Coordinates
year: Year of collection/observation
Various taxonomic, temporal, locality, and metadata fields
Source identifiers added by format_columns() (e.g., data_source)
format_columns(), bind_here(), flag_duplicates(), remove_flagged()
# Show the first rows head(occurrences) # Number of occurrences per species table(occurrences$species)# Show the first rows head(occurrences) # Number of occurrences per species table(occurrences$species)
Visualize the output of get_env_bins() by plotting environmental blocks
(bins) along two selected environmental variables. Each block is shown as
a colored rectangle, and points falling inside the same rectangle share the
same block_id.
plot_env_bins( env_bins, x_var, y_var, alpha_blocks = 0.3, color_points = "black", size_points = 2, alpha_points = 0.5, stroke_points = 1, xlab = NULL, ylab = NULL, theme_plot = ggplot2::theme_minimal() )plot_env_bins( env_bins, x_var, y_var, alpha_blocks = 0.3, color_points = "black", size_points = 2, alpha_points = 0.5, stroke_points = 1, xlab = NULL, ylab = NULL, theme_plot = ggplot2::theme_minimal() )
env_bins |
(list) output list from
|
x_var |
(character) name of the environmental variable used on the x-axis. |
y_var |
(character) name of the environmental variable used on the y-axis. |
alpha_blocks |
(numeric) transparency level of the block rectangles. Must be between 0 and 1. Default is 0.3. |
color_points |
(character) color of the points representing occurrence
records. Default is |
size_points |
(numeric) size of the points representing occurrence records. Default is 2. |
alpha_points |
(numeric) transparency level of the points. Must be between 0 and 1. Default is 0.5.. |
stroke_points |
(numeric) size of the border of the points. Default is 1. |
xlab |
(character) label for the x-axis. Default is |
ylab |
(character) label for the y-axis. Default is |
theme_plot |
(theme) a |
A ggplot object showing the environmental blocks (colored rectangles) and the occurrence records in the selected environmental space.
# Load example data data("occurrences", package = "RuHere") # Get only occurrences from Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Load example of raster variables data("worldclim", package = "RuHere") # Unwrap Packed raster r <- terra::unwrap(worldclim) # Get bins b <- get_env_bins(occ = occ, env_layers = r, n_bins = 10) # Plot plot_env_bins(b, x_var = "bio_1", y_var = "bio_12", xlab = "Temperature", ylab = "Precipitation")# Load example data data("occurrences", package = "RuHere") # Get only occurrences from Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Load example of raster variables data("worldclim", package = "RuHere") # Unwrap Packed raster r <- terra::unwrap(worldclim) # Get bins b <- get_env_bins(occ = occ, env_layers = r, n_bins = 10) # Plot plot_env_bins(b, x_var = "bio_1", y_var = "bio_12", xlab = "Temperature", ylab = "Precipitation")
Prepare data to request GBIF download
prepare_gbif_download( species, rank = NULL, kingdom = NULL, phylum = NULL, class = NULL, order = NULL, family = NULL, genus = NULL, strict = FALSE, progress_bar = FALSE, ... )prepare_gbif_download( species, rank = NULL, kingdom = NULL, phylum = NULL, class = NULL, order = NULL, family = NULL, genus = NULL, strict = FALSE, progress_bar = FALSE, ... )
species |
(character) a vector of species name(s). |
rank |
(character) optional taxonomic rank (for example, 'species' or 'genus'). Default is NULL, meaning it will return species matched across all ranks. |
kingdom |
(character) optional taxonomic kingdom (for example, 'Plantae' or 'Animalia'). Default is NULL, meaning it will return species matched across all kingdoms. |
phylum |
(character) optional taxonomic phylum. Default is NULL, meaning it will return species matched across all phyla. |
class |
(character) optional taxonomic class. Defaults is NULL, meaning it will return species matched across all classes. |
order |
(character) optional taxonomic order. Defaults is NULL, meaning it will return species matched across all orders |
family |
(character) optional taxonomic family. Defaults is NULL, meaning it will return species matched across all families. |
genus |
(character) optional taxonomic genus. Defaults is NULL, meaning it will return species matched across all genus. |
strict |
(logical) If TRUE, it (fuzzy) matches only the given name, but never a taxon in the upper classification. Default is FALSE. |
progress_bar |
(logical) whether to display a progress bar during
processing. If TRUE, the 'pbapply' package must be installed. Default is
|
... |
other parameters passed to |
A data.frame with species information, including the number of occurrences and other related details.
This function requires an active internet connection to access GBIF data.
gbif_prepared <- prepare_gbif_download(species = "Araucaria angustifolia")gbif_prepared <- prepare_gbif_download(species = "Araucaria angustifolia")
format_columns()
A named list of data frames containing metadata templates for the main biodiversity data providers supported by the package (GBIF, SpeciesLink, iDigBio, and BIEN).
These templates are used internally by format_columns() to harmonize
columns.
prepared_metadataprepared_metadata
A named list of four data frames:
$gbif — template for GBIF dataset.
$specieslink — template for SpeciesLink dataset.
$idigbio — template for iDigBio dataset.
$bien — template for BIEN dataset.
Each element of prepared_metadata is a single-row data frame where:
column names correspond to the package’s standardized output fields
values in the row represent the original column names used by each data provider
These mappings allow format_columns() to:
rename fields (e.g., scientificname → scientificName)
identify which variables are missing or provider-specific
coerce classes consistently (e.g., dates, coordinates)
ensure compatibility when combining datasets from different sources
format_columns()
# View template for GBIF records prepared_metadata$gbif# View template for GBIF records prepared_metadata$gbif
A subset of Atlantic mammals records obtained from the
atlanticr::atlantic_mammals dataset, containing occurrences of
Puma concolor.
This dataset is provided as an example to illustrate how to create
user-defined metadata templates for occurrence records from external
sources using the package’s create_metadata() function.
puma_atlanticrpuma_atlanticr
A data frame where each row represents a single occurrence record of
Puma concolor. Columns include species name, location, and other
relevant metadata fields provided by the atlantic_mammals dataset.
create_metadata(),
format_columns()
# Preview first rows head(puma_atlanticr) # Count occurrences per year table(puma_atlanticr$year)# Preview first rows head(puma_atlanticr) # Count occurrences per year table(puma_atlanticr$year)
These functions move one column to a new position in a data frame,
either immediately after or before another column, while preserving
the order of all remaining columns. They are lightweight base-R utilities
equivalent to dplyr::relocate(), but without external dependencies.
relocate_after(df, col, after) relocate_before(df, col, before)relocate_after(df, col, after) relocate_before(df, col, before)
df |
(data.frame) a data.frame whose columns will be reordered. |
col |
(character) the name of the column to move. |
after |
(character) for |
before |
(character) for |
A data.frame with columns reordered.
This function removes accents and replaces special characters from strings, returning a plain-text version suitable for data cleaning or standardization.
remove_accent(s)remove_accent(s)
s |
(character) a character vector containing the strings to process. |
A vector string without accents or special characters.
remove_accent(c("Colômbia", "São Paulo"))remove_accent(c("Colômbia", "São Paulo"))
This function removes occurrence records flagged as invalid by one or more flagging functions. Additional manual control is available to force keeping or removing specific records, regardless of their flag values.
remove_flagged( occ, flags = "all", additional_flags = NULL, force_keep = NULL, force_remove = NULL, remove_NA = FALSE, column_id = "record_id", save_flagged = FALSE, output_dir = NULL, overwrite = FALSE, output_format = ".gz" )remove_flagged( occ, flags = "all", additional_flags = NULL, force_keep = NULL, force_remove = NULL, remove_NA = FALSE, column_id = "record_id", save_flagged = FALSE, output_dir = NULL, overwrite = FALSE, output_format = ".gz" )
occ |
(data.frame or data.table) a dataset with occurrence records that has been processed by two or more flagging functions. See details. |
flags |
(character) a character vector with the names of the flag columns to be used for filtering records. See details for the available options. Default is "all". |
additional_flags |
(character) an optional named character vector with
the names of additional logical columns to be used as flags. Default is |
force_keep |
(character) an optional character vector with the IDs of
records that were flagged but should still be kept. Default is |
force_remove |
(character) an optional character vector with the IDs of
records that were not flagged but should still be removed. Default is |
remove_NA |
(logical) whether to remove records that have NA in the flags specified. Default is FALSE. |
column_id |
(character) the name of the column containing unique record
IDs. Required if |
save_flagged |
(logical) whether to save the flagged (removed) records.
If |
output_dir |
(character) path to an existing directory where removed
flagged records will be saved. Only used when |
overwrite |
(logical) whether to overwrite existing files in
|
output_format |
(character) output format for saving removed records.
Options are |
The following flags are available: correct_country, correct_state, cultivated, fossil, inaturalist, faunabr, florabr, wcvp, iucn, duplicated, thin_geo, thin_env, .val, .equ, .zer, .cap, .cen, .sea, .urb, .otl, .gbf, .inst, and .aohi.
A data.frame containing only the valid (kept) records according to the
flags and additional criteria.
# Load example data data("occ_flagged", package = "RuHere") # Remove all flagged records occ_valid <- remove_flagged(occ = occ_flagged) # Remove flagged records and force removal of some unflagged records to_remove <- c("gbif_5987", "specieslink_2301", "gbif_18761") occ_valid2 <- remove_flagged(occ = occ_flagged, force_remove = to_remove) # Remove flagged records but keep some flagged ones to_keep <- c("gbif_14501", "gbif_12002", "gbif_5168") occ_valid3 <- remove_flagged(occ = occ_flagged, force_keep = to_keep)# Load example data data("occ_flagged", package = "RuHere") # Remove all flagged records occ_valid <- remove_flagged(occ = occ_flagged) # Remove flagged records and force removal of some unflagged records to_remove <- c("gbif_5987", "specieslink_2301", "gbif_18761") occ_valid2 <- remove_flagged(occ = occ_flagged, force_remove = to_remove) # Remove flagged records but keep some flagged ones to_keep <- c("gbif_14501", "gbif_12002", "gbif_5168") occ_valid3 <- remove_flagged(occ = occ_flagged, force_keep = to_keep)
This function identifies and removes invalid geographic coordinates, including non-numeric values, NA or empty values, and coordinates outside the valid range for Earth (latitude > 90 or < -90, and longitude > 180 or < -180).
remove_invalid_coordinates( occ, long = "decimalLongitude", lat = "decimalLatitude", return_invalid = TRUE, save_invalid = FALSE, output_dir = NULL, overwrite = FALSE, output_format = ".gz", verbose = FALSE )remove_invalid_coordinates( occ, long = "decimalLongitude", lat = "decimalLatitude", return_invalid = TRUE, save_invalid = FALSE, output_dir = NULL, overwrite = FALSE, output_format = ".gz", verbose = FALSE )
occ |
(data.frame or data.table) a dataset with occurrence records. |
long |
(character) column name in |
lat |
(character) column name in |
return_invalid |
(logical) whether to return a list containing the valid and invalid coordinates. Default is TRUE. |
save_invalid |
(logical) whether to save the invalid (removed) records.
If |
output_dir |
(character) path to an existing directory where records with
invalid coordinates will be saved. Only used when |
overwrite |
(logical) whether to overwrite existing files in
|
output_format |
(character) output format for saving removed records.
Options are |
verbose |
(logical) whether to print messages about function progress.
Default is |
If return_invalid = FALSE, returns the occurrence dataset containing only
valid coordinates.
If return_invalid = TRUE (default), returns a list with two elements:
valid – the dataset with valid coordinates.
invalid – the dataset with invalid coordinates removed.
# Create fake data example occ <- data.frame("species" = "spp", "decimalLongitude" = c(10, -190, 20, 50, NA), "decimalLatitude" = c(20, 20, 240, 50, NA)) # Split valid and invalid coordinates occ_valid <- remove_invalid_coordinates(occ)# Create fake data example occ <- data.frame("species" = "spp", "decimalLongitude" = c(10, -190, 20, 50, NA), "decimalLatitude" = c(20, 20, 240, 50, NA)) # Split valid and invalid coordinates occ_valid <- remove_invalid_coordinates(occ)
Submit a request to download occurrence data from GBIF.
request_gbif(gbif_info, hasCoordinate = TRUE, hasGeospatialIssue = FALSE, format = "DWCA", gbif_user = NULL, gbif_pwd = NULL, gbif_email = NULL, additional_predicates = NULL)request_gbif(gbif_info, hasCoordinate = TRUE, hasGeospatialIssue = FALSE, format = "DWCA", gbif_user = NULL, gbif_pwd = NULL, gbif_email = NULL, additional_predicates = NULL)
gbif_info |
an object of class 'gbif_info' resulted by the
|
hasCoordinate |
(logical) whether to retrieve only records with coordinates. Default is TRUE. |
hasGeospatialIssue |
(logical) whether to retrieve records identified with geospatial issue. Default is FALSE. |
format |
(character) the download format. Options available are 'DWCA', 'SIMPLE_CSV', or 'SPECIES_LIST', Default is DWCA'. |
gbif_user |
(character) user name within GBIF's website. Default is
NULL, meaning it will try to obtain this information from the R enviroment.
(check |
gbif_pwd |
(character) user password within GBIF's website. Default is NULL, meaning it will try to obtain this information from the R enviroment. |
gbif_email |
(character) user email within GBIF's website. Default is NULL, meaning it will try to obtain this information from the R enviroment. |
additional_predicates |
(character or occ_predicate) additional
supported predicates that can be combined to build more complex download requests. See
|
You can use the object returned by this function to check the download
request progress with rgbif::occ_download_wait()
A download request key returned by the GBIF API, which can be used to monitor or retrieve the download.
This function requires an active internet connection and valid GBIF credentials.
## Not run: # Prepare data to request GBIF download gbif_prepared <- prepare_gbif_download(species = "Araucaria angustifolia") # Submit a request to download occurrences gbif_requested <- request_gbif(gbif_info = gbif_prepared) # Check progress rgbif::occ_download_wait(gbif_requested) ## End(Not run)## Not run: # Prepare data to request GBIF download gbif_prepared <- prepare_gbif_download(species = "Araucaria angustifolia") # Submit a request to download occurrences gbif_requested <- request_gbif(gbif_info = gbif_prepared) # Check progress rgbif::occ_download_wait(gbif_requested) ## End(Not run)
This function generates spatial grids (rasters) of species richness, record density, or summarized biological traits from occurrence data. It supports custom resolutions, masking, and automatic coordinate reprojection to match reference rasters.
richness_here( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", records = "record_id", raster_base = NULL, res = NULL, crs = "epsg:4326", mask = NULL, summary = "records", field = NULL, field_name = NULL, fun = mean, verbose = TRUE )richness_here( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", records = "record_id", raster_base = NULL, res = NULL, crs = "epsg:4326", mask = NULL, summary = "records", field = NULL, field_name = NULL, fun = mean, verbose = TRUE )
occ |
(data.frame) a dataset containing occurrence records. Must include columns for species names and geographic coordinates. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
records |
(character) the name of the column in |
raster_base |
(SpatRaster) an optional reference raster. If provided,
the output will match its resolution, extent, and CRS. Default is |
res |
(numeric) the desired resolution (in decimal degrees if WGS84)
for the output grid. Only used if |
crs |
(character) the coordinate reference system of the raster.
(see ?terra::crs). Default is "epsg:4326". Only applicable if |
mask |
(SpatRaster or SpatVector) an optional layer to mask the
final output. Default is |
summary |
(character) the type of summary to calculate.
Either |
field |
(character or named vector) column in |
field_name |
(character) a custom name used to build the legend when
plotting the result with |
fun |
(function) the function to aggregate |
verbose |
(logical) whether to print messages about the progress.
Default is |
A SpatRaster object representing the calculated richness,
density, or trait summary.
# Load example data data("occ_flagged", package = "RuHere") # Mapping the density of records r_density <- richness_here(occ_flagged, summary = "records", res = 0.5) ggrid_here(r_density) # We can also summarize key features: # 1. Identifying problematic regions by summing error flags # We create a variable to store the sum of logical flags (TRUE = 1, FALSE = 0) total_flags <- occ_flagged$florabr_flag + occ_flagged$wcvp_flag + occ_flagged$iucn_flag + occ_flagged$cultivated_flag + occ_flagged$inaturalist_flag + occ_flagged$duplicated_flag names(total_flags) <- occ_flagged$record_id # Using summary = "records" with to see the average accumulation of errors # with fun = mean to see the average accumulation r_flags <- richness_here(occ_flagged, summary = "records", field = total_flags, field_name = "Number of flags", fun = mean, res = 0.5) ggrid_here(r_flags) # 2. Or we can summarize organisms traits spatially # Simulating a trait (e.g., mass) for each unique record spp <- unique(occ_flagged$record_id) sim_mass <- setNames(runif(length(spp), 10, 50), spp) r_trait <- richness_here(occ_flagged, summary = "records", field = sim_mass, field_name = "Mass", fun = mean, res = 0.5) ggrid_here(r_trait)# Load example data data("occ_flagged", package = "RuHere") # Mapping the density of records r_density <- richness_here(occ_flagged, summary = "records", res = 0.5) ggrid_here(r_density) # We can also summarize key features: # 1. Identifying problematic regions by summing error flags # We create a variable to store the sum of logical flags (TRUE = 1, FALSE = 0) total_flags <- occ_flagged$florabr_flag + occ_flagged$wcvp_flag + occ_flagged$iucn_flag + occ_flagged$cultivated_flag + occ_flagged$inaturalist_flag + occ_flagged$duplicated_flag names(total_flags) <- occ_flagged$record_id # Using summary = "records" with to see the average accumulation of errors # with fun = mean to see the average accumulation r_flags <- richness_here(occ_flagged, summary = "records", field = total_flags, field_name = "Number of flags", fun = mean, res = 0.5) ggrid_here(r_flags) # 2. Or we can summarize organisms traits spatially # Simulating a trait (e.g., mass) for each unique record spp <- unique(occ_flagged$record_id) sim_mass <- setNames(runif(length(spp), 10, 50), spp) r_trait <- richness_here(occ_flagged, summary = "records", field = sim_mass, field_name = "Mass", fun = mean, res = 0.5) ggrid_here(r_trait)
This function sets GBIF credentials (username, email and password) as environment variables in the R environment. These credentials are required to retrieve occurrence records from GBIF.
set_gbif_credentials( gbif_username, gbif_email, gbif_password, permanently = FALSE, overwrite = FALSE, open_Renviron = FALSE, verbose = TRUE )set_gbif_credentials( gbif_username, gbif_email, gbif_password, permanently = FALSE, overwrite = FALSE, open_Renviron = FALSE, verbose = TRUE )
gbif_username |
(character) your GBIF username. |
gbif_email |
(character) your GBIF email address. |
gbif_password |
(character) your GBIF password. |
permanently |
(logical) whether to add the GBIF credentials permanently
to the R environment. Default is |
overwrite |
(logical) whether to overwrite GBIF credentials if they
already exist. Only applicable if permanently is set to |
open_Renviron |
(logical) whether to open the .Renviron file after
saving the credentials. Only applicable if permanently is set to |
verbose |
(logical) if |
If permanently and open_Renviron are set to TRUE, it opens the .Renviron
file. Otherwise, the credentials are saved silently.
## Not run: set_gbif_credentials(gbif_username = "my_username", gbif_email = "[email protected]", gbif_password = "my_password") ## End(Not run)## Not run: set_gbif_credentials(gbif_username = "my_username", gbif_email = "[email protected]", gbif_password = "my_password") ## End(Not run)
This function sets the IUCN API key as an environment variable in the R environment. This key is required to obtain distributional data from IUCN.
set_iucn_credentials( iucn_key, permanently = FALSE, overwrite = FALSE, open_Renviron = FALSE, verbose = TRUE )set_iucn_credentials( iucn_key, permanently = FALSE, overwrite = FALSE, open_Renviron = FALSE, verbose = TRUE )
iucn_key |
(character) your IUCN API key. See Details. |
permanently |
(logical) whether to add the SpeciesLink API key
permanently to the R environment. Default is |
overwrite |
(logical) whether to overwrite IUCN credential if it
already exists. Only applicable if |
open_Renviron |
(logical) whether to open the .Renviron file after
saving the credential. Only applicable if |
verbose |
(logical) if |
To check your API key, visit: https://api.iucnredlist.org/users/edit.
If permanently and open_Renviron are set to TRUE, it opens the .Renviron
file. Otherwise, the credentials are saved silently.
## Not run: set_iucn_credentials(iucn_key = "my_key") ## End(Not run)## Not run: set_iucn_credentials(iucn_key = "my_key") ## End(Not run)
This function sets the SpeciesLink API key as an environment variable in the R environment. This API key is required to retrieve occurrence records from SpeciesLink.
set_specieslink_credentials( specieslink_key, permanently = FALSE, overwrite = FALSE, open_Renviron = FALSE, verbose = TRUE )set_specieslink_credentials( specieslink_key, permanently = FALSE, overwrite = FALSE, open_Renviron = FALSE, verbose = TRUE )
specieslink_key |
(character) your SpeciesLink API key. |
permanently |
(logical) whether to add the SpeciesLink API key
permanently to the R environment. Default is |
overwrite |
(logical) whether to overwrite SpeciesLink credential if it
already exists. Only applicable if |
open_Renviron |
(logical) whether to open the .Renviron file after
saving the credential. Only applicable if |
verbose |
(logical) if |
To check your API key, visit: https://specieslink.net/aut/profile/apikeys.
If permanently and open_Renviron are set to TRUE, it opens the .Renviron
file. Otherwise, the credentials are saved silently.
## Not run: set_specieslink_credentials(specieslink_key = "my_key") ## End(Not run)## Not run: set_specieslink_credentials(specieslink_key = "my_key") ## End(Not run)
This function creates density heatmaps using kernel density estimation. The algorithm is inspired by the SpatialKDE R package and the "Heatmap" tool from QGIS. Each occurrence contributes to the density surface within a circular neighborhood defined by a specified radius.
spatial_kde( occ, long = "decimalLongitude", lat = "decimalLatitude", radius = 0.2, resolution = NULL, buffer_extent = 500, crs = "epsg:4326", raster_ref = NULL, kernel = "quartic", scaled = TRUE, decay = 1, mask = NULL, zero_as_NA = FALSE, weights = NULL )spatial_kde( occ, long = "decimalLongitude", lat = "decimalLatitude", radius = 0.2, resolution = NULL, buffer_extent = 500, crs = "epsg:4326", raster_ref = NULL, kernel = "quartic", scaled = TRUE, decay = 1, mask = NULL, zero_as_NA = FALSE, weights = NULL )
occ |
(data.frame, data.table, or SpatVector) a data frame or SpatVector containing the occurrences. Must contain columns longitude and latitude. |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
radius |
(numeric) a positive numeric value specifying the smoothing radius for the kernel density estimate. This parameter determines the circular neighborhood around each point where that point will have an influence. See details. Default is 0.2. |
resolution |
(numeric) a positive numeric value specifying the resolution
(in degrees or meters, depending on the |
buffer_extent |
(numeric) width of the buffer (in kilometers) to draw around the occurrences to define the area for computing the heatmap. Default is 500. |
crs |
(character) the coordinate reference system of the raster heatmap
(see ?terra::crs). Default is "epsg:4326". Only applicable if |
raster_ref |
(SpatRaster) an optional raster to use as reference for resolution, CRS, and extent. Default is NULL. |
kernel |
(character) type of kernel to use. Available kernerls are "uniform", "quartic", "triweight", "epanechnikov", or "triangular". Default is "quartic". |
scaled |
(logical) whether to scale output values to vary between 0 and
|
decay |
(numeric) decay parameter for "triangular" kernel. Only
applicable if |
mask |
(SpatRaster or SpatExtent) optional spatial object to define the
extent of the area for the heatmap. Default is NULL, in which case the
extent is derived from |
zero_as_NA |
(logical) whether to convert regions with value 0 to NA. Default is FALSE. |
weights |
(numeric) optional vector of weights for individual points.
Must be the same length as the number of occurrences in |
The radius parameter controls how far the influence of each observation
extends. Smaller values produce fine-grained peaks; larger values produce
smoother, more spread-out heatmaps. Units depend on the CRS: degrees for
geographic coordinates (default), meters for projected coordinates.
If raster_ref is not provided, the extent is calculated from the convex
hull of occ plus buffer_extent.
Kernels define the weight decay of points:
"uniform" = constant, "quartic"/"triweight"/"epanechnikov" = smooth, and
"triangular" = linear decay (using decay parameter).
A SpatRaster containing the kernel density values.
Hart, T., & Zandbergen, P. (2014). Kernel density estimation and hotspot mapping: Examining the influence of interpolation method, grid cell size, and radius on crime forecasting. Policing: An International Journal of Police Strategies & Management, 37(2), 305-323.
Nelson, T. A., & Boots, B. (2008). Detecting spatial hot spots in landscape ecology. Ecography, 31(5), 556-566.
Chainey, S., Tompson, L., & Uhlig, S. (2008). The utility of hotspot mapping for predicting spatial patterns of crime. Security journal, 21(1), 4-28.
Caha J (2023). SpatialKDE: Kernel Density Estimation for Spatial Data. https://jancaha.github.io/SpatialKDE/index.html.
# Load example data data("occ_flagged", package = "RuHere") # Remove flagged records occ <- remove_flagged(occ_flagged) # Generate heatmap heatmap <- spatial_kde(occ = occ, resolution = 0.25, buffer_extent = 50, radius = 2) # Plot heatmap with terra terra::plot(heatmap) # Plot heatmap with ggplot ggmap_here(occ = occ, heatmap = heatmap)# Load example data data("occ_flagged", package = "RuHere") # Remove flagged records occ <- remove_flagged(occ_flagged) # Generate heatmap heatmap <- spatial_kde(occ = occ, resolution = 0.25, buffer_extent = 50, radius = 2) # Plot heatmap with terra terra::plot(heatmap) # Plot heatmap with ggplot ggmap_here(occ = occ, heatmap = heatmap)
Convert a data.frame (or data.table) of occurrence records into a SpatVector object.
spatialize( occ, long = "decimalLongitude", lat = "decimalLatitude", crs = "epsg:4326", force_numeric = TRUE )spatialize( occ, long = "decimalLongitude", lat = "decimalLatitude", crs = "epsg:4326", force_numeric = TRUE )
occ |
(data.frame or data.table) a data frame containing the occurrence records to be flagged. Must contain columns for longitude, and latitude. |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
crs |
(character) the coordinate reference system (see |
force_numeric |
(logical) whether to coerce the longitude and latitude
columns to numeric if they are not already. Default is |
A SpatVector object containing the spatialized occurrence records.
# Load example data data("occurrences", package = "RuHere") # Spatialize the occurrence records pts <- spatialize(occurrences) # Plot the resulting SpatVector terra::plot(pts)# Load example data data("occurrences", package = "RuHere") # Spatialize the occurrence records pts <- spatialize(occurrences) # Plot the resulting SpatVector terra::plot(pts)
This function standardizes country names using both names and codes present in a specified column.
standardize_countries( occ, country_column = "country", max_distance = 0.1, user_dictionary = NULL, lookup_na_country = FALSE, long = "decimalLongitude", lat = "decimalLatitude", return_dictionary = TRUE )standardize_countries( occ, country_column = "country", max_distance = 0.1, user_dictionary = NULL, lookup_na_country = FALSE, long = "decimalLongitude", lat = "decimalLatitude", return_dictionary = TRUE )
occ |
(data.frame) a dataset with occurrence records, preferably
standardized using |
country_column |
(character) the column name containing the country information. |
max_distance |
(numeric) maximum allowed distance (as a fraction) when
searching for suggestions for misspelled country names. Can be any value
between 0 and 1. Higher values return more suggestions. See |
user_dictionary |
(data.frame) optional data.frame with two columns:
'country_name' and 'country_suggested'. If provided, this dictionary will be
combined with the package’s default country dictionary
( |
lookup_na_country |
(logical) whether to extract the country from coordinates when the country column has missing values. If TRUE, longitude and latitude columns must be provided. Default is FALSE. |
long |
(character) column name with longitude. Only applicable if
|
lat |
(character) column name with latitude. Only applicable if
|
return_dictionary |
(logical) whether to return the dictionary of countries that were (fuzzy) matched. |
Country names are first standardized by exact matching against a list of
country names in several languages from rnaturalearthdata::map_units110.
Any unmatched names are then processed using a fuzzy matching algorithm to
find potential candidates for misspelled country names. If unmatched names
remain and lookup_na_country = TRUE, the country is extracted from
coordinates using a map retrieved from rnaturalearthdata::map_units110.
A list with two elements:
data |
The original |
dictionary |
If |
# Import and standardize GBIF data("occ_gbif", package = "RuHere") #Import data example gbif_standardized <- format_columns(occ_gbif, metadata = "gbif") # Import and standardize SpeciesLink data("occ_splink", package = "RuHere") #Import data example splink_standardized <- format_columns(occ_splink, metadata = "specieslink") # Import and standardize BIEN data("occ_bien", package = "RuHere") #Import data example bien_standardized <- format_columns(occ_bien, metadata = "bien") # Import and standardize idigbio data("occ_idig", package = "RuHere") #Import data example idig_standardized <- format_columns(occ_idig, metadata = "idigbio") # Merge all all_occ <- bind_here(gbif_standardized, splink_standardized, bien_standardized, idig_standardized) # Standardize countries occ_standardized <- standardize_countries(occ = all_occ)# Import and standardize GBIF data("occ_gbif", package = "RuHere") #Import data example gbif_standardized <- format_columns(occ_gbif, metadata = "gbif") # Import and standardize SpeciesLink data("occ_splink", package = "RuHere") #Import data example splink_standardized <- format_columns(occ_splink, metadata = "specieslink") # Import and standardize BIEN data("occ_bien", package = "RuHere") #Import data example bien_standardized <- format_columns(occ_bien, metadata = "bien") # Import and standardize idigbio data("occ_idig", package = "RuHere") #Import data example idig_standardized <- format_columns(occ_idig, metadata = "idigbio") # Merge all all_occ <- bind_here(gbif_standardized, splink_standardized, bien_standardized, idig_standardized) # Standardize countries occ_standardized <- standardize_countries(occ = all_occ)
This function standardizes state names using both names and codes present in a specified column.
standardize_states( occ, state_column = "stateProvince", country_column = "country_suggested", max_distance = 0.1, lookup_na_state = FALSE, long = "decimalLongitude", lat = "decimalLatitude", return_dictionary = TRUE )standardize_states( occ, state_column = "stateProvince", country_column = "country_suggested", max_distance = 0.1, lookup_na_state = FALSE, long = "decimalLongitude", lat = "decimalLatitude", return_dictionary = TRUE )
occ |
(data.frame) a dataset with occurrence records, preferably
standardized using |
state_column |
(character) the column name containing the state information. |
country_column |
(character) the column name containing the country information. |
max_distance |
(numeric) maximum allowed distance (as a fraction) when
searching for suggestions for misspelled state names. Can be any value
between 0 and 1. Higher values return more suggestions. See |
lookup_na_state |
(logical) whether to extract the state from coordinates when the state column has missing values. If TRUE, longitude and latitude columns must be provided. Default is FALSE. |
long |
(character) column name with longitude. Only applicable if
|
lat |
(character) column name with latitude. Only applicable if
|
return_dictionary |
(logical) whether to return the dictionary of states that were (fuzzy) matched. |
States names are first standardized by exact matching against a list of
state names in several languages from rnaturalearthdata::states50.
Any unmatched names are then processed using a fuzzy matching algorithm to
find potential candidates for misspelled state names. If unmatched names
remain and lookup_na_state = TRUE, the state is extracted from
coordinates using a map retrieved from rnaturalearthdata::states50.
A list with two elements:
data |
The original |
dictionary |
If |
# Import and standardize GBIF data("occ_gbif", package = "RuHere") #Import data example gbif_standardized <- format_columns(occ_gbif, metadata = "gbif") # Import and standardize SpeciesLink data("occ_splink", package = "RuHere") #Import data example splink_standardized <- format_columns(occ_splink, metadata = "specieslink") # Import and standardize BIEN data("occ_bien", package = "RuHere") #Import data example bien_standardized <- format_columns(occ_bien, metadata = "bien") # Import and standardize idigbio data("occ_idig", package = "RuHere") #Import data example idig_standardized <- format_columns(occ_idig, metadata = "idigbio") # Merge all all_occ <- bind_here(gbif_standardized, splink_standardized, bien_standardized, idig_standardized) # Standardize countries occ_standardized <- standardize_countries(occ = all_occ) # Standardize states occ_standardized2 <- standardize_states(occ = occ_standardized$occ)# Import and standardize GBIF data("occ_gbif", package = "RuHere") #Import data example gbif_standardized <- format_columns(occ_gbif, metadata = "gbif") # Import and standardize SpeciesLink data("occ_splink", package = "RuHere") #Import data example splink_standardized <- format_columns(occ_splink, metadata = "specieslink") # Import and standardize BIEN data("occ_bien", package = "RuHere") #Import data example bien_standardized <- format_columns(occ_bien, metadata = "bien") # Import and standardize idigbio data("occ_idig", package = "RuHere") #Import data example idig_standardized <- format_columns(occ_idig, metadata = "idigbio") # Merge all all_occ <- bind_here(gbif_standardized, splink_standardized, bien_standardized, idig_standardized) # Standardize countries occ_standardized <- standardize_countries(occ = all_occ) # Standardize states occ_standardized2 <- standardize_states(occ = occ_standardized$occ)
A simplified PackedSpatVector containing state-level polygons (e.g.,
provinces, departments, regions) for countries worldwide. Names and parent
countries (geonunit) were cleaned (lowercase, accents removed).
statesstates
A PackedSpatVector object with polygons of administrative divisions
and one attribute:
State/province/region name.
The dataset was generated from rnaturalearth::ne_states(). The following
processing steps were applied:
kept only administrative types: "Province", "State",
"Department", "Region", "Federal District";
selected only "name" and "geonunit" columns;
both fields were cleaned via tolower() and remove_accent();
records where state name = country name were removed;
geometries were simplified using terra::simplifyGeom(tolerance = 0.05);
wrapped with terra::wrap() for internal storage.
Natural Earth data, via rnaturalearth.
data(states) states <- terra::unwrap(states) terra::plot(states)data(states) states <- terra::unwrap(states) terra::plot(states)
Provides lookup tables used to standardize subnational administrative units (states and provinces) in occurrence datasets.
Generated from rnaturalearth::ne_states(), it includes a wide range of
name variants (in multiple languages, transliterations, and common
abbreviations), as well as postal codes for each unit.
This dictionary allows consistent mapping of user-provided names such as
"são paulo", "sao paulo", "SP", "illinois", "ill.", "bayern",
"bavaria" to a single standardized state or province name.
states_dictionarystates_dictionary
A named list with two data frames:
A data frame with columns:
Character. Name variants of states or provinces
from ne_states(), lowercased and accent-stripped.
Character. Standardized state/province name, also lowercased and accent-stripped.
Character. Country associated with the state/province, lowercased and accent-stripped.
A data frame with columns:
Character. Postal codes from ne_states(), cleaned
and converted to uppercase.
Character. Standardized state/province name corresponding to the code.
Character. Country associated with the code.
The dictionary is constructed by:
selecting administrative units of type "State" or "Province";
extracting multiple name fields, including alternative names and multilingual fields;
normalizing names to lowercase and removing accents;
normalizing codes to uppercase;
removing duplicates and ambiguous entries;
removing rows with missing names or codes.
data(states_dictionary) head(states_dictionary$states_name) head(states_dictionary$states_code)data(states_dictionary) head(states_dictionary$states_name) head(states_dictionary$states_code)
Extracts the state for each occurrence record based on coordinates.
states_from_coords( occ, long = "decimalLongitude", lat = "decimalLatitude", from = "all", state_column = "stateProvince", output_column = "state_xy", append_source = FALSE )states_from_coords( occ, long = "decimalLongitude", lat = "decimalLatitude", from = "all", state_column = "stateProvince", output_column = "state_xy", append_source = FALSE )
occ |
(data.frame) a dataset with occurrence records, preferably
standardized using |
long |
(character) column name with longitude. Default is 'decimalLongitude'. |
lat |
(character) column name with latitude. Default is 'decimalLatitude'. |
from |
(character) whether to extract the state for all records ('all') or only for records missing state information ('na_only'). If 'na_only', you must provide the name of the column with state information. Default is 'all'. |
state_column |
(character) the column name containing the state. Only
applicable if |
output_column |
(character) column name created in |
append_source |
(logical) whether to create a new column in |
The states are extracted from coordinates using a map retrieved from
rnaturalearthdata::states50.
The original occ data.frame with an additional column containing the
states extracted from coordinates.
# Import and standardize GBIF data("occ_gbif", package = "RuHere") #Import data example gbif_standardized <- format_columns(occ_gbif, metadata = "gbif") gbif_states <- states_from_coords(occ = gbif_standardized)# Import and standardize GBIF data("occ_gbif", package = "RuHere") #Import data example gbif_standardized <- format_columns(occ_gbif, metadata = "gbif") gbif_states <- states_from_coords(occ = gbif_standardized)
This functions returns a dataframe and a bar plot summarizing the number of records flagged by each flagging function.
summarize_flags( occ = NULL, flagged_dir = NULL, output_format = ".gz", flags = "all", additional_flags = NULL, names_additional_flags = NULL, plot = TRUE, show_unflagged = TRUE, occ_unflagged = NULL, fill = "#0072B2", sort = TRUE, decreasing = TRUE, add_n = TRUE, size_n = 3.5, theme_plot = ggplot2::theme_minimal(), ... )summarize_flags( occ = NULL, flagged_dir = NULL, output_format = ".gz", flags = "all", additional_flags = NULL, names_additional_flags = NULL, plot = TRUE, show_unflagged = TRUE, occ_unflagged = NULL, fill = "#0072B2", sort = TRUE, decreasing = TRUE, add_n = TRUE, size_n = 3.5, theme_plot = ggplot2::theme_minimal(), ... )
occ |
(data.frame or data.table) a dataset containing occurrence records that has been processed by one or more flagging functions. See Details for available flag types. |
flagged_dir |
(character) optional path to a directory containing files
with flagged records saved using the |
output_format |
(character) output format used to read the removed records.
Options are |
flags |
(character) the flags to be summarized. Use |
additional_flags |
(character) an optional named character vector with
the names of additional logical columns to be used as flags. Default is |
names_additional_flags |
(character) an optional different name to the
flag provided in |
plot |
(logical) whether to return a |
show_unflagged |
(logical) whether to include the number of unflagged
records in the plot. Default is |
occ_unflagged |
(data.frame or data.table) an optional dataset
containing unflagged occurrence records. Only applicable if |
fill |
(character) fill color for the bar plot. Default is |
sort |
(logical) whether to sort bars according to the number of records.
Default is |
decreasing |
(logical) whether to sort bars in decreasing order (flags
with more records appear at the top of the plot). Default is |
add_n |
(logical) whether to display the number of flagged records on
the bars. Default is |
size_n |
(numeric) size of the text showing the number of records. Only
used when |
theme_plot |
(theme) a |
... |
additional arguments passed to |
This function expects an occurrence dataset that has already been processed
by one or more flagging routines from RuHere or related packages such as
CoordinateCleaner. Any logical column in occ can be used as a flag.
The following built-in flag names are recognized:
From RuHere:
correct_country, correct_state, cultivated, florabr, faunabr,
wcvp, iucn, bien, duplicated, thin_geo, thin_env, consensus
From CoordinateCleaner:
.val, .equ, .zer, .cap, .cen, .sea, .urb, .otl, .gbf,
.inst, .aohi
Users may also supply additional logical columns using
additional_flags, optionally providing alternative display names
(names_additional_flags) and colors (col_additional_flags).
If plot = TRUE, a list with two elements:
A data frame summarizing the number of records per flag.
A ggplot2 object showing the summary as a bar plot.
If plot = FALSE, only the summary data frame is returned.
# Load example data data("occ_flagged", package = "RuHere") # Summarize flags sum_flags <- summarize_flags(occ = occ_flagged) # Plot sum_flags$plot_summary# Load example data data("occ_flagged", package = "RuHere") # Summarize flags sum_flags <- summarize_flags(occ = occ_flagged) # Plot sum_flags$plot_summary
Flags occurrence records for thinning by keeping only one record per species within the same environmental block/bin.
thin_env( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", env_layers, n_bins = 5, prioritary_column = NULL, decreasing = TRUE, flag_for_NA = FALSE )thin_env( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", env_layers, n_bins = 5, prioritary_column = NULL, decreasing = TRUE, flag_for_NA = FALSE )
occ |
(data.frame or data.table) a data frame containing the occurrence records. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
env_layers |
(SpatRaster) object containing environmental variables. |
n_bins |
(numeric) number of bins into which each environmental variable will be divided. |
prioritary_column |
(character) name of a numeric columns in |
decreasing |
(logical) whether to sort records in decreasing order using
the |
flag_for_NA |
(logical) whether to treat records falling in |
This function used get_env_bins() to create a multidimensional grid in
environmental space by splitting each environmental variable into n_bins
equally sized intervals. Records falling into the same environmental bin are
considered redundant; only one is kept (based on retention priority when
provided), and the remaining records are flagged.
The original occ data frame with two additional columns:
thin_env_flag: logical indicating whether each record is retained
(TRUE) or flagged as redundant (FALSE).
bin: environmental bin ID assigned to each record. Each component
of the ID corresponds to the bin of one environmental variable.
# Load example data data("occurrences", package = "RuHere") # Get only occurrences from Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Load example of raster variables data("worldclim", package = "RuHere") # Unwrap Packed raster r <- terra::unwrap(worldclim) # Flag records that are close to each other in the enviromnetal space occ_env_thin <- thin_env(occ = occ, env_layers = r) # Number of flagged (redundant) records sum(!occ_env_thin$thin_env_flag) #Number of flagged records# Load example data data("occurrences", package = "RuHere") # Get only occurrences from Araucaria occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Load example of raster variables data("worldclim", package = "RuHere") # Unwrap Packed raster r <- terra::unwrap(worldclim) # Flag records that are close to each other in the enviromnetal space occ_env_thin <- thin_env(occ = occ, env_layers = r) # Number of flagged (redundant) records sum(!occ_env_thin$thin_env_flag) #Number of flagged records
Marks occurrence records for thinning by keeping only one record per species within a radius of 'd' kilometers.
thin_geo( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", d, prioritary_column = NULL, decreasing = TRUE, remove_invalid = TRUE, optimize_memory = FALSE, verbose = TRUE )thin_geo( occ, species = "species", long = "decimalLongitude", lat = "decimalLatitude", d, prioritary_column = NULL, decreasing = TRUE, remove_invalid = TRUE, optimize_memory = FALSE, verbose = TRUE )
occ |
(data.frame or data.table) a data frame containing the occurrence records to be flagged. Must contain columns for species, longitude, and latitude. |
species |
(character) the name of the column in |
long |
(character) the name of the column in |
lat |
(character) the name of the column in |
d |
(numeric) thinning distance in kilometers (e.g., 10 for 10km). |
prioritary_column |
(character) name of a numeric columns in |
decreasing |
(logical) whether to sort records in decreasing order using
the |
remove_invalid |
(logical) whether to remove invalid coordinates.
Default is |
optimize_memory |
(logical) whether to compute the distance matrix using a C++ implementation that reduces memory usage at the cost of increased computation time. Recommended for large datasets (> 10,000 records). Default is FALSE. |
verbose |
(logical) whether to display messages during function execution. Set to TRUE to enable display, or FALSE to run silently. Default is TRUE. |
This function is similar to the thin() function from the spThin package,
but with an important difference: it allows specifying a priority order for
retaining records.
When a thinning distance is provided (e.g., 10 km), the function identifies
clusters of records within this distance. Within each cluster, it keeps the
record with the highest priority according to the column defined in
prioritary_column (for example, keeping the most recent record if
prioritary_column = "year"), and flags the remaining nearby records for
removal.
If prioritary_column is NULL, the priority follows the original order of
rows in the input occ data.frame.
The original occ data frame augmented with a new logical column named
thin_geo_flag. Records that are retained after thinning receive
TRUE, while records identified as too close to a higher-priority
record receive FALSE.
# Load example data data("occurrences", package = "RuHere") # Subset occurrences for Araucaria angustifolia occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Thin records using a 10 km distance threshold occ_thin <- thin_geo(occ = occ, d = 10) sum(!occ_thin$thin_geo_flag) # Number of records flagged for removal # Prioritizing more recent records within each cluster occ_thin_recent <- thin_geo(occ = occ, d = 10, prioritary_column = "year") sum(!occ_thin_recent$thin_geo_flag) # Number of records flagged for removal# Load example data data("occurrences", package = "RuHere") # Subset occurrences for Araucaria angustifolia occ <- occurrences[occurrences$species == "Araucaria angustifolia", ] # Thin records using a 10 km distance threshold occ_thin <- thin_geo(occ = occ, d = 10) sum(!occ_thin$thin_geo_flag) # Number of records flagged for removal # Prioritizing more recent records within each cluster occ_thin_recent <- thin_geo(occ = occ, d = 10, prioritary_column = "year") sum(!occ_thin_recent$thin_geo_flag) # Number of records flagged for removal
This function downloads the World Checklist of Vascular Plants database,
which is required for filtering occurrence records using specialists'
information via the flag_wcvp() function.
wcvp_here( data_dir, overwrite = TRUE, remove_files = TRUE, timeout = 300, verbose = TRUE )wcvp_here( data_dir, overwrite = TRUE, remove_files = TRUE, timeout = 300, verbose = TRUE )
data_dir |
(character) a directory to save the data downloaded from WCVP. |
overwrite |
(logical) If TRUE, data is overwritten. Default is TRUE. |
remove_files |
(logical) whether to remove the downloaded files used in building the final dataset. Default is TRUE. |
timeout |
(numeric) maximum time (in seconds) allowed for downloading. Default is 300. Slower internet connections may require higher values. |
verbose |
(logical) whether to display messages during function execution. Set to TRUE to enable display, or FALSE to run silently. Default is TRUE. |
A message indicating that the data were successfully saved in the directory
specified by data_dir.
# Define a directory to save the data data_dir <- tempdir() # Here, a temporary directory # Download the WCVP database wcvp_here(data_dir = data_dir)# Define a directory to save the data data_dir <- tempdir() # Here, a temporary directory # Download the WCVP database wcvp_here(data_dir = data_dir)
A "PackedSpatVector" containing country polygons from Natural Earth,
processed and cleaned for use within the package. Country names were
converted to lowercase and had accents removed.
worldworld
A PackedSpatVector object with country polygons and one attribute:
Country name.
The dataset is sourced from rnaturalearthdata::map_units110, then:
converted to a SpatVector using terra,
attribute "name" cleaned (tolower(), remove_accent()),
wrapped using terra::wrap() for robust internal storage.
Natural Earth data, via rnaturalearthdata.
data(world) world <- terra::unwrap(world) terra::plot(world)data(world) world <- terra::unwrap(world) terra::plot(world)
A PackedSpatRaster containing three bioclimatic variables from the
WorldClim, cropped to a region of interest South America.
worldclimworldclim
A SpatRaster with 3 layers and the following characteristics:
151 rows × 183 columns
0.08333333° × 0.08333333°
xmin = -57.08333, xmax = -41.83333, ymin = -32.08333, ymax = -19.5
WGS84 (EPSG:4326)
Mean Annual Temperature (°C × 10)
Temperature Annual Range (°C × 10)
Annual Precipitation (mm)
This raster corresponds to three standard bioclimatic variables from the WorldClim 2.1 dataset.
data(worldclim) bioclim <- terra::unwrap(worldclim) terra::plot(bioclim)data(worldclim) bioclim <- terra::unwrap(worldclim) terra::plot(bioclim)