02. Anatomy of the dataset - Argo · tutorial / tutorial

What argopy returns by default is in “point-cloud” format: a single N_POINTS dimension with all points (each profile = several consecutive points). Efficient but not the classical Argo layout.

For science, reshape it to N_PROF × N_LEVELS:

N_PROF: number of profiles (= cycles).
N_LEVELS: pressure levels within each profile.

argopy does this with .argo.point2profile().

%run _style.py
from argopy import DataFetcher
import argopy

ds_point = DataFetcher(src='erddap', mode='expert').float(5905141).to_xarray()
print('point-cloud:', dict(ds_point.dims))

point-cloud: {'N_POINTS': 176145}

/var/folders/8j/y_l8frxs2n19mq92k5pv4y100000gn/T/ipykernel_27248/3254288223.py:6: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  print('point-cloud:', dict(ds_point.dims))

ds = ds_point.argo.point2profile()
print('profile format:', dict(ds.dims))
ds

profile format: {'N_PROF': 314, 'N_LEVELS': 562}

/var/folders/8j/y_l8frxs2n19mq92k5pv4y100000gn/T/ipykernel_27248/2047636976.py:2: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  print('profile format:', dict(ds.dims))

Key variables¶

Beyond coordinates (LATITUDE, LONGITUDE, TIME, PRES), the dataset carries:

Family	Meaning
`TEMP`, `PSAL`, `PRES`	Raw variables (real-time).
`TEMP_ADJUSTED`, `PSAL_ADJUSTED`, `PRES_ADJUSTED`	Adjusted variables (delayed-mode calibration when available).
`_QC`, `_ADJUSTED_QC`	Quality flags (string, values ‘1’ to ‘9’).
`*_ERROR`	Estimated error of the adjusted value (delayed-mode only).
`DATA_MODE`	Per profile: ‘R’ (real-time), ‘A’ (real-time + adjustments), ‘D’ (delayed-mode).

Data modes: R / A / D¶

R (real-time): what the float transmits directly, with automatic QC. Available within hours.
A (adjusted): same as R with a preliminary automatic adjustment in *_ADJUSTED.
D (delayed-mode): a PI reviewed it, calibrated the sensor against nearby CTDs, applied fine adjustments. Available 6 to 12 months later.

For rigorous science always use *_ADJUSTED when DATA_MODE is ‘D’ or ‘A’, the raw variable only when it’s ‘R’.

import numpy as np

modes = np.array([str(m) for m in ds.DATA_MODE.values])
unique, counts = np.unique(modes, return_counts=True)
for m, c in zip(unique, counts):
    print(f'  {m}: {c} profiles')

  D: 314 profiles

QC flags¶

Each value (TEMP, PSAL, PRES) has an associated QC flag. Canonical table:

Flag	Meaning
`'1'`	Good. Data is good.
`'2'`	Probably good.
`'3'`	Probably bad. Use with caution.
`'4'`	Bad. Discard.
`'5'`	Changed (a value was corrected).
`'8'`	Estimated (interpolated).
`'9'`	Missing value.

Standard practice: keep flags ‘1’ and ‘2’ (good + probably good).

We use the adjusted variable and mask out anything that isn’t good:

import xarray as xr

# usar ADJUSTED si DATA_MODE != 'R', sino la cruda
def merge_adjusted(ds, var):
    """Devuelve var con valores ADJUSTED donde DATA_MODE != R."""
    mode_is_R = (ds.DATA_MODE.astype(str) == 'R')
    return xr.where(mode_is_R, ds[var], ds[f'{var}_ADJUSTED'])

temp = merge_adjusted(ds, 'TEMP')
psal = merge_adjusted(ds, 'PSAL')
pres = merge_adjusted(ds, 'PRES')

def mask_qc(var, qc, good=('1', '2')):
    qc_str = qc.astype(str)
    mask = xr.zeros_like(qc_str, dtype=bool)
    for g in good:
        mask = mask | (qc_str == g)
    return var.where(mask)

mode_is_R = (ds.DATA_MODE.astype(str) == 'R')
temp_qc = xr.where(mode_is_R, ds.TEMP_QC, ds.TEMP_ADJUSTED_QC)
psal_qc = xr.where(mode_is_R, ds.PSAL_QC, ds.PSAL_ADJUSTED_QC)
pres_qc = xr.where(mode_is_R, ds.PRES_QC, ds.PRES_ADJUSTED_QC)

temp_clean = mask_qc(temp, temp_qc)
psal_clean = mask_qc(psal, psal_qc)
pres_clean = mask_qc(pres, pres_qc)

print('temp shape:', temp_clean.shape)
print('% valid temperature:', f'{float(temp_clean.notnull().mean())*100:.1f}%')
print('% valid salinity:   ', f'{float(psal_clean.notnull().mean())*100:.1f}%')

temp shape: (314, 562)
% valid temperature: 99.6%
% valid salinity:    99.6%

This pair (merge_adjusted + mask_qc) is the pattern you’ll use all the time. Worth keeping handy.

To keep the code below clean, we wrap it all into a single “clean” Dataset:

ds_clean = xr.Dataset({
    'TEMP': temp_clean,
    'PSAL': psal_clean,
    'PRES': pres_clean,
}, coords={
    'LATITUDE': ds.LATITUDE,
    'LONGITUDE': ds.LONGITUDE,
    'TIME': ds.TIME,
    'CYCLE_NUMBER': ds.CYCLE_NUMBER,
})
ds_clean

Summary¶

argopy returns N_POINTS format by default. Use .argo.point2profile() to get N_PROF × N_LEVELS.
Decide per profile whether to use the raw variable or _ADJUSTED based on DATA_MODE.
Mask by QC: keep flags '1' and '2'.
Wrap this in helper functions (merge_adjusted, mask_qc) and save the clean dataset once.

Next: T/S profiles and T-S diagrams.