Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

02. Anatomy of the dataset

Suyana

What argopy returns by default is in “point-cloud” format: a single N_POINTS dimension with all points (each profile = several consecutive points). Efficient but not the classical Argo layout.

For science, reshape it to N_PROF × N_LEVELS:

  • N_PROF: number of profiles (= cycles).

  • N_LEVELS: pressure levels within each profile.

argopy does this with .argo.point2profile().

%run _style.py
from argopy import DataFetcher
import argopy

ds_point = DataFetcher(src='erddap', mode='expert').float(5905141).to_xarray()
print('point-cloud:', dict(ds_point.dims))
point-cloud: {'N_POINTS': 176145}
/var/folders/8j/y_l8frxs2n19mq92k5pv4y100000gn/T/ipykernel_27248/3254288223.py:6: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  print('point-cloud:', dict(ds_point.dims))
ds = ds_point.argo.point2profile()
print('profile format:', dict(ds.dims))
ds
profile format: {'N_PROF': 314, 'N_LEVELS': 562}
/var/folders/8j/y_l8frxs2n19mq92k5pv4y100000gn/T/ipykernel_27248/2047636976.py:2: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  print('profile format:', dict(ds.dims))
Loading...

Key variables

Beyond coordinates (LATITUDE, LONGITUDE, TIME, PRES), the dataset carries:

FamilyMeaning
TEMP, PSAL, PRESRaw variables (real-time).
TEMP_ADJUSTED, PSAL_ADJUSTED, PRES_ADJUSTEDAdjusted variables (delayed-mode calibration when available).
*_QC, *_ADJUSTED_QCQuality flags (string, values ‘1’ to ‘9’).
*_ERROREstimated error of the adjusted value (delayed-mode only).
DATA_MODEPer profile: ‘R’ (real-time), ‘A’ (real-time + adjustments), ‘D’ (delayed-mode).

Data modes: R / A / D

  • R (real-time): what the float transmits directly, with automatic QC. Available within hours.

  • A (adjusted): same as R with a preliminary automatic adjustment in *_ADJUSTED.

  • D (delayed-mode): a PI reviewed it, calibrated the sensor against nearby CTDs, applied fine adjustments. Available 6 to 12 months later.

For rigorous science always use *_ADJUSTED when DATA_MODE is ‘D’ or ‘A’, the raw variable only when it’s ‘R’.

import numpy as np

modes = np.array([str(m) for m in ds.DATA_MODE.values])
unique, counts = np.unique(modes, return_counts=True)
for m, c in zip(unique, counts):
    print(f'  {m}: {c} profiles')
  D: 314 profiles

QC flags

Each value (TEMP, PSAL, PRES) has an associated QC flag. Canonical table:

FlagMeaning
'1'Good. Data is good.
'2'Probably good.
'3'Probably bad. Use with caution.
'4'Bad. Discard.
'5'Changed (a value was corrected).
'8'Estimated (interpolated).
'9'Missing value.

Standard practice: keep flags ‘1’ and ‘2’ (good + probably good).

We use the adjusted variable and mask out anything that isn’t good:

import xarray as xr

# usar ADJUSTED si DATA_MODE != 'R', sino la cruda
def merge_adjusted(ds, var):
    """Devuelve var con valores ADJUSTED donde DATA_MODE != R."""
    mode_is_R = (ds.DATA_MODE.astype(str) == 'R')
    return xr.where(mode_is_R, ds[var], ds[f'{var}_ADJUSTED'])

temp = merge_adjusted(ds, 'TEMP')
psal = merge_adjusted(ds, 'PSAL')
pres = merge_adjusted(ds, 'PRES')

def mask_qc(var, qc, good=('1', '2')):
    qc_str = qc.astype(str)
    mask = xr.zeros_like(qc_str, dtype=bool)
    for g in good:
        mask = mask | (qc_str == g)
    return var.where(mask)

mode_is_R = (ds.DATA_MODE.astype(str) == 'R')
temp_qc = xr.where(mode_is_R, ds.TEMP_QC, ds.TEMP_ADJUSTED_QC)
psal_qc = xr.where(mode_is_R, ds.PSAL_QC, ds.PSAL_ADJUSTED_QC)
pres_qc = xr.where(mode_is_R, ds.PRES_QC, ds.PRES_ADJUSTED_QC)

temp_clean = mask_qc(temp, temp_qc)
psal_clean = mask_qc(psal, psal_qc)
pres_clean = mask_qc(pres, pres_qc)

print('temp shape:', temp_clean.shape)
print('% valid temperature:', f'{float(temp_clean.notnull().mean())*100:.1f}%')
print('% valid salinity:   ', f'{float(psal_clean.notnull().mean())*100:.1f}%')
temp shape: (314, 562)
% valid temperature: 99.6%
% valid salinity:    99.6%

This pair (merge_adjusted + mask_qc) is the pattern you’ll use all the time. Worth keeping handy.

To keep the code below clean, we wrap it all into a single “clean” Dataset:

ds_clean = xr.Dataset({
    'TEMP': temp_clean,
    'PSAL': psal_clean,
    'PRES': pres_clean,
}, coords={
    'LATITUDE': ds.LATITUDE,
    'LONGITUDE': ds.LONGITUDE,
    'TIME': ds.TIME,
    'CYCLE_NUMBER': ds.CYCLE_NUMBER,
})
ds_clean
Loading...

Summary

  • argopy returns N_POINTS format by default. Use .argo.point2profile() to get N_PROF × N_LEVELS.

  • Decide per profile whether to use the raw variable or _ADJUSTED based on DATA_MODE.

  • Mask by QC: keep flags '1' and '2'.

  • Wrap this in helper functions (merge_adjusted, mask_qc) and save the clean dataset once.

Next: T/S profiles and T-S diagrams.