Expanding HadISD: quality-controlled, sub-daily station data from 1931

HadISD is a sub-daily, station-based, quality-controlled dataset designed to study past extremes of temperature, pressure and humidity and allow comparisons to future projections. Herein we describe the first major update to the HadISD dataset. The temporal coverage of the dataset has been extended to 1931 to present, doubling the time range over which data are provided. Improvements made to the station selection and merging procedures result in 7677 stations being provided in version 2.0.0.2015p of this dataset. The selection of stations to merge together making composites has also been improved 5 and made more robust. The underlying structure of the quality control procedure is the same as for HadISD.1.0.x, but a number of improvements have been implemented in individual tests. Also, more detailed quality control tests for wind speed and direction have been added. The data will be made available as netCDF files at www.metoffice.gov.uk/hadobs/hadisd and updated annually.

. The process used for the station selection and merging in HadISD.2.0.0 120 observations. Finally, there have to be 15 years worth of months which have the equivalent amount of data for observations every six hours. Stations which pass these three checks, with no requirement on continuity, are retained. This results in 8127 stations being taken forward for further processing. The methodology of this updated station selection procedure is shown in In HadISD.1.0.x, 934 of the final set of stations are composites, using its static list of station matches. Therefore it is likely that a number of stations within the 8127 taken forward for HadISD.2.0.0 are non-unique, and so could be merged together. Also, there will be stations in the full ISD catalogue which could supplement the data within these 8127 candidates and so improve the temporal coverage.
To avoid merging stations which are not suitable, we need a simple, yet robust method of selecting stations to merge. We 10 follow a method which is similar to the International Surface Temperature Initiative (ISTI, Rennie et al., 2014). The ISTI methodology maps separations (distance and height) into decaying exponential probability curves. These probabilities are combined with a station-name similarity algorithm and a threshold set above which stations are merged.
In HadISD.1.0.x a hierarchical scoring system was adopted along with a detailed, manual comparison of the temperature anomalies from the ISD-Lite database (ftp.ncdc.noaa.gov/pub/data/noaa/isd-lite). Merged stations could also be composited out of many short records, which on their own would have had insufficient data to be included in the final station list.
For HadISD.2.0.0, our selection of merging candidates is based only on the latitude, longitude, elevation and station name.
The Euclidean distance between the two stations is calculated using the latitude and longitude. Using an exponential decay 5 with an e-folding distance of 25 km, a likelihood of similarity is derived from the station separation. A similar calculation is performed for the elevation, but using an e-folding distance of 100 m. The latitudes and longitudes of the stations are only stored to a precision of a single decimal place, which in the worst case can result in an accuracy of only ∼5 km. In cases where distinct stations are close (urban stations, for example) but have very similar names, then this may result in an erroneous merge.
The station names are compared using the Jaccard Index (Jaccard, 1901) as in the ISTI merging algorithm. This allows for 10 slight differences in spelling between station names rather than requiring an identical match, arising for example from different spellings of the same station in countries with multiple languages or non-Roman alphabets.
If the product of these three probabilities is greater than 0.5, then the stations are deemed similar enough to merge. Using the horizontal and vertical separations and the station name ensures that large differences in any one of these three measures will preclude merging. A reverse check is performed to ensure that a secondary station is not merged into two primary stations; 15 only the primary station with the highest likelihood of a match is used.
Merging stations within the list of candidate stations results in a final list of 7677 stations, of which 1993 contain data from other station IDs which are in the full ISD archive. The increase in the data coverage by including stations from the full ISD holdings can be seen in Fig. 2. These secondary stations have no requirements on record length or median reporting interval.
When the raw ISD data files are converted to NetCDF prior to processing, the primary stations are read in first, and then all 20 secondaries are read in to fill in any gaps. The focus of HadISD at the moment is on temperature and dewpoint data and so observations are overwritten if those from a secondary station have both temperature and dewpoint in preference to the primary with only one of the two. If only one observation is available out of all stations, then temperature is preferred over dewpoint.
Finally, observations closer to the top of the hour are preferred, but at lower importance than the temperature and dewpoint selection.

25
There are few stations prior to 1931 in the ISD archive, as shown in Fig. 2, hence our decision to only extend the dataset back to 1931. This figure also shows why HadISD.1.0.x was chosen to start in 1973. However, by checking in the full ISD catalogue for stations to merge with, the coverage has been improved back to 1950, as well as smaller improvements at other times. The distribution of stations in the first, last and every 20th year is shown in the Appendix (Fig. 9). Although in the very early years, only small parts of the globe have any coverage, the increase in station numbers from the late 1940s to 1960 results of the final set of candidate stations and mergers are available on the HadISD website at www.metoffice.gov.uk/hadobs/hadisd.
At each annual update we will also make available lists of stations that are newly included into HadISD.2.0.x and those that are no longer included (for example, they have been merged or removed from the ISD). These additional lists should enable users to work with a dynamic station list more easily.

5
Since the release of HadISD.1.0.0 a number of issues have come to light about countries for which specific problems affect the data held in ISD. For two of these, Germany and Canada, we have been able to carry out some extra processing to increase the quality of the station records

Germany
The stations in Germany have station identifying numbers in the ISD that start with 09 and 10. However, it is the remaining 10 4 digits of the ID number that uniquely identify the station within Germany (Andreas Becker, personal communication). Therefore, we have been able to explicitly merge the 09 stations into the 10 stations. We still perform the merging checks 5 Geosci. Instrum. Method. Data Syst. Discuss., doi:10.5194/gi-2016-9, 2016 Manuscript under review for journal Geosci. Instrum. Method. Data Syst. Published: 12 September 2016 c Author(s) 2016. CC-BY 3.0 License.  Geosci. Instrum. Method. Data Syst. Discuss., doi:10.5194/gi-2016-9, 2016 Manuscript under review for journal Geosci. Instrum. Method. Data Syst. Published: 12 September 2016 c Author(s) 2016. CC-BY 3.0 License. outlined above to ensure that no spurious mergers are performed. This results in 44 stations being merged together prior to the station selection criteria being applied.

Canada
Only 1000 WMO numbers have been assigned for use in Canada, and as a result, many have been re-used when old stations have closed, and new ones opened. In some cases, this has resulted in apparent station moves in the ISD record. Using a 5 list kindly supplied by Environment Canada (Lee Cudlip, personal communication) we have been able to assess some of the Canadian stations in the ISD record. The list contained information for 994 stations which could be categorised as follows (the number of stations in each is given in parentheses): -Single -stations which appeared in the list only once (529).
-On/Off -stations which had an "active" and "inactive" status indicating the start and end dates of operation (47) 10 -Good Station Moves -stations which showed a change in location, with dates showing the end of reporting at the previous location, and the start in the new location (216).
-Overlap Moves -similarly to good station moves, but the start of reporting in the new location occurs before the end of reporting at the old (15).
-Possible Homogeneity issues -multiple dates at a single location, perhaps indicating changes in instrumentation (92).

15
-Questionable Moves -location changes with no dates given showing the end at one or the beginning at another location (33).
-Dates -cases where "active" and "inactive" statuses occurred at the same time, so the final status could not be determined (49).
-Other -more complex sets of start and end dates that could not be categorised easily (13).

20
In the ISD, there are more than 1000 stations listed as being in Canada. We selected those which were likely to correspond to the WMO stations (those which have ISD IDs that match 71xxx0-99999). This resulted in 934 stations which we could compare to the Environment Canada list.
Stations which appeared in the Single, On/Off and Possible Homogeneity issues categories were retained in the candidate station list (668). Those from the Overlap moves, Questionable Moves, Dates and Other were rejected from the station list 25 (110).
The 216 stations in the Good Moves list were processed further. Using the station details in the ISD list, the period of time when the station was in this location as determined from the Environment Canada list was extracted. Usually this was the most recent location. The start and end times of the station were adjusted as appropriate to ensure that only the period in the location 7 Geosci. Instrum. Method. Data Syst. Discuss., doi:10.5194/gi-2016-9, 2016 Manuscript under review for journal Geosci. Instrum. Method. Data Syst. Published: 12 September 2016 c Author(s) 2016. CC-BY 3.0 License. as given in the full ISD station list was used when further selecting stations. In many cases this resulted in the station not being selected for inclusion with HadISD.
Of the 934 Canadian stations we were able to assess, 798 were kept for processing by further selection criteria: in 32 the station names were sufficiently different to reduce the probability of a good merger below the threshold and 104 were rejected.
There are other stations which are located in Canada (which do not match the pseudo-WMO IDs used by ISD) which we could 5 not process. These, along with the 30 which were not in the Environment Canada list, were retained in the station selection procedure as we have no information indicating that there are problems with them.

Updating the Quality Control Tests
As part of this update we took the chance to re-write the quality control software from IDL into Python as this language is becoming more commonly used and is also Open Source. All the code used to create HadISD.2.0.0 is written in Python (and 10 run using version 2.7.6), and will be made available alongside the dataset at www.metoffice/gov/uk/hadobs/hadisd/.
We attempted to match the performance and outputs of the tests between the two languages. In some cases we were able to correct bugs present in the IDL, and some tests could be written to result in bit-wise reproducibility. However for others, this was not possible, primarily those where curve-fitting was used to determine critical values. We have also used this opportunity to improve the functionality of some of the tests. We outline the changes made and the tests where differences exist between 15 the two code versions in the Appendix, but the quality control checks where more substantive changes have been made are detailed below.

Distributional Gap
The distributional gap check in HadISD.1.0.x had two parts. The first part worked on a monthly level, comparing the anomalies of the monthly median values. By standardising against the inter-quartile range (IQR), and comparing step-wise from the 20 middle of the distribution outwards, asymmetries were identified and flagged if severe enough. For more details see Dunn et al. (2012).

The second part of the distributional gap test compares all observations within a calendar month (over all years). A histogram
is created from all observations within a calendar month (e.g. all Januaries), and a gaussian distribution fitted. Threshold values are determined by using the positions where this fitted frequency falls below y = 0.1 and rounding outwards to the next integer-25 plus-one. Going outwards from the center, the distribution is scanned for gaps which occur beyond this threshold value, and any observations occurring beyond the gap are flagged.
In a number of cases it has come to light that a simple gaussian distribution is not a good fit to the bulk of the observations, resulting in thresholds that are too high. We therefore have increased the complexity of the fitted gaussian distribution by allowing for non-zero skew and kurtosis by using a Gauss-Hermite series 1 . The updated thresholds (as calculated when the 30 fitted curve drops below y = 0.1) then occur closer to the bulk of the distribution then when using a plain gaussian curve.  Table 4).
The updated version of this test allows these thresholds to be calculated dynamically. Firstly the length of each string of repeated values is obtained, and then distribution of these strings lengths is analysed. An inverse decay curve is fitted to  is modified by finding the next empty bin to ensure the entire main distribution is retained (see Figure 5). However, if this dynamically calculated threshold is larger than what was used in HadISD.1.0.x, then the old value from Dunn et al. (2012) Table 4 is retained. This ensures that this test can only result in more stringent removal of repeated streaks rather than fewer.

Spike
In HadISD.1.0.0 the spike check compared the first-difference values between observations to critical values to determined 5 whether a spike had occurred or not. These critical values were determined from the IQR of the first-differences. Similarly to the updated repeated streak check, in HadISD.2.0.0 the updated critical values are calculated from the distribution of firstdifference magnitudes. This distribution is again fitted with an inverse decay curve to obtain a first guess at the critical values, which is then modified by finding the next empty bin. This threshold is used if it is smaller than that obtained from the IQR of the first differences.  Table 1. Logical Wind Checks used in HadISD.2.0.0, adapted from (DeGaetano, 1997) This test has also been made symmetric, so that the jump down out of the spike has to be greater than the critical value (as opposed to half the critical value as used in HadISD.1.0.x).

Unusual Variance Check
This test identifies whole months where the within-month variance of normalised anomalies is sufficiently greater than the median variance over the full station series for that month. It includes an algorithm to identify periods in the sea-level pressure 5 which are likely to be the result of intense (usually tropical) storms (low pressure systems), and hence need to be retained.
Locations which experience tropical storms usually have very uniform pressure values. Therefore, the extreme low pressure values occuring during the passage of a storm increases their monthly variance and so could result in erroneous flagging.
In HadISD.1.0.x the minimum pressure and the maximum wind speed within a calendar month were assessed for contemporaneity and that they were at least 4.5 median absolute deviations (MAD) from the median value. Now, all time periods within 10 a month where both the wind speed and SLP exceed 4 MAD from the median are used when checking for storm signals in case two storms occur within the same calendar month.

Winds
The level of quality control applied to the wind speed and direction observations in HadISD.1.0.x was not as high as for temperature, dewpoint temperature and sea-level pressure. Therefore, in HadISD.2.0.0 we have added in a set of logical checks 15 for wind speed and direction as well as testing for the year-to-year consistency of the wind rose for the station.
By general convention, if the wind speed is 0 m s −1 (calm) then the direction should be recorded at 0 • . Conversely, a non-zero northerly wind speed should be recorded with a direction of 360 • . In ISD, the wind direction has been recorded as missing for calm periods, and so we use these logical checks to set the wind direction to 0 • when the speed has been recorded as 0 m s −1 .

20
The logical checks used in HadISD.2.0.0 are based on those outlined in Table 2 of DeGaetano (1997) and are summarised in Table 1. This results in the removal of negative wind speeds, and also directions outside of the range 0 → 360 • inclusive.
Observations with a non-zero wind speed but a direction of 0 • and calm periods with a direction = 0 • are also removed.
To quality control the distribution of the wind speed and direction, we use the method outlined in Lucio-Eceiza et al. (2015) to assess rotations between wind roses. Their work focuses on the homogeneity of the wind record, with the aim to adjust 25 erroneous years. In this instance we just remove years where the wind rose is very different to all others. To perform this assessment of the wind rose, we calculate the root-mean-square error (rmse) for each annual wind rose when compared to that calculated for the entire record. These rmse values are fitted with a Rician distribution (appropriate for rmse values). As in the distributional gap check, we use the location where this fitted frequency curve falls below y = 0.01 as the threshold, and search outwards for the first empty bin which is used as the final threshold. Any years where the rmse is larger than this are flagged. This test does flag whole years at a time, but will highlight and remove those years where 5 the distribution of wind directions is radically different to the average, identifying possible undocumented station moves or changes in instrumentation.
In HadISD.2.0.0, wind speeds are now also checked for unusual variance, as well as the odd cluster, streak and record checks which were processed in HadISD.1.0.x. In all these cases the wind direction is now also flagged synergistically.

10
In HadISD.1.0.x the closest 10 stations within 50 m elevation and 300 km distance were selected as neighbours, and where possible, these would be evenly distributed around four quadrants (northeast, southeast, southwest and northwest).
By increasing the span of the dataset, the selection of neighbours needed to be improved. If the selection method of HadISD.1.0.x had been retained, then it is likely that during the early record, stations would be compared to neighbours that have no data during that time. The new procedure is as follows.

15
The closest 20 neighbours within the limits of 500 m elevation and 300 km distance are obtained for each station. For each of these neighbours, the data overlap with the target is calculated. Also, correlation between the neighbour and target is obtained after removing the annual and diurnal cycles. These cycles are removed by first calculating the daily long-term mean, and subtracting that from the data. Then the relative means for each of the 24 hours are calculated over all days, and also removed.
Therefore anomalous hours and days will stand out. The linear combination of the correlation coefficient and overlap fraction 20 is used to rank the neighbours, and up to the best ten neighbours are chosen, requiring that at least two occur within each quadrant if possible. For the 356 stations where fewer than three neighbours were found, then this test was not run.
Using these updated neighbouring stations, the remainder of the test is very similar as for HadISD.1.0.x. The stationneighbour difference series are calculated. However, the inter-quartile range of the difference series is calculated for each calendar month separately, rather than for the entire record. The variations in the station climatology over the annual cycle 25 may result in inter-station differences that are on average larger in some months than others. Any observation associated with a difference exceeding 5IQR of the whole difference series is flagged.
During the neighbour checks, some of the intra-station checks are un-done, as documented for HadISD.1.0.x in Dunn et al. (2012). Although this is retained for the odd cluster, climatological, gap and dew-point depression checks, it is no longer performed on the spike check, as a visual inspection showed that the flags on many true spikes were being removed.  The summary of the fraction of observations removed for each of the three main variables are shown in Fig. 6. The values for each variable and test are shown in Table 3. As in HadISD.1.0.x, the majority of stations have very low flagging rates, with less than 1 per cent of observations removed. There are some regional and country-scale patterns that emerge in the flagging rates. For temperature the large regions which have the highest proportion of flagged observations are eastern and northern 5 North America and western and central Europe. On average the removal rates are higher for the dewpoint temperature than for temperature, but with similar regions showing higher than average removal rates. The majority of stations have comparatively few sea-level pressure observations removed, but the cluster of Mexican stations is still present, but now joined by Japan and We also show in Fig. 8 the number of break points detected in the temperature record for each station as well as the distribution of stations with record length and number of breaks (equivalent plots for dewpoint, sea-level pressure and wind speeds are show in the Appendix, Figs 10 and 11). Not only the length of record and quality of the station data, but also the number and size of inhomogeneities are important when assessing stations that are suitable for climate monitoring. Therefore 25 we do not perform a selection on these lines as the requirements for this will differ between applications. We encourage users to make their own assessment as which stations are suitable for their particular investigation.

Data Provision
HadISD.2.0.0 is provided as Network Common Data Format version 4 files (NetCDF4) at www.metoffice.gov.uk/hadobs/ hadisd/. We have moved from NetCDF3 files as used in HadISD.1.0.x to NetCDF4. This format allows for internal compression, 30 and so results in smaller file sizes on disc, which will hopefully make them easier to process and download. The inventory files, log-files of the processing and also summary plots will also be made available alongside the updated data files. A list of  the fields available in each NetCDF file are given in Table 4. Of note is that the wind gust, past significant weather and the precipitation variables have not been quality controlled.
The versioning scheme will be the same as for HadISD.1.0.x, with annual updates occurring at the beginning of each calendar year. To ensure that as much data from the previous year is included in the updates, these are carried out in a two stage process.
A preliminary dataset will be released early in the year (for example v2.0.1.2016p in January 2017) with a final version (e.g. 5 v2.0.1.2016f) a few months later to ensure that late-arriving data are included.
Updates to the dataset will be made public on the Twitter handle @MetOfficeHadOBS and also at the blog http://hadisd. blogspot.co.uk/. We encourage users of this dataset to contact the authors if they find any issues, for example observations which they believe have been erroneously flagged.  land-surface humidity. In HadISD.2.0.0 we also release data files containing sub-daily humidity and heat-health measures.
These are calculated directly from the sub-daily observations of temperature, dewpoint temperature and pressure.
The formulae we use are the same as in HadISDH (see Willett et al., 2014 for full details) but we give the method here with the specific formulae in Table 5. Firstly the sub-daily sea-level pressure values provided in HadISD are converted to station-level pressure using the formula from List (1963). This is different to HadISDH, where the climatological monthly 5 mean sea-level pressure values from the 20th Century Reanalysis V2 (Compo et al., 2011) were used. Also, this means that if there are no pressure values in the HadISD, then no humidity or heat stress variables have been calculated.
The temperature, dewpoint temperature and station pressure are then used to calculate the vapour pressure with respect to water. This is used to calculate the wet-bulb temperature. If this wet-bulb temperature is below 0 • C then the process is repeated using the formulae with respect to ice. The resulting vapour pressure values are used to obtain the specific and relative   On top of this, these humidity values are used to derive a number of heat-stress metrics on an hourly basis. These are outlined in Table 6. These will allow the study of individual heat wave events not only through meteorological variables but also those which capture the impact on human heat-health.
The cleaned HadISD.2.0.0 data are used to derive these variables. However, neither of these two sets of variables have undergone further quality control or homogenisation processes. Therefore they will inherit any remaining data issues present 5 within the input variables drawn from HadISD. However the homogeneity information from the temperatures and dewpoint temperatures will be suitable to select stations with few and small inhomogeneities.

Summary
We present the first major update to the sub-daily station-based HadISD dataset where the temporal coverage has been extended back to 1931. As part of this the station selection and merging algorithms have been updated, and will be run as part of the 10 annual update cycle. HadISD.2.0.0.2015p contains 7677 stations of which 1993 are composites resulting from the merging procedure. The quality control tests have been adjusted to account for the increased length of record, but also improved to take advantage of our increased knowledge of the dataset and the extremes within it. More detailed quality control tests have been applied to the wind speed and direction observations. The temperature and dewpoint observations have been used to create sub-daily humidity and heat-stress datasets. All data files and supplementary material will be made available at 15 www.metoffice.gov.uk/hadobs/hadisd.  List (1963) Temperature T , station height Z in metres Table 5 (1994) Heat Index . 87 − T f 5 HI = 0.5(T f + 61 + 1.2(T f − 68) + 0.094RH) Rothfusz (1990) Where T f is the temperature in Fahrenheit.  Jensen, M. E., Burman, R. D., and Allen, R. G.: Evapotranspiration and irrigation water requirements, ASCE, 1990.  Table 3 but in per cent. As a result of rounding, rows may not sum to exactly 100.0 per cent.

Test
Variable Stations with detection rate band (% of total original observations)