Blue Carbon Sample Size Analysis

A blue carbon project analysis for monitoring campaigns calculating the amout of sample plots based on real field datasets
Statistics
R
Mangroves
Author
Affiliation

Javier Patrón

Published

June 28, 2024

Introduction

In this R Markdown we will calculate the sample sizes for three example data sets, and we will do an analysis with a confusion matrix to understand what are most important parameters from the field data in order to calculate the total carbon per tree per hectare. This step is important, so we can identify what are the important pieces wen we are stratifying the sample plots, and we want to homogenize the total project area as much as possible. Additionally, to understand our sample size options we we will create a demo data frame with random sample sizes and calculate the sample size needed depending in the variance.

The steps followed are.

  1. Download the libraries
  2. Read the data. Datasets:
    • Site A: Abu Ali, Saudi Arabia
    • Site B: Delta Blue Carbon, Sindh Pakistan
    • Site C: Sidnh Pakistan, Pakistan institute
  3. Cleaning and tidy the data into the same format
  4. Bind the data sets into only one big df
  5. Create a demo data set for understanding how the sample size is influenced by the variation and standard deviation of the data.
  6. Calculate the different sample size with the proposed equations from the SOP.
  7. Calculate the different sample sizes with the real data sets.
  8. Perform the confusion matrix to understand the relationship between parameters and strengthen the SOP for calculating sample sizes

Here is our demo data set summary:

Summary Demo Data Set
site_name tree_count mean_carbon_kg sd_carbon_kg std_error
Very Steady 151 0.1481133 0.0709162 0.0057711
Steady 173 0.2263267 0.1571232 0.0119459
Normal 178 0.4256436 0.2501144 0.0187469
Variant 190 0.6394418 0.4990083 0.0362019
Very Variant 208 1.7013919 1.2109702 0.0839657

Step 6. Calculate the different sample size with the proposed equations from the SOP.

a. Power Calculation

In this case, we utilized the pwr.t.test() function from the pwr package in R. This function calculates the sample size, taking into consideration the desired power level, with certain significance level (power = 95% confidence ,margin of error = 0.05).

The effect size for each stratum was determined by dividing a desired mean (0.4 kg per sample), by the standard deviation within that stratum, meaning how much our samples deviate from the target mean.

Results: In “lightblue” results gave us 2.3 for the smallest sample size due to low variance, and 239.2 as the highest sample size due to higher variance.

Stratas summary: Power Calculation Sample
site_name tree_count mean_carbon_kg sd_carbon_kg std_error power_calc
Very Steady 151 0.1481133 0.0709162 0.0057711 2.3
Steady 173 0.2263267 0.1571232 0.0119459 5.2
Normal 178 0.4256436 0.2501144 0.0187469 11.2
Variant 190 0.6394418 0.4990083 0.0362019 41.4
Very Variant 208 1.7013919 1.2109702 0.0839657 239.2

The pwr.t.test() function is good when you have a desired mean for each stratum and will easily determine the adequate sample sizes that accounts for the variability within each stratum. To visualize the results better here is the graph for the Variant Statification.

Power Calculation Graph

b. Central Limit Theorem

Now we will calculate the sample size using the Central Limit Theorem.

\[n = \left(\frac{Z \cdot sd}{E}\right)^2\]

where:

  • \(n\) = sample size

  • \(Z\) = Z-score (e.g., 1.96 for 95% confidence)

  • \(sd\) = Standard Deviation of your population

  • \(E\) = Margin of error you are willing to accept in your estimate

Using the Central Limit Theorem, we calculate the minimum sample size to ensure our margin of error and confidence interval stay within VM0033 requirements:

  • For a 90% Confidence Interval: 90% confidence that our carbon stock estimate is no more than 20% off the true value.

  • For a 95% Confidence Interval: 95% confidence that our carbon stock estimate is no more than 30% off the true value.

Results: In “lightgreen” the CLT results gave us 9.7852599 for the smallest sample size due to low variance, and 21.62359 as the highest sample size due to higher variance.

Demo stratas summary: CLT Sample
site_name tree_count mean_carbon_kg sd_carbon_kg std_error power_calc CLT
Very Steady 151 0.1481133 0.0709162 0.0057711 2.3 9.78526
Steady 173 0.2263267 0.1571232 0.0119459 5.2 20.57209
Normal 178 0.4256436 0.2501144 0.0187469 11.2 14.73855
Variant 190 0.6394418 0.4990083 0.0362019 41.4 25.99459
Very Variant 208 1.7013919 1.2109702 0.0839657 239.2 21.62359

c. A/R Methodological Tool.

\[n =\frac{N \text{ }* \text{ }tvalue^2 \text{ }* \text{ }(\epsilon\ w * s)^2)}{N \text{ }*\text{ } E^2 + \text{ }tvalue^2 \text{ }* \text{ }\epsilon\ w * s^2}\]

n = Number of sample plots required for estimation of biomass stocks within the project boundary; dimensionless.

N = Total number of possible sample plots within the project boundary space or the population; dimensionless. (plot_count)

t-value = Two-sided Student´s t-value, at infinite degrees of freedom, for the required confidence level; dimensionless. (Table-\> 90% = 1.645)

w = Relative weight of the area of stratum i (i.e. the area of the stratum i divided by the project). (154/ plot_count * 154)

s = Estimated standard deviation of biomass stock in stratum (SD Carbon Biomass)

E = Acceptable margin of error (i.e. calculated by multiplying the mean biomass stock by the desired precision. i.e. mean biomass stock * 0.1 (for 10% precision) or 0.2 (for 20% precision)

Results: In the “lightpink” the A/R Methodological tool gave us 0.7823963 for the smallest sample size due to low variance, and 168.9942103 as the highest sample size due to higher variance.

INCOMPLETE: The E factor its having an effect on the final A/R Tool that needs to be reviewed.

Demo Strata summary: A/R Methodological Tool
site_name tree_count stratum_size_ha mean_carbon_kg sd_carbon_kg std_error power_calc CLT ar_tool
Very Steady 151 100 0.1481133 0.0709162 0.0057711 2.3 9.78526 0.7823963
Steady 173 100 0.2263267 0.1571232 0.0119459 5.2 20.57209 3.8227536
Normal 178 100 0.4256436 0.2501144 0.0187469 11.2 14.73855 9.5999852
Variant 190 100 0.6394418 0.4990083 0.0362019 41.4 25.99459 36.6006678
Very Variant 208 100 1.7013919 1.2109702 0.0839657 239.2 21.62359 168.9942103

CIFOR

The CIFOR document by Kauffman and Donato (2012) on mangrove forest research outlines a formula for calculating sample sizes. Published by the Center for International Forestry Research.

\[n = \left(\frac{t \cdot s}{E}\right)^2\]

Where:

  • n = the number of sample plots

  • t = the t-distribution value, usually 2 for 95% confidence

  • s = the expected standard deviation from prior data

  • E = the acceptable margin of error.

Results: In the “gold” the CIFOR tool gave us 22.0168348 for the smallest sample size due to low variance, and 48.6530775 as the highest sample size due to higher variance.

Demo Strata summary: CIFOR Sample
site_name tree_count mean_carbon_kg sd_carbon_kg std_error power_calc CLT ar_tool CIFOR
Very Steady 151 0.1481133 0.0709162 0.0057711 2.3 9.78526 0.7823963 22.01683
Steady 173 0.2263267 0.1571232 0.0119459 5.2 20.57209 3.8227536 46.28720
Normal 178 0.4256436 0.2501144 0.0187469 11.2 14.73855 9.5999852 33.16173
Variant 190 0.6394418 0.4990083 0.0362019 41.4 25.99459 36.6006678 58.48783
Very Variant 208 1.7013919 1.2109702 0.0839657 239.2 21.62359 168.9942103 48.65308

WINROCK

Now we wil add the results from the Winrock tool manually for a good comparison of all five methods.

Results: In the “purple” ethe Winrock tool gave us 5.99 for the smallest sample size due to low variance, and 106.6 as the highest sample size due to higher variance.

Table 1. Demo dataset summary per stratum
site_name tree_count mean_carbon_kg sd_carbon_kg std_error power_calc CLT ar_tool CIFOR Winrock
Very Steady 151 0.1481133 0.0709162 0.0057711 2.3 9.78526 0.7823963 22.01683 5.99
Steady 173 0.2263267 0.1571232 0.0119459 5.2 20.57209 3.8227536 46.28720 13.70
Normal 178 0.4256436 0.2501144 0.0187469 11.2 14.73855 9.5999852 33.16173 22.96
Variant 190 0.6394418 0.4990083 0.0362019 41.4 25.99459 36.6006678 58.48783 36.75
Very Variant 208 1.7013919 1.2109702 0.0839657 239.2 21.62359 168.9942103 48.65308 106.60

Final sample size and correlations per method

The table below presents two critical metrics for evaluating our methodologies:

  1. The total sample size.

  2. The correlation with standard deviation (SD).

Overall, this table highlights the impact of variability on each method and shows the total sample size for each method which is good for the cost-benefit analysis. As you can see the correlation with SD shows how the project variance influences each method’s outcomes. A higher correlation suggests that changes in the SD significantly affect the method’s performance. In this case, the Winrock method show the highest correlation with SD, with the AR Tool and Power Calc following. This indicates their sensitivity to the data variance, and can be useful for stakeholder decisions.

Table: Total Sample Size and Correlation per Method
method total_sample_size Correlation_with_SD
CLT 92.71408 0.5307088
Winrock 186.00000 0.9966763
CIFOR 208.60668 0.5307088
ar_tool 219.80001 0.9868234
power_calc 299.30000 0.9787978

Step 7. Calculate the different sample sizes with the real data sets.

Now, we will calculate the sample sizes again, but this time using the actual data collected from the three mangrove restoration sites.

Methodology results for site B

Table 4. Real Data Set Sample Sizes
site year plot_size_m2 plot_count sd_carbon_kg power_calc CLT ar_tool CIFOR Winrock
B 2020 153.938 5 0.8938315 192.1 313.98551 0.0001628 706.4674 32.76
B 2019 153.938 7 1.8395299 51.2 82.42220 0.0014350 185.4499 97.37
B 2017 153.938 11 2.1047953 44.5 71.47415 0.0008316 160.8168 73.94
B 2018 153.938 7 2.5186589 62.8 101.56436 0.0012075 228.5198 89.27
B 2016 153.938 13 4.5645982 42.4 68.08314 0.0124360 153.1871 282.53
B 2015 153.938 7 6.5105214 74.9 121.38922 0.0069008 273.1257 213.14

Step 8. Perform the confusion matrix to understand the relationship between parameters and strengthen the SOP for calculating sample sizes

Now, for this last step, we will create three confusion matrices to analyze the field sampling data from mangrove restoration projects. This analysis will help us understand and explore the relationships between the parameters involved.

For each data set, we will filter the important columns and run the analysis between each parameter. Showing the correlation values. A value close to 1 indicates a strong positive relationship, meaning that as one variable increases, the other one increases in that magnitude as well. Negative values indicate an inverse relationship, meaning as one variable increases, the other decreases.

To complement this analysis, we will run t-tests to review the statistical significance of the observed correlations. This step is essential because even a high correlation might not hold practical significance if for example, the data comes from a dispersed set of data (e.g. a cloud of points). Which can potentially mislead our interpretation of the relationships within our environmental sampling data.

This table will describe the symbol for the significant levels based in the results of the p-value

Significance Levels Description
P.Value Symbol Description
< 0.001 *** Highly significant
0.001 - 0.01 ** Very significant
0.01 - 0.05 * Significant
0.05 - 0.1 . Marginally significant
> 0.1 --- Not significant

Confusion Matrix - Site A

Abu Ali, Saudi Arabia

P-Value Results - Site A

Abu Ali, Saudi Arabia

P-Value Significance Matrix. SITE A
plantation_year plot_size_m2 height_cm crown_size_m total_tree_kg_c total_tree_mg_c_ha
plantation_year NA *** *** *** *** ***
plot_size_m2 *** NA *** *** *** ***
height_cm *** *** NA *** *** ***
crown_size_m *** *** *** NA *** ***
total_tree_kg_c *** *** *** *** NA ***
total_tree_mg_c_ha *** *** *** *** *** NA

Confusion Matrix - Site B

Delta Blue Carbon, Sindh, Pakistan

P-Value Results. Site B

Delta Blue Carbon, Sindh, Pakistan

P-Value Significance Matrix. SITE B
year plot height_cm crown_dia_m total_c_kg total_c_t_ha
year NA *** *** *** *** ***
plot *** NA *** *** *** ***
height_cm *** *** NA *** *** ***
crown_dia_m *** *** *** NA *** ***
total_c_kg *** *** *** *** NA ***
total_c_t_ha *** *** *** *** *** NA

Confusion Matrix - Site C

Pakistan Institute

P-Value Results - Site C

Pakistan Institute

P-Value Significance Matrix. SITE C
year_of_plantation x_coordinate y_coordinate species dbh height_m crown_dia_m total_c total_c_t_ha_34
year_of_plantation NA * . *** *** *** *** *** ***
x_coordinate * NA *** *** *** *** *** *** ***
y_coordinate . *** NA *** *** *** *** *** ***
species *** *** *** NA *** *** *** *** ***
dbh *** *** *** *** NA *** *** *** ***
height_m *** *** *** *** *** NA *** *** ***
crown_dia_m *** *** *** *** *** *** NA *** ***
total_c *** *** *** *** *** *** *** NA ***
total_c_t_ha_34 *** *** *** *** *** *** *** *** NA

Conclusions

  • The demo data frame helped me understand how the sample size varies with the type of variance in each stratum. The relationship isn’t linear, and the sample sizes are highly sensitive to the standard deviation (SD) of each stratum. This raises a follow up question: how can we create effective strata that will lead to an optimal sample size saving costs and effort?

  • The Power Calculation was the most respondent to variance, offering a reasonable sample sizes for achieving desired power (conservative).

  • The Central Limit Theorem provides a simpler, big-picture view with a trend of having smaller sample sizes compared to power calculation.

  • The A/R Methodological tool (VM0033) gave us good results within the expected limits for the demo sample data set, but it does require complete data for the weighting parameter (‘wi’), size per strata, total project area, and sample plot size.

  • CIFOR tool, is the simplest method but gave us the highest variance in sample sizing. I recommend it when data to use other methods are missing.

  • For the real data sets we got reasonable sample size from power calculation and CLT. The A/R tool results unfortunately did not worked as expected due to the weighting strata size.

  • Win Rock tool is very easy to use. It does need the data for sizes and weighting which is asked in specifically in the A/R Methodology. Is a conservative tool.

  • Confusion matrices are valuable for guiding field data collection, highlighting important variables.

    • High correlation observed between carbon stock and crown size in all three matrices.

    • For Site A, the plot size was inversely related to total tree carbon per ha. This may indicate bias in the carbon calculation.

    • The diameter at breast height (dbh) showed the highest correlation with carbon stock for the third matrix, along with an intresting X coordinate factor giving us a higher correlated relationship with carbon stock than Y coordinate.

Key Takeaways

  1. Understanding the correlation between stratification and variance for sample size is key. Larger areas with greater variance will need more samples, highlighting the importance of having the correct stratification to create relatively homogeneous areas, minimizing the variance from the beginning of the project.

  2. Running correlation matrices and t-tests, we can help you oversee what are the key parameters to consider in the stratification. In this case, we saw that it is essential to consider clusters based on crown size, plot size, height, and diameter at breast height (DBH).

  3. For Site A, there was an inverse relationship between plot size and total tree carbon per hectare, indicating that plot size does influence the calculations. The key takeaway here that plot sizes do influence in the calculation, and its useful to have different plot sizes measurements to run analysis and feed the continuous improvement plan for the project.

  4. An interesting point from the A/R Tool shows that in large strata, once you approach the sample size cap, expanding the area further slightly increases the sample size. For example, expanding from 1,000 ha to 100,000 ha has a smaller % effect on sample size than increasing from 100 ha to 500 ha. This needs to be considered in the stratification process.

  5. The use of other methods is useful when:

    1. If you are going to a new site and have a small data sets, you can calculate the sample size with CIFOR or CLT and kick off the initial analysis for a monitoring campaign.

    2. Comparing the sample sizes between methods vs. theoretical analysis can help make decisions to stakeholders more easy or safe.

    3. Developing a more detailed monitoring campaign plan based on the initial data available. For carbon projects is common to not have all data fields available, which this methods can be beneficial to have different options.

Citation

BibTeX citation:
@online{patrón2024,
  author = {Patrón, Javier},
  title = {Blue {Carbon} {Sample} {Size} {Analysis}},
  date = {2024-06-28},
  url = {https://github.com/javipatron},
  langid = {en}
}
For attribution, please cite this work as:
Patrón, Javier. 2024. “Blue Carbon Sample Size Analysis.” June 28, 2024. https://github.com/javipatron.