2. Input Data Processing

Prepare data for the optimisation

Data

The pseudomised data is available in the public_data folder as an example. These are NOT the original values from KBS database.

Three datasets are relevant:

  • contrib_per_unit.xlsx
  • constraints.xlsx
  • bounds.xlsx

1. Contribution per unit food

This dataframe stores data for contribution per unit food on

  • macro nutrients
  • micro nutrients
  • environmental impact
  • others
library(readxl)
library(data.table)

cpu <- read_excel("data/contrib_per_unit.xlsx")
head(cpu)
# A tibble: 6 × 55
  Foodgroup  Means_pr10MJ `Energy (MJ)` Protein1 Protein2 Available carbohydra…¹
  <chr>             <dbl>         <dbl>    <dbl>    <dbl>                  <dbl>
1 Bread, fi…        85.3         0.0109   0.0828   0.0828                  0.460
2 Bread, co…       116.          0.0101   0.0940   0.0940                  0.413
3 Flour, gr…         3.78        0.014    0.117    0.117                   0.671
4 Rice               9.87        0.0135   0.0670   0.0670                  0.704
5 Pasta             11.1         0.0137   0.109    0.109                   0.624
6 Breakfast…        17.2         0.0157   0.109    0.109                   0.608
# ℹ abbreviated name: ¹​`Available carbohydrates1`
# ℹ 49 more variables: `Available carbohydrates2` <dbl>, `Added Sugar` <dbl>,
#   `Dietary fiber` <dbl>, Fat1 <dbl>, Fat2 <dbl>,
#   `Sum saturated fatty acids` <dbl>, `Sum trans fatty acids` <dbl>,
#   `Sum n-3 fatty acids` <dbl>, `Sum ALA` <dbl>,
#   `Sum monounsaturated fatty acids1` <dbl>,
#   `Sum monounsaturated fatty acids2` <dbl>, …
colnames(cpu)
 [1] "Foodgroup"                        "Means_pr10MJ"                    
 [3] "Energy (MJ)"                      "Protein1"                        
 [5] "Protein2"                         "Available carbohydrates1"        
 [7] "Available carbohydrates2"         "Added Sugar"                     
 [9] "Dietary fiber"                    "Fat1"                            
[11] "Fat2"                             "Sum saturated fatty acids"       
[13] "Sum trans fatty acids"            "Sum n-3 fatty acids"             
[15] "Sum ALA"                          "Sum monounsaturated fatty acids1"
[17] "Sum monounsaturated fatty acids2" "Sum polyunsaturated fatty acids1"
[19] "Sum polyunsaturated fatty acids2" "Vitamin A"                       
[21] "Vitamin E"                        "Thiamin (Vitamin B1)"            
[23] "Riboflavin (Vitamin B2)"          "Niacin equivalent"               
[25] "Vitamin B6"                       "Folate"                          
[27] "Vitamin B12"                      "Vitamin C"                       
[29] "Vitamin D"                        "Sodium"                          
[31] "Potassium"                        "Calcium"                         
[33] "Magnesium"                        "Phosphorus"                      
[35] "Iron"                             "Zinc"                            
[37] "Iodine"                           "Selenium"                        
[39] "Copper"                           "Alcohol"                         
[41] "GHGE"                             "FE"                              
[43] "ME"                               "ACID"                            
[45] "WU"                               "LU"                              
[47] "Whole grains"                     "Fruit"                           
[49] "Vegetables"                       "Dairy1"                          
[51] "Dairy2"                           "Fish1"                           
[53] "Fish2"                            "Red meat"                        
[55] "White meat"                      
Note

The 2nd column (Means_pr10MJ) is the average food intake, which is same as mean in bounds data (3rd data file). It is not necessary to be in this table as it is not the per unit contribution. We will drop it in the subsequent data processing.

2. Constraints

Total diet constraint limits for each outcome of interest. The names need to match the column names in contrib_per_unit (check it in the next section)

  • Dir: direction: equal, greater or less than
  • rhs: right hand side values
constraints <- read_excel("data/constraints.xlsx") 
head(constraints)
# A tibble: 6 × 3
  tag_outcome             Dir     rhs
  <chr>                   <chr> <dbl>
1 Energy (MJ)             E      10.5
2 Protein, g, lower       G      62.9
3 Protein, g, upper       L     126. 
4 Carbohydrates, g, lower G     283. 
5 Carbohydrates, g, upper L     377. 
6 Added sugar, g          L      62.9

3. Bounds

Intake information on each food groups. Food names need to match the food names in contrib_per_unit.

  • mean: average intake based on dietary survey. Served as baseline
  • lower_bound, upper_bound: lower and upper permissible optimised intake.
intake <- read_excel("data/bounds.xlsx") 
head(intake)
# A tibble: 6 × 4
  Foodgroup           mean lower_bound upper_bound
  <chr>              <dbl>       <dbl>       <dbl>
1 Bread fine         85.3        8.53        245. 
2 Bread coarse      116.        11.6         289. 
3 Flours grains       3.78       0.378        20.6
4 Rice                9.87       0.987        53.1
5 Pasta              11.1        1.11         61.6
6 Breakfast cereals  17.2        1.72         99.1

Data check

Objectives
  • Check whether the food names and outcome (nutrient, envimpact) match across different tables, so the multiplication later is correct

Check food names

# cpu$Foodgroup
# intake$Foodgroup

# put the food groups from two tables together
food_list <- cbind(cpu$Foodgroup, intake$Foodgroup)
head(food_list)
     [,1]                [,2]               
[1,] "Bread, fine"       "Bread fine"       
[2,] "Bread, coarse"     "Bread coarse"     
[3,] "Flour, grains"     "Flours grains"    
[4,] "Rice"              "Rice"             
[5,] "Pasta"             "Pasta"            
[6,] "Breakfast cereals" "Breakfast cereals"

It can be helpful to check in detail whether they match. In this case, it is because of the commas.

# check whether the values match
food_list[, 1] == food_list[, 2]
# if they do not match, can inspect in more details
food_list[food_list[, 1] != food_list[, 2], ]
1
this line puts all the indices where the 1st and 2nd column don’t match as the row index for food_list dataframe, and prints out all the rows corresponding to those indices
 [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
[13] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
[25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[49]  TRUE FALSE  TRUE  TRUE FALSE
      [,1]                           [,2]                         
 [1,] "Bread, fine"                  "Bread fine"                 
 [2,] "Bread, coarse"                "Bread coarse"               
 [3,] "Flour, grains"                "Flours grains"              
 [4,] "Cakes, cookies, crackers"     "Cakes cookies crackers"     
 [5,] "Potatoes, raw, boiled, baked" "Potatoes raw boiled"        
 [6,] "Potatoes, fried"              "Potatoes fried"             
 [7,] "Vegetables, dark green"       "Vegetables dark green"      
 [8,] "Vegetables, red orange"       "Vegetables red orange"      
 [9,] "Vegetables, other"            "Vegetables other"           
[10,] "Vegetables, salad"            "Vegetables salad"           
[11,] "Pomme, stone"                 "Pomme stone"                
[12,] "Fruit, other"                 "Fruit other"                
[13,] "Meat, dairy substitutes"      "Meat dairy substitutes"     
[14,] "Fish, lean"                   "Fish lean"                  
[15,] "Fish, fatty"                  "Fish fatty"                 
[16,] "Shellfish, other"             "Shellfish other"            
[17,] "Milk, low fat"                "Milk low fat"               
[18,] "Milk, high fat"               "Milk high fat"              
[19,] "Dairy, fermented"             "Dairy fermented"            
[20,] "Dairy, other"                 "Dairy other"                
[21,] "Cheese, fresh"                "Cheese fresh"               
[22,] "Cheese, brown"                "Cheese brown"               
[23,] "Cheese, other"                "Cheese other"               
[24,] "Fats, plant-based"            "Fats plant based"           
[25,] "Fats, animal-based"           "Fats animal based"          
[26,] "Coffee, tea"                  "Coffee tea"                 
[27,] "Soft drinks, sugar-sweetened" "Soft drinks sugar sweetened"
[28,] "Soft drinks, sugar-free"      "Soft drinks sugar free"     
[29,] "Sugar, sweets"                "Sugar sweets"               
[30,] "Sauces, miscellaneous"        "Sauces miscellaneous"       

Check tags

The names of outcomes of interest might be coded differently, so it might be necessary to recode the names.

Code
cpu <- dplyr::select(cpu ,
                     "Foodgroup",
                     "Energy (MJ)" = "Energy (MJ)",
                     "Protein, g, lower" = "Protein1", 
                     "Protein, g, upper" = "Protein2", 
                     "Carbohydrates, g, lower"="Available carbohydrates1", 
                     "Carbohydrates, g, upper"="Available carbohydrates2",
                     "Added sugar, g"="Added Sugar",
                     "Dietary fiber, g"= "Dietary fiber", 
                     "Fat, g, lower"= "Fat1",  
                     "Fat, g, upper"= "Fat2",
                     "Saturated fatty acids, g"="Sum saturated fatty acids",
                     "Trans fatty acids, g"="Sum trans fatty acids", #Added
                     "n-3 fatty acids, g"="Sum n-3 fatty acids",
                     "ALA, g" = "Sum ALA",
                     "MUFA, g, lower" = "Sum monounsaturated fatty acids1",
                     "MUFA, g, upper" = "Sum monounsaturated fatty acids2",
                     "PUFA, g, lower" = "Sum polyunsaturated fatty acids1",
                     "PUFA, g, upper" = "Sum polyunsaturated fatty acids2",
                     "Vitamin A, RE µg"= "Vitamin A",
                     "Vitamin E, alfa-TE"= "Vitamin E",
                     "Thiamin (Vitamin B1), mg"= "Thiamin (Vitamin B1)",
                     "Riboflavin (Vitamin B2), mg"= "Riboflavin (Vitamin B2)",
                     "Niacin, NE"="Niacin equivalent",
                     "Vitamin B6, mg"="Vitamin B6",
                     "Folate, µg"="Folate",
                     "Vitamin B12, µg"="Vitamin B12",
                     "Vitamin C, mg"="Vitamin C",
                     "Vitamin D, µg" = "Vitamin D",
                     "Sodium, mg"="Sodium",
                     "Potassium, mg"="Potassium",
                     "Calcium, mg"="Calcium",
                     "Magnesium, mg"= "Magnesium",
                     "Phosphorus, mg"="Phosphorus",
                     "Iron, mg"="Iron",
                     "Zinc, mg"="Zinc",
                     "Iodine, µg"="Iodine",
                     "Selenium, µg"="Selenium",
                     "Copper, µg"="Copper",
                     "Alcohol, g" = "Alcohol",
                     "GHGE, kg CO2-eq" = "GHGE",
                     "FE, g P-eq" = "FE",
                     "ME, g N-eq" = "ME",
                     "TA, g SO2-eq" = "ACID",
                     "WU, m2" = "WU",
                     "LU, m3a" = "LU",
                     "Whole grains, g" = "Whole grains",
                     "Total fruit, g" = "Fruit",
                     "Total vegetables, g" = "Vegetables",
                     "Total dairy, minimum, g" = "Dairy1",
                     "Total dairy, maximum, g" = "Dairy2",
                     "Total fish, minimum, g" = "Fish1",
                     "Total fish, maximum, g" = "Fish2",
                     "Total red meat, maximum, g" = "Red meat",
                     "Total white meat, maximum, g" = "White meat" 
)
1
data name
2
the column name that already exists in the data
3
if wish to change the name, new_name = existing_name

Afterwards, compare with another table.

# extract the tags of interest from two data
tags1 <- colnames(cpu)[-1]
tags2 <- constraints$tag_outcome

# put together for easy comparison
tags_list <- cbind(tags1, tags2)

# check whether the values match
tags_list[, 1] == tags_list[, 2]
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
[25] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
[37] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[49]  TRUE  TRUE  TRUE  TRUE  TRUE
# if they do not match, can inspect in more details
tags_list[tags_list[, 1] != tags_list[, 2], ]  
      tags1              tags2             
 [1,] "Vitamin A, RE µg" "Vitamin A, RE μg"
 [2,] "Folate, µg"       "Folate, μg"      
 [3,] "Vitamin B12, µg"  "Vitamin B12, μg" 
 [4,] "Vitamin D, µg"    "Vitamin D, μg"   
 [5,] "Iodine, µg"       "Iodine, μg"      
 [6,] "Selenium, µg"     "Selenium, μg"    
 [7,] "Copper, µg"       "Copper, μg"      
 [8,] "GHGE, kg CO2-eq"  "GHGE, kg COe-eq" 
 [9,] "FE, g P-eq"       "FE, kg P-eq"     
[10,] "ME, g N-eq"       "ME, kg N-eq"     
[11,] "TA, g SO2-eq"     "TA, kg SO2-eq"   
Note

Pay attention to the unmatched ones. In some cases (row 9, 10, 11) it is indeed unmatched (g vs kg); in other cases where special character is involved (such as mu), human will see the same characters but R “sees” differently due to encoding. User should decide for themselves whether they are truly different.

Data process

Objectives

Process the data so that they are in the correct format for the optimisation step

Data Information In Rcplex (processing required)
contrib_per_unit Contribution per food unit Amat
constraints Constraint limits and directions bvec, sense
bounds Intake for baseline, lower and upper bounds Qmat, cvec, lb, ub

Amat

Amat comes from contrib_per_unit (here it is called cpu). The necessary processing is to remove the first column (Food names), and transpose so that rows become columns.

The dimension for Amat is \(n_{outcomes} \times n_{foods}\), these two do not need to be equal. In this example they are equal simply because we happen to have 53 foods and 53 outcomes.

Amat <- as.matrix(t(cpu[, -c(1)])) # remove the Foodgroup, then transpose
dim(Amat)
[1] 53 53

bvec and sense

bvec and sense are values from the constraints data. Each pair of values should correspond to one outcome.

head(constraints)
# A tibble: 6 × 3
  tag_outcome             Dir     rhs
  <chr>                   <chr> <dbl>
1 Energy (MJ)             E      10.5
2 Protein, g, lower       G      62.9
3 Protein, g, upper       L     126. 
4 Carbohydrates, g, lower G     283. 
5 Carbohydrates, g, upper L     377. 
6 Added sugar, g          L      62.9
bvec <- constraints$rhs
sense <- constraints$Dir   #Rcplex uses E, L, and R instead of ==, <= and >= 

lb, ub

lb, ub are values from bounds (here called intake).

head(intake, 3)
# A tibble: 3 × 4
  Foodgroup       mean lower_bound upper_bound
  <chr>          <dbl>       <dbl>       <dbl>
1 Bread fine     85.3        8.53        245. 
2 Bread coarse  116.        11.6         289. 
3 Flours grains   3.78       0.378        20.6
ub <- as.numeric(intake$upper_bound)
lb <- as.numeric(intake$lower_bound)

Qmat, cvec

Qmat and cvec are derived from the bound data. They are based on the current average intake (baseline).

obs <- intake$mean
obssq <- 1/obs^2
n_food <- length(obs)

# create diagonal matrix, where each diagonal value is obssq
Qmat <- 2*as.matrix(diag(obssq, n_food, n_food)) 

# cvec
cvec <- -(2/obs)

Put together as a list

For convenience, these values can be saved in an R object (a list) and used directly in the optimisation program.

d <- list(cvec = cvec, 
          Amat = Amat, 
          bvec = bvec, 
          Qmat = Qmat, 
          lb = lb, 
          ub = ub, 
          sense = sense)

# to save: execute the following line
saveRDS(d, file = 'd.RData')
# to read: execute the following line
d <- readRDS('d.RData')

Once this is done, you can refer to the next chapter: Run optimisation.