A hint on how to read World Bank data into R

May 19, 2018
R Import



World Bank is an ideal place for us to access global development data. We can always search suitable open data by relevant keyword and download it in any format we prefer (CSV, XML, EXCEL). However, if you are the guy who prefers to wrangle data with R, then there is another option for you to deal with it instead of download open data from World Bank into your pc and import into R manually by using read.csv( ), read.xlsx( ) functions.





Installation


To download and install ‘WDI’ package from CRAN, use install.packages( ) function.



install.packages('WDI')
# Install R package, 'WDI'.






Search for data


By using WDIsearch( ) function from ‘WDI’ package, we are allowed to search interested datasets from World Bank web API by inserting relevant keywords. But before we make use of this function, do remember to load the package in our R console after the installation of ‘WDI’ package.



library('WDI')
# Load R package.



WDIsearch('keyword')
# Insert relevant keyword inside single quote.



Let’s say we’re looking for global greenhouse gas emissions data from World Bank for our assignment purpose. We can start to search the relevant data by key in:

  • greenhouse
  • green

inside the single quote.



WDIsearch('greenhouse')
# 'greenhouse' keyword

WDIsearch('green')
# 'green' keyword



Please be noted that the keyword we insert into the function inside single quote(not wrong for using a double quote, " “) should not consist of white space. For an example,

  • greenhouse gas emissions
  • green house gas emissions


If you wish to do these, use ‘.*’ for each white space between the words. For an instance,

  • greenhouse.*gass.*emissions
  • green.*house.*gass.*emissions



WDIsearch('greenhouse.\*gass.\*emissions')
# 'greenhouse.\*gass.\*emissions' keyword

WDIsearch('green.\*house.\*gass.\*emissions')
# 'green.\*house.\*gass.\*emissions' keyword



All of them produce the same output which is a matrix which consists of a list of indicators and name of datasets.



     indicator          
[1,] "EN.ATM.GHGT.ZG"   
[2,] "EN.ATM.GHGT.KT.CE"
[3,] "EN.ATM.GHGO.ZG"   
[4,] "EN.ATM.GHGO.KT.CE"
     name                                                                                       
[1,] "Total greenhouse gas emissions (% change from 1990)"                                      
[2,] "Total greenhouse gas emissions (kt of CO2 equivalent)"                                    
[3,] "Other greenhouse gas emissions (% change from 1990)"                                      
[4,] "Other greenhouse gas emissions, HFC, PFC and SF6 (thousand metric tons of CO2 equivalent)"






Convert class of searched output(optional)


It seems like the output produced above is a little bit messy. So, let’s convert the class of the matrix into another data type, data frame. Before that, we have to store the output produced by a named variable.



data= WDIsearch('greenhouse')
# Store in 'data'



Use as.data.frame( ) function to convert any data type into data frame.



as.data.frame(data)
# Convert class



Or perform these two steps in one line.



data= as.data.frame(WDIsearch('greenhouse'))



Both of the above ways produce the same output in data frame format.



          indicator
1    EN.ATM.GHGT.ZG
2 EN.ATM.GHGT.KT.CE
3    EN.ATM.GHGO.ZG
4 EN.ATM.GHGO.KT.CE
                                                                                       name
1                                       Total greenhouse gas emissions (% change from 1990)
2                                     Total greenhouse gas emissions (kt of CO2 equivalent)
3                                       Other greenhouse gas emissions (% change from 1990)
4 Other greenhouse gas emissions, HFC, PFC and SF6 (thousand metric tons of CO2 equivalent)



Because of limited space here, the ‘name’ variable is shifted a little bit downward, but if you follow the steps in R console, this should not be an issue.






Download and insert data into R


If you have read my first post on this blog, you will know how complicated I wrangled the downloaded data from World Bank web API in R during analytics competition. However, one of the wonderful features for this R package is WDI( ) function allows us to make use of downloaded data from World Bank in tidy format directly (one variable in one column and every observation have its row).



variable_name= WDI(indicator='keyword', country= c('country_code1','country_code2','country_code3'),start= start_year, end= end_year)



Let’s talk back about global greenhouse gas emissions. We have obtained the list of greenhouse gas emissions data which available on World Bank.



indicator name
EN.ATM.GHGT.ZG Total greenhouse gas emissions (% change from 1990)
EN.ATM.GHGT.KT.CE Total greenhouse gas emissions (kt of CO2 equivalent)
EN.ATM.GHGO.ZG Other greenhouse gas emissions (% change from 1990)
EN.ATM.GHGO.KT.CE Other greenhouse gas emissions, HFC, PFC and SF6 (thousand metric tons of CO2 equivalent)



Indicator for each of dataset is important as it is used as a keyword which can be included in WDI( ) function.

We have a scenario here. Let’s say we want to download the total greenhouse gas emissions in kilotonnes of carbon dioxide equivalent for Malaysia, Singapore, Canada from the year 2000 to 2004, then we need to include the indicator for this dataset, country code, start year and end year inside the WDI( ) function.


  • indicator= EN.ATM.GHGT.KT.CE (from searched result)
  • country code= Malaysia= MY, Singapore= SG, Canada= CA
  • start year= 2000
  • end year= 2004



ghge= WDI(indicator='EN.ATM.GHGT.KT.CE', country= c('MY','SG','CA'),start= 2000,end= 2004)
# Download data and store into 'ghge'.



Now, we have our dataset in tidy format. If you notice, you will found that total greenhouse gas emissions in kilotonnes in different years for these three countries are combined under a named ‘year’ column.



   iso2c   country EN.ATM.GHGT.KT.CE year
1     CA    Canada        1040136.77 2004
2     CA    Canada         951373.04 2003
3     CA    Canada         908686.15 2002
4     CA    Canada         718534.46 2001
5     CA    Canada         760643.00 2000
6     MY  Malaysia         216283.73 2004
7     MY  Malaysia         195386.72 2003
8     MY  Malaysia         199983.48 2002
9     MY  Malaysia         176214.37 2001
10    MY  Malaysia         171393.52 2000
11    SG Singapore          44836.63 2004
12    SG Singapore          43165.65 2003
13    SG Singapore                NA 2002
14    SG Singapore          49851.52 2001
15    SG Singapore                NA 2000






Renaming columns of data


To change the name for each column of data, use colnames( ) function.



columnnames=c('Country Code','Country','Total Greenhouse Gass Emissions in kt','Year')
# Renaming columns of data and store in 'columnnames'

colnames(ghge)= columnnames
# change columns name of 'ghge' with the new names we set



   Country Code   Country Total Greenhouse Gass Emissions in kt Year
1            CA    Canada                            1040136.77 2004
2            CA    Canada                             951373.04 2003
3            CA    Canada                             908686.15 2002
4            CA    Canada                             718534.46 2001
5            CA    Canada                             760643.00 2000
6            MY  Malaysia                             216283.73 2004
7            MY  Malaysia                             195386.72 2003
8            MY  Malaysia                             199983.48 2002
9            MY  Malaysia                             176214.37 2001
10           MY  Malaysia                             171393.52 2000
11           SG Singapore                              44836.63 2004
12           SG Singapore                              43165.65 2003
13           SG Singapore                                    NA 2002
14           SG Singapore                              49851.52 2001
15           SG Singapore                                    NA 2000






Resources


For more details and information, refer documentation for WDI






R vs Excel: lookup and text functions

June 29, 2018
R Excel

Create correlation network plot in a quick way

June 8, 2018
R Visualization

Correlation and Covariance

June 2, 2018
R Regression
comments powered by Disqus