R vs Excel: lookup and text functions

June 29, 2018
R Excel

Background


Last week Saturday, I attended an Excel workshop which is organised by my course society at my campus. On that day, I have learnt some basic techniques and concepts in Microsoft Excel. Overall, the learning topic of the day can be categorized into three sub-sections which are lookup functions, logical functions and text functions. It’s quite useful, direct and this inspires me to write this post to compare some of the lookup and text functions in Microsoft Excel with R. For the rest of programming languages, it will be updated in following posts soon and my comparison here is more on how do we actually make use of those functions in Excel & R so that we can achieve the desired result and solve our problems.







Lookup function in R


So, what actually a lookup function is?

In short, lookup function is a designed function and used when we want to fill in data in the corresponding row based on the column of data searched by the function.

Let say you have a catalogue on hand like:


Goods Price
Apple 0.99
Lemon 1.20
Milk 5.50
Tomato 0.89
Sugar 4.50
Salt 5.65
Oil 10.80



And now, imagine your customer needs only apple, oil and milk and he or she wants to know the price of each item.


Goods Price
Apple
Oil
Milk



Yeah, you can just search the price of the goods and fill in the blank accordingly since the catalogue consists only a few of items but what if your catalogue consists of 100 or even 1000 of goods?

In this time, we can make use of lookup functions in R to complete our task.






How


Method 1:


merge function is an inner_join which joins two data frames together based on the matched key. It returns us only the matched item with its value in alphabetical order. While outer_join like left_join and right_join returns us the matched item with its value and unmatched item with ‘NA’ value. The usage of merge function is indicated as below:


merge(dataframe1, dataframe2, by="columnname")
# specify a key in 'by' arugument



Method 2:


match function tells us the matched item for two vectors on which position but not the value of it. This function is another way to solve the issue faced above in few lines of code.






Example 1


For an example, I have a list of students’ score for a mid-term test:


Score
# display students' score in output


Student Score
Kalvin 78.0
David 56.0
Michael 45.5
May 89.0
June 34.0
Taylor 67.0
John 88.0
Massive 56.0
Steven 58.0


Now, I have some of them in another list in random order:


Score1
# display few selected students in output



Student Score
Steven
John
Kalvin
June
May
Massive








merge_data= merge(Score, Score1[,1], by="Student")
# merge two data frames by 'Student'



merge_data                     # display result
  Student Score
1    John    88
2    June    34
3  Kalvin    78
4 Massive    56
5     May    89
6  Steven    58


Above indicated that two data frames are joined together based on the key, “Student”. In other words, merge function fill in the score of few selected students in a new table based on the original data we have. Noted that, a data frame is a data structure which consists of rows and columns and just like a table. Next, if you notice carefully, you will found that all the students’ name are arranged in alphabetical order and this is what I mentioned previously.






Example2:


Next, suppose I have a group of people with different age in a data frame named people_age.


      Name Age
1      Abu  34
2      Ali  56
3    Ahmad  21
4 Mohammad  23
5    Muthu  15



and another data frame named people_without_age.


       Name
1       Abu
2     Muthu
3  Mohammad
4  Mohammad
5  Mohammad
6       Abu
7       Abu
8     Ahmad
9  Mohammad
10    Muthu
11      Ali
12 Mohammad
13    Muthu
14      Abu
15 Mohammad
16    Muthu
17    Ahmad
18    Ahmad
19    Ahmad
20      Abu



  1. To use match function to fill in age for 20 people in the second table, first, you will need to open a column for the table.

    • people_without_age$Age


  1. After that, match the name of people from second data frame named people_without_age with the first one, people_age which has the age information.

    • match( people_without_age$name, people_age$name )


  1. Normally, if we want to have the values of a particular column in data frames, we use syntax like:


data$column_name[]
# data is your name of data frame
# '$' specifies which column from data frame


From this example, we are going to take all the values for different age of people.



people_age$Age[]
[1] 34 56 21 23 15



  1. Combine all above in a line of code. Interpret as we take the values of Age from first data frame for only the matched items based on the Name variable in two data frames. After that, store in a new opened column named Age inside the data frame named people_without_age.


people_without_age$Age= people_age$Age[match(people_without_age$Name, people_age$Name)]


Or make use of ‘with’ function to make our life easier.


people_without_age$Age= with(people_age, Age[match(people_without_age$Name, Name)])



Result:


people_without_age
       Name Age
1       Abu  34
2     Muthu  15
3  Mohammad  23
4  Mohammad  23
5  Mohammad  23
6       Abu  34
7       Abu  34
8     Ahmad  21
9  Mohammad  23
10    Muthu  15
11      Ali  56
12 Mohammad  23
13    Muthu  15
14      Abu  34
15 Mohammad  23
16    Muthu  15
17    Ahmad  21
18    Ahmad  21
19    Ahmad  21
20      Abu  34





Click here to jump to Lookup function in Excel.







Text Functions in R


Okay, text functions. Suppose you have two strings, now you wish to combine them into one word, or extract some of words from the term, we use text functions to do it.






Concatenate


To combine multiple strings into one in R, use paste( ) funcion.


paste(string1, string2, sep="??")
# 'sep' argument specifies two strings separate by what character or symbol you set.






Example


My first name is Teoh and my last name is Kai Wen. To combine these two strings by empty space, I will do:



paste("Teoh", "Kai Wen", sep=" ")
[1] "Teoh Kai Wen"
# left a space in 'sep' argument give you an empty space between 2 strings.






Substr and Substring


You might be interested in extracting some characters which start from left, right or middle of a word in text analysis. So, the substr function in the base library of R is an option for you to print the number of characters start from left to right in the console.

There is another option for you to perform the same task and get the job done. stringR is an additionally R package you can download and install in Rstudio.

You might curious if the base library of R offers substr function which can get your job done, why do we still need to download and install additional R package? It’s ok, I answer these later.






Example 1



substr( )


substr(x, start, stop)


substr( ) is a function for us to extract some characters from a string. For an instance, my name is Teoh Kai Wen and I want to extract my first name from my full name.



substr("Teoh Kai Wen", 1, 4)
[1] "Teoh"
# start at 1, end at 4 


What if I want to extract my middle name from full name?



substr("Teoh Kai Wen", 6, 9)
[1] "Kai "
# start at 6, because empty space do occupy space





substring( )


substring( ) function is slightly different from substr( ) function. Recall substr, we specify the start and stop number for a string. But for substring,


substring(x, start, last)
# by default, the number of last character to be returned is 1 million.


For an example, I want to print my name start from middle name to last name (Kai Wen) from full name (Teoh Kai Wen), I do:



substring("Teoh Kai Wen", 5)
[1] " Kai Wen"
# start from 5 


And the returned result will be started from the starting number I specify until the end of string (by deafult until 1000000)






Example2


str_sub( )


By default, substr( ) from the base library of R print the characters started from left to right. But what if we want to print the characters start from right to left of a string?

Yes, stringR is the answer on why do we need to install additional R package and it allows us to print the number of characters start from right to left of a string in the console. Of course, this is not the only extension function that it has, you can have more details in its official documentation.



I can use traditional substr to print my last name by specifying the start and the end number in the function.



substr("Teoh Kai Wen",10, 13 )
[1] "Wen"


But it’s much more convenient for us to code in this way.



library(stringr)
# load R package

str_sub("Teoh Kai Wen", -3, -1)
[1] "Wen"
# -ve indicate start from right to left
# print the first to third character from right 






Substitution


The following next section is fun, interesting and I would like to talk more about it.

If you are an R user, you knew that sometimes import your dataset for an example, in Excel format(.xlsx) into R will results some annoying characters appear on the header of a data frame.





This is a dataset about products sales (in unit) for the year from 2016 to 2018. But when I save this in .xlsx format and import this dataset into Rstudio, my output shown in Rstudio is looked like:.


   Product X2016 X2017 X2018
1 Keyboard   400   200   540
2    Mouse   300   400   600
3  Monitor   303   630   270



If you notice carefully, you will found that my header is slightly different from the dataset in Excel (.xlsx) format. To change and remove extra annoying characters, ‘X’ in every header of dataset in R, we make use of gsub or sub function.






gsub( ) and sub( )


A gsub function and sub function are the same except gsub function replace all the occurence of match characters you specify to the character you set. However, a sub function makes change only the first occurence of match character to the character you set.


gsub(pattern_to_replace, replacement, data)
# replace all match characters

sub(pattern_to_replace, replacement, data)
# replace only the first occurence of match character


It’s ok you confuse what I’m talking right now, but you will understand soon.






Example 1


Consider the previous dataset which I modify a little bit (all the years put in a column named Year).



   Product  Year Unit_Sold
1 Keyboard X2016       400
2    Mouse X2016       300
3  Monitor X2016       303
4 Keyboard X2017       200
5    Mouse X2017       400
6  Monitor X2017       630
7 Keyboard X2018       540
8    Mouse X2018       600
9  Monitor X2018       270



To replace all ‘X’ into empty spacev(remove ‘x’ character), do:



data1$Year= gsub("X"," ", data1$Year)
# replace all 'X' with white space

data1
   Product  Year Unit_Sold
1 Keyboard  2016       400
2    Mouse  2016       300
3  Monitor  2016       303
4 Keyboard  2017       200
5    Mouse  2017       400
6  Monitor  2017       630
7 Keyboard  2018       540
8    Mouse  2018       600
9  Monitor  2018       270
# display output






Example 2

Then What about a sub function?


Consider we have a string here: “can you put some candy inside the can?”



sent=c("can you put some candy inside the can?")
# store a string inside 'sent'

sub("c", "D", sent)
[1] "Dan you put some candy inside the can?"
# substitute the first encountered 'c' into D


And again, if we use gsub,



sent=c("can you put some candy inside the can?")
# store a string inside 'sent'

gsub("c", "D", sent)
[1] "Dan you put some Dandy inside the Dan?"
# substitute the all 'c' into D




Noted that both gsub and sub function are case sensitive, for an example, e\(\ne\) E. And both of these functions are useful especially when we perform web scraping to get some data from web pages as you need to remove unnecessary, insignificant symbols before to do your further analysis in R. It takes time to learn and explore as sometimes we are not only deal with the extra characters like the previous example. But also some sensitive symbols like parentheses, white space and combine all of them is much more complicated in replacing of them. For an example, you can try to replace the string (I) (2015) into 2015 by using the ways I showed above.





Click here to jump to Text functions in Excel.







Lookup functions in Excel


Compare to R, operations in Excel is much more simple and convenient. Recall the example of mid-term score in the previous section.


In Excel, there is a lookup function called VLOOKUP. The purpose of using this function is to search for the data we have and fill in the values (in this case refer to mid-term score) for the chosen keyword (in this case refer to student).





To fill in the mid-term score of a selected student in a new column, we do:





To fill in the mid-term score of all selected students, we drag the Score column until row number 7. But note that, we have to lock the data (table) from A2 to B10 because when we drag the column down, the position of data it sets in every row will be changed accordingly. To lock the table, we highlight the position of table (in this case is from A2 to B10) and add dollar sign, ‘$’ or just simply press the shotcut key, f4.





Drag along the column, you get this:









Text functions in Excel






Concatenate


To join two strings together, Excel came with the function CONCATENATE. And the syntax of this function is shown as below:




Both of the strings in A2 and B2 are joined together with a space between them in forming a full name. Besides white space, " “, you can also set the other type of symbols and characters in between two strings or just leave it blank in VLOOKUP function.





To join multiple strings together, we can just insert the strings accordingly in the function.

For an example:

=VLOOKUP(text1, text2, text3…)


And now when I drag down the column, all the first name are joined together with last name with a space between them.








Left, Right and Mid function


To print the characters from a string or sentence started from left to right or right to left, we make use of these functions.



The syntax of left , right and mid function are shown as below:



Left function

Left function



Let say we’re interested in printing the first name of ‘John Teoh’ which is John, what we can do is:







For the right function, it prints the characters of a string started from right to left.



And the syntax of this function is:



Right function

Right function



Let say we want to print the four characters started from right to left for the name Vincent:







For mid funtion, we can define the starting and ending number in printing the characters of a string.



The syntax of this function is slightly different from the left and right function and is shown as below:



Mid function

Mid function



For an instance, we can print the characters of a string started from the middle to somewhere we specify:








Substitution


Nothing special and I did talk a lot about it in previous section. In Excel, the same output can be produced just like in R.



The syntax of this function can be wrote in this way:


=SUBSTITUTE(x,“pattern”,“replace”)


If you want to substitute the only first occurence (by default, all matched pattern are removed) just like a sub() do:


=SUBSTITUTE(x, "pattern", "replace", 1)
# remove the first encountered pattern



I provide an example in Excel. Consider we have a sentence,





All the ‘a’ characters are replaced with ‘x’.

If we want to replace only the first ‘a’ character with ‘x’, try:







Note that in R, single quote, ‘’ are same with double quote, " “. However, in Excel, you only can insert double quote,” " for arguments but not single quote,’ ‘. Because I’m the one who prefer single quote more than double quote, that’ why I mention here =.= .







Summary








Create correlation network plot in a quick way

June 8, 2018
R Visualization

Correlation and Covariance

June 2, 2018
R Regression

A hint on how to read World Bank data into R

May 19, 2018
R Import
comments powered by Disqus