Content
Last week Saturday, I attended an Excel workshop which is organised by my course society at my campus. On that day, I have learnt some basic techniques and concepts in Microsoft Excel. Overall, the learning topic of the day can be categorized into three sub-sections which are lookup functions, logical functions and text functions. It’s quite useful, direct and this inspires me to write this post to compare some of the lookup and text functions in Microsoft Excel with R. For the rest of programming languages, it will be updated in following posts soon and my comparison here is more on how do we actually make use of those functions in Excel & R so that we can achieve the desired result and solve our problems.
So, what actually a lookup function is?
In short, lookup function is a designed function and used when we want to fill in data in the corresponding row based on the column of data searched by the function.
Let say you have a catalogue on hand like:
Goods | Price |
---|---|
Apple | 0.99 |
Lemon | 1.20 |
Milk | 5.50 |
Tomato | 0.89 |
Sugar | 4.50 |
Salt | 5.65 |
Oil | 10.80 |
And now, imagine your customer needs only apple, oil and milk and he or she wants to know the price of each item.
Goods | Price |
---|---|
Apple | |
Oil | |
Milk |
Yeah, you can just search the price of the goods and fill in the blank accordingly since the catalogue consists only a few of items but what if your catalogue consists of 100 or even 1000 of goods?
In this time, we can make use of lookup functions in R to complete our task.
merge function is an inner_join which joins two data frames together based on the matched key. It returns us only the matched item with its value in alphabetical order. While outer_join like left_join and right_join returns us the matched item with its value and unmatched item with ‘NA’ value. The usage of merge function is indicated as below:
merge(dataframe1, dataframe2, by="columnname")
# specify a key in 'by' arugument
match function tells us the matched item for two vectors on which position but not the value of it. This function is another way to solve the issue faced above in few lines of code.
For an example, I have a list of students’ score for a mid-term test:
Score
# display students' score in output
Student | Score |
---|---|
Kalvin | 78.0 |
David | 56.0 |
Michael | 45.5 |
May | 89.0 |
June | 34.0 |
Taylor | 67.0 |
John | 88.0 |
Massive | 56.0 |
Steven | 58.0 |
Now, I have some of them in another list in random order:
Score1
# display few selected students in output
Student | Score |
---|---|
Steven | |
John | |
Kalvin | |
June | |
May | |
Massive |
merge_data= merge(Score, Score1[,1], by="Student")
# merge two data frames by 'Student'
merge_data # display result
Student Score
1 John 88
2 June 34
3 Kalvin 78
4 Massive 56
5 May 89
6 Steven 58
Above indicated that two data frames are joined together based on the key, “Student”. In other words, merge function fill in the score of few selected students in a new table based on the original data we have. Noted that, a data frame is a data structure which consists of rows and columns and just like a table. Next, if you notice carefully, you will found that all the students’ name are arranged in alphabetical order and this is what I mentioned previously.
Next, suppose I have a group of people with different age in a data frame named people_age.
Name Age
1 Abu 34
2 Ali 56
3 Ahmad 21
4 Mohammad 23
5 Muthu 15
and another data frame named people_without_age.
Name
1 Abu
2 Muthu
3 Mohammad
4 Mohammad
5 Mohammad
6 Abu
7 Abu
8 Ahmad
9 Mohammad
10 Muthu
11 Ali
12 Mohammad
13 Muthu
14 Abu
15 Mohammad
16 Muthu
17 Ahmad
18 Ahmad
19 Ahmad
20 Abu
To use match function to fill in age for 20 people in the second table, first, you will need to open a column for the table.
After that, match the name of people from second data frame named people_without_age with the first one, people_age which has the age information.
data$column_name[]
# data is your name of data frame
# '$' specifies which column from data frame
From this example, we are going to take all the values for different age of people.
people_age$Age[]
[1] 34 56 21 23 15
people_without_age$Age= people_age$Age[match(people_without_age$Name, people_age$Name)]
Or make use of ‘with’ function to make our life easier.
people_without_age$Age= with(people_age, Age[match(people_without_age$Name, Name)])
Result:
people_without_age
Name Age
1 Abu 34
2 Muthu 15
3 Mohammad 23
4 Mohammad 23
5 Mohammad 23
6 Abu 34
7 Abu 34
8 Ahmad 21
9 Mohammad 23
10 Muthu 15
11 Ali 56
12 Mohammad 23
13 Muthu 15
14 Abu 34
15 Mohammad 23
16 Muthu 15
17 Ahmad 21
18 Ahmad 21
19 Ahmad 21
20 Abu 34
Click here to jump to Lookup function in Excel.
Okay, text functions. Suppose you have two strings, now you wish to combine them into one word, or extract some of words from the term, we use text functions to do it.
To combine multiple strings into one in R, use paste( ) funcion.
paste(string1, string2, sep="??")
# 'sep' argument specifies two strings separate by what character or symbol you set.
My first name is Teoh and my last name is Kai Wen. To combine these two strings by empty space, I will do:
paste("Teoh", "Kai Wen", sep=" ")
[1] "Teoh Kai Wen"
# left a space in 'sep' argument give you an empty space between 2 strings.
You might be interested in extracting some characters which start from left, right or middle of a word in text analysis. So, the substr function in the base library of R is an option for you to print the number of characters start from left to right in the console.
There is another option for you to perform the same task and get the job done. stringR is an additionally R package you can download and install in Rstudio.
You might curious if the base library of R offers substr function which can get your job done, why do we still need to download and install additional R package? It’s ok, I answer these later.
substr(x, start, stop)
substr( ) is a function for us to extract some characters from a string. For an instance, my name is Teoh Kai Wen and I want to extract my first name from my full name.
substr("Teoh Kai Wen", 1, 4)
[1] "Teoh"
# start at 1, end at 4
What if I want to extract my middle name from full name?
substr("Teoh Kai Wen", 6, 9)
[1] "Kai "
# start at 6, because empty space do occupy space
substring( ) function is slightly different from substr( ) function. Recall substr, we specify the start and stop number for a string. But for substring,
substring(x, start, last)
# by default, the number of last character to be returned is 1 million.
For an example, I want to print my name start from middle name to last name (Kai Wen) from full name (Teoh Kai Wen), I do:
substring("Teoh Kai Wen", 5)
[1] " Kai Wen"
# start from 5
And the returned result will be started from the starting number I specify until the end of string (by deafult until 1000000)
By default, substr( ) from the base library of R print the characters started from left to right. But what if we want to print the characters start from right to left of a string?
Yes, stringR is the answer on why do we need to install additional R package and it allows us to print the number of characters start from right to left of a string in the console. Of course, this is not the only extension function that it has, you can have more details in its official documentation.
I can use traditional substr to print my last name by specifying the start and the end number in the function.
substr("Teoh Kai Wen",10, 13 )
[1] "Wen"
But it’s much more convenient for us to code in this way.
library(stringr)
# load R package
str_sub("Teoh Kai Wen", -3, -1)
[1] "Wen"
# -ve indicate start from right to left
# print the first to third character from right
The following next section is fun, interesting and I would like to talk more about it.
If you are an R user, you knew that sometimes import your dataset for an example, in Excel format(.xlsx) into R will results some annoying characters appear on the header of a data frame.
This is a dataset about products sales (in unit) for the year from 2016 to 2018. But when I save this in .xlsx format and import this dataset into Rstudio, my output shown in Rstudio is looked like:.
Product X2016 X2017 X2018
1 Keyboard 400 200 540
2 Mouse 300 400 600
3 Monitor 303 630 270
If you notice carefully, you will found that my header is slightly different from the dataset in Excel (.xlsx) format. To change and remove extra annoying characters, ‘X’ in every header of dataset in R, we make use of gsub or sub function.
A gsub function and sub function are the same except gsub function replace all the occurence of match characters you specify to the character you set. However, a sub function makes change only the first occurence of match character to the character you set.
gsub(pattern_to_replace, replacement, data)
# replace all match characters
sub(pattern_to_replace, replacement, data)
# replace only the first occurence of match character
It’s ok you confuse what I’m talking right now, but you will understand soon.
Consider the previous dataset which I modify a little bit (all the years put in a column named Year).
Product Year Unit_Sold
1 Keyboard X2016 400
2 Mouse X2016 300
3 Monitor X2016 303
4 Keyboard X2017 200
5 Mouse X2017 400
6 Monitor X2017 630
7 Keyboard X2018 540
8 Mouse X2018 600
9 Monitor X2018 270
To replace all ‘X’ into empty spacev(remove ‘x’ character), do:
data1$Year= gsub("X"," ", data1$Year)
# replace all 'X' with white space
data1
Product Year Unit_Sold
1 Keyboard 2016 400
2 Mouse 2016 300
3 Monitor 2016 303
4 Keyboard 2017 200
5 Mouse 2017 400
6 Monitor 2017 630
7 Keyboard 2018 540
8 Mouse 2018 600
9 Monitor 2018 270
# display output
Then What about a sub function?
Consider we have a string here: “can you put some candy inside the can?”
sent=c("can you put some candy inside the can?")
# store a string inside 'sent'
sub("c", "D", sent)
[1] "Dan you put some candy inside the can?"
# substitute the first encountered 'c' into D
And again, if we use gsub,
sent=c("can you put some candy inside the can?")
# store a string inside 'sent'
gsub("c", "D", sent)
[1] "Dan you put some Dandy inside the Dan?"
# substitute the all 'c' into D
Noted that both gsub and sub function are case sensitive, for an example, e\(\ne\) E. And both of these functions are useful especially when we perform web scraping to get some data from web pages as you need to remove unnecessary, insignificant symbols before to do your further analysis in R. It takes time to learn and explore as sometimes we are not only deal with the extra characters like the previous example. But also some sensitive symbols like parentheses, white space and combine all of them is much more complicated in replacing of them. For an example, you can try to replace the string (I) (2015) into 2015 by using the ways I showed above.
Click here to jump to Text functions in Excel.
Compare to R, operations in Excel is much more simple and convenient. Recall the example of mid-term score in the previous section.
In Excel, there is a lookup function called VLOOKUP. The purpose of using this function is to search for the data we have and fill in the values (in this case refer to mid-term score) for the chosen keyword (in this case refer to student).
To fill in the mid-term score of a selected student in a new column, we do:
To fill in the mid-term score of all selected students, we drag the Score column until row number 7. But note that, we have to lock the data (table) from A2 to B10 because when we drag the column down, the position of data it sets in every row will be changed accordingly. To lock the table, we highlight the position of table (in this case is from A2 to B10) and add dollar sign, ‘$’ or just simply press the shotcut key, f4.
Drag along the column, you get this:
To join two strings together, Excel came with the function CONCATENATE. And the syntax of this function is shown as below:
Both of the strings in A2 and B2 are joined together with a space between them in forming a full name. Besides white space, " “, you can also set the other type of symbols and characters in between two strings or just leave it blank in VLOOKUP function.
To join multiple strings together, we can just insert the strings accordingly in the function.
For an example:
=VLOOKUP(text1, text2, text3…)
And now when I drag down the column, all the first name are joined together with last name with a space between them.
To print the characters from a string or sentence started from left to right or right to left, we make use of these functions.
The syntax of left , right and mid function are shown as below:
Let say we’re interested in printing the first name of ‘John Teoh’ which is John, what we can do is:
For the right function, it prints the characters of a string started from right to left.
And the syntax of this function is:
Let say we want to print the four characters started from right to left for the name Vincent:
For mid funtion, we can define the starting and ending number in printing the characters of a string.
The syntax of this function is slightly different from the left and right function and is shown as below:
For an instance, we can print the characters of a string started from the middle to somewhere we specify:
Nothing special and I did talk a lot about it in previous section. In Excel, the same output can be produced just like in R.
The syntax of this function can be wrote in this way:
=SUBSTITUTE(x,“pattern”,“replace”)
If you want to substitute the only first occurence (by default, all matched pattern are removed) just like a sub() do:
=SUBSTITUTE(x, "pattern", "replace", 1)
# remove the first encountered pattern
I provide an example in Excel. Consider we have a sentence,
All the ‘a’ characters are replaced with ‘x’.
If we want to replace only the first ‘a’ character with ‘x’, try:
Note that in R, single quote, ‘’ are same with double quote, " “. However, in Excel, you only can insert double quote,” " for arguments but not single quote,’ ‘. Because I’m the one who prefer single quote more than double quote, that’ why I mention here =.= .