Galaxy of Thoughts

Mental journies shuttling between utopia, dystopia and bordering paranoia.

15 May 2020



The main purpose of this particular article if I may call it so about a tweet that popped in my thread a few weeks ago. How the largest mobile service provider in Kenya sequentially labels their innovative mobile money receipts. The receipts are usually in SMS (Short Message Service) format, every transaction you make you receive the SMS with receipt details. Details of amount sent, name, mobile number of recipient and balance of your account.

According to the tweet, the initial two letters change every year, with tweeps even affirming the same. Coming from my various amusements in life, where data is king and interestingly being among the very few people with their backed up MPesa SMS’s dating back to 2012. I used to backup my SMS after a certain period, this was not because I believed in backing up, on the contrary it was because I had my first Andriod flagship phone pet-named “Ideos” manufactured by Hauwei, model U8150 IDEOS. This gadget packed 512MB storage, 256MB RAM, 3.2MP back camera and wait for it running android version 2.2 Froyo now we are on version 10.0.

Now starts the fun part, using my somewhat experience in data analysis I embarked on the journey of retrieving all my old backed up SMS. Best part about it they are saved as XML thanks to the app I downloaded back in 2012 SMS Backup & Restore interestingly it runs on even the current versions of android talk of backward compatibility with native apps, back then the logo was blue and not current green.

Intended approach

Fair warning I will be going full macho technical, do bear with me if my lingo is not straight forward, the plus is still you can be able to follow. FYI you can skip the next parts go straight to conclusion.

After merging all the XML files and then filtering out MPesa messages. I proceeded to used regular expression (Regex) to pick the receipt number and the date the SMS was received. The receipt is text format and includes other details such us the number you are sending to the name of the recipient, the amount being sent or received, your MPesa account balance and additional promotional items.


Loading libraries I will be using.

#loading necessary

library("stringr") #String manipulation
library("tidyr") #Tidying data
library("dplyr") #data manipulation 
library("openxlsx") #Reading excel
library("xml2") #Reading xml and manipulating xml docs
library("purrr") #Functional mapping across vectors
library("lubridate") #date manipulation
library("ggplot2") #Grammar of graphics. 

I normally find myself loading both tidyr and dplyr because I end up using both freely. Truth be told I end up forgetting which functions come from which package.

Reading XML

As mentioned the back of the texts are in XML format. 24 XML files starting from October 2012 to Jan 2020, we shall explore this later.

I did some preprocessing of loading the texts, I will show you the steps I used but won’t avail the data. It’s a dump of my personal messages.

All the XML files are in one folder, we load the file names from the folder. Using list.files and the argument pattern = ".xml$" to ensure we only pick .xml files.

list_files <- list.files("../SMSBackupRestore/", pattern = ".xml$")

Now, that we have the list of files names we need to load it up to R. Challenge reading each file then merging them together into one whilst ensuring that you optimize your memory usage.

Created function func_xml_work to read and manipulate the xlm code. Snippet of how the xml looks initially below.

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<!--File Created By SMS Backup & Restore v10.06.102 on 16/01/2020 18:39:33-->
<smses count="2" backup_set="f24cc90b" backup_date="1579189173334">
  <sms protocol="0" address="MPESA" date="1506773148699" type="1" subject="null" body="LIU5GRAAHF confirmed.You bought KshXXX of airtime on 30/9/17 at 3:05 PM.New M-PESA balance is KshXX. Transaction cost, Ksh0.00." toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" date_sent="1506773145000" sub_id="-1" readable_date="30 Sep 2017 15:05:48" contact_name="(Unknown)" />
  <sms protocol="0" address="MPESA" date="1506855198658" type="1" subject="null" body="LJ14HATBG6 Confirmed. KshXXX sent to XXXX on 1/10/17 at 1:53 PM. New M-PESA balance is KshXXX. Transaction cost, KshXXX." toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" date_sent="1506855181000" sub_id="-1" readable_date="1 Oct 2017 13:53:18" contact_name="(Unknown)" />

My initial implementation was filled with for-loops and was very slow, so I switched to using xml2. I am still cranky when using it, fumbling around I was able to hook up the function.

  • Reading the xml using read_xml kinder obvious
  • Dropping down the nodes to the child nodes xml_children, XML takes similar approach to html with a tree structure of sort. To save myself the trouble I used xml_children to read them all at once but if you know the name of the key you can use xml_child(doc, "keyname") to directly access a single key.
  • Pick the attributes from the child nodes xml_attrs
  • Since picking attributes creates a list we convert it to a data frame then transpose t.
  • Notice I created an object len_size this was to count the number of child attributes and picking the one with the highest count I pick that as the number of rows.
func_xml_work <- function(i){
  xm_file <- xml2::read_xml(list_files[i])
  xm_children <- xml_children(xm_file)
  xm_list <- xml_attrs(xm_children)
  len_size <- table(unlist(lapply(xm_list,length)))
  dt <-[1:as.numeric(len_size[[1]])])), row.names = FALSE)

Gymnastics we have a function that will read the XML file and convert it to a data frame. Now we need to see that it combines with the other XML files into one. That is where the party begins. You can opt either to:

  1. Read all files to R then merge them from R. The disadvantage it will not be prudent, memory optimization.
  2. Read the files and merge immediately, continuous merge as you load. This helps with memory optimization.

I opt for the second option, implementation is harder but reward is better memory optimization. Second function func_merge_dt merging the xml files loaded to R on the go, simple procedure using bind_rows

dt_comb <-

func_merge_dt <- function(len){
    dt_temp <- func_xml_work(len)
    dt_comb <- bind_rows(dt_temp)

dt_comb <-map_dfr(1:length(list_files),func_merge_dt)

Data munging

Finally we have the all the xmls loaded and saved as one data frame. Let the games begin. Concentrate on the topic of the day, MPESA receipts. Good thing their is a variable address which makes this easier, filter data-set by MPESA.

I need to anonymise the receipts, leaving it with no personal identification information. Turn all names, amounts and phone numbers to XXX.

My preference is using stringr suite of functions because they are easy to manipulate strings. The name of the functions are even easier to remember. I tend to use a lot of regex for the data extraction.

dt_mpesa <- dt_comb %>% 
  filter(address == "MPESA")

dt_mpesa <- dt_mpesa %>% 
  distinct(body, .keep_all = TRUE)

##Anonymising the data ----

dt_mpesa$body <- dt_mpesa$body %>% 
  str_replace("sent to ([:graph:]|[:space:]){0,} [:digit:]{0,} ", "sent to XXXX ") %>% 
  str_replace_all("Ksh([:digit:]|[:punctuation:]){0,}","KshXX.XX") %>% 
  str_replace("paid to ([:graph:]|[:space:]|[:digit:]){0,} on ", "paid to XXXX on ") %>% 
  str_replace_all("from\\s([:graph:]|[:space:]){0,}\\son", "from XXXX on") %>% 
  str_replace_all("from([:graph:]|[:space:]){0,}\\son", "from XXXX on") %>% 
  str_replace_all("kutoka\\s([:graph:]|[:space:]){0,}\\sMnamo  ", "kutoka XXXX Mnamo  ") %>% 
  str_replace_all("from\\s([:graph:]|[:space:]){0,}\\sNew\\sM", "from XXXX New M") %>% 
  str_replace("airtime for ([:digit:]){0,} on ", "airtime for XXXX on ") %>% 
  str_replace_all("cash\\sto\\s([:graph:]|[:space:]){0,}\\sNew\\sM", "cash to XXXX New M")

Exploratory data analysis

First we load the anonymized data which I have made public. Then get our hands dirty with the grease.

Reduce the variables to just what we need

dt_mpesa <- read.xlsx("mpesa_trans.xlsx")
#dt_mpesa <- read.xlsx("../Rtest/Mpesa/mpesa_trans.xlsx")


dt_mpesa <- dt_mpesa %>% 

## [1] "address"       "date"          "body"          "readable_date"

Check randomly 5 cases/rows in the data-set. To have an idea of what we are working with. I will convert the data frame to a tibble, tibble have a nice format when shown in a markdown

dt_mpesa <- as_tibble(dt_mpesa)
sample_n(dt_mpesa, 5) %>% 
address date body readable_date
MPESA 1557385899042 NE96SC32DI Confirmed.You have received KshXX.XX from XXXX on 9/5/19 at 10:09 AM New M-PESA balance is KshXX.XX Buy goods with M-PESA. 9 May 2019 10:11:39
MPESA 1375436232156 DU02LP277 confirmed. You bought KshXX.XX of airtime on 2/8/13 at 12:36 PM New M-PESA balance is KshXX.XXSafaricom only calls you from 0722000000 Aug 2, 2013 12:37:12 PM
MPESA 1408895242014 FO09EZ563 Confirmed. KshXX.XX sent to XXXX on 24/8/14 at 6:32 PM New M-PESA balance is KshXX.XXPIN YAKO SIRI YAKO Aug 24, 2014 6:47:22 PM
MPESA 1387262022488 EJ46EC505 Confirmed. on 17/12/13 at 9:33 AM Withdraw KshXX.XX from 287136 - Wabcom Technologies Tuskys supermarkets Imara New M-PESA balance is KshXX.XXUpdate menu & activate Mshwari Dec 17, 2013 9:33:42 AM
MPESA 1414510337925 FW69GJ095 confirmed. You bought KshXX.XX of airtime for XXXX on 28/10/14 at 6:11 PM New M-PESA balance is KshXX.XX Oct 28, 2014 6:32:17 PM

We some data mining is needed to extract values from the string in body. Just have the right tool here for this and for me it is stringr. What we are aiming for:

  1. Pick date
  2. Pick receipt number

stringr function str_extract will be our favorite out of the box tool. For extracting the values. After which we shall filter down the data to only actual receipt numbers.

dt_mpesa["receipt_no"] <- dt_mpesa$body %>% str_extract("\\w+")
dt_mpesa["receipt_dig2"] <- dt_mpesa$receipt_no %>% str_extract("[A-Z]{0,2}")
dt_mpesa["act_date"] <- dt_mpesa$body %>% str_extract("[0-9]{1,2}\\/[0-9]{1,2}\\/[0-9]{1,2}")
dt_mpesa["year"] <- dt_mpesa$readable_date %>% str_extract("[0-9][0-9][0-9][0-9]") 

dt_mpesa$act_date <- dt_mpesa$act_date %>% dmy()

dt_mpesa <- dt_mpesa %>% 
  filter(!(receipt_no %in% c("Failed","Transaction","You","Your","To","BRIAN","An")))

dt_mpesa <- dt_mpesa %>% 

dt_mpesa <- dt_mpesa[nchar(dt_mpesa$receipt_dig2) == 2,]
dt_mpesa["receipt_dig1"] <- substr(dt_mpesa$receipt_dig2,1,1)

sample_n(dt_mpesa, 5) %>% 
address date body readable_date receipt_no receipt_dig2 act_date year receipt_dig1
MPESA 1552479093999 NCD4HB4W1E confirmed.You bought KshXX.XX of airtime for XXXX on 13/3/19 at 3:11 PM.New balance is KshXX.XX Transaction cost, KshXX.XX 13 Mar 2019 15:11:33 NCD4HB4W1E NC 2019-03-13 2019 N
MPESA 1412513605551 FT56UF181 Confirmed. You have transferred KshXX.XX from XXXX on 5/10/14 at 3:37 PM. M-Shwari balance is KshXX.XX M-PESA balance is KshXX.XX Oct 5, 2014 3:53:25 PM FT56UF181 FT 2014-10-05 2014 F
MPESA 1575654263655 NL65HF7NIF Confirmed. KshXX.XX sent to XXXX on 6/12/19 at 8:44 PM New M-PESA balance is KshXX.XX Transaction cost, KshXX.XX 6 Dec 2019 20:44:23 NL65HF7NIF NL 2019-12-06 2019 N
MPESA 1423048568124 GK61TK815 Confirmed. You have received KshXX.XX from XXXX on 4/2/15 at 1:59 PM New M-PESA balance is KshXX.XXPay Bills via M-PESA Feb 4, 2015 2:16:08 PM GK61TK815 GK 2015-02-04 2015 G
MPESA 1377091072627 DW15MP435 confirmed. You bought KshXX.XX of airtime on 21/8/13 at 4:17 PM New M-PESA balance is KshXX.XXSafaricom only calls you from 0722000000 Aug 21, 2013 4:17:52 PM DW15MP435 DW 2013-08-21 2013 D

Now we have our data arranged in formats we would like. Let us see what we can deduce from the data

count(dt_mpesa, year) %>% 
year n
2012 59
2013 371
2014 559
2015 193
2017 128
2018 340
2019 409
2020 7
ggplot(dt_mpesa, aes(x = year)) +  
  geom_bar(colour = "green") +
    geom_text(aes(label = ..count..), stat = "count",
            vjust = 1.5, colour = "white") + 
  ylab("Number of transactions") + 
  ggtitle("Frequency of Transactions", "Oct 2012 - Jan 2020")

The first letter is the year. Let us test that.

dt_mpesa %>% 
  group_by(year) %>% 
  count(first_letter = receipt_dig1) %>% 
year first_letter n
2012 C 59
2013 C 4
2013 D 234
2013 E 133
2014 E 118
2014 F 372
2014 G 69
2015 G 154
2015 J 39
2017 L 128
2018 M 340
2019 N 409
2020 O 7
dt_mpesa <- dt_mpesa %>% ungroup()

ggplot(dt_mpesa, aes(x = year, y = receipt_dig1)) +
  geom_segment(aes(yend = receipt_dig1), xend = 0, colour = "grey50") + 
  geom_point(colour = "green", size = 5) + 
  theme_bw() + 
    panel.grid.major.y = element_blank()
  ) + 
  ylab("First Letter in Receipt Number") + 
  ggtitle("Year vs First Letter","Which first letters are in every year")

The second letter stands for the month. Is that the case let us see.

dt_mpesa %>% 
  group_by(year,month = month(act_date)) %>% 
  count(two_digit = receipt_dig2) %>% 
  as_tibble() %>% 
  sample_n(5) %>% 
year month two_digit n
2013 10 EB 6
2013 7 DQ 5
2018 8 MH 29
2014 10 FS 1
2013 12 EK 6
dt_mpesa <- dt_mpesa %>% ungroup()

Work in progress