Recently I had to analyse a file with 3.5 million rows of data. The reason why this becomes challenging is that this file cannot be opened in Excel or even notepad The first challenge was to check that we had the right ammount of data. We new the we should have 3455385. Before writing some C# utilities or importing it into SQL I decided to try R.
R is a statistics language and framework that is very popular in accedemia and over the last 7 years has increasingly being used. At the Build last year it was announced that R was to become a part of SQL2016. Also R is being used in Machine Learning in Azure. So it looks worth a closer look
So the first task to determine the total number of rows. This can be done as below
> frame2 <- read.csv("C:\\DAT203xLabfiles\\Jba_FL_USD.txt", header = TRUE, stringsAsFactors = FALSE)
We load a DataFrame using read.csv and assign this to a variable frame2. This takes a couble of minutes. A Dataframe is like a table, internally it is a list of vectors. Each vector can have different data types, the stringsAsFactors=FALSE stops the default behaviour of reading strings as a Factor (enumeration in C#). The total number of rows can then be found using nrow.
To see the data we can use the head function to get the first 5 rows of data.
If frame2 was a sql table In SQL this would be select top 5 * from frame2
In the case of 3.5 Million rows the file was still small enough to fit into memory. Apparently there are ways of handling truely huge files aka big data. In anycase my data is small enough to be handled in memory so I don't have to wory about that yet.
Now we would like to query this data to learn more about it's data quality
We can see how many different countries are involved by
 "ENG" "WAL" "SCO" "NIR" "XXX"
The corresponding sql would be select distinct country from frame2
So there are some lines with unknown countries.
> dataUK <- frame2[which(frame2$Country!="XXX"),]
> nrow(frame2) - nrow(dataUK)
Next we notice that not all data has Lat Long data
> dataUK.LatLong <- dataUK[which(dataUK$Lat!=0),]
> nrow(dataUK) - nrow(dataUK.LatLong)
> sumtiv <- sum(dataUK.LatLong$TIV1)
There is a lot more data analysis to be done, looking at distributions of sums insured for different lines of business. Also some data has been geocoded which can also be analysed at various levels
I would like to visualize this data. Here is a way of doing this using GoogleEarth kmz files. Here's my first attempt at visualizing 3.5M locations. Not suprizingly it did not work
>colors<-colorRampPalette(c("white",'lavenderblush',"pink",'yellow','orange',"red", "brown",'forestgreen','green','cyan', "blue", "black"))(12)
>SIplot<-data.frame(latitude=dataUK.LatLong$Lat, longitude=dataUK.LatLong$Long, SI=dataUK.LatLong$TIV1)
>coordinates(SIplot) <- ~longitude + latitude
>crs.geo <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84")
>proj4string(SIplot) <- crs.geo
>plotKML(SIplot, colour_scale=colors, altitude=50, points_names="")
Plotting the first variable on the list
KML file opened for writing...
Writing to KML...
Error: C stack usage 19923204 is too close to the limit
In addition: Warning messages:
Lost warning messages
It makes more sence to look at the top 100 locations. To do this we need to first sort the data: http://stackoverflow.com/questions/1296646/how-to-sort-a-dataframe-by-columns
> sorted <- SIplot[ with(SIplot, order(-SIplot$SI),SIplot$Logitude,SIplot$Latitude),]
colors<-colorRampPalette(c("white",'lavenderblush',"pink",'yellow','orange',"red", "brown",'forestgreen','green','cyan', "blue", "black"))(12)
#SIplot<-data.frame(latitude=dataUK.LatLong$Lat, longitude=dataUK.LatLong$Long, SI=dataUK.LatLong$TIV1)
coordinates(sorted100) <- ~longitude + latitude
crs.geo <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84")
proj4string(sorted100) <- crs.geo
plotKML(sorted100, colour_scale=colors, altitude=50, points_names="")
R has a lot of powerful libraries that plot shape files. I am impressed by how much can be done with only a few lines of code with a free library