Wednesday, January 14, 2009

Plotting Histograms and boxplot with R

Plotting huge files in R can be a cake walk. Here is how you can proceed:

Suppose say you have a huge file with 2 million entries in [TAB] format.

Then go ahead and read the file in the following format:

>data <- read.table(file="fileName",head=TRUE,sep=",")

# read.table is a file reading function
# Note here, if you have no header then say head=FALSE and if you data is separated by
# \t then say sep="\t"

#You may like to see summary of your data by saying

> summary(data)
V1 V2
HPAA-aaa57a02.b1: 1 Min. :0.000000
HPAA-aaa57a02.g1: 1 1st Qu.:0.009985
HPAA-aaa57a03.b1: 1 Median :0.020565
HPAA-aaa57a03.g1: 1 Mean :0.021674
HPAA-aaa57a04.b1: 1 3rd Qu.:0.033898
HPAA-aaa57a04.g1: 1 Max. :0.065789
(Other) :1140845

# Pretty Cool..

If you have not written down the column names in the data file, don't worry R provides one for you.

>names(data)
>[1] "V1" "V2"

# So, V1 and V2 are your column names.
# To Plot a Histogram simply say
>hist(data$V2)

# If you did not select any output device by default it will plot as a Rplots.ps file
# on your working directory

# You can however copy this to a file of your interest

>dev.copy(postscript,"myfile.ps",height=7,width=7)

# Here Height and width are in inches
# Sometimes it does not allow you to do a dev.copy when your dev.list() is NULL.
# You can populate it by first plotting something or by doing a hist(data$V2)
# or just a plot(data$V2)
# Then do
>hist(data$V2)

# You can check how many devices are attached to your screen by doing a

> dev.list()
postscript postscript postscript
2 3 4
# You can switch off your current device by doing a
>dev.off()
postscript
2
> dev.list()
postscript postscript
2 3

# I Plotted my histogram using the following command:

5
> hist(data$V2,breaks=100,main="Trace File Blast hit Frequency plot, cutoff=30"
,xlab="Y=M/(R+L - 2O)",col="red",las=1,freq=TRUE)
> lines(density(data$V2))

This will save the file in the current file "myfile.ps". Every subsequent call to a plot function will keep plotting in different pages of the same file.

Alternatively a truehist() can be called to plot histograms. But before that do a
>require(MASS)
> truehist(data$V2,nbins=100,main="Trace File Blast hit Frequency plot, cutoff=
30",xlab="Y=M/(R+L - 2O)",col="blue",las=1,prob=TRUE)



Making a boxplot:

Plotting a  boxplot is equally easier using R. Follow the following commands:

data <- read.table(file="bgcorrect.out",head=FALSE,sep="\t")
summary(data)

#boxplot(data$vals,main='normalized data after synthetic normalization', ylab='G
ene Expression values');
boxplot(log2(data))
help(boxplot)



SAVING OUTPUT AS TEXT FILE

#In case you are just interested to save the histogram output file into a text file

> h <- hist(data$V2,breaks=100,plot=FALSE) # Plot= FALSE will not plot, default true
> save(h,ascii=TRUE,file="data.txt") # Will save data in ascii format, default=FALSE

# If you want to read the content of h
>h
# You will get the following
# $breaks -> break points
# $count -> frequency
# $density -> plot density
# $intensities -> plot intensity
# $mids -> mid points of data

So, just printing a single variable will be
>h$mids # Will print the mid point..

No comments: