'Heatmap creation using ggplot for large genomic dataset

Dear StackOverflow community,

I have a very large data set with an extract that looks like the below:

                     AC010327.1   AC010368.1  AC010525.2
TGYR                     0          0          0.984
BHT                      0.1        0          0
THY_RHE                  0          0.0002     0
FJU_WJNKO                0          0          0
PAED_DISE                0.342      0          0
DID PID                  0          0.3821     0

Each column is a gene, this is 30,000 columns long. There are 9 rows in total each a code for a disease type. The figures represent a statistical test outcome that is between 0-1 that has been run for that disease against the gene type.

I would like to present this mass of data in an easy to view form and thought a heatmap would be most suitable.

Using:

x <- data
x <-as.data.frame(x)
heatmap(x, scale - 'none')

Gets me a pretty ugly block of data.

I have been trying ggplot2 with geom_tile but keep getting error messages. I am slightly unsure what the "aes" function of this would be as I haven't names my row or coloumn names.

I can provide more information if needed but would be grateful for some guidance?

Many thanks

Update 13/2/18

Using solution below, is there a way of weighting it in preference to results greater than 0?



Solution 1:[1]

We can convert the data frame from wide format to long format, and then use the geom_tile.

library(tidyverse)

dat2 <- dat %>%
  rownames_to_column(var = "Disease") %>%
  gather(Gene, Value, -Disease)

ggplot(dat2, aes(x = Gene, y = Disease, fill = Value)) +
  geom_tile() +
  scale_fill_viridis_c()

enter image description here

DATA

dat <- read.table(text = "                     'AC010327.1'   'AC010368.1'  'AC010525.2'
TGYR                     0          0          0.984
                  BHT                      0.1        0          0
                  THY_RHE                  0          0.0002     0
                  FJU_WJNKO                0          0          0
                  PAED_DISE                0.342      0          0
                  'DID PID'                  0          0.3821     0",
                  header = TRUE, stringsAsFactors = FALSE)

Solution 2:[2]

When you are observing covariance (difference among different variables), and suppose the check/test is with two categorical variable like yours, its always better to use geom_tile for a fairly medium size dataset.

But when your dataset is huge that it cant be seen in geom_tile, then its better to use d3heatmap

I can show you an example with a large dataset, which you can also try and is similar to your dataset.

library(d3heatmap)
url <- "http://datasets.flowingdata.com/ppg2008.csv"
nba_players <- read.csv(url, row.names = 1)
d3heatmap(nba_players, scale = "column")

The result can be opened in web browser and can be played interactively An example result can be seen in this site: Output

Check this site for more information

Notes

  1. The dataset should be a numeric dataset, d3 heatmaps won't accept any negative values or any characters

  2. To avoid the problem you can make a percentage share for each row or column

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 www
Solution 2 marc_s