Specifying bin range values for continuous data in R - r

I have a set of transaction values whose range are 0-15000 USD. I've plotted a histogram specifying breaks of $250 bin values, which is helpful. What I would like to do is go back into the dataframe and create my own bin values within the data frame. The bins would specify the range that the transactions fall into, such as: 0-250, 251-499, 500-749, 750...by 250 all the way up to 15,000.
I looked at this nifty post Generate bins from a data frame regarding 'cut' and 'findInterval' but they aren't really meeting my expectations. It's either nasty factors that looks okay for low bin ranges but once I get above $x,000 I get e-values (1.27e+04, 1.3e04).
What I'd like is:
Tran ID Amount Bin
135 $249.22 0-250
138 $1,022.01 1000-1249
155 $10,350.11 10,249-10,500
Is this possible with 'cut' or 'findInterval' or is there a better implementation?

cut is the way to go for this problem. If you do not like the output with the brackets, you can use some data manipulation to get it to look the way you'd like.
bins <- seq(0, 15000, by=250)
Amount2 <- as.numeric(gsub("\\$|,", "", df$Amount))
labels <- gsub("(?<!^)(\\d{3})$", ",\\1", bins, perl=T)
rangelabels <- paste(head(labels,-1), tail(labels,-1), sep="-")
df$Bin <- cut(Amount2, bins, rangelabels)
We first create a sequence from 0 to 15,000 by 250. Next we format the Amount column by eliminating the dollar signs and commas and save to the variable Amount2. We then format the output labels by inserting commas after the first three digits. We will use that variable in the final Bin column.
The variable rangelabels combines the bin break-points with a hyphen. The main function is next, cut(Amount2, bins, rangelabels). The first argument, Amount2 is the data frame vector being cut. The second argument, bins supplies the breaks for the intervals. The last argument, rangelabels is the vector of names for the output resulting in:
df
TranID Amount Bin
1 135 $249.22 0-250
2 138 $1,022.01 1,000-1,250
3 155 $10,350.11 10,250-10,500

Related

How to split a heat map or a draw a boundary with specific range?

I have a expression matrix containing three groups. I need to draw or split the heat-map with specific range of column.
Total number of colums: 151 where 1st column is gene ids
Group1: 2:40
Group2: 41:80
Group3: 81:151
I searched for splitting the heatmap and I got some hits like this.
But they are based on specific clusters.
I need to give my range as (2:40, 41:80, 81:151) for splitting or making boundary for the heatmap
Something like this
library(pheatmap)
mat = cbind(genes=1:100,
matrix(rnorm(150*100,mean = rep(1:3,c(39*100,40*100,71*100))),ncol=150))
colnames(mat)[2:ncol(mat)] = paste0("col",1:150)
You need to know how many are in each group, from what you provided, i counted this:
Group1: 39 Group2: 40 Group3: 71
So you need to make a data.frame that has the same row names as your matrix, and tell it which is group1,2 etc.
DF = data.frame(Groups=rep(c("Group1","Group2","Group3"),c(39,40,71)))
rownames(DF) = colnames(mat)[2:ncol(mat)]
Then we plot, mat[,-1] means excluding the first column, you need to specify where to insert the gap, and for your example it is at 39,79 and 80 because we excluded the first column:
pheatmap(mat[,-1],cluster_cols=FALSE,
annotation_col=DF,gaps_col = cumsum(c(39,40,71)))

Counting NA values by ID?

I'm learning R from scratch right now and am trying to count the number of NA's within a given table, aggregated by the ID of the file it came from. I then want to output that information in a new data frame, showing just the ID and the sum of the NA lines contained within. I've looked at some similar questions, but they all seem to deal with very short datasets, whereas mine is comparably long (10k + lines) so I can't call out each individual line to aggregate.
Ideally, if I start with a data table called "Data" with a total of four columns, and one column called "ID", I would like to output a data frame that is simply:
[ID] [NA_Count]
1 500
2 352
3 100
Thanks in advance...
Something like the following should work, although I am assuming that Date is always there and Field 1 and Field 2 are numeric:
# get file names and initialize a vector for the counts
fileNames <- list.files(<filePath>)
missRowsVec <- integer(length(fileNames))
# loop through files, get number of
for(filePos in 1:length(fileNames)) {
# read in files **fill in <filePath>**
temp <- read.csv(paste0(<filePath>, fileNames[filePos]), as.is=TRUE)
# count the number of rows with missing values,
# ** fill in <fieldName#> with strings of variable names **
missRowsVec[filePos] <- sum(apply(temp[, c(<field1Name>, <field2Name>)],
function(i) anyNA(i)))
} # end loop
# build data frame
myDataFrame <- data.frame("fileNames"=fileNames, "missCount"=missRowsVec)
This may be a bit dense, but it should work more or less. Try small portions of it, like just some inner function, to see how stuff works.

Drawing lines between values on plot in R

I have a dataframe "df" that has 120 rows and 2 columns containing numbers as shown...
V1 V2
10001 177417
227418 267719
317720 471368
I want to be able to lay these along the X-axis of a plot with a line connecting the values from V1 t0 V2 in each row.
one option would be to use seq(V1,V2) for each row then concatenate to create a full series, However with the the amount of data involved, the object size runs to >10GB and is therefore not a viable option. The Y-axis position here is not important.
Any ideas?
First create a plot object, then enter the rest of the rows using the segments function:
plot(x=c(1,1), y=df[1,], xlim = c(1,nrow(df)), ylim=range(df), type='l')
segments(x0=2:nrow(df), x1=2:nrow(df), y0=df[-1,1], y1=df[-1,2])
Here is how it looks on a random cumulative set:
df <- apply(as.data.frame(cbind(rnorm(1000),rnorm(1000))),2,cumsum)

r- hist.default, 'x' must be numeric

Just picking up R and I have the following question:
Say I have the following data.frame:
v1 v2 v3
3 16 a
44 457 d
5 23 d
34 122 c
12 222 a
...and so on
I would like to create a histogram or barchart for this in R, but instead of having the x-axis be one of the numeric values, I would like a count by v3. (2 a, 1 c, 2 d...etc.)
If I do hist(dataFrame$v3), I get the error that 'x 'must be numeric.
Why can't it count the instances of each different string like it can for the other columns?
What would be the simplest code for this?
OK. First of all, you should know exactly what a histogram is. It is not a plot of counts. It is a visualization for continuous variables that estimates the underlying probability density function. So do not try to use hist on categorical data. (That's why hist tells you that the value you pass must be numeric.)
If you just want counts of discrete values, that's just a basic bar plot. You can calculate counts of values in R for discrete data using table and then plot that with the basic barplot() command.
barplot(table(dataFrame$v3))
If you want to require a minimum number of observations, try
tbl<-table(dataFrame$v3)
atleast <- function(i) {function(x) x>=i}
barplot(Filter(atleast(10), tbl))

how to plot histogram with the first column value as x and the second column as y value?

I have the following structure in a file:
2014 50
2012 60
2016 80
I wish to plot a histogram using the first column as x values, and the second column as y values.
I tried:
data<-read.table(fileinput.txt, header = T)
hist(data[,2])
but it only gives me two bars.
If I want to write this into an R script, how to save the image to a location on the server?
barplot(dt$obs,dt$year)
Have a look at the other arguments as well.
You can use jpeg to save it to a file.

Resources