Avoid first column of data frame

Avoid first column of data frame - r

I have following data frame and I want to assign each value to a bin,
title
1 MotorolaROID RAZR MAXX 4G Android Phone Black 32GBVerizon Wireless.jpg^HTC EVO 4G 1GB White Sprint Smartphone.jpg
2 MotorolaROID RAZR MAXX 4G Android Phone Black 32GBVerizon Wireless.jpg^NEW 4 0 Android 2 3 Unlocked Quad Bands GPS Bluetooth Wifi Smart Cell phone G10.jpg
3 MotorolaROID RAZR MAXX 4G Android Phone Black 32GBVerizon Wireless.jpg^Motorola Droid X2 Verizon BAD ESN GOOD Condition 100 Functional.jpg
4 MotorolaROID RAZR MAXX 4G Android Phone Black 32GBVerizon Wireless.jpg^UNLOCKED Huawei Ideos S7 Tablet Smartphone.jpg
5 MotorolaROID RAZR MAXX 4G Android Phone Black 32GBVerizon Wireless.jpg^Apple iPhone 4 16GB Black AT&T Smartphone MC318LLA .jpg
6 MotorolaROID RAZR MAXX 4G Android Phone Black 32GBVerizon Wireless.jpg^Apple iPhone 4 16GB Black Factory Unlocked Smartphone.jpg
column1 column2 column3 column4 column5 column6 column7
1 0.978 0.635 0.973 0.7619048 0.6383881 0.8339921 0.06666667
2 0.343 0.702 0.990 0.2623762 0.6150583 0.9285714 0.04166667
3 0.984 0.675 0.712 0.7056277 0.6770944 0.5612648 0.00000000
4 0.798 0.648 0.931 0.4090909 0.5864263 0.8571429 0.00000000
5 0.898 0.709 0.993 0.5000000 0.6951220 0.9328063 0.05882353
6 0.898 0.709 0.993 0.5000000 0.6951220 0.9328063 0.06250000
When I tried to run the following line I get an error Error in cut.default(newX[, i], ...) : 'x' must be numeric I know this is because my first column is the title column. How can I execute this by ignoring the first column.
df_bin <- apply(df, 2, cut, c(-Inf, seq(0.5, 1, 0.1), Inf), labels=0:7)

Apply over 'all but the first column' by excluding it via a -1 count:
df_bin <- apply(df[,-1], 2, cut, c(-Inf, seq(0.5, 1, 0.1), Inf), labels=0:7)
The key here is df[,-1] versus your df.

Related

Join tables based on one column exact match and other columns fuzzy matches excel

I have two tables where I want to match age and height to the percentile they fall within (according to WHO guidelines). So if the ages in table_percentile and table_height match, find the percentile column in table_percentile that the height falls in from table_height. The percentiles (columns P3, P15, P50, P85, P97) contain heights (cm). I think I may need to create a min and max column for each percentile so there is a set range for the heights to fall between. So if the age is 0 days and the height 49 cm, the percentile would be P15 as it is > 47.217 and <49.148.
NB I have tried this in R with fuzzy join however R keeps crashing so I am trying to see if excel processing may be different. The dataset is almost 300,000 observations. I have cut them down by gender and age but its still crashing.
table_percentile
Age P3 P15 P50 P85 P97
1 0 45.644 47.217 49.148 51.078 52.651
2 1 45.808 47.383 49.317 51.250 52.825
3 2 45.971 47.549 49.485 51.422 53.000
4 3 46.134 47.714 49.654 51.594 53.175
5 4 46.297 47.880 49.823 51.766 53.349
6 5 46.461 48.046 49.992 51.938 53.524
7 6 46.624 48.212 50.161 52.110 53.698
table_height
Age Height
1 0 49.0
2 1 50.4
3 2 48.8
4 2 51.5
5 4 52.0
6 6 46.8
7 6 49.0
Output that I'd like to get
Age Height Percentile
1 0 49.0 P15
2 1 50.4 P50
3 2 48.8 P15
4 2 51.5 P85
5 4 52.0 P85
6 6 46.8 P3
7 6 49.0 P15

Struggling to create a box plot, histogram, and qqplot in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 10 months ago.
Improve this question
I am a very new R user, and I am trying to use R to create a box plot for prices at target vs at Walmart. I also want to create 2 histograms for the prices at each store as well as qqplots. I keep getting various errors, including "Error in hist.default(mydata) : 'x' must be numeric:" and boxplot(mydata)
"Error in x[floor(d)] + x[ceiling(d)] :
non-numeric argument to binary operator" . I have correctly uploaded my csv file and I will attach my data for clarity. I have also added a direct c & p of some of my code. I have tried using hist(mydata), boxplot(mydata), and qqplot(mydata) as well, all which have returned with the x is not numeric error. I'm sorry if any of this is dumb, I am extremely new to R not to mention extremely bad at it. Thank you all for your help!
#[Workspace loaded from ~/.RData]
mydata <- read.csv(file.choose(), header = T) names(mydata)
#Error: unexpected symbol in " mydata <- read.csv(file.choose(), header = T) names"
mydata <- read.csv(file.choose(), header = T)
names(mydata)
#[1] "Product" "Walmart" "Target"
mydata
Product
1 Sara lee artesano bread
2 Store brand dozen large eggs
3 Store brand 2% milk 1 gallon (128 fl oz)
4 12.4 oz cheez its
5 Ritz cracker fresh stacks 8ct, 11.8 oz
6 Sabra classic hummus 10 oz
7 Oreo chocolate sandwich cookies 14.3 oz
8 Motts applesauce 6 ct/4oz cups
9 Bananas (each)
10 Hass Avocado (each)
11 Chips ahoy original family size, 18.2 oz
12 Lays potato chips party size, 13 oz
13 Amy’s frozen mexican casserole, 9.5 oz
14 Jack’s frozen pizza original thin crust, 13.8 oz
15 Store brand sweet cream unsalted butter, 4 count, 16 oz
16 Sour cream and onion pringles, 5.5 oz
17 Philadelphia original cream cheese spread, 8 oz
18 Daisy sour cream, regular, 16 oz:
19 Kraft singles, 24 ct/16 oz:
20 Doritos nacho cheese, party size, 14.5 oz
21 Tyson Fun Chicken nuggets, 1.81 lb (29 oz), frozen
22 Kraft mac n cheese original, 7.25 oz
23 appleapple gogo squeeze, 12ct, 3.2 oz each
24 Yoplait original french vanilla yogurt, 6oz
25 Essentia bottled water, 1 liter
26 Premium oyster crackers, 9oz
27 Aunt Jemima buttermilk pancake miz, 32 oz
28 Eggo frozen homestyle waffles, 10ct/12.3 oz
29 Kellogg's Froot Loops, 10.1 oz
30 Tostitos scoops tortilla chips, 10 oz
Walmart Target
1 2.98 2.99
2 1.93 1.99
3 2.92 2.99
4 3.14 3.19
5 3.28 3.29
6 3.68 3.69
7 3.48 3.39
8 2.26 2.29
9 0.17 0.25
10 1.18 1.19
11 3.98 4.49
12 4.48 4.79
13 4.58 4.59
14 3.42 3.59
15 3.18 2.99
16 1.78 1.79
17 3.24 3.39
18 1.94 2.29
19 4.18 4.39
20 4.48 4.79
21 6.42 6.69
22 1.00 0.99
23 5.98 6.49
24 0.56 0.69
25 1.88 1.99
26 3.12 2.99
27 2.64 2.79
28 2.63 2.69
29 2.98 2.99
30 3.48 3.99
hist(mydata)
#Error in hist.default(mydata) : 'x' must be numeric
x<-sample(LETTERS[1:5],20,replace=TRUE)
df<-data.frame(x)
df
x
1 E
2 B
3 A
4 B
5 E
6 B
7 A
8 A
9 C
10 E
11 A
12 B
13 A
14 B
15 C
16 D
17 C
18 E
19 A
20 D
x<-sample(LETTERS[1:5],20,replace=TRUE)
df<-data.frame(x)
hist(df$x)
#Error in hist.default(df$x) : 'x' must be numeric
x<-sample(LETTERS[1:5],20,replace=TRUE)
df<-data.frame(x)
barplot(table(df$x))
boxplot(mydata)
#Error in x[floor(d)] + x[ceiling(d)] :
# non-numeric argument to binary operator
qqplot("Walmart")
#Error in sort(y) : argument "y" is missing, with no default
qqplot(mydata)
#Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) :
# undefined columns selected
#In addition: Warning message:
#In xtfrm.data.frame(x) : cannot xtfrm data frames

There seems to be a problem with the data you uploaded but no matter...I will just create data resembling your problem and show you how to do it with some simple code (some may offer alternatives like ggplot, but I think my example will use shorter code and be more intuitive.)
First, we can load ggpubr for plotting functions:
# Load ggpubr for plotting functions:
library(ggpubr)
Then we can create a new data frame, first with the prices and store names, then combining them into a data frame we can use:
# Create price values and store values:
prices.1 <- c(1,2,3,4,5,3)
prices.2 <- c(8,6,4,2,0,1)
store <- c("walmart",
"walmart",
"walmart",
"target",
"target",
"target")
# Create dataframe for these values:
store.data <- data.frame(prices.1,
prices.2,
store)
Now we can just plug in our data into all of these plots nearly the same way each time. the first part of the code is the plot function name, the data part is our stored data, and the x and y values are what we use for our variables:
# Scatterplot:
ggscatter(data = store.data,
x="prices.1",
y="prices.2")
# Boxplot:
ggboxplot(data = store.data,
x="store",
y="prices.1")
# Histogram:
gghistogram(data = store.data,
x="prices.1")
# QQ Plot:
ggqqplot(data = store.data,
x="prices.1")
There are simpler alternatives like base R functions like this, but I find they are much harder to customize compared to ggpubr and ggplot:
plot(x,y)
Of course, you can really customize the ggpubr and ggplot output to look much better, but thats up to you and what you want to learn:
ggboxplot(data = store.data,
x="store",
y="prices.1",
fill = "store",
title = "Prices of Merchandise by Store",
caption = "*Data obtained from Stack Overflow",
palette = "jco",
legend = "none",
xlab ="Store Name",
ylab = "Prices of Merchandise",
ggtheme = theme_pubclean())
Hope thats helpful. Let me know if you have questions!

geom_smooth(method=loess) is not working - argument trace.hat is missing

I made this code in the spring and for some reason it is not working anymore and R doesn't draw the plots. When I try to use grid.arrange to draw them, I get the following warning message:
Warning messages:
1: Computation failed in stat_smooth():
argument "trace.hat" is missing, with no default
Here's a piece of the code that used to work flawlessly:
sc.occpt.ptre<-ggplot(subset(sc, species %in% c("Ptre")), aes(x=sizeclass, y=occpt)) +
geom_smooth(method=loess, color="black", size=0.5)+
coord_cartesian(ylim=c(0,2.5))+
theme_pub2()+
theme(axis.title.y = element_blank())+
theme(axis.title.x = element_blank())+
scale_x_continuous(breaks = seq(10, 45, by = 5))+
annotate("text",-Inf,Inf,hjust=0,vjust=2,label="Populus tremula",family="sans", size=1.7)+
annotate("text",-Inf,Inf,hjust=-0.07,vjust=4,label="Spearman's rho = 0.25, p = 0.007", family="sans",fontface="italic", size=1.7)
and I use this code to draw the plots:
grid.arrange(sc.occpt.ptre, sc.occpt.bsp, sc.occpt.blt, sc.occpt.pabi, sc.occpt.psyl, sc.occpt.all, ncol=2, left = textGrob("Occurrences per CWD item", rot=90))
I can't figure out how I should put the argument trace.hat in my code nor why I suddenly need it in the first place. Thank you in advance!
head(sc). Sizeclass and occpt are both numeric.
h sppt occpt countl sizeclass sites2 sites x_co y_co spc area reg freg no type species decay dia empty
1 0.6931472 0.66666667 0.6666667 3 10 AIMA AIMALA1 350855.3 6799399 2 0.4521 EH 2a 1 2 Blt 2 11 0
2 0.4505612 0.25000000 0.7500000 8 10 AIMA AIMALA2 350408.1 6800231 2 0.9955 EH 2a 1 3 Blt 3 12 0
3 0.6365142 0.33333333 0.5000000 6 10 AIMA AIMALA3 350478.3 6799771 2 0.5283 EH 2a 5 3 Blt 2 12 0
4 1.0549202 0.37500000 0.6250000 8 10 AIMA AIMALA4 350480.3 6799868 3 0.7139 EH 2a 6 3 Blt 1 11 0
5 1.0114043 0.10000000 0.2000000 30 10 HAIL HAILUO1 395750.1 7219345 3 0.2405 OP 3a 7 8 Blt 5 10 1
6 1.0889000 0.08888889 0.2222222 45 10 HAIL HAILUO2 387392.3 7217148 4 0.1562 OP 3a 6 8 Blt 4 12 1

want to expand a large bipartite network plot avoid vertices overlapped

I was plotting a bipartite graph using igraph package with R. There are about 10,000 edges, I want to expand the width of the whole plot to avoid state vertices overlapped.
my data looks like this:
> test2
user_id state meanlat meanlon countUS countS degState
<chr> <chr> <dbl> <dbl> <int> <int> <int>
1 -_1ctLaz3jhPYc12hKXsEQ NC 35.19401 -80.83235 909 3 18487
2 -_1ctLaz3jhPYc12hKXsEQ NV 36.11559 -115.18042 29 3 37884
3 -_1ctLaz3jhPYc12hKXsEQ SC 35.05108 -80.96166 4 3 665
4 -0wUMy3vgInUD4S6KJInnw IL 40.11227 -88.22955 2 3 1478
5 -0wUMy3vgInUD4S6KJInnw NV 36.11559 -115.18042 23 3 37884
6 -0wUMy3vgInUD4S6KJInnw WI 43.08051 -89.39835 20 3 3963
and below is my code on graph creating and setting.
g2 <- graph_from_data_frame(test2,directed = F)
V(g2)$type <- ifelse(names(V(g2)) %in% UserStateR$user_id, 'user', 'state')
V(g2)$label <- ifelse(V(g2)$type == 'user', " ", paste(names(V(g2)),"\n",as.character(test2$degState),sep=""))
V(g2)$size <- ifelse(V(g2)$type == 'user', 3, 20)
V(g2)$color <- ifelse(V(g2)$type == 'user', 'wheat', 'salmon')
V(g2)$type <- ifelse(names(V(g2)) %in% UserStateR$user_id, T, F )
E(g2)$color <- heat.colors(8)[test2$countS]
plot(g2,layout=layout.bipartite(g2, types = names(V(g2)) %in% UserStateR$state, hgap = 50, vgap = 50))
as you can see, I have tried to change the hgap and vgap arguments, but it doesn't work apparently. I have also tried asp argument, but that is not what I want.

I know this might be too late for #floatsd but I was struggling with this today and had a really hard time finding an answer, so this might help others out.
First, in general, there is a an attribute to iplot.graph called asp that very simply controls how rectangular your plot is. Simply do
l=layout.bipartite(CCM_net)
plot(CCM_net, layout=l, asp=0.65)
for a wide plot. asp smaller than 1 gives you a wide plot, asp larger than 1 a tall plot.
However, this might still not give you the layout you want. The bipartite command basically generates a matrix with coordinates for your vertices, and I actually don't understand yet how it comes up with the x-coordinates, so I ended up changing them myself.
Below the example (I am assuming you know how to turn your data into data frames with the edge list and edge/vertex attributes for making graphs so am skipping that).
My data is CCM_data_sign and is
from to value
2 EVI MAXT 0.67
4 EVI MINT 0.81
5 EVI P 0.70
7 EVI SM 0.79
8 EVI AMO 0.86
11 MAXT EVI 0.81
18 MAXT AMO 0.84
21 MEANT EVI 0.88
28 MEANT AMO 0.83
29 MEANT PDO 0.71
31 MINT EVI 0.96
39 MINT PDO 0.78
40 MINT MEI 0.66
41 P EVI 0.91
49 P PDO 0.77
50 P MEI 0.71
51 PET EVI 0.90
58 PET AMO 0.89
59 PET PDO 0.70
61 SM EVI 0.94
68 SM AMO 0.90
69 SM PDO 0.81
70 SM MEI 0.73
74 AMO MINT 0.93
76 AMO PET 0.66
79 AMO PDO 0.71
80 AMO MEI 0.83
90 PDO MEI 0.82
The data frame I generated for graphing is called CCM_net.
First a bipartite plot without any layout adjustments
V(CCM_net)$size<-30
l=layout.bipartite(CCM_net)
plot(CCM_net,
layout=l,
edge.arrow.size=1,
edge.arrow.width=2,
vertex.label.family="Helvetica",
vertex.label.color="black",
vertex.label.cex=2,
vertex.label.dist=c(3,3,3,3,3,3,3,3,3,3,3),
vertex.label.degree=c(pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,pi/2,pi/2,pi/2), #0 is right, “pi” is left, “pi/2” is below, and “-pi/2” is above
edge.lty=1)
This gives you the following
If I use asp I get the following
plot(CCM_net,
layout=l,
edge.arrow.size=1,
vertex.label.family="Helvetica",
vertex.label.color="black",
vertex.label.cex=2,
vertex.label.dist=c(3,3,3,3,3,3,3,3,3,3,3),
vertex.label.degree=c(pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,pi/2,pi/2,pi/2), #0 is right, “pi” is left, “pi/2” is below, and “-pi/2” is above
edge.arrow.width=2,
edge.lty=1,
asp=0.6) # controls how rectangular the plot is. <1 = wide, >1 = tall
dev.off()
This is looking better, but still not really what I want - see how some vertices are closer to each other than others?
So eventually I took the following approach. Setting the coordinates as bipartite looks like this
coords <- layout_as_bipartite(CCM_net)
coords
[,1] [,2]
[1,] 3.0 0
[2,] 0.0 1
[3,] 2.0 1
[4,] 3.5 1
[5,] 6.0 1
[6,] 1.0 1
[7,] 5.0 1
[8,] 7.0 1
[9,] 1.0 0
[10,] 4.5 0
[11,] 5.5 0
This matrix shows the x coordinates of your vertices in the first columns and the y coordinates in the second column, ordered according to your list with names. My list with names is
id name
1 EVI EVI
2 MAXT MAXT
3 MEANT MEANT
4 MINT MINT
5 P P
6 PET PET
7 SM SM
8 SR SR
9 AMO AMO
10 PDO PDO
11 MEI MEI
In my graph, EVI, AMO and PDO are on the bottom, but note their x coordinates: 3.0, 1.0, 4.5 and 5.5. I haven't figured out yet how the code comes up with that, but I don't like it so I simply changed it.
coords[,1]=c(2,0,4,8,12,16,20,24,9,16,24)
Now the plotting code (also with asp) and the output becomes
plot(CCM_net,
layout=coords,
edge.arrow.size=1,
vertex.label.family="Helvetica",
vertex.label.color="black",
vertex.label.cex=1,
vertex.label.dist=c(4,4,4,4,4,4,4,4,4,4,4),
vertex.label.degree=c(pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,-pi/2,pi/2,pi/2,pi/2), #0 is right, “pi” is left, “pi/2” is below, and “-pi/2” is above
edge.arrow.width=2,
edge.lty=1,
asp=0.6) # controls how rectangular the plot is. <1 = wide, >1 = tall
Now the vertices are nicely spaced in a rectangular plot!
Note - I also decreased the size of the vertices, the size of the labels and their positioning, for better readability.

I think you can output with PDF. then zoom in.
Or, use rgexf package to output gexf file. Then visualizate in gephi.
I think gephi is a good tools for network visualization.

Creating heat map with R from a square matrix

I have a gzip compressed file file.gz with 4,726,276 lines where the first and last five lines look like this:
FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO
CAN -1 CAN 1 OT 0 1.0000 0.0000 0.0000 0.0000 -1 0.745118 0.1111 1.5526
CAN -1 CAN 2 OT 0 0.8761 0.1239 0.0000 0.0619 -1 0.752607 0.0648 1.4615
CAN -1 CAN 3 OT 0 0.8810 0.1190 0.0000 0.0595 -1 0.753934 0.3058 1.7941
CAN -1 CAN 4 OT 0 0.8911 0.1089 0.0000 0.0545 -1 0.751706 0.8031 2.4138
WAN 2 WAN 4 OT 0 0.8410 0.0000 0.1590 0.1590 -1 0.787251 0.0840 1.5000
WAN 2 WAN 5 OT 0 0.8606 0.0000 0.1394 0.1394 -1 0.784882 0.7671 2.3571
WAN 3 WAN 4 OT 0 0.8306 0.0000 0.1694 0.1694 -1 0.790142 0.0392 1.3846
WAN 3 WAN 5 OT 0 0.7960 0.0364 0.1676 0.1858 -1 0.795924 0.5000 2.0000
WAN 4 WAN 5 OT 0 0.8227 0.0090 0.1683 0.1728 -1 0.793460 0.5577 2.0645
The x-value is column 1 + 2. The y-value is column 3 + 4. The z-value is column 10. Values along the diagonal are not present in the input file. They should preferably be 1, but 0 is also fine.
How can I create a heat map from such data?
Here is a simple example for a 3x3 matrix:
FID1 IID1 FID2 IID2 PI_HAT
A 1 B 1 0.1
A 1 B 2 0.2
B 1 B 2 0.3

This is a ggplot2 approach. 4.5m rows shouldn't be a problem in R.
df <- read.table(text='FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO
CAN -1 CAN 1 OT 0 1.0000 0.0000 0.0000 0.0000 -1 0.745118 0.1111 1.5526
CAN -1 CAN 2 OT 0 0.8761 0.1239 0.0000 0.0619 -1 0.752607 0.0648 1.4615
CAN -1 CAN 3 OT 0 0.8810 0.1190 0.0000 0.0595 -1 0.753934 0.3058 1.7941
CAN -1 CAN 4 OT 0 0.8911 0.1089 0.0000 0.0545 -1 0.751706 0.8031 2.4138
CAN -1 CAN 4 OT 0 0.8911 0.1089 0.0000 0.0545 -1 0.751706 0.8031 2.4138
CAN -1 CAN 4 OT 0 0.8911 0.1089 0.0000 0.0545 -1 0.751706 0.8031 2.4138
WAN 3 WAN 4 OT 0 0.8306 0.0000 0.1694 0.1694 -1 0.790142 0.0392 1.3846
WAN 3 WAN 4 OT 0 0.8306 0.0000 0.1694 0.1694 -1 0.790142 0.0392 1.3846
WAN 2 WAN 4 OT 0 0.8410 0.0000 0.1590 0.1590 -1 0.787251 0.0840 1.5000
WAN 2 WAN 5 OT 0 0.8606 0.0000 0.1394 0.1394 -1 0.784882 0.7671 2.3571
WAN 3 WAN 4 OT 0 0.8306 0.0000 0.1694 0.1694 -1 0.790142 0.0392 1.3846
WAN 3 WAN 5 OT 0 0.7960 0.0364 0.1676 0.1858 -1 0.795924 0.5000 2.0000
WAN 4 WAN 5 OT 0 0.8227 0.0090 0.1683 0.1728 -1 0.793460 0.5577 2.0645', header=T)
I added a few lines in your output to make it more reasonable in a heatmap. There was no overlap previously:
#create your variables by merging columns 1+2 and 3+4
a <- mapply(paste,df[[1]], df[[2]])
b <- mapply(paste,df[[3]], df[[4]])
#combine in a data.frame
df2 <- data.frame(a,b)
library(dplyr)
#aggregate because you will need aggregated rows for this to work
#this should only take a few seconds for 4.5m rows
df3 <-
df2 %>%
group_by(a,b) %>%
summarize(total=n())
#plot with ggplot2
library(ggplot2)
ggplot(df3, aes(x=a,y=b,fill=total)) + geom_tile()
Output:

Your question seems to have 2 parts:
How to handle the data in R (in this case coming from a gzip compressed archive)
How to make a heatmap
At first blush, it appeared that you were implying that the size of the data was large -- there are many resources on how to use Big Data in R (here's one) -- however based on the comments I take it that the data size is actually not an issue. If it were then your options would depend in part on your hardware resources as well as your willingness to sample data (which I highly recommend) rather than use every single one of your 5 million rows.
The Central Limit Theorem is your friend.
You can read in gzip data like this:
data <- read.table(gzfile("file.gz"),header=T, sep="\t", stringsAsFactors=F)
Since you did not provide your compressed archive, I've copied your sample data and read it from my clipboard in the code below. I'll show you how to construct a heatmap from this data; for importing from gzip and handling Big Data check out the link provided above.
require(stats)
require(fields)
require(akima)
a <- read.table(con <- file("clipboard"), header = T)
a$x1 <- as.numeric(a[,1])
a$x2 <- as.numeric(a[,2])
a$y1 <- as.numeric(a[,3])
a$y2 <- as.numeric(a[,4])
x <- as.matrix(cbind(a$x1, a$x2))
y <- as.matrix(cbind(a$y1, a$y2))
z <- as.matrix(a[, 10])
s = smooth.2d(z, x=cbind(x,y), theta=0.5)
image.plot(s)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Avoid first column of data frame - r

Apply over 'all but the first column' by excluding it via a -1 count: df_bin <- apply(df[,-1], 2, cut, c(-Inf, seq(0.5, 1, 0.1), Inf), labels=0:7) The key here is df[,-1] versus your df.

Related

Join tables based on one column exact match and other columns fuzzy matches excel

Struggling to create a box plot, histogram, and qqplot in R [closed]

geom_smooth(method=loess) is not working - argument trace.hat is missing

want to expand a large bipartite network plot avoid vertices overlapped

Creating heat map with R from a square matrix

Categories

Resources