how to resolve the warning in R [closed] - r

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
The data i am using looks as shown below, it has 50000 instances and 32 variables....
The missing values are present in many varibles ,..
sorry was unable to post the entire data..
I used
library(zoo)
d$V5 <- na.locf(d$V5)
and i further checked for Gini value and it gave me the output as below
Gini(d$V5)
[1] NA
Warning messages:
1: In sum(x * 1:n) : Integer overflow - use sum(as.numeric(.))
2: In n * sum(x) : NAs produced by integer overflow
But d$V5 corresponds to age which is a number
The aim was to find Gini and information gain and to plot a decision tree, due to missing values the decision tree split is one.
Hence, filling missing values was necessary.
Data:
1 022 F O 044 0 N 31 12 00P 0012 Y Y N Y 0048 731 0.000000 Y N 0 VERA LUCIA N N 300.000000 0000 00 N 0
2 015 F S 018 0 Y 31 20 00 P 0216 Y Y Y Y 0012 853 0.000000 Y N 0 SARA FELIPE N N 300.000000 0000 00 N 0
3 024 F C 022 0 Y 31 08 00 P 0048 Y N Y Y 0012 040 0.000000 Y N 0 HELENA DOMINGOS SOGRA N N 229.000000 0000 00 N 0
4 012 F C 047 0 N 31 25 00 P 0180 Y Y N Y 0024 035 0.000000 Y N 0 JACI VALERIA ALEXANDRA TRAJANO N N 304.000000 0000 00 N 0
5 016 F S 028 0 Y 31 25 00 O 0012 Y Y Y Y 0012 024 0.000000 Y N 0 MARCIA CRISTINA ZANELLA SANDRO L P MARTINS N N 250.000000 0000 00 N 0
.....
49998 023 F S 023 0 Y 31 28 00 P 0264 Y Y Y Y 0012 991 0.000000 Y N 0 NOVINA GLAUCIA N N 240.000000 0000 00 N 1
49999 009 F C 038 0 Y 5 28 00 P 0048 Y Y Y Y 0204 040 0.000000 Y N 0 LILIANE FIGUEIREDO MIRNA CARVALHO NASCIMENTO N N 616.000000 0000 00 N 0
50000 022 M S 029 0 Y 31 23 00 P 0048 Y Y N Y 0036 026 0.000000 Y N 0 TITO MARTINS N N 341.000000 0000 00 N 0

The error you're getting has nothing to do with missing values (which may or may not present a problem of their own). It can easily be reproduced by doing:
sum(1:100000)
#[1] NA
#Warning message:
#In sum(1:1e+05) : integer overflow - use sum(as.numeric(.))
And can also be avoided by converting to doubles:
sum(as.numeric(1:100000))
#[1] 5000050000
So do
d$V5 = as.numeric(d$V5)
and take it from there.

Related

R group data into equal groups with a metric variable

I'm struggeling to get a good performing script for this problem: I have a table with a score, x, y. I want to sort the table by score and than build groups based on the x value. Each group should have an equal sum (not counts) of x. x is a metric number in the dataset and resembles the historic turnover of a customer.
score x y
0.436024136 3 435
0.282303336 46 56
0.532358015 24 34
0.644236597 0 2
0.99623626 0 4
0.557673456 56 46
0.08898779 0 7
0.702941303 453 2
0.415717835 23 1
0.017497461 234 3
0.426239166 23 59
0.638896238 234 86
0.629610596 26 68
0.073107526 0 35
0.85741877 0 977
0.468612039 0 324
0.740704267 23 56
0.720147257 0 68
0.965212467 23 0
a good way to do so is adding a group variable to the data.frame with cumsum! Now you can easily sum the groups with e. g. subset.
data.frame$group <-cumsum(as.numeric(data.frame$x)) %/% (ceiling(sum(data.frame$x) / 3)) + 1
remarks:
in big data.frames cumsum(as.numeric()) works reliably
%/% is a division where you get an integer back
the '+1' just let your groups start with 1 instead of 0
thank you #Ronak Shah!

Smoothing Lines in ggplot between all data point

I have a data.frame similar to this example
SqMt <- "Sex Sq..Meters PDXTotalFreqStpy
1 M 129 22
2 M 129 0
3 M 129 1
4 F 129 35
5 F 129 42
6 F 129 5
7 M 557 20
8 M 557 0
9 M 557 15
10 F 557 39
11 F 557 0
12 F 557 0
13 M 1208 33
14 M 1208 26
15 M 1208 3
16 F 1208 7
17 F 1208 0
18 F 1208 8
19 M 604 68
20 M 604 0
21 M 604 0
22 F 604 0
23 F 604 0
24 F 604 0"
Data <- read.table(text=SqMt, header = TRUE)
I want to show the average PDXTotalFreqStpy for each Sq..Meters organized by Sex. This is what I use:
library(ggplot2)
ggplot(Data, aes(x=Sq..Meters, y=PDXTotalFreqStpy)) + stat_summary(fun.y="mean", geom="line", aes(group=Sex,color=Sex))
How do I get these lines smoothed out so that they are not jagged and instead, nice and curvy and go through all the data points? I have seen things on spline, but I have not gotten those to work?
See if this works for you:
library(dplyr)
# increase n if the result is not smooth enough
# (for this example, n = 50 looks sufficient to me)
n = 50
# manipulate data to calculate the mean for each sex at each x-value
# before passing the result to ggplot()
Data %>%
group_by(Sex, x = Sq..Meters) %>%
summarise(y = mean(PDXTotalFreqStpy)) %>%
ungroup() %>%
ggplot(aes(x, y, color = Sex)) +
# optional: show point locations for reference
geom_point() +
# optional: show original lines for reference
geom_line(linetype = "dashed", alpha = 0.5) +
# further data manipulation to calculate values for smoothed spline
geom_line(data = . %>%
group_by(Sex) %>%
summarise(x1 = list(spline(x, y, n)[["x"]]),
y1 = list(spline(x, y, n)[["y"]])) %>%
tidyr::unnest(),
aes(x = x1, y = y1))

R: sampling from a dataset based on a certain distribution centered around points in a different dataset

I am trying to sample rows from a set of points, df_map, in X-Y-Z space according to the distribution of the points on the X-Y plane. The mean and standard deviation of the distribution is in another dataset, df_pts.
My data looks like this
> df_map
X Y Z
A 6 0 103
B -4 2 102
C -2 15 112
D 13 6 105
E 1 -3 117
F 5 16 105
G 10 5 103
H 14 -7 119
I 8 14 107
J -8 -4 100
> df_pts
x y accuracy
a 5 18 -0.8464018
b 3 2 0.5695678
c -18 14 -0.4711559
d 11 13 -0.7306417
e -3 -10 2.1887011
f -9 -11 2.1523923
g 5 1 -0.9612284
h 12 -19 -0.4750582
i -16 20 -1.4554292
j 0 -8 3.4028887
I want to iterate through the rows in df_pts and choose one row from df_map according to Gaussian distribution of distances from the (df_pts[i, x], df_pts[i, y]) with the 2d standard deviation being df_pts[i, accuracy]. In other words, at each i = 1:10, I want to take a sample from df_map according to normal distribution with mean df_pts[i, x]^2 + df_pts[i, y]^2 and 2d sd df_pts[i, accuracy].
I'd appreciate any suggestions for an efficient and sophisticated way of doing this. I'm relatively new to R, and coming from a C background, my way for coding tasks like this involves too many basics loops and calculations at each step using basic operations, which makes the code extremely slow.
I apologize in advance if the question is too trivial or is not well-framed.
Easy-to-use data:
df_map <- data.frame(x = c(6,-4,-2,13,1,5,10,14,8,-8),
y= c(0,2,15,6,-3,16,5,-7,14,-4),
z= c(103,102,112,105,117,105,103,119,107,100))
df_pts <- data.frame(x = c(5,3,-18,11,-3,-9,5,12,-16,0),
y= c(18,2,14,13,-10,-11,1,-19,20,-8),
accuracy = c(-0.8464018, 0.5695678,-0.4711559,-0.7306417, 2.1887011, 2.1523923,-0.9612284,-0.4750582,-1.4554292,3.4028887))
What I think you are looking for is a nearest neighbour search. I have struggled A LOT with this in the past but here is the code I came up with:
library("FNN")
findNeighbour <- function(index){
first = df_pts[index,1:2]
hit = get.knnx(df_map[c("x","y")], first, k =1 )
hit_index = hit[[1]]
hit_result = df_map[hit_index,]
result = append(df_pts[index,], hit_result)
}
t <- do.call(rbind, lapply(1:nrow(df_map),findNeighbour))
which results in:
x y accuracy x.1 y.1 z
1 5 18 -0.8464018 5 16 105
2 3 2 0.5695678 6 0 103
3 -18 14 -0.4711559 -2 15 112
4 11 13 -0.7306417 8 14 107
5 -3 -10 2.1887011 -8 -4 100
6 -9 -11 2.1523923 -8 -4 100
7 5 1 -0.9612284 6 0 103
8 12 -19 -0.4750582 14 -7 119
9 -16 20 -1.4554292 -2 15 112
10 0 -8 3.4028887 1 -3 117
As you can see some data is matched multiple times in this example, so depending on your goal you might want to throw these out or do a bidirectional search.
I hope this is what you are looking for
Thank you for the suggestion.
I ended up doing the following
df_map <- data.frame(X = c(6,-4,-2,13,1,5,10,14,8,-8),
Y= c(0,2,15,6,-3,16,5,-7,14,-4),
Z= c(103,102,112,105,117,105,103,119,107,100))
df_pts <- data.frame(x = c(5,3,-18,11,-3,-9,5,12,-16,0),
y= c(18,2,14,13,-10,-11,1,-19,20,-8),
accuracy = c(-0.8464018, 0.5695678,-0.4711559,-0.7306417, 2.1887011, 2.1523923,-0.9612284,-0.4750582,-1.4554292,3.4028887))
map.point2map <- function(map_in, pt_in) {
dists <- dist(rbind(cbind(x = pt_in['x'],
y = pt_in['y']),
cbind(x = map_in$X,
y = map_in$Y)))[1:dim(map_in)[1]]
mu <- mean(dists)
stddev <- abs(as.numeric(pt_in['accuracy']))
return(sample_n(tbl = map_in[, c('X', 'Y')],
size = 1,
replace = TRUE,
weight = dnorm(dists, mean = mu, sd = stddev)))
}
mapped <- apply(df_pts,
1,
function(x) map.point2map(map_in = df_map,
pt_in = x))
and mapped is a list of 10 points sampled from df_map as desired.

r - How to plot alphabets?

I want a scatter plot which looks like letters of the alphabet. How can I do this with a program? I can just enter co-ordinates and make the plot look like an 'A' or 'S' or whatever. But can it be done in an easier manner?
The pch argument of plot will take arguments that can be used to represent these values. From ?points, values 32-127 are the ASCII character set.
With a little messing around, values 65:90 correspond to capital letters, and values 97:122 correspond to lower case letters.
To illustrate this, try
plot(1:10, 1:10, type="p", pch=97:107)
for example.
Here is a plot of all of the latin alphabet
# blank canvas
plot(1:30, 1:30, type="n")
# upper case
points(1:26, 1:26, pch=65:90)
# lower case
points(1:26, 4:29, pch=97:107)
You could even build a mapping between these values for easier reference.
myRefUpper <- setNames(65:90, LETTERS)
myRefUpper
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
myRefLower <- setNames(97:107, letters)
myRefLower
a b c d e f g h i j k l m n o p q r s t u v w x y z
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
This way, you could refer to specific letters by name. For example, try
plot(1:10, 1:10, type="p", pch=c(myRefLower[c("a", "t", "q")], myRefUpper[LETTERS[10:16]]))
There is now an R package on GitHub that provide coordinates for the Hershey fonts that Ben Bolker mentioned: hershey.
For example, we can get coordinates for the start and end of each stroke (line) in the letter A, for the Roman Simplex font (a simple font using minimal straight lines to create letters):
library(hershey)
coord <- subset(hershey, font == 'rowmans' & char == 'A')
coord
#> x y left right width stroke idx glyph font ascii char
#> 93723 0 12 -9 9 18 0 1 34 rowmans 65 A
#> 93724 -8 -9 -9 9 18 0 2 34 rowmans 65 A
#> 93725 0 12 -9 9 18 1 3 34 rowmans 65 A
#> 93726 8 -9 -9 9 18 1 4 34 rowmans 65 A
#> 93727 -5 -2 -9 9 18 2 5 34 rowmans 65 A
#> 93728 5 -2 -9 9 18 2 6 34 rowmans 65 A
We can use the base approx function to interpolate between the start and end points for each stroke, then plot the result, using the graphical parameter pty to set a square aspect ratio:
op <- par(pty = "s")
plot(coord[, 1:2], type = "n")
for (i in unique(coord$stroke)){
points(approx(subset(coord, stroke == i)))
}
To reset the default graphical parameters:
par(op)
For ggplot2 you can do the interpolation first as below:
library(dplyr)
library(ggplot2)
coord2 <- coord %>%
group_by(stroke) %>%
do(as_tibble(approx(.)))
ggplot(coord2, aes(x, y, group = stroke)) +
geom_point() +
coord_equal() +
theme_minimal()
Created on 2022-01-05 by the reprex package (v2.0.1)
Edit
approx won't work for letters with strokes that have x values with different y values, e.g. vertical strokes or strokes that bend back on themselves. For this we can define our own linear interpolation function:
interp <- function(coord, eps = 0.5) {
y <- coord$y
x <- coord$x
n <- length(x)
x2 <- (x[-1] - x[-n])/eps
y2 <- (y[-1] - y[-n])/eps
p <- pmax(abs(x2), abs(y2))
id <- sequence(p)
list(x = c(x[1], rep(x[-n], p) + rep((x[-1] - x[-n])/p, p)*id),
y = c(y[1], rep(y[-n], p) + rep((y[-1] - y[-n])/p, p)*id))
}
library(hershey)
coord <- subset(hershey, font == 'rowmans' & char == 'C')
op <- par(pty = "s")
plot(coord[, 1:2], type = "n")
for (i in unique(coord$stroke)){
points(interp(subset(coord, stroke == i)))
}
Created on 2022-01-05 by the reprex package (v2.0.1)

Getting a stacked area plot in R

This question is a continuation of the previous question I asked.
Now I have a case where there is also a category column with Prop. So, the dataset becomes like
Hour Category Prop2
00 A 25
00 B 59
00 A 55
00 C 5
00 B 50
...
01 C 56
01 B 45
01 A 56
01 B 35
...
23 D 58
23 A 52
23 B 50
23 B 35
23 B 15
In this case I need to make a stacked area plot in R with the percentages of these different categories for each day. So, the result will be like.
A B C D
00 20% 30% 35% 15%
01 25% 10% 40% 25%
02 20% 40% 10% 30%
.
.
.
20
21
22 25% 10% 30% 35%
23 35% 20% 20% 25%
So now I would get the share of each Category in each hour and then plot this is a stacked area plot like this where the x-axis is the hour and y-axis the percentage of Prop2 for each category given by the different colours
You can use the ggplot2 package from Hadley Wickham for that.
R> library(ggplot2)
An example data set :
R> d <- data.frame(t=rep(0:23,each=4),var=rep(LETTERS[1:4],4),val=round(runif(4*24,0,50)))
R> head(d,10)
t var val
1 0 A 1
2 0 B 45
3 0 C 6
4 0 D 14
5 1 A 35
6 1 B 21
7 1 C 13
8 1 D 22
9 2 A 20
10 2 B 44
And then you can use ggplot with geom_area :
R> ggplot(d, aes(x=t,y=val,group=var,fill=var)) + geom_area(position="fill")
You can use stackpoly from the plotrix package:
library(plotrix)
#create proportions table
pdat <- prop.table(xtabs(Prop2~Hour+Category,Dat),margin=1)
#draw chart
stackpoly(pdat,stack=T,xaxlab=rownames(pdat))
#add legend
legend(1,colnames(pdat),bg="#ffffff55",fill=rainbow(dim(pdat)[2]))
If you want to take the borders away you can use scale_x_discrete and coord_cartesian this way
p <- ggplot(d, aes(x=Date,y=Volume,group=Platform,fill=Platform)) + geom_area(position="fill")
base_size <- 9
p + theme_set(theme_bw(base_size=9)) + scale_x_discrete(expand = c(0, 0)) + coord_cartesian(ylim=c(0,1))

Resources