I'm trying to replace NA values in a column in a data frame with the value from another column in the same row. Instead of replacing the values the entire column seems to be deleted.
fDF is a data frame where some values are NA. When column 1 has an NA value I want to replace it with the value in column 2.
fDF[columns[1]] = if(is.na(fDF[columns[1]]) == TRUE &
is.na(fDF[columns[2]]) == FALSE) fDF[columns[2]]
I'm not sure what I'm doing wrong here.
Thanks
You can adjust following code to your data:
> ddf
xx yy zz
1 1 10 11.88
2 2 9 NA
3 3 11 12.20
4 4 9 12.48
5 5 7 NA
6 6 6 13.28
7 7 9 13.80
8 8 8 14.40
9 9 5 NA
10 10 4 15.84
11 11 6 16.68
12 12 6 17.60
13 13 5 18.60
14 14 4 19.68
15 15 6 NA
16 16 8 22.08
17 17 4 23.40
18 18 6 24.80
19 19 8 NA
20 20 11 27.84
21 21 8 29.48
22 22 10 31.20
23 23 9 33.00
>
>
> idx = is.na(ddf$zz)
> idx
[1] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
[22] FALSE FALSE
>
> ddf$zz[idx]=ddf$yy[idx]
>
> ddf
xx yy zz
1 1 10 11.88
2 2 9 9.00
3 3 11 12.20
4 4 9 12.48
5 5 7 7.00
6 6 6 13.28
7 7 9 13.80
8 8 8 14.40
9 9 5 5.00
10 10 4 15.84
11 11 6 16.68
12 12 6 17.60
13 13 5 18.60
14 14 4 19.68
15 15 6 6.00
16 16 8 22.08
17 17 4 23.40
18 18 6 24.80
19 19 8 8.00
20 20 11 27.84
21 21 8 29.48
22 22 10 31.20
23 23 9 33.00
>
You want an ifelse() expression:
fDF[columns[1]] <- ifelse(is.na(fDF[columns[1]]), fDF[columns[2]], fDF[columns[1]])
not trying to assign the result of an if statement to a vector, which doesn't make any sense.
[EDIT only for David Arenburg: if that wasn't already explicit enough, in R if statements are not vectorized, hence can only handle scalar expressions, hence they're not what the OP needed. I had already tagged the question 'vectorization' yesterday and the OP is free to go read about vectorization in R in any of the thousands of good writeups and tutorials out there.]
Related
In my example, I want to use the following code:
# Classifiction dataset
library(dplyr)
nest <- c(1,3,4,7,12,13,21,25,26,28)
finder_max <- c(9,50,25,50,25,50,9,9,9,3)
max_TA <- c(7.4,29.4,17.0,33.1,16.2,34.4,4.3,3.52,7.47,1.4)
ds.class <- data.frame(nest,finder_max,max_TA)
ds.class$ClassType <- ifelse(ds.class$finder_max==3,"Class_1_3",
ifelse(ds.class$finder_max==9,"Class_3_9",
ifelse(ds.class$finder_max==25,"Class_9_25",
ifelse(ds.class$finder_max==50,"Class_25_50","Class_50_51"))))
ds.class
# nest finder_max max_TA ClassType
# 1 1 9 7.40 Class_3_9
# 2 3 50 29.40 Class_25_50
# 3 4 25 17.00 Class_9_25
# 4 7 50 33.10 Class_25_50
# 5 12 25 16.20 Class_9_25
# 6 13 50 34.40 Class_25_50
# 7 21 9 4.30 Class_3_9
# 8 25 9 3.52 Class
# 9 26 9 7.47 Class_3_9
# 10 28 3 1.40 Class_1_3
# Custom ordination vector
custom.vec <- c("Class_0_1","Class_1_3","Class_3_9",
"Class_9_25","Class_25_50","Class_50")
# Original dataset
my.ds <- read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/test_ants.csv")
my.ds$ClassType <- cut(my.ds$AT,breaks=c(-Inf,1,2.9,8.9,24.9,49.9,Inf),
right=FALSE,labels=c("Class_0_1","Class_1_3","Class_3_9",
"Class_9_25","Class_25_50","Class_50"))
str(my.ds)
# 'data.frame': 55 obs. of 4 variables:
# $ days : int 0 47 76 0 47 76 118 160 193 227 ...
# $ nest : int 2 2 2 3 3 3 3 3 3 3 ...
# $ AT : num 10.92 22.86 23.24 0.14 0.48 ...
# $ ClassType: Factor w/ 6 levels "Class_0_1","Class_1_3",..: 4 4 4 1 1 1 1 1 1 1 ...
I'd like to remove the rows in the my.ds with equal ClassType find in ds.class by nest. I need to remove too, the classes
higher in my custom ordination than ClassType (custom.vec). Example: If I have ClassType Class_25_50 in nest 3 in ds.class, I need to remove the data with this ClassType and higher classes ("Class_50"), if exist, for nest 3 in the file my.ds
My new output must to be for new.my.ds:
new.my.ds
# days nest AT ClassType
# 1 0 2 10.9200 Class_9_25
# 2 47 2 22.8600 Class_9_25
# 3 76 2 23.2400 Class_9_25
# 4 0 3 0.1400 Class_0_1
# 5 47 3 0.4800 Class_0_1
# 6 76 3 0.8300 Class_0_1
# 7 118 3 0.8300 Class_0_1
# 8 160 3 0.9400 Class_0_1
# 9 193 3 0.9400 Class_0_1
# 10 227 3 0.9400 Class_0_1
# 11 262 3 0.9400 Class_0_1
# 12 306 3 0.9400 Class_0_1
# 13 355 3 11.9300 Class_9_25
# 14 396 3 12.8100 Class_9_25
# 16 0 4 1.0000 Class_1_3
# 17 76 4 1.5600 Class_1_3
# 18 160 4 2.8800 Class_1_3
# 19 193 4 2.8800 Class_1_3
# 20 227 4 2.8800 Class_1_3
# 21 262 4 2.8800 Class_1_3
# 22 306 4 2.8800 Class_1_3
# 24 0 7 11.7100 Class_9_25
# 25 47 7 24.7900 Class_9_25
#...
# 55 349 1067 0.9600 Class_0_1
Please, any help with it?
I guess something similar should have been asked before, however I could only find an answer for python and SQL. So please notify me in the comments when this was also asked for R!
Data
Let's say we have a dataframe like this:
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
# In cause you do not get the same dataframe see the comment by #Ian Campbell - thanks!
position value
1 1 27
2 2 37
3 3 57
4 4 89
5 5 20
6 6 86
7 7 97
8 8 62
9 9 58
10 10 6
11 11 19
12 12 16
13 13 61
14 14 34
15 15 67
16 16 43
17 17 88
18 18 83
19 19 32
20 20 63
Goal
I'm interested in calculating the average value for n positions and subtract this from the average value of the next n positions, let's say n=5 for now.
What I tried
I now used this method, however when I apply this to a bigger dataframe it takes a huge amount of time, and hence wonder if there is a faster method for this.
calc <- function( pos ) {
this.five <- df %>% slice(pos:(pos+4))
next.five <- df %>% slice((pos+5):(pos+9))
differ = mean(this.five$value)- mean(next.five$value)
data.frame(dif= differ)
}
df %>%
group_by(position) %>%
do(calc(.$position))
That produces the following table:
position dif
<int> <dbl>
1 1 -15.8
2 2 9.40
3 3 37.6
4 4 38.8
5 5 37.4
6 6 22.4
7 7 4.20
8 8 -26.4
9 9 -31
10 10 -35.4
11 11 -22.4
12 12 -22.3
13 13 -0.733
14 14 15.5
15 15 -0.400
16 16 NaN
17 17 NaN
18 18 NaN
19 19 NaN
20 20 NaN
I suspect a data.table approach may be faster.
library(data.table)
setDT(df)
df[,c("roll.position","rollmean") := lapply(.SD,frollmean,n=5,fill=NA, align = "left")]
df[, result := rollmean[.I] - rollmean[.I + 5]]
df[,.(position,value,rollmean,result)]
# position value rollmean result
# 1: 1 27 46.0 -15.8
# 2: 2 37 57.8 9.4
# 3: 3 57 69.8 37.6
# 4: 4 89 70.8 38.8
# 5: 5 20 64.6 37.4
# 6: 6 86 61.8 22.4
# 7: 7 97 48.4 4.2
# 8: 8 62 32.2 -26.4
# 9: 9 58 32.0 -31.0
#10: 10 6 27.2 -35.4
#11: 11 19 39.4 -22.4
#12: 12 16 44.2 NA
#13: 13 61 58.6 NA
#14: 14 34 63.0 NA
#15: 15 67 62.6 NA
#16: 16 43 61.8 NA
#17: 17 88 NA NA
#18: 18 83 NA NA
#19: 19 32 NA NA
#20: 20 63 NA NA
Data
RNGkind(sample.kind = "Rounding")
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
RNGkind(sample.kind = "default")
> structure(dat_de$total_all)
[1] 11 11 9 6 9 15 10 6 11 10 10 9 7 13 7 5 5 8 10 14 9 10 13 6 10 11 12 22 11 1 7 9 12 7 7 11 9 7 15 10 6 10
[43] 8 10 9 8 14 5 10 12 14 9 10 18 8 8 15
> structure(dat_en$total_all)
[1] 25 10 12 17 10 11 11 9 9 25 14 10 13 22 13 10 11 15 20 11 9 15 9 14 10 19 10 9 8 14 4 18 16 7 10 13 9 11 12
This is my variable "Total_all" in the german and english version.
I want to put the results of the describe function (see below) of these two variables in a presentable table. Preferably one table for both variables, if that is possible.
> describe(dat_de$total_all)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 57 9.81 3.45 10 9.62 2.97 1 22 21 0.73 1.81 0.46
> describe(dat_en$total_all)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 39 12.69 4.69 11 12.24 2.97 4 25 21 1.01 0.61 0.75
I'm grateful for your help :)
I'm not quite sure which library the describe function is in [edit: looks like you're using the one from the psych package] but you can make a simple, nice looking table using the kable function from knitr:
library(knitr)
library(psych)
de_dat_descr <- data.frame(describe(dat_de$total_all), row.names = "de_dat_descr")
en_dat_descr <- data.frame(describe(dat_en$total_all), row.names = "en_dat_descr")
dat.df <- t(rbind.data.frame(de_dat_descr, en_dat_descr))
kable(dat.df)
I want to conditionally create a new var = old var. My data looks like this:
id id2
1.1 1 1
1.2 2 2
1.3 3 3
1.4 4 4
1.5 NA 5
5.5 5 6
5.6 6 7
5.7 7 8
5.8 8 9
5.51 NA 10
9.9 9 11
9.10 10 12
9.11 11 13
9.4 NA 14
12.12 12 15
12.2 NA 16
13.13 13 17
13.14 14 18
13.15 15 19
13.16 16 20
How can I create a new var = id2 when id is missing? If id is not missing, id3 is missing.
id id2 id3
1.1 1 1
1.2 2 2
1.3 3 3
1.4 4 4
1.5 NA 5 5
5.5 5 6
5.6 6 7
5.7 7 8
5.8 8 9
5.51 NA 10 10
9.9 9 11
9.10 10 12
9.11 11 13
9.4 NA 14 14
12.12 12 15
12.2 NA 16 16
13.13 13 17
13.14 14 18
13.15 15 19
13.16 16 20
Thanks!!
Assuming that dat is your data frame, you can do the following based on ifelse in base R.
dat$id3 <- with(dat, ifelse(is.na(id), id2, NA))
Or
dat2 <- transform(dat, id3 = ifelse(is.na(id), id2, NA))
DATA
dat <- read.table(text = " id id2
1.1 1 1
1.2 2 2
1.3 3 3
1.4 4 4
1.5 NA 5
5.5 5 6
5.6 6 7
5.7 7 8
5.8 8 9
5.51 NA 10
9.9 9 11
9.10 10 12
9.11 11 13
9.4 NA 14
12.12 12 15
12.2 NA 16
13.13 13 17
13.14 14 18
13.15 15 19
13.16 16 20",
header = TRUE)
I have time series data with N/As. The data are to end up in an animated scatterplot
Week X Y
1 1 105
2 3 110
3 5 N/A
4 7 130
8 15 160
12 23 180
16 30 N/A
20 37 200
For a smooth animation, the data will be supplemented by calculated, additional values/rows. For the X values this is simply arithmetical. No problem so far.
Week X Y
1 1 105
2
2 3 110
4
3 5 N/A
6
4 7 130
8
9
10
11
12
13
14
8 15 160
16
17
18
19
20
21
22
12 23 180
24
25
26
27
28
29
16 30 N/A
31
32
33
34
35
36
20 37 200
The Y values should be interpolated and there is the additional requirement, that interpolation should only appear between two consecutive values and not between values, that have a N/A between them.
Week X Value
1 1 105
2 interpolated value
2 3 110
4
3 5 N/A
6
4 7 130
8 interpolated value
9 interpolated value
10 interpolated value
11 interpolated value
12 interpolated value
13 interpolated value
14 interpolated value
8 15 160
16 interpolated value
17 interpolated value
18 interpolated value
19 interpolated value
20 interpolated value
21 interpolated value
22 interpolated value
12 23 180
24
25
26
27
28
29
16 30 N/A
31
32
33
34
35
36
20 37 200
I have already experimented with approx, converted the "original" N/A to placeholder values and tried the zoo package with na.approx etc. but donĀ“t get it, to express a correct condition statement for this kind of "conditional approximation" or "conditional gap filling". Any hint is welcome and very appreciated.
Thanks in advance
Replace the NAs with Inf, interpolate and then revert infinite values to NA.
library(zoo)
DF2 <- DF
DF2$Y[is.na(DF2$Y)] <- Inf
w <- merge(DF2, data.frame(Week = min(DF2$Week):max(DF2$Week)), by = 1, all.y = TRUE)
w$Value <- na.approx(w$Y)
w$Value[!is.finite(Value)] <- NA
giving the following where Week has been expanded to all weeks, Y is such that the original NAs are shown as Inf and the inserted NAs as NA. Value is the interpolated Y.
> w
Week X Y Value
1 1 1 105 105.0
2 2 3 110 110.0
3 3 5 Inf NA
4 4 7 130 130.0
5 5 NA NA 137.5
6 6 NA NA 145.0
7 7 NA NA 152.5
8 8 15 160 160.0
9 9 NA NA 165.0
10 10 NA NA 170.0
11 11 NA NA 175.0
12 12 23 180 180.0
13 13 NA NA NA
14 14 NA NA NA
15 15 NA NA NA
16 16 30 Inf NA
17 17 NA NA NA
18 18 NA NA NA
19 19 NA NA NA
20 20 37 200 200.0
Note: Input DF in reproducible form:
Lines <- "
Week X Y
1 1 105
2 3 110
3 5 N/A
4 7 130
8 15 160
12 23 180
16 30 N/A
20 37 200"
DF <- read.table(text = Lines, header = TRUE, na.strings = "N/A")