I have a number of columns in a data frame that represent replicates of an experimental result.
Example here
1a 2a 3a 4a 5a
1 154 152 154 156 NA
2 154 154 154 NA NA
3 154 154 154 154 NA
4 154 154 154 154 NA
5 154 NA 154 154 NA
6 NA NA NA 154 NA
7 154 154 NA 154 NA
8 154 154 NA 154 NA
9 154 NA 154 150 NA
10 149 149 NA 149 149
What I would like is to create another column which has the value that occurs(>=2)from each of the other columns.
1a 2a 3a 4a 5a score
1 154 152 154 156 NA 154
2 154 154 154 NA NA 154
3 154 154 154 154 NA 154
4 154 154 154 154 NA 154
5 154 NA 154 154 NA 154
6 NA NA NA 154 NA NA
7 154 154 NA 154 NA 154
8 154 154 NA 154 NA 154
9 154 NA 154 150 NA 154
10 149 149 NA 149 149 149
EDIT: Modified example above to demonstrate.
flodel's answer of using the mode was initially successful however it would use a value even if it only occurred once. I would like it to either come up NA or a character string (which ever is easier)if there are not 2>x values in each row.
You are not looking for the median but the mode, which is easy enough to define yourself:
Mode <- function(x, min.freq = 1L) {
f <- table(x)
k <- f[f >= min.freq]
if (length(k) > 0L) as.numeric(names(f)[which.max(f)]) else NA
}
test$score <- apply(test2, 1, Mode, min.freq = 2L)
Related
data<-fread(
V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 116 116 116 116 102 96 NA 106 116 NA 144
2: 122 122 114 114 114 114 NA 121 111 98 108
3: 118 88 78 78 77 72 96 NA 95 NA NA
4: 118 118 77 NA 86 139 127 NA 103 93 84
5: 150 150 154 154 121 121 114 111 NA NA NA
6: NA NA NA NA NA NA NA NA NA NA 141
7: 174 174 174 125 118 117 116 139 116 102 104
8: 183 183 183 175 175 176 NA 139 123 140 141
9: 134 140 106 174 162 162 169 140 127 112 NA
10: 178 178 178 NA NA 116 95 95 125 115 103)
I try to sum elements by row with condition(<90) like this
V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 sum
1: 116 116 116 116 102 96 NA 106 116 NA 144 0
2: 122 122 114 114 114 114 NA 121 111 98 108 0
3: 118 88 78 78 77 72 96 NA 95 NA NA 88+78+78+77+72
4: 118 118 77 NA 86 139 127 NA 103 93 84 77+86+84
5: 150 150 154 154 121 121 114 111 NA NA NA 0
6: NA NA NA NA NA NA NA NA NA NA 141 0
7: 174 174 174 125 118 117 116 139 116 102 104 0
8: 183 183 183 175 175 176 NA 139 123 140 141 0
9: 134 140 106 174 162 162 169 140 127 112 NA 0
10: 178 178 178 NA NA 116 95 95 125 115 103 0
raw data is large over 10000 row, so I don't prefer for loop
please use data.table
Here's a simple way in base R:
data$sum <- rowSums(data * (data < 90), na.rm = TRUE)
In data.table, you can do:
data[ , sum := rowSums(data * (data < 90), na.rm = TRUE)]
V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 sum
1: 116 116 116 116 102 96 NA 106 116 NA 144 0
2: 122 122 114 114 114 114 NA 121 111 98 108 0
3: 118 88 78 78 77 72 96 NA 95 NA NA 393
4: 118 118 77 NA 86 139 127 NA 103 93 84 247
5: 150 150 154 154 121 121 114 111 NA NA NA 0
6: NA NA NA NA NA NA NA NA NA NA 141 0
7: 174 174 174 125 118 117 116 139 116 102 104 0
8: 183 183 183 175 175 176 NA 139 123 140 141 0
9: 134 140 106 174 162 162 169 140 127 112 NA 0
10: 178 178 178 NA NA 116 95 95 125 115 103 0
library(dplyr)
df <- data.frame(x=c(1,100,5),y=c(200,5,15), z = c(1,2,NA))
df["sum"] <- df %>%
apply(2, function(x) ifelse(x < 90,x,0)) %>%
rowSums(na.rm = TRUE)
df
I have some of these table
ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 limit
1: 10167638 89 NA 116 102 96 NA 106 116 NA 144 3
2: 10298462 74 114 NA NA 114 NA 121 111 98 108 6
3: 10316168 88 78 NA 77 72 96 NA 95 NA NA 4
4: 10423491 118 77 NA 86 139 127 NA 103 93 84 2
5: 10497492 12 154 NA 121 121 114 111 NA NA NA 7
6: 10619463 42 NA NA NA NA NA NA NA NA 141 9
7: 10631362 174 NA 125 118 117 116 139 116 NA 104 10
8: 10725490 49 NA 175 NA 176 NA 139 123 140 141 5
9: 10767348 140 106 174 162 NA 169 140 127 112 NA 6
10: 10832134 10 178 NA NA 116 95 95 125 115 103 3
I try to fill this NAs with previous column value
(if V2 is NA fill with V1 value)
with condition which is limit (if limit is 3 just fill NA until V3 and leave with NAs)
so what I try to do is like this
ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 limit
1: 10167638 89 89 116 102 96 NA 106 116 NA 144 3
2: 10298462 74 114 114 114 114 114 121 111 98 108 6
3: 10316168 88 78 78 77 72 96 NA 95 NA NA 4
4: 10423491 118 77 NA 86 139 127 NA 103 93 84 2
5: 10497492 12 154 154 121 121 114 111 NA NA NA 7
6: 10619463 42 42 42 42 42 42 42 42 42 141 9
7: 10631362 174 174 125 118 117 116 139 116 116 104 10
8: 10725490 49 49 175 175 176 NA 139 123 140 141 5
9: 10767348 140 106 174 162 162 169 140 127 112 NA 6
10: 10832134 10 178 178 NA 116 95 95 125 115 103 3
Actual data is pretty big so it would be nice solve this problem with data.table
but other solution is still okay like dplyr or tidyr or other solution.
Using data.table's set() function:
Code
col <- paste0("V", 1:10)
for (i in 2:length(col)) {
rows <- which(is.na(dt[[col[i]]]) & dt[["limit"]] >= i)
set(
x = dt,
i = rows,
j = col[i],
value = dt[[col[i-1]]][rows]
)
}
Results
dt
ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 limit
1: 10167638 89 89 116 102 96 NA 106 116 NA 144 3
2: 10298462 74 114 114 114 114 114 121 111 98 108 6
3: 10316168 88 78 78 77 72 96 NA 95 NA NA 4
4: 10423491 118 77 NA 86 139 127 NA 103 93 84 2
5: 10497492 12 154 154 121 121 114 111 NA NA NA 7
6: 10619463 42 42 42 42 42 42 42 42 42 141 9
7: 10631362 174 174 125 118 117 116 139 139 139 104 10
8: 10725490 49 49 175 175 176 NA 139 123 140 141 5
9: 10767348 140 106 174 162 162 169 140 127 112 NA 6
10: 110832134 10 178 178 NA 116 95 95 125 115 103 3
Data
dt <- fread(" ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 limit
10167638 89 NA 116 102 96 NA 106 116 NA 144 3
10298462 74 114 NA NA 114 NA 121 111 98 108 6
10316168 88 78 NA 77 72 96 NA 95 NA NA 4
10423491 118 77 NA 86 139 127 NA 103 93 84 2
10497492 12 154 NA 121 121 114 111 NA NA NA 7
10619463 42 NA NA NA NA NA NA NA NA 141 9
10631362 174 NA 125 118 117 116 139 116 NA 104 10
10725490 49 NA 175 NA 176 NA 139 123 140 141 5
10767348 140 106 174 162 NA 169 140 127 112 NA 6
110832134 10 178 NA NA 116 95 95 125 115 103 3")
You can try a tidyverse
library(tidyverse)
dt %>%
gather(k, v, -ID, -limit) %>% # make df from wide to long
mutate(k = factor(k, levels = unique(k))) %>% # for correct spreading in the last step
group_by(ID) %>%
mutate(gr=ifelse(is.na(v), 1:n(), 0)) %>% # check where the NA's are
fill(v) %>% # update the values
mutate(v = ifelse(limit >= gr, v, NA)) %>% # change to NA back again accrding limit
select(-gr) %>%
spread(k, v) # backtransform to long
# A tibble: 10 x 12
# Groups: ID [10]
ID limit V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 10167638 3 89 89 116 102 96 NA 106 116 NA 144
2 10298462 6 74 114 114 114 114 114 121 111 98 108
3 10316168 4 88 78 78 77 72 96 NA 95 NA NA
4 10423491 2 118 77 NA 86 139 127 NA 103 93 84
5 10497492 7 12 154 154 121 121 114 111 NA NA NA
6 10619463 9 42 42 42 42 42 42 42 42 42 141
7 10631362 10 174 174 125 118 117 116 139 116 116 104
8 10725490 5 49 49 175 175 176 NA 139 123 140 141
9 10767348 6 140 106 174 162 162 169 140 127 112 NA
10 110832134 3 10 178 178 NA 116 95 95 125 115 103
I am struggling to figure out how to use tryCatch() to throw an error. I have read several blog posts, Hadley's write-up in advanced R and several SO posts. But for some reason, it just hasn't sunk in yet. My dummy example is this: when a vector has a length that is less that 160, stop executing the function and instead provide the user with an error message. Pretty simple stuff but apparently not for me. I feel like this function should do exactly that:
dummy_fun <- function(x) {
tryCatch(length(x) < 160 ,
error = function(e) {
print("An error message")
}
)
return(x*2)
}
But when I run the function, the error is not caught:
>dummy_fun(airquality$Ozone)
[1] 82 72 24 36 NA 56 46 38 16 NA 14 32 22 28 36 28 68 12 60 22 2 22 8 64 NA NA NA 46 90 230 74
[32] NA NA NA NA NA NA 58 NA 142 78 NA NA 46 NA NA 42 74 40 24 26 NA NA NA NA NA NA NA NA NA NA 270
[63] 98 64 NA 128 80 154 194 194 170 NA 20 54 NA 14 96 70 122 158 126 32 NA NA 160 216 40 104 164 100 128 118 78
[94] 18 32 156 70 132 244 178 220 NA NA 88 56 130 NA 44 118 46 62 88 42 18 NA 90 336 146 NA 152 236 168 170 192
[125] 156 146 182 94 64 40 46 42 48 88 42 56 18 26 92 36 26 48 32 26 46 72 14 28 60 NA 28 36 40
Even though length is clearly less than 160.
>length(airquality$Ozone) < 160
[1] TRUE
If I use stop or stopifnot, it stops the code but then automatically opens up the debugging window (at least in RStudio) whereas I'd just like an error, telling the user that there is an error:
dummy_fun2 <- function(x) {
stop(length(x) < 160 ,"An error message")
return(x*2)
}
dummy_fun2(airquality$Ozone)
And stopifnot:
dummy_fun3 <- function(x) {
stopifnot(length(x) < 160 ,"An error message")
return(x*2)
}
dummy_fun3(airquality$Ozone)
So, I am curious if anyone has any idea what I am doing wrong here. I'm sure I'll get this labelled as a duplicate post but I truly am lost with this.
dummy_fun <- function(x) {
if (length(x) < 160) stop("x is not big enough")
return(x*2)
}
dummy_fun(airquality$Ozone)
dummy_fun(rep(1, 50))
dummy_fun(rep(1, 500))
Is this what you are looking for?
onError <- function(){
print("An error message")
}
dummy_fun <- function(x) {
tryCatch(length(x) < 160 , finally = onError())
return(x*2)
}
dummy_fun(airquality$Ozone)
dummy_fun(airquality$Ozone)
[1] "An error message"
[1] 82 72 24 36 NA 56 46 38 16 NA 14 32 22 28 36 28 68 12 60 22 2 22 8 64 NA NA NA 46 90 230 74 NA NA NA
[35] NA NA NA 58 NA 142 78 NA NA 46 NA NA 42 74 40 24 26 NA NA NA NA NA NA NA NA NA NA 270 98 64 NA 128 80 154
[69] 194 194 170 NA 20 54 NA 14 96 70 122 158 126 32 NA NA 160 216 40 104 164 100 128 118 78 18 32 156 70 132 244 178 220 NA
[103] NA 88 56 130 NA 44 118 46 62 88 42 18 NA 90 336 146 NA 152 236 168 170 192 156 146 182 94 64 40 46 42 48 88 42 56
[137] 18 26 92 36 26 48 32 26 46 72 14 28 60 NA 28 36 40
This question already has answers here:
How to delete rows from a dataframe that contain n*NA
(4 answers)
Closed 6 years ago.
I would like to remove all rows with 3 or more NA values - below is an example of my data set (but the actual data set has 95,000 rows)
Plot_ID Tree_ID Dbh13 Dbh08 Dbh03 Dbh93_94
106 6 236 132 123 132
204 5 NA NA NA 142
495 8 134 NA NA 102
984 12 NA 123 110 97
So that it looks like this
Plot_ID Tree_ID Dbh13 Dbh08 Dbh03 Dbh93_94
106 6 236 132 123 132
495 8 134 NA NA 102
984 12 NA 123 110 97
Plot_ID Tree_ID Dbh13 Dbh08 Dbh03 Dbh93_94
106 6 236 132 123 132
204 5 NA NA NA 142
495 8 134 NA NA 102
984 12 NA 123 110 97
df1 <- read.table(con<-file("clipboard"),header=T)
cnt_na <- apply(df1, 1, function(z) sum(is.na(z)))
df1[cnt_na < 3,]
Plot_ID Tree_ID Dbh13 Dbh08 Dbh03 Dbh93_94
1 106 6 236 132 123 132
3 495 8 134 NA NA 102
4 984 12 NA 123 110 97
I have a dataframe with few rows of identical row names. I want to replace NAs of every second row with the non NA of identical immediate previous row. But if there already exists a value in second row then, it should not be affected.
Please see below:
df:
date 1 1 2 3 3
20040101 100 150 NA NA 140
20040115 200 NA 200 NA NA
20040131 170 NA NA NA NA
20040131 NA 165 180 190 190
20040205 NA NA NA NA NA
20040228 140 145 165 150 155
20040228 NA NA NA NA NA
20040301 150 155 170 150 160
20040315 NA NA 180 190 200
20040331 NA 145 160 NA NA
20040331 NA NA NA 175 180
I want the resulting data frame to be:
df_new:
date 1 1 2 3 3
20040101 100 150 NA NA 140
20040115 200 NA 200 NA NA
20040131 170 165 180 190 190
20040205 NA NA NA NA NA
20040228 140 145 165 150 155
20040301 150 155 170 150 160
20040315 NA NA 180 190 200
20040331 NA 145 160 175 180
I have tried the following for loop, but results are not as desired:
for (i in 2:nrow(df)) {
if(all(is.na(df[i, ]))){ df[i, ] = fill[(i-1), ]}
out[i, ]<- df[i-1,ncol]
}
Please guide me in this regard.
Thanks
Saba
Here is an option using data.table. We place the datasets in a list, then make it a single data.table using rbindlist, grouped by 'date', loop through the columns (lapply(.SD, ..) and subset the non-NA elements.
library(data.table)
unique(rbindlist(list(df1, df2))[,lapply(.SD, function(x)
if(all(is.na(x))) x else x[!is.na(x)]) , date])
# date X11A X11A.1 X21B X3CC X3CC.1
#1: 20040101 100 150 NA NA 140
#2: 20040115 200 NA 200 NA NA
#3: 20040131 170 165 180 190 190
#4: 20040205 NA NA NA NA NA
#5: 20040228 140 145 165 150 155
#6: 20040301 150 155 170 150 160
#7: 20040315 NA NA 180 190 200
#8: 20040331 NA 145 160 175 180
As the OP noted about using for loop and which, another option with data.table that uses both of them with set would be
setDT(df1)
dfN <- setDT(df2)[df1, on = "date"]
for(j in 2:ncol(df1)){
set(df1, i = which(is.na(df1[[j]])), j = j,
value = dfN[[j]][is.na(df1[[j]])])
}
df1
# date X11A X11A.1 X21B X3CC X3CC.1
#1: 20040101 100 150 NA NA 140
#2: 20040115 200 NA 200 NA NA
#3: 20040131 170 165 180 190 190
#4: 20040205 NA NA NA NA NA
#5: 20040228 140 145 165 150 155
#6: 20040301 150 155 170 150 160
#7: 20040315 NA NA 180 190 200
#8: 20040331 NA 145 160 175 180
An alternate solution using data.table:
library(data.table)
setDT(df)
df[,lapply(.SD,mean,na.rm=T),by=date]
## date X11A X11A.1 X21B X3CC X3CC.1
##1: 20040101 100 150 NaN NaN 140
##2: 20040115 200 NaN 200 NaN NaN
##3: 20040131 170 165 180 190 190
##4: 20040205 NaN NaN NaN NaN NaN
##5: 20040228 140 145 165 150 155
##6: 20040301 150 155 170 150 160
##7: 20040315 NaN NaN 180 190 200
##8: 20040331 NaN 145 160 175 180
Assumption: Here, I am assuming that in case numerous tuples occur for a single date, each column has only one unique value, otherwise NA.