Remove rows with 3 or more NA values - R [duplicate] - r

This question already has answers here:
How to delete rows from a dataframe that contain n*NA
(4 answers)
Closed 6 years ago.
I would like to remove all rows with 3 or more NA values - below is an example of my data set (but the actual data set has 95,000 rows)
Plot_ID Tree_ID Dbh13 Dbh08 Dbh03 Dbh93_94
106 6 236 132 123 132
204 5 NA NA NA 142
495 8 134 NA NA 102
984 12 NA 123 110 97
So that it looks like this
Plot_ID Tree_ID Dbh13 Dbh08 Dbh03 Dbh93_94
106 6 236 132 123 132
495 8 134 NA NA 102
984 12 NA 123 110 97

Plot_ID Tree_ID Dbh13 Dbh08 Dbh03 Dbh93_94
106 6 236 132 123 132
204 5 NA NA NA 142
495 8 134 NA NA 102
984 12 NA 123 110 97
df1 <- read.table(con<-file("clipboard"),header=T)
cnt_na <- apply(df1, 1, function(z) sum(is.na(z)))
df1[cnt_na < 3,]
Plot_ID Tree_ID Dbh13 Dbh08 Dbh03 Dbh93_94
1 106 6 236 132 123 132
3 495 8 134 NA NA 102
4 984 12 NA 123 110 97

Related

Pivot / Reshape data [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
My sample data looks like this:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time (days) Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick.
Kind regards, and thank you in advance.
Here is a base R option
u <- cbind(
data[1],
do.call(
rbind,
lapply(
split.default(data[-1], ceiling(seq_along(data[-1]) / 2)),
setNames,
c("Value", "Time")
)
)
)
out <- `row.names<-`(
subset(
x <- u[order(u$pid), ],
complete.cases(x)
), NULL
)
such that
> out
pid Value Time
1 1 1356 1435
2 1 1483 1405
3 1 1563 1374
4 2 943 1848
5 2 1173 1818
6 2 1300 1785
7 3 1590 185
8 3 1585 294
9 4 130 72
10 4 140 82
11 4 220 126
12 4 166 159
13 4 380 189
14 4 353 231
15 4 180 268
16 4 571 334
17 4 443 70
18 4 266 124
19 4 213 156
20 4 583 173
21 4 510 222
22 4 596 303
23 4 476 145
24 4 656 217
25 4 816 289
26 4 136 79
27 4 756 89
28 4 703 128
29 4 776 166
30 4 586 203
31 4 526 240
32 4 580 278
33 4 483 371
An option with pivot_longer
library(dplyr)
library(tidyr)
names(data)[8] <- "measurement4"
data %>%
pivot_longer(cols = -pid, names_to = c('.value', 'grp'),
names_sep = "(?<=[a-z])(?=[0-9])", values_drop_na = TRUE) %>% select(-grp)
# A tibble: 33 x 3
# pid measurement Tdays
# <int> <int> <int>
# 1 1 1356 1435
# 2 1 1483 1405
# 3 1 1563 1374
# 4 2 943 1848
# 5 2 1173 1818
# 6 2 1300 1785
# 7 3 1590 185
# 8 3 1585 294
# 9 4 130 72
#10 4 443 70
# … with 23 more rows

fill NA with previous column and specific condition with data.table in R

I have some of these table
ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 limit
1: 10167638 89 NA 116 102 96 NA 106 116 NA 144 3
2: 10298462 74 114 NA NA 114 NA 121 111 98 108 6
3: 10316168 88 78 NA 77 72 96 NA 95 NA NA 4
4: 10423491 118 77 NA 86 139 127 NA 103 93 84 2
5: 10497492 12 154 NA 121 121 114 111 NA NA NA 7
6: 10619463 42 NA NA NA NA NA NA NA NA 141 9
7: 10631362 174 NA 125 118 117 116 139 116 NA 104 10
8: 10725490 49 NA 175 NA 176 NA 139 123 140 141 5
9: 10767348 140 106 174 162 NA 169 140 127 112 NA 6
10: 10832134 10 178 NA NA 116 95 95 125 115 103 3
I try to fill this NAs with previous column value
(if V2 is NA fill with V1 value)
with condition which is limit (if limit is 3 just fill NA until V3 and leave with NAs)
so what I try to do is like this
ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 limit
1: 10167638 89 89 116 102 96 NA 106 116 NA 144 3
2: 10298462 74 114 114 114 114 114 121 111 98 108 6
3: 10316168 88 78 78 77 72 96 NA 95 NA NA 4
4: 10423491 118 77 NA 86 139 127 NA 103 93 84 2
5: 10497492 12 154 154 121 121 114 111 NA NA NA 7
6: 10619463 42 42 42 42 42 42 42 42 42 141 9
7: 10631362 174 174 125 118 117 116 139 116 116 104 10
8: 10725490 49 49 175 175 176 NA 139 123 140 141 5
9: 10767348 140 106 174 162 162 169 140 127 112 NA 6
10: 10832134 10 178 178 NA 116 95 95 125 115 103 3
Actual data is pretty big so it would be nice solve this problem with data.table
but other solution is still okay like dplyr or tidyr or other solution.
Using data.table's set() function:
Code
col <- paste0("V", 1:10)
for (i in 2:length(col)) {
rows <- which(is.na(dt[[col[i]]]) & dt[["limit"]] >= i)
set(
x = dt,
i = rows,
j = col[i],
value = dt[[col[i-1]]][rows]
)
}
Results
dt
ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 limit
1: 10167638 89 89 116 102 96 NA 106 116 NA 144 3
2: 10298462 74 114 114 114 114 114 121 111 98 108 6
3: 10316168 88 78 78 77 72 96 NA 95 NA NA 4
4: 10423491 118 77 NA 86 139 127 NA 103 93 84 2
5: 10497492 12 154 154 121 121 114 111 NA NA NA 7
6: 10619463 42 42 42 42 42 42 42 42 42 141 9
7: 10631362 174 174 125 118 117 116 139 139 139 104 10
8: 10725490 49 49 175 175 176 NA 139 123 140 141 5
9: 10767348 140 106 174 162 162 169 140 127 112 NA 6
10: 110832134 10 178 178 NA 116 95 95 125 115 103 3
Data
dt <- fread(" ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 limit
10167638 89 NA 116 102 96 NA 106 116 NA 144 3
10298462 74 114 NA NA 114 NA 121 111 98 108 6
10316168 88 78 NA 77 72 96 NA 95 NA NA 4
10423491 118 77 NA 86 139 127 NA 103 93 84 2
10497492 12 154 NA 121 121 114 111 NA NA NA 7
10619463 42 NA NA NA NA NA NA NA NA 141 9
10631362 174 NA 125 118 117 116 139 116 NA 104 10
10725490 49 NA 175 NA 176 NA 139 123 140 141 5
10767348 140 106 174 162 NA 169 140 127 112 NA 6
110832134 10 178 NA NA 116 95 95 125 115 103 3")
You can try a tidyverse
library(tidyverse)
dt %>%
gather(k, v, -ID, -limit) %>% # make df from wide to long
mutate(k = factor(k, levels = unique(k))) %>% # for correct spreading in the last step
group_by(ID) %>%
mutate(gr=ifelse(is.na(v), 1:n(), 0)) %>% # check where the NA's are
fill(v) %>% # update the values
mutate(v = ifelse(limit >= gr, v, NA)) %>% # change to NA back again accrding limit
select(-gr) %>%
spread(k, v) # backtransform to long
# A tibble: 10 x 12
# Groups: ID [10]
ID limit V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 10167638 3 89 89 116 102 96 NA 106 116 NA 144
2 10298462 6 74 114 114 114 114 114 121 111 98 108
3 10316168 4 88 78 78 77 72 96 NA 95 NA NA
4 10423491 2 118 77 NA 86 139 127 NA 103 93 84
5 10497492 7 12 154 154 121 121 114 111 NA NA NA
6 10619463 9 42 42 42 42 42 42 42 42 42 141
7 10631362 10 174 174 125 118 117 116 139 116 116 104
8 10725490 5 49 49 175 175 176 NA 139 123 140 141
9 10767348 6 140 106 174 162 162 169 140 127 112 NA
10 110832134 3 10 178 178 NA 116 95 95 125 115 103

Modified: Replacing values of rows with identical rownames in a dataframes

I have a dataframe with few rows of identical row names. I want to replace NAs of every second row with the non NA of identical immediate previous row. But if there already exists a value in second row then, it should not be affected.
Please see below:
df:
date 1 1 2 3 3
20040101 100 150 NA NA 140
20040115 200 NA 200 NA NA
20040131 170 NA NA NA NA
20040131 NA 165 180 190 190
20040205 NA NA NA NA NA
20040228 140 145 165 150 155
20040228 NA NA NA NA NA
20040301 150 155 170 150 160
20040315 NA NA 180 190 200
20040331 NA 145 160 NA NA
20040331 NA NA NA 175 180
I want the resulting data frame to be:
df_new:
date 1 1 2 3 3
20040101 100 150 NA NA 140
20040115 200 NA 200 NA NA
20040131 170 165 180 190 190
20040205 NA NA NA NA NA
20040228 140 145 165 150 155
20040301 150 155 170 150 160
20040315 NA NA 180 190 200
20040331 NA 145 160 175 180
I have tried the following for loop, but results are not as desired:
for (i in 2:nrow(df)) {
if(all(is.na(df[i, ]))){ df[i, ] = fill[(i-1), ]}
out[i, ]<- df[i-1,ncol]
}
Please guide me in this regard.
Thanks
Saba
Here is an option using data.table. We place the datasets in a list, then make it a single data.table using rbindlist, grouped by 'date', loop through the columns (lapply(.SD, ..) and subset the non-NA elements.
library(data.table)
unique(rbindlist(list(df1, df2))[,lapply(.SD, function(x)
if(all(is.na(x))) x else x[!is.na(x)]) , date])
# date X11A X11A.1 X21B X3CC X3CC.1
#1: 20040101 100 150 NA NA 140
#2: 20040115 200 NA 200 NA NA
#3: 20040131 170 165 180 190 190
#4: 20040205 NA NA NA NA NA
#5: 20040228 140 145 165 150 155
#6: 20040301 150 155 170 150 160
#7: 20040315 NA NA 180 190 200
#8: 20040331 NA 145 160 175 180
As the OP noted about using for loop and which, another option with data.table that uses both of them with set would be
setDT(df1)
dfN <- setDT(df2)[df1, on = "date"]
for(j in 2:ncol(df1)){
set(df1, i = which(is.na(df1[[j]])), j = j,
value = dfN[[j]][is.na(df1[[j]])])
}
df1
# date X11A X11A.1 X21B X3CC X3CC.1
#1: 20040101 100 150 NA NA 140
#2: 20040115 200 NA 200 NA NA
#3: 20040131 170 165 180 190 190
#4: 20040205 NA NA NA NA NA
#5: 20040228 140 145 165 150 155
#6: 20040301 150 155 170 150 160
#7: 20040315 NA NA 180 190 200
#8: 20040331 NA 145 160 175 180
An alternate solution using data.table:
library(data.table)
setDT(df)
df[,lapply(.SD,mean,na.rm=T),by=date]
## date X11A X11A.1 X21B X3CC X3CC.1
##1: 20040101 100 150 NaN NaN 140
##2: 20040115 200 NaN 200 NaN NaN
##3: 20040131 170 165 180 190 190
##4: 20040205 NaN NaN NaN NaN NaN
##5: 20040228 140 145 165 150 155
##6: 20040301 150 155 170 150 160
##7: 20040315 NaN NaN 180 190 200
##8: 20040331 NaN 145 160 175 180
Assumption: Here, I am assuming that in case numerous tuples occur for a single date, each column has only one unique value, otherwise NA.

Function or other basic script that compares values on two variables in a dataframe using an id variable located in both

Let's say you have two data frames, both of which contain some, but not all of the same records. Where they are the same records, the id variable in both data frames matches. There is a particular variable in each data frame that needs to be checked for consistency across the data frames, and any discrepancies need to be printed:
d1 <- ## first dataframe
d2 <- ## second dataframe
colnames(d1) #column headings for dataframe 1
[1] "id" "variable1" "variable2" "variable3"
colnames(d2) #column headings for dataframe 2 are identical
[1] "id" "variable1" "variable2" "variable3"
length(d1$id) #there are 200 records in dataframe 1
[1] 200
length(d2$id) #there are not the same number in dataframe 2
[1] 150
##Some function that takes d1$id, matches with d2$id, then compares the values of the matched, returning any discrepancies
I constructed an elaborate loop for this, but feel as though this is not the right way of going about it. Surely there is some better way than this for-if-for-if-if statement.
for (i in seq(d1$id)){ ##Sets up counter for loop
if (d1$id[i] %in% d2$id){ ## Search, compares and saves a common id and variable
index <- d1$id[i];
variable_d1 <- d1$variable1[i];
for (p in seq(d2$id)){ set
if (d2$id[p] == index){ ## saves the corresponding value in the second dataframe
variable_d2 <- d2$variable1[p];
if (variable_d2 != variable_d1) { ## prints if they are not equal
print(index);
}
}
}
}
}
Here's a solution, using random input data with a 50% chance that a given cell will be discrepant between d1 and d2:
set.seed(1);
d1 <- data.frame(id=sample(300,200),variable1=sample(2,200,replace=T),variable2=sample(2,200,replace=T),variable3=sample(2,200,replace=T));
d2 <- data.frame(id=sample(300,150),variable1=sample(2,150,replace=T),variable2=sample(2,150,replace=T),variable3=sample(2,150,replace=T));
head(d1);
## id variable1 variable2 variable3
## 1 80 1 2 2
## 2 112 1 1 2
## 3 171 2 2 1
## 4 270 1 2 2
## 5 60 1 2 2
## 6 266 2 2 2
head(d2);
## id variable1 variable2 variable3
## 1 258 1 2 1
## 2 11 1 1 1
## 3 290 2 1 2
## 4 222 2 1 2
## 5 81 2 1 1
## 6 200 1 2 1
com <- intersect(d1$id,d2$id); ## derive common id values
d1com <- match(com,d1$id); ## find indexes of d1 that correspond to common id values, in order of com
d2com <- match(com,d2$id); ## find indexes of d2 that correspond to common id values, in order of com
v1diff <- com[d1$variable1[d1com]!=d2$variable1[d2com]]; ## get ids of variable1 discrepancies
v1diff;
## [1] 60 278 18 219 290 35 107 4 237 131 50 210 29 168 6 174 61 127 99 220 247 244 157 51 84 122 196 125 265 115 186 139 3 132 223 211 268 102 155 207 238 41 199 200 231 236 172 275 250 176 248 255 222 59 100 33 124
v2diff <- com[d1$variable2[d1com]!=d2$variable2[d2com]]; ## get ids of variable2 discrepancies
v2diff;
## [1] 112 60 18 198 219 290 131 50 210 29 168 258 215 291 127 161 99 220 110 293 87 164 84 122 196 125 186 139 81 132 82 89 223 268 98 14 155 241 207 231 172 62 275 176 248 255 59 298 100 12 156
v3diff <- com[d1$variable3[d1com]!=d2$variable3[d2com]]; ## get ids of variable3 discrepancies
v3diff;
## [1] 278 219 290 35 4 237 131 168 202 174 215 220 247 244 261 293 164 13 294 84 196 125 265 115 186 81 3 89 223 211 268 98 14 155 241 207 38 191 200 276 250 45 269 255 298 100 12 156 124
Here's a proof that all variable1 values for ids in v1diff are really discrepant between d1 and d2:
d1$variable1[match(v1diff,d1$id)]; d2$variable1[match(v1diff,d2$id)];
## [1] 1 2 2 1 1 2 2 1 1 1 2 2 2 2 1 2 2 1 2 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 2 2 1 2 2 1 1 2 1 1 2 1 2 1 2 2 1 2 2 1 1
## [1] 2 1 1 2 2 1 1 2 2 2 1 1 1 1 2 1 1 2 1 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 1 1 1 2 1 1 2 2 1 2 2 1 2 1 2 1 1 2 1 1 2 2
Here's a proof that all variable1 values for ids not in v1diff are not discrepant between d1 and d2:
with(subset(d1,id%in%com&!id%in%v1diff),variable1[order(id)]); with(subset(d2,id%in%com&!id%in%v1diff),variable1[order(id)]);
## [1] 1 1 2 1 1 1 2 2 1 2 2 1 2 2 1 1 2 1 2 1 2 1 1 1 1 1 1 2 2 2 2 1 1 1 2 2 2 1 1 1 1
## [1] 1 1 2 1 1 1 2 2 1 2 2 1 2 2 1 1 2 1 2 1 2 1 1 1 1 1 1 2 2 2 2 1 1 1 2 2 2 1 1 1 1
Here, I wrapped this solution in a function which returns the vectors of discrepant id values in a list, with each component named for the variable it represents:
compare <- function(d1,d2,cols=setdiff(intersect(colnames(d1),colnames(d2)),'id')) {
com <- intersect(d1$id,d2$id);
d1com <- match(com,d1$id);
d2com <- match(com,d2$id);
setNames(lapply(cols,function(col) com[d1[[col]][d1com]!=d2[[col]][d2com]]),cols);
};
compare(d1,d2);
## $variable1
## [1] 60 278 18 219 290 35 107 4 237 131 50 210 29 168 6 174 61 127 99 220 247 244 157 51 84 122 196 125 265 115 186 139 3 132 223 211 268 102 155 207 238 41 199 200 231 236 172 275 250 176 248 255 222 59 100 33 124
##
## $variable2
## [1] 112 60 18 198 219 290 131 50 210 29 168 258 215 291 127 161 99 220 110 293 87 164 84 122 196 125 186 139 81 132 82 89 223 268 98 14 155 241 207 231 172 62 275 176 248 255 59 298 100 12 156
##
## $variable3
## [1] 278 219 290 35 4 237 131 168 202 174 215 220 247 244 261 293 164 13 294 84 196 125 265 115 186 81 3 89 223 211 268 98 14 155 241 207 38 191 200 276 250 45 269 255 298 100 12 156 124
Here is an approach using merge.
First, merge the dataframes, keeping all columns.
x <- merge(d1, d1, by="id")
Then, find all rows which do not match:
x[x$variable1.x != x$variable1.y | x$variable2.x != x$variable2.y |
x$variable3.x != x$variable3.y, ]

create new column based on most frequent value of previous columns R

I have a number of columns in a data frame that represent replicates of an experimental result.
Example here
1a 2a 3a 4a 5a
1 154 152 154 156 NA
2 154 154 154 NA NA
3 154 154 154 154 NA
4 154 154 154 154 NA
5 154 NA 154 154 NA
6 NA NA NA 154 NA
7 154 154 NA 154 NA
8 154 154 NA 154 NA
9 154 NA 154 150 NA
10 149 149 NA 149 149
What I would like is to create another column which has the value that occurs(>=2)from each of the other columns.
1a 2a 3a 4a 5a score
1 154 152 154 156 NA 154
2 154 154 154 NA NA 154
3 154 154 154 154 NA 154
4 154 154 154 154 NA 154
5 154 NA 154 154 NA 154
6 NA NA NA 154 NA NA
7 154 154 NA 154 NA 154
8 154 154 NA 154 NA 154
9 154 NA 154 150 NA 154
10 149 149 NA 149 149 149
EDIT: Modified example above to demonstrate.
flodel's answer of using the mode was initially successful however it would use a value even if it only occurred once. I would like it to either come up NA or a character string (which ever is easier)if there are not 2>x values in each row.
You are not looking for the median but the mode, which is easy enough to define yourself:
Mode <- function(x, min.freq = 1L) {
f <- table(x)
k <- f[f >= min.freq]
if (length(k) > 0L) as.numeric(names(f)[which.max(f)]) else NA
}
test$score <- apply(test2, 1, Mode, min.freq = 2L)

Resources