Modified: Replacing values of rows with identical rownames in a dataframes - r

I have a dataframe with few rows of identical row names. I want to replace NAs of every second row with the non NA of identical immediate previous row. But if there already exists a value in second row then, it should not be affected.
Please see below:
df:
date 1 1 2 3 3
20040101 100 150 NA NA 140
20040115 200 NA 200 NA NA
20040131 170 NA NA NA NA
20040131 NA 165 180 190 190
20040205 NA NA NA NA NA
20040228 140 145 165 150 155
20040228 NA NA NA NA NA
20040301 150 155 170 150 160
20040315 NA NA 180 190 200
20040331 NA 145 160 NA NA
20040331 NA NA NA 175 180
I want the resulting data frame to be:
df_new:
date 1 1 2 3 3
20040101 100 150 NA NA 140
20040115 200 NA 200 NA NA
20040131 170 165 180 190 190
20040205 NA NA NA NA NA
20040228 140 145 165 150 155
20040301 150 155 170 150 160
20040315 NA NA 180 190 200
20040331 NA 145 160 175 180
I have tried the following for loop, but results are not as desired:
for (i in 2:nrow(df)) {
if(all(is.na(df[i, ]))){ df[i, ] = fill[(i-1), ]}
out[i, ]<- df[i-1,ncol]
}
Please guide me in this regard.
Thanks
Saba

Here is an option using data.table. We place the datasets in a list, then make it a single data.table using rbindlist, grouped by 'date', loop through the columns (lapply(.SD, ..) and subset the non-NA elements.
library(data.table)
unique(rbindlist(list(df1, df2))[,lapply(.SD, function(x)
if(all(is.na(x))) x else x[!is.na(x)]) , date])
# date X11A X11A.1 X21B X3CC X3CC.1
#1: 20040101 100 150 NA NA 140
#2: 20040115 200 NA 200 NA NA
#3: 20040131 170 165 180 190 190
#4: 20040205 NA NA NA NA NA
#5: 20040228 140 145 165 150 155
#6: 20040301 150 155 170 150 160
#7: 20040315 NA NA 180 190 200
#8: 20040331 NA 145 160 175 180
As the OP noted about using for loop and which, another option with data.table that uses both of them with set would be
setDT(df1)
dfN <- setDT(df2)[df1, on = "date"]
for(j in 2:ncol(df1)){
set(df1, i = which(is.na(df1[[j]])), j = j,
value = dfN[[j]][is.na(df1[[j]])])
}
df1
# date X11A X11A.1 X21B X3CC X3CC.1
#1: 20040101 100 150 NA NA 140
#2: 20040115 200 NA 200 NA NA
#3: 20040131 170 165 180 190 190
#4: 20040205 NA NA NA NA NA
#5: 20040228 140 145 165 150 155
#6: 20040301 150 155 170 150 160
#7: 20040315 NA NA 180 190 200
#8: 20040331 NA 145 160 175 180

An alternate solution using data.table:
library(data.table)
setDT(df)
df[,lapply(.SD,mean,na.rm=T),by=date]
## date X11A X11A.1 X21B X3CC X3CC.1
##1: 20040101 100 150 NaN NaN 140
##2: 20040115 200 NaN 200 NaN NaN
##3: 20040131 170 165 180 190 190
##4: 20040205 NaN NaN NaN NaN NaN
##5: 20040228 140 145 165 150 155
##6: 20040301 150 155 170 150 160
##7: 20040315 NaN NaN 180 190 200
##8: 20040331 NaN 145 160 175 180
Assumption: Here, I am assuming that in case numerous tuples occur for a single date, each column has only one unique value, otherwise NA.

Related

How to drop last value in multiple columns of different lengths in R (based on group)?

I have a dataset where there is an ID column and 10 columns of data. Each ID has a different number of (non-NA) entries for each column, and I want to remove the first and last entry for each ID for each column. For the first entries, this was not a problem because every ID has their first entries on their respective first row, and thus the following code works:
data <- grouped_df(data, vars=c("ID"), drop = group_by_drop_default(data))
data <- data %>% slice(-c(1))
However, for the last entries, as they cannot be identified by a particular row for each ID, I'm at a loss. Thanks to this answer I found that making the data frame into a data.table and using supply, I can copy the very last non-NA entries into a new list, but I don't know how to get that for each ID (group) and I don't know how to remove them from the dataset (or create NAs in their place).
So basically, from the following dataset:
1 2 3 4 5 6 7 8 9 10 ID
1: 748 726 743 731 696 786 710 732 802 784 A
2: 707 724 760 728 730 798 668 696 492 341 A
3: NA 743 754 704 729 26 NA 675 NA NA A
4: NA 740 754 691 708 NA NA 79 NA NA A
5: NA 739 87 69 463 NA NA NA NA NA A
6: NA 594 NA NA NA NA NA NA NA NA A
7: 950 814 878 792 743 796 774 700 826 827 B
8: 402 772 789 823 773 732 796 664 857 889 B
9: NA 819 812 746 744 706 824 656 760 834 B
10: NA 3 694 782 702 750 771 677 798 759 B
11: NA NA NA 650 512 835 29 123 303 240 B
12: NA NA NA NA NA 226 NA NA NA NA B
I'd like to remove/replace the last non-NA entry for each column for each ID, resulting in
1 2 3 4 5 6 7 8 9 10 ID
1: 748 726 743 731 696 786 710 732 802 784 A
2: NA 724 760 728 730 798 NA 696 NA NA A
3: NA 743 754 704 729 NA NA 675 NA NA A
4: NA 740 754 691 708 NA NA NA NA NA A
5: NA 739 NA NA NA NA NA NA NA NA A
6: NA NA NA NA NA NA NA NA NA NA A
7: 950 814 878 792 743 796 774 700 826 827 B
8: NA 772 789 823 773 732 796 664 857 889 B
9: NA 819 812 746 744 706 824 656 760 834 B
10: NA NA NA 782 702 750 771 677 798 759 B
11: NA NA NA NA NA 835 NA NA NA NA B
12: NA NA NA NA NA NA NA NA NA NA B
I would really appreciate any help I could get with this! I'm fairly new at R and cannot wrap my head around how to do it.
library(data.table)
# Sample data
DT <- fread(" 1 2 3 4 5 6 7 8 9 10 ID
748 726 743 731 696 786 710 732 802 784 A
707 724 760 728 730 798 668 696 492 341 A
NA 743 754 704 729 26 NA 675 NA NA A
NA 740 754 691 708 NA NA 79 NA NA A
NA 739 87 69 463 NA NA NA NA NA A
NA 594 NA NA NA NA NA NA NA NA A
950 814 878 792 743 796 774 700 826 827 B
402 772 789 823 773 732 796 664 857 889 B
NA 819 812 746 744 706 824 656 760 834 B
NA 3 694 782 702 750 771 677 798 759 B
NA NA NA 650 512 835 29 123 303 240 B
NA NA NA NA NA 226 NA NA NA NA B", header = TRUE)
# Add row id
DT[, rowid := .I]
# Melt to long, drop NA-values
DT.melt <- melt(DT, id.vars = c("ID", "rowid"), na.rm = TRUE)
# Set last entry of each group to NA
DT.melt[DT.melt[, .I[.N], by = .(ID, variable)][['V1']], value := NA]
# Cast to wide again, fill missing with NA
dcast(DT.melt, ID + rowid ~ variable, value.var = "value", fill = NA)
# ID rowid 1 2 3 4 5 6 7 8 9 10
# 1: A 1 748 726 743 731 696 786 710 732 802 784
# 2: A 2 NA 724 760 728 730 798 NA 696 NA NA
# 3: A 3 NA 743 754 704 729 NA NA 675 NA NA
# 4: A 4 NA 740 754 691 708 NA NA NA NA NA
# 5: A 5 NA 739 NA NA NA NA NA NA NA NA
# 6: A 6 NA NA NA NA NA NA NA NA NA NA
# 7: B 7 950 814 878 792 743 796 774 700 826 827
# 8: B 8 NA 772 789 823 773 732 796 664 857 889
# 9: B 9 NA 819 812 746 744 706 824 656 760 834
#10: B 10 NA NA NA 782 702 750 771 677 798 759
#11: B 11 NA NA NA NA NA 835 NA NA NA NA
#12: B 12 NA NA NA NA NA NA NA NA NA NA

pick element with condition and sum by row in r data.table

data<-fread(
V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 116 116 116 116 102 96 NA 106 116 NA 144
2: 122 122 114 114 114 114 NA 121 111 98 108
3: 118 88 78 78 77 72 96 NA 95 NA NA
4: 118 118 77 NA 86 139 127 NA 103 93 84
5: 150 150 154 154 121 121 114 111 NA NA NA
6: NA NA NA NA NA NA NA NA NA NA 141
7: 174 174 174 125 118 117 116 139 116 102 104
8: 183 183 183 175 175 176 NA 139 123 140 141
9: 134 140 106 174 162 162 169 140 127 112 NA
10: 178 178 178 NA NA 116 95 95 125 115 103)
I try to sum elements by row with condition(<90) like this
V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 sum
1: 116 116 116 116 102 96 NA 106 116 NA 144 0
2: 122 122 114 114 114 114 NA 121 111 98 108 0
3: 118 88 78 78 77 72 96 NA 95 NA NA 88+78+78+77+72
4: 118 118 77 NA 86 139 127 NA 103 93 84 77+86+84
5: 150 150 154 154 121 121 114 111 NA NA NA 0
6: NA NA NA NA NA NA NA NA NA NA 141 0
7: 174 174 174 125 118 117 116 139 116 102 104 0
8: 183 183 183 175 175 176 NA 139 123 140 141 0
9: 134 140 106 174 162 162 169 140 127 112 NA 0
10: 178 178 178 NA NA 116 95 95 125 115 103 0
raw data is large over 10000 row, so I don't prefer for loop
please use data.table
Here's a simple way in base R:
data$sum <- rowSums(data * (data < 90), na.rm = TRUE)
In data.table, you can do:
data[ , sum := rowSums(data * (data < 90), na.rm = TRUE)]
V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 sum
1: 116 116 116 116 102 96 NA 106 116 NA 144 0
2: 122 122 114 114 114 114 NA 121 111 98 108 0
3: 118 88 78 78 77 72 96 NA 95 NA NA 393
4: 118 118 77 NA 86 139 127 NA 103 93 84 247
5: 150 150 154 154 121 121 114 111 NA NA NA 0
6: NA NA NA NA NA NA NA NA NA NA 141 0
7: 174 174 174 125 118 117 116 139 116 102 104 0
8: 183 183 183 175 175 176 NA 139 123 140 141 0
9: 134 140 106 174 162 162 169 140 127 112 NA 0
10: 178 178 178 NA NA 116 95 95 125 115 103 0
library(dplyr)
df <- data.frame(x=c(1,100,5),y=c(200,5,15), z = c(1,2,NA))
df["sum"] <- df %>%
apply(2, function(x) ifelse(x < 90,x,0)) %>%
rowSums(na.rm = TRUE)
df

Remove rows with 3 or more NA values - R [duplicate]

This question already has answers here:
How to delete rows from a dataframe that contain n*NA
(4 answers)
Closed 6 years ago.
I would like to remove all rows with 3 or more NA values - below is an example of my data set (but the actual data set has 95,000 rows)
Plot_ID Tree_ID Dbh13 Dbh08 Dbh03 Dbh93_94
106 6 236 132 123 132
204 5 NA NA NA 142
495 8 134 NA NA 102
984 12 NA 123 110 97
So that it looks like this
Plot_ID Tree_ID Dbh13 Dbh08 Dbh03 Dbh93_94
106 6 236 132 123 132
495 8 134 NA NA 102
984 12 NA 123 110 97
Plot_ID Tree_ID Dbh13 Dbh08 Dbh03 Dbh93_94
106 6 236 132 123 132
204 5 NA NA NA 142
495 8 134 NA NA 102
984 12 NA 123 110 97
df1 <- read.table(con<-file("clipboard"),header=T)
cnt_na <- apply(df1, 1, function(z) sum(is.na(z)))
df1[cnt_na < 3,]
Plot_ID Tree_ID Dbh13 Dbh08 Dbh03 Dbh93_94
1 106 6 236 132 123 132
3 495 8 134 NA NA 102
4 984 12 NA 123 110 97

How to only ready real values in a set

This might be a simply question but I haven't figured it out.
I am writing a simple loop.
v6<-c()
> aa
[1] 1 4 8 9 10 12 15 16 17 18 19 20 21 25 29 30 38
[18] 39 46 47 48 49 52 53 54 60 65 69 73 75 81 82 83 85
[35] 86 87 90 91 92 94 96 97 98 99 100 101 104 105 106 110 112
[52] 113 114 116 117 118 119 122 125 126 128 129
for (i in aa){
v6[i]<-sum(as.numeric(Sep1$Units[Sep1$ID==i]))
}
> v6
[1] 3800 NA NA 2600 NA NA NA 7700 13500 11900 NA
[12] 15600 NA NA 2000 17700 9600 11600 3400 11200 6600 NA
[23] NA NA 6000 NA NA NA 8800 2400 NA NA NA
[34] NA NA NA NA 2600 4500 NA NA NA NA NA
[45] NA 23400 36000 4000 5100 NA NA 9200 5400 7000 NA
[56] NA NA NA NA 5000 NA NA NA NA 60000 NA
[67] NA NA 7200 NA NA NA 20000 NA 39600 NA NA
[78] NA NA NA 23600 1600 10600 NA 39000 1000 6200 NA
[89] NA 3000 100 1400 NA 12800 NA 5100 2000 32000 7000
[100] 10900 4800 NA NA 3200 14600 24000 NA NA NA 16200
[111] NA 5000 28800 16800 NA 2600 40000 800 8400 NA NA
[122] 18000 NA NA 24800 13600 NA 4600 11700
I realized R has red 1 through 129 instead of just read "1, 4, 8, ...". Now I know I can use na.omit(v6) to remove all the NA in values, but I am just wondering if there is a way that allows R to ready just the values in "aa" instead of going through 1 though 129 please?
I don't know if I have emphasized my question well. Thanks
Generally if you are using a for loop in R there is always a better way to do it.
You need to provide test data in order for me to show that this works, but I believe the following statement will do what you want without a for loop:
v6[aa] <- sum(as.numeric(Sep1$Units[Sep1$ID %in% aa]))
The expression "v6[aa] <-" says "for the elements in the vector v6 at the positions in the vector aa, assign the values in the following vector to those positions."

create new column based on most frequent value of previous columns R

I have a number of columns in a data frame that represent replicates of an experimental result.
Example here
1a 2a 3a 4a 5a
1 154 152 154 156 NA
2 154 154 154 NA NA
3 154 154 154 154 NA
4 154 154 154 154 NA
5 154 NA 154 154 NA
6 NA NA NA 154 NA
7 154 154 NA 154 NA
8 154 154 NA 154 NA
9 154 NA 154 150 NA
10 149 149 NA 149 149
What I would like is to create another column which has the value that occurs(>=2)from each of the other columns.
1a 2a 3a 4a 5a score
1 154 152 154 156 NA 154
2 154 154 154 NA NA 154
3 154 154 154 154 NA 154
4 154 154 154 154 NA 154
5 154 NA 154 154 NA 154
6 NA NA NA 154 NA NA
7 154 154 NA 154 NA 154
8 154 154 NA 154 NA 154
9 154 NA 154 150 NA 154
10 149 149 NA 149 149 149
EDIT: Modified example above to demonstrate.
flodel's answer of using the mode was initially successful however it would use a value even if it only occurred once. I would like it to either come up NA or a character string (which ever is easier)if there are not 2>x values in each row.
You are not looking for the median but the mode, which is easy enough to define yourself:
Mode <- function(x, min.freq = 1L) {
f <- table(x)
k <- f[f >= min.freq]
if (length(k) > 0L) as.numeric(names(f)[which.max(f)]) else NA
}
test$score <- apply(test2, 1, Mode, min.freq = 2L)

Resources