pick element with condition and sum by row in r data.table

pick element with condition and sum by row in r data.table - r

data<-fread(
V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 116 116 116 116 102 96 NA 106 116 NA 144
2: 122 122 114 114 114 114 NA 121 111 98 108
3: 118 88 78 78 77 72 96 NA 95 NA NA
4: 118 118 77 NA 86 139 127 NA 103 93 84
5: 150 150 154 154 121 121 114 111 NA NA NA
6: NA NA NA NA NA NA NA NA NA NA 141
7: 174 174 174 125 118 117 116 139 116 102 104
8: 183 183 183 175 175 176 NA 139 123 140 141
9: 134 140 106 174 162 162 169 140 127 112 NA
10: 178 178 178 NA NA 116 95 95 125 115 103)
I try to sum elements by row with condition(<90) like this
V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 sum
1: 116 116 116 116 102 96 NA 106 116 NA 144 0
2: 122 122 114 114 114 114 NA 121 111 98 108 0
3: 118 88 78 78 77 72 96 NA 95 NA NA 88+78+78+77+72
4: 118 118 77 NA 86 139 127 NA 103 93 84 77+86+84
5: 150 150 154 154 121 121 114 111 NA NA NA 0
6: NA NA NA NA NA NA NA NA NA NA 141 0
7: 174 174 174 125 118 117 116 139 116 102 104 0
8: 183 183 183 175 175 176 NA 139 123 140 141 0
9: 134 140 106 174 162 162 169 140 127 112 NA 0
10: 178 178 178 NA NA 116 95 95 125 115 103 0
raw data is large over 10000 row, so I don't prefer for loop
please use data.table

Here's a simple way in base R:
data$sum <- rowSums(data * (data < 90), na.rm = TRUE)
In data.table, you can do:
data[ , sum := rowSums(data * (data < 90), na.rm = TRUE)]
V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 sum
1: 116 116 116 116 102 96 NA 106 116 NA 144 0
2: 122 122 114 114 114 114 NA 121 111 98 108 0
3: 118 88 78 78 77 72 96 NA 95 NA NA 393
4: 118 118 77 NA 86 139 127 NA 103 93 84 247
5: 150 150 154 154 121 121 114 111 NA NA NA 0
6: NA NA NA NA NA NA NA NA NA NA 141 0
7: 174 174 174 125 118 117 116 139 116 102 104 0
8: 183 183 183 175 175 176 NA 139 123 140 141 0
9: 134 140 106 174 162 162 169 140 127 112 NA 0
10: 178 178 178 NA NA 116 95 95 125 115 103 0

library(dplyr)
df <- data.frame(x=c(1,100,5),y=c(200,5,15), z = c(1,2,NA))
df["sum"] <- df %>%
apply(2, function(x) ifelse(x < 90,x,0)) %>%
rowSums(na.rm = TRUE)
df

Related

Why no changes in for loop

My data looks like below.
ID Group timing glucose_level
<chr> <dbl> <int> <dbl>
1 black 7 0 0 136
2 black 1 0 0 116
3 blue 20 0 0 144
4 green 18 0 0 114
5 red 4 0 0 126
6 red 5 0 0 80
7 green 17 0 0 111
8 green 3 0 0 109
9 red 20 0 0 96
10 black 39 0 0 140
There are some missing values in glucose level.
Below are part of glucose level data
[697] 128 157 132 142 141 128 97 120 123 131 132 126 140 103 147 181 217 257 218 234 240 281 273 224 210 227 NA NA 245
[726] 230 252 270 238 134 173 193 151 128 180 218 218 190 225 214 186 140 237 239 279 246 244 146 196 157 178 140 127 187
[755] 206 177 220 179 167 127 219 223 241 162 235 140 187 154 172 116 139 194 173 150 187 131 176 114 154 180 223 150 219
[784] 130 169 104 136 132 121 175 169 128 110 101 100 92 122 196 203 96 143 129 NA 72 141 143 129 149 132 107 94 76
[813] 80 95 63 198 181 86 122
I wanna use a loop to replace the missing values.
Here are my code:
for(i in 1:length(data)){
if(is.na(data[i,'glucose_level'])){
if(data[i,'Group']==0){
data[i,'glucose_level']=162.7059
}else if(data[i,'Group']==1){
data[i,'glucose_level']= 163.1415
}else{
data[i,'glucose_level']= 165.9106
}
}
}
I print out data$glucose_level and find there are still missing values in it.why no changes in my data???

You can use nested ifelse or case_when and check for conditions and assign values accordingly.
library(dplyr)
data <- data %>%
mutate(glucose_level = case_when(!is.na(glucose_level) ~ glucose_level,
Group == 0 ~ 162.7059,
Group == 1 ~ 163.1415,
TRUE ~ 165.9106))

We can use fcase from data.table
library(data.table)
setDT(data)[, glucose_level := fcase(!is.na(glucose_level), glucose_level,
Group == 0, 162.7059,
Group == 1,163.1415,
165.9106)]

Calculating new column as mean of selected columns in R data frame

I have a large (ish) data frame and I want to use dplyr mutate function (or suitable alternative) to calculate the mean of selected columns.
For example, suppose I had a data frame as follows:
colnames(dall)
[1] "Code" "LA.Name" "LA_Name" "Jan.20" "Feb.20" "Mar.20" "Apr.20" "May.20" "Jun.20"
[10] "Jul.20" "Aug.20" "Sep.20" "Oct.20" "Nov.20" "Dec.20" "Jan.19" "Feb.19" "Mar.19"
[19] "Apr.19" "May.19" "Jun.19" "Jul.19" "Aug.19" "Sep.19" "Oct.19" "Nov.19" "Dec.19"
[28] "Jan.18" "Feb.18" "Mar.18" "Apr.18" "May.18" "Jun.18" "Jul.18" "Aug.18" "Sep.18"
[37] "Oct.18" "Nov.18" "Dec.18" "Jan.17" "Feb.17" "Mar.17" "Apr.17" "May.17" "Jun.17"
[46] "Jul.17" "Aug.17" "Sep.17" "Oct.17" "Nov.17" "Dec.17" "Jan.16" "Feb.16" "Mar.16"
[55] "Apr.16" "May.16" "Jun.16" "Jul.16" "Aug.16" "Sep.16" "Oct.16" "Nov.16" "Dec.16"
[64] "Jan.15" "Feb.15" "Mar.15" "Apr.15" "May.15" "Jun.15" "Jul.15" "Aug.15" "Sep.15"
[73] "Oct.15" "Nov.15" "Dec.15"
I'm trying to create a new column with the mean of January data from 2015 to 2019.
Have tried several methods. Latest as follows:
mutate(dall, mJan15to19 = mean(Jan.15,Jan.16,Jan.17,Jan.18,Jan.19))
I get the following back:
Error in mean.default(Jan.15, Jan.16, Jan.17, Jan.18, Jan.19) :
'trim' must be numeric of length one
In addition: Warning message:
In if (na.rm) x <- x[!is.na(x)] :
the condition has length > 1 and only the first element will be used
The content of the cells I'm trying to average is a numeric
Can you help?
UPDATE:
Tried:
head(dall) %>% mutate(new = rowMeans(select(., Jan.15:Jan.19)))
Returned the following:
head(dall) %>% mutate(new = rowMeans(select(., Jan.15:Jan.19)))
Code LA.Name LA_Name Jan.20 Feb.20 Mar.20 Apr.20 May.20 Jun.20
1 E06000001 Hartlepool Hartlepool 108 76 89 NA NA NA
2 E06000002 Middlesbrough Middlesbrough 178 98 135 NA NA NA
3 E06000003 Redcar and Cleveland Redcar and Cleveland 150 148 126 NA NA NA
4 E06000004 Stockton-on-Tees Stockton-on-Tees 202 124 175 NA NA NA
5 E06000005 Darlington Darlington 137 90 116 NA NA NA
6 E06000006 Halton Halton 141 101 115 NA NA NA
Jul.20 Aug.20 Sep.20 Oct.20 Nov.20 Dec.20 Jan.19 Feb.19 Mar.19 Apr.19 May.19 Jun.19 Jul.19 Aug.19
1 NA NA NA NA NA NA 92 87 68 81 108 77 97 73
2 NA NA NA NA NA NA 144 116 126 113 123 100 113 118
3 NA NA NA NA NA NA 146 152 133 135 114 101 140 116
4 NA NA NA NA NA NA 192 166 160 133 157 126 136 149
5 NA NA NA NA NA NA 138 110 104 84 115 75 86 104
6 NA NA NA NA NA NA 114 95 83 92 97 88 98 83
Sep.19 Oct.19 Nov.19 Dec.19 Jan.18 Feb.18 Mar.18 Apr.18 May.18 Jun.18 Jul.18 Aug.18 Sep.18 Oct.18
1 69 87 85 99 126 89 97 97 77 65 64 61 76 71
2 117 127 119 121 204 117 112 132 129 106 96 115 103 111
3 108 139 134 145 225 152 135 114 122 116 113 108 113 154
4 136 177 159 173 256 171 189 142 146 149 142 144 128 179
5 77 95 96 119 127 125 98 98 104 76 77 84 79 109
6 91 106 102 121 170 106 114 93 102 93 83 111 91 93
Nov.18 Dec.18 Jan.17 Feb.17 Mar.17 Apr.17 May.17 Jun.17 Jul.17 Aug.17 Sep.17 Oct.17 Nov.17 Dec.17
1 94 97 116 83 101 76 85 86 52 80 85 88 98 94
2 108 121 151 137 131 111 112 114 127 112 113 120 150 151
3 113 129 171 126 158 104 120 134 122 119 107 145 126 134
4 152 174 177 166 176 129 157 148 141 148 168 143 142 186
5 84 100 103 110 105 88 101 89 73 92 87 96 102 86
6 115 96 117 95 115 94 99 105 93 110 110 86 89 84
Jan.16 Feb.16 Mar.16 Apr.16 May.16 Jun.16 Jul.16 Aug.16 Sep.16 Oct.16 Nov.16 Dec.16 Jan.15 Feb.15
1 79 97 90 92 82 87 75 74 74 79 68 93 138 99
2 116 143 138 131 139 95 107 107 102 121 125 142 166 144
3 129 132 147 141 137 137 115 108 115 127 135 124 179 144
4 159 176 171 191 146 169 160 128 161 143 159 161 263 169
5 105 113 85 92 87 92 74 78 91 85 88 86 149 78
6 113 98 108 117 90 99 92 107 101 93 123 111 162 105
Mar.15 Apr.15 May.15 Jun.15 Jul.15 Aug.15 Sep.15 Oct.15 Nov.15 Dec.15 new
1 109 69 82 85 71 65 74 82 81 112 85.89796
2 130 116 127 124 119 104 107 95 115 101 123.51020
3 129 142 136 125 114 108 120 117 108 140 131.61224
4 155 163 127 129 142 101 161 148 140 180 161.30612
5 105 102 78 90 112 91 83 109 97 96 96.34694
6 100 102 99 90 90 81 102 98 86 107 103.02041
>
I have a new column, but the calculation is incorrect. I want an average of all of the 'Jan' columns except for 'Jan.20'

Since you wanted rowwise mean, this will work:
dall$mJan15to19 = rowMeans(dall[,c("Jan.15","Jan.16","Jan.17","Jan.18","Jan.19")])

fill NA with previous column and specific condition with data.table in R

I have some of these table
ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 limit
1: 10167638 89 NA 116 102 96 NA 106 116 NA 144 3
2: 10298462 74 114 NA NA 114 NA 121 111 98 108 6
3: 10316168 88 78 NA 77 72 96 NA 95 NA NA 4
4: 10423491 118 77 NA 86 139 127 NA 103 93 84 2
5: 10497492 12 154 NA 121 121 114 111 NA NA NA 7
6: 10619463 42 NA NA NA NA NA NA NA NA 141 9
7: 10631362 174 NA 125 118 117 116 139 116 NA 104 10
8: 10725490 49 NA 175 NA 176 NA 139 123 140 141 5
9: 10767348 140 106 174 162 NA 169 140 127 112 NA 6
10: 10832134 10 178 NA NA 116 95 95 125 115 103 3
I try to fill this NAs with previous column value
(if V2 is NA fill with V1 value)
with condition which is limit (if limit is 3 just fill NA until V3 and leave with NAs)
so what I try to do is like this
ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 limit
1: 10167638 89 89 116 102 96 NA 106 116 NA 144 3
2: 10298462 74 114 114 114 114 114 121 111 98 108 6
3: 10316168 88 78 78 77 72 96 NA 95 NA NA 4
4: 10423491 118 77 NA 86 139 127 NA 103 93 84 2
5: 10497492 12 154 154 121 121 114 111 NA NA NA 7
6: 10619463 42 42 42 42 42 42 42 42 42 141 9
7: 10631362 174 174 125 118 117 116 139 116 116 104 10
8: 10725490 49 49 175 175 176 NA 139 123 140 141 5
9: 10767348 140 106 174 162 162 169 140 127 112 NA 6
10: 10832134 10 178 178 NA 116 95 95 125 115 103 3
Actual data is pretty big so it would be nice solve this problem with data.table
but other solution is still okay like dplyr or tidyr or other solution.

Using data.table's set() function:
Code
col <- paste0("V", 1:10)
for (i in 2:length(col)) {
rows <- which(is.na(dt[[col[i]]]) & dt[["limit"]] >= i)
set(
x = dt,
i = rows,
j = col[i],
value = dt[[col[i-1]]][rows]
)
}
Results
dt
ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 limit
1: 10167638 89 89 116 102 96 NA 106 116 NA 144 3
2: 10298462 74 114 114 114 114 114 121 111 98 108 6
3: 10316168 88 78 78 77 72 96 NA 95 NA NA 4
4: 10423491 118 77 NA 86 139 127 NA 103 93 84 2
5: 10497492 12 154 154 121 121 114 111 NA NA NA 7
6: 10619463 42 42 42 42 42 42 42 42 42 141 9
7: 10631362 174 174 125 118 117 116 139 139 139 104 10
8: 10725490 49 49 175 175 176 NA 139 123 140 141 5
9: 10767348 140 106 174 162 162 169 140 127 112 NA 6
10: 110832134 10 178 178 NA 116 95 95 125 115 103 3
Data
dt <- fread(" ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 limit
10167638 89 NA 116 102 96 NA 106 116 NA 144 3
10298462 74 114 NA NA 114 NA 121 111 98 108 6
10316168 88 78 NA 77 72 96 NA 95 NA NA 4
10423491 118 77 NA 86 139 127 NA 103 93 84 2
10497492 12 154 NA 121 121 114 111 NA NA NA 7
10619463 42 NA NA NA NA NA NA NA NA 141 9
10631362 174 NA 125 118 117 116 139 116 NA 104 10
10725490 49 NA 175 NA 176 NA 139 123 140 141 5
10767348 140 106 174 162 NA 169 140 127 112 NA 6
110832134 10 178 NA NA 116 95 95 125 115 103 3")

You can try a tidyverse
library(tidyverse)
dt %>%
gather(k, v, -ID, -limit) %>% # make df from wide to long
mutate(k = factor(k, levels = unique(k))) %>% # for correct spreading in the last step
group_by(ID) %>%
mutate(gr=ifelse(is.na(v), 1:n(), 0)) %>% # check where the NA's are
fill(v) %>% # update the values
mutate(v = ifelse(limit >= gr, v, NA)) %>% # change to NA back again accrding limit
select(-gr) %>%
spread(k, v) # backtransform to long
# A tibble: 10 x 12
# Groups: ID [10]
ID limit V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 10167638 3 89 89 116 102 96 NA 106 116 NA 144
2 10298462 6 74 114 114 114 114 114 121 111 98 108
3 10316168 4 88 78 78 77 72 96 NA 95 NA NA
4 10423491 2 118 77 NA 86 139 127 NA 103 93 84
5 10497492 7 12 154 154 121 121 114 111 NA NA NA
6 10619463 9 42 42 42 42 42 42 42 42 42 141
7 10631362 10 174 174 125 118 117 116 139 116 116 104
8 10725490 5 49 49 175 175 176 NA 139 123 140 141
9 10767348 6 140 106 174 162 162 169 140 127 112 NA
10 110832134 3 10 178 178 NA 116 95 95 125 115 103

R: number in a txt file split up by line

I have a problem in reading a .txt in to R.
The data is something like this:
68 89 103 1
37 8 103 9
78 93 8 12
3 50
I used readLine() in R and came up with a list. But when I compare it to the raw data, I find that , for example, the last "1" in the first line is not 1, it should be connected to the second line, which make the number to e 137, instead of 1 and 37. I think this data is split by " ". If I use readLine(), I manually split up the lines. How could I correctly read it?
And, number 9 is not connect to 78 since at the beginning of line 3, there is a space. number 12 is connected with 3 to form 123, since there is no space before 3.
Thanks. I even don't know how to search my problem in Google. Don't know how to express it.
182 63 68 152 130 134 145 152 98 152 182 88 95 105 130 137 167 152 81 71 84 126 134 152 116 130 91 63 68 84 95 152 105 152 63
102 152 63 77 112 140 77 119 152 161 167 105 112 145 161 182 152 81 95 84 91 102 108 130 134 91
1 2 1 4 3 6 1 1 5 2 1 5 2 3 4 5 5 1 2 6 1
63 102 119 161 161 172 179 88 91 95 105 112 119 119 137 145 167 172 91 98 108 112 134 137 161 161 179 71 174 95 105 134 134 1
37 140 145 150 150 68 68 130 137 77 95 112 137 161 174 81 84 126 134 161 161 174 68 77 98 102 102 102 112 88 88 91 98 112 134
134 137 137 140 140 152 152 77 179 112 71 71 74 77 112 116 116 140 140 167 77 95 126 150 88 126 130 130 134 63 74 84 84 88 9
1 95 108 134 137 179 81 88 105 116 123 140 145 152 161 161 179 88 95 112 119 126 126 150 157 179 68 68 84 102 105 119 123 123
137 161 179 182 140 152 182 182 81 63 88 134 84 134 182
7 11 9 2 9 4 6 7 6 1 13 2 1 10 4 5 11 11 9 12 1 3 1 3 3
Basically, what I am doing now is:
For example, the vector:
ind <- c(7, 11, 9, 2 ,9 ,4 ,6, 7, 6 ,1, 13, 2 ,1 ,10 ,4 ,5 ,11 ,11, 9 ,12, 1, 3 ,1, 3 ,3)
indicates that the block of number above should be split up according to the length specified by the vector. I know I can split up a vector by
split(vector, rep(1:length(ind), ind))
However, the problem is I can't read the block of number correctly.

Based on the conditions you described, i.e. if there is a space at the beginning of line after you read the file with readLines, then the last number in the previous line should be joined with the first number of the current line.
Using your second example (I didn't understand the ind though)
lines1 <- readLines(n=10)
182 63 68 152 130 134 145 152 98 152 182 88 95 105 130 137 167 152 81 71 84 126 134 152 116 130 91 63 68 84 95 152 105 152 63
102 152 63 77 112 140 77 119 152 161 167 105 112 145 161 182 152 81 95 84 91 102 108 130 134 91
1 2 1 4 3 6 1 1 5 2 1 5 2 3 4 5 5 1 2 6 1
63 102 119 161 161 172 179 88 91 95 105 112 119 119 137 145 167 172 91 98 108 112 134 137 161 161 179 71 174 95 105 134 134 1
37 140 145 150 150 68 68 130 137 77 95 112 137 161 174 81 84 126 134 161 161 174 68 77 98 102 102 102 112 88 88 91 98 112 134
134 137 137 140 140 152 152 77 179 112 71 71 74 77 112 116 116 140 140 167 77 95 126 150 88 126 130 130 134 63 74 84 84 88 9
1 95 108 134 137 179 81 88 105 116 123 140 145 152 161 161 179 88 95 112 119 126 126 150 157 179 68 68 84 102 105 119 123 123
137 161 179 182 140 152 182 182 81 63 88 134 84 134 182
lines2 <- lines1[lines1!=''] #remove blank lines
indx <- grep("^ ", lines2) #create a numeric index for lines that start with a space
indx1 <- indx-1 #index that is one above the previous `indx`
lines2[indx1] <- paste0(lines2[indx1], gsub("^\\s+", "", lines2[indx])) #paste the lines together using the two indexes
lines3 <- lines2[-indx] #remove the lines that belong to the first index
lines3
#[1] "182 63 68 152 130 134 145 152 98 152 182 88 95 105 130 137 167 152 81 71 84 126 134 152 116 130 91 63 68 84 95 152 105 152 63102 152 63 77 112 140 77 119 152 161 167 105 112 145 161 182 152 81 95 84 91 102 108 130 134 91"
#[2] "1 2 1 4 3 6 1 1 5 2 1 5 2 3 4 5 5 1 2 6 1"
#[3] "63 102 119 161 161 172 179 88 91 95 105 112 119 119 137 145 167 172 91 98 108 112 134 137 161 161 179 71 174 95 105 134 134 1"
#[4] "37 140 145 150 150 68 68 130 137 77 95 112 137 161 174 81 84 126 134 161 161 174 68 77 98 102 102 102 112 88 88 91 98 112 134134 137 137 140 140 152 152 77 179 112 71 71 74 77 112 116 116 140 140 167 77 95 126 150 88 126 130 130 134 63 74 84 84 88 9"
#[5] "1 95 108 134 137 179 81 88 105 116 123 140 145 152 161 161 179 88 95 112 119 126 126 150 157 179 68 68 84 102 105 119 123 123137 161 179 182 140 152 182 182 81 63 88 134 84 134 182"

create new column based on most frequent value of previous columns R

I have a number of columns in a data frame that represent replicates of an experimental result.
Example here
1a 2a 3a 4a 5a
1 154 152 154 156 NA
2 154 154 154 NA NA
3 154 154 154 154 NA
4 154 154 154 154 NA
5 154 NA 154 154 NA
6 NA NA NA 154 NA
7 154 154 NA 154 NA
8 154 154 NA 154 NA
9 154 NA 154 150 NA
10 149 149 NA 149 149
What I would like is to create another column which has the value that occurs(>=2)from each of the other columns.
1a 2a 3a 4a 5a score
1 154 152 154 156 NA 154
2 154 154 154 NA NA 154
3 154 154 154 154 NA 154
4 154 154 154 154 NA 154
5 154 NA 154 154 NA 154
6 NA NA NA 154 NA NA
7 154 154 NA 154 NA 154
8 154 154 NA 154 NA 154
9 154 NA 154 150 NA 154
10 149 149 NA 149 149 149
EDIT: Modified example above to demonstrate.
flodel's answer of using the mode was initially successful however it would use a value even if it only occurred once. I would like it to either come up NA or a character string (which ever is easier)if there are not 2>x values in each row.

You are not looking for the median but the mode, which is easy enough to define yourself:
Mode <- function(x, min.freq = 1L) {
f <- table(x)
k <- f[f >= min.freq]
if (length(k) > 0L) as.numeric(names(f)[which.max(f)]) else NA
}
test$score <- apply(test2, 1, Mode, min.freq = 2L)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

pick element with condition and sum by row in r data.table - r

library(dplyr) df <- data.frame(x=c(1,100,5),y=c(200,5,15), z = c(1,2,NA)) df["sum"] <- df %>% apply(2, function(x) ifelse(x < 90,x,0)) %>% rowSums(na.rm = TRUE) df

Related

Why no changes in for loop

Calculating new column as mean of selected columns in R data frame

fill NA with previous column and specific condition with data.table in R

R: number in a txt file split up by line

create new column based on most frequent value of previous columns R

Categories

Resources