Calculate mean across rows with NA values in R - r

I have a really simple R question but I can't seem to find an adequate solution. Let's say we have the following data frame:
groupid<-rep(1:5, each=3)
names<-rep(c("Bill", "Jim", "Sarah", "Mike", "Jennifer"),3)
test1<-rep(c(90, 70, 90, NA, 100),3)
test2<-rep(c(80, NA, 92, 80, 65), 3)
testscores<-data.frame(groupid, names, test1, test2)
groupid names test1 test2
1 1 Bill 90 80
2 1 Jim 70 NA
3 1 Sarah 90 92
4 1 Mike NA 80
5 1 Jennifer 100 65
6 2 Bill 90 80
7 2 Jim 70 NA
8 2 Sarah 90 92
9 2 Mike NA 80
10 2 Jennifer 100 65
11 3 Bill 90 80
12 3 Jim 70 NA
13 3 Sarah 90 92
14 3 Mike NA 80
15 3 Jennifer 100 65
We are interested in getting the mean across rows (adding an extra column to the data frame) for each test, ignoring the NA values. For example, 'Jim' would have value of 70 for his average and 'Mike' would have a value of 80. All the others would be averaged normally.
I tried using transform from the plyr package but it did not appear to accommodate the NA issue.

testscores$testMean <- rowMeans(testscores[,3:4], na.rm=TRUE)
> testscores
groupid names test1 test2 testMean
1 1 Bill 90 80 85.0
2 1 Jim 70 NA 70.0
3 1 Sarah 90 92 91.0
4 2 Mike NA 80 80.0
5 2 Jennifer 100 65 82.5
6 2 Bill 90 80 85.0
7 3 Jim 70 NA 70.0
8 3 Sarah 90 92 91.0
9 3 Mike NA 80 80.0
10 4 Jennifer 100 65 82.5
11 4 Bill 90 80 85.0
12 4 Jim 70 NA 70.0
13 5 Sarah 90 92 91.0
14 5 Mike NA 80 80.0
15 5 Jennifer 100 65 82.5

you can also use this
testscores <- structure(list(groupid = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L,
4L, 4L, 4L, 5L, 5L, 5L), names = structure(c(1L, 3L, 5L, 4L,
2L, 1L, 3L, 5L, 4L, 2L, 1L, 3L, 5L, 4L, 2L), .Label = c("Bill",
"Jennifer", "Jim", "Mike", "Sarah"), class = "factor"), test1 = c(90,
70, 90, NA, 100, 90, 70, 90, NA, 100, 90, 70, 90, NA, 100), test2 = c(80,
NA, 92, 80, 65, 80, NA, 92, 80, 65, 80, NA, 92, 80, 65)), .Names = c("groupid",
"names", "test1", "test2"), row.names = c(NA, -15L), class = "data.frame")
testscores$meanTest=rowMeans(testscores[,c("test1", "test2")], na.rm=TRUE)
# groupid names test1 test2 meanTest
#1 1 Bill 90 80 85.0
#2 1 Jim 70 NA 70.0
#3 1 Sarah 90 92 91.0
#4 2 Mike NA 80 80.0
#5 2 Jennifer 100 65 82.5
#6 2 Bill 90 80 85.0
#7 3 Jim 70 NA 70.0
#8 3 Sarah 90 92 91.0
#9 3 Mike NA 80 80.0
#10 4 Jennifer 100 65 82.5
#11 4 Bill 90 80 85.0
#12 4 Jim 70 NA 70.0
#13 5 Sarah 90 92 91.0
#14 5 Mike NA 80 80.0
#15 5 Jennifer 100 65 82.5

Related

Based on a dataframe add values to a different dataframe when values match up [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
This is hard to explain but basically I have a very simple dataframe with Counties and Cases
dat <- "County Cases
1 Borden 5
2 Bosque 3
3 Bowue 1"
and I have a large dataframe from TEX <- map_data('county', 'texas').
> head(TEX)
long lat group order region subregion
1 -95.75271 31.53560 1 1 texas anderson
2 -95.76989 31.55852 1 2 texas anderson
3 -95.76416 31.58143 1 3 texas anderson
4 -95.72979 31.58143 1 4 texas anderson
5 -95.74698 31.61008 1 5 texas anderson
6 -95.72405 31.63873 1 6 texas anderson
What I want to do is check every row and if the subregion is in the dataframe dat then add the corresponding number of cases to a new column in TEX called "cases" or add 0 if not.
For example
> head(TEX)
long lat group order region subregion cases
1 -95.75271 31.53560 1 1 texas anderson 0
2 -95.76989 31.55852 1 2 texas anderson 0
3 -95.76416 31.58143 1 3 texas anderson 0
4 -95.72979 31.58143 1 4 texas anderson 0
5 -95.74698 31.61008 1 5 texas Borden 5
6 -95.72405 31.63873 1 6 texas Bosque 3
I tried doing it with this bit of code
for (val in counties$counties) {
for (vall in TEX$subregion) {
if (val == vall) TEX$cases = counties$cases
}
}
but I get this error
Error in `$<-.data.frame`(`*tmp*`, "cases", value = c(5L, 3L, 2L, 1L, :
replacement has 10 rows, data has 4488
My end goal here is to be able to create a choropleth of texas counties that have COVID cases based on my growing list of Counties and Cases. If you have a better method of doing this than I would be open to that!
Regards!
UPDATE: Ian's solution worked great but it is causing a problem with ggplot and mapping. If I take a section of the dataframe TEX before merge it looks like this
6 -96.81268 28.28693 4 76 texas aransas
77 -96.80695 28.25828 4 77 texas aransas
78 -96.82414 28.21817 4 78 texas aransas
79 -96.87570 28.19525 4 79 texas aransas
80 -96.91009 28.16660 4 80 texas aransas
81 -96.94446 28.14942 4 81 texas aransas
82 -96.94446 28.18379 4 82 texas aransas
83 -96.92727 28.24109 4 83 texas aransas
84 -96.92154 28.26974 4 84 texas aransas
85 -96.94446 28.27547 4 85 texas aransas
86 -96.99030 28.25255 4 86 texas aransas
87 -96.98457 28.23536 4 87 texas aransas
88 -96.97311 28.21817 4 88 texas aransas
89 -96.96165 28.19525 4 89 texas aransas
90 -96.97311 28.17233 4 90 texas aransas
91 -97.00175 28.15515 4 91 texas aransas
92 -97.03613 28.15515 4 92 texas aransas
93 -97.04186 28.17233 4 93 texas aransas
94 -97.03613 28.20098 4 94 texas aransas
95 -97.05905 28.21817 4 95 texas aransas
96 -97.07624 28.20671 4 96 texas aransas
97 -97.11062 28.21817 4 97 texas aransas
98 -97.12780 28.23536 4 98 texas aransas
99 -97.12780 28.25255 4 99 texas aransas
100 -97.11062 28.26401 4 100 texas aransas
101 -97.01894 28.27547 4 101 texas aransas
102 -96.80122 28.31557 4 102 texas aransas
and after plotting
ggplot(TEX, aes(long,lat, group = group)) + geom_polygon(aes(fill = subregion),color = "black") + theme(legend.position = "none") + coord_quickmap()
Looks great! Now when I execute the merge function TEX gets rearranged
72 aransas -97.00175 28.15515 4 91 texas 1
73 aransas -97.04186 28.17233 4 93 texas 1
74 aransas -96.80695 28.25828 4 77 texas 1
75 aransas -96.80122 28.31557 4 102 texas 1
76 aransas -97.03613 28.15515 4 92 texas 1
77 aransas -96.81268 28.28693 4 76 texas 1
78 aransas -97.12780 28.25255 4 99 texas 1
79 aransas -97.11062 28.26401 4 100 texas 1
80 aransas -96.97311 28.17233 4 90 texas 1
81 aransas -97.12780 28.23536 4 98 texas 1
82 aransas -97.07624 28.20671 4 96 texas 1
83 aransas -96.94446 28.27547 4 85 texas 1
84 aransas -97.01894 28.27547 4 101 texas 1
85 aransas -96.96165 28.19525 4 89 texas 1
86 aransas -97.11062 28.21817 4 97 texas 1
87 aransas -96.87570 28.19525 4 79 texas 1
88 aransas -97.03613 28.20098 4 94 texas 1
89 aransas -97.05905 28.21817 4 95 texas 1
90 aransas -96.97311 28.21817 4 88 texas 1
91 aransas -96.92154 28.26974 4 84 texas 1
92 aransas -96.99030 28.25255 4 86 texas 1
93 aransas -96.98457 28.23536 4 87 texas 1
94 aransas -96.82414 28.21817 4 78 texas 1
95 aransas -96.80122 28.31557 4 75 texas 1
96 aransas -96.94446 28.14942 4 81 texas 1
97 aransas -96.91009 28.16660 4 80 texas 1
98 aransas -96.92727 28.24109 4 83 texas 1
99 aransas -96.94446 28.18379 4 82 texas 1
and now the map looks like this...
What can I do to save the original order of TEX? or wait maybe I just need to sort by order....
UPDATE#2
TEX <- TEX[order(TEX$order),]
solved the problem. I am curious why merge changed the order like that
We can use merge from base R.
result <- merge(TEX,dat,by.x="subregion",by.y="County",all.x=TRUE)
result
subregion long lat group order region Cases
1 anderson -95.75271 31.53560 1 1 texas NA
2 anderson -95.76989 31.55852 1 2 texas NA
3 anderson -95.76416 31.58143 1 3 texas NA
4 anderson -95.72979 31.58143 1 4 texas NA
5 anderson -95.74698 31.61008 1 5 texas NA
6 anderson -95.72405 31.63873 1 6 texas NA
7 Borden -95.74698 31.61008 1 5 texas 5
8 Bosque -95.72405 31.63873 1 6 texas 3
Then we can replace the NAs with 0.
result$Cases[is.na(result$Cases)] <- 0
result
subregion long lat group order region Cases
1 anderson -95.75271 31.53560 1 1 texas 0
2 anderson -95.76989 31.55852 1 2 texas 0
3 anderson -95.76416 31.58143 1 3 texas 0
4 anderson -95.72979 31.58143 1 4 texas 0
5 anderson -95.74698 31.61008 1 5 texas 0
6 anderson -95.72405 31.63873 1 6 texas 0
7 Borden -95.74698 31.61008 1 5 texas 5
8 Bosque -95.72405 31.63873 1 6 texas 3
Data
TEX <- structure(list(long = c(-95.75271, -95.76989, -95.76416, -95.72979,
-95.74698, -95.72405, -95.74698, -95.72405), lat = c(31.5356,
31.55852, 31.58143, 31.58143, 31.61008, 31.63873, 31.61008, 31.63873
), group = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), order = c(1L, 2L,
3L, 4L, 5L, 6L, 5L, 6L), region = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = "texas", class = "factor"), subregion = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 3L), .Label = c("anderson", "Borden",
"Bosque"), class = "factor")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
dat <- structure(list(County = structure(1:3, .Label = c("Borden", "Bosque",
"Bowue"), class = "factor"), Cases = c(5L, 3L, 1L)), class = "data.frame", row.names = c("1",
"2", "3"))

R - How to calculate value differences between dates with heterogeneous number of rows

My data look like the following example.
# A tibble: 18 x 4
DATE AUTHOR PRODUCT SALES
<dttm> <chr> <chr> <dbl>
1 2019-11-27 James B 80
2 2019-11-28 James B 100
3 2019-11-27 James A 80
4 2019-11-28 James A 100
5 2019-11-26 Frank B 70
6 2019-11-27 Frank B 75
7 2019-11-28 Frank B 65
8 2019-11-26 Frank A 70
9 2019-11-27 Frank A 75
10 2019-11-28 Frank A 65
11 2019-11-25 Mary A 100
12 2019-11-26 Mary A 80
13 2019-11-27 Mary A 95
14 2019-11-28 Mary A 110
15 2019-11-25 Mary B 100
16 2019-11-26 Mary B 80
17 2019-11-27 Mary B 95
18 2019-11-28 Mary B 110
I would like to add a "DIFF" column where the difference over day for SALES is calculated grouping by AUTHOR. My issues here are the following:
I have a different number of rows for every AUTHOR.
The same DATE could be repeated for some AUTHORS to report different information (in this example is PRODUCT), but the value for SALES will always remain the same, since it only depends on the DATE and the AUTHOR.
I have to keep every row in the dataset because every row contains specific information, so I can not just drop the rows where DATE is a duplicated.
Ideally I would implement the whole with a loop function in my script.
My desired outcome would be:
# A tibble: 18 x 4
DATE AUTHOR PRODUCT SALES DIFF
<dttm> <chr> <chr> <dbl>
1 2019-11-27 James B 80
2 2019-11-28 James B 100 20
3 2019-11-27 James A 80
4 2019-11-28 James A 100 20
5 2019-11-26 Frank B 70
6 2019-11-27 Frank B 75 5
7 2019-11-28 Frank B 65 -10
8 2019-11-26 Frank A 70
9 2019-11-27 Frank A 75 5
10 2019-11-28 Frank A 65 -10
11 2019-11-25 Mary A 100
12 2019-11-26 Mary A 80 -20
13 2019-11-27 Mary A 95 15
14 2019-11-28 Mary A 110 15
15 2019-11-25 Mary B 100
16 2019-11-26 Mary B 80 -20
17 2019-11-27 Mary B 95 15
18 2019-11-28 Mary B 110 15
I tried different things with dplyr and mutate but nothing seemed to work. Anyone has suggestions?
Thank you!
You could use lag to subtract previous value by group
library(dplyr)
df %>% group_by(AUTHOR, PRODUCT) %>% mutate(diff = SALES - lag(SALES))
# DATE AUTHOR PRODUCT SALES diff
# <fct> <fct> <fct> <int> <int>
# 1 2019-11-27 James B 80 NA
# 2 2019-11-28 James B 100 20
# 3 2019-11-27 James A 80 NA
# 4 2019-11-28 James A 100 20
# 5 2019-11-26 Frank B 70 NA
# 6 2019-11-27 Frank B 75 5
# 7 2019-11-28 Frank B 65 -10
# 8 2019-11-26 Frank A 70 NA
# 9 2019-11-27 Frank A 75 5
#10 2019-11-28 Frank A 65 -10
#11 2019-11-25 Mary A 100 NA
#12 2019-11-26 Mary A 80 -20
#13 2019-11-27 Mary A 95 15
#14 2019-11-28 Mary A 110 15
#15 2019-11-25 Mary B 100 NA
#16 2019-11-26 Mary B 80 -20
#17 2019-11-27 Mary B 95 15
#18 2019-11-28 Mary B 110 15
Or using diff
df %>% group_by(AUTHOR, PRODUCT) %>% mutate(diff = c(NA, diff(SALES)))
data
df <- structure(list(DATE = structure(c(3L, 4L, 3L, 4L, 2L, 3L, 4L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("2019-11-25",
"2019-11-26", "2019-11-27", "2019-11-28"), class = "factor"),
AUTHOR = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Frank",
"James", "Mary"), class = "factor"), PRODUCT = structure(c(2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("A", "B"), class = "factor"), SALES = c(80L,
100L, 80L, 100L, 70L, 75L, 65L, 70L, 75L, 65L, 100L, 80L,
95L, 110L, 100L, 80L, 95L, 110L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18"))
We can use shift from data.table
library(data.table)
setDT(df)[, diff := SALES - shift(SALES), .(AUTHOR, PRODUCT)][]

Merge lines with same ID and take average value

From the table below I need to combine the lines by calculating the average value for those lines with same ID (column 2).
I was thinking of the plyr function??
ddply(df, summarize, value = average(ID))
df:
miRNA ID 100G 100R 106G 106R 122G 122R 124G 124R 126G 126R 134G 134R 141G 141R 167G 167R 185G 185R
1 hsa-miR-106a ID7 1585 423 180 113 598 266 227 242 70 106 2703 442 715 309 546 113 358 309
2 hsa-miR-1185-1 ID2 10 1 3 3 11 8 4 4 28 2 13 3 6 3 6 4 7 5
3 hsa-miR-1185-2 ID2 2 0 2 1 5 1 1 0 4 1 1 1 3 2 2 0 2 1
4 hsa-miR-1197 ID2 2 0 0 5 3 3 0 4 16 0 4 1 3 0 0 2 2 4
5 hsa-miR-127 ID3 29 17 6 55 40 35 6 20 171 10 32 21 23 25 10 14 32 55
Summary of original data:
> str(ClusterMatrix)
'data.frame': 113 obs. of 98 variables:
$ miRNA: Factor w/ 202 levels "hsa-miR-106a",..: 1 3 4 6 8 8 14 15 15 16 ...
$ ID : Factor w/ 27 levels "ID1","ID10","ID11",..: 25 12 12 12 21 21 12 21 21 6 ...
$ 100G : Factor w/ 308 levels "-0.307749042739963",..: 279 11 3 3 101 42 139 158 215 222 ...
$ 100R : Factor w/ 316 levels "-0.138028803567403",..: 207 7 8 8 18 42 128 183 232 209 ...
$ 106G : Factor w/ 260 levels "-0.103556709881933",..: 171 4 1 3 7 258 95 110 149 162 ...
$ 106R : Factor w/ 300 levels "-0.141810346640204",..: 141 4 6 2 108 41 146 196 244 267 ...
$ 122G : Factor w/ 336 levels "-0.0409548922061764",..: 237 12 4 6 103 47 148 203 257 264 ...
$ 122R : Factor w/ 316 levels "-0.135708706475279",..: 177 1 8 6 36 44 131 192 239 244 ...
$ 124G : Factor w/ 267 levels "-0.348439853247856",..: 210 5 2 3 7 50 126 138 188 249 ...
$ 124R : Factor w/ 303 levels "-0.176414190219115",..: 193 3 7 3 21 52 167 200 238 239 ...
$ 126G : Factor w/ 307 levels "-0.227658806811544",..: 122 88 5 76 169 61 240 220 281 265 ...
$ 126R : Factor w/ 249 levels "-0.271925865853123",..: 119 1 2 3 11 247 78 110 151 193 ...
$ 134G : Factor w/ 344 levels "-0.106333543799583",..: 304 14 8 5 33 48 150 196 248 231 ...
$ 134R : Factor w/ 300 levels "-0.0997616469801097",..: 183 5 7 7 22 298 113 159 213 221 ...
$ 141G : Factor w/ 335 levels "-0.134429748398679",..: 253 7 3 3 24 29 142 137 223 302 ...
$ 141R : Factor w/ 314 levels "-0.143299688877927",..: 210 4 5 7 98 54 154 199 255 251 ...
$ 167G : Factor w/ 306 levels "-0.211181452126958",..: 222 7 4 6 11 292 91 101 175 226 ...
$ 167R : Factor w/ 282 levels "-0.0490740880560127",..: 130 2 6 4 15 282 110 146 196 197 ...
$ 185G : Factor w/ 317 levels "-0.0567841338235346",..: 218 2 7 7 33 34 130 194 227 259 ...
We can use dplyr. We group by 'ID', use mutate_each to create columns that show the mean value of '100G' to '185R'. We select the columns in mutate_each by using regex patterns in matches. Then cbind (bind_cols) the original dataset with the mutated columns, and convert to data.frame if needed. We can also change the column names of the mean columns.
library(dplyr)
out <- df1 %>%
group_by(ID) %>%
mutate_each(funs(mean=mean(., na.rm=TRUE)), matches('^\\d+')) %>%
setNames(., c(names(.)[1:2], paste0('Mean_', names(.)[3:ncol(.)]))) %>%
as.data.frame()
out1 <- bind_cols(df1, out[-(1:2)])
out1
# miRNA ID 100G 100R 106G 106R 122G 122R 124G 124R 126G 126R 134G
#1 hsa-miR-106a ID7 1585 423 180 113 598 266 227 242 70 106 2703
#2 hsa-miR-1185-1 ID2 10 1 3 3 11 8 4 4 28 2 13
#3 hsa-miR-1185-2 ID2 2 0 2 1 5 1 1 0 4 1 1
#4 hsa-miR-1197 ID2 2 0 0 5 3 3 0 4 16 0 4
#5 hsa-miR-127 ID3 29 17 6 55 40 35 6 20 171 10 32
# 134R 141G 141R 167G 167R 185G 185R Mean_100G Mean_100R Mean_106G
#1 442 715 309 546 113 358 309 1585.000000 423.0000000 180.000000
#2 3 6 3 6 4 7 5 4.666667 0.3333333 1.666667
#3 1 3 2 2 0 2 1 4.666667 0.3333333 1.666667
#4 1 3 0 0 2 2 4 4.666667 0.3333333 1.666667
#5 21 23 25 10 14 32 55 29.000000 17.0000000 6.000000
# Mean_106R Mean_122G Mean_122R Mean_124G Mean_124R Mean_126G Mean_126R
#1 113 598.000000 266 227.000000 242.000000 70 106
#2 3 6.333333 4 1.666667 2.666667 16 1
#3 3 6.333333 4 1.666667 2.666667 16 1
#4 3 6.333333 4 1.666667 2.666667 16 1
#5 55 40.000000 35 6.000000 20.000000 171 10
# Mean_134G Mean_134R Mean_141G Mean_141R Mean_167G Mean_167R Mean_185G
#1 2703 442.000000 715 309.000000 546.000000 113 358.000000
#2 6 1.666667 4 1.666667 2.666667 2 3.666667
#3 6 1.666667 4 1.666667 2.666667 2 3.666667
#4 6 1.666667 4 1.666667 2.666667 2 3.666667
#5 32 21.000000 23 25.000000 10.000000 14 32.000000
# Mean_185R
#1 309.000000
#2 3.333333
#3 3.333333
#4 3.333333
#5 55.000000
EDIT: If we need a single row mean for each 'ID', we can use summarise_each
df1 %>%
group_by(ID) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)), matches('^\\d+'))
EDIT2: Based on the OP's update the original dataset ('ClusterMatrix') columns are all factor class. We need to convert the columns to numeric class before getting the mean. There are two options to convert the factor to numeric - 1) by as.numeric(as.character(.. which may be a bit slower, 2) as.numeric(levels(.. which is faster. Here I am using the first method as it may be more clear.
ClusterMatrix %>%
group_by(ID) %>%
summarise_each(funs(mean= mean(as.numeric(as.character(.)),
na.rm=TRUE)), matches('^\\d+'))
data
df1 <- structure(list(miRNA = c("hsa-miR-106a", "hsa-miR-1185-1",
"hsa-miR-1185-2",
"hsa-miR-1197", "hsa-miR-127"), ID = c("ID7", "ID2", "ID2", "ID2",
"ID3"), `100G` = c(1585L, 10L, 2L, 2L, 29L), `100R` = c(423L,
1L, 0L, 0L, 17L), `106G` = c(180L, 3L, 2L, 0L, 6L), `106R` = c(113L,
3L, 1L, 5L, 55L), `122G` = c(598L, 11L, 5L, 3L, 40L), `122R` = c(266L,
8L, 1L, 3L, 35L), `124G` = c(227L, 4L, 1L, 0L, 6L), `124R` = c(242L,
4L, 0L, 4L, 20L), `126G` = c(70L, 28L, 4L, 16L, 171L), `126R` = c(106L,
2L, 1L, 0L, 10L), `134G` = c(2703L, 13L, 1L, 4L, 32L), `134R` = c(442L,
3L, 1L, 1L, 21L), `141G` = c(715L, 6L, 3L, 3L, 23L), `141R` = c(309L,
3L, 2L, 0L, 25L), `167G` = c(546L, 6L, 2L, 0L, 10L), `167R` = c(113L,
4L, 0L, 2L, 14L), `185G` = c(358L, 7L, 2L, 2L, 32L), `185R` = c(309L,
5L, 1L, 4L, 55L)), .Names = c("miRNA", "ID", "100G", "100R",
"106G", "106R", "122G", "122R", "124G", "124R", "126G", "126R",
"134G", "134R", "141G", "141R", "167G", "167R", "185G", "185R"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"
))

Merging output in R

max=aggregate(cbind(a$VALUE,Date=a$DATE) ~ format(a$DATE, "%m") + cut(a$CLASS, breaks=c(0,2,4,6,8,10,12,14)) , data = a, max)[-1]
max$DATE=as.Date(max$DATE, origin = "1970-01-01")
Sample Data :
DATE GRADE VALUE
2008-09-01 1 20
2008-09-02 2 30
2008-09-03 3 50
.
.
2008-09-30 2 75
.
.
2008-10-01 1 95
.
.
2008-11-01 4 90
.
.
2008-12-01 1 70
2008-12-02 2 40
2008-12-28 4 30
2008-12-29 1 40
2008-12-31 3 50
My Expected output according to above table for only first month is :
DATE GRADE VALUE
2008-09-30 (0,2] 75
2008-09-02 (2,4] 50
Output in my real data :
format(DATE, "%m")
1 09
2 10
3 11
4 12
5 09
6 10
7 11
cut(a$GRADE, breaks = c(0, 2, 4, 6, 8, 10, 12, 14)) value
1 (0,2] 0.30844444
2 (0,2] 1.00000000
3 (0,2] 1.00000000
4 (0,2] 0.73333333
5 (2,4] 0.16983488
6 (2,4] 0.09368000
7 (2,4] 0.10589335
Date
1 2008-09-30
2 2008-10-31
3 2008-11-28
4 2008-12-31
5 2008-09-30
6 2008-10-31
7 2008-11-28
The output is not according to the sample data , as the data is too big . A simple logic is that there are grades from 1 to 10 , so I want to find the highest value for a month in the corresponding grade groups . Eg : I need a highest value for each group (0,2],(0,4] etc
I used an aggregate condition with function max and two grouping it by two columns Date and Grade . Now when I run the code and display the value of max , I get 3 tables as output one after the other. Now I want to plot this output but i am not able to do that because of this .So how can i merge all these output ?
Try:
library(dplyr)
a %>%
group_by(MONTH=format(DATE, "%m"), GRADE=cut(GRADE, breaks=seq(0,14,by=2))) %>%
summarise_each(funs(max))
# MONTH GRADE DATE VALUE
#1 09 (0,2] 2008-09-30 75
#2 09 (2,4] 2008-09-03 50
#3 10 (0,2] 2008-10-01 95
#4 11 (2,4] 2008-11-01 90
#5 12 (0,2] 2008-12-29 70
#6 12 (2,4] 2008-12-31 50
Or using data.table
library(data.table)
setDT(a)[, list(DATE=max(DATE), VALUE=max(VALUE)),
by= list(MONTH=format(DATE, "%m"),
GRADE=cut(GRADE, breaks=seq(0,14, by=2)))]
# MONTH GRADE DATE VALUE
#1: 09 (0,2] 2008-09-30 75
#2: 09 (2,4] 2008-09-03 50
#3: 10 (0,2] 2008-10-01 95
#4: 11 (2,4] 2008-11-01 90
#5: 12 (0,2] 2008-12-29 70
#6: 12 (2,4] 2008-12-31 50
Or using aggregate
res <- transform(with(a,
aggregate(cbind(VALUE, DATE),
list(MONTH=format(DATE, "%m") ,GRADE=cut(GRADE, breaks=seq(0,14, by=2))), max)),
DATE=as.Date(DATE, origin="1970-01-01"))
res[order(res$MONTH),]
# MONTH GRADE VALUE DATE
#1 09 (0,2] 75 2008-09-30
#4 09 (2,4] 50 2008-09-03
#2 10 (0,2] 95 2008-10-01
#5 11 (2,4] 90 2008-11-01
#3 12 (0,2] 70 2008-12-29
#6 12 (2,4] 50 2008-12-31
data
a <- structure(list(DATE = structure(c(14123, 14124, 14125, 14152,
14153, 14184, 14214, 14215, 14241, 14242, 14244), class = "Date"),
GRADE = c(1L, 2L, 3L, 2L, 1L, 4L, 1L, 2L, 4L, 1L, 3L), VALUE = c(20L,
30L, 50L, 75L, 95L, 90L, 70L, 40L, 30L, 40L, 50L)), .Names = c("DATE",
"GRADE", "VALUE"), row.names = c(NA, -11L), class = "data.frame")
Update
If you want to include YEAR also in the grouping
library(dplyr)
a %>%
group_by(MONTH=format(DATE, "%m"), YEAR=format(DATE, "%Y"), GRADE=cut(GRADE, breaks=seq(0,14, by=2)))%>%
summarise_each(funs(max))
# MONTH YEAR GRADE DATE VALUE
#1 09 2008 (0,2] 2008-09-30 75
#2 09 2008 (2,4] 2008-09-03 50
#3 09 2009 (0,2] 2009-09-30 75
#4 09 2009 (2,4] 2009-09-03 50
#5 10 2008 (0,2] 2008-10-01 95
#6 10 2009 (0,2] 2009-10-01 95
#7 11 2008 (2,4] 2008-11-01 90
#8 11 2009 (2,4] 2009-11-01 90
#9 12 2008 (0,2] 2008-12-29 70
#10 12 2008 (2,4] 2008-12-31 50
#11 12 2009 (0,2] 2009-12-29 70
#12 12 2009 (2,4] 2009-12-31 50
data
a <- structure(list(DATE = structure(c(14123, 14124, 14125, 14152,
14153, 14184, 14214, 14215, 14241, 14242, 14244, 14488, 14489,
14490, 14517, 14518, 14549, 14579, 14580, 14606, 14607, 14609
), class = "Date"), GRADE = c(1L, 2L, 3L, 2L, 1L, 4L, 1L, 2L,
4L, 1L, 3L, 1L, 2L, 3L, 2L, 1L, 4L, 1L, 2L, 4L, 1L, 3L), VALUE = c(20L,
30L, 50L, 75L, 95L, 90L, 70L, 40L, 30L, 40L, 50L, 20L, 30L, 50L,
75L, 95L, 90L, 70L, 40L, 30L, 40L, 50L)), .Names = c("DATE",
"GRADE", "VALUE"), row.names = c("1", "2", "3", "4", "5", "6",
"7", "8", "9", "10", "11", "12", "21", "31", "41", "51", "61",
"71", "81", "91", "101", "111"), class = "data.frame")
Following code using base R may be helpful (using 'a' dataframe from akrun's answer):
xx = strsplit(as.character(a$DATE), '-')
a$month = sapply(strsplit(as.character(a$DATE), '-'),'[',2)
gradeCats = cut(a$GRADE, breaks = c(0, 2, 4, 6, 8, 10, 12, 14))
aggregate(VALUE~month+gradeCats, data= a, max)
month gradeCats VALUE
1 09 (0,2] 75
2 10 (0,2] 95
3 12 (0,2] 70
4 09 (2,4] 50
5 11 (2,4] 90
6 12 (2,4] 50

aggregate data in columns with duplicate id in R

I have a df like this:
> dat
gen M1 M1 M1 M1 M2 M2 M2
G1 150 142 130 105 96
G2 150 145 142 130 96 89
G3 150 145 130 105 96
G4 145 142 130 105 89
G5 150 142 130 105 96
G6 145 142 130 96 89
G7 150 142 105 96
G8 150 145 130 105 89
G9 150 145 142 96 89
Here, data are present in duplicated ids. I like to aggergate like this:
>dat1
gen M1 M1 M1 M1 agg M2 M2 M2 agg
G1 150 142 130 150/142/130 105 96 105/96
G2 150 145 142 130 150/145/142/130 96 89 96/89
G3 150 145 130 150/145/130 105 96 105/96
G4 145 142 130 145/142/430 105 89 105/89
G5 150 142 130 150/142/130 105 96 105/96
G6 145 142 130 145/142/130 96 89 96/89
G7 150 142 150/142 105 96 105/96
G8 150 145 130 150/145/130 105 89 105/89
G9 150 145 142 150/145/142 96 89 96/89
here, in agg column i aggregated all the values based on duplicate first row.
I like to create new column at the end of the duplicate columns and aggregate it.
How to do it in R. I am very confused
EDIT:
dput(dat)
structure(list(V1 = structure(c(10L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L), .Label = c("G1", "G2", "G3", "G4", "G5", "G6", "G7",
"G8", "G9", "gen"), class = "factor"), V2 = structure(c(2L, 1L,
1L, 1L, NA, 1L, NA, 1L, 1L, 1L), .Label = c("150", "M1"), class = "factor"),
V3 = structure(c(2L, NA, 1L, 1L, 1L, NA, 1L, NA, 1L, 1L), .Label = c("145",
"M1"), class = "factor"), V4 = structure(c(2L, 1L, 1L, NA,
1L, 1L, 1L, 1L, NA, 1L), .Label = c("142", "M1"), class = "factor"),
V5 = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 1L, NA), .Label = c("130",
"M1"), class = "factor"), V6 = structure(c(2L, 1L, NA, 1L,
1L, 1L, NA, 1L, 1L, NA), .Label = c("105", "M2"), class = "factor"),
V7 = structure(c(2L, 1L, 1L, 1L, NA, 1L, 1L, 1L, NA, 1L), .Label = c("96",
"M2"), class = "factor"), V8 = structure(c(2L, NA, 1L, NA,
1L, NA, 1L, NA, 1L, 1L), .Label = c("89", "M2"), class = "factor")), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7", "V8"), class = "data.frame", row.names = c(NA,
-10L))
This works if the missing values are blanks:
dat$agg1 <- apply(dat[,2:5],1,function(x)paste(x[nchar(x)>0],collapse="/"))
dat$agg2 <- apply(dat[,6:8],1,function(x)paste(x[nchar(x)>0],collapse="/"))
dat <- dat[,c(1:5,9,6:8,10)]
dat
# gen M1 M1.1 M1.2 M1.3 agg1 M2 M2.1 M2.2 agg2
# 1 G1 150 142 130 150/142/130 105 96 105/96
# 2 G2 150 145 142 130 150/145/142/130 96 89 96/89
# 3 G3 150 145 130 150/145/130 105 96 105/96
# 4 G4 145 142 130 145/142/130 105 89 105/89
# ...
This works if the missing values are NA
dat$agg1 <- apply(dat[,2:5],1,function(x)paste(x[!is.na(x)],collapse="/"))
dat$agg2 <- apply(dat[,6:8],1,function(x)paste(x[!is.na(x)],collapse="/"))
to aggregate them into a character vector you use paste()
x=data.frame(x1=1:10,x2=1:10,x1=11:20)
#now notice that r created my x object with three columns x1,x2 and x1.1
xnew=cbind(x,agg=paste(x$x1,x$x2,x$x1.1,sep="/"))
I am not sure if this is what you want to do because I am a bit confused about the structure of your data.
Here is my script... I Know some of you guys can make it simple and elegant!
I transposed my df (a simple example) and read as table.
> dat<-read.table("dat.txt", header=T, sep="\t", na.strings="")
> dat
gen A B C D
1 M1 1 NA 3 NA
2 M1 NA 6 NA 3
3 M1 4 8 NA NA
4 M1 NA NA 6 3
5 M2 8 NA 6 NA
6 M2 NA 2 NA 6
7 M3 3 8 NA 2
8 M3 8 9 5 NA
9 M4 3 7 8 5
10 M4 5 NA 3 2
> final<-NULL
> for(i in 1:4){
+ mar<-as.character(dat[1,1])
+ dat1<-dat[dat[,1]%in% c(mar),]
+ dat <- dat[!dat[,1]%in% c(mar),]
+ dat2 <- apply(dat1,2,function(x)paste(x[!is.na(x)],collapse="/"))
+ dat2$gen<-mar
+ dat3<-rbind(dat1,dat2)
+ final<-rbind(final, dat3)
+ }
Warning messages:
1: In dat2$gen <- mar : Coercing LHS to a list
2: In dat2$gen <- mar : Coercing LHS to a list
3: In dat2$gen <- mar : Coercing LHS to a list
4: In dat2$gen <- mar : Coercing LHS to a list
> final
gen A B C D
1 M1 1 <NA> 3 <NA>
2 M1 <NA> 6 <NA> 3
3 M1 4 8 <NA> <NA>
4 M1 <NA> <NA> 6 3
5 M1 1/ 4 6/ 8 3/ 6 3/ 3
51 M2 8 <NA> 6 <NA>
6 M2 <NA> 2 <NA> 6
31 M2 8 2 6 6
7 M3 3 8 <NA> 2
8 M3 8 9 5 <NA>
32 M3 3/8 8/9 5 2
9 M4 3 7 8 5
10 M4 5 <NA> 3 2
33 M4 3/5 7 8/3 5/2

Resources