creating a new column to indicate variable number of each unique animal - r

I need to change the format of my current data by creating a new column as "tnum"(first column)to indicate trait/variable numbers and the last column "tval" to indicate each trait value.
My current data file (9,000 animals) is similar to this format:
anim <- c(201,202,203,204,205)
bwt <- c(1.2,1.0,0.9,1.1,1.5)
leng <- c(14,21,18,16,19)
temp <- c(33,34,39,38,37)
mydf <- data.frame(anim,bwt,leng,temp)
anim bwt leng temp
1 201 1.2 14 33
2 202 1.0 21 34
3 203 0.9 18 39
4 204 1.1 16 38
5 205 1.5 19 37
Trait 1 = bwt, trait 2 = leng, and trait 3 = temp. This what I am looking for:
tnum anim tval
1 201 1.2
2 201 14
3 201 33
1 202 1.0
2 202 21
3 202 34
1 203 0.9
2 203 18
3 203 39
1 204 1.1
2 204 16
3 204 38
1 205 1.5
2 205 19
3 205 37
Any help would be appreciated.
Baz

library("reshape")
m <- melt(mydf, id.vars="anim")
m
anim variable value
1 201 bwt 1.2
2 202 bwt 1.0
3 203 bwt 0.9
4 204 bwt 1.1
5 205 bwt 1.5
6 201 leng 14.0
7 202 leng 21.0
8 203 leng 18.0
9 204 leng 16.0
10 205 leng 19.0
11 201 temp 33.0
12 202 temp 34.0
13 203 temp 39.0
14 204 temp 38.0
15 205 temp 37.0
and please format your code better next time. its simple.

Related

R- Subtract values of all rows in a group from previous row in different group and filter out rows

In R say I had the data frame:
data
frame object x y
1 6 150 100
2 6 149 99
3 6 148 98
3 6 140 90
4 6 148.5 97
4 6 142 93
5 6 147 96
5 6 138 92
5 6 135 90
6 6 146.5 99
1 7 125 200
2 7 126 197
3 7 127 202
3 7 119 185
4 7 117 183
4 7 123 199
5 7 115 190
5 7 124 202
5 7 118 192
6 7 124.5 199
I want to output the object which is the closest in the previous frame based on the (x,y) coordinates and filter out the other objects. I want to find the difference in the x and y between all the objects in a given frame and the single object in the previous frame and keep the closest object while removing the rest. The object that is kept would then serve as reference for the next frame. The frames with only one object would be left as is. The output should be one object per frame:
data
frame object x y
1 6 150 100
2 6 149 99
3 6 148 98
4 6 148.5 97
5 6 147 96
6 6 146.5 99
1 7 125 200
2 7 126 197
3 7 127 202
4 7 123 199
5 7 124 202
6 7 124.5 199
This is a cumulative operation, so it'll take an iterative approach. Here's a simple function to do one operation, assuming it's for only one object.
fun <- function(Z, fr) {
prevZ <- head(subset(Z, frame == (fr-1)), 1)
thisZ <- subset(Z, frame == fr)
if (nrow(prevZ) < 1 || nrow(thisZ) < 2) return(Z)
ind <- which.min( abs(thisZ$x - prevZ$x) + abs(thisZ$y - prevZ$y) )
rbind(subset(Z, frame != fr), thisZ[ind,])
}
fun(subset(dat, object == 6), 3)
# frame object x y
# 1 1 6 150.0 100
# 2 2 6 149.0 99
# 5 4 6 148.5 97
# 6 4 6 142.0 93
# 7 5 6 147.0 96
# 8 5 6 138.0 92
# 9 5 6 135.0 90
# 10 6 6 146.5 99
# 3 3 6 148.0 98
(The order is not maintained, it can easily be sorted back into place as needed.)
Now we can Reduce this for each object within the data.
out <- do.call(rbind,
lapply(split(dat, dat$object),
function(X) Reduce(fun, seq(min(X$frame)+1, max(X$frame)), init=X)))
out <- out[order(out$object, out$frame),]
out
# frame object x y
# 6.1 1 6 150.0 100
# 6.2 2 6 149.0 99
# 6.3 3 6 148.0 98
# 6.5 4 6 148.5 97
# 6.7 5 6 147.0 96
# 6.10 6 6 146.5 99
# 7.11 1 7 125.0 200
# 7.12 2 7 126.0 197
# 7.13 3 7 127.0 202
# 7.16 4 7 123.0 199
# 7.18 5 7 124.0 202
# 7.20 6 7 124.5 199
We can create a for loop that applies the criteria to a single object, and then use group_by %>% summarize to apply it to every object:
library(dplyr)
keep_closest_frame = function(data) {
frames = split(data, dd$frame)
for(i in seq_along(frames)) {
if(nrow(frames[[i]]) != 1 & i == 1) {
stop("First frame must have exactly 1 row")
}
if(nrow(frames[[i]]) == 1) next
dists = with(frames[[i]], abs(x - frames[[i - 1]][["x"]]) + abs(y - frames[[i - 1]][["y"]]))
frames[[i]] = frames[[i]][which.min(dists), ]
}
bind_rows(frames)
}
data %>%
group_by(object) %>%
summarize(keep_closest_frame(across()))
# # A tibble: 12 × 4
# # Groups: object [2]
# object frame x y
# <int> <int> <dbl> <int>
# 1 6 1 150 100
# 2 6 2 149 99
# 3 6 3 148 98
# 4 6 4 148. 97
# 5 6 5 147 96
# 6 6 6 146. 99
# 7 7 1 125 200
# 8 7 2 126 197
# 9 7 3 127 202
# 10 7 4 123 199
# 11 7 5 124 202
# 12 7 6 124. 199

Add new column to state data frame based on other column data [duplicate]

This question already has answers here:
Categorize numeric variable into group/ bins/ breaks
(4 answers)
Closed 1 year ago.
I am attempting to add a new column to the state sample data frame in R. I am hoping for this column to cluster the ID of states into broader categories (1-4). My code is close to what I am looking for but I am not getting it quite right.. I know I could enter each state ID line by line but is there a a quicker way? Thank you!
library(tidyverse)
#Add column to denote each state
States=state.x77
States=data.frame(States)
States <- tibble::rowid_to_column(States, "ID")
States
#Create new variable for state buckets
States <- States %>%
mutate(WAGE_BUCKET=case_when(ID <= c(1,12) ~ '1',
ID <= c(13,24) ~ '2',
ID <= c(25,37) ~ '3',
ID <= c(38,50) ~ '4',
TRUE ~ 'NA'))
View(States) #It is not grouping the states in the way I want/I am still getting some NA values but unsure why!
You can use cut or findInterval if all of your groups will be using contiguous ID values:
findInterval(States$ID, c(0, 12, 24, 37, 51))
# [1] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
If you want to make it a bit more verbose, you can use dplyr::between in your case_when:
States %>%
mutate(
WAGE_BUCKET = case_when(
between(ID, 1, 12) ~ "1",
between(ID, 13, 24) ~ "2",
between(ID, 25, 37) ~ "3",
between(ID, 38, 50) ~ "4",
TRUE ~ NA_character_)
)
# ID Population Income Illiteracy Life Exp Murder HS Grad Frost Area WAGE_BUCKET
# 1 1 3615 3624 2.1 69.05 15.1 41.3 20 50708 1
# 2 2 365 6315 1.5 69.31 11.3 66.7 152 566432 1
# 3 3 2212 4530 1.8 70.55 7.8 58.1 15 113417 1
# 4 4 2110 3378 1.9 70.66 10.1 39.9 65 51945 1
# 5 5 21198 5114 1.1 71.71 10.3 62.6 20 156361 1
# 6 6 2541 4884 0.7 72.06 6.8 63.9 166 103766 1
# 7 7 3100 5348 1.1 72.48 3.1 56.0 139 4862 1
# 8 8 579 4809 0.9 70.06 6.2 54.6 103 1982 1
# 9 9 8277 4815 1.3 70.66 10.7 52.6 11 54090 1
# 10 10 4931 4091 2.0 68.54 13.9 40.6 60 58073 1
# 11 11 868 4963 1.9 73.60 6.2 61.9 0 6425 1
# 12 12 813 4119 0.6 71.87 5.3 59.5 126 82677 1
# 13 13 11197 5107 0.9 70.14 10.3 52.6 127 55748 2
# 14 14 5313 4458 0.7 70.88 7.1 52.9 122 36097 2
# 15 15 2861 4628 0.5 72.56 2.3 59.0 140 55941 2
# 16 16 2280 4669 0.6 72.58 4.5 59.9 114 81787 2
# 17 17 3387 3712 1.6 70.10 10.6 38.5 95 39650 2
# 18 18 3806 3545 2.8 68.76 13.2 42.2 12 44930 2
# 19 19 1058 3694 0.7 70.39 2.7 54.7 161 30920 2
# 20 20 4122 5299 0.9 70.22 8.5 52.3 101 9891 2
# 21 21 5814 4755 1.1 71.83 3.3 58.5 103 7826 2
# 22 22 9111 4751 0.9 70.63 11.1 52.8 125 56817 2
# 23 23 3921 4675 0.6 72.96 2.3 57.6 160 79289 2
# 24 24 2341 3098 2.4 68.09 12.5 41.0 50 47296 2
# 25 25 4767 4254 0.8 70.69 9.3 48.8 108 68995 3
# 26 26 746 4347 0.6 70.56 5.0 59.2 155 145587 3
# 27 27 1544 4508 0.6 72.60 2.9 59.3 139 76483 3
# 28 28 590 5149 0.5 69.03 11.5 65.2 188 109889 3
# 29 29 812 4281 0.7 71.23 3.3 57.6 174 9027 3
# 30 30 7333 5237 1.1 70.93 5.2 52.5 115 7521 3
# 31 31 1144 3601 2.2 70.32 9.7 55.2 120 121412 3
# 32 32 18076 4903 1.4 70.55 10.9 52.7 82 47831 3
# 33 33 5441 3875 1.8 69.21 11.1 38.5 80 48798 3
# 34 34 637 5087 0.8 72.78 1.4 50.3 186 69273 3
# 35 35 10735 4561 0.8 70.82 7.4 53.2 124 40975 3
# 36 36 2715 3983 1.1 71.42 6.4 51.6 82 68782 3
# 37 37 2284 4660 0.6 72.13 4.2 60.0 44 96184 3
# 38 38 11860 4449 1.0 70.43 6.1 50.2 126 44966 4
# 39 39 931 4558 1.3 71.90 2.4 46.4 127 1049 4
# 40 40 2816 3635 2.3 67.96 11.6 37.8 65 30225 4
# 41 41 681 4167 0.5 72.08 1.7 53.3 172 75955 4
# 42 42 4173 3821 1.7 70.11 11.0 41.8 70 41328 4
# 43 43 12237 4188 2.2 70.90 12.2 47.4 35 262134 4
# 44 44 1203 4022 0.6 72.90 4.5 67.3 137 82096 4
# 45 45 472 3907 0.6 71.64 5.5 57.1 168 9267 4
# 46 46 4981 4701 1.4 70.08 9.5 47.8 85 39780 4
# 47 47 3559 4864 0.6 71.72 4.3 63.5 32 66570 4
# 48 48 1799 3617 1.4 69.48 6.7 41.6 100 24070 4
# 49 49 4589 4468 0.7 72.48 3.0 54.5 149 54464 4
# 50 50 376 4566 0.6 70.29 6.9 62.9 173 97203 4
It is a vector of length > 1. The comparison operators works on a single vector. We could use between
library(dplyr)
States <- States %>%
mutate(WAGE_BUCKET=case_when(between(ID, 1, 12) ~ '1',
between(ID, 13,24) ~ '2',
between(ID, 25,37) ~ '3',
between(ID, 38,50) ~ '4',
TRUE ~ NA_character_))
Or another option is to use & with > and <=
States %>%
mutate(WAGE_BUCKET=case_when(ID >= 1 & ID <=12 ~ '1',
ID >= 13 & ID <= 24) ~ '2',
ID >= 25 & ID <= 37 ~ '3',
ID >= 38 & ID <= 50 ~ '4',
TRUE ~ NA_character))
Or may be the OP meant to use %in%
States %>%
mutate(WAGE_BUCKET=case_when(ID %in% c(1,12) ~ '1',
ID %in% c(13,24) ~ '2',
ID %in% c(25,37) ~ '3',
ID %in% c(38,50) ~ '4',
TRUE ~ NA_character_))

conditional Substacting numbers

I have data frame like this
test <- data.frame(gr=rep(letters[1:2],each=6),No=c(100:105,200:205))
gr No
1 a 100
2 a 101
3 a 102
4 a 103
5 a 104
6 a 105
7 b 200
8 b 201
9 b 202
10 b 203
11 b 204
12 b 205
in the No column the numbers are increasing in each gr. I need to sum gr a with 100 and b with 50 and need to have consecutive decrease after this operation.
I would like to have a new column that consecutive decrease with this increase. So I tried
decrese_func <- function(No,gr){
if(any(gr=="a")){
No+100
}
else
No+50
}
test%>%
group_by(gr)%>%
mutate(new_column=decrese_func(No,gr))
# A tibble: 12 x 3
# Groups: gr [2]
gr No new_column
<fct> <int> <dbl>
1 a 100 200
2 a 101 201
3 a 102 202
4 a 103 203
5 a 104 204
6 a 105 205
7 b 200 250
8 b 201 251
9 b 202 252
10 b 203 253
11 b 204 254
12 b 205 255
but what I need is like this
gr No new_column
<fct> <int> <dbl>
1 a 100 200
2 a 101 199
3 a 102 198
4 a 103 197
5 a 104 196
6 a 105 195
7 b 200 250
8 b 201 249
9 b 202 248
10 b 203 247
11 b 204 246
12 b 205 245
I cannot figure it out how to have consecutive decrease ?
Thx.
Not the most elegant answer but in the mean time, this may work:
library(dplyr)
test %>%
mutate(A = case_when(gr == "a" ~ 100,
gr == "b" ~ 50,
TRUE ~ NA_real_)) %>%
group_by(gr) %>%
mutate(B = (1:NROW(gr) - 1) * 2,
New_Column = No + A - B)
# A tibble: 12 x 5
# Groups: gr [2]
gr No A B New_Column
<fct> <int> <dbl> <dbl> <dbl>
1 a 100 100 0 200
2 a 101 100 2 199
3 a 102 100 4 198
4 a 103 100 6 197
5 a 104 100 8 196
6 a 105 100 10 195
7 b 200 50 0 250
8 b 201 50 2 249
9 b 202 50 4 248
10 b 203 50 6 247
11 b 204 50 8 246
12 b 205 50 10 245
Add select(gr, No, New_Column) at the end of the chain to get gr, No and New_Column only. I left the other columns just to show what's going on.
And if you want to wrap it into a function you could do something like:
desc_func <- function(group_var, condition, if_true_add, if_false_add, to_number) {
ifelse(
group_var == condition,
to_number + if_true_add - (1:NROW(group_var) - 1) * 2,
to_number + if_false_add - (1:NROW(group_var) - 1) * 2)
}
test %>%
group_by(gr) %>%
mutate(test_var = desc_func(gr, "a", 100, 50, No))
# A tibble: 12 x 3
# Groups: gr [2]
gr No test_var
<fct> <int> <dbl>
1 a 100 200
2 a 101 199
3 a 102 198
4 a 103 197
5 a 104 196
6 a 105 195
7 b 200 250
8 b 201 249
9 b 202 248
10 b 203 247
11 b 204 246
12 b 205 245
Here is a way to do this in base R
test$New <- with(test, No + c(100, 50)[cumsum(!duplicated(gr))] - 2*(No %% 100))
test$New
#[1] 200 199 198 197 196 195 250 249 248 247 246 245
Or a slight variation with match
with(test, No + c(100, 50)[match(gr, unique(gr))] - 2*(No %% 100))

How to flatten out nested list into one list more efficiently instead of using unlist method?

I have a nested list which contains set of data.frame objects in it, now I want them flatten out. I used most common approach like unlist method, it is not properly fatten out my list, the output was not well represented. How can I make this happen more efficiently? Does anyone knows any trick of doing this operation? Thanks.
example:
mylist <- list(pass=list(Alpha.df1_yes=airquality[2:4,], Alpha.df2_yes=airquality[3:6,],Alpha.df3_yes=airquality[2:5,],Alpha.df4_yes=airquality[7:9,]),
fail=list(Alpha.df1_no=airquality[5:7,], Alpha.df2_no=airquality[8:10,], Alpha.df3_no=airquality[13:16,],Alpha.df4_no=airquality[11:13,]))
I tried like this, it works but output was not properly arranged.
res <- lapply(mylist, unlist)
after flatten out, I would like to do merge them without duplication:
out <- lapply(res, rbind.data.frame)
my desired output:
mylist[[1]]$pass:
Ozone Solar.R Wind Temp Month Day
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
How can make this sort of flatten output more compatibly represented? Can anyone propose possible idea of doing this in R? Thanks a lot.
Using lapply and duplicated:
res <- lapply(mylist, function(i){
x <- do.call(rbind, i)
x[ !duplicated(x), ]
rownames(x) <- NULL
x
})
res$pass
# Ozone Solar.R Wind Temp Month Day
# 1 36 118 8.0 72 5 2
# 2 12 149 12.6 74 5 3
# 3 18 313 11.5 62 5 4
# 4 12 149 12.6 74 5 3
# 5 18 313 11.5 62 5 4
# 6 NA NA 14.3 56 5 5
# 7 28 NA 14.9 66 5 6
# 8 36 118 8.0 72 5 2
# 9 12 149 12.6 74 5 3
# 10 18 313 11.5 62 5 4
# 11 NA NA 14.3 56 5 5
# 12 23 299 8.6 65 5 7
# 13 19 99 13.8 59 5 8
# 14 8 19 20.1 61 5 9
Above still returns a list, if we want to keep all in one dataframe with no lists, then:
res <- do.call(rbind, unlist(mylist, recursive = FALSE))
res <- res[!duplicated(res), ]
res
# Ozone Solar.R Wind Temp Month Day
# pass.Alpha.df1_yes.2 36 118 8.0 72 5 2
# pass.Alpha.df1_yes.3 12 149 12.6 74 5 3
# pass.Alpha.df1_yes.4 18 313 11.5 62 5 4
# pass.Alpha.df2_yes.5 NA NA 14.3 56 5 5
# pass.Alpha.df2_yes.6 28 NA 14.9 66 5 6
# pass.Alpha.df4_yes.7 23 299 8.6 65 5 7
# pass.Alpha.df4_yes.8 19 99 13.8 59 5 8
# pass.Alpha.df4_yes.9 8 19 20.1 61 5 9
# fail.Alpha.df2_no.10 NA 194 8.6 69 5 10
# fail.Alpha.df3_no.13 11 290 9.2 66 5 13
# fail.Alpha.df3_no.14 14 274 10.9 68 5 14
# fail.Alpha.df3_no.15 18 65 13.2 58 5 15
# fail.Alpha.df3_no.16 14 334 11.5 64 5 16
# fail.Alpha.df4_no.11 7 NA 6.9 74 5 11
# fail.Alpha.df4_no.12 16 256 9.7 69 5 12

row average of columns that match string

I have a data frame below and I want to find the average row value for all columns with header *R and all columns with *G.
The output should then be four columns: Rfam, Classes, avg.rowR, avg.rowG
I was playing around with the rowMeans() function, but I am not sure how to specify the columns.
Rfam Classes 26G 26R 35G 35R 46G 46R 48G 48R 55G 55R
5_8S_rRNA rRNA 63 39 8 27 26 17 28 43 41 17
5S_rRNA rRNA 171 149 119 109 681 47 95 161 417 153
7SK 7SK 53 282 748 371 248 42 425 384 316 198
ACA64 Other 7 8 19 2 10 1 36 10 10 4
let-7 miRNA 121825 73207 25259 75080 54301 63510 30444 53800 78961 47533
lin-4 miRNA 10149 16263 5629 19680 11297 37866 3816 9677 11713 10068
Metazoa_SRP SRP 317 1629 1008 418 1205 407 1116 1225 1413 1075
mir-1 miRNA 3 4 1 2 0 26 1 1 0 4
mir-10 miRNA 912163 1411287 523793 1487160 517017 1466085 107597 551381 727720 788201
mir-101 miRNA 461 320 199 553 174 460 278 297 256 254
mir-103 miRNA 937 419 202 497 318 217 328 343 891 439
mir-1180 miRNA 110 32 4 17 53 47 6 29 35 22
mir-1226 miRNA 11 3 0 3 6 0 1 2 5 4
mir-1237 miRNA 3 2 1 1 0 1 0 2 1 1
mir-1249 miRNA 5 14 2 9 4 5 9 5 7 7
newcols <- sapply(c("R$", "G$"), function(x) rowMeans(df[grep(x, names(df))]))
setNames(cbind(df[1:2], newcols), c(names(df)[1:2], "avg.rowR", "avg.rowG"))
# Rfam Classes avg.rowR avg.rowG
# 1 5_8S_rRNA rRNA 28.6 33.2
# 2 5S_rRNA rRNA 123.8 296.6
# 3 7SK 7SK 255.4 358.0
# 4 ACA64 Other 5.0 16.4
# 5 let-7 miRNA 62626.0 62158.0
# 6 lin-4 miRNA 18710.8 8520.8
# 7 Metazoa_SRP SRP 950.8 1011.8
# 8 mir-1 miRNA 7.4 1.0
# 9 mir-10 miRNA 1140822.8 557658.0
# 10 mir-101 miRNA 376.8 273.6
# 11 mir-103 miRNA 383.0 535.2
# 12 mir-1180 miRNA 29.4 41.6
# 13 mir-1226 miRNA 2.4 4.6
# 14 mir-1237 miRNA 1.4 1.0
# 15 mir-1249 miRNA 8.0 5.4
One way to look for patterns in column names is to use the grep family of functions. The function call grep("R$", names(df)) will return the index of all column names that end with R. When we use it with sapply we can search for the R and G columns in one expression.
The core of the second line is cbind(df[1:2], newcols). That is the binding of the first two columns of df and the two new columns of mean values. Wrapping it with setNames(.., c(names(df)f[1:2]....)) formats the column names to match your desired output.

Resources