Create new column using condition on another existing column - r

I have data like this
Time chamber
9 1
10 2
11 3
12 4
13 5
14 6
15 7
16 8
17 9
18 10
19 11
20 12
21 1
22 2
23 3
24 4
I want to create a new column using conditions on another existing column (chamber).
It should look something like this
Time chamber treatment
9 1 c2t2
10 2 c2t2
11 3 c0t0r
12 4 c2t2r
13 5 c2t2r
14 6 c0t0
15 7 c0t0r
16 8 c0t0r
17 9 c2t2
18 10 c2t2r
19 11 c0t0
20 12 c0t0
21 1 c2t2
22 2 c2t2
23 3 c0t0r
24 4 c2t2r
For chambers 1,2,9: Treatment is c2t2
For chambers 3,7,8: Treatment is c0t0r.
For chambers 4,5,10: Treatment is c2t2r
For chambers 6,11,12: Treatment is c0t0.
I have also made a lookup table, but I don't know how to use it:
lookup_table <- data.frame(row.names = c("1", "2", "3","4", "5", "6","7", "8", "9","10", "11", "12"),
new_col = c("C2T2", "C2T2", "C0T0R","C2T2R", "C2T2R", "C0T0","C0T0R", "C0T0R", "C2T2","C2T2R", "C0T0", "C0T0"),
stringsAsFactors = FALSE)

Assuming "dt" is your dataframe name, then you can use dplyr with case_when
library(tidyverse)
dt %>%
mutate(newcol = case_when(dt$chamber %in% c(1, 2, 9) ~ "c2t2",
dt$chamber %in% c(3, 7, 8) ~ "c0t0r",
dt$chamber %in% c(4, 5, 10) ~ "c2t2r",
dt$chamber %in% c(6, 11, 12) ~ "c0t0"))
Output:
Time chamber newcol
1 9 1 c2t2
2 10 2 c2t2
3 11 3 c0t0r
4 12 4 c2t2r
5 13 5 c2t2r
6 14 6 c0t0
7 15 7 c0t0r
8 16 8 c0t0r
9 17 9 c2t2
10 18 10 c2t2r
11 19 11 c0t0
12 20 12 c0t0
13 21 1 c2t2
14 22 2 c2t2
15 23 3 c0t0r
16 24 4 c2t2r
>

You can merge your df with the lookup_table. In my experience, if you want to combine different data.frames, merge() is the command I like to use. Do note that there are many different ways and specialised packages you can use for the same purpose!
You need to specify which column you use as the 'matching column' and also that you want to keep all records in df:
merge(df, lookup_table, all.x = TRUE, by.x = "chamber", by.y = "row.names")
Data:
df <- structure(list(Time = 9:24, chamber = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L)),
.Names = c("Time", "chamber"), class = "data.frame",
row.names = c(NA, -16L))
lookup_table <- structure(list(new_col = c("C2T2", "C2T2", "C0T0R", "C2T2R",
"C2T2R", "C0T0", "C0T0R", "C0T0R",
"C2T2", "C2T2R", "C0T0", "C0T0")),
.Names = "new_col",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"), class = "data.frame")

Related

Cann't remove NAs

I am trying to delete rows in my dataset, which contains NAs, but none of the functions work, What could be a reason?
Here is sample of my code,
Site_cov<- read.csv("site_cov.csv")
colnames(Site_cov)<- c("Point", "Basal", "Short.Saps", "Tall.Saps")
head(Site_cov)
Point Basal Short.Saps Tall.Saps
1 DEL001 Na 2 0
2 DEL002 Na 1 6
3 DEL003 Na 0 5
4 DEL004 10 21 22
Here, I though that upper and lower case Nas, could be a problem and this is what I run,
Site_cov$Basal<-toupper(Site_cov$Basal)
Site_cov$Short.Saps<-toupper(Site_cov$Short.Saps)
Site_cov$Tall.Saps<-toupper(Site_cov$Tall.Saps)
Then, I try to delete NAs
Site_cov_NA <- Site_cov[complete.cases(Site_cov[ , c("Point", "Basal", "Short.Saps", "Tall.Saps")]), ]
But, NAs are still here
head(Site_cov_NA)
Point Basal Short.Saps Tall.Saps
1 DEL001 NA 2 0
2 DEL002 NA 1 6
3 DEL003 NA 0 5
4 DEL004 10 21 22
5 DEL005 60 8 17
6 DEL006 80 17 13
Obviously you have 'Na' strings that are fake NAs. replace them with real ones, then your code should work.
dat <- replace(dat, dat == 'Na', NA)
dat[complete.cases(dat[, c("Point", "Basal", "Short.Saps", "Tall.Saps")]), ]
# Point Basal Short.Saps Tall.Saps
# 4 DEL004 10 21 22
Data:
dat <- structure(list(Point = c("DEL001", "DEL002", "DEL003", "DEL004"
), Basal = c("Na", "Na", "Na", "10"), Short.Saps = c(2L, 1L,
0L, 21L), Tall.Saps = c(0L, 6L, 5L, 22L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Try the complete.cases() function (https://stat.ethz.ch/R-manual/R-patched/library/stats/html/complete.cases.html)
try <- data.frame("a"=c(1,3,NA,NA), "b"=c(3,5,2,3))
try1<-try[complete.cases(try),]
try1

Calculate value using row with pattern

I have these inputs:
df
# A tibble: 53 x 2
Task Frame
<chr> <int>
1 S101-10061 6
2 S101-10061-74716 16
3 S101-10065 18
4 S101-10065-104934 16
5 S101-10071 32
6 S101-10071-104898 74
7 S101-10072 8
8 S101-10072-79124 58
9 S101-10074 38
10 S101-10075 82
As you see in same "Task" first 10 characters sometimes is same. So I need to find that task is a same, for example task 1 (S101-10061) as same task 2 (S101-10061-74716) in first 10 characters, and if that same find abs difference from number of frame, here in example 16-6=10. So I expect something like:
Task Frame Diff
<chr> <int> <int>
1 S101-10061 6 6
2 S101-10061-74716 16
3 S101-10065 18 2
4 S101-10065-104934 16
5 S101-10071 32 24
6 S101-10071-104898 74
7 S101-10072 8 50
8 S101-10072-79124 58
9 S101-10074 38
10 S101-10075 82
I tried:
df %>% mutate(
Diff = accumulate(
Frame[1:n()], function(x,y)(abs(x-y))
)
)
But its doesn't help, how to compare row by pattern? any ideas?
Here is a dplyr solution.
library(dplyr)
df %>%
mutate(Task = substr(Task, 1, 10)) %>%
group_by(Task) %>%
mutate(Diff = abs(Frame - lead(Frame)))
Data.
df <-
structure(list(Task = c("S101-10061", "S101-10061-74716", "S101-10065",
"S101-10065-104934", "S101-10071", "S101-10071-104898", "S101-10072",
"S101-10072-79124", "S101-10074", "S101-10075"), Frame = c(6L,
16L, 18L, 16L, 32L, 74L, 8L, 58L, 38L, 82L)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

Dividing grouped data by group means r

I have data split up into two categories:
z= Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X
I'd like to divide each value of Tracer by the group mean depending on which group it belongs to (e.g. All values of Tracer belonging to time=0 and treatment=S are divided by their mean).
The procedure would be something like this:
Find category means as follows:
1:
aggmeanz <-aggregate(z$Tracer, list(time=z$time,treatment=z$treatment), FUN=mean)
2: Divide z$Tracer by the correct aggmeanz value
structure(list(Tracer = c(15L, 20L, 25L, 4L, 55L, 16L, 15L, 20L
), time = c(0L, 0L, 0L, 0L, 15L, 15L, 15L, 15L), treatment = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("S", "X"), class = "factor")), .Names = c("Tracer",
"time", "treatment"), class = "data.frame", row.names = c(NA,
-8L))
Alternatively, here is a dplyr solution:
library(dplyr)
group_by(z,time,treatment) %>%
mutate(pmean=Tracer/mean(Tracer))
Output:
Tracer time treatment pmean
(int) (int) (fctr) (dbl)
1 15 0 S 0.8571429
2 20 0 S 1.1428571
3 25 0 X 1.7241379
4 4 0 X 0.2758621
5 55 15 S 1.5492958
6 16 15 S 0.4507042
7 15 15 X 0.8571429
8 20 15 X 1.1428571
Data:
z <- read.table(text="Tracer time treatment
15 0 S
20 0 S
25 0 X
04 0 X
55 15 S
16 15 S
15 15 X
20 15 X",head=TRUE)
Is it ok to use non-base tools? With data.table installed and loaded:
z <- data.table(z)
z[, scaledTracer := Tracer/mean(Tracer), by = c("time","treatment")]
Would compute means by each unique combination of time and treatment (which appear to be groups of 2 rows in your data), and scale the Tracer values in each group by the appropriate mean.
It's not the prettiest but:
groupmeans = aggregate(z$Tracer, by = list(z$time, z$treatment), FUN = mean)
Group.1 Group.2 x
0 S 17.5
15 S 35.5
0 X 14.5
15 X 17.5
names(groupmeans) = c("time", "treatment", "groupmean")
z = merge(z, groupmeans, id.vars = c("time","treatment" ))
time treatment groupmean Tracer tracer_div
0 S 17.5 15 0.8571429
0 S 17.5 20 1.1428571
0 X 14.5 25 1.7241379
0 X 14.5 4 0.2758621
15 S 35.5 55 1.5492958
15 S 35.5 16 0.4507042
15 X 17.5 15 0.8571429
15 X 17.5 20 1.1428571
z$tracer_div = z$Tracer/z$groupmean
time treatment groupmean Tracer tracer_div
0 S 17.5 15 0.8571429
0 S 17.5 20 1.1428571
0 X 14.5 25 1.7241379
0 X 14.5 4 0.2758621
15 S 35.5 55 1.5492958
15 S 35.5 16 0.4507042
15 X 17.5 15 0.8571429
15 X 17.5 20 1.1428571
You could reassign z$Tracer to the final step if you didn't want to create a whole new column. It can be nice to keep every step though in case you want to use it in another calculation or plot later.
a base R solution:
do.call(c, lapply(split(z[1], z[, -1]), FUN = function(x) x[[1]]/mean(x[[1]])))
# 0.S1 0.S2 15.S1 15.S2 0.X1 0.X2 15.X1 15.X2
#0.8571429 1.1428571 1.5492958 0.4507042 1.7142857 0.2857143 0.8571429 1.1428571
split into timextreatment groups first, then divide each group by mean. finally glue back together with c.

Sort data by factor and output into a matrix (or df) R

I have looked through other posts and I think I have an idea of what I could do, but I want to be clear!
I have a very large data frame that contains 4 variables and a number of rows.
Chain ResId ResNum Energy
1 C O17 500 -37.03670
2 A ARG 8 -0.84560
3 A LEU 24 -0.56739
4 A ASP 25 -0.98583
5 B ARG 8 -0.64880
6 B LEU 24 -0.58380
7 B ASP 25 -0.85930
Each row contains CHAIN (A, B, or C), ResID, ResNum, and Energy. I would like to sort this data so that all of the energy values belonging to a specific Resid and num in each chain are clustered together. By cluster I mean all of the values for "ARG 8" are grouped or all of the rows containing "ARG 8" are grouped. I don't know which is more efficient. Ideally, I would like the output for all residues to be
ARG 8
0.000
0.000
0.000
where the "0.000" are the energy values for ARG 8 or O17 and so on.
Sorry for the header breaks, I wanted the data to be clean, but I can't insert images.
data
structure(list(Chain = structure(c(3L, 1L, 1L, 1L, 2L, 2L, 2L
), .Label = c("A", "B", "C"), class = "factor"), ResId = structure(c(4L,
1L, 3L, 2L, 1L, 3L, 2L), .Label = c("ARG", "ASP", "LEU", "O17"
), class = "factor"), ResNum = c(500L, 8L, 24L, 25L, 8L, 24L,
25L), Energy = c(-37.0367, -0.8456, -0.56739, -0.98583, -0.6488,
-0.5838, -0.8593)), .Names = c("Chain", "ResId", "ResNum", "Energy"
), class = "data.frame", row.names = c(NA, -7L))
If you want to convert to wide format
library(reshape2)
dcast(df, ResId+ResNum~paste0('Energy.',Chain), value.var='Energy')
# ResId ResNum Energy.A Energy.B Energy.C
#1 ARG 8 -0.84560 -0.6488 NA
#2 ASP 25 -0.98583 -0.8593 NA
#3 LEU 24 -0.56739 -0.5838 NA
#4 O17 500 NA NA -37.0367
After your edit, the output you are most likely looking for is:
library(reshape2)
dcast(df, ResId~Chain, value.var= 'Energy')
ResId A B C
1 ARG -0.84560 -0.6488 NA
2 ASP -0.98583 -0.8593 NA
3 LEU -0.56739 -0.5838 NA
4 O17 NA NA -37.0367
This will put the values together. You can further specify based on your desired output.
df[order(df$ResId), ]
Chain ResId ResNum Energy
2 A ARG 8 -0.84560
5 B ARG 8 -0.64880
4 A ASP 25 -0.98583
7 B ASP 25 -0.85930
3 A LEU 24 -0.56739
6 B LEU 24 -0.58380
1 C O17 500 -37.03670
#With dplyr
library(dplyr)
df %>%
arrange(ResId)
Chain ResId ResNum Energy
1 A ARG 8 -0.84560
2 B ARG 8 -0.64880
3 A ASP 25 -0.98583
4 B ASP 25 -0.85930
5 A LEU 24 -0.56739
6 B LEU 24 -0.58380
7 C O17 500 -37.03670
Data
df <- read.table(text = '
Chain ResId ResNum Energy
C O17 500 -37.0367
A ARG 8 -0.8456
A LEU 24 -0.56739
A ASP 25 -0.98583
B ARG 8 -0.6488
B LEU 24 -0.5838
B ASP 25 -0.8593', header=T)
Try this:
df <- df[order(df$Chain, df$ResId, df$ResNum),]
where df is the name of your dataframe. This should order it for you.

R - ddply summarise using nlevels() does not work

When using the plyr package to summarise my data, it seems impossible to use the nlevels() function.
The structure of my data set is as follows:
>aer <- read.xlsx("XXXX.xlsx", sheetIndex=1)
>aer$ID <- as.factor(aer$ID)
>aer$description <- as.factor(aer$description)
>head(aer)
ID SOC start end days count severity relation
1 1 410 2015-04-21 2015-04-28 7 1 1 3
2 1 500 2015-01-30 2015-05-04 94 1 1 3
3 1 600 2014-11-25 2014-11-29 4 1 1 3
4 1 600 2015-01-02 2015-01-07 5 1 1 3
5 1 600 2015-01-26 2015-03-02 35 1 1 3
6 1 600 2015-04-14 2015-04-17 3 1 1 3
> dput(head(aer,4))
structure(list(ID = structure(c(1L, 1L, 1L, 1L), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "12", "13", "14",
"15"), class = "factor"), SOC = c(410, 500, 600, 600),
start = structure(c(16546, 16465, 16399, 16437), class = "Date"),
end = structure(c(16553, 16559, 16403, 16442), class = "Date"),
days = c(7, 94, 4, 5), count = c(1, 1, 1, 1), severity = c(1,
1, 1, 1), relation = c(3, 3, 3, 3)), .Names = c("ID", "SOC",
"description", "start", "end", "days", "count", "severity", "relation"
), row.names = c(NA, 4L), class = "data.frame")
What I would like to know is how many levels exists in the "ID" variable in data sections created, when dividing the data set using the variable "SOC". I want to summarise this information together with some other variables in a new data set. Therefore, I would like to use the plyr package like so:
summaer2 <- ddply(aer, c("SOC"), summarise,
participants = nlevels(ID),
events = sum(count),
min_duration = min(days),
max_duration = max(days),
max_severity = max(severity))
This returns the following error:
Error in Summary.factor(c(4L, 5L, 11L, 11L, 14L, 14L), na.rm = FALSE) :
‘max’ not meaningful for factors
Could someone give me advice on how to reach my goal? Or what I'm doing wrong?
Many thanks in advance!
Update:
Substituting nlevels(ID) with length(unique(ID)) seems to give me the desired output:
> head(summaer2)
SOC participants events min_duration max_duration max_severity
1 100 4 7 1 62 2
2 410 9 16 1 41 2
3 431 2 2 109 132 1
4 500 5 9 23 125 2
5 600 8 19 1 35 1
6 1040 1 1 98 98 2

Resources