I have this table right now that looks like this:
Worker | Score
A | 10
A | 20
A | 0
A | 0
A | 0
B | 2
B | 4
B | 0
B | 6
Right now some of my scores which are unavailable I've filled them with 0. Is there a way on R where I can replace those 0 values with the mean value of the particular worker's score. The final table should look like this:
Worker | Score
A | 10
A | 20
A | 15 (mean of other scores)
A | 15 (mean of other scores)
A | 15 (mean of other scores)
B | 2
B | 4
B | 4 (mean of other scores)
B | 6
Right now I am thinking of looping through but I have 100 of thousands of entries which would make it very slow and inefficient.
Use ave to find the averages for each Worker and then use replace to substitute the relevant values
replace(x = df$Score, list = df$Score == 0, values =
ave(df$Score, df$Worker, FUN = function(x) sum(x, na.rm = TRUE)/sum(x!=0))[df$Score == 0])
#[1] 10 20 15 15 15 2 4 4 6
DATA
df = structure(list(Worker = c("A", "A", "A", "A", "A", "B", "B",
"B", "B"), Score = c(10L, 20L, 0L, 0L, 0L, 2L, 4L, 0L, 6L)), .Names = c("Worker",
"Score"), class = "data.frame", row.names = c(NA, -9L))
One option is na.aggregate from base R. Replace the 0 values in 'score' by NA, grouped by 'Worker', apply the na.aggregate on 'Score' to replace the 'NA' by the mean of the 'Score' by assigning it to 'Score'
library(data.table)
library(zoo)
setDT(df1)[Score ==0, Score := NA ][, .(Score = na.aggregate(Score)), by = Worker]
# Worker Score
#1: A 10
#2: A 20
#3: A 15
#4: A 15
#5: A 15
#6: B 2
#7: B 4
#8: B 4
#9: B 6
Or it can be made more compact by
setDT(df1)[, .(Score = na.aggregate(Score*NA^!Score)), Worker]
data
df1 <- structure(list(Worker = c("A", "A", "A", "A", "A", "B", "B",
"B", "B"), Score = c(10L, 20L, 0L, 0L, 0L, 2L, 4L, 0L, 6L)),
.Names = c("Worker",
"Score"), class = "data.frame", row.names = c(NA, -9L))
Here is another solution with data.table
library("data.table")
df1 <- data.table(Worker = c("A", "A", "A", "A", "A", "B", "B", "B", "B"),
Score = c(10L, 20L, 0L, 0L, 0L, 2L, 4L, 0L, 6L))
m <- df1[Score!=0, mean(Score), Worker]
m[df1, on="Worker"][, `:=`(Score=ifelse(Score==0, V1, Score), V1=NULL)][]
Related
I have a dataframe (df) that looks similar to this:
person
outcome
a
1
a
1
a
0
a
0
a
0
b
1
b
0
b
1
c
1
c
1
c
0
c
0
c
0
For persons whose last observation is a 0, I would like to remove the trailing 0s plus the last 1, so that the final df looks like this:
person
outcome
a
1
b
1
b
0
b
1
c
1
The last three 0s and last 1 were removed for A and C, but B was left alone because its last observation was a 1. Is there a way to do this, or does it have to be done by hand?
May be this helps
library(dplyr)
df %>%
group_by(person) %>%
mutate(new = cumsum(outcome)) %>%
filter(if(last(outcome) == 0) new <max(new) else TRUE) %>%
ungroup %>%
select(-new)
-output
# A tibble: 5 × 2
person outcome
<chr> <int>
1 a 1
2 b 1
3 b 0
4 b 1
5 c 1
data
df <- structure(list(person = c("a", "a", "a", "a", "a", "b", "b",
"b", "c", "c", "c", "c", "c"), outcome = c(1L, 1L, 0L, 0L, 0L,
1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-13L))
I have two columns MPH, Threshold, Car. I’d like to write some code to return the MPH for the column car when the first instance of threshold is 1.
MPH Threshold Car
30 0 A
31 0 A
32 1 A
33 1 A
34 1 A
35 1 A
30 0 B
31 0 B
32 0 B
33 0 B
34 1 B
35 1 B
Desired Output:
Value Car
32 A
34 B
Assuming you'll always have at-least one value where Threshold = 1 for each Car we can do
library(dplyr)
df %>%
group_by(Car) %>%
slice(which.max(Threshold == 1)) %>%
select(-Threshold)
# MPH Car
# <int> <fct>
#1 32 A
#2 34 B
Of using base R ave
df[with(df, ave(Threshold == 1, Car, FUN = function(x)
seq_along(x) == which.max(x))), ]
You can also do
library(dplyr)
df %>%
filter(Threshold == 1) %>%
subset(!duplicated(Car))
library(data.table)
dt <- data.table(df)
dt[Threshold == 1, ][!duplicated(Car),]
An option with data.table
library(data.table)
i1 <- setDT(df)[, .I[which(Threshold == 1)[1]], Car]$V1
df[i1, .(Value = MPH, Car)]
# Value Car
#1: 32 A
#2: 34 B
data
df <- structure(list(MPH = c(30L, 31L, 32L, 33L, 34L, 35L, 30L, 31L,
32L, 33L, 34L, 35L), Threshold = c(0L, 0L, 1L, 1L, 1L, 1L, 0L,
0L, 0L, 0L, 1L, 1L), Car = c("A", "A", "A", "A", "A", "A", "B",
"B", "B", "B", "B", "B")), class = "data.frame", row.names = c(NA,
-12L))
I've 2 different data.tables. I need to merge and get max value based on a row values. The examples of two tables are given as Input below and expected output shown below.
Input
Table 1
X A B
A 3
B 4 6
C 5
D 9 12
Table 2
X A B
A 1 5
B 6 8
C 7 14
D 5
E 1 1
F 2 3
G 5 6
Expected Output:
X A B
A 3 5
B 6 8
C 7 14
D 9 12
E 1 1
F 2 3
G 5 6
We can rbind the two datasets and do a group by max
library(data.table)
rbindlist(list(tbl1, tbl2))[, lapply(.SD, max, na.rm = TRUE), X]
# X A B
#1: A 3 5
#2: B 6 8
#3: C 7 14
#4: D 9 12
#5: E 1 1
#6: F 2 3
#7: G 5 6
If we are using base R, then use aggregate after rbinding the datasets
aggregate(.~ X, rbind(tbl1, tbl2), max, na.rm = TRUE, na.action = NULL)
NOTE: Assume that the 'A', 'B' columns are numeric and blanks are NA
data
tbl1 <- structure(list(X = c("A", "B", "C", "D"), A = c(3L, 4L, 5L, 9L
), B = c(NA, 6L, NA, 12L)), .Names = c("X", "A", "B"), class = "data.frame",
row.names = c(NA, -4L))
tbl2 <- structure(list(X = c("A", "B", "C", "D", "E", "F", "G"), A = c(1L,
6L, 7L, 5L, 1L, 2L, 5L), B = c(5L, 8L, 14L, NA, 1L, 3L, 6L)), .Names = c("X",
"A", "B"), class = "data.frame",
row.names = c(NA, -7L))
For a sample dataframe:
df1 <- structure(list(i.d = structure(1:9, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i"), class = "factor"), group = c(1L,
1L, 2L, 1L, 3L, 3L, 2L, 2L, 1L), cat = c(0L, 0L, 1L, 1L, 0L,
0L, 1L, 0L, NA)), .Names = c("i.d", "group", "cat"), class = "data.frame", row.names = c(NA,
-9L))
I wish to add an additional column to my dataframe ("pc.cat") which records the percentage '1s' in column cat BY the group ID variable.
For example, there are four values in group 1 (i.d's a, b, d and i). Value 'i' is NA so this can be ignored for now. Only one of the three values left is one, so the percentage would read 33.33 (to 2 dp). This value will be populated into column 'pc.cat' next to all the rows with '1' in the group (even the NA columns). The process would then be repeated for the other groups (2 and 3).
If anyone could help me with the code for this I would greatly appreciate it.
This can be accomplished with the ave function:
df1$pc.cat <- ave(df1$cat, df1$group, FUN=function(x) 100*mean(na.omit(x)))
df1
# i.d group cat pc.cat
# 1 a 1 0 33.33333
# 2 b 1 0 33.33333
# 3 c 2 1 66.66667
# 4 d 1 1 33.33333
# 5 e 3 0 0.00000
# 6 f 3 0 0.00000
# 7 g 2 1 66.66667
# 8 h 2 0 66.66667
# 9 i 1 NA 33.33333
library(data.table)
setDT(df1)
df1[!is.na(cat), mean(cat), by=group]
With data.table:
library(data.table)
DT <- data.table(df1)
DT[, list(sum(na.omit(cat))/length(cat)), by = "group"]
I have a matrix(similar to a wig file) like this:
Position reference A C G T N sum(total read counts)
68773265 A 1 0 0 0 0 1
68773266 C 0 1 0 1 0 2
68773267 C 0 1 1 2 0 4
To achieve variant(non-reference) allele ratio,
I want to create this: (sum-reference sequence's count)/sum * 100 per position
Position reference frequency(%) sum(total read counts)
68773265 A 0 1
68773266 C 50 2
68773267 C 75 4
Please give me some advice on this problem. Thanks in advance!!
Using the subset of column names "nm1", match the "reference" column with the "nm1" to get the column index, cbind with 1:nrow(df1) for creating row/column index. Get the rowSums of "nm1" columns ("Sum1"), use this to create "frequencyPercent" based on the formula in the post.
nm1 <- c('A', 'C', 'G', 'T') # this could include `N` also
indx <- cbind(1:nrow(df1), match(df1$reference, nm1))
Sum1 <- rowSums(df1[nm1])
data.frame(df1[1:2], frequencyPercent=100*(Sum1-df1[nm1][indx])/Sum1,
SumTotalCounts=df1[,ncol(df1)])
Or use transform on the original dataset
transform(df1, frequencyPercent=100*(Sum1-df1[nm1][indx])/Sum1,
check.names=FALSE)[c(1:2,8:9)]
# Position reference sum(total read counts) frequencyPercent
#1 68773265 A 1 0
#2 68773266 C 2 50
#3 68773267 C 4 75
data
df1 <- structure(list(Position = 68773265:68773267, reference = c("A",
"C", "C"), A = c(1L, 0L, 0L), C = c(0L, 1L, 1L), G = c(0L, 0L,
1L), T = 0:2, N = c(0L, 0L, 0L), `sum(total read counts)` = c(1L,
2L, 4L)), .Names = c("Position", "reference", "A", "C", "G",
"T", "N", "sum(total read counts)"), class = "data.frame",
row.names = c(NA, -3L))