Want to create a new column "non_coded" using existing 3 columns- allele_2 , allele_1 and A1
the conditions I want satisfied are :
if allele_2 == A1 then non_coded = allele_1
if allele_2 != A1 then non_coded = allele_2
Thanks in advance,
Rad
OK This is what the data looks like:
SNPID chrom STRAND IMPUTED allele_2 allele_1 MAF CALL_RATE HET_RATE
1 rs1000000 12 + Y A G 0.12160 1.00000 0.2146
2 rs10000009 4 + Y G A 0.07888 0.99762 0.1386
HWP RSQ PHYS_POS A1 M1_FRQ M1_INFO M1_BETA M1_SE M1_P
1 1.0000 0.9817 125456933 A 0.1173 0.9452 -0.0113 0.0528 0.83090
2 0.1164 0.8354 71083542 A 0.9048 0.9017 -0.0097 0.0593 0.87000
The code I tried:
Hy_MVA$non_coded <- ifelse(Hy_MVA$allele_2 == Hy_MVA$A1, Hy_MVA$allele_1, Hy_MVA$allele_2)
result:
SNPID chrom STRAND IMPUTED allele_2 allele_1 MAF CALL_RATE HET_RATE
1 rs1000000 12 + Y A G 0.12160 1.00000 0.2146
2 rs10000009 4 + Y G A 0.07888 0.99762 0.1386
HWP RSQ PHYS_POS A1 M1_FRQ M1_INFO M1_BETA M1_SE M1_P non_coded
1 1.0000 0.9817 125456933 A 0.1173 0.9452 -0.0113 0.0528 0.83090 3
2 0.1164 0.8354 71083542 A 0.9048 0.9017 -0.0097 0.0593 0.87000 3
What I want:
SNPID chrom STRAND IMPUTED allele_2 allele_1 MAF CALL_RATE HET_RATE
1 rs1000000 12 + Y A G 0.12160 1.00000 0.2146
2 rs10000009 4 + Y G A 0.07888 0.99762 0.1386
HWP RSQ PHYS_POS A1 M1_FRQ M1_INFO M1_BETA M1_SE M1_P non_coded
1 1.0000 0.9817 125456933 A 0.1173 0.9452 -0.0113 0.0528 0.83090 G
2 0.1164 0.8354 71083542 A 0.9048 0.9017 -0.0097 0.0593 0.87000 G
As Chase said, use ifelse(). I guess the code then becomes:
non_coded <- ifelse(allele_2 == A1, allele_1, allele_2)
Edit
After seeing the updated question, it makes sense that you get numbers because allele_1 and allele_2 are factors. Adding a as.character() should fix this:
A1 <- c("A","A","B")
allele_1 <- as.factor(c("A","C","C"))
allele_2 <- as.factor(c("A","B","B"))
non_coded <- ifelse(allele_2 == A1, as.character(allele_1), as.character(allele_2))
non_coded
[1] "A" "B" "C"
Since you want non_coded to be one of two values:
Hy_MVA$non_coded <- Hy_MVA$allele_2
Hy_MVA$non_coded[Hy_MVA$allele_2 == Hy_MVA$A1] <- Hy_MVA$allele_1[Hy_MVA$allele_2 == Hy_MVA$A1]
That replaces values with allele_1 values in only the rows where allele_2 == A1. It sounds as though you might have a problem with ifelse converting a factor to a numeric.
Related
I have a dataframe with multiple values per cell, and I want to find and return the values that are only in one column.
ID <- c("1","1","1","2","2","2","3","3","3")
locus <- c("A","B","C","A","B","C","A","B","C")
Acceptable <- c("K44 Z33 G49","K72 QR123","B92 N12 PQ16 G99","V3","L89 I203 UPA66 QF29"," ","K44 Z33 K72","B92 PQ14","J22 M43 VC78")
Unacceptable <- c("K44 Z33 G48","K72 QR123 B22","B92 N12 PQ16 G99","V3 N9 Q7","L89 I203 UPA66 QF29","B8","K44 Z33"," ","J22 M43 VC78")
df <- data.frame(ID,locus,Acceptable,Unacceptable)
dataframe
I want to make another column, Unique_values, that returns all the unique values that are only present in Unacceptable, and that are not in Acceptable. So the output should be this.
I already have a poorly optimized method to find the duplicates between the two columns:
df$Duplicate_values <- do.call(paste, c(df[,c("Acceptable","Unacceptable")], sep=" "))
df$Duplicate_values = sapply(strsplit(df$Duplicate_values, ' '), function(i)paste(i[duplicated(i)]))
#this is for cleaning up the text field so that it looks like the other columns
df$Duplicate_values = gsub("[^0-9A-Za-z///' ]"," ",df$Duplicate_values)
df$Duplicate_values = gsub("character 0",NA,df$Duplicate_values)
df$Duplicate_values = gsub("^c ","",df$Duplicate_values)
df$Duplicate_values = gsub(" "," ",df$Duplicate_values)
df$Duplicate_values = trimws(df$Duplicate_values)
(if anyone knows a faster method to return these duplicates, please let me now!)
I cannot use this method to find the unique values however, because it would then also return the unique values of the Acceptable column, which I do not want.
Any suggestions?
A similar approach using setdiff:
lA <- strsplit(df$Acceptable, " ")
lU <- strsplit(df$Unacceptable, " ")
df$Unique_values <- lapply(1:nrow(df), function(i) paste0(setdiff(lU[[i]], lA[[i]]), collapse = " "))
df
#> ID locus Acceptable Unacceptable Unique_values
#> 1 1 A K44 Z33 G49 K44 Z33 G48 G48
#> 2 1 B K72 QR123 K72 QR123 B22 B22
#> 3 1 C B92 N12 PQ16 G99 B92 N12 PQ16 G99
#> 4 2 A V3 V3 N9 Q7 N9 Q7
#> 5 2 B L89 I203 UPA66 QF29 L89 I203 UPA66 QF29
#> 6 2 C B8 B8
#> 7 3 A K44 Z33 K72 K44 Z33
#> 8 3 B B92 PQ14
#> 9 3 C J22 M43 VC78 J22 M43 VC78
I have a dataframe in R
> df
Dataset a1 b2 c3
1 past 0.0029 0.00250 0.0011
2 present 0.0035 0.00078 -0.0018
3 future1 0.0020 0.02100 0.0200
4 future2 0.0390 0.04000 0.0460
How can I multiply the columns of a1,b2 and c3 with value of 4193215.58165948 ?
You can use #Otto Kässi's approach and use cbind to bind it with your Dataset column.
df <- cbind(Dataset = df$Dataset, df[,2:4] * 4193215.58165948)
You can do
df[,2:4] <- df[,2:4] * 4193215.58165948
Or a more explicit (and safe) approach:
v <- c("a1", "b2", "c3")
df[,v] <- df[,v] * 4193215.58165948
This question already has answers here:
R group by aggregate
(3 answers)
Closed 2 years ago.
I have a data frame df that has an ID-column (called SNP) in which some IDs occur more than once.
CHR SNP A1 A2 MAF NCHROBS
1: 1 1:197 C T 0.3148 314
2: 1 1:205 G C 0.2058 314
3: 1 1:206 A C 0.0000 314
4: 1 1:219 C G 0.8472 314
5: 1 1:223 A C 0.7265 314
6: 1 1:224 G T 0.3295 314
7: 1 1:197 C T 0.3148 314
8: 1 1:205 G C 0.0000 314
9: 1 1:206 A C 0.0000 314
10: 1 1:219 C G 0.0000 314
11: 1 1:223 A C 0.0000 314
12: 1 1:224 G T 0.0000 314
13: 1 1:197 C T 0.4753 314
14: 1 1:205 G C 0.1964 314
15: 1 1:206 A C 0.0000 314
16: 1 1:219 C G 0.6594 314
17: 1 1:223 A C 0.8946 314
18: 1 1:224 G T 0.2437 314
I would like to calculate the mean and standard deviation (SD) from the values in the MAF-column that share the same ID.
df <-
list.files(pattern = "*.csv") %>%
map_df(~fread(.))
colMeans(df, rows=df$SNP == "1:197", cols=df$MAF)
Why is it not possible to specify values based on conditions with colMeans?
Since you have a data.table,
df[, .(mu = mean(MAF), sigma = sd(MAF)), by = .(SNP) ]
# SNP mu sigma
# 1: 1:197 0.3683000 0.09266472
# 2: 1:205 0.1340667 0.11620023
# 3: 1:206 0.0000000 0.00000000
# 4: 1:219 0.5022000 0.44493914
# 5: 1:223 0.5403667 0.47545926
# 6: 1:224 0.1910667 0.17093936
If you prefer base (despite using data.table), then
aggregate(dat$MAF, list(dat$SNP), function(a) c(mu = mean(a), sigma = sd(a)))
# Group.1 x.mu x.sigma
# 1 1:197 0.36830000 0.09266472
# 2 1:205 0.13406667 0.11620023
# 3 1:206 0.00000000 0.00000000
# 4 1:219 0.50220000 0.44493914
# 5 1:223 0.54036667 0.47545926
# 6 1:224 0.19106667 0.17093936
Using dplyr
library(dplyr)
df %>%
group_by(SNP) %>%
summarise(mean = mean(MAF),
sd = sd(MAF))
Gives us:
SNP mean sd
<chr> <dbl> <dbl>
1 1:197 0.368 0.0927
2 1:205 0.134 0.116
3 1:206 0 0
4 1:219 0.502 0.445
5 1:223 0.540 0.475
6 1:224 0.191 0.171
To answer your question as to why colMeans is not working:
If you look at the doucmentation of colMeans using ?colMeans you will realize that you are passing the wrong named arguments. The docs give the following example: colMeans(x, na.rm=FALSE, dims=1). And you will realize, that it doesn't have (or takes) any arguments named rows and cols. So when you try to run your code, you will get the unused arguments error.
As to the question, if it is possible to pass conditional statements in colMeans you will have to pass those statements with df, i.e. you can pass the subset of df as follows:
colMeans(df[df$SNP == "1:197", "MAF", drop=F], na.rm=F, dims=1)
Note it is important to pass the argument drop=F in this case, as you are subsetting on single column. When you subset on single column, [ operator simplies the result and convert the dataframe to numeric vector. But when using drop=F, it preserves the dimension of originally passed dataframe.
If a numeric vector is passed to colMeans you will get an error as the colMeans accept x to be of atleast 2 dimensions.
As to the other question of how to calculate column mean, I believe others have highlighted quite nice approaches in this thread, any of those approaches work, you just have to choose one.
You could use the tapply() function, given SNP is a factor:
mean.CHR=tapply(df$SNP,df$MAF,mean)
sd.CHR=tapply(df$SNP,df$MAF,sd)
I have two data.frame x1 & x2. I want to remove rows from x2 if there is a common gene found in x1 and x2
x1 <- chr start end Genes
1 8401 8410 Mndal,Mnda,Ifi203,Ifi202b
2 8001 8020 Cyb5r1,Adipor1,Klhl12
3 4001 4020 Alyref2,Itln1,Cd244
x2 <- chr start end Genes
1 8861 8868 Olfr1193
1 8405 8420 Mrgprx3-ps,Mrgpra1,Mrgpra2a,Mndal,Mrgpra2b
2 8501 8520 Chia,Chi3l3,Chi3l4
3 4321 4670 Tdpoz4,Tdpoz3,Tdpoz5
x2 <- chr start end Genes
1 8861 8868 Olfr1193
2 8501 8520 Chia,Chi3l3,Chi3l4
3 4321 4670 Tdpoz4,Tdpoz3,Tdpoz5
You could try
x2[mapply(function(x,y) !any(x %in% y),
strsplit(x1$Genes, ','), strsplit(x2$Genes, ',')),]
# chr start end Genes
#2 2 8501 8520 Chia,Chi3l3,Chi3l4
#3 3 4321 4670 Tdpoz4,Tdpoz3,Tdpoz5
Or replace !any(x %in% y) with length(intersect(x,y))==0.
NOTE: If the "Genes" column is "factor", convert it to "character" as strsplit cannot take 'factor' class. i.e. strsplit(as.character(x1$Genes, ','))
Update
Based on the new dataset for 'x2', we can merge the two datasets by the 'chr' column, strsplit the 'Genes.x', 'Genes.y' from the output dataset ('xNew'), get the logical index based on the occurrence of any element of 'Genes.x' in 'Genes.y' strings, use that to subset the 'x2' dataset
xNew <- merge(x1, x2[,c(1,4)], by='chr')
indx <- mapply(function(x,y) any(x %in% y),
strsplit(xNew$Genes.x, ','), strsplit(xNew$Genes.y, ','))
x2[!indx,]
# chr start end Genes
#1 1 8861 8868 Olfr1193
#3 2 8501 8520 Chia,Chi3l3,Chi3l4
#4 3 4321 4670 Tdpoz4,Tdpoz3,Tdpoz5
I have the following data frame (this is only the head of the data frame). The ID column is subject (I have more subjects in the data frame, not only subject #99). I want to calculate the mean "rt" by "subject" and "condition" only for observations that have z.score (in absolute values) smaller than 1.
> b
subject rt ac condition z.score
1 99 1253 1 200_9 1.20862682
2 99 1895 1 102_2 2.95813507
3 99 1049 1 68_1 1.16862102
4 99 1732 1 68_9 2.94415384
5 99 765 1 34_9 -0.63991180
7 99 1016 1 68_2 -0.03191493
I know I can to do it using tapply or dcast (from reshape2) after subsetting the data:
b1 <- subset(b, abs(z.score) < 1)
b2 <- dcast(b1, subject~condition, mean, value.var = "rt")
subject 34_1 34_2 34_9 68_1 68_2 68_9 102_1 102_2 102_9 200_1 200_2 200_9
1 99 1028.5714 957.5385 861.6818 837.0000 969.7222 856.4000 912.5556 977.7273 858.7800 1006.0000 1015.3684 913.2449
2 5203 957.8889 815.2500 845.7750 933.0000 893.0000 883.0435 926.0000 879.2778 813.7308 804.2857 803.8125 843.7200
3 5205 1456.3333 1008.4286 850.7170 1142.4444 910.4706 998.4667 935.2500 980.9167 897.4681 1040.8000 838.7917 819.9710
4 5306 1022.2000 940.5882 904.6562 1525.0000 1216.0000 929.5167 955.8571 981.7500 902.8913 997.6000 924.6818 883.4583
5 5307 1396.1250 1217.1111 1044.4038 1055.5000 1115.6000 980.5833 1003.5714 1482.8571 941.4490 1091.5556 1125.2143 989.4918
6 5308 659.8571 904.2857 966.7755 960.9091 1048.6000 904.5082 836.2000 1753.6667 926.0400 870.2222 1066.6667 930.7500
In the example above for b1 each of the subjects had observations that met the subset demands.
However, it can be that for a certain subject I won't have observations after I subset. In this case I want to get NA in b2 for that subject in the specific condition in which he doesn't have observations that meet the subset demands. Does anyone have an idea for a way to do that?
Any help will be greatly appreciated.
Best,
Ayala
There is a drop argument in dcast that you can use in this situation, but you'll need to convert subject to a factor.
Here is a dataset with a second subject ID that has no values that meet your condition that the absolute value of z.score is less than one.
library(reshape2)
bb = data.frame(subject=c(99,99,99,99,99,11,11,11), rt=c(100,150,2,4,10,15,1,2),
ac=rep(1,8), condition=c("A","A","B","D","C","C","D","D"),
z.score=c(0.2,0.3,0.2,0.3,.2,2,2,2))
If you reshape this to a wide format with dcast, you lose subject number 11 even with the drop argument.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE)
subject A B C D
1 99 125 2 10 4
Make subject a factor.
bb$subject = factor(bb$subject)
Now you can dcast with drop = FALSE to keep all subjects in the wide dataset.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE)
subject A B C D
1 11 NaN NaN NaN NaN
2 99 125 2 10 4
To get NA instead of NaN you can use the fill argument.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE, fill = as.numeric(NA))
subject A B C D
1 11 NA NA NA NA
2 99 125 2 10 4
Is it the following you are after? I created a similar dataset "bb"
library("plyr") ###needed for . function below
bb<- data.frame(subject=c(99,99,99,99,99,11,11,11),rt=c(100,150,2,4,10,15,1,2), ac=rep(1,8) ,condition=c("A","A","B","D","C","C","D","D"), z.score=c(0.2,0.3,0.2,0.3,1.5,-0.3,0.8,0.7))
bb
subject rt ac condition z.score
#1 99 100 1 A 0.2
#2 99 150 1 A 0.3
#3 99 2 1 B 0.2
#4 99 4 1 D 0.3
#5 99 10 1 C 1.5
#6 11 15 1 C -0.3
#7 11 1 1 D 0.8
#8 11 2 1 D 0.7
Then you call dcast with subset included:
cc<-dcast(bb,subject~condition, mean, value.var = "rt",subset = .(abs(z.score)<1))
cc
subject A B C D
#1 11 NaN NaN 15 1.5
#2 99 125 2 NaN 4.0