How to convert only SOME positive numbers to negative numbers (conditional recoding)? - r

I am looking for a convenient way to convert positive values (proportions) into negative values of the same variable, depending on the value of another variable.
This is how the data structure looks like:
id Item Var1 Freq
1 P1 0 0.043
2 P2 1 0.078
3 P3 2 0.454
4 P4 3 0.543
5 T1 0 0.001
6 T2 1 0
7 T3 2 0.045
8 T4 3 0.321
9 A1 0 0.671
...
More precisely, I would like to put the numbers for Freq into the negative if Var1 <= 1 (e.g. -0.043).
This is what I tried:
for(i in 1: 180) {
if (mydata$Var1 <= "1") (mydata$Freq*(-1))}
OR
mydata$Freq[mydata$Var1 <= "1"] = -abs(mydata$Freq)}
In both cases, the negative sign is rightly set but the numbers are altered as well.
Any help is highly appreciated. THANKS!

new.Freq <- with(mydata, ifelse(Var1 <= 1, -Freq, Freq))

Try:
index <- mydata$Var1 <= 1
mydata$Freq[index] = -abs(mydata$Freq[index])
There are two errors in your attempted code:
You did a character comparison by writing x <= "1" - this should be a numeric comparison, i.e. x <= 1
Although you are replacing a subset of your vector, you don't refer to the same subset as the replacement

It can also be used to deal with two variables when one has negative values and want to combine that by retaining negative values,
similarly can use it to convert to negative value by put - at start of variable (as mentioned above) e.g. -Freq.
mydata$new_Freq <- with(mydata, ifelse(Var1 < 0, Low_Freq, Freq))
id Item Var1 Freq Low_Freq
1 P1 0 1.043 -0.063
2 P2 1 1.078 -0.077
3 P3 2 2.401 -0.068
4 P4 3 3.543 -0.323
5 T1 0 1.001 1.333
6 T2 1 1.778 1.887
7 T3 2 2.045 1.011
8 T4 3 3.321 1.000
9 A1 0 4.671 2.303
# Output would be:
id Item Var1 Freq Low_Freq new_Freq
1 P1 0 1.043 -0.063 -0.063
2 P2 1 1.078 -0.077 -0.077
3 P3 2 2.401 -0.068 -0.068
4 P4 3 3.543 -0.323 -0.323
5 T1 0 1.001 0.999 1.001
6 T2 1 1.778 0.887 1.778
7 T3 2 2.045 1.011 2.045
8 T4 3 3.321 1.000 3.321
9 A1 0 4.671 2.303 4.671

Related

R filter not retaining values less than -10

I am using R to retain values which are below a condition, it is working but not including row where log10 column has values more than -10. Below is a part of the table. I want to select all the values in log10 column which are between -8 to -14.
My table:
chr rs ps af beta p_wald log10
1 5 S5_10683198 10683198 0.025 0.5628516 9.422555e-15 -14.0258313188689
2 8 S8_361882 361882 0.025 0.5295581 6.825981e-13 -12.1658349249069
3 8 S8_7385592 7385592 0.021 0.5421677 5.944847e-12 -11.2258593181539
4 2 S2_276875 276875 0.025 0.4899961 1.342672e-11 -10.8720300677393
5 3 S3_7418268 7418268 0.021 0.4906429 1.711510e-11 -10.7666205590256
6 2 S2_14021380 14021380 0.025 0.5080194 2.511021e-11 -10.6001496552098
7 3 S3_13987777 13987777 0.021 0.4595375 2.664140e-11 -10.5744429568178
30 8 S8_7395237 7395237 0.021 0.4186731 3.995514e-09 -8.39842734325747
31 6 S6_7387034 7387034 0.028 0.4266190 4.957138e-09 -8.30476899075709
32 5 S5_11495292 11495292 0.028 0.4575658 5.080677e-09 -8.29407841413843
33 1 S1_15059335 15059335 0.025 0.4183669 5.106630e-09 -8.29186560773669
19430 7 S7_14557672 14557672 0.037 -0.1892856 0.005395347 -2.26798061857347
19431 7 S7_2217818 2217818 0.055 -0.1286663 0.005396288 -2.26790488007629
19432 8 S8_3013554 3013554 0.030 0.1430241 0.005396304 -2.26790359239457
19433 4 S4_1154225 1154225 0.045 -0.1572871 0.005396518 -2.2678863700186
19434 1 S1_21402062 21402062 0.074 0.1159478 0.005396604 -2.26787944906923
19435 2 S2_6176209 6176209 0.030 0.1522105 0.005396680 -2.26787333297322
I am using this R code
maf2 <- maf1[ which(maf1$log10 >= -8.00), ]
my results
chr rs ps af beta p_wald log10
30 8 S8_7395237 7395237 0.021 0.4186731 3.995514e-09 -8.39842734325747
31 6 S6_7387034 7387034 0.028 0.4266190 4.957138e-09 -8.30476899075709
32 5 S5_11495292 11495292 0.028 0.4575658 5.080677e-09 -8.29407841413843
33 1 S1_15059335 15059335 0.025 0.4183669 5.106630e-09 -8.29186560773669
The results skipping first 7 rows with > -10.
What to change in the code.
Thanks,
Vinod

Make a matrix of 2 rows into a row and a column in R

I'm using R
I have a csv file from single cell data like this, where the column 'cluster' is repeated for all the unique 'gene' column.
dput(markers)
p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
APOC1 0 1.696639642 0.939 0.394 0 0 APOC1
APOE 0 1.487160872 0.958 0.475 0 0 APOE
GPNMB 9.30E-269 1.31714457 0.745 0.301 2.49E-264 0 GPNMB
FTL 2.24E-230 0.766844152 1 0.977 6.00E-226 0 FTL
PSAP 2.27E-225 0.98726538 0.925 0.685 6.07E-221 0 PSAP
CTSB 4.84E-211 0.925031015 0.902 0.606 1.29E-206 0 CTSB
CTSS 1.37E-197 0.898457063 0.869 0.609 3.67E-193 0 CTSS
CSTB 8.05E-191 0.853658991 0.918 0.732 2.15E-186 0 CSTB
CTSD 1.23E-187 1.08931251 0.787 0.443 3.30E-183 0 CTSD
IGKC 0 1.560337702 0.998 0.237 0 1 IGKC
IGLC2 0 1.546344857 0.997 0.152 0 1 IGLC2
IGLC3 0 1.342649567 0.967 0.073 0 1 IGLC3
C11orf96 0 1.245172517 0.99 0.253 0 1 C11orf96
COL3A1 0 1.212528128 1 0.343 0 1 COL3A1
LUM 0 1.202452925 0.971 0.143 0 1 LUM
IGHG4 0 0.977399051 0.876 0.092 0 1 IGHG4
HSPG2 0 0.957478533 0.883 0.148 0 1 HSPG2
NNMT 0 0.952577589 0.945 0.213 0 1 NNMT
IGHG1 0 0.913733424 0.861 0.07 0 1 IGHG1
COL6A31 0 1.847828827 0.907 0.192 0 2 COL6A3
PDGFRA 5.38E-292 0.849349193 0.503 0.052 1.44E-287 2 PDGFRA
COL5A21 2.67E-280 1.400314195 0.649 0.105 7.14E-276 2 COL5A2
CALD1 1.11E-275 1.292924443 0.771 0.155 2.98E-271 2 CALD1
CCDC80 1.73E-271 1.168549626 0.706 0.123 4.64E-267 2 CCDC80
COL1A21 1.66E-268 2.004626869 0.966 0.326 4.45E-264 2 COL1A2
DCN1 1.47E-253 1.540631398 0.886 0.254 3.93E-249 2 DCN
COL3A11 3.88E-253 2.216642854 0.955 0.353 1.04E-248 2 COL3A1
FBN1 6.40E-251 0.949521182 0.525 0.07 1.71E-246 2 FBN1
I want to transform my matrix so that the row name is the unique cluster name and each column has all the genes from that cluster name (picture 2). How should i write the code?
dput(markers)
0 1 2
APOC1 IGKC COL6A3
APOE IGLC2 PDGFRA
GPNMB IGLC3 COL5A2
FTL C11orf96 CALD1
PSAP COL3A1 CCDC80
CTSB LUM COL1A2
CTSS IGHG4 DCN
CSTB HSPG2 COL3A1
CTSD NNMT FBN1
I tried this and the result file has no values.
markers = read.csv("./markers.csv", row.names=1, stringsAsFactors=FALSE)
z1 = matrix("", ncol = length(unique(markers$cluster)))
colnames(z1) = unique(markers$cluster)
for (i in 1:nrow(z1)){
for (j in 1:ncol(z1)){
genes1 = as.character(markers$gene)[markers$cluster == rownames(z1)[i]]
z1[i,0] = paste(genes1, collapse=" ")
z1 = matrix("", ncol = length(unique(markers$cluster)))
colnames(z1) = unique(markers$cluster)
for (i in 1:nrow(z1)){
for (j in 1:ncol(z1)){
genes1 = as.character(markers$gene)[markers$cluster == rownames(z1)[i]]
z1[i,0] = paste(genes1, collapse=" ")
}
}
write.csv(z1, "test.csv")
This may accomplish what you want, but first we need a reproducible example:
set.seed(42)
cluster <- c(rep(0, 8), rep(1, 10), rep(2, 12))
gene <- replicate(30, paste0(sample(LETTERS, 4), collapse=""))
markers <- data.frame(cluster, gene, stringsAsFactors=FALSE)
This data frame only contains the two columns you are interested in. We need to split the data frame by gene:
markers.split <- split(markers$gene, markers$cluster)
Print this out. It is a list containing 3 character vectors, one for 0, 1, and 2. The problem with the table format you want is that tables and matrices have to have the same number of rows in each column. We have to pad the vectors so they are all as long as the longest one (12 in this case):
rows <- max(sapply(markers.split, length))
markers.sp <- lapply(markers.split, function(x) c(x, rep("", rows - length(x))))
markers.df <- do.call(data.frame, list(markers.sp, stringsAsFactors=FALSE))
markers.df
# X0 X1 X2
# 1 QEAJ ZHDX TIKC
# 2 DRQO VRME PEXN
# 3 XGDE DBXR EVBR
# 4 NTRO CXWQ XQRE
# 5 CIDE URFX NHWY
# 6 METB BTCV UDYG
# 7 HCAJ UBWF JRMU
# 8 XKOV ZJHE VSPZ
# 9 AQGD QLIU
# 10 MJIL KYPH
# 11 WFAM
# 12 NEIW
R automatically adds "X" to any column name that starts with a number.

Observation matching between groups

I am dealing with an original dataset has more than 20000 rows. A condensed version of this looks something like this below
Row x y z Group Survive
1 0.0680 0.8701 0.0619 1 78.43507
2 0.9984 0.0016 0.0000 1 89.55533
3 0.4146 0.5787 0.0068 1 85.35468
4 0.3910 0.6016 0.0074 2 67.49987
5 0.3902 0.6023 0.0075 2 81.87669
6 0.0621 0.8701 0.0678 2 27.26777
7 0.6532 0.3442 0.0026 3 53.03938
8 0.6508 0.3466 0.0026 3 62.32931
9 0.9977 0.0023 0.0000 3 97.00324
My goal is to create a column called Match1 as shown below
Row x y z Group Survive Match1
1 0.0680 0.8701 0.0619 1 78.43507 g1r1-g2r3
2 0.9984 0.0016 0.0000 1 89.55533 g1r2-g2r1
3 0.4146 0.5787 0.0068 1 85.35468 g1r3-g2r2
1 0.3910 0.6016 0.0074 2 67.49987 g1r2-g2r1
2 0.3902 0.6023 0.0075 2 81.87669 g1r3-g2r2
3 0.0621 0.8701 0.0678 2 27.26777 g1r1-g2r3
1 0.6532 0.3442 0.0026 3 53.03938 NA
2 0.6508 0.3466 0.0026 3 62.32931 NA
3 0.9977 0.0023 0.0000 3 97.00324 NA
The logic behind the values g1r1-g2r3, g1r2-g2r1, g1r3-g2r2 is as follows
1st step, a distance matrix is generated between rows in Group1 and Group2 based on Mahalanobis or simple distance method , sqrt((x2-x1)^2 + (y2-y1)^2 + (z2-z1)^2)
0.4235 = sqrt{ (0.3910-0.0680)^2 + (0.6016-0.8701)^2 + (0.0074-0.0619)^2}
0.4225 = sqrt{ (0.3902-0.0680)^2 + (0.6023-0.8701)^2 + (0.0075-0.0619)^2}
0.0083 = sqrt{ (0.0621-0.0680)^2 + (0.8701-0.8701)^2 + (0.0678-0.0619)^2}
0.8538 = sqrt{ (0.3910-0.9984)^2 + (0.6016-0.0016)^2 + (0.0074-0.0000)^2}
0.8549 = sqrt{ (0.3902-0.9984)^2 + (0.6023-0.0016)^2 + (0.0075-0.0000)^2}
1.2789 = sqrt{ (0.0621-0.9984)^2 + (0.8701-0.0016)^2 + (0.0678-0.0000)^2}
0.0329 = sqrt{ (0.3910-0.4146)^2 + (0.6016-0.5787)^2 + (0.0074-0.0068)^2}
Group1 vs Group2
g2r1 g2r2 g2r3
g1r1 0.4235 0.4225 0.0083
g1r2 0.8538 0.8549 1.2789
g1r3 0.0329 0.0340 0.4614
2nd step, find the minimum or smallest distance in each row.
g2r1 g2r2 g2r3
g1r1 0.4235 0.4225 **0.0083**
g1r2 **0.8538** 0.8549 1.2789
g1r3 0.0329* **0.0340** 0.4614
The column Match1 takes value g1r1-g2r3 because rows , Row1-Group1 and Row3-Group2 result in smallest distance 0.0083. Similarly g1r2-g2r1 because, Row2-Group1 and Row1-Group2 results in smallest value 0.8538. Although 0.0329 is the smallest value in the last row of the distance matrix we skip this value and chose the next smallest value 0.0340 because choosing 0.0329 will result in pairing Row3-Group1 with Row1-Group2 and Row1-Group2 is already paired with Row2-Group1, so we chose the next smallest value 0.0340 which results in g1r1-g2r3.
3rd step, calculate average survival based on matched observations in Step2.
(78.43507 - 27.26777) + (89.55533 - 67.49987) + (85.35468 -81.87669)/3 = 25.56692
I am not sure how to string together these steps programatically I would appreciate any suggestions or help putting all these pieces together efficiently.

Conditional filtering and summarizing in R

I have recently transitioned from STATA + Excel to R. So, I would appreciate if someone could help me in writing efficient code. I have tried my best to research the answer before posting on SO.
Here's how my data looks like:
mydata<-data.frame(sassign$buyer,sassign$purch,sassign$total_)
str(mydata)
'data.frame': 50000 obs. of 3 variables:
$ sassign.buyer : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 2 1 ...
$ sassign.purch : num 10 3 2 1 1 1 1 11 11 1 ...
$ sassign.total_: num 357 138 172 272 149 113 15 238 418 123 ...
head(mydata)
sassign.buyer sassign.purch sassign.total_
1 no 10 357
2 no 3 138
3 no 2 172
4 no 1 272
5 no 1 149
6 yes 1 113
My objective is to find average number of buyers with # of purchases > 1.
So, here's what I did:
Method 1: Long method
library(psych)
check<-as.numeric(mydata$sassign.buyer)-1
myd<-cbind(mydata,check)
abcd<-psych::describe(myd[myd$sassign.purch>1,])
abcd$mean[4]
The output I got is:0.1031536697, which is correct.
#Sathish: Here's how check looks like:
head(check)
0 0 0 0 0 1
This did solve my purpose.
Pros of this method: It's easy and typically a beginner level.
Cons: Too many-- I need an extra variable (check). Plus, I don't like this method--it's too clunky.
Side Question : I realized that by default, functions don't show higher precision although options (digits=10) is set. For instance, here's what I got from running :
psych::describe(myd[myd$sassign.purch>1,])
vars n mean sd median trimmed mad min max range skew
sassign.buyer* 1 34880 1.10 0.30 1 1.00 0.00 1 2 1 2.61
sassign.purch 2 34880 5.14 3.48 4 4.73 2.97 2 12 10 0.65
sassign.total_ 3 34880 227.40 101.12 228 226.13 112.68 30 479 449 0.09
check 4 34880 0.10 0.30 0 0.00 0.00 0 1 1 2.61
kurtosis se
sassign.buyer* 4.81 0.00
sassign.purch -1.05 0.02
sassign.total_ -0.72 0.54
check 4.81 0.00
It's only when I ran
abcd$mean[4]
I got 0.1031536697
Method 2: Using dplyr
I tried pipes and function call, but I finally gave up.
Method 2 | Try1:
psych::describe(dplyr::filter(mydata,mydata$sassign.purch>1)[,dplyr::mutate(as.numeric(mydata$sassign.buyer)-1)])
Output:
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "c('double', 'numeric')"
Method 2 | Try2: Using pipes:
mydata %>% mutate(newcol = as.numeric(sassign.buyer)-1) %>% dplyr::filter(sassign.purch>1) %>% summarise(meanpurch = mean(newcol))
This did work, and I got meanpurch= 0.1031537. However, I am still not sure about Try 1.
Any thoughts why this isn't working?
Data:
> dt
# sassign.buyer sassign.purch sassign.total_
# 1 no 10 357
# 2 no 3 138
# 3 no 2 172
# 4 no 1 272
# 5 no 1 149
# 6 yes 1 113
Number of Buyers with purchases greater than 1
library(dplyr)
dt %>%
group_by(sassign.buyer) %>%
filter(sassign.purch > 1)
#
# Source: local data frame [3 x 3]
# Groups: sassign.buyer [1]
#
# sassign.buyer sassign.purch sassign.total_
# (chr) (int) (int)
# 1 no 10 357
# 2 no 3 138
# 3 no 2 172
Average number of buyers with purchases greater than 1
dt %>%
group_by(sassign.buyer) %>%
filter(sassign.purch > 1) %>%
summarise(avg_no_buyers_gt_1 = length(sassign.buyer)/ nrow(dt))
# Source: local data frame [1 x 2]
#
# sassign.buyer avg_no_buyers_gt_1
# (chr) (dbl)
# 1 no 0.5
If no grouping of buyers is required,
dt %>%
filter(sassign.purch > 1) %>%
summarise(avg_no_buyers_gt_1 = length(sassign.buyer)/ nrow(dt))
# avg_no_buyers_gt_1
# 1 0.7777778
Finding the proportion of cases that suit a condition is easy to do with mean(). Here's a blog post explaining it: https://drsimonj.svbtle.com/proportionsfrequencies-with-mean-and-booleans, and here's a simple example:
buyer <- c("yes", "yes", "no", "no")
mean(buyer == "yes")
#> [1] 0.5
So in your case, you can do mean(d$sassign.buyer[d$sassign.purch > 1] == "yes"). Here's a worked example:
d <- data.frame(
sassign.buyer = factor(c("yes", "yes", "no", "no")),
sassign.purch = c(1, 10, 0, 200)
)
mean(d$sassign.buyer[d$sassign.purch > 1] == "yes")
#> [1] 0.5
This gets all cases where d$sassign.purch is greater han 1, and then computes the proportion (using mean()) of these cases in which d$sassign.buyer is equal to "yes".

counting function on data frame in R

I have the following data frame:
> Mice
Blood States Minute
1 0.875 X0 0.8352569
2 0.875 A2 0.7551901
3 0.625 X0 1.4508139
4 0.625 A1 0.7876343
5 0.375 X0 1.1345252
6 0.125 X0 0.8699363
7 0.375 X0 0.9378742
8 1.125 H1 0.9769522
9 0.625 X0 0.4716321
10 0.875 H1 0.9935999
11 0.625 X0 1.0025917
12 0.375 A1 1.0703999
13 0.375 X0 1.3044854
14 0.875 H1 0.6720436
15 0.875 A1 1.0431863
So every mouse has some value of drugs in their "Blood", and their "State" is checked. This is just a piece of my data frame, but the mice can be in 4 different states. "Minute" is whenever something occurs to the mice, does not matter what.
For every value of "Blood", the mice can be in either of the 4 different states, and I want to count how many observations I have in each category.
The count() function with both columns Blood and States did not work because "States" is a factor column
To operate on factor levels, you can use tapply or by. If you have discrete scale for Mice$Blood, convert it to a factor as well:
> by(mice$States, as.factor(mice$Blood), function(x) summary(factor(x)))
as.factor(mice$Blood): 0.125
X0
1
------------------------------------------------------------------------------------------------
as.factor(mice$Blood): 0.375
A1 X0
1 3
------------------------------------------------------------------------------------------------
as.factor(mice$Blood): 0.625
A1 X0
1 3
------------------------------------------------------------------------------------------------
as.factor(mice$Blood): 0.875
A1 A2 H1 X0
1 1 2 1
------------------------------------------------------------------------------------------------
as.factor(mice$Blood): 1.125
H1
1
The returned object is a list, so you may capture it and use for your purposes.

Resources