why i cant change contents in column in R? - r

> data$Accepted.Final.round
[1] NA NA NA NA NA NA NA NA 1 NA NA NA NA 1 1 1 1 0 1 0 0 1 1 1
1
1 NA 1 1 1 1
[32] NA 1 1 0 1 1 1 1 1 1 NA 1 1 0 1 1 0 1 1 1 1 1 1 1
1
NA 1 NA NA NA NA
[63] NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA
NA
NA NA NA NA NA NA
[94] NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
I have a dataset column consists of NA, 1, 0. However when I try
data$Accepted.Final.round[data$Accepted.Final.round==NA]<-0
or
ifelse(data$Accepted.Final.round==1,1,0)
to replace NA with 0, both lines cannot work.
Could you guys think of any ways to fix this?

Use is.na() to determine if a value is NA. NA is contagious, meaning that doing operations with NA usually returns NA. That includes checking for equality with ==, i.e. x == NA will always return NA and not TRUE or FALSE.
x <- c(2, NA, 2)
x[is.na(x)] <- 0

The second attempt from OP was pretty close:
data$Accepted.Final.round <- ifelse(is.na(data$Accepted.Final.round),
0 ,data$Accepted.Final.round)
The document for ifelse explains as:
Usages:
ifelse(test, yes, no)
yes will be evaluated if and only if any element of test is true, and
analogously for no.
Missing values (i.e. NA) in test give missing (NA) values in the result.

Related

Getting uncommon elements from two columns from two files

I have two files for normal and cancer for T cell blood cell sequence like below for cancer
> head(cancer[1:2,])
cloneId cloneCount cloneFraction targetSequences
1 0 64 0.02273535 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC
2 1 64 0.02273535 TGTCAACACAGTTACTCTATTCCGTGGACGTTC
targetQualities allVHitsWithScore
1 EEEEEEENNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN IGLV1-51*00(117.6)
2 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN IGKV1-39*00(152),IGKV1D-39*00(152)
allDHitsWithScore allJHitsWithScore allCHitsWithScore
1 IGLJ2*00(42.3),IGLJ3*00(42.3) IGLC3*00(118),IGLC2*00(117.3)
2 IGKJ1*00(65.4) IGKC*00(75)
allVAlignments allDAlignments
1 421|446|473|0|25|SG425CSA427T|93.0
2 427|442|471|0|15|SG435C|59.0;349|364|395|0|15|SG357C|59.0
allJAlignments allCAlignments nSeqFR1 minQualFR1
1 27|30|58|36|39||15.0;27|30|58|36|39||15.0 ; NA NA
2 19|30|58|22|33||55.0 NA NA
nSeqCDR1 minQualCDR1 nSeqFR2 minQualFR2 nSeqCDR2 minQualCDR2 nSeqFR3 minQualFR3
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
nSeqCDR3 minQualCDR3 nSeqFR4 minQualFR4 aaSeqFR1
1 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC 36 NA NA NA
2 TGTCAACACAGTTACTCTATTCCGTGGACGTTC 45 NA NA NA
aaSeqCDR1 aaSeqFR2 aaSeqCDR2 aaSeqFR3 aaSeqCDR3 aaSeqFR4
1 NA NA NA NA CASWDSSLKIVLF NA
2 NA NA NA NA CQHSYSIPWTF NA
refPoints
1 :::::::::0:-7:25:::::36:-7:39:::
2 :::::::::0:-9:15:::::22:1:33:::
> names(cancer)
[1] "cloneId" "cloneCount" "cloneFraction" "targetSequences"
[5] "targetQualities" "allVHitsWithScore" "allDHitsWithScore" "allJHitsWithScore"
[9] "allCHitsWithScore" "allVAlignments" "allDAlignments" "allJAlignments"
[13] "allCAlignments" "nSeqFR1" "minQualFR1" "nSeqCDR1"
[17] "minQualCDR1" "nSeqFR2" "minQualFR2" "nSeqCDR2"
[21] "minQualCDR2" "nSeqFR3" "minQualFR3" "nSeqCDR3"
[25] "minQualCDR3" "nSeqFR4" "minQualFR4" "aaSeqFR1"
[29] "aaSeqCDR1" "aaSeqFR2" "aaSeqCDR2" "aaSeqFR3"
[33] "aaSeqCDR3" "aaSeqFR4" "refPoints"
>
And for normal
> head(normal[1:2,])
cloneId cloneCount cloneFraction targetSequences
1 0 100 0.03745318 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC
2 1 53 0.01985019 TGTCAACACAGTTACTCTATTCCGTGGACGTTC
targetQualities allVHitsWithScore
1 EEEENNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN IGLV1-51*00(115.8)
2 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNEEEE IGKV1-39*00(124.4),IGKV1D-39*00(124.4)
allDHitsWithScore allJHitsWithScore allCHitsWithScore
1 IGLJ2*00(44.8),IGLJ3*00(44.8) IGLC2*00(103.3),IGLC3*00(103.3)
2 IGKJ1*00(61.2) IGKC*00(114.2)
allVAlignments allDAlignments
1 421|446|473|0|25|SG425CSA427T|93.0
2 427|442|471|0|15|SG435C|59.0;349|364|395|0|15|SG357C|59.0
allJAlignments allCAlignments nSeqFR1 minQualFR1
1 27|30|58|36|39||15.0;27|30|58|36|39||15.0 ; NA NA
2 19|30|58|22|33||55.0 NA NA
nSeqCDR1 minQualCDR1 nSeqFR2 minQualFR2 nSeqCDR2 minQualCDR2 nSeqFR3 minQualFR3
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
nSeqCDR3 minQualCDR3 nSeqFR4 minQualFR4 aaSeqFR1
1 TGCGCATCATGGGATAGCAGCCTGAAAATTGTCCTTTTC 36 NA NA NA
2 TGTCAACACAGTTACTCTATTCCGTGGACGTTC 36 NA NA NA
aaSeqCDR1 aaSeqFR2 aaSeqCDR2 aaSeqFR3 aaSeqCDR3 aaSeqFR4
1 NA NA NA NA CASWDSSLKIVLF NA
2 NA NA NA NA CQHSYSIPWTF NA
refPoints
1 :::::::::0:-7:25:::::36:-7:39:::
2 :::::::::0:-9:15:::::22:1:33:::
>
How I can get subset the cancer file for uncommon elements in aaSeqCDR3 and nSeqCDR3 columns?
I mean I have cancer file in these two columns all elements are unique and not common with normal file
If we want to subset based on elements that are not present in 'normal', use anti_join
library(dplyr)
anti_join(cancer, normal[ c("aaSeqCDR3", "nSeqCDR3")],
by = c("aaSeqCDR3", "nSeqCDR3"))

Measure impact of store renovation on sales in R

I have a dataset that contains the sales of stores for the last years, together with the year when the shop was renovated last. My goal is to measure if the renovation had an impact on sales post-reopening, and how this impact evolved over the 4 years after the re-opening.
My challenge is that the general trend in the data set shows that all stores re losing about 2% per year of revenues. I therefore need to take that into account as well when measuring my effect.
My initial idea was to create dummies for each possible year of renovation, but this won't work given that I only data for 35 shops. I therefore tried to create a variable counting the number of years since renovation, but i'm missing something I think:
library(data.table)
year_start = 2013
year_stop = 2017
n_years = year_stop - year_start+1
seed_sales = 100
year_decrease = 0.02
n_shops = 35
shops = paste0("Shop",seq(1,n_shops))
dt_sales <- data.table( Shop = sort(rep(shops, n_years)),
Year = rep(seq(year_start,year_stop), length(shops)),
Year_Renovation = round(rbinom(n_shops*n_years,1,0.3)*runif(1, year_start-10, year_stop))
)
dt_sales[, Sales := 100-(Year-year_start)*year_decrease*rnorm(n_shops*n_years,1)-ifelse(Year_Renovation==1,ifelse(Year-Year_Renovation<2,10,0)*rnorm(n_shops*n_years)+ifelse(Year-Year_Renovation>2,10*Year-Year_Renovation,0)*rnorm(n_shops*n_years),0)]
## Current thinking
dt_sales[, Is_renovated := ifelse(Year_Renovation == 0,0,1)]
dt_sales[Is_renovated==1 & Year-Year_Renovation>=0, Years_since_rennovation := Year-Year_Renovation]
lm = glm(Sales ~ Year + Is_renovated:Years_since_rennovation, data=dt_sales,family = gaussian(),na.action = na.omit)
summary(lm)
Output is:
(Intercept) 137.855325 9.679754 14.242 < 2e-16 ***
Year -0.018807 0.004803 -3.915 0.000279 ***
Years_since_rennovation NA NA NA NA
The yearly decline is captured, but the effect of renovation is apparently wrapped into the intercept, which goes up to 137 instead of 100 as I set it.
Where am I going wrong?
Thanks!
Stefano
What follows is an answer to your R question. If you have any questions about whether this is the proper modelling strategy, I would head to Cross Validated.
There's two problems. First, dt_sales$Years_since_rennovation is almost all NA:
dt_sales$Years_since_rennovation
[1] NA NA NA NA 2 NA NA NA 1 2 NA NA NA 1 2 NA NA NA 1 NA NA NA 0 1 NA
[26] NA NA 0 NA NA NA NA NA NA NA NA NA 0 NA NA NA NA NA 1 NA NA NA 0 1 NA
[51] NA NA NA NA NA NA NA NA NA NA NA NA 0 1 NA NA NA NA NA NA NA NA 0 NA NA
[76] NA NA NA 1 NA NA NA NA 1 NA NA NA 0 1 2 NA NA 0 NA NA NA NA NA NA NA
[101] NA NA 0 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA
[126] NA NA NA 1 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA 0 NA 2
[151] NA NA NA NA NA NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA
Therefore, you see in the summary() output
(144 observations deleted due to missingness)
Then, if we examine dt$Year and dt$Years_since_rennovation for the remaining observations, we see there's perfect multicollinearity:
dt_sales$Year[!is.na(dt_sales$Years_since_rennovation)] - 2015
# [1] 2 1 2 1 2 1 0 1 0 0 1 0 1 0 1 0 1 1 0 1 2 0 0 1 1 1 2 0 0 2 0
dt_sales$Years_since_rennovation[!is.na(dt_sales$Years_since_rennovation)]
# [1] 2 1 2 1 2 1 0 1 0 0 1 0 1 0 1 0 1 1 0 1 2 0 0 1 1 1 2 0 0 2 0
This makes it impossible for R to estimate both coefficients. So, R estimates the first coefficient and drops the second variable. If you don't want R to do that without throwing an error, pass singular.ok = FALSE (see help("glm")):
lm = glm(Sales ~ Year + Is_renovated:Years_since_rennovation, data=dt_sales,
family = gaussian(), na.action = na.omit, singular.ok = FALSE)
Error in glm.fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, :
singular fit encountered
As a side note, I'd avoid naming objects lm as that's also the name of the basic OLS function.

Set consequent non na values to NA

Set every non-NA value that has a non-NA value to "his left" to NA.
Data
a <- c(3,2,3,NA,NA,1,NA,NA,2,1,4,NA)
[1] 3 2 3 NA NA 1 NA NA 2 1 4 NA
Desired Output
[1] 3 NA NA NA NA 1 NA NA 2 NA NA NA
My working but ugly solution:
IND <- !(is.na(a)) & data.table::rleidv(!(is.na(a))) %>% duplicated
a[IND]<- NA
a
There's gotta be a better solution ...
Alternatively,
a[-1][diff(!is.na(a)) == 0] <- NA; a
# [1] 3 NA NA NA NA 1 NA NA 2 NA NA NA
OK for brevity...
a[!is.na(dplyr::lag(a))]<-NA
a
[1] 3 NA NA NA NA 1 NA NA 2 NA NA NA
You can do a simple ifelse statement where you add your vector with a lagged vector a. If the result is NA then the value should remain the same. Else, NA, i.e.
ifelse(is.na(a + dplyr::lag(a)), a, NA)
#[1] 3 NA NA NA NA 1 NA NA 2 NA NA NA

How to find the number of discordant and concordant pairs in R?

I am trying to find the number of discordant and concordant pairs in a clinical trial, and have come across the 'asbio' library which provides the function ConDis.matrix. (http://artax.karlin.mff.cuni.cz/r-help/library/asbio/html/ConDis.matrix.html)
The dataset they give as an example is:
crab<-data.frame(gill.wt=c(159,179,100,45,384,230,100,320,80,220,320,210),
body.wt=c(14.4,15.2,11.3,2.5,22.7,14.9,1.41,15.81,4.19,15.39,17.25,9.52))
attach(crab)
crabm<-ConDis.matrix(gill.wt,body.wt)
crabm
Which gives a result that looks like:
1 2 3 4 5 6 7 8 9 10 11 12
1 NA NA NA NA NA NA NA NA NA NA NA NA
2 1 NA NA NA NA NA NA NA NA NA NA NA
3 1 1 NA NA NA NA NA NA NA NA NA NA
4 1 1 1 NA NA NA NA NA NA NA NA NA
5 1 1 1 1 NA NA NA NA NA NA NA NA
6 1 -1 1 1 1 NA NA NA NA NA NA NA
7 1 1 0 -1 1 1 NA NA NA NA NA NA
8 1 1 1 1 1 1 1 NA NA NA NA NA
9 1 1 1 1 1 1 -1 1 NA NA NA NA
10 1 1 1 1 1 -1 1 1 1 NA NA NA
11 1 1 1 1 1 1 1 0 1 1 NA NA
12 -1 -1 -1 1 1 1 1 1 1 1 1 NA
The solution I can think of is adding up the 1s and -1s (for concordant and discordant) respectively but I don't know how to count values in a matrix. Alternatively is someone has a better way of counting concordant/discordant then I would love to know.
Your found solution was
sum(crabm == 1, na.rm = TRUE)
[1] 57
sum(crabm == -1, na.rm = TRUE)
[1] 7
You could try (C...concordant, D...discordant pairs):
library(DescTools)
tab <- table(crab$gill.wt, crab$body.wt)
ConDisPairs(tab)[c("C","D")]
$C
[1] 57
$D
[1] 7

R - Matching rows and colums of matrices with different length

my problem at the moment is the following. I have an directed 1-mode edgelist representing pairs of actors participating in joint projects in a certain year, which might look like:
projektleader projectpartner year
A B 2005
A C 2000
B A 2002
... ... ...
Now I need only a subset for one particular year. Not all actors are active in very year, so the dimensions of the subsets differ. For a following Network Analysis, I need a weighted and directed adjacency matrix, so I use the option of the [network package] to create it. I first load it as a network object and transform it then in a adjacency matrix.
grants_00 <- subset(grants, (year_grant=2000), select = c(projectpartner, projectleader))
nw_00 <- network(grants_08to11[,1:2], matrix="edgelist", directed=TRUE)
grants_00.adj <- as.matrix(nw_00, matrix.type = "adjacency")
The resulting matrix looks somewhat like
A B C E ...
A 0 1 1 0
B 1 0 0 0
...
So far so good. My problem is now: For the further analysis I am planning to do I need an adjacency Matrix for every year with the same dimension and order. That means that all actors from the initial dataset have to be the row and column names of the matrix for the corresponding years, but the matrix should only contain observed pairs for this certain year. I hope my problem is clear. I appreciate any kind of constructive solutions.
My idea ATM is the following: I create a matrix of the initial dataset and the reduced dataset. Then I set all matrix values there to Zero. Then I somehow match it with the reduced matrix and fill it with the right values in the right rows and columns. Unfortunately I have no clue how this might be possible.
Has anybody an idea how to solve this problem?
Unfortunately , your question is not clear, so I will try to answer.
If I understand you want :
****Given a big and small matrix : Find the locations where they match?****
I regenerate your data
library(network)
N <- 20
grants <- data.frame(
projectleader = sample(x=LETTERS[1:20],size=N,replace = TRUE),
projectpartner = sample(x=LETTERS[1:20],size=N,replace = TRUE),
year_grant = sample(x=0:5 ,size=N,replace = TRUE) +2000
)
head(grants)
projectleader projectpartner year_grant
1 D K 2002
2 M M 2001
3 K L 2005
4 N Q 2002
5 G D 2003
6 I B 2004
Function to create the small matrix
##
adjency <- function(year){
grants_00 <- subset(grants, (year_grant==year),
select = c(projectpartner, projectleader))
nw_00 <- network(grants_00, matrix="edgelist", directed=TRUE)
grants_00.adj <- as.matrix(nw_00, matrix.type = "adjacency")
as.data.frame(grants_00.adj)
}
use plyr to get a list for every year
library(plyr)
years <- unique(grants$year_grant)
years <- years[order(years)]
bigMatrix <- llply(as.list(years),.fun=adjm)
Create full matrix (The answer)
# create an empty matrix with NAs
population <- union(grants$projectpartner,grants$projectleader)
population_size <- length(population)
full_matrix <- matrix(rep(NA, population_size*population_size),
nrow=population_size)
rownames(full_matrix) <- colnames(full_matrix) <- population
find the location where they match
frn <- as.matrix(bigMatrix[[1]])
tmp <- match(rownames(frn), rownames(full_matrix))
tmp2 <- match(colnames(frn), colnames(full_matrix))
# do a merge
full_matrix[tmp,tmp2] <- frn
head(bigMatrix[[1]])
D I J K O Q S
D 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0
J 1 0 0 0 0 0 0
K 0 0 0 0 0 0 0
O 0 0 0 1 0 0 0
Q 0 1 0 0 0 0 0
the full matrix
K M L Q D B E J C S O F G N I A H
K 0 NA NA 0 0 NA NA 0 NA 0 0 NA NA NA 0 NA NA
M NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
L NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Q 0 NA NA 0 0 NA NA 0 NA 0 0 NA NA NA 1 NA NA
D 0 NA NA 0 0 NA NA 0 NA 0 0 NA NA NA 0 NA NA
B NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
E NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
J 0 NA NA 0 1 NA NA 0 NA 0 0 NA NA NA 0 NA NA
C NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
S 0 NA NA 1 0 NA NA 0 NA 0 0 NA NA NA 0 NA NA
O 1 NA NA 0 0 NA NA 0 NA 0 0 NA NA NA 0 NA NA
F NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
G NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
N NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
I 0 NA NA 0 0 NA NA 0 NA 0 0 NA NA NA 0 NA NA
A NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
H NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Resources