How to change an NA value in a specific row in R? - r

I am very new in R and still learning. My data is the Titanic.csv which has 891 observation and 13 variables. I would like to change the NA value on the 62 observation of PassengerID 62 in column 12 (column_name "Embarked") from NA to "S" and 830 observation to "C".
I found similar postings, but it didn't give me what I need.
How to replace certain values in a specific rows and columns with NA in R?
How to change NA value in a specific row and column?
My assignment is asking to use the below function.
boat<-within(boat,Embarked[is.na(Embarked)]<-"your choice here")
If I do this
boat<-within(boat,Embarked[is.na(Embarked)]<- "S")
or "C" in where it says "your choice here" it replaces both observations with either "S" or "C".
Below is the example of the Titanic.csv file.
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Owen male 22 1 0 A/5 1717.25 S
2 1 1 Cumings,John female 38 1 0 PC 9971.28 C85 C
17 0 3 Rice, Eugene male 2 4 1 382 29.125 Q
18 1 2 Williams,Charles male 0 0 2443 13 S
60 0 3 Goodwin, William male 11 5 2 CA 21 46.9 S
61 0 3 Sirayanian, Orsen male 22 0 0 2669 7.2292 C
62 1 1 Icard, Amelie female 38 0 0 11357 80 B28 NA
63 0 1 Harris, Henry male 45 1 0 36973 83.475 C83 S
My apologies if the sample dataframe is somewhat condensed.

# df is you data frame, first one is the row e.g 62, second one is column e.g 12
df[62, 12]
# Now assign "S" with the `<-` operator
df[62, 12] <- "S"
# and check if NA is changed to S
df[62, 12]
#Embarked
#<chr>
# 1 S
# Same with
df[830, 12] <- "C"

Related

Get the average of the values of one column for the values in another

I was not so sure how to ask this question. i am trying to answer what is the average tone when an initiative is mentioned and additionally when a topic, and a goal( or achievement) are mentioned. My dataframe (df) has many mentions of 70 initiatives (rows). meaning my df has 500+ rows of data, but only 70 Initiatives.
My data looks like this
> tabmean
Initiative Topic Goals Achievements Tone
1 52 44 2 2 2
2 294 42 2 2 2
3 103 31 2 2 2
4 52 41 2 2 2
5 87 26 2 1 1
6 52 87 2 2 2
7 136 81 2 2 2
8 19 7 2 2 1
9 19 4 2 2 2
10 0 63 2 2 2
11 0 25 2 2 2
12 19 51 2 2 2
13 52 51 2 2 2
14 108 94 2 2 1
15 52 89 2 2 2
16 110 37 2 2 2
17 247 25 2 2 2
18 66 95 2 2 2
19 24 49 2 2 2
20 24 110 2 2 2
I want to find what is the mean or average Tone when an Initiative is mentioned. as well as what is the Tone when an Initiative, a Topic and a Goal are mentioned at the same time. The code options for Tone are : positive(coded: 1), neutral(2), negative (coded:3), and both positive and negative(4). Goals and Achievements are coded yes(1) and no(2).
I have used this code:
GoalMeanTone <- tabmean %>%
group_by(Initiative,Topic,Goals,Tone) %>%
summarize(averagetone = mean(Tone))
With Solution output :
GoalMeanTone
# A tibble: 454 x 5
# Groups: Initiative, Topic, Goals [424]
Initiative Topic Goals Tone averagetone
<chr> <chr> <chr> <chr> <dbl>
1 0 104 2 0 NA
2 0 105 2 0 NA
3 0 22 2 0 NA
4 0 25 2 0 NA
5 0 29 2 0 NA
6 0 30 2 1 NA
7 0 31 1 1 NA
8 0 42 1 0 NA
9 0 44 2 0 NA
10 0 44 NA 0 NA
# ... with 444 more rows
note that for Initiative Value 0 means "other initiative".
and I've also tried this code
library(plyr)
GoalMeanTone2 <- ddply( tabmean, .(Initiative), function(x) mean(tabmean$Tone) )
with solution output
> GoalMeanTone2
Initiative V1
1 0 NA
2 1 NA
3 101 NA
4 102 NA
5 103 NA
6 104 NA
7 105 NA
8 107 NA
9 108 NA
10 110 NA
Note that in both instances, I do not get an average for Tone but instead get NA's
I have removed the NAs in the df from the column "Tone" also have tried to remove all the other mission values in the df ( its only about 30 values that i deleted).
and I have also re-coded the values for Tone :
tabmean<-Meantable %>% mutate(Tone=recode(Tone,
`1`="1",
`2`="0",
`3`="-1",
`4`="2"))
I still cannot manage to get the average tone for an initiative. Maybe the solution is more obvious than i think, but have gotten stuck and have no idea how to proceed or solve this.
i'd be super grateful for a better code to get this. Thanks!
I'm not completely sure what you mean by 'the average tone when an initiative is mentioned', but let's say that you'd want to get the average tone for when initiative=1, you could try the following:
tabmean %>% filter(initiative==1) %>% summarise(avg_tone=mean(tone, na.rm=TRUE)
Note that (1) you have to add na.rm==TRUE to the summarise call if you have missing values in the column that you are summarizing, otherwise it will only produce NA's, and (2) check that the columns are of type numeric (you could check that with str(tabmean) and for example change tone to numeric with tabmean <- tabmean %>% mutate(tone=as.numeric(tone)).

Converting categorical columns to numerical values

I want to convert categorical columns in the dataset to be numerical values (1,2,3, etc).
How can I do this in R?
## Load vcd package
library(vcd)
## Load Arthritis dataset (data frame)
data(Arthritis)
Arthritis <- Arthritis[,2:5]
head(Arthritis)
Treatment Sex Age Improved
1 Treated Male 27 Some
2 Treated Male 29 None
3 Treated Male 30 None
4 Treated Male 32 Marked
5 Treated Male 46 Marked
6 Treated Male 58 Marked
Resulting dataset would look like this:
Treatment Sex Age Improved
[1,] 1 1 27 1
[2,] 1 1 29 0
[3,] 1 1 30 0
[4,] 1 1 32 2
[5,] 1 1 46 2
[6,] 1 1 58 2
If number of variables is huge, you may consider using this automation:
Arthritis2 <- sapply(Arthritis, unclass)
Edit:
Arthritis2 <- sapply(Arthritis, unclass) - 1
Solution using named list and match function:
scores <- list("0" = "None", "1" = "Some", "2" = "Marked" )
Arthritis$Scores <- names(scores)[match(Arthritis$Improved, scores)]
head(Arthritis)
Sex Age Improved Scores
1 Male 27 Some 1
2 Male 29 None 0
3 Male 30 None 0
4 Male 32 Marked 2
5 Male 46 Marked 2
6 Male 58 Marked 2
If you don't want to keep Improved column, simply do this instead:
Arthritis$Improved <- names(scores)[match(Arthritis$Improved, scores)]

`mstate`: prepare "long" format data into "mstate" format data

The typical preparation steps for mstate involve converting "wide" format data (1x row per 'patient') into "multi-state" format data (multiple rows per 'patient' for each possible transition in the multi-state model).
For example, data in wide format:
library(mstate)
data(ebmt4)
ebmt <- ebmt4
> head(ebmt)
id rec rec.s ae ae.s recae recae.s rel rel.s srv srv.s year agecl proph match
1 1 22 1 995 0 995 0 995 0 995 0 1995-1998 20-40 no no gender mismatch
2 2 29 1 12 1 29 1 422 1 579 1 1995-1998 20-40 no no gender mismatch
3 3 1264 0 27 1 1264 0 1264 0 1264 0 1995-1998 20-40 no no gender mismatch
4 4 50 1 42 1 50 1 84 1 117 1 1995-1998 20-40 no gender mismatch
5 5 22 1 1133 0 1133 0 114 1 1133 0 1995-1998 >40 no gender mismatch
6 6 33 1 27 1 33 1 1427 0 1427 0 1995-1998 20-40 no no gender mismatch
Is converted to multi-state format:
tmat <- transMat(x = list(c(2, 3, 5, 6), c(4, 5, 6), c(4, 5, 6), c(5, 6), c(), c()), names = c("Tx", "Rec", "AE", "Rec+AE", "Rel", "Death"))
msebmt <- msprep(data = ebmt, trans = tmat, time = c(NA, "rec", "ae", "recae", "rel", "srv"), status = c(NA, "rec.s", "ae.s", "recae.s", "rel.s", "srv.s"), keep = c("match", "proph", "year", "agecl"))
> head(msebmt)
An object of class 'msdata'
Data:
id from to trans Tstart Tstop time status match proph year agecl
1 1 1 2 1 0 22 22 1 no gender mismatch no 1995-1998 20-40
2 1 1 3 2 0 22 22 0 no gender mismatch no 1995-1998 20-40
3 1 1 5 3 0 22 22 0 no gender mismatch no 1995-1998 20-40
4 1 1 6 4 0 22 22 0 no gender mismatch no 1995-1998 20-40
5 1 2 4 5 22 995 973 0 no gender mismatch no 1995-1998 20-40
6 1 2 5 6 22 995 973 0 no gender mismatch no 1995-1998 20-40
But what if my original dataset has time-varying covariates (i.e. long format) and I want to format the data into multi-state mode? All of the tutorials I have found online are only for converting initially wide data to multi-state data (not initially long data); for example the mstate package vignette.
So, let's say I have the below data df, where id is for a 'patient', (start,stop] tell us the time periods, state is the state the patient is in at the end of the time period, and tv.cov is their time-varying covariate (assumed constant over the time period). Note that only patient id=5 has 3x entries and that person's tv.cov changes.
id start stop state tv.cov
1 0 1 1 1
2 0 4 1 2
3 0 7 1 1
4 0 10 1 5
5 0 6 1 4
5 6 10 2 10
5 10 15 3 12
Assuming the basic "illness-death" transition model:
tmat <- mstate::trans.illdeath(names = c("healthy", "sick", "death"))
> tmat
to
from healthy sick death
healthy NA 1 2
sick NA NA 3
death NA NA NA
How can I prep df into multi-state format?
As a hack, should I setup the data in "wide" format, format the data into "multi-state" format using msprep and then join another frame onto it which contains the time-varying covariates for each patient at each time interval?

Removing certain values from a data frame

I know there are already some threads like this, but I could not find any solutions.
I have a dataframe that looks like this:
Name Age Sex Survived
1 Allison 0.17 female 1
2 Leah 0.33 female 0
3 David 0.8 male 1
4 Daniel 0.83 male 1
5 Alex 0.83 male 1
6 Jay 0.92 male 1
7 Sara 16 female 1
8 Jade 15 female 1
9 Connor 17 male 1
10 Jon 18 male 1
11 Mary 8 female 1
I want to remove ages that are below 1. I want the data to look like this:
Name Age Sex Survived
1 Allison NA female 1
2 Leah NA female 0
3 David NA male 1
4 Daniel NA male 1
5 Alex NA male 1
6 Jay NA male 1
7 Sara 16 female 1
8 Jade 15 female 1
9 Connor 17 male 1
10 Jon 18 male 1
11 Mary 8 female 1
Or to just remove the rows with ages < 1 altogether.
Following other solutions I tried this but it didn't work
mydata[mydata$Age<"1"&&mydata$Age>"0"] <- NA
Here are three ways to remove the rows:
mydata[mydata$Age > 1, ]
subset(mydata, Age > 1)
filter(mydata, Age > 1)
Here is how to make them NA:
mydata$Age[mydata$Age < 1] <- NA
Your issue is that you are using 1 as a character (in quotes). Character less/greater than work a little differently to numbers so be careful. Also make sure your Age column is numeric. The best way to do that is
mydata$Age <- as.numeric(as.character(mydata$Age))
so you don't accidentally mess up factor variables.
edit
put the wrong signs. fixed now
> mydata[mydata$Age<1, "Age"] <- NA
> mydata
Name Age Sex Survived
1 Allison NA female 1
2 Leah NA female 0
3 David NA male 1
4 Daniel NA male 1
5 Alex NA male 1
6 Jay NA male 1
7 Sara 16 female 1
8 Jade 15 female 1
9 Connor 17 male 1
10 Jon 18 male 1
11 Mary 8 female 1
Update
Maybe you can use if Age is factor
mydata[as.numeric(as.character(mydata$Age))<1, "Age"] <- NA

R Frequency Tables: prop.table does not work if all data points within variable share the outcome?

imagine, you have the following data set:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 0 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
Further, imagine you want to compile summary tables that print out the frequencies of those that drink wine, beer, water.
I solved it that way.
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
It works perfectly. No problem. Now, let us tweak the data set as follows: We set all entries for water to 1.
df<-data.frame(read.table(header = TRUE, text = "
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
If I now run the following commands:
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
it gives me the following error message after the second line: Error in margin.table(x, margin) : 'x' is not an array! Why?
Why does it make a difference if all data points within a variable have all the same outcome? Also, what can I do to circumvent this problem? Thanks guys!
The function prop.table uses the function sweep which takes an array as first argument. Since your second con is a list and not an array, prop.table will fail.
Why is your second con a list? Because the column Water has just one element and all the other columns have 2 elements. When the number of elements is different apply can't simplify the result to an array and gives you a list.
In the example you gave us, a safer way is to work with lapply instead, it will always give a list with the results:
con <- lapply(df, table)
con_P <- lapply(con, function(x) x/sum(x))

Resources