matching id inside data frame - r

I made this simple data frame to make my question more clear:
id = c(11, 12, 13, 14, 15)
referenceperson = c("yes", "no", "yes", "no", "yes")
smoke = c(3, 4, 3, NA, 2)
spouseid = c(12, 11, NA, 15, 14)
dataframe = data.frame(id, referenceperson , smoke, spouseid)
I would like to get the the amount of smoking of the spouse of a reference person only, in this example value 4 of the first observation.
I'm lost here and thanks for any help

Using only the values in your dataframe object, will step though it and present a compact method of getting the single value you ask for and then all the values:
> dataframe[ match(dataframe$spouseid[1], data.frame$id) , 'smoke']
[1] 4
That was the method of getting the index of the spouse of the person in the first and using it to get the 'smoke' value in the referenced row. The next line demonstrates that match will get you all such indices and where they don't exist will return an NA.
> match(dataframe$spouseid, dataframe$id)
[1] 2 1 NA 5 4
In R using NA as an index into a dataframe will return an NA, rather than a null value. This preserves sequence information. Therefore, you can get all the smoking values of spouses with this:
> dataframe[ match(dataframe$spouseid, dataframe$id) , 'smoke']
[1] 4 3 NA 2 NA
And then assign those values to a column in the dataframe.
> dataframe$smk_stat_spouse <-
dataframe[ match(dataframe$spouseid, dataframe$id) , 'smoke']
> dataframe
id referenceperson smoke spouseid smk_stat_spouse
1 11 yes 3 12 4
2 12 no 4 11 3
3 13 yes 3 NA NA
4 14 no NA 15 2
5 15 yes 2 14 NA

I believe I found a solution, although it is very messy (I'm new to r)
df1 <- cbind(id, referenceperson)
df1 <- as.data.frame(df1)
df2 <- cbind(spouseid, smoke)
df2 <- as.data.frame(df2)
matched <- df2$smoke[match(df1$id, df2$spouseid) ]
refp <- ifelse(referenceperson=="yes", 1, referenceperson)
refp <- ifelse(refp=="no", NA, refp)
refp <- as.numeric(refp)
refp*matched

Related

Perform calculations on row depending on individual cells [duplicate]

This question already has answers here:
Sum rows in data.frame or matrix
(7 answers)
Closed 2 years ago.
I have a data frame in R that looks like
1 3 NULL,
2 NULL 5,
NULL NULL 9
I want to iterate through each row and perform and add the two numbers that are present. If there aren't two numbers present I want to throw an error. How do I refer to specific rows and cells in R? To iterate through the rows I have a for loop. Sorry not sure how to format a matrix above.
for(i in 1:nrow(df))
Data:
df <- data.frame(
v1 = c(1, 2, NA),
v2 = c(3, NA, NA),
v3 = c(NA, 5, 9)
)
Use rowSums:
df$sum <- rowSums(df, na.rm = T)
Result:
df
v1 v2 v3 sum
1 1 3 NA 4
2 2 NA 5 7
3 NA NA 9 9
If you do need a for loop:
for(i in 1:nrow(df)){
df$sum[i] <- rowSums(df[i,], na.rm = T)
}
If you have something with NULL you can make it a data.frame, but that will make the columns with NULL a character vector. You have to convert those to numeric, which will then introduce NA for NULL.
rowSums will then create the sum you want.
df <- read.table(text=
"
a b c
1 3 NULL
2 NULL 5
NULL NULL 9
", header =T)
# make columns numeric, this will change the NULL to NA
df <- data.frame(lapply(df, as.numeric))
cbind(df, sum=rowSums(df, na.rm = T))
# a b c sum
# 1 1 3 NA 4
# 2 2 NA 5 7
# 3 NA NA 9 9

recoding integers in a vector so they register as NA instead?

I hope this isn't a silly question but I am REALLY struggling to recode a variable in R so that certain values register as NA instead of the placeholder integer that got read in. respondents who did not answer the question for that column were originally coded as -88, -89 and -99 instead of NA and I only know how to remove them completely from that column.
I want to keep that row, just have those inputs registered as missing. Recode doesn't seem to work b/c NA isn't a value
Thanks!
Maybe you can try replace
v <- replace(v,v%in%c(-88,-89,-99),NA)
such that
> v
[1] 1 2 NA NA -1 NA NA
Dummy Data
v <- c(1,2,-88,-89,-1,-99,-89)
You can use the %in% operator to find all positions in a vector which match with another vector, and then set them to NA as follows:
dat = data.frame(V1 = c(10, 20, 30, -88, -89, -99))
dat$V1[dat$V1 %in% c(-88, -89, -99)] = NA
dat
V1
1 10
2 20
3 30
4 NA
5 NA
6 NA
Here's one way to do it, which will replace all values of -88, -89 and -99 in your data:
for (i in c(-88, -89, -99)){
data.df[data.df == i] <- NA
}
If you need to just replace in one column (e.g. column 'x'):
for (i in c(-88, -89, -99)){
data.df$x[data.df$x == i] <- NA
}
The correct/ most adequat answer to that question is depending on the exact specifics of your data, in case you have a numeric variable and all other values are positive, this would work.
somedata <-
tibble::tribble(
~v1, ~v2,
1, 2,
3, 4,
-88, 5,
6, -89,
-99, 1
)
library(tidyverse)
somedata %>%
mutate(v1 = ifelse(v1 < 0, NA, v1))
# A tibble: 5 x 2
v1 v2
<dbl> <dbl>
1 1 2
2 3 4
3 NA 5
4 6 -89
5 NA 1
Thanks so much again to everyone for your help!
I first converted the variable to numeric, then this seemed to work for me:
anesCSV$clinton.withNA <- replace(anesCSV$clintonthermo_numeric,anesCSV$clintonthermo_numeric%in%c(-88,-89,-99),NA)
As someone initially suggested:
v <- replace(v,v%in%c(-88,-89,-99),NA)
I did create a new variable to store the results personally!

Automate replacement of missing data on a sequence of variables using mutate_all

I am trying to automate a process to complete missing values on a sequence of variables using an ifelse statement and mutate_all function. The problem involves a dataframe with many variable names, for example, ax1, bx1, ...zx1, ax2, bx2, ...zx2, ax3, bx3, ...zx3. The following data give a small scenario:
df<-data.frame(
"id" = c(1:5),
"ax1" = c(1, "NA", 8, "NA", 17),
"bx1" = c(2, 7, "NA", 11, 12),
"ax2" = c(2, 1, 8, 15, 17),
"bx2" = c(2, 6, 4, 13, 11))
The process is to replace the missing values on the variables with the ending "x1" with their corresponding values on the variables with the ending "x2". That is, if ax1 is missing it is replaced by ax2 and any missingness on bx1 is replaced by bx2 and so on. Since there are many variables than the scenario presented here, I am looking for a way to automate this process. I have tried the following codes
library(dplyr)
df <- df %>%
mutate_all(vars(ends_with("x1", "x2")), function(x,y)
ifelse(is.na(x), y, x)))
but it does not work. I greatly appreciate any help on this.
The expected output is
id ax1 bx1 ax2 bx2
1 1 2 2 2
2 1 7 1 6
3 8 4 8 4
4 15 11 15 13
5 17 12 17 11
In base R, we can replace NA value in x1 with corresponding NA values in x2 using Map.
x1_cols <- grep('x1$', names(df))
x2_cols <- grep('x2$', names(df))
df[x1_cols] <- Map(function(x, y) {x[is.na(x)] <- y[is.na(x)];x},
df[x1_cols], df[x2_cols])
df
# id ax1 bx1 ax2 bx2
#1 1 1 2 2 2
#2 2 1 7 1 6
#3 3 8 4 8 4
#4 4 15 11 15 13
#5 5 17 12 17 11
We can use the same logic and use purrr::map2
df[x1_cols] <- purrr::map2(df[x1_cols], df[x2_cols],
~{.x[is.na(.x)] <- .y[is.na(.x)];.x})
data
Modified data a bit making sure that NA are actual NAs and not string "NA" which were actually making columns as factors.
df<-data.frame(id=c(1:5),
ax1=c(1,NA,8,NA,17),
bx1=c(2,7,NA,11,12),
ax2=c(2,1,8,15,17),
bx2=c(2,6,4,13,11))

R: compare multiple columns pairs and place value on new corresponding variable

Am a basic R user.
I have 50 column pairs (example pair is: "pair_q1" and "pair_01_v_rde") per "id" in the same dataframe that I would like to collect data from and place it in a new corresponding variable e.g. "newvar_q1".
All the pair variable names have a pattern in their names that can be distilled to this ("pair_qX" and "pair_X_v_rde", where X = 1:50, and the final variables I would like to have are "newvar_qX", where X = 1:50)
Ideally only one member of the pair should contain data, but this is not the case.
Each of the variables can contain values from 1:5 or NA(missing).
Rules for collecting data from each pair based on "id" and what to place in their newly created corresponding variable are:
If one of the pairs has a value and the other is missing then place the value in their corresponding new variable. e.g. ("pair_q1" = 1 and "pair_01_v_rde" = NA then "newvar_q1" = 1)
If both pairs have the same value or both are missing then place that value/missing in their corresponding new variable e.g. ("pair_q50" = 1/NA and "pair_50_v_rde" = 1/NA then "newvar_q50" = 1/NA)
If both pairs have different values then ignore both values and assign their corresponding new variable 999 e.g. ("pair_q02" = 3 and "pair_02_v_rde" = 2 then "newvar_q02" = 999)
Can anyone show me how I can execute this in R please?
Thanks!
Nelly
# Create Toy dataset
id <- c(100, 101, 102)
pair_q1 <- c(1, NA, 1)
pair_01_v_rde <- c(NA, 2, 1)
pair_q2 <- c(1, 1, NA)
pair_02_v_rde <- c(2, NA, NA)
pair_q50 <- c(NA, 2, 4)
pair_50_v_rde <- c(4, 3, 1)
mydata <- data.frame(id, pair_q1, pair_01_v_rde, pair_q2, pair_02_v_rde, pair_q50, pair_50_v_rde)
# The dataset
> mydata
id pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde
1 100 1 NA 1 2 NA 4
2 101 NA 2 1 NA 2 3
3 102 1 1 NA NA 4 1
# Here I manually build what I would like to have in the dataset
newvar_q1 <- c(1, 2, 1)
newvar_q2 <- c(999, 1, NA)
newvar_q50 <- c(4, 999, 999)
mydata2 <- data.frame(id, pair_q1, pair_01_v_rde, pair_q2, pair_02_v_rde, pair_q50, pair_50_v_rde, newvar_q1, newvar_q2, newvar_q50)
> mydata2
id pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde newvar_q1 newvar_q2 newvar_q50
1 100 1 NA 1 2 NA 4 1 999 4
2 101 NA 2 1 NA 2 3 2 1 999
3 102 1 1 NA NA 4 1 1 NA 999
A possible solution using the 'tidyverse' (use 'inner_join(mydata,.,by="id")' to get the new columns in the order you give in your question):
mydata %>%
select(id,matches("^pair_q")) %>% # keeps only left part of pairs
gather(k,v1,-id) %>% # transforms into tuples (id,variable name,variable value)
mutate(n=as.integer(str_extract(k,"\\d+"))) -> df1 # converts variable name into variable number
mydata %>%
select(id,matches("^pair_\\d")) %>% # same on right part of pairs
gather(k,v2,-id) %>%
mutate(n=as.integer(str_extract(k,"\\d+"))) -> df2
inner_join(df1,df2,by=c("id","n")) %>%
mutate(w=case_when(is.na(v1) ~ v2, # builds new variable value
is.na(v2) ~ v1, # from your rules
v1==v2 ~ v1,
TRUE ~999),
k=paste0("newvar_q",n)) %>% # builds new variable name from variable number
select(id,k,w) %>% # keeps only useful columns
spread(k,w) %>% # switches back from tuple view to wide view
inner_join(mydata,by="id") # and merges the new variables to the original data
# id newvar_q1 newvar_q2 newvar_q50 pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde
#1 100 1 999 4 1 NA 1 #2 NA 4
#2 101 2 1 999 NA 2 1 NA 2 3
#3 102 1 NA 999 1 1 NA NA 4 1

Sort vector keeping NAs position in R

Problem 1 (solved)
How can I sort vector DoB:
DoB <- c(NA, 9, NA, 2, 1, NA)
while keeping the NAs in the same position?
I would like to get:
> DoB
[1] NA 1 NA 2 9 NA
I have tried this (borrowing from this answer)
NAs_index <- which(is.na(DoB))
DoB <- sort(DoB, na.last = NA)
for(i in 0:(length(NAs_index)-1))
DoB <- append(DoB, NA, after=(NAs_index[i+1]+i))
but
> DoB
[1] 1 NA 2 9 NA NA
Answer is
DoB[!is.na(DoB)] <- sort(DoB)
Thanks to #BigDataScientist and #akrun
Now, Problem 2
Say, I have a vector id
id <- 1:6
That I would also like to sort by the same principle, so that the values of id are ordered according to order(DoB), but keeping the NAs fixed in the same position?:
> id
[1] 1 5 3 4 2 6
You could do:
DoB[!is.na(DoB)] <- sort(DoB)
Edit: Concerning the follow up question in the comments:
You can use order() for that and take care of the NAs with the na.last parameter,..
data <- data.frame(DoB = c(NA, 9, NA, 2, 1, NA), id = 1:6)
data$id[!is.na(data$DoB)] <- order(data$DoB, na.last = NA)
data$DoB[!is.na(data$DoB)] <- sort(data$DoB)
We create a logical index and then do the sort
i1 <- is.na(DoB)
DoB[!i1] <- sort(DoB[!i1])
DoB
#[1] NA 1 NA 2 9 NA

Resources