Using R to combine information into one row [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 1 year ago.
I have a table which has 5 columns (ID, var, state, loc and position). The var column contains a description of a certain variant e.g. var1. Within the table there are multiple rows which include var 1 but they have a different state and position. What I want to do is make a new table where each var is included only once and the position is included in two columns based on its state.
For example, say I have four var1 rows; two with the state H and two with the state h. In the new table I need the columns to be: sample - var - loc - position if H and position if h - such that all the information for var 1 is in one row. I would need to be able to do this for every single variant in my original data set.
Current data example
structure(list(ID = c(1234L, 1234L, 1234L, 1234L, 5678L, 5678L,
NA, NA, NA, NA), var = c("var1", "var1", "var1", "var1", "var2",
"var2", NA, NA, NA, NA), state = c("H", "H", "h", "h", "H", "h",
NA, NA, NA, NA), loc = c(4L, 4L, 4L, 4L, 12L, 12L, NA, NA, NA,
NA), position = c(6000L, 6002L, 6004L, 6006L, 3002L, 3004L, NA,
NA, NA, NA)), row.names = c("1", "2", "3", "4", "5", "6", "NA",
"NA.1", "NA.2", "NA.3"), class = "data.frame")
wanted format
structure(list(V1 = c("ID", "1234", "5678", NA, NA, NA, NA, NA,
NA, NA), V2 = c("var1", "var1", "var2", NA, NA, NA, NA, NA, NA,
NA), V3 = c("loc", "4", "12", NA, NA, NA, NA, NA, NA, NA), V4 = c("state H",
"6000 6002", "3002", NA, NA, NA, NA, NA, NA, NA), V5 = c("state h",
"6004 6006", "3004", NA, NA, NA, NA, NA, NA, NA)), row.names = c("1",
"2", "3", "NA", "NA.1", "NA.2", "NA.3", "NA.4", "NA.5", "NA.6"
), class = "data.frame")
Any guidance would be appreciate

The answer to your question is likely revolving around tidyr::pivot_wider
I changed the example data because I believe yours was inconsistent.
Data
df<-structure(list(ID = c(1234L, 1234L, 1234L, 1234L, 5678L, 5678L
), var = c("var1", "var1", "var1", "var1", "var2", "var2"), state = c("H",
"H", "h", "h", "H", "h"), loc = c(4L, 4L, 4L, 4L, 12L, 12L),
position = c(6000L, 6002L, 6004L, 6006L, 3002L, 3004L)), row.names = c("1",
"2", "3", "4", "5", "6"), class = "data.frame")
df
ID var state loc position
1 1234 var1 H 4 6000
2 1234 var1 H 4 6002
3 1234 var1 h 4 6004
4 1234 var1 h 4 6006
5 5678 var2 H 12 3002
6 5678 var2 h 12 3004
Answer
library(tidyr)
df %>% pivot_wider(names_from = state,
values_from = position,
values_fn = toString)
# A tibble: 2 × 5
ID var loc H h
<int> <chr> <int> <chr> <chr>
1 1234 var1 4 6000, 6002 6004, 6006
2 5678 var2 12 3002 3004

Related

R: Pivot dataframe longer with column names to a single column and rows to different columns [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 12 days ago.
The output of an sapply call returns class: [1] "matrix" "array" which I have never seen before. Transforming this to a dataframe gives a 'wide' format of what I want, though. For example:
df<-structure(list(`04232395` = c("22.8", "0", NA, NA), `04232200` = c("39.4",
"2024", "47.55", "1977-09-30 to 2019-03-27 , 1976-11-30 to 1977-03-01 , 1975-11-30 to 1976-03-01 , NA"
), `04232185` = c(NA, "0", NA, NA), `04232165` = c(NA, "0", NA,
NA), `01515600` = c("0.93", "0", NA, NA), `01515580` = c("4.77",
"0", NA, NA), `01515600` = c("0.93", "0", NA, NA), `01515580` = c("4.77",
"0", NA, NA), `01515720` = c("4.44", "0", NA, NA), `01515790` = c("1.32",
"0", NA, NA), `01515600` = c("0.93", "0", NA, NA), `01515720` = c("4.44",
"0", NA, NA), `01515790` = c("1.32", "0", NA, NA), `01515720` = c("4.44",
"0", NA, NA), `01515790` = c("1.32", "0", NA, NA), `01515600` = c("0.93",
"0", NA, NA), `01515580` = c("4.77", "0", NA, NA), `01515720` = c("4.44",
"0", NA, NA), `04244903` = c("8.02", "0", NA, NA), `04233000` = c("35.2",
"27605", "75.63", NA), `04233255` = c("86.7", "3903", "10.69",
NA), `04234000` = c("126", "35786", "98.04", NA), `04233310` = c("42",
"0", NA, NA), `04233300` = c("39", "10197", "27.93", NA), `04233700` = c("40.3",
"822", "2.25", NA), `0423368620` = c("29.7", "713", "1.95", NA
), `04233676` = c("20.7", "0", NA, NA), `04233633` = c("40.2",
"0", NA, NA), `04233635` = c(NA, "0", NA, NA), `04233255` = c("86.7",
"3903", "10.69", NA), `04234000` = c("126", "35786", "98.04",
NA), `04233310` = c("42", "0", NA, NA), `04233300` = c("39",
"10197", "27.93", NA), `01513990` = c("17", "0", NA, NA), `01513910` = c("15.8",
"0", NA, NA), `01514000` = c("185", "19523", "92.41", "1978-11-07 to 2017-09-30 , 1978-10-10 to 1978-10-14 , NA"
), `01513990` = c("17", "0", NA, NA), `01514000` = c("185", "19523",
"92.41", "1978-11-07 to 2017-09-30 , 1978-10-10 to 1978-10-14 , NA"
), `01513840` = c("8.59", "791", "2.16", NA), `01513831` = c("4216",
"4877", "13.36", NA), `01513990` = c("17", "0", NA, NA), `01513830` = c("20.7",
"0", NA, NA), `01514880` = c("46.5", "0", NA, NA), `01514000` = c("185",
"19523", "92.41", "1978-11-07 to 2017-09-30 , 1978-10-10 to 1978-10-14 , NA"
), `01513840` = c("8.59", "791", "2.16", NA), `01513831` = c("4216",
"4877", "13.36", NA), `01513830` = c("20.7", "0", NA, NA), `01514000` = c("185",
"19523", "92.41", "1978-11-07 to 2017-09-30 , 1978-10-10 to 1978-10-14 , NA"
), `01513990` = c("17", "0", NA, NA), `01513990` = c("17", "0",
NA, NA), `01513910` = c("15.8", "0", NA, NA), `04233300` = c("39",
"10197", "27.93", NA), `04233286` = c("27", "7390", "20.24",
NA), `04233275` = c(NA, "0", NA, NA), `01513930` = c(NA, "0",
NA, NA), `04233310` = c("42", "0", NA, NA), `04233300` = c("39",
"10197", "27.93", NA), `04233286` = c("27", "7390", "20.24",
NA), `04233275` = c(NA, "0", NA, NA), `04233310` = c("42", "0",
NA, NA), `04233300` = c("39", "10197", "27.93", NA), `04233286` = c("27",
"7390", "20.24", NA), `04233275` = c(NA, "0", NA, NA), `04233310` = c("42",
"0", NA, NA), `04233300` = c("39", "10197", "27.93", NA), `04233286` = c("27",
"7390", "20.24", NA), `04233275` = c(NA, "0", NA, NA), `04233255` = c("86.7",
"3903", "10.69", NA), `04234000` = c("126", "35786", "98.04",
NA), `04233310` = c("42", "0", NA, NA), `04233300` = c("39",
"10197", "27.93", NA), `04233286` = c("27", "7390", "20.24",
NA), `04233275` = c(NA, "0", NA, NA), `04233310` = c("42", "0",
NA, NA), `04233300` = c("39", "10197", "27.93", NA), `04233286` = c("27",
"7390", "20.24", NA), `04233275` = c(NA, "0", NA, NA), `04233275` = c(NA,
"0", NA, NA), `0423368620` = c("29.7", "713", "1.95", NA), `04233676` = c("20.7",
"0", NA, NA), `01513930` = c(NA, "0", NA, NA), `04233678` = c("2.73",
"456", "1.25", NA), `04233286` = c("27", "7390", "20.24", NA),
`04233700` = c("40.3", "822", "2.25", NA), `04233275` = c(NA,
"0", NA, NA), `0423368620` = c("29.7", "713", "1.95", NA),
`04233676` = c("20.7", "0", NA, NA), `04233000` = c("35.2",
"27605", "75.63", NA), `04233255` = c("86.7", "3903", "10.69",
NA), `04234000` = c("126", "35786", "98.04", NA), `04233310` = c("42",
"0", NA, NA), `04233300` = c("39", "10197", "27.93", NA),
`04233286` = c("27", "7390", "20.24", NA), `04233275` = c(NA,
"0", NA, NA), `01513930` = c(NA, "0", NA, NA), `04233000` = c("35.2",
"27605", "75.63", NA), `04233255` = c("86.7", "3903", "10.69",
NA), `04234000` = c("126", "35786", "98.04", NA), `04233310` = c("42",
"0", NA, NA), `04233300` = c("39", "10197", "27.93", NA),
`04233700` = c("40.3", "822", "2.25", NA), `0423368620` = c("29.7",
"713", "1.95", NA), `04233676` = c("20.7", "0", NA, NA),
`04233678` = c("2.73", "456", "1.25", NA), `04233700` = c("40.3",
"822", "2.25", NA), `0423368620` = c("29.7", "713", "1.95",
NA), `04233676` = c("20.7", "0", NA, NA), `04233633` = c("40.2",
"0", NA, NA), `04233635` = c(NA, "0", NA, NA), `04233678` = c("2.73",
"456", "1.25", NA), `04233700` = c("40.3", "822", "2.25",
NA), `0423368620` = c("29.7", "713", "1.95", NA), `04233676` = c("20.7",
"0", NA, NA), `04233633` = c("40.2", "0", NA, NA), `04233635` = c(NA,
"0", NA, NA), `04244985` = c("55.5", "0", NA, NA), `04244990` = c("18.8",
"0", NA, NA), `04244903` = c("8.02", "0", NA, NA), `04244000` = c("66.3",
"9626", "72.4", "1968-09-29 to 2014-09-30 , NA"), `04243400` = c(NA,
"0", NA, NA), `04243390` = c("24.9", "0", NA, NA), `01338800` = c("43.6",
"0", NA, NA), `04243400` = c(NA, "0", NA, NA), `04243390` = c("24.9",
"0", NA, NA), `01338000` = c("144", "3042", "8.33", NA),
`01339060` = c("59.8", "3056", "8.37", NA), `01338800` = c("43.6",
"0", NA, NA), `01503970` = c(NA, "0", NA, NA), `01503980` = c("24.3",
"0", NA, NA), `04243390` = c("24.9", "0", NA, NA), `01503970` = c(NA,
"0", NA, NA), `01503980` = c("24.3", "0", NA, NA)), row.names = c(NA,
-4L), class = "data.frame")
I would like to have the column names of this dataframe to be in a single row and the 4 rows for each column in the old dataframe to be 4 individual columns in the new. I tried:
df_new<-data.frame(site = colnames(df), DA = df[1,], Q = df[2,], POR = df[3,], Q_gaps = df[4,])
but this is not right.
I have also tried
df_new<-df %>%
pivot_longer(everything())%>%
group_by(name)%>%
mutate(DA = slice(value,1), Q = slice(value,2), POR = slice(value,3), Q_gaps = slice(value,4))
but this resulted in an error.
Searching for this question does not return what I am looking for, so this may be duplicated elsewhere, I just don't know how to ask it.
EDIT:
this worked and gives the desired output, but any more elegant solution?
df_new<-data.frame(site = colnames(df), DA = as.numeric(df[1,]), Q = as.numeric(df[2,]), POR = as.numeric(df[3,]), Q_gaps = as.character(df[4,]))
Edited for the new data (adds setNames/make.unique up front).
library(dplyr)
library(tidyr)
setNames(df, make.unique(names(df))) %>%
mutate(nm = c("DA", "Q", "POR", "Q_gaps")) %>%
pivot_longer(-nm, names_to = "site") %>%
pivot_wider(site, names_from = "nm", values_from = "value") %>%
mutate(across(-site, ~ type.convert(., as.is=TRUE)), site = sub("\\.[0-9]+$", "", site))
# # A tibble: 132 × 5
# site DA Q POR Q_gaps
# <chr> <dbl> <int> <dbl> <chr>
# 1 04232395 22.8 0 NA NA
# 2 04232200 39.4 2024 47.6 1977-09-30 to 2019-03-27 , 1976-11-30 to 1977-03-01 , 1975-11-30 to 1976-03-01 , NA
# 3 04232185 NA 0 NA NA
# 4 04232165 NA 0 NA NA
# 5 01515600 0.93 0 NA NA
# 6 01515580 4.77 0 NA NA
# 7 01515600 0.93 0 NA NA
# 8 01515580 4.77 0 NA NA
# 9 01515720 4.44 0 NA NA
# 10 01515790 1.32 0 NA NA
# # … with 122 more rows
# # ℹ Use `print(n = ...)` to see more rows
The site=sub(..) is to remove the .1 (etc) that will occur due to duplicate column names; this should be safe as long as your real site strings never end in a decimal followed by one or more numbers.
Or we can use akrun's suggestion:
as.data.frame(t(setNames(df, make.unique(names(df))))) %>%
setNames(c("DA","Q","POR","Q_gaps")) %>%
tibble::rownames_to_column("site") %>%
mutate(
across(-site, ~ type.convert(., as.is = TRUE)),
site = sub("\\.[0-9]+$", "", site)
) %>%
head()
# site DA Q POR Q_gaps
# 1 04232395 22.80 0 NA <NA>
# 2 04232200 39.40 2024 47.55 1977-09-30 to 2019-03-27 , 1976-11-30 to 1977-03-01 , 1975-11-30 to 1976-03-01 , NA
# 3 04232185 NA 0 NA <NA>
# 4 04232165 NA 0 NA <NA>
# 5 01515600 0.93 0 NA <NA>
# 6 01515580 4.77 0 NA <NA>

How to create subsets of a dataframe based on columns using a for loop in R

I have a dataframe which looks like this:
id age1 sex1 age2 sex2 age3 sex3 age4 sex4
1 5 20 <NA> NA <NA> NA <NA> 27 Female
2 25 NA <NA> NA <NA> NA <NA> 35 Female
3 65 NA <NA> NA <NA> NA <NA> NA <NA>
this is the code for the data:
temp <- structure(list(id = c(5L, 25L, 65L, 25L, 65L, 5L, 5L, 85L, 285L,
541L), age1 = c(20L, NA, NA, NA, NA, NA, NA, NA, NA, NA), sex1 = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), .Label = c("missing",
"inapplicable", "refusal", "don't know", "inconsistent", "Male",
"Female"), class = "factor"), age2 = c(NA, NA, NA, NA, 31L,
NA, NA, NA, NA, NA), sex2 = structure(c(NA, NA, NA, NA, 7L,
NA, NA, NA, NA, NA), .Label = c("missing", "inapplicable", "refusal",
"don't know", "inconsistent", "Male", "Female"), class = "factor"),
age3 = c(NA, NA, NA, NA, 32L, NA, NA, NA, 25L, 23L), sex3 = structure(c(NA,
NA, NA, NA, 7L, NA, NA, NA, 6L, 7L), .Label = c("missing",
"inapplicable", "refusal", "don't know", "inconsistent",
"Male", "Female"), class = "factor"), age4 = c(27L, 35L,
NA, NA, 33L, NA, 24L, NA, 26L, NA), sex4 = structure(c(7L,
7L, NA, NA, 7L, NA, 7L, NA, 6L, NA), .Label = c("missing",
"inapplicable", "refusal", "don't know", "inconsistent",
"Male", "Female"), class = "factor")), row.names = c(NA,
10L), class = "data.frame")
I would like to know how to make multiple subsets based the data based on the columns.
I know I could do this by using the codes:
Subset1<- temp[,1:3]
Subset2<-temp[,c(1,4:5)]
Subset3<- temp[,c(1,6:7)]
But there must be a more concise way to do this. I've tried a for loop but I'm new to R and don't know how to this including keeping the names of the new subsets consistent.
We can use split.default to split data based on number in the column names and append the first column in each list.
new_list <- lapply(split.default(temp[-1], gsub("\\D", "", names(temp)[-1])),
function(x) cbind(temp[1], x))
new_list
#$`1`
# id age_1 sex_1
#1 5 20 <NA>
#2 25 NA <NA>
#3 65 NA <NA>
#4 25 NA <NA>
#5 65 NA <NA>
#6 5 NA <NA>
#7 5 NA <NA>
#8 85 NA <NA>
#9 285 NA <NA>
#10 541 NA <NA>
#$`2`
# id age_2 sex_2
#1 5 NA <NA>
#...
This returns a list of dataframes, if you want data in separate dataframes, we can do :
names(new_list) <- paste0('Subset', seq_along(new_list))
list2env(new_list, .GlobalEnv)
Here is another base R solution
ind <- 1:4
list2env(setNames(lapply(ind, function(k) subset(temp,select = c(1,2*k+(0:1)))),
paste0("Subset",ind)),
envir = .GlobalEnv)
where subset + lapply was used

How can I extract rows from above and below a specific row in an R dataframe?

Currently I'm working with some Fastq sequencing data. I have a dataframe with three columns and hundreds of rows. The first column contains the raw sequencing reads and the others contain information about those reads. I want to return a row with the string "FALSE" in the 3rd column, plus the row directly above this, and two rows directly below it. I think it is similar to grep -A -B in shell.
I've looked around and my question is very similar to this one:
Returning above and below rows of specific rows in r dataframe
However, the answers here are based on row-names and not strings within the rows. My row names are just numbers in numerical order.
Fastq Output BARCODE Dulplicated
1 ReadName1 NA NA
2 ReadSeq1 TGTG TTAT FALSE
3 + NA NA
4 Ascii_score1 NA NA
5 ReadName2 NA NA
6 ReadSeq2 TGCT TTAT FALSE
7 + NA NA
8 Ascii_score2 NA NA
9 ReadName3 NA NA
10 ReadSeq3 TGCT TTAT TRUE
11 + NA NA
12 Ascii_score3 NA NA
If the duplicated column has character values. You can do
inds <- which(df$Dulplicated == "FALSE")
df[sort(unique(c(inds, inds - 1, inds + 1, inds + 2))), ]
# FastqOutput BARCODE Dulplicated
#1 ReadName1 <NA> NA
#2 ReadSeq1 TGTGTTAT FALSE
#3 + <NA> NA
#4 Ascii_score1 <NA> NA
#5 ReadName2 <NA> NA
#6 ReadSeq2 TGCTTTAT FALSE
#7 + <NA> NA
#8 Ascii_score2 <NA> NA
Or similarly using dplyr::slice
library(dplyr)
df %>% slice(sort(unique(c(inds, inds - 1, inds + 1, inds + 2))))
data
df <- structure(list(FastqOutput = structure(c(5L, 8L, 1L, 2L, 6L,
9L, 1L, 3L, 7L, 10L, 1L, 4L), .Label = c("+", "Ascii_score1",
"Ascii_score2", "Ascii_score3", "ReadName1", "ReadName2", "ReadName3",
"ReadSeq1", "ReadSeq2", "ReadSeq3"), class = "factor"), BARCODE =
structure(c(NA, 2L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA), .Label = c("TGCTTTAT",
"TGTGTTAT"), class = "factor"), Dulplicated = c(NA, FALSE, NA,
NA, NA, FALSE, NA, NA, NA, TRUE, NA, NA)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
We can use data.table
library(data.table)
setDT(df)[df[, {i1 <-.I[which(!as.logical(Dulplicated))]
sort(unique(i1+ rep((-2:2), length(i1)))) }]]
# FastqOutput BARCODE Dulplicated
#1: ReadName1 <NA> NA
#2: ReadSeq1 TGTGTTAT FALSE
#3: + <NA> NA
#4: Ascii_score1 <NA> NA
#5: ReadName2 <NA> NA
#6: ReadSeq2 TGCTTTAT FALSE
#7: + <NA> NA
#8: Ascii_score2 <NA> NA
Or it can bee written more compactly
setDT(df)[df[, Reduce(`|`, shift(!as.logical(Dulplicated), n = -2:2))]]
data
df <- structure(list(FastqOutput = structure(c(5L, 8L, 1L, 2L, 6L,
9L, 1L, 3L, 7L, 10L, 1L, 4L), .Label = c("+", "Ascii_score1",
"Ascii_score2", "Ascii_score3", "ReadName1", "ReadName2", "ReadName3",
"ReadSeq1", "ReadSeq2", "ReadSeq3"), class = "factor"), BARCODE =
structure(c(NA, 2L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA), .Label = c("TGCTTTAT",
"TGTGTTAT"), class = "factor"), Dulplicated = c(NA, FALSE, NA,
NA, NA, FALSE, NA, NA, NA, TRUE, NA, NA)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

Igraph Identifying nodes based on centrality scores

I'm running the igraph package for some network analysis on this example dataset
structure(list(ï..Column1 = c(NA, NA, NA, NA), Column2 = c(NA,
NA, NA, NA), Column3 = c(NA, NA, NA, NA), Column4 = c(NA, NA,
NA, NA), Column5 = structure(c(2L, 1L, 4L, 3L), .Label = c("Eric ",
"Jim", "Matt", "Tim"), class = "factor"), Column6 = c(NA, NA,
NA, NA), Column7 = structure(c(1L, 3L, 2L, 3L), .Label = c("Eric",
"Erica", "Mary "), class = "factor"), Column8 = structure(c(3L,
2L, 1L, 3L), .Label = c("Beth", "Loranda", "Matt"), class = "factor"),
Column9 = structure(c(2L, 3L, 1L, 3L), .Label = c("Courtney ",
"Heather ", "Patrick"), class = "factor"), Column10 = structure(4:1, .Label = c("Beth",
"Heather", "John", "Loranda "), class = "factor"), Column11 = c(NA,
NA, NA, NA), Column12 = c(NA, NA, NA, NA), Column13 = c(NA,
NA, NA, NA), Column14 = c(NA, NA, NA, NA), Column15 = c(NA,
NA, NA, NA)), class = "data.frame", row.names = c(NA, -4L
))
Here is the edgelist for anyone who wants to skip the step of finding that
structure(c("Jim", "Eric ", "Tim", "Matt", "Jim", "Eric ", "Tim",
"Matt", "Jim", "Eric ", "Tim", "Matt", "Jim", "Eric ", "Tim",
"Matt", "Eric", "Mary ", "Erica", "Mary ", "Matt", "Loranda",
"Beth", "Matt", "Heather ", "Patrick", "Courtney ", "Patrick",
"Loranda ", "John", "Heather", "Beth"), .Dim = c(16L, 2L), .Dimnames = list(
NULL, c("Column5", "value")))
I'm trying to calculate centrality for each of the nodes in the network using this code (mat is my edgelist matrix)
g1=graph_from_edgelist(mat)
degree.cent <- centr_degree(g1, mode = "all")
degree.cent
My output is something like this
> degree.cent
$`res`
[1] 4 1 4 2 4 1 6 1 2 1 2 1 1 1 1
$centralization
[1] 0.1479592
$theoretical_max
[1] 392
I know 'degree$res` is my centrality score measures, but what isn't clear to me is which nodes are actually receiving that score. I looked up a tutorial here, but all it says is the first score is "node 1". There's no indication of what node 1 is or an easy way to identify that
Firstly, you are getting incorrect results as some of the names contain spaces (Eric, Marry, Heather, ...). So, let
mat <- gsub(" ", "", mat)
g1 <- graph_from_edgelist(mat)
degree.cent <- centr_degree(g1, mode = "all")
Now we may extract the corresponding names of vertices and combine them with your result:
setNames(degree.cent$res, V(g1)$name)
# Jim Eric Mary Tim Erica Matt Loranda Beth Heather
# 4 5 2 4 1 6 2 2 2
# Patrick Courtney John
# 2 1 1

Recode into new variable conditional on values in two other variables

I would like to be able to create a new variable based on specific values in two existing variables. My dataframe looks like:
structure(list(id = structure(c(1L, 2L, 3L, NA, NA, NA), .Label = c("blue",
"red", "yellow"), class = "factor"), value = c(-4.3, -2.5, -3.6,
NA, NA, NA)), .Names = c("id", "value"), row.names = c(NA, -6L
), class = "data.frame")
I would like to create a new column that contains only those values that pertain to blue (e.g., 4.2). All other values would result in NA, like so:
structure(list(id = structure(c(1L, 2L, 3L, NA, NA, NA), .Label = c("blue",
"red", "yellow"), class = "factor"), value = c(-4.3, -2.5, -3.6,
NA, NA, NA), newvalue = c(-4.3, NA, NA, NA, NA, NA)), .Names = c("id",
"value", "newvalue"), row.names = c(NA, -6L), class = "data.frame")
I tried the following:
b1 <- dat$id=="blue"
dat$newvalue <- dat$value[b1]
But that filled every cell in the new column with the same value (-4.3).
Due to presence of NA's it becomes tricky to assign values directly using indexing. We can use replace instead where we replace any non "blue" value to NA.
dat$newvalue <- replace(dat$value, dat$id != "blue", NA)
dat
# id value newvalue
#1 blue -4.3 -4.3
#2 red -2.5 NA
#3 yellow -3.6 NA
#4 <NA> NA NA
#5 <NA> NA NA
#6 <NA> NA NA
The equivalent ifelse statement would be :
dat$newvalue <- ifelse(dat$id != "blue", NA, dat$value)

Resources