This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 12 days ago.
The output of an sapply call returns class: [1] "matrix" "array" which I have never seen before. Transforming this to a dataframe gives a 'wide' format of what I want, though. For example:
df<-structure(list(`04232395` = c("22.8", "0", NA, NA), `04232200` = c("39.4",
"2024", "47.55", "1977-09-30 to 2019-03-27 , 1976-11-30 to 1977-03-01 , 1975-11-30 to 1976-03-01 , NA"
), `04232185` = c(NA, "0", NA, NA), `04232165` = c(NA, "0", NA,
NA), `01515600` = c("0.93", "0", NA, NA), `01515580` = c("4.77",
"0", NA, NA), `01515600` = c("0.93", "0", NA, NA), `01515580` = c("4.77",
"0", NA, NA), `01515720` = c("4.44", "0", NA, NA), `01515790` = c("1.32",
"0", NA, NA), `01515600` = c("0.93", "0", NA, NA), `01515720` = c("4.44",
"0", NA, NA), `01515790` = c("1.32", "0", NA, NA), `01515720` = c("4.44",
"0", NA, NA), `01515790` = c("1.32", "0", NA, NA), `01515600` = c("0.93",
"0", NA, NA), `01515580` = c("4.77", "0", NA, NA), `01515720` = c("4.44",
"0", NA, NA), `04244903` = c("8.02", "0", NA, NA), `04233000` = c("35.2",
"27605", "75.63", NA), `04233255` = c("86.7", "3903", "10.69",
NA), `04234000` = c("126", "35786", "98.04", NA), `04233310` = c("42",
"0", NA, NA), `04233300` = c("39", "10197", "27.93", NA), `04233700` = c("40.3",
"822", "2.25", NA), `0423368620` = c("29.7", "713", "1.95", NA
), `04233676` = c("20.7", "0", NA, NA), `04233633` = c("40.2",
"0", NA, NA), `04233635` = c(NA, "0", NA, NA), `04233255` = c("86.7",
"3903", "10.69", NA), `04234000` = c("126", "35786", "98.04",
NA), `04233310` = c("42", "0", NA, NA), `04233300` = c("39",
"10197", "27.93", NA), `01513990` = c("17", "0", NA, NA), `01513910` = c("15.8",
"0", NA, NA), `01514000` = c("185", "19523", "92.41", "1978-11-07 to 2017-09-30 , 1978-10-10 to 1978-10-14 , NA"
), `01513990` = c("17", "0", NA, NA), `01514000` = c("185", "19523",
"92.41", "1978-11-07 to 2017-09-30 , 1978-10-10 to 1978-10-14 , NA"
), `01513840` = c("8.59", "791", "2.16", NA), `01513831` = c("4216",
"4877", "13.36", NA), `01513990` = c("17", "0", NA, NA), `01513830` = c("20.7",
"0", NA, NA), `01514880` = c("46.5", "0", NA, NA), `01514000` = c("185",
"19523", "92.41", "1978-11-07 to 2017-09-30 , 1978-10-10 to 1978-10-14 , NA"
), `01513840` = c("8.59", "791", "2.16", NA), `01513831` = c("4216",
"4877", "13.36", NA), `01513830` = c("20.7", "0", NA, NA), `01514000` = c("185",
"19523", "92.41", "1978-11-07 to 2017-09-30 , 1978-10-10 to 1978-10-14 , NA"
), `01513990` = c("17", "0", NA, NA), `01513990` = c("17", "0",
NA, NA), `01513910` = c("15.8", "0", NA, NA), `04233300` = c("39",
"10197", "27.93", NA), `04233286` = c("27", "7390", "20.24",
NA), `04233275` = c(NA, "0", NA, NA), `01513930` = c(NA, "0",
NA, NA), `04233310` = c("42", "0", NA, NA), `04233300` = c("39",
"10197", "27.93", NA), `04233286` = c("27", "7390", "20.24",
NA), `04233275` = c(NA, "0", NA, NA), `04233310` = c("42", "0",
NA, NA), `04233300` = c("39", "10197", "27.93", NA), `04233286` = c("27",
"7390", "20.24", NA), `04233275` = c(NA, "0", NA, NA), `04233310` = c("42",
"0", NA, NA), `04233300` = c("39", "10197", "27.93", NA), `04233286` = c("27",
"7390", "20.24", NA), `04233275` = c(NA, "0", NA, NA), `04233255` = c("86.7",
"3903", "10.69", NA), `04234000` = c("126", "35786", "98.04",
NA), `04233310` = c("42", "0", NA, NA), `04233300` = c("39",
"10197", "27.93", NA), `04233286` = c("27", "7390", "20.24",
NA), `04233275` = c(NA, "0", NA, NA), `04233310` = c("42", "0",
NA, NA), `04233300` = c("39", "10197", "27.93", NA), `04233286` = c("27",
"7390", "20.24", NA), `04233275` = c(NA, "0", NA, NA), `04233275` = c(NA,
"0", NA, NA), `0423368620` = c("29.7", "713", "1.95", NA), `04233676` = c("20.7",
"0", NA, NA), `01513930` = c(NA, "0", NA, NA), `04233678` = c("2.73",
"456", "1.25", NA), `04233286` = c("27", "7390", "20.24", NA),
`04233700` = c("40.3", "822", "2.25", NA), `04233275` = c(NA,
"0", NA, NA), `0423368620` = c("29.7", "713", "1.95", NA),
`04233676` = c("20.7", "0", NA, NA), `04233000` = c("35.2",
"27605", "75.63", NA), `04233255` = c("86.7", "3903", "10.69",
NA), `04234000` = c("126", "35786", "98.04", NA), `04233310` = c("42",
"0", NA, NA), `04233300` = c("39", "10197", "27.93", NA),
`04233286` = c("27", "7390", "20.24", NA), `04233275` = c(NA,
"0", NA, NA), `01513930` = c(NA, "0", NA, NA), `04233000` = c("35.2",
"27605", "75.63", NA), `04233255` = c("86.7", "3903", "10.69",
NA), `04234000` = c("126", "35786", "98.04", NA), `04233310` = c("42",
"0", NA, NA), `04233300` = c("39", "10197", "27.93", NA),
`04233700` = c("40.3", "822", "2.25", NA), `0423368620` = c("29.7",
"713", "1.95", NA), `04233676` = c("20.7", "0", NA, NA),
`04233678` = c("2.73", "456", "1.25", NA), `04233700` = c("40.3",
"822", "2.25", NA), `0423368620` = c("29.7", "713", "1.95",
NA), `04233676` = c("20.7", "0", NA, NA), `04233633` = c("40.2",
"0", NA, NA), `04233635` = c(NA, "0", NA, NA), `04233678` = c("2.73",
"456", "1.25", NA), `04233700` = c("40.3", "822", "2.25",
NA), `0423368620` = c("29.7", "713", "1.95", NA), `04233676` = c("20.7",
"0", NA, NA), `04233633` = c("40.2", "0", NA, NA), `04233635` = c(NA,
"0", NA, NA), `04244985` = c("55.5", "0", NA, NA), `04244990` = c("18.8",
"0", NA, NA), `04244903` = c("8.02", "0", NA, NA), `04244000` = c("66.3",
"9626", "72.4", "1968-09-29 to 2014-09-30 , NA"), `04243400` = c(NA,
"0", NA, NA), `04243390` = c("24.9", "0", NA, NA), `01338800` = c("43.6",
"0", NA, NA), `04243400` = c(NA, "0", NA, NA), `04243390` = c("24.9",
"0", NA, NA), `01338000` = c("144", "3042", "8.33", NA),
`01339060` = c("59.8", "3056", "8.37", NA), `01338800` = c("43.6",
"0", NA, NA), `01503970` = c(NA, "0", NA, NA), `01503980` = c("24.3",
"0", NA, NA), `04243390` = c("24.9", "0", NA, NA), `01503970` = c(NA,
"0", NA, NA), `01503980` = c("24.3", "0", NA, NA)), row.names = c(NA,
-4L), class = "data.frame")
I would like to have the column names of this dataframe to be in a single row and the 4 rows for each column in the old dataframe to be 4 individual columns in the new. I tried:
df_new<-data.frame(site = colnames(df), DA = df[1,], Q = df[2,], POR = df[3,], Q_gaps = df[4,])
but this is not right.
I have also tried
df_new<-df %>%
pivot_longer(everything())%>%
group_by(name)%>%
mutate(DA = slice(value,1), Q = slice(value,2), POR = slice(value,3), Q_gaps = slice(value,4))
but this resulted in an error.
Searching for this question does not return what I am looking for, so this may be duplicated elsewhere, I just don't know how to ask it.
EDIT:
this worked and gives the desired output, but any more elegant solution?
df_new<-data.frame(site = colnames(df), DA = as.numeric(df[1,]), Q = as.numeric(df[2,]), POR = as.numeric(df[3,]), Q_gaps = as.character(df[4,]))
Edited for the new data (adds setNames/make.unique up front).
library(dplyr)
library(tidyr)
setNames(df, make.unique(names(df))) %>%
mutate(nm = c("DA", "Q", "POR", "Q_gaps")) %>%
pivot_longer(-nm, names_to = "site") %>%
pivot_wider(site, names_from = "nm", values_from = "value") %>%
mutate(across(-site, ~ type.convert(., as.is=TRUE)), site = sub("\\.[0-9]+$", "", site))
# # A tibble: 132 × 5
# site DA Q POR Q_gaps
# <chr> <dbl> <int> <dbl> <chr>
# 1 04232395 22.8 0 NA NA
# 2 04232200 39.4 2024 47.6 1977-09-30 to 2019-03-27 , 1976-11-30 to 1977-03-01 , 1975-11-30 to 1976-03-01 , NA
# 3 04232185 NA 0 NA NA
# 4 04232165 NA 0 NA NA
# 5 01515600 0.93 0 NA NA
# 6 01515580 4.77 0 NA NA
# 7 01515600 0.93 0 NA NA
# 8 01515580 4.77 0 NA NA
# 9 01515720 4.44 0 NA NA
# 10 01515790 1.32 0 NA NA
# # … with 122 more rows
# # ℹ Use `print(n = ...)` to see more rows
The site=sub(..) is to remove the .1 (etc) that will occur due to duplicate column names; this should be safe as long as your real site strings never end in a decimal followed by one or more numbers.
Or we can use akrun's suggestion:
as.data.frame(t(setNames(df, make.unique(names(df))))) %>%
setNames(c("DA","Q","POR","Q_gaps")) %>%
tibble::rownames_to_column("site") %>%
mutate(
across(-site, ~ type.convert(., as.is = TRUE)),
site = sub("\\.[0-9]+$", "", site)
) %>%
head()
# site DA Q POR Q_gaps
# 1 04232395 22.80 0 NA <NA>
# 2 04232200 39.40 2024 47.55 1977-09-30 to 2019-03-27 , 1976-11-30 to 1977-03-01 , 1975-11-30 to 1976-03-01 , NA
# 3 04232185 NA 0 NA <NA>
# 4 04232165 NA 0 NA <NA>
# 5 01515600 0.93 0 NA <NA>
# 6 01515580 4.77 0 NA <NA>
I have a dataframe which looks like this:
id age1 sex1 age2 sex2 age3 sex3 age4 sex4
1 5 20 <NA> NA <NA> NA <NA> 27 Female
2 25 NA <NA> NA <NA> NA <NA> 35 Female
3 65 NA <NA> NA <NA> NA <NA> NA <NA>
this is the code for the data:
temp <- structure(list(id = c(5L, 25L, 65L, 25L, 65L, 5L, 5L, 85L, 285L,
541L), age1 = c(20L, NA, NA, NA, NA, NA, NA, NA, NA, NA), sex1 = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), .Label = c("missing",
"inapplicable", "refusal", "don't know", "inconsistent", "Male",
"Female"), class = "factor"), age2 = c(NA, NA, NA, NA, 31L,
NA, NA, NA, NA, NA), sex2 = structure(c(NA, NA, NA, NA, 7L,
NA, NA, NA, NA, NA), .Label = c("missing", "inapplicable", "refusal",
"don't know", "inconsistent", "Male", "Female"), class = "factor"),
age3 = c(NA, NA, NA, NA, 32L, NA, NA, NA, 25L, 23L), sex3 = structure(c(NA,
NA, NA, NA, 7L, NA, NA, NA, 6L, 7L), .Label = c("missing",
"inapplicable", "refusal", "don't know", "inconsistent",
"Male", "Female"), class = "factor"), age4 = c(27L, 35L,
NA, NA, 33L, NA, 24L, NA, 26L, NA), sex4 = structure(c(7L,
7L, NA, NA, 7L, NA, 7L, NA, 6L, NA), .Label = c("missing",
"inapplicable", "refusal", "don't know", "inconsistent",
"Male", "Female"), class = "factor")), row.names = c(NA,
10L), class = "data.frame")
I would like to know how to make multiple subsets based the data based on the columns.
I know I could do this by using the codes:
Subset1<- temp[,1:3]
Subset2<-temp[,c(1,4:5)]
Subset3<- temp[,c(1,6:7)]
But there must be a more concise way to do this. I've tried a for loop but I'm new to R and don't know how to this including keeping the names of the new subsets consistent.
We can use split.default to split data based on number in the column names and append the first column in each list.
new_list <- lapply(split.default(temp[-1], gsub("\\D", "", names(temp)[-1])),
function(x) cbind(temp[1], x))
new_list
#$`1`
# id age_1 sex_1
#1 5 20 <NA>
#2 25 NA <NA>
#3 65 NA <NA>
#4 25 NA <NA>
#5 65 NA <NA>
#6 5 NA <NA>
#7 5 NA <NA>
#8 85 NA <NA>
#9 285 NA <NA>
#10 541 NA <NA>
#$`2`
# id age_2 sex_2
#1 5 NA <NA>
#...
This returns a list of dataframes, if you want data in separate dataframes, we can do :
names(new_list) <- paste0('Subset', seq_along(new_list))
list2env(new_list, .GlobalEnv)
Here is another base R solution
ind <- 1:4
list2env(setNames(lapply(ind, function(k) subset(temp,select = c(1,2*k+(0:1)))),
paste0("Subset",ind)),
envir = .GlobalEnv)
where subset + lapply was used
Currently I'm working with some Fastq sequencing data. I have a dataframe with three columns and hundreds of rows. The first column contains the raw sequencing reads and the others contain information about those reads. I want to return a row with the string "FALSE" in the 3rd column, plus the row directly above this, and two rows directly below it. I think it is similar to grep -A -B in shell.
I've looked around and my question is very similar to this one:
Returning above and below rows of specific rows in r dataframe
However, the answers here are based on row-names and not strings within the rows. My row names are just numbers in numerical order.
Fastq Output BARCODE Dulplicated
1 ReadName1 NA NA
2 ReadSeq1 TGTG TTAT FALSE
3 + NA NA
4 Ascii_score1 NA NA
5 ReadName2 NA NA
6 ReadSeq2 TGCT TTAT FALSE
7 + NA NA
8 Ascii_score2 NA NA
9 ReadName3 NA NA
10 ReadSeq3 TGCT TTAT TRUE
11 + NA NA
12 Ascii_score3 NA NA
If the duplicated column has character values. You can do
inds <- which(df$Dulplicated == "FALSE")
df[sort(unique(c(inds, inds - 1, inds + 1, inds + 2))), ]
# FastqOutput BARCODE Dulplicated
#1 ReadName1 <NA> NA
#2 ReadSeq1 TGTGTTAT FALSE
#3 + <NA> NA
#4 Ascii_score1 <NA> NA
#5 ReadName2 <NA> NA
#6 ReadSeq2 TGCTTTAT FALSE
#7 + <NA> NA
#8 Ascii_score2 <NA> NA
Or similarly using dplyr::slice
library(dplyr)
df %>% slice(sort(unique(c(inds, inds - 1, inds + 1, inds + 2))))
data
df <- structure(list(FastqOutput = structure(c(5L, 8L, 1L, 2L, 6L,
9L, 1L, 3L, 7L, 10L, 1L, 4L), .Label = c("+", "Ascii_score1",
"Ascii_score2", "Ascii_score3", "ReadName1", "ReadName2", "ReadName3",
"ReadSeq1", "ReadSeq2", "ReadSeq3"), class = "factor"), BARCODE =
structure(c(NA, 2L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA), .Label = c("TGCTTTAT",
"TGTGTTAT"), class = "factor"), Dulplicated = c(NA, FALSE, NA,
NA, NA, FALSE, NA, NA, NA, TRUE, NA, NA)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
We can use data.table
library(data.table)
setDT(df)[df[, {i1 <-.I[which(!as.logical(Dulplicated))]
sort(unique(i1+ rep((-2:2), length(i1)))) }]]
# FastqOutput BARCODE Dulplicated
#1: ReadName1 <NA> NA
#2: ReadSeq1 TGTGTTAT FALSE
#3: + <NA> NA
#4: Ascii_score1 <NA> NA
#5: ReadName2 <NA> NA
#6: ReadSeq2 TGCTTTAT FALSE
#7: + <NA> NA
#8: Ascii_score2 <NA> NA
Or it can bee written more compactly
setDT(df)[df[, Reduce(`|`, shift(!as.logical(Dulplicated), n = -2:2))]]
data
df <- structure(list(FastqOutput = structure(c(5L, 8L, 1L, 2L, 6L,
9L, 1L, 3L, 7L, 10L, 1L, 4L), .Label = c("+", "Ascii_score1",
"Ascii_score2", "Ascii_score3", "ReadName1", "ReadName2", "ReadName3",
"ReadSeq1", "ReadSeq2", "ReadSeq3"), class = "factor"), BARCODE =
structure(c(NA, 2L, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA), .Label = c("TGCTTTAT",
"TGTGTTAT"), class = "factor"), Dulplicated = c(NA, FALSE, NA,
NA, NA, FALSE, NA, NA, NA, TRUE, NA, NA)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
I'm running the igraph package for some network analysis on this example dataset
structure(list(ï..Column1 = c(NA, NA, NA, NA), Column2 = c(NA,
NA, NA, NA), Column3 = c(NA, NA, NA, NA), Column4 = c(NA, NA,
NA, NA), Column5 = structure(c(2L, 1L, 4L, 3L), .Label = c("Eric ",
"Jim", "Matt", "Tim"), class = "factor"), Column6 = c(NA, NA,
NA, NA), Column7 = structure(c(1L, 3L, 2L, 3L), .Label = c("Eric",
"Erica", "Mary "), class = "factor"), Column8 = structure(c(3L,
2L, 1L, 3L), .Label = c("Beth", "Loranda", "Matt"), class = "factor"),
Column9 = structure(c(2L, 3L, 1L, 3L), .Label = c("Courtney ",
"Heather ", "Patrick"), class = "factor"), Column10 = structure(4:1, .Label = c("Beth",
"Heather", "John", "Loranda "), class = "factor"), Column11 = c(NA,
NA, NA, NA), Column12 = c(NA, NA, NA, NA), Column13 = c(NA,
NA, NA, NA), Column14 = c(NA, NA, NA, NA), Column15 = c(NA,
NA, NA, NA)), class = "data.frame", row.names = c(NA, -4L
))
Here is the edgelist for anyone who wants to skip the step of finding that
structure(c("Jim", "Eric ", "Tim", "Matt", "Jim", "Eric ", "Tim",
"Matt", "Jim", "Eric ", "Tim", "Matt", "Jim", "Eric ", "Tim",
"Matt", "Eric", "Mary ", "Erica", "Mary ", "Matt", "Loranda",
"Beth", "Matt", "Heather ", "Patrick", "Courtney ", "Patrick",
"Loranda ", "John", "Heather", "Beth"), .Dim = c(16L, 2L), .Dimnames = list(
NULL, c("Column5", "value")))
I'm trying to calculate centrality for each of the nodes in the network using this code (mat is my edgelist matrix)
g1=graph_from_edgelist(mat)
degree.cent <- centr_degree(g1, mode = "all")
degree.cent
My output is something like this
> degree.cent
$`res`
[1] 4 1 4 2 4 1 6 1 2 1 2 1 1 1 1
$centralization
[1] 0.1479592
$theoretical_max
[1] 392
I know 'degree$res` is my centrality score measures, but what isn't clear to me is which nodes are actually receiving that score. I looked up a tutorial here, but all it says is the first score is "node 1". There's no indication of what node 1 is or an easy way to identify that
Firstly, you are getting incorrect results as some of the names contain spaces (Eric, Marry, Heather, ...). So, let
mat <- gsub(" ", "", mat)
g1 <- graph_from_edgelist(mat)
degree.cent <- centr_degree(g1, mode = "all")
Now we may extract the corresponding names of vertices and combine them with your result:
setNames(degree.cent$res, V(g1)$name)
# Jim Eric Mary Tim Erica Matt Loranda Beth Heather
# 4 5 2 4 1 6 2 2 2
# Patrick Courtney John
# 2 1 1