I have a dataframe that is structured like below, where A/B/C/D are different treatment methods:
input <- read.table(text="
filename wavelength A B C D
file1 w1 NA NA 1 2
file1 w2 NA NA 3 2
file1 w3 NA NA 6 2
file2 w1 3 4 NA NA
file2 w2 4 8 NA NA
file2 w3 6 1 NA NA", header=TRUE)
And I would like for it to be transposed so that wavelength is the header and treatments are rows with the filenames duplicated each time:
desired <- read.table(text="
filename Method w1 w2 w3
file1 C 1 3 6
file1 D 2 2 2
file2 A 3 4 6
file2 B 4 8 1", header=TRUE)
I've tried melt/cast from reshape2, melt from the data.table package, gather/spread, t - everything I can think of. The actual data frame in the end will be about 500 rows by 3500 columns - so I would prefer not to call out any specific column or method names. My issue seems mainly to be that I can't call all method columns under one value and use it to melt:
colMethods <- myData[, 2:length(myData)]
A lot of times I don't get an error, but the dataframe R returns is just a list of wavelengths and a column that says 'wavelength'. How would any of you approach this? Thanks!
You can try this:
library(tidyverse)
#Data
df <- structure(list(filename = c("file1", "file1", "file1", "file2",
"file2", "file2"), wavelength = c("w1", "w2", "w3", "w1", "w2",
"w3"), A = c(NA, NA, NA, 3L, 4L, 6L), B = c(NA, NA, NA, 4L, 8L,
1L), C = c(1L, 3L, 6L, NA, NA, NA), D = c(2L, 2L, 2L, NA, NA,
NA)), class = "data.frame", row.names = c(NA, -6L))
Code:
df %>% pivot_longer(cols = -c(1,2)) %>% filter(!is.na(value)) %>%
pivot_wider(names_from = wavelength,values_from = value)
Output:
# A tibble: 4 x 5
filename name w1 w2 w3
<chr> <chr> <int> <int> <int>
1 file1 C 1 3 6
2 file1 D 2 2 2
3 file2 A 3 4 6
4 file2 B 4 8 1
Here is data.table alternative using melt and dcast :
library(data.table)
dcast(melt(setDT(input), id.vars = 1:2, na.rm = TRUE),
variable+filename~wavelength, value.var = 'value')
# variable filename w1 w2 w3
#1: A file2 3 4 6
#2: B file2 4 8 1
#3: C file1 1 3 6
#4: D file1 2 2 2
Related
Help with applying functions over a list of data frames.
I don't often work with lists or functions so following a 3 hour search and test I need some assistance.
I have a list of 2 data frames as follows (real list has 40+):
df1 <- structure(list(ID = 1:4,
Period = c("C_2021", "C_2021", "C_2021", "C_2021"),
subjects = c(2044L, 2044L, 2058L, 2059L),
Q_1_A = c(1L, 1L, 4L, 6L),
Q_1_B = c(6L, 1L, 6L, NA),
col3 = c(4L, 6L, 5L, 2L),
col4 = c(3L, 5L, 4L, 4L)),
class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(ID = 1:4,
Period = c("C_2022", "C_2022", "C_2022", "C_2022"),
subjects = c(2058L, 2058L, 2065L, 2066L),
Q_1_A = c(2L, 5L, 5L, 6L),
Q_1_B = c(6L, 1L, 4L, NA),
col3 = c(NA, 6L, 5L, 3L),
col4 = c(3L, 6L, 5L, 5L)),
class = "data.frame", row.names = c(NA, -4L))
The structure of the datasets are as follows:
df1
ID Period subjects Q_1_A Q_1_B col3 col4
1 1 C_2021 2044 1 6 4 3
2 2 C_2021 2044 1 1 6 5
3 3 C_2021 2058 4 6 5 4
4 4 C_2021 2059 6 NA 2 4
df2
ID Period subjects Q_1_A Q_1_B col3 col4
1 1 C_2022 2058 2 6 NA 3
2 2 C_2022 2058 5 1 6 6
3 3 C_2022 2065 5 4 5 5
4 4 C_2022 2066 6 NA 3 5
The list of df's
dflist <- list(df1, df2)
I would like to do 2 things:
1. Conditional removal of string before 2nd underscore
I would like to remove characters before the 2nd underscore only in columns beginning with "Q". Column "Q_1_A" would become "A". The code should only impact columns starting with "Q".
Note: The ifelse is important - in the real data there are other columns with 2 underscores that cannot be modified, and the columns in data frames may be in different orders so it needs to be done by column name.
#doesnt work (cant seem to get purr working either)
dflist <- lapply(dflist, function(x) {
names(x) <- ifelse(starts_with(names(x), "Q"), sub("^[^_]*_", "", names(x)), .x)
x})
2. Once column names are updated, remove columns present on a list.
Note: In the real data there are a lot of columns in each df, it's much easier to list the columns to keep rather than remove.
List of columns to keep below
List is structured assuming the gsub above has been complete.
col_keep <- c("ID", "Period", "subjects", "A", "B")
#doesnt work
dflist <- lapply(dflist, function(x) {
x[(names(x) %in% col_keep)]
x})
**UPDATE** I think actually the following will work
dflist <- lapply(dflist, function(x)
{x <- x %>% select(any_of(col_keep))})
#is the best way to do it?
Help would be greatly appreciated.
For the first required apply this
dflist <- lapply(dflist, function(x) {
names(x) <- ifelse(startsWith(names(x), "Q"),
gsub("[Q_0-9]+", "" , names(x)), names(x))
x})
and the second
col_keep <- c("ID", "Period", "subjects", "A", "B")
dflist <- lapply(dflist, function(x) subset(x , select = col_keep))
In base R:
lapply(dflist, \(x)setNames(x, sub('^Q([^_]*_){2}', '', names(x)))[col_keep])
[[1]]
ID Period subjects A B
1 1 C_2021 2044 1 6
2 2 C_2021 2044 1 1
3 3 C_2021 2058 4 6
4 4 C_2021 2059 6 NA
[[2]]
ID Period subjects A B
1 1 C_2022 2058 2 6
2 2 C_2022 2058 5 1
3 3 C_2022 2065 5 4
4 4 C_2022 2066 6 NA
in tidyverse:
library(tidyverse)
dflist %>%
map(~rename_with(.,~str_remove(.,'([^_]+_){2}'), starts_with('Q'))%>%
select(all_of(col_keep)))
[[1]]
ID Period subjects A B
1 1 C_2021 2044 1 6
2 2 C_2021 2044 1 1
3 3 C_2021 2058 4 6
4 4 C_2021 2059 6 NA
[[2]]
ID Period subjects A B
1 1 C_2022 2058 2 6
2 2 C_2022 2058 5 1
3 3 C_2022 2065 5 4
4 4 C_2022 2066 6 NA
Another solutions using base:
# wrap up code for ease of reading
validate_names <- function(df) {
setNames(df, ifelse(grepl("^Q", names(df)),
gsub("[Q_0-9]", "", names(df)), names(df)))
}
# lapply to transform list, then subset with character vector
lapply(dflist, validate_names) |>
lapply(`[`, col_keep)
I have a large dataframe with approximately this pattern:
Person
Rate
Street
a
b
c
d
e
f
A
2
XYZ
1
NULL
3
4
5
NULL
A
2
XYZ
NULL
2
NULL
NULL
NULL
NULL
A
3
XYZ
NULL
NULL
NULL
NULL
NULL
6
B
2
DEF
NULL
NULL
NULL
NULL
5
NULL
B
2
DEF
NULL
2
3
NULL
NULL
6
C
1
DEF
1
2
3
4
5
6
A, b, c, d, e, f represents about 600 columns.
I am trying to combine the columns so that each person becomes one line, rows a-f combine into a single line using sum, and any conflicting rate or street information becomes a new row. So the data should look something like this:
Person
Rate
Rate 2
Street
a
b
c
d
e
f
A
2
3
XYZ
1
2
3
4
5
6
B
2
DEF
NULL
2
3
NULL
5
6
C
1
DEF
1
2
3
4
5
6
I keep trying to make this work with aggregate and summarize but I'm not sure that's the right approach.
Thank you very much for your help!
First we pivot all the unique rates per person and street.
library(reshape2)
tmp1=dcast(unique(df[,c("Person","Rate","Street")]),Person+Street~Rate,value.var="Rate")
colnames(tmp1)[-c(1:2)]=paste("Rate",colnames(tmp1)[-c(1:2)])
Then we aggregate and sum by person and rate, columns 4 to 9, from "a" to "f", change accordingly.
tmp2=aggregate(df[,4:9],list(Person=df$Person,Street=df$Street),function(x){
ifelse(all(is.na(x)),NA,sum(x,na.rm=T))
})
And finally merge the two.
merge(tmp1,tmp2,by=c("Person","Street"))
Person Street Rate 1 Rate 2 Rate 3 a b c d e f
1 A XYZ NA 2 3 1 2 3 4 5 6
2 B DEF NA 2 NA NA 2 3 NA 5 6
3 C DEF 1 NA NA 1 2 3 4 5 6
Perhaps, you can do this in two-step process -
library(dplyr)
library(tidyr)
#sum columns a-f
table1 <- df %>%
group_by(Person) %>%
summarise(across(a:f, sum, na.rm = TRUE))
#Remove duplicated values and get the data in separate columns
#for Rate and Street columns.
table2 <- df %>%
group_by(Person) %>%
mutate(across(c(Rate, Street), ~replace(., duplicated(.), NA))) %>%
select(Person, Rate, Street) %>%
filter(if_any(c(Rate, Street), ~!is.na(.))) %>%
mutate(col = row_number()) %>%
ungroup %>%
pivot_wider(names_from = col, values_from = c(Rate, Street)) %>%
select(where(~any(!is.na(.))))
#Join the two data to get final result
inner_join(table1, table2, by = 'Person')
# Person a b c d e f Rate_1 Rate_2 Street_1
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <chr>
#1 A 1 2 3 4 5 6 2 3 XYZ
#2 B 0 2 3 0 5 6 2 NA DEF
#3 C 1 2 3 4 5 6 1 NA DEF
data
It is helpful and easier to help when you share data in a reproducible format which can be copied directly. I have used the below data for the answer.
df <- structure(list(Person = c("A", "A", "A", "B", "B", "C"), Rate = c(2L,
2L, 3L, 2L, 2L, 1L), Street = c("XYZ", "XYZ", "XYZ", "DEF", "DEF",
"DEF"), a = c(1L, NA, NA, NA, NA, 1L), b = c(NA, 2L, NA, NA,
2L, 2L), c = c(3L, NA, NA, NA, 3L, 3L), d = c(4L, NA, NA, NA,
NA, 4L), e = c(5L, NA, NA, 5L, NA, 5L), f = c(NA, NA, 6L, NA,
6L, 6L)), row.names = c(NA, -6L), class = "data.frame")
Below is the output of one of my columns within a DF I've created from importing a weekly summary.csv.
these are unique codes and each code should only be 4 number long ie 8400, 9070 etc.
when the summary document is produced all the codes are bunched together without commas or indentation.
like below:
1 84709070
2 75508470
3 8400
3 750084009100
is there a way I can turn the above into 4 new rows that split the numbers start from the first int by 4 places ie output the fourth row would look like:
tariff1, tariff2, tariff3, tariff4
7500 8400 9100 none
I managed to create an abomination in excel but it hardly works at the best of time and id prefer to use R for everything, we are getting about 30k of these entries a week would really streamline processes!
You can use tidyr::separate mentioning the positions where you want to split in sep.
tidyr::separate(df, V2, paste0('col', 1:4), sep = seq(4, 12, 4), convert = TRUE)
# V1 col1 col2 col3 col4
#1 1 8470 9070 NA NA
#2 2 7550 8470 NA NA
#3 3 8400 NA NA NA
#4 3 7500 8400 9100 NA
seq generates the sequence of positions.
seq(4, 12, 4)
#[1] 4 8 12
data
df <- structure(list(V1 = c(1L, 2L, 3L, 3L), V2 = c(84709070, 75508470,
8400, 750084009100)), class = "data.frame", row.names = c(NA, -4L))
Here is a base R option, which defines a function f to split the numbers
f <- function(x) t(`length<-`(as.numeric(sapply(seq(1,nchar(x),by = 4), function(k) substr(x,k,k+3))),4))
dfout <- cbind(df,data.frame(Vectorize(f)(df$V2)))
such that
V1 V2 X1 X2 X3 X4
1 1 84709070 8470 7550 8400 7500
2 2 75508470 9070 8470 NA 8400
3 3 8400 NA NA NA 9100
4 3 750084009100 NA NA NA NA
Data
> dput(df)
structure(list(V1 = c(1L, 2L, 3L, 3L), V2 = c(84709070, 75508470,
8400, 750084009100)), class = "data.frame", row.names = c(NA,
-4L))
An option with strsplit from base R
lst1 <- strsplit(as.character(df$V2), "(?<=....)", perl = TRUE)
df[paste0('col', 1:4)] <- do.call(rbind, lapply(lst1,
`length<-`, max(lengths(lst1))+1))
df <- type.convert(df, as.is = TRUE)
-output
df
# V1 V2 col1 col2 col3 col4
#1 1 84709070 8470 9070 NA NA
#2 2 75508470 7550 8470 NA NA
# 3 8400 8400 NA NA NA
#4 3 750084009100 7500 8400 9100 NA
Or using read.fwf from base R
df[paste0('col', 1:4)] <- read.fwf(file = textConnection(as.character(df$V2)),
widths = c(4, 4, 4, 4))
data
df <- structure(list(V1 = c(1L, 2L, 3L, 3L), V2 = c(84709070, 75508470,
8400, 750084009100)), class = "data.frame", row.names = c(NA,
-4L))
I am having issue with rearranging some data.
The original data is:
structure(list(id = 1:3, artery.1 = structure(c(1L, 1L, 2L), .Label = c("a",
"b"), class = "factor"), artery.2 = structure(c(1L, NA, 2L), .Label = c("b",
"c"), class = "factor"), artery.3 = structure(c(1L, NA, 2L), .Label = c("c",
"d"), class = "factor"), artery.4 = structure(c(NA, NA, 1L), .Label = "e", class = "factor"), artery.5 = structure(c(NA, NA, 1L), .Label = "f", class = "factor"),
diameter.1 = c(3L, 2L, 1L), diameter.2 = c(2L, NA, 2L), diameter.3 = c(3L,
NA, 3L), diameter.4 = c(NA, NA, 4L), diameter.5 = c(NA, NA,
5L)), .Names = c("id", "artery.1", "artery.2", "artery.3",
"artery.4", "artery.5", "diameter.1", "diameter.2", "diameter.3",
"diameter.4", "diameter.5"), class = "data.frame", row.names = c(NA,
-3L))
# id artery.1 artery.2 artery.3 artery.4 artery.5 diameter.1 diameter.2 diameter.3 diameter.4 diameter.5
# 1 1 a b c <NA> <NA> 3 2 3 NA NA
# 2 2 a <NA> <NA> <NA> <NA> 2 NA NA NA NA
# 3 3 b c d e f 1 2 3 4 5
I would like to get to this:
structure(list(id = 1:3, a = c(3L, 2L, NA), b = c(2L, NA, 1L),
c = c(3L, NA, 2L), d = c(NA, NA, 3L), e = c(NA, NA, 4L),
f = c(NA, NA, 5L)), .Names = c("id", "a", "b", "c", "d",
"e", "f"), class = "data.frame", row.names = c(NA, -3L))
# id a b c d e f
# 1 1 3 2 3 NA NA NA
# 2 2 2 NA NA NA NA NA
# 3 3 NA 1 2 3 4 5
Basically, a to f represents arteries and the numerical values represent the corresponding diameter. Each row represents a patient.
Is there a neat way to sort this dataframe out?
Modern tidyr makes the solution even more succinct via the pivot_ functions:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-id, names_pattern = '(artery|diameter)\\.(\\d+)', names_to = c('.value', NA)) %>%
filter(!is.na(artery)) %>%
pivot_wider(names_from = artery, values_from = diameter)
id a b c d e f
<int> <int> <int> <int> <int> <int> <int>
1 1 3 2 3 NA NA NA
2 2 2 NA NA NA NA NA
3 3 NA 1 2 3 4 5
Here is the older solution, which uses the deprecated gather and spread functions:
library(dplyr)
library(tidyr)
new.df <- gather(df, variable, value, artery.1:diameter.5) %>%
separate(variable, c('variable', 'num')) %>%
spread(variable, value) %>%
subset(!is.na(artery)) %>%
mutate(diameter = as.numeric(diameter)) %>%
select(-num) %>%
spread(artery, diameter)
Output:
id a b c d e f
1 1 3 2 3 NA NA NA
2 2 2 NA NA NA NA NA
3 3 NA 1 2 3 4 5
Or using melt/dcast combination with data.table while selecting variables using regex in the patterns function
library(data.table) #v>=1.9.6
dcast(melt(setDT(df),
id = "id",
measure = patterns("artery", "diameter")),
id ~ value1,
sum,
value.var = "value2",
subset = .(!is.na(value2)),
fill = NA)
# id a b c d e f
# 1: 1 3 2 3 NA NA NA
# 2: 2 2 NA NA NA NA NA
# 3: 3 NA 1 2 3 4 5
As you can see, both melt and dcast are very flexible and you can use regex, specify a subset, pass multiple functions and specify how you want to fill missing values.
You can use xtabs with reshape from base R. Use the latter to transform data to long format and use the former to get the count table:
xtabs(diameter ~ id + artery, reshape(df, varying = 2:11, sep = '.', dir = "long"))
# artery
#id a b c d e f
# 1 3 2 3 0 0 0
# 2 2 0 0 0 0 0
# 3 0 1 2 3 4 5
This can be done with two reshape() calls. First, we can longify both artery and diameter on id, then widen with artery as the time variable. To prevent a column of NAs, we also must subset out rows with NA values for artery in the intermediate frame.
reshape(subset(reshape(df,dir='l',varying=setdiff(names(df),'id'),timevar=NULL),!is.na(artery)),dir='w',timevar='artery');
## id diameter.a diameter.b diameter.c diameter.d diameter.e diameter.f
## 1.1 1 3 2 3 NA NA NA
## 2.1 2 2 NA NA NA NA NA
## 3.1 3 NA 1 2 3 4 5
The diameter. prefixes can be removed afterward, if desired. However, an advantage of this solution is that it would be capable of preserving multiple column sets, whereas the xtabs() solution cannot. The prefixes would be essential to distinguish the column sets in that case.
I have inherited a data set that is 23 attributes measured for each of 13 names (between-subjects--each participant only rated one name on all of these attributes). Right now it's structured such that the attributes are the fastest-moving factor, followed by the name. So the the data look like this:
Sub# N1-item1 N1-item2 N1-item3 […] N2-item1 N2-item2 N2-item3
1 3 5 3 NA NA NA
2 NA NA NA 1 5 3
3 3 5 3 NA NA NA
4 NA NA NA 2 2 1
It needs to be restructured it such that it's collapsed over name, and all of the item1 entries are the same column (subjects don't matter for this purpose), as below (bearing in mind that there are 23 items not 3 and 13 names not 2):
Name item1 item2 item3
N1 3 5 3
N2 1 5 3
I can do this with loops and, but I'd rather do it in a manner more natural to R, which I'm guessing would be one of the apply family of functions, but I can't quite wrap my head around it--what is the smart way to do this?
Here's an answer using dplyr and tidyr:
library(dplyr)#loads libraries
library(tidyr)
dat %>% #name of your dataframe
gather(key, val, -Sub) %>% #gathers to long data, with id as Sub
filter(!is.na(val)) %>% #removes rows with NA for the value
separate(key, c("Name", "item")) %>% #split the column key into Name and item
spread(item, val) #spreads the data into wide format, with item as the columns
Sub Name item1 item2 item3
1 1 N1 3 5 3
2 2 N2 1 5 3
3 3 N1 3 5 3
4 4 N2 2 2 1
Spin the column names around to be itemX-NY and then let reshape sort it out:
names(dat)[-1] <- gsub("(^.+?)-(.+?$)", "\\2-\\1", names(dat)[-1])
na.omit(reshape(dat, direction="long", idvar="Sub", varying=-1, sep="-"))
# Sub time item1 item2 item3
#1.N1 1 N1 3 5 3
#3.N1 3 N1 3 5 3
#2.N2 2 N2 1 5 3
#4.N2 4 N2 2 2 1
Where the data was:
dat <- structure(list(Sub = 1:4, `item1-N1` = c(3L, NA, 3L, NA), `item2-N1` = c(5L,
NA, 5L, NA), `item3-N1` = c(3L, NA, 3L, NA), `item1-N2` = c(NA,
1L, NA, 2L), `item2-N2` = c(NA, 5L, NA, 2L), `item3-N2` = c(NA,
3L, NA, 1L)), .Names = c("Sub", "item1-N1", "item2-N1", "item3-N1",
"item1-N2", "item2-N2", "item3-N2"), row.names = c(NA, -4L), class = "data.frame