Suppose we have this data frame:
avg_1 avg_2 avg_3 avg_4
132 123 23 214
DF DM RF RM
How can I convert this in R so that the output is a new data frame that looks like:
avg key
132 DF
123 DM
23 RF
214 RM
I have tried using pivot_longer from tidyverse, but the trouble is that I'm also trying to rename the columns to avg and key. Can anyone help?
In base R I would try:
setNames(data.frame(t(df), row.names = NULL), c("avg", "key"))
Output
avg key
1 132 DF
2 123 DM
3 23 RF
4 214 RM
Does this work:
library(dplyr)
library(purrr)
library(tibble)
t(df) %>% as.tibble() %>% set_names(c('avg','key')) %>% type.convert(as.is = T)
# A tibble: 4 x 2
avg key
<int> <chr>
1 132 DF
2 123 DM
3 23 RF
4 214 RM
And here is a solution with R builtin methods:
x <- t(your.data.fram)
names(x) <- c("avg", "key")
Note that you might also want to change the data types to numeric and factor, if they are something different, e.g.
x$avg <- as.numeric(x$avg)
x$key <- as.factor(x$key)
Related
I am looking for a dplyr way to break variable into multiple columns according to dictionary:
vardic <- data.frame(varname=c('a','b','c','d'),
length=c(2,6,3,1) ) %>%
mutate(end=cumsum(length),start=end-length+1)
d <- data.frame(orig_string=c('11333333444A',
'22444444111C',
'55666666000B'))
The desired output is:
d2 <- data.frame(a=c(11,22,55),b=c(333333,444444,666666),c=c(444,111,000),d=c('A','C','B')
This has to be done using only dplyr commands because this will be implemented via arrow on a larger than memory dataset (asked in this other question)
UPDATE (responding to comments): functions outside dplyr could be used, as long as supported by arrow. arrow's list of R/dplyr supported functions describes what has been implemented so far. Hopefully this pseudocode illustrates the pipeline:
library(tidyverse)
library(arrow)
d %>% write_dataset('myfile',format='parquet')
'myfile' %>% open_dataset %>%
sequence_of_arrowsupported_commands_to_split_columns
Update2: added cols indicating start and end position in vardic
Update3: made the arrow pipeline, above, more reproducible. then tested #akrun's solution. But separate is not supported by arrow
OP here. Tks for all the support. All other answers are great and work well with purr + dplyr. However, these loop/map through the variables have not been implemented in arrow yet. One solution, however, is to have the loop outside and the arrow command be repeated for eah variable:
For instance (a hard coded sequence would be):
ds <- 'file' %>% open_dataset
ds <- ds %>% mutate(a=str_sub(orig_string,1,2))
ds <- ds %>% mutate(b=str_sub(orig_string,3,8))
...
ds %>% collect
Now reimplement this as a function + a loop:
extract_var_arrow <- function(ds,var){
s <- vardic[varname==var]$start
e <- vardic[varname==var]$end
ds %>% mutate("{var}" := str_sub(orig_string,s,e)) %>% return
}
for(v in vardic$varname){
ds <- ds %>% extract_var_arrow(v)
}
Note that, until it sees a collect statement arrow, is just compiling a query. So the above loop is equivalent to:
# ds <- ds %>% extract_var_arrow('a')
# %>% extract_var_arrow('b')
# %>% extract_var_arrow('c')
# %>% extract_var_arrow('d')
Finally we can collect
ds %>% select(-orig_string) %>% collect
a b c d
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B
Not sure, as others, what exactly you mean by 'only' dplyr. If by that you mean only tidyverse, here's a solution that relies on dplyr, tidyr and stringr:
library(dplyr)
library(stringr)
library(tidyr)
d2 <- d %>%
mutate(orig_string = str_extract_all(orig_string, "(.)\\1+(?!\\1)|[A-Z]$")) %>%
unnest_wider(orig_string)
names(d2) <- vardic$varname
# A tibble: 3 × 4
a b c d
<chr> <chr> <chr> <chr>
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B
EDIT:
Here's a fully automated tidyverse solution:
library(tidyr)
d %>%
separate(orig_string,
into = vardic$varname,
sep = cumsum(vardic$lenght))
a b c d
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B
Base R solution:
# instantiate d2 with nrow(d) rows and 0 columns
d2 <- d
d2$orig_string <- NULL
for (i in seq(to = nrow(vardic))) {
d2[[vardic$varname[[i]]]] <- substr(
d$orig_string,
vardic$start[[i]],
vardic$end[[i]]
)
}
d2
a b c d
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B
If you can use other tidyverse packages, here's a solution using purrr:pmap_dfc():
library(dplyr)
library(purrr)
library(stringr)
pmap_dfc(vardic, \(varname, start, end, ...) tibble(
!!varname := str_sub(d$orig_string, start = start, end = end)
))
# A tibble: 3 × 4
a b c d
<chr> <chr> <chr> <chr>
1 11 333333 444 A
2 22 444444 111 C
3 55 666666 000 B
my function is going to take data and column name from user.
column name must be passed only like : column or my column (ie, no "" s only backticks if column name is multi words)
at part of my function I need to check for missing values in that column and replace them with "missed".
myfun <- function(data, column){
library(tidyverse)
data <- as.tibble(data)
data %>%
mutate(across(.cols = {{column}}, .fns = as.character)) %>%
replace_na(list({{columns}} = "missed"))
}
new_df <- airquality %>%
myfun(Ozone)
I used {{}} before and I had no issues passing column names like airquality example above. Now at part of my actual function, I need to replace na s . the format above does not work with replace_na() .
any idea how to fix this?
I think you need to use replace_na() within mutate(). If you use the same strategy as for converting your columns to character, this seems to solve the problem:
myfun <- function(data, column){
library(tidyverse)
data <- as_tibble(data)
data %>%
mutate(across(.cols = {{column}}, .fns = ~ replace_na(as.character(.x), "missed")))
}
new_df <- airquality %>%
myfun(Ozone)
The output:
> head(new_df)
# A tibble: 6 × 6
Ozone Solar.R Wind Temp Month Day
<chr> <int> <dbl> <int> <int> <int>
1 41 190 7.4 67 5 1
2 36 118 8 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 missed NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
I'm working with the data frame below, which is just part of the full data, and I need to condense the duplicate numbers in the id column into one row. I want to preserve the row that has the highest sbp number, unless it's 300 or over, in which case I want to discard that too.
So for example, for the first three rows that have id as 13480, I want to keep the row that has 124 and discard the other two.
id,sex,visits,sbp
13480,M,2,124
13480,M,3,306
13480,M,4,116
13520,M,2,124
13520,M,3,116
13520,M,4,120
13580,M,2,NA
13580,M,3,124
This is the farthest I got, been trying to tweak this but not sure I'm on the right track:
maxsbp <- split(sbp, sbp$sbp)
r <- data.frame()
for (i in 1:length(maxsbp)){
one <- maxsbp[[i]]
index <- which(one$sbp == max(one$sbp))
select <- one[index,]
r <- rbind(r, select)
}
r1 <- r[!(sbp$sbp>=300),]
r1
I think a tidy solution to this would work quite well. I would first filter all values above 300, if you do not want to keep any value above that threshold. Then group_by id, order, and keep the first.
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
my.df %>% filter(sbp < 300) # filter to retain only values below 300
%>% group_by(id) # group by id
%>% arrange(-sbp) # arrange by id in descending order
%>% top_n(1, sbp) # retain first value i.e. the largest
# A tibble: 3 x 3
# Groups: id [3]
# id sex sbp
# <dbl> <chr> <dbl>
#1 13480 M 124
#2 13520 M 124
#3 13580 M 124
In R, very rarely you'll require explicit for loops to do tasks.
There are functions available which will help you perform such grouped operations.
For example, in base R you can use subset and ave :
subset(df,sbp == ave(sbp,id,FUN = function(x) max(sbp[sbp <= 300],na.rm = TRUE)))
# id sex visits sbp
#1 13480 M 2 124
#4 13520 M 2 124
#8 13580 M 3 124
The same can be done using dplyr whose syntax is a little bit easier to understand.
library(dplyr)
df %>%
group_by(id) %>%
filter(sbp == max(sbp[sbp <= 300], na.rm = TRUE))
slice_head can also be used
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
> my.df
id sex sbp
1 13480 M 124
2 13480 M 306
3 13480 M 116
4 13520 M 124
5 13520 M 116
6 13520 M 120
7 13580 M NA
8 13580 M 124
Proceed simply like this
my.df %>% group_by(id, sex) %>%
arrange(desc(sbp)) %>%
slice_head() %>%
filter(sbp <300)
# A tibble: 2 x 3
# Groups: id, sex [2]
id sex sbp
<dbl> <chr> <dbl>
1 13520 M 124
2 13580 M 124
I have a situation where I have data distributed between two dataframe, and I need to subset the data from one of the dataframes first, and then conduct a t-test between this subset data and the (entire) data from the other dataframe.
I attempted to use %>% and group_by() to select the data I want, and then I tried to invoke the t-test as shown below.
library(dplyr)
a <- c("AA","AA","AA","AB","AB","AB")
b <- c(1,2,3,1,2,3)
c <- c(12,34,56,78,90,12)
cols1 <- c("SampID", "Reps", "Vals")
df1 <- data.frame(a,b,c)
colnames(df1) <- cols1
df1
SampID Reps Vals
1 AA 1 12
2 AA 2 34
3 AA 3 56
4 AB 1 78
5 AB 2 90
6 AB 3 12
e <- c(1,2,3,4,5,6,7,8,9)
f <- c(11,22,33,44,55,66,77,88,99)
cols2 <- c("CtrlReps","CtrlVals")
df2 <- data.frame(e,f)
colnames(df2) <- cols2
df2
CtrlReps CtrlVals
1 1 11
2 2 22
3 3 33
4 4 44
5 5 55
6 6 66
7 7 77
8 8 88
9 9 99
df1 %>%
group_by(SampID) %>%
t.test(Vals, df2$CtrlVals, var.equal = FALSE)
This, however, returns an error:
Error in match.arg(alternative) :
'arg' must be NULL or a character vector
I also tried using do but that returns an error as well:
outputs <- df1 %>%
group_by(SampID) %>%
do(tpvals = t.test(Vals, df2$CtrlVals, data = ., paired = FALSE, var.equal = FALSE)) %>%
summarise(SampID, pvals = tpvals$p.value)
Error in t.test(Vals, df2$CtrlVals, data = ., paired = FALSE, var.equal = FALSE) :
object 'Vals' not found
I am new to R, and I have exhausted my Google-Fu, so I have no idea what is happening. To the best of my knowledge, these two errors are unrelated, I think but resolving one or the other gives me a way out of the situation. I just don't know how. I am also sure that resolving this problem would immediately land me in the next problem (the one this post actually addresses).
Your inputs/guidance/help would be much appreciated!
Your attempt with do was close, it can be fixed by doing:
outputs <- df1 %>%
group_by(SampID) %>%
do(tpvals = t.test(.$Vals, df2$CtrlVals,
paired = FALSE, var.equal = FALSE)) %>%
summarise(SampID, pvals = tpvals$p.value)
You need .$Vals to get at the Vals column within do, it doesn't work quite the same way as mutate. The data argument for t.test also isn't useful here as you don't have both variables in the same dataframe so you can't put them both in a formula.
Result:
> outputs
# A tibble: 2 x 2
SampID pvals
<fct> <dbl>
1 AA 0.253
2 AB 0.862
Please see attached image for the best way I can describe my question.
I promise I did attempt to research this first, and I saw a few answers that fit close, but many of them required listing off each variable (in this image, this would be each encounter #), and my data has approximately 15 million lines of code, with about 10,000 different encounter #'s.
I would appreciate any assistance!
As an alternative, you can also use the data.table package. Especially on large datasets, data.table will give you an enormous performance boost. Applied to the data as used by #r2evans:
library(data.table)
setDT(df)[, .(n_uniq_enc = uniqueN(encounter)), by = patient]
this will lead to the following result:
patient n_uniq_enc
1: 123 5
2: 456 5
Lacking a reproducible example, here's some sample data:
set.seed(42)
df <- data.frame(patient = sample(c(123,456), size=30, replace=TRUE), encounter=sample(c(12,34,56,78,90), size=30, replace=TRUE))
head(df)
# patient encounter
# 1 456 78
# 2 456 90
# 3 123 34
# 4 456 78
# 5 456 12
# 6 456 90
Base R:
aggregate(x = df$encounter, by = list(patient = df$patient),
FUN = function(a) length(unique(a)))
# patient x
# 1 123 5
# 2 456 5
or (by #20100721's suggestion):
aggregate(encounter~.,FUN = function(t) length(unique(t)),data = df)
Using dplyr:
library(dplyr)
group_by(df, patient) %>%
summarize(numencounters = length(unique(encounter)))
# # A tibble: 2 x 2
# patient numencounters
# <dbl> <int>
# 1 123 5
# 2 456 5
Update: #2100721 informed me of n_distinct, effectively same as length(unique(...)):
group_by(df, patient) %>%
summarize(numencounters = n_distinct(encounter))