Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 2 years ago.
Improve this question
Creating factor levels in a dataset with NAs works for individual columns, but I need to iterate across many more columns (all start with 'impact.') and have struck a problem inside a dplyr mutate(across)
What am I doing wrong?
Reprex below
library(tribble)
library(dplyr)
df <- tribble(~id, ~tumour, ~impact.chemo, ~impact.radio,
1,'lung',NA,1,
2,'lung',1,NA,
3,'lung',2,3,
4,'meso',3,4,
5,'lung',4,5)
# Factor labels
trt_labels <- c('Planned', 'Modified', 'Interrupted', 'Deferred', "Omitted")
# Such that factor levels match labels as, retaining NAs where present:
data.frame(level = 1:5,
label = trt_labels)
# Create factor works for individual columns
factor(df$impact.chemo, levels = 1:5, labels = trt_labels)
factor(df$impact.radio, levels = 1:5, labels = trt_labels)
# But fails inside mutate(across)
df %>%
mutate(across(.cols = starts_with('impact'), ~factor(levels = 1:5, labels = trt_labels)))
Just making #27ϕ9's comment an answer: the purrr-style lambda function you specified inside across is not correct because it needs the first argument, which is the object the function should refer to (in this case, the dataframe columns selected by across).
To fix your issue, you should insert .x inside the lambda function, which is non other than a shortcut for function(x) x - see this page for more info about purrr-style lambda functions.
df %>%
mutate(across(.cols = starts_with('impact'), ~factor(.x, levels = 1:5, labels = trt_labels)))
# A tibble: 5 x 4
# id tumour impact.chemo impact.radio
# <dbl> <chr> <fct> <fct>
# 1 1 lung NA Planned
# 2 2 lung Planned NA
# 3 3 lung Modified Interrupted
# 4 4 meso Interrupted Deferred
# 5 5 lung Deferred Omitted
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
this is probably a very basic question but I'm just starting out using R and hope someone can help.
I've imported some data into R and created an object containing just the data I'm working on first:
Each of the values is from a scale of 1 to 10.
What I want to produce is a chart showing the mean of each column, something like this (which I did in Excel):
I'm sure this is possible, but I'm going round in circles figuring it out! Ignoring the vertical line (at maximum value) and standard deviations for now, though ultimately I'd like to have them included. Thank you!
set.seed(42)
dat <- setNames(data.frame(replicate(4, sample(10, 50, replace=TRUE))), c("2000", "2400", "2800", "3200"))
head(dat)
# 2000 2400 2800 3200
# 1 1 6 5 1
# 2 5 6 9 1
# 3 1 2 10 5
# 4 9 4 8 3
# 5 10 3 7 10
# 6 4 6 6 1
library(dplyr)
library(tidyr) # pivot_longer
library(ggplot2)
dat %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarize(value = mean(value), .groups = "drop") %>%
mutate(name = as.integer(name)) %>%
ggplot(aes(name, value)) + geom_line()
It seems that you have encoded a numerical value in the column name, which is not a good idea, because it is a violation of the first normal form. I would thus suggest to transpose the data and encode the first value in the first column.
With your peculiar data structure, you must first extract the number from the colmn names with
x <- as.numeric(names(dat))
Then you can compute all column means with
y <- colMeans(dat)
And then you can plot it
plot(x, y, type="l")
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 1 year ago.
Improve this question
I am trying to access a column in my dataframe using dataframe$column format. But it returns NULL. What am I doing wrong ? Please help
As you can see from the output, you don't have a column called Ozone; the column, and the only one, you have is called V1. You will have to split the data in V1 into columns. This can be done using tidyr's separate, like so:
Data:
df <- data.frame(
V1 = c("Ozone,Solar.R,Wind,Temp,Month,Day",
"41,190,7.4,67,5,1")
)
First, get your column names:
col_names <- unlist(strsplit(df$V1[1], ","))
The column names are now stored in a vector:
col_names
[1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
Now transform df:
library(dplyr)
library(tidyr)
df %>%
# first rename the col to be transformed:
rename("Ozone,Solar.R,Wind,Temp,Month,Day" = V1) %>%
# remove the first row, which is now redundant:
slice(2:nrow(.)) %>%
# separate into columns using the `col_names`:
separate(1, into = col_names, sep = ",")
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 1 year ago.
Improve this question
Here's my problem : dplyr group_by doesn't work on a data.frame, but it works well on another. The problematic dataframe is imported from a SPSS file with the package foreign. when I execute that :
d_summarised <- d %>%
group_by(group) %>%
summarise(Sex = (sum(d$GENRE == "F", na.rm = TRUE))/sum(!is.na(d$GENRE))) %>%
select(Sex, group)
The result is calculated on the whole sample, and not by group (so the result is the same by group, what is not expected).
# A tibble: 6 x 2
group Sex
* <fct> <dbl>
1 group1 0.626
2 group2 0.626
3 group3 0.626
4 group4 0.626
5 group5 0.626
6 NA 0.626
But, at the same time, on the same session, with the same packages loaded, this works :
dat <- data.frame(x=c(1,2,3,3,2,1), y=c(15,24,54,65,82,65))
dat %>%
group_by(x) %>%
summarise(mean(y))
Here's the result :
# A tibble: 3 x 2
x `mean(y)`
* <dbl> <dbl>
1 1 40
2 2 53
3 3 59.5
plyr is not loaded, only dplyr. How could that be possible ?
The issue would be breaking the grouping with d$. Instead, use the column names and it should work
library(dplyr)
d %>%
group_by(group) %>%
summarise(Sex = (sum(GENRE == "F", na.rm = TRUE))/sum(!is.na(GENRE))) %>%
select(Sex, group)
NOTE: when we use d$GENRE, it is selecting the whole column in the dataset and not limiting the elements within the group
In the second case, OP was applying mean directly on 'y' instead of mean(dat$y). In other words, it is not the data structure i.e. data.frame vs tibble, but it is because of extracting the whole column
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Goal: to get new row names when using base R order() function (as is done with dplyr::arrange()). The rownames/index output for the base R call is 3, 1, 2 as seen below whereas the output for arrange() is 1, 2, 3 (seen below). How can I get 1, 2, 3 using base R order()?
Reprex:
library(dplyr)
df <- data.frame(
company = c("A", "B", "C"),
sales = c(100, 200, 50)
)
# base R:
df[order(df$sales),]
# dplyr:
arrange(df, sales)
# Base R output:
## company sales
## 3 C 50
## 1 A 100
## 2 B 200
# dplyr output:
## company sales
## 1 C 50
## 2 A 100
## 3 B 200
If your goal is for the row numbers after using arrange() to match what you get from order(), then do the following (a few extra dplyr and tibble steps).
library(dplyr)
library(tibble)
df %>%
rownames_to_column() %>%
arrange(sales) %>%
column_to_rownames("rowname")
company sales
3 C 50
1 A 100
2 B 200
If your goal is to the same rownames as what result after arrange(), you can assign the row names after using order().
df_new <- df[order(df$sales),]
rownames(df_new) <- 1:nrow(df_new)
It may be good practice to create an ID column instead of using row names. Usually numbered ID's correspond to the original data, but of course you can create them after your ordering operation.
df_new <- df[order(df$sales),]
df_new$id <- 1:nrow(df_new)
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 5 years ago.
Improve this question
I have a data frame in R that has two columns, one with last names, the other with the frequency of each last name. I would like to randomly select last names based on the frequency values (0 -> 1).
So far I have tried using the sample function, but it doesn't allow for specific frequencies for each value. Not sure if this is possible :/
df1 <- data.frame(names = c("John","Mary"),freq=c(0.2,0.8))
df1
# names freq
# 1 John 0.2
# 2 Mary 0.8
set.seed(1)
sample100 <- sample(
x = df1$names,
size = 100,
replace=TRUE,
prob=df1$freq)
table(sample100)
# sample100
# John Mary
# 17 83