I am trying to create a stacked bar chart showing % frequency of occurrences by group
library(dplyr)
library(ggplot2)
brfss_2013 %>%
group_by(incomeLev, mentalHealth) %>%
summarise(count_mentalHealth=n()) %>%
group_by(incomeLev) %>%
mutate(count_inc=sum(count_mentalHealth)) %>%
mutate(percent=count_mentalHealth / count_inc * 100) %>%
ungroup() %>%
ggplot(aes(x=forcats::fct_explicit_na(incomeLev),
y=count_mentalHealth,
group=mentalHealth)) +
geom_bar(aes(fill=mentalHealth),
stat="identity") +
geom_text(aes(label=sprintf("%0.1f%%", percent)),
position=position_stack(vjust=0.5))
However, this is the traceback I receive:
1. dplyr::group_by(., incomeLev, mentalHealth)
8. plyr::summarise(., count_mentalHealth = n())
9. [ base::eval(...) ] with 1 more call
11. dplyr::n()
12. dplyr:::from_context("..group_size")
13. `%||%`(...)
In addition: Warning message:
Factor `incomeLev` contains implicit NA, consider using `forcats::fct_explicit_na`
>
Here is a sample of my data
brfss_2013 <- structure(list(incomeLev = structure(c(2L, 3L, 3L, 2L, 2L, 3L,
NA, 2L, 3L, 1L, 3L, NA), .Label = c("$25,000-$35,000", "$50,000-$75,000",
"Over $75,000"), class = "factor"), mentalHealth = structure(c(3L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("Excellent",
"Ok", "Very Bad"), class = "factor")), row.names = c(NA, -12L
), class = "data.frame")
Update:
Output of str(brfss_2013):
'data.frame': 491775 obs. of 9 variables:
$ mentalHealth: Factor w/ 5 levels "Excellent","Good",..: 5 1 1 1 1 1 3 1 1 1 ...
$ pa1min_ : int 947 110 316 35 429 120 280 30 240 260 ...
$ bmiLev : Factor w/ 6 levels "Underweight",..: 5 1 3 2 5 5 2 3 4 3 ...
$ X_drnkmo4 : int 2 0 80 16 20 0 1 2 4 0 ...
$ X_frutsum : num 413 20 46 49 7 157 150 67 100 58 ...
$ X_vegesum : num 53 148 191 136 243 143 216 360 172 114 ...
$ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
$ X_state : Factor w/ 55 levels "0","Alabama",..: 2 2 2 2 2 2 2 2 2 2 ...
$ incomeLev : Factor w/ 4 levels "$25,000-$35,000",..: 2 4 4 2 2 4 NA 2 4 1 ...
First of all, your code works incredibly well when you transform everything into character. So you could just do
brfss_2013[c("incomeLev", "mentalHealth")] <-
lapply(brfss_2013[c("incomeLev", "mentalHealth")], as.character)
and then just run your code as you figured it out.
But, let's do it with factors (don't run the lapply(.) line in this case!).
You want a "missing" category, which you can obtain by adding a new level "missing" for the NAs.
levels(brfss_2013$incomeLev) <- c(levels(brfss_2013$incomeLev), "missing")
brfss_2013$incomeLev[is.na(brfss_2013$incomeLev)] <- "missing"
Then, your aggregation (in a base R way).
b1 <- with(brfss_2013, aggregate(list(count_mentalHealth=incomeLev),
by=list(mentalHealth=mentalHealth, incomeLev=incomeLev),
length))
b2 <- aggregate(mentalHealth ~ ., brfss_2013, length)
names(b2)[2] <- "count_inc"
brfss_2013.agg <- merge(b1, b2)
rm(b1, b2) # just to clean up
Add the "percent" column.
brfss_2013.agg$percent <- with(brfss_2013.agg, count_mentalHealth / count_inc)
Plot.
library(ggplot2)
ggplot(brfss_2013.agg, aes(x=incomeLev, y=count_mentalHealth, group=mentalHealth)) +
geom_bar(aes(fill=mentalHealth), stat="identity") +
geom_text(aes(label=sprintf("%0.1f%%", percent)),
position=position_stack(vjust=0.5))
Result
So your code actually works fine for me. It looks like it might be an issue with package versions because it seems odd that you're using the plyr summarise function.
However, here's a slightly more concise way to create that graph (and hopefully this is helpful for whatever you want to add to this plot)
brfss_2013 %>%
# Add count of income levels first (note this only adds a variable)
add_count(incomeLev) %>%
rename(count_inc = n) %>%
# Count observations per group (this transforms data)
count(incomeLev, mentalHealth, count_inc) %>%
rename(count_mentalHealth = n) %>%
mutate(percent= count_mentalHealth / count_inc) %>%
ggplot(aes(x= incomeLev,
y= count_mentalHealth,
# Technically you don't need this group here but groups can be handy
group= mentalHealth)) +
geom_bar(aes(fill=mentalHealth),
stat="identity")+
# Using the scales package does the percent formatting for you
geom_text(aes(label = scales::percent(percent)), vjust = 1)+
theme_minimal()
Related
This question already has answers here:
Aggregate and reshape from long to wide
(2 answers)
Closed 2 years ago.
Dataset is a breakdown of responders and the number of contacts they have had within a given time period along with details on their age bracket, something similar to:
participant participant_age contact contact_age
1 18-30 1 18-30
1 18-30 2 30-40
2 30-40 1 18-30
3 18-30 1 18-30
3 18-30 2 50-60
My aim is to calculate the mean number of contacts each age group of participant has had with each age bracket of contact. Something similar to:
age_bracket 18-30 30-40 40-50
18-30 1 3 2
30-40 1.5 4 2
40-50 3 4 1
I have been attempting to use the group_by and spread functions available in dplyr. The closest I have come is using
data%>%
group_by(participant_age, contact_age) %>%
tally() %>%
spread(key = participant_age, value = n)
But this produces the total number (n) of each contact, rather than the mean number of contacts per age bracket.
In base R use tapply.
t(with(dat, tapply(contact, list(contact_age, participant_age), mean)))
# 18-30 30-40 50-60
# 18-30 1 2 2
# 30-40 1 NA NA
Data:
dat <- structure(list(participant = c(1L, 1L, 2L, 3L, 3L), participant_age = c("18-30",
"18-30", "30-40", "18-30", "18-30"), contact = c(1L, 2L, 1L,
1L, 2L), contact_age = c("18-30", "30-40", "18-30", "18-30",
"50-60")), class = "data.frame", row.names = c(NA, -5L))
If I understood correctly your aim, you were pretty close to the right solution:
data %>%
group_by(participant_age, contact_age) %>%
summarise(mean = mean(contact), .groups = "drop") %>%
spread(key = participant_age, value = mean)
You can use pivot_wider and pass the function to apply in values_fn :
tidyr::pivot_wider(df, names_from = contact_age, values_from = contact, values_fn = mean)
id timepoint dv.a
1 baseline 100
1 1min 105
1 2min 90
2 baseline 70
2 1min 100
2 2min 80
3 baseline 80
3 1min 80
3 2min 90
I have repeated measures data for a given subject in long format as above. I'm looking to calculate percent change relative to baseline for each subject.
id timepoint dv pct.chg
1 baseline 100 100
1 1min 105 105
1 2min 90 90
2 baseline 70 100
2 1min 100 143
2 2min 80 114
3 baseline 80 100
3 1min 80 100
3 2min 90 113
df <- expand.grid( time=c("baseline","1","2"), id=1:4)
df$dv <- sample(100,12)
df %>% group_by(id) %>%
mutate(perc=dv*100/dv[time=="baseline"]) %>%
ungroup()
You're wanting to do something for each 'id' group, so that's the group_by, then you need to create a new column, so there's a mutate. That new variable is the old dv, scaled by the value that dv takes at the baseline - hence the inner part of the mutate. And finally it's to remove the grouping you'd applied.
Try creating a helper column, group and arrange on that. Then use the window function first in your mutate function:
df %>% mutate(clean_timepoint = str_remove(timepoint,"min") %>% if_else(. == "baseline", "0", .) %>% as.numeric()) %>%
group_by(id) %>%
arrange(id,clean_timepoint) %>%
mutate(pct.chg = (dv / first(dv)) * 100) %>%
select(-clean_timepoint)
in Base Ryou can do this
for(i in 1:(NROW(df)/3)){
df[1+3*(i-1),4] <- 100
df[2+3*(i-1),4] <- df[2+3*(i-1),3]/df[1+3*(i-1),3]*100
df[3+3*(i-1),4] <- df[3+3*(i-1),3]/df[1+3*(i-1),3]*100
}
colnames(df)[4] <- "pct.chg"
output:
> df
id timepoint dv.a pct.chg
1 1 baseline 100 100.0000
2 1 1min 105 105.0000
3 1 2min 90 90.0000
4 2 baseline 70 100.0000
5 2 1min 100 142.8571
6 2 2min 80 114.2857
7 3 baseline 80 100.0000
8 3 1min 80 100.0000
9 3 2min 90 112.5000
Base R solution: (assuming "baseline" always appears as first record per group)
data.frame(do.call("rbind", lapply(split(df, df$id),
function(x){x$pct.change <- x$dv/x$dv[1]; return(x)})), row.names = NULL)
Data:
df <- structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
timepoint = c(
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min"
),
dv = c(100L, 105L, 90L, 70L, 100L, 80L, 80L, 80L, 90L)
),
class = "data.frame",
row.names = c(NA,-9L)
)
I want to create a plot of time per temperature in 2 sites. I have data of the temperature each 10 minutes a day from february to april and I need daily cycles of hourly averages of temperature to plot.
I calculated the mean temperature for hour a day and try to create a plot with geom_plot and geopm_line of different ways.
data <- read.xlsx("temperatura.xlsx", 1)
data <- data %>% mutate (month = as.factor(month), month = as.factor (month), day = as.factor(day), h = as.factor(h), min = as.factor(min))
head (data)
month day h min t.site1 t.site2
2 1 0 0 15.485 16.773
2 1 0 10 15.509 16.773
2 1 0 20 15.557 16.773
2 1 0 30 15.557 16.773
2 1 0 40 15.605 16.773
2 1 0 50 15.605 16.773
str(data)
'data.frame': 12816 obs. of 6 variables:
$ month : Factor w/ 3 levels "2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ day : Factor w/ 31 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ h : Factor w/ 24 levels "0","1","2","3",..: 1 1 1 1 1 1 2 2 2 2 ...
$ min : Factor w/ 6 levels "0","10","20",..: 1 2 3 4 5 6 1 2 3 4 ...
$ t.site1: num 15.5 15.5 15.6 15.6 15.6 ...
$ t.site2: num 16.8 16.8 16.8 16.8 16.8 ...
hour <- group_by(data, month, day, h)
mean.h.site1 <- summarize(hour, mean.h.site1 = mean(t.site1))
t1 <- ggplot (data = mean.h.site1, aes(x=h, y=mean.h.site1)) +
geom_line()
t2 <- ggplot(data = mean.h.site1, aes(x=h, y=mean.h.site1, group = month))+
geom_line() +
geom_point()
t3 <- ggplot (data = mean.h.site1, aes(x=day, y=mean.h.site1, group=1))+
geom_point()
I expect the output of the variability of temperature across the time for each site, but the actual output show temperature variability during each day.
It's interesting that your data is showing month, day and hour as factor. Is it possible that there are some character values somewhere in that column when you read the data? It's very unusual to see numbers stored as factor in that fashion.
I'll do 4 things:
Convert factors to numbers
Convert numbers to dates
Convert a wide table to a long one, and finally
plot the temps against a real date
# Load packages and data
library(data.table) # for overall fast data processing
library(lubridate) # for dates wrangling
library(ggplot2) # plotting
dt <- fread("month day h min t.site1 t.site2
2 1 0 0 15.485 16.773
2 1 0 10 15.509 16.773
2 1 0 20 15.557 16.773
2 1 0 30 15.557 16.773
2 1 0 40 15.605 16.773
2 1 0 50 15.605 16.773")
# Convert factors to numbers (I actuall didn't run this because I just created the data.table, but it seems you'll need to do it):
dt[, names(dt)[1:4] := lapply(.SD, function(x) as.numeric(as.character(x)), .SDcols = 1:4]
# Create proper dates. We'll consider all dates occurring in 2019.
dt[, date := ymd_hm(paste0("2019/", month, "/", day, " ", h, ":", min))]
# convert wide data to long one
dt2 <- melt(dt[, .(date, t.site1, t.site2)], id.vars = "date")
# plot the data
ggplot(dt2, aes(x = date, y = value, color = variable))+geom_point()+geom_path()
You could paste the time columns together and convert them as.POSIXct.
As #PavoDive already pointed out we'll need numeric time columns. Check your code that produced the data or transform to numeric with d[1:4] <- Map(function(x) as.numeric(as.character(x)), d[1:4]).
Now paste the rows with apply, convert as.POSIXct, and cbind it to the remainder. The sprintf looks first that all values have the same digits before pasting.
d2 <- cbind(time=as.POSIXct(apply(sapply(d[1:4], sprintf, fmt="%02d"), 1, paste, collapse=""),
format="%m%d%H%M"),
d[5:6])
Plots nicely, here in base R:
with(d2, plot(time, t.site1, ylim=c(15, 17), xaxt="n",
xlab="time", ylab="value", type="b", col="red",
main="Time series"))
with(d2, lines(time, t.site2, type="b", col="green"))
mtext(strftime(d2$time, "%H:%M"), 1, 1, at=d2$time) # strftime gives the desired formatting
legend("bottomright", names(d2)[2:3], col=c("red", "green"), lty=rep(1, 2))
Data
d <- structure(list(month = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "2", class = "factor"),
day = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor"),
h = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "0", class = "factor"),
min = structure(1:6, .Label = c("0", "10", "20", "30", "40",
"50"), class = "factor"), t.site1 = c(15.485, 15.509, 15.557,
15.557, 15.605, 15.605), t.site2 = c(16.773, 16.773, 16.773,
16.773, 16.773, 16.773)), row.names = c(NA, -6L), class = "data.frame")
I'm assuming that you needed the actual output showing temperature variability by hour for each day in the same plot?
EDITED:
I have updated the code to generate a day worth of data. And, also generate the chart.
library(tidyverse)
library(lubridate)
df <- data_frame(month = rep(2, 144),
day = rep(1, 144),
h = rep(0:24, each = 6, len = 144),
min = rep((0:5)*10,24),
t.site1 = rnorm(n = 144, mean = 15.501, sd = 0.552),
t.site2 = rnorm(n = 144, mean = 16.501, sd = 0.532))
df %>%
group_by(month, day, h) %>%
summarise(mean_t_site1 = mean(t.site1), mean_t_site2 = mean(t.site2)) %>%
mutate(date = ymd_h(paste0("2019-",month,"-",day," ",h))) %>%
ungroup() %>%
select(mean_t_site1:date) %>%
gather(key = "site", value = "mean_temperature", -date) %>%
ggplot(aes(x = date, y = mean_temperature, colour = site)) +
geom_line()
Could you verify if this is the output you need?
I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?
Thank you in advance
P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure
dfold:
structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France",
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L,
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L,
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"),
pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased",
"pid"), row.names = c(NA, 5L), class = "data.frame")
dfnew:
structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000,
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA,
3L), class = "data.frame")
Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.
Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 NA Yes 5
> dfnew
Age Salary pid
1 44 72000 1
2 27 48000 2
3 30 54000 3
4 38 61000 4
5 40 70000 5
To do this for a single column, try
dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
Using it on the whole dataset was a bit trickier:
First define all common colnames except pid:
cols <- names(dfnew)[names(dfnew) != "pid"]
> cols
[1] "Age" "Salary"
Now use mapply to replace the NA values with ifelse:
dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])
> dfold
Country Age Salary Purchased pid
1 France 44 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use
cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]
This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 7 years ago.
I already went through different links like: How to convert a factor to an integer\numeric without a loss of information?
but could not solve the problem
I have a data frame
SYMBOL PVALUE1 PVALUE2
1 10-Mar 0.813027629406118 0.78820189558684
2 10-Sep 0.00167287722066533 0.00167287722066533
3 11-Mar 0.21179810441316 0.464576340307205
4 11-Sep 0.00221961024320294 0.00221961024320294
5 12-Sep 0.934667427815304 0.986884425214009
6 15-Sep 0.00167287722066533 0.00167287722066533
7 1-Dec 0.464576340307205 0.0911572830792113
8 1-Mar 0.00818426308604705 0.0252302356363697
9 1-Sep 0.60516237199519 0.570568468332992
10 2-Mar 0.0103975819620539 0.00382292568622066
11 2-Sep 0.00167287722066533 0.00167287722066533
When i try str()
str(df)
'data.frame': 20305 obs. of 3 variables:
$ SYMBOL : Factor w/ 21050 levels "","10-Mar","10-Sep",..: 2 3 4 5 6 7 8 9 10 11 ...
$ PVALUE1: Factor w/ 209 levels "0","0.000109570493049298",..: 169 22 110 24 181 22 139 39 149 44 ...
$ PVALUE2: Factor w/ 216 levels "0","0.000109570493049298",..: 172 20 141 23 201 20 90 61 150 29 ...
I try mode()
sapply(df,mode)
SYMBOL PVALUE1 PVALUE2
"numeric" "numeric" "numeric"
When i try to assign values based on the condition below, to the two numeric columns(2,3) by
df$Score <- rowSums(ifelse(df[,-1]==0, 0,
ifelse(df[, -1]<= 0.05, 2, ifelse(df[,-1]>= 0.065,-2,1))))
I get Warning messages:
1: In Ops.factor(left, right) : ‘<=’ not meaningful for factors
2: In Ops.factor(left, right) : ‘<=’ not meaningful for factors
3: In Ops.factor(left, right) : ‘>=’ not meaningful for factors
4: In Ops.factor(left, right) : ‘>=’ not meaningful for factors
and the output comes like this:
SYMBOL PVALUE1 PVALUE2 Score
1 10-Mar 0.813027629406118 0.78820189558684 NA
2 10-Sep 0.00167287722066533 0.00167287722066533 NA
3 11-Mar 0.21179810441316 0.464576340307205 NA
4 11-Sep 0.00221961024320294 0.00221961024320294 NA
5 12-Sep 0.934667427815304 0.986884425214009 NA
6 15-Sep 0.00167287722066533 0.00167287722066533 NA
If the factor is already numeric, why the above code is not working and gives NA. How should i proceed.
Edit dput()
structure(list(SYMBOL = structure(1:6, .Label = c("10-Mar", "10-Sep",
"11-Mar", "11-Sep", "12-Sep", "15-Sep"), class = "factor"), PVALUE1 = structure(c(4L,
1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533", "0.00221961024320294",
"0.21179810441316", "0.813027629406118", "0.934667427815304"), class = "factor"),
PVALUE2 = structure(c(4L, 1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533",
"0.00221961024320294", "0.464576340307205", "0.78820189558684",
"0.986884425214009"), class = "factor")), .Names = c("SYMBOL",
"PVALUE1", "PVALUE2"), row.names = c(NA, 6L), class = "data.frame")
I tried this also:
indx <- sapply(df, is.factor)
df[indx] <- lapply(df[indx], function(x) as.numeric(levels(x))[x])
indx returns
SYMBOL PVALUE1 PVALUE2
TRUE TRUE TRUE
Warning message:
In FUN(X[[3L]], ...) : NAs introduced by coercion
Using your dput data, this works just fine:
df = structure(list(SYMBOL = structure(1:6, .Label = c("10-Mar", "10-Sep",
"11-Mar", "11-Sep", "12-Sep", "15-Sep"), class = "factor"), PVALUE1 = structure(c(4L,
1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533", "0.00221961024320294",
"0.21179810441316", "0.813027629406118", "0.934667427815304"), class = "factor"),
PVALUE2 = structure(c(4L, 1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533",
"0.00221961024320294", "0.464576340307205", "0.78820189558684",
"0.986884425214009"), class = "factor")), .Names = c("SYMBOL",
"PVALUE1", "PVALUE2"), row.names = c(NA, 6L), class = "data.frame")
df$PVALUE1 = as.numeric(as.character(df$PVALUE1))
df$PVALUE2 = as.numeric(as.character(df$PVALUE2))
df
# SYMBOL PVALUE1 PVALUE2
# 1 10-Mar 0.813027629 0.788201896
# 2 10-Sep 0.001672877 0.001672877
# 3 11-Mar 0.211798104 0.464576340
# 4 11-Sep 0.002219610 0.002219610
# 5 12-Sep 0.934667428 0.986884425
# 6 15-Sep 0.001672877 0.001672877
sapply(df, class)
# SYMBOL PVALUE1 PVALUE2
# "factor" "numeric" "numeric"
If you have issues doing this to your whole data frame, it's possible you have some irregular rows. However, I also looked at the CSV you provided in the comments, and it looks just fine.
Also note that this is one of several equivalent solutions in the duplicate question that you linked.
To convert all but the first column, you could do
df[, 2:ncol(df)] = lapply(df[, -1], function(x) as.numeric(as.character(x)))
Note that you don't want to convert date columns or SYMBOL columns this way as they aren't numeric.
Similarly, to convert columns named, say PVALUE1 to PVALUE47, you could construct the column names and then convert them:
col_to_convert = paste0("PVALUE", 1:47)
df[, col_to_convert] = lapply(df[, col_to_convert], function(x) as.numeric(as.character(x)))
In general, best practice is to not have these columns as factors in the first place. However you get this data into R probably has a way to specify column classes, e.g., colClasses in read.table, read.csv, etc.
An option using data.table
library(data.table)
setDT(df)[, 2:3 := lapply(.SD, function(x)
as.numeric(levels(x))[x]), .SDcols=2:3]
Or a bit more faster version would be to use set
indx <- which(sapply(df, is.factor) & grepl('PVALUE', names(df)))
setDT(df)
for(j in indx){
set(df, i=NULL, j=j, value= as.numeric(levels(df[[j]]))[df[[j]]])
}
I guess the reason why you got the warning is because the 'indx' you created also included the first column (as it is also a factor) but it is non-numeric. By converting non-numeric elements from factor to numeric, those elements will be coerced to NA.
According to ?factor
To transform a factor ‘f’ to approximately its
original numeric values, ‘as.numeric(levels(f))[f]’ is recommended
and slightly more efficient than ‘as.numeric(as.character(f))’.