I am trying to convert every value of a data.frame column into factors, this is so I can use them as the "groups" in a boxplot graph. However, using both the as.factor() and factor() function, it turns every value into . There are 5 different cell types in the column, CD8, CD4, Bcell, Mono, Gran - and all turn to NA.
Confusingly, when i apply the function to just one row of the column then it works perfectly fine. The dataframe is very very large (over 3 million rows) - could this be the cause of the issue?
Code :
> head(BP)
Methylation Cell_Type
1 0.03219298 CD8
2 0.11684228 CD8
3 0.04214158 CD8
4 0.26700497 CD8
5 0.34251732 CD8
6 0.34231208 CD8
> BP$Cell_Type <- as.factor(BP$Cell_Type)
> head(BP)
Methylation Cell_Type
1 0.03219298 <NA>
2 0.11684228 <NA>
3 0.04214158 <NA>
4 0.26700497 <NA>
5 0.34251732 <NA>
6 0.34231208 <NA>
Unsure why this is happening - any advice would be greatly appreciated!
Thanks
Out put of dput(head(BP))
> dput(head(BP))
structure(list(Methylation = c(0.0321929818018839,
0.116842281589967,
0.0421415803696093, 0.267004971824527, 0.342517319094108,
0.342312083101948
), Cell_Type = structure(list(Cell_Type = structure(c(3L, 3L,
3L, 3L, 3L, 3L), .Label = c("Bcell", "CD4", "CD8", "Gran", "Mono"
), class = "factor")), row.names = c(NA, 6L), class =
"data.frame")), row.names = c(NA,
6L), class = "data.frame")
Maybe make sure Cell_Type is a character first?
BP <- tibble::tribble(
~Methylation, ~Cell_Type,
0.03219298, "CD8",
0.11684228, "CD8",
0.04214158, "CD8",
0.26700497, "CD8",
0.34251732, "CD8",
0.34231208, "CD8")
BP$Cell_Type <- as.factor(BP$Cell_Type)
print(BP)
Methylation Cell_Type
<dbl> <fct>
1 0.0322 CD8
2 0.117 CD8
3 0.0421 CD8
4 0.267 CD8
5 0.343 CD8
6 0.342 CD8
Or simply
BP$Cell_Type <- as.factor(as.character(BP$Cell_Type))
Related
I ran the following Indicator Species Analysis (indval) code from labdsv package in R on a dataframe called "data" where species abundances are columns and sites are rows as below:
Site Species X Species Y Species Z etc
1 10 3 5
2 5 15 220
3 0 1 0
4 21 100 3
In a separate file is the corresponding Group data for each site which is either group 1 or group 2 (called this spe.grp), that is the following:
Groups
1
2
1
2
I removed categorical variables so that spe.only has only the species data
spe.only <- data[,2:1521]
I then removed species which do not occur in any sample
spe.only[, (!apply(spe.only==0,2,all))]
I then ran Indicator species based on Groups (1) or (2)
(iva <- indval(spe.only, spe.grp$Groups))
But I get
"Error in indval.default(spe.only, spe.grp$Status) : All species
must occur in at least one plot"
How do I resolve this error so that I can run indval correctly?
The step
spe.only[, (!apply(spe.only==0,2,all))]
was not assigned back to the original object i.e. if we don't assign it back it, the output from the above step only prints on the console and not updates the original object
spe.only <- spe.only[, (!apply(spe.only==0,2,all))]
Now do the indval
> library(labdsv)
> indval(spe.only, spe.grp$Groups)
$relfrq
1 2
SpeciesX 0.5 1
SpeciesY 1.0 1
SpeciesZ 0.5 1
$relabu
1 2
SpeciesX 0.27777778 0.7222222
SpeciesY 0.03361345 0.9663866
SpeciesZ 0.02192982 0.9780702
$indval
1 2
SpeciesX 0.13888889 0.7222222
SpeciesY 0.03361345 0.9663866
SpeciesZ 0.01096491 0.9780702
$maxcls
SpeciesX SpeciesY SpeciesZ
2 2 2
$indcls
SpeciesX SpeciesY SpeciesZ
0.7222222 0.9663866 0.9780702
$pval
SpeciesX SpeciesY SpeciesZ
0.678 0.319 0.671
The error is reproducible on the original 'spe.only' object
> indval(spe.only, spe.grp$Groups)
Error in indval.default(spe.only, spe.grp$Groups) :
All species must occur in at least one plot
data
spe.only <- structure(list(SpeciesX = c(10L, 5L, 0L, 21L), SpeciesY = c(3L,
15L, 1L, 100L), SpeciesZ = c(5L, 220L, 0L, 3L), SpeciesD = c(0,
0, 0, 0)), row.names = c(NA, -4L), class = "data.frame")
spe.grp <- structure(list(Groups = c(1, 2, 1, 2)),
class = "data.frame", row.names = c(NA,
-4L))
I have one dataframe looking as follows:
Date Element Problem Losses
1 2020-09-29 54 Energy loss NA
2 2020-09-30 54 Fault NA
3 2020-09-30 54 Energy loss NA
4 2020-09-29 40 Cooling NA
5 2020-09-29 50 Voltage NA
I would like to insert certain values in the Losses column whenever the problem column has the substring "Energy".
The values I need to insert are in another dataframe, looking like this:
Date Element Losses
1 2020-09-29 54 13.24
2 2020-09-30 54 12.16
This is just an example, as the actual dataframes I'm using are pretty big, so I'd like to do this with some type of merge by the Date and Element columns, instead with looping through both dataframes.
EDIT:
I've tried using a merge by the Element column, so first I get the Losses repeteadly for all the corresponding elements, and then putting those rows where I don't have my desired substring back as Nan.
My problem here is that merging by Element deletes all my other rows, getting only the following:
Date Element Problem Losses
1 2020-09-29 54 Energy loss 13.24
2 2020-09-30 54 Fault NA
3 2020-09-30 54 Energy loss 12.16
Base R solution:
transform(df, Losses = insert_df$Losses[match(paste0(Date, Element, grepl("Energy", Problem)),
paste0(insert_df$Date, insert_df$Element, "TRUE"))])
Data:
df <- structure(list(Date = structure(c(18534, 18535, 18535, 18534,
18534), class = "Date"), Element = c(54L, 54L, 54L, 40L, 50L),
Problem = c("Energy loss", "Fault", "Energy loss", "Cooling",
"Voltage"), Losses = c(NA, NA, NA, NA, NA)), row.names = c(NA,
-5L), class = "data.frame")
insert_df <- structure(list(Date = structure(18534:18535, class = c("IDate",
"Date")), Element = c(54L, 54L), Losses = c(13.24, 12.16)), class = "data.frame", row.names = c(NA,
-2L))
Fish have been caught using different fishing methods.
I would like to merge rows based on Species (that's if they are the same fish species), if they are caught by both Bottom fishing and Trolling methods it will result in two rows collapsing into one row, changing the Method value to Both.
For example Caranx ignobilis will have a new Method value of Both. Bait Released and Kept columns should also have values on the same row.
Species Method Bait Released Kept
4 Caranx ignobilis Both NA 1 1
It seems so simple yet I have been scratching my head for hours and toying around with case_when as part of the tidyverse package.
The tibble is a result of previously sub-setting data using group_by and pivot_wider.
This is what the sample looks like:
# A tibble: 10 x 5
# Groups: Species [9]
Species Method Bait Released Kept
<chr> <fct> <int> <int> <int>
1 Aethaloperca rogaa Bottom fishing NA NA 2
2 Aprion virescens Bottom fishing NA NA 1
3 Balistidae spp. Bottom fishing NA NA 1
4 Caranx ignobilis Trolling NA NA 1
5 Caranx ignobilis Bottom fishing NA 1 NA
6 Epinephelus fasciatus Bottom fishing NA 3 NA
7 Epinephelus multinotatus Bottom fishing NA NA 5
8 Other species Bottom fishing NA 1 NA
9 Thunnus albacares Trolling NA NA 1
10 Variola louti Bottom fishing NA NA 1
Data:
fish_catch <- structure(list(Species = c("Aethaloperca rogaa", "Aprion virescens","Balistidae spp.", "Caranx ignobilis", "Caranx ignobilis", "Epinephelus fasciatus","Epinephelus multinotatus", "Other species", "Thunnus albacares","Variola louti"),
Method = structure(c(1L, 1L, 1L, 2L, 1L, 1L,1L, 1L, 2L, 1L), .Label = c("Bottom fishing", "Trolling"), class = "factor"),Bait = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,NA_integer_),
Released = c(NA, NA, NA, NA, 1L, 3L, NA, 1L,NA, NA),
Kept = c(2L, 1L, 1L, 1L, NA, NA, 5L, NA, 1L, 1L)), class = c("grouped_df","tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L), groups = structure(list(Species = c("Aethaloperca rogaa", "Aprion virescens",
"Balistidae spp.","Caranx ignobilis", "Epinephelus fasciatus", "Epinephelus multinotatus","Other species", "Thunnus albacares", "Variola louti"), .rows = list(1L, 2L, 3L, 4:5, 6L, 7L, 8L, 9L, 10L)), row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame"), .drop = FALSE))
The route I was going down but then I realised it's not incorporating Species or the other columns
mutate(Method = case_when(Method == "Bottom fishing" & Method == "Trolling" ~ "Both",
Method == "Bottom fishing" ~ "Bottom fishing",
Method == "Trolling" ~ "Trolling", TRUE ~ as.character(MethodCaught)))
Here is one approach using tidyverse. You can group_by(Species) and set Method to "Both" if both Bottom fishing and Trolling are included in Method within that Species. Then afterwards, you can group_by both Species and Method, and use fill to replace NA with known values. In the end, use slice to keep one row for each Species/Method. This assumes you would have otherwise 1 row for each Species/Method - please let me know if this is not the case.
library(tidyverse)
fish_catch %>%
group_by(Species) %>%
mutate(Method = ifelse(all(c("Bottom fishing", "Trolling") %in% Method), "Both", as.character(Method))) %>%
group_by(Species, Method) %>%
fill(c(Bait, Released, Kept), .direction = "updown") %>%
slice(1)
Output
# A tibble: 9 x 5
# Groups: Species, Method [9]
Species Method Bait Released Kept
<chr> <chr> <int> <int> <int>
1 Aethaloperca rogaa Bottom fishing NA NA 2
2 Aprion virescens Bottom fishing NA NA 1
3 Balistidae spp. Bottom fishing NA NA 1
4 Caranx ignobilis Both NA 1 1
5 Epinephelus fasciatus Bottom fishing NA 3 NA
6 Epinephelus multinotatus Bottom fishing NA NA 5
7 Other species Bottom fishing NA 1 NA
8 Thunnus albacares Trolling NA NA 1
9 Variola louti Bottom fishing NA NA 1
This should get you started. You can add the other columns to the summarize function.
library(tidyverse)
fish_catch %>% select(-Bait, -Released, -Kept) %>%
group_by(Species) %>%
summarize(Method = paste0(Method, collapse = "")) %>%
mutate(Method = fct_recode(Method, "both" = "TrollingBottom fishing"))
# A tibble: 9 x 2
Species Method
<chr> <fct>
1 Aethaloperca rogaa Bottom fishing
2 Aprion virescens Bottom fishing
3 Balistidae spp. Bottom fishing
4 Caranx ignobilis both
5 Epinephelus fasciatus Bottom fishing
6 Epinephelus multinotatus Bottom fishing
7 Other species Bottom fishing
8 Thunnus albacares Trolling
9 Variola louti Bottom fishing
I was wondering if someone here can help me with a lapply question.
Every month, data are extracted and the data frames are named according to the date extracted (01-08-2019,01-09-2019,01-10-2019 etc). The contents of each data frame are similar to the example below:
01-09-2019
ID DOB
3 01-07-2019
5 01-06-2019
7 01-05-2019
8 01-09-2019
01-10-2019
ID DOB
2 01-10-2019
5 01-06-2019
8 01-09-2019
9 01-02-2019
As the months roll on, there are more data sets being downloaded.
I am wanting to calculate the ages of people in each of the data sets based on the date the data was extracted - so in essence, the age would be the date difference between the data frame name and the DOB variable.
01-09-2019
ID DOB AGE(months)
3 01-07-2019 2
5 01-06-2019 3
7 01-05-2019 4
8 01-09-2019 0
01-10-2019
ID DOB AGE(months)
2 01-10-2019 0
5 01-06-2019 4
8 01-09-2019 1
9 01-02-2019 8
I was thinking of putting all of the data frames together in a list (as there are a lot) and then using lapply to calculate age across all data frames. How do I go about calculating the difference between a data frame name and a column?
If I may suggest a slightly differen approach: It might make more sense to compress your list into a single data frame before calculating the ages. Given your data looks something like this, i.e. it is a list of data frames, where the list element names are the dates of access:
$`01-09-2019`
# A tibble: 4 x 2
ID DOB
<dbl> <date>
1 3 2019-07-01
2 5 2019-06-01
3 7 2019-05-01
4 8 2019-09-01
$`01-10-2019`
# A tibble: 4 x 2
ID DOB
<dbl> <date>
1 2 2019-10-01
2 5 2019-06-01
3 8 2019-09-01
4 9 2019-02-01
You can call bind_rows first with parameter .id = "date_extracted" to turn your list into a data frame, and then calculate age in months.
library(tidyverse)
library(lubridate)
tib <- bind_rows(tib_list, .id = "date_extracted") %>%
mutate(date_extracted = dmy(date_extracted),
DOB = dmy(DOB),
age_months = month(date_extracted) - month(DOB)
)
#### OUTPUT ####
# A tibble: 8 x 4
date_extracted ID DOB age_months
<date> <dbl> <date> <dbl>
1 2019-09-01 3 2019-07-01 2
2 2019-09-01 5 2019-06-01 3
3 2019-09-01 7 2019-05-01 4
4 2019-09-01 8 2019-09-01 0
5 2019-10-01 2 2019-10-01 0
6 2019-10-01 5 2019-06-01 4
7 2019-10-01 8 2019-09-01 1
8 2019-10-01 9 2019-02-01 8
This can be solved with lapply as well but we can also use Map in this case to iterate over list and their names after adding all the dataframes in a list. In base R,
Map(function(x, y) {
x$DOB <- as.Date(x$DOB)
transform(x, age = as.integer(format(as.Date(y), "%m")) -
as.integer(format(x$DOB, "%m")))
}, list_df, names(list_df))
#$`01-09-2019`
# ID DOB age
#1 3 0001-07-20 2
#2 5 0001-06-20 3
#3 7 0001-05-20 4
#4 8 0001-09-20 0
#$`01-10-2019`
# ID DOB age
#1 2 0001-10-20 0
#2 5 0001-06-20 4
#3 8 0001-09-20 1
#4 9 0001-02-20 8
We can also do the same in tidyverse
library(dplyr)
library(lubridate)
purrr::imap(list_df, ~.x %>% mutate(age = month(.y) - month(DOB)))
data
list_df <- list(`01-09-2019` = structure(list(ID = c(3L, 5L, 7L, 8L),
DOB = structure(c(3L, 2L, 1L, 4L), .Label = c("01-05-2019", "01-06-2019",
"01-07-2019", "01-09-2019"), class = "factor")), class = "data.frame",
row.names = c(NA, -4L)), `01-10-2019` = structure(list(ID = c(2L, 5L, 8L, 9L),
DOB = structure(c(4L, 2L, 3L, 1L), .Label = c("01-02-2019",
"01-06-2019", "01-09-2019", "01-10-2019"), class = "factor")),
class = "data.frame", row.names = c(NA, -4L)))
It's bad practice to use dates and numbers as dataframe names consider prefix the date with an "x" as shown below in this base R solution:
df_list <- list(x01_09_2019 = `01-09-2019`, x01_10_2019 = `01-10-2019`)
df_list <- mapply(cbind, "report_date" = names(df_list), df_list, SIMPLIFY = F)
df_list <- lapply(df_list, function(x){
x$report_date <- as.Date(gsub("_", "-", gsub("x", "", x$report_date)), "%d-%m-%Y")
x$Age <- x$report_date - x$DOB
return(x)
}
)
Data:
`01-09-2019` <- structure(list(ID = c(3, 5, 7, 8),
DOB = structure(c(18078, 18048, 18017, 18140), class = "Date")),
class = "data.frame", row.names = c(NA, -4L))
`01-10-2019` <- structure(list(ID = c(2, 5, 8, 9),
DOB = structure(c(18170, 18048, 18140, 17928), class = "Date")),
class = "data.frame", row.names = c(NA, -4L))
I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?
Thank you in advance
P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure
dfold:
structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France",
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L,
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L,
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"),
pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased",
"pid"), row.names = c(NA, 5L), class = "data.frame")
dfnew:
structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000,
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA,
3L), class = "data.frame")
Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.
Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 NA Yes 5
> dfnew
Age Salary pid
1 44 72000 1
2 27 48000 2
3 30 54000 3
4 38 61000 4
5 40 70000 5
To do this for a single column, try
dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
Using it on the whole dataset was a bit trickier:
First define all common colnames except pid:
cols <- names(dfnew)[names(dfnew) != "pid"]
> cols
[1] "Age" "Salary"
Now use mapply to replace the NA values with ifelse:
dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])
> dfold
Country Age Salary Purchased pid
1 France 44 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use
cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]