I have one dataframe looking as follows:
Date Element Problem Losses
1 2020-09-29 54 Energy loss NA
2 2020-09-30 54 Fault NA
3 2020-09-30 54 Energy loss NA
4 2020-09-29 40 Cooling NA
5 2020-09-29 50 Voltage NA
I would like to insert certain values in the Losses column whenever the problem column has the substring "Energy".
The values I need to insert are in another dataframe, looking like this:
Date Element Losses
1 2020-09-29 54 13.24
2 2020-09-30 54 12.16
This is just an example, as the actual dataframes I'm using are pretty big, so I'd like to do this with some type of merge by the Date and Element columns, instead with looping through both dataframes.
EDIT:
I've tried using a merge by the Element column, so first I get the Losses repeteadly for all the corresponding elements, and then putting those rows where I don't have my desired substring back as Nan.
My problem here is that merging by Element deletes all my other rows, getting only the following:
Date Element Problem Losses
1 2020-09-29 54 Energy loss 13.24
2 2020-09-30 54 Fault NA
3 2020-09-30 54 Energy loss 12.16
Base R solution:
transform(df, Losses = insert_df$Losses[match(paste0(Date, Element, grepl("Energy", Problem)),
paste0(insert_df$Date, insert_df$Element, "TRUE"))])
Data:
df <- structure(list(Date = structure(c(18534, 18535, 18535, 18534,
18534), class = "Date"), Element = c(54L, 54L, 54L, 40L, 50L),
Problem = c("Energy loss", "Fault", "Energy loss", "Cooling",
"Voltage"), Losses = c(NA, NA, NA, NA, NA)), row.names = c(NA,
-5L), class = "data.frame")
insert_df <- structure(list(Date = structure(18534:18535, class = c("IDate",
"Date")), Element = c(54L, 54L), Losses = c(13.24, 12.16)), class = "data.frame", row.names = c(NA,
-2L))
Related
I have a dataframe with a column called 's_nummer'. This column is sometimes NA and in that case, I would like to add a number myself that can range from 700001 to 800000. So in this case, row numbers 3 and 4 do not contain a value in the s_nummer column and I would like to add the values 700001 to row 3 and 700002 to row 4.
dput:
structure(list(s_nummer = c(599999, 599999, NA, NA), eerste_voornaam = c("Debbie",
"Debbie", "Debbie", "Debbie"), tussenvoegsel = c(NA, NA, NA,
NA), geslachtsnaam = c("Oomen", "Oomen", "Oomen", "Oomen")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
Hope you can help!
Thanks in advance
You can use which with is.na to get the lines with NA in x$s_nummer and overwrite them with 700000 + seq_along.
i <- which(is.na(x$s_nummer))
x$s_nummer[i] <- 700000 + seq_along(i)
# s_nummer eerste_voornaam tussenvoegsel geslachtsnaam
#1 599999 Debbie NA Oomen
#2 599999 Debbie NA Oomen
#3 700001 Debbie NA Oomen
#4 700002 Debbie NA Oomen
I am trying to convert every value of a data.frame column into factors, this is so I can use them as the "groups" in a boxplot graph. However, using both the as.factor() and factor() function, it turns every value into . There are 5 different cell types in the column, CD8, CD4, Bcell, Mono, Gran - and all turn to NA.
Confusingly, when i apply the function to just one row of the column then it works perfectly fine. The dataframe is very very large (over 3 million rows) - could this be the cause of the issue?
Code :
> head(BP)
Methylation Cell_Type
1 0.03219298 CD8
2 0.11684228 CD8
3 0.04214158 CD8
4 0.26700497 CD8
5 0.34251732 CD8
6 0.34231208 CD8
> BP$Cell_Type <- as.factor(BP$Cell_Type)
> head(BP)
Methylation Cell_Type
1 0.03219298 <NA>
2 0.11684228 <NA>
3 0.04214158 <NA>
4 0.26700497 <NA>
5 0.34251732 <NA>
6 0.34231208 <NA>
Unsure why this is happening - any advice would be greatly appreciated!
Thanks
Out put of dput(head(BP))
> dput(head(BP))
structure(list(Methylation = c(0.0321929818018839,
0.116842281589967,
0.0421415803696093, 0.267004971824527, 0.342517319094108,
0.342312083101948
), Cell_Type = structure(list(Cell_Type = structure(c(3L, 3L,
3L, 3L, 3L, 3L), .Label = c("Bcell", "CD4", "CD8", "Gran", "Mono"
), class = "factor")), row.names = c(NA, 6L), class =
"data.frame")), row.names = c(NA,
6L), class = "data.frame")
Maybe make sure Cell_Type is a character first?
BP <- tibble::tribble(
~Methylation, ~Cell_Type,
0.03219298, "CD8",
0.11684228, "CD8",
0.04214158, "CD8",
0.26700497, "CD8",
0.34251732, "CD8",
0.34231208, "CD8")
BP$Cell_Type <- as.factor(BP$Cell_Type)
print(BP)
Methylation Cell_Type
<dbl> <fct>
1 0.0322 CD8
2 0.117 CD8
3 0.0421 CD8
4 0.267 CD8
5 0.343 CD8
6 0.342 CD8
Or simply
BP$Cell_Type <- as.factor(as.character(BP$Cell_Type))
I am very new to R programming and am trying to determine the number of days apportioned per month between two dates.
I have a dataset that has the following structure:
from_date
to_date
quantity
Example data:
2019-06-15 2019-09-10 55
2019-07-11 2019-10-05 17
I would like to call a function that returns a dataset/vector? that holds 3 values as there will be a maximum difference between from_date and to_date of 3 months.
I have tried using lubridate::floor_date() to work backward from the to_date
Not sure if you are looking for some result like below:
df$quantity <- with(df,as.Date(to_date)-as.Date(from_date))
or
df$quantity <- apply(df, 1, function(v) diff(as.Date(v)))
yielding
> df
from_date to_date quantity
1 2019-06-15 2019-09-10 87
2 2019-07-11 2019-10-05 86
Data
df <- structure(list(from_date = structure(1:2, .Label = c("2019-06-15",
"2019-07-11"), class = "factor"), to_date = structure(1:2, .Label = c("2019-09-10",
"2019-10-05"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?
Thank you in advance
P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure
dfold:
structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France",
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L,
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L,
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"),
pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased",
"pid"), row.names = c(NA, 5L), class = "data.frame")
dfnew:
structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000,
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA,
3L), class = "data.frame")
Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.
Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 NA Yes 5
> dfnew
Age Salary pid
1 44 72000 1
2 27 48000 2
3 30 54000 3
4 38 61000 4
5 40 70000 5
To do this for a single column, try
dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
Using it on the whole dataset was a bit trickier:
First define all common colnames except pid:
cols <- names(dfnew)[names(dfnew) != "pid"]
> cols
[1] "Age" "Salary"
Now use mapply to replace the NA values with ifelse:
dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])
> dfold
Country Age Salary Purchased pid
1 France 44 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use
cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]
This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 6 years ago.
I have a large dataset of observations, with several observations in rows and several different variables for each ID.
e.g.
Data
ID V1 V2 V3 time
1 35 100 5.2 2015-07-03 07:49
2 25 111 6.2 2015-04-01 11:52
3 41 120 NA 2015-04-01 14:17
1 25 NA NA 2015-07-03 07:51
2 NA 122 6.2 2015-04-01 11:50
3 40 110 4.1 2015-04-01 14:25
I would like to extract the earliest (first) observation for each variable independently based on the time column, for each unique ID. i.e. I would like to combine multiple rows of the same ID together so that I have one row of the first observation for each variable (time variable will not be equal for all).
The min() function will return the earliest time for a set of observations, but the problem is I need to do this for each variable. To do this I have tried using the tapply function with minimum time
tapply(Data, ID, min(time)
but get an error saying
"Error in match.fun(FUN) :
'min(Data$time)' is not a function, character or symbol.
I suspect that there is also a problem because many of the rows of observations have missing data.
Alternatively I have tried to just do each variable one at a time using aggregate, and select the min(time) this way:
firstV1 <-aggregate(V1[min(time)]~ID, data=Data, na.rm=T)
From the example dataset, what I would like to see is:
Data
ID V1 V2 V3
1 35 100 5.2
2 25 122 6.2
3 41 120 4.1
Note the '25' for ID2 V1 was from the later observation because the first observation was missing. Same for ID3 V3.
Input data
structure(list(ID = c(1L, 2L, 3L, 1L, 2L, 3L), V1 = c(35L, 25L,
41L, 25L, NA, 40L), V2 = c(100L, 111L, 120L, NA, 122L, 110L),
V3 = c(5.2, 6.2, 4.2, NA, 6.2, 4.1), time = structure(c(1435906140,
1427885520, 1427894220, 1435906260, 1427885400, 1427894700
), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("ID",
"V1", "V2", "V3", "time"), row.names = c(NA, -6L), class = "data.frame")
This should do what you need.
library(data.table)
Data <- rbind(cbind(1,35,100,5.2,"2015-07-03 07:49"),
cbind(2,25,111,6.2,"2015-04-01 11:52"),
cbind(3,41,120,4.2,"2015-04-01 14:17"),
cbind(1,25,NA,NA,"2015-07-03 07:51"),
cbind(2,NA,122,6.2,"2015-04-01 11:50"),
cbind(3,40,110,4.1,"2015-04-01 14:25"))
colnames(Data) <- c("ID","V1","V2","V3","time")
Data <- data.table(Data)
class(Data[,time])
Data[,time:=as.POSIXct(time)]
minTime.Data <- Data[,lapply(.SD, function(x) x[time==min(time)]),by=ID]
minTime.Data
The outcome will be
ID V1 V2 V3 time
1: 1 35 100 5.2 2015-07-03 07:49:00
2: 2 NA 122 6.2 2015-04-01 11:50:00
3: 3 41 120 4.2 2015-04-01 14:17:00
Let me know if this is what you were looking for, because there is a little ambiguity in your question.