This question already has an answer here:
Sort year-month column by year AND month
(1 answer)
Closed 1 year ago.
I have the following Df in R.
DF<-
Month Count Group
Dec-20 12 A2
Feb-21 30 R5
Mar-21 43 R5
Jan-21 90 B1
Total 175 -
The Month are jumbled in the above dataframe, I need to make them in Descending order.
Required Df<-
Month Count Group
Mar-21 43 R5
Feb-21 30 R5
Jan-21 90 B1
Dec-20 12 A2
Total 175 -
You can use zoo::as.yearmon to convert the character values of month to yearmon class which can be arranged.
library(dplyr)
df %>% arrange(desc(zoo::as.yearmon(Month, '%b-%y')))
# Month Count Group
#1 Mar-21 43 R5
#2 Feb-21 30 R5
#3 Jan-21 90 B1
#4 Dec-20 12 A2
#5 Total 175 -
In base R, create a date object and then use order.
df[order(as.Date(paste0(df$Month, '-1'), '%b-%y-%d'), decreasing = TRUE), ]
data
df <- structure(list(Month = c("Dec-20", "Feb-21", "Mar-21", "Jan-21",
"Total"), Count = c(12L, 30L, 43L, 90L, 175L), Group = c("A2",
"R5", "R5", "B1", "-")), class = "data.frame", row.names = c(NA, -5L))
Related
I have two datasets, and I need to merge them by the ID value. The problems are:
The ID value can be repeated across the same dataset (no other unique value is available).
The two datasets are not equal in the rows number or the column numbers.
Example:
df1
ID
Gender
99
Male
85
Female
7
Male
df2
ID
Body_Temperature
Body_Temperature_date_time
99
36
1/1/2020 12:00 am
99
38
2/1/2020 10:30 am
99
37
1/1/2020 06:41 am
52
38
1/2/2020 11:00 am
11
39
4/5/2020 09:09 pm
7
35
9/8/2020 02:30 am
How can I turn these two datasets into one single dataset in a way that allows me to apply some machine learning models on it later on?
Depending on your expected results, if you are wanting to return all rows from each dataframe, then you can use a full_join from dplyr:
library(dplyr)
full_join(df2, df1, by = "ID")
Or with base R:
merge(x=df2,y=df1,by="ID",all=TRUE)
Output
ID Body_Temperature Body_Temperature_date_time Gender
1 99 36 1/1/2020 12:00 am Male
2 99 38 2/1/2020 10:30 am Male
3 99 37 1/1/2020 06:41 am Male
4 52 38 1/2/2020 11:00 am <NA>
5 11 39 4/5/2020 09:09 pm <NA>
6 7 35 9/8/2020 02:30 am Male
7 85 NA <NA> Female
If you have more than 2 dataframes to combine, which only overlap with the ID column, then you can use reduce on a dataframe list (so put all the dataframes that you want to combine into a list):
library(tidyverse)
df_list <- list(df1, df2)
multi_full <- reduce(df_list, function(x, y, ...)
full_join(x, y, by = "ID", ...))
Or Reduce with base R:
df_list <- list(df1, df2)
multi_full <- Reduce(function(x, y, ...)
merge(x, y, by = "ID", all = TRUE, ...), df_list)
Data
df1 <- structure(list(ID = c(99L, 85L, 7L), Gender = c("Male", "Female",
"Male")), class = "data.frame", row.names = c(NA, -3L))
df2 <- structure(list(ID = c(99L, 99L, 99L, 52L, 11L, 7L), Body_Temperature = c(36L,
38L, 37L, 38L, 39L, 35L), Body_Temperature_date_time = c("1/1/2020 12:00 am",
"2/1/2020 10:30 am", "1/1/2020 06:41 am", "1/2/2020 11:00 am",
"4/5/2020 09:09 pm", "9/8/2020 02:30 am")), class = "data.frame", row.names = c(NA,
-6L))
This question already has answers here:
How to specify "does not contain" in dplyr filter
(4 answers)
dplyr Exclude row [duplicate]
(1 answer)
Closed 3 years ago.
This is my dataframe x
ID Name Initials AGE
123 Mike NA 18
124 John NA 20
125 Lily NA 21
126 Jasper NA 24
127 Toby NA 27
128 Will NA 19
129 Oscar NA 32
I also have a list of ID's I want to remove from data frame x, num[1:3], which is the following: y
>print(y)
[1] 124 125 129
My goal is remove all the ID's in y from data frame x
This is my desired output
ID Name Initials AGE
123 Mike NA 18
126 Jasper NA 24
127 Toby NA 27
128 Will NA 19
I'm using the dplyr package and trying this but its not working,
FinalData <- x %>%
select(everything()) %>%
filter(ID != c(y))
Can anyone tell me what needs to be corrected?
We can use %in% and negate ! when the length of the 'y' is greater than 1. The select step is not needed as it is selecting all the columns with everything()
library(dplyr)
x %>%
filter(!ID %in% y)
# ID Name Initials AGE
#1 123 Mike NA 18
#2 126 Jasper NA 24
#3 127 Toby NA 27
#4 128 Will NA 19
Or another option is anti_join
x %>%
anti_join(tibble(ID = y))
In base R, subset can be used
subset(x, !ID %in% y)
data
y <- c(124, 125, 129)
x <- structure(list(ID = 123:129, Name = c("Mike", "John", "Lily",
"Jasper", "Toby", "Will", "Oscar"), Initials = c(NA, NA, NA,
NA, NA, NA, NA), AGE = c(18L, 20L, 21L, 24L, 27L, 19L, 32L)),
class = "data.frame", row.names = c(NA,
-7L))
Sample data:
sampleData
Ozone Solar.R Wind Temp Month Day sampleData.Ozone
1 41 190 7.4 67 5 1 41
2 36 118 8.0 72 5 2 36
3 12 149 12.6 74 5 3 12
.........
Want to extract records on the condition $ozone > 31
Here is the code:
data <- sampleData[sampleData$ozone > 31]
And get the error below:
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L) X[[j]] <- as.matrix(X[[j]]) :
missing value where TRUE/FALSE needed
How should I correct it? Thanks!
R is case sensitive, so your ozone has to match the name in your data.frame. Also to subset a data.frame, you need two indices (row and column) separated by a comma. If there is nothing after the comma, it means that you are selecting all the columns:
sampleData[sampleData$Ozone > 31,]
Other methods to subset a data.frame:
subset(sampleData, Ozone > 31)
or with dplyr:
library(dplyr)
sampleData %>%
filter(Ozone > 31)
Result:
Ozone Solar.R Wind Temp Month Day sampleData.Ozone
1 41 190 7.4 67 5 1 41
2 36 118 8.0 72 5 2 36
Data:
sampleData = structure(list(Ozone = c(41L, 36L, 12L), Solar.R = c(190L, 118L,
149L), Wind = c(7.4, 8, 12.6), Temp = c(67L, 72L, 74L), Month = c(5L,
5L, 5L), Day = 1:3, sampleData.Ozone = c(41L, 36L, 12L)), .Names = c("Ozone",
"Solar.R", "Wind", "Temp", "Month", "Day", "sampleData.Ozone"
), class = "data.frame", row.names = c("1", "2", "3"))
This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 6 years ago.
I have a large dataset of observations, with several observations in rows and several different variables for each ID.
e.g.
Data
ID V1 V2 V3 time
1 35 100 5.2 2015-07-03 07:49
2 25 111 6.2 2015-04-01 11:52
3 41 120 NA 2015-04-01 14:17
1 25 NA NA 2015-07-03 07:51
2 NA 122 6.2 2015-04-01 11:50
3 40 110 4.1 2015-04-01 14:25
I would like to extract the earliest (first) observation for each variable independently based on the time column, for each unique ID. i.e. I would like to combine multiple rows of the same ID together so that I have one row of the first observation for each variable (time variable will not be equal for all).
The min() function will return the earliest time for a set of observations, but the problem is I need to do this for each variable. To do this I have tried using the tapply function with minimum time
tapply(Data, ID, min(time)
but get an error saying
"Error in match.fun(FUN) :
'min(Data$time)' is not a function, character or symbol.
I suspect that there is also a problem because many of the rows of observations have missing data.
Alternatively I have tried to just do each variable one at a time using aggregate, and select the min(time) this way:
firstV1 <-aggregate(V1[min(time)]~ID, data=Data, na.rm=T)
From the example dataset, what I would like to see is:
Data
ID V1 V2 V3
1 35 100 5.2
2 25 122 6.2
3 41 120 4.1
Note the '25' for ID2 V1 was from the later observation because the first observation was missing. Same for ID3 V3.
Input data
structure(list(ID = c(1L, 2L, 3L, 1L, 2L, 3L), V1 = c(35L, 25L,
41L, 25L, NA, 40L), V2 = c(100L, 111L, 120L, NA, 122L, 110L),
V3 = c(5.2, 6.2, 4.2, NA, 6.2, 4.1), time = structure(c(1435906140,
1427885520, 1427894220, 1435906260, 1427885400, 1427894700
), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("ID",
"V1", "V2", "V3", "time"), row.names = c(NA, -6L), class = "data.frame")
This should do what you need.
library(data.table)
Data <- rbind(cbind(1,35,100,5.2,"2015-07-03 07:49"),
cbind(2,25,111,6.2,"2015-04-01 11:52"),
cbind(3,41,120,4.2,"2015-04-01 14:17"),
cbind(1,25,NA,NA,"2015-07-03 07:51"),
cbind(2,NA,122,6.2,"2015-04-01 11:50"),
cbind(3,40,110,4.1,"2015-04-01 14:25"))
colnames(Data) <- c("ID","V1","V2","V3","time")
Data <- data.table(Data)
class(Data[,time])
Data[,time:=as.POSIXct(time)]
minTime.Data <- Data[,lapply(.SD, function(x) x[time==min(time)]),by=ID]
minTime.Data
The outcome will be
ID V1 V2 V3 time
1: 1 35 100 5.2 2015-07-03 07:49:00
2: 2 NA 122 6.2 2015-04-01 11:50:00
3: 3 41 120 4.2 2015-04-01 14:17:00
Let me know if this is what you were looking for, because there is a little ambiguity in your question.
I've got a data frame that looks like something along these lines:
Day Salesperson Value
==== ============ =====
Monday John 40
Monday Sarah 50
Tuesday John 60
Tuesday Sarah 30
Wednesday John 50
Wednesday Sarah 40
I want to divide the value for each salesperson by the number of times that each of the days of the week has occurred. So: There have been 3 Monday, 3 Tuesdays, and 2 Wednesdays — I don't have this information digitally, but can create a vector along the lines of
c(3, 3, 2)
How can I conditionally divide the Value column based on the number of times each day occurs?
I've found an inelegant solution, which entails copying the Day column to a temp column, replacing each of the names of the week in the new column with the number of times each day occurs using
df$temp <- sub("Monday, 3, df$temp)
but doing this seems kinda clunky. Is there a neat way to do this?
Suppose your auxiliary data is in another data.frame:
Day N_Day
1 Monday 3
2 Tuesday 3
3 Wednesday 2
The simplest way would be to merge:
DF_new <- merge(DF, DF2, by="Day")
DF_new$newcol <- DF_new$Value / DF_new$N_Day
which gives
Day Salesperson Value N_Day newcol
1 Monday John 40 3 13.33333
2 Monday Sarah 50 3 16.66667
3 Tuesday John 60 3 20.00000
4 Tuesday Sarah 30 3 10.00000
5 Wednesday John 50 2 25.00000
6 Wednesday Sarah 40 2 20.00000
The mergeless shortcut is
DF$newcol <- DF$Value / DF2$N_Day[match(DF$Day, DF2$Day)]
Data:
DF <- structure(list(Day = structure(c(1L, 1L, 2L, 2L, 3L, 3L), .Label =
c("Monday",
"Tuesday", "Wednesday"), class = "factor"), Salesperson = structure(c(1L,
2L, 1L, 2L, 1L, 2L), .Label = c("John", "Sarah"), class = "factor"),
Value = c(40L, 50L, 60L, 30L, 50L, 40L)), .Names = c("Day",
"Salesperson", "Value"), class = "data.frame", row.names = c(NA,
-6L))
DF2 <- structure(list(Day = structure(1:3, .Label = c("Monday", "Tuesday",
"Wednesday"), class = "factor"), N_Day = c(3, 3, 2)), .Names = c("Day",
"N_Day"), row.names = c(NA, -3L), class = "data.frame")
You can use the library dplyr to merge your data frame with the frequency of each day.
df <- data.frame(
Day=c("Monday","Monday","Tuesday","Tuesday","Wednesday","Wednesday"),
Salesperson=c("John","Sarah","John","Sarah","John","Sarah"),
Value=c(40,50,60,30,50,40), stringsAsFactors=F)
aux <- data.frame(
Day=c("Monday","Tuesday","Wednesday"),
freq=c(3,3,2)
)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value2=Value/n)
To create this auxiliary table with the count of days that appear in your original data instead of doing it manually. You could use:
aux <- df %>% group_by(Day) %>% summarise(n=n())
> output
Day Salesperson Value n Value2
1 Monday John 40 2 20
2 Monday Sarah 50 2 25
3 Tuesday John 60 2 30
4 Tuesday Sarah 30 2 15
5 Wednesday John 50 2 25
6 Wednesday Sarah 40 2 20
If you want to substitute the actual valuecolumn, then use mutate(Value=Value/n) and to remove the additional columns, you can add a select(-n)
output <- df %>% left_join(aux, by="Day") %>% mutate(Value=Value/n) %>% select(-n)