How to number repeated values in a column in R? - r

I have a big dataset where some names are repeated, like below.
Name
Year
Value
AH
2013
1800
AH
2014
2400
AH
2015
2300
BC
2013
1900
BC
2014
1600
KP
2013
3600
DN
2013
2800
I'd like to know how to create a column for numbering repeated names sequentially.
Name
Year
Value
Number
AH
2013
1800
1
AH
2014
2400
2
AH
2015
2300
3
BC
2013
1900
1
BC
2014
1600
2
KP
2013
3600
1
DN
2013
2800
1
I found a previous ask with the following code
library(data.table)
library(dplyr)
weighted_df %>%
mutate(Number = rowid(Name))
but I keep getting "Error in is.data.frame(.data) : object 'weighted_df' not found. I'm quite new to R so I'm unsure of what else I can try out, or if I'm using the wrong libraries for the function.
Any help is greatly appreciated! Thanks y'all!

Related

How do I calculate days since value exceeded in R?

I'm working with daily discharge data over 30 years. Discharge is measured in cfs, and my dataset looks like this:
date ddmm year cfs
1/04/1986 1-Apr 1986 2560
2/04/1986 2-Apr 1986 3100
3/04/1986 3-Apr 1986 2780
4/04/1986 4-Apr 1986 2640
...
17/01/1987 17-Jan 1987 1130
18/01/1987 18-Jan 1987 1190
19/01/1987 19-Jan 1987 1100
20/01/1987 20-Jan 1987 864
21/01/1987 21-Jan 1987 895
22/01/1987 22-Jan 1987 962
23/01/1987 23-Jan 1987 998
24/01/1987 24-Jan 1987 1140
I'm trying to calculate the number of days preceding each date that the discharge exceeds 1000 cfs and put it in a new column ("DaysGreater1000") that will be used in a subsequent analysis.
In this example, DaysGreater1000 would be 0 for all of the dates in April 1986. DaysGreater1000 would be 1 on 20 Jan, 2 on 21 Jan, 3 on 22 Jan, etc.
Do I first need to create a column (event) of binary data for when the threshold is exceeded? I have been reading several old questions and it looks like I need to use ifelse but I can't figure out how to make a new column of data and then how to make the next step to calculate the number of preceding days.
Here are the questions that I have been examining:
Calculate days since last event in R
Calculate elapsed time since last event
... And this is the code that looks promising, but I can't quite put it all together!
df %>%
mutate(event = as.logical(event),
last_event = if_else(event, true = date, false = NA_integer_)) %>%
fill(last_event) %>%
mutate(event_age = date - last_event)
summary(df)
I'm sorry if I'm not being very eloquent! I'm feeling a bit rusty as I haven't used R in a while.

Rearranging data columns in R

I have an excel file that contains two columns : Car_Model_Year and Cost.
Car_Model_Year Cost
2018 25000
2010 9000
2005 13000
2002 35000
1995 8000
I want to sort my data as follows:
Car_Model_Year Cost
1995 8000
2002 35000
2005 13000
2010 9000
2018 25000
So now, the Car_Model_Year are sorted in ascending order. I wrote the following R code, but I don't know how to rearrange the values of the variable Cost accordingly.
my_data <- read.csv2("data.csv")
my_data <- sort(my_data$Car_Model_Year, decreasing = FALSE)
Any help will be very appreciated!
Are you looking for this?
sorted_df <- df[order(df$Car_Model_Year, df$Cost),]
print(sorted_df)
# A tibble: 5 x 2
Car_Model_Year Cost
<dbl> <dbl>
1 1995 8000
2 2002 35000
3 2005 13000
4 2010 9000
5 2018 25000
Note that you can use signs (+/ -) to indicate asc or desc:
# Sort by car_model(descending) and cost(acending)
sorted_df <-df[order(-df$Car_Model_Year, df$Cost),]
Does the below approach work? To sort by two or more columns, you just add them to the order() - i.e. order(var1, var2,...)
my_data <- data.frame(Car_Model_Year=c(2018,2010,2005,2002,1995),
Cost=c(25000,9000,13000,35000,8000))
sorted <- my_data[order(my_data$Car_Model_Year, my_data$Cost),]
> print(sorted)
Car_Model_Year Cost
5 1995 8000
4 2002 35000
3 2005 13000
2 2010 9000
1 2018 25000
dplyr::arrange() makes it easy:
library(dplyr)
my_data %>% arrange(Car_Model_Year, Cost)
Descending price instead:
my_data %>% arrange(Car_Model_Year, desc(Cost))

How to create a numerical variable from Qualterly time column - R

I have a dataset which currently looks like this:
Time Var1 Var2
2013 Q4 123 756
2013 Q4 657 987
2014 Q1 746 756
2014 Q1 66 999
2014 Q2 774 542
And I need to convert this categorical 'Time' variable into a numerical variable, something which may look potentially like this:
Time Var1 Var2 n.Time
2013 Q4 123 756 1
2013 Q4 657 987 1
2014 Q1 746 756 2
2014 Q1 66 999 2
2014 Q2 774 542 3
Or something similar which gives the 'Time' column a numerical value which is proportional.
I have attempted the
df$n.Time <- as.yearqtr(df$Time)
But this just gives the same output as the 'Time' column instead of making it numerical.
Any help would be greatly appreciated
Would something like this work?
df$n.Time <- as.numeric(as.factor(df$Time))
I think you are looking for splitting Q part, from Time column and then change it a numerical value.
df$n.Time <- as.factor(substr(as.character(df$Time),
gregexpr("Q",df$Time),nchar(as.character(df$Time))))

Testing whether n% of data values exist in a variable grouped by posix date

I have a data frame that has hourly observational climate data over multiple years, I have included a dummy data frame below that will hopefully illustrate my QU.
dateTime <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-12-31"),
by=(60*60))
WS <- sample(0:20,8761,rep=TRUE)
WD <- sample(0:390,8761,rep=TRUE)
Temp <- sample(0:40,8761,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I need to group by year (or in this example, by month) to find if df$WS has 75% or more of valid data for that month. My filtering criteria is NA as 0 is still a valid observation. I have real NAs as it is observational climate data.
I have tried dplyr piping using %>% function to filer by a new column "Month" as well as reviewing several questions on here
Calculate the percentages of a column in a data frame - "grouped" by column,
Making a data frame of count of NA by variable for multiple data frames in a list,
R group by date, and summarize the values
None of these have really answered my question.
My hope is to put something in a longer script that works in a looping function that will go through all my stations and all the years in each station to produce a wind rose if this criteria is met for that year / station. Please let me know if I need to clarify more.
Cheers
There are many way of doing this. This one appears quite instructive.
First create a new variable which will denote month (and account for year if you have more than one year). Split on this variable and count the number of NAs. Divide this by the number of values and multiply by 100 to get percentage points.
df$monthyear <- format(df$dateTime, format = "%m %Y")
out <- split(df, f = df$monthyear)
sapply(out, function(x) (sum(is.na(x$WS))/nrow(x)) * 100)
01 2012 02 2012 03 2012 04 2012 05 2012 06 2012 07 2012
23.92473 21.40805 24.09152 25.00000 20.56452 24.58333 27.15054
08 2012 09 2012 10 2012 11 2012 12 2012
22.31183 25.69444 23.22148 21.80556 24.96533
You could also use data.table.
library(data.table)
setDT(df)
df[, (sum(is.na(WS))/.N) * 100, by = monthyear]
monthyear V1
1: 01 2012 23.92473
2: 02 2012 21.40805
3: 03 2012 24.09152
4: 04 2012 25.00000
5: 05 2012 20.56452
6: 06 2012 24.58333
7: 07 2012 27.15054
8: 08 2012 22.31183
9: 09 2012 25.69444
10: 10 2012 23.22148
11: 11 2012 21.80556
12: 12 2012 24.96533
Here is a method using dplyr. It will work even if you have missing data.
library(lubridate) #for the days_in_month function
library(dplyr)
df2 <- df %>% mutate(Month=format(dateTime,"%Y-%m")) %>%
group_by(Month) %>%
summarise(No.Obs=sum(!is.na(WS)),
Max.Obs=24*days_in_month(as.Date(paste0(first(Month),"-01")))) %>%
mutate(Obs.Rate=No.Obs/Max.Obs)
df2
Month No.Obs Max.Obs Obs.Rate
<chr> <int> <dbl> <dbl>
1 2012-01 575 744 0.7728495
2 2012-02 545 696 0.7830460
3 2012-03 560 744 0.7526882
4 2012-04 537 720 0.7458333
5 2012-05 567 744 0.7620968
6 2012-06 557 720 0.7736111
7 2012-07 553 744 0.7432796
8 2012-08 568 744 0.7634409
9 2012-09 546 720 0.7583333
10 2012-10 544 744 0.7311828
11 2012-11 546 720 0.7583333
12 2012-12 554 744 0.7446237

Got an error using ifelse inside mutate inside the for loop

I have a list of 244 data frames which looks like the following:
The name of the list is datas.
datas[[1]]
year sal
2000 10000
2000 15000
2005 10000
2005 9000
2005 12000
2010 15000
2010 12000
2010 20000
2013 25000
2013 15000
2015 20000
I would like to make a new column called fix.sal, multiplying different values for different years. For example, I multiply 2 on sals which are on the same rows with 2000. In the same way, the number multiplied on the sal value is 1.8 for 2005, 1.5 for 2010, 1.2 for 2013, 1 for 2015. So the result should be like this:
Year sal fix.sal
2000 10000 20000
2000 15000 30000
2005 10000 18000
2005 9000 16200
2005 12000 21600
2010 15000 22500
2010 12000 18000
2010 20000 30000
2013 25000 30000
2013 15000 18000
2015 20000 20000
I succeeded to do this by using ifelse inside mutate which for package dplyr.
library(dplyr)
datas[[1]]<-mutate(datas[[1]], fix.sal=
ifelse(datas[[1]]$Year==2000,datas[[1]]$sal*2,
ifelse(datas[[1]]$Year==2005,datas[[1]]$sal*1.8,
ifelse(datas[[1]]$Year==2010,datas[[1]]$sal*1.5,
ifelse(datas[[1]]$Year==2013,datas[[1]]$sal*1.2,
datas[[1]]$sal*1)))))
But I have to do this operation to the 244 data frames in the list datas.
So I tried to do it using the for loop like this;
for(i in 1:244){
datas[[i]]<-mutate(datas[[i]], fix.sal=
ifelse(datas[[i]]$Year==2000,datas[[i]]$sal*2,
ifelse(datas[[i]]$Year==2005,datas[[i]]$sal*1.8,
ifelse(datas[[i]]$Year==2010,datas[[i]]$sal*1.5,
ifelse(datas[[i]]$Year==2013,datas[[i]]$sal*1.2,
datas[[i]]$sal*1)))))
}
Then there came an error;
Error: invalid subscript type 'integer'
How can I solve this...?
Any comments will be greatly appreciated! :)
Please don't force yourself to use ifelse for this. Instead, create a vector with your multipliers, then use the year to select from the vector. The vector will look something like this:
multiplier <-
c("2005" = 1.2
, "2006" = 1.05
, "2007" = 0.9)
With whatever your multiplier is for each year in your data. Then, here is some sample data (all the same, but that doesn't matter):
datas <-
lapply(1:3, function(idx){
data.frame(
Year = 2005:2007
, sal = c(10, 20, 30)
)
})
Finally, we can then use lapply to loop through the list more efficiently. Each time through, it uses the Year to pick a value from the multipliers vector (note the use of as.character, otherwise it will pick, e.g., the 2005th entry, instead of the one named "2005").
lapply(datas, function(x){
mutate(x, fix.sal = sal*multiplier[as.character(Year)])
})
returns:
[[1]]
Year sal fix.sal
1 2005 10 12
2 2006 20 21
3 2007 30 27
[[2]]
Year sal fix.sal
1 2005 10 12
2 2006 20 21
3 2007 30 27
[[3]]
Year sal fix.sal
1 2005 10 12
2 2006 20 21
3 2007 30 27
For more compact code, you can use:
lapply(datas, mutate, fix.sal = sal*multiplier[as.character(Year)])
but that makes it slightly less clear to me what is happening.
Here's a simple solution using ifelse and lapply:
# Creating the list
df <- data.frame(year=c(rep(2000,2),rep(2005,3),rep(2010,3),rep(2013,2),2015),
sal=c(10000,15000,10000,9000,12000,15000,12000,20000,25000,15000,20000))
datas <- list(df,df)
# Applying the function with ifelse
lapply(datas,function(x){
outp <- ifelse(df$year==2000,df$sal*2,
ifelse(df$year==2005,df$sal*1.8,
ifelse(df$year==2010,df$sal*1.5,
ifelse(df$year==2013,df$sal*1.2,df$sal*1))))
return(outp)
})
You'll get the result for each df inside the list.

Resources