Get cell value if SUM exceeds a limit using R - r

I have a huge data frame with details of each employee's working hours per day. For e.g:
STAFF_ID DATE MONTH HOURS_WORKED
345 4-May-15 May 5
678 4-May-15 May 2
965 4-May-15 May 4
248 4-May-15 May 6
345 5-May-15 May 7
678 6-May-15 May 3
678 7-May-15 May 3
965 8-May-15 May 5
345 7-Jun-15 June 1
678 8-Jun-15 June 2
965 8-Jun-15 June 4
248 8-Jun-15 June 6
345 8-Jun-15 June 3
678 9-Jun-15 June 2
678 10-Jun-15 June 3
965 11-Jun-15 June 4
965 12-Jun-15 June 3
What I want to find out is if any employee works more than 7 hours per month, and if there is,:
When is the latest day which the employee works on which results in the total hour to be greater than 7, and
for that latest day, how much did it exceed the maximum by?
Expected results:
STAFF_ID DATE MONTH HOURS_WORKED LATEST_DATE HOURS_EXCEED
345 4-May-15 May 5 5-May-15 5
678 4-May-15 May 2 7-May-15 1
965 4-May-15 May 4 8-May-15 2
248 4-May-15 May 6 NA NA
345 5-May-15 May 7 5-May-15 5
678 6-May-15 May 3 7-May-15 1
678 7-May-15 May 3 7-May-15 1
965 8-May-15 May 5 8-May-15 2
345 7-Jun-15 June 1 NA NA
678 8-Jun-15 June 2 NA NA
965 8-Jun-15 June 4 11-Jun-15 1
248 8-Jun-15 June 6 NA NA
345 8-Jun-15 June 3 NA NA
678 9-Jun-15 June 2 NA NA
678 10-Jun-15 June 3 NA NA
965 11-Jun-15 June 4 11-Jun-15 1
965 12-Jun-15 June 3 11-Jun-15 1
I have also asked the same question, but in that question, I asked for Excel solutions. However, as mentioned, the data file is really huge, hence, I would prefer if I could solve this using R.
Thank you!

Updated
Using data.table with custom function, assuming DATE is of class date.
library(data.table)
# Calculate cumulative sum of hours worked per month per group
setDT(df)[,total_hours := cumsum(HOURS_WORKED),by = c("STAFF_ID", "MONTH")]
# Define custom function which selects first match that is total_hours > 7
over.seven <- function(x,z) {
y <- x[(z>7)][1]
return(y)
}
# Add desired columns
df[,`:=`(LATEST_DATE = over.seven(DATE,total_hours),
HOURS_EXCEED = over.seven(total_hours - 7,total_hours)),
by = c("STAFF_ID", "MONTH")]
> df
# STAFF_ID DATE MONTH HOURS_WORKED total_hours LATEST_DATE HOURS_EXCEED
# 1: 345 2015-05-04 May 5 5 2015-05-05 5
# 2: 678 2015-05-04 May 2 2 2015-05-07 1
# 3: 965 2015-05-04 May 4 4 2015-05-08 2
# 4: 248 2015-05-04 May 6 6 <NA> NA
# 5: 345 2015-05-05 May 7 12 2015-05-05 5
# 6: 678 2015-05-06 May 3 5 2015-05-07 1
# 7: 678 2015-05-07 May 3 8 2015-05-07 1
# 8: 965 2015-05-08 May 5 9 2015-05-08 2
# 9: 345 2015-06-07 June 1 1 <NA> NA
#10: 678 2015-06-08 June 2 2 <NA> NA
#11: 965 2015-06-08 June 4 4 2015-06-11 1
#12: 248 2015-06-08 June 6 6 <NA> NA
#13: 345 2015-06-08 June 3 4 <NA> NA
#14: 678 2015-06-09 June 2 4 <NA> NA
#15: 678 2015-06-10 June 3 7 <NA> NA
#16: 965 2015-06-11 June 4 8 2015-06-11 1
#17: 965 2015-06-12 June 3 11 2015-06-11 1

Related

Importing Data in R

I want to import data into R but I am getting a few errors. I download my ".CSV" file to my computer and specify the file path like this setwd("C:/Users/intellipaat/Desktop/BLOG/files") and then I am writing read.data <- read.csv("file1.csv"), but the console returns an error like this.
"read.data<-read.csv(file1.csv)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
'file1.csv' object not found
What should I do for this? I tried the internet link route, but again I encountered a problem.
I wrote like this:
install.packages("XML")
install.packages("RCurl")
to load the packages, run the following command:
library("XML")
library("RCurl")
url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
tabs <- getURL(url)
and the console wrote me this error;
Error in function (type, msg, asError = TRUE) :
error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
I would be glad if you help me in this regard...
The Ease of Doing Business rankings table on Wikipedia is an HTML table, not a comma separated values file.
Loading the HTML table into an R data frame can be handled in a relatively straightforward manner with the rvest package. Instead of downloading the HTML file we can read it directly into R with read_html(), and then use html_table() to extract the tabular data into a data frame.
library(rvest)
wiki_url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
aPage <- read_html(wiki_url)
aTable <- html_table(aPage)[[2]] # second node is table of rankings
head(aTable)
...and the first few rows of output:
> head(aTable)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
1 Very Easy New Zealand 1 1 1 1 2 2 3 3 3
2 Very Easy Singapore 2 2 2 2 1 1 1 1 1
3 Very Easy Hong Kong 3 4 5 4 5 3 2 2 2
4 Very Easy Denmark 4 3 3 3 3 4 5 5 5
5 Very Easy South Korea 5 5 4 5 4 5 7 8 8
6 Very Easy United States 6 8 6 8 7 7 4 4 4
2011 2010 2009 2008 2007 2006
1 3 2 2 2 2 1
2 1 1 1 1 1 2
3 2 3 4 4 5 7
4 6 6 5 5 7 8
5 16 19 23 30 23 27
6 5 4 3 3 3 3
>
Next, we confirm that the last countries were read correctly: Libya, Yemen, Venezuela, Eritrea, and Somalia.
> tail(aTable,n=5)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
186 Below Average Libya 186 186 185 188 188 188 187 N/A N/A
187 Below Average Yemen 187 187 186 179 170 137 133 118 99
188 Below Average Venezuela 188 188 188 187 186 182 181 180 177
189 Below Average Eritrea 189 189 189 189 189 189 184 182 180
190 Below Average Somalia 190 190 190 190 N/A N/A N/A N/A N/A
2011 2010 2009 2008 2007 2006
186 N/A N/A N/A N/A N/A N/A
187 105 99 98 113 98 90
188 172 177 174 172 164 120
189 180 175 173 171 170 137
190 N/A N/A N/A N/A N/A N/A
Finally, we use tidyr and dplyr to convert the data to narrow format tidy data for subsequent analysis.
library(dplyr)
library(tidyr)
aTable %>%
# convert years 2017 - 2020 to character because pivot_longer()
# requires all columns to be of same data type
mutate_at(3:6,as.character) %>%
pivot_longer(-c(Classification,Jurisdiction),
names_to="Year",values_to="Rank") %>%
# convert Rank and Year to numeric values (introducing NA values)
mutate_at(c("Rank","Year"),as.numeric) -> rankings
head(rankings)
...and the output:
> head(rankings)
# A tibble: 6 x 4
Classification Jurisdiction Year Rank
<chr> <chr> <dbl> <dbl>
1 Very Easy New Zealand 2020 1
2 Very Easy New Zealand 2019 1
3 Very Easy New Zealand 2018 1
4 Very Easy New Zealand 2017 1
5 Very Easy New Zealand 2016 2
6 Very Easy New Zealand 2015 2
>

Pivot/Reshape data in R [duplicate]

This question already has answers here:
Reshape horizontal to to long format using pivot_longer
(3 answers)
Closed 2 years ago.
Thank you all for your answers, I thought I was smarter than I am and hoped I would've understood any of it. I think I messed up my visualisation of my data aswell. I have edited my post to better show my sample data. Sorry for the inconvenience, and I truly hope that someone can help me.
I have a question about reshaping my data. The data collected looks as such:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick. Am reletively new to Rstudio and Stackoverflow (if you couldn't tell that already).
Kind regards, and thank you in advance.
Here is a slightly different pivot_longer() version.
library(tidyr)
library(dplyr)
dw %>%
pivot_longer(cols = -PID, names_to =".value", names_pattern = "(.+)[0-9]")
# A tibble: 9 x 3
PID T measurement
<dbl> <dbl> <dbl>
1 1 1 100
2 1 4 200
3 1 7 50
4 2 2 150
5 2 5 300
6 2 8 60
7 3 3 120
8 3 6 210
9 3 9 70
The names_to = ".value" argument creates new columns from column names based on the names_pattern argument. The names_pattern argument takes a special regex input. In this case, here is the breakdown:
(.+) # match everything - anything noted like this becomes the ".values"
[0-9] # numeric characters - tells the pattern that the numbers
# at the end are excluded from ".values". If you have multiple digit
# numbers, use [0-9*]
In the last edit you asked for a solution that is easy to understand. A very simple approach would be to stack the measurement columns on top of each other and the Tdays columns on top of each other. Although specialty packages make things very concise and elegant, for simplicity we can solve this without additional packages. Standard R has a convenient function aptly named stack, which works like this:
> exp <- data.frame(value1 = 1:5, value2 = 6:10)
> stack(exp)
values ind
1 1 value1
2 2 value1
3 3 value1
4 4 value1
5 5 value1
6 6 value2
7 7 value2
8 8 value2
9 9 value2
10 10 value2
We can stack measurements and Tdays seperately and then combine them via cbind:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurement4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
cbind(stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
Which keeps measurements and Tdays neatly together but leaves us without pid which we can add using rep to replicate the original pid 4 times:
result <- cbind(pid = rep(data$pid, 4),
stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
The head of which looks like
> head(result)
pid values ind values ind
1 1 1356 measurement1 1435 Tdays1
2 2 943 measurement1 1848 Tdays1
3 3 1590 measurement1 185 Tdays1
4 4 130 measurement1 72 Tdays1
5 4 140 measurement1 82 Tdays1
6 4 220 measurement1 126 Tdays1
As I said above, this is not the order you expected and you can try to sort this data.frame, if that is of any concern:
result <- result[order(result$pid), c(1, 4, 2)]
names(result) <- c("pid", "Time", "Value")
leading to the final result
> head(result)
pid Time Value
1 1 1435 1356
13 1 1405 1483
25 1 1374 1563
37 1 NA NA
2 2 1848 943
14 2 1818 1173
tidyverse solution
library(tidyverse)
dw %>%
pivot_longer(-PID) %>%
mutate(name = gsub('^([A-Za-z]+)(\\d+)$', '\\1_\\2', name )) %>%
separate(name, into = c('A', 'B'), sep = '_', convert = T) %>%
pivot_wider(names_from = A, values_from = value)
Gives the following output
# A tibble: 9 x 4
PID B T measurement
<int> <int> <int> <int>
1 1 1 1 100
2 1 2 4 200
3 1 3 7 50
4 2 1 2 150
5 2 2 5 300
6 2 3 8 60
7 3 1 3 120
8 3 2 6 210
9 3 3 9 70
Considering a dataframe, df like the following:
PID T1 measurement1 T2 measurement2 T3 measurement3
1 1 100 4 200 7 50
2 2 150 5 300 8 60
3 3 120 6 210 9 70
You can use this solution to get your required dataframe:
iters = seq(from = 4, to = length(colnames(df))-1, by = 2)
finalDf = df[, c(1,2,3)]
for(j in iters){
tobind = df[, c(1,j,j+1)]
finalDf = rbind(finalDf, tobind)
}
finalDf = finalDf[order(finalDf[,1]),]
print(finalDf)
The output of the print statement is this:
PID T1 measurement1
1 1 1 100
4 1 4 200
7 1 7 50
2 2 2 150
5 2 5 300
8 2 8 60
3 3 3 120
6 3 6 210
9 3 9 70
Maybe you can try reshape like below
reshape(
setNames(data, gsub("(\\d+)$", "\\.\\1", names(data))),
direction = "long",
varying = 2:ncol(data)
)

R Time Series, date isn't being read properly

I have this data that I want to plot as a time series.
Date Units.Sold
1 Jan-16 588
2 Feb-16 448
3 Mar-16 490
4 Apr-16 512
5 May-16 528
6 Jun-16 432
7 Jul-16 470
8 Aug-16 446
9 Sep-16 465
10 Oct-16 388
11 Nov-16 429
12 Dec-16 414
However, when I use ts(datasetName), I get this:
Time Series:
Start = 1
End = 12
Frequency = 1
Date Units.Sold
1 5 588
2 4 448
3 8 490
4 1 512
5 9 528
6 7 432
7 6 470
8 2 446
9 12 465
10 11 388
11 10 429
12 3 414
As you can see, the dates are in the wrong order. I want January to correspond with 1, February with 2, and so on. Can anybody help?
You need to convert your column named 'Date' to a Date - class object first. You can use as.Date for that, but you'll need to add a year first.
your_year <- 2018
df$Date <- as.Date(paste0(df$Date, '-', your_year), format = '%b-%d-%Y')

Extracting the data based on the values

I have a data frame like below, where the error 1 is present if there is a error in DOB and after that the corrected DOB for the same record with no error. I want to extract only the data which are not corrected and having the error 1. Could anyone out there help me on this.
ID Date1 Date2 DOB Code Error
381 2002-10-01 2015-10-01 1967-01-22 4 1
381 2002-10-01 2015-10-01 1967-01-20 4 NA
381 2011-10-01 2015-10-01 1969-05-13 11 1
381 2011-10-01 2015-10-01 1968-05-13 11 NA
837 2005-12-07 2015-12-07 1987-11-19 8 1
837 2005-12-08 2015-12-08 1989-12-07 8 1
837 2001-04-15 2015-04-15 1984-08-11 18 1
840 2001-04-23 2015-04-23 1999-03-14 18 NA
The output table will have the details below.
ID Date1 Date2 DOB Code Error
837 2005-12-07 2015-12-07 1987-11-19 8 1
837 2005-12-08 2015-12-08 1989-12-07 8 1
837 2001-04-15 2015-04-15 1984-08-11 18 1

Get row number data frame R

I have a dataset like this
epoch epochIndex year month
1 335 1 1850 12
2 639 2 1851 10
3 670 3 1851 11
4 366 4 1851 1
5 517 5 1851 6
6 547 6 1851 7
7 578 7 1851 8
8 1005 8 1852 10
9 1036 9 1852 11
10 1066 10 1852 12
What I would like to do is to set the Year and Month and get the correspondent row number, like
MONTH <- 12
YEAR <- 1850
ROWNUMBER = 1
Many thanks
A simple which call would be enough, e.g.:
df <- read.table(textConnection("
epoch epochIndex year month
1 335 1 1850 12
2 639 2 1851 10
3 670 3 1851 11
4 366 4 1851 1
5 517 5 1851 6
6 547 6 1851 7
7 578 7 1851 8
8 1005 8 1852 10
9 1036 9 1852 11
10 1066 10 1852 12"), header=TRUE)
which(df$year == 1850 & df$month == 12)
# [1] 1
which(df$year == 1852 & df$month == 12)
# [1] 10
Sorry I found the answer
TIMEC <- which(df$year==YEAR & df$month==MONTH)

Resources