Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I have 8 variables per company, with a total of 25 companies. However, i don't need to make any distinction between these companies. If you look at the example: I need to have AH and JUMBO in one column, the same for AHQ1 and JUMBOQ1, and for both Q2s. In this way i don't have 6 columns, but just 3 and twice as much observations in these rows. The title of the column can stay AH, AHQ1, and AHQ2.
Thanks in advance for any tips!!
Example of data:
df <- data.frame("ID" = c(1,1,2,2,2,2), "Year" = c(2012, 2015,2012,2013,2015,2016),
"AH" = c(1, NA, 1,1,1,1), "AHQ1" = c(8, NA,7,8,9,10),
"AHQ2" = c(10,NA,7,8,5,2),"JUMBO" = c(NA,NA,1,1,1,NA),
"JUMBOQ1" = c(NA,NA,8,9,7,NA), "JUMBOQ2"= c(NA,NA,10,9,7,NA))
temp <- cbind(df[1:2], df[6:8])
names(temp) <- names(df[1:5])
df2 <- rbind(df[1:5], temp)
> df2
ID Year AH AHQ1 AHQ2
1 1 2012 1 8 10
2 1 2015 NA NA NA
3 2 2012 1 7 7
4 2 2013 1 8 8
5 2 2015 1 9 5
6 2 2016 1 10 2
7 1 2012 NA NA NA
8 1 2015 NA NA NA
9 2 2012 1 8 10
10 2 2013 1 9 9
11 2 2015 1 7 7
12 2 2016 NA NA NA
Is this what you are looking for?
Related
This question already has answers here:
Calculate difference between values in consecutive rows by group
(4 answers)
Closed 5 years ago.
Here is a simplified version of what my data set looks like:
> df
ID total_sleep sleep_end_date
1 1 9 2017-09-03
2 1 8 2017-09-04
3 1 7 2017-09-05
4 1 10 2017-09-06
5 1 11 2017-09-07
6 2 5 2017-09-03
7 2 12 2017-09-04
8 2 4 2017-09-05
9 2 3 2017-09-06
10 2 6 2017-09-07
Where total_sleep is expressed in hours.
What I am is trying to find is the absolute difference in hours of sleep for every two consecutive dates, given a specific user ID. The desired output should look something like this:
> df_answer
ID total_sleep sleep_end_date diff_hours_of_sleep
1 1 9 2017-09-03 NA
2 1 8 2017-09-04 1
3 1 7 2017-09-05 1
4 1 10 2017-09-06 3
5 1 11 2017-09-07 1
6 2 5 2017-09-03 NA
7 2 12 2017-09-04 7
8 2 4 2017-09-05 8
9 2 3 2017-09-06 1
10 2 6 2017-09-08 NA
NA appears in rows 1 and 6 because it doesn't have any data concerning the day before.
Most importantly, NA appears in row 10 because I don't have any data concerning the previous day (2017-09-07). And this has been the trickiest part to code for me.
I've googled (meaning: "stackoverflowed") this and tried to find a solution using the "data wrangling cheatsheet" for dplyr, but I haven't been been able to find a function that enables me to do what I want taking into account these two variables: date and different user IDs.
I am a beginner in R, so I might indeed be missing something simple. Any input or suggestion would be very welcome!
## Order data.frame by IDs, then by increasing sleep_end_dates (if not already sorted)
df <- df[order(df$ID, df$sleep_end_date),]
## Calculate difference in total_sleep with previous entry
df$diff_hours_of_sleep <- c(NA,abs(diff(df$total_sleep)))
## If previous ID is not equal, replace diff_hours_of_sleep with NA
ind <- c(NA, diff(df$ID))
df$diff_hours_of_sleep[ind != 0] <- NA
## And if previous day wasn't yesterday, replace diff_hours_of_sleep with NA
day_ind <- c(NA, diff(df$sleep_end_date))
df$diff_hours_of_sleep[day_ind != 1] <- NA
Maybe the following will do it.
df <- lapply(split(df, df$ID), function(x){
y <- ifelse(diff(x$sleep_end_date) == 1, abs(diff(x$total_sleep)), NA)
x$diff_hours_of_sleep <- c(NA, y)
x
})
df <- do.call(rbind, df)
df
Here is a solution using data.table -
dt1 <- data.table(df, key=c('id', 'sleep_end_date'))
merge(
dt1[,.(id, total_sleep, sleep_end_date, i=.I - 1)],
dt1[,.(id, total_sleep, i=.I)], by=c('id','i'), all.x=TRUE) [,.(id,sleep_end_date,\
total_sleep.x,delta=total_sleep.y-total_sleep.x)]
id sleep_end_date total_sleep.x delta
1: 1 2017-09-03 9 NA
2: 1 2017-09-04 8 1
3: 1 2017-09-05 7 1
4: 1 2017-09-06 10 -3
5: 1 2017-09-07 11 -1
6: 2 2017-09-03 5 NA
7: 2 2017-09-04 12 -7
8: 2 2017-09-05 4 8
9: 2 2017-09-06 3 1
10: 2 2017-09-07 6 -3
I'm not sure how the peformance compares to the pure data.frame approach, but it does appear to scale well; extending the input set to 20,000 rows this took under one second on my system.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
How can I plot this in R with Department and Year grouped, with Time as x-axis and Counts as y-axis? And have a line connecting and colors the same groups
Department Year Counts Time
1 CPD 2011 24 0
2 CPD 2011 28 1
3 CPD 2011 31 2
4 APD 2012 20 0
5 APD 2012 25 2
6 APD 2012 21 3
7 CPD 2012 30 2
8 CPD 2012 26 3
9 CPD 2012 11 5
Do you mean something like this...
library(ggplot2)
df$depYr <- paste(df$Department,df$Year,sep="_") #set a combined dept_year variable
ggplot(df,aes(x=Time,y=Counts,colour=depYr,group=depYr))+geom_line()
This question already has answers here:
Find consecutive values in vector in R [duplicate]
(2 answers)
Closed 6 years ago.
I am fairly new to the art of programming (loops etc..) and this is something where I would be grateful if I could get an opinion whether my approach is fine or it would definitely need to be optimized if it was about to used on much bigger sample.
Currently I have approximately 20 000 observations and one of the columns is the ID of receipt. What I would like to achieve is to assign each row to a group that would consist of IDs that are ascending in a format of n+1. If this rule is broken the new group should be created until the rule is broken again.
To illustrate, lets say I have this table (Important note is that ID are not necessarily unique and can repeat, like ID 10 in my example):
MyTable <- data.frame(ID = c(1,2,3,4,6,7,8,10,10,11,17,18,19,200,201,202,2010,2011,2013))
MyTable
ID
1
2
3
4
6
7
8
10
10
11
17
18
19
200
201
202
2010
2011
2013
The result of my grouping should be following:
ID GROUP
1 1
2 1
3 1
4 1
6 2
7 2
8 2
10 3
10 3
11 3
17 4
18 4
19 4
200 5
201 5
202 5
2010 6
2011 6
2013 7
I used dplyr for ordering the ID in ascending way. Then created the variable MyData$Group which I have simply filled with 1's.
rep(1,length(MyTable$ID)
for (i in 2:length(MyTable$ID) ) {
if(MyTable$ID[i] == MyTable$ID[i-1]+1 | MyTable$ID[i] == MyTable$ID[i-1]) {
MyTable$ID[i] <- MyTable$GROUP[i-1]
} else {
MyTable$GROUP[i] <- MyTable$GROUP[i-1]+1
}
}
This code worked for me and I got the results fairly easily. However, I wonder if in eyes of more experienced programmers, this piece of code would be considered as "bad", "average", "good" or whatever rating you come up with.
EDIT: I am sure this topic has been touched already, not arguing against that. Though, as the main difference is that I would like to touch a topic of optimization here and see whether my approach meets standards.
Thanks!
To make a long story short:
MyTable$Group <- cumsum(c(1, diff(MyTable$ID) != 1))
# ID Group
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 6 2
#6 7 2
#7 8 2
#8 10 3
#9 11 3
#10 12 3
#11 17 4
#12 18 4
#13 19 4
#14 200 5
#15 201 5
#16 202 5
#17 2010 6
#18 2011 6
#19 2013 7
You are searching all differences in your vector Mytable$ID, which are not 1, so this are your "breaks". And then you cumsum all these values. When you do not know cumsum so type ?cumsum.
That's all!
UPDATE: with repeating IDs, you can use this:
MyTable <- data.frame(ID = c(1,2,3,4,6,7,8,10,10,11,17,18,19,200,201,202,2010,2011,2013))
MyTable$Group <- cumsum(c(1, !diff(MyTable$ID) %in% c(0,1) ))
# ID Group
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 6 2
#6 7 2
#7 8 2
#8 10 3
#9 10 3
#10 11 3
#11 17 4
#12 18 4
#13 19 4
#14 200 5
#15 201 5
#16 202 5
#17 2010 6
#18 2011 6
#19 2013 7
I have several data frames with the following structure:
january february march april
Id A B Id A B Id A B Id A B
1 4 4 1 2 3 3 9 7 1 4 3
2 3 5 2 2 7 2 2 4 4 6 2
3 6 8 4 9 9 2 3 5
4 7 8
I would like to bring them into one single data frame which contains ´NA´ for the missing ID' and there corresponding attributes. The results has might look like:
Id janA janB febA febB marA marB aprA aprB
1 4 4 2 3 NA NA 4 3
2 3 5 2 7 2 4 3 5
3 6 8 NA NA 9 7 NA NA
4 7 8 9 9 NA NA 6 2
Given some data:
ID<-c(1,2,3,4)
A<-c(4,3,6,7)
B<-c(4,5,8,8)
jan<-data.frame(ID,A,B)
ID<-c(1,2,4)
A<-c(2,2,9)
B<-c(3,7,9)
feb<-data.frame(ID,A,B)
ID<-c(3,2)
A<-c(9,2)
B<-c(7,4)
mar<-data.frame(ID,A,B)
ID<-c(1,4,2)
A<-c(4,6,3)
B<-c(6,2,5)
apr<-data.frame(ID,A,B)
What I have tried:
test <- rbind(jan, feb,mar,apr)
test <- rbind.fill(jan, feb, mar,apr)
You can use merge within Reduce.
First, let's prepare a list with the data and change the column names to janA, janB, febA, ...
list_df <- list(
jan = jan,
feb = feb,
mar = mar,
apr = apr
)
list_df <- lapply(names(list_df), function(name_month){
df_month <- list_df[[name_month]]
names(df_month)[-1] <- paste0(name_month, names(df_month)[-1])
df_month
})
Reduce will merge all of them.
Reduce(function(x, y) merge(x, y, by = "ID", all = TRUE), list_df)
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I have a large data frame and part of it look like this:
id carbon nitrogen sulfer
1 1 NA NA NA
2 1 NA NA NA
3 1 5 6 78
4 2 NA NA NA
5 2 NA NA NA
6 2 8 8 67
7 3 NA NA NA
8 3 NA NA NA
9 3 NA NA NA
10 3 7 9 55
Is there a way I can filled the NA values in columns (example carbon), with one value given common to id. For example if we take id == 1 then for column carbon has value of 5 and need to replace NA with 5.
I have nearly 200 columns to filled like this.
Any help to automate this is much appreciated.
You haven't really clarified the issues pointed out by #flodel in his comments. Anyway, given the pattern in your example data, i.e. all values of carbon nitrogen sulfer are missing, except the last within each id, then you can try this:
library(zoo)
na.locf(df, na.rm = FALSE, fromLast = TRUE)
# id carbon nitrogen sulfer
# 1 1 5 6 78
# 2 1 5 6 78
# 3 1 5 6 78
# 4 2 8 8 67
# 5 2 8 8 67
# 6 2 8 8 67
# 7 3 7 9 55
# 8 3 7 9 55
# 9 3 7 9 55
# 10 3 7 9 55
Try this, I hope it what you need. If not update and I update the result. By the way this was asked a few times already.
myFunc <- function(value) {
if (value[1] == 1) {
value[2] = 5;
value[3] = 5;
value[4] = 5
}
# here you put you're if's
}
apply(data, 1, myFunc)