How to update and replace part of old data - r

I want to merge the df OldData and NewData.
In this case, Nov-2015 and Dec 2015 are present in both df.
Since NewData is the most accurate update available, I want to update the value of Nov-2015 and Dec 2015 using the value in df NewData and of course adding the records of Jan-2016 and Feb-2016 as well.
Can anyone help?
OldData
Month Value
1 Jan-2015 3
2 Feb-2015 76
3 Mar-2015 31
4 Apr-2015 45
5 May-2015 99
6 Jun-2015 95
7 Jul-2015 18
8 Aug-2015 97
9 Sep-2015 61
10 Oct-2015 7
11 Nov-2015 42
12 Dec-2015 32
NewData
Month Value
1 Nov-2015 88
2 Dec-2015 45
3 Jan-2016 32
4 Feb-2016 11
Here is the output I want:
JoinData
Month Value
1 Jan-2015 3
2 Feb-2015 76
3 Mar-2015 31
4 Apr-2015 45
5 May-2015 99
6 Jun-2015 95
7 Jul-2015 18
8 Aug-2015 97
9 Sep-2015 61
10 Oct-2015 7
11 Nov-2015 88
12 Dec-2015 45
13 Jan-2016 32
14 Feb-2016 11
Thanks for #akrun, the problem is solved, and the following code works smoothly!!
rbindlist(list(OldData, NewData))[!duplicated(Month, fromLast=TRUE)]
Update: Now, let's upgrade our problem little bit.
suppose our OldData and NewData have another column called "Type".
How do we merge/update it this time?
> OldData
Month Type Value
1 2015-01 A 3
2 2015-02 A 76
3 2015-03 A 31
4 2015-04 A 45
5 2015-05 A 99
6 2015-06 A 95
7 2015-07 A 18
8 2015-08 A 97
9 2015-09 A 61
10 2015-10 A 7
11 2015-11 B 42
12 2015-12 C 32
13 2015-12 D 77
> NewData
Month Type Value
1 2015-11 A 88
2 2015-12 C 45
3 2015-12 D 22
4 2016-01 A 32
5 2016-02 A 11
The JoinData will suppose to update all value from NewData ass following:
> JoinData
Month Type Value
1 2015-01 A 3
2 2015-02 A 76
3 2015-03 A 31
4 2015-04 A 45
5 2015-05 A 99
6 2015-06 A 95
7 2015-07 A 18
8 2015-08 A 97
9 2015-09 A 61
10 2015-10 A 7
11 2015-11 B 42
12 2015-11 A 88 (originally not included, added from the NewData)
12 2015-12 C 45 (Updated the value by NewData)
13 2015-12 D 22 (Updated the value by NewData)
14 2016-01 A 32 (newly added from NewData)
15 2016-02 A 11 (newly added from NewData)
Thanks for #akrun: I have got the solution here for the second question as well.
Thanks for the help for everyone here!
Here is the answer:
d1 <- merge(OldData, NewData, by = c("Month","Type"), all = TRUE);d2 <- transform(d1, Value.x= ifelse(!is.na(Value.y), Value.y, Value.x))[-4];d2[!duplicated(d2[1:2], fromLast=TRUE),]

Here is an option using data.table (similar approach as #thelatemail mentioned in the comments)
library(data.table)
rbindlist(list(OldData, NewData))[!duplicated(Month, fromLast=TRUE)]
Or
rbindlist(list(OldData, NewData))[,if(.N >1) .SD[.N] else .SD, Month]

Related

Slide along data frame rows and compare rows with next rows

I guess something similar should have been asked before, however I could only find an answer for python and SQL. So please notify me in the comments when this was also asked for R!
Data
Let's say we have a dataframe like this:
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
# In cause you do not get the same dataframe see the comment by #Ian Campbell - thanks!
position value
1 1 27
2 2 37
3 3 57
4 4 89
5 5 20
6 6 86
7 7 97
8 8 62
9 9 58
10 10 6
11 11 19
12 12 16
13 13 61
14 14 34
15 15 67
16 16 43
17 17 88
18 18 83
19 19 32
20 20 63
Goal
I'm interested in calculating the average value for n positions and subtract this from the average value of the next n positions, let's say n=5 for now.
What I tried
I now used this method, however when I apply this to a bigger dataframe it takes a huge amount of time, and hence wonder if there is a faster method for this.
calc <- function( pos ) {
this.five <- df %>% slice(pos:(pos+4))
next.five <- df %>% slice((pos+5):(pos+9))
differ = mean(this.five$value)- mean(next.five$value)
data.frame(dif= differ)
}
df %>%
group_by(position) %>%
do(calc(.$position))
That produces the following table:
position dif
<int> <dbl>
1 1 -15.8
2 2 9.40
3 3 37.6
4 4 38.8
5 5 37.4
6 6 22.4
7 7 4.20
8 8 -26.4
9 9 -31
10 10 -35.4
11 11 -22.4
12 12 -22.3
13 13 -0.733
14 14 15.5
15 15 -0.400
16 16 NaN
17 17 NaN
18 18 NaN
19 19 NaN
20 20 NaN
I suspect a data.table approach may be faster.
library(data.table)
setDT(df)
df[,c("roll.position","rollmean") := lapply(.SD,frollmean,n=5,fill=NA, align = "left")]
df[, result := rollmean[.I] - rollmean[.I + 5]]
df[,.(position,value,rollmean,result)]
# position value rollmean result
# 1: 1 27 46.0 -15.8
# 2: 2 37 57.8 9.4
# 3: 3 57 69.8 37.6
# 4: 4 89 70.8 38.8
# 5: 5 20 64.6 37.4
# 6: 6 86 61.8 22.4
# 7: 7 97 48.4 4.2
# 8: 8 62 32.2 -26.4
# 9: 9 58 32.0 -31.0
#10: 10 6 27.2 -35.4
#11: 11 19 39.4 -22.4
#12: 12 16 44.2 NA
#13: 13 61 58.6 NA
#14: 14 34 63.0 NA
#15: 15 67 62.6 NA
#16: 16 43 61.8 NA
#17: 17 88 NA NA
#18: 18 83 NA NA
#19: 19 32 NA NA
#20: 20 63 NA NA
Data
RNGkind(sample.kind = "Rounding")
set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20))
RNGkind(sample.kind = "default")

Adding a column according to norm data in R

I have a longitudinal dataset in the long form with the length of around 2800, with around 400 participants in total. Here's a sample of my data.
# ID wave score sex age edu
#1 1001 1 28 1 69 12
#2 1001 2 27 1 70 12
#3 1001 3 28 1 71 12
#4 1001 4 26 1 72 12
#5 1002 1 30 2 78 9
#6 1002 3 30 2 80 9
#7 1003 1 30 2 65 16
#8 1003 2 30 2 66 16
#9 1003 3 29 2 67 16
#10 1003 4 28 2 68 16
#11 1004 1 22 2 85 4
#12 1005 1 20 2 60 9
#13 1005 2 18 1 61 9
#14 1006 1 22 1 74 9
#15 1006 2 23 1 75 9
#16 1006 3 25 1 76 9
#17 1006 4 19 1 77 9
I want to create a new column "cutoff" with values "Normal" or "Impaired" because my outcome data, "score" has a cutoff score indicating impairment according to norm. The norm consists of different -1SD measures(the cutoff point) according to Sex, Edu(year of education), and Age.
Below is what I'm currently doing, checking an excel file myself and putting in the corresponding cutoff score according to the three conditions. First of all, I am not sure if I am creating the right column.
data$cutoff <- ifelse(data$sex==1 & data$age<70
& data$edu<3
& data$score<19.91, "Impaired", "Normal")
data$cutoff <- ifelse(data$sex==2 & data$age<70
& data$edu<3
& data$score<18.39, "Impaired", "Normal")
Additionally, I am wondering if I can import an excel file stating the norm, and create a column according to the values in it.
The excel file has a structure as shown below.
# Sex Male Female
#60-69 Edu(yr) 0-3 4-6 7-12 13>= 0-3 4-6 7-12 13>=
#Age Number 22 51 119 72 130 138 106 51
# Mean 24.45 26.6 27.06 27.83 23.31 25.86 27.26 28.09
# SD 3.03 1.89 1.8 1.53 3.28 2.55 1.85 1.44
# -1.5SD' 19.92 23.27 23.76 24.8 18.53 21.81 23.91 25.15
#70-79 Edu(yr) 0-3 4-6 7-12 13>= 0-3 4-6 7-12 13>=
....
I have created new columns "agecat" and "educat," allocating each ID into a group of age and education used in the norm. Now I want to make use of these columns, matching it with rows and columns of the excel file above. One of the motivations is to create a code that can be used for further research using the test scores of my data.
I think your ifelse statements should work fine, but I would definitely import the Excel file rather than hardcoding it, though you may need to structure it a bit differently. I would structure it just like a dataset, with columns for Sex, Edu, Age, Mean, SD, -1.5SD, etc., import it into R, then do a left outer join on Sex+Edu+Age:
merge(x = long_df, y = norm_df, by = c("Sex", "Edu(yr)", "Age"), all.x = TRUE)
Then you can compare the columns directly.
If I understand correctly, the OP wants to mark a certain type of outliers in his dataset. So, there are two tasks here:
Compute the statistics mean(score), sd(score), and cutoff value mean(score) - 1.5 * sd(score) for each group of sex, age category agecat, and edu category edcat.
Find all rows where score is lower than the cutoff value for the particular group.
As already mentioned by hannes101, the second step can be implemented by a non-equi join.
library(data.table)
# categorize age and edu (left closed intervals)
mydata[, c("agecat", "educat") := .(cut(age, c(seq(0, 90, 10), Inf), right = FALSE),
cut(edu, c(0, 4, 7, 13, Inf), right = FALSE))][]
# compute statistics
cutoffs <- mydata[, .(.N, Mean = mean(score), SD = sd(score),
m1.5SD = mean(score) - 1.5 * sd(score)),
by = .(sex, agecat, educat)]
# non-equi update join
mydata[, cutoff := "Normal"]
mydata[cutoffs, on = .(sex, agecat, educat, score < m1.5SD), cutoff := "Impaired"][]
mydata
ID wave score sex age edu agecat educat cutoff
1: 1001 1 28 1 69 12 [60,70) [7,13) Normal
2: 1001 2 27 1 70 12 [70,80) [7,13) Normal
3: 1001 3 28 1 71 12 [70,80) [7,13) Normal
4: 1001 4 26 1 72 12 [70,80) [7,13) Normal
5: 1002 1 30 2 78 9 [70,80) [7,13) Normal
6: 1002 3 30 2 80 9 [80,90) [7,13) Normal
7: 1003 1 33 2 65 16 [60,70) [13,Inf) Normal
8: 1003 2 32 2 66 16 [60,70) [13,Inf) Normal
9: 1003 3 31 2 67 16 [60,70) [13,Inf) Normal
10: 1003 4 24 2 68 16 [60,70) [13,Inf) Impaired
11: 1004 1 22 2 85 4 [80,90) [4,7) Normal
12: 1005 1 20 2 60 9 [60,70) [7,13) Normal
13: 1005 2 18 1 61 9 [60,70) [7,13) Normal
14: 1006 1 22 1 74 9 [70,80) [7,13) Normal
15: 1006 2 23 1 75 9 [70,80) [7,13) Normal
16: 1006 3 25 1 76 9 [70,80) [7,13) Normal
17: 1006 4 19 1 77 9 [70,80) [7,13) Normal
18: 1007 1 33 2 65 16 [60,70) [13,Inf) Normal
19: 1007 2 32 2 66 16 [60,70) [13,Inf) Normal
20: 1007 3 31 2 67 16 [60,70) [13,Inf) Normal
21: 1007 4 31 2 68 16 [60,70) [13,Inf) Normal
ID wave score sex age edu agecat educat cutoff
In this made-up example there is only one row which meets the "Impaired" conditions.
Likewise, the statistics is rather sparsely populated:
cutoffs
sex agecat educat N Mean SD m1.5SD
1: 1 [60,70) [7,13) 2 23.00000 7.071068 12.39340
2: 1 [70,80) [7,13) 7 24.28571 3.147183 19.56494
3: 2 [70,80) [7,13) 1 30.00000 NA NA
4: 2 [80,90) [7,13) 1 30.00000 NA NA
5: 2 [60,70) [13,Inf) 8 30.87500 2.900123 26.52482
6: 2 [80,90) [4,7) 1 22.00000 NA NA
7: 2 [60,70) [7,13) 1 20.00000 NA NA
Data
OP's sample dataset has been modified in one group for demonstration.
library(data.table)
mydata <- fread("
# ID wave score sex age edu
#1 1001 1 28 1 69 12
#2 1001 2 27 1 70 12
#3 1001 3 28 1 71 12
#4 1001 4 26 1 72 12
#5 1002 1 30 2 78 9
#6 1002 3 30 2 80 9
#7 1003 1 33 2 65 16
#8 1003 2 32 2 66 16
#9 1003 3 31 2 67 16
#10 1003 4 24 2 68 16
#11 1004 1 22 2 85 4
#12 1005 1 20 2 60 9
#13 1005 2 18 1 61 9
#14 1006 1 22 1 74 9
#15 1006 2 23 1 75 9
#16 1006 3 25 1 76 9
#17 1006 4 19 1 77 9
#18 1007 1 33 2 65 16
#19 1007 2 32 2 66 16
#20 1007 3 31 2 67 16
#21 1007 4 31 2 68 16
", drop = 1L)

Optimizing preprocessing data frame in R

I have the following data frame with the name dataValues:
dates hours
1 2015-10-12 1
5 2015-10-12 5
9 2015-10-12 9
11 2015-10-12 11
14 2015-10-12 14
15 2015-10-12 15
17 2015-10-12 17
19 2015-10-12 19
22 2015-10-12 22
23 2015-10-12 23
24 2015-10-12 24
27 2015-10-13 3
29 2015-10-13 5
33 2015-10-13 9
36 2015-10-13 12
37 2015-10-13 13
38 2015-10-13 14
40 2015-10-13 16
42 2015-10-13 18
44 2015-10-13 20
45 2015-10-13 21
46 2015-10-13 22
47 2015-10-13 23
49 2015-10-14 1
54 2015-10-14 6
56 2015-10-14 8
59 2015-10-14 11
60 2015-10-14 12
61 2015-10-14 13
63 2015-10-14 15
64 2015-10-14 16
66 2015-10-14 18
69 2015-10-14 21
71 2015-10-14 23
72 2015-10-14 24
I have preprocessed this data frame to get all hours on a certain day, which is variable totallist and has output:
[[1]]
[1] 1 5 9 11 14 15 17 19 22 23 24
[[2]]
[1] 3 5 9 12 13 14 16 18 20 21 22 23
[[3]]
[1] 1 6 8 11 12 13 15 16 18 21 23 24
The code I used for this is the following:
uniqueDates <- unique(dataValues$dates)
totallist <- {}
for(date in uniqueDates){
templist <- {}
for(i in 1:length(dataValues$dates)){
if(dataValues$dates[i]==date){
newlist <- append(templist,dataValues$hours[i])
}
}
totallist <- append(totallist,list(templist))
}
For the example in this question (with 3 days) it works fine and the result is what I want, but if I use this on a large dataset (which has about 260 days), it takes about 6 to 7 minutes to finish.
My question is if there is an optimized way to do what I want?
Try any of these:
# 1
with(unique(dataValues), split(hours, dates))
# 1a - variation of last solution
with(dataValues, lapply(split(hours, dates), unique))
# 2
unstack(unique(dataValues), hours ~ dates)
# 2a - variation of last solution
lapply(unstack(dataValues, hours ~ dates), unique)
Note that if the data values are known to be unique already, as is the case in the sample data shown in the question, then unique(dataValues) in #1 and #2 could be replaced with just dataValues.
I believe you would be better by using the tapply function. I've created a simpler dataframe just to show what it is doing:
df <- data.frame(dates=rep(c("2015-01-02","2015-01-03","2015-01-04"),10),hours=trunc(runif(30,1,10)))
tapply(df$hours,df$dates,unique)
Output:
$`2015-01-02`
[1] 2 8 6 1 5
$`2015-01-03`
[1] 7 5 2 3
$`2015-01-04`
[1] 1 2 6 5 8 4 9

how to fix "undefined columns selected" for network meta-analysis in R?

I am conducting a network meta-analysis on R with two packages, gemtc and rjags. However, when I type
Model <- mtc.model (network, linearmodel=’fixed’).
R always returns “
Error in [.data.frame(data, sel1 | sel2, columns, drop = FALSE) :
undefined columns selected In addition: Warning messages: 1: In
mtc.model(network, linearModel = "fixed") : Likelihood can not be
inferred. Defaulting to normal. 2: In mtc.model(network, linearModel =
"fixed") : Link can not be inferred. Defaulting to identity “
How to fix this problem? Thanks!
I am attaching my codes and data here:
SAE <- read.csv(file.choose(),head=T, sep=",")
head(SAE)
network <- mtc.network(data.ab=SAE)
summary(network)
plot(network)
model.fe <- mtc.model (network, linearModel="fixed")
plot(model.fe)
summary(model.fe)
cat(model.fe$code)
model.fe$data
# run this model
result.fe <- mtc.run(model.fe, n.adapt=0, n.iter=50)
plot(result.fe)
gelman.diag(result.fe)
result.fe <- mtc.run(model.fe, n.adapt=1000, n.iter=5000)
plot(result.fe)
gelman.diag(result.fe)
following is my data: SAE
study treatment responder sample.size
1 1 3 0 76
2 1 30 2 72
3 2 3 99 1389
4 2 23 132 1383
5 3 1 6 352
6 3 30 2 178
7 4 2 6 106
8 4 30 3 95
9 5 3 49 393
10 5 25 18 198
11 6 1 20 65
12 6 22 10 26
13 7 1 1 76
14 7 30 3 76
15 8 3 7 441
16 8 26 1 220
17 9 2 1 47
18 9 30 0 41
19 10 3 10 156
20 10 30 9 150
21 11 1 4 85
22 11 25 5 85
23 11 30 4 84
24 12 3 6 152
25 12 30 5 160
26 13 18 4 158
27 13 21 8 158
28 14 1 3 110
29 14 30 2 111
30 15 3 3 83
31 15 30 1 92
32 16 1 3 124
33 16 22 6 123
34 16 30 4 125
35 17 3 236 1553
36 17 23 254 1546
37 18 6 5 398
38 18 7 6 403
39 19 1 64 588
40 19 22 73 584
How about reading the manual ?mtc.model. It clearly states the following:
Required columns [responders, sampleSize]
So your responder variable should be responders and your sample.size variable should be sampleSize.
Next, your plot(network) should help you determine that some comparisons can not be made. In your data, there are 2 subgroups of trials that were compared. Treatment 18 and 21 were not compared with any of the others. Therefore you can only do a meta-analysis of 21 and 18 or a network meta-analysis of the rest.
network <- mtc.network(data.ab=SAE[!SAE$treatment %in% c(21, 18), ])
model.fe <- mtc.model(network, linearModel="fixed")

Transforming long format data to short format by segmenting dates that include redundant observations

I have a data set that is long format and includes exact date/time measurements of 3 scores on a single test administered between 3 and 5 times per year.
ID Date Fl Er Cmp
1 9/24/2010 11:38 15 2 17
1 1/11/2011 11:53 39 11 25
1 1/15/2011 11:36 39 11 39
1 3/7/2011 11:28 95 58 2
2 10/4/2010 14:35 35 9 6
2 1/7/2011 13:11 32 7 8
2 3/7/2011 13:11 79 42 30
3 10/12/2011 13:22 17 3 18
3 1/19/2012 14:14 45 15 36
3 5/8/2012 11:55 29 6 11
3 6/8/2012 11:55 74 37 7
4 9/14/2012 9:15 62 28 18
4 1/24/2013 9:51 82 45 9
4 5/21/2013 14:04 135 87 17
5 9/12/2011 11:30 98 61 18
5 9/15/2011 13:23 55 22 9
5 11/15/2011 11:34 98 61 17
5 1/9/2012 11:32 55 22 17
5 4/20/2012 11:30 23 4 17
I need to transform this data to short format with time bands based on month (i.e. Fall=August-October; Winter=January-February; Spring=March-May). Some bands will include more than one observation per participant, and as such, will need a "spill over" band. An example transformation for the Fl scores below.
ID Fall1Fl Fall2Fl Winter1Fl Winter2Fl Spring1Fl Spring2Fl
1 15 NA 39 39 95 NA
2 35 NA 32 NA 79 NA
3 17 NA 45 NA 28 74
4 62 NA 82 NA 135 NA
5 98 55 55 NA 23 NA
Notice that dates which are "redundant" (i.e. more than 1 Aug-Oct observation) spill over into Fall2fl column. Dates that occur outside of the desired bands (i.e. November, December, June, July) should be deleted. The final data set should have additional columns that include Fl Er and Cmp.
Any help would be appreciated!
(Link to .csv file with long data http://mentor.coe.uh.edu/Data_Example_Long.csv )
This seems to do what you are looking for, but doesn't exactly match your desired output. I haven't looked at your sample data to see whether the problem lies with your sample desired output or the transformations I've done, but you should be able to follow along with the code to see how the transformations were made.
## Convert dates to actual date formats
mydf$Date <- strptime(gsub("/", "-", mydf$Date), format="%m-%d-%Y %H:%M")
## Factor the months so we can get the "seasons" that you want
Months <- factor(month(mydf$Date), levels=1:12)
levels(Months) <- list(Fall = c(8:10),
Winter = c(1:2),
Spring = c(3:5),
Other = c(6, 7, 11, 12))
mydf$Seasons <- Months
## Drop the "Other" seasons
mydf <- mydf[!mydf$Seasons == "Other", ]
## Add a "Year" column
mydf$Year <- year(mydf$Date)
## Add a "Times" column
mydf$Times <- as.numeric(ave(as.character(mydf$Seasons),
mydf$ID, mydf$Year, FUN = seq_along))
## Load "reshape2" and use `dcast` on just one variable.
## Repeat for other variables by changing the "value.var"
dcast(mydf, ID ~ Seasons + Times, value.var="Fluency")
# ID Fall_1 Fall_2 Winter_1 Winter_2 Spring_2 Spring_3
# 1 1 15 NA 39 39 NA 95
# 2 2 35 NA 32 NA 79 NA
# 3 3 17 NA 45 NA 29 NA
# 4 4 62 NA 82 NA 135 NA
# 5 5 98 55 55 NA 23 NA

Resources