I am trying to calculate correlation for my data frame i.e. df3 which looks like this
group a b
1 01_01-102_PRT 0.5857299 1.0915944
2 01_1014_EMH -0.8875033 0.9982261
3 02_02-012_ABT 1.5402289 1.0095046
4 02_02-028B_TMA -0.2635421 0.9533909
5 02_097A_KMG 0.1529145 1.0452099
6 02_116_DMC 0.7375643 0.9927591
My code:
require(plyr)
func <- function(df3)
{
return(data.frame(COR = cor(df3$a, df3$b)))
}
too <- ddply(df3, .(group), func)
My output
group COR
1 01_01-102_PRT NA
2 01_1014_EMH NA
3 02_02-012_ABT NA
4 02_02-028B_TMA NA
5 02_097A_KMG NA
....
I have also tried other ways given here https://stats.stackexchange.com/questions/4040/r-compute-correlation-by-group but I always get NAs.
Help please
Thanks
Jason
It appears that each group consists of exactly one row and therefore of one a and one b value. You cannot calculate correlation if there's no variation in the data. Hence, you need at least two different values for both a and b.
Related
I have a large data set, and some cases are missing a variable here and there, but I have some cases where participants answered no questions at all, or only 1/150 questions. Is there a way to get rid of cases missing >x variables, but keep cases with
case k1 k2 k3
1 na 2 3
2 3 1 5
3 1 na 2
4 na na na
So in this case, I want a formula that would remove case 4 only. Any ideas?
Try this example where your matrix is named yourMatrix and you let < than 2 missing values.
# Number of missing values to start removing cases
nMissing <- 2
foo <- apply(yourMatrix, 1, function(x) sum(is.na(x)))
yourMatrix[!foo >= nMissing, ]
So this is what worked best for me.
MyDataset2 <- MYDataset
nMissing <- 23
foo <- rowSums(is.na(MyDataset2))
MyDataset2 <- MyDataset2[!foo < nMissing, ]
I have a longitudinal dataset with many missing values that I would like automatically imputed in R based on the 'last observed value' carried forward, and the 'next observed value' carried backwards. Similar questions have been asked previously, but I would like to add specific conditions for imputation based on the length of the gaps.
The following data frame (wide format) demonstrates the issue:
miss.df <- data.frame(id = c('A','B','C','D','E'),
w1 = c(1,1,2,2,1),
w2 = c(1,NA,NA,2,NA),
w3 = c(NA,NA,NA,NA,2),
w4 = c(1,NA,NA,NA,NA),
w5 = c(1,2,NA,1,3),
w6 = c(1,2,1,NA,NA))
As so:
id w1 w2 w3 w4 w5 w6
1 A 1 1 NA 1 1 1
2 B 1 NA NA NA 2 2
3 C 2 NA NA NA NA 1
4 D 2 2 NA NA 1 NA
5 E 1 NA 2 NA 3 NA
Please note that the data is in wide format, so w1 is the first wave, etc. The first wave is complete with no missings. The values are the numeric values for a categorical variable (political party preference). There is no order to the categories. This data frame therefore consists of information on only one variable, on five individuals across six waves.
The conditions I would like are as follows:
If the gap consists of only one missing, carry last observed value forward, including cases where the gap is in the final wave.
If the gap is an even number of missings (id = C, for instance), then carry forward and carry back so that the values 'meet in the middle'. As such, it is assumed that the individual transitioned (i.e. changed category) half-way through.
If the gap is an odd number of missings (id = B, for instance), then carry forward and carry back to meet in the middle, as point 2, but the exact middle value is imputed as the carry forward value.
If one was to run a loop with the above conditions, the data frame would look like this:
id w1 w2 w3 w4 w5 w6
1 A 1 1 1 1 1 1
2 B 1 1 1 2 2 2
3 C 2 2 2 1 1 1
4 D 2 2 2 1 1 1
5 E 1 1 2 2 3 3
Thanks in advance.
Hmm. Tricky. And I don't know of any useful R generic for filling in NAs. In the end I thought the easiest way was a good old for loop. The logic is to fill in one from the left, then one from the right, and to repeat this until everything is filled in. Not very R at all - it could practically be C code - but should be fine unless you have a zillion rows.
fill_in_old_skool <- function (r) {
while (anyNA(r)) {
for (idx in seq_along(r)) {
val <- r[idx]
if (is.na(r[idx]) && idx > 1) r[idx] <- lastval
lastval <- val
}
for (idx in rev(seq_along(r))) {
val <- r[idx]
if (is.na(r[idx]) && idx < length(r)) r[idx] <- lastval
lastval <- val
}
}
r
}
miss.df[,-1] <- t(apply(miss.df[,-1], 1, fill_in_old_skool))
The imputeTS package has a function that is very similar, to what you want to do.
The function is called na_ma(x, k = 2, weighting = "simple").
Missing Value Imputation by Weighted Moving Average
Basically what it does for you is:
If you input a time series x, it looks for the k next values and takes their average as values for imputation.
Not exactly what you described, but I think it might resemble the idea behind your proposed procedure.
Please have a close look at the example data sets and desired outcome to see the purpose of this question. It is not a merging data sets solution what I am looking for. So I could find the answer neither here: How to join (merge) data frames (inner, outer, left, right)?, nor here Use apply() to assign value to new column. It rather refers to a solution for assigning values to new colnames if they meet a condition.
Here a reproducible illustration of what I would like to do:
Email <- as.factor(c("1#1.com", "2#2.com", "3#3.com","4#4.com", "5#5.com"))
dataset1 <- data.frame(Email)
Code <- as.factor(c("Z001", "Z002", "Z003","Z004","Z005"))
Email <- as.factor(c("x#x.com", "2#2.com", "y#y.com", "1#1.com","z#z.com"))
dataset2 <- data.frame(Code, Email)
This results in the following example datasets:
Email
1 1#1.com
2 2#2.com
3 3#3.com
4 4#4.com
5 5#5.com
Code Email
1 Z001 x#x.com
2 Z002 2#2.com
3 Z003 y#y.com
4 Z004 1#1.com
5 Z005 z#z.com
Desired output:
Email Z002 Z004
1 1#1.com NA 1
2 2#2.com 1 NA
3 3#3.com NA NA
4 4#4.com NA NA
5 5#5.com NA NA
So I would like to write a loop that checks whether the Email of dataset2 occurs in dataset1, and if this condition is true, that the Code associated with the Email in dataset2, is assigned as a new column name to dataset1 with a 1 as cell value for this observation. My attempt to get this done and an example of the desired output clarifies the question.
My own attempt to fix it (I know it is wrong, but shows my intention):
for(i in 1:nrow(dataset2)){
if(dataset2$Email[i] %in% dataset1$Email)
dataset1[,dataset2$Code[i]] <- dataset2$Code[i]
dataset1[,dataset2$Code[i]][i] <- 1
}
Would be great if anyone could help me out.
Your dataset2 is in "long" format - changing the Code column into multiple columns is changing it to "wide" format. So in addition to the join, we also need to convert from long to wide - this R-FAQ is a good read on that. Combining these two operations, we do this:
dat = merge(dataset1, dataset2, all.x = T) ## left join
dat$value = 1 ## add the value we want in the result
## convert long to wide
result = reshape2::dcast(dat, Email ~ Code, value.var = "value", drop = T)
result["NA"] = NULL ## remove the NA column that is added
result
# Email Z002 Z004
# 1 1#1.com NA 1
# 2 2#2.com 1 NA
# 3 3#3.com NA NA
# 4 4#4.com NA NA
# 5 5#5.com NA NA
Thank you for all your help! I am working with time-series data and trying to identify the count at which an observation occurred, while working with the rollapply function in R. To clarify, here is some code:
# Sample Data
dates <- c("2014-01-01","2014-01-02","2014-01-03","2014-01-04","2014-01-05",
"2014-01-06","2014-01-07","2014-01-08","2014-01-09","2014-01-10")
data <- c(20,12,31,26,22,22,31,10,22,23)
xts.object <- as.xts(data,as.Date(dates))
# Apply 4-Day Min
rollMin <- rollapply(xts.object,4,min)
xts.object2 <- cbind(xts.object,rollMin)
# Desired Output
desiredOutput <- c(NA,NA,NA,3,4,1,2,1,2,3)
xts.object3 <- cbind(xts.object2,desiredOutput)
colnames(xts.object3) <- c("data","rollMin","desiredOutput")
The first 3 observations of desiredOutput is NA's because the window size selected for the rollapply function is set to 4. On the 4th observation, the min was 12 and that has been true for 3 days, therefore the desiredOutput displays 3 on 2014-01-04.
Thanks, again!
You can use rollapply here as well. which.min will return the index of the minimal value. To get the range of days you have to reduce the window size (+ one, because in R indices starting at 1) by the index.
rollapply(xts.object,4,function(x)NROW(x)-which.min(x)+1)
# [,1]
#2014-01-01 NA
#2014-01-02 NA
#2014-01-03 NA
#2014-01-04 3
#2014-01-05 4
#2014-01-06 2
#2014-01-07 3
#2014-01-08 1
#2014-01-09 2
#2014-01-10 3
I have large data-frame consists of two columns. I want to calculate the average of the second column values for each subset of the first column. The subset of the first column is based on a specified granularity. For example, for the following data-frame, df, I want to calculate the average of df$B values for each subset of df$A with an increment(granularity) of 1 for each subset. The results should be in two new columns.
A B expected results newA newB
0.22096 1 0 1.142857
0.33489 1 1 2
0.33655 1 2 4
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5
This is a simple example, I'm not sure how to loop over the whole data-frame and perform the calculation i.e. the average of the df$B.
tried below to subset, but couldn't figure how to append the results and create final results:
Tried something like :
increment<-1
mx<-max(df$A)
i<-0
newDF<-data.frame()
while(i < mx){
tmp<-subset(df, (A >i & A< (i+increment)))
i<-i+granualrity
}
Not sure about the logic. But I'm sure there is a short way to do the required calculation. Any thoughts?
I would use findInterval for the subset selection (In your example a simple ceiling for each A value should be sufficient, too. But if your increment is different from 1 you need findInterval.) and tapply to calculate the mean:
df <- read.table(textConnection("
A B
0.22096 1
0.33489 1
0.33655 1
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5"), header=TRUE)
## sort data.frame by column A (needed for findInterval)
df <- df[order(df$A), ]
## define granuality
subsets <- seq(1, max(ceiling(df$A)), by=1) # change the "by" argument for different increments
df$subset <- findInterval(df$A, subsets)
tapply(df$B, df$subset, mean)
# 0 1 2
#1.142857 2.000000 4.000000