I would like to return values with matching conditions in another column based on a cut score criterion. If the cut scores are not available in the variable, I would like to grab closest larger value. Here is a snapshot of dataset:
ids <- c(1,2,3,4,5,6,7,8,9,10)
scores.a <- c(512,531,541,555,562,565,570,572,573,588)
scores.b <- c(12,13,14,15,16,17,18,19,20,21)
data <- data.frame(ids, scores.a, scores.b)
> data
ids scores.a scores.b
1 1 512 12
2 2 531 13
3 3 541 14
4 4 555 15
5 5 562 16
6 6 565 17
7 7 570 18
8 8 572 19
9 9 573 20
10 10 588 21
cuts <- c(531, 560, 571)
I would like to grab score.b value corresponding to the first cut score, which is 13. Then, grab score.b value corresponding to the second cut (560) score but it is not in the score.a, so I would like to get the score.a value 562 (closest to 560), and the corresponding value would be 16. Lastly, for the third cut score (571), I would like to get 19 which is the corresponding value of the closest value (572) to the third cut score.
Here is what I would like to get.
scores.b
cut.1 13
cut.2 16
cut.3 19
Any thoughts?
Thanks
We can use a rolling join
library(data.table)
setDT(data)[data.table(cuts = cuts), .(ids = ids, cuts, scores.b),
on = .(scores.a = cuts), roll = -Inf]
# ids cuts scores.b
#1: 2 531 13
#2: 5 560 16
#3: 8 571 19
Or another option is findInterval from base R after changing the sign and taking the reverse
with(data, scores.b[rev(nrow(data) + 1 - findInterval(rev(-cuts), rev(-scores.a)))])
#[1] 13 16 19
This doesn't remove the other columns, but this illustrates correct results better
df1 <- data[match(seq_along(cuts), findInterval(data$scores.a, cuts)), ]
rownames(df1) <- paste("cuts", seq_along(cuts), sep = ".")
> df1
ids scores.a scores.b
cuts.1 2 531 13
cuts.2 5 562 16
cuts.3 8 572 19
Related
I have a twin-dataset, in which there is one column called wpsum, another column is family-id, which is the same for corresponding twin pairs.
wpsum family-id
twin 1 14 220
twin 2 18 220
I want to calculate the correlation between wpsumof those with the same family-id, while there are also some single family id's, if one twin did not take part in the re-survey. family-id is a character.
There's no correlation between wpsum of those with the same family-id, as you put it, mainly because there's no third variable with which to correlate wpsum within the family-id groups (see my comment), but you can get the difference in wpsum scores within the groups. Maybe that's what you meant by correlation. Here's how to get those (I changed and expanded your example):
dat <- data.frame(wpsum = c(14, 18, 20, 5, 10, NA, 1),
family_id = c("220","220","221","221","222","222","223"))
dat
wpsum family_id
1 14 220
2 18 220
3 20 221
4 5 221
5 10 222
6 NA 222
7 1 223
diffs <- by(dat, dat$family_id, function(x) abs(x$wpsum[1] - x$wpsum[2]))
diffs
dat$family_id: 220
[1] 4
------------------------------
dat$family_id: 221
[1] 15
------------------------------
dat$family_id: 222
[1] NA
------------------------------
dat$family_id: 223
[1] NA
You can make a data.frame with this new variable of differences like so:
diff.frame <- data.frame(diffs = as.numeric(diffs), family_id = names(diffs))
diff.frame
diffs family_id
1 4 220
2 15 221
3 NA 222
4 NA 223
Note that neither missing values nor missing observations are a (coding) problem here - they just result in missing differences without error. If you started having more than two observations within each family ID, though, then you'd need to do something different.
I am using a large dataset that contains multiple variables that contain similar information. The variables range from PR1 through PR25. Each contains information regarding a procedure code. in short the dataframe looks like this:
Obs PR1 PR2 PR3
1 527 1422 222
2 1600 527 569
3 341 222 341
4 222 569 1422
5 569 341 1660
Where PR1 through PR25 values are factors.
I am looking for a way to make a table of information across all of these variables. For instance, I would like to make a table that shows a count of total number of value "527" for PR1:PR25. I would like to do this for multiple values of interest.
For instance
PR Tot
#222 3
#341 3
#527 2
#569 3
#1600 1
#1660 1
However, I only want to retrieve the frequency for a very specific set of values such as only extracting the frequency of 527 or 1600.
I have initially tried using a simple function like length(which(PR1=="527")), which works but is tedious.
I used the method suggested by Soren using:
library(plyr)
all_codes <- data.frame(codes=unlist(lapply(df,levels),use.names=F))
result <- ddply(all_codes,.(codes),summarize,count=length(codes))
result[which(result$codes %in% c("527", "5251", "5252", "5253", "5259",
"526", "521", "529", "8512", "8521", "344", "854", "8523", "8541", "8546",
"8542", "8547" , "8544", "8545", "8543", "639",
"064","065","063","0650","0651", "0652", "062", "066", "4040", "4041",
"4042", "0721", "0712","0701", "0702", "070", "0741", "435","436", "4399",
"439", "438", "437", "4381", "4391", "4342", "5122", "5121", "5124", "5123",
"518", "519", "503", "5022", "5012")),]
And got the following output (abbreviated):
codes count
92 062 5
95 064 8
96 0650 2
769 526 8
770 527 8
However, I had a feeling that was incorrect. When I checked it against the output from sapply(df, function(PR1) length(which(PR1 == "527")))
I get the following:
PR1 PR2 PR3 PR4 PR5 PR6 PR7 PR8 ...
1152 36 6 1 2 1 1 1
Which is the correct number of "527" cases in the dataframe. Any suggestions why the first method is giving incorrect sums of factor levels?
Thanks for any help, and let me know if I can provide more info
You can use sapply() or lapply() function to get count of a some value over all columns.
Create data frame df
df <- data.frame(A = 1:4, B = c(4,4,4,4), C = c(2,3,4,4), D = 9:12)
df
# A B C D
# 1 1 4 2 9
# 2 2 4 3 10
# 3 3 4 4 11
# 4 4 4 4 12
Frequency of value "4" in each column A, B, C, and D using sapply() function
sapply(df, function(x) length(which(x == 4)))
A B C D
1 4 2 0
Frequency of value "4" in each column A, B, C, and D using lapply() function
lapply(df, function(x) length(which(x == 4)))
# $A
# [1] 1
# $B
# [1] 4
# $C
# [1] 2
# $D
# [1] 0
The following takes your example and returns an output that may be generalized across all 25 columns. The "plyr" library is used to create the aggregated counts
Scripted as follows:
library(plyr)
df <- data.frame(PR1=c("527","1600","341","222","569"),PR2=c("1422","527","222","569","341"),PR3=c("222","569","341","1422","1660"),stringsAsFactors = T)
all_codes <- data.frame(codes=unlist(lapply(df,levels),use.names=F))
result <- ddply(all_codes,.(codes),summarize,count=length(codes))
result[which(result$codes %in% c('527','222')),]
Explained as follows:
Create the data frame as specified above. As OP noted values are factors, stringsAsFactors is set to TRUE
df <- data.frame(
PR1=c("527","1600","341","222","569"),
PR2=c("1422","527","222","569","341"),
PR3=c("222","569","341","1422","1660"),
stringsAsFactors = T)
Reviewing results of df
df
PR1 PR2 PR3
1 527 1422 222
2 1600 527 569
3 341 222 341
4 222 569 1422
5 569 341 1660
As OP asks to combine all the codes across PR1:PR25 a these are unified into a single list by using lapply to loop across all the columns. However, as these are factors -- and it seems that the interest in the in the level value of the factor and not its underlying numeric representation, lapply(df,levels) returns these values. To merge into a single list PR1:PR25 it's simply unlist() and since the column names are seemingly not useful in this case, use.names is set to FALSE. Finally, a data.frame is created with the single column called codes, which is later fed into the ddply() function to get the counts.
all_codes <- data.frame(codes=unlist(lapply(df,levels),use.names=F))
all_codes
codes
1 1600
2 222
3 341
4 527
5 569
6 1422
7 222
8 341
9 527
10 569
11 1422
12 1660
13 222
14 341
15 569
Uisng ddply() to split() the data.frame on df$codes value and then take the length() of each vector returned by split in ddply()
result <- ddply(all_codes,.(codes),summarize,count=length(codes))
result
Reviewing the result gives the PR1:PR25 aggregated count of all the level values of each factor in the original data.frame
codes count
1 1422 2
2 1600 1
3 1660 1
4 222 3
5 341 3
6 527 2
7 569 3
And since we're only interested in specific values (527 given in OP, but here two values of interest are exemplified, 527 and 222:
result[which(result$codes %in% c('527','222')),]
codes count
4 222 3
6 527 2
I have two data frames. The first one contains the original state of an image with all the data available to reconstruct the image from scratch (the entire coordinate set and their color values).
I then have a second data frame. This one is smaller and contains only data about the differences (the changes made) between the the updated state and the original state. Sort of like video encoding with key frames.
Unfortunately I don't have an unique id column to help me match them. I have an x column and I have a y column which, combined, can make up a unique id.
My question is this: What is an elegant way of merging these two data sets, replacing the values in the original dataframe with the values in the "differenced" data frame whose x and y coordinates match.
Here's some example data to illustrate:
original <- data.frame(x = 1:10, y = 23:32, value = 120:129)
x y value
1 1 23 120
2 2 24 121
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 127
9 9 31 128
10 10 32 129
And the dataframe with updated differences:
update <- data.frame(x = c(1:4, 8), y = c(2, 24, 17, 23, 30), value = 50:54)
x y value
1 1 2 50
2 2 24 51
3 3 17 52
4 4 23 53
5 8 30 54
The desired final output should contain all the rows in the original data frame. However, the rows in original where the x and y coordinates both match the corresponding coordinates in update, should have their value replaced with the values in the update data frame. Here's the desired output:
original_updated <- data.frame(x = 1:10, y = 23:32,
value = c(120, 51, 122:126, 54, 128:129))
x y value
1 1 23 120
2 2 24 51
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 54
9 9 31 128
10 10 32 129
I've tried to come up with a vectorised solution with indexing for some time, but I can't figure it out. Usually I'd use %in% if it were just one column with unique ids. But the two columns are non unique.
One solution would be to treat them as strings or tuples and combine them to one column as a coordinate pair, and then use %in%.
But I was curious whether there were any solution to this problem involving indexing with boolean vectors. Any suggestions?
First merge in a way which guarantees all values from the original will be present:
merged = merge(original, update, by = c("x","y"), all.x = TRUE)
Then use dplyr to choose update's values where possible, and original's value otherwise:
library(dplyr)
middle = mutate(merged, value = ifelse(is.na(value.y), value.x, value.y))
final = select(middle, x, y, value)
The match function is used to generate indices. Needs a nomatch argument to prevent NA on the left hand side of data.frame.[<-. I don't think it is as transparent as a merge followed by replace, but I'm guessing it will be faster:
original[ match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)] ,
"value"] <-
update[ which( match(update$x, original$x) == match(update$y, original$y)),
"value"]
You can see the difference:
> match(update$x, original$x)[
match(update$x, original$x) ==
match(update$y, original$y) ]
[1] NA 2 NA 8
> match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)]
[1] 2 8
The "interior" match functions are returning:
> match(update$y, original$y)
[1] NA 2 NA 1 8
> match(update$x, original$x)
[1] 1 2 3 4 8
I have 4 data frames with data from different experiments, where each row represents a trial. The participant's id (SID) is stored as a factor. Each one of the data frames look like this:
Experiment 1:
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059
I want to make a new data frame with the id's of the participants in each of the experiments, for example:
Exp1 Exp2 Exp3 Exp4
5402 22081 22160 25434
25403 22069 22179 25439
25485 22115 22141 25408
25457 22120 22185 25445
28041 22448 22239 25473
29514 22492 22291 25489
I want each column to be ordered as numbers, that is, 2 comes before 10.
I used unique() to extract the participant id's (SID) in each data frame, but I am having problems ordering the columns.
I tried using:
data.frame(order(unique(df1$SID)),
order(unique(df2$SID)),
order(unique(df3$SID)),
order(unique(df4$SID)))
and I get (without the column names):
38 60 16 32 15
2 9 41 14 41
3 33 5 30 62
4 51 11 18 33
I'm sorry if I am missing something very basic, I am still very new to R.
Thank you for any help!
Edit:
I tried the solutions in the comments, and now I have:
x<-cbind(sort(as.numeric(unique(df1$SID)),decreasing = F),
sort(as.numeric(unique(df2$SID)),decreasing = F),
sort(as.numeric(unique(df3$SID)),decreasing = F),
sort(as.numeric(unique(df4$SID)),decreasing = F) )
Still does not work... I get:
V1 V2 V3 V4
8 6 5 2
2 9 35 11 3
3 10 37 17 184
4 13 38 91 185
5 15 39 103 186
The subject id's are 3 to 5 digit numbers...
If your data looks like this:
df <- read.table(text="
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059",
header=TRUE, colClasses = c("factor","integer","numeric"))
I would do something like this:
df <- df[order(as.numeric(as.character(df$SID)), trial),] # sort df on SID (numeric) & trial
split(df$SID, df$trial) # breaks the vector SID into a list of vectors of SID for each trial
If you were worried about unique values you could do:
lapply(split(df$SID, df$trial), unique) # breaks SID into list of unique SIDs for each trial
That will give you a list of participant IDs for each trial, sorted by numeric value but maintaining their factor property.
If you really wanted a data frame, and the number of participants in each experiment were equal, you could use data.frame() on the list, as in: data.frame(split(df$SID, df$trial))
Suppose x and y represent the Exp1 SID and Exp2 SID. You can create a ordered list of unique values as shown below:
x<-factor(x = c(2,5,4,3,6,1,4,5,6,3,2,3))
y<-factor(x = c(2,3,4,2,4,1,4,5,5,3,2,3))
list(exp1=sort(x = unique(x),decreasing = F),y=sort(x = unique(y),decreasing = F))
I have a dataset that has several hundred variables with hundreds of observations. Each observation has a unique identifier, and is associated with one of approximately 50 groups. It looks like so (the variables I'm not concerned about have been ignored below):
ID Group Score
1 10 400
2 11 473
3 12 293
4 13 382
5 14 283
6 11 348
7 11 645
8 13 423
9 10 434
10 10 124
etc.
I would like to calculate an adjusted mean for each observation that needs to use the N-count for each Group, the sum of Scores for that Group, as well as the means for the Scores of each group. (So, in the example above, the N-count for Group 11 is three, the sum is 1466, and the mean is 488.67, and I would use these numbers only on IDs 2, 6, and 7).
I've been fiddling with plyr, and am able to extract the n-counts and means as follows (accounting for missing Scores and Group values):
new_data <- ddply(main_data, "Group", N = sum(!is.na(Scores)), mean = mean(Scores, na.rm = TRUE).
I'm stuck, though, on how to get the sum of the scores for a particular group, and then how to calculate the adjusted means either within the main_data set or a new dataset. Any help would be appreciated.
Here is the plyr way.
ddply(main_data, .(Group), summarize, N = sum(!is.na(Score)), mean = mean(Score, na.rm = TRUE), total = sum(Score))
Group N mean total
1 10 3 319.3333 958
2 11 3 488.6667 1466
3 12 1 293.0000 293
4 13 2 402.5000 805
5 14 1 283.0000 283
Check out the dplyr package.
main_data %>% group_by(Group) %>% summarize(n = n(), mean = mean(Score, na.rm=TRUE), total = sum(Score))
Source: local data frame [5 x 4]
Group n mean total
1 10 3 319.3333 958
2 11 3 488.6667 1466
3 12 1 293.0000 293
4 13 2 402.5000 805
5 14 1 283.0000 283