Merge two dataframes by a closest value in R - r

I have two dataframes that I want to merge by the closest value in one column. The first dataframe (DF1) consists of individuals and their estimated individual risk ("risk"):
DF1<- data.frame(ID = c(1, 2, 3), risk = c(22, 40, 20))
ID risk
1 22
2 40
3 20
The second dataframe (DF2) consists of population by age groups ("population_age") and the normal risks within each age group ("population_normal_risk"):
DF2<- data.frame(population_age = c("30-34","35-39","40-44"), population_normal_risk = c(15, 30, 45))
population_age population_normal_risk
30-34 15
35-39 30
40-44 45
What I want is to add a new column in the DF1 dataframe showing the population age group ("population_age") with the closest risk value ("population_normal_risk") to the estimated risk on each individual ("risk").
What I expected would be:
ID risk population_age_group
1 22 30-34
2 40 40-44
3 20 30-34
Thanks in advance!

We can use findInterval.
First we need to calculate our break points at the halfway points between the population risk values:
breaks <- c(0, df2$population_normal_risk + c(diff(df2$population_normal_risk) / 2, Inf))
Then use findInterval to detect which bin our risks fall into:
matches <- findInterval(df1$risk, breaks)
Finally, write the matches in:
df1$population_age <- df2$population_age[matches]
Giving us:
df1
ID risk population_age
1 1 22 30-34
2 2 40 40-44
3 3 20 30-34`

We can try the code below using outer + max.col
transform(
DF1,
population_age = DF2[max.col(-abs(outer(risk, DF2$population_normal_risk, `-`))), "population_age"]
)
which gives
ID risk population_age
1 1 22 30-34
2 2 40 40-44
3 3 20 30-34

Related

Sort list on numeric values stored as factor

I have 4 data frames with data from different experiments, where each row represents a trial. The participant's id (SID) is stored as a factor. Each one of the data frames look like this:
Experiment 1:
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059
I want to make a new data frame with the id's of the participants in each of the experiments, for example:
Exp1 Exp2 Exp3 Exp4
5402 22081 22160 25434
25403 22069 22179 25439
25485 22115 22141 25408
25457 22120 22185 25445
28041 22448 22239 25473
29514 22492 22291 25489
I want each column to be ordered as numbers, that is, 2 comes before 10.
I used unique() to extract the participant id's (SID) in each data frame, but I am having problems ordering the columns.
I tried using:
data.frame(order(unique(df1$SID)),
order(unique(df2$SID)),
order(unique(df3$SID)),
order(unique(df4$SID)))
and I get (without the column names):
38 60 16 32 15
2 9 41 14 41
3 33 5 30 62
4 51 11 18 33
I'm sorry if I am missing something very basic, I am still very new to R.
Thank you for any help!
Edit:
I tried the solutions in the comments, and now I have:
x<-cbind(sort(as.numeric(unique(df1$SID)),decreasing = F),
sort(as.numeric(unique(df2$SID)),decreasing = F),
sort(as.numeric(unique(df3$SID)),decreasing = F),
sort(as.numeric(unique(df4$SID)),decreasing = F) )
Still does not work... I get:
V1 V2 V3 V4
8 6 5 2
2 9 35 11 3
3 10 37 17 184
4 13 38 91 185
5 15 39 103 186
The subject id's are 3 to 5 digit numbers...
If your data looks like this:
df <- read.table(text="
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059",
header=TRUE, colClasses = c("factor","integer","numeric"))
I would do something like this:
df <- df[order(as.numeric(as.character(df$SID)), trial),] # sort df on SID (numeric) & trial
split(df$SID, df$trial) # breaks the vector SID into a list of vectors of SID for each trial
If you were worried about unique values you could do:
lapply(split(df$SID, df$trial), unique) # breaks SID into list of unique SIDs for each trial
That will give you a list of participant IDs for each trial, sorted by numeric value but maintaining their factor property.
If you really wanted a data frame, and the number of participants in each experiment were equal, you could use data.frame() on the list, as in: data.frame(split(df$SID, df$trial))
Suppose x and y represent the Exp1 SID and Exp2 SID. You can create a ordered list of unique values as shown below:
x<-factor(x = c(2,5,4,3,6,1,4,5,6,3,2,3))
y<-factor(x = c(2,3,4,2,4,1,4,5,5,3,2,3))
list(exp1=sort(x = unique(x),decreasing = F),y=sort(x = unique(y),decreasing = F))

Replace value in a column based on a Frequency Count using R

I have a dataset with multiple columns. Many of these columns contain over 32 factors, so to run a Random Forest (for example), I want to replace values in the column based on their Frequency Count.
One of the column reads like this:
$ country
: Factor w/ 92 levels "China","India","USA",..: 30 39 39 20 89 30 16 21 30 30 ...
What I would like to do is only retain the top N (where N is a value between 5 and 20) countries, and replace the remaining values with "Other".
I know how to calculate the frequency of the values using the table function, but I can't seem to find a solution for replacing values on the basis of such a rule. How can this be done?
Some example data:
set.seed(1)
x <- factor(sample(1:5,100,prob=c(1,3,4,2,5),replace=TRUE))
table(x)
# 1 2 3 4 5
# 4 26 30 13 27
Replace all the levels other than the top 3 (Levels 2/3/5) with "Other":
levels(x)[rank(table(x)) < 3] <- "Other"
table(x)
#Other 2 3 5
# 17 26 30 27

Random sample a percentage of rows without repetition in R

I have population data with age and gender characteristics, and I'm trying to populate another column with employment type based on other data have. I've used 'sample' to select a sample of of the population who work part time and then I will add this data as a new column, but I have yet to figure out how to ensure those selected are not reselected in the next sample for a different employment type.
At the moment I have the following which is for 23% of Male in a certain age group:
PT=my.df[sample(which(my.df$Age=="15" & my.df$Gender=="Male"), round(0.23*length (which(my.df$Age=="15" & my.df$Gender=="Male")))),]
And an example of my output looks like this:
Edinburgh.ID Age Gender
2445 2445 15 Male
2477 2477 15 Male
2469 2469 15 Male
2485 2485 15 Male
2487 2487 15 Male
2483 2483 15 Male
I now want to select the next x% from the same age and gender group who have a different employment type. If I just change the 0.23 to another percentage, in some cases, the same IDs are coming out but I want individual IDs in each sample.
The dplyr package gives the possibility to randomly sample in percentage with(out) replacement.
library('dplyr')
sample_frac(df, size = percentage, replace = FALSE)
then you can adjust your constraints on age and gender accordingly.
You could define a data.frame describing the employment statistics for a given group and sample from it. Here is an approach in base R.
# Generate some data
N = 1000
my.df <- data.frame(Age = rep("15", N),
Gender = sample(c("Male", "Female"), N, TRUE),
Activity = rep("", N),
stringsAsFactors = FALSE)
head(my.df)
# Age Gender Activity
# 1 15 Female
# 2 15 Male
# 3 15 Male
# 4 15 Female
# 5 15 Male
# 6 15 Female
# employment statistics for the group age = "15" and gender = "Male"
employment <- data.frame(activity = letters[1:5],
prob = c(0.1, 0.1, 0.2, 0.5, 0.1),
stringsAsFactors = FALSE)
employment
# activity prob
# 1 a 0.1
# 2 b 0.1
# 3 c 0.2
# 4 d 0.5
# 5 e 0.1
# Assign activities
set.seed(35)
id <- which(my.df$Age == "15" & my.df$Gender == "Male")
my.df[id, "Activity"] <- sample(employment$activity, length(id),
replace = TRUE, prob = employment$prob)
table(my.df[my.df$Gender=="Male", "Activity"])/length(id)
# a b c d e
# 0.1135903 0.1054767 0.1805274 0.4665314 0.1338742

calculate gender percentage from grouped data frame in R

I have fairly large data frame that includes information on individuals divided into treatment groups. I am trying to generate variable means and gender percentages per group. I was able to calculate the means but I am not sure how to get the gender percentages.
Below, I generated a small replica of what my data looks like:
library(plyr)
#create variables and data frame
sampleid<-seq(1:100)
gender = rep(c("female","male"),c(50,50))
score <- rnorm(100)
age<-sample(25:35,100,replace=TRUE)
treatment <- rep(seq(1:5), each=4)
d <- data.frame(sampleid,gender,age,score, treatment)
>head(d)
sampleid gender age score treatment
1 1 female 34 1.6917201 1
2 2 female 26 -1.6189545 1
3 3 female 28 1.2867895 1
4 4 female 34 -0.5027578 1
5 5 female 29 -1.3652895 2
6 6 female 26 -2.4430843 2
I obtain the mean of each numeric column by:
groupstat<-ddply(d, .(treatment),numcolwise(mean))
which gives:
treatment sampleid age score
1 1 42.5 29.15 0.142078574
2 2 46.5 29.50 -0.261492514
3 3 50.5 30.50 -0.188393235
4 4 54.5 30.45 0.003526078
5 5 58.5 30.55 0.062996737
However I also need an additional column "Percent Female", which should give me the percentage of females within each treatment group 1:5.
Can someone help me in how to add this?
Try this out
groupstat<-ddply(d, .(treatment),summarise,
meansc= mean(score),
meanage= mean(age),
meanID= mean(sampleid),
nfem= length(gender[gender=="female"]), # number females per treatment group
nmale= length(gender[gender=="male"]), # number of males per treatment group
percentfem= nfem/(nfem+nmale)) # percent females by treatment group
I would first split into treatment groups (split(d, f = d$treatment)) and than calc the means for each group (function(x) sum(x$gender == "female")/length(x$gender):
sapply(split(d, f = d$treatment), function(x) sum(x$gender == "female")/length(x$gender))

Apply LR models to another dataframe

I searched SO, but I could not seem to find the right code that is applicable to my question. It is similar to this question: Linear Regression calculation several times in one dataframe
I got a dataframe of LR coefficients following Andrie's code:
Cddply <- ddply(test, .(sumtest), function(test)coef(lm(Area~Conc, data=test)))
sumtest (Intercept) Conc
1 -108589.2726 846.0713372
2 -49653.18701 811.3982918
3 -102598.6252 832.6419926
4 -72607.4017 727.0765558
5 54224.28878 391.256075
6 -42357.45407 357.0845661
7 -34171.92228 367.3962888
8 -9332.569856 289.8631555
9 -7376.448899 335.7047756
10 -37704.92277 359.1457617
My question is how to apply each of these LR models (1-10) to specific row intervals in another dataframe in order to get x, the independent variable, into a 3rd column. For example, I would like to apply sumtest1 to Samples 6:29, sumtest2 to samples 35:50, sumtest3 to samples 56:79, etc.. in intervals of 24 and 16 samples. The sample numbers repeats after 200, so sumtest9 will be for Samples 6:29 again.
Sample Area
6 236211
7 724919
8 1259814
9 1574722
10 268836
11 863818
12 1261768
13 1591845
14 220322
15 608396
16 980182
17 1415859
18 276276
19 724532
20 1130024
21 1147840
22 252051
23 544870
24 832512
25 899457
26 285093
27 4291007
28 825922
29 865491
35 246707
36 538092
37 767269
38 852410
39 269152
40 971471
41 1573989
42 1897208
43 261321
44 481486
45 598617
46 769240
47 229695
48 782691
49 1380597
50 1725419
The resulting dataframe would look like this:
Sample Area Calc
6 236211 407.5312917
7 724919 985.1525288
8 1259814 1617.363812
9 1574722 1989.564693
10 268836 446.0919309
...
35 246707 365.2452551
36 538092 724.3591324
37 767269 1006.805521
38 852410 1111.736505
39 269152 392.9073207
Thank you for your assistance.
Is this what you want? I made up a slightly larger dummy data set of 'area' to make it easier to see how the code worked when I tried it out.
# create 400 rows of area data
set.seed(123)
df <- data.frame(area = round(rnorm(400, mean = 1000000, sd = 100000)))
# "sample numbers repeats after 200" -> add a sample nr 1-200, 1-200
df$sample_nr <- 1:200
# create a factor which cuts the vector of sample_nr into pieces of length 16, 24, 16, 24...
# repeat to a total length of the pieces is 200
# i.e. 5 repeats of (16, 24)
grp <- cut(df$sample_nr, breaks = c(-Inf, cumsum(rep(c(16, 24), 5))))
# add a numeric version of the chunks to data frame
# this number indicates the model from which coefficients will be used
# row 1-16 (16 rows): model 1; row 17-40 (24 rows): model 2;
# row 41-56 (16 rows): model 3; and so on.
df$mod <- as.numeric(grp)
# read coefficients
coefs <- read.table(text = "intercept beta_conc
1 -108589.2726 846.0713372
2 -49653.18701 811.3982918
3 -102598.6252 832.6419926
4 -72607.4017 727.0765558
5 54224.28878 391.256075
6 -42357.45407 357.0845661
7 -34171.92228 367.3962888
8 -9332.569856 289.8631555
9 -7376.448899 335.7047756
10 -37704.92277 359.1457617", header = TRUE)
# add model number
coefs$mod <- rownames(coefs)
head(df)
head(coefs)
# join area data and coefficients by model number
# (use 'join' instead of merge to avoid sorting)
library(plyr)
df2 <- join(df, coefs)
# calculate conc from area and model coefficients
# area = intercept + beta_conc * conc
# conc = (area - intercept) / beta_conc
df2$conc <- (df2$area - df2$intercept) / df2$beta_conc
head(df2, 41)

Resources