I would like to recode a numerical variable based on a cut score criterion. If the cut scores are not available in the variable, I would like to recode the closest smaller value as a cut score. Here is a snapshot of dataset:
ids <- c(1,2,3,4,5,6,7,8,9,10)
scores <- c(512,531,541,555,562,565,570,572,573,588)
data <- data.frame(ids, scores)
> data
ids scores
1 1 512
2 2 531
3 3 541
4 4 555
5 5 562
6 6 565
7 7 570
8 8 572
9 9 573
10 10 588
cuts <- c(531, 560, 575)
The first cut score (531) is in the dataset. So it will stay the same as 531. However, 560 and 575 were not available. I would like to recode the closest smaller value (555) to the second cut score as 560 in the new column, and for the third cut score, I'd like to recode 573 as 575.
Here is what I would like to get.
ids scores rescored
1 1 512 512
2 2 531 531
3 3 541 541
4 4 555 560
5 5 562 562
6 6 565 565
7 7 570 570
8 8 572 572
9 9 573 575
10 10 588 588
Any thoughts?
Thanks
One option would be to find the index with findInterval and then get the pmax of the 'scores' corresponding to that index with the 'cuts' and updated the 'rescored' column elements on that index
i1 <- with(data, findInterval(cuts, scores))
data$rescored <- data$scores
data$rescored[i1] <- with(data, pmax(scores[i1], cuts))
data
# ids scores rescored
#1 1 512 512
#2 2 531 531
#3 3 541 541
#4 4 555 560
#5 5 562 562
#6 6 565 565
#7 7 570 570
#8 8 572 572
#9 9 573 575
#10 10 588 588
Related
I have a large data-set consisting of a header and a series of values in that column. I want to detect the presence and number of duplicates of these values within the whole dataset.
1 2 3 4 5 6 7
734 456 346 545 874 734 455
734 783 482 545 456 948 483
So for example, it would detect 734 3 times, 456 twice etc.
I've tried using the duplicated function in r but this seems to only work on rows as a whole or columns as a whole. Using
duplicated(df)
doesn't pick up any duplicates, though I know there are two duplicates in the first row.
So I'm asking how to detect duplicates both within and between columns/rows.
Cheers
You can use table() and data.frame() to see the occurrence
data.frame(table(v))
such that
v Freq
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 346 1
9 455 1
10 456 2
11 482 1
12 483 1
13 545 2
14 734 3
15 783 1
16 874 1
17 948 1
DATA
v <- c(1, 2, 3, 4, 5, 6, 7, 734, 456, 346, 545, 874, 734, 455, 734,
783, 482, 545, 456, 948, 483)
You can transform it to a vector and then use table() as follows:
library(data.table)
library(dplyr)
df<-fread("734 456 346 545 874 734 455
734 783 482 545 456 948 483")
df%>%unlist()%>%table()
# 346 455 456 482 483 545 734 783 874 948
# 1 1 2 1 1 2 3 1 1 1
I have this data that I want to plot as a time series.
Date Units.Sold
1 Jan-16 588
2 Feb-16 448
3 Mar-16 490
4 Apr-16 512
5 May-16 528
6 Jun-16 432
7 Jul-16 470
8 Aug-16 446
9 Sep-16 465
10 Oct-16 388
11 Nov-16 429
12 Dec-16 414
However, when I use ts(datasetName), I get this:
Time Series:
Start = 1
End = 12
Frequency = 1
Date Units.Sold
1 5 588
2 4 448
3 8 490
4 1 512
5 9 528
6 7 432
7 6 470
8 2 446
9 12 465
10 11 388
11 10 429
12 3 414
As you can see, the dates are in the wrong order. I want January to correspond with 1, February with 2, and so on. Can anybody help?
You need to convert your column named 'Date' to a Date - class object first. You can use as.Date for that, but you'll need to add a year first.
your_year <- 2018
df$Date <- as.Date(paste0(df$Date, '-', your_year), format = '%b-%d-%Y')
I wanted to ask for help, because I am having difficulties ordering my table, because the column for the table to be ordered has duplicates (coltoorder). This is a tiny part of my table. The desired order is custom, roughly speaking, it is based on the order of the first column, except for the first value (887).
text<-"col1 col2 col3 coltoorder
895 2 1374 887
888 2 14 887
1018 3 1065 895
896 2 307 895
889 2 4 888
891 2 8 888
1055 2 971 1018
926 3 241 896
1021 2 87 1018
897 2 64 896"
mytable<-read.table(text=text, header = T)
mytable
desired order
myindex<-c(887,895,888,1018,896) # equivalent to
myindex2<-c(887,887,895,895,888,888,1018,1018,896,896)
some failed attemps
try1<-mytable[match(myindex, mytable$coltoorder),]
try2<-mytable[match(myindex2, mytable$coltoorder),]
try3<-mytable[mytable$coltoorder %in% myindex,]
try3<-mytable[myindex %in% mytable$coltoorder,]
try4<-mytable[myindex2 %in% mytable$coltoorder,]
rownames(mytable) <- mytable$coltoorder # error
It seems like coltoorder should be treated categorically, not numerically. All factors have an order of their levels, so we'll convert to a factor where the levels are ordered according to myindex. Then this ordering is "baked in" to the column and we can use order normally on it.
mytable$coltoorder = factor(mytable$coltoorder, levels = myindex)
mytable[order(mytable$coltoorder), ]
# col1 col2 col3 coltoorder
# 8 895 2 1374 887
# 1 888 2 14 887
# 131 1018 3 1065 895
# 9 896 2 307 895
# 2 889 2 4 888
# 4 891 2 8 888
# 168 1055 2 971 1018
# 134 1021 2 87 1018
# 39 926 3 241 896
# 10 897 2 64 896
Do be careful - this column is now a factor not a numeric. If you want to recover the numeric values from a factor, you need to convert via character: original_values = as.numeric(as.character(mytable$coltoorder)).
Your data sample suggests that your desired sort order is equivalent to the first appearance in column coltoorder.
If this is true, the function fct_inorder() from Hadley Wickham's forcats package may be particular helpful here:
mytable$coltoorder <- forcats::fct_inorder(as.character(mytable$coltoorder))
mytable[order(mytable$coltoorder), ]
col1 col2 col3 coltoorder
1 895 2 1374 887
2 888 2 14 887
3 1018 3 1065 895
4 896 2 307 895
5 889 2 4 888
6 891 2 8 888
7 1055 2 971 1018
9 1021 2 87 1018
8 926 3 241 896
10 897 2 64 896
fct_inorder() reorders factors levels by first appearance. So, there is no need to create a separate myindex vector.
However, the caveats from Gregor's answer apply as well.
I was searching for an answer to my specific problem, but I didn't find a conclusion. I found this: Add column to Data Frame based on values of other columns , but it was'nt exactly what I need in my specific case.
I'm really a beginner in R, so I hope maybe someone can help me or has a good hint for me.
Here an example of what my data frame looks like:
ID answer 1.partnerID
125 3 715
235 4 845
370 7 985
560 1 950
715 5 235
950 5 560
845 6 370
985 6 125
I try to describe what I want to do on an example:
In the first row is the data of the person with the ID 125. The first partner of this person is the person with ID 715. I want to create a new column, with the value of the answer of each person´s partner in it. It should look like this:
ID answer 1.partnerID 1.partneranswer
125 3 715 5
235 4 845 6
370 7 985 6
560 1 950 5
715 5 235 4
950 5 560 1
845 6 370 7
985 6 125 3
So R should take the value of the column 1.partnerID, which is in this case "715" and search for the row, where "715" is the value in the column ID (there are no IDs more than once).
From this specific row R should take the value from the column answer (in this example that´s the "5") and put it into the new column "1.partneranswer", but in the row from person 125.
I hope someone can understand what I want to do ...
My problem is that I can imagine how to write this for each row per hand, but I think there need to be an easiear way to do it for all rows in once? (especially because in my original data.frame are 5 partners per person and there are more than one column from which the values should be transfered, so it would come to many hours work to write it for each single row per hand).
I hope someone can help.
Thank you!
One solution is to use apply as follows:
df$partneranswer <- apply(df, 1, function(x) df$answer[df$ID == x[3]])
Output will be as desired above. There may be a loop-less approach.
EDIT: Adding a loop-less (vectorized answer) using match:
df$partneranswer <- df$answer[match(df$X1.partnerID, df$ID)]
df
ID answer X1.partnerID partneranswer
1 125 3 715 5
2 235 4 845 6
3 370 7 985 6
4 560 1 950 5
5 715 5 235 4
6 950 5 560 1
7 845 6 370 7
8 985 6 125 3
Update: This can be done with self join; The first two columns define a map relationship from ID to answer, in order to find the answers for the partner IDs, you can merge the data frame with itself with first data frame keyed on partnerID and the second data frame keyed on ID:
Suppose df is (fixed the column names a little bit):
df
# ID answer partnerID
#1 125 3 715
#2 235 4 845
#3 370 7 985
#4 560 1 950
#5 715 5 235
#6 950 5 560
#7 845 6 370
#8 985 6 125
merge(df, df[c('ID', 'answer')], by.x = "partnerID", by.y = "ID")
# partnerID ID answer.x answer.y
#1 125 985 6 3
#2 235 715 5 4
#3 370 845 6 7
#4 560 950 5 1
#5 715 125 3 5
#6 845 235 4 6
#7 950 560 1 5
#8 985 370 7 6
Old answer:
If the ID and partnerID are mapped to each other one on one, you can try:
df$partneranswer <- with(df, answer[sapply(X1.partnerID, function(partnerID) which(ID == partnerID))])
df
# ID answer X1.partnerID partneranswer
#1 125 3 715 5
#2 235 4 845 6
#3 370 7 985 6
#4 560 1 950 5
#5 715 5 235 4
#6 950 5 560 1
#7 845 6 370 7
#8 985 6 125 3
Task:
I have to check that if the value in the data vector is above from the given threshold,
If in my data vector, I found 5 consecutive values greater then the given threshold then I keep these values as they are.
If I have less then 5 values (not 5 consecutive values) then I will replace these values with NA's.
The sample data and required output is shown below. In this example the threshold value is 1000. X is input data variable and the desired output is: Y = X(Threshold > 1000)
X Y
580 580
457 457
980 980
1250 NA
3600 NA
598 598
1200 1200
1345 1345
9658 9658
1253 1253
4500 4500
1150 1150
596 596
594 594
550 550
1450 NA
320 320
1780 NA
592 592
590 590
I have used the following code in R for my desired output but unable to get the appropriate one:
for (i in 1:nrow(X)) # X is my data vector
{counter=0
if (X[i]>10000)
{
for (j in i:(i+4))
{
if (X[j]>10000)
{counter=counter+1}
}
ifelse (counter < 5, NA, X[j])
}
X[i]<- NA
}
X
I am sure that the above code contain some error. I need help in the form of either a new code or modifying this code or any package in R.
Here is an approach using dplyr, using a cumulative sum of diff(x > 1000) to group the values.
library(dplyr)
df <- data.frame(x)
df
# x
# 1 580
# 2 457
# 3 980
# 4 1250
# 5 3600
# 6 598
# 7 1200
# 8 1345
# 9 9658
# 10 1253
# 11 4500
# 12 1150
# 13 596
# 14 594
# 15 550
# 16 1450
# 17 320
# 18 1780
# 19 592
# 20 590
df %>% mutate(group = cumsum(c(0, abs(diff(x>1000))))) %>%
group_by(group) %>%
mutate(count = n()) %>%
ungroup() %>%
mutate(y = ifelse(x<1000 | count > 5, x, NA))
# x group count y
# (int) (dbl) (int) (int)
# 1 580 0 3 580
# 2 457 0 3 457
# 3 980 0 3 980
# 4 1250 1 2 NA
# 5 3600 1 2 NA
# 6 598 2 1 598
# 7 1200 3 6 1200
# 8 1345 3 6 1345
# 9 9658 3 6 9658
# 10 1253 3 6 1253
# 11 4500 3 6 4500
# 12 1150 3 6 1150
# 13 596 4 3 596
# 14 594 4 3 594
# 15 550 4 3 550
# 16 1450 5 1 NA
# 17 320 6 1 320
# 18 1780 7 1 NA
# 19 592 8 2 592
# 20 590 8 2 590
Another approach :
Y<-rep(NA,nrow(X))
for (i in 1:nrow(X)) {
if (X[i,1]<1000) {Y[i]<-X[i,1]} else if (sum(X[i:min((i+4),nrow(X)),1]>1000)>=5) {
Y[i:min((i+4),nrow(X))]<-X[i:min((i+4),nrow(X)),1]}
}
returns
> Y
[1] 580 457 980 NA NA 598 1200 1345 9658 1253 4500 1150 596 594 550 NA 320 NA 592 590
This assumes that the values of X are in the first column of a dataframe named X.
It then creates Y with NAand only change the values if the criteria are met.