Find lowest value in three columns in R [duplicate] - r

This question already has answers here:
min for each row in a data frame
(4 answers)
Closed 2 years ago.
I have a dataframe with 400 people who each have three predicted values (so 400 rows, 3 columns). Now I need a function that writes me the lowest of these three values into a variable, so that every person has the best prediction in a fourth column. I can't find any possibility, so I would be very thankful for your help!

Imagine you had 3 columns named Score1, Score2, and Score3. You might use apply as follows:
data$MinScore <- apply(data[,c("Score1","Score2","Score3")],1,min)
head(data)
Person Score1 Score2 Score3 MinScore
1 Person1 11 90 73 11
2 Person2 60 85 76 60
3 Person3 20 16 36 16
4 Person4 95 87 66 66
5 Person5 99 81 20 20
6 Person6 42 79 80 42
Sample Data
data <- data.frame(Person = paste0("Person", 1:400),Score1 = sample(1:100,100),Score2 = sample(1:100,100),Score3 = sample(1:100,100))

Related

How to subset columns based on the value in another column in R

I'm looking to subset multiple columns based on the value (a year) that is issued elsewhere in the data. For example, I have a column reflecting various data, and another including a year. My data looks something like this:
Individual
Age 2010
weight 2010
Age 2011
Weight 2011
Age 2012
Weight 2012
Age 2013
Weight 2013
Year
A
53
50
85
100
82
102
56
90
2013
B
22
NA
23
75
NA
68
25
60
2013
C
33
65
34
64
35
70
NA
75
2010
D
NA
70
28
NA
29
78
30
55
2012
E
NA
NA
64
90
NA
NA
NA
NA
2011
I want to create a new column that reflects the data that the 'Year' columns highlights. For example, subsetting data for 'Individual' A from 2013, and 'Individual B' from 2012.
My end goal is to have a table that looks like:
Individual
Age
Weight
A
56
90
B
25
60
C
33
65
D
29
78
E
64
90
Is there any way to subset the years based on the years chosen in the final column?
I made a subset of your data and came up with the following (could be more elegant but this works):
Individual<-c("A","B","C","D","E")
Age2010<-c(53,22,33,NA,NA)
`weight 2010`<-c(50,NA,65,70,NA)
Age2011<-c(85,23,34,28,64)
Weight2011<-c(100,75,64,NA,90)
df<-as.data.frame(cbind(Individual,Age2010,`weight 2010`,Age2011,Weight2011))
colnames(df)<-str_replace_all(colnames(df)," ", "") # remove spaces
# create a dataframe for each year (prob could do this using `apply`)
df2010<-df %>% select(Individual, contains("2010")) %>% mutate(year=2010) %>% rename(weight=weight2010,age=Age2010)
df2011<-df %>% select(Individual, contains("2011")) %>% mutate(year=2011) %>% rename(weight=Weight2011,age=Age2011)
final<-bind_rows(df2010,df2011)
Of course, you can extend this for the remaining years in your dataset. You will then have a year variable to perform your analyses.

How to subset your dataframe to only keep the first duplicate? [duplicate]

This question already has answers here:
Remove duplicates based on 2nd column condition
(4 answers)
Closed 4 years ago.
I have a dataframe with multiple variables, and I am interested in how to subset it so that it only includes the first duplicate.
>head(occurrence)
userId occurrence profile.birthday profile.gender postDate count
1 100469891698 6 47 Female 583 days 0
2 100469891698 6 47 Female 55 days 0
3 100469891698 6 47 Female 481 days 0
4 100469891698 6 47 Female 583 days 0
5 100469891698 6 47 Female 583 days 0
6 100469891698 6 47 Female 583 days 0
Here you can see the dataframe. The 'occurrence' column counts how many times the same userId has occurred. I have tried the following code to remove duplicates:
occurrence <- occurrence[!duplicated(occurrence$userId),]
However, this way it remove "random" duplicates. I want to keep the data which is the oldest one by postDate. So for example the first row should look something like this:
userId occurrence profile.birthday profile.gender postDate count
1 100469891698 6 47 Female 583 days 0
Thank you for your help!
Did you try order first like this:
occurrence <- occurrence[order(occurrence$userId, occurrence$postDate, decreasing=TRUE),]
occurrenceClean <- occurrence[!duplicated(occurrence$userId),]
occurrenceClean
You could use dplyr for this and after filtering on the max postDate, use a distinct (unique) to remove all duplicate rows. Of course if there are differences in the rows with max postDate you will get all of those records.
occurrence <- occurrence %>%
group_by(userId) %>%
filter(postDate == max(postDate)) %>%
distinct
occurence
# A tibble: 1 x 6
# Groups: userId [1]
userId occurrence profile.birthday profile.gender postDate count
<dbl> <int> <int> <chr> <chr> <int>
1 100469891698 6 47 Female 583 days 0

R how to avoid "for" when I want to go through dataframe

give a brief example.
I have data frame data1.
name<-c("John","John","Mike","Amy".....)
nationality<-c("Canada","America","Spain","Japan".....)
data1<-data.frame(name,nationality....)
which mean the people is from different countries
each people is specialize by his name and country, and no repeat.
the second data frame is
name2<-c("John","John","Mike","John",......)
nationality2<-c("Canada","Canada","Canada".....)
score<-c(87,67,98,78,56......)
data2<-data.frame(name2,nationality2,score)
every people is promised to have 5 rows in data2, which means they have 5 scores but they are in random order.
what I want to do is to know every person's 5 scores, but I didn't care what his name is and where he is from.
the final data frame I want to have is
score1 score2 score3 score4 score5
1 89 89 87 78 90
2 ...
3 ...
every row represent one person 5 scores but I don't care who he is.
my data number is so large so I can not use for function.
what can I do?
Although there is an already accepted answer which uses base R I would like to suggest a solution which uses the convenient dcast() function for reshaping from wide to long form instead of using tapply() and repeated calls to rbind():
library(data.table) # CRAN version 1.10.4 used
dcast(setDT(data2)[setDT(data1), on = c(name2 = "name", nationality2 = "nationality")],
name2 + nationality2 ~ paste0("score", rowid(rleid(name2, nationality2))),
value.var = "score")
returns
name2 nationality2 score1 score2 score3 score4 score5
1: Amy Canada 93 91 73 8 79
2: John America 3 77 69 89 31
3: Mike Canada 76 92 46 47 75
It seems to me that's what you're asking:
data1 <- data.frame(name = c("John","Mike","Amy"),
nationality = c("America","Canada","Canada"))
data2 <- data.frame(name2 = rep(c("John","Mike","Amy","Jack","John"),each = 5),
score = sample(100,25), nationality2 =rep(c("America","Canada","Canada","Canada","Canada"),each = 5))
data3 <- merge(data2,data1,by.x=c("name2","nationality2"),by.y=c("name","nationality"))
data3$name_country <- paste(data3$name2,data3$nationality2)
all_scores_list <- tapply(data3$score,data3$name_country,c)
as.data.frame(do.call(rbind,all_scores_list))
# V1 V2 V3 V4 V5
# Amy Canada 57 69 90 81 50
# John America 4 92 75 15 2
# Mike Canada 25 86 51 20 12

How to quote the grouped data frame it self in the function in ddply()

It is possible to apply certain function in the grouping of data frame by certain variables with ddply(), but how to quote the grouped data frame as the argument of the function?
Take min() as an EXAMPLE:
What I have:
> BodyWeight
Treatment day1 day2 day3
1 a 32 33 36
2 a 35 35 26
3 a 33 38 46
4 b 23 24 25
5 b 22 16 34
6 b 36 35 37
7 c 45 45 39
8 c 29 26 12
9 c 43 27 36
What I want:
Treatment min
1 a 26
2 b 16
3 c 12
What I did and what I got:
> ddply(BodyWeight, .(Treatment), summarize, min= min(BodyWeight[,-1]))
Treatment min
1 a 12
2 b 12
3 c 12
The min() is just an example, unspecific solutions are desired.
What you want is to summarize by Treatment and Day. The issue is you have days in multiple columns. You need to convert your data from the wide format its in (multiple columns) into a long format (key-value pairs).
library(tidyr)
library(plyr)
bw_long <- gather(Bodyweight, day, value, day1:day3)
ddply(bw_long, .(Treatment, day), summarize, min = min(value))
p.s. Check out the successor to plyr, dplyr
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(BodyWeight)), grouped by 'Treatment', unlist the Subset of Data.table (.SD) and get the min value.
library(data.table)
setDT(BodyWeight)[, .(min = min(unlist(.SD))) , by = Treatment]
# Treatment min
#1: a 26
#2: b 16
#3: c 12

Exclude intervals that overlap between two data frame's (by range of two column values)

This is almost an extension of a previous question I asked, but I've run into a new problem I haven't found a solution for.
Here is the original question and answer: Find matching intervals in data frame by range of two column values
(this found overlapping intervals that were common among different names within same data frame)
I now want to find a way to exclude row's in DF1 when there are overlapping intervals with a new data-frame, DF2.
Using the same DF1 :
Name Event Order Sequence start_event end_event duration Group
JOHN 1 A 0 19 19 ID1
JOHN 2 A 60 112 52 ID1
JOHN 3 A 392 429 37 ID1
JOHN 4 B 282 329 47 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 1 A 19 75 56 ID1
ADAM 2 A 384 407 23 ID1
ADAM 3 B 0 79 79 ID1
ADAM 4 B 505 586 81 ID1
ADAM 5 C 140 205 65 ID1
ADAM 6 C 522 599 77 ID1
This continues for 18 different names and two ID groups.
Now have a second data frame with intervals that I wish to exclude from the above data frame.
Here is an example of DF2:
Name Event Order Sequence start_event end_event duration Group
GAP1 1 A 55 121 66 ID1
GAP2 2 A 394 419 25 ID1
GAP3 3 C 502 635 133 ID1
I.E., I am hoping to find any interval for each "Name" in DF1, that is in the same "Sequence" and has overlapping time at any point of the interval found in DF2 (any portion, whether it begins before the start event, or begins midway and ends after the end event). I would like to iterate through each distinct "Name" in DF1. Also, the sequence matters, so I would only like to return results found common between sequence A and sequence A, then sequence B and sequence B, and finally sequence C and sequence C.
Desired Result (showing just the first name):
Name Event Order Sequence start_event end_event duration Group
JOHN 1 A 0 19 19 ID1
JOHN 4 B 282 329 47 ID1
JOHN 5 C 147 226 79 ID1
ADAM 3 B 0 79 79 ID1
ADAM 4 B 505 586 81 ID1
ADAM 5 C 140 205 65 ID1
Last time the answer was resolved in part with foverlaps, but I am still not overly familiar with it to be able to solve this problem - assuming that's the best way to answer this.
Thanks!
This piece of code should work for you
library(data.table)
Dt1 <- data.table(a = 1:1000,b=1:1000 + 100)
Dt2 <- data.table(a = 100:200,b=100:200+10)
#identify the positions that are not allowed
badSeq <- unique(unlist(lapply(1:nrow(Dt2),function(i) Dt2[i,a:b,])))
#select for the rows outside of the range
correctPos <- sapply(1:nrow(Dt1),
function(i)
all(!Dt1[i,a:b %in% badSeq]))
Dt1[correctPos,]
I have done it with data.tables rather than data.frames. I like them better and they can be faster. But you can apply the same ideas to a data.frame

Resources