create a group variable in a data frame by a string variable starting from a certain value in R - r

I have following data frame.
sub1=c("2021","2121","M123","M143")
x1=c(10,5,6,7)
x2=c(11,12,34,56)
data=data.frame(sub1,x1,x2)
I need to get create a group variable for this data frame such that if the sub1 starts from number 2, then it will belongs to one group and if sub1 starts from letter M , it belongs to second group.
My desired output should be like this,
sub1 x1 x2 group
1 2021 10 11 1
2 2121 5 12 1
3 M123 6 34 2
4 M143 7 56 2
can anyone suggest any funstion that i use for this ? I tried grep funstion as follows, but i didnt get the desired result.
data$sub1[grep("^[2].*", data$sub1)]
Thank you

What about this:
data$group <- ifelse(substr(data$sub1,1,1)==2,1,2)
data
sub1 x1 x2 group
1 2021 10 11 1
2 2121 5 12 1
3 M123 6 34 2
4 M143 7 56 2
In case you do not know if it could be other cases than 2 or M:
ifelse(substr(data$sub1,1,1)==2,1,ifelse(substr(data$sub1,1,1)=='M',2,'Missing'))

Another way using substring and indexing to assign groups.
data$group <- (substr(data$sub1, 1, 1) == "M") + 1
data
# sub1 x1 x2 group
#1 2021 10 11 1
#2 2121 5 12 1
#3 M123 6 34 2
#4 M143 7 56 2
Or extract first character using regex
sub("(.).*", "\\1", data$sub1)
#[1] "2" "2" "M" "M"
and then use the same method to assign groups
(sub("(.).*", "\\1", data$sub1) == "M") + 1
#[1] 1 1 2 2

You can also do:
as.integer(!grepl("^2", data$sub1)) + 1
[1] 1 1 2 2

Related

Adding new columns to dataframe with suffix

I want to subtract one column from another and create a new one using the corresponding suffix in the first column. I have approx 50 columns
I can do it "manually" as follows...
df$new1 <- df$col_a1 - df$col_b1
df$new2 <- df$col_a2 - df$col_b2
What is the easiest way to create a loop that does the job for me?
We can use grep to identify columns which has "a" and "b" in it and subtract them directly.
a_cols <- grep("col_a", names(df))
b_cols <- grep("col_b", names(df))
df[paste0("new", seq_along(a_cols))] <- df[a_cols] - df[b_cols]
df
# col_a1 col_a2 col_b1 col_b2 new1 new2
#1 10 15 1 5 9 10
#2 9 14 2 6 7 8
#3 8 13 3 7 5 6
#4 7 12 4 8 3 4
#5 6 11 5 9 1 2
#6 5 10 6 10 -1 0
data
Tested on this data
df <- data.frame(col_a1 = 10:5, col_a2 = 15:10, col_b1 = 1:6, col_b2 = 5:10)

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

perform operations on a data frame based on a factors

I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766

Finding "outliers" in a group

I am working with hospital discharge data. All hospitalizations (cases) with the same Patient_ID are supposed to be of the same person. However I figured out that there are Pat_ID's with different ages and both sexes.
Imagine I have a data set like this:
Case_ID <- 1:8
Pat_ID <- c(rep("1",4), rep("2",3),"3")
Sex <- c(rep(1,4), rep(2,2),1,1)
Age <- c(rep(33,3),76,rep(19,2),49,15)
Pat_File <- data.frame(Case_ID, Pat_ID, Sex,Age)
Case_ID Pat_ID Sex Age
1 1 1 33
2 1 1 33
3 1 1 33
4 1 1 76
5 2 2 19
6 2 2 19
7 2 1 49
8 3 1 15
It was relatively easy to identify Pat_ID's with cases that differ from each other. I found these ID's by calculating an average for age and/or sex (coded as 1 and 2) with help of the function aggregate and then calculated the difference between the average and age or sex. I would like to automatically remove/identify cases where age or sex deviate from the majority of the cases of a patient ID. In my example I would like to remove cases 4 and 7.
You could try
library(data.table)
Using Mode from
Is there a built-in function for finding the mode?
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
setDT(Pat_File)[, .SD[Age==Mode(Age) & Sex==Mode(Sex)] , by=Pat_ID]
# Pat_ID Case_ID Sex Age
#1: 1 1 1 33
#2: 1 2 1 33
#3: 1 3 1 33
#4: 2 5 2 19
#5: 2 6 2 19
#6: 3 8 1 15
Testing other cases,
Pat_File$Sex[6] <- 1
Pat_File$Age[4] <- 16
setDT(Pat_File)[, .SD[Age==Mode(Age) & Sex==Mode(Sex)] , by=Pat_ID]
# Pat_ID Case_ID Sex Age
#1: 1 1 1 33
#2: 1 2 1 33
#3: 1 3 1 33
#4: 2 6 1 19
#5: 3 8 1 15
This method works, I believe, though I doubt it's the quickest or most efficient way.
Essentially I split the dataframe by your grouping variable. Then I found the 'mode' for the variables you're concerned about. Then we filtered those observations that didn't contain all of the modes. We then stuck everything back together:
library(dplyr) # I used dplyr to 'filter' though you could do it another way
temp <- split(Pat_File, Pat_ID)
Mode.Sex <- lapply(temp, function(x) { temp1 <- table(as.vector(x$Sex)); names(temp1)[temp1 == max(temp1)]})
Mode.Age <- lapply(temp, function(x) { temp1 <- table(as.vector(x$Age)); names(temp1)[temp1 == max(temp1)]})
temp.f<-NULL
for(i in 1:length(temp)){
temp.f[[i]] <- temp[[i]] %>% filter(Sex==Mode.Sex[[i]] & Age==Mode.Age[[i]])
}
do.call("rbind", temp.f)
# Case_ID Pat_ID Sex Age
#1 1 1 1 33
#2 2 1 1 33
#3 3 1 1 33
#4 5 2 2 19
#5 6 2 2 19
#6 8 3 1 15
Here is another approach using the sqldf package:
1) Create new dataframe (called data_groups) with unique groups based on Pat_ID, Sex, and Age
2) For each unique group, check Pat_ID against every other group and if the Pat_ID of one group matches another group, select the group with lower count and store in new vector (low_counts)
3) Take new datafame (data_groups) and take out Pat_IDs from new vector (low_counts)
4) Recombine with Pat_File
Here is the code:
library(sqldf)
# Create new dataframe with unique groups based on Pat_ID, Sex, and Age
data_groups <- sqldf("SELECT *, COUNT(*) FROM Pat_File GROUP BY Pat_ID, Sex, Age")
# Create New Vector to Store Pat_IDs with Sex and Age that differ from mode
low_counts <- vector()
# Unique groups
data_groups
for(i in 1:length(data_groups[,1])){
for(j in 1:length(data_groups[,1])){
if(i<j){
k <- length(low_counts)+1
result <- data_groups[i,2]==data_groups[j,2]
if(is.na(result)){result <- FALSE}
if(result==TRUE){
if(data_groups[i,5]<data_groups[j,5]){low_counts[k] <- data_groups[i,1]}
else{low_counts[k] <- data_groups[j,1]}
}
}
}
}
low_counts <- as.data.frame(low_counts)
# Take out lower counts
data_groups <- sqldf("SELECT * FROM data_groups WHERE Case_ID NOT IN (SELECT * FROM low_counts)")
Pat_File <- sqldf("SELECT Pat_File.Case_ID, Pat_File.Pat_ID, Pat_File.Sex, Pat_File.Age FROM data_groups, Pat_File WHERE data_groups.Pat_ID=Pat_File.Pat_ID AND data_groups.Sex=Pat_File.Sex AND data_groups.Age=Pat_File.Age ORDER BY Pat_File.Case_ID")
Pat_File
Which Provides the following results:
Case_ID Pat_ID Sex Age
1 1 1 1 33
2 2 1 1 33
3 3 1 1 33
4 5 2 2 19
5 6 2 2 19
6 8 3 1 15

How to replace the rows in data

Hello I have a table with 5 columns. One of the column X is:
x <- c(1,1,1,1,1,1,2,2,2,3)
How can I change the order of numbers in vector X, for example on the first place put 3s, on the second place put 1s and on the third place put 2s. The output should be in format like:
x <- c(3,1,1,1,1,1,1,2,2,2)
And replace not only the values in the column X but all other rows for each number of X
To clarify the question:
X(old version) -> X(new version)
1 2
2 3
3 1
So, If X=1 make it X=2
If X=2 make it X=3
If X=3 make it X=1
And if for example we change X=1 to X=2 we should put all the rows for X=1 to X=2
I have two vectors:
x <- c(1,1,1,1,1,1,2,2,2,3)
z <- c(10,10,10,10,10,10,20,20,20,30)
The desired output:
x z
1 30
2 10
2 10
2 10
2 10
2 10
2 10
3 20
3 20
3 20
You could
x1 <-c(2,3,1)[x]
x[order(x1)]
# [1] 3 1 1 1 1 1 1 2 2 2
or
x[order(chartr(old="123",new="231",x))]
#[1] 3 1 1 1 1 1 1 2 2 2
Update
If you have many columns.
x <- c(1,1,1,1,1,1,2,2,2,3)
z <- c(10,10,10,10,10,10,20,20,20,30)
set.seed(14)
y <- matrix(sample(25,10*3,replace=TRUE),ncol=3)
m1 <- as.data.frame(cbind(x,z,y))
x1 <- c(2,3,1)[m1$x]
x1
# [1] 2 2 2 2 2 2 3 3 3 1
res <- cbind(x=c(2,3,1)[m1$x[order(x1)]],subset(m1[order(x1),], select=-x))
res
# x z V3 V4 V5
#10 1 30 10 15 2
#1 2 10 7 23 9
#2 2 10 16 5 11
#3 2 10 24 12 16
#4 2 10 14 22 18
#5 2 10 25 22 19
#6 2 10 13 19 16
#7 3 20 24 9 10
#8 3 20 11 17 14
#9 3 20 13 22 18
If I'm understanding correctly, it sounds as though you want to define your own order for sorting something. Is that right? Two ways you could do that:
Option #1: Make another column in your data.frame and assign values in the order you'd like. If you wanted the threes to come first, the ones to come second and the twos to come third, you'd do this:
Data$y <- rep(NA, nrow(Data)
Data$y[Data$x == 3] <- 1
Data$y[Data$x == 1] <- 2
Data$y[Data$x == 2] <- 3
Then you can sort on y and your data.frame will have the order you want.
Option #2: If the numbers you list in x are levels in a factor, you could do this using plyr:
library(plyr)
Data$x <- revalue(Data$x, c("3" = "1", "1" = "2", "2" = "3"))
Personally, I think that the 2nd option would be rather confusing, but if you are using "1", "2", and "3" to refer to levels in a factor, that is one quick way to change things.

Resources