add identifier column - r

For example, my data looks like this:
Number Value
1 3
2 4
3 6
4 7
I want to add a third column as identifier column based on Value. If the value is >5, then group 1, otherwise group 2. Then return sth like this:
Number Value Group
1 3 2
2 4 2
3 6 1
4 7 1
Thanks for your help!

You can add a new column to data frame:
df$Group <- ifelse(df$Value > 5, 1, 2)
I recommend reading more about ?data.frame ?ifelse and other data frame operations like
?transform

Related

if i want to sort a column by size in rstudio, how do i make sure that the associated values of the rows sort with the column?

I have a data.frame with 1200 rows and 5 columns, where each row contains 5 values of one person. now i need to sort one column by size but I want the remaining columns to sort with the column, so that one column is sorted by increasing values and the other columns contain the values of the right persons. ( So that one row still contains data from one and the same person)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
these are the column names of my data.frame and I wanna sort it by the column called "avg"
First of all, please always provide us with a reproducible example such as below. The sorting of a data frame by default sorts all columns.
vector <- 1:3
BAPlotDET <- data.frame(vector, vector, vector, vector, vector)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
fsskiddet fspiddet avg diff absdiff
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
BAPlotDET <- BAPlotDET[order(-BAPlotDET$avg),]
> BAPlotDET
fsskiddet fspiddet avg diff absdiff
3 3 3 3 3 3
2 2 2 2 2 2
1 1 1 1 1 1

Removing rows from a dataset based on conditional statement across factors

I am struggling to figure out how to remove rows from a dataset based on conditions across multiple factors in a large dataset. Here is some example data to illustrate the problem I am having with a smaller data frame:
Code<-c("A","B","C","D","C","D","A","A")
Value<-c(1, 2, 3, 4, 1, 2, 3, 4)
data<-data.frame(cbind(Code, Value))
data$Value <- (as.numeric(data$Value))
data
Code Value
1 A 1
2 B 2
3 C 3
4 D 4
5 C 1
6 D 2
7 A 3
8 A 4
I want to remove values where the Code is A and the Value is < 2 from the dataset. I understand the logic of how to select for values where Code is A and Values <2, but I can't figure out how to remove these values from the dataset without also removing all values of A that are > 2, while maintaining values of the other codes that are less than 2.
#Easy to select for values of A less than 2
data2<- subset(data, (Code == "A" & Value < 2))
data2
Code Value
1 A 1
#But I want to remove values of A less than 2 without also removing values of A that are greater than 2:
data1<- subset(data, (Code != "A" & Value > 2))
data1
Code Value
3 C 3
4 D 4
### just using Value > 2 does not allow me to include values that are less than 2 for the other Codes (B,C,D):
data2<- subset(data, Value > 2)
data2
3 C 3
4 D 4
7 A 3
8 A 4
My ideal dataset would look like this:
data
Code Value
2 B 2
3 C 3
4 D 4
5 C 1
6 D 2
7 A 3
8 A 4
I have tried different iterations of filter(), subset(), and select() but I can't figure out the correct conditional statement that allows me to remove the desired combination of levels of multiple factors. Any suggestions would be greatly appreciated.

For loop to paste rows to create new dataframe from existing dataframe

New to SO, but can't figure out how to get this code to work. I have a dataframe that is very large, and is set up like this:
Number Year Type Amount
1 1 A 5
1 2 A 2
1 3 A 7
1 4 A 1
1 1 B 5
1 2 B 11
1 3 B 0
1 4 B 2
This goes onto multiple for multiple numbers. I want to take this dataframe and make a new dataframe that has two of the rows together, but it would be nested (for example, row 1 and row 2, row 1 and row 3, row 1 and row 4, row 2 and row 3, row 2 and row 4) where each combination of each year is together within types and numbers.
Example output:
Number Year Type Amount Number Year Type Amount
1 1 A 5 1 2 A 2
1 1 A 5 1 3 A 7
1 1 A 5 1 4 A 1
1 2 A 2 1 3 A 7
1 2 A 2 1 4 A 1
1 3 A 7 1 4 A 1
I thought that I would do a for loop to loop within number and type, but I do not know how to make the rows paste from there, or how to ensure that I am only getting the combinations of the rows once. For example:
for(i in 1:n_number){
for(j in 1:n_type){
....}}
Any tips would be appreciated! I am relatively new to coding, so I don't know if I should be using a for loop at all. Thank you!
df <- data.frame(Number= rep(1,8),
Year = rep(c(1:4),2),
Type = rep(c('A','B'),each=4),
Amount=c(5,2,7,1,5,11,0,2))
My interpretation is that you want to create a dataframe with all row combinations, where Number and Type are the same and Year is different.
First suggestion - join on Number and Type, then remove rows that have different Year. I added an index to prevent redundant matches (1 with 2 and 2 with 1).
df$index <- 1:nrow(df)
out <- merge(df,df,by=c("Number","Type"))
out <- out[which(out$index.x>out$index.y & out$Year.x!=out$Year.y),]
Second suggestion - if you want to see a version using a loop.
out2 <- NULL
for (i in c(1:(nrow(df)-1))){
for (j in c((i+1):nrow(df))){
if(df[i,"Year"]!=df[j,"Year"] & df[i,"Number"]==df[j,"Number"] & df[i,"Type"]==df[j,"Type"]){
out2 <- rbind(out2,cbind(df[i,],df[j,]))
}
}
}

From a set of pairs, find all subsets s.t. no pair in the subset shares any element with a pair not in the subset

I have a set of pairs. Each pair is represented as [i,1:2]. That is, the ith pair are the numbers in the first and second column in the ith row.
I need to sort these pairs into distinct groups, s.t. there is no element in any pair in the jth group that is in any group not j. For example:
EXAMPLE 1: DATA
> col1 <- c(3, 4, 6, 7, 10, 8)
> col2 <- c(6, 7, 3, 4, 3, 1)
>
> dat <- cbind(col1, col2)
> rownames(dat) <- 1:nrow(dat)
>
> dat
col1 col2
1 3 6
2 4 7
3 6 3
4 7 4
5 10 3
6 8 1
For all pairs, it doesn't matter if the number is in column 1 or column 2, the pairs should be sorted into groups s.t. every number in every pair in every group exists only in one group. So the solved example would look like this.
col1 col2 groups
1 3 6 1
2 4 7 2
3 6 3 1
4 7 4 2
5 10 3 1
6 8 1 3
Rows 1, 3, and 5 are grouped together because 1 and 3 contain the same numbers and 5 shares the number 3, so it must be grouped with them. 2 and 4 share the same distinct numbers so they are grouped together and 6 has unique numbers so it is left alone.
If we change the data slightly, note the following.
EXAMPLE 2: NEW DATA
Note what happens when we add a row that shares an element with row 6 and row 5.
col1 col2 groups
1 3 6 1
2 4 7 2
3 6 3 1
4 7 4 2
5 10 3 1
6 8 1 1
7 1 10 1
The 10 in the 7th row connects it to the first group because it shares an elements with the 5th row. It also shares an element with the 6th row (the number 1), so the 6th row would instead be in group 1.
PROBLEM
Is there a simple way to form the groups? A vector operation? A sorting algorithm? It gets very nasty very quickly if you try to do it with a loop, since each subsequent row can change the membership of earlier rows, as demonstrated in the example.
To draw on the old answer at: identify groups of linked episodes which chain together , which assigns a group to each individual value, you could try this to assign a group to each linked pair:
library(igraph)
g <- graph_from_data_frame(dat)
links <- data.frame(col1=V(g)$name,group=components(g)$membership)
merge(dat,links,by="col1",all.x=TRUE,sort=FALSE)
# col1 col2 group
#1 3 6 1
#2 4 7 2
#3 6 3 1
#4 7 4 2
#5 10 3 1
#6 8 1 3
Your elements can be regarded as vertices in an undirected graph, and your pairs can be regarded as edges, and then (assuming that you want to find groups of minimal size -- if you don't, then e.g. the entire set of pairs could be labelled "Group 1") the groups you're looking for are the connected components in this graph. They can all be found in linear time with a depth-first or breadth-first search.

Pull coefficients from a data frame based on information in another data frame

Right now I have two data frames in R, contains some data that looks like this:
> data
p a i
1 1 1 2.2561469
2 5 2 0.2316390
3 2 3 0.4867456
4 3 1 0.1511705
5 4 2 0.8838884
And one the contains coefficients that looks like this:
> coef
3 2 1
1 29420.50 31029.75 29941.96
2 26915.00 27881.00 27050.00
3 27756.00 28904.00 28699.40
4 28345.33 29802.33 28377.56
5 28217.00 29409.00 28738.67
These data frames are connected as each value in data$a corresponds to a column name in coef and data$p corresponds to row names in coef.
I need to apply these coefficients to multiply these coefficients by the values in data$i by matching the row and column names in coef to data$a and data$p.
In other words, for each row in data, I need to use data$a and data$p for each row to pull a specific number from coef that will be multiplied by the value of data$i for that row to create a new vector in data that looks something like this:
> data
p a i z
1 1 1 2.2561469 67553
2 5 2 0.2316390 6812
3 2 3 0.4867456 .
4 3 1 0.1511705 .
5 4 2 0.8838884 .
I was thinking I should create factors in my coef data frame based on the row and column names but am unsure of where to go from there.
Thanks in advance,
Ian
If you order your coef data.frame, you can just index them as though the column names weren't there.
coef <- coef[,order(names(coef))]
Then apply a function to each row:
myfun <- function(x) {
x[3]*coef[x[1], x[2]]
}
data$z <- apply(data, 1, myfun)
> data
p a i z
1 1 1 2.2561469 67553.460
2 5 2 0.2316390 6812.271
3 2 3 0.4867456 13100.758
4 3 1 0.1511705 4338.503
5 4 2 0.8838884 26341.934
>

Resources