Creating a new variable in a data frame and changing its values in one step [duplicate] - r

This question already has answers here:
Convert continuous numeric values to discrete categories defined by intervals
(2 answers)
Closed 5 years ago.
I have a column which is part of a data frame, df. It is full of integers. Let's say it is the number of houses sold in a day by a reality compant. Let's call it df$houses. I want to make a second column called df$quant where the number of houses is categorized, with 0 being 0-2 houses sold in a day, 1 being 3-5 houses, 2 being 6-9 houses and 3 being more than 10 houses? I could do this in two steps.
1) Create the new column df$quant from df$houses:
df$quant <- df$houses
2) Change the values of df$quant:
df$quant[which(df$quant <= 2)] <- 0
etc.
I would like to do this in one step though, making the new variable and filling it with the proper values. Mostly, so I don't have to worry about getting the order of the lines of code in the second step right. It would be more robust.
Could this be done with an if statement?
Thanks a lot.

I would do something like this: (using cut)
x <- 1:11
df <- data.frame(x)
myFunction <- function(x) as.integer(cut(x, c(-1, 2, 5, 9, max(x)))) - 1
df$new <- myFunction(df$x)
df
x new
1 1 0
2 2 0
3 3 1
4 4 1
5 5 1
6 6 2
7 7 2
8 8 2
9 9 2
10 10 3
11 11 3

Related

Subsetting rows of data frame by charater patterns (grepl) in a for loop [duplicate]

This question already has answers here:
Subset rows in a data frame based on a vector of values
(4 answers)
Closed 2 years ago.
I am attempting to subset a data frame by removing rows containing certain charater patterns, which are stored in a vector. My issue is that only the last pattern of the vector is removed from my data frame. How can I make my loop work iteratively, so that all patterns stored in the vector are removed from my data frame?
Mock input:
df<-data.frame(organism=c("human_longname","cat_longname","bird_longname","virus_longname","bat_longname","pangolian_longname"),size=c(6,4,2,1,3,5))
df
organism size
1 human_longname 6
2 cat_longname 4
3 bird_longname 2
4 virus_longname 1
5 bat_longname 3
6 pangolian_longname 5
used code and output:
vectors<-c("bat","virus","pangolian")
for(i in vectors){df_1<-df[!grepl(i,df$organism),]}
df_1
organism size
1 human_longname 6
2 cat_longname 4
3 bird_longname 2
4 virus_longname 1
5 bat_longname 3
Expected output
df_1
organism size
1 human_longname 6
2 cat_longname 4
3 bird_longname 2
You can try this:
df[!df$organism %in% c("bat","virus","pangolian"),]
organism size
1 human 6
2 cat 4
3 bird 2
Update: Based on new data, here an approach using grepl(). These functions can be used to avoid loops:
#Vectors
vectors<-c("bat","virus","pangolian")
#Format
vectors2 <- paste0(vectors,collapse = '|')
#Avoid loop
df[!grepl(pattern = vectors2,df$organism),]
organism size
1 human_longname 6
2 cat_longname 4
3 bird_longname 2
Also just for curious, here maybe a not optimal loop to do the same task creating a new dataframe and an index:
#Create index
index <- c()
#Loop
for(i in 1:dim(df)[1])
{
if(grepl(vectors2,df$organism[i])==F)
{
index <- c(index,i)
}
ndf <- df[index,]
}
ndf
organism size
1 human_longname 6
2 cat_longname 4
3 bird_longname 2

For loop to paste rows to create new dataframe from existing dataframe

New to SO, but can't figure out how to get this code to work. I have a dataframe that is very large, and is set up like this:
Number Year Type Amount
1 1 A 5
1 2 A 2
1 3 A 7
1 4 A 1
1 1 B 5
1 2 B 11
1 3 B 0
1 4 B 2
This goes onto multiple for multiple numbers. I want to take this dataframe and make a new dataframe that has two of the rows together, but it would be nested (for example, row 1 and row 2, row 1 and row 3, row 1 and row 4, row 2 and row 3, row 2 and row 4) where each combination of each year is together within types and numbers.
Example output:
Number Year Type Amount Number Year Type Amount
1 1 A 5 1 2 A 2
1 1 A 5 1 3 A 7
1 1 A 5 1 4 A 1
1 2 A 2 1 3 A 7
1 2 A 2 1 4 A 1
1 3 A 7 1 4 A 1
I thought that I would do a for loop to loop within number and type, but I do not know how to make the rows paste from there, or how to ensure that I am only getting the combinations of the rows once. For example:
for(i in 1:n_number){
for(j in 1:n_type){
....}}
Any tips would be appreciated! I am relatively new to coding, so I don't know if I should be using a for loop at all. Thank you!
df <- data.frame(Number= rep(1,8),
Year = rep(c(1:4),2),
Type = rep(c('A','B'),each=4),
Amount=c(5,2,7,1,5,11,0,2))
My interpretation is that you want to create a dataframe with all row combinations, where Number and Type are the same and Year is different.
First suggestion - join on Number and Type, then remove rows that have different Year. I added an index to prevent redundant matches (1 with 2 and 2 with 1).
df$index <- 1:nrow(df)
out <- merge(df,df,by=c("Number","Type"))
out <- out[which(out$index.x>out$index.y & out$Year.x!=out$Year.y),]
Second suggestion - if you want to see a version using a loop.
out2 <- NULL
for (i in c(1:(nrow(df)-1))){
for (j in c((i+1):nrow(df))){
if(df[i,"Year"]!=df[j,"Year"] & df[i,"Number"]==df[j,"Number"] & df[i,"Type"]==df[j,"Type"]){
out2 <- rbind(out2,cbind(df[i,],df[j,]))
}
}
}

Apply a maximum value to whole group [duplicate]

This question already has answers here:
Aggregate a dataframe on a given column and display another column
(8 answers)
Closed 6 years ago.
I have a df like this:
Id count
1 0
1 5
1 7
2 5
2 10
3 2
3 5
3 4
and I want to get the maximum count and apply that to the whole "group" based on ID, like this:
Id count max_count
1 0 7
1 5 7
1 7 7
2 5 10
2 10 10
3 2 5
3 5 5
3 4 5
I've tried pmax, slice etc. I'm generally having trouble working with data that is in interval-specific form; if you could direct me to tools well-suited to that type of data, would really appreciate it!
Figured it out with help from Gavin Simpson here: Aggregate a dataframe on a given column and display another column
maxcount <- aggregate(count ~ Id, data = df, FUN = max)
new_df<-merge(df, maxcount)
Better way:
df$max_count <- with(df, ave(count, Id, FUN = max))

Data Summary in R: Using count() and finding an average numeric value [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 6 years ago.
I am working on a directed graph and need some advice on generating a particular edge attribute.
I need to use both the count of interactions as well as another quality of the interaction (the average length of text used within interactions between the same unique from/to pair) in my visualization.
I am struggling to figure out how to create this output in a clean, scalable way. Below is my current input, solution, and output. I have also included an ideal output along with some things I have tried.
Input
x = read.table(network = "
Actor Receiver Length
1 1 4
1 2 20
1 3 9
1 3 100
1 3 15
2 3 38
3 1 25
3 1 17"
sep = "", header = TRUE)
I am currently using dplyr to get a count of how many times each pair appears to achieve the output below.
I use the following command:
EDGE <- dplyr::count(network, Actor, Receiver )
names(EDGE) <- c("from","to","count")
To achieve my current output:
From To Count
1 1 1
1 2 1
1 3 3
2 3 1
3 1 2
Ideally, however, I like to know the average lengths for each pair as well, or end up with something like this:
From To Count AverageLength
1 1 1 4
1 2 1 20
1 3 3 41
2 3 1 38
3 1 2 21
Is there any way I can do this without creating a host of new data frames and then grafting them back onto the output? I am mostly having issues trying to summarize and count at the same time. My stupid solution has been to simply add "Length" as an argument to the count function, this does not produce anything useful. I could also that it may be useful to combine actor-receiver and then use the summary function to create something to graft onto the frame as a result of the count. In the interest of scaling, however, I would like to figure out if there is a simple and clear way of doing this.
Thank you very much for any assistance with this issue.
A naive solution would be to use cbind() in order to connect these two outputs together. Here is an example code:
Actor <- c(rep(1, 5), 2, 3, 3)
Receiver <- c(1, 2, rep(3, 4), 1, 1)
Length <- c(4, 20, 9, 100, 15, 38, 25, 17)
x <- data.frame("Actor" = Actor,
"Receiver" = Receiver,
"Length" = Length)
library(plyr)
EDGE <- cbind(ddply(x,.(Actor, Receiver), nrow), # This part replace dplyr::count
ddply(x,.(Actor, Receiver), summarize, mean(Length))[ , 3]) # This is the summarize
names(EDGE) <- c("From", "To", "Count", "AverageLength")
EDGE # Gives the expected results
From To Count AverageLength
1 1 1 1 4.00000
2 1 2 1 20.00000
3 1 3 3 41.33333
4 2 3 1 38.00000
5 3 1 2 21.00000

group and label rows in data frame by numeric in R

I need to group and label every x observations(rows) in a dataset in R.
I need to know if the last group of rows in the dataset has less than x observations
For example:
If I use a dataset with 10 observations and 2 variables and I want to group by every 3 rows.
I want to add a new column so that the dataset looks like this:
speed dist newcol
4 2 1
4 10 1
7 4 1
7 22 2
8 16 2
9 10 2
10 18 3
10 26 3
10 34 3
11 17 4
df$group <- rep(1:(nrow(df)/3), each = 3)
This works if the number of rows is an exact multiple of 3. Every three rows will get tagged in serial numbers.
A quick dirty way to tackle the problem of not knowing how incomplete the final group is to simply check the remained when nrow is modulus divided by group size: nrow(df) %% 3 #change the divisor to your group size
assuming your data is df you can do
df$newcol = rep(1:ceiling(nrow(df)/3), each = 3)[1:nrow(df)]

Resources