If Statements and logical operators in R - r

I have a dataframe with a Money column and an Age Group column.
The Money column has NAs and the Age Group column has values that range from 1 to 5.
What I want to do is find the sum of the Money column when the AgeGroup column equals a certain value. Say 5 for this example.
I have been attempting to use an if statement but I am getting the response "the condition has length > 1 and only the first element will be used".
if(df$AgeGroup == 5)
SumOfMoney <- sum(df$Money)
My problem is I don't know how to turn "if" into "when". I want to sum the Money column when those rows that have an AgeGroup value of 5, or 3, or whatever I choose.
I believe I have the condition correct, do I add a second if statement when calculating the sum?

I would use data.table for this 'by-group' operation.
library(data.table)
setDT(df)[,list(sm=sum(Money,na.rm=TRUE)),AgeGroup]
This will compute the sum of money by group. Filtering the result to get some group value :
setDT(df)[,list(sm=sum(Money,na.rm=TRUE)),AgeGroup][AgeGroup==4]

Try:
library(dplyr)
df %>%
group_by(AgeGroup) %>%
summarise(Money = sum(Money, na.rm = TRUE))
Which gives:
#Source: local data frame [5 x 2]
#
# AgeGroup Money
#1 1 1033
#2 2 793
#3 3 224
#4 4 133
#5 5 103
If you want to subset for a specific AgeGroup you could add:
... %>% filter(AgeGroup == 5)

Try:
set.seed(7)
df <- data.frame(AgeGroup = sample(1:5, 10, T), Money = sample(100:500, 10))
df[1,2] <- NA
AgeGroup Money
1 5 NA
2 2 192
3 1 408
4 1 138
5 2 280
6 4 133
7 2 321
8 5 103
9 1 487
10 3 224
with(df, tapply(Money, AgeGroup, FUN= sum, na.rm=T))
1 2 3 4 5
1033 793 224 133 103
If you would like to just have the sum of one group at a time try:
sum(df[df$AgeGroup == 5,"Money"], na.rm=T)
[1] 103

I think the following function should do the trick.
> AGE <- c(1,2,3,2,5,5)
> MONEY <- c(100,200,300,400,200,100)
> dat <- data.frame(cbind(AGE,MONEY))
> dat
AGE MONEY
1 1 100
2 2 200
3 3 300
4 2 400
5 5 200
6 5 100
> getSumOfGroup <- function(df, group){
+ return(sum(df[AGE == group,"MONEY"]))
+ }
> getSumOfGroup(dat, 5)
[1] 300

Related

creating a variable based on other factors using R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 months ago.
My data looks like this:
hh_id
indl
ind_salary
hh_income
1
1
200
1
2
450
1
3
00
2
4
1232
2
5
423
Individuals with the same hh_id lives in the same household so they will have the same household income. And for that the variable hh_income equal the sum of the salary of all persons with the same hh_id;
so my data would look like:
hh_id
indl
ind_salary
hh_income
1
1
200
650
1
2
450
650
1
3
00
650
2
4
1232
1655
2
5
423
1655
Any ideas please;
Using dplyr:
data %>% group_by(hh_id) %>% mutate(hh_income = sum(ind_salary))
You can use R base function ave to generate sum of ind_salary grouped by hh_id and get a vector of the same length of ind_salary
> df$hh_income <- ave(df$ind_salary, df$hh_id, FUN=sum)
> df
hh_id indl ind_salary hh_income
1 1 1 200 650
2 1 2 450 650
3 1 3 0 650
4 2 4 1232 1655
5 2 5 423 1655
Using only base R:
hh_id <- c(1, 1 ,1, 2, 2)
indl <- c(1, 2, 3, 4, 5)
ind_salary <- c(200, 450, 0, 1232, 423)
hh_df <- data.frame(hh_id, indl, ind_salary)
hh_income <- tapply(hh_df$ind_salary, hh_df$hh_id, sum)
hh_income <- as.data.frame(hh_income)
hh_income$hh_id <- rownames(hh_income)
hh_df <- merge(hh_df, hh_income, by = 'hh_id')
View(hh_df)
Just to add more explanation to KacZdr's answer which would have helped me immensely as a beginner. Also, this is more in line with standard tidyr pipe code standards.
new_data <- data %>% # This creates a new dataset from the original so you don't alter the original, I find this much easier
group_by(hh_id)%>% # obviously groups the data by the variable that has duplicate values within the column that you want to apply a summary function , in this case sum
mutate(income = sum(ind_salary))# mutate creates a new column "income" and fills it with the sum of ind_salary for all with the same hh_id. This would be what you have called hh_income in your table.

How to determine when a change in value occurs in R

I am following this example from stack overflow: Identifying where value changes in R data.frame column
Theres two columns: ind and value. How do I identify the 'ind' when 'value' increases by 100?
For example,
Value increases by 100 at ind = 4.
df <- data.frame(ind=1:10,
value=as.character(c(100,100,100,200,200,200,300,300,400,400)), stringsAsFactors=F)
df
ind value
1 1 100
2 2 100
3 3 100
4 4 200
5 5 200
6 6 200
7 7 300
8 8 300
9 9 400
10 10 400
I tried this but it doesn't work:
miss <- function(x) ifelse(is.finite(x),x,NA)
value_xx =miss(min(df$ind[df$value[1:length(df$value)] >= 100], Inf, na.rm=TRUE))
Like this:
df$ind[c(FALSE, diff(as.numeric(df$value)) == 100)]
You can use diff to get difference between consecutive values and get the index for those where the difference is greater than equal to 100. Added + 1 to the index since diff returns vector which of length 1 shorter than the original one.
df$ind[which(diff(df$value) >= 100) + 1]
#[1] 4 7 9
In dplyr, you can use lag to get previous values :
library(dplyr)
df %>% filter(value - lag(value) >= 100)
# ind value
#1 4 200
#2 7 300
#3 9 400

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

Delete following observations when goal has been reached

Given the dataframe:
df = data.frame(
ID = c(1,1,1,1,2,3,3),
Start = c(0,8,150,200,6,7,60),
Stop = c(5,60,170,210,NA,45,80))
ID Start Stop Dummy
1 1 0 5 0
2 1 8 60 1
3 1 150 170 1
4 1 200 210 1
5 2 6 NA 0
6 3 7 45 0
7 3 60 80 1
For each ID, I would like to keep all rows until Start[i+1] - Stop[i] >= 28, and then delete the following observations of that ID
In this example, the output should be
ID Start Stop Dummy
1 1 0 5 0
2 1 8 60 1
5 2 6 NA 0
6 3 7 45 0
7 3 60 80 1
I ended up having to set NA's to a value easy to identify later and the following code
df$Stop[is.na(df$Stop)] = 10000
df$diff <- df$Start-c(0,df$Stop[1:length(df$Stop)-1])
space <- with(df, unique(ID[diff<28]))
df2 <- subset(df, (ID %in% space & diff < 28) | !ID %in% space)
Using data.table...
library(data.table)
setDT(df)
df[,{
w = which( shift(Start,type="lead") - Stop >= 28 )
if (length(w)) .SD[seq(w[1])] else .SD
}, by=ID]
# ID Start Stop
# 1: 1 0 5
# 2: 1 8 60
# 3: 2 6 NA
# 4: 3 7 45
# 5: 3 60 80
.SD is the Subset of Data associated with each by=ID group.
Create a diff column.
df$diff<-df$Start-c(0,df$Stop[1:length(df$Stop)-1])
Subset on the basis of this column
df[df$diff<28,]
PS: I have converted 'NA' to 0. You would have to handle that anyway.
p <- which(df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] >= 28)
df <- df[p,]
Assuming you want to keep entries where next entry start if higher than giben entry stop by 28 or more
The result is:
>p 2 3
> df[p,]
ID Start Stop
2 1 8 60
3 1 150 170
start in row 2 ( i + 1 = 2) is higher than stop in row 1 (i=1) by 90.
Or, if by until you mean the reverse condition, then
df <- df[which(df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] < 28),]
Inclusion of NA in your data frame got me thinking. You have to be very careful how you word your condition. If you want to keep all the cases where difference between next start and stop is less than 28, then the above statement will do.
However, if you want to keep all cases EXCEPT when difference is 28 or more, then you should
p <- which((df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] >= 28))
rp <- which((!is.element(1:nrow(df),p)))
df <- df[rp,]
As it will include the unknown difference.

How to bin ordered data by percentile for each id in R dataframe [r]

I have dataframe that contains 70-80 rows of ordered response time (rt) data for each of 228 people each with a unique id# (everyone doesn't have the same amount of rows). I want to bin each person's RTs into 5 bins. I want the 1st bin to be their fastest 20 percent of RTs, 2nd bin to be their next fastest 20 percent RTs, etc., etc. Each bin should have the same amount of trials in it (unless the total # of trial is odd).
My current dataframe looks like this:
id RT
7000 225
7000 250
7000 253
7001 189
7001 201
7001 225
I'd like my new dataframe to look like this:
id RT Bin
7000 225 1
7000 250 1
After getting my data to look like this, I will aggregate by id and bin
The only way I can think of to do this is to split the data into a list (using the split command), loop through each person, use the quantile command to get break points for the different bins, assign a bin value (1-5) to every response time. This feels very convoluted (and would be difficult for me). I'm in a bit of a jam and I would greatly appreciate any help in how to streamline this process. Thanks.
The answer #Chase gave split the range into 5 groups of equal length (difference of endpoints). What you seem to want is pentiles (5 groups with equal number in each group). For that, you need the cut2 function in Hmisc
library("plyr")
library("Hmisc")
dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))
tmp <- ddply(dat, "id", transform, hists = as.numeric(cut2(value, g = 5)))
tmp now has what you want
> tmp
id value hists
1 1 0.19016791 3
2 1 0.27795226 4
3 1 0.74350982 5
4 1 0.43459571 4
5 1 -2.72263322 1
....
95 10 -0.10111905 3
96 10 -0.28251991 2
97 10 -0.19308950 2
98 10 0.32827137 4
99 10 -0.01993215 4
100 10 -1.04100991 1
With the same number in each hists for each id
> table(tmp$id, tmp$hists)
1 2 3 4 5
1 2 2 2 2 2
2 2 2 2 2 2
3 2 2 2 2 2
4 2 2 2 2 2
5 2 2 2 2 2
6 2 2 2 2 2
7 2 2 2 2 2
8 2 2 2 2 2
9 2 2 2 2 2
10 2 2 2 2 2
Here's a reproducible example using package plyr and the cut function:
dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))
ddply(dat, "id", transform, hists = cut(value, breaks = 5))
id value hists
1 1 -1.82080027 (-1.94,-1.41]
2 1 0.11035796 (-0.36,0.166]
3 1 -0.57487134 (-0.886,-0.36]
4 1 -0.99455189 (-1.41,-0.886]
....
96 10 -0.03376074 (-0.233,0.386]
97 10 -0.71879488 (-0.853,-0.233]
98 10 -0.17533570 (-0.233,0.386]
99 10 -1.07668282 (-1.47,-0.853]
100 10 -1.45170078 (-1.47,-0.853]
Pass in labels = FALSE to cut if you want simple integer values returned instead of the bins.
Here's an answer in plain old R.
#make up some data
df <- data.frame(rt = rnorm(60), id = rep(letters[1:3], rep(20)) )
#and this is all there is to it
df <- df[order(df$id, df$rt),]
df$bin <- rep( unlist( tapply( df$rt, df$id, quantile )), each = 4)
You'll note that quantile command used can be set to use any quantiles. The defaults are for quintiles but if you want deciles then use
quantile(x, seq(0, 1, 0.1))
in the function above.
The answer above is a bit fragile. It requires equal numbers of RTs/id and I didn't tell you how to get to the magic number 4. But, it also will run very fast on a large dataset. If you want a more robust solution in base R.
library('Hmisc')
df <- df[order(df$id),]
df$bin <- unlist(lapply( unique(df$id), function(x) cut2(df$rt[df$id==x], g = 5) ))
This is much more robust than the first solution but it isn't as fast. For small datasets you won't notice.

Resources