creating a variable based on other factors using R [duplicate] - r

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 months ago.
My data looks like this:
hh_id
indl
ind_salary
hh_income
1
1
200
1
2
450
1
3
00
2
4
1232
2
5
423
Individuals with the same hh_id lives in the same household so they will have the same household income. And for that the variable hh_income equal the sum of the salary of all persons with the same hh_id;
so my data would look like:
hh_id
indl
ind_salary
hh_income
1
1
200
650
1
2
450
650
1
3
00
650
2
4
1232
1655
2
5
423
1655
Any ideas please;

Using dplyr:
data %>% group_by(hh_id) %>% mutate(hh_income = sum(ind_salary))

You can use R base function ave to generate sum of ind_salary grouped by hh_id and get a vector of the same length of ind_salary
> df$hh_income <- ave(df$ind_salary, df$hh_id, FUN=sum)
> df
hh_id indl ind_salary hh_income
1 1 1 200 650
2 1 2 450 650
3 1 3 0 650
4 2 4 1232 1655
5 2 5 423 1655

Using only base R:
hh_id <- c(1, 1 ,1, 2, 2)
indl <- c(1, 2, 3, 4, 5)
ind_salary <- c(200, 450, 0, 1232, 423)
hh_df <- data.frame(hh_id, indl, ind_salary)
hh_income <- tapply(hh_df$ind_salary, hh_df$hh_id, sum)
hh_income <- as.data.frame(hh_income)
hh_income$hh_id <- rownames(hh_income)
hh_df <- merge(hh_df, hh_income, by = 'hh_id')
View(hh_df)

Just to add more explanation to KacZdr's answer which would have helped me immensely as a beginner. Also, this is more in line with standard tidyr pipe code standards.
new_data <- data %>% # This creates a new dataset from the original so you don't alter the original, I find this much easier
group_by(hh_id)%>% # obviously groups the data by the variable that has duplicate values within the column that you want to apply a summary function , in this case sum
mutate(income = sum(ind_salary))# mutate creates a new column "income" and fills it with the sum of ind_salary for all with the same hh_id. This would be what you have called hh_income in your table.

Related

How do I sum a column of values consecutively row by row within certain column IDs?

This is a pretty complicated one, sorry ahead of time!
I am attempting to add column values in one column consecutively (CURRENT_FIX_DURATION), with a new column created with the values going up, but only within specified rows (as specified by TRIAL_INDEX, 1-160, within ID, 75 separate IDs).
Here is part of my data frame:
ID TRIAL_INDEX CURRENT_FIX_DURATION CURRENT_FIX_INDEX
1 bb10jml2 1 462 1
2 bb10jml2 1 166 2
3 bb10jml2 1 60 3
4 bb10jml2 1 118 4
5 bb10jml2 1 60 5
CURRENT_FIX_INTEREST_AREA_INDEX
1 5
2 3
3 .
4 4
5 .
There are many 160 trials for each, and 75 separate IDs, with varying numbers of numbers to be added in the column CURRENT_FIX_DURATION.
I would want to be able to add up the #s for CURRENT_FIX_DURATION, with the summing stopping at the end of a trial, and repeating for the next trial.
Here's a sample output of what I would want to achieve:
CURRENT_FIX_DURATION
462
628
688
806
866
I would want this to continue until it reached TRIAL_INDEX 2, and then start over with the beginning of the next value not being summed in with the previous TRIAL_INDEX's CURRENT_FIX_DURATION column.
Is this possible to achieve? I though of using for loops, but I'm not sure where to begin within a data frame.
In general, the difficulty is compounded by the fact that the numbers to be added for each Subject/Trial is completely variable.
Should I convert this to long format and try with ddply?
Let me know what you think or if you would like more information!
Thank you for your time!
Here is a solution within the tidyverse using map2 from the purrrpackage.
library(tidyverse)
mydata <- tibble(id = rep("a", 5), trial_index = rep(1, 5),
current_fix_duration = c(462, 166, 60, 118, 60),
current_fix_index = 1:5)
newdata <- mydata %>% group_by(id) %>%
mutate(current_fix_duration2 = map2_dbl(trial_index, current_fix_index, ~sum(current_fix_duration[.x:.y]))) %>%
as.data.frame()
# A tibble: 5 x 5
# Groups: id [1]
id trial_index current_fix_duration current_fix_index current_fix_duration2
<chr> <dbl> <dbl> <int> <dbl>
1 a 1 462 1 462
2 a 1 166 2 628
3 a 1 60 3 688
4 a 1 118 4 806
5 a 1 60 5 866

Aggregate/collapse data frame

Is there an "all-in-one" convenience function in R that can collapse/aggregate a data frame to resolve the many-to-many problem? The motivation is to reduce many-to-many relationships so that two or more tables can be joined on some primary key (a column with unique identifier values). To elucidate, consider a data frame like:
set.seed(1) # for reproducibility
df <- data.frame(id = sort(rep(seq(1,3),4)), # primary key
geo_loc = state.abb[sample(seq(1,length(state.name)), # state abbreviations
size=length(sort(rep(seq(1,3),4))),
replace = TRUE)],
revenue = c(sample(seq(0,50),size=3), sample(c(seq(101,200)),size=3),
sample(seq(201,300),size=4), sample(seq(301,1000),size=2)),
prod_id = sample(LETTERS[c(seq(1,4))],size=12, replace=TRUE),
quant = c(sample(seq(0,5),size=4), sample(c(seq(3,8)),size=4),
sample(seq(6,11),size=2), sample(seq(9,14),size=2))) ; df
id geo_loc revenue prod_id quant
1 1 MN 47 D 0
2 1 MA 29 B 3
3 1 SD 50 B 4
4 1 NM 174 A 1
5 2 NC 136 D 6
6 2 LA 143 B 5
7 2 IN 215 C 8
8 2 WY 202 A 4
9 3 NY 271 A 10
10 3 HI 211 C 9
11 3 CT 613 C 10
12 3 MS 748 A 14
Does a function already exist that will collapse this table such that there is only one row per unique id? It would have to convert the geo_loc and prod_id columns to k levels - 1 dummy columns. It would also be nice if such a function could allow automatic clustering of the revenue into a number of blocks based on perhaps quantiles.
Only aggregate when you have a proper grouping variable. It would be more logical to aggregate by prod_id, for example.
To perform these data tidying and aggregating operations I would personally recommend spread() and gather() from the tidyr package and summarise() and group_by() from the dplyr package.

If Statements and logical operators in R

I have a dataframe with a Money column and an Age Group column.
The Money column has NAs and the Age Group column has values that range from 1 to 5.
What I want to do is find the sum of the Money column when the AgeGroup column equals a certain value. Say 5 for this example.
I have been attempting to use an if statement but I am getting the response "the condition has length > 1 and only the first element will be used".
if(df$AgeGroup == 5)
SumOfMoney <- sum(df$Money)
My problem is I don't know how to turn "if" into "when". I want to sum the Money column when those rows that have an AgeGroup value of 5, or 3, or whatever I choose.
I believe I have the condition correct, do I add a second if statement when calculating the sum?
I would use data.table for this 'by-group' operation.
library(data.table)
setDT(df)[,list(sm=sum(Money,na.rm=TRUE)),AgeGroup]
This will compute the sum of money by group. Filtering the result to get some group value :
setDT(df)[,list(sm=sum(Money,na.rm=TRUE)),AgeGroup][AgeGroup==4]
Try:
library(dplyr)
df %>%
group_by(AgeGroup) %>%
summarise(Money = sum(Money, na.rm = TRUE))
Which gives:
#Source: local data frame [5 x 2]
#
# AgeGroup Money
#1 1 1033
#2 2 793
#3 3 224
#4 4 133
#5 5 103
If you want to subset for a specific AgeGroup you could add:
... %>% filter(AgeGroup == 5)
Try:
set.seed(7)
df <- data.frame(AgeGroup = sample(1:5, 10, T), Money = sample(100:500, 10))
df[1,2] <- NA
AgeGroup Money
1 5 NA
2 2 192
3 1 408
4 1 138
5 2 280
6 4 133
7 2 321
8 5 103
9 1 487
10 3 224
with(df, tapply(Money, AgeGroup, FUN= sum, na.rm=T))
1 2 3 4 5
1033 793 224 133 103
If you would like to just have the sum of one group at a time try:
sum(df[df$AgeGroup == 5,"Money"], na.rm=T)
[1] 103
I think the following function should do the trick.
> AGE <- c(1,2,3,2,5,5)
> MONEY <- c(100,200,300,400,200,100)
> dat <- data.frame(cbind(AGE,MONEY))
> dat
AGE MONEY
1 1 100
2 2 200
3 3 300
4 2 400
5 5 200
6 5 100
> getSumOfGroup <- function(df, group){
+ return(sum(df[AGE == group,"MONEY"]))
+ }
> getSumOfGroup(dat, 5)
[1] 300

How do I remove rows from a data.frame where two specific columns have missing values?

Say I write the following code to produce a dataframe:
name <- c("Joe","John","Susie","Mack","Mo","Curly","Jim")
age <- c(1,2,3,NaN,4,5,NaN)
DOB <- c(10000, 12000, 16000, NaN, 18000, 20000, 22000)
DOB <- as.Date(DOB, origin = "1960-01-01")
trt <- c(0, 1, 1, 2, 2, 1, 1)
df <- data.frame(name, age, DOB, trt)
that looks like this:
name age DOB trt
1 Joe 1 1987-05-19 0
2 John 2 1992-11-08 1
3 Susie 3 2003-10-22 1
4 Mack NaN <NA> 2
5 Mo 4 2009-04-13 2
6 Curly 5 2014-10-04 1
7 Jim NaN 2020-03-26 1
How would I be able to remove rows where both age and DOB have missing values for the row? For example, I'd like a new dataframe (df2) to look like this:
name age DOB trt
1 Joe 1 1987-05-19 0
2 John 2 1992-11-08 1
3 Susie 3 2003-10-22 1
5 Mo 4 2009-04-13 2
6 Curly 5 2014-10-04 1
7 Jim NaN 2020-03-26 1
I've tried the following code, but it deleted too many rows:
df2 <- df[!(is.na(df$age)) & !(is.na(df$DOB)), ]
In SAS, I would just write
WHERE missing(age) ge 1 AND missing(DOB) ge 1 in a DATA step, but obviously R has different syntax.
Thanks in advance!
If you want to remove those rows where two columns (age and DOB) have more than 1 NA (which would mathematically mean that there could only be 2 NAs in such a case), you can do for example:
df[!is.na(df$age) | !is.na(df$DOB),]
which means that either both or one of the columns should be not NA, or
df[rowSums(is.na(df[2:3])) < 2L,]
which means that the sum of NAs in columns 2 and 3 should be less than 2 (hence, 1 or 0) or very similar:
df[rowSums(is.na(df[c("age", "DOB")])) < 2L,]
And of course there's other options, like what #rawr provided in the comments.
And to better understand the subsetting, check this:
rowSums(is.na(df[2:3]))
#[1] 0 0 0 2 0 0 1
rowSums(is.na(df[2:3])) < 2L
#[1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE
You were pretty close
df[!(is.na(df$age) & is.na(df$DOB)), ]
or
df[!is.na(df$age) | !is.na(df$DOB), ]
Maybe this could be easier:
require(tidyverse)
df <- drop_na(df, c("age", "DOB"))

How to bin ordered data by percentile for each id in R dataframe [r]

I have dataframe that contains 70-80 rows of ordered response time (rt) data for each of 228 people each with a unique id# (everyone doesn't have the same amount of rows). I want to bin each person's RTs into 5 bins. I want the 1st bin to be their fastest 20 percent of RTs, 2nd bin to be their next fastest 20 percent RTs, etc., etc. Each bin should have the same amount of trials in it (unless the total # of trial is odd).
My current dataframe looks like this:
id RT
7000 225
7000 250
7000 253
7001 189
7001 201
7001 225
I'd like my new dataframe to look like this:
id RT Bin
7000 225 1
7000 250 1
After getting my data to look like this, I will aggregate by id and bin
The only way I can think of to do this is to split the data into a list (using the split command), loop through each person, use the quantile command to get break points for the different bins, assign a bin value (1-5) to every response time. This feels very convoluted (and would be difficult for me). I'm in a bit of a jam and I would greatly appreciate any help in how to streamline this process. Thanks.
The answer #Chase gave split the range into 5 groups of equal length (difference of endpoints). What you seem to want is pentiles (5 groups with equal number in each group). For that, you need the cut2 function in Hmisc
library("plyr")
library("Hmisc")
dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))
tmp <- ddply(dat, "id", transform, hists = as.numeric(cut2(value, g = 5)))
tmp now has what you want
> tmp
id value hists
1 1 0.19016791 3
2 1 0.27795226 4
3 1 0.74350982 5
4 1 0.43459571 4
5 1 -2.72263322 1
....
95 10 -0.10111905 3
96 10 -0.28251991 2
97 10 -0.19308950 2
98 10 0.32827137 4
99 10 -0.01993215 4
100 10 -1.04100991 1
With the same number in each hists for each id
> table(tmp$id, tmp$hists)
1 2 3 4 5
1 2 2 2 2 2
2 2 2 2 2 2
3 2 2 2 2 2
4 2 2 2 2 2
5 2 2 2 2 2
6 2 2 2 2 2
7 2 2 2 2 2
8 2 2 2 2 2
9 2 2 2 2 2
10 2 2 2 2 2
Here's a reproducible example using package plyr and the cut function:
dat <- data.frame(id = rep(1:10, each = 10), value = rnorm(100))
ddply(dat, "id", transform, hists = cut(value, breaks = 5))
id value hists
1 1 -1.82080027 (-1.94,-1.41]
2 1 0.11035796 (-0.36,0.166]
3 1 -0.57487134 (-0.886,-0.36]
4 1 -0.99455189 (-1.41,-0.886]
....
96 10 -0.03376074 (-0.233,0.386]
97 10 -0.71879488 (-0.853,-0.233]
98 10 -0.17533570 (-0.233,0.386]
99 10 -1.07668282 (-1.47,-0.853]
100 10 -1.45170078 (-1.47,-0.853]
Pass in labels = FALSE to cut if you want simple integer values returned instead of the bins.
Here's an answer in plain old R.
#make up some data
df <- data.frame(rt = rnorm(60), id = rep(letters[1:3], rep(20)) )
#and this is all there is to it
df <- df[order(df$id, df$rt),]
df$bin <- rep( unlist( tapply( df$rt, df$id, quantile )), each = 4)
You'll note that quantile command used can be set to use any quantiles. The defaults are for quintiles but if you want deciles then use
quantile(x, seq(0, 1, 0.1))
in the function above.
The answer above is a bit fragile. It requires equal numbers of RTs/id and I didn't tell you how to get to the magic number 4. But, it also will run very fast on a large dataset. If you want a more robust solution in base R.
library('Hmisc')
df <- df[order(df$id),]
df$bin <- unlist(lapply( unique(df$id), function(x) cut2(df$rt[df$id==x], g = 5) ))
This is much more robust than the first solution but it isn't as fast. For small datasets you won't notice.

Resources