Fill in Column Based on Other Rows (R) - r

I am looking for a way to fill in a column in R based on values in a different column. Below is what my data looks like.
year
action
player
end
2001
1
Mike
2003
2002
0
Mike
NA
2003
0
Mike
NA
2004
0
Mike
NA
2001
0
Alan
NA
2002
0
Alan
NA
2003
1
Alan
2004
2004
0
Alan
NA
I would like to either change the "action" column or create a new column such that it reflects the duration between the "year" and "end" variables. Below is what it would look like:
year
action
player
end
2001
1
Mike
2003
2002
1
Mike
NA
2003
1
Mike
NA
2004
0
Mike
NA
2001
0
Alan
NA
2002
0
Alan
NA
2003
1
Alan
2004
2004
1
Alan
NA
I have tried to do this with the following loop:
i <- 0
z <- 0
for (i in 1:nrow(df)){
i <- z + i + 1
if (df[i, 2] == 0) {}
else {df[i, 5] = (df[i, 4] - df[i, 1])}
z <- df[i,5]
for (z in i:nrow(df)){df[i, 2] = 1}
}
Here, my i value is skyrocketing, breaking the loop. I am not sure why that is occuring. I'd be interested to either know how to fix my approach or how to do this in a smarter fashion.

There's no need for explicit loops here.
First group your data frame by player. Then find the rows where the cumulative sum (cumsum) of action is greater than 0 and the year is less than or equal to the end year of the group. If the row meets these conditions, set action to 1, otherwise to 0.
Using the dplyr package you could achieve this in a couple of lines:
library(dplyr)
df %>%
group_by(player) %>%
mutate(action = as.numeric(cumsum(action) > 0 & year <= na.omit(end)[1]))
#> # A tibble: 8 x 4
#> # Groups: player [2]
#> year action player end
#> <int> <dbl> <chr> <int>
#> 1 2001 1 Mike 2003
#> 2 2002 1 Mike NA
#> 3 2003 1 Mike NA
#> 4 2004 0 Mike NA
#> 5 2001 0 Alan NA
#> 6 2002 0 Alan NA
#> 7 2003 1 Alan 2004
#> 8 2004 1 Alan NA

Related

Create dummy variables for every unique value in a column based on a condition from a second column in R

I have a dataframe that looks kind of like this with many more rows and columns:
> df <- data.frame(country = c ("Australia","Australia","Australia","Angola","Angola","Angola","US","US","US"), year=c("1945","1946","1947"), leader = c("David", "NA", "NA", "NA","Henry","NA","Tom","NA","Chris"), natural.death = c(0,NA,NA,NA,1,NA,1,NA,0),gdp.growth.rate=c(1,4,3,5,6,1,5,7,9))
> df
country year leader natural.death gdp.growth.rate
1 Australia 1945 David 0 1
2 Australia 1946 NA NA 4
3 Australia 1947 NA NA 3
4 Angola 1945 NA NA 5
5 Angola 1946 Henry 1 6
6 Angola 1947 NA NA 1
7 US 1945 Tom 1 5
8 US 1946 NA NA 7
9 US 1947 Chris 0 9
I am trying to add x number of new columns, where x corresponds to the number of unique leaders (column leader) satisfying the condition of leader being dead (natural.death==1). In the case of this df, I would expect to get 2 new columns for Henry and Tom, with values of 0,0,0,0,1,0,0,0,0 and 0,0,0,0,0,0,1,0,0, respectively. I would preferably have the two new columns called id1 and id2 according to the order of data presented in natural.death. I need to create 69 new columns as there 69 leaders who died, so I am looking for a non-manual method to deal with this.
I already tried loops, if, for, unique, mtabulate, dcast, dummies but I could not get anything work unfortunately.
I am hoping to get:
> df <- data.frame(country = c ("Australia","Australia","Australia","Angola","Angola","Angola","US","US","US"), year=c("1945","1946","1947"), leader = c("David", "NA", "NA", "NA","Henry","NA","Tom","NA","Chris"), natural.death = c(0,NA,NA,NA,1,NA,1,NA,0),gdp.growth.rate=c(1,4,3,5,6,1,5,7,9),
+ id1=c(0,0,0,0,1,0,0,0,0),id2=c(0,0,0,0,0,0,1,0,0))
> df
country year leader natural.death gdp.growth.rate id1 id2
1 Australia 1945 David 0 1 0 0
2 Australia 1946 NA NA 4 0 0
3 Australia 1947 NA NA 3 0 0
4 Angola 1945 NA NA 5 0 0
5 Angola 1946 Henry 1 6 1 0
6 Angola 1947 NA NA 1 0 0
7 US 1945 Tom 1 5 0 1
8 US 1946 NA NA 7 0 0
9 US 1947 Chris 0 9 0 0
Here is a crude way to do this
df <- data.frame(country = c ("Australia","Australia","Australia","Angola","Angola","Angola","US","US","US"), year=c("1945","1946","1947"), leader = c("David", "NA", "NA", "NA","Henry","NA","Tom","NA","Chris"), natural.death = c(0,NA,NA,NA,1,NA,1,NA,0),gdp.growth.rate=c(1,4,3,5,6,1,5,7,9))
tmp=which(df$natural.death==1) #index of deaths
lng=length(tmp) #number of deaths
#create matrix with zeros and lng columns, append to df
df=cbind(df,data.frame(matrix(0,nrow=nrow(df),ncol=lng)))
#change the newly added column names
colnames(df)[(ncol(df)-lng+1):ncol(df)]=paste0("id",1:lng)
for (i in 1:lng) { #loop over new columns
df[tmp[i],paste0("id",i)]=1 #at index i of death and column id+i set df to 1
}
country year leader natural.death gdp.growth.rate id1 id2
1 Australia 1945 David 0 1 0 0
2 Australia 1946 NA NA 4 0 0
3 Australia 1947 NA NA 3 0 0
4 Angola 1945 NA NA 5 0 0
5 Angola 1946 Henry 1 6 1 0
6 Angola 1947 NA NA 1 0 0
7 US 1945 Tom 1 5 0 1
8 US 1946 NA NA 7 0 0
9 US 1947 Chris 0 9 0 0
And an approach with tidyverse.
library(tidyverse)
df %>%
mutate(id = ifelse(natural.death == 1, 1, 0),
id = ifelse(is.na(id), 0, id),
tmp = cumsum(id)) %>%
pivot_wider(names_prefix = "id",
names_from = tmp,
values_from = id,
values_fill = list(id = 0)) %>%
select(-id0)
country year leader natural.death gdp.growth.rate id1 id2
<fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Australia 1945 David 0 1 0 0
2 Australia 1946 NA NA 4 0 0
3 Australia 1947 NA NA 3 0 0
4 Angola 1945 NA NA 5 0 0
5 Angola 1946 Henry 1 6 1 0
6 Angola 1947 NA NA 1 0 0
7 US 1945 Tom 1 5 0 1
8 US 1946 NA NA 7 0 0
9 US 1947 Chris 0 9 0 0

error spreading data set in R

I have a long data set that is broken down by geographical location and year, with about 5 variables of interest (see structure blow), every time I try to convert it to wide form, I get told that there's duplication so it can't.
df
Yr Geo Obs1 Obs2
2001 Dist1 1 3
2002 Dist1 2 5
2003 Dist1 4 2
2004 Dist1 2 1
2001 Dist2 1 3
2002 Dist2 .9 5
2003 Dist2 6 8
2004 Dist2 2 .2
I want to convert it into something like this
yr dist1obs1 dist1obs2 dist2obs1 dist2obs2
2001
2002
2003
2004
Looking for something like this...?
> reshape(df, v.names= c("Obs1", "Obs2"), idvar="Yr", timevar ="Geo", direction="wide")
Yr Obs1.Dist1 Obs2.Dist1 Obs1.Dist2 Obs2.Dist2
1 2001 1 3 1.0 3.0
2 2002 2 5 0.9 5.0
3 2003 4 2 6.0 8.0
4 2004 2 1 2.0 0.2
Here is a solution using tidyr. Because spread works with one key-value pair, you need to first gather the Obs and unite the dist with it so that you have one key-value pair to work with. I also set the column names to be lower case as shown in the requested output.
library(tidyverse)
tbl <- read_table2(
"Yr Geo Obs1 Obs2
2001 Dist1 1 3
2002 Dist1 2 5
2003 Dist1 4 2
2004 Dist1 2 1
2001 Dist2 1 3
2002 Dist2 .9 5
2003 Dist2 6 8
2004 Dist2 2 .2"
)
tbl %>%
gather("obsnum", "obs", Obs1, Obs2) %>%
unite(colname, Geo, obsnum, sep = "") %>%
spread(colname, obs) %>%
`colnames<-`(str_to_lower(colnames(.)))
#> # A tibble: 4 x 5
#> yr dist1obs1 dist1obs2 dist2obs1 dist2obs2
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2001 1. 3. 1.00 3.00
#> 2 2002 2. 5. 0.900 5.00
#> 3 2003 4. 2. 6.00 8.00
#> 4 2004 2. 1. 2.00 0.200
Created on 2018-04-19 by the reprex package (v0.2.0).

How to find observations whose dummy variable changes from 1 to 0 (and not viceversa) in a df in r

I have a survey composed of n individuals; each individual is present more than one time in the survey (panel). I have a variable pens, which is a dummy that takes value 1 if the individual invests in a complementary pension form. For example:
df <- data.frame(year=c(2002,2002,2004,2004,2006,2008), id=c(1,2,1,2,3,3), y.b=c(1950,1943,1950,1943,1966,1966), sex=c("F", "M", "F", "M", "M", "M"), income=c(100000,55000,88000,66000,12000,24000), pens=c(0,1,1,0,1,1))
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
where id is the individual, y.b is year of birth, pens is the dummy variable regarding complementary pension.
I want to know if there are individuals that invested in a complementary pension form in year t but didn't hold the complementary pension form in year t+2 (the survey is conducted every two years). In this way I want to know how many person had a complementary pension form but released it before pension or gave up (for example for economic reasons).
I tried with this command:
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
and actually I have the individuals whose pens variable had changed during time (the command check if a variable is constant in time). For this reason I find individuals whose pens variable changed from 0 (didn't have complementary pension) in year t to 1 in year t+2 and viceversa; but I am interested in individuals whose pens variable was 1 (had a complementary pensione) in year t and 0 in year t+2.
If I use this command with the df I get that for id 1 and 2 the variable x is 0 (pens variable isn't constant), but I'd need to find a way to get just id 2 (whose pens variable changed from 1 to 0).
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
year id pens x
1 2002 1 0 0
2 2002 2 1 0
3 2004 1 1 0
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
(for the sake of semplicity I omitted other variables)
So the desired output is:
year id pens x
1 2002 1 0 1
2 2002 2 1 0
3 2004 1 1 1
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
only id 2 has x=0 since the pens variable changed from 1 to 0.
Thanks in advance
This assigns 1 to the id's for which there is a decline in pens and 0 otherwise.
transform(d.d, x = ave(pens, id, FUN = function(x) any(diff(x) < 0)))
giving:
year id y.b sex income pens x
1 2002 1 1950 F 100000 0 0
2 2002 2 1943 M 55000 1 1
3 2004 1 1950 F 88000 1 0
4 2004 2 1943 M 66000 0 1
5 2006 3 1966 M 12000 1 0
6 2008 3 1966 M 24000 1 0
This should work even even if there are more than 2 rows per id but if we knew there were always 2 rows then we could omit the any simplifying it to:
transform(d.d, x = ave(pens, id, FUN = diff) < 0)
Note: The input in reproducible form is:
Lines <- "year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1"
d.d <- read.table(text = Lines, header = TRUE, check.names = FALSE)

Summarizing a dataframe by date and group

I am trying to summarize a data set by a few different factors. Below is an example of my data:
household<-c("household1","household1","household1","household2","household2","household2","household3","household3","household3")
date<-c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value<-c(1:9)
type<-c("income","water","energy","income","water","energy","income","water","energy")
df<-data.frame(household,date,value,type)
household date value type
1 household1 1999-05-10 100 income
2 household1 1999-05-25 200 water
3 household1 1999-10-12 300 energy
4 household2 1999-02-02 400 income
5 household2 1999-08-20 500 water
6 household2 1999-02-19 600 energy
7 household3 1999-07-01 700 income
8 household3 1999-10-13 800 water
9 household3 1999-01-01 900 energy
I want to summarize the data by month. Ideally the resulting data set would have 12 rows per household (one for each month) and a column for each category of expenditure (water, energy, income) that is a sum of that month's total.
I tried starting by adding a column with a short date, and then I was going to filter for each type and create a separate data frame for the summed data per transaction type. I was then going to merge those data frames together to have the summarized df. I attempted to summarize it using ddply, but it aggregated too much, and I can't keep the household level info.
ddply(df,.(shortdate),summarize,mean_value=mean(value))
shortdate mean_value
1 14/07 15.88235
2 14/09 5.00000
3 14/10 5.00000
4 14/11 21.81818
5 14/12 20.00000
6 15/01 10.00000
7 15/02 12.50000
8 15/04 5.00000
Any help would be much appreciated!
It sounds like what you are looking for is a pivot table. I like to use reshape::cast for these types of tables. If there is more than one value returned for a given expenditure type for a given household/year/month combination, this will sum those values. If there is only one value, it returns the value. The "sum" argument is not required but only placed there to handle exceptions. I think if your data is clean you shouldn't need this argument.
hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3")
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value <- c(1:9)
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy")
df <- data.frame(hh, date, value, type)
# Load lubridate library, add date and year
library(lubridate)
df$month <- month(df$date)
df$year <- year(df$date)
# Load reshape library, run cast from reshape, creates pivot table
library(reshape)
dfNew <- cast(df, hh+year+month~type, value = "value", sum)
> dfNew
hh year month energy income water
1 hh1 1999 4 3 0 0
2 hh1 1999 10 0 1 0
3 hh1 1999 11 0 0 2
4 hh2 1999 2 0 4 0
5 hh2 1999 3 6 0 0
6 hh2 1999 6 0 0 5
7 hh3 1999 1 9 0 0
8 hh3 1999 4 0 7 0
9 hh3 1999 8 0 0 8
Try this:
df$ym<-zoo::as.yearmon(as.Date(df$date), "%y/%m")
library(dplyr)
df %>% group_by(ym,type) %>%
summarise(mean_value=mean(value))
Source: local data frame [9 x 3]
Groups: ym [?]
ym type mean_value
<S3: yearmon> <fctr> <dbl>
1 jan 1999 income 1
2 jun 1999 energy 3
3 jul 1999 energy 6
4 jul 1999 water 2
5 ago 1999 income 4
6 set 1999 energy 9
7 set 1999 income 7
8 nov 1999 water 5
9 dez 1999 water 8
Edit: the wide format:
reshape2::dcast(dfr, ym ~ type)
ym energy income water
1 jan 1999 NA 1 NA
2 jun 1999 3 NA NA
3 jul 1999 6 NA 2
4 ago 1999 NA 4 NA
5 set 1999 9 7 NA
6 nov 1999 NA NA 5
7 dez 1999 NA NA 8
If I understood your requirement correctly (from the description in the question), this is what you are looking for:
library(dplyr)
library(tidyr)
df %>% mutate(date = lubridate::month(date)) %>%
complete(household, date = 1:12) %>%
spread(type, value) %>% group_by(household, date) %>%
mutate(Total = sum(energy, income, water, na.rm = T)) %>%
select(household, Month = date, energy:water, Total)
#Source: local data frame [36 x 6]
#Groups: household, Month [36]
#
# household Month energy income water Total
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 household1 1 NA NA NA 0
#2 household1 2 NA NA NA 0
#3 household1 3 NA NA 200 200
#4 household1 4 NA NA NA 0
#5 household1 5 NA NA NA 0
#6 household1 6 NA NA NA 0
#7 household1 7 NA NA NA 0
#8 household1 8 NA NA NA 0
#9 household1 9 300 NA NA 300
#10 household1 10 NA NA NA 0
# ... with 26 more rows
Note: I used the same df you provided in the question. The only change I made was the value column. Instead of 1:9, I used seq(100, 900, 100)
If I got it wrong, please let me know and I will delete my answer. I will add an explanation of what's going on if this is correct.

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Resources