Related
I have a dataframe that looks kind of like this with many more rows and columns:
> df <- data.frame(country = c ("Australia","Australia","Australia","Angola","Angola","Angola","US","US","US"), year=c("1945","1946","1947"), leader = c("David", "NA", "NA", "NA","Henry","NA","Tom","NA","Chris"), natural.death = c(0,NA,NA,NA,1,NA,1,NA,0),gdp.growth.rate=c(1,4,3,5,6,1,5,7,9))
> df
country year leader natural.death gdp.growth.rate
1 Australia 1945 David 0 1
2 Australia 1946 NA NA 4
3 Australia 1947 NA NA 3
4 Angola 1945 NA NA 5
5 Angola 1946 Henry 1 6
6 Angola 1947 NA NA 1
7 US 1945 Tom 1 5
8 US 1946 NA NA 7
9 US 1947 Chris 0 9
I am trying to add x number of new columns, where x corresponds to the number of unique leaders (column leader) satisfying the condition of leader being dead (natural.death==1). In the case of this df, I would expect to get 2 new columns for Henry and Tom, with values of 0,0,0,0,1,0,0,0,0 and 0,0,0,0,0,0,1,0,0, respectively. I would preferably have the two new columns called id1 and id2 according to the order of data presented in natural.death. I need to create 69 new columns as there 69 leaders who died, so I am looking for a non-manual method to deal with this.
I already tried loops, if, for, unique, mtabulate, dcast, dummies but I could not get anything work unfortunately.
I am hoping to get:
> df <- data.frame(country = c ("Australia","Australia","Australia","Angola","Angola","Angola","US","US","US"), year=c("1945","1946","1947"), leader = c("David", "NA", "NA", "NA","Henry","NA","Tom","NA","Chris"), natural.death = c(0,NA,NA,NA,1,NA,1,NA,0),gdp.growth.rate=c(1,4,3,5,6,1,5,7,9),
+ id1=c(0,0,0,0,1,0,0,0,0),id2=c(0,0,0,0,0,0,1,0,0))
> df
country year leader natural.death gdp.growth.rate id1 id2
1 Australia 1945 David 0 1 0 0
2 Australia 1946 NA NA 4 0 0
3 Australia 1947 NA NA 3 0 0
4 Angola 1945 NA NA 5 0 0
5 Angola 1946 Henry 1 6 1 0
6 Angola 1947 NA NA 1 0 0
7 US 1945 Tom 1 5 0 1
8 US 1946 NA NA 7 0 0
9 US 1947 Chris 0 9 0 0
Here is a crude way to do this
df <- data.frame(country = c ("Australia","Australia","Australia","Angola","Angola","Angola","US","US","US"), year=c("1945","1946","1947"), leader = c("David", "NA", "NA", "NA","Henry","NA","Tom","NA","Chris"), natural.death = c(0,NA,NA,NA,1,NA,1,NA,0),gdp.growth.rate=c(1,4,3,5,6,1,5,7,9))
tmp=which(df$natural.death==1) #index of deaths
lng=length(tmp) #number of deaths
#create matrix with zeros and lng columns, append to df
df=cbind(df,data.frame(matrix(0,nrow=nrow(df),ncol=lng)))
#change the newly added column names
colnames(df)[(ncol(df)-lng+1):ncol(df)]=paste0("id",1:lng)
for (i in 1:lng) { #loop over new columns
df[tmp[i],paste0("id",i)]=1 #at index i of death and column id+i set df to 1
}
country year leader natural.death gdp.growth.rate id1 id2
1 Australia 1945 David 0 1 0 0
2 Australia 1946 NA NA 4 0 0
3 Australia 1947 NA NA 3 0 0
4 Angola 1945 NA NA 5 0 0
5 Angola 1946 Henry 1 6 1 0
6 Angola 1947 NA NA 1 0 0
7 US 1945 Tom 1 5 0 1
8 US 1946 NA NA 7 0 0
9 US 1947 Chris 0 9 0 0
And an approach with tidyverse.
library(tidyverse)
df %>%
mutate(id = ifelse(natural.death == 1, 1, 0),
id = ifelse(is.na(id), 0, id),
tmp = cumsum(id)) %>%
pivot_wider(names_prefix = "id",
names_from = tmp,
values_from = id,
values_fill = list(id = 0)) %>%
select(-id0)
country year leader natural.death gdp.growth.rate id1 id2
<fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Australia 1945 David 0 1 0 0
2 Australia 1946 NA NA 4 0 0
3 Australia 1947 NA NA 3 0 0
4 Angola 1945 NA NA 5 0 0
5 Angola 1946 Henry 1 6 1 0
6 Angola 1947 NA NA 1 0 0
7 US 1945 Tom 1 5 0 1
8 US 1946 NA NA 7 0 0
9 US 1947 Chris 0 9 0 0
I have a long data set that is broken down by geographical location and year, with about 5 variables of interest (see structure blow), every time I try to convert it to wide form, I get told that there's duplication so it can't.
df
Yr Geo Obs1 Obs2
2001 Dist1 1 3
2002 Dist1 2 5
2003 Dist1 4 2
2004 Dist1 2 1
2001 Dist2 1 3
2002 Dist2 .9 5
2003 Dist2 6 8
2004 Dist2 2 .2
I want to convert it into something like this
yr dist1obs1 dist1obs2 dist2obs1 dist2obs2
2001
2002
2003
2004
Looking for something like this...?
> reshape(df, v.names= c("Obs1", "Obs2"), idvar="Yr", timevar ="Geo", direction="wide")
Yr Obs1.Dist1 Obs2.Dist1 Obs1.Dist2 Obs2.Dist2
1 2001 1 3 1.0 3.0
2 2002 2 5 0.9 5.0
3 2003 4 2 6.0 8.0
4 2004 2 1 2.0 0.2
Here is a solution using tidyr. Because spread works with one key-value pair, you need to first gather the Obs and unite the dist with it so that you have one key-value pair to work with. I also set the column names to be lower case as shown in the requested output.
library(tidyverse)
tbl <- read_table2(
"Yr Geo Obs1 Obs2
2001 Dist1 1 3
2002 Dist1 2 5
2003 Dist1 4 2
2004 Dist1 2 1
2001 Dist2 1 3
2002 Dist2 .9 5
2003 Dist2 6 8
2004 Dist2 2 .2"
)
tbl %>%
gather("obsnum", "obs", Obs1, Obs2) %>%
unite(colname, Geo, obsnum, sep = "") %>%
spread(colname, obs) %>%
`colnames<-`(str_to_lower(colnames(.)))
#> # A tibble: 4 x 5
#> yr dist1obs1 dist1obs2 dist2obs1 dist2obs2
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2001 1. 3. 1.00 3.00
#> 2 2002 2. 5. 0.900 5.00
#> 3 2003 4. 2. 6.00 8.00
#> 4 2004 2. 1. 2.00 0.200
Created on 2018-04-19 by the reprex package (v0.2.0).
I have a survey composed of n individuals; each individual is present more than one time in the survey (panel). I have a variable pens, which is a dummy that takes value 1 if the individual invests in a complementary pension form. For example:
df <- data.frame(year=c(2002,2002,2004,2004,2006,2008), id=c(1,2,1,2,3,3), y.b=c(1950,1943,1950,1943,1966,1966), sex=c("F", "M", "F", "M", "M", "M"), income=c(100000,55000,88000,66000,12000,24000), pens=c(0,1,1,0,1,1))
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
where id is the individual, y.b is year of birth, pens is the dummy variable regarding complementary pension.
I want to know if there are individuals that invested in a complementary pension form in year t but didn't hold the complementary pension form in year t+2 (the survey is conducted every two years). In this way I want to know how many person had a complementary pension form but released it before pension or gave up (for example for economic reasons).
I tried with this command:
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
and actually I have the individuals whose pens variable had changed during time (the command check if a variable is constant in time). For this reason I find individuals whose pens variable changed from 0 (didn't have complementary pension) in year t to 1 in year t+2 and viceversa; but I am interested in individuals whose pens variable was 1 (had a complementary pensione) in year t and 0 in year t+2.
If I use this command with the df I get that for id 1 and 2 the variable x is 0 (pens variable isn't constant), but I'd need to find a way to get just id 2 (whose pens variable changed from 1 to 0).
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
year id pens x
1 2002 1 0 0
2 2002 2 1 0
3 2004 1 1 0
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
(for the sake of semplicity I omitted other variables)
So the desired output is:
year id pens x
1 2002 1 0 1
2 2002 2 1 0
3 2004 1 1 1
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
only id 2 has x=0 since the pens variable changed from 1 to 0.
Thanks in advance
This assigns 1 to the id's for which there is a decline in pens and 0 otherwise.
transform(d.d, x = ave(pens, id, FUN = function(x) any(diff(x) < 0)))
giving:
year id y.b sex income pens x
1 2002 1 1950 F 100000 0 0
2 2002 2 1943 M 55000 1 1
3 2004 1 1950 F 88000 1 0
4 2004 2 1943 M 66000 0 1
5 2006 3 1966 M 12000 1 0
6 2008 3 1966 M 24000 1 0
This should work even even if there are more than 2 rows per id but if we knew there were always 2 rows then we could omit the any simplifying it to:
transform(d.d, x = ave(pens, id, FUN = diff) < 0)
Note: The input in reproducible form is:
Lines <- "year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1"
d.d <- read.table(text = Lines, header = TRUE, check.names = FALSE)
I am trying to summarize a data set by a few different factors. Below is an example of my data:
household<-c("household1","household1","household1","household2","household2","household2","household3","household3","household3")
date<-c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value<-c(1:9)
type<-c("income","water","energy","income","water","energy","income","water","energy")
df<-data.frame(household,date,value,type)
household date value type
1 household1 1999-05-10 100 income
2 household1 1999-05-25 200 water
3 household1 1999-10-12 300 energy
4 household2 1999-02-02 400 income
5 household2 1999-08-20 500 water
6 household2 1999-02-19 600 energy
7 household3 1999-07-01 700 income
8 household3 1999-10-13 800 water
9 household3 1999-01-01 900 energy
I want to summarize the data by month. Ideally the resulting data set would have 12 rows per household (one for each month) and a column for each category of expenditure (water, energy, income) that is a sum of that month's total.
I tried starting by adding a column with a short date, and then I was going to filter for each type and create a separate data frame for the summed data per transaction type. I was then going to merge those data frames together to have the summarized df. I attempted to summarize it using ddply, but it aggregated too much, and I can't keep the household level info.
ddply(df,.(shortdate),summarize,mean_value=mean(value))
shortdate mean_value
1 14/07 15.88235
2 14/09 5.00000
3 14/10 5.00000
4 14/11 21.81818
5 14/12 20.00000
6 15/01 10.00000
7 15/02 12.50000
8 15/04 5.00000
Any help would be much appreciated!
It sounds like what you are looking for is a pivot table. I like to use reshape::cast for these types of tables. If there is more than one value returned for a given expenditure type for a given household/year/month combination, this will sum those values. If there is only one value, it returns the value. The "sum" argument is not required but only placed there to handle exceptions. I think if your data is clean you shouldn't need this argument.
hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3")
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value <- c(1:9)
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy")
df <- data.frame(hh, date, value, type)
# Load lubridate library, add date and year
library(lubridate)
df$month <- month(df$date)
df$year <- year(df$date)
# Load reshape library, run cast from reshape, creates pivot table
library(reshape)
dfNew <- cast(df, hh+year+month~type, value = "value", sum)
> dfNew
hh year month energy income water
1 hh1 1999 4 3 0 0
2 hh1 1999 10 0 1 0
3 hh1 1999 11 0 0 2
4 hh2 1999 2 0 4 0
5 hh2 1999 3 6 0 0
6 hh2 1999 6 0 0 5
7 hh3 1999 1 9 0 0
8 hh3 1999 4 0 7 0
9 hh3 1999 8 0 0 8
Try this:
df$ym<-zoo::as.yearmon(as.Date(df$date), "%y/%m")
library(dplyr)
df %>% group_by(ym,type) %>%
summarise(mean_value=mean(value))
Source: local data frame [9 x 3]
Groups: ym [?]
ym type mean_value
<S3: yearmon> <fctr> <dbl>
1 jan 1999 income 1
2 jun 1999 energy 3
3 jul 1999 energy 6
4 jul 1999 water 2
5 ago 1999 income 4
6 set 1999 energy 9
7 set 1999 income 7
8 nov 1999 water 5
9 dez 1999 water 8
Edit: the wide format:
reshape2::dcast(dfr, ym ~ type)
ym energy income water
1 jan 1999 NA 1 NA
2 jun 1999 3 NA NA
3 jul 1999 6 NA 2
4 ago 1999 NA 4 NA
5 set 1999 9 7 NA
6 nov 1999 NA NA 5
7 dez 1999 NA NA 8
If I understood your requirement correctly (from the description in the question), this is what you are looking for:
library(dplyr)
library(tidyr)
df %>% mutate(date = lubridate::month(date)) %>%
complete(household, date = 1:12) %>%
spread(type, value) %>% group_by(household, date) %>%
mutate(Total = sum(energy, income, water, na.rm = T)) %>%
select(household, Month = date, energy:water, Total)
#Source: local data frame [36 x 6]
#Groups: household, Month [36]
#
# household Month energy income water Total
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 household1 1 NA NA NA 0
#2 household1 2 NA NA NA 0
#3 household1 3 NA NA 200 200
#4 household1 4 NA NA NA 0
#5 household1 5 NA NA NA 0
#6 household1 6 NA NA NA 0
#7 household1 7 NA NA NA 0
#8 household1 8 NA NA NA 0
#9 household1 9 300 NA NA 300
#10 household1 10 NA NA NA 0
# ... with 26 more rows
Note: I used the same df you provided in the question. The only change I made was the value column. Instead of 1:9, I used seq(100, 900, 100)
If I got it wrong, please let me know and I will delete my answer. I will add an explanation of what's going on if this is correct.
I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)