How do I merge two dataframes with conflicting values - r

I'm sorry if this is a duplicate question, but I have looked around at similar problems and haven't been able to find a real solution. Anyway, here goes:
I've read a .csv file into a table. There I'm dealing with 3 columns:
"ID"(author's ID), "num_pub"(number of articles published), and "year"(spans from 1930 to 2017).
I would like to get a final table where I would have "num_pub" for each "year", for every "ID". So rows would be "ID"s, columns would be "year"s, and underneath each year there would be the corresponding "num_pub" or 0 value if the author hasn't published anything.
I have tried creating two new tables and merging them in a few different ways described here but to no avail.
So first I read my file into a table:
tab<-read.table("mytable.csv",sep=",",head=T,colClasses=c("character","numeric","factor"))
head(tab,10)
ID num_pub year
1 00002 1 1977
2 00002 2 1978
3 00002 1 1983
4 00002 4 1984
5 00002 3 1990
6 00002 1 1994
7 00002 2 1996
8 00004 3 1957
9 00004 1 1958
10 00004 1 1959
With that, I was then able to create a table where for each "ID", there was every single "year", and if the author published in that year, the value was 1, otherwise it was 0:
a<-table(tab[,1], tab[,3])
Calling head(a,1) returns the following table: pic
I would like to know how to achieve the desired result I described above. Namely, having a table where rows would be populated with "ID"s, columns would be populated with "year"s (from 1930 to 2017), and underneath each year, there would be an actual "num_pub" value or a 0 value. The structure of the table would be just like the one shown in the pic
Thank you for your time and help. I'm very new to R, and kind of stuck in the mud with this.
Edit: the reshape approach as explained here does not solve my problem. I need zeros in place of "NA"s, and I want my year to start with 1930 instead of the first year that the author has published.

using reshape2 & dcast one can change to a wide format and then pipe through to replace NAs with 0s.
library(reshape2)
library(dplyr)
dcast(tab, ID~year, value.var = "num_pub") %>%
replace(is.na(.), 0)
ID 1957 1958 1959 1977 1978 1983 1984 1990 1994 1996
1 00002 0 0 0 1 2 1 4 3 1 2
2 00004 3 1 1 0 0 0 0 0 0 0

You can use complete to fill in the zeros for non available data, and then spread to turn your column of years into multiple columns (both from the tidyr package):
library(tidyr)
df_complete <-
complete(df, ID, year, fill = list(num_pub = 0))
spread(df_complete, key = year, value = num_pub)
# A tibble: 2 x 11
ID `1957` `1958` `1959` `1977` `1978` `1983` `1984` `1990` `1994` `1996`
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 00002 0 0 0 1 2 1 4 3 1 2
2 00004 3 1 1 0 0 0 0 0 0 0
Data:
df <-
data.frame(ID = c("00002", "00002", "00002", "00002", "00002", "00002", "00002", "00004", "00004", "00004"),
num_pub = c(1, 2, 1, 4, 3, 1, 2, 3, 1, 1),
year = c(1977, 1978, 1983, 1984, 1990, 1994, 1996, 1957, 1958, 1959))

In base R this might be handled with a merge operation followed by some coercion to 0/1 by way of negating is.na and using as.numeric. (Admittedly the complete function appears easier.
temp <- merge(expand.grid(ID=sprintf("%05d", 2:4),year=1930:2018), tab, all=T)
str(temp)
#--------
'data.frame': 267 obs. of 3 variables:
$ ID : Factor w/ 3 levels "00002","00003",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : int 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 ...
$ num_pub: num NA NA NA NA NA NA NA NA NA NA ...
temp$any_pub <- as.numeric(!is.na(temp$num_pub))
head(temp)
ID year num_pub any_pub
1 00002 1930 NA 0
2 00002 1931 NA 0
3 00002 1932 NA 0
4 00002 1933 NA 0
5 00002 1934 NA 0
6 00002 1935 NA 0
tapply(temp$any_pub, temp$ID,sum)
#
00002 00003 00004
7 0 3

Related

Create dummy variables for every unique value in a column based on a condition from a second column in R

I have a dataframe that looks kind of like this with many more rows and columns:
> df <- data.frame(country = c ("Australia","Australia","Australia","Angola","Angola","Angola","US","US","US"), year=c("1945","1946","1947"), leader = c("David", "NA", "NA", "NA","Henry","NA","Tom","NA","Chris"), natural.death = c(0,NA,NA,NA,1,NA,1,NA,0),gdp.growth.rate=c(1,4,3,5,6,1,5,7,9))
> df
country year leader natural.death gdp.growth.rate
1 Australia 1945 David 0 1
2 Australia 1946 NA NA 4
3 Australia 1947 NA NA 3
4 Angola 1945 NA NA 5
5 Angola 1946 Henry 1 6
6 Angola 1947 NA NA 1
7 US 1945 Tom 1 5
8 US 1946 NA NA 7
9 US 1947 Chris 0 9
I am trying to add x number of new columns, where x corresponds to the number of unique leaders (column leader) satisfying the condition of leader being dead (natural.death==1). In the case of this df, I would expect to get 2 new columns for Henry and Tom, with values of 0,0,0,0,1,0,0,0,0 and 0,0,0,0,0,0,1,0,0, respectively. I would preferably have the two new columns called id1 and id2 according to the order of data presented in natural.death. I need to create 69 new columns as there 69 leaders who died, so I am looking for a non-manual method to deal with this.
I already tried loops, if, for, unique, mtabulate, dcast, dummies but I could not get anything work unfortunately.
I am hoping to get:
> df <- data.frame(country = c ("Australia","Australia","Australia","Angola","Angola","Angola","US","US","US"), year=c("1945","1946","1947"), leader = c("David", "NA", "NA", "NA","Henry","NA","Tom","NA","Chris"), natural.death = c(0,NA,NA,NA,1,NA,1,NA,0),gdp.growth.rate=c(1,4,3,5,6,1,5,7,9),
+ id1=c(0,0,0,0,1,0,0,0,0),id2=c(0,0,0,0,0,0,1,0,0))
> df
country year leader natural.death gdp.growth.rate id1 id2
1 Australia 1945 David 0 1 0 0
2 Australia 1946 NA NA 4 0 0
3 Australia 1947 NA NA 3 0 0
4 Angola 1945 NA NA 5 0 0
5 Angola 1946 Henry 1 6 1 0
6 Angola 1947 NA NA 1 0 0
7 US 1945 Tom 1 5 0 1
8 US 1946 NA NA 7 0 0
9 US 1947 Chris 0 9 0 0
Here is a crude way to do this
df <- data.frame(country = c ("Australia","Australia","Australia","Angola","Angola","Angola","US","US","US"), year=c("1945","1946","1947"), leader = c("David", "NA", "NA", "NA","Henry","NA","Tom","NA","Chris"), natural.death = c(0,NA,NA,NA,1,NA,1,NA,0),gdp.growth.rate=c(1,4,3,5,6,1,5,7,9))
tmp=which(df$natural.death==1) #index of deaths
lng=length(tmp) #number of deaths
#create matrix with zeros and lng columns, append to df
df=cbind(df,data.frame(matrix(0,nrow=nrow(df),ncol=lng)))
#change the newly added column names
colnames(df)[(ncol(df)-lng+1):ncol(df)]=paste0("id",1:lng)
for (i in 1:lng) { #loop over new columns
df[tmp[i],paste0("id",i)]=1 #at index i of death and column id+i set df to 1
}
country year leader natural.death gdp.growth.rate id1 id2
1 Australia 1945 David 0 1 0 0
2 Australia 1946 NA NA 4 0 0
3 Australia 1947 NA NA 3 0 0
4 Angola 1945 NA NA 5 0 0
5 Angola 1946 Henry 1 6 1 0
6 Angola 1947 NA NA 1 0 0
7 US 1945 Tom 1 5 0 1
8 US 1946 NA NA 7 0 0
9 US 1947 Chris 0 9 0 0
And an approach with tidyverse.
library(tidyverse)
df %>%
mutate(id = ifelse(natural.death == 1, 1, 0),
id = ifelse(is.na(id), 0, id),
tmp = cumsum(id)) %>%
pivot_wider(names_prefix = "id",
names_from = tmp,
values_from = id,
values_fill = list(id = 0)) %>%
select(-id0)
country year leader natural.death gdp.growth.rate id1 id2
<fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Australia 1945 David 0 1 0 0
2 Australia 1946 NA NA 4 0 0
3 Australia 1947 NA NA 3 0 0
4 Angola 1945 NA NA 5 0 0
5 Angola 1946 Henry 1 6 1 0
6 Angola 1947 NA NA 1 0 0
7 US 1945 Tom 1 5 0 1
8 US 1946 NA NA 7 0 0
9 US 1947 Chris 0 9 0 0

Spread valued column into binary 'time series' in R

I'm attempting to spread a valued column first into a set of binary columns and then gather them again in a 'time series' format.
By way of example, consider locations that have been conquered at certain times, with data that looks like this:
df1 <- data.frame(locationID = c(1,2,3), conquered_in = c(1931, 1932, 1929))
locationID conquered_in
1 1 1931
2 2 1932
3 3 1929
I'm attempting to reshape the data to look like this:
df2 <- data.frame(locationID = c(1,1,1,1,2,2,2,2,3,3,3,3), year = c(1929,1930,1931,1932,1929,1930,1931,1932,1929,1930,1931,1932), conquered = c(0,0,1,1,0,0,0,0,1,1,1,1))
locationID year conquered
1 1 1929 0
2 1 1930 0
3 1 1931 1
4 1 1932 1
5 2 1929 0
6 2 1930 0
7 2 1931 0
8 2 1932 0
9 3 1929 1
10 3 1930 1
11 3 1931 1
12 3 1932 1
My original strategy was to spread on conquered and then attempt a gather. This answer seemed close, but I can't seem to get it right with fill, since I'm trying to populate the later years with 1's also.
You can use complete() to expand the data frame and then use cumsum() when conquered equals 1 to fill the grouped data downwards.
library(tidyr)
library(dplyr)
df1 %>%
mutate(conquered = 1) %>%
complete(locationID, conquered_in = seq(min(conquered_in), max(conquered_in)), fill = list(conquered = 0)) %>%
group_by(locationID) %>%
mutate(conquered = cumsum(conquered == 1))
# A tibble: 12 x 3
# Groups: locationID [3]
locationID conquered_in conquered
<dbl> <dbl> <int>
1 1 1929 0
2 1 1930 0
3 1 1931 1
4 1 1932 1
5 2 1929 0
6 2 1930 0
7 2 1931 0
8 2 1932 1
9 3 1929 1
10 3 1930 1
11 3 1931 1
12 3 1932 1
Using complete from tidyr would be better choice. Though we need to aware that the conquered year may not fully cover all the year from beginning to end of the war.
library(dplyr)
library(tidyr)
library(magrittr)
df1 <- data.frame(locationID = c(1,2,3), conquered_in = c(1931, 1932, 1929))
# A data frame full of all year you want to cover
df2 <- data.frame(year=seq(1929, 1940, by=1))
# Create a data frame full of combination of year and location + conquered data
df3 <- full_join(df2, df1, by=c("year"="conquered_in")) %>%
mutate(conquered=if_else(!is.na(locationID), 1, 0)) %>%
complete(year, locationID) %>%
arrange(locationID) %>%
filter(!is.na(locationID))
# calculate conquered depend on the first year it get conquered - using group by location
df3 %<>%
group_by(locationID) %>%
# year 2000 in the min just for case if you have location that never conquered
mutate(conquered=if_else(year>=min(2000, year[conquered==1], na.rm=T), 1, 0)) %>%
ungroup()
df3 %>% filter(year<=1932)
# A tibble: 12 x 3
year locationID conquered
<dbl> <dbl> <dbl>
1 1929 1 0
2 1930 1 0
3 1931 1 1
4 1932 1 1
5 1929 2 0
6 1930 2 0
7 1931 2 0
8 1932 2 1
9 1929 3 1
10 1930 3 1
11 1931 3 1
12 1932 3 1

How to find observations whose dummy variable changes from 1 to 0 (and not viceversa) in a df in r

I have a survey composed of n individuals; each individual is present more than one time in the survey (panel). I have a variable pens, which is a dummy that takes value 1 if the individual invests in a complementary pension form. For example:
df <- data.frame(year=c(2002,2002,2004,2004,2006,2008), id=c(1,2,1,2,3,3), y.b=c(1950,1943,1950,1943,1966,1966), sex=c("F", "M", "F", "M", "M", "M"), income=c(100000,55000,88000,66000,12000,24000), pens=c(0,1,1,0,1,1))
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
where id is the individual, y.b is year of birth, pens is the dummy variable regarding complementary pension.
I want to know if there are individuals that invested in a complementary pension form in year t but didn't hold the complementary pension form in year t+2 (the survey is conducted every two years). In this way I want to know how many person had a complementary pension form but released it before pension or gave up (for example for economic reasons).
I tried with this command:
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
and actually I have the individuals whose pens variable had changed during time (the command check if a variable is constant in time). For this reason I find individuals whose pens variable changed from 0 (didn't have complementary pension) in year t to 1 in year t+2 and viceversa; but I am interested in individuals whose pens variable was 1 (had a complementary pensione) in year t and 0 in year t+2.
If I use this command with the df I get that for id 1 and 2 the variable x is 0 (pens variable isn't constant), but I'd need to find a way to get just id 2 (whose pens variable changed from 1 to 0).
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
which(df$x=="0")
year id pens x
1 2002 1 0 0
2 2002 2 1 0
3 2004 1 1 0
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
(for the sake of semplicity I omitted other variables)
So the desired output is:
year id pens x
1 2002 1 0 1
2 2002 2 1 0
3 2004 1 1 1
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
only id 2 has x=0 since the pens variable changed from 1 to 0.
Thanks in advance
This assigns 1 to the id's for which there is a decline in pens and 0 otherwise.
transform(d.d, x = ave(pens, id, FUN = function(x) any(diff(x) < 0)))
giving:
year id y.b sex income pens x
1 2002 1 1950 F 100000 0 0
2 2002 2 1943 M 55000 1 1
3 2004 1 1950 F 88000 1 0
4 2004 2 1943 M 66000 0 1
5 2006 3 1966 M 12000 1 0
6 2008 3 1966 M 24000 1 0
This should work even even if there are more than 2 rows per id but if we knew there were always 2 rows then we could omit the any simplifying it to:
transform(d.d, x = ave(pens, id, FUN = diff) < 0)
Note: The input in reproducible form is:
Lines <- "year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1"
d.d <- read.table(text = Lines, header = TRUE, check.names = FALSE)

Summarizing a dataframe by date and group

I am trying to summarize a data set by a few different factors. Below is an example of my data:
household<-c("household1","household1","household1","household2","household2","household2","household3","household3","household3")
date<-c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value<-c(1:9)
type<-c("income","water","energy","income","water","energy","income","water","energy")
df<-data.frame(household,date,value,type)
household date value type
1 household1 1999-05-10 100 income
2 household1 1999-05-25 200 water
3 household1 1999-10-12 300 energy
4 household2 1999-02-02 400 income
5 household2 1999-08-20 500 water
6 household2 1999-02-19 600 energy
7 household3 1999-07-01 700 income
8 household3 1999-10-13 800 water
9 household3 1999-01-01 900 energy
I want to summarize the data by month. Ideally the resulting data set would have 12 rows per household (one for each month) and a column for each category of expenditure (water, energy, income) that is a sum of that month's total.
I tried starting by adding a column with a short date, and then I was going to filter for each type and create a separate data frame for the summed data per transaction type. I was then going to merge those data frames together to have the summarized df. I attempted to summarize it using ddply, but it aggregated too much, and I can't keep the household level info.
ddply(df,.(shortdate),summarize,mean_value=mean(value))
shortdate mean_value
1 14/07 15.88235
2 14/09 5.00000
3 14/10 5.00000
4 14/11 21.81818
5 14/12 20.00000
6 15/01 10.00000
7 15/02 12.50000
8 15/04 5.00000
Any help would be much appreciated!
It sounds like what you are looking for is a pivot table. I like to use reshape::cast for these types of tables. If there is more than one value returned for a given expenditure type for a given household/year/month combination, this will sum those values. If there is only one value, it returns the value. The "sum" argument is not required but only placed there to handle exceptions. I think if your data is clean you shouldn't need this argument.
hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3")
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value <- c(1:9)
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy")
df <- data.frame(hh, date, value, type)
# Load lubridate library, add date and year
library(lubridate)
df$month <- month(df$date)
df$year <- year(df$date)
# Load reshape library, run cast from reshape, creates pivot table
library(reshape)
dfNew <- cast(df, hh+year+month~type, value = "value", sum)
> dfNew
hh year month energy income water
1 hh1 1999 4 3 0 0
2 hh1 1999 10 0 1 0
3 hh1 1999 11 0 0 2
4 hh2 1999 2 0 4 0
5 hh2 1999 3 6 0 0
6 hh2 1999 6 0 0 5
7 hh3 1999 1 9 0 0
8 hh3 1999 4 0 7 0
9 hh3 1999 8 0 0 8
Try this:
df$ym<-zoo::as.yearmon(as.Date(df$date), "%y/%m")
library(dplyr)
df %>% group_by(ym,type) %>%
summarise(mean_value=mean(value))
Source: local data frame [9 x 3]
Groups: ym [?]
ym type mean_value
<S3: yearmon> <fctr> <dbl>
1 jan 1999 income 1
2 jun 1999 energy 3
3 jul 1999 energy 6
4 jul 1999 water 2
5 ago 1999 income 4
6 set 1999 energy 9
7 set 1999 income 7
8 nov 1999 water 5
9 dez 1999 water 8
Edit: the wide format:
reshape2::dcast(dfr, ym ~ type)
ym energy income water
1 jan 1999 NA 1 NA
2 jun 1999 3 NA NA
3 jul 1999 6 NA 2
4 ago 1999 NA 4 NA
5 set 1999 9 7 NA
6 nov 1999 NA NA 5
7 dez 1999 NA NA 8
If I understood your requirement correctly (from the description in the question), this is what you are looking for:
library(dplyr)
library(tidyr)
df %>% mutate(date = lubridate::month(date)) %>%
complete(household, date = 1:12) %>%
spread(type, value) %>% group_by(household, date) %>%
mutate(Total = sum(energy, income, water, na.rm = T)) %>%
select(household, Month = date, energy:water, Total)
#Source: local data frame [36 x 6]
#Groups: household, Month [36]
#
# household Month energy income water Total
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 household1 1 NA NA NA 0
#2 household1 2 NA NA NA 0
#3 household1 3 NA NA 200 200
#4 household1 4 NA NA NA 0
#5 household1 5 NA NA NA 0
#6 household1 6 NA NA NA 0
#7 household1 7 NA NA NA 0
#8 household1 8 NA NA NA 0
#9 household1 9 300 NA NA 300
#10 household1 10 NA NA NA 0
# ... with 26 more rows
Note: I used the same df you provided in the question. The only change I made was the value column. Instead of 1:9, I used seq(100, 900, 100)
If I got it wrong, please let me know and I will delete my answer. I will add an explanation of what's going on if this is correct.

How to create timeseries by grouping entries in R?

I want to create a time series from 01/01/2004 until 31/12/2010 of daily mortality data in R. The raw data that I have now (.csv file), has as columns day - month - year and every row is a death case. So if the mortality on a certain day is for example equal to four, there are four rows with that date. If there is no death case reported on a specific day, that day is omitted in the dataset.
What I need is a time-series with 2557 rows (from 01/01/2004 until 31/12/2010) wherein the total number of death cases per day is listed. If there is no death case on a certain day, I still need that day to be in the list with a "0" assigned to it.
Does anyone know how to do this?
Thanks,
Gosia
Example of the raw data:
day month year
1 1 2004
3 1 2004
3 1 2004
3 1 2004
6 1 2004
7 1 2004
What I need:
day month year deaths
1 1 2004 1
2 1 2004 0
3 1 2004 3
4 1 2004 0
5 1 2004 0
6 1 2004 1
df <- read.table(text="day month year
1 1 2004
3 1 2004
3 1 2004
3 1 2004
6 1 2004
7 1 2004",header=TRUE)
#transform to dates
dates <- as.Date(with(df,paste(year,month,day,sep="-")))
#contingency table
tab <- as.data.frame(table(dates))
names(tab)[2] <- "deaths"
tab$dates <- as.Date(tab$dates)
#sequence of dates
res <- data.frame(dates=seq(from=min(dates),to=max(dates),by="1 day"))
#merge
res <- merge(res,tab,by="dates",all.x=TRUE)
res[is.na(res$deaths),"deaths"] <- 0
res
# dates deaths
#1 2004-01-01 1
#2 2004-01-02 0
#3 2004-01-03 3
#4 2004-01-04 0
#5 2004-01-05 0
#6 2004-01-06 1
#7 2004-01-07 1

Resources