So I have a few really big data sets that I want to merge in the manner that I indicated below. And I tried to do it like this:
setwd(..)
myfiles = list.files(pattern="*.dta")
dflist <- lapply(myfiles, read.dta13)
merged.data.frame = Reduce(function(...) merge(..., all=T), dflist)
However, I keep running out of memory...
Is there another way I can accomplish this without running out of memory?
What the dataframes look like, and what I want the merged product to be:
Dataframe 1
Country Year Sector ID Var 1
Austria 2001 Construction lp 22
Austria 2001 Construction fp 23
Austria 2001 Manufact lp 12
Austria 2001 Manufact fp 43
Austria 2002 Construction lp 55
Austria 2002 Construction fp 34
Austria 2002 Manufact lp 16
Austria 2002 Manufact fp 76
Dataframe 2
Country Year Sector Type Var1 Var2
Austria 2001 Construction A 23 5
Austria 2001 Construction B 34 5
Austria 2001 Manufact A 98 4
Austria 2001 Manufact B 48 3
Austria 2002 Construction A 43 9
Austria 2002 Construction B 23 7
Austria 2002 Manufact A 65 6
Austria 2002 Manufact B 45 6
Dataframe 3
Country Year Sector Var3
Austria 2001 Construction 123
Austria 2001 Acco 345
Austria 2001 Manufact 234
Austria 2001 Prod 466
Austria 2002 Construction 785
Austria 2002 Acco 789
Austria 2002 Manufact 678
Austria 2002 Prod 899
Merged:
Country Year Sector ID Type Var1 Var2 Var3
Austria 2001 Construction lp NA 22 NA NA
Austria 2001 Construction fp NA 23 NA NA
Austria 2001 Construction NA A 23 5 NA
Austria 2001 Construction NA B 34 5 NA
Austria 2001 Construction NA NA NA NA 123
Austria 2001 Manufact lp NA 12 NA NA
Austria 2001 Manufact fp NA 43 NA NA
Austria 2001 Manufact NA A 98 4 NA
Austria 2001 Manufact NA B 48 3 NA
Austria 2001 Manufact NA NA NA NA 234
Austria 2001 Acco NA NA NA NA 345
Austria 2001 Prod NA NA NA NA 466
.... 2002 ....
and so on.
Related
I have a data set that contains a column with country names, it looks like this:
> xx
# A tibble: 139 × 5
Country `2012_value` `2013_value` `2014_value` `2015_value`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Albania NA NA NA 35
2 Algeria 1 1 1 NA
3 Andorra NA NA NA 50
4 Antigua & Barbuda NA NA NA 98
5 Argentina NA NA NA 65
6 Armenia NA 44 46 46.5
7 Ascension NA NA NA 100
8 Austria NA NA NA 98
9 Azerbaijan NA NA 49 50
10 Bahamas NA NA NA 95
I would like to change the names of the countries according to a table I have:
> print(itu_emi_countries, n = 30)
# A tibble: 215 × 2
`ITU Name` `EMI Name`
<chr> <chr>
1 Afghanistan Afghanistan
2 Albania Albania
3 Algeria Algeria
4 American Samoa American Samoa
5 Andorra Andorra
6 Angola Angola
7 Antigua & Barbuda Antigua
8 Argentina Argentina
9 Armenia Armenia
10 Aruba Aruba
11 Australia Australia
12 Austria Austria
13 Azerbaijan Azerbaijan
14 Bahamas Bahamas
15 Bahrain Bahrain
16 Bangladesh Bangladesh
17 Barbados Barbados
18 Belarus Belarus
19 Belgium Belgium
20 Belize Belize
21 Benin Benin
22 Bermuda Bermuda
23 Bhutan Bhutan
24 Bolivia Bolivia
25 Bosnia and Herzegovina Bosnia-Herzegovina
26 Botswana Botswana
27 Brazil Brazil
28 Brunei Darussalam Brunei
29 Bulgaria Bulgaria
30 Burkina Faso Burkina Faso
# … with 185 more rows
The country names are written as in the first column, and I want to change them to the second column. How can I do this?
A simple rename and left_join does the trick:
library(tidyverse)
itu_emi_countries <- itu_emi_countries %>%
rename(Country = `ITU Name`)
left_join(xx, itu_emi_countries, by = "Country")
With dplyr, you could use recode() and pass a named vector indicating the relations between the old and new names.
library(dplyr)
xx %>%
mutate(Country = recode(Country, !!!tibble::deframe(itu_emi_countries)))
This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 1 year ago.
I am trying to replicate the same column values for the next 2 cells in the column using R.
I have a data-frame of the following form:
Time World Cate Data
1994 Africa A 12
1994 B 17
1994 C 22
1994 Asia A 55
1994 B 10
1994 C 58
1995 Africa A 62
1995 B 87
1995 C 12
1995 Asia A 59
1995 B 12
1995 C 38
and I want to convert it to the following form:
Time World Cate Data
1994 Africa A 12
1994 Africa B 17
1994 Africa C 22
1994 Asia A 55
1994 Asia B 10
1994 Asia C 58
1995 Africa A 62
1995 Africa B 87
1995 Africa C 12
1995 Asia A 59
1995 Asia B 12
1995 Asia C 38
Use fill from the tidyr package:
If your dataframe is called dat, then
dat <- tidyr::fill(dat, World)
Using na.locf function from library(zoo)
library(zoo)
na.locf(df)
Time World Cate Data
1 1994 Africa A 12
2 1994 Africa B 17
3 1994 Africa C 22
4 1994 Asia A 55
5 1994 Asia B 10
6 1994 Asia C 58
7 1995 Africa A 62
8 1995 Africa B 87
9 1995 Africa C 12
10 1995 Asia A 59
11 1995 Asia B 12
12 1995 Asia C 38
Code
dummy$World <- rep(dummy$World[(1:floor(dim(dummy)[1]/5))*5-4],each = 5)
dummy
I have two data frames. The first one looks like
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
The second one contains all the countries that are in the first data frame plus a few more countries for year 2018. It looks likes this
Country Year production
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I would like to merge the two data frames, and the final table should look like this:
Country Year production
Germany 1996 11
France 1996 12
Greece 1996 15
UK 1996 17
USA 1996 24
Austria 1996 NA
Japan 1996 NA
Germany 2018 27
France 2018 29
Greece 2018 44
UK 2018 46
USA 2018 99
Austria 2018 56
Japan 2018 66
I've tried several functions including full_join, merge, and rbind but they didn't work. Does anybody have any ideas?
With dplyr and tidyr, you may use:
bind_rows(df1, df2) %>%
complete(Country, Year)
Country Year production
<chr> <int> <int>
1 Austria 1996 NA
2 Austria 2018 56
3 France 1996 12
4 France 2018 29
5 Germany 1996 11
6 Germany 2018 27
7 Greece 1996 15
8 Greece 2018 44
9 Japan 1996 NA
10 Japan 2018 66
11 UK 1996 17
12 UK 2018 46
13 USA 1996 24
14 USA 2018 99
Consider base R with expand.grid and merge (and avoid any dependencies should you be a package author):
# BUILD DF OF ALL POSSIBLE COMBINATIONS OF COUNTRY AND YEAR
all_country_years <- expand.grid(Country=unique(c(df_96$Country, df_18$Country)),
Year=c(1996, 2018))
# MERGE (LEFT JOIN)
final_df <- merge(all_country_years, rbind(df_96, df_18), by=c("Country", "Year"),
all.x=TRUE)
# ORDER DATA AND RESET ROW NAMES
final_df <- data.frame(with(final_df, final_df[order(Year, Country),]),
row.names = NULL)
final_df
# Country Year production
# 1 Germany 1996 11
# 2 France 1996 12
# 3 Greece 1996 15
# 4 UK 1996 17
# 5 USA 1996 24
# 6 Austria 1996 NA
# 7 Japan 1996 NA
# 8 Germany 2018 27
# 9 France 2018 29
# 10 Greece 2018 44
# 11 UK 2018 46
# 12 USA 2018 99
# 13 Austria 2018 56
# 14 Japan 2018 66
Demo
This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 3 years ago.
I have a dataframe like this:
ID year fcmstat secmstat mstat
138 4 1998 NA NA 1
139 4 1999 NA NA 1
140 4 2000 NA NA 1
141 4 2001 NA NA 1
142 4 2002 NA NA 1
143 4 2003 2 NA 2
144 4 2004 NA NA NA
145 4 2005 NA NA NA
146 4 2006 NA 3 3
147 4 2007 NA NA NA
375 19 2001 NA NA 2
376 19 2002 6 NA 6
377 19 2003 NA NA NA
378 19 2004 NA 5 5
379 19 2005 NA NA NA
380 19 2006 NA NA 1
fcmstat: type of first marital status change
secmstat: type of second marital status change
first marital status, for ID 4(19), fsmstat was changed in 2003(2002) and second marital status secmstat was changed in 2006(2004). So, for ID 4, in 2004 and 2005 marital status was same as fcmstat of 2003 and for ID 19, 2003's mstat should be same as fcmstat of 2002.
I want to fill in t he last column as follows:
ID year fcmstat secmstat mstat
138 4 1998 NA NA 1
139 4 1999 NA NA 1
140 4 2000 NA NA 1
141 4 2001 NA NA 1
142 4 2002 NA NA 1
143 4 2003 2 NA 2
144 4 2004 NA NA 2
145 4 2005 NA NA 2
146 4 2006 NA 3 3
147 4 2007 NA NA NA
375 19 2001 NA NA 2
376 19 2002 6 NA 6
377 19 2003 NA NA 6
378 19 2004 NA 5 5
379 19 2005 NA NA NA
380 19 2006 NA NA 1
Also, before any first change, the mstatshould be same as before. Consider the following case.
ID year fcmstat secmstat mstat
1171 61 1978 NA NA 0
1172 61 1979 NA NA 0
1173 61 1980 NA NA 0
1174 61 1981 NA NA 0
1175 61 1982 NA NA 0
1176 61 1983 NA NA NA
1177 61 1984 NA NA NA
1178 61 1985 1 NA 1
1179 61 1986 NA NA 1
1180 61 1987 NA NA 1
the first change was in 1985. So, the missing mstat in 1984 and 1983 should be same as mstat of 1982. SO for this case, my desired output is:
ID year fcmstat secmstat mstat
1171 61 1978 NA NA 0
1172 61 1979 NA NA 0
1173 61 1980 NA NA 0
1174 61 1981 NA NA 0
1175 61 1982 NA NA 0
1176 61 1983 NA NA 0
1177 61 1984 NA NA 0
1178 61 1985 1 NA 1
1179 61 1986 NA NA 1
1180 61 1987 NA NA 1
As suggested by Schilker the code df$mstat_updated<-na.locf(df$mstat) gives the following:
ID year fcmstat secmstat mstat mstat_updated
138 4 1998 NA NA 1 1
139 4 1999 NA NA 1 1
140 4 2000 NA NA 1 1
141 4 2001 NA NA 1 1
142 4 2002 NA NA 1 1
143 4 2003 2 NA 2 2
144 4 2004 NA NA NA 2
145 4 2005 NA NA NA 2
146 4 2006 NA 3 3 3
147 4 2007 NA NA NA 3
148 4 2008 NA NA NA 3
However, I do want to fill in mstat for 2004 and 2005 but not in 2007 and 2008. I want to fill in NA's only between first marstat change, fcmstat and second marstat, secmstat change.
As I mentioned in my comment this a duplicate of here
library(zoo)
df<-data.frame(ID=c('4','4','4','4'),
year=c(2003,2004,2005,2006),
mstat=c(2,NA,NA,3))
df$mstat<-na.locf(df$mstat)
I want to subtract the year in which the respondents were born (variables containing yrbrn) by the variable for year of the interview (inwyys) and save the results as new variables in the data frame.
Head of the data frame:
inwyys yrbrn2 yrbrn3 yrbrn4 yrbrn5 yrbrn6 yrbrn7 yrbrn8
1 2012 1949 1955 NA NA NA NA NA
2 2012 1983 1951 1956 1989 1995 2003 2005
3 2012 1946 1946 1978 NA NA NA NA
4 2013 NA NA NA NA NA NA NA
5 2013 1953 1959 1980 1985 1991 2008 2011
6 2013 1938 NA NA NA NA NA NA
Can someone help me with that?
Thank you very much!
This can be done by sub-setting (x[,-1]..take everything but not the first column, x[,1]..take the first column) your data and make the subtraction. With cbind you can bind the new result to the original data.
cbind(x, x[,-1] - x[,1])
# inwyys yrbrn2 yrbrn3 yrbrn4 yrbrn5 yrbrn6 yrbrn7 yrbrn8 yrbrn2 yrbrn3 yrbrn4 yrbrn5 yrbrn6 yrbrn7 yrbrn8
#1 2012 1949 1955 NA NA NA NA NA -63 -57 NA NA NA NA NA
#2 2012 1983 1951 1956 1989 1995 2003 2005 -29 -61 -56 -23 -17 -9 -7
#3 2012 1946 1946 1978 NA NA NA NA -66 -66 -34 NA NA NA NA
#4 2013 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#5 2013 1953 1959 1980 1985 1991 2008 2011 -60 -54 -33 -28 -22 -5 -2
#6 2013 1938 NA NA NA NA NA NA -75 NA NA NA NA NA NA
Data:
x <- read.table(header=TRUE, text=" inwyys yrbrn2 yrbrn3 yrbrn4 yrbrn5 yrbrn6 yrbrn7 yrbrn8
1 2012 1949 1955 NA NA NA NA NA
2 2012 1983 1951 1956 1989 1995 2003 2005
3 2012 1946 1946 1978 NA NA NA NA
4 2013 NA NA NA NA NA NA NA
5 2013 1953 1959 1980 1985 1991 2008 2011
6 2013 1938 NA NA NA NA NA NA")
I believe the following is what you are looking for
data$newvar1<-data$yrbrn2-data$inwyys
But replace "data" with the name of your data set. If you want to do it for each yrbrn column, just change "newvar1" to "newvar2" etc so you do not override your previous calculations