Creating new variables using different datasets - r

I'm trying to merge two datasets that have different structures. The first dataset contains indicators on a country level, per country and age class and looks like this:
country age_class nb_birth nb_singleton nb_twin prob_twin nb_under_5_dead nb_under_5_dead_singleton
1 AO7 15-19 28 28 0 0.000000000 2 2
2 AO7 20-24 1133 1123 10 0.008826125 107 101
3 AO7 25-29 3338 3256 82 0.024565608 327 302
4 AO7 30-34 4152 4059 93 0.022398844 425 402
5 AO7 35-39 4934 4784 150 0.030401297 545 509
6 AO7 40-44 4840 4647 193 0.039876033 726 660
The second datasets contains indicators on an individual level, per mother and looks like this:
caseid weight country age age_class region region_type education wealth parity
1 00010001 02 1.086089 AO7 38 35-39 benguela rural no education poorest 7
2 00010002 02 1.086089 AO7 40 40-44 benguela rural no education poorest 6
3 00010002 03 1.086089 AO7 16 15-19 benguela rural no education poorest 1
4 00010003 02 1.086089 AO7 43 40-44 benguela rural primary poorest 8
5 00010004 02 1.086089 AO7 25 25-29 benguela rural no education poorest 6
6 00010006 01 1.086089 AO7 26 25-29 benguela rural no education poorest 4
As you can see, the variables "country" and "age class" are shared. What I'm trying to do is assign indicators from the country dataset to every row in the mother dataset. I would like to end up with something like this:
caseid weight country age age_class [...] nb_births nb_singleton nb_twin prob_twin
1 00010001 02 1.086089 AO7 38 35-39 [...] 4934 4784 150 0.0304
2 00010002 02 1.086089 AO7 40 40-44 [...] 4840 4647 193 0.0399
3 00010002 03 1.086089 AO7 16 15-19 [...] 28 28 0 0.0000
In the end, all mothers from the same country and age class will have the same values for variables from the country dataset.
I am working with the dplyr package and have played with the join functions and mutate function. But I can't figure it out.
Do you have any idea how to solve this issue?

If I understand your question correctly the only thing you need to do is a dplyr::left_join. df1 is the country dataset and df1 is the mothers dataset.
left_join(df2, df1)
## Joining, by = c("country", "age_class")
## caseid weight country age age_class [...] nb_singleton nb_twin prob_twin nb_under_5_dead nb_under_5_dead_singleton
## 1 00010001 02 1.086089 AO7 38 35-39 [...] 4784 150 0.03040130 545 509
## 2 00010002 02 1.086089 AO7 40 40-44 [...] 4647 193 0.03987603 726 660
## 3 00010002 03 1.086089 AO7 16 15-19 [...] 28 0 0.00000000 2 2
## 4 00010003 02 1.086089 AO7 43 40-44 [...] 4647 193 0.03987603 726 660
## 5 00010004 02 1.086089 AO7 25 25-29 [...] 3256 82 0.02456561 327 302
## 6 00010006 01 1.086089 AO7 26 25-29 [...] 3256 82 0.02456561 327 302
dplyr::left_join keeps all the rows from the first data.frame and adds matching rows from the second data.frame.

Related

Adding a column to a data frame with two different variables

I am sure this is a super easy answer but I am struggling with how to add a column with two different variables to my dataframe. Currently, this is what it looks like
vcv.index model.index par.index grid index estimate se lcl ucl fixed
1 6 6 16 A 16 0.8856724 0.07033280 0.6650468 0.9679751
2 7 7 17 A 17 0.6298118 0.06925471 0.4873052 0.7528014
3 8 8 18 A 18 0.6299359 0.06658557 0.4930263 0.7487169
4 9 9 19 A 19 0.6297988 0.05511771 0.5169948 0.7300157
5 10 10 20 A 20 0.7575811 0.05033490 0.6461758 0.8424612
6 21 21 61 B 61 0.8713467 0.07638687 0.6404598 0.9626184
7 22 22 62 B 62 0.6074379 0.06881230 0.4677827 0.7314827
8 23 23 63 B 63 0.6041054 0.06107520 0.4805279 0.7156792
9 24 24 64 B 64 0.5806565 0.06927308 0.4422237 0.7074601
10 25 25 65 B 65 0.7370944 0.05892108 0.6070620 0.8357394
11 41 41 121 C 121 0.8048479 0.09684385 0.5519097 0.9324759
12 42 42 122 C 122 0.5259547 0.07165218 0.3871380 0.6608721
13 43 43 123 C 123 0.5427100 0.07127273 0.4033255 0.6757137
14 44 44 124 C 124 0.5168820 0.06156392 0.3975561 0.6343132
15 45 45 125 C 125 0.6550049 0.07378403 0.5002851 0.7826343
16 196 196 586 A 586 0.8536314 0.08709394 0.5979992 0.9580976
17 197 197 587 A 587 0.5672194 0.07079508 0.4268452 0.6975725
18 198 198 588 A 588 0.5675415 0.06380445 0.4408540 0.6859714
19 199 199 589 A 589 0.5666874 0.06499899 0.4377071 0.6872233
20 200 200 590 A 590 0.7058542 0.05985868 0.5769484 0.8085177
21 211 211 631 B 631 0.8360614 0.09413427 0.5703031 0.9514472
22 212 212 632 B 632 0.5432872 0.07906200 0.3891364 0.6895701
23 213 213 633 B 633 0.5400994 0.06497607 0.4129055 0.6622759
24 214 214 634 B 634 0.5161692 0.06292706 0.3943257 0.6361202
25 215 215 635 B 635 0.6821667 0.07280044 0.5263841 0.8056298
26 226 226 676 C 676 0.7621875 0.10484478 0.5077465 0.9087471
27 227 227 677 C 677 0.4607440 0.07326970 0.3240229 0.6036386
28 228 228 678 C 678 0.4775168 0.08336433 0.3219349 0.6375872
29 229 229 679 C 679 0.4517655 0.06393339 0.3319262 0.5774725
30 230 230 680 C 680 0.5944330 0.07210672 0.4491995 0.7248303
then I am adding a column with periods 1-5 repeated until reaches the end
with this code
SurJagPred$estimates %<>% mutate(Primary = rep(1:5, 6))
and I also need to add sex( F, M) as well. the numbers 1-15 are female and the 16-30 are male. So overall it should look like this.
> vcv.index model.index par.index grid index estimate se lcl ucl fixed Primary Sex
F
1 6 6 16 A 16 0.8856724 0.07033280 0.6650468 0.9679751 1 F
2 7 7 17 A 17 0.6298118 0.06925471 0.4873052 0.7528014 2 F
3 8 8 18 A 18 0.6299359 0.06658557 0.4930263 0.7487169 3 F
4 9 9 19 A 19 0.6297988 0.05511771 0.5169948 0.7300157 4 F
We can use rep with each on a vector of values to replicate each element of the vector to that many times
SurJagPred$estimates %<>%
mutate(Sex = rep(c("F", "M"), each = 15))

sum rows with a similar name in r

I have this table that has rows with similar or equal names I need to sum these rows. Make a single record, how can I do it using R. I try with a loop for doesn't work.
or (i in length(df1$variante)) {
+ if (df1$variante == "INTERNACIONAL" | df1$variante == " INTERNAC" ){
+ rbind(df1$variante)
+ }
+
+
+ }
Warning message:
In if (df1$variante == "INTERNACIONAL" | df1$variante == " INTERNAC") { :
a condição tem comprimento > 1 e somente o primeiro elemento será usado
df_variante
x freq
1 259
2 INTERNACIONAL 844
3 GOLD 2164
4 GOLD EXECUTIVOS 2
5 GRAFITE 109
6 GRAFITE INTERNACIONA 231
7 INFINITE 546
8 INFINITE EXECUTIVOS 2
9 INTERNACIONAL 4660
10 NACIONAL 8390
11 NANQUIM 12
12 NANQUIM INTERNACIONA 57
13 PLATINUM 1407
14 AZUL 9
15 AZUL CARD 112
16 BLACK 775
17 GOLD 2872
18 IN INTERNAC 1
19 IN NACIONAL 7
20 INTERNACIONAL 6678
21 MC ELECTRONIC GOV BAHIA 5
22 NACIONAL 9692
23 PLATINUM 1383
24 TIGRE NACIONAL 207
25 TURISMO GOLD 337
26 TURISMO INTERNACION 528
27 TURISMO NACIONAL 841
28 TURISMO PLATINUM 90
29 TURISMO GOLD 322
30 TURISMO INTER 531
31 AMAZONIA GOLD 5
32 AMAZONIA INTERNACIONAL 14
33 AMAZONIA NACIONAL 19
34 EPIDEMIA CORINTHIANA PLATINUM 4
35 JCB UNICO 203
36 UNIVERSITARIO INTERNACION 92
37 UNIVERSITARIO INTER 262
If you want the rows where freq contains for example 'INTER'- you could maybe do:
sum(df_variante$freq[grep(df_variante$x,pattern='INTER')])
Here you are just using grep to search the x column for the pattern 'INTER'
I noticed there are some rows with something like 'IN NACIONAL'.
In this case you could do:
idxs=unique(c(grep(df_variante$x,pattern='INTER'),grep(df_variante$x,pattern='NACIONAL'),....)) #do this for all patterns of interest
df_sub=df_variante[idxs,]
BTW:
your for loop in your question is not working because- your looping through the columns with length you want to loop through the rows:
for(i in 1:nrow(df_variante)){....}
But that is only if you still want to do it that way

Running Total with subtraction

I have a data set with closing and opening dates of public schools in California. Available here or dput() at the bottom of the question. The data also lists what type of school it is and where it is. I am trying to create a running total column which also takes into account school closings as well as school type.
Here is the solution I've come up with, which basically entails me encoding a lot of different 1's and 0's based on the conditions using ifelse:
# open charter schools
pubschls$open_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# open public schools
pubschls$open_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# closed charters
pubschls$closed_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
# closed public schools
pubschls$closed_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
lausd <- filter(pubschls, NCESDist=="0622710")
# count number open during each year
Then I subtract the columns from each other to get totals.
la_schools_count <- aggregate(lausd[c('open_chart','closed_chart','open_pub','closed_pub')],
by=list(year(lausd$OpenDate)), sum)
# find net charters by subtracting closed from open
la_schools_count$net_chart <- la_schools_count$open_chart - la_schools_count$closed_chart
# find net public schools by subtracting closed from open
la_schools_count$net_pub <- la_schools_count$open_pub - la_schools_count$closed_pub
# add running totals
la_schools_count$cum_chart <- cumsum(la_schools_count$net_chart)
la_schools_count$cum_pub <- cumsum(la_schools_count$net_pub)
# total totals
la_schools_count$total <- la_schools_count$cum_chart + la_schools_count$cum_pub
My output looks like this:
la_schools_count <- select(la_schools_count, "year", "cum_chart", "cum_pub", "pen_rate", "total")
year cum_chart cum_pub pen_rate total
1 1952 1 0 100.00000 1
2 1956 1 1 50.00000 2
3 1969 1 2 33.33333 3
4 1980 55 469 10.49618 524
5 1989 55 470 10.47619 525
6 1990 55 470 10.47619 525
7 1991 55 473 10.41667 528
8 1992 55 476 10.35782 531
9 1993 55 477 10.33835 532
10 1994 56 478 10.48689 534
11 1995 57 478 10.65421 535
12 1996 57 479 10.63433 536
13 1997 58 481 10.76067 539
14 1998 59 480 10.94620 539
15 1999 61 480 11.27542 541
16 2000 61 481 11.25461 542
17 2001 62 482 11.39706 544
18 2002 64 484 11.67883 548
19 2003 73 485 13.08244 558
20 2004 83 496 14.33506 579
21 2005 90 524 14.65798 614
22 2006 96 532 15.28662 628
23 2007 90 534 14.42308 624
24 2008 97 539 15.25157 636
25 2009 108 546 16.51376 654
26 2010 124 566 17.97101 690
27 2011 140 580 19.44444 720
28 2012 144 605 19.22563 749
29 2013 162 609 21.01167 771
30 2014 179 611 22.65823 790
31 2015 195 611 24.19355 806
32 2016 203 614 24.84700 817
33 2017 211 619 25.42169 830
I'm just wondering if this could be done in a better way. Like an apply statement to all rows based on the conditions?
dput:
structure(list(CDSCode = c("19647330100289", "19647330100297",
"19647330100669", "19647330100677", "19647330100743", "19647330100750"
), OpenDate = structure(c(12324, 12297, 12240, 12299, 12634,
12310), class = "Date"), ClosedDate = structure(c(NA, 15176,
NA, NA, NA, NA), class = "Date"), Charter = c("Y", "Y", "Y",
"Y", "Y", "Y")), .Names = c("CDSCode", "OpenDate", "ClosedDate",
"Charter"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I followed your code and learned what you were doing except pen_rate. It seems that pen_rate is calculated dividing cum_chart by total. I download the original data set and did the following. I called the data set foo. Whenclosed_pub), I combined Charter and ClosedDate. I checked if ClosedDate is NA or not, and converted the logical output to numbers (1 = open, 0 = closed). This is how I created the four groups (i.e., open_chart, closed_chart, open_pub, and closed_pub). I guess this would ask you to do less typing. Since the dates are in character, I extracted year using substr(). If you have a date object, you need to do something else. Once you have year, you group the data with it and calculate how many schools exist for each type of school using count(). This part is the equivalent of your aggregate() code. Then, Convert the output to a wide-format data with spread() and did the rest of the calculation as you demonstrated in your codes. The final output seems different from what you have in your question, but my outcome was identical to one that I obtained by running your codes. I hope this will help you.
library(dplyr)
library(tidyr)
library(readxl)
# Get the necessary data
foo <- read_xls("pubschls.xls") %>%
select(NCESDist, CDSCode, OpenDate, ClosedDate, Charter) %>%
filter(NCESDist == "0622710" & (!Charter %in% NA))
mutate(foo, group = paste(Charter, as.numeric(is.na(ClosedDate)), sep = "_"),
year = substr(OpenDate, star = nchar(OpenDate) - 3, stop = nchar(OpenDate))) %>%
count(year, group) %>%
spread(key = group, value = n, fill = 0) %>%
mutate(net_chart = Y_1 - Y_0,
net_pub = N_1 - N_0,
cum_chart = cumsum(net_chart),
cum_pub = cumsum(net_pub),
total = cum_chart + cum_pub,
pen_rate = cum_chart / total)
# A part of the outcome
# year N_0 N_1 Y_0 Y_1 net_chart net_pub cum_chart cum_pub total pen_rate
#1 1866 0 1 0 0 0 1 0 1 1 0.00000000
#2 1873 0 1 0 0 0 1 0 2 2 0.00000000
#3 1878 0 1 0 0 0 1 0 3 3 0.00000000
#4 1881 0 1 0 0 0 1 0 4 4 0.00000000
#5 1882 0 2 0 0 0 2 0 6 6 0.00000000
#110 2007 0 2 15 9 -6 2 87 393 480 0.18125000
#111 2008 2 8 9 15 6 6 93 399 492 0.18902439
#112 2009 1 9 4 15 11 8 104 407 511 0.20352250
#113 2010 5 26 5 21 16 21 120 428 548 0.21897810
#114 2011 2 16 2 18 16 14 136 442 578 0.23529412
#115 2012 2 27 3 7 4 25 140 467 607 0.23064250
#116 2013 1 5 1 19 18 4 158 471 629 0.25119237
#117 2014 1 3 1 18 17 2 175 473 648 0.27006173
#118 2015 0 0 2 18 16 0 191 473 664 0.28765060
#119 2016 0 3 0 8 8 3 199 476 675 0.29481481
#120 2017 0 5 0 9 9 5 208 481 689 0.30188679

Equivalent of index - match in Excel to return greater than the lookup value

In R I need to perform a similar function to index-match in Excel which returns the value just greater than the look up value.
Data Set A
Country GNI2009
Ukraine 6604
Egypt 5937
Morocco 5307
Philippines 4707
Indonesia 4148
India 3677
Viet Nam 3180
Pakistan 2760
Nigeria 2699
Data Set B
GNI2004 s1 s2 s3 s4
6649 295 33 59 3
6021 260 30 50 3
5418 226 27 42 2
4846 193 23 35 2
4311 162 20 29 2
3813 134 16 23 1
3356 109 13 19 1
2976 89 10 15 1
2578 68 7 11 0
2248 51 5 8 0
2199 48 5 8 0
At the 2009 level GNI for each country (data set A) I would like to find out which GNI2004 is just greater than or equal to GNI2009 and then return the corresponding sales values (s1,s2...) at that row (data set B). I would like to repeat this for each and every Country-gni row for 2009 in table A.
For example: Nigeria with a GNI2009 of 2698 in data set A would return:
GNI2004 s1 s2 s3 s4
2976 89 10 15 1
In Excel I guess this would be something like Index and Match where the match condition would be match(look up value, look uparray,-1)
You could try data.tables rolling join which designed to achieve just that
library(data.table) # V1.9.6+
indx <- setDT(DataB)[setDT(DataA), roll = -Inf, on = c(GNI2004 = "GNI2009"), which = TRUE]
DataA[, names(DataB) := DataB[indx]]
DataA
# Country GNI2009 GNI2004 s1 s2 s3 s4
# 1: Ukraine 6604 6649 295 33 59 3
# 2: Egypt 5937 6021 260 30 50 3
# 3: Morocco 5307 5418 226 27 42 2
# 4: Philippines 4707 4846 193 23 35 2
# 5: Indonesia 4148 4311 162 20 29 2
# 6: India 3677 3813 134 16 23 1
# 7: Viet Nam 3180 3356 109 13 19 1
# 8: Pakistan 2760 2976 89 10 15 1
# 9: Nigeria 2699 2976 89 10 15 1
The idea here is per each row in GNI2009 find the closest equal/bigger value in GNI2004, get the row index and subset. Then we update DataA with the result.
See here for more information.

Create a column according to the levels of a vector

I have a data frame with a column (species) presenting 153 levels of a factor
> out80[1:10,1:3]
Species Plots100 Plots80
1 02 901 2091
2 03 921 2094
3 04 29 60
4 05 1255 2145
5 06 563 850
6 07 38 53
7 08S 102 144
8 09 897 1734
9 10 503 1084
10 11 134 334
What I would like to do is look for this level of the factor in another column (code)of another data frame(species.tab2) and simply create another column in out80 with the name associated with this level from the column French name
> head(species.tab2[,1:3])
var code French_name
1 ESPAR 2 CHENE PEDONCULE
2 ESPAR 3 CHENE SESSILE
3 ESPAR 3 CHENE SESSILE
4 ESPAR 3 CHENE SESSILE
5 ESPAR 4 CHENE ROUGE
6 ESPAR 5 CHENE PUBESCENT
I have tried doing it with ifelse or with a loop but I can't get it to work.
So the result would be something like this:
Species Plots100 Plots80 Name
1 02 901 2091 CHENE PEDONCULE
2 03 921 2094 CHENE SESSILE
etc...
EDIT: Here are the levels:
> out80$Species
[1] 02 03 04 05 06 07 08S 09 10 11 12P 12V 13B 13C 13G 14 15P 15S 16
[20] 17C 17F 17O 18C 18D 18M 19 20G 20P 20X 21C 21M 21O 22C 22G 22M 22S 23A 23AB
[39] 23AF 23AM 23C 23F 23PA 23PC 23PD 23PF 23PM 23SO 23SS 24 25B 25C 25FD 25FR 25M 25R 25V
[58] 26E 26OC 27C 27N 28 29AF 29AI 29CM 29EN 29LI 29MA 29MI 31 32 33B 33G 33N 34 36
[77] 37 38AL 38AU 39 40 41 42 49AA 49AE 49AM 49BO 49BS 49C 49CA 49CS 49EA 49EV 49FL 49IA
[96] 49LN 49MB 49PC 49PL 49PM 49PS 49PT 49RA 49RC 49RP 49RT 49SN 49TF 49TG 51 52 53CA 53CO 53S
[115] 54 55 56 57A 57B 58 59 61 62 63 64 65 66 67 68CC 68CE 68CJ 68CL 68CM
[134] 68EO 68PC 68PM 68SC 68SV 68TG 68TH 69 69JC 69JO 70SB 70SC 70SE 71 72V 73 74H 74J 76
[153] 77
> species.tab2$code
[1] 2 3 3 3 4 5 5 5 6 6 6 7 08S 9 10 10 11 12P 12V
[20] 12V 13B 13C 13G 14 14 14 15P 15S 15S 16 17C 17F 17O 17O 18C 18C 18D 18D
[39] 18M 19 19 20G 20P 20X 21C 21M 21O 22C 22G 22G 22M 22S 23A 23A 23AB 23AF 23AM
[58] 23C 23F 23PA 23PA 23PC 23PD 23PF 23PM 23SO 24 25B 25C 25D 25E3 25FR 25M 25R 25V 26E
[77] 26E 26OC 27C 27N 28 29AI 29CM 29EN 29MA 29MI 29LI 31 32 33B 33G 33N 34 36 37
[96] 38AU 38AL 39 40 41 42 49AA 49AE 49AM 49BO 49BO 49BS 49C 49CA 49CS 49EA 49EV 49FL 49IA
[115] 49LN 49MB 49PC 49PL 49PM 49PS 49PT 49RA 49RC 49RP 49RT 49SN 49TF 49TG 51 52 53CA 53CO 53S
[134] 54 55 56 57A 57B 58 59 61 62 63 64 65 66 67 68CC 68CJ 68CL 68CM 68EO
[153] 68PC 68PM 68SC 68SV 68TG 68TH 69 69JC 69JO 70SB 70SC 70SE 71 72V 73 74H 74J 76 77
There are some repetition in code just due to the fact that for a same code, there are 2 or 3 different French names existing. For these I just want one of the name, doesn't matter which one it is.
Thank you for your help.
Using merge , after creating a new column code in out80
out80$code <- gsub('^0|S$','',out80$Species)
merge(out80,species.tab2)
code Species Plots100 Plots80 var French_name
1 2 02 901 2091 ESPAR CHENE PEDONCULE
2 3 03 921 2094 ESPAR CHENE SESSILE
3 3 03 921 2094 ESPAR CHENE SESSILE
4 3 03 921 2094 ESPAR CHENE SESSILE
5 4 04 29 60 ESPAR CHENE ROUGE
6 5 05 1255 2145 ESPAR CHENE PUBESCENT
EDIT
Code and Species doesn't match for levels 01,02,...., so I create a new column to match them.
gsub('^0([0-9])$','\\1',out80$Species)
A data.table solution:
require(data.table)
dt1 <- data.table(out80)
# positive look ahead
# match 0's at beginning followed by numbers
# if found, replace all beginning 0's with ""
dt1[, key := sub("^[0]+(?=[0-9]+$)", "", Species, perl=T)]
setkey(dt1, "key")
dt2 <- data.table(species.tab2)
dt2[, code := as.character(code)]
dt2[, key := sub("^[0]+(?=[0-9]+$)", "", code, perl=T)]
setkey(dt2, "key")
merge(dt1, dt2)
# key Species Plots100 Plots80 var code French_name
# 1: 2 02 901 2091 ESPAR 2 CHENE_PEDONCULE
# 2: 3 03 921 2094 ESPAR 3 CHENE_SESSILE
# 3: 3 03 921 2094 ESPAR 3 CHENE_SESSILE
# 4: 3 03 921 2094 ESPAR 3 CHENE_SESSILE
# 5: 4 04 29 60 ESPAR 4 CHENE_ROUGE
# 6: 5 05 1255 2145 ESPAR 5 CHENE_PUBESCENT

Resources