Automating finding and converting values in r - r

I have a sample dataset with 45 rows and is given below.
itemid title release_date
16 573 Body Snatchers 1993
17 670 Body Snatchers 1993
41 1645 Butcher Boy, The 1998
42 1650 Butcher Boy, The 1998
1 218 Cape Fear 1991
18 673 Cape Fear 1962
27 1234 Chairman of the Board 1998
43 1654 Chairman of the Board 1998
2 246 Chasing Amy 1997
5 268 Chasing Amy 1997
11 309 Deceiver 1997
37 1606 Deceiver 1997
28 1256 Designated Mourner, The 1997
29 1257 Designated Mourner, The 1997
12 329 Desperate Measures 1998
13 348 Desperate Measures 1998
9 304 Fly Away Home 1996
15 500 Fly Away Home 1996
26 1175 Hugo Pool 1997
39 1617 Hugo Pool 1997
31 1395 Hurricane Streets 1998
38 1607 Hurricane Streets 1998
10 305 Ice Storm, The 1997
21 865 Ice Storm, The 1997
4 266 Kull the Conqueror 1997
19 680 Kull the Conqueror 1997
22 876 Money Talks 1997
24 881 Money Talks 1997
35 1477 Nightwatch 1997
40 1625 Nightwatch 1997
6 274 Sabrina 1995
14 486 Sabrina 1954
33 1442 Scarlet Letter, The 1995
36 1542 Scarlet Letter, The 1926
3 251 Shall We Dance? 1996
30 1286 Shall We Dance? 1937
32 1429 Sliding Doors 1998
45 1680 Sliding Doors 1998
20 711 Substance of Fire, The 1996
44 1658 Substance of Fire, The 1996
23 878 That Darn Cat! 1997
25 1003 That Darn Cat! 1997
34 1444 That Darn Cat! 1965
7 297 Ulee's Gold 1997
8 303 Ulee's Gold 1997
what I am trying to do is to convert the itemid based on the movie name and if the release date of the movie is same. for example, The movie 'Ulee's Gold' has two item id's 297 & 303. I am trying to find a way to automate the process of checking the release date of the movie and if its same, itemid[2] of that movie should be replaced with itemid[1]. For the time being I have done it manually by extracting the itemid's into two vectors x & y and then changing them using vectorization. I want to know if there is a better way of getting this task done because there are only 18 movies with multiple id's but the dataset has a few hundred. Finding and processing this manually will be very time consuming.
I am providing the code that I have used to get this task done.
x <- c(670,1650,1654,268,1606,1257,348,500,1617,1607,865,680,881,1625,1680,1658,1003,303)
y<- c(573,1645,1234,246,309,1256,329,304,1175,1395,305,266,876,1477,1429,711,878,297)
for(i in 1:18)
{
df$itemid[x[i]] <- y[i]
}
Is there a better way to get this done?

I think you can do it in dplyr straightforwardly:
Using your comment above, a brief example:
itemid <- c(878,1003,1444,297,303)
title <- c(rep("That Darn Cat!", 3), rep("Ulee's Gold", 2))
year <- c(1997,1997,1965,1997,1997)
temp <- data.frame(itemid,title,year)
temp
library(dplyr)
temp %>% group_by(title,year) %>% mutate(itemid1 = min(itemid))
(I changed 'release_date' to 'year' for some reason... but this basically groups the title/year together, searches for the minimum itemid and the mutate creates a new variable with this lowest 'itemid'.
which gives:
# itemid title year itemid1
#1 878 That Darn Cat! 1997 878
#2 1003 That Darn Cat! 1997 878
#3 1444 That Darn Cat! 1965 1444
#4 297 Ulee's Gold 1997 297
#5 303 Ulee's Gold 1997 297

Related

Create a new variable in data frame that contains the sum of the values of all other groups

I have data similar to this
example_data <- data.frame(
company = c(rep("A",6),
rep("B",6),
rep("C",6)),
year = c(rep(c(rep(c(2019), 3), rep(2020, 3)), 3)),
country = c(rep(c("Australia","Tanzania","Nepal"),3)),
sales = c(sample(1000:2000, 18)),
employees = c(sample(100:200, 18)),
profit = c(sample(500:1000, 18))
)
which when printed out looks like this:
> example_data
company year country sales employees profit
1 A 2019 Australia 1815 138 986
2 A 2019 Tanzania 1183 126 907
3 A 2019 Nepal 1159 155 939
4 A 2020 Australia 1873 183 866
5 A 2020 Tanzania 1858 198 579
6 A 2020 Nepal 1841 184 601
7 B 2019 Australia 1989 160 595
8 B 2019 Tanzania 1162 151 520
9 B 2019 Nepal 1470 187 670
10 B 2020 Australia 1013 128 945
11 B 2020 Tanzania 1718 123 886
12 B 2020 Nepal 1135 149 778
13 C 2019 Australia 1846 188 755
14 C 2019 Tanzania 1445 194 916
15 C 2019 Nepal 1029 145 903
16 C 2020 Australia 1737 161 578
17 C 2020 Tanzania 1489 141 859
18 C 2020 Nepal 1350 167 536
The unit of observation for the three variables of interest sales, employees, profit is a unique combination of company, year, and country.
What I need is a column in the data frame for every one of these three variables named other_sales, other_employees, and other_profit. (In my actual data, I have not only three but closer to 40 such variables of interest.) These should be the sum of the other companies in that year, in that country for that variable. So for instance, example_data$other_sales[1] should be the sum of the two values 1989 and 1846, which are "he sales for company B in that year in that country, and the sales for company C in that year in that country respectively.
I am familiar with dplyr::group_by() and dplyr::mutate(), but I struggle to come up with a way to solve this problem. What I would like to do is something like this:
library(dplyr)
example_data %>%
group_by(company, year, country) %>%
mutate(other_sales = sum(
example_data %>% filter(company!="this") %>% .$sales)
)
# "this" should be the value of 'company' in the current group
Obviously, this code doesn't work. Even if it did, it would not accomplish the goal of creating these other_* variables automatically for every specified column in the data frame. I've thought about creating a complicated for loop, but I figured before I go down that most likely wrong route, it's better to ask here. Finally, while it would be possible to construct a solution based purely on column indices (i.e., for example_data[1,7] calculate the sum of [7,4] and [13,4]), this would not work in my real data because the number of observations per company can differ.
EDIT: small correction in the code
--- SOLUTION ---
Based on the comment under this question, I was able to figure out a solution that solves both issues in the question:
example_data %>%
group_by(year, country) %>%
mutate(across(sales:profit, .names = "other_{.col}", function(x) sum(x)-x))
I think this will solve your problem
example_data %>%
group_by(country,year) %>%
mutate(other_sales = sum(sales)- sales)
To generalise it for all variables, i.e. sales, profit and employees:
(arrange is not necessary, but helps when checking.)
library(tidyverse)
set.seed(123)
example_data <- data.frame(
company = c(rep("A",6),
rep("B",6),
rep("C",6)),
year = c(rep(c(rep(c(2019), 3), rep(2020, 3)), 3)),
country = c(rep(c("Australia","Tanzania","Nepal"),3)),
sales = c(sample(1000:2000, 18)),
employees = c(sample(100:200, 18)),
profit = c(sample(500:1000, 18))
)
example_data |>
arrange(country, year, company) |> # Optional
group_by(country, year) |>
mutate(across(sales:profit, ~sum(.) - ., .names = "other_{.col}"))
#> # A tibble: 18 × 9
#> # Groups: country, year [6]
#> company year country sales employees profit other_sales other_em…¹ other…²
#> <chr> <dbl> <chr> <int> <int> <int> <int> <int> <int>
#> 1 A 2019 Australia 1414 190 989 3190 302 1515
#> 2 B 2019 Australia 1817 125 522 2787 367 1982
#> 3 C 2019 Australia 1373 177 993 3231 315 1511
#> 4 A 2020 Australia 1525 108 892 2830 372 1524
#> 5 B 2020 Australia 1228 197 808 3127 283 1608
#> 6 C 2020 Australia 1602 175 716 2753 305 1700
#> 7 A 2019 Nepal 1178 191 762 2899 283 1608
#> 8 B 2019 Nepal 1298 141 943 2779 333 1427
#> 9 C 2019 Nepal 1601 142 665 2476 332 1705
#> 10 A 2020 Nepal 1937 171 829 2721 266 1967
#> 11 B 2020 Nepal 1013 135 991 3645 302 1805
#> 12 C 2020 Nepal 1708 131 976 2950 306 1820
#> 13 A 2019 Tanzania 1462 156 608 2781 286 1633
#> 14 B 2019 Tanzania 1117 106 910 3126 336 1331
#> 15 C 2019 Tanzania 1664 180 723 2579 262 1518
#> 16 A 2020 Tanzania 1194 192 924 3010 296 1423
#> 17 B 2020 Tanzania 1243 182 634 2961 306 1713
#> 18 C 2020 Tanzania 1767 114 789 2437 374 1558
#> # … with abbreviated variable names ¹​other_employees, ²​other_profit
Created on 2022-12-08 with reprex v2.0.2

R: Substituting missing values (NAs) with two different values

I might be overcomplicating things - would love to know if if there is an easier way to solve this. I have a data frame (df) with 5654 observations - 1332 are foreign-born, and 4322 Canada-born subjects.
The variable df$YR_IMM captures: "In what year did you come to live in Canada?"
See the following distribution of observations per immigration year table(df$YR_IMM) :
1920 1926 1928 1930 1939 1942 1944 1946 1947 1948 1949 1950 1951 1952 1953 1954
2 1 1 2 1 2 1 1 1 9 5 1 7 13 3 5
1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970
10 5 8 6 6 1 5 1 6 3 7 16 18 12 15 13
1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986
10 17 8 18 25 16 15 12 16 27 13 16 11 9 17 16
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
24 21 31 36 26 30 26 24 22 30 29 26 47 52 53 28 9
Naturally these are only foreign-born individuals (mean = 1985) - however, 348 foreign-borns are missing. There are a total of 4670 NAs that also include Canada-borns subjects.
How can I code these df$YR_IMM NAs in such a way that
348 (NA) --> 1985
4322(NA) --> 100
Additionally, the status is given by df$Brthcoun with 0 = "born in Canada" and 1 = "born outside of Canada.
Hope this makes sense - thank you!
EDIT: This was the solution ->
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
Try the below code:
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 0] <- 100
df$YR_IMM[is.na(df$YR_IMM) & df$Brthcoun == 1] <- 1985
I hope this helps!
Something like this should also work:
df$YR_IMM <- ifelse(is.na(df$YR_IMM) & df$Brthcoun == 0, 100, 1985)

Parallelize for loops that subset panel data by industry-year

I want to carry out an estimation procedure that uses data on all firms in a given sector, for a rolling window of 5 years.
I can do it easily in a loop, but since the estimation procedure takes quite a while, I would like to parallelize it. Is there any way to do this?
My data looks like this:
sale_log cogs_log ppegt_log m_naics4 naics_2 gvkey year
1 3.9070198 2.5146032 3.192821715 9.290151e-02 72 1001 1983
2 4.1028774 2.7375141 3.517861329 1.067687e-01 72 1001 1984
3 4.5909863 3.2106595 3.975112703 2.511660e-01 72 1001 1985
4 3.2560391 2.7867256 -0.763368555 1.351031e-02 44 1003 1982
5 3.2966287 2.8088799 -0.305698649 1.151525e-02 44 1003 1983
6 3.2636907 2.8330357 0.154036559 8.699394e-03 44 1003 1984
7 3.7916480 3.2346849 0.887916936 1.351803e-02 44 1003 1985
8 4.1778028 3.5364473 1.177985972 1.761273e-02 44 1003 1986
9 4.1819066 3.7297111 1.393016951 1.686331e-02 44 1003 1987
10 4.0174411 3.6050022 1.479584215 1.601205e-02 44 1003 1988
11 3.4466429 2.9633579 1.312863013 8.888067e-03 44 1003 1989
12 3.0667367 2.6128805 0.909779173 2.102674e-02 42 1004 1965
13 3.2362968 2.8140391 1.430690273 2.050934e-02 42 1004 1966
14 3.1981990 2.8822097 1.721614365 1.702929e-02 42 1004 1967
15 3.9265031 3.6159280 2.399823853 2.559074e-02 42 1004 1968
16 4.3343438 4.0116068 2.592692585 3.649313e-02 42 1004 1969
17 4.5869564 4.3059855 2.772196529 4.743631e-02 42 1004 1970
18 4.7015486 4.3995561 2.875267240 5.155589e-02 42 1004 1971
19 5.0564414 4.7539697 3.218686385 6.863808e-02 42 1004 1972
20 5.4323873 5.1711531 3.350849771 8.272720e-02 42 1004 1973
21 5.2979696 5.0033437 3.383504340 6.726429e-02 42 1004 1974
22 5.3958779 5.1475985 3.475121024 1.534230e-01 42 1004 1975
23 5.5442635 5.3195666 3.517557041 1.674937e-01 42 1004 1976
24 5.6260795 5.3909462 3.694842501 1.711362e-01 42 1004 1977
25 5.8039766 5.5455887 3.895724689 1.836405e-01 42 1004 1978
26 5.8198831 5.5665980 3.960153940 1.700499e-01 42 1004 1979
27 5.7474447 5.4697019 3.943733263 1.520660e-01 42 1004 1980
where gvkey is the firm id and naics are the industry codes.
The code I wrote:
theta=matrix(,60,23)
count=1
temp <- dat %>% select(
"sale_log", "cogs_log", "ppegt_log",
"m_naics4", "naics_2", "gvkey", "year"
)
for (i in 1960:2019) { # 5-year rolling sector-year specific production functions
sub <- temp[between(temp$year,i-5,i),] # subset 5 years
jcount <- 1
for (j in sort(unique(sub$naics_2))) { # loop over sectors
temp2 <- sub[sub$naics_2==j,]
mdl <- prodestOP(
Y=temp2$sale_log, fX=temp2$cogs_log, sX=temp2$ppegt_log,
pX=temp2$cogs_log, cX=temp2$m_naics4, idvar=temp2$gvkey,
timevar=temp2$year
)
theta[count,jcount] <- mdl#Model$FSbetas[2]
jcount <- jcount+1
}
count <- count+1
}

meta regression and bubble plot with metafor package in R

I am working on a meta-regression on the association of year and medication prevalence with 'metafor' package.
The model I used is 'rma.glmm' for mixed-effect model with logit transformed from 'metafor' package.
My R script is below:
dat<-escalc(xi=A, ni=Sample, measure="PLO")
print(dat)
model_A<-rma.glmm(xi=A, ni=Sample, measure="PLO", mods=~year)
print(model_A)
I did get significant results so I performed a bubble plot with this model. But I found there is no way to perform bubble plot straight away from 'ram.glmm' formula. I did something alternatively:
wi<-1/dat$vi
plot(year, transf.ilogit(dat$yi), cex=wi)
Apparently I got some 'crazy' results, my questions are:
1> How could I weight points in bubble plot by study sample size? the points in bubble plot should be proportionate to study weight. Here, I used 'wi<-dat$vi'. vi stands for sampling variance, which I got from 'escalc()'. But it doesn't seem right.
2> Is my model correct to investigate the association between year and medication prevalence? I tried 'rma' model I got totally different results.
3> Is there any alternative way to perform bubble plot? I also tried:
percentage<-A/Sample
plot(year, percentage)
The database is below:
study year Sample A
study 1 2007 414 364
study 2 2010 142 99
study 3 1999 15 0
study 4 2000 17 0
study 5 2001 20 0
study 6 2002 22 5
study 7 2003 21 6
study 8 2004 24 7
study 9 1999 203 82
study 10 2009 647 436
study 11 2009 200 169
study 12 2010 156 128
study 13 2009 10753 6374
study 14 2007 143 109
study 15 2001 247 36
study 16 2004 318 184
study 17 2012 611 565
study 18 2013 180 167
study 19 2006 344 337
study 20 2007 209 103
study 21 2013 470 354
study 22 2010 180 146
study 23 2005 522 302
study 24 2000 62 30
study 25 2001 79 39
study 26 2002 85 43
study 27 2011 548 307
study 28 2009 218 216
study 29 2006 2901 2332
study 30 2008 464 259
study 31 2010 650 393
study 32 2008 2514 704

selecting observations based on a condition depended on a grouped variable

I have a question that I am hoping some will help me answer. I have a data set ordered by parasites and year, that looks something like this (the actual dataset is much larger):
parasites year samples
1000 2000 11
910 2000 22
878 2000 13
999 2002 64
910 2002 75
710 2002 16
890 2004 29
810 2004 10
789 2004 9
876 2005 120
750 2005 12
624 2005 157
what I would like to do is, for every year, I want to select the 2 samples with the highest number of parasites, to give me an output that looks like this:
parasites year samples
1000 2000 11
910 2000 22
999 2002 64
910 2002 75
890 2004 29
810 2004 10
876 2005 120
750 2005 12
I am new to programming as a whole and still trying to find my way around R. can someone please explain to me how I would go about this? Thanks so much.
How about with data.table:
parasites<-read.table(header=T,text="parasites year samples
1000 2000 11
910 2000 22
878 2000 13
999 2002 64
910 2002 75
710 2002 16
890 2004 29
810 2004 10
789 2004 9
876 2005 120
750 2005 12
624 2005 157")
EDIT - sorry sorted by parasites, not samples
require(data.table)
data.table(parasites)[,.SD[order(-parasites)][1:2],by="year"]
Note .SD is the sub-table for each year value as set in by=
year parasites samples
1: 2000 1000 11
2: 2000 910 22
3: 2002 999 64
4: 2002 910 75
5: 2004 890 29
6: 2004 810 10
7: 2005 876 120
8: 2005 750 12
Here is a R-base solution (if you need it):
data = data.frame("parasites"=c(1000,910,878,999,910,710,890,910,789,876,750,624),
"year"=c(2000,2000,2000,2002,2002,2002,2004,2004,2004,2005,2005,2005),
"samples"=c(11,22,13,64,75,16,29,10,9,120,12,157))
data = data[order(data$year,data$samples),]
data_list = lapply(unique(data$year),function(x) (tail(data[data$year==x,],n=2)))
final_data = do.call(rbind, Map(as.data.frame,data_list))
Hope that helps!

Resources