Convert values using a conversion table R - r

I am currently running statistical models on ACT and SAT scores. To help clean my data, I want to convert the ACT scores into its SAT equivalent. I found the following table online:
ACT SAT
<dbl> <dbl>
1 36 1590
2 35 1540
3 34 1500
4 33 1460
5 32 1430
6 31 1400
7 30 1370
8 29 1340
9 28 1310
10 27 1280
I want to replace the column ACT_Composite with the number in the SAT column of the conversion table. For instance, if one row displays an ACT_Composite score of 35, I want to input 1540.
If anyone has ideas on how to accomplish this, I would greatly appreciate it.

In base you can you use merge directly:
#Reading score table
df <- read.table(header = TRUE, text ="ACT SAT
36 1590
35 1540
34 1500
33 1460
32 1430
31 1400
30 1370
29 1340
28 1310
27 1280")
#Setting seed to reproduce df1
set.seed(1234)
# Create a data.frame with 50 sample scores
df1 <- data.frame(ACT_Composite = sample(27:36, 50, replace = TRUE))
# left-join df1 with df with keys ACT_Composite and ACT
result <- merge(df1, df,
by.x = "ACT_Composite",
by.y = "ACT",
all.x = TRUE,
sort = FALSE)
#The first 6 values of result
head(result)
ACT_Composite SAT
1 31 1400
2 31 1400
3 31 1400
4 31 1400
5 31 1400
6 36 1590
In data.table you can you use merge
library(data.table)
#Setting seed to reproduce df1
set.seed(1234)
# Create a data.table with 50 sample scores
df1 <- data.table(ACT_Composite = sample(27:36, 50, replace = TRUE))
# left-join df1 with df with keys ACT_Composite and ACT
result <- merge(df1, df,
by.x = "ACT_Composite",
by.y = "ACT",
all.x = TRUE,
sort = FALSE)
#The first 6 values of result
head(result)
ACT_Composite SAT
1: 36 1590
2: 32 1430
3: 31 1400
4: 35 1540
5: 31 1400
6: 32 1430
Alternatively in data.table you can try also
df1 <- data.table(ACT_Composite = sample(27:36, 50, replace = TRUE))
setDT(df)# you need to convert your look-up table df into data.table
result <- df[df1, on = c(ACT = "ACT_Composite")]
head(result)
ACT_Composite SAT
1: 36 1590
2: 32 1430
3: 31 1400
4: 35 1540
5: 31 1400
6: 32 1430

Related

Efficient data.table method to generate additional rows given random numbers

I have a large data.table that I want to generate a random number (using two columns) and perform a calculation. Then I want to perform this step 1,000 times. I am looking for a way to do this efficiently with out a loop.
Example data:
> dt <- data.table(Group=c(rep("A",3),rep("B",3)),
Year=rep(2020:2022,2),
N=c(300,350,400,123,175,156),
Count=c(25,30,35,3,6,8),
Pop=c(1234,1543,1754,2500,2600,2400))
> dt
Group Year N Count Pop
1: A 2020 300 25 1234
2: A 2021 350 30 1543
3: A 2022 400 35 1754
4: B 2020 123 3 2500
5: B 2021 175 6 2600
6: B 2022 156 8 2400
> dt[, rate := rpois(.N, lambda=Count)/Pop*100000]
> dt[, value := N*(rate/100000)]
> dt
Group Year N Count Pop rate value
1: A 2020 300 25 1234 1944.8947 5.8346840
2: A 2021 350 30 1543 2009.0732 7.0317563
3: A 2022 400 35 1754 1938.4265 7.7537058
4: B 2020 123 3 2500 120.0000 0.1476000
5: B 2021 175 6 2600 115.3846 0.2019231
6: B 2022 156 8 2400 416.6667 0.6500000
I want to be able to do this calculation for value 1,000 times, and keep all instances (with an indicator column for 1-1,000 indicating which run) without using a loop. Any suggestions?
Maybe you can try replicate like below
n <- 1000
dt[, paste0(c("rate", "value"), rep(1:n, each = 2)) := replicate(n, list(u <- rpois(.N, lambda = Count) / Pop * 100000, N * (u / 100000)))]

Using str_split to fill rows down data frame with number ranges and multiple numbers

I have a dataframe with crop names and their respective FAO codes. Unfortunately, some crop categories, such as 'other cereals', have multiple FAO codes, ranges of FAO codes or even worse - multiple ranges of FAO codes.
Snippet of the dataframe with the different formats for FAO codes.
> FAOCODE_crops
SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 other cereals 68,71,75,89,92,94,97,101,103,108
27 other oil crops 260:310,312:339
31 other fibre crops 773:821
Using the following code successfully breaks down these numbers,
unlist(lapply(unlist(strsplit(FAOCODE_crops$FAOCODE, ",")), function(x) eval(parse(text = x))))
[1] 15 27 56 44 79 79 83 68 71 75 89 92 94 97 101 103 108
... but I fail to merge these numbers back into the dataframe, where every FAOCODE gets its own row.
> FAOCODE_crops$FAOCODE <- unlist(lapply(unlist(strsplit(MAPSPAM_crops$FAOCODE, ",")), function(x) eval(parse(text = x))))
Error in `$<-.data.frame`(`*tmp*`, FAOCODE, value = c(15, 27, 56, 44, :
replacement has 571 rows, data has 42
I fully understand why it doesn't merge successfully, but I can't figure out a way to fill the table with a new row for each FAOCODE as idealized below:
SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 other cereals 68
8 other cereals 71
8 other cereals 75
8 other cereals 89
And so on...
Any help is greatly appreciated!
We can use separate_rows to separate the ,. After that, we can loop through the FAOCODE using map and ~eval(parse(text = .x)) to evaluate the number range. Finnaly, we can use unnest to expand the data frame.
library(tidyverse)
dat2 <- dat %>%
separate_rows(FAOCODE, sep = ",") %>%
mutate(FAOCODE = map(FAOCODE, ~eval(parse(text = .x)))) %>%
unnest(cols = FAOCODE)
dat2
# # A tibble: 140 x 2
# SPAM_full_name FAOCODE
# <chr> <dbl>
# 1 wheat 15
# 2 rice 27
# 3 other cereals 68
# 4 other cereals 71
# 5 other cereals 75
# 6 other cereals 89
# 7 other cereals 92
# 8 other cereals 94
# 9 other cereals 97
# 10 other cereals 101
# # ... with 130 more rows
DATA
dat <- read.table(text = " SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 'other cereals' '68,71,75,89,92,94,97,101,103,108'
27 'other oil crops' '260:310,312:339'
31 'other fibre crops' '773:821'",
header = TRUE, stringsAsFactors = FALSE)

Match and replace value using 2 Data Frames (R)

2 dfs, need to match "Name" with info$Name and replace corresponding values in details$Salary , df - details should retain all values and there should be no NAs(if match found replace the value if not found leave as it is)
details<- data.frame(Name = c("Aks","Bob","Caty","David","Enya","Fredrick","Gaby","Hema","Isac","Jaby","Katy"),
Age = c(12,22,33,43,24,67,41,19,25,24,32),
Gender = c("f","m","m","f","m","f","m","f","m","m","m"),
Salary = c(1500,2000,3.6,8500,1.2,1400,2300,2.5,5.2,2000,1265))
info <- data.frame(Name = c("caty","Enya","Dadi","Enta","Billu","Viku","situ","Hema","Ignu","Isac"),
income = c(2500,5600,3200,1522,2421,3121,4122,5211,1000,3500))
Expected Result :
Name Age Gender Salary
Aks 12 f 1500
Bob 22 m 2000
Caty 33 m 2500
David 43 f 8500
Enya 24 m 5600
Fredrick 67 f 1400
Gaby 41 m 2300
Hema 19 f 5211
Isac 25 m 3500
Jaby 24 m 2000
Katy 32 m 1265
None of the following is giving expected result
dplyr::left_join(details,info,by = "Name")
dplyr::right_join(details,info,by = "Name")
dplyr::inner_join(details,info, by ="Name") # for other matching and replace this works fine but not here
dplyr:: full_join(details,info,by ="Name")
All the results are giving NA's , tried using match function also but it is not giving desired result, any help would be highly appreciated
You have Name in both the dataframe in different cases, we need to first bring them in the same case then do a left_join with them and use coalesce to select the first non-NA value between income and salary.
library(dplyr)
details %>% mutate(Name = stringr::str_to_title(Name)) %>%
left_join(info %>% mutate(Name = stringr::str_to_title(Name)), by = "Name") %>%
mutate(Salary = coalesce(income, Salary)) %>%
select(names(details))
# Name Age Gender Salary
#1 Aks 12 f 1500
#2 Bob 22 m 2000
#3 Caty 33 m 2500
#4 David 43 f 8500
#5 Enya 24 m 5600
#6 Fredrick 67 f 1400
#7 Gaby 41 m 2300
#8 Hema 19 f 5211
#9 Isac 25 m 3500
#10 Jaby 24 m 2000
#11 Katy 32 m 1265
A base R solution:
matches <- match(tolower(details$Name), tolower(info$Name))
match <- !is.na(matches)
details$Salary[match] <- info$income[matches[match]]
#Result
Name Age Gender Salary
1 Aks 12 f 1500
2 Bob 22 m 2000
3 Caty 33 m 2500
4 David 43 f 8500
5 Enya 24 m 5600
6 Fredrick 67 f 1400
7 Gaby 41 m 2300
8 Hema 19 f 5211
9 Isac 25 m 3500
10 Jaby 24 m 2000
11 Katy 32 m 1265

Panel Data - sum by group and create new variable

I know there are already a lot of questions on "sum by group" posed, however, I do not get solved my problem. Here is it:
df1 is my simplified data set
> df1 = data.table( Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010),
ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628 ),
category= c("0910","0910","0911","0913", "0914", "0910","0910","0911","1014","1012","1011","1013"),
var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301) )
df2 is the desired result (see var2):
> df2 = data.table( Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010),
ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628 ),
category= c("0910","0910","0911","0913", "0914", "0910","0910","0911","1014","1012","1011","1013"),
var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301),
var2= c(130,130,700,700,35,35,350,350,132,132,702,702) )
So I would like to calculate the sums of var1 grouped by ID and the first two integers of category
So if the first two integers of the variable category is 09 (or 10 and so on), then assign to var2 the sum by group ID and the first two integers of category. Then, equal IDs in the same category should get assigned the same sum.
I tried to achiev that by
> df1$var2 = rep(NA, rep(length(df1$ID)))
df1$var2 = ifelse(substr(df1$category,1,2)=="09", by(df1[Year==2009,]$var1, df1[Year==2009,]$ID,sum), df1$var2)
df1$Var2 = ifelse(substr(df1$category,1,2)=="10", by(df1[Year==2010,]$var1, df1[Year==2010,]$ID,sum), df1$var1)
But here the sums are not assigned to the correct item.
Could somebody help me out?
df1 = data.frame( Year = c(2009,2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010),
ID = c(1621, 1621, 1628,1628,3101, 3101,3105,3105,1621, 1621, 1628,1628 ),
category= c("0910",NA,"0911","0913", "0914", "0910","0910",NA,"1014","1012",NA,"1013"),
var1 = c(60,70, 400,300,15,20, 200,150,61,71,401,301) )
I added NA values in OP's original dataframe to reflect the full specification he desired.
df1$category_sub = substr(df1$category, 1, 2)
df1_aggre = aggregate(var1 ~ ID + category_sub, data = df1, sum)
names(df1_aggre)[3] = "var2"
df2 = merge(df1, df1_aggre, all=TRUE)
df2[order(df2$Year),]
Result:
> df2[order(df2$Year),]
ID category_sub Year category var1 var2
1 1621 09 2009 0910 60 60
4 1621 <NA> 2009 <NA> 70 NA
5 1628 09 2009 0911 400 700
6 1628 09 2009 0913 300 700
9 3101 09 2009 0914 15 35
10 3101 09 2009 0910 20 35
11 3105 09 2009 0910 200 200
12 3105 <NA> 2009 <NA> 150 NA
2 1621 10 2010 1014 61 132
3 1621 10 2010 1012 71 132
7 1628 10 2010 1013 301 301
8 1628 <NA> 2010 <NA> 401 NA
I first extracted the first two integers from category and grouped var1 by ID and category_sub. I then renamed var1 to var2 and merged df1 and df1_aggre by ID and category_sub with all=TRUE option. This specifies a full outer join. The resulting dataframe was unsorted, so I sorted df2 by Year to get the desired result.

Performing a row by row chisq test on a data frame and capturing the result as a tibble

I have a data frame similar to this:
df1 <- data.frame(c(31,3447,12,1966,39,3275),
c(20,3460,10,1968,30,3284),
c(334,3146,212,1766,338,2976),
c(36,3442,35,1943,47,3267),
c(81,3399,71,1907,112,3202),
c(22,3458,22,1956,42,3272))
colnames(df1) <- c("Site1.C1","Site1.C2","Site2.C1","Site2.C2","Site3.C1","Site3.C2")
df1
Site1.C1 Site1.C2 Site2.C1 Site2.C2 Site3.C1 Site3.C2
1 31 20 334 36 81 22
2 3447 3460 3146 3442 3399 3458
3 12 10 212 35 71 22
4 1966 1968 1766 1943 1907 1956
5 39 30 338 47 112 42
6 3275 3284 2976 3267 3202 3272
I am converting each row into a table and then performing a chisq test.
In order get specific values from the chisq result (p value, parameter, statistic, expected, etc), I'm having to repeat chisq test several times over (in a very ugly and cumbersome way), using the following code:
df2 <- df1 %>% rowwise() %>% mutate(P=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$p.value,
df=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$parameter,
Site1.c1.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[1,1],
Site1.c2.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[1,2],
Site2.c1.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[2,1],
Site2.c2.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[2,2],
Site3.c1.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[3,1],
Site3.c2.exp=chisq.test(rbind(c(Site1.C1,Site1.C2),c(Site2.C1,Site2.C2),c(Site3.C1,Site3.C2)))$expected[3,2])
as.data.frame(df2)
Site1.C1 Site1.C2 Site2.C1 Site2.C2 Site3.C1 Site3.C2 P df Site1.c1.exp Site1.c2.exp Site2.c1.exp Site2.c2.exp Site3.c1.exp Site3.c2.exp
1 31 20 334 36 81 22 2.513166e-08 2 43.40840 7.591603 314.9237 55.07634 87.66794 15.33206
2 3447 3460 3146 3442 3399 3458 2.760225e-02 2 3391.05464 3515.945362 3234.4387 3353.56132 3366.50668 3490.49332
3 12 10 212 35 71 22 4.743725e-04 2 17.92818 4.071823 201.2845 45.71547 75.78729 17.21271
4 1966 1968 1766 1943 1907 1956 1.026376e-01 2 1928.02242 2005.977577 1817.7517 1891.24831 1893.22588 1969.77412
5 39 30 338 47 112 42 2.632225e-10 2 55.49507 13.504934 309.6464 75.35362 123.85855 30.14145
6 3275 3284 2976 3267 3202 3272 2.686389e-02 2 3216.55048 3342.449523 3061.5833 3181.41674 3174.86626 3299.13374
Is there a more elegant way to do chisq test just once and capture the result as a tibble in the same row and then extract values on a need-to basis into additional columns?
My data frame has over a million of rows and some additional variables not used with the Chisq test.
Thank you.
With input from #akrun, I was able to get the desired result using the following code:
df2 <- df1 %>% rowwise() %>% mutate(result=list(chisq.test(rbind(c(Site1.C1,Site1.C2),c(S‌​ite2.C1,Site2.C2),c(‌​Site3.C1,Site3.C2)))‌​))

Resources