R - boxplotting specific variables against one another - r

I am using the dative data frame within R, and I am trying to plot only the LengthOfRecipient == 'nonpronomial' against Modality. I gathered all the LengthOfRecipient == 'nonpronomial':
library('languageR')
lor.np = dative[dative$PronomOfRec == 'nonpronominal',]$LengthOfRecipient
I have tried nesting this subset function, and applied vectors, but I cannot figure out a way to then access the Modality column for only the items in lor.np and store it in mod.np, so that I can plot and analyze the data with:
boxplot(lor.np, mod.np)
I'm very new to R and the syntax is extremely confusing. Any help would be very appreciated. Thanks in advance!

It might be easier to select all the columns you want at once and then use the formula feature in boxplot rather than using vectors:
library('languageR')
lor.np <- dative[dative$PronomOfRec == 'nonpronominal',
c('LengthOfRecipient','Modality')]
head(lor.np)
# LengthOfRecipient Modality
# 2 2 written
# 3 1 written
# 5 2 written
# 6 2 written
# 7 2 written
# 11 2 written
## but you don't even need to select the columns:
lor.np <- dative[dative$PronomOfRec == 'nonpronominal', ]
boxplot(LengthOfRecipient ~ Modality, lor.np)
After looking at the data, you don't need droplevels, but here is an example when it may be useful:
dat1 <- dative[dative$Modality == 'written', ]
boxplot(LengthOfRecipient ~ Modality, dat1)

Related

How do I use dataframe names as inputs for var in for loop (R language)?

In R, I defined the following function:
race_ethn_tab <- function(x) {
x %>%
group_by(RAC1P) %>%
tally(wt = PWGTP) %>%
print(n = 15) }
The function simply generates a weighted tally for a given dataset, for example, race_ethn_tab(ca_pop_2000) generates a simple 9 x 2 table:
1 Race 1 22322824
2 Race 2 2144044
3 Race 3 228817
4 Race 4 1827
5 Race 5 98823
6 Race 6 3722624
7 Race 7 116176
8 Race 8 3183821
9 Race 9 1268095
I have to do this for several (approx. 10 distinct datasets) where it's easier for me to keep the dfs distinct rather than bind them and create a year variable. So, I am trying to use either a for loop or purrr::map() to iterate through my list of dfs.
Here is what I tried:
dfs_test <- as.list(as_tibble(ca_pop_2000),
as_tibble(ca_pop_2001),
as_tibble(ca_pop_2002),
as_tibble(ca_pop_2003),
as_tibble(ca_pop_2004))
# Attempt 1: Using for loop
for (i in dfs_test) {
race_ethn_tab(i)
}
# Attempt 2: Using purrr::map
race_ethn_outs <- map(dfs_test, race_ethn_tab)
Both attempts are telling me that group_by can't be applied to a factor object, but I can't figure out why the elements in dfs_test are being registered as factors given that I am forcing them into the tibble class. Would appreciate any tips based on my approach or alternative approaches that could make sense here.
This, from #RonakShah, was exactly what was needed:
You code should work if you use list instead of as.list. See output of
as.list(as_tibble(mtcars), as_tibble(iris)) vs list(as_tibble(mtcars),
as_tibble(iris)) – Ronak Shah Oct 2 at 0:23
We can use mget to return a list of datasets, then loop over the list and apply the function
dfs_test <- mget(paste0("ca_pop_", 2000:2004))
It can be also made more general if we use ls
dfs_test <- mget(ls(pattern = '^ca_pop_\\d{4}$'))
map(dfs_test, race_ethn_tab)
This would make it easier if there are 100s of objects already created in the global environment instead of doing
list(ca_pop_2000, ca_pop_2001, .., ca_pop_2020)

R: "Adding" 2 variables (columns) to create an aggregate variable (column)?

This may be a strange request - I hope I am wording it correctly:
I have a dataset (df) and three variables (BELONG_1, GRPOR_14, ETHNIC10) that I want to "add" so as to get an aggregate variable (PsychIntegration) that can be run in a regression analysis - e.g. controlling for other variables such as gender, age etc.
BELONG_1: 1=Do not belong - 10=Do belong
GRPOR_14: 1=Not proud - 10=Very proud
ETHNIC10: 1=Not important - 4=Very important
The Cronbach Alpha for these three variables is 0.62
ID BELONG_1 GRPOR_14 ETHNIC10 PsychIntegration
1 10 8 4 ??
2 3 4 2 ??
3 7 10 3 ??
4 1 1 1 ??
How exactly do I "add" (?) these variables to get PsychIntegration?
I hope that makes sense - thanks again!
Try using the package dplyr (or package tidyverse). Assuming your data are stored in a data.frame named df
df <- df %>%
mutate(PsychIntegration = rowSums(select(., -ID)))
if you want to sum all rows. Use
df <- df %>%
mutate(PsychIntegration = BELONG_1 + GRPOR_14 + ETHNIC10)
if you just want to sum those three columns.
Using just base R one possibility is
df$PsychIntegration <- df$BELONG_1 + df$GRPOR_14 + df$ETHNIC10

How to write a SAS macro in R?

I want to test the SAS code my team has produced in R to compare the estimates that we get from each but being new to R am not having much luck. In SAS we have written 3 macros to produce three separate estimates (HFS010, HFS011, HFS012), an example of one given here;
%macro HFS010 (peninc_var, pengn_var, pentax_var, pentype_var, HFS010_x_var);
do i = 1 to dim(pentypex);
if &pentype_var = 1 and &pengn_var = 1 then &HFS010_x_var = &peninc_var;
else if &pentype_var = 1 and &pengn_var = 2 then &HFS010_x_var = &peninc_var + &pentax_var;
end;
%mend HFS010;
Basically the idea is that each macro produces an estimate for gross pension income (so where applicable adds tax deducted from pensions on to pension income value). There are three macros as we want separate estimates for cases where pentype = 1 (HFS010), pentype = 2 (HFS011) and pentype = 3 to 7 (HFS012) and the survey accepts up to 16 entries for pensions.
To attempt to produce an equivalent of the above code in R, I wrote the following;
for(i in 1:16) {
pens_data[[paste0("HFS010_",i)]] <- case_when(
pens_data[[paste0("pentype",i)]] == 1 & pens_data[[paste0("pengn",i)]] == 1 ~ pens_data[[paste0("peninc",i)]],
pens_data[[paste0("pentype",i)]] == 1 & pens_data[[paste0("pengn",i)]] == 2 ~ pens_data[[paste0("peninc",i)]] + pens_data[[paste0("pentax",i)]],
TRUE ~ 0)
This code does not produce errors but upon inspecting the estimates, there were some cases that should have estimates that were left blank.
Does anyone know of a way to write a macro in R? I thought of writing a function potentially for each of HFS010, HFS011, HFS012 but being new to R am not sure how to go about this.
If anyone has any suggestions as to why my R code isn't producing the correct estimates, or how they would write the equivalent of a SAS macro in R it would be greatly appreciated! I have tried to use defmacro but could not get this to work without errors.
Thanks so much!
Ashlee
There are many ways to write this in R. But first a copule of comments:
R works fine with vector, so we should as possible manipulate vectors. This is much faster and allows to avoid slow for loop with side effect.
In order to help other to give you answer please provide a reproducible example that cover both uses cases.
For example:
set.seed(1)
dx <- data.frame(
peninc_var=sample(c(1,3),5,TRUE),
pengn_var=sample(c(1,2),5,TRUE),
pentax_var=1:5)
Here an option in base R. I am creating the new variable HFS010_x_var using ifelse :
dx$HFS010_x_var <-
with(dx,{
## I am adding a last NO condition here to assign missing NA
ifelse(peninc_var==1 & pengn_var==1,peninc_var,
ifelse(peninc_var==1 & pengn_var==2,peninc_var + pentax_var,NA))
})
peninc_var pengn_var pentax_var HFS010_x_var
1: 1 2 1 2
2: 1 2 2 3
3: 3 2 3 NA
4: 3 2 4 NA
5: 1 1 5 1
Another option (more sugar syntax ) is to use data.table:
library(data.table)
setDT(dx)
dx[peninc_var==1 & pengn_var==1,HFS010_x_var := peninc_var]
dx[peninc_var==1 & pengn_var==2,HFS010_x_var := peninc_var+pentax_var]

How to run a for-loop through a string vector of a data frame in R?

I'm trying to do something very simple: to run a loop through a vector of names and use those names in my code.
geo = c(rep("AT",3),rep("BE",3))
time = c(rep(c("1990Q1","1990Q2","1990Q3"),2))
value = c(1:6)
Data <- data.frame(geo,time,value)
My real dataset has 14 countries and 75 time periods. I would like to find a function which for example loops through the countries, then subsets them so I have the single datasets such as:
data_AT <- subset(Data, (Data$geo=="AT"))
data_BE <- subset(Data, (Data$geo=="BE"))
but with a loop and ideally with a solution I can apply to other functions as well :-)
In my mind, this should look something like this:
codes <- unique(Data$geo)
for (i in 1:length(codes))
{k <- codes[i]
data_(k) <- subset(Data, (Data$geo==k))}
however subset doesn't work like this, neither do other functions. I think my problem is that I don't know how to address the respective name which "k" has taken (e.g. "AT") as part of my code. If at all possible, I would very much appreciate an answer with a general solution of how I can run a function through a vector containing text and use each element of that vector in my code. Maybe in the direction of the apply functions? Though I'm not getting very far with that either...
Any help would be very much appreciated!
I'm using loops for simiral purposes too. Maybe it's not the fastest way, but at least I understand it -- for example, when saving plots for different subsets.
There is no need to loop through length of vector, you can loop through vector itself. For converting string to variable name, you can use assign.
geo = c(rep("AT",3),rep("BE",3))
time = c(rep(c("1990Q1","1990Q2","1990Q3"),2))
value = c(1:6)
Data <- data.frame(geo,time,value)
codes <- sort(unique(Data$geo))
for (k in codes) {
name<-paste("data", k, sep="_")
assign(name, subset(Data, (Data$geo==k)))
}
BTW, filter from package dplyr is much faster than subset!
In R, you would typically do this with a list of data.frames instead of several separate data.frames:
lst <- split(Data, Data$geo)
lst
#$AT
# geo time value
#1 AT 1990Q1 1
#2 AT 1990Q2 2
#3 AT 1990Q3 3
#
#$BE
# geo time value
#4 BE 1990Q1 4
#5 BE 1990Q2 5
#6 BE 1990Q3 6
Now you can access each element (which is a data.frame) by typing:
lst[["AT"]]
# geo time value
#1 AT 1990Q1 1
#2 AT 1990Q2 2
#3 AT 1990Q3 3
If you have a vector of country names for which you want to add +1 to the value column, you can do it like this:
cntrs <- c("BE", "AT")
lst[cntrs] <- lapply(lst[cntrs], function(x) {x$value <- x$value + 1; return(x)} )
#$BE
# geo time value
#4 BE 1990Q1 5
#5 BE 1990Q2 6
#6 BE 1990Q3 7
#
#$AT
# geo time value
#1 AT 1990Q1 2
#2 AT 1990Q2 3
#3 AT 1990Q3 4
Edit: if you really want to stick with a for loop, I recommend not to split the data into several separate data.frames but to run the loop on the whole data set like this for example:
cntrs <- "BE"
for(i in cntrs){
Data$value[Data$geo == i] <- Data$value[Data$geo == i] + 1
}

How to write in a membership vector to use in modularity() function with R

I have a membership vector created with another software and I am stuck to write it into R so that I can use iGraph' modularity function to calculate modularity of this community division. 
Can someone help me with how to write the vector into R so that the Modularity(g,membership) could run?
I tried with using membership <- read.table(file), but the result could not be used with Modularity(g, membership)
Thanks,
Song
read.table creates a data frame, you need to convert that to a simple numeric vector. Alternatively you can use scan(). You might need to adjust the following to your data format.
library(igraph)
G <- graph.full(3) + graph.ring(3) + graph.full(3)
contents <- '1 1 1 2 2 2 3 3 3'
memb <- scan(textConnection(contents))
# Read 9 items
modularity(G, memb)
# [1] 0.6666667
Instead of the textConnection(), just put your file name there.

Resources