set missing values for multiple labelled variables - r

How to I set missing values for multiple labelled vectors in a data frame. I am working with a survey dataset from spss. I am dealing with about 20 different variables, with the same missing values. So would like to find a way to use lapply() to make this work, but I can't.
I actually can do this with base R via as.numeric() and then recode() but I'm intrigued by the possibilities of haven and the labelled class so I'd like to find a way to do this all in Hadley's tidyverse
Roughly the variables of interest look like this. I am sorry if this is a basic question, but I find the help documentaiton associated with the haven and labelled packages just very unhelpful.
library(haven)
library(labelled)
v1<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6))
v2<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6))
v3<-data.frame(v1=v1, v2=v2)
lapply(v3, val_labels)
lapply(v3, function(x) set_na_values(x, c(5,6)))

Ok, I think I understand now what you trying to do...
i.e. Mark the labels, and the values as NA without removing the underlying imported data...
See addendum for a more detailed example that uses a public data file to show an example that harnesses dplyr to update multiple columns, labels...
Proposed Solution
df <- data_frame(s1 = c(1,2,2,2,5,6), s2 = c(1,2,2,2,5,6)) %>%
set_value_labels(s1 = c(agree=1, disagree=2, dk=5, refused=6),
s2 = c(agree=1, disagree=2, dk = tagged_na("5"), refused = tagged_na("6"))) %>%
set_na_values(s2 = c(5,6))
val_labels(df)
is.na(df$s1)
is.na(df$s2)
df
Solution Result:
> library(haven)
> library(labelled)
> library(dplyr)
> df <- data_frame(s1 = c(1,2,2,2,5,6), s2 = c(1,2,2,2,5,6)) %>%
+ set_value_labels(s1 = c(agree=1, disagree=2, dk=5, refused=6),
+ s2 = c(agree=1, disagree=2, dk = tagged_na("5"), refused = tagged_na("6"))) %>%
+ set_na_values(s2 = c(5,6))
> val_labels(df)
$s1
agree disagree dk refused
1 2 5 6
$s2
agree disagree dk refused
1 2 NA NA
> is.na(df$s1)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
> is.na(df$s2)
[1] FALSE FALSE FALSE FALSE TRUE TRUE
> df
# A tibble: 6 × 2
s1 s2
<dbl+lbl> <dbl+lbl>
1 1 1
2 2 2
3 2 2
4 2 2
5 5 5
6 6 6
Now we can manipulate the data
mean(df$s1, na.rm = TRUE)
mean(df$s2, na.rm = TRUE)
> mean(df$s1, na.rm = TRUE)
[1] 3
> mean(df$s2, na.rm = TRUE)
[1] 1.75
Use Labelled package to remove labels and replace with R NA
If you wish to strip the labels and replace with R NA values you can use remove_labels(x, user_na_to_na = TRUE)
Example:
df <- remove_labels(df, user_na_to_na = TRUE)
df
Result:
> df <- remove_labels(df, user_na_to_na = TRUE)
> df
# A tibble: 6 × 2
s1 s2
<dbl> <dbl>
1 1 1
2 2 2
3 2 2
4 2 2
5 5 NA
6 6 NA
--
Explanation / Overview of SPSS Format:
IBM SPSS (The application) can import and export data in many formats and in non-rectangular configurations; however, the data set is always translated to an SPSS rectangular data file, known as a system file (using the extension *.sav). Metadata (information about the data) such as variable formats, missing values, and variable and value labels are stored with the dataset.
Value Labels
Base R has one data type that effectively maintains a mapping between integers and character labels: the factor. This, however, is not the primary use of factors: they are instead designed to automatically generate useful contrasts for linear models. Factors differ from the labelled values provided by the other tools in important ways:
SPSS and SAS can label numeric and character values, not just integer values.
Missing Values
All three tools (SPSS, SAS, Stata) provide a global “system missing value” which is displayed as .. This is roughly equivalent to R’s NA, although neither Stata nor SAS propagate missingness in numeric comparisons: SAS treats the missing value as the smallest possible number (i.e. -inf), and Stata treats it as the largest possible number (i.e. inf).
Each tool also provides a mechanism for recording multiple types of missingness:
Stata has “extended” missing values, .A through .Z.
SAS has “special” missing values, .A through .Z plus ._.
SPSS has per-column “user” missing values. Each column can declare up to three distinct values or a range of values (plus one distinct value) that should be treated as missing.
User Defined Missing Values
SPSS’s user-defined values work differently to SAS and Stata. Each column can have either up to three distinct values that are considered as missing or a range. Haven provides labelled_spss() as a subclass of labelled() to model these additional user-defined missings.
x1 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_value = 99)
x2 <- labelled_spss(c(1:10, 99), c(Missing = 99), na_range = c(90, Inf))
x1
#> <Labelled SPSS double>
#> [1] 1 2 3 4 5 6 7 8 9 10 99
#> Missing values: 99
#>
#> Labels:
#> value label
#> 99 Missing
x2
#> <Labelled SPSS double>
#> [1] 1 2 3 4 5 6 7 8 9 10 99
#> Missing range: [90, Inf]
#>
#> Labels:
#> value label
#> 99 Missing
Tagged missing values
To support Stata’s extended and SAS’s special missing value, haven implements a tagged NA. It does this by taking advantage of the internal structure of a floating point NA. That allows these values to behave identical to NA in regular R operations, while still preserving the value of the tag.
The R interface for creating with tagged NAs is a little clunky because generally they’ll be created by haven for you. But you can create your own with tagged_na():
Important:
Note these tagged NAs behave identically to regular NAs, even when printing. To see their tags, use print_tagged_na():
Thus:
library(haven)
library(labelled)
v1<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6))
v2<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=tagged_na("5"), refused= tagged_na("6")))
v3<-data.frame(v1 = v1, v2 = v2)
v3
lapply(v3, val_labels)
> v3
x x.1
1 1 1
2 2 2
3 2 2
4 2 2
5 5 5
6 6 6
> lapply(v3, val_labels)
$x
agree disagree dk refused
1 2 5 6
$x.1
agree disagree dk refused
1 2 NA NA
Word of caution:
SPSS’s user-defined values work differently to SAS and Stata. Each column can have either up to three distinct values that are considered as missing, or a range. Haven provides labelled_spss() as a subclass of labelled() to model these additional user-defined missings.
I hope the above helps
Take care
T.
References:
https://cran.r-project.org/web/packages/haven/haven.pdf
https://cran.r-project.org/web/packages/haven/vignettes/semantics.html
https://www.spss-tutorials.com/spss-missing-values-tutorial/
Addendum Example using Public Data...
SPSS Missing Values Example using an SPPS Data file {hospital.sav}
Firstly, let's make sure we highlight that
System missing values - are values that are completely absent from the data
User missing values are values that are present in the data but must be excluded from calculations.
SPSS View of Data...
Let's review the image and the data... The SPSS data shown in the variable view shows that each row has a Label [Column5], we note that rows 10 through 14 have specific values attributed to them [1..6] [Column 6] that have name attributes and that no values have been specified as Missing [Column 7].
Now let's look at the SPSS data view:
Here we can note that there is missing data... (See hilighted "."'s). The key point is that we have Missing data, but currently have no "Missing User Values"
Now let's turn to R, and load the data into R
hospital_url <- "https://www.spss-tutorials.com/downloads/hospital.sav"
hospital <- read_sav(hospital_url,
user_na = FALSE)
head(hospital,5)
# We're interested in columns 10 through 14...
head(hospital[10:14],5)
Result
> hospital_url <- "https://www.spss-tutorials.com/downloads/hospital.sav"
> hospital <- read_sav(hospital_url,
+ user_na = FALSE)
> head(hospital,5)
# A tibble: 5 × 14
visit_id patient_id first_name surname_prefix last_name gender entry_date entry_time
<dbl> <dbl> <chr> <chr> <chr> <dbl+lbl> <date> <time>
1 32943 23176 JEFFREY DIJKSTRA 1 2013-01-08 16:56:10
2 32944 20754 MARK VAN DER BERG 1 2013-02-01 14:24:45
3 32945 25419 WILLEM VERMEULEN 1 2013-02-02 10:01:43
4 32946 21139 LINDA JANSSEN 0 2013-02-10 10:24:39
5 32947 25419 WILLEM VERMEULEN 1 2013-02-10 18:05:59
# ... with 6 more variables: exit_moment <dttm>, doctor_rating <dbl+lbl>, nurse_rating <dbl+lbl>,
# room_rating <dbl+lbl>, food_rating <dbl+lbl>, facilities_rating <dbl+lbl>
Columns 10 through 14 contain Values
1="Very Dissatisfied"
2="Dissatisfied"
3="Neutral"
4="Satisfied"
5="Very Satisfied"
6="Not applicable or don't want to answer"
thus:
> head(hospital[10:14],5)
# A tibble: 5 × 5
doctor_rating nurse_rating room_rating food_rating facilities_rating
<dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl>
1 5 5 4 2 3
2 4 5 4 3 3
3 5 6 4 5 4
4 4 5 5 4 4
5 5 5 6 6 6
SPSS Value Labels
> lapply(hospital[10], val_labels)
$doctor_rating
Very dissatisfied Dissatisfied
1 2
Neutral Satisfied
3 4
Very satisfied Not applicable or don't want to answer
5 6
ok, note that above we can confirm we have imported the Value Labels.
Remove Non-Applicable data from the survey data
Our goal is to now remove the "Not applicable or don't want to answer" data entries by setting them to be "User NA values" i.e. An SPSS missing value.
Solution - Step 1 - A Single Column
We wish to set the missing value attribute across multiple columns in the data... Let first do this for one column...
Note we use add_value_labels not set_value_labels as we wish to append a new label, not completely overwrite existing labels...
d <- hospital
mean(d$doctor_rating, na.rm = TRUE)
d <- hospital %>%
add_value_labels( doctor_rating = c( "Not applicable or don't want to answer"
= tagged_na("6") )) %>%
set_na_values(doctor_rating = 5)
val_labels(d$doctor_rating)
mean(d$doctor_rating, na.rm = TRUE)
> d <- hospital
> mean(d$doctor_rating, na.rm = TRUE)
[1] 4.322368
> d <- hospital %>%
+ add_value_labels( doctor_rating = c( "Not applicable or don't want to answer"
+ = tagged_na("6") )) %>%
+ set_na_values(doctor_rating = 6)
> val_labels(d$doctor_rating)
Very dissatisfied Dissatisfied
1 2
Neutral Satisfied
3 4
Very satisfied Not applicable or don't want to answer
5 6
Not applicable or don't want to answer
NA
> mean(d$doctor_rating, na.rm = TRUE)
[1] 4.097015
Solution - Step 2 - Now apply to multiple columns...
mean(hospital$nurse_rating)
mean(hospital$nurse_rating, na.rm = TRUE)
d <- hospital %>%
add_value_labels( doctor_rating = c( "Not applicable or don't want to answer"
= tagged_na("6") )) %>%
set_na_values(doctor_rating = 6) %>%
add_value_labels( nurse_rating = c( "Not applicable or don't want to answer"
= tagged_na("6") )) %>%
set_na_values(nurse_rating = 6)
mean(d$nurse_rating, na.rm = TRUE)
Result
Note that nurse_rating contains "NaN" values and NA tagged values.
The first mean() call fails, the second succeeds but includes "Not Applicable..." after the filter the "Not Applicable..." are removed...
> mean(hospital$nurse_rating)
[1] NaN
> mean(hospital$nurse_rating, na.rm = TRUE)
[1] 4.471429
> d <- hospital %>%
+ add_value_labels( doctor_rating = c( "Not applicable or don't want to answer"
+ = tagged_na("6") )) %>%
+ set_na_values(doctor_rating = 6) %>%
+ add_value_labels( nurse_rating = c( "Not applicable or don't want to answer"
+ = tagged_na("6") )) %>%
+ set_na_values(nurse_rating = 6)
> mean(d$nurse_rating, na.rm = TRUE)
[1] 4.341085
Convert tagged NA to R NA
Here we take the above tagged NA and convert to R NA values.
d <- d %>% remove_labels(user_na_to_na = TRUE)

Not quite sure if this is what you are looking for:
v1 <- labelled(c(1, 2, 2, 2, 5, 6), c(agree = 1, disagree = 2, dk = 5, refused = 6))
v2 <- labelled(c(1, 2, 2, 2, 5, 6), c(agree = 1, disagree = 2, dk = 5, refused = 6))
v3 <- data_frame(v1 = v1, v2 = v2)
lapply(names(v3), FUN = function(x) {
na_values(v3[[x]]) <<- 5:6
})
lapply(v3, na_values)
The last line returning
$v1
[1] 5 6
$v2
[1] 5 6
Verify missing values:
is.na(v3$v1)
[1] FALSE FALSE FALSE FALSE TRUE TRUE

Defining SPSS-style user-defined missing values
Main functions
The two main functions in labelled package for manipulating SPSS style user-defined missing values are na_values and na_range.
library(labelled)
v1 <-c(1,2,2,2,5,6)
val_labels(v1) <- c(agree=1, disagree=2, dk=5, refused=6)
na_values(v1) <- 5:6
v1
<Labelled SPSS double>
[1] 1 2 2 2 5 6
Missing values: 5, 6
Labels:
value label
1 agree
2 disagree
5 dk
6 refused
set_* functions
The set_* functions in labelled are intended to be used with magrittr / dplyr.
library(dplyr)
d <- tibble(v1 = c(1, 2, 2, 2, 5, 6), v2 = c(1:3, 1:3))
d <- d %>%
set_value_labels(v1 = c(agree=1, disagree=2, dk=5, refused=6)) %>%
set_na_values(v1 = 5:6)
d$v1
<Labelled SPSS double>
[1] 1 2 2 2 5 6
Missing values: 5, 6
Labels:
value label
1 agree
2 disagree
5 dk
6 refused
What are user-defined missing values?
User-defined missing values are just and only meta-information. It doesn't change the values in a vector. This is simply a way to say to the user that these values could/should be considered in some context as missing values. It means that if you compute something (e.g. mean) from your vector, these values will still be taken into account.
mean(v1)
[1] 3
You can easily convert user-defined missing values to proper NA with user_na_to_na.
mean(user_na_to_na(v1), na.rm = TRUE)
[1] 1.75
There are very few functions that would take into account these meta-information. See for example the freq function from questionr package.
library(questionr)
freq(v1)
n % val%
[1] agree 1 16.7 25
[2] disagree 3 50.0 75
[5] dk 1 16.7 NA
[6] refused 1 16.7 NA
NA 0 0.0 NA
What is the difference with tagged NAs ?
The purpose of tagged NAs, introduced by haven, is to reproduce the way Stata is managing missing values. All tagged NAs are internally considered as NA by R.

Is this correct?
#Using replace to substitute 5 and 6 in v3 with NA
data.frame(lapply(v3, function(a) replace(x = a, list = a %in% c(5,6), values = NA)))
# x x.1
#1 1 1
#2 2 2
#3 2 2
#4 2 2
#5 NA NA
#6 NA NA
I know labelled_spss allows you to specify na_range or even a vector of na_values
#DATA
v11 = labelled_spss(x = c(1,2,2,2,5,6),
labels = c(agree=1, disagree=2, dk=5, refused=6),
na_range = 5:6)
#Check if v11 has NA values
is.na(v11)
#[1] FALSE FALSE FALSE FALSE TRUE TRUE
v22 = labelled_spss(x = c(1,2,2,2,5,6),
labels = c(agree=1, disagree=2, dk=5, refused=6),
na_range = 5:6)
#Put v11 and v22 in a list
v33 = list(v11, v22)
#Use replace like above
data.frame(lapply(X = v33, FUN = function(a) replace(x = a, list = is.na(a), values = NA)))
# x x.1
#1 1 1
#2 2 2
#3 2 2
#4 2 2
#5 NA NA
#6 NA NA

The first argument to set_na_values is a data frame, not a vector/column, which is why your lapply command doesn't work. You could build a list of the arguments for set_na_values for an arbitrary number of columns in your data frame and then call it with do.call as below...
v1<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6))
v2<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6))
v3<-data.frame(v1=v1, v2=v2)
na_values(v3)
args <- c(list(.data = v3), setNames(lapply(names(v3), function(x) c(5,6)), names(v3)))
v3 <- do.call(set_na_values, args)
na_values(v3)
Update: You can also use the assignment form of the na_values function within an lapply statement, since it accepts a vector as it's first argument instead of a data frame like set_na_values...
library(haven)
library(labelled)
v1<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6))
v2<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6))
v3<-data.frame(v1=v1, v2=v2)
lapply(v3, val_labels)
na_values(v3)
v3[] <- lapply(v3, function(x) `na_values<-`(x, c(5,6)))
na_values(v3)
or even use the normal version of na_values in the lapply command, just making sure to return the 'fixed' vector...
library(haven)
library(labelled)
v1<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6))
v2<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6))
v3<-data.frame(v1=v1, v2=v2)
lapply(v3, val_labels)
na_values(v3)
v3[] <- lapply(v3, function(x) { na_values(x) <- c(5,6); x } )
na_values(v3)
and that idea can be used inside of a dplyr chain as well, either applying to all variables, or applying to whatever columns are selected using dplyr's selection tools...
library(haven)
library(labelled)
library(dplyr)
v1<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6))
v2<-labelled(c(1,2,2,2,5,6), c(agree=1, disagree=2, dk=5, refused=6))
v3<-data.frame(v1=v1, v2=v2)
lapply(v3, val_labels)
na_values(v3)
v4 <- v3 %>% mutate_all(funs(`na_values<-`(., c(5,6))))
na_values(v4)
v5 <- v3 %>% mutate_each(funs(`na_values<-`(., c(5,6))), x)
na_values(v5)

You could use a very simple solution in using base R:
v3[v3 == 5 ] <- NA
v3[v3 == 6 ] <- NA
But if you're looking for a really fast solution, you can use a data.table approach.
library(data.table)
setDT(v3)
for(j in seq_along(v3)) {
set(v3, i=which(v3[[j]] %in% c(5,6)), j=j, value=NA)
}

Related

Merging words in a wordcloud made in R

I have created a word cloud with the following frequency of terms:
interesting interesting 21
economics economics 12
learning learning 9
learn learn 6
taxes taxes 6
debating debating 6
everything everything 6
know know 6
tax tax 3
meaning meaning 3
I want to add the 6 counts for "learn" into the overall count for "learning" so that the frequency becomes 15, and I only have "learning" in my word cloud. I also want to do the same for "taxes" and "tax".
This is the code I used to generate the wordcloud.
dataset <- read.csv("~/filepath.csv")
> corpus <- Corpus(VectorSource(dataset$comment))
> clean_corpus <- tm_map(corpus, removeWords, stopwords('english'))
> wordcloud(clean_corpus, scale=c(5,0.5), max.words=100, random.order = FALSE, rot.per=0.35, colors=my_palette)
I have tried using the SnowballC package, but this was the outcome:
> library(SnowballC)
> clean_set <- tm_map(clean_corpus, stemDocument)
> dtm <- TermDocumentMatrix(clean_set)
> m <- as.matrix(dtm)
> v <- sort(rowSums(m), decreasing = TRUE)
> d <- data.frame(word = names(v), freq=v)
> head(d, 10)
This gives me the output below (economics has become econom, debating has become debat, everything, everyth) which is obviously unideal. I only have an issue with learn/learning and tax/taxes, so would it be possible to manually merge just those two sets of words?
interest interest 21
learn learn 18
econom econom 12
tax tax 9
debat debat 6
everyth everyth 6
know know 6
mean mean 3
understand understand 3
group group 3
I have also tried clean_corpus_2 <- tm_map(clean_corpus, content_transformer(gsub), pattern = "taxes", replacement = "tax", fixed = TRUE) which changed nothing in the output.
I'm using the tidyverse packages, particularly dplyr as that's why I'm comfortable with, but I'm sure this is doable with base R or any number of other approaches.
library(tidyverse)
First I mock up some data as I don't have yours to test on:
testdata <- tribble(
~ID, ~comment,
1, "learn",
2, "learning",
3, "learned",
4, "tax",
5, "taxes",
6, "panoply"
)
Next is the explicitly listing the options approach:
testdata1 <- testdata %>% mutate(
newcol = case_when(
comment %in% c("learn", "learning", "learned") ~ "learn",
comment %in% c("tax", "taxes") ~ "tax",
TRUE ~ as.character(comment)
)
)
In this code, %>% is a pipe, mutate() adds a new column based on what follows. newcol is the name of the new column, and its contents is decided by the case_when() construct, which tests each option in turn until it finds something returning "TRUE" - that's why the last option (the default "don't change" approach) is listed as TRUE ~ .
After that, the pattern-matching (grepl) approach:
testdata2 <- testdata %>% mutate(
newcol = case_when(
grepl(comment, pattern = "learn") ~ "learn",
grepl(comment, pattern = "tax") ~ "tax",
TRUE ~ as.character(comment)
)
)
Yielding:
> testdata1
# A tibble: 6 × 3
ID comment newcol
<dbl> <chr> <chr>
1 1 learn learn
2 2 learning learn
3 3 learned learn
4 4 tax tax
5 5 taxes tax
6 6 panoply panoply
> testdata2
# A tibble: 6 × 3
ID comment newcol
<dbl> <chr> <chr>
1 1 learn learn
2 2 learning learn
3 3 learned learn
4 4 tax tax
5 5 taxes tax
6 6 panoply panoply

Adding a column to a df based on comparison with a list through strsplit() in R

I've been working on something for a while now and still haven't figured out how to get it to work in my preferred way. Hoping someone can help me:
I have a dataframe containing lots of data (5000+ obs) about city budgets, therefore, one of the variable names is obviously 'city'. I have a seperate list of 40 cities that I want to attach to this dataframe and essentially conditionally check for each cityname in the df, if it's also on the seperate list (and so; code it 1; or else 0). I made an example below with smaller dataset:
city <- c(rep("city_a", 8), rep("city_b", 5), rep("city_c", 4), rep("city_d", 7),
rep("city_e", 3), rep("city_f", 9), rep("city_g", 4))
school <- c(1:8, 1:5, 1:4, 1:7,1:3, 1:9, 1:4)
df <- data.frame(city, school)
seperate_list <- tolower("City_A, City_B, City_E, City_G")
seperate_list <- gsub('[,]', '', seperate_list)
seperate_list <- strsplit(seperate_list, " ")[[1]]
Note: You may ask; why do the second part like that? My dataset is much larger and I wanted to find a way to make the process more automatic, so e.g. I wouldn't have to manually delete all the commas and seperate the citynames from one another. Now that I have df and seperate_list, I want to combine them in df, by adding a third column that specifies whether (1) or not (0) each city is in the seperate list. I've tried using a for loop and also lapply, but with no luck since I'm not very skilled in both of those yet.
I would appreciate a hint, so I can sort of find of myself!
library(tidyverse)
city <- c(rep("city_a", 8), rep("city_b", 5), rep("city_c", 4), rep("city_d", 7),
rep("city_e", 3), rep("city_f", 9), rep("city_g", 4))
school <- c(1:8, 1:5, 1:4, 1:7,1:3, 1:9, 1:4)
df <- data.frame(city, school)
seperate_list <- tolower("City_A, City_B, City_E, City_G")
seperate_list <- gsub('[,]', '', seperate_list)
seperate_list <- strsplit(seperate_list, " ")[[1]]
df %>%
mutate(
in_list = city %in% seperate_list
) %>%
as_tibble()
#> # A tibble: 40 x 3
#> city school in_list
#> <chr> <int> <lgl>
#> 1 city_a 1 TRUE
#> 2 city_a 2 TRUE
#> 3 city_a 3 TRUE
#> 4 city_a 4 TRUE
#> 5 city_a 5 TRUE
#> 6 city_a 6 TRUE
#> 7 city_a 7 TRUE
#> 8 city_a 8 TRUE
#> 9 city_b 1 TRUE
#> 10 city_b 2 TRUE
#> # … with 30 more rows
Created on 2021-09-09 by the reprex package (v2.0.1)
I think you might also look in joining tables and make the list of interest as a column of another table. This looks for what databases and relational algebra are made for.

How can i add more columns in dataframe by for loop

I am beginner of R. I need to transfer some Eviews code to R. There are some loop code to add 10 or more columns\variables with some function in data in Eviews.
Here are eviews example code to estimate deflator:
for %x exp con gov inv cap ex im
frml def_{%x} = gdp_{%x}/gdp_{%x}_r*100
next
I used dplyr package and use mutate function. But it is very hard to add many variables.
library(dplyr)
nominal_gdp<-rnorm(4)
nominal_inv<-rnorm(4)
nominal_gov<-rnorm(4)
nominal_exp<-rnorm(4)
real_gdp<-rnorm(4)
real_inv<-rnorm(4)
real_gov<-rnorm(4)
real_exp<-rnorm(4)
df<-data.frame(nominal_gdp,nominal_inv,
nominal_gov,nominal_exp,real_gdp,real_inv,real_gov,real_exp)
df<-df %>% mutate(deflator_gdp=nominal_gdp/real_gdp*100,
deflator_inv=nominal_inv/real_inv,
deflator_gov=nominal_gov/real_gov,
deflator_exp=nominal_exp/real_exp)
print(df)
Please help me to this in R by loop.
The answer is that your data is not as "tidy" as it could be.
This is what you have (with an added observation ID for clarity):
library(dplyr)
df <- data.frame(nominal_gdp = rnorm(4),
nominal_inv = rnorm(4),
nominal_gov = rnorm(4),
real_gdp = rnorm(4),
real_inv = rnorm(4),
real_gov = rnorm(4))
df <- df %>%
mutate(obs_id = 1:n()) %>%
select(obs_id, everything())
which gives:
obs_id nominal_gdp nominal_inv nominal_gov real_gdp real_inv real_gov
1 1 -0.9692060 -1.5223055 -0.26966202 0.49057546 2.3253066 0.8761837
2 2 1.2696927 1.2591910 0.04238958 -1.51398652 -0.7209661 0.3021453
3 3 0.8415725 -0.1728212 0.98846942 -0.58743294 -0.7256786 0.5649908
4 4 -0.8235101 1.0500614 -0.49308092 0.04820723 -2.0697008 1.2478635
Consider if you had instead, in df2:
obs_id variable real nominal
1 1 gdp 0.49057546 -0.96920602
2 2 gdp -1.51398652 1.26969267
3 3 gdp -0.58743294 0.84157254
4 4 gdp 0.04820723 -0.82351006
5 1 inv 2.32530662 -1.52230550
6 2 inv -0.72096614 1.25919100
7 3 inv -0.72567857 -0.17282123
8 4 inv -2.06970078 1.05006136
9 1 gov 0.87618366 -0.26966202
10 2 gov 0.30214534 0.04238958
11 3 gov 0.56499079 0.98846942
12 4 gov 1.24786355 -0.49308092
Then what you want to do is trivial:
df2 %>% mutate(deflator = real / nominal)
obs_id variable real nominal deflator
1 1 gdp 0.49057546 -0.96920602 -0.50616221
2 2 gdp -1.51398652 1.26969267 -1.19240392
3 3 gdp -0.58743294 0.84157254 -0.69801819
4 4 gdp 0.04820723 -0.82351006 -0.05853872
5 1 inv 2.32530662 -1.52230550 -1.52749012
6 2 inv -0.72096614 1.25919100 -0.57256297
7 3 inv -0.72567857 -0.17282123 4.19901294
8 4 inv -2.06970078 1.05006136 -1.97102841
9 1 gov 0.87618366 -0.26966202 -3.24919196
10 2 gov 0.30214534 0.04238958 7.12782060
11 3 gov 0.56499079 0.98846942 0.57158146
12 4 gov 1.24786355 -0.49308092 -2.53074800
So the question becomes: how do we get to the nice dplyr-compatible data.frame.
You need to gather your data using tidyr::gather. However, because you have 2 sets of variables to gather (the real and nominal values), it is not straightforward. I have done it in two steps, there may be a better way though.
real_vals <- df %>%
select(obs_id, starts_with("real")) %>%
# the line below is where the magic happens
tidyr::gather(variable, real, starts_with("real")) %>%
# extracting the variable name (by erasing up to the underscore)
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Same thing for nominal values
nominal_vals <- df %>%
select(obs_id, starts_with("nominal")) %>%
tidyr::gather(variable, nominal, starts_with("nominal")) %>%
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Merging them... Now we have something we can work with!
df2 <-
full_join(real_vals, nominal_vals, by = c("obs_id", "variable"))
Note the importance of the observation id when merging.
We can grep the matching names, and sort:
x <- colnames(df)
df[ sort(x[ (grepl("^nominal", x)) ]) ] /
df[ sort(x[ (grepl("^real", x)) ]) ] * 100
Similarly, if the columns were sorted, then we could just:
df[ 1:4 ] / df[ 5:8 ] * 100
We can loop over column names using purrr::map_dfc then apply a custom function over the selected columns (i.e. the columns that matched the current name from nms)
library(dplyr)
library(purrr)
#Replace anything before _ with empty string
nms <- unique(sub('.*_','',names(df)))
#Use map if you need the ouptut as a list not a dataframe
map_dfc(nms, ~deflator_fun(df, .x))
Custom function
deflator_fun <- function(df, x){
#browser()
nx <- paste0('nominal_',x)
rx <- paste0('real_',x)
select(df, matches(x)) %>%
mutate(!!paste0('deflator_',quo_name(x)) := !!ensym(nx) / !!ensym(rx)*100)
}
#Test
deflator_fun(df, 'gdp')
nominal_gdp real_gdp deflator_gdp
1 -0.3332074 0.181303480 -183.78433
2 -1.0185754 -0.138891362 733.36121
3 -1.0717912 0.005764186 -18593.97398
4 0.3035286 0.385280401 78.78123
Note: Learn more about quo_name, !!, and ensym which they are tools for programming with dplyr here

programatically create new variables which are sums of nested series of other variables

I have data giving me the percentage of people in some groups who have various levels of educational attainment:
df <- data_frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10))
df
# A tibble: 2 x 5
group no.highschool high.school college graduate
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 20. 70. 10. 0.
2 B 10. 40. 40. 10.
E.g., in group A 70% of people have a high school education.
I want to generate 4 variables that give me the proportion of people in each group with less than each of the 4 levels of education (e.g., lessthan_no.highschool, lessthan_high.school, etc.).
desired df would be:
desired.df <- data.frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10),
lessthan_no.highschool = c(0,0),
lessthan_high.school = c(20, 10),
lessthan_college = c(90, 50),
lessthan_graduate = c(100, 90))
In my actual data I have many groups and a lot more levels of education. Of course I could do this one variable at a time, but how could I do this programatically (and elegantly) using tidyverse tools?
I would start by doing something like a mutate_at() inside of a map(), but where I get tripped up is that the list of variables being summed is different for each of the new variables. You could pass in the list of new variables and their corresponding variables to be summed as two lists to a pmap(), but it's not obvious how to generate that second list concisely. Wondering if there's some kind of nesting solution...
Here is a base R solution. Though the question asks for a tidyverse one, considering the dialog in the comments to the question I have decided to post it.
It uses apply and cumsum to do the hard work. Then there are some cosmetic concerns before cbinding into the final result.
tmp <- apply(df[-1], 1, function(x){
s <- cumsum(x)
100*c(0, s[-length(s)])/sum(x)
})
rownames(tmp) <- paste("lessthan", names(df)[-1], sep = "_")
desired.df <- cbind(df, t(tmp))
desired.df
# group no.highschool high.school college graduate lessthan_no.highschool
#1 A 20 70 10 0 0
#2 B 10 40 40 10 0
# lessthan_high.school lessthan_college lessthan_graduate
#1 20 90 100
#2 10 50 90
how could I do this programatically (and elegantly) using tidyverse tools?
Definitely the first step is to tidy your data. Encoding information (like edu level) in column names is not tidy. When you convert education to a factor, make sure the levels are in the correct order - I used the order in which they appeared in the original data column names.
library(tidyr)
tidy_result = df %>% gather(key = "education", value = "n", -group) %>%
mutate(education = factor(education, levels = names(df)[-1])) %>%
group_by(group) %>%
mutate(lessthan_x = lag(cumsum(n), default = 0) / sum(n) * 100) %>%
arrange(group, education)
tidy_result
# # A tibble: 8 x 4
# # Groups: group [2]
# group education n lessthan_x
# <chr> <fct> <dbl> <dbl>
# 1 A no.highschool 20 0
# 2 A high.school 70 20
# 3 A college 10 90
# 4 A graduate 0 100
# 5 B no.highschool 10 0
# 6 B high.school 40 10
# 7 B college 40 50
# 8 B graduate 10 90
This gives us a nice, tidy result. If you want to spread/cast this data into your un-tidy desired.df format, I would recommend using data.table::dcast, as (to my knowledge) the tidyverse does not offer a nice way to spread multiple columns. See Spreading multiple columns with tidyr or How can I spread repeated measures of multiple variables into wide format? for the data.table solution or an inelegant tidyr/dplyr version. Before spreading, you could create a key less_than_x_key = paste("lessthan", education, sep = "_").

R- Trimming a string in a dataframe after a particular pattern

I am having trouble figuring out how to trim the end off of a string in a data frame.
I want to trim everything to a "base" name, after #s and letters, a period, then a number. My goal is trim everything in my dataframe to this "base" name, then sum the values with the same "base." I was thinking it would be possible to trim, then merge and sum the values.
ie/
Gene_name Values
B0222.5 4
B0222.6 16
B0228.7.1 2
B0228.7.2 12
B0350.2h.1 30
B0350.2h.2 2
B0350.2i 15
2RSSE.1a 3
2RSSE.1b 10
R02F11.11 4
to
Gene_name Values
B0222.5 4
B0222.6 16
B0228.7 14
B0350.2 47
2RSSE.1 13
R02F11.11 4
Thank you for any help!
Here is a solution using the dplyr and stringr packages. You first create a column with your extracted base pattern, and then use the group_by and summarise functions from dplyr to get the sum of values for each name:
library(dplyr)
library(stringr)
df2 = df %>% mutate(Gene_name = str_extract(Gene_name,"[[:alnum:]]+\\.\\d+")) %>%
group_by(Gene_name) %>% summarise(Values = sum(Values))
Gene_name Values
<chr> <int>
1 2RSSE.1 13
2 B0222.5 4
3 B0222.6 16
4 B0228.7 14
5 B0350.2 47
6 R02F11.11 4
As someone has also suggested, I would get gene names first, and then search for them in the original data.frame
df <- data.frame(Gene_name = c("B0222.5", "B0222.6", "B0228.7.1", "B0228.7.2", "B0350.2h.1", "B0350.2h.2", "B0350.2i", "2RSSE.1a", "2RSSE.1b", "R02F11.11"),
Values = c(4, 16, 2, 12, 30, 2, 15, 3, 10, 4),
stringsAsFactors = F)
pat <- "(^[[:alnum:]]+\\.[[:digit:]]*)"
cap.pos <- regexpr(pat, df$Gene_name)
cap.gene <- unique(substr(df$Gene_name, cap.pos, (cap.pos + attributes(cap.pos)$match.length - 1)))
do.call(rbind, lapply(cap.gene, (function(nm){
sumval <- sum(df[grepl(nm, df$Gene_name, fixed = T),]$Values, na.rm = T)
data.frame(Gene_name = nm, Value = sumval)
})))
The result tracks with your request
Gene_name Value
1 B0222.5 4
2 B0222.6 16
3 B0228.7 14
4 B0350.2 47
5 2RSSE.1 13
6 R02F11.11 4
You can also create the Gene_name as a factor and change the levels.
# coerce the vector as a factor
Gene_name <- as.factor(Gene_name)
# view the levels
levels(Gene_name)
# to make B0228.7.1 into B0228.7
levels(Gene_name)[ *index for B0228.7.1* ] <- B0228.7
You can repeat this for the levels that need to change and then the values will automatically sum together and rows with similar levels will be treated as the same category.

Resources