I'm working with survey data containing value labels. The haven package allows one to import data with value label attributes. Sometimes these value labels need to be edited in routine ways.
The example I'm giving here is very simple, but I'm looking for a solution that can be applied to similar problems across large data.frames.
d <- dput(structure(list(var1 = structure(c(1, 2, NA, NA, 3, NA, 1, 1), labels = structure(c(1,
2, 3, 8, 9), .Names = c("Protection of environment should be given priority",
"Economic growth should be given priority", "[DON'T READ] Both equally",
"[DON'T READ] Don't Know", "[DON'T READ] Refused")), class = "labelled")), .Names = "var1", row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame")))
d$var1
<Labelled double>
[1] 1 2 NA NA 3 NA 1 1
Labels:
value label
1 Protection of environment should be given priority
2 Economic growth should be given priority
3 [DON'T READ] Both equally
8 [DON'T READ] Don't Know
9 [DON'T READ] Refused
If a value label begins with "[DON'T READ]" I want to remove "[DON'T READ]" from the beginning of the label and add "(VOL)" at the end. So, "[DON'T READ] Both equally" would now read "Both equally (VOL)."
Of course, it's straightforward to edit this individual variable with a function from haven's associated labelled package. But I want to apply this solution across all the variables in a data.frame.
library(labelled)
val_labels(d$var1) <- c("Protection of environment should be given priority" = 1,
"Economic growth should be given priority" = 2,
"Both equally (VOL)" = 3,
"Don't Know (VOL)" = 8,
"Refused (VOL)" = 9)
How can I achieve the result of the function directly above in a way that can be applied to every variable in a data.frame?
The solution must work regardless of the specific value. (In this instance it is values 3,8, & 9 that need alteration, but this is not necessarily the case).
There are a few ways to do this. You could use lapply() or (if you want a one(ish)-liner) you could use any of the scoped variants of mutate():
1). Using lapply()
This method loops over all columns with gsub() to remove the part you do not want and adds the " (VOL)" to the end of the string. Of course you could use this with a subset as well!
d[] <- lapply(d, function(x) {
labels <- attributes(x)$labels
names(labels) <- gsub("\\[DON'T READ\\]\\s*(.*)", "\\1 (VOL)", names(labels))
attributes(x)$labels <- labels
x
})
d$var1
[1] 1 2 NA NA 3 NA 1 1
attr(,"labels")
Protection of environment should be given priority Economic growth should be given priority
1 2
Both equally (VOL) Don't Know (VOL)
3 8
Refused (VOL)
9
attr(,"class")
[1] "labelled"
2) Using mutate_all()
Using the same logic (with the same result) you could change the name of the labels in a tidier way:
d %>%
mutate_all(~{names(attributes(.)$labels) <- gsub("\\[DON'T READ\\]\\s*(.*)", "\\1 (VOL)", names(attributes(.)$labels));.}) %>%
map(attributes) # just to check on the result
Related
I have a problem.
I have the following data frame.
1
2
NA
100
1.00499
NA
1.00813
NA
0.99203
NA
Two columns. In the second column, apart from the starting value, there are only NAs. I want to fill the first NA of the 2nd column by multiplying the 1st value from column 2 with the 2nd value from column 1 (100* 1.00499). The 3rd value of column 2 should be the product of the 2nd new created value in column 2 and the 3rd value in column 1 and so on. So that at the end the NAs are replaced by values.
These two sources have helped me understand how to refer to different rows. But in both cases a new column is created.I don't want that. I want to fill the already existing column 2.
Use a value from the previous row in an R data.table calculation
https://statisticsglobe.com/use-previous-row-of-data-table-in-r
Can anyone help me?
Thanks so much in advance.
Sample code
library(quantmod)
data.N225<-getSymbols("^N225",from="1965-01-01", to="2022-03-30", auto.assign=FALSE, src='yahoo')
data.N225[c(1:3, nrow(data.N225)),]
data.N225<- na.omit(data.N225)
N225 <- data.N225[,6]
N225$DiskreteRendite= Delt(N225$N225.Adjusted)
N225[c(1:3,nrow(N225)),]
options(digits=5)
N225.diskret <- N225[,3]
N225.diskret[c(1:3,nrow(N225.diskret)),]
N225$diskretplus1 <- N225$DiskreteRendite+1
N225[c(1:3,nrow(N225)),]
library(dplyr)
N225$normiert <-"Value"
N225$normiert[1,] <-100
N225[c(1:3,nrow(N225)),]
N225.new <- N225[,4:5]
N225.new[c(1:3,nrow(N225.new)),]
Here is the code to create the data frame in R studio.
a <- c(NA, 1.0050,1.0081, 1.0095, 1.0016,0.9947)
b <- c(100, NA, NA, NA, NA, NA)
c<- data.frame(ONE = a, TWO=b)
You could use cumprod for cummulative product
transform(
df,
TWO = cumprod(c(na.omit(TWO),na.omit(ONE)))
)
which yields
ONE TWO
1 NA 100.0000
2 1.0050 100.5000
3 1.0081 101.3140
4 1.0095 102.2765
5 1.0016 102.4402
6 0.9947 101.8972
data
> dput(df)
structure(list(ONE = c(NA, 1.005, 1.0081, 1.0095, 1.0016, 0.9947
), TWO = c(100, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-6L))
What about (gasp) a for loop?
(I'll use dat instead of c for your dataframe to avoid confusion with function c()).
for (row in 2:nrow(dat)) {
if (!is.na(dat$TWO[row-1])) {
dat$TWO[row] <- dat$ONE[row] * dat$TWO[row-1]
}
}
This means:
For each row from the second to the end, if the TWO in the previous row is not a missing value, calculate the TWO in this row by multiplying ONE in the current row and TWO from the previous row.
Output:
#> ONE TWO
#> 1 NA 100.0000
#> 2 1.0050 100.5000
#> 3 1.0081 101.3140
#> 4 1.0095 102.2765
#> 5 1.0016 102.4402
#> 6 0.9947 101.8972
Created on 2022-04-28 by the reprex package (v2.0.1)
I'd love to read a dplyr solution!
Hi I have a similar question to this (Filter one column by matching to another column)
For background I'm trying to match up a code for a book name and a place where it is being used. I figured out how to use filter and grepyl to narrow down the book name but now I need to filter out if the location names match. I can't give up the data since it's private. It's a similar example to the one above except I'm filtering with what the animal starts with first so imagine this.
df <- data.frame(pair = c(1, 1, 2, 2, 3, 3,4,4,4),
animal = rep(c("Elephant", "Giraffe", "Antelope"), 6),
value = seq(1, 12, 2),
drop = c("savannah", "savannah", "jungle", "jungle", "zoo", "unknown", "unknown", "zoo", "my house"))
zoo_animals <- filter(df, grepl("Gir|Ele", animal))
what I'm not sure how to do is to build on that to see if the location column matches between each entry. Is it just & location = location?
What I want is to have it find is there an elephant and a giraffe from the zoo? what about the savanna? From the data I made it appears the only match is savanna so it would print those data points so a df that is
pair
animal
value
drop
1
Elephant
7
savannah
1
Giraffe
3
savannah
1
Giraffe
3
savannah
I'm trying to use choroplethr to make a map at the county level. Currently, I have 3 categorical integers (1, 2, 3) in my csv under the column value which vary depending on each county. The region column contains county fips.
I want to display the following values as the respective label , color (value = label = color):
0 = "None" = "white", 1 = "MD" = "#64acbe", 2 = "DO" = "#c85a5a", 3 =
"Both" = "#574249",
I've tried several combinations of scale_fill_brewer without the results I'm looking for. Any assistance would be great. Here's code that simulates the data I'm using:
library(choroplethr)
library(ggplot2)
library(choroplethrMaps)
Res <- data.frame(
region = c(45001, 22001, 51001, 16001, 19001, 21001, 29001, 40001, 8001, 19003, 16003, 17001, 18001, 28001, 38001, 31001, 39001, 42001, 53001, 55001, 50001, 72001, 72003, 72005, 72007, 72009, 45003, 27001),
value = c(0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3),
stringsAsFactors = FALSE)
county_choropleth(Res,
title = "All United States Medical Residencies",
legend = "Types of Medical Residencies"
)
Thank you for using Choroplethr.
I think that there are a few issues here. The first one I'd like to address is that your value column contains numeric data. This by itself is not a problem. But because you are actually using it to code categorical data (i.e. "MD", "OD", etc.) this is a problem. Therefore my first task will be to change the data type from numeric to character data:
> class(Res$value)
[1] "numeric"
> Res$value = as.character(Res$value)
> class(Res$value)
[1] "character"
Now I will replace the "numbers" with the category names that you want:
> Res[Res=="0"] = "None"
> Res[Res=="1"] = "MD"
> Res[Res=="2"] = "DO"
> Res[Res=="3"] = "Both"
> head(Res)
region value
1 45001 None
2 22001 MD
3 51001 DO
4 16001 Both
5 19001 None
6 21001 MD
Now for the second issue. You said that you were trying to use scale_fill_brewer. That function is for using the Brewer scales. But you don't want those here. You say that you have your own scale. So you want to use scale_fill_manual.
county_choropleth(Res) +
scale_fill_manual(values=c("None" = "#ffffffff",
"MD" = "#64acbe",
"DO" = "#c85a5a",
"Both" = "#574249"),
name="Types of Medical Residencies")
Note: What choroplethr calls the "legend" (which is actually the name of the legend) is actually a property of the ggplot2 scale. In particular, it is the name of the scale. So if you are using your own scale, you cannot use choroplethr's legend parameter any more.
Of course, now we have a new problem: Alaska and Hawaii are all black. I actually forgot about this issue (it's been a while since I worked on Choroplethr). The reason this happens is very technical, and perhaps more detailed than you care for, but I will mention it here for completeness: choroplethr uses ggplot2 annotations to render AK and HI in the proper place. the choropelthr + ggplot_scale paradigm does not work here for AK and HI because ggplot does not propogate additional layers / scales to annotations. To get around this we must use the object-oriented features of choroplethr:
c = CountyChoropleth$new(Res)
c$title = "All United States Medical Residencies"
c$ggplot_scale = scale_fill_manual(values=c("None" = "#ffffffff", "MD" = "#64acbe", "DO" = "#c85a5a", "Both" = "#574249"),
name="Types of Medical Residencies")
c$render()
I have a data frame with categorical values that were entered manually and there are several mistakes. Someone cleaned up the bad data and I loaded that into R and merged that with the rest of my data. Everything so far is good.
As an example, let's say this is the data I have with the original (mix of good and bad data) in the "Value" column and the corrections of the bad data in the "Value_Clean" column. Obviously this is a small example but my actual data frame has dozens of corrections of different values and several thousand rows.
test <- data.frame(ID = c(1, 2, 3)
, Value = c("Discuss plan", "Discuss plan", "Discuss paln")
, Value_Clean = c(NA, NA, "Discuss plan"))
I would like to create a new column called "Value_Final" that has "Discuss plan" for IDs 1, 2, and 3.
It seems pretty straightforward that I should be able to accomplish this with an ifelse:
test$Value_Final <- ifelse(is.na(test$Value_Clean), test$Value, test$Value_Clean)
However, when I do that I get the following:
> test
ID Value Value_Clean Value_Final
1 1 Discuss plan <NA> 2
2 2 Discuss plan <NA> 2
3 3 Discuss paln Discuss plan 1
What the hell? I feel like I've done similar things with ifelse in R without running into this issue, so what is going?
Thanks!
It is a case of factor coercing to integer storage value. Can be corrected with stringsAsFactors = FALSE while creating the data.frame
test <- data.frame(ID = c(1, 2, 3)
, Value = c("Discuss plan", "Discuss plan", "Discuss paln")
, Value_Clean = c(NA, NA, "Discuss plan"), stringsAsFactors = FALSE)
ifelse(is.na(test$Value_Clean), test$Value, test$Value_Clean)
#[1] "Discuss plan" "Discuss plan" "Discuss plan"
Or if the data is already created, then can convert to character with as.character
test[1:2] <- lapply(test[1:2], as.character)
Or do this within the ifelse
ifelse(is.na(test$Value_Clean), as.character(test$Value),
as.character(test$Value_Clean))
The dplyr version of ifelse doesn't have this issue
ifelse(is.na(test$Value_Clean), test$Value, test$Value_Clean)
# [1] 2 2 1
dplyr::if_else(is.na(test$Value_Clean), test$Value, test$Value_Clean)
# [1] Discuss plan Discuss plan Discuss plan
# Levels: Discuss paln Discuss plan
FYI for this particular example you might use coalesce instead
dplyr::coalesce(test$Value_Clean, test$Value)
# [1] Discuss plan Discuss plan Discuss plan
# Levels: Discuss plan
you could try dplyr and tibbles as an alternative:
library(dplyr)
tibble(ID = c(1, 2, 3)
, Value = c("Discuss plan", "Discuss plan", "Discuss plan")
, Value_Clean = c(NA, NA, "Discuss plan")) %>%
mutate(Value_Final = ifelse(is.na(Value_Clean), Value, Value_Clean))
tibbles don't convert character columns to factors per default, which comes in handy in many many cases
Edit:
use as_tibble(dataframe) to convert an existing dataframe to a tibble
I am working with a set of excel spreadsheets which has column names which are dates.
After reading in the data with readxl::read_xlsx(), these column names become excel index dates (i.e. an integer representing days elapsed from 1899-12-30)
Is it possible to used dplyr::rename_if() or similar to rename all column names that are currently integers? I have written a function rename_func that I would like to apply to all such columns.
df %>% rename_if(is.numeric, rename_func) is not suitable as is.numeric is applied to the data in the column not the column name itself. I have also tried:
is.name.numeric <- function(x) is.numeric(names(x))
df %>% rename_if(is.name.numeric, rename_func)
which does not work and does not change any names (i.e. is.name.numeric returns FALSE for all cols)
edit: here is a dummy version of my data
df_badnames <- structure(list(Level = c(1, 2, 3, 3, 3), Title = c("AUSTRALIAN TOTAL",
"MANAGERS", "Chief Executives, Managing Directors & Legislators",
"Farmers and Farm Managers", "Hospitality, Retail and Service Managers"
), `38718` = c(213777.89, 20997.52, 501.81, 121.26, 4402.7),
`38749` = c(216274.12, 21316.05, 498.1, 119.3, 4468.67),
`38777` = c(218563.95, 21671.84, 494.08, 118.03, 4541.02),
`38808` = c(220065.05, 22011.76, 488.56, 116.24, 4609.28)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
and I would like:
df_goodnames <- structure(list(Level = c(1, 2, 3, 3, 3), Title = c("AUSTRALIAN TOTAL",
"MANAGERS", "Chief Executives, Managing Directors & Legislators",
"Farmers and Farm Managers", "Hospitality, Retail and Service Managers"
), Jan2006 = c(213777.89, 20997.52, 501.81, 121.26, 4402.7),
Feb2006 = c(216274.12, 21316.05, 498.1, 119.3, 4468.67),
Mar2006 = c(218563.95, 21671.84, 494.08, 118.03, 4541.02),
Apr2006 = c(220065.05, 22011.76, 488.56, 116.24, 4609.28)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
I understand that it is best practice to create a date column and change the shape of this df, but I need to join a few spreadsheets first and having integer column names causes a lot of problems. I currently have a work around but the crux of my question (apply a rename_if predicate to a name, rather than a column) is still interesting.
Although, the names look numeric but they are not
class(names(df_badnames))
#[1] "character"
so they would not be caught by is.numeric or similar other functions.
One way to do this is find out which names can be coerced to numeric and then convert them into the date format of our choice
cols <- as.numeric(names(df_badnames))
names(df_badnames)[!is.na(cols)] <- format(as.Date(cols[!is.na(cols)],
origin = "1899-12-30"), "%b%Y")
df_badnames
# Level Title Jan2006 Feb2006 Mar2006 Apr2006
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#1 1 AUSTRALIAN TOTAL 213778. 216274. 218564. 220065.
#2 2 MANAGERS 20998. 21316. 21672. 22012.
#3 3 Chief Executives, Managing Directors & Legisla… 502. 498. 494. 489.
#4 3 Farmers and Farm Managers 121. 119. 118. 116.
#5 3 Hospitality, Retail and Service Managers 4403. 4469. 4541. 4609.