I have a data frame with about 50 overlapping columns that I need to combine across. Below is a snippet of what the data frame looks like (it is about 150 rows long and several hundred columns across)
ID PAI_Q1.y PAI_Q1.x
540 0 NA
680 1 NA
240 NA 2
330 NA 3
For a single column, the following code works perfectly:
qualtrics <- qualtrics %>%
mutate(PAI_Q1 = ifelse(is.na(PAI_Q1.y), PAI_Q1.x, PAI_Q1.y))
However, I'm having trouble writing this into a loop or a function across all of the rows that need to be converted (i.e., PAI_Q2, PAI_Q3, etc...). Below are the two attempts I've made thus far. Does anyone have suggestions for tweaks (or know of a function that exists) that let me do this basic task iteratively?
Attempt #1
mutate_col <- function(data, string, string.x, string.y){
data <- data %>% mutate(string = ifelse(is.na(string.y), string.x, string.y))
}
Error: Problem with `mutate()` column `string`.
ℹ string = ifelse(is.na(string.y), string.x, string.y).
x object 'PAI_Q1.y' not found
Attempt #2
for (i in 1:colnames(df)){
if(names(i) %in% list_of_cols){ #list of columns that must be combined
y <- paste(i, ".y", sep = "")
x <- paste(i, ".x", sep = "")
df <- df %>% mutate(i = ifelse(is.na(y), x, y))
}
}
ID PAI_Q1.y PAI_Q1.x i.1
540 0 NA PAI.Q1.y
680 1 NA PAI.Q1.y
240 NA 2 PAI.Q1.y
330 NA 3 PAI.Q1.y
You can use tidyr to change the data from wide to long, then get the data the correct format, then change back to wide format.
library(stringr)
library(tidyr)
library(dplyr)
qualtrics <- qualtrics %>%
tidyr::pivot_longer(
!ID,
names_to = "question",
values_to = "value",
values_drop_na = TRUE
) %>%
dplyr::mutate(question = stringr::str_extract(question, "[^.]+")) %>%
tidyr::pivot_wider(names_from = question, values_from = value)
Output
# A tibble: 4 × 3
ID PAI_Q1 PAI_Q2
<dbl> <dbl> <dbl>
1 540 0 0
2 680 1 1
3 240 2 2
4 330 3 3
Data
qualtrics <-
structure(
list(
ID = c(540, 680, 240, 330),
PAI_Q1.y = c(0, 1,
NA, NA),
PAI_Q1.x = c(NA, NA, 2, 3),
PAI_Q2.y = c(0, 1, NA, NA),
PAI_Q2.x = c(NA, NA, 2, 3)
),
class = "data.frame",
row.names = c(NA,-4L)
)
Related
I was working on something I thought would be simple, but maybe today my brain isn't working. My data is like this:
tibble(metric = c('income', 'income_upp', 'income_low', 'n_house', 'n_house_upp', 'n_house_low'),
value = c(120, 140, 100, 10, 8, 12))
metric value
income 120
income_low 100
income_upp 140
n 10
n_low 8
n_upp 12
And I want to pivot_wider so it looks like this:
metric value value_low value_upp
income 120 100 140
n 10 8 12
I'm having trouble separating metrics, because pivot_wider as is, brings a dataframe that's too wide:
df %>% pivot_wider(names_from = 'metric', values_from = value)
How can I achieve this or should I pivot longer after the pivot wider?
Thanks!
I think if you convert metric into a column with "value", "value_upp" and "value_low" values, you can pivot_wider:
df %>%
mutate(param = case_when(str_detect(metric, "upp") ~ "value_upp",
str_detect(metric, "low") ~ "value_low",
TRUE ~ "value"),
metric = str_remove(metric, "_low|_upp")) %>%
pivot_wider(names_from = param, values_from = value)
I like to use separate() when I have text in a column like this. This function allows you to separate a column into multiple columns if there is a separator in the function.
In particular in this example we would want to use the arguments sep="_" and into = c("metric", "state") to convert into columns with those names.
Then mutate() and pivot_wider() can be used as you had previously specified.
library(tidyverse)
df <- tribble(~metric, ~value,
"income", 120,
"income_low", 100,
"income_upp", 140,
"n", 10,
"n_low", 8,
"n_upp", 12)
df |>
separate(metric, sep = "_", into = c("metric", "state")) |>
mutate(state = ifelse(is.na(state), "value", state)) |>
pivot_wider(id_cols = metric, names_from = state, values_from = value, names_sep = "_")
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1, 4].
#> # A tibble: 2 × 4
#> metric value low upp
#> <chr> <dbl> <dbl> <dbl>
#> 1 income 120 100 140
#> 2 n 10 8 12
Created on 2022-12-21 with reprex v2.0.2
Note you can use the argument names_glue or names_prefix in pivot_wider() to add the "value" as a prefix to the column names.
a data.table approach (if you can live wit the trailing underacore achter value_
library(data.table)
setDT(df)
# create some new columns based on metric
df[, c("first", "second") := tstrsplit(metric, "_")]
# metric value first second
# 1: income 120 income <NA>
# 2: income_low 100 income low
# 3: income_upp 140 income upp
# 4: n 10 n <NA>
# 5: n_low 8 n low
# 6: n_upp 12 n upp
# replace NA with ""
df[is.na(df)] <- ""
# now cast to wide, createing colnames on the fly
dcast(df, first ~ paste0("value_", second), value.var = "value")
# first value_ value_low value_upp
# 1: income 120 100 140
# 2: n 10 8 12
I would like to combine two variables that have only one answer each into a single variable that has both answers.
Example
IPV_YES only has answers that are 1
IPV_NO only has answers that are 2
I would like to combine them into a single variable named IPV that would have the 1 and 2 results from both individual category.
I have tried using ifelse command but it only shows me the value of IPV_YES.
Dataset I have
My desired outcome
my answer
df %>% mutate(across(everything(), ~ifelse(. == "", NA, as.numeric(.)))) %>%
group_by(ID) %>%
rowwise() %>%
transmute(IPV = sum(c_across(everything()), na.rm = T))
# A tibble: 4 x 2
# Rowwise: ID
ID IPV
<dbl> <dbl>
1 1 1
2 2 2
3 3 1
4 4 2
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
We can use coalesce after converting the '' to NA
library(dplyr)
df <- df %>%
transmute(ID, IPV = coalesce(na_if(IPV_YES, ""), na_if(IPV_NO, ""))) %>%
type.convert(as.is = TRUE)
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
df$IPV <- ifelse(df$IPV_YES != "", df$IPV_YES, df$IPV_NO[!df$IPV_NO==""])
Here, we specify an ifelse statement; it can be glossed thus: if the value in df$IPV_YES is not blank, then give the value in df$IPV_YES, else give those values from df$IPV_NO that are not blank.
If you want to remove the IPV_* columns:
df[,2:3] <- NULL
Result:
df
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2
Data:
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
Maybe you can try the code below
replace(df, df == "", NA) %>%
mutate(IPV = coalesce(IPV_YES, IPV_NO)) %>%
select(ID, IPV) %>%
type.convert(as.is = TRUE)
which gives
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2
I'm scraping cars information from a website and I'm getting not constant and not so clean data from it. I'm trying to clean and arrange this data into a data frame.
For example:
dd <- data.frame(measure = c("wheel", "wheel", "length", "width", "wheel", "width"), value = 1:6, model = "a", stringsAsFactors = F)
dd
measure value model
1 wheel 1 a
2 wheel 2 a
3 length 3 a
4 width 4 a
5 wheel 5 a
6 width 6 a
It this example, I have 3 values of wheel and 2 of width. In my real data, it's not always the same thing that is repeated, it may or may not have repetition and it could be repeated more than once.
I need to reshape this table to have one line per model, however I don't want to aggregate the value that have a common measure. Precisely, I would want the table to become:
model length wheel wheel1 wheel2 width width1
1 a 3 1 2 5 4 6
This was obtained using dcast on manually modified data:
library(reshape2)
res <- data.frame(measure = c("wheel", "wheel1", "length", "width", "wheel2", "width1"), value = 1:6, model = "a", stringsAsFactors = F)
dcast(res, model ~ measure)
I need either a way to modify dcast so it doesn't aggregate the measure or automatically modify dd so it becomes res.
I've tried something ugly and not exactly what I needed:
dd[duplicated(dd$measure), "measure"] <- paste0(dd[duplicated(dd$measure), "measure"] , 1:3)
dd
measure value model
1 wheel 1 a
2 wheel1 2 a
3 length 3 a
4 width 4 a
5 wheel2 5 a
6 width3 6 a
This code isn't working because width get the index 3 and not 2. Also, this would not adjust to another table like:
dd2 <- data.frame(measure = c("wheel", "wheel", "length", "width", "wheel"), value = 1:5, model = "a", stringsAsFactors = F)
dd2[duplicated(dd2$measure), "measure"] <- paste0(dd2[duplicated(dd2$measure), "measure"] , 1:3)
Error in `[<-.data.frame`(`*tmp*`, duplicated(dd2$measure), "measure", :
replacement has 3 rows, data has 2
Anyway, how could I modify my variable measure dynamically so all the words are unique?
Can you use dplyr::mutate as below:
dd <- dd %>%
group_by(model, measure) %>%
mutate(measure2 = paste0(measure, ifelse(row_number() > 1, row_number() - 1, ""))) %>%
ungroup() %>%
mutate(measure = measure2) %>%
select(measure, model, value)
dd
# A tibble: 6 x 3
measure model value
<chr> <chr> <int>
1 wheel a 1
2 wheel1 a 2
3 length a 3
4 width a 4
5 wheel2 a 5
6 width1 a 6
A different tidyverse possibility could be:
dd %>%
arrange(model, measure) %>%
group_by(model, measure) %>%
mutate(var = paste(measure, seq_along(measure), sep = "_")) %>%
ungroup() %>%
select(-measure) %>%
spread(var, value)
model length_1 wheel_1 wheel_2 wheel_3 width_1 width_2
<chr> <int> <int> <int> <int> <int> <int>
1 a 3 1 2 5 4 6
make.unique does just that:
dd$measure <- make.unique(dd$measure,sep = "")
dd
# measure value model
# 1 wheel 1 a
# 2 wheel1 2 a
# 3 length 3 a
# 4 width 4 a
# 5 wheel2 5 a
# 6 width1 6 a
You also could renumber the values with a sapply
sapply(unique(dd$measure), function(x) {
z <- dd$measure[dd$measure %in% x]
if (length(z) > 1)
dd$measure[dd$measure %in% x] <<- paste0(z, ".", seq(length(z)))
})
and use reshape after.
reshape(dd, direction="wide", timevar="measure", idvar="model")
# model value.wheel.1 value.wheel.2 value.length value.width.1 value.wheel.3 value.width.2
# 1 a 1 2 3 4 5 6
Data
dd <- structure(list(measure = c("wheel", "wheel", "length", "width", "wheel", "width"),
value = 1:6, model = c("a", "a", "a", "a", "a", "a")),
class = "data.frame", row.names = c(NA, -6L))
Say I have 900 dataframes at hand, and I wanted to get something similar to a frequency distribution based off of another column for each "type".
Sample Code makin;
df1 <- as_tibble(iris)
df2 <- slice(df1, 1:7)
df2 <- df2 %>%
mutate(type = 1:7)
This is similar to what I currently have just working with one dataframe:
df2 %>% select(type, Sepal.Length) %>%
mutate(Count = ifelse(Sepal.Length > 0, 1, 0)) %>%
mutate(Percentage = Count/7)
In the case that for any row, Sepal.Length = 0, then I'm not going to count it (count column will be = 0 for that row value).
But I'm going to have 900 dataframes that I'll be running this code on, so I was thinking about running it through a loop.
Ideally, if two dataframes are inputted, and both have Sepal.Length values >0 for row 1, then I want the count to be 2 for row 1 / type 1. Is there a better way to approach this? And if I do go for the looping option then is there a way to combine all the dataframes to tell R that row 1 / type 1 has multiple > 0 values?
For your iris example, what it sounds like you want is:
library(tidyverse)
df1 <- as_tibble(iris)
df2 <- slice(df1, 1:7)
df2 <- df2 %>%
mutate(type = 1:7)
group_by(df2, type) %>%
transmute(has_sepal = sum(Sepal.Length > 0))
# A tibble: 7 x 2
# Groups: type [7]
# type has_sepal
# <int> <int>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# 6 6 1
# 7 7 1
To do this over 900 data frames... If you want this to work on iris, hard code. Someone who is familiar with writing functions using tidyverse evaluation could write a more general version for you, but that's still on my todo list.
f_fill_in_blank_first <- function(tib){
# hard code the var1 and var2
group_by(tib, <var1>) %>%
transmute(var1_not_zero = sum(<var 1> != 0))
}
f_iris <- function(tib)
group_by(tib, type) %>%
transmute(var1_not_zero = sum(Sepal.Length != 0)
}
Depending on the structure of your 900 data frames, you could vapply with this function (edit, no, not this function, refactor so it produces a named atomic vector if you want to vapply this function) to put the whole thing into an array, then collapse one of the dimensions with apply and sum
If you want to keep your code:
df2 %>% select(type, Sepal.Length) %>%
mutate(Count = ifelse(Sepal.Length > 0, 1, 0)) %>%
mutate(Percentage = Count/7)
You can wrap it into a function (add_a_count):
library(tidyverse)
df1 <- as_tibble(iris)
df2 <- df1 %>%
mutate(type = nrow(df1))
add_a_count = function(df)
{
counted_df = df %>%
select(type, Sepal.Length) %>%
mutate(Count = ifelse(Sepal.Length > 0, 1, 0),
Percentage = Count/7)
return(counted_df)
}
I generate 100 duplicates of the test df2 with the following function:
duplicate_df = function(df, no_duplicates)
{
tmp_df_list = list()
for(i in c(1:no_duplicates))
{
print(paste0("Duplicate ", i, " generated."))
tmp_df_list[[i]] = df
}
return(tmp_df_list)
}
data_frames_list = duplicate_df(df = df2, no_duplicates = 100)
And use it with lapply: counted_data_frames = lapply(data_frames_list, add_a_count)
The list counted_data_frames can relatively easily be manipulated (You can use another apply function if you want a non-list output). This might not be the fastest way to do it, but it's straightforward.
EDIT
You can get your Counts columns via looping over the list of data frames. A new data frame counts_data_frame contains all counts with every column being counts of one original data frame:
counts_data_frame = data.frame(type = seq(from = 1, to = nrow(df2)))
for(i in c(1:length(counted_data_frames)))
{
counts_data_frame = cbind(counts_data_frame, as.vector(counted_data_frames[[i]]["Count"]))
}
When looping over the rows of this new data frame, you can sum up your counts and get a vector of counts for plotting:
counts_summarised = vector(length = nrow(counts_data_frame))
for(i in c(1:nrow(counts_data_frame)))
{
counts_summarised[i] = sum(counts_data_frame[i, 2:ncol(counts_data_frame)])
}
plot(counts_summarised, ylab = "Counts", xlab = "Type")
In this solution, I will show you how to:
import all CSV files, into separate data frames in a list, assuming that they all have the same column name for the variable you are interested in and that the files are in one folder (your working directory, preferrably);
count the number of 0 and nonzero measurements and their proportions;
convert the list into a dataframe
Specifically, I used lapply() for looping through the data.frames, converting the list to a data.frame using enframe(), unnesting the value column with unnest(), and spreading the pct by type using spread().
Let's first create a data to work with.
library(tidyverse)
# create a list
datlist <- list()
# this list will contain ten data frames with
# a sample with up to 8 0's and 20 random uniforms as observations
for (i in seq_len(10)){
datlist[[i]] = data.frame(x = sample(c(sample(c(0,1,2,3,4), 8, replace = T), runif(20,0,10))))
}
# name each element of the list datlist
name_element <- LETTERS[1:10]
datlist <- set_names(datlist, name_element)
# save each file separately
mapply(write.csv, datlist, file=paste0(names(datlist), '.csv'), row.names = FALSE)
The following will import your data into R and store them as data.frames in a list.
# import all csv files in the folder into separate data frames in the temp list
temp <- list.files(pattern = "*.csv")
myfiles <- lapply(temp, read.csv)
The following will calculate the percentages by type if we assume that each file contains the same variables.
# Calculate the frequency and relative distributions
lapply(myfiles,
function(varname) mutate(varname, type = if_else(x == 0, 0, 1)) %>%
group_by(type) %>% summarise(n = n()) %>%
mutate(pct = n / sum(n))
) %>%
enframe() %>% # convert the list into a data.frame
unnest(value) %>% # unnest the values
spread(type, pct) # spread the values by type
# A tibble: 17 x 4
name n `0` `1`
<int> <int> <dbl> <dbl>
1 1 3 0.107 NA
2 1 25 NA 0.893
3 2 28 NA 1.00
4 3 1 0.0357 NA
5 3 27 NA 0.964
6 4 2 0.0714 NA
7 4 26 NA 0.929
8 5 28 NA 1.00
9 6 28 NA 1.00
10 7 2 0.0714 NA
11 7 26 NA 0.929
12 8 3 0.107 NA
13 8 25 NA 0.893
14 9 1 0.0357 NA
15 9 27 NA 0.964
16 10 1 0.0357 NA
17 10 27 NA 0.964
I have merged data downloaded from different sources. The data is annual (one observation per year), but the dates are not consistently "anchored", e.g. I have "1992-12-31" or "1993-01-01". What is the recommended way to handle this sort of data? How best to merge certain rows within a data.frame, based on a criterion of 'closeness' in the dates?
There are existing questions and answers about merging rows within a dataframe, which could be applied to my situation with some adaptation, but here my question is specific to dates and the problem of handling the coexistence of "1992-12-31" and "1992-01-01" in annual data --- The data I have comes from institutions like the OECD, IMF, World Bank. Perhaps a clever package already knows the standard conventions of these institutions?
I am interested in both efficiency and readability of the code. I am also very much open to a data.table solution. Related question/answer not specifically about dates: how do I replace numeric codes with value labels from a lookup table?
Input:
df <- structure(list(year = c("1992-12-31", "1993-01-01", "1993-12-31", "1994-01-01"), x = c(NA, 1, NA, 4), y = c(2, NA, 3, NA)), .Names = c("year", "x", "y"), row.names = c(NA, -4L), class = "data.frame")
df
## year x y
##1 1992-12-31 NA 2
##2 1993-01-01 1 NA
##3 1993-12-31 NA 3
##4 1994-01-01 4 NA
Desired Output:
df2
## year x y
##1 1993-01-01 1 2
##2 1994-01-01 4 3
(assuming a mapping like this "1992-12-31" = "1993-01-01")
A solution:
key <- c("1992-12-31" = "1993-01-01",
"1993-12-31" = "1994-01-01")
matched <- match(df$year, names(key))
df$year <- ifelse(is.na(matched),
df$year, key[matched])
df
## year x y
##1 1993-01-01 NA 2
##2 1993-01-01 1 NA
##3 1994-01-01 NA 3
##4 1994-01-01 4 NA
df <- aggregate(x = df[c("x","y")],
by = list(year = df$year), mean, na.rm = TRUE)
df
## year x y
##1 1993-01-01 1 2
##2 1994-01-01 4 3
But I'm eager to learn if there is a cleverer way.
Side remark: I do realize that my existing dataset is already amenable to plotting, e.g. with base R or with ggplot2(Hadley Wickham):
plot(df1$x, df1$y)
library(ggplot2)
ggplot(df1, aes(x = year)) + geom_point(aes(y = x)) + geom_point(aes(y = y))
One solution using library dplyr is to assign ids to groups of dates that belong together and then summarize based on those groups:
library(dplyr)
df %>%
arrange(year) %>%
mutate(id = cumsum(as.numeric(difftime(lead(df$year, default = max(year)), df$year, units = 'days')) == 1)) %>%
group_by(id) %>%
summarise(year = max(year), x = x[2], y = y[1]) %>%
select(-id)
Output is as follows:
Source: local data frame [2 x 3]
year x y
(chr) (dbl) (dbl)
1 1993-01-01 1 2
2 1994-01-01 4 3
Maybe add one day for all dates, then round the dates to YYYYMM, then summarise.
library(lubridate)
library(dplyr)
#add one day then group
df %>%
mutate(year = ymd(year),
YYYYMM = format(year + 1, "%Y%m")) %>%
group_by(YYYYMM) %>%
summarise(x = sum(x, na.rm = TRUE),
y = sum(y, na.rm = TRUE))
#output
# YYYYMM x y
# (chr) (dbl) (dbl)
# 1 199301 1 2
# 2 199401 4 3