I want to split a column in multiple binary dummy columns. my dataframe: df
id siz eage
1 6 10
2 7 11
3 8 10
At the moment i have this code with package qdaptools and caret:
df <- cbind(df [1:3],mtabulate(strsplit(as.character(df$age), ':')))
My question: how can I give a title to these dummy columns, so I get this:
id size age_10 age_11
1 6 1 0
2 7 0 1
3 8 1 0
You can try dummy.data.frame from dummies package.
library(dummies)
library(dplyr)
df %>%
dummy.data.frame(names="age", sep="_")
Output is:
id size age_10 age_11
1 1 6 1 0
2 2 7 0 1
3 3 8 1 0
Sample data:
df <- structure(list(id = 1:3, size = 6:8, age = c(10L, 11L, 10L)), .Names = c("id",
"size", "age"), class = "data.frame", row.names = c(NA, -3L))
Update:
For the error which you are getting on your actual data you can use below code
Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you
called 'sort' on a list?
library(dummies)
library(dplyr)
df %>%
data.frame() %>%
dummy.data.frame(names="Verkoopkanaal_groepering", sep="_")
To rename by index: colnames(df)[4:5] <- c("age_10", "age_11")
To rename by existing column name colnames(df)[colnames(df) == "INSERT_COL_NAME"] <- "NEW_COL_NAME"
Related
I have the following data frame:
library(dplyr)
old_data = data.frame(id = c(1,2,3), var1 = c(11,12,13))
> old_data
id var1
1 1 11
2 2 12
3 3 13
I want to replace the values in the 2nd row of "old_data" with data in "new_data" (i.e. rows in "old_data" where the id variables matches ):
new_data = data.frame(id = c(4,2,5), var1 = c(11,15,13))
> new_data
id var1
1 4 11
2 2 15
3 5 13
Using the answer found here (Update rows of data frame in R), I tried to do this with the "dplyr" library:
update = old_data %>%
rows_update(new_data, by = "id")
But this gave me the following error:
Error: Attempting to update missing rows.
Run `rlang::last_error()` to see where the error occurred.
This is what I am trying to get:
id var1
1 1 11
2 2 15
3 3 13
Can someone please tell me what I am doing wrong?
Thanks!
A little bit messy but this works (on this sample data at least)
old_data %>%
left_join(new_data,by="id") %>%
mutate(var1 = if_else(!is.na(var1.y),var1.y,var1.x)) %>%
select(id,var1)
# id var1
#1 1 11
#2 2 15
#3 3 13
A base R approach using match -
inds <- match(old_data$id, new_data$id)
old_data$var1[!is.na(inds)] <- na.omit(new_data$var1[inds])
old_data
# id var1
#1 1 11
#2 2 15
#3 3 13
A data.table approach (with turning the data table back into a dataframe):
library(data.table)
as.data.frame(setDT(old_data)[new_data, var1 := .(i.var1), on = "id"])
Output
id var1
1 1 11
2 2 15
3 3 13
An alternative tidyverse option using rows_update. You can filter new_data to only have ids that appear in old_data. Then, you can update those values, like you had previously tried. Essentially, new_data must only have id values that appear in old_data.
library(tidyverse)
old_data %>%
rows_update(., new_data %>% filter(id %in% old_data$id), by = "id")
Data
old_data <-
structure(list(id = c(1, 2, 3), var1 = c(11, 12, 13)),
class = "data.frame",
row.names = c(NA,-3L))
new_data <-
structure(list(id = c(4, 2, 5), var1 = c(11, 15, 13)),
class = "data.frame",
row.names = c(NA,-3L))
We can use dplyr::rows_update if we first use a semi_join on new_data to filter only those ids that are included in old_data.
library(dplyr)
old_data %>%
rows_update(new_data %>%
semi_join(old_data, by = "id"),
by = "id")
#> id var1
#> 1 1 11
#> 2 2 15
#> 3 3 13
Created on 2021-12-29 by the reprex package (v0.3.0)
I am new to R and trying to summarize a dataframe with multiple functions and I would like the result to appear in the same column, rather than in separated columns for each function. For example, my data set looks something like this
data =
A B
----
1 2
2 2
3 2
4 2
And I call summarize_all(data, c(min, max)) the dataframe becomes
a_fn1 b_fn1 a_fn2 b_fn2
1 2 4 2
How can I make it so that the result of the summarize_all becomes this:
A B
----
1 2
4 2
Thanks
Does this work:
library(dplyr)
bind_rows(apply(data,2,min),apply(data,2,max))
# A tibble: 2 x 2
A B
<dbl> <dbl>
1 1 2
2 4 2
Here is an option with transpose
library(dplyr)
library(tidyr)
pivot_longer(df1, cols = everything()) %>%
group_by(name) %>%
summarise(min = min(value), max = max(value)) %>%
data.table::transpose(., make.names = 'name')
A B
1 1 2
2 4 2
data
df1 <- structure(list(A = 1:4, B = c(2L, 2L, 2L, 2L)),
class = "data.frame", row.names = c(NA,
-4L))
I have multiple columns that I need to merge and return a contingency table counting each number.
Example of an ordinal data set:
df <- data.frame(ab = c(1,2,3,4,5),
ba = c(1,3,3,3,5))
>ab ba
1 1
2 3
3 3
4 3
5 5
I would like to be able to return a contingency table showing like this:
>1 2 3 4 5
2 1 4 1 2
Ive attempted examples featured on here for similar issues, but I get the sums returned instead of a count:
library(plyr)
colSums(rbind.fill(data.frame(t(unclass(df$ab))), data.frame(t(unclass(df$ba)))),`
na.rm = T)
Any help is greatly appreciated
We unlist the data.frame into a vector and apply table in base R
table(unlist(df))
# 1 2 3 4 5
# 2 1 4 1 2
Or with tidyverse, by reshaping the data into 'long' format with pivot_longer and get the count
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything()) %>%
count(value)
data
df <- structure(list(ab = 1:5, ba = c(1L, 3L, 3L, 3L, 5L)),
class = "data.frame", row.names = c(NA,
-5L))
I have a, simplified, a data frame with 71 columns and N rows. What I want to get is a frequency table of the values in the first column based on all other columns (all other columns have dummies). Simplified (with only 4 columns) this would be like that:
df <- data.frame(sample(1:8,20,replace=T),sample(0:1,20,replace = T),sample(0:1,20,replace = T),sample(0:1,20,replace = T))
I have tried this for loop with dplyr (where x is the first column with the 8 different values), and it only works for the first 10 or 11 columns without problems, but after then it only generates NA's and returns the error:
freq_df <- data.frame(matrix(NA, nrow=8, ncol=71))
for (i in 2:71){
freq_df[,i] <- df %>%
filter(df[i]==1) %>%
count(x) %>%
select(n)
}
in `[<-.data.frame`(`*tmp*`, , i, value = list(n = c(3L, 5L, 8L, :
replacement element 1 has 7 rows, need 8
Anyone knows why R returns this error? Thank you for your help!
Your error is because not all first column values will occur where other columns are 1. You have 8 unique values in the first column, maybe you have 7 when you filter on the 11th column == 1. So the results can have different lengths, which is a problem.
Try this instead, I think it's what you're trying to do. (If not, please clarify your goal by showing the expected output.)
names(df) = paste0("V", 1:4)
df %>%
group_by(V1) %>%
summarize(across(everything(), sum, .names = "{.col}_count"))
# V1 V2_count V3_count V4_count
# <int> <int> <int> <int>
# 1 1 1 0 1
# 2 2 2 1 2
# 3 3 3 3 2
# 4 4 0 0 0
# 5 5 0 0 0
# 6 6 3 1 2
# 7 7 3 1 1
# 8 8 1 1 0
In base R, we can do
names(df) <- paste0("V", 1:4)
out <- aggregate(.~ V1, df, sum, na.rm = TRUE)
names(out)[-1] <- paste0(names(out)[-1], "_count")
I have a data frame with object names and a list of statistical moments for that object, like this:
Object Mean IQR Skew
x 1 1 1
y 2 2 2
z 3 3 3
What i want is to for each row create columns with the statistical moments and the object name prefixed. Like so:
xMean xIQR xSkew yMean yIQR ySkew zMean zIQR zSkew
1 1 1 2 2 2 3 3 3
In essence what I need is to collapse a data frame to a single row such that it list all statistical moments on a single line as i'll have many rows like the final one but a finite set of columns.
You could do:
df1$id <- 1
reshape(df1, idvar="id", timevar="Object", direction="wide")[-1]
# Mean.x IQR.x Skew.x Mean.y IQR.y Skew.y Mean.z IQR.z Skew.z
#1 1 1 1 2 2 2 3 3 3
Or using dcast, melt from reshape2
library(reshape2)
dcast(melt(df1, id.var=c('id', 'Object')), id~..., value.var='value')[-1]
# x_Mean x_IQR x_Skew y_Mean y_IQR y_Skew z_Mean z_IQR z_Skew
#1 1 1 1 2 2 2 3 3 3
Or using dplyr and tidyr
library(dplyr)
library(tidyr)
df1 %>%
gather(Var, Val, Mean:Skew) %>%
unite(VarNew,Object, Var, sep="") %>%
spread(VarNew, Val) %>%
select(-id)
# xIQR xMean xSkew yIQR yMean ySkew zIQR zMean zSkew
#1 1 1 1 2 2 2 3 3 3
data
df1 <- structure(list(Object = c("x", "y", "z"), Mean = 1:3, IQR = 1:3,
Skew = 1:3), .Names = c("Object", "Mean", "IQR", "Skew"), class = "data.frame", row.names = c(NA,
-3L))
Or maybe something like
setNames(unlist(data.frame(t(df[-1]))), paste0(rep(df[, 1], each = nrow(df)), names(df[, -1])))
# xMean xIQR xSkew yMean yIQR ySkew zMean zIQR zSkew
# 1 1 1 2 2 2 3 3 3