I have a, simplified, a data frame with 71 columns and N rows. What I want to get is a frequency table of the values in the first column based on all other columns (all other columns have dummies). Simplified (with only 4 columns) this would be like that:
df <- data.frame(sample(1:8,20,replace=T),sample(0:1,20,replace = T),sample(0:1,20,replace = T),sample(0:1,20,replace = T))
I have tried this for loop with dplyr (where x is the first column with the 8 different values), and it only works for the first 10 or 11 columns without problems, but after then it only generates NA's and returns the error:
freq_df <- data.frame(matrix(NA, nrow=8, ncol=71))
for (i in 2:71){
freq_df[,i] <- df %>%
filter(df[i]==1) %>%
count(x) %>%
select(n)
}
in `[<-.data.frame`(`*tmp*`, , i, value = list(n = c(3L, 5L, 8L, :
replacement element 1 has 7 rows, need 8
Anyone knows why R returns this error? Thank you for your help!
Your error is because not all first column values will occur where other columns are 1. You have 8 unique values in the first column, maybe you have 7 when you filter on the 11th column == 1. So the results can have different lengths, which is a problem.
Try this instead, I think it's what you're trying to do. (If not, please clarify your goal by showing the expected output.)
names(df) = paste0("V", 1:4)
df %>%
group_by(V1) %>%
summarize(across(everything(), sum, .names = "{.col}_count"))
# V1 V2_count V3_count V4_count
# <int> <int> <int> <int>
# 1 1 1 0 1
# 2 2 2 1 2
# 3 3 3 3 2
# 4 4 0 0 0
# 5 5 0 0 0
# 6 6 3 1 2
# 7 7 3 1 1
# 8 8 1 1 0
In base R, we can do
names(df) <- paste0("V", 1:4)
out <- aggregate(.~ V1, df, sum, na.rm = TRUE)
names(out)[-1] <- paste0(names(out)[-1], "_count")
Related
I have a list of dataframes and want to append a new column to each, however I keep getting various error messages. Can anybody explain why the below code doesn't work for me? I'd be happy if rowid_to)column works as the data in my actual set is alright ordered correctly, otherwise i'd like a new column with a list going from 1:length(data$data)
##dataset
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))%>%
nest_by(Location)
###map + rowid_to_column
attempt1<- data%>%
map(.,rowid_to_column(.,var = "hour"))
##mutate
attempt2<-data %>%
map(., mutate("Hours" = 1:6))
###add column
attempt3<- data%>%
map(.$data,add_column(.data,hours = 1:6))
newcolumn<- 1:6
###lapply
attempt4<- lapply(data,cbind(data$data,newcolumn))
Many thanks,
Stuart
You were nearly there with your base R attempt, but you want to iterate over data$data, which is a list of data frames.
data$data <- lapply(data$data, function(x) {
hour <- seq_len(nrow(x))
cbind(x, hour)
})
data$data
# [[1]]
# Day Average Amplitude hour
# 1 1 6.070539 1.123182 1
# 2 2 3.638313 8.218556 2
# 3 3 11.220683 2.049816 3
# 4 4 12.832782 14.858611 4
# 5 5 12.485757 7.806147 5
# 6 6 19.250489 6.181270 6
Edit: Updated as realised it was iterating over columns rather than rows. This approach will work if the data frames have different numbers of rows, which the methods with the vector defined as 1:6 will not.
a data.table approach
library(data.table)
setDT(data)
data[, data := lapply(data, function(x) cbind(x, new_col = 1:6))]
data$data
# [[1]]
# Day Average Amplitude test new_col
# 1 1 11.139917 0.3690539 1 1
# 2 2 5.350847 7.0925508 2 2
# 3 3 9.602104 6.1782818 3 3
# 4 4 14.866074 13.7356913 4 4
# 5 5 1.114201 1.1007080 5 5
# 6 6 2.447236 5.9944926 6 6
#
# [[2]]
# Day Average Amplitude test new_col
# 1 1 17.230213 13.966576 1 1
# .....
A purrr approach:
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))%>%
group_split(Location) %>%
purrr::map_dfr(~.x %>% mutate(Hours = c(1:6)))
If you want to use your approach and preserve the same data structure, this is a way again using purrr (you need to ungroup, otherwise it will not work due to the rowwise grouping)
data %>% ungroup() %>%
mutate_at("data", .f = ~map(.x, ~.x %>% mutate(Hours = c(1:6))) )
I have a large csv dataset with more than 45k rows and 19 different variables. I'd like to filter it by a specific variable (V4) so that each filtered group starts with 0 and then the next 0 will mark the start of a new group/dataframe/datatable, while keeping all other variables inside this new table as well. I need those separate groups to further analyse each case of data.
I tried:
filtered_data <- my_data %>%
group_by("V4") %>%
filter("V4" == 0 & "V4" !=0)
View(filtered_data)
The first "V4" == 0 seems to work but I'm struggling how to define the end of each filtered dataframe e.g. how to filter from 0 to 3, then 0 to 5 etc.
How can I determine the length of each case? Is there a logical operator that saves each group before V4 turns 0 again? Or would it be better to create a loop?
Example of my_data:
V1 V2 V3 V4 . . . V19
1 0
2 1
3 2
4 ` 3
5 0
6 1
7 2
8 3
9 4
10 5
11 0
...
45k
Here is a way to group your rows with basic arithmetic.
I create the groups using a cumulative sum of an indicator variable (V4 is 0 or not) and split the data.frame into single dataframes using group_split.
# example data 12000 rows in total, 4000 groups of 3 rows
df <- data.frame(V1 = 1:12000,
V2 = sample(LETTERS, 12000, replace = T),
V4 = rep(0:2, 4000))
df <- df %>%
mutate(Groups = ifelse(V4 == 0, 1, 0),
Groups = cumsum(Groups)) %>%
group_split(Groups)
So the first group/dataframe is
> df[[1]]
# A tibble: 3 x 4
V1 V2 V4 Groups
<int> <chr> <int> <dbl>
1 1 L 0 1
2 2 L 1 1
3 3 Y 2 1
the second
> df[[2]]
# A tibble: 3 x 4
V1 V2 V4 Groups
<int> <chr> <int> <dbl>
1 4 Z 0 2
2 5 N 1 2
3 6 Y 2 2
and so on.
If you want to save each data.frame seperately you could use something like this:
# new environment that holds all data.frames
dfEnv <- new.env()
df %>%
mutate(Groups = ifelse(V4 == 0, 1, 0),
Groups = cumsum(Groups)) %>%
group_by(Groups) %>%
do({
# save every group inside the new environment as a single data.frame
dfEnv[[paste0("Group_", unique(.$Groups))]] <- .
})
Now you have dfEnv$Group_1, dfEnv$Group_2, ... and so on.
Inside do() you could also use saveRDS or write.csv to save the data to disk.
This question already has an answer here:
Split a column into multiple binary dummy columns [duplicate]
(1 answer)
Closed 5 years ago.
I have a dataframe with the following structure
test <- data.frame(col = c('a; ff; cc; rr;', 'rr; a; cc; e;'))
Now I want to create a dataframe from this which contains a named column for each of the unique values in the test dataframe. A unique value is a value ended by the ';' character and starting with a space, not including the space. Then for each of the rows in the column I wish to fill the dummy columns with either a 1 or a 0. As given below
data.frame(a = c(1,1), ff = c(1,0), cc = c(1,1), rr = c(1,0), e = c(0,1))
a ff cc rr e
1 1 1 1 1 0
2 1 0 1 1 1
I tried creating a df using for loops and the unique values in the column but it's getting to messy. I have a vector available containing the unique values of the column. The problem is how to create the ones and zeros. I tried some mutate_all() function with grep() but this did not work.
I'd use splitstackshape and mtabulate from qdapTools packages to get this as a one liner,
i.e.
library(splitstackshape)
library(qdapTools)
mtabulate(as.data.frame(t(cSplit(test, 'col', sep = ';', 'wide'))))
# a cc ff rr e
#V1 1 1 1 1 0
#V2 1 1 0 1 1
It can also be full splitstackshape as #A5C1D2H2I1M1N2O1R2T1 mentions in comments,
cSplit_e(test, "col", ";", mode = "binary", type = "character", fill = 0)
Here's a possible data.table implementation. First we split the rows into columns, melt into a single column and the spread it wide while counting the events for each row
library(data.table)
test2 <- setDT(test)[, tstrsplit(col, "; |;")]
dcast(melt(test2, measure = names(test2)), rowid(variable) ~ value, length)
# variable a cc e ff rr
# 1: 1 1 1 0 1 1
# 2: 2 1 1 1 0 1
Here's a base R approach:
x <- strsplit(as.character(test$col), ";\\s?") # split the strings
lvl <- unique(unlist(x)) # get unique elements
x <- lapply(x, factor, levels = lvl) # convert to factor
t(sapply(x, table)) # count elements and transpose
# a ff cc rr e
#[1,] 1 1 1 1 0
#[2,] 1 0 1 1 1
We can do this with tidyverse
library(tidyverse)
rownames_to_column(test, 'grp') %>%
separate_rows(col) %>%
filter(col!="") %>%
count( grp, col) %>%
spread(col, n, fill = 0) %>%
ungroup() %>%
select(-grp)
# A tibble: 2 × 5
# a cc e ff rr
#* <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 0 1 1
#2 1 1 1 0 1
Here is a base R solution. First remove the space. Get all the unique combination. Split the actual data frame and then check presence of it in the cols which will have all the combo. Then you get a logical matrix which can be easily converted into numeric.
test=as.data.frame(apply(test,2,function(x)gsub('\\s+', '',x)))
cols=unique(unlist(strsplit(as.character(test$col), split = ';')))
yy=strsplit(as.character(test$col), split = ';')
z=as.data.frame(do.call.rbind(lapply(yy, function(x) cols %in% x)))
names(z)=cols
z=as.data.frame(lapply(z, as.integer))
Another approach with tidytext and tidyverse
library(tidyverse)
library(tidytext) #for unnest_tokens()
df <- test %>%
unnest_tokens(word, col) %>%
rownames_to_column(var="row") %>%
mutate(row = floor(parse_number(row)),
val = 1) %>%
spread(word, val, fill = 0) %>%
select(-row)
df
# a cc e ff rr
#1 1 1 0 1 1
#2 1 1 1 0 1
Another simple solution without any extra packages:
x = c('a; ff; cc; rr;', 'rr; a; cc; e;')
G = lapply(strsplit(x,';'), trimws)
dict = sort(unique(unlist(G)))
do.call(rbind, lapply(G, function(g) 1*sapply(dict, function(d) d %in% g)))
I have a data frame in R for which I want to remove certain rows provided that match certain conditions. How can I do it ?
I have tried using dplyr and ifelse but my code does not give right answer
check8 <- distinct(df5,prod,.keep_all = TRUE)
Does not work! gives the entire data set
Input is:
check1 <- data.frame(ID = c(1,1,2,2,2,3,4),
prod = c("R","T","R","T",NA,"T","R"),
bad = c(0,0,0,1,0,1,0))
# ID prod bad
# 1 1 R 0
# 2 1 T 0
# 3 2 R 0
# 4 2 T 1
# 5 2 <NA> 0
# 6 3 T 1
# 7 4 R 0
Output expected:
data.frame(ID = c(1,2,3,4),
prod = c("R","R","T","R"),
bad = c(0,0,1,0))
# ID prod bad
# 1 1 R 0
# 2 2 R 0
# 3 3 T 1
# 4 4 R 0
I want to have the output such that for IDs where both prod or NA are there, keep only rows with prod R, but if only one prod is there then keep that row despite the prod .
Using dplyr we can use filter to select rows where prod == "R" or if there is only one row in the group, select that row.
library(dplyr)
check1 %>%
group_by(ID) %>%
filter(prod == "R" | n() == 1)
# ID prod bad
# <dbl> <fct> <dbl>
#1 1 R 0
#2 2 R 0
#3 3 T 1
#4 4 R 0
Here solution using an anti_join
library(dplyr)
check1 <- data.frame(ID = c(1,1,2,2,2,3,4), prod = c("R","T","R","T",NA,"T","R"), bad = c(0,0,0,1,0,1,0))
# First part: select all the IDs which contain 'R' as prod
p1 <- check1 %>%
group_by(ID) %>%
filter(prod == 'R')
# Second part: using anti_join get all the rows from check1 where there are not
# matching values in p1
p2 <- anti_join(check1, p1, by = 'ID')
solution <- bind_rows(
p1,
p2
) %>%
arrange(ID)
I have been playing with dplyr::mutate_at to create new variables by applying the same function to some of the columns. When I name my function in the .funs argument, the mutate call creates new columns with a suffix instead of replacing the existing ones, which is a cool option that I discovered in this thread.
df = data.frame(var1=1:2, var2=4:5, other=9)
df %>% mutate_at(vars(contains("var")), .funs=funs('sqrt'=sqrt))
#### var1 var2 other var1_sqrt var2_sqrt
#### 1 1 4 9 1.000000 2.000000
#### 2 2 5 9 1.414214 2.236068
However, I noticed that when the vars argument used to point my columns returns only one column instead of several, the resulting new column drops the initial name: it gets named sqrt instead of other_sqrt here:
df %>% mutate_at(vars(contains("other")), .funs=funs('sqrt'=sqrt))
#### var1 var2 other sqrt
#### 1 1 4 9 3
#### 2 2 5 9 3
I would like to understand why this behaviour happens, and how to avoid it because I don't know in advance how many columns the contains() will return.
EDIT:
The newly created columns must inherit the original name of the original columns, plus the suffix 'sqrt' at the end.
Thanks
Here is another idea. We can add setNames(sub("^sqrt$", "other_sqrt", names(.))) after the mutate_at call. The idea is to replace the column name sqrt with other_sqrt. The pattern ^sqrt$ should only match the derived column sqrt if there is only one column named other, which is demonstrated in Example 1. If there are more than one columns with other, such as Example 2, the setNames would not change the column names.
library(dplyr)
# Example 1
df <- data.frame(var1 = 1:2, var2 = 4:5, other = 9)
df %>%
mutate_at(vars(contains("other")), funs("sqrt" = sqrt(.))) %>%
setNames(sub("^sqrt$", "other_sqrt", names(.)))
# var1 var2 other other_sqrt
# 1 1 4 9 3
# 2 2 5 9 3
# Example 2
df2 <- data.frame(var1 = 1:2, var2 = 4:5, other1 = 9, other2 = 16)
df2 %>%
mutate_at(vars(contains("other")), funs("sqrt" = sqrt(.))) %>%
setNames(sub("^sqrt$", "other_sqrt", names(.)))
# var1 var2 other1 other2 other1_sqrt other2_sqrt
# 1 1 4 9 16 3 4
# 2 2 5 9 16 3 4
Or we can design a function to check how many columns contain the string other before manipulating the data frame.
mutate_sqrt <- function(df, string){
string_col <- grep(string, names(df), value = TRUE)
df2 <- df %>% mutate_at(vars(contains(string)), funs("sqrt" = sqrt(.)))
if (length(string_col) == 1){
df2 <- df2 %>% setNames(sub("^sqrt$", paste(string_col, "sqrt", sep = "_"), names(.)))
}
return(df2)
}
mutate_sqrt(df, "other")
# var1 var2 other other_sqrt
# 1 1 4 9 3
# 2 2 5 9 3
mutate_sqrt(df2, "other")
# var1 var2 other1 other2 other1_sqrt other2_sqrt
# 1 1 4 9 16 3 4
# 2 2 5 9 16 3 4
I just figured out a (not so clean) way to do it;
I add a extra dummy variable to the dataset, with a name that ensures that it will be selected and that we don't fall into the 1-variable case, and after the calculation I remove the 2 dummies, like this:
df %>% mutate(other_fake=NA) %>%
mutate_at(vars(contains("other")), .funs=funs('sqrt'=sqrt)) %>%
select(-contains("other_fake"))
#### var1 var2 other other_sqrt
#### 1 1 4 9 3
#### 2 2 5 9 3