Have created a dataframe that contains ids and stringvalues :
mycols <- c('id','2')
ids <- c(1,1,2,3)
stringvalues <- c('a','a','b','c')
mydf <- data.frame(ids , stringvalues)
mydf contains :
ids stringvalues
1 1 a
2 1 a
3 2 b
4 3 c
I'm attempting to produce a new dataframe that contains the id and
corresponding counts for each string :
id, a , b , c
1 , 2 , 0 , 0
2 , 0 , 1 , 0
3 , 0 , 0 , 1
I'm trying to create multiple summarise implementations :
g1 <- group_by(mydf , ids)
s1 <- summarise(g1 , a = count('a'))
s2 <- summarise(g1 , b = count('b'))
s3 <- summarise(g1 , c = count('c'))
But returns error : Evaluation error: no applicable method for 'groups' applied to an object of class "character".
How to create new columns that count number of string entries in the column ?
Does doing a dplyr::count followed by tidyr::spread work for you? (I'm only posting this as you mentioned you were wanting to create a dataframe of this sort - otherwise it's much simpler to use table(mydf) as the other comments/answers suggest.)
library(dplyr)
library(tidyr)
mydf %>% count(ids, stringvalues) %>% spread(stringvalues, n, fill = 0)
#> # A tibble: 3 x 4
#> ids a b c
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 0 0
#> 2 2 0 1 0
#> 3 3 0 0 1
You can use count directly. First,
count(mydf, ids,stringvalues)
gives
# A tibble: 3 x 3
ids stringvalues n
<dbl> <fctr> <int>
1 1 a 2
2 2 b 1
3 3 c 1
then reshape,
count(mydf, ids,stringvalues) %>% tidyr::spread(stringvalues, n)
gives
# A tibble: 3 x 4
ids a b c
* <dbl> <int> <int> <int>
1 1 2 NA NA
2 2 NA 1 NA
3 3 NA NA 1
then replace the NAs with something like res[is.na(res)] <- 0, where res is the object constructed above.
Here's a base-R solution:
data.frame(cbind(table(mydf)))
Output option 1 (row # = ID):
a b c
1 2 0 0
2 0 1 0
3 0 0 1
Output option 2 (with ID as column):
data.frame(cbind(id=unique(mydf$ids),table(mydf)))
id a b c
1 1 2 0 0
2 2 0 1 0
3 3 0 0 1
Related
I simplified the dataset to demonstrate what I want to do. I'm not used to dealing with multiple columns. Here I made a simple data
data<-data.frame(id=c(1,1,1,2,2,2,2),
title_1=c(65,58,47,NA,25,27,43),
title_2=c(NA,NA,32,35,12,NA,1))
In my actual dataset there are so many columns, but for now I just named as above. My goal is to change the values of title_1 and title_2 by the following rule. If there is a number , change it to 1. If there is an NA value, change it to 0. But in my actual dataset, there are hundreds of columns named as title_1, title_2, ... , title_100, ... So, I cannot type all the column names. So for my simple data, I want to use the code that doesn't type the column names explicitly. My expected output is
data<-data.frame(id=c(1,1,1,2,2,2,2),
title_1=c(1,1,1,0,1,1,1),
title_2=c(0,0,1,1,1,0,1))
With dplyr we can use tidyselect syntax inside across() to select all variables starting with "title_" and then apply a function on all selected columns inside across():
data<-data.frame(id=c(1,1,1,2,2,2,2),
title_1=c(65,58,47,NA,25,27,43),
title_2=c(NA,NA,32,35,12,NA,1))
library(dplyr)
data %>%
mutate(across(starts_with("title_"), ~ ifelse(is.na(.x), 0, 1)))
#> id title_1 title_2
#> 1 1 1 0
#> 2 1 1 0
#> 3 1 1 1
#> 4 2 0 1
#> 5 2 1 1
#> 6 2 1 0
#> 7 2 1 1
In base R we would use grepl to select the column names, then assign those columns new values with lapply:
data<-data.frame(id=c(1,1,1,2,2,2,2),
title_1=c(65,58,47,NA,25,27,43),
title_2=c(NA,NA,32,35,12,NA,1))
mycols <- grepl("^title_", names(data))
data[mycols] <- lapply(data[mycols], \(x) ifelse(is.na(x), 0, 1))
data
#> id title_1 title_2
#> 1 1 1 0
#> 2 1 1 0
#> 3 1 1 1
#> 4 2 0 1
#> 5 2 1 1
#> 6 2 1 0
#> 7 2 1 1
Finally, we would select the columns with data.table similary, but here we'd prefer the actual names with grep(value = TRUE):
mycols <- grep("^title_", names(data), value = TRUE)
library(data.table)
data_tb <- as.data.table(data)
data_tb[,
get("mycols") := lapply(.SD, \(x) ifelse(is.na(x), 0, 1)),
.SDcols = mycols]
data_tb
#> id title_1 title_2
#> 1: 1 1 0
#> 2: 1 1 0
#> 3: 1 1 1
#> 4: 2 0 1
#> 5: 2 1 1
#> 6: 2 1 0
#> 7: 2 1 1
Created on 2022-07-26 by the reprex package (v2.0.1)
While an ifelse statement as presented by #TimTeaFan is the best solution,
here is an alternative approach using across twice:
library(dplyr)
library(tidyr)
data %>%
mutate(across(-id, ~ .-.+1),
across(-id, ~ replace_na(.,0)))
id title_1 title_2
1 1 1 0
2 1 1 0
3 1 1 1
4 2 0 1
5 2 1 1
6 2 1 0
7 2 1 1
I have a dataframe like this :
dd<-data.frame(col1=c(1,0,1),col2=c(1,1,1),col3=c(1,0,0),col4=c(1,0,1,0,1,0,1,0))
And i would like to have the sum of each column between them like:
col1+col2 col1+col3 col1+col4 col2+col3 col2+col4 col3+col4
2 2 2 2 2 2
1 1 1 1 1 0
1 1 2 1 2 1
2 1 1 1 1 0
I did'nt find any fonctions who does that
Please help me
One base R option might be combn + rowSums
setNames(
as.data.frame(combn(dd, 2, rowSums)),
combn(names(dd), 2, paste0, collapse = "+")
)
which gives
col1+col2 col1+col3 col1+col4 col2+col3 col2+col4 col3+col4
1 2 2 2 2 2 2
2 1 0 0 1 1 0
3 2 1 2 1 2 1
Data
dd<-data.frame(col1=c(1,0,1),col2=c(1,1,1),col3=c(1,0,0),col4=c(1,0,1))
One dplyr and purrr possibility could be:
map_dfc(.x = combn(names(dd), 2, simplify = FALSE),
~ dd %>%
rowwise() %>%
transmute(!!paste(.x, collapse = "+") := sum(c_across(all_of(.x)))))
`col1+col2` `col1+col3` `col1+col4` `col2+col3` `col2+col4` `col3+col4`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 2 2 2 2 2
2 1 0 0 1 1 0
3 2 1 2 1 2 1
Uglier, and slower than the base R above:
do.call("cbind", setNames(Map(function(i){y <- dd[,i] + dd[,-c(1:i)]},
seq_along(dd)[1:ncol(dd)-1]), names(dd)[1:(ncol(dd)-1)]))
suppose I have the following data:
A <- c(4,4,4,4,4)
B <- c(1,2,3,4,4)
C <- c(1,2,4,4,4)
D <- c(3,2,4,1,4)
filt <- c(1,1,10,8,10)
data <- as.data.frame(rbind(A,B,C,D,filt))
data <- t(data)
data <- as.data.frame(data)
> data
A B C d filt
V1 4 1 1 3 1
V2 4 2 2 2 1
V3 4 3 4 4 10
V4 4 4 4 1 8
V5 4 4 4 4 10
I want to get counts on the occurances of 1,2,3, & 4 for each variable, after filtering. In my attempt to achieve this below, I get Error: length(rows) == 1 is not TRUE.
data %>%
dplyr::filter(filt ==1) %>%
plyr::summarize(A_count = count(A),
B_count = count(B))
I get the error - its because some of my columns do not contain all values 1-4. Is there a way to specify what it should look for & give 0 values if not found? I'm not sure how to do this if possible, or if there is a different work around.
Any help is VERY appreciated!!!
This was a bit of a weird one, I didn't use classical plyr, but I think this is roughly what you're looking for. I removed the filtering column , filt as to not get counts of that:
library(dplyr)
data %>%
filter(filt == 1) %>%
select(-filt) %>%
purrr::map_df(function(a_column){
purrr::map_int(1:4, function(num) sum(a_column == num))
})
# A tibble: 4 x 4
A B C D
<int> <int> <int> <int>
1 0 1 1 0
2 0 1 1 1
3 0 0 0 1
4 2 0 0 0
Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0
An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+
dt.new <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
dt.new
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2)
df.new <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)
I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
library(dplyr)
library(tidyr)
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
#
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.
I am trying to split one column in a data frame in to multiple columns which hold the values from the original column as new column names. Then if there was an occurrence for that respective column in the original give it a 1 in the new column or 0 if no match. I realize this is not the best way to explain so, for example:
df <- data.frame(subject = c(1:4), Location = c('A', 'A/B', 'B/C/D', 'A/B/C/D'))
# subject Location
# 1 1 A
# 2 2 A/B
# 3 3 B/C/D
# 4 4 A/B/C/D
and would like to expand it to wide format, something such as, with 1's and 0's (or T and F):
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have looked into tidyr and the separate function and reshape2 and the cast function but seem to getting hung up on giving logical values. Any help on the issue would be greatly appreciated. Thank you.
You may try cSplit_e from package splitstackshape:
library(splitstackshape)
cSplit_e(data = df, split.col = "Location", sep = "/",
type = "character", drop = TRUE, fill = 0)
# subject Location_A Location_B Location_C Location_D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
You could take the following step-by-step approach.
## get the unique values after splitting
u <- unique(unlist(strsplit(as.character(df$Location), "/")))
## compare 'u' with 'Location'
m <- vapply(u, grepl, logical(length(u)), x = df$Location)
## coerce to integer representation
m[] <- as.integer(m)
## bind 'm' to 'subject'
cbind(df["subject"], m)
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1