I have following dataframe in r
Company Education Health
A NA 1
A 1 2
A 1 NA
I want the count of levels in each columns(1,2,NA) in a following format
Company Education_1 Education_NA Health_1 Health_2 Health_NA
A 2 1 1 1 1
How can I do it in R?
You can do the following:
library(tidyverse)
df %>%
gather(k, v, -Company) %>%
unite(tmp, k, v, sep = "_") %>%
count(Company, tmp) %>%
spread(tmp, n)
## A tibble: 1 x 6
# Company Education_1 Education_NA Health_1 Health_2 Health_NA
# <fct> <int> <int> <int> <int> <int>
#1 A 2 1 1 1 1
Sample data
df <- read.table(text =
" Company Education Health
A NA 1
A 1 2
A 1 NA ", header = T)
Using DF in the Note at the end where we have added a company B as well and using the reshape2 package it can be done in one recast call. The id.var and fun arguments can be omitted and the same answer will be given but it will produce a message saying it used those defaults.
library(reshape2)
recast(DF, Company ~ variable + value,
id.var = "Company", fun = length)
giving this data frame:
Company Education_1 Education_NA Health_1 Health_2 Health_NA
1 A 2 1 1 1 1
2 B 2 1 1 1 1
Note
Lines <- " Company Education Health
1 A NA 1
2 A 1 2
3 A 1 NA
4 B NA 1
5 B 1 2
6 B 1 NA"
DF <- read.table(text = Lines)
In plyr you can use a hack with ddply by transposing tables to get what appear to be new columns:
x <- data.frame(Company="A",Education=c(NA,1,1),Health=c(1,2,NA))
library(plyr)
ddply(x,.(Company),plyr::summarise,
Education=t(table(addNA(Education))),
Health=t(table(addNA(Health)))
)
Company Education.1 Education.NA Health.1 Health.2 Health.NA
1 A 2 1 1 1 1
However, they are not really columns, but table elements in the data.frame.
You can use a do.call(data.frame,y) construct to make them proper data frame columns, but you need more than one row for it to work.
Related
I have a long form of clinical data that looks something like this:
patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104)
outcome <- c(1,1,1,1,1,NA,1,NA,NA,NA,NA,NA)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
A patient should be kept in the database only if they 2 or 3 observations (so patients that have complete data for 0 or only 1 time points should be thrown out. So for this example my desired result is this:
patientid <- c(100,100,100,101,101,101)
outcome <- c(1,1,1,1,1,NA)
time <- c(1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
Hence patients 102 and 104 are thrown out of the database because of they were missing the outcome variable in 2 or 3 of the time points.
We can create a logical expression on the sum of non-NA elements as a logical vector, grouped by 'patientid' to filter patientid's having more than one non-NA 'outcome'
library(dplyr)
Data %>%
group_by(patientid) %>%
filter(sum(!is.na(outcome)) > 1) %>%
ungroup
-output
# A tibble: 6 x 3
# patientid outcome time
# <dbl> <dbl> <dbl>
#1 100 1 1
#2 100 1 2
#3 100 1 3
#4 101 1 1
#5 101 1 2
#6 101 NA 3
A base R option using subset + ave
subset(
Data,
ave(!is.na(outcome), patientid, FUN = sum) > 1
)
giving
patientid outcome time
1 100 1 1
2 100 1 2
3 100 1 3
4 101 1 1
5 101 1 2
6 101 NA 3
A data.table option
setDT(Data)[, Y := sum(!is.na(outcome)), patientid][Y > 1, ][, Y := NULL][]
or a simpler one (thank #akrun)
setDT(Data)[Data[, .I[sum(!is.na(outcome)) > 1], .(patientid)]$V1]
which gives
patientid outcome time
1: 100 1 1
2: 100 1 2
3: 100 1 3
4: 101 1 1
5: 101 1 2
6: 101 NA 3
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(observation = sum(outcome, na.rm = TRUE)) %>% # create new variable (observation) and count the observation per patient
filter(observation >=2) %>%
ungroup
output:
# A tibble: 6 x 4
patientid outcome time observation
<dbl> <dbl> <dbl> <dbl>
1 100 1 1 3
2 100 1 2 3
3 100 1 3 3
4 101 1 1 2
5 101 1 2 2
6 101 NA 3 2
I have a dataset where the first line is the header, the second line is some explanatory data, and then rows 3 on are numbers. Because when I read in the data with this second explanatory row, the classes are automatically converted to factors (or I could put stringsasfactors=F).
What I would like to do is remove the second row, and have a function that goes through all columns and detects if they're just numbers and change the class type to the appropriate type. Is there something like that available? Perhaps using dplyr? I have many columns so I'd like to avoid manually reassigning them.
A simplified example below
> df <- data.frame(A = c("col 1",1,2,3,4,5), B = c("col 2",1,2,3,4,5))
> df
A B
1 col 1 col 2
2 1 1
3 2 2
4 3 3
5 4 4
6 5 5
if all the numbers are after the second line, then we can do so
library(tidyverse)
df[-1, ] %>% mutate_all(as.numeric)
depending on the task can be done this way
df <- tibble(A = c("col 1",1,2,3,4,5),
B = c("col 2",1,2,3,4,5),
C = c(letters[1:5], 6))
df[-1, ] %>% mutate_if(~ any(!is.na(as.numeric(.))), as.numeric)
A B C
<dbl> <dbl> <dbl>
1 1 1 NA
2 2 2 NA
3 3 3 NA
4 4 4 NA
5 5 5 6
or so
df[-1, ] %>% mutate_if(~ all(!is.na(as.numeric(.))), as.numeric)
A B C
<dbl> <dbl> <chr>
1 1 1 b
2 2 2 c
3 3 3 d
4 4 4 e
5 5 5 6
In base R, we can just do
df[-1] <- lapply(df[-1], as.numeric)
Given the input and code below, using dplyr and groups, how can I produce the results shown in the output? I know how to sum columns in groups using dplyr, but in this case I need to count how many of each non-numeric grade occurred in each class.
**INPUT**
Class Student Grade
1 Jack C
1 Mary B
1 Mo B
1 Jane A
1 Tom C
2 Don C
2 Betsy B
2 Sue C
2 Tayna B
2 Kim C
**CODE**
# Create the dataframe
Class <- c(1,1,1,1,1,2,2,2,2,2)
Name <- c("Jack", "Mary", "Mo", "Jane", "Tom", "Don", "Betsy", "Sue", "Tayna", "Kim")
Grade <- c("C","B","B","A","C","C","B","C","B","C")
StudentGrades <- data.frame(Class, Name, Grade)
**OUTPUT**
Class Grade-A Grade-B Grade-C
1 1 2 2
2 0 2 3
We can use count to get the frequency count and then with pivot_wider change from 'long' to 'wide' format
library(dplyr)
library(tidyr)
library(stringr)
StudentGrades %>%
count(Class, Grade = str_c('Grade_', Grade)) %>%
pivot_wider(names_from = Grade, values_from = n, values_fill = list(n = 0))
# A tibble: 2 x 4
# Class Grade_A Grade_B Grade_C
# <dbl> <int> <int> <int>
#1 1 1 2 2
#2 2 0 2 3
Or in base R
table(StudentGrades[c('Class', 'Grade')])
Here is a base R solution, where table() + split() are used
dfout <- do.call(rbind,lapply(split(StudentGrades,StudentGrades$Class),
function(v) c(unique(v[1]),table(v$Grade))))
such that
> dfout
Class A B C
1 1 1 2 2
2 2 0 2 3
I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!
This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]
Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".
Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1
Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))
Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
I have a format to separate where I will have this data:
df = data.frame(id=c(1,2),name=c('A~B~C','A~B~D'),value=c('1~2~3','1~~2'))
id name value
1 A~B~C 1~2~3
2 A~B~D 1~~2
which is expected to have the following output where the column name is the original column name followed by the text in the name column:
id value_A value_B value_C value_D
1 1 2 3
2 1 2
I manage to achieve the splitting for the by using many nested for loops to process on my data row by row. It works on small sample data but once the data gets huge, the time is an issue.
Also,there could be more than 1 value columns, but they all should map into the same name column.
Example output:
id value_A value_B value_C value1_A value1_B value1_C
1 1 2 3 1 2 3
2 1 2 3 1 2 3
You can try dplyr:
library(tidyverse)
df %>%
separate_rows(name, value, sep = "~") %>%
spread(name, value)
id A B C D
1 1 1 2 3 <NA>
2 2 1 <NA> 2
Instead of NA you can fill empty cells by anything you specify within fill = ""
Or baseR and reshape2:
a <- strsplit(as.character(df$name), "~")
b <- strsplit(as.character(df$value), "~")
df2 <- do.call(rbind.data.frame, Map(cbind, df$id, a, b))
library(reshape2)
dcast(df2, V1~V2, value.var = "V3")
A B C D
1 1 2 3 <NA>
2 1 <NA> 2
Here is an option using cSplit/dcast. Split the rows into 'long' format with cSplit and dcast it to 'wide' format
library(splitstackshape)
dcast(cSplit(df, c('name','value'), '~', 'long')[!is.na(value)], id ~ paste0('value_', name))
# id value_A value_B value_C value_D
#1: 1 1 2 3 NA
#2: 2 1 NA NA 2