Create a new column by aggregating multiple columns in R - r

Background
I have a dataset, df, where I would like to aggregate multiple columns and create a new column. I need to multiply Type, Span and Population columns and create a new Output column
ID Status Type Span State Population
A Yes 2 70% Ga 10000
Desired output
ID Status Type Span State Population Output
A Yes 2 70% Ga 10000 14000
dput
structure(list(ID = structure(1L, .Label = "A ", class = "factor"),
Status = structure(1L, .Label = "Yes", class = "factor"),
Type = 2L, Span = structure(1L, .Label = "70%", class = "factor"),
State = structure(1L, .Label = "Ga", class = "factor"), Population = 10000L), class = "data.frame",
row.names = c(NA,
-1L))
This is what I have tried
df %>%
mutate(Output = Type * Span * Population)

Here, we are creating a new column based on the inputs from different column. We can just use mutate to get the Span percent of Population and multiply by 'Type'. Note that 'Span' is not numeric, as it is having %, so we extract the numeric part with parse_number divide by 100, then multiply with Population along with the 'Type'
library(dplyr)
df %>%
mutate(Output = Type * Population * readr::parse_number(as.character(Span))/100)
# ID Status Type Span State Population Output
#1 A Yes 2 70% Ga 10000 14000
If the columns 'Type', 'Population' are not numeric, it is better to convert to numeric with as.numeric(as.character(df$Type)) and for 'Population' (assuming they are factor class). Another option is type.convert(df, as.is = TRUE) and then work on that modified class dataset

We can remove the '%' sign using sub, convert to numeric and multiply values.
This can be done in base R as :
df$output <- with(df, Type * as.numeric(sub('%', '', Span)) * Population/100)
df
# ID Status Type Span State Population output
#1 A Yes 2 70% Ga 10000 14000

Related

Why am I having Issues with Separating Rows in a Dataframe?

I'm having an issue with separating rows in a dataframe that I'm working in.
In my dataframe, there's a column called officialIndices that I want to separate the rows by. This column stores a list of numbers act as indexes to indicate which rows have the same data. For example: indices 2:3 means that rows 2:3 have the same data.
Here is the code that I am working with.
offices_list <- data_google$offices
offices_JSON <- toJSON(offices_list)
offices_from_JSON <-
separate_rows(fromJSON(offices_JSON), officialIndices, convert = TRUE)
This is what my offices_list frame looks like
This is what it looks like after I try to separate the rows
My code works fine when it has indices 2:3 since there is a difference of 1. However on indices like 7:10, it separates the rows as 7 and 10 instead of doing 7, 8, 9, 10, which is how I want it do be done. How would I get my code to separate the rows like this?
Output of dput(head(offices_list))
structure(list(position = c("President of the United States",
"Vice-President of the United States", "United States Senate",
"Governor", "Mayor", "Auditor"), divisionId = c("ocd-division/country:us",
"ocd-division/country:us", "ocd-division/country:us/state:or",
"ocd-division/country:us/state:or", "ocd-division/country:us/state:or/place:portland",
"ocd-division/country:us/state:or/place:portland"), levels = list(
"country", "country", "country", "administrativeArea1", NULL,
NULL), roles = list(c("headOfState", "headOfGovernment"),
"deputyHeadOfGovernment", "legislatorUpperBody", "headOfGovernment",
NULL, NULL), officialIndices = list(0L, 1L, 2:3, 4L, 5L,
6L)), row.names = c(NA, 6L), class = "data.frame")
This should work. I expect it will work for further rows too, since I tested for ranges greater than two in officialIndices.
First I extracted the start and end rows, and used their difference to determine how many rows are needed. Then tidyr::uncount() will add that many copies.
library(dplyr); library(tidyr)
data_sep <- data %>%
separate(officialIndices, into = c("start", "end"), sep = ":") %>%
# Use 1 row, and more if "end" is defined and larger than "start"
mutate(rows = 1 + if_else(is.na(end), 0, as.numeric(end) - as.numeric(start))) %>%
uncount(rows)

Calculating percent of categorical responses (with grouping) in R

I have the following dataframe:
IV Device1 Device2 Device3
Color Same Same Missing
Color Different Same Missing
Color Same Unique Missing
Shape Same Missing Same
Shape Different Same Different
Explanation: each IV (Independent Variable) is composed of several measurements (the ‘Color’ section is composed of 3 different measurements, while 'Shape' is composed of 2).
Each data point has one of 4 possible categorical values: Same/Different/Unique/Missing. 'Missing' means that there is no value for that measurement in the case of that device, while the other 3 values represent the existing result for that measurement.
Question: I want to calculate for each device the percent of times that it has a Same/Different/Unique value (thus generating 3 different percentages), out of the total number of values for that IV (not including cases where there is a ‘Missing’ value).
For example, device 2 would have the following percentages:
Color- 67% same, 0% different, 33% unique.
Shape- 100% same, 0% different, 0% unique.
Thank you!
This is a not a TIDY solution, but you can use this until someone else posts a better one:
# Replace all "Missing" with NAs
df[df == "Missing"] <- NA
# Create factor levels
df[,-1] <- lapply(df[,-1], function(x) {
factor(x, levels = c('Same', 'Different', 'Unique'))
})
# Custom function to calculate percent of categorical responses
custom <- function(x) {
y <- length(na.omit(x))
if(y > 0)
return(round((table(x)/y)*100))
else
return(rep(0, 3))
}
library(purrr)
# Split the dataframe on IV, remove the IV column and apply the custom function
Final <- df %>% split(df$IV) %>%
map(., function(x) {
x <- x[, -1]
t(sapply(x, custom))
})
Output
Final is a list of two data frames:
$Color
Same Different Unique
Device1 67 33 0
Device2 67 0 33
Device3 0 0 0
$Shape
Same Different Unique
Device1 50 50 0
Device2 100 0 0
Device3 50 50 0
Data
structure(list(IV = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("Color",
"Shape"), class = "factor"), Device1 = structure(c(1L, 2L, 1L,
1L, 2L), .Label = c("Same", "Different", "Unique"), class = "factor"),
Device2 = structure(c(1L, 1L, 3L, NA, 1L), .Label = c("Same",
"Different", "Unique"), class = "factor"), Device3 = structure(c(NA,
NA, NA, 1L, 2L), .Label = c("Same", "Different", "Unique"
), class = "factor")), .Names = c("IV", "Device1", "Device2",
"Device3"), row.names = c(NA, -5L), class = "data.frame")
Quick and dirty: First, replace your 'Missing' by 'NA' using your preferred method (sed, excel, etc), then you can use table on each of the columns to get the summary statistics:
myStats <- function(x){
table(factor(x, levels = c('Same', 'Different', 'Unique')))/sum(table(x))
}
apply(yourData, 2, myStats)
This will return the summary of what you want.

lookup data in a datatable and add it to a new column

I have two data tables as shown below:
bigrams
w1w2 freq w1 w2
common names 1 common names
department of 4 department of
family name 6 family name
bigrams = setDT(structure(list(w1w2 = c("common names", "department of", "family name"
), freq = c(1L, 4L, 6L), w1 = c("common", "department", "family"
), w2 = c("names", "of", "name")), .Names = c("w1w2", "freq",
"w1", "w2"), row.names = c(NA, -3L), class = "data.frame"))
unigrams
w1 freq
common 2
department 3
family 4
name 5
names 1
of 9
unigrams = setDT(structure(list(w1 = c("common", "department", "family", "name",
"names", "of"), freq = c(2L, 3L, 4L, 5L, 1L, 9L)), .Names = c("w1",
"freq"), row.names = c(NA, -6L), class = "data.frame"))
desired output
w1w2 freq w1 w2 w1freq w2freq
common names 1 common names 2 1
department of 4 department of 3 9
family name 6 family name 4 5
What I have done so far
setkey(bigrams, w1)
setkey(unigrams, w1)
result <- bigrams[unigrams]
This gives me the i.freq column for w1 but when I try to do the same for w2 the i.freq column is updated to reflect the freq of w2.
How can I get freq for both w1 and w2 in separate columns?
Note: I have already seen solutions to data.table Lookup value and translate and Modify column of a data.table based on another column and add the new column
You can do two joins, and in v1.9.6 of data.table you can specify the on= argument for differing column names.
library(data.table)
bigrams[unigrams, on=c("w1"), nomatch = 0][unigrams, on=c(w2 = "w1"), nomatch = 0]
w1w2 freq w1 w2 i.freq i.freq.1
1: family name 6 family name 4 5
2: common names 1 common names 2 1
3: department of 4 department of 3 9
You can do this with a bit of reshaping.
library(dplyr)
library(tidyr)
bigrams %>%
rename(w1w2_string = w1w2,
w1w2_freq = freq) %>%
gather(order, string,
w1, w2) %>%
left_join(unigrams %>%
rename(string = w1) ) %>%
gather(type, value,
string, freq) %>%
unite(order_type, order, type) %>%
spread(order_type, value)
Edit: Explanation
The first observation you can make is that bigrams contains in fact information about three different units of analysis: a bigram and two unigrams. Convert to long form so that the unit of analysis is a unigram. Then we can merge in the other unigram data. Now note that your unigram has two different pieces of information per row: the frequency for the unigram, and the text of the unigram. Convert to long form again so that the unit of analysis is a piece of information about a unigram. Now spread, so that each new column is a type of information about a unigram.

How to match column values in two dataframes and make rownames with the matching corresponding column values

I have this dataframe called mydf. I want to match the current column in another dataframe called secondf with the column key.genomloc and extract the corresponding key.wesmut.genom column values and make that rowname as shown in the result.
This is what I have tried, but does not work as desired:
current <- secondf[,"key.genomloc"]
replacement <- secondf[,"key.wesmut.genom"]
v <- mydf[,"current"] %in% current
w <- current %in% mydf[,"current"]
rownames(mydf)<-mydf[,"current"]
rownames(mydf)[v] <- replacement[w]
Data:
mydf <-structure(list(current = structure(c(5L, 2L), .Label = c("chr1:115256529:T:C",
"chr1:115256530:G:T", "chr1:115258744:C:A", "chr1:115258744:C:T",
"chr1:115258747:C:T", "chr11:32417945:T:C", "chr12:25398284:C:A",
"chr12:25398284:C:T", "chr13:28592640:A:C", "chr13:28592641:T:A",
"chr13:28592642:C:A", "chr13:28592642:C:G", "chr15:90631838:C:T",
"chr15:90631934:C:T", "chr2:209113112:C:T", "chr2:209113113:G:A",
"chr2:209113113:G:C", "chr2:209113113:G:T", "chr2:25457242:C:T",
"chr2:25457243:G:A", "chr2:25457243:G:T", "chr4:55599320:G:T"
), class = "factor"), `index` = c(1451738, 1451718)), .Names = c("current",
"index"), row.names = 1:2, class = "data.frame")
secondf<-structure(c("WES:FLT3:p.D835H", "WES:FLT3:p.D835N", "WES:FLT3:p.D835Y",
"WES:FLT3:p.D835A", "WES:FLT3:p.D835V", "chr1:115256530:G:T",
"chr13:28592642:C:T", "chr13:28592642:C:A", "chr1:115258747:C:T",
"chr13:28592641:T:A"), .Dim = c(5L, 2L), .Dimnames = list(NULL,
c("key.wesmut.genom", "key.genomloc")))
Result
rowname current index
WES:FLT3:p.D835A chr1:115258747:C:T 1451738
WES:FLT3:p.D835H chr1:115256530:G:T 1451718
We can use match
mydf$rowname <- secondf[,1][match(mydf$current,secondf[,2])]
mydf[c(3,1:2)]
# rowname current index
#1 WES:FLT3:p.D835A chr1:115258747:C:T 1451738
#2 WES:FLT3:p.D835H chr1:115256530:G:T 1451718

Returning first row of group

I have a dataframe consisting of an ID, that is the same for each element in a group, two datetimes and the time interval between these two. One of the datetime objects is my relevant time marker. Now I like to get a subset of the dataframe that consists of the earliest entry for each group. The entries (especially the time interval) need to stay untouched.
My first approach was to sort the frame according to 1. ID and 2. relevant datetime. However, I wasn't able to return the first entry for each new group.
I then have been looking at the aggregate() as well as ddply() function but I could not find an option in both that just returns the first entry without applying an aggregation function to the time interval value.
Is there an (easy) way to accomplish this?
ADDITION:
Maybe I was unclear by adding my aggregate() and ddply() notes. I do not necessarily need to aggregate. Given the fact that the dataframe is sorted in a way that the first row of each new group is the row I am looking for, it would suffice to just return a subset with each row that has a different ID than the one before (which is the start-row of each new group).
Example data:
structure(list(ID = c(1454L, 1322L, 1454L, 1454L, 1855L, 1669L,
1727L, 1727L, 1488L), Line = structure(c(2L, 1L, 3L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"),
Start = structure(c(1357038060, 1357221074, 1357369644, 1357834170,
1357913412, 1358151763, 1358691675, 1358789411, 1359538400
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1357110430,
1357365312, 1357564413, 1358230679, 1357978810, 1358674600,
1358853933, 1359531923, 1359568151), class = c("POSIXct",
"POSIXt"), tzone = ""), Interval = c(1206.16666666667, 2403.96666666667,
3246.15, 6608.48333333333, 1089.96666666667, 8713.95, 2704.3,
12375.2, 495.85)), .Names = c("ID", "Line", "Start", "End",
"Interval"), row.names = c(NA, -9L), class = "data.frame")
By reproducing the example data frame and testing it I found a way of getting the needed result:
Order data by relevant columns (ID, Start)
ordered_data <- data[order(data$ID, data$Start),]
Find the first row for each new ID
final <- ordered_data[!duplicated(ordered_data$ID),]
As you don't provide any data, here is an example using base R with a sample data frame :
df <- data.frame(group=c("a", "b"), value=1:8)
## Order the data frame with the variable of interest
df <- df[order(df$value),]
## Aggregate
aggregate(df, list(df$group), FUN=head, 1)
EDIT : As Ananda suggests in his comment, the following call to aggregate is better :
aggregate(.~group, df, FUN=head, 1)
If you prefer to use plyr, you can replace aggregate with ddply :
ddply(df, "group", head, 1)
Using ffirst from collapse
library(collapse)
ffirst(df, g = df$group)
data
df <- data.frame(group=c("a", "b"), value=1:8)
This could also be achieved by dplyr using group_by and slice-family of functions,
data %>%
group_by(ID) %>%
slice_head(n = 1)

Resources