I have a dataset that has the cat's ID number at a centre and their ages. The dataset looks like this:
ID Number Animal Type Age
121012 Cat 0.002
128129 Cat 1.000
429202 Cat 0.920
238232 Cat 15.000
132265 Cat 0.050
234235 Cat 9.000
682892 Cat 16.000
A kitten has an age numerical value below 1, in other words, kittens can be any number that isn't a whole number.
Meanwhile, adult cats have an age value that is any whole number.
I need to split the data, or better yet group, the kitten population from the adult population but I have no idea.
(Im still pretty new to this, only had it for 4 weeks so sorry if I sound like a noob)
Many thanks to anyone who can help!
In addition to the above answer, find below two more methods,
Method 1
df_kitten <- subset(df, Age <1)
df_adult <- subset(df, Age >= 1)
Method 2
df_kitten <- df[df$Age < 1,]
df_adult <- df[df$Age >= 1,]
Thanks
Balaji
If you don't want to split your data, you can use dplyr::group_by to ensure a grouping structure of your data.frame.
library(tidyverse);
df %>%
mutate(isKitten = Age < 1) %>%
group_by(isKitten)
Any further data manipulations will then be performed on the group level.
For example, you can calculate the mean age per group:
df %>%
mutate(isKitten = Age < 1) %>%
group_by(isKitten) %>%
summarise(meanAge = mean(Age))
## A tibble: 2 x 2
# isKitten meanAge
# <lgl> <dbl>
#1 FALSE 10.2
#2 TRUE 0.324
df_split = split(df, df$Age < 1)
Or you might want to create a column that says if the cat is kitten or adult:
df$type_of_cat <- ifelse(df$age < 1, "Kitten", "Adult")
df_split = split(df, df$type_of_cat)
I am assuming your table only contains cats.
Related
I would like to create a column in my data frame that gives the percentage of each category. The total (100%) would be the summary of the column Score.
My data looks like
Client Score
<chr> <int>
1 RP 125
2 DM 30
Expected
Client Score %
<chr> <int>
1 RP 125 80.6
2 DM 30 19.3
Thanks!
Note special character in column names is not good.
library(dplyr)
df %>%
mutate(`%` = round(Score/sum(Score, na.rm = TRUE)*100, 1))
Client Score %
1 RP 125 80.6
2 DM 30 19.4
Probably the best way is to use dplyr. I recreated your data below and used the mutate function to create a new column on the dataframe.
#Creation of data
Client <- c("RP","DM")
Score <- c(125,30)
DF <- data.frame(Client,Score)
DF
#install.packages("dplyr") #Remove first # and install if library doesn't load
library(dplyr) #If this doesn't run, install library using code above.
#Shows new column
DF %>%
mutate("%" = round((Score/sum(Score))*100,1))
#Overwrites dataframe with new column added
DF %>%
mutate("%" = round((Score/sum(Score))*100,1)) -> DF
Using base R functions the same goal can be achieved.
X <- round((DF$Score/sum(DF$Score))*100,1) #Creation of percentage
DF$"%" <- X #Storage of X as % to dataframe
DF #Check to see it exists
In base R, may use proportions
df[["%"]] <- round(proportions(df$Score) * 100, 1)
-output
> df
Client Score %
1 RP 125 80.6
2 DM 30 19.4
I have a dataframe with character and numeric data. I would like to use dplyr to create a summary grouped by time points and trials generating the following:
averages
standard deviations
variation
ratio between time points
(etc etc)
I feel like all of this could be done in the dplyr pipe, but I am struggling to make a ratio of averages between time points within trials.
I fully admit that I may be carrying around a hammer looking for nails, so please feel free to recommend solutions that utilize other packages or functions, but ideally I'd like simple/straight forward code for ease of use by multiple collaborators.
library(dplyr)
# creating an example DF
num <- runif(100, 50, 3200)
smpl <- 1:100
df <- data.frame( num, smpl)
df$time <- "time1"
df$time[seq(2,100,2)] <- "time2"
df$trial <- "a"
df$trial[26:50] <- "b"
df$trial[51:75] <- "c"
df$trial[75:100] <- "d"
# using the magic of pipelines to calculate useful things
df1 <- df %>%
group_by(time, trial) %>%
summarise(avg = mean(num),
var = var(num),
stdev = sd(num))
I'd love to get [the ratio time2/time1 of the avg for each trial] included in this block above, but I don't know how to call "avg" specifically by "time1" vs "time2" within the pipe.
From here on, nothing does quite what I'm hoping for...
df1 <- df1[with(df1,order(trial,time)),]
# this better ressembles my actual DF structure,
# so reordering it will make some of my next attempts to solve this make more sense
I tried to use the fact that 'every other line' is different (this is not ideal because each df will have a different number of rows, so I will either introduce NAs or it will require constantly change these #'s (or writing a function to constantly change them))
tm2 <- data.frame(x=df1$avg[seq(2,4,2)])
tm1 <- data.frame(x=df1$avg[seq(1,3,2)])
so minimally, this is the ratio I'd like included in the df, but tied to the avg & trial columns:
tm2/tm1
It doesn't matter to me 'which' time row this ratio ends up in, so long as it is consistent across all the trials (so if a column of ratios has "blank" for every "time1" and "value" for every "time2", that's fine).
# I added in a separate column to allow 'match' later
tm1$time <- "time1"
tm2$time <- "time1" # to keep them all 'in row'
df1$avg_tm1 <- tm1$x[match(df1$time, tm1$time)]
df1$avg_tm2 <- tm2$x[match(df1$time, tm2$time)]
but this fails to match by 'trial' also, since that info is lost in this new tm1 df ; this really makes me think it should all be done in dplry the first time...
Then I tried to create a new column in the tm1 df with the ratio
tm2$ratio <-tm2$x/tm1$x
and add in the ratio values only if the avg matches
df1$ratio <- tm2$ratio[match(tm2$x, df1$avg)]
This might work, but when I extract the avg values, it rounds, so the numbers do not match exactly. I'm also cautious about this because if I process ridiculous amounts of data, there's a higher and higher chance that two random averages will be similar enough to misplace these ratios.
I tried several other things that completely failed, so let's pretend that something worked and entered the ratio into the df1 as separate columns
Then any further calculations or annotations are straight forward:
df2 <- df1 %>%
mutate(ratio = avg_tm2/avg_tm1,
lost = 1- ratio,
word = paste0(round(lost*100),"%"))
But I am still stuck on 'how' to call specific cells inside the pipe or which other tools/packages to use to calculate deltas or ratios between cells in the same column.
Thanks in advance
We could group by 'trial' and mutate to create the 'ratio' column
df1 %>%
group_by(trial) %>%
mutate(ratio = last(avg)/first(avg))
# A tibble: 8 x 6
# Groups: trial [4]
# time trial avg var stdev ratio
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 time1 a 1815. 715630. 846. 0.795
#2 time1 b 2012. 1299823. 1140. 0.686
#3 time1 c 1505. 878168. 937. 1.09
#4 time1 d 1387. 902364. 950. 1.17
#5 time2 a 1444. 998943. 999. 0.795
#6 time2 b 1380. 720135. 849. 0.686
#7 time2 c 1641. 1205778. 1098. 1.09
#8 time2 d 1619. 582418. 763. 1.17
NOTE: We used set.seed(2) for creating the dataset
Work out a separate data.frame:
set.seed(2)
# your code above to generate df1
df2 <- select(df1, time, trial, avg) %>%
spread(time, avg) %>%
mutate(ratio = time2/time1)
df2
# # A tibble: 4 × 4
# trial time1 time2 ratio
# <chr> <dbl> <dbl> <dbl>
# 1 a 1815.203 1443.731 0.7953555
# 2 b 2012.436 1379.981 0.6857266
# 3 c 1505.474 1641.439 1.0903135
# 4 d 1386.876 1619.341 1.1676176
and now you can merge the relevant column onto the original frame:
left_join(df1, select(df2, trial, ratio), by="trial")
# Source: local data frame [8 x 6]
# Groups: time [?]
# time trial avg var stdev ratio
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 time1 a 1815.203 715630.4 845.9494 0.7953555
# 2 time1 b 2012.436 1299823.3 1140.0979 0.6857266
# 3 time1 c 1505.474 878168.3 937.1063 1.0903135
# 4 time1 d 1386.876 902363.7 949.9282 1.1676176
# 5 time2 a 1443.731 998943.3 999.4715 0.7953555
# 6 time2 b 1379.981 720134.6 848.6074 0.6857266
# 7 time2 c 1641.439 1205778.0 1098.0792 1.0903135
# 8 time2 d 1619.341 582417.5 763.1629 1.1676176
I have a large (~200k rows) dataframe that is structured like this:
df <-
data.frame(c(1,1,1,1,1), c('blue','blue','blue','blue','blue'), c('m','m','m','m','m'), c(2016,2016,2016,2016,2016),c(3,4,5,6,7), c(10,20,30,40,50))
colnames(df) <- c('id', 'color', 'size', 'year', 'week','revenue')
Let's say it is currently week 7, and I want to compare the trailing 4 week average of revenue to the current week's revenue. What I would like to do is create a new column for that average when all of the identifiers match.
df_new <-
data.frame(1, 'blue', 'm', 2016,7,50, 25 )
colnames(df_new) <- c('id', 'color', 'size', 'year', 'week','revenue', 't4ave')
How can I accomplish this efficiently? Thank you for the help
good question. for loops are pretty inefficient, but since you do have to check the conditions of prior entries, this is the only solution I can think of (mind you, I'm also an intermediate at R):
for (i in 1:nrow(df))
{
# condition for all entries to match up
if ((i > 5) && (df$id[i] == df$id[i-1] == df$id[i-2] == df$id[i-3] == df$id[i-4])
&& (df$color[i] == df$color[i-1] == df$color[i-2] == df$color[i-3] == df$color[i-4])
&& (df$size[i] == df$size[i-1] == df$size[i-2] == df$size[i-3] == df$size[i-4])
&& (df$year[i] == df$year[i-1] == df$year[i-2] == df$year[i-3] == df$year[i-4])
&& (df$week[i] == df$week[i-1] == df$week[i-2] == df$week[i-3] == df$week[i-4]))
# avg of last 4 entries' revenues
avg <- mean(df$revenue[i-1] + df$revenue[i-2] + df$revenue[i-3] + df$revenue[i-4])
# create new variable of difference between this entry and last 4's
df$diff <- df$revenue[i] - avg
}
This code will probably take forever, but it should work. If this is a one time thing for when the code needs to run, then it should be okay. Otherwise, hopefully others will be able to advise.
A solution using dplyr and zoo. The idea is to group the variable that are the same, such as id, color, size, and year. Aftet that, use rollmean to calculate the rolling mean of revenue. Use na.pad = TRUE and align = "right" to make sure the calculation covers the recent weeks. Finally, use lag to "shift" the calculation results to fit your needs.
library(dplyr)
library(zoo)
df2 <- df %>%
group_by(id, color, size, year) %>%
mutate(t4ave = rollmean(revenue, 4, na.pad = TRUE, align = "right")) %>%
mutate(t4ave = lag(t4ave))
df2
# A tibble: 5 x 7
# Groups: id, color, size, year [1]
id color size year week revenue t4ave
<dbl> <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
1 1 blue m 2016 3 10 NA
2 1 blue m 2016 4 20 NA
3 1 blue m 2016 5 30 NA
4 1 blue m 2016 6 40 NA
5 1 blue m 2016 7 50 25
I am having trouble using combinations of ddply and merge to aggregate some variables. The data frame that I am using is really large, so I am putting an example below:
data_sample <- cbind.data.frame(c(123,123,123,321,321,134,145,000),
c('j', 'f','j','f','f','o','j','f'),
c(seq(110,180, by = 10)))
colnames(data_sample) <- c('Person','Expense_Type','Expense_Value')
I want to calculate, for each person, the percentage of the value of expense of type j on the person's total expense.
data_sample2 <- ddply(data_sample, c('Person'), transform, total = sum(Value))
data_sample2 <- ddply(data_sample2, c('Person','Type'), transform, empresa = sum(Value))
This it what I've done to list total expenses by type, but the problem is that not all individuals have expenses of type j, so their percentage should be 0 and I don't know how to leave only one line per person with the percentage of total expenses of type j.
I might have not made myself clear.
Thank you!
We can use the by function:
by(data_sample, data_sample$Person, FUN = function(dat){
sum(dat[dat$Expense_Type == 'j',]$Expense_Value) / sum(dat$Expense_Value)
})
We could also make use of the dplyr package:
library(dplyr)
data_sample %>%
group_by(Person) %>%
summarise(Percent_J = sum(ifelse(Expense_Type == 'j', Expense_Value, 0)) / sum(Expense_Value))
# A tibble: 5 × 2
Person Percent_J
<dbl> <dbl>
1 0 0.0000000
2 123 0.6666667
3 134 0.0000000
4 145 1.0000000
5 321 0.0000000
I am trying to calculate the families sizes from a data frame, which also contains two types of events : family members who died, and those who left the family. I would like to take into account these two parameters in order to compute the actual family size.
Here is a reproductive example of my problem, with 3 families only :
family <- factor(rep(c("001","002","003"), c(10,8,15)), levels=c("001","002","003"), labels=c("001","002","003"), ordered=TRUE)
dead <- c(0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0)
left <- c(0,0,0,0,0,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0)
DF <- data.frame(family, dead, left) ; DF
I could count N = total family members (in each family) in a second dataframe DF2, by simply using table()
DF2 <- with(DF, data.frame(table(family)))
colnames(DF2)[2] <- "N" ; DF2
family N
1 001 10
2 002 8
3 003 15
But i can not find a proper way to get the actual number of people (for example, creating a new variable N2 into DF2) , calculated by substracting to N the number of members who died or left the family. I suppose i have to relate the two dataframes DF and DF2 in a way. i have looked for other related questions in this site but could not find the right answer...
If anyone has a good idea, it would be great !
Thank you in advance..
Deni
Logic : First we want to group_by(family) and then calculate 2 numbers : i) total #obs in each group ii) subtract the sum(dead) + sum(left) from this total .
In dplyr package : n() helps us get the total #observations in each group
In data.table : .N does the same above job
library(dplyr)
DF %>% group_by(family) %>% summarise( total = n(), current = n()-sum(dead,left, na.rm = TRUE))
# family total current
# (fctr) (int) (dbl)
#1 001 10 6
#2 002 8 4
#3 003 15 7
library(data.table)
# setDT() is preferred if incase your data was a data.frame. else just DF.
setDT(DF)[, .(total = .N, current = .N - sum(dead, left, na.rm = TRUE)), by = family]
# family total current
#1: 001 10 6
#2: 002 8 4
#3: 003 15 7
Here is a base R option
do.call(data.frame, aggregate(dl~family, transform(DF, dl = dead + left),
FUN = function(x) c(total=length(x), current=length(x) - sum(x))))
Or a modified version is
transform(aggregate(. ~ family, transform(DF, total = 1,
current = dead + left)[c(1,4:5)], FUN = sum), current = total - current)
# family total current
#1 001 10 6
#2 002 8 4
#3 003 15 7
I finally found another which works fine (from another post), allowing to compute everything from the original DF table. This uses the ddply function :
DF <- ddply(DF,.(family),transform,total=length(family))
DF <- ddply(DF,.(family),transform,actual=length(family)-sum(dead=="1")-sum(left=="1"))
DF
Thanks a lot to everyone who helped ! Deni