I have a CSV like this, saved as an object in R named df1.
X Y Z Year
0 2 4 2014
3 1 3 2014
5 4 0 2014
0 3 0 2014
2 1 0 2015
I want to:
Count each column where there are no "0" for year 2014. For example, for column X, the count = 2 (not 3 because I want 2014 data only). For column Y the count is 4. For column Z the count is 1.
Sum all the counts for each column
This is what I tried:
count_total <- sum(df1$x != 0 &
df1$y != 0 &
df1&z != 0 &
df1$Year == 2014)
count_total
I want the output to be simply be 1 (i.e. the 2nd row in df has no 0's)
However, this does not align with my countifs on excel. In excel, it's like this:
=COUNTIFS('df1'!$A$2:$A$859,"<>0",'df1'!$B$2:$B$859,"<>0",
'df1'!$C$2:$C$859,"<>0",'df1'!$D$2:$D$859,2014)
Wondering if I mistyped something on R? I'm a dyplr user but can't find anything particularly useful on google.
Thank you very much!
One way is using rowSums on subset of data
sum(rowSums(subset(df1, Year == 2014) == 0) == 0)
#[1] 1
You can do this with aggregate then colSums to get the totals by column.
agg <- aggregate(. ~ Year, df1, function(x) sum(x != 0))
agg
# Year X Y Z
#1 2014 2 4 2
#2 2015 1 1 0
colSums(agg[-1])
#X Y Z
#3 5 2
Data.
df1 <- read.table(text = "
X Y Z Year
0 2 4 2014
3 1 3 2014
5 4 0 2014
0 3 0 2014
2 1 0 2015
",header = TRUE)
dplyrapproach:
library(dplyr)
df1 %>%
group_by(Year) %>%
summarise_at(vars(X:Z), function (x) sum(x != 0))
Output:
# A tibble: 2 x 4
# Year X Y Z
# <int> <int> <int> <int>
# 1 2014 2 4 2
# 2 2015 1 1 0
Alternative using summaryBy.
library(doBy)
summaryBy(list(c('X','Y','Z'), c('Year')), df1, FUN= function(x) sum(x!=0), keep.names=T)
Year X Y Z
1 2014 2 4 2
2 2015 1 1 0
When needed use colSums as explained before.
Related
How can I run a loop over multiple columns changing consecutive values to true values?
For example, if I have a dataframe like this...
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
I want to show the binned values...
Time Value Bin Subject_ID
1 6 1 1
2 4 2 1
4 8 3 1
1 2 4 1
Is there a way to do it in a loop?
I tried this code...
for (row in 2:nrow(df)) {
if(df[row - 1, "Subject_ID"] == df[row, "Subject_ID"]) {
df[row,1:2] = df[row,1:2] - df[row - 1,1:2]
}
}
But the code changed it line by line and did not give the correct values for each bin.
If you still insist on using a for loop, you can use the following solution. It's very simple but you have to first create a copy of your data set as your desired output values are the difference of values between rows of the original data set. In order for this to happen we move DF outside of the for loop so the values remain intact, otherwise in every iteration values of DF data set will be replaced with the new values and the final output gives incorrect results:
df <- read.table(header = TRUE, text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1")
DF <- df[, c("Time", "Value")]
for(i in 2:nrow(df)) {
df[i, c("Time", "Value")] <- DF[i, ] - DF[i-1, ]
}
df
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
The problem with the code in the question is that after row i is changed the changed row is used in calculating row i+1 rather than the original row i. To fix that run the loop in reverse order. That is use nrow(df):2 in the for statement. Alternately try one of these which do not use any loops and also have the advantage of not overwriting the input -- something which makes the code easier to debug.
1) Base R Use ave to perform Diff by group where Diff uses diff to actually perform the differencing.
Diff <- function(x) c(x[1], diff(x))
transform(df,
Time = ave(Time, Subject_ID, FUN = Diff),
Value = ave(Value, Subject_ID, FUN = Diff))
giving:
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
2) dplyr Using dplyr we write the above except we use lag:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(Time = Time - lag(Time, default = 0),
Value = Value - lag(Value, default = 0)) %>%
ungroup
giving:
# A tibble: 4 x 4
Time Value Bin Subject_ID
<dbl> <dbl> <int> <int>
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
or using across:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(across(Time:Value, ~ .x - lag(.x, default = 0))) %>%
ungroup
Note
Lines <- "Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1"
df <- read.table(text = Lines, header = TRUE)
Here is a base R one-liner with diff in a lapply loop.
df[1:2] <- lapply(df[1.2], function(x) c(x[1], diff(x)))
df
# Time Value Bin Subject_ID
#1 1 1 1 1
#2 2 2 2 1
#3 4 4 3 1
#4 1 1 4 1
Data
df <- read.table(text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
", header = TRUE)
dplyr one liner
library(dplyr)
df %>% mutate(across(c(Time, Value), ~c(first(.), diff(.))))
#> Time Value Bin Subject_ID
#> 1 1 6 1 1
#> 2 2 4 2 1
#> 3 4 8 3 1
#> 4 1 2 4 1
I am trying to expand on the answer to this problem that was solved, Take Sum of a Variable if Combination of Values in Two Other Columns are Unique
but because I am new to stack overflow, I can't comment directly on that post so here is my problem:
I have a dataset like the following but with about 100 columns of binary data as shown in "ani1" and "bni2" columns.
Locations <- c("A","A","A","A","B","B","C","C","D", "D","D")
seasons <- c("2", "2", "3", "4","2","3","1","2","2","4","4")
ani1 <- c(1,1,1,1,0,1,1,1,0,1,0)
bni2 <- c(0,0,1,1,1,1,0,1,0,1,1)
df <- data.frame(Locations, seasons, ani1, bni2)
Locations seasons ani1 bni2
1 A 2 1 0
2 A 2 1 0
3 A 3 1 1
4 A 4 1 1
5 B 2 0 1
6 B 3 1 1
7 C 1 1 0
8 C 2 1 1
9 D 2 0 0
10 D 4 1 1
11 D 4 0 1
I am attempting to sum all the columns based on the location and season, but I want to simplify so I get a total column for column #3 and after for each unique combination of location and season.
The problem is not all the columns have a 1 value for every combination of location and season and they all have different names.
I would like something like this:
Locations seasons ani1 bni2
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
Here is my attempt using a for loop:
df2 <- 0
for(i in 3:length(df)){
testdf <- data.frame(t(apply(df[1:2], 1, sort)), df[i])
df2 <- aggregate(i~., testdf, FUN=sum)
}
I get the following error:
Error in model.frame.default(formula = i ~ ., data = testdf) :
variable lengths differ (found for 'X1')
Thank you!
You can use dplyr::summarise and across after group_by.
library(dplyr)
df %>%
group_by(Locations, seasons) %>%
summarise(across(starts_with("ani"), ~sum(.x, na.rm = TRUE))) %>%
ungroup()
Another option is to reshape the data to long format using functions from the tidyr package. This avoids the issue of having to select columns 3 onwards.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(Locations, seasons)) %>%
group_by(Locations, seasons, name) %>%
summarise(Sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = "name", values_from = "Sum")
Result:
# A tibble: 9 x 4
Locations seasons ani1 ani2
<chr> <int> <int> <int>
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
Ciao,
I did post a similar inquiry but what I needed changed so deeply sorry as I work for a school district and they need different information!
Here is my replicating example.
a=c(1,2,3,4,5,6)
b=c(1,0,NA,NA,0,NA)
c=c(2010,2010,2010,2010,2010,2010)
d=c(1,1,0,1,0,NA)
e=c(2012,2012,2012,2012,2012,2012)
f=c(1,0,0,0,0,NA)
g=c(2014,2014,2014,2014,2014,2014)
h=c(1,1,0,1,0,NA)
i=c(2010,2012,2014,2012,2014,2014)
mydata = data.frame(a,b,c,d,e,f,g,h,i)
names(mydata) = c("id","test1","year1","test2","year2","test3","year3","anytest","year")
The nuts and bolts is to find the first '1' in test1 and test2 and test3 and then add to column value in year1 or year2 or year3 based on where the first '1' is found. I am aiming to search through each row and find the first test column that is equal to 1. The new column I am aiming to create is "anytest." This column is 1 if test1 or test2 or test3 equals to 1. If none of them do then it equals to 0. This ignores NA values..if test1 and test2 are NA but test3 equals to 0 then anytest equals to 0. Now I have made progress I think using this code:
anytestTRY = if(rowSums(mydata[,c(test1,test2,test3)] == 1, na.rm=TRUE) > 0],1,0)
But now I am at a crossroads because I am aiming to search through each row to find the first column of test1 test2 or test3 that equals to 1 and then report the year for that test. So if test1 equals to 0 and test2 equals to NA and test3 equals to 1 I want the column which I created called year3 to be in date. Then last of all if test 1 and test2 and test3 all equals to 0 or NA or some combination of the sort then date should be last year which here is 2014.
We can use rowSums from base R to create the anytest column
i1 <- grep('test', names(mydata))
NA^(rowSums(is.na(mydata[i1])) == 3) * (rowSums(mydata[i1] == 1, na.rm = TRUE) !=0)
#[1] 1 1 0 1 0 NA
If we also need a column of 'date', use max.col to get the column index of the max value of 'test' per row, extract the 'year' based on cbinding the row index with column index
i2 <- grep('year', names(mydata))
m1 <- replace(mydata[i1], is.na(mydata[i1]), 0)
i3 <- !rowSums(m1 == 1)
date <- rep(NA, nrow(mydata))
date[!i3] <- mydata[i2][!i3,][cbind(seq_len(sum(!i3)), max.col(m1[!i3,], 'first'))]
date[i3] <- do.call(pmax, mydata[i2][i3,])
date
#[1] 2010 2012 2014 2012 2014 2014
a=c(1,2,3,4,5,6)
b=c(1,0,NA,NA,0,NA)
c=c(2010,2010,2010,2010,2010,2010)
d=c(1,1,0,1,0,NA)
e=c(2012,2012,2012,2012,2012,2012)
f=c(1,0,0,0,0,NA)
g=c(2014,2014,2014,2014,2014,2014)
h=c(1,1,0,1,0,NA)
i=c(2010,2012,2014,2012,2014,2014)
mydata = data.frame(a,b,c,d,e,f,g)
names(mydata) = c("id","test1","year1","test2","year2","test3","year3")
library(tidyverse)
library(lubridate)
mydata %>%
mutate_all(~as.numeric(as.character(.))) %>% # update columns to numeric
group_by(id) %>% # for each id
nest() %>% # nest data
mutate(date = map(data, ~case_when(.$test1==1 ~ .$year1, # get year based on first test that is 1
.$test2==1 ~ .$year2,
.$test3==1 ~ .$year3,
TRUE ~ max(c(mydata$year1, mydata$year2, mydata$year3)))), # if no test is 1 get the maximum year in the original dataset
anytest = map(data, ~as.numeric(case_when(sum(c(.$test1, .$test2, .$test3)==1, na.rm = T) > 0 ~ "1", # create anytest column
sum(is.na(c(.$test1, .$test2, .$test3))) == 3 ~ "NA",
TRUE ~ "0")))) %>%
unnest()
which returns:
# # A tibble: 6 x 9
# id date anytest test1 year1 test2 year2 test3 year3
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2010 1 1 2010 1 2012 1 2014
# 2 2 2012 1 0 2010 1 2012 0 2014
# 3 3 2014 0 NA 2010 0 2012 0 2014
# 4 4 2012 1 NA 2010 1 2012 0 2014
# 5 5 2014 0 0 2010 0 2012 0 2014
# 6 6 2014 NA NA 2010 NA 2012 NA 2014
How could I to obtain new vectors of cummulative sums following
A:
year month day x y
.
.
.
2000 10 20 10 0
2000 10 21 2 0
2000 10 22 5 1
2000 10 23 9 0
.
.
.
If “y” > 0 then how can I sum the “x” values in these ranges for next mention vectors:
B: sum of x values from the date where y>0 and two days before (5+2+10=17) for all the dates where y>0.
C: sum of x values of 10 days before the two days before an event where y>0, i.e, from 2000-11-10 until 2000-10-20, for all the dates where y>0. In this case, 2000-10-22 is y>0, then it was grouped and summarized the 10 days before 2 days before the event where y>0.
To solve your first question, here is one idea. We can create a column called Flag showing the rows meeting the condition. In the following code, A2 is the intermediate data frame with the Flag column. Flag == 1 indicates the rows y > 0 and the previous two rows.
library(dplyr)
library(tidyr)
A2 <- A %>%
mutate(Flag1 = ifelse(y > 0, 1, NA)) %>%
mutate(Flag2 = lead(Flag1, 2)) %>%
fill(Flag1, .direction = "up") %>%
fill(Flag2, .direction = "down") %>%
mutate(Flag = Flag1 + Flag2 - 1)
A2
# year month day x y Flag1 Flag2 Flag
# 1 2000 10 20 10 0 1 1 1
# 2 2000 10 21 2 0 1 1 1
# 3 2000 10 22 5 1 1 1 1
# 4 2000 10 23 9 0 NA 1 NA
After that, we can filter for Flag == 1 and calculate the sum using summarise. A3 is the final output.
A3 <- A2 %>%
filter(Flag == 1) %>%
summarise(x_sum = sum(x))
A3
# x_sum
# 1 17
As for your second question, since you did not provide a good example dataset and at least to me it is not clear what do you want. I will not try to answer that right now. If you can provide proper updates, I may try to think about it.
DATA
A <- read.table(text = "year month day x y
2000 10 20 10 0
2000 10 21 2 0
2000 10 22 5 1
2000 10 23 9 0",
header = TRUE, stringsAsFactors = FALSE)
I need to fill $Year with missing values of the sequence by the factor of $Country. The $Count column can just be padded out with 0's.
Country Year Count
A 1 1
A 2 1
A 4 2
B 1 1
B 3 1
So I end up with
Country Year Count
A 1 1
A 2 1
A 3 0
A 4 2
B 1 1
B 2 0
B 3 1
Hope that's clear guys, thanks in advance!
This is a dplyr/tidyr solution using complete and full_seq:
library(dplyr)
library(tidyr)
df %>% group_by(Country) %>% complete(Year=full_seq(Year,1),fill=list(Count=0))
Country Year Count
<chr> <dbl> <dbl>
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
library(data.table)
# d is your original data.frame
setDT(d)
foo <- d[, .(Year = min(Year):max(Year)), Country]
res <- merge(d, foo, all.y = TRUE)[is.na(Count), Count := 0]
Similar to #PoGibas' answer:
library(data.table)
# set default values
def = list(Count = 0L)
# create table with all levels
fullDT = setkey(DT[, .(Year = seq(min(Year), max(Year))), by=Country])
# initialize to defaults
fullDT[, names(def) := def ]
# overwrite from data
fullDT[DT, names(def) := mget(sprintf("i.%s", names(def))) ]
which gives
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 0
4: A 4 2
5: B 1 1
6: B 2 0
7: B 3 1
This generalizes to having more columns (besides Count). I guess similar functionality exists in the "tidyverse", with a name like "expand" or "complete".
Another base R idea can be to split on Country, use setdiff to find the missing values from the seq(max(Year)), and rbind them to original data frame. Use do.call to rbind the list back to a data frame, i.e.
d1 <- do.call(rbind, c(lapply(split(df, df$Country), function(i){
x <- rbind(i, data.frame(Country = i$Country[1],
Year = setdiff(seq(max(i$Year)), i$Year),
Count = 0));
x[with(x, order(Year)),]}), make.row.names = FALSE))
which gives,
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1
> setkey(DT,Country,Year)
> DT[setkey(DT[, .(min(Year):max(Year)), by = Country], Country, V1)]
Country Year Count
1: A 1 1
2: A 2 1
3: A 3 NA
4: A 4 2
5: B 1 1
6: B 2 NA
7: B 3 1
Another dplyr and tidyr solution.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(Country) %>%
do(data_frame(Country = unique(.$Country),
Year = full_seq(.$Year, 1))) %>%
full_join(dt, by = c("Country", "Year")) %>%
replace_na(list(Count = 0))
Here is an approach in base R that uses tapply, do.call, range, and seq, to calculate year sequences. Then constructs a data.frame from the named list that is returned, merges this onto the original which adds the desired rows, and finally fills in missing values.
# get named list with year sequences
temp <- tapply(dat$Year, dat$Country, function(x) do.call(seq, as.list(range(x))))
# construct data.frame
mydf <- data.frame(Year=unlist(temp), Country=rep(names(temp), lengths(temp)))
# merge onto original
mydf <- merge(dat, mydf, all=TRUE)
# fill in missing values
mydf[is.na(mydf)] <- 0
This returns
mydf
Country Year Count
1 A 1 1
2 A 2 1
3 A 3 0
4 A 4 2
5 B 1 1
6 B 2 0
7 B 3 1