I have a data frame which looks like this
where value of b ranges from 1:31 and alpha_1,alpha_2 and alpha_3 can only have value 0 and 1. for each b value i have 1000 observations so total 31000 observations. I want to group the entire dataset by b and wanted to count value of alpha columns ONLY when its value is 1. So the end result would have 31 observations (unique b values from 1:31) and count of alpha values when its 1.
how do i do this in R. I have tried using pipe methods in dplyr and nothing seems to be working.
We can use
library(dplyr)
df1 %>%
group_by(b) %>%
summarise_at(vars(starts_with("alpha")), sum)
Related
I need to run a loop for my dataset, ds. the dim of ds is 4000, 11. Each country of the world of represented. And each country has data from 1970 to 1999.
The data set has missing data amongst its 8 rows. I need to run a loop that calculates how much missing data there is PER year. the year is in df$year.
I am pretty sure the years (1970, 1971, 1972...) are numeric values.
This is my current code
missingds<-c()
for (i in 1:length(ds)){
missingds[names(ds)[i]]<-sum(is.na(ds[i]))/4000
}
This gives me the proportion of missing data per variable of ds. I just cannot figure out how to get it report the proportion of all the variables per year.
I do have an indicator variable ds$missing which reports 1 if there is an NA in any of the columns of that row and 0 if not.
A picture of ds
To count number of NA values in each column using dplyr you can do :
library(dplyr)
result <- data %>%
group_by(Year) %>%
summarise(across(gdp_growth:polity, ~sum(is.na(.))))
In base R you can use aggregate :
aggregate(cbind(gdp_growth, gdp_per_capita, inf_mort, pop_density, polity)~year,
data, function(x) sum(is.na(x)))
Replace sum with mean if you want to count the proportions of NA value in each year.
Using data.table
library(data.table)
setDT(data)[, lapply(.SD, function(x) sum(is.na(x))),
by = Year, .SDcols = gdp_growth:polity]
I have followed this example Remove last N rows in data frame with the arbitrary number of rows but it just deletes only the last 50 rows of the data frame rather than the last 50 rows of every study site within the data frame. I have a really big data set that has multiple study sites and within each study site there's multiple depths and for each depth, a concentration of nutrients.
I want to just delete the last 50 rows of depth for each station.
E.g.
station 1 has 250 depths
station 2 has 1000 depths
station 3 has 150 depth
but keep all the other data consistent.
This just seems to remove the last 50 from the dataframe rather than the last 50 from every station...
df<- df[-seq(nrow(df),nrow(df)-50),]
What should I do to add more variables (study site) to filter by?
A potential base R solution would be:
d <- data.frame(station = rep(paste("station", 1:3), c(250, 1000, 150)),
depth = rnorm(250 + 1000 + 150, 100, 10))
d$grp_counter <- do.call("c", lapply(tapply(d$depth, d$station, length), seq_len))
d$grp_length <- rep(tapply(d$depth, d$station, length), tapply(d$depth, d$station, length))
d <- d[d$grp_counter <= (d$grp_length - 50),]
d
# OR w/o auxiliary vars: subset(d, select = -c(grp_counter, grp_length))
we can use slice function from dplyr package
df2<-df %>% group_by(Col1) %>% slice(1:(n()-4))
At first it groups by category column and if arranged in proper order it can remove last n number of rows (in this case 4) from dataframe for each category.
I have the following data frame in R (actual data frame is millions of rows with thousands of unique Column A values):
Row Column A Column B
1 130077 65
2 130077 65
3 130077 65
4 200040 10
5 200040 10
How can I add up Column B values grouped by Column A values without including duplicated Column A values? Correct output would be:
130077 65
200040 10
........
I have tried using filter and group_by with no success since the final output does sum values by Column A values but includes duplicated values.
An option is to get the distinct rows, then do a group by 'ColumnA' and get the sum of 'ColumnB'
library(dplyr)
df1 %>%
distinct(ColumnA, ColumnB) %>% # The example gives the expected output here
group_by(ColumnA) %>%
summarise(ColumnB = sum(ColumnB))
Or in base R with unique and aggregate
aggregate(ColumnB ~ ColumnA, unique(df1[c("ColumnA", "ColumnB")]), sum)
Similar to dplyr: put count occurrences into new variable, Putting rowwise counts of value occurences into new variables, how to do that in R with dplyr?, Count observations of distinct values per group and add a new column of counts for each value or Create columns from factors and count
I am looking for a way to summarize observations under a certain grouping variable and attach the result to my dataframe (preferably with dplyr, but any solution is appreciated).
Here is the structure of my data:
id team teamsize apr16 may16
123 A 13 0 1
124 B 8 1 1
125 A 13 1 0
126 A 13 0 1
I would like R to group my data according to the teams, count the number of 1 for apr16 (so just add up the 1 for each team) and attach a new variable to my dataframe with the result. In a next step, I would like to use this result (it represents numbers of users of a certain policy) in order to calculate the share of users for each Team (that's what I need teamsize for) - but I haven gotten there yet.
I have tried several ways to just calculate the number of users, but it did not work out.
DF <- DF%>%
group_by(team, apr16) %>%
mutate(users_0416 = n()) %>%
ungroup
DF <- DF%>%
group_by(team, apr16) %>%
mutate(users_0416 = sum()) %>%
ungroup
I feel like this is a really easy thing to do but I am just not moving on and really looking forward to any help.
Is there a #dplyr function that can filter a table when the sum of a column hits a certain value? Ex. if df has 10 rows and if I add the sum of column1 and it gets to 5 by row 6, rows 7-10 are filtered out?
Do you mean something like this?
df <- data.frame(a = seq(1:10));
require(dplyr);
df %>%
filter(cumsum(a) < 8)
# a
#1 1
#2 2
#3 3
Explanation: cumsum is your friend here, where in above example you filter rows where the cumulative row sum is <8.