Replace NA in one row of data frame with values from other - r

i want to replace NA in one row with values from another row, example data are:
group <-c('A','A_old')
year1<- c(NA,'20')
year2<- c(NA,'40')
year3<- c('20','230')
datac=data_frame(group,year1,year2,year3)
group <-c('A','A_old')
year1<- c('20','20')
year2<- c('40','40')
year3<- c('20','230')
finaldatac=data_frame(group,year1,year2,year3)
Original table is much larger so referring to each element one by one and assigning value is not possible..
Thanks!
For sake of argument below, i need to refer to the row values by their name as original table is big and i can not play around with only two rows. For example in table below, i would like to replace row 1 (group==A) with row 5 (group==E). Data are here:
group <-c('A','B','C','D','E','F','G')
year1<- c(NA,'100',NA,'200','300',NA,NA)
year2<- c(NA,'100',NA,'200','300','50','40')
year3<- c('20','100',10,'200','300','150','230')
data=data.frame(group,year1,year2,year3)
SO i want to get:
group <-c('A','B','C','D','E','F','G')
year1<- c('300','100',NA,'200','300',NA,NA)
year2<- c('300','100',NA,'200','300','50','40')
year3<- c('20','100',10,'200','300','150','230')
data=data.frame(group,year1,year2,year3)

Other than using fill or na.locf, you could do:
datac %>%
group_by(grp = gsub("_.*", "", group)) %>%
mutate_at(vars(contains("year")),
funs(.[!is.na(.)])) %>%
ungroup() %>% select(-grp)
Output:
# A tibble: 2 x 4
group year1 year2 year3
<chr> <chr> <chr> <chr>
1 A 20 40 20
2 A_old 20 40 230
For your second example, you could do:
data %>%
mutate_at(
vars(contains("year")),
funs(
case_when(
group == "A" & is.na(.) ~ .[group == "E"],
TRUE ~ .)
)
)
Output:
group year1 year2 year3
1 A 300 300 20
2 B 100 100 100
3 C <NA> <NA> 10
4 D 200 200 200
5 E 300 300 300
6 F <NA> 50 150
7 G <NA> 40 230
You can also add other conditions to case_when.
For instance, if you'd additionally like to replace C years with what is there for group D, you would add:
data %>%
mutate_at(
vars(contains("year")),
funs(
case_when(
group == "A" & is.na(.) ~ .[group == "E"],
group == "C" & is.na(.) ~ .[group == "D"],
TRUE ~ .)
)
)

After a very long evening and headache from r i managed to get this:
rm(list = ls())
group <-c('A','A old')
year1<- c(NA,'20')
year2<- c(NA,'40')
year3<- c('20','230')
datac=data_frame(group,year1,year2,year3)
group <-c('A','A old')
year1<- c('20','20')
year2<- c('40','40')
year3<- c('20','230')
finaldatac=data_frame(group,year1,year2,year3)
datac$group <- gsub(' ', '--', datac$group)
datact = t(datac)
colnames(datact) = datact[1, ]
datact = datact[-1, ]
datact[,"A"] <- ifelse(!is.na(datact[,"A"]), datact[,"A"] , datact[,"A--old"])
datactt=t(datact)
group = rownames(datactt)
datactt<-cbind(datactt, group)
rownames(datactt) <- c()
datactt <- as.data.frame(datactt)
sapply(datactt, class)
datactt <- data.frame(lapply(datactt, as.character), stringsAsFactors=FALSE)
datactt$group <- gsub('--', ' ', datactt$group)
Where datactt (hopefully) is the same as finaldatac that i wanted... I am sure this cant be the best solution, obviously not the prettiest. If anybody has something similar, but shorter or more efficient please post it i would appreciate the answer.

Related

Groupby and filter output is producing different results

Can someone help me understand what the grouping is doing here, please?
Why do these two produce two different grouped outputs? The top returns all grouped variables where n() >1 in results A and outside of A category but just the A pairing while the bottom returns n() > 1 here duplicates exist in only A.
Sample Data:
df <- data.frame(ID = c(1,1,3,4,5,6,6),
Acronym = c('A','B','A','A','B','A','A')
)
df %>%
group_by(ID) %>%#
filter(Acronym == 'A',n() > 1)
df %>% filter(Acronym == 'A') %>%
group_by(ID) %>%
filter(n() > 1)
In the first example, rows with Acroynm == "A" are in the data frame and contribute to the row count n().
In the second example, these rows are removed, and don't contribute to row count from n().
If we want the first case to return only 'ID' 6, use sum to get the count of 'A' values in Acronym
library(dplyr)
df %>%
group_by(ID) %>%
filter(sum(Acronym == 'A') > 1)
As mentioned in the other post, it is just that n() is based on the whole group count and not on the number of 'A's. If we are unsure about the filter, create a column with mutate and check the output
df %>%
group_by(ID) %>%
mutate(ind = Acronym == 'A' & n() > 1)
# A tibble: 7 × 3
# Groups: ID [5]
ID Acronym ind
<dbl> <chr> <lgl>
1 1 A TRUE
2 1 B FALSE
3 3 A FALSE
4 4 A FALSE
5 5 B FALSE
6 6 A TRUE
7 6 A TRUE

Matching and returning values based on condition or ID

This seems like it should be fairly easy, but i'm having trouble with it.
Example: I have a dataframe with two columns IDs and perc_change. I want to know which unique IDs have had more than 30% change.
IDs <- c(1,1,2,1,1,2,2,2,3,2,3,4,5,6,3)
perc_change <- c(50,40,60,70,80,30,20,40,23,25,10,30,12,7,70)
df <- data.frame(IDs, perc_change)
So far:
if (df$perc_change > 30) {
unique(df$IDs)
} else {
}
This obviously doesn't work because it returns all unique IDs. Should I be be finding the index and then matching it or soemthing?
Thanks in advance!
We could do so, to get the values of each ID:
library(dplyr)
df %>%
group_by(IDs) %>%
filter(perc_change > 30) %>%
mutate(values = paste0(perc_change, collapse = ","), .keep="unused") %>%
distinct(IDs, .keep_all = TRUE)
Output:
IDs values
<dbl> <chr>
1 1 50,40,70,80
2 2 60,40
3 3 70
Just use [ to subset and take the unique - i.e. no need for if/else conditions
with(df, unique(IDs[perc_change > 30]))
[1] 1 2 3
We can group, filter and count using dplyr
> library(dplyr)
> df %>%
group_by(IDs) %>%
filter(perc_change > 30) %>%
count(IDs)
# A tibble: 3 x 2
# Groups: IDs [3]
IDs n
<dbl> <int>
1 1 4
2 2 2
3 3 1
unique(df[df$perc_change > 30,"IDs"])

Combine attributes from two columns and sum the values from duplicate rows

This question is slightly modified from this one.
I have a dataframe in long table format like this:
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
ID name value
1 a broad
1 c 50
1 a mangrove
1 c 50
1 a mangrove
1 c 50
2 a coniferous
2 c 50
About the data: The value from the second row 50 corresponds to the value broad from the first row. Similarly, the value from the fourth row 50 corresponds to the value mangrove from the third row and so on.. In simple words, values for name c are related with name a.
I want to combine the value in such a way that I could get the corresponding values for each name, which would also aggregate the values with similar names:
df2 <- data.frame(ID=c(1,1,2),
name=c("c_broad","c_mangrove","c_coniferous"),
value=c(50,100,50))
which should look like this:
ID name value
1 c_broad 50
1 c_mangrove 100
2 c_coniferous 50
Using reshape2:
library(reshape2)
df1$grp = cumsum(df1$name == "a")
df2 = dcast(df1, ID + grp ~ name)
df2$c = as.numeric(df2$c)
aggregate(c ~ ID + a, df2, sum)
ID a c
1 1 broad 50
2 2 coniferous 50
3 1 mangrove 100
Column names can be changed if desired, also "c_" can be added to the names with paste.
Using tidyverse:
value_a <- df1 %>% dplyr::filter(name=="a") %>% dplyr::pull(value)
df1 %>%
dplyr::filter(name=="c") %>% #Modify into a sensible data frame from here
dplyr::mutate(a = value_a,
name = stringr::str_c(name, "_" ,a)) %>%
dplyr::select(-a) %>% # to here
dplyr::group_by(ID, name) %>%
dplyr::summarise(value=sum(as.numeric(value)))
# A tibble: 3 x 3
# Groups: ID [2]
ID name value
<dbl> <chr> <dbl>
1 1 c_broad 50
2 1 c_mangrove 100
3 2 c_coniferous 50
Tha main problem you find in your dataframe is that a single column is containing, names and values, and that is the first thing you should fix. My advice is always modify the original dataframe into a tidy format (https://tidyr.tidyverse.org/articles/tidy-data.html) and from there leverage all tidyverse power, or data.table or your framework of choice.
Notice the temporal variable value_a could be included in the pipeline directly I have not done it for clarity. The main idea is to separate values and species in different columns, the first three calls in the pipeline, and then apply the usual tidyverse operations.
Might not be the most elegant, but it works:
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50)
)
df1 %>% group_by( 1+floor((1:n()-1)/2) ) %>%
summarize(
ID = ID[1],
name = paste0( name[2], "_", value[1] ),
value = as.numeric(value[2])
) %>% ungroup %>% select( -1 ) %>% group_by(name) %>%
mutate( value = sum(value) ) %>%
unique
Here is somthing improved, that actually is humanly readable:
i <- seq( 1, nrow(df1), 2 )
df1 %>% summarise(
ID = ID[i],
name = paste0( name[i+1], "_", value[i] ),
value = as.numeric(value[i+1])
) %>% group_by(name) %>%
summarize(
ID=ID[1], value = sum( value )
) %>% arrange(ID)
Base R solution:
# Nullify numeric values belonging to a grouping category: grps => character vector
grps <- gsub("\\d+", NA, df1$value)
# Interpolate NA values using prior string value: a => character vector
df1$a <- na.omit(grps)[cumsum(!(is.na(grps)))]
# Split-Apply-Combine aggregation: data.frame => stdout(console)
data.frame(do.call(rbind, lapply(with(df1, split(df1, a)), function(x){
y <- transform(subset(x, !grepl("\\D+", value)), value = as.numeric(value))
setNames(
aggregate(value ~ ID + a, y, FUN = function(z){sum(z, na.rm = TRUE)}),
c("ID", "a", "c")
)
}
)
),
row.names = NULL
)
additional option
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
library(tidyverse)
df1 %>%
pivot_wider(ID, names_from = name, values_from = value) %>%
unnest(c("a", "c")) %>%
group_by(ID, name = a) %>%
summarise(value = sum(as.numeric(c), na.rm = T), .groups = "drop")
#> # A tibble: 3 x 3
#> ID name value
#> <dbl> <chr> <dbl>
#> 1 1 broad 50
#> 2 1 mangrove 100
#> 3 2 coniferous 50
Created on 2021-04-12 by the reprex package (v2.0.0)

How to group multiple rows based on some criteria and sum values in R?

Hi All,
Example :- The above is the data I have. I want to group age 1-2 and count the values. In this data value is 4 for age group 1-2. Similarly I want to group age 3-4 and count the values. Here the value for age group 3-4 is 6.
How can I group age and aggregate the values correspond to it?
I know this way: code-
data.frame(df %>% group_by(df$Age) %>% tally())
But the values are aggregating on individual Age.
I want the values aggregating on multiple age to be a group as mentioned above example.
Any help on this will be greatly helpful.
Thanks a lot to All.
Here are two solutions, with base R and with package dplyr.
I will use the data posted by Shree.
First, base R.
I create a grouping variable grp and then aggregate on it.
grp <- with(df, c((age %in% 1:2) + 2*(age %in% 3:4)))
aggregate(age ~ grp, df, length)
# grp age
#1 1 4
#2 2 6
Second a dplyr way.
Function case_when is used to create a grouping variable. This allows for meaningful names to be given to the groups in an easy way.
library(dplyr)
df %>%
mutate(grp = case_when(
age %in% 1:2 ~ "2:3",
age %in% 3:4 ~ "3:4",
TRUE ~ NA_character_
)) %>%
group_by(grp) %>%
tally()
## A tibble: 2 x 2
# grp n
# <chr> <int>
#1 1:2 4
#2 3:4 6
Here's one way using dplyr and ?cut from base R -
df <- data.frame(age = c(1,1,2,2,3,3,3,4,4,4),
Name = letters[1:10],
stringsAsFactors = F)
df %>%
count(grp = cut(age, breaks = c(0,2,4)))
# A tibble: 2 x 2
grp n
<fct> <int>
1 (0,2] 4
2 (2,4] 6

R: sum row based on several conditions

I am working on my thesis with little knowledge of r, so the answer this question may be pretty obvious.
I have the a dataset looking like this:
county<-c('1001','1001','1001','1202','1202','1303','1303')
naics<-c('423620','423630','423720','423620','423720','423550','423720')
employment<-c(5,6,5,5,5,6,5)
data<-data.frame(county,naics,employment)
For every county, I want to sum the value of employment of rows with naics '423620' and '423720'. (So two conditions: 1. same county code 2. those two naics codes) The row in which they are added should be the first one ('423620'), and the second one ('423720') should be removed
The final dataset should look like this:
county2<-c('1001','1001','1202','1303','1303')
naics2<-c('423620','423630','423620','423550','423720')
employment2<-c(10,6,10,6,5)
data2<-data.frame(county2,naics2,employment2)
I have tried to do it myself with aggregate and rowSum, but because of the two conditions, I have failed thus far. Thank you very much.
We can do
library(dplyr)
data$naics <- as.character(data$naics)
data %>%
filter(naics %in% c(423620, 423720)) %>% group_by(county) %>%
summarise(naics = "423620", employment = sum(employment)) %>%
bind_rows(., filter(data, !naics %in% c(423620, 423720)))
# A tibble: 5 x 3
# county naics employment
# <fctr> <chr> <dbl>
#1 1001 423620 10
#2 1202 423620 10
#3 1303 423620 5
#4 1001 423630 6
#5 1303 423550 6
With such a condition, I'd first write a small helper and then pass it on to dplyr mutate:
# replace 423720 by 423620 only if both exist
onlyThoseNAICS <- function(v){
if( ("423620" %in% v) & ("423720" %in% v) ) v[v == "423720"] <- "423620"
v
}
data %>%
dplyr::group_by(county) %>%
dplyr::mutate(naics = onlyThoseNAICS(naics)) %>%
dplyr::group_by(county, naics) %>%
dplyr::summarise(employment = sum(employment)) %>%
dplyr::ungroup()

Resources