R Data wrangling - r

I am new to R, I have a csv file that contains values:
A, , ,
,B, ,
, ,C1,
, , ,D1
, , ,D2
, ,C2,
, , ,D3
, , ,D4
Loading the data into a data frame:
dat = read.csv("~/RData/test.csv", header = FALSE)
dat
# V1 V2 V3 V4
# 1 A
# 2 B
# 3 C1
# 4 D1
# 5 D2
# 6 C2
# 7 D3
# 8 D4
I need to wrangle this to a data frame format:
A,B,C1,D1
A,B,C1,D2
A,B,C2,D3
A,B,C2,D4
Thanks in advance!

A solution using dplyr and tidyr. This solution follows the link in Gregor's comments. But instead of using zoo package, here I show the use of fill function from tidyr, na.omit from base R, and distinct function from dplyr.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
fill(everything(), .direction = "down") %>%
na.omit() %>%
distinct(V4, .keep_all = TRUE)
dt2
V1 V2 V3 V4
1 A B C1 D1
2 A B C1 D2
3 A B C2 D3
4 A B C2 D4
DATA
dt <- read.table(text = "V1 V2 V3 V4
1 A NA NA NA
2 NA B NA NA
3 NA NA C1 NA
4 NA NA NA D1
5 NA NA NA D2
6 NA NA C2 NA
7 NA NA NA D3
8 NA NA NA D4",
header = TRUE, stringsAsFactors = FALSE)

By using zoo
library(zoo)
df[df==' '] <- NA
df[1:3] <- lapply(df[1:3], na.locf0, fromLast = FALSE)
df <- df[!is.na(df$V4),]
df
giving:
V1 V2 V3 V4
4 A B C1 D1
5 A B C1 D2
7 A B C2 D3
8 A B C2 D4
or by using magrittr too we can write the above code in terms of this pipeline:
library(magrittr)
library(zoo)
df %>%
replace(. == ' ', NA) %>%
replace(1:3, lapply(.[1:3], na.locf0, fromLast = FALSE)) %>%
subset(!is.na(V4))

Related

how to change row values based on information from another dataframe in R

I have the original df:
A <- c("A1", "A2", "A3", "A4")
B <- c(1,0,1,NA)
C <- c(0,1,0,NA)
D <- c(NA, 1, 0, NA)
df <- data.frame(A, B, C, D)
And my second df2:
A <- c("A2", "A3")
df2 <- data.frame(A)
I would like to modify df_modified to look like this
A B C D
A1 1 0 NA
A2 NA NA NA
A3 NA NA NA
A4 NA NA NA
My current code, which generated all rows filled by NA is:
df_modifed <- df %>% mutate(B = case_when(df$A == df2$A ~ NA),
C = case_when(df$A == df2$A ~ NA),
D = case_when(df$A == df2$A ~ NA))
How can I do this correctly?
In base R, this is easier i.e. specify the logical index as row and column index without the first column (-1) and assign those elements to NA
df[df$A %in% df2$A, -1] <- NA
-output
> df
A B C D
1 A1 1 0 NA
2 A2 NA NA NA
3 A3 NA NA NA
4 A4 NA NA NA
Or if we want to use tidyverse, use across
library(dplyr)
df %>%
mutate(across(where(is.numeric), ~ case_when(!A %in% df2$A~ .)))
-output
A B C D
1 A1 1 0 NA
2 A2 NA NA NA
3 A3 NA NA NA
4 A4 NA NA NA
Here is an alternative dplyr way:
bind_rows(df, df2) %>%
group_by(A) %>%
mutate(across(c(B,C,D), ~first(.)==last(.))*1) %>%
distinct()
A B C D
<chr> <dbl> <dbl> <dbl>
1 A1 1 1 NA
2 A2 NA NA NA
3 A3 NA NA NA
4 A4 NA NA NA

Replace NA in row with value in adjacent row (not only one row)

input data
V1 V2 V3 V4 V5 #header
a b c d e #full data 1
NA f g NA NA
NA NA NA i NA
NA j NA NA NA k
a1 b1 c1 d1 e1 #full data 2
NA NA f1 g1 NA
Expected output
V1 V2 V3 V4 V5
a bfj cg di ek
a1 b1 c1f1 d1g1 e1
It is useful link:
Replace NA in row with value in adjacent row "ROW" not column
I used a lot of for loops. OMG.. My code is very dirty.
Here's an approach using dplyr. First, I identify the columns with no NAs. Then I use the cumulative count of those to define groups. Within those groups, I paste all the rows' values (excluding NA's) together.
library(dplyr)
df1 %>%
rowwise() %>% mutate(full = sum(is.na(c_across()))) %>% ungroup() %>%
group_by(group = cumsum(full == 0)) %>%
summarize(across(.fns = ~paste0(na.omit(.x), collapse = ""))) %>%
select(-group, -full)
# A tibble: 2 × 5
V1 V2 V3 V4 V5
<chr> <chr> <chr> <chr> <chr>
1 a bfj cg di e
2 a1 b1 c1f1 d1g1 e1

Replace NA in row with value in adjacent row "ROW" not column [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 1 year ago.
Raw data:
V1 V2
1 c1 a
2 c2 b
3 <NA> c
4 <NA> d
5 c3 e
6 <NA> f
7 c4 g
Reproducible Sample Data
V1 = c('c1','c2',NA,NA,'c3',NA,'c4')
V2 = c('a','b','c','d','e','f','g')
data.frame(V1,V2)
Expected output
V1_after V2_after
1 c1 a
2 c2 b c d
3 c3 e f
4 c4 g
V1_after <- c('c1','c2','c3','c4')
V2_after <- c('a',paste('b','c','d'),paste('e','f'),'g')
data.frame(V1_after,V2_after)
This is sample data.
In Real data, Rows where NA in V1 is not regular
It is too difficult to me
You could make use of zoo::na.locf for this. It takes the most recent non-NA value and fill all NA values on the way:
library(dplyr)
library(zoo)
df %>%
mutate(V1 = zoo::na.locf(V1)) %>%
group_by(V1) %>%
summarise(V2 = paste0(V2, collapse = " "))
# A tibble: 4 x 2
V1 V2
<chr> <chr>
1 c1 a
2 c2 b c d
3 c3 e f
4 c4 g
A base R option using na.omit + cumsum + aggregate
aggregate(
V2 ~ .,
transform(
df,
V1 = na.omit(V1)[cumsum(!is.na(V1))]
), c
)
gives
V1 V2
1 c1 a
2 c2 b, c, d
3 c3 e, f
4 c4 g
You can fill the NA with the previous non-NA values and summarise the data.
library(dplyr)
library(tidyr)
df %>%
fill(V1) %>%
group_by(V1) %>%
summarise(V2 = paste(V2, collapse = ' '))
# V1 V2
# <chr> <chr>
#1 c1 a
#2 c2 b c d
#3 c3 e f
#4 c4 g

In R is there a way to recode the columns from one data frame with values from another data frame?

I am still relatively new to working in R and I am not sure how to approach this problem. Any help or advice is greatly appreciated!!!
The problem I have is that I am working with two data frames and I need to recode the first data frame with values from the second. The first data frame (df1) contains the data from the respondents to a survey and the other data frame(df2) is the data dictionary for df1.
The data looks like this:
df1 <- data.frame(a = c(1,2,3),
b = c(4,5,6),
c = c(7,8,9))
df2 <- data.frame(columnIndicator = c("a","a","a","b","b","b","c","c","c" ),
df1_value = c(1,2,3,4,5,6,7,8,9),
new_value = c("a1","a2","a3","b1","b2","b3","c1","c2","c3"))
So far I can manually recode df1 to get the expected output by doing this:
df1 <- within(df1,{
a[a==1] <- "a1"
a[a==2] <- "a2"
a[a==3] <- "a3"
b[b==4] <- "b4"
b[b==5] <- "b5"
b[b==6] <- "b6"
c[c==7] <- "c7"
c[c==8] <- "c8"
c[c==9] <- "c9"
})
However my real dataset has about 42 columns that need to be recoded and that method is a little time intensive. Is there another way in R for me to recode the values in df1 with the values in df2?
Thanks!
Just need to transform the shape a bit.
library(data.table)
df1 <- data.frame(a = c(1,2,3),
b = c(4,5,6),
c = c(7,8,9))
df2 <- data.frame(columnIndicator = c("a","a","a","b","b","b","c","c","c" ),
df1_value = c(1,2,3,4,5,6,7,8,9),
new_value = c("a1","a2","a3","b4","b5","b6","c7","c8","c9"),stringsAsFactors = FALSE)
setDT(df1)
setDT(df2)
df1[,ID:=.I]
ldf1 <- melt(df1,measure.vars = c("a","b","c"),variable.name = "columnIndicator",value.name = "df1_value")
ldf1[df2,"new_value":=i.new_value,on=.(columnIndicator,df1_value)]
ldf1
#> ID columnIndicator df1_value new_value
#> 1: 1 a 1 a1
#> 2: 2 a 2 a2
#> 3: 3 a 3 a3
#> 4: 1 b 4 b4
#> 5: 2 b 5 b5
#> 6: 3 b 6 b6
#> 7: 1 c 7 c7
#> 8: 2 c 8 c8
#> 9: 3 c 9 c9
dcast(ldf1,ID~columnIndicator,value.var = "new_value")
#> ID a b c
#> 1: 1 a1 b4 c7
#> 2: 2 a2 b5 c8
#> 3: 3 a3 b6 c9
Created on 2020-04-18 by the reprex package (v0.3.0)
In base R, we can unlist df1 match it with df1_value and get corresponding new_value.
df1[] <- df2$new_value[match(unlist(df1), df2$df1_value)]
df1
# a b c
#1 a1 b1 c1
#2 a2 b2 c2
#3 a3 b3 c3
Is this what you are looking for???
library(dplyr)
df3 <- df1 %>% gather(key = "key", value = "value")
df3 %>% inner_join(df2, by = c("key" = "columnIndicator", "value" = "df1_value"))
Output
key value new_value
1 a 1 a1
2 a 2 a2
3 a 3 a3
4 b 4 b1
5 b 5 b2
6 b 6 b3
7 c 7 c1
8 c 8 c2
9 c 9 c3

How to get the values with maximum occurrence across range of columns based on another column

I have a data frame with 29 rows and 26 column with a lot of NA's. Data looks somewhat like shown below( working on R studio)
df <-
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
a1 b1 d f d d na na na f
a1 b2 d d d f na f na na
a1 b3 d f d f na na d d
a2 c1 f f d na na d d f
a2 c2 f d d f na f na na
a2 c3 d f d f na na f d
Here we have V1-V10 columns. a1 and a2 are 2 distinct values in column V1,
b1-b3 in column V2 are distinct values related to a1 in V1 and c1-c3 related to a2.
column V3- V10 we have distinct values in each rows related to a1 and a2
Result i want is as below-
NewV1 max.occurrence(V3-V10)
a1 d
a2 f
to summarize i want to get the value with maximum occurrence(max.occurrence(V3-V10)) across column V3-V10 based on V1. NOTE= NA to be excluded.
Another possiblity using the data.table-package:
library(data.table)
melt(setDT(df),
id = 1:2,
na.rm = TRUE)[, .N, by = .(V1, value)
][order(-N), .(max.occ = value[1]), by = V1]
which gives:
V1 max.occ
1: a1 d
2: a2 f
A similar logic with the tidyverse-packages:
library(dplyr)
library(tidyr)
df %>%
gather(k, v, V3:V10, na.rm = TRUE) %>%
group_by(V1, v) %>%
tally() %>%
arrange(-n) %>%
slice(1) %>%
select(V1, max.occ = v)
If you like dplyr, this would work:
df %>%
gather("key", "value", V3:V10) %>%
group_by(V1) %>%
dplyr::summarise(max.occurence = names(which.max(table(value))))
This gives:
# A tibble: 2 x 2
V1 max.occurence
<fct> <chr>
1 a1 d
2 a2 f

Resources