R Data wrangling

R Data wrangling - r

I am new to R, I have a csv file that contains values:
A, , ,
,B, ,
, ,C1,
, , ,D1
, , ,D2
, ,C2,
, , ,D3
, , ,D4
Loading the data into a data frame:
dat = read.csv("~/RData/test.csv", header = FALSE)
dat
# V1 V2 V3 V4
# 1 A
# 2 B
# 3 C1
# 4 D1
# 5 D2
# 6 C2
# 7 D3
# 8 D4
I need to wrangle this to a data frame format:
A,B,C1,D1
A,B,C1,D2
A,B,C2,D3
A,B,C2,D4
Thanks in advance!

A solution using dplyr and tidyr. This solution follows the link in Gregor's comments. But instead of using zoo package, here I show the use of fill function from tidyr, na.omit from base R, and distinct function from dplyr.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
fill(everything(), .direction = "down") %>%
na.omit() %>%
distinct(V4, .keep_all = TRUE)
dt2
V1 V2 V3 V4
1 A B C1 D1
2 A B C1 D2
3 A B C2 D3
4 A B C2 D4
DATA
dt <- read.table(text = "V1 V2 V3 V4
1 A NA NA NA
2 NA B NA NA
3 NA NA C1 NA
4 NA NA NA D1
5 NA NA NA D2
6 NA NA C2 NA
7 NA NA NA D3
8 NA NA NA D4",
header = TRUE, stringsAsFactors = FALSE)

By using zoo
library(zoo)
df[df==' '] <- NA
df[1:3] <- lapply(df[1:3], na.locf0, fromLast = FALSE)
df <- df[!is.na(df$V4),]
df
giving:
V1 V2 V3 V4
4 A B C1 D1
5 A B C1 D2
7 A B C2 D3
8 A B C2 D4
or by using magrittr too we can write the above code in terms of this pipeline:
library(magrittr)
library(zoo)
df %>%
replace(. == ' ', NA) %>%
replace(1:3, lapply(.[1:3], na.locf0, fromLast = FALSE)) %>%
subset(!is.na(V4))

Related

how to change row values based on information from another dataframe in R

I have the original df:
A <- c("A1", "A2", "A3", "A4")
B <- c(1,0,1,NA)
C <- c(0,1,0,NA)
D <- c(NA, 1, 0, NA)
df <- data.frame(A, B, C, D)
And my second df2:
A <- c("A2", "A3")
df2 <- data.frame(A)
I would like to modify df_modified to look like this
A B C D
A1 1 0 NA
A2 NA NA NA
A3 NA NA NA
A4 NA NA NA
My current code, which generated all rows filled by NA is:
df_modifed <- df %>% mutate(B = case_when(df$A == df2$A ~ NA),
C = case_when(df$A == df2$A ~ NA),
D = case_when(df$A == df2$A ~ NA))
How can I do this correctly?

In base R, this is easier i.e. specify the logical index as row and column index without the first column (-1) and assign those elements to NA
df[df$A %in% df2$A, -1] <- NA
-output
> df
A B C D
1 A1 1 0 NA
2 A2 NA NA NA
3 A3 NA NA NA
4 A4 NA NA NA
Or if we want to use tidyverse, use across
library(dplyr)
df %>%
mutate(across(where(is.numeric), ~ case_when(!A %in% df2$A~ .)))
-output
A B C D
1 A1 1 0 NA
2 A2 NA NA NA
3 A3 NA NA NA
4 A4 NA NA NA

Here is an alternative dplyr way:
bind_rows(df, df2) %>%
group_by(A) %>%
mutate(across(c(B,C,D), ~first(.)==last(.))*1) %>%
distinct()
A B C D
<chr> <dbl> <dbl> <dbl>
1 A1 1 1 NA
2 A2 NA NA NA
3 A3 NA NA NA
4 A4 NA NA NA

Replace NA in row with value in adjacent row (not only one row)

input data
V1 V2 V3 V4 V5 #header
a b c d e #full data 1
NA f g NA NA
NA NA NA i NA
NA j NA NA NA k
a1 b1 c1 d1 e1 #full data 2
NA NA f1 g1 NA
Expected output
V1 V2 V3 V4 V5
a bfj cg di ek
a1 b1 c1f1 d1g1 e1
It is useful link:
Replace NA in row with value in adjacent row "ROW" not column
I used a lot of for loops. OMG.. My code is very dirty.

Here's an approach using dplyr. First, I identify the columns with no NAs. Then I use the cumulative count of those to define groups. Within those groups, I paste all the rows' values (excluding NA's) together.
library(dplyr)
df1 %>%
rowwise() %>% mutate(full = sum(is.na(c_across()))) %>% ungroup() %>%
group_by(group = cumsum(full == 0)) %>%
summarize(across(.fns = ~paste0(na.omit(.x), collapse = ""))) %>%
select(-group, -full)
# A tibble: 2 × 5
V1 V2 V3 V4 V5
<chr> <chr> <chr> <chr> <chr>
1 a bfj cg di e
2 a1 b1 c1f1 d1g1 e1

Replace NA in row with value in adjacent row "ROW" not column [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 1 year ago.
Raw data:
V1 V2
1 c1 a
2 c2 b
3 <NA> c
4 <NA> d
5 c3 e
6 <NA> f
7 c4 g
Reproducible Sample Data
V1 = c('c1','c2',NA,NA,'c3',NA,'c4')
V2 = c('a','b','c','d','e','f','g')
data.frame(V1,V2)
Expected output
V1_after V2_after
1 c1 a
2 c2 b c d
3 c3 e f
4 c4 g
V1_after <- c('c1','c2','c3','c4')
V2_after <- c('a',paste('b','c','d'),paste('e','f'),'g')
data.frame(V1_after,V2_after)
This is sample data.
In Real data, Rows where NA in V1 is not regular
It is too difficult to me

You could make use of zoo::na.locf for this. It takes the most recent non-NA value and fill all NA values on the way:
library(dplyr)
library(zoo)
df %>%
mutate(V1 = zoo::na.locf(V1)) %>%
group_by(V1) %>%
summarise(V2 = paste0(V2, collapse = " "))
# A tibble: 4 x 2
V1 V2
<chr> <chr>
1 c1 a
2 c2 b c d
3 c3 e f
4 c4 g

A base R option using na.omit + cumsum + aggregate
aggregate(
V2 ~ .,
transform(
df,
V1 = na.omit(V1)[cumsum(!is.na(V1))]
), c
)
gives
V1 V2
1 c1 a
2 c2 b, c, d
3 c3 e, f
4 c4 g

You can fill the NA with the previous non-NA values and summarise the data.
library(dplyr)
library(tidyr)
df %>%
fill(V1) %>%
group_by(V1) %>%
summarise(V2 = paste(V2, collapse = ' '))
# V1 V2
# <chr> <chr>
#1 c1 a
#2 c2 b c d
#3 c3 e f
#4 c4 g

In R is there a way to recode the columns from one data frame with values from another data frame?

I am still relatively new to working in R and I am not sure how to approach this problem. Any help or advice is greatly appreciated!!!
The problem I have is that I am working with two data frames and I need to recode the first data frame with values from the second. The first data frame (df1) contains the data from the respondents to a survey and the other data frame(df2) is the data dictionary for df1.
The data looks like this:
df1 <- data.frame(a = c(1,2,3),
b = c(4,5,6),
c = c(7,8,9))
df2 <- data.frame(columnIndicator = c("a","a","a","b","b","b","c","c","c" ),
df1_value = c(1,2,3,4,5,6,7,8,9),
new_value = c("a1","a2","a3","b1","b2","b3","c1","c2","c3"))
So far I can manually recode df1 to get the expected output by doing this:
df1 <- within(df1,{
a[a==1] <- "a1"
a[a==2] <- "a2"
a[a==3] <- "a3"
b[b==4] <- "b4"
b[b==5] <- "b5"
b[b==6] <- "b6"
c[c==7] <- "c7"
c[c==8] <- "c8"
c[c==9] <- "c9"
})
However my real dataset has about 42 columns that need to be recoded and that method is a little time intensive. Is there another way in R for me to recode the values in df1 with the values in df2?
Thanks!

Just need to transform the shape a bit.
library(data.table)
df1 <- data.frame(a = c(1,2,3),
b = c(4,5,6),
c = c(7,8,9))
df2 <- data.frame(columnIndicator = c("a","a","a","b","b","b","c","c","c" ),
df1_value = c(1,2,3,4,5,6,7,8,9),
new_value = c("a1","a2","a3","b4","b5","b6","c7","c8","c9"),stringsAsFactors = FALSE)
setDT(df1)
setDT(df2)
df1[,ID:=.I]
ldf1 <- melt(df1,measure.vars = c("a","b","c"),variable.name = "columnIndicator",value.name = "df1_value")
ldf1[df2,"new_value":=i.new_value,on=.(columnIndicator,df1_value)]
ldf1
#> ID columnIndicator df1_value new_value
#> 1: 1 a 1 a1
#> 2: 2 a 2 a2
#> 3: 3 a 3 a3
#> 4: 1 b 4 b4
#> 5: 2 b 5 b5
#> 6: 3 b 6 b6
#> 7: 1 c 7 c7
#> 8: 2 c 8 c8
#> 9: 3 c 9 c9
dcast(ldf1,ID~columnIndicator,value.var = "new_value")
#> ID a b c
#> 1: 1 a1 b4 c7
#> 2: 2 a2 b5 c8
#> 3: 3 a3 b6 c9
Created on 2020-04-18 by the reprex package (v0.3.0)

In base R, we can unlist df1 match it with df1_value and get corresponding new_value.
df1[] <- df2$new_value[match(unlist(df1), df2$df1_value)]
df1
# a b c
#1 a1 b1 c1
#2 a2 b2 c2
#3 a3 b3 c3

Is this what you are looking for???
library(dplyr)
df3 <- df1 %>% gather(key = "key", value = "value")
df3 %>% inner_join(df2, by = c("key" = "columnIndicator", "value" = "df1_value"))
Output
key value new_value
1 a 1 a1
2 a 2 a2
3 a 3 a3
4 b 4 b1
5 b 5 b2
6 b 6 b3
7 c 7 c1
8 c 8 c2
9 c 9 c3

How to get the values with maximum occurrence across range of columns based on another column

I have a data frame with 29 rows and 26 column with a lot of NA's. Data looks somewhat like shown below( working on R studio)
df <-
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
a1 b1 d f d d na na na f
a1 b2 d d d f na f na na
a1 b3 d f d f na na d d
a2 c1 f f d na na d d f
a2 c2 f d d f na f na na
a2 c3 d f d f na na f d
Here we have V1-V10 columns. a1 and a2 are 2 distinct values in column V1,
b1-b3 in column V2 are distinct values related to a1 in V1 and c1-c3 related to a2.
column V3- V10 we have distinct values in each rows related to a1 and a2
Result i want is as below-
NewV1 max.occurrence(V3-V10)
a1 d
a2 f
to summarize i want to get the value with maximum occurrence(max.occurrence(V3-V10)) across column V3-V10 based on V1. NOTE= NA to be excluded.

Another possiblity using the data.table-package:
library(data.table)
melt(setDT(df),
id = 1:2,
na.rm = TRUE)[, .N, by = .(V1, value)
][order(-N), .(max.occ = value[1]), by = V1]
which gives:
V1 max.occ
1: a1 d
2: a2 f
A similar logic with the tidyverse-packages:
library(dplyr)
library(tidyr)
df %>%
gather(k, v, V3:V10, na.rm = TRUE) %>%
group_by(V1, v) %>%
tally() %>%
arrange(-n) %>%
slice(1) %>%
select(V1, max.occ = v)

If you like dplyr, this would work:
df %>%
gather("key", "value", V3:V10) %>%
group_by(V1) %>%
dplyr::summarise(max.occurence = names(which.max(table(value))))
This gives:
# A tibble: 2 x 2
V1 max.occurence
<fct> <chr>
1 a1 d
2 a2 f

Categories

swift-package-manager

react-table

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Data wrangling - r

Related

how to change row values based on information from another dataframe in R

Replace NA in row with value in adjacent row (not only one row)

Replace NA in row with value in adjacent row "ROW" not column [duplicate]

In R is there a way to recode the columns from one data frame with values from another data frame?

How to get the values with maximum occurrence across range of columns based on another column

Categories

Resources