R: Shift some rows by one column across table - r

I have data frame X that looks like this. It has 4 columns and 5 rows.
name age gender class
A 12 M C1
B 10 F C2
C M C1 N/A
D F C2 N/A
E F C1 N/A
I would like to shift all data from col 2 (age) and row 3 onward by one column to right so that gender and classes align leaving the wrongly filled age data as blank . My resulting set should look like:
name age gender class
A 12 M C1
B 10 F C2
C N/A M C1
D N/A F C2
E N/A F C1
Please note: this is a situation from a very large dataset with 4 mil records and 52 cols.
Any help will be much appreciated. Thanks in advance!

Like this:
nc <- ncol(dfr)
dfr[-(1:2), 3:nc] <- dfr[-(1:2), 2:(nc-1)]
dfr[-(1:2), 2] <- NA
The negative indices in the rows mean 'everything but rows 1 and 2'.

> df <- data.frame("name" = LETTERS[1:5],
+ "age" = c(12, 10, "M","F","F"),
+ "gender" = c("M", "F", "C1", "C2", "C1"),
+ "class" = c("C1", "C2", NA,NA,NA))
> df
name age gender class
1 A 12 M C1
2 B 10 F C2
3 C M C1 <NA>
4 D F C2 <NA>
5 E F C1 <NA>
> df[3:nrow(df),3:ncol(df)] <- df[3:nrow(df),2:ncol(df)]
Warning message:
In `[<-.data.frame`(`*tmp*`, 3:nrow(df), 3:ncol(df), value = list( :
provided 3 variables to replace 2 variables
> df
name age gender class
1 A 12 M C1
2 B 10 F C2
3 C M M C1
4 D F F C2
5 E F F C1
> df[3:nrow(df),2] <- NA
> df
name age gender class
1 A 12 M C1
2 B 10 F C2
3 C <NA> M C1
4 D <NA> F C2
5 E <NA> F C1

Related

Incrementing grouped identifiers

I have example data as follows:
library(data.table)
dat <- fread("Survey Variable_codes_2022
D D1
A A1
B B1
B B3
B B2
E E1
B NA
E NA")
For the two rows that have Variable_codes_2022==NA, I would like to increment the variable code so that it becomes:
dat <- fread("Survey Variable_codes_2022
D D1
A A1
B B1
B B3
B B2
E E1
B B4
E E2"
Because the column Variable_codes_2022 is a string variable, the numbers are not in numerical order.
I have no idea where to start and I was wondering if someone could help me on the right track.
We could do it this way:
grouping
arranging and
mutate.
To keep the original order we could first create and id and then rearrange:
library(dplyr)
dat %>%
group_by(Survey) %>%
arrange(.by_group = TRUE) %>%
mutate(Variable_codes_2022 = paste0(Survey, row_number()))
Survey Variable_codes_2022
<chr> <chr>
1 A A1
2 B B1
3 B B2
4 B B3
5 B B4
6 D D1
7 E E1
8 E E2
data.table option using rleid like this:
library(data.table)
dat[, Variable_codes_2022 := paste0(Survey, rleid(Variable_codes_2022)), by = Survey]
dat
#> Survey Variable_codes_2022
#> 1: D D1
#> 2: A A1
#> 3: B B1
#> 4: B B2
#> 5: B B3
#> 6: E E1
#> 7: B B4
#> 8: E E2
Created on 2022-12-01 with reprex v2.0.2
dat <-
structure(list(survey = c("D", "A", "B", "B", "B", "E", "B",
"E", "B"), var_code = c("D1", "A1", "B1", "B3", "B2", "E1", NA,
NA, NA)), row.names = c(NA, -9L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000026db10f1ef0>)
library(dplyr)
library(stringr)
dat %>%
group_by(survey) %>%
mutate(
aux1 = as.numeric(stringr::str_remove(var_code,survey)),
aux2 = cumsum(is.na(var_code)),
var_code = paste0(survey,max(aux1,na.rm = TRUE)+aux2)
) %>%
ungroup() %>%
select(-aux1,-aux2)
# A tibble: 9 x 2
survey var_code
<chr> <chr>
1 D D1
2 A A1
3 B B3
4 B B3
5 B B3
6 E E1
7 B B4
8 E E2
9 B B5
This solution with rowid.
Added an extra element to the sample so it can be tested against multiple missings
library(data.table)
#> Warning: package 'data.table' was built under R version 4.2.2
dat <- fread("Survey Variable_codes_2022
D D1
A A1
B B1
B B3
B B2
E E1
B NA
E NA
E NA")
dat[, n := as.numeric(substr(
Variable_codes_2022, nchar(Survey)+1, nchar(Variable_codes_2022)))]
dat[is.na(n),
Variable_codes_2022 := paste0(Survey, rowid(Survey) +
dat[.SD[,.(Survey)], .(m=max(n, na.rm=T)), on = "Survey", by=.EACHI ][,m])]
dat
#> Survey Variable_codes_2022 n
#> 1: D D1 1
#> 2: A A1 1
#> 3: B B1 1
#> 4: B B3 3
#> 5: B B2 2
#> 6: E E1 1
#> 7: B B4 NA
#> 8: E E2 NA
#> 9: E E3 NA

Combine multiple dataframe of different columns [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 2 years ago.
Can we combine rows of multiple dataframe with different columns. Example below
> asd1 <- data.frame(a = c("a","b"), b = c("fd", "fg"))
> asd1
a b
1 a fd
2 b fg
> asd2 <- data.frame(a = c("a","b"), e = c("fd", "fg"), c = c("gfd","asd"))
> asd2
a e c
1 a fd gfd
2 b fg asd
Newdf <- rbind(asd1, asd2)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
Right now there is an error since of different columns.
Expected output
newdf
data a b e c
asd1 a fd NA NA
asd1 b fg NA NA
asd2 a NA fd gfd
asd2 b NA fg asd
Is the above output possible?
I would suggest you bind_rows() from dplyr:
library(dplyr)
#Data 1
asd1 <- data.frame(a = c("a","b"), b = c("fd", "fg"))
#Data 2
asd2 <- data.frame(a = c("a","b"), e = c("fd", "fg"), c = c("gfd","asd"))
#Bind
df <- bind_rows(asd1,asd2)
Output:
a b e c
1 a fd <NA> <NA>
2 b fg <NA> <NA>
3 a <NA> fd gfd
4 b <NA> fg asd
library(dplyr)
bind_rows(asd1, asd2, .id = "data")
# data a b e c
# 1 1 a fd <NA> <NA>
# 2 1 b fg <NA> <NA>
# 3 2 a <NA> fd gfd
# 4 2 b <NA> fg asd

Calculate direct dependencies among values of a dataframe in R

A data frame is given and the objective is to calculate the direct dependency value between two columns of the data frame.
c1 c2 N
a b 30
a c 5
a d 10
c a 5
b a 10
what we are looking for is that to get the direct dependency relations, for example, for aand b this value is ab - ba = 20.
The final result should be like this:
c1 c2 N DepValue
a b 30 ab - ba = 20
a c 5 ac - ca = 0
a d 10 ad- 0 = 10
c a 5 ca - ac= 0
b a 10 ba - ab = 20
Thank you for your help.
D <- read.table(header=TRUE, stringsAsFactors = FALSE, text=
"c1 c2 N
a b 30
a c 5
a d 10
c a 5
b a 10")
N12 <- D$N
names(N12) <- paste0(D$c1, D$c2)
N21 <- N12[paste0(D$c2, D$c1)]
D$depValue <- D$N - ifelse(is.na(N21), 0, N21)
result:
> D
c1 c2 N depValue
1 a b 30 20
2 a c 5 0
3 a d 10 10
4 c a 5 0
5 b a 10 -20
One option is to create groups with pmin and pmax values of c1 and c2 and take difference between the two values. This will return NA for groups with only one value, we can replace those NAs to the first value in the group.
library(dplyr)
df %>%
group_by(group1 = pmin(c1, c2), group2 = pmax(c1, c2)) %>%
mutate(dep = N[1] - N[2],
dep = replace(dep, is.na(dep), N[1])) %>%
ungroup() %>%
select(-group1, -group2)
# c1 c2 N dep
# <chr> <chr> <int> <int>
#1 a b 30 20
#2 a c 5 0
#3 a d 10 10
#4 c a 5 0
#5 b a 10 20
An idea via base R is to sort columns c1 and c2, split based on those values and subtract N, i.e.
i1 <- paste(pmin(df$c1, df$c2), pmax(df$c1, df$c2))
i1
#[1] "a b" "a c" "a d" "a c" "a b"
do.call(rbind, lapply(split(df, i1), function(i) {i['DepValue'] <- Reduce(`-`, i$N); i}))
# c1 c2 N DepValue
#a b.1 a b 30 20
#a b.5 b a 10 20
#a c.2 a c 5 0
#a c.4 c a 5 0
#a d a d 10 10

data.table shift right all cell values by number of na within each row [R]

How do I shift the cells in a data table TO THE RIGHT by the number of NA in each row in R?
Example Data:
data <- data.table(c1=c("a","e","h","j"),
c2=c("b","f","i",NA),
c3=c("c","g",NA,NA),
c4=c("d",NA,NA,NA), stringsAsFactors = F)
c1 c2 c3 c4
1 a b c d
2 e f g <NA>
3 h i <NA> <NA>
4 j <NA> <NA> <NA>
Desired Data from example:
data.desired <- data.table(
c1=c("a",NA,NA,NA),
c2=c("b","e",NA,NA),
c3=c("c","f","h",NA),
c4=c("d","g","i","j"), stringsAsFactors = F)
c1 c2 c3 c4
1 a b c d
2 <NA> e f g
3 <NA> <NA> h i
4 <NA> <NA> <NA> j
Here's one attempt using matrix indexing and a counter of NA values by row:
#convert back to a data.frame to take advantage of matrix indexing
setDF(data)
arr <- which(!is.na(data), arr.ind=TRUE)
arr[,"col"] <- arr[,"col"] + rowSums(is.na(data))[arr[,"row"]]
out <- data
out[] <- NA
out[arr] <- data[!is.na(data)]
out
# c1 c2 c3 c4
#1 a b c d
#2 <NA> e f g
#3 <NA> <NA> h i
#4 <NA> <NA> <NA> j
#convert to data.table if necessary
setDT(out)
This option is pretty quick and from a brief test churns through 4 columns / 2 million rows in about 3-4 seconds.
We can use
data.table(t(apply(data, 1, function(x){ c(rep(NA, sum(is.na(x))), x[!is.na(x)])})))
# V1 V2 V3 V4
# 1: a b c d
# 2: <NA> e f g
# 3: <NA> <NA> h i
# 4: <NA> <NA> <NA> j

Arranging the column values in R based on similar pairs present in data

The script below is a data frame of three columns. My need is that I want to take a pair of values(a1,a2) at a time. If there is a duplicate of the pair present in the table, then I want to arrange the corresponding a3 values in asecending order. For illustration, first (a1,a2) value pair is ("A","D"), we see that they occur again at 4th and 7th position. So, I want the pair to be compared with every row here and after finding the 4th and 7th, get the corresponding "a3" values arranged in ascending order, similarly for all the row pairs. Kindly try avoiding loops and if's as it may slow down the process. I tried using "arrange", but no help. Thanks and please suggest.
a1 = c("A","B","C","A","B","C","A")
a2 = c("D","E","F","D","F","E","D")
a3 = c(20,40,50,5,15,35,10)
a123= data.frame(a1,a2,a3)
View(a123)
Expected Outcome
a1 = c("A","B","C","A","B","C","A")
a2 = c("D","E","F","D","F","E","D")
a3 = c(5,40,50,10,15,35,20)
a123 = data.frame(a1,a2,a3)
We can group the data by a1 and a2, and then use mutate and sort to rearrange the numbers in a3. a123_r is the final output.
library(dplyr)
a123_r <- a123 %>%
group_by(a1, a2) %>%
mutate(a3 = sort(a3)) %>%
ungroup()
a123_r
# # A tibble: 7 x 3
# a1 a2 a3
# <fctr> <fctr> <dbl>
# 1 A D 5.00
# 2 B E 40.0
# 3 C F 50.0
# 4 A D 10.0
# 5 B F 15.0
# 6 C E 35.0
# 7 A D 20.0
i would just paste them to another column to create a key.
a4 = paste(a1,a2)
a123 = cbind(a123,a4)
a123[order(a123$a4,a123$a3),]
# a1 a2 a3 a4
#4 A D 5 A D
#7 A D 10 A D
#1 A D 20 A D
#2 B E 40 B E
#5 B F 15 B F
#6 C E 35 C E
#3 C F 50 C F
# or save the new order
a123 = a123[order(a123$a4,a123$a3),]
For the sake of completeness, here is also a data.table solution which updates only column a3 by reference, i.e., without copying the whole data object a123:
library(data.table)
setDT(a123)[, a3 := sort(a3), by = .(a1, a2)][]
a1 a2 a3
1: A D 5
2: B E 40
3: C F 50
4: A D 10
5: B F 15
6: C E 35
7: A D 20

Resources