I have a large database (90,000 * 1500) sorted by child observations - which includes their mom's info. I want to sort the database according to mom's data.
The problem is that each kid only appears once in DB mom bs. It may appear up to 10 times.
In addition, I want the number of rows to be a number of different mothers (approx. 40,000) and a bit of data for each child - between 0-10.
For example, the DB I have and the DB I want to create:
You could use reshape
library(data.table)
df = data.frame(
'c' = c('c1', 'c2', 'c3', 'c4', 'c5'),
'id_num' = seq(1,5),
'age' = c(12, 15, 5, 8, 19),
'mom'= c(1,3,1,2,3)
)
df
c id_num age mom
1 c1 1 12 1
2 c2 2 15 3
3 c3 3 5 1
4 c4 4 8 2
5 c5 5 19 3
df = setDT(df)[order(mom)]
df[, id_child := seq(.N), mom]
reshape(df, idvar = "mom", timevar = "id_child", direction = "wide")
mom c.1 id_num.1 age.1 c.2 id_num.2 age.2
1: 1 c1 1 12 c3 3 5
2: 2 c4 4 8 <NA> NA NA
3: 3 c2 2 15 c5 5 19
Here is a solution similar to #Metariat, but with base R, where ave() is used
df$seq <- with(df,ave(id_num,mom,FUN = seq_along))
dfout <- reshape(df, idvar = "mom", timevar = "seq", direction = "wide")
such that
> dfout
mom c.1 id_num.1 age.1 c.2 id_num.2 age.2
1 1 c1 1 12 c3 3 5
2 3 c2 2 15 c5 5 19
4 2 c4 4 8 <NA> NA NA
EDIT:
If you have very big data frame, you can try the divide and conquer policy to see if it works
library(plyr)
dfs <- split(df,df$mom)
lst <- lapply(dfs, function(x) {
x <- within(x,seqnum <- ave(id_num,mom,FUN = seq_along))
reshape(x, idvar = "mom", timevar = "seqnum", direction = "wide")
}
)
dfout <- rbind.fill(lst)
You can do this using the tidyr package, with group_by.
group_by(data, mom)
Then each mom contains a list of children. You can then sort the database as follows.
arrange(data, id_num, .by_group = TRUE)
To filter children between 0 and 10:
filter(data, age <= 10)
Related
I am quite new to R, and I do not know how to create variables in a loop. I have a dataset where each observation is uniquely defined by an id and a type. My goal would be to create different datasets from a starting one, keeping for each dataset the id, type a specific variable, and to rename the variable type as type_variable. Please see below a reproducible example of my dataset:
dt_type <- data.frame(id = c(1,1,1,1,2,2,2,2),
type= c("b1", "b2","c1", "c2","b1", "b2","c1", "c2"),
a=rnorm(8), b=rnorm(8),c=rnorm(8),d=rnorm(8))
# id type a b c d
# 1 1 b1 -0.74733339 -1.1121249 -0.2005649 1.70320036
# 2 1 b2 -0.87290362 -0.1221949 -2.7723691 1.04158671
# 3 1 c1 -0.00878965 -0.7592988 -0.5108226 2.10755315
# 4 1 c2 0.87295622 -0.5885439 0.2606365 -0.87080649
# 5 2 b1 -0.74536372 0.1377794 -0.1382621 0.01743011
# 6 2 b2 -0.01570109 -0.3058672 -0.3146880 -0.43594081
# 7 2 c1 -0.28966205 -0.2045772 -1.1776759 -2.24223369
# 8 2 c2 -0.63680969 2.3815740 0.4462243 -0.05397941
This is how I have tried to do it, but unfortunately it does not work.
varlist <- list("a", "b", "c", "d")
for (i in 1:4) {
tmp <- dt_type %>% rename(paste("type", varlist[[i]], sep=="_") = type) %>%
arrange(id, varlist[[i]], desc(paste("type", varlist[[i]], sep=="_"))) %>%
distinct(id, varlist[[i]], .keep_all = T)
assign(paste("dt_type_", varlist[[i]]), tmp)
}
I am used to using loops in other programming languages, but if there are better ways to reach the result I want, please let me know.
Sorry for not posting the expected output, here it is:
dt_type_a
# id type value
# 1 1 b1 -1.5023199
# 2 1 b2 -0.3653626
# 3 1 c1 1.2842098
# 4 1 c2 0.2732327
# 5 2 b1 -0.7581897
# 6 2 b2 1.1627059
# 7 2 c1 -1.6644546
# 8 2 c2 1.2916819
dt_type_b
# id type value
# 1 1 b1 -0.19573684
# 2 1 b2 -1.35095843
# 3 1 c1 0.69342205
# 4 1 c2 0.47689611
# 5 2 b1 0.67058845
# 6 2 b2 0.21992074
# 7 2 c1 -0.02046201
# 8 2 c2 0.19686712
Thanks,
Vincenzo
Hum, I would just go from wide to long but since you're asking to create variables dynamically:
library(data.table)
dt_type <- data.frame(id = c(1,1,1,1,2,2,2,2),
type= c("b1", "b2","c1", "c2","b1", "b2","c1", "c2"),
a=rnorm(8), b=rnorm(8),c=rnorm(8),d=rnorm(8))
setDT(dt_type)
dt_long <- melt(dt_type, id.vars = c("id", "type"))
varnames <- unique(dt_long$variable)
for (var in varnames) {
assign(paste0("dt_type_", var), dt_long[variable == var, .(id, type, value)])
}
hope it helps...
I have a df where one variable is an integer. I'd like to split this column into it's individual digits. See my example below
Group Number
A 456
B 3
C 18
To
Group Number Digit1 Digit2 Digit3
A 456 4 5 6
B 3 3 NA NA
C 18 1 8 NA
We can use read.fwf from base R. Find the max number of character (nchar) in 'Number' column (mx). Read the 'Number' column after converting to character (as.character), specify the 'widths' as 1 by replicating 1 with mx and assign the output to new 'Digit' columns in the data
mx <- max(nchar(df1$Number))
df1[paste0("Digit", seq_len(mx))] <- read.fwf(textConnection(
as.character(df1$Number)), widths = rep(1, mx))
-output
df1
# Group Number Digit1 Digit2 Digit3
#1 A 456 4 5 6
#2 B 3 3 NA NA
#3 C 18 1 8 NA
data
df1 <- structure(list(Group = c("A", "B", "C"), Number = c(456L, 3L,
18L)), class = "data.frame", row.names = c(NA, -3L))
Another base R option (I think #akrun's approach using read.fwf is much simpler)
cbind(
df,
with(
df,
type.convert(
`colnames<-`(do.call(
rbind,
lapply(
strsplit(as.character(Number), ""),
`length<-`, max(nchar(Number))
)
), paste0("Digit", seq(max(nchar(Number))))),
as.is = TRUE
)
)
)
which gives
Group Number Digit1 Digit2 Digit3
1 A 456 4 5 6
2 B 3 3 NA NA
3 C 18 1 8 NA
Using splitstackshape::cSplit
splitstackshape::cSplit(df, 'Number', sep = '', stripWhite = FALSE, drop = FALSE)
# Group Number Number_1 Number_2 Number_3
#1: A 456 4 5 6
#2: B 3 3 NA NA
#3: C 18 1 8 NA
Updated
I realized I could use max function for counting characters limit in each row so that I could include it in my map2 function and save some lines of codes thanks to an accident that led to an inspiration by dear #ThomasIsCoding.
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
df %>%
rowwise() %>%
mutate(map2_dfc(Number, 1:max(nchar(Number)), ~ str_sub(.x, .y, .y))) %>%
unnest(cols = !c(Group, Number)) %>%
rename_with(~ str_replace(., "\\.\\.\\.", "Digit"), .cols = !c(Group, Number)) %>%
mutate(across(!c(Group, Number), as.numeric, na.rm = TRUE))
# A tibble: 3 x 5
Group Number Digit1 Digit2 Digit3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 456 4 5 6
2 B 3 3 NA NA
3 C 18 1 8 NA
Data
df <- tribble(
~Group, ~Number,
"A", 456,
"B", 3,
"C", 18
)
Two base r methods:
no_cols <- max(nchar(as.character(df1$Number)))
# Using `strsplit()`:
cbind(df1, setNames(data.frame(do.call(rbind,
lapply(strsplit(as.character(df1$Number), ""),
function(x) {
length(x) <- no_cols
x
}
)
)
), paste0("Digit", seq_len(no_cols))))
# Using `regmatches()` and `gregexpr()`:
cbind(df1, setNames(data.frame(do.call(rbind,
lapply(regmatches(df1$Number, gregexpr("\\d", df1$Number)),
function(x) {
length(x) <- no_cols
x
}
)
)
), paste0("Digit", seq_len(no_cols))))
I want to reshape the following data frame
x <- structure(
list(name = c("HT", "AT", "HG", "AG"),
conv = c(2L, + 2L, 3L, 4L)),
.Names = c("name", "conv"), row.names = 1:4, class = "data.frame")
> x
name conv
1 HT 2
2 AT 2
3 HG 3
4 AG 4
into
conv x.1 x.2
1 2 HT AT
2 3 HG NA
3 4 AG NA
In the final data frame there should be a row for every distinct value of
conv, and as many x.? columns as there are rows in the original data
frame for that particular value of conv, filling with NAs when necessary. I don't care about the column names.
I tried reshape but I can't get it to work, because it seems that it needs
a third column that I don't have:
> reshape(x, idvar='conv', direction='wide')
Error in `[.data.frame`(data, , timevar) : undefined columns selected
using data.table v1.9.5:
require(data.table)
dcast(setDT(x), conv ~ paste0("x.", x[, seq_len(.N), by=conv]$V1), value.var="name")
# conv x.1 x.2
# 1: 2 HT AT
# 2: 3 HG NA
# 3: 4 AG NA
You can install it by following the instructions here.
Something you can try:
xmax <- max(table(x$conv))
xsplit <- split(x, x$conv)
xsplit <- sapply(xsplit, function(tab){c(tab$name, rep(NA, xmax-length(tab$name)))})
x2 <- data.frame(conv=x$conv[!duplicated(x$conv)], t(xsplit), stringsAsFactors=F)
colnames(x2)[-1]<-paste("x",1:xmax,sep=".")
x2
# conv x.1 x.2
#2 2 HT AT
#3 3 HG <NA>
#4 4 AG <NA>
NB: with reshape, you can do what's below but I don't think that's what you want. There may be some parameters to set so you get what you want but I'm really not a reshape expert :-(
reshape(data=x, v.names="name", timevar="name", idvar="conv", direction="wide")
# conv name.HT name.AT name.HG name.AG
#1 2 HT AT <NA> <NA>
#3 3 <NA> <NA> HG <NA>
#4 4 <NA> <NA> <NA> AG
Considering the following data frame:
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
> df
var1 var2
1 1 5
2 2 6
3 3 7
4 4 8
5 5 1
I'd like to remove all rows whose values are flipped across the two columns. In this case, it would be row 1 and row 5 as the values 1 and 5 in row 1 are flipped to 5 and 1 in row 5. These two rows should be removed.
I hope it came clear what I am asking for :-)
Kind regards!
Perhaps something like this could work too:
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
df[!do.call(paste, df) %in% do.call(paste, rev(df)), ]
var1 var2
2 2 6
3 3 7
4 4 8
I'd have to test it on a few more test cases though, but the general idea is to use rev to reverse the order of the columns in "df" and paste them together and compare that with the pasted columns from "df".
Here's a simple but not especially elegant way: make a reversed data frame with a flag, and then merge it on to df:
# Make a reversed dataset
fd <- data.frame(var1 = df$var2, var2 = df$var1, flag = TRUE)
# Merge it onto your original df, then drop the matched rows and the flag var
df.sub <- subset(merge(x = df, y = fd, by = c("var1", "var2"), all.x = TRUE),
subset = is.na(flag),
select = c("var1", "var2"))
Using a bit of maths - the two rows are the same up to a permutation if the sum and absolute value of difference are the same:
df[with(df, !duplicated(data.frame(var1 + var2, abs(var1 - var2)), fromLast = TRUE)),]
# var1 var2
#1 1 5
#2 2 6
#3 3 7
#4 4 8
edit: should've read the question more carefully, to remove both duplicates, follow Ananda's suggestion:
df.ind = with(df, data.frame(var1 + var2, abs(var1 - var2)))
df[!duplicated(df.ind) & !duplicated(df.ind, fromLast = TRUE),]
# var1 var2
#2 2 6
#3 3 7
#4 4 8
If creating a copy doesn't cause memory issues then this works as well -
df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
df2 <- data.frame(var12 = 1:5, var22 = c(5,6,7,8,1))
df3 <- merge(df,df2, by.x = 'var2', by.y = 'var12', all.x = TRUE)
df3 <- subset(
df3,
is.na(var22),
select = c('var1','var2')
)
Output:
> df3
var1 var2
3 2 6
4 3 7
5 4 8
I tried merging df with df but that gives a warning about the column var2 being duplicated. Anybody know what to do?
If you can assume there are no duplicates in the data frame. Here's a one line answer, but still not too concise:
df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df) + 1:nrow(df)],]
## var1 var2
## 2 2 6
## 3 3 7
## 4 4 8
rbindlist is necessary here because rbind(df,df[,2:1]) will match by column name rather than index, so the other option is something like rbind(df,setnames(df[,2:1],names(df))). If you want to keep duplicates from the original, this gets even more unpleasant:
> df <- data.frame(var1 = 1:5, var2 = c(5,6,7,8,1))
> df<-rbind(df,c(2,6))
> df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df)+1:nrow(df)],]
var1 var2
2 2 6
3 3 7
4 4 8
> df[!duplicated(rbindlist(list(df,df[,2:1])))[nrow(df)+1:nrow(df)] | duplicated(df),]
var1 var2
2 2 6
3 3 7
4 4 8
6 2 6
This is a basic problem in data analysis which Stata deals with in one step.
Create a wide data frame with time invariant data (x0) and time varying data for years 2000 and 2005 (x1,x2):
d1 <- data.frame(subject = c("id1", "id2"),
x0 = c("male", "female"),
x1_2000 = 1:2,
x1_2005 = 5:6,
x2_2000 = 1:2,
x2_2005 = 5:6
)
s.t.
subject x0 x1_2000 x1_2005 x2_2000 x2_2005
1 id1 male 1 5 1 5
2 id2 female 2 6 2 6
I want to shape it like a panel so data looks like this:
subject x0 time x1 x2
1 id1 male 2000 1 1
2 id2 female 2000 2 2
3 id1 male 2005 5 5
4 id2 female 2005 6 6
I can do this with reshape s.t.
d2 <-reshape(d1,
idvar="subject",
varying=list(c("x1_2000","x1_2005"),
c("x2_2000","x2_2005")),
v.names=c("x1","x2"),
times = c(2000,2005),
direction = "long",
sep= "_")
My main concern is that when you have dozens of variables the above command gets very long. In stata one would simply type:
reshape long x1 x2, i(subject) j(year)
Is there such a simple solution in R?
reshape can guess many of its arguments. In this case it's sufficient to specify the following. No packages are used.
reshape(d1, dir = "long", varying = 3:6, sep = "_")
giving:
subject x0 time x1 x2 id
1.2000 id1 male 2000 1 1 1
2.2000 id2 female 2000 2 2 2
1.2005 id1 male 2005 5 5 1
2.2005 id2 female 2005 6 6 2
here is a brief example using reshape2 package:
library(reshape2)
library(stringr)
# it is always useful to start with melt
d2 <- melt(d1, id=c("subject", "x0"))
# redefine the time and x1, x2, ... separately
d2 <- transform(d2, time = str_replace(variable, "^.*_", ""),
variable = str_replace(variable, "_.*$", ""))
# finally, cast as you want
d3 <- dcast(d2, subject+x0+time~variable)
now you don't need even specifying x1 and x2.
This code works if variables increase:
> d1 <- data.frame(subject = c("id1", "id2"), x0 = c("male", "female"),
+ x1_2000 = 1:2,
+ x1_2005 = 5:6,
+ x2_2000 = 1:2,
+ x2_2005 = 5:6,
+ x3_2000 = 1:2,
+ x3_2005 = 5:6,
+ x4_2000 = 1:2,
+ x4_2005 = 5:6
+ )
>
> d2 <- melt(d1, id=c("subject", "x0"))
> d2 <- transform(d2, time = str_replace(variable, "^.*_", ""),
+ variable = str_replace(variable, "_.*$", ""))
>
> d3 <- dcast(d2, subject+x0+time~variable)
>
> d3
subject x0 time x1 x2 x3 x4
1 id1 male 2000 1 1 1 1
2 id1 male 2005 5 5 5 5
3 id2 female 2000 2 2 2 2
4 id2 female 2005 6 6 6 6