R - splitting column by legend from the other one - r

I've got data.frame like below
ID age legend location
1 83 country;province;city X;A;J
2 15 country;city X;K
3 2 country;province;city Y;B;I
4 12 country;city X;L
5 2 country;city Y;J
6 2 country;province;city Y;A;M
7 18 country;province;city X;B;J
8 85 country;province;city X;A;I
To describe it: there is third column (legend) with description of the value of fourth column (location). Order of the records in the rows of legend column indicate the order of value in location column.
As a result, I need to obtain the data.frame as below
ID age country province city
1 83 X A J
2 15 X <NA> K
3 2 Y B I
4 12 X <NA> L
5 2 Y <NA> J
6 2 Y A M
7 18 X B J
8 85 X A I
To describe, I need to extract info from legend column and set them as name of new columns and then fill with appropriate information from location column. I cannot just split the columns by ; because there is different number of records in each rows. Any suggestion?

Using DF shown reproducibly in the Note at the end use separate_rows and then spread the data out from long to wide. If the order of columns does not matter then the select line can be omitted.
library(dplyr)
library(tidyr)
DF %>%
separate_rows(legend, location) %>%
spread(legend, location) %>%
select(ID, age, country, province, city) # optional
giving:
ID age country province city
1 1 83 X A J
2 2 15 X <NA> K
3 3 2 Y B I
4 4 12 X <NA> L
5 5 2 Y <NA> J
6 6 2 Y A M
7 7 18 X B J
8 8 85 X A I
Note
Lines <- "
ID age legend location
1 83 country;province;city X;A;J
2 15 country;city X;K
3 2 country;province;city Y;B;I
4 12 country;city X;L
5 2 country;city Y;J
6 2 country;province;city Y;A;M
7 18 country;province;city X;B;J
8 85 country;province;city X;A;I"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)

Related

Gathering columns from wide to long by id [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I've got a data frame like this:
set.seed(100)
drugs <- data.frame(id = 1:5,
drug_1 = letters[1:5], drug_dos_1 = sample(100,5),
drug_2 = letters[3:7], drug_dos_2 = sample(100,5)
)
id drug_1 drug_dos_1 drug_2 drug_dos_2
1 a 31 c 49
2 b 26 d 81
3 c 55 e 37
4 d 6 f 54
5 e 45 g 17
I'd like to transform this messy table into a tidy table with all drugs of an id in one column and the corresponding drug dosages in one column. The table should look like this in the end:
id drug dosage
1 a 31
1 c 49
2 b 26
2 d 81
etc
I guess this could be achieved by using a reshaping function that transforms by data from wide to long format but I didn't manage.
One option is melt from data.table which can take multiple patterns in the measure argument
library(data.table)
melt(setDT(drugs), measure = patterns('^drug_\\d+$', 'dos'),
value.name = c('drug', 'dosage'))[, variable := NULL][order(id)]
# id drug dosage
#1: 1 a 31
#2: 1 c 49
#3: 2 b 26
#4: 2 d 81
#5: 3 c 55
#6: 3 e 37
#7: 4 d 6
#8: 4 f 54
#9: 5 e 45
#10 5 g 17
Here, the 'drug' is common in all the columns, so we need to create a unique pattern. One way is to specify the starting location (^) followed by the 'drug' substring, then underscore (_) and one or more numbers (\\d+) at the end ($) of the string. For the 'dos', just use that substring to match those column names that have 'dos'
library(dplyr)
drugs %>% gather(key,val,-id) %>% mutate(key=gsub('_\\d','',key)) %>% #replace _1 and _2 at the end wiht nothing
mutate(key=gsub('drug_','',key)) %>% group_by(key) %>% #replace drug_ at the start of dos with nothin and gruop by key
mutate(row=row_number()) %>% spread(key,val) %>%
select(id,drug,dos,-row)
# A tibble: 10 x 3
id drug dos
<int> <chr> <chr>
1 1 a 31
2 1 c 49
3 2 b 26
4 2 d 81
5 3 c 55
6 3 e 37
7 4 d 6
8 4 f 54
9 5 e 45
10 5 g 17
Warning message:
attributes are not identical across measure variables;
they will be dropped
#This warning generated as we merged drug(chr) and dose(num) into one column (val)

understanding apply and outer function in R

Suppose i have a data which looks like this
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
I Wanted to compare these values with each other so if an ID has changed its value of A variable over a period of B variable(which is from 1 to 4) it goes into data frame K and if it hasn't then it goes to data frame L.
so in this data set K will look like
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
and L will look like
ID A B C
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
In terms of nested loops and if then else statement it can be solved like following
for ( i in 1:length(ID)){
m=0
for (j in 1: length(B)){
ifelse( A[j] == A[j+1],m,m=m+1)
}
ifelse(m=0, L=c[,df[i]], K=c[,df[i]])
}
I have read in some posts that in R nested loops can be replaced by apply and outer function. if someone can help me understand how it can be used in such circumstances.
So basically you don't need a loop with conditions here, all you need to do is to check if there's a variance (and then converting it to a logical using !) in A during each cycle of B (IDs) by converting A to a numeric value (I'm assuming its a factor in your real data set, if its not a factor, you can use FUN = function(x) length(unique(x)) within ave instead ) and then split accordingly. With base R we can use ave for such task, for example
indx <- !with(df, ave(as.numeric(A), ID , FUN = var))
Or (if A is a character rather a factor)
indx <- with(df, ave(A, ID , FUN = function(x) length(unique(x)))) == 1L
Then simply run split
split(df, indx)
# $`FALSE`
# ID A B C
# 1 1 X 1 10
# 2 1 X 2 10
# 3 1 Z 3 15
# 4 1 Y 4 12
# 5 2 Y 1 15
# 6 2 X 2 13
# 7 2 X 3 13
# 8 2 Y 4 13
#
# $`TRUE`
# ID A B C
# 9 3 Y 1 16
# 10 3 Y 2 18
# 11 3 Y 3 19
# 12 3 Y 4 10
This will return a list with two data frames.
Similarly with data.table
library(data.table)
setDT(df)[, indx := !var(A), by = ID]
split(df, df$indx)
Or dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(indx = !var(A)) %>%
split(., indx)
Since you want to understand apply rather than simply getting it done, you can consider tapply. As a demonstration:
> tapply(df$A, df$ID, function(x) ifelse(length(unique(x))>1, "K", "L"))
1 2 3
"K" "K" "L"
In a bit plainer English: go through all df$A grouped by df$ID, and apply the function on df$A within each groupings (i.e. the x in the embedded function): if the number of unique values is more than 1, it's "K", otherwise it's "L".
We can do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'ID', we check the length of unique elements in 'A' (uniqueN(A)) is greater than 1 or not, create a column 'ind' based on that. We can then split the dataset based on that
'ind' column.
library(data.table)
setDT(df1)[, ind:= uniqueN(A)>1, by = ID]
setDF(df1)
split(df1[-5], df1$ind)
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
Or similarly using dplyr, we can use n_distinct to create a logical column and then split by that column.
library(dplyr)
df2 <- df1 %>%
group_by(ID) %>%
mutate(ind= n_distinct(A)>1)
split(df2, df2$ind)
Or a base R option with table. We get the table of the first two columns of 'df1' i.e. the 'ID' and 'A'. By double negating (!!) the output, we can get the '0' values convert to 'TRUE' and all other frequency as 'FALSE'. Get the rowSums ('indx'). We match the ID column in 'df1' with the names of the 'indx', use that to replace the 'ID' with TRUE/FALSE, and split the dataset with that.
indx <- rowSums(!!table(df1[1:2]))>1
lst <- split(df1, indx[match(df1$ID, names(indx))])
lst
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
If we need to get individual datasets on the global environment, change the names of the list elements to the object names we wanted and use list2env (not recommended though)
list2env(setNames(lst, c('L', 'K')), envir=.GlobalEnv)

Concatenate data frames together based on similar column values

Specifically, say I had three data frames d1, d2, d3:
d1:
X Y Z value
1 0 20 135 43
2 0 4 105 50
3 5 18 20 10
...
d2:
X Y Z value
1 0 20 135 15
2 0 4 105 14
3 2 9 12 16
...
d3:
X Y Z value
1 0 20 135 29
2 2 9 14 16
...
I want to be able to combine these data frames such that each row of the combined data frame consists of three values, based on all unique X, Y, Z combinations. If such an X, Y, Z combination does not exist in one of the original data frames then I just want it to have a value of null (or some arbitrarily low number if that isn't possible). So I'd want an output of:
dfinal:
X Y Z value1 value2 value3
1 0 20 135 43 15 29
2 0 4 105 50 14 null
3 5 18 20 10 null null
4 2 9 12 null 16 null
5 2 9 14 null null 16
...
Is there any efficient way of doing this? I've tried doing this instead using data.table which seemed more suited for this but have yet to figure out how.
?merge
Should do the trick?
By default the data frames are merged on the columns with names they both have, but separate specifications of the columns can be given by by.x and by.y.
So:
merge(d1,d2, by=c("X","Y","Z"))
And you can include all=TRUE, to have complete rows.
The missing data will be NA
merge(d1,d2, by=c("X","Y","Z"), all=TRUE)
Take a look at dplyr and its join methods. I wrote a small example:
library(dplyr)
library(data.table)
d1 <- data.table(X = c(1,2,3), Y = c(2,3,4), Z = c(8,3,9), value = c(22,3,44))
d2 <- data.table(X = c(1,4,3), Y = c(2,6,4), Z = c(8,9,9), value = c(44,22,11))
d2 <- rename(d2, value2 = value)
full_join(d1,d2)
output:
X Y Z value value2
1 1 2 8 22 44
2 2 3 3 3 NA
3 3 4 9 44 11
4 4 6 9 NA 22

How to merge tables and fill the empty cells in the mean time in R?

Assume there are two tables a and b.
Table a:
ID AGE
1 20
2 empty
3 40
4 empty
Table b:
ID AGE
2 25
4 45
5 60
How to merge the two table in R so that the resulting table becomes:
ID AGE
1 20
2 25
3 40
4 45
You could try
library(data.table)
setkey(setDT(a), ID)[b, AGE:= i.AGE][]
# ID AGE
#1: 1 20
#2: 2 25
#3: 3 40
#4: 4 45
data
a <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
b <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))
Assuming you have NA on every position in the first table where you want to use the second table's age numbers you can use rbind and na.omit.
Example
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))
na.omit(rbind(x,y))
Results in what you're after (although unordered and I assume you just forgot ID 5)
ID AGE
1 20
3 40
2 25
4 45
5 60
EDIT
If you want to merge two different data.frames's and keep the columns its a different thing. You can use merge to achieve this.
Here are two data frames with different columns:
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA), COUNTY=c(1,2,3,4))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60), STATE=c('CA','CA','IL'))
Add them together into one data.frame
res <- merge(x, y, by='ID', all=T)
giving us
ID AGE.x COUNTY AGE.y STATE
1 20 1 NA <NA>
2 NA 2 25 CA
3 40 3 NA <NA>
4 NA 4 45 CA
5 NA NA 60 IL
Then massage it into the form we want
idx <- which(is.na(res$AGE.x)) # find missing rows in x
res$AGE.x[idx] <- res$AGE.y[idx] # replace them with y's values
names(res)[agrep('AGE\\.x', names(res))] <- 'AGE' # rename merged column AGE.x to AGE
subset(res, select=-AGE.y) # dump the AGE.y column
Which gives us
ID AGE COUNTY STATE
1 20 1 <NA>
2 25 2 CA
3 40 3 <NA>
4 45 4 CA
5 60 NA IL
The package in the other answer will work. Here is a dirty hack if you don't want to use the package:
x$AGE[is.na(x$AGE)] <- y$AGE[y$ID %in% x$ID]
> x
ID AGE
1 1 20
2 2 25
3 3 40
4 4 45
But, I would use the package to avoid the clunky code.

automating a normal transformation function in R over multiple columns

I have a data frame m with:
>m
id w y z
1 2 5 8
2 18 5 98
3 1 25 5
4 52 25 8
5 5 5 4
6 3 3 5
Below is a general function for normally transforming a variable that I need to apply to columns w,y,z.
y<-qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x))
For example, if I wanted to run this function on "column w" to get the output column appended to dataframe "m" then:
m$w_n<-qnorm((rank(m$w,na.last="keep")-0.5)/sum(!is.na(m$w))
Can someone help me automate this to run on multiple columns in data frame m?
Ideally, I would want an output data frame with the following columns:
id w y z w_n y_n z_n
Note this is a sample data frame, the one I have is much larger and I have more letter columns to run this function on other than w, y,z.
Thanks!
Probably a way to do it in a single step, but what about:
df <- data.frame(id = 1:6, w = sample(50, 6), z = sample(50, 6) )
df
id w z
1 1 39 40
2 2 20 26
3 3 43 11
4 4 4 37
5 5 36 24
6 6 27 14
transCols <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))
tmpdf <- lapply(df[, -1], transCols)
names(tmpdf) <- paste0(names(tmpdf), "_n")
df_final <- cbind(df, tmpdf)
df_final
df_final
id w z w_n z_n
1 1 39 40 -0.2104284 -1.3829941
2 2 20 26 1.3829941 1.3829941
3 3 43 11 0.2104284 0.6744898
4 4 4 37 -1.3829941 0.2104284
5 5 36 24 0.6744898 -0.6744898
6 6 27 14 -0.6744898 -0.2104284

Resources