Splitting multiple columns in R - r

I have following dataframe:
olddf <- structure(list(test = structure(1:6, .Label = c("test1", "test2",
"test3", "test4", "test5", "test6"), class = "factor"), month0_gp1 = c("163±28",
"133±20", "177±29", "153±30", "161±31", "159±23"), month0_gp2 = c("122±17",
"167±20", "146±26", "150±27", "148±33", "161±37"), month1_gp1 = c("157±32",
"152±37", "151±24", "143±25", "144±29", "126±30"), month1_gp2 = c("181±14",
"133±34", "152±38", "144±30", "148±20", "137±19"), month3_gp1 = c("139±38",
"161±39", "166±38", "162±39", "151±38", "155±38"), month3_gp2 = c("151±40",
"161±33", "137±25", "161±31", "168±30", "147±34")), .Names = c("test",
"month0_gp1", "month0_gp2", "month1_gp1", "month1_gp2", "month3_gp1",
"month3_gp2"), row.names = c(NA, 6L), class = "data.frame")
test month0_gp1 month0_gp2 month1_gp1 month1_gp2 month3_gp1 month3_gp2
1 test1 163±28 122±17 157±32 181±14 139±38 151±40
2 test2 133±20 167±20 152±37 133±34 161±39 161±33
3 test3 177±29 146±26 151±24 152±38 166±38 137±25
4 test4 153±30 150±27 143±25 144±30 162±39 161±31
5 test5 161±31 148±33 144±29 148±20 151±38 168±30
6 test6 159±23 161±37 126±30 137±19 155±38 147±34
I have to split columns 2:7 into 2 each (one for mean and other for sd):
test month0_gp1_mean month0_gp1_sd month0_gp2_mean month0_gp2_sd month1_gp1_mean month1_gp1_sd ....
I checked earlier posts and used do.call(rbind... method:
mydf <- data.frame(do.call(rbind, strsplit(olddf$month0_gp1,'±')))
mydf
X1 X2
1 163 28
2 133 20
3 177 29
4 153 30
5 161 31
6 159 23
But this works for one column at a time. How can I modify this to loop for 2:7 columns, and combine them to form one new dataframe? Thanks for your help.

First, get my cSplit function from this GitHub Gist.
Second, split it up:
cSplit(olddf, 2:ncol(olddf), sep = "±")
# test 2_1 2_2 3_1 3_2 4_1 4_2 5_1 5_2 6_1 6_2 7_1 7_2
# 1: test1 163 28 122 17 157 32 181 14 139 38 151 40
# 2: test2 133 20 167 20 152 37 133 34 161 39 161 33
# 3: test3 177 29 146 26 151 24 152 38 166 38 137 25
# 4: test4 153 30 150 27 143 25 144 30 162 39 161 31
# 5: test5 161 31 148 33 144 29 148 20 151 38 168 30
# 6: test6 159 23 161 37 126 30 137 19 155 38 147 34
If you want to do the column renaming in the same step, try:
Nam <- names(olddf)[2:ncol(olddf)]
setnames(
cSplit(olddf, 2:ncol(olddf), sep = "±"),
c("test", paste(rep(Nam, each = 2), c("mean", "sd"), sep = "_")))[]
Another option would be to look at dplyr + tidyr.
Here's the best I could come up with, but I'm not sure if this is the correct way to do this with these tools....
olddf %>%
gather(GM, value, -test) %>% # Makes the data somewhat long
separate(value, c("MEAN", "SD")) %>% # Splits "value" column. We're wide again
gather(MSD, value, -test, -GM) %>% # Makes the data long again
unite(var, GM, MSD) %>% # Combines GM and MSD columns
spread(var, value) # Goes from wide to long
This is sort of the equivalent of melting the data once, using colsplit on the resulting "value" column, melting the data again, and using dcast to get the wide format.

Here's a qdap approach:
library(qdap)
for(i in seq(2, 13, by = 2)){
olddf <- colsplit2df(olddf, i,
paste0(names(olddf)[i], "_", c("mean", "sd")), sep = "±")
}
olddf[,-1] <- lapply(olddf[,-1], as.numeric)
olddf
I looked at Ananda's splitstackshape package first as I figured there was an easy way to do this but I couldn't figure out a way.
Not sure if you need the last line converting the columns to numeric but assumed you would.

Related

How to use extract on multiple columns and name output columns based on input column names

I have a data frame of blood pressure data of the following form:
bpdata <- data.frame(bp1 = c("120/89", "110/70", "121/78"), bp2 = c("130/69", "120/90", "125/72"), bp3 = c("115/90", "112/71", "135/80"))
I would like to use the following extract command, but globally, i.e. on all bp\d columns
extract(bp1, c("systolic_1","diastolic_1"),"(\\d+)/(\\d+)")
How can I capture the digit in the column selection and use it in the column output names? I can hack around this by creating a list of column names and then using one of the apply family, but it seems to me there ought to be a more elegant way to do this.
Any suggestions?
We could use read.csv on multiple columns in a loop (Map) with sep = "/" and cbind the list elements at the end with do.call
do.call(cbind, Map(function(x, y) read.csv(text= x, sep="/", header = FALSE,
col.names = paste0(c('systolic', 'diastolic'), y)),
unname(bpdata), seq_along(bpdata)))
# systolic1 diastolic1 systolic2 diastolic2 systolic3 diastolic3
#1 120 89 130 69 115 90
#2 110 70 120 90 112 71
#3 121 78 125 72 135 80
Or without a loop, paste the columns to a single string for each row and then use read.csv/read.table
read.csv(text = do.call(paste, c(bpdata, sep="/")),
sep="/", header = FALSE,
col.names = paste0(c('systolic', 'diastolic'),
rep(seq_along(bpdata), each = 2)))
# systolic1 diastolic1 systolic2 diastolic2 systolic3 diastolic3
#1 120 89 130 69 115 90
#2 110 70 120 90 112 71
#3 121 78 125 72 135 80
Or using tidyverse, similar option is to unite the column into a single one with /, then use either extract or separate to split the column into multiple columns
library(dplyr)
library(tidyr)
library(stringr)
bpdata %>%
unite(bpcols, everything(), sep="/") %>%
separate(bpcols, into = str_c(c('systolic', 'diastolic'),
rep(seq_along(bpdata), each = 2)), convert = TRUE)
# systolic1 diastolic1 systolic2 diastolic2 systolic3 diastolic3
#1 120 89 130 69 115 90
#2 110 70 120 90 112 71
#3 121 78 125 72 135 80

How to obtain conditioned results from an R dataframe

This is my first message here. I'm trying to solve an R exercise from an edX R course, and I'm stuck in it. It would be great if somebody could help me solve it. Here are the dataframe and question given:
> students
height shoesize gender population
1 181 44 male kuopio
2 160 38 female kuopio
3 174 42 female kuopio
4 170 43 male kuopio
5 172 43 male kuopio
6 165 39 female kuopio
7 161 38 female kuopio
8 167 38 female tampere
9 164 39 female tampere
10 166 38 female tampere
11 162 37 female tampere
12 158 36 female tampere
13 175 42 male tampere
14 181 44 male tampere
15 180 43 male tampere
16 177 43 male tampere
17 173 41 male tampere
Given the dataframe above, create two subsets with students whose height is equal to or below the median height (call it students.short) and students whose height is strictly above the median height (call it students.tall). What is the mean shoesize for each of the above 2 subsets by population?
I've been able to create the two subsets students.tall and students.short (both display the answers by TRUE/FALSE), but I don't know how to obtain the mean by population. The data should be displayed like this:
kuopio tampere
students.short xxxx xxxx
students.tall xxxx xxxx
Many thanks if you can give me a hand!
We can split by a logical vector based on the median height
# // median height
medHeight <- median(students$height, na.rm = TRUE)
# // split the data into a list of data.frames using the 'medHeight'
lst1 <- with(students, split(students, height > medHeight))
Then loop over the list use aggregate from base R
lapply(lst1, function(dat) aggregate(shoesize ~ population,
data = dat, FUN = mean, na.rm = TRUE))
However, we don't need to create two separate datasets or a list. It can be done by grouping with both 'population' and the 'grp' created with logical vector
library(dplyr)
students %>%
group_by(grp = height > medHeight, population) %>%
summarise(shoesize = mean(shoesize))
You can try this:
#Code
students.short <- students[students$height<=median(students$height),]
students.tall <- students[students$height>median(students$height),]
#Mean
mean(students.short$shoesize)
mean(students.tall$shoesize)
Output:
[1] 38.44444
[1] 42.75
You can use pivot_wider() in tidyr and set the argument values_fn as mean.
library(dplyr)
library(tidyr)
df %>%
mutate(grp = if_else(height > median(height), "students.tall", "students.short")) %>%
pivot_wider(id_cols = grp, names_from = population, values_from = height, values_fn = mean)
# # A tibble: 2 x 3
# grp kuopio tampere
# <chr> <dbl> <dbl>
# 1 students.tall 176. 177.
# 2 students.short 164 163.
With a base way, you can try xtabs(), which returns a table object.
xtabs(height ~ grp + population,
aggregate(height ~ grp + population, FUN = mean,
transform(df, grp = ifelse(height > median(height), "students.tall", "students.short"))))
# population
# grp kuopio tampere
# students.short 164.0000 163.4000
# students.tall 175.6667 177.2000
Note: To convert a table object into data.frame, you can use as.data.frame.matrix().

str_match based on vector with count issue

I havent got a reprex but my data are stored in a csv file
https://transcode.geo.data.gouv.fr/services/5e2a1fbefa4268bc25628f27/feature-types/drac:site?format=CSV&projection=WGS84
library(readr)
bzh_sites <- read_csv("site.csv")
I want to count row based on characters matching (column NATURE)
pattern<-c("allée|aqueduc|architecture|atelier|bas|carrière|caveau|chapelle|château|chemin|cimetière|coffre|dépôt|dolmen|eau|église|enceinte|enclos|éperon|espace|exploitation|fanum|ferme|funéraire|groupe|habitat|maison|manoir|menhir|monastère|motte|nécropole|occupation|organisation|parcellaire|pêcherie|prieuré|production|rue|sépulture|stèle|thermes|traitement|tumulus|villa")
test2 <- bzh_sites %>%
drop_na(NATURE) %>%
group_by(NATURE = str_match( NATURE, pattern )) %>%
summarise(n = n())
gives me :
NATURE n
1 allée 176
2 aqueduc 73
3 architecture 68
4 atelier 200
AND another test with the same data (NATURE)
pattern <- c("allée|aqueduc|architecture|atelier")
test2 <- bzh_sites %>%
drop_na(NATURE) %>%
group_by(NATURE = str_match( NATURE, pattern )) %>%
summarise(n = n())
gives me :
NATURE n
1 allée 178
2 aqueduc 74
3 architecture 79
4 atelier 248
I have no idea about the différences of count.
I tried to find out where the discrepancy is for first group i.e "allée". This is what I found :
library(stringr)
pattern1<-c("allée|aqueduc|architecture|atelier|bas|carrière|caveau|chapelle|château|chemin|cimetière|coffre|dépôt|dolmen|eau|église|enceinte|enclos|éperon|espace|exploitation|fanum|ferme|funéraire|groupe|habitat|maison|manoir|menhir|monastère|motte|nécropole|occupation|organisation|parcellaire|pêcherie|prieuré|production|rue|sépulture|stèle|thermes|traitement|tumulus|villa")
#Get indices where 'allée' is found using pattern1
ind1 <- which(str_match(bzh_sites$NATURE, pattern1 )[, 1] == 'allée')
pattern2 <- c("allée|aqueduc|architecture|atelier")
#Get indices where 'allée' is found using pattern1
ind2 <- which(str_match(bzh_sites$NATURE, pattern2)[, 1] == 'allée')
#Indices which are present in ind2 but absent in ind1
setdiff(ind2, ind1)
#[1] 3093 10400
#Get corresponding text
temp <- bzh_sites$NATURE[setdiff(ind2, ind1)]
temp
#[1] "dolmen allée couverte" "coffre funéraire allée couverte"
What happens when we use pattern1 and pattern2 on temp
str_match(temp, pattern1)
# [,1]
#[1,] "dolmen"
#[2,] "coffre"
str_match(temp, pattern2)
# [,1]
#[1,] "allée"
#[2,] "allée"
As we can see using pattern1 certain values are classified in another group since they occur first in the string hence we have a mismatch.
A similar explanation can be given for mismatches in other groups.
str_match only returns first match, to get all the matches in pattern we can use str_match_all
table(unlist(str_match_all(bzh_sites$NATURE, pattern1)))
# allée aqueduc architecture atelier bas
# 178 76 79 252 62
# carrière caveau chapelle château chemin
# 46 35 226 205 350
# cimetière coffre dépôt dolmen eau
# 275 155 450 542 114
# église enceinte enclos éperon space
# 360 655 338 114 102
#exploitation fanum ferme funéraire groups
# 1856 38 196 1256 295
# habitat maison manoir menhir monastère
# 1154 65 161 1036 31
# motte nécropole occupation organisation parcellaire
# 566 312 5152 50 492
# pêcherie prieuré production rue sépulture
# 69 66 334 44 152
# stèle thermes traitement tumulus villa
# 651 50 119 1232 225

use dplyr mutate() in programming

I am trying to assign a column name to a variable using mutate.
df <-data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(name){
df%>%mutate(name = ifelse(x <50, "small", "big"))
}
When I run
new(name = "newVar")
it doesn't work. I know mutate_() could help but I'm struggling in using it together with ifelse.
Any help would be appreciated.
Using dplyr 0.7.1 and its advances in NSE, you have to UQ the argument to mutate and then use := when assigning. There is lots of info on programming with dplyr and NSE here: https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
I've changed the name of the function argument to myvar to avoid confusion. You could also use case_when from dplyr instead of ifelse if you have more categories to recode.
df <- data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(myvar){
df %>% mutate(UQ(myvar) := ifelse(x < 50, "small", "big"))
}
new(myvar = "newVar")
This returns
x y newVar
1 37 1.82669 small
2 63 -0.04333 big
3 46 0.20748 small
4 93 0.94169 big
5 83 -0.15678 big
6 14 -1.43567 small
7 61 0.35173 big
8 26 -0.71826 small
9 21 1.09237 small
10 90 1.99185 big
11 60 -1.01408 big
12 70 0.87534 big
13 55 0.85325 big
14 38 1.70972 small
15 6 0.74836 small
16 23 -0.08528 small
17 27 2.02613 small
18 76 -0.45648 big
19 97 1.20124 big
20 99 -0.34930 big
21 74 1.77341 big
22 72 -0.32862 big
23 64 -0.07994 big
24 53 -0.40116 big
25 16 -0.70226 small
26 8 0.78965 small
27 34 0.01871 small
28 24 1.95154 small
29 82 -0.70616 big
30 77 -0.40387 big
31 43 -0.88383 small
32 88 -0.21862 big
33 45 0.53409 small
34 29 -2.29234 small
35 54 1.00730 big
36 22 -0.62636 small
37 100 0.75193 big
38 52 -0.41389 big
39 36 0.19817 small
40 89 -0.49224 big
41 81 -1.51998 big
42 18 0.57047 small
43 78 -0.44445 big
44 49 -0.08845 small
45 20 0.14014 small
46 32 0.48094 small
47 1 -0.12224 small
48 66 0.48769 big
49 11 -0.49005 small
50 87 -0.25517 big
Following the dlyr programming vignette, define your function as follows:
new <- function(name)
{
nn <- enquo(name) %>% quo_name()
df %>% mutate( !!nn := ifelse(x <50, "small", "big"))
}
enquo takes its expression argument and quotes it, followed by quo_name converting it into a string. Since nn is now quoted, we need to tell mutate not to quote it a second time. That's what !! is for. Finally, := is a helper operator to make it valid R code. Note that with this definition, you can simply pass newVar instead of "newVar" to your function, maintaining dplyr style.
> new( newVar ) %>% head
x y newVar
1 94 -1.07642088 big
2 85 0.68746266 big
3 80 0.02630903 big
4 74 0.18323506 big
5 86 0.85086915 big
6 38 0.41882858 small
Base R solution
df <-data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(name){
df[,name]='s'
df[,name][df$x>50]='b'
return(df)
}
I am using dplyr 0.5 so i just combine base R with mutate
new <- function(Name){
df=mutate(df,ifelse(x <50, "small", "big"))
names(df)[3]=Name
return(df)
}
new("newVar")

Add new columns to a data.table containing many variables

I want to add many new columns simultaneously to a data.table based on by-group computations. A working example of my data would look something like this:
Time Stock x1 x2 x3
1: 2014-08-22 A 15 27 34
2: 2014-08-23 A 39 44 29
3: 2014-08-24 A 20 50 5
4: 2014-08-22 B 42 22 43
5: 2014-08-23 B 44 45 12
6: 2014-08-24 B 3 21 2
Now I want to scale and sum many of the variables to get an output like:
Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
1: 2014-08-22 A 15 27 34 -1.1175975 0.7310560 121 68
2: 2014-08-23 A 39 44 29 0.3073393 0.4085313 121 68
3: 2014-08-24 A 20 50 5 0.8102582 -1.1395873 121 68
4: 2014-08-22 B 42 22 43 -0.5401315 1.1226726 88 57
5: 2014-08-23 B 44 45 12 1.1539172 -0.3274462 88 57
6: 2014-08-24 B 3 21 2 -0.6137858 -0.7952265 88 57
A brute force implementation of my problem would be:
library(data.table)
set.seed(123)
d <- data.table(Time = rep(seq.Date( Sys.Date(), length=3, by="day" )),
Stock = rep(LETTERS[1:2], each=3 ),
x1 = sample(1:50, 6),
x2 = sample(1:50, 6),
x3 = sample(1:50, 6))
d[,x2_scale:=scale(x2),by=Stock]
d[,x3_scale:=scale(x3),by=Stock]
d[,x2_sum:=sum(x2),by=Stock]
d[,x3_sum:=sum(x3),by=Stock]
Other posts describing a similar issue (Add multiple columns to R data.table in one function call? and Assign multiple columns using := in data.table, by group) suggest the following solution:
d[, c("x2_scale","x3_scale"):=list(scale(x2),scale(x3)), by=Stock]
d[, c("x2_sum","x3_sum"):=list(sum(x2),sum(x3)), by=Stock]
But again, this would get very messy with a lot of variables and also this brings up an error message with scale (but not with sum since this isn't returning a vector).
Is there a more efficient way to achieve the required result (keeping in mind that my actual data set is quite large)?
I think with a small modification to your last code you can easily do both for as many variables you want
vars <- c("x2", "x3") # <- Choose the variable you want to operate on
d[, paste0(vars, "_", "scale") := lapply(.SD, function(x) scale(x)[, 1]), .SDcols = vars, by = Stock]
d[, paste0(vars, "_", "sum") := lapply(.SD, sum), .SDcols = vars, by = Stock]
## Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
## 1: 2014-08-22 A 13 14 32 -1.1338934 1.1323092 87 44
## 2: 2014-08-23 A 25 39 9 0.7559289 -0.3701780 87 44
## 3: 2014-08-24 A 18 34 3 0.3779645 -0.7621312 87 44
## 4: 2014-08-22 B 44 8 6 -0.4730162 -0.7258662 59 32
## 5: 2014-08-23 B 49 3 18 -0.6757374 1.1406469 59 32
## 6: 2014-08-24 B 15 48 8 1.1487535 -0.4147807 59 32
For simple functions (that don't need special treatment like scale) you could easily do something like
vars <- c("x2", "x3") # <- Define the variable you want to operate on
funs <- c("min", "max", "mean", "sum") # <- define your function
for(i in funs){
d[, paste0(vars, "_", i) := lapply(.SD, eval(i)), .SDcols = vars, by = Stock]
}
Another variation using data.table
vars <- c("x2", "x3")
d[, paste0(rep(vars, each=2), "_", c("scale", "sum")) := do.call(`cbind`,
lapply(.SD, function(x) list(scale(x)[,1], sum(x)))), .SDcols=vars, by=Stock]
d
# Time Stock x1 x2 x3 x2_scale x2_sum x3_scale x3_sum
#1: 2014-08-22 A 15 27 34 -1.1175975 121 0.7310560 68
#2: 2014-08-23 A 39 44 29 0.3073393 121 0.4085313 68
#3: 2014-08-24 A 20 50 5 0.8102582 121 -1.1395873 68
#4: 2014-08-22 B 42 22 43 -0.5401315 88 1.1226726 57
#5: 2014-08-23 B 44 45 12 1.1539172 88 -0.3274462 57
#6: 2014-08-24 B 3 21 2 -0.6137858 88 -0.7952265 57
Based on comments from #Arun, you could also do:
cols <- paste0(rep(vars, each=2), "_", c("scale", "sum"))
d[,(cols):= unlist(lapply(.SD, function(x) list(scale(x)[,1L], sum(x))),
rec=F), by=Stock, .SDcols=vars]
You're probably looking for a pure data.table solution, but you could also consider using dplyr here since it works with data.tables as well (no need for conversion). Then, from dplyr you could use the function mutate_all as I do in this example here (with the first data set you showed in your question):
library(dplyr)
dt %>%
group_by(Stock) %>%
mutate_all(funs(sum, scale), x2, x3)
#Source: local data table [6 x 9]
#Groups: Stock
#
# Time Stock x1 x2 x3 x2_sum x3_sum x2_scale x3_scale
#1 2014-08-22 A 15 27 34 121 68 -1.1175975 0.7310560
#2 2014-08-23 A 39 44 29 121 68 0.3073393 0.4085313
#3 2014-08-24 A 20 50 5 121 68 0.8102582 -1.1395873
#4 2014-08-22 B 42 22 43 88 57 -0.5401315 1.1226726
#5 2014-08-23 B 44 45 12 88 57 1.1539172 -0.3274462
#6 2014-08-24 B 3 21 2 88 57 -0.6137858 -0.7952265
You can easily add more functions to be calculated which will create more columns for you. Note that mutate_all applies the function to each column except the grouping variable (Stock) by default. But you can either specify the columns you only want to apply the functions to (which I did in this example) or you can specify which columns you don't want to apply the functions to (that would be, e.g. -c(x2,x3) instead of where I wrote x2, x3).
EDIT: replaced mutate_each above with mutate_all as mutate_each will be deprecated in the near future.
EDIT: cleaner version using functional. I think this is the closest to the dplyr answer.
library(functional)
funs <- list(scale=Compose(scale, c), sum=sum) # See data.table issue #783 on github for the need for this
cols <- paste0("x", 2:3)
cols.all <- outer(cols, names(funs), paste, sep="_")
d[,
c(cols.all) := unlist(lapply(funs, Curry(lapply, X=.SD)), rec=F),
.SDcols=cols,
by=Stock
]
Produces:
Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
1: 2014-08-22 A 15 27 34 -1.1175975 0.7310560 121 68
2: 2014-08-23 A 39 44 29 0.3073393 0.4085313 121 68
3: 2014-08-24 A 20 50 5 0.8102582 -1.1395873 121 68
4: 2014-08-22 B 42 22 43 -0.5401315 1.1226726 88 57
5: 2014-08-23 B 44 45 12 1.1539172 -0.3274462 88 57
6: 2014-08-24 B 3 21 2 -0.6137858 -0.7952265 88 57

Resources