gsub inside function with mutate called in dplyr chain gives error

gsub inside function with mutate called in dplyr chain gives error - r

I have the following example data (the real data contains other columns with both numeric and character variables):
structure(list(AM = structure(1:20, .Label = c("AMP_R", "AZI_R",
"CHL_R", "CIP_R", "COL_R", "ERY_R", "ETP_R", "F.C_R", "FEP_R",
"FOT_R", "FOX_R", "GEN_R", "IMI_R", "MERO_R", "NAL_R", "STR_R",
"SULFA_R", "T.C_R", "TAZ_R", "TET_R"), class = "factor")), .Names = "AM", row.names = c(NA,
-20L), class = "data.frame")
I tried to create a function that will detect whether or not a column in a data frame contains variables with the ending "_R". If they do, it will remove this ending and proceed with renaming the variables to full names, accoring to a conversion table. If the "_R" ending is not present, it will just convert the names directly.
I have tried the following on the first part of the function:
library(dplyr)
convert_AM_names <- function(data, col) {
data %>%
mutate(col = gsub("(.*?)_R", "\\1", col))
}
I want to use it in a dplyr chain, like this:
AM <- AM %>%
rowwise() %>%
convert_AM_names(., AM)
However, when I do this, it gives the error "Error in mutate_impl(.data, dots): Column "col" must be length 1, not 20"
I saw that similar issues have been addressed here at SO, but for most of them the solution was to use rowwise(), which doesn't seem to work here. Any suggestions?

You can use an anchor for your regular expression that only matches when the _R is right at the end:
convert_AM_names <- function(col) {
gsub("(.*)_R$", "\\1", col)
}
library(dplyr)
df %>%
mutate(AM = convert_AM_names(AM))
Or directly - without the overhead of convert_AM_names():
df %>%
mutate(AM = gsub("(.*)_R$", "\\1", AM))
Both will yield:
AM
1 AMP
2 AZI
3 CHL
4 CIP
5 COL
6 ERY
7 ETP
8 F.C
9 FEP
10 FOT
11 FOX
12 GEN
13 IMI
14 MERO
15 NAL
16 STR
17 SULFA
18 T.C
19 TAZ
20 TET

You can use mutate_at() which allows you to select a column and apply a function to it.
AM %>%
mutate_at(.vars = "AM",
.funs = gsub,
pattern = "(.*?)_R",
replacement = "\\1")
If you wanted, you could also rewrite your function:
convert_AM_names <- function(col) {
gsub("(.*?)_R", "\\1", col)
}
And use it in mutate_at():
AM %>%
mutate_at(.vars = "AM",
.funs = convert_AM_names)
In both cases, the result looks like this:
AM
1 AMP
2 AZI
3 CHL
4 CIP
5 COL
6 ERY
7 ETP
8 F.C
9 FEP
10 FOT
11 FOX
12 GEN
13 IMI
14 MERO
15 NAL
16 STR
17 SULFA
18 T.C
19 TAZ
20 TET

Related

Use part of row data for new columns in R

I have a very large df with a column that contains the file directory for each row's data.
Example: D:Mouse_2174/experiment/13/trialsummary.txt.1
I would like to create 2 new columns, one with only the mouse ID (2174) and one with the session number (13). There will be different IDs and session numbers based on the row.
I've used sub as recommended here (match part of names in data.frame to new column), but only can get the subject column to say "D:Mouse_2174" I've added an additional line and can get it down to "D:Mous2174"
Is there a way to eliminate all chars before _ and after / to obtain mouse ID?
For session number, I'm not quite as sure what to do with multiple / in the directory name.
percent_correct_list$mouse_id <- sub("/.+", "", percent_correct_list$rn)
#gives me D:Mouse_2174
percent_correct_list$mouse_id <- sub("+._", "", percent_correct_list$mouse_id)
#gives me D:Mous2174
Here is sample code for the directories:
df <- data.frame(
rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
"D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
"D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
"D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")
)
What I want:
rn
id
session
D:..
2174
9
D:..
2181
33
D:..
2183
107
D:..
2185
87
Maybe there's some way to do this earlier along in the process too (like when I import all the data into a df using lapply - but this is good as well)

For sure isnt an elegant solution. Only works if your ID and Session are always numbers...
df <- data.frame(
rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
"D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
"D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
"D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")) %>%
# Extract all numeric values from the string
mutate(allnums = regmatches(rn, gregexpr("+[[:digit:]]+", rn)))%>%
# Separate them
separate(allnums, into = c("id", "session", "idk"), sep = "\\,") %>%
# Extract them individually
mutate(id = as.numeric(regmatches(id, gregexpr("+[[:digit:]]+", id,))),
session = as.numeric(regmatches(session, gregexpr("+[[:digit:]]+", session)))) %>%
select(-idk)
Output:
1 D:Mouse_2174/iti_intervals/9/trialsummary.txt.1 2174 9
2 D:Mouse_2181/iti_intervals/33/trialsummary.txt.1 2181 33
3 D:Mouse_2183/iti_intervals/107/trialsummary.txt.2 2183 107
4 D:Mouse_2185/iti_intervals/87/trialsummary.txt.1 2185 87

Here's a somewhat long-winded solution, using tidyr::separate. Perhaps there is something more concise/elegant.
It does assume that all values of rn take the same format.
library(dplyr)
library(tidyr)
new_df <- df %>%
# separate on / into 4 new columns
separate(rn, into = c(paste0("item", 1:4)), sep = "/", remove = FALSE) %>%
# remove unwanted columns
select(-item2, -item4) %>%
# separate again on _ into 2 new columns
separate(item1, sep = "_", into = c("prefix", "id")) %>%
# retain and rename desired columns
select(rn, id, session = item3)
Result:
rn id session
1 D:Mouse_2174/iti_intervals/9/trialsummary.txt.1 2174 9
2 D:Mouse_2181/iti_intervals/33/trialsummary.txt.1 2181 33
3 D:Mouse_2183/iti_intervals/107/trialsummary.txt.2 2183 107
4 D:Mouse_2185/iti_intervals/87/trialsummary.txt.1 2185 87

How to do rowsums on a select set of columns containing a string and a number in R?

I have a list of column names that look like this...
colnames(dat)
1 subject
2 e.type
3 group
4 boxnum
5 edate
6 file.name
7 fr
8 active
9 inactive
10 reward
11 latency.to.first.active
12 latency.to.first.inactive
13 act0.600
14 act600.1200
15 act1200.1800
16 act1800.2400
17 act2400.3000
18 act3000.3600
19 inact0.600
20 inact600.1200
21 inact1200.1800
22 inact1800.2400
23 inact2400.3000
24 inact3000.3600
25 rew0.600
26 rew600.1200
27 rew1200.1800
28 rew1800.2400
29 rew2400.3000
30 rew3000.3600
I want to get the row sum for the columns that list act#, inact#, and reward#
This works...
for (row in 1:nrow(dat)) {
dat[row, "active"] = rowSums(dat[row,c(13:18)])
dat[row, "inactive"] = rowSums(dat[row,c(19:24)])
dat[row, "reward"] = rowSums(dat[row,c(25:30)])
}
But I don't want to hard coded it since the number of columns for the 3 sections may change. How can I do this without hard coding the column indexes?
Also, for example, I tried searching for the "act" named columns but it was also including the "active" column.

sub_dat <- dat[, 13:30]
result <- sapply(split.default(sub_dat, substr(names(sub_dat), 1, 3)), rowSums)
dat[, c('active', 'inactive', 'reward')] <- result

Easy-cheesy with witch select & matches from the tidyverse.
library(tidyverse)
data %>%
mutate(
sum_act = rowSums(select(., matches("act[0-9]"))),
sum_inact = rowSums(select(., matches("inact[0-9]"))),
sum_rew = rowSums(select(., matches("rew[0-9]")))
)

I made an example how it could be done:
t <- data.frame(c(1,2,3),c("a","b","c"))
colnames(t) <- c("num","char")
#with function append() you make a list of rows that fulfill your logical argument
whichRows <- append(which(t$char == "a"),which(t$char == "b"))
sum(t$num[whichRows])
or if I misunderstood you and you want to sum for every column separately then:
sum(t$num[which(t$char == "a")])
sum(t$num[which(t$char == "b")])

Export file and create two headers based on one column

I'm not even sure how to approach this situation, I'm probably blocked. I have a wide dataframe, something like this
Date Amy_X Amy_Y John_X John_Y
March 14 15 10.5 14.5
April 10 11 15 16
I would like to export the data to a csv file with the following format
Date Amy John
X Y X Y
March 14 15 10.5 14.5
April 10 11 15 16
My first question is what is the best approach to achieve this. Should I separate Amy_X into Amyand Xand then create a repeat of the vector of names Amy, Amy, John, John and use than as another header. What's the best solution for this scenario?

The question says to output the file to csv but the output shown is not comma-separated values (csv). We show both.
Using input data frame DF defined reproducibly in the Note at the end, create a data frame from the headers and use separate_rows on it and rbind that to DF. Then do any remaining fix ups. Write it out without the row and column names and without quotes. Replace stdout() with your file name.
library(dplyr)
library(tidyr)
DF2 <- DF %>%
names %>%
as.list %>%
as.data.frame %>%
separate_rows(everything()) %>%
setNames(names(DF)) %>%
rbind(DF)
DF2[2, 1] <- DF2[1, duplicated(unlist(DF2[1, ]))] <- ""
output <- capture.output(prmatrix(DF2, quote = FALSE,
rowlab = rep("", nrow(DF2)), collab = rep("", ncol(DF2))))[-1]
writeLines(output, stdout())
giving the following which reproduces the output shown in the question:
Date Amy John
X Y X Y
March 14 15 10.5 14.5
April 10 11 15 16
If you really did want csv then use this instead of the writeLines and statement prior to it above:
write.table(DF2, stdout(), sep = ",", quote = FALSE, row.names = FALSE,
col.names = FALSE)
giving:
Date,Amy,,John,
,X,Y,X,Y
March,14,15,10.5,14.5
April,10,11,15,16
Note
Lines <- "Date Amy_X Amy_Y John_X John_Y
March 14 15 10.5 14.5
April 10 11 15 16"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE, as.is = TRUE)

How can i add more columns in dataframe by for loop

I am beginner of R. I need to transfer some Eviews code to R. There are some loop code to add 10 or more columns\variables with some function in data in Eviews.
Here are eviews example code to estimate deflator:
for %x exp con gov inv cap ex im
frml def_{%x} = gdp_{%x}/gdp_{%x}_r*100
next
I used dplyr package and use mutate function. But it is very hard to add many variables.
library(dplyr)
nominal_gdp<-rnorm(4)
nominal_inv<-rnorm(4)
nominal_gov<-rnorm(4)
nominal_exp<-rnorm(4)
real_gdp<-rnorm(4)
real_inv<-rnorm(4)
real_gov<-rnorm(4)
real_exp<-rnorm(4)
df<-data.frame(nominal_gdp,nominal_inv,
nominal_gov,nominal_exp,real_gdp,real_inv,real_gov,real_exp)
df<-df %>% mutate(deflator_gdp=nominal_gdp/real_gdp*100,
deflator_inv=nominal_inv/real_inv,
deflator_gov=nominal_gov/real_gov,
deflator_exp=nominal_exp/real_exp)
print(df)
Please help me to this in R by loop.

The answer is that your data is not as "tidy" as it could be.
This is what you have (with an added observation ID for clarity):
library(dplyr)
df <- data.frame(nominal_gdp = rnorm(4),
nominal_inv = rnorm(4),
nominal_gov = rnorm(4),
real_gdp = rnorm(4),
real_inv = rnorm(4),
real_gov = rnorm(4))
df <- df %>%
mutate(obs_id = 1:n()) %>%
select(obs_id, everything())
which gives:
obs_id nominal_gdp nominal_inv nominal_gov real_gdp real_inv real_gov
1 1 -0.9692060 -1.5223055 -0.26966202 0.49057546 2.3253066 0.8761837
2 2 1.2696927 1.2591910 0.04238958 -1.51398652 -0.7209661 0.3021453
3 3 0.8415725 -0.1728212 0.98846942 -0.58743294 -0.7256786 0.5649908
4 4 -0.8235101 1.0500614 -0.49308092 0.04820723 -2.0697008 1.2478635
Consider if you had instead, in df2:
obs_id variable real nominal
1 1 gdp 0.49057546 -0.96920602
2 2 gdp -1.51398652 1.26969267
3 3 gdp -0.58743294 0.84157254
4 4 gdp 0.04820723 -0.82351006
5 1 inv 2.32530662 -1.52230550
6 2 inv -0.72096614 1.25919100
7 3 inv -0.72567857 -0.17282123
8 4 inv -2.06970078 1.05006136
9 1 gov 0.87618366 -0.26966202
10 2 gov 0.30214534 0.04238958
11 3 gov 0.56499079 0.98846942
12 4 gov 1.24786355 -0.49308092
Then what you want to do is trivial:
df2 %>% mutate(deflator = real / nominal)
obs_id variable real nominal deflator
1 1 gdp 0.49057546 -0.96920602 -0.50616221
2 2 gdp -1.51398652 1.26969267 -1.19240392
3 3 gdp -0.58743294 0.84157254 -0.69801819
4 4 gdp 0.04820723 -0.82351006 -0.05853872
5 1 inv 2.32530662 -1.52230550 -1.52749012
6 2 inv -0.72096614 1.25919100 -0.57256297
7 3 inv -0.72567857 -0.17282123 4.19901294
8 4 inv -2.06970078 1.05006136 -1.97102841
9 1 gov 0.87618366 -0.26966202 -3.24919196
10 2 gov 0.30214534 0.04238958 7.12782060
11 3 gov 0.56499079 0.98846942 0.57158146
12 4 gov 1.24786355 -0.49308092 -2.53074800
So the question becomes: how do we get to the nice dplyr-compatible data.frame.
You need to gather your data using tidyr::gather. However, because you have 2 sets of variables to gather (the real and nominal values), it is not straightforward. I have done it in two steps, there may be a better way though.
real_vals <- df %>%
select(obs_id, starts_with("real")) %>%
# the line below is where the magic happens
tidyr::gather(variable, real, starts_with("real")) %>%
# extracting the variable name (by erasing up to the underscore)
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Same thing for nominal values
nominal_vals <- df %>%
select(obs_id, starts_with("nominal")) %>%
tidyr::gather(variable, nominal, starts_with("nominal")) %>%
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Merging them... Now we have something we can work with!
df2 <-
full_join(real_vals, nominal_vals, by = c("obs_id", "variable"))
Note the importance of the observation id when merging.

We can grep the matching names, and sort:
x <- colnames(df)
df[ sort(x[ (grepl("^nominal", x)) ]) ] /
df[ sort(x[ (grepl("^real", x)) ]) ] * 100
Similarly, if the columns were sorted, then we could just:
df[ 1:4 ] / df[ 5:8 ] * 100

We can loop over column names using purrr::map_dfc then apply a custom function over the selected columns (i.e. the columns that matched the current name from nms)
library(dplyr)
library(purrr)
#Replace anything before _ with empty string
nms <- unique(sub('.*_','',names(df)))
#Use map if you need the ouptut as a list not a dataframe
map_dfc(nms, ~deflator_fun(df, .x))
Custom function
deflator_fun <- function(df, x){
#browser()
nx <- paste0('nominal_',x)
rx <- paste0('real_',x)
select(df, matches(x)) %>%
mutate(!!paste0('deflator_',quo_name(x)) := !!ensym(nx) / !!ensym(rx)*100)
}
#Test
deflator_fun(df, 'gdp')
nominal_gdp real_gdp deflator_gdp
1 -0.3332074 0.181303480 -183.78433
2 -1.0185754 -0.138891362 733.36121
3 -1.0717912 0.005764186 -18593.97398
4 0.3035286 0.385280401 78.78123
Note: Learn more about quo_name, !!, and ensym which they are tools for programming with dplyr here

R - Removing the same name in two columns of a data frame

I am working with a data frame that has two columns, name and spouse. I am trying to calculate the interracial marriage frequency, but I need to remove repeated registers.
When I have the name of a creature I need to keep this register in the data frame but remove the register where that creature name is the spouse name. I have this following data sample:
name spouse
15 Finarfin EÃ¤rwen
6 Tar-VanimeldÃ« Herucalmo
17 Faramir owyn
8 Tar-Meneldur Almarian
14 Finduilas of Dol Amroth Denethor II
12 FinwÃ« MÃriel SerindÃ« then ,Indis
9 Tar-AncalimÃ« Hallacar
7 Tar-MÃriel Ar-PharazÃ´n
5 Tarannon Falastur BerÃºthiel
21 Rufus Burrows Asphodel Brandybuck
2 Angrod EldalÃ³tÃ«
4 Ar-GimilzÃ´r InzilbÃªth
19 Lobelia Sackville-Baggins Otho Sackville-Baggins
25 Mrs. Proudfoot Odo Proudfoot
22 Rudigar Bolger Belba Baggins
24 Odo Proudfoot Mrs. Proudfoot
3 Ar-PharazÃ´n Tar-MÃriel
13 Fingolfin AnairÃ«
18 SilmariÃ«n Elatan
23 Rowan Greenhand Belba Baggins
20 RÃan Huor
1 Adanel Belemir
16 Fastolph Bolger Pansy Baggins
10 Morwen Steelsheen Thengel
11 Tar-Aldarion Erendis
25 Belemir Adanel
For example, I ran the code and in line 1 it caught name Adanel and got Belemir as its spouse, so I need to keep line 1, but remove line 25, because with that I will avoid duplicated data.
I have tried this following code:
interacialMariage <-data %>% filter(spouse != name) %>% select(name, spouse)
How can I get the same spouse name register out of the data frame registers?
P.S.: I would need it to avoid case sensitive (Belemir == belemir) so that I don't have problems in the future.
Thanks!

You could set up another vector with the row-wise alphabetically sorted names, and deduplicate using that...
sorted <- sapply(1:nrow(data),
function(i) paste(sort(c(trimws(tolower(data$name[i])),
trimws(tolower(data$spouse[i])))),
collapse=" "))
irM <- data[!duplicated(sorted),]
The trimws strips off any leading or trailing spaces before sorting and pasting, and tolower converts everything to lower case.

My attempt with tidyverse:
library(tidyverse)
dat %>%
mutate(id = 1:n()) %>% # add id to label the pairs
gather('key', 'name', -id) %>% # transform: key (name | spouse), name, id
group_by(name) %>% # group by unique name to find duplicated
top_n(-1, wt = id) %>% # if name > 1, take row with the lower id
spread(key, name) %>% # spread data to original format
select(-id) # remove id's
# # A tibble: 3 x 2
# name spouse
# <chr> <chr>
# 1 Adanel Belemir
# 2 Fastolph Bolger Pansy Baggins
# 3 Morwen Steelsheen Thengel
Data:
dat <- data.frame(
name = c("Adanel", "Fastolph Bolger", "Morwen Steelsheen", "Belemir"),
spouse = c("Belemir", "Pansy Baggins", "Thengel", "Adanel" ),
stringsAsFactors = F
)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

gsub inside function with mutate called in dplyr chain gives error - r

Related

Use part of row data for new columns in R

How to do rowsums on a select set of columns containing a string and a number in R?

Export file and create two headers based on one column

How can i add more columns in dataframe by for loop

R - Removing the same name in two columns of a data frame

Categories

Resources