Split column in dataframe in R at a '%' character

Split column in dataframe in R at a '%' character - r

I cannot adapt the existing solutions on this forum that address splitting dataframe columns into two.
I have this dataframe (which funnily enough, has already been split to make it look like this), that I need to be split at the % symbol
The end result I would like, is the 32%, 35%, 54%... 55% will be deleted. So it is just two columns of data. This data is scraped from a website.
Thank you

df$Long<-sapply(strsplit(as.character(df$Long), split= "\\%"),'[',2)
Long Short
1 239 497
2 142 269
3 216 186
4 96 52
5 93 184
6 160 142
7 96 79

This should do it:
df$Long <- paste0(unlist(lapply(strsplit(a, "%"), `[[`, 1)), "%")
It splits each string in Long column by "%" and gets the first element in each split. Then it adds the "%" to the end of the resulting vector.

I would use tidyr and dplyr for this:
library(dplyr)
library(tidyr)
data.frame(Long = c("32% 239", "35% 142", "54% 216"), Short = c(497,269,186), stringsAsFactors = F) %>%
separate(Long, c("Long_percent","Long_2"), sep = " ") %>%
select(-Long_percent)
Or you can also use a regex, which could be useful if the data is not so well formed:
data.frame(Long = c("32% 239", "35% 142", "54% 216"), Short = c(497,269,186), stringsAsFactors = F) %>%
mutate(Long = gsub("[0-9%]+ ", "", Long, perl = T))

Related

Use part of row data for new columns in R

I have a very large df with a column that contains the file directory for each row's data.
Example: D:Mouse_2174/experiment/13/trialsummary.txt.1
I would like to create 2 new columns, one with only the mouse ID (2174) and one with the session number (13). There will be different IDs and session numbers based on the row.
I've used sub as recommended here (match part of names in data.frame to new column), but only can get the subject column to say "D:Mouse_2174" I've added an additional line and can get it down to "D:Mous2174"
Is there a way to eliminate all chars before _ and after / to obtain mouse ID?
For session number, I'm not quite as sure what to do with multiple / in the directory name.
percent_correct_list$mouse_id <- sub("/.+", "", percent_correct_list$rn)
#gives me D:Mouse_2174
percent_correct_list$mouse_id <- sub("+._", "", percent_correct_list$mouse_id)
#gives me D:Mous2174
Here is sample code for the directories:
df <- data.frame(
rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
"D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
"D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
"D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")
)
What I want:
rn
id
session
D:..
2174
9
D:..
2181
33
D:..
2183
107
D:..
2185
87
Maybe there's some way to do this earlier along in the process too (like when I import all the data into a df using lapply - but this is good as well)

For sure isnt an elegant solution. Only works if your ID and Session are always numbers...
df <- data.frame(
rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
"D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
"D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
"D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")) %>%
# Extract all numeric values from the string
mutate(allnums = regmatches(rn, gregexpr("+[[:digit:]]+", rn)))%>%
# Separate them
separate(allnums, into = c("id", "session", "idk"), sep = "\\,") %>%
# Extract them individually
mutate(id = as.numeric(regmatches(id, gregexpr("+[[:digit:]]+", id,))),
session = as.numeric(regmatches(session, gregexpr("+[[:digit:]]+", session)))) %>%
select(-idk)
Output:
1 D:Mouse_2174/iti_intervals/9/trialsummary.txt.1 2174 9
2 D:Mouse_2181/iti_intervals/33/trialsummary.txt.1 2181 33
3 D:Mouse_2183/iti_intervals/107/trialsummary.txt.2 2183 107
4 D:Mouse_2185/iti_intervals/87/trialsummary.txt.1 2185 87

Here's a somewhat long-winded solution, using tidyr::separate. Perhaps there is something more concise/elegant.
It does assume that all values of rn take the same format.
library(dplyr)
library(tidyr)
new_df <- df %>%
# separate on / into 4 new columns
separate(rn, into = c(paste0("item", 1:4)), sep = "/", remove = FALSE) %>%
# remove unwanted columns
select(-item2, -item4) %>%
# separate again on _ into 2 new columns
separate(item1, sep = "_", into = c("prefix", "id")) %>%
# retain and rename desired columns
select(rn, id, session = item3)
Result:
rn id session
1 D:Mouse_2174/iti_intervals/9/trialsummary.txt.1 2174 9
2 D:Mouse_2181/iti_intervals/33/trialsummary.txt.1 2181 33
3 D:Mouse_2183/iti_intervals/107/trialsummary.txt.2 2183 107
4 D:Mouse_2185/iti_intervals/87/trialsummary.txt.1 2185 87

Separate character variable into two columns

I have scraped some data from a url to analyse cycling results. Unfortunately the name column exists of the name and the name of the team in one field. I would like to extract these from each other. Here's the code (last part doesn't work)
#get url
stradebianchi_2020 <- read_html("https://www.procyclingstats.com/race/strade-bianche/2020/result")
#scrape table
results_2020 <- stradebianchi_2020%>%
html_nodes("td")%>%
html_text()
#transpose scraped data into dataframe
results_stradebianchi_2020 <- as.data.frame(t(matrix(results_2020, 8, byrow = F)))
#rename
names(results_stradebianchi_2020) <- c("rank", "#", "name", "age", "team", "UCI point", "PCS points", "time")
#split rider from team
separate(data = results_stradebianchi_2020, col = name, into = c("left", "right"), sep = " ")
I think the best option is to get the team variable name and use that name to remove it from the 'name' column.
All suggestions are welcome!

I think your request is wrongly formulated. You want to remove team from name.
That's how you should do it in my opinion:
results_stradebianchi_2020 %>%
mutate(name = stringr::str_remove(name, team))
Write this instead of your line with separate.
In this case separate is not an optimal solution for you because the separation character is not clearly defined.
Also, I would advise you to remove the initial blanks from name with stringr::str_trim(name)

You could do this in base R with gsub and replace in the name column the pattern of team column with "", i.e. nothing. We use apply() with MARGIN=1 to go through the data frame row by row. Finally we use trimws to clean from whitespace (where we change to whitespace="[\\h\\v]" for better matching the spaces).
res <- transform(results_stradebianchi_2020,
name=trimws(apply(results_stradebianchi_2020, 1, function(x)
gsub(x["team"], "", x["name"])), whitespace="[\\h\\v]"))
head(res)
# rank X. name age team UCI.point PCS.points time
# 1 1 201 van Aert Wout 25 Team Jumbo-Visma 300 200 4:58:564:58:56
# 2 2 234 Formolo Davide 27 UAE-Team Emirates 250 150 0:300:30
# 3 3 87 Schachmann Maximilian 26 BORA - hansgrohe 215 120 0:320:32
# 4 4 111 Bettiol Alberto 26 EF Pro Cycling 175 100 1:311:31
# 5 5 44 Fuglsang Jakob 35 Astana Pro Team 120 90 2:552:55
# 6 6 7 Štybar Zdenek 34 Deceuninck - Quick Step 115 80 3:593:59

Remove dots from data column

I'm beginner dealing with R and working with strings.
I've been trying to remove periods from data but unfortunately I can't find a solution.
This is the data I'm working on in a dataframe df:
df <- read.table(text = " n mesAno receita
97 1/2009 3.812.819.062,06
98 2/2009 4.039.362.599,36
99 3/2009 3.652.885.587,18
100 4/2009 3.460.247.960,02
101 5/2009 3.465.677.403,12
102 6/2009 3.131.903.622,55
103 7/2009 3.204.983.361,46
104 8/2009 3.811.786.009,24
105 9/2009 3.180.864.095,05
106 10/2009 3.352.535.553,88
107 11/2009 5.214.148.756,95
108 12/2009 4.491.795.201,50
109 1/2010 4.333.557.619,30
110 2/2010 4.808.488.277,86
111 3/2010 4.039.347.179,81
112 4/2010 3.867.676.530,69
113 5/2010 6.356.164.873,94
114 6/2010 3.961.793.391,19
115 7/2010 3797656130.81
116 8/2010 4709949715.37
117 9/2010 4047436592.12
118 10/2010 3923484635.28
119 11/2010 4821729985.03
120 12/2010 5024757038.22",
header = TRUE,
stringsAsFactors = TRUE)
My objective is to transform receita column to numeric as it's is being stored as factor. But applying conversion functions like as.numeric(as.factor(x)) does not work in the interval 97:114 (it coerces to NA's).
I suppose that this is because of the periods separating billion/million/thousands in this column.
The mentioned conversion functions will work only if I have something like 3812819062.06 as in 115:120.
I tried mutating the dataset adding another column and modelling.
I don't really know if what i'm doing is fine, but i also tried extracting the anomalous numbers to a variable, and applying sub/gsub on them but without success.
Is there some straight forward way of doing this, that is, instruct it to remove the 2 first occurrences of '.' and then replace the comma with a '.'?
I'm very confident that the function i'm needing is gsub but i'm having a hard time finding the correct usage. Any help will be appreciated.
Edit: My approach using dplyr::mutate(). Ugly but works.
df <- df %>%
mutate(receita_temp = receita) %>%
mutate(dot_count = str_count(receita, '\\.')) %>%
mutate(receita_temp = ifelse(dot_count == 3,
gsub('\\.', '', as.factor(receita_temp)),
gsub('\\,', '.',as.factor(receita_temp))
)) %>%
mutate(receita_temp = ifelse(dot_count == 3,
gsub('\\,', '.',as.factor(receita_temp)),
receita_temp)) %>%
select(-c(dot_count, receita)) %>%
rename(., receita = receita_temp)

I'm using regex and some stringr functions to remove all the periods except those followed by two digits and the end of the string. That way, periods denoting separation like in 3.811.786.009,24 are removed, but periods denoting the start of a decimal like in 4821729985.03 are not. Using str_remove_all rather than str_remove lets me not have to worry about removing the matches repeatedly or about how well it will scale. Then replace the remaining commas with periods, and make it numeric.
library(tidyverse)
df2 <- df %>%
mutate(receita = str_remove_all(receita, "\\.(?!\\d{2,}$)") %>%
str_replace_all(",", ".") %>%
as.numeric())
print(head(df2), digits = 12)
#> n mesAno receita
#> 1 97 1/2009 3812819062.06
#> 2 98 2/2009 4039362599.36
#> 3 99 3/2009 3652885587.18
#> 4 100 4/2009 3460247960.02
#> 5 101 5/2009 3465677403.12
#> 6 102 6/2009 3131903622.55
Created on 2018-09-04 by the reprex package (v0.2.0).

You can use the following:
first create a function that will be used for replacement:
repl = function(x)setNames(c("","."),c(".",","))[x]
This function takes in either "." or "," and returns "" or '.' respectively
Now use this function to replace
stringr::str_replace_all(as.character(df[,3]), "[.](?!\\d+$)|,", repl)
[1] "3812819062.06" "4039362599.36" "3652885587.18" "3460247960.02" "3465677403.12" "3131903622.55"
[7] "3204983361.46" "3811786009.24" "3180864095.05" "3352535553.88" "5214148756.95" "4491795201.50"
[13] "4333557619.30" "4808488277.86" "4039347179.81" "3867676530.69" "6356164873.94" "3961793391.19"
[19] "3797656130.81" "4709949715.37" "4047436592.12" "3923484635.28" "4821729985.03" "5024757038.22"
Of course you can do the rest. ie calling as.numeric() etc.
To do this in base R:
sub(',','.',gsub('[.](?!\\d+$)','',as.character(df[,3]),perl=T))
or If you know the exact number of . and , in your data, you could do
a = as.character(df[,3])
regmatches(a,gregexpr('[.](?!\\d+$)|,',df[,3],perl = T)) = list(c("","","","."))
a

df$num <- as.numeric(sapply(as.character(si), function(x) gsub("\\,","\\.",ifelse(grepl("\\,", x), gsub("\\.","",x),x))))
should do the trick.
First, the function searches for rows with ",", removes "." in these rows, and last it converts all occurring "," into ".", so that it can be converted without problems to numeric.
Use print(df$num, digits = 12) to see your data with 2 decimals.

R: split a list of value ranges (in text) into two lists

I have a column of temperature ranges (or single temperature if only one reading was recorded) like this:
"117-118"
"117-118"
"117-122"
"122-128"
"123"
"118-124"
"118-124"
"118-124"
"123-128"
"91-101"
...
In R, how can I split this column into two columns (i.e., low temperature and high temperature columns)?

input <- c("117-118"
"117-118",
"117-122",
"122-128",
"123",
"118-124",
"118-124",
"118-124",
"123-128",
"91-101")
Using read.table with a custom separator:
newData <- read.table(text = input, sep = "-", fill = TRUE)
Note the blanks will be filled in with NA. If you want the mean, you could do:
means <- rowMeans(newData, na.rm = TRUE)
You could also use strsplit and sapply as you mention in your comment like this:
newData <- t(sapply(strsplit(input, "-"), function(x) as.numeric(c(x, x)[1:2])))
# If you want NAs as before, change from c(x, x) to c(x, NA)

Thanks for some useful tips given by #akrun and others, I found a way to get this work:
temp <- c("117-118", "117-118", "117-122", "122-128" ,"123", "118-124", "118-124", "118-124", "123-128", "91-101")
low <- as.numeric(lapply(strsplit(temp,"-"), function(x) x[1]))
high <- as.numeric(lapply(strsplit(temp,"-"), function(x) x[2]))
This gives results as:
> low
[1] 117 117 117 122 123 118 118 118 123 91
> high
[1] 118 118 122 128 NA 124 124 124 128 101
In the case of single value, assign that value to both lists can be done as follows (may not be the best way but it works):
high[is.na(high)] <- low[is.na(high)]
Which leads to this result:
> high
[1] 118 118 122 128 123 124 124 124 128 101

Convert comma separated string to numeric columns

I have a dataset with several columns, one of which is a column for reaction times. These reaction times are comma separated to denote the reaction times (of the same participant) for the different trials.
For example: row 1 (i.e.: the data from participant 1) has the following under the column "reaction times"
reaction_times
2000,1450,1800,2200
Hence these are the reaction times of participant 1 for trials 1,2,3,4.
I now want to create a new data set in which the reaction times for these trials all form individual columns. This way I can calculate the mean reaction time for each trial.
trial 1 trial 2 trial 3 trial 4
participant 1: 2000 1450 1800 2200
I tried the colsplit from the reshape2 package but that doesn't seem to split my data into new columns (perhaps because my data is all in 1 cell).
Any suggestions?

I think you are looking for the strsplit() function;
a = "2000,1450,1800,2200"
strsplit(a, ",")
[[1]]
[1] "2000" "1450" "1800" "2200"
Notice that strsplit returns a list, in this case with only one element. This is because strsplit takes vectors as input. Therefore, you can also put a long vector of your single cell characters into the function and get back a splitted list of that vector. In a more relevant example this look like:
# Create some example data
dat = data.frame(reaction_time =
apply(matrix(round(runif(100, 1, 2000)),
25, 4), 1, paste, collapse = ","),
stringsAsFactors=FALSE)
splitdat = do.call("rbind", strsplit(dat$reaction_time, ","))
splitdat = data.frame(apply(splitdat, 2, as.numeric))
names(splitdat) = paste("trial", 1:4, sep = "")
head(splitdat)
trial1 trial2 trial3 trial4
1 597 1071 1430 997
2 614 322 1242 1140
3 1522 1679 51 1120
4 225 1988 1938 1068
5 621 623 1174 55
6 1918 1828 136 1816
and finally, to calculate the mean per person:
apply(splitdat, 1, mean)
[1] 1187.50 361.25 963.75 1017.00 916.25 1409.50 730.00 1310.75 1133.75
[10] 851.25 914.75 881.25 889.00 1014.75 676.75 850.50 805.00 1460.00
[19] 901.00 1443.50 507.25 691.50 1090.00 833.25 669.25

A nifty, if rather heavy-handed, way is to use read.csv in conjunction with textConnection. Assuming your data is in a data frame, df:
x <- read.csv(textConnection(df[["reaction times"]]))

Old question, but I came across it from another recent question (which seems unrelated).
Both existing answers are appropriate, but I wanted to share an answer related to a package I have created called "splitstackshape" that is fast and has straightforward syntax.
Here's some sample data:
set.seed(1)
dat = data.frame(
reaction_time = apply(matrix(round(
runif(24, 1, 2000)), 6, 4), 1, paste, collapse = ","))
This is the splitting:
library(splitstackshape)
cSplit(dat, "reaction_time", ",")
# reaction_time_1 reaction_time_2 reaction_time_3 reaction_time_4
# 1: 532 1889 1374 761
# 2: 745 1322 769 1555
# 3: 1146 1259 1540 1869
# 4: 1817 125 996 425
# 5: 404 413 1436 1304
# 6: 1797 354 1984 252
And, optionally, if you need to take the rowMeans:
rowMeans(cSplit(dat, "reaction_time", ","))
# [1] 1139.00 1097.75 1453.50 840.75 889.25 1096.75

Another option using dplyr and tidyr with Paul Hiemstra's example data is:
# create example data
data = data.frame(reaction_time =
apply(matrix(round(runif(100, 1, 2000)),
25, 4), 1, paste, collapse = ","),
stringsAsFactors=FALSE)
head(data)
# clean data
data2 <- data %>% mutate(split_reaction_time = str_split(as.character(reaction_time), ",")) %>% unnest(split_reaction_time)
data2$col_names <- c("trial1", "trial2", "trial3", "trial4")
data2 <- data2 %>% spread(key = col_names, value = split_reaction_time) %>% select(-reaction_time)
head(data2)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split column in dataframe in R at a '%' character - r

df$Long<-sapply(strsplit(as.character(df$Long), split= "\\%"),'[',2) Long Short 1 239 497 2 142 269 3 216 186 4 96 52 5 93 184 6 160 142 7 96 79

This should do it: df$Long <- paste0(unlist(lapply(strsplit(a, "%"), `[[`, 1)), "%") It splits each string in Long column by "%" and gets the first element in each split. Then it adds the "%" to the end of the resulting vector.

Related

Use part of row data for new columns in R

Separate character variable into two columns

Remove dots from data column

R: split a list of value ranges (in text) into two lists

Convert comma separated string to numeric columns

Categories

Resources