r - Replace String In Specific Column - r

I have a dataset of approximately 2 million rows and 45 columns. I would like to replace a list of values in one specific column within this dataset.
I have tried gsub but it is proving to take a prohibitive length of time. I need to perform 16 replacements.
To give you an example of what I've done :
setwd("C:/RStudio")
dat2 <- read.csv("2016 new.csv", stringsAsFactors=FALSE)
dat3 <- read.csv("2017 new.csv", stringsAsFactors=FALSE)
dat4 <- read.csv("2018 new.csv", stringsAsFactors=FALSE)
myfulldata <- rbind(dat2, dat3)
myfulldata <- rbind(myfulldata, dat4)
myfulldata <- myfulldata[, -c(1,5,10,11,12,13,15,20,21,22,41,42,43,44,48,50,51,52,59,61,62,64,65,66,67,68,69,70,71,72)]
gc()
myfulldata[is.na(myfulldata)] <- ""
gc()
myfulldata <- gsub("Text Being Replaced","CS1",myfulldata, fixed=TRUE)
I've bound several files then removed the columns I don't need. The bottom line is where I begin the string replace section. I only want to replace cases in one specific column. With this in mind can I use something other than gsub or whatever works best so that I'm only replacing cases in column number 36, named Waypoint?
Many thanks,
Eoghan

Answer going out to phiver:
set.seed(123)
# data simulation
n = 10 #2e6
m = 45 #45
myfulldata <- as.data.frame(matrix(paste0("Text", 1:(n * m)), ncol = m), stringsAsFactors = FALSE)
names(myfulldata)[36] <- "Waypoint"
myfulldata$Waypoint[sample(seq.int(nrow(myfulldata)), 5)] <- "Text Being Replaced"
myfulldata$Waypoint
# [1] "Text351" "Text352" "CS1" "CS1" "Text355" "CS1" "CS1" "CS1"
# "Text359" "Text360"
# data replacement
myfulldata$Waypoint <- gsub("Text Being Replaced", "CS1", myfulldata$Waypoint, fixed = TRUE)
myfulldata
Output:
V33 V34 V35 Waypoint V37 V38
1 Text321 Text331 Text341 Text351 Text361 Text371
2 Text322 Text332 Text342 Text352 Text362 Text372
3 Text323 Text333 Text343 CS1 Text363 Text373
4 Text324 Text334 Text344 CS1 Text364 Text374
5 Text325 Text335 Text345 Text355 Text365 Text375
6 Text326 Text336 Text346 CS1 Text366 Text376
7 Text327 Text337 Text347 CS1 Text367 Text377
8 Text328 Text338 Text348 CS1 Text368 Text378
9 Text329 Text339 Text349 Text359 Text369 Text379
10 Text330 Text340 Text350 Text360 Text370 Text380

Related

Remove a part of a row name in a list of dataframes

I have two lists of dataframes. One list of dataframes is structured as follows:
data1
Label Pred n
1 Mito-0001_Series007_blue.tif Pear 10
2 Mito-0001_Series007_blue.tif Orange 223
3 Mito-0001_Series007_blue.tif Apple 890
4 Mito-0001_Series007_blue.tif Peach 34
And repeats with different numbers e.g.
Label Pred n
1 Mito-0002_Series007_blue.tif Pear 90
2 Mito-0002_Series007_blue.tif Orange 127
3 Mito-0002_Series007_blue.tif Apple 76
4 Mito-0002_Series007_blue.tif Peach 344
The second list of dataframes is structured. like this:
data2
Slice Area
Mask of Mask-0001Series007_blue-1.tif. 789.21
etc
Question
I want to
Make the row names match up by:
a) Remove the "Mito-" from data1
b) Remove the "Mask of Mask-" from data 2
c) Remove the "-1" towards the end of data 2
Keeping in mind that this is a list of dataframes.
So far:
I have used the information from the post named "How can I remove certain part of row names in data frame"
How can I remove certain part of row names in data frame
They suggest using
data2$Slice <- sub("Mask of Mask-", "", data2$Slice)
Which obviously isn't working for the list of dataframes. It returns a blank character
character(0)
Thanks in advance, I have been amazed at how great people are at answering questions on this site :)
First, we could define a function f that applies gsub with a regex that fits for all.
f <- \(x) gsub('.*(\\d{4}_?Series\\d{3}_blue).*(\\.tif)?\\.?', '\\1\\2', x)
Explanation:
.* any single character, repeatedly
\\d{4} four digits
_? underscore, if available
Series literally
(...) capture group (they get numbered internally)
\\. a period (needs to be escaped, otherwise we say "any character")
\\1 capture group 1
Test the regex
## test it
(x <- c(names(data1), data1[[1]]$Label, data2$Slice))
# [1] "Mito-0001_Series007_blue" "Mito-0002_Series007_blue"
# [3] "Mito-0001_Series007_blue.tif" "Mito-0001_Series007_blue.tif"
# [5] "Mito-0001_Series007_blue.tif" "Mito-0001_Series007_blue.tif"
# [7] "Mask of Mask-0001Series007_blue-1.tif."
f(x)
# [1] "0001_Series007_blue" "0002_Series007_blue" "0001_Series007_blue" "0001_Series007_blue"
# [5] "0001_Series007_blue" "0001_Series007_blue" "0001Series007_blue"
Seems to work, so we can apply it.
names(data1) <- f(names(data1))
data1 <- lapply(data1, \(x) {x$Label <- f(x$Label); x})
data2$Slice <- f(data2$Slice)
data1
# $`0001_Series007_blue`
# Label Pred n
# 1 0001_Series007_blue Pear 10
# 2 0001_Series007_blue Orange 223
# 3 0001_Series007_blue Apple 890
# 4 0001_Series007_blue Peach 34
#
# $`0002_Series007_blue`
# Label Pred n
# 1 0002_Series007_blue Pear 90
# 2 0002_Series007_blue Orange 127
# 3 0002_Series007_blue Apple 76
# 4 0002_Series007_blue Peach 344
data2
# Slice Area
# 1 0001Series007_blue 789.21
Data:
data1 <- list(`Mito-0001_Series007_blue` = structure(list(Label = c("Mito-0001_Series007_blue.tif",
"Mito-0001_Series007_blue.tif", "Mito-0001_Series007_blue.tif",
"Mito-0001_Series007_blue.tif"), Pred = c("Pear", "Orange", "Apple",
"Peach"), n = c(10L, 223L, 890L, 34L)), class = "data.frame", row.names = c("1",
"2", "3", "4")), `Mito-0002_Series007_blue` = structure(list(
Label = c("Mito-0002_Series007_blue.tif", "Mito-0002_Series007_blue.tif",
"Mito-0002_Series007_blue.tif", "Mito-0002_Series007_blue.tif"
), Pred = c("Pear", "Orange", "Apple", "Peach"), n = c(90L,
127L, 76L, 344L)), class = "data.frame", row.names = c("1",
"2", "3", "4")))
data2 <- structure(list(Slice = "Mask of Mask-0001Series007_blue-1.tif.",
Area = 789.21), class = "data.frame", row.names = c(NA, -1L
))
Using the given info
The answer by #jay.sf, was really helpful. But it only worked for data1, rather than data2. To ensure it also got applied to data2, I added the extra line of code:
#Old code
f <-function(x) gsub('.*(\\d{4}_?Series\\d{3}_blue).*(\\.tif)?\\.?', '\\1\\2', x)
#I added the [[1]] after data2 as well
(x <- c(names(data1), data1[[1]]$Label, data2[[1]]$Slice))
f(x)
names(data1) <- f(names(data1))
data1 <- lapply(data1, function(x) {x$Label <- f(x$Label); x})
# This line of code was causing problems, so I removed it
# data2$Slice <- f(data2$Slice)
#And added the following to apply it to data 2
names(data2) <- f(names(data2))
data2 <- lapply(data2, function(x) {x$Slice <- f(x$Slice); x})

Replacing integers in a dataframe column that's a list of integer vectors (not just single integers) with character strings in R

I have a dataframe with a column that's really a list of integer vectors (not just single integers).
# make example dataframe
starting_dataframe <-
data.frame(first_names = c("Megan",
"Abby",
"Alyssa",
"Alex",
"Heather"))
starting_dataframe$player_indices <-
list(as.integer(1),
as.integer(c(2, 5)),
as.integer(3),
as.integer(4),
as.integer(c(6, 7)))
I want to replace the integers with character strings according to a second concordance dataframe.
# make concordance dataframe
example_concord <-
data.frame(last_names = c("Rapinoe",
"Wambach",
"Naeher",
"Morgan",
"Dahlkemper",
"Mitts",
"O'Reilly"),
player_ids = as.integer(c(1,2,3,4,5,6,7)))
The desired result would look like this:
# make dataframe of desired result
desired_result <-
data.frame(first_names = c("Megan",
"Abby",
"Alyssa",
"Alex",
"Heather"))
desired_result$player_indices <-
list(c("Rapinoe"),
c("Wambach", "Dahlkemper"),
c("Naeher"),
c("Morgan"),
c("Mitts", "O'Reilly"))
I can't for the life of me figure out how to do it and failed to find a similar case here on stackoverflow. How do I do it? I wouldn't mind a dplyr-specific solution in particular.
I suggest creating a "lookup dictionary" of sorts, and lapply across each of the ids:
example_concord_idx <- setNames(as.character(example_concord$last_names),
example_concord$player_ids)
example_concord_idx
# 1 2 3 4 5 6
# "Rapinoe" "Wambach" "Naeher" "Morgan" "Dahlkemper" "Mitts"
# 7
# "O'Reilly"
starting_dataframe$result <-
lapply(starting_dataframe$player_indices,
function(a) example_concord_idx[a])
starting_dataframe
# first_names player_indices result
# 1 Megan 1 Rapinoe
# 2 Abby 2, 5 Wambach, Dahlkemper
# 3 Alyssa 3 Naeher
# 4 Alex 4 Morgan
# 5 Heather 6, 7 Mitts, O'Reilly
(Code golf?)
Map(`[`, list(example_concord_idx), starting_dataframe$player_indices)
For tidyverse enthusiasts, I adapted the second half of the accepted answer by r2evans to use map() and %>%:
require(tidyverse)
starting_dataframe <-
starting_dataframe %>%
mutate(
result = map(.x = player_indices, .f = function(a) example_concord_idx[a])
)
Definitely won't win code golf, though!
Another way is to unlist the list-column, and relist it after modifying its contents:
df1$player_indices <- relist(df2$last_names[unlist(df1$player_indices)], df1$player_indices)
df1
#> first_names player_indices
#> 1 Megan Rapinoe
#> 2 Abby Wambach, Dahlkemper
#> 3 Alyssa Naeher
#> 4 Alex Morgan
#> 5 Heather Mitts, O'Reilly
Data
## initial data.frame w/ list-column
df1 <- data.frame(first_names = c("Megan", "Abby", "Alyssa", "Alex", "Heather"), stringsAsFactors = FALSE)
df1$player_indices <- list(1, c(2,5), 3, 4, c(6,7))
## lookup data.frame
df2 <- data.frame(last_names = c("Rapinoe", "Wambach", "Naeher", "Morgan", "Dahlkemper",
"Mitts", "O'Reilly"), stringsAsFactors = FALSE)
NB: I set stringsAsFactors = FALSE to create character columns in the data.frames, but it works just as well with factor columns instead.

Extracting Column data from .csv and turning every 10 consecutive rows into corresponding columns

Below is the code I am trying to implement. I want to extract this 10 consecutive values of rows and turn them into corresponding columns .
This is how data looks like: https://drive.google.com/file/d/0B7huoyuu0wrfeUs4d2p0eGpZSFU/view?usp=sharing
I have been trying but temp1 and temp2 comes out to be empty. Please help.
library(Hmisc) #for increment function
myData <- read.csv("Clothing_&_Accessories.csv",header=FALSE,sep=",",fill=TRUE) # reading the csv file
extract<-myData$V2 # extracting the desired column
x<-1
y<-1
temp1 <- NULL #initialisation
temp2 <- NULL #initialisation
data.sorted <- NULL #initialisation
limit<-nrow(myData) # Calculating no of rows
while (x! = limit) {
count <- 1
for (count in 11) {
if (count > 10) {
inc(x) <- 1
break # gets out of for loop
}
else {
temp1[y]<-data_mat[x] # extracting by every row element
}
inc(x) <- 1 # increment x
inc(y) <- 1 # increment y
}
temp2<-temp1
data.sorted<-rbind(data.sorted,temp2) # turn rows into columns
}
Your code is too complex. You can do this using only one for loop, without external packages, likes this:
myData <- as.data.frame(matrix(c(rep("a", 10), "", rep("b", 10)), ncol=1), stringsAsFactors = FALSE)
newData <- data.frame(row.names=1:10)
for (i in 1:((nrow(myData)+1)/11)) {
start <- 11*i - 10
newData[[paste0("col", i)]] <- myData$V1[start:(start+9)]
}
You don't actually need all this though. You can simply remove the empty lines, split the vector in chunks of size 10 (as explained here) and then turn the list into a data frame.
vec <- myData$V1[nchar(myData$V1)>0]
as.data.frame(split(vec, ceiling(seq_along(vec)/10)))
# X1 X2
# 1 a b
# 2 a b
# 3 a b
# 4 a b
# 5 a b
# 6 a b
# 7 a b
# 8 a b
# 9 a b
# 10 a b
We could create a numeric index based on the '' values in the 'V2' column, split the dataset, use Reduce/merge to get the columns in the wide format.
indx <- cumsum(myData$V2=='')+1
res <- Reduce(function(...) merge(..., by= 'V1'), split(myData, indx))
res1 <- res[order(factor(res$V1, levels=myData[1:10, 1])),]
colnames(res1)[-1] <- paste0('Col', 1:3)
head(res1,3)
# V1 Col1 Col2 Col3
#2 ProductId B000179R3I B0000C3XXN B0000C3XX9
#4 product_title Amazon.com Amazon.com Amazon.com
#3 product_price unknown unknown unknown
From the p1.png, the 'V1' column can also be the column names for the values in 'V2'. If that is the case, we can 'transpose' the 'res1' except the first column and change the column names of the output with the first column of 'res1' (setNames(...))
res2 <- setNames(as.data.frame(t(res1[-1]), stringsAsFactors=FALSE),
res1[,1])
row.names(res2) <- NULL
res2[] <- lapply(res2, type.convert)
head(res2)
# ProductId product_title product_price userid
#1 B000179R3I Amazon.com unknown A3Q0VJTU04EZ56
#2 B0000C3XXN Amazon.com unknown A34JM8F992M9N1
#3 B0000C3XX9 Amazon.com unknown A34JM8F993MN91
# profileName helpfulness reviewscore review_time
#1 Jeanmarie Kabala "JP Kabala" 7/7 4 1182816000
#2 M. Shapiro 6/6 5 1205107200
#3 J. Cruze 8/8 5 120571929
# review_summary
#1 Periwinkle Dartmouth Blazer
#2 great classic jacket
#3 Good jacket
# review_text
#1 I own the Austin Reed dartmouth blazer in every color
#2 This is the second time I bought this jacket
#3 This is the third time I bought this jacket
I guess this is just a reshaping issue. In that case, we can use dcast from data.table to convert from long to wide format
library(data.table)
DT <- dcast(setDT(myData)[V1!=''][, N:= paste0('Col', 1:.N) ,V1], V1~N,
value.var='V2')
data
myData <- structure(list(V1 = c("ProductId", "product_title",
"product_price",
"userid", "profileName", "helpfulness", "reviewscore", "review_time",
"review_summary", "review_text", "", "ProductId", "product_title",
"product_price", "userid", "profileName", "helpfulness",
"reviewscore",
"review_time", "review_summary", "review_text", "", "ProductId",
"product_title", "product_price", "userid", "profileName",
"helpfulness",
"reviewscore", "review_time", "review_summary", "review_text"
), V2 = c("B000179R3I", "Amazon.com", "unknown", "A3Q0VJTU04EZ56",
"Jeanmarie Kabala \"JP Kabala\"", "7/7", "4", "1182816000",
"Periwinkle Dartmouth Blazer",
"I own the Austin Reed dartmouth blazer in every color", "",
"B0000C3XXN", "Amazon.com", "unknown", "A34JM8F992M9N1",
"M. Shapiro",
"6/6", "5", "1205107200", "great classic jacket",
"This is the second time I bought this jacket",
"", "B0000C3XX9", "Amazon.com", "unknown", "A34JM8F993MN91",
"J. Cruze", "8/8", "5", "120571929", "Good jacket",
"This is the third time I bought this jacket"
)), .Names = c("V1", "V2"), row.names = c(NA, 32L),
class = "data.frame")

Read.table into R

I want to read a text file into R, but I got a problem that the first column are mixed with the column names and the first column numbers.
Data text file
revenues 4118000000.0, 4315000000.0, 4512000000.0, 4709000000.0, 4906000000.0, 5103000000.0
cost_of_revenue-1595852945.4985902, -1651829192.2662954, -1705945706.6237037, -1758202488.5708148, -1808599538.1076286, -1857136855.234145
gross_profit 2522147054.5014095, 2663170807.7337046, 2806054293.376296, 2950797511.429185, 3097400461.892371, 3245863144.765855
R Code:
data.predicted_values = read.table("predicted_values.txt", sep=",")
Output:
V1 V2 V3 V4 V5 V6
1 revenues 4118000000.0 4315000000 4512000000 4709000000 4906000000 5103000000
2 cost_of_revenue-1595852945.4985902 -1651829192 -1705945707 -1758202489 -1808599538 -1857136855
3 gross_profit 2522147054.5014095 2663170808 2806054293 2950797511 3097400462 3245863145
How can I split the first column into two parts? I mean I want the first column V1 is revenues,cost_of_revenue, gross_profit. V2 is 4118000000.0,-1595852945.4985902,2522147054.5014095. And so on and so forth.
This is along the same lines of thinking as #DWin's, but accounts for the negative values in the second row.
TEXT <- readLines("predicted_values.txt")
A <- gregexpr("[A-Za-z_]+", TEXT)
B <- read.table(text = regmatches(TEXT, A, invert = TRUE)[[1]], sep = ",")
C <- cbind(FirstCol = regmatches(TEXT, A)[[1]], B)
C
# FirstCol V1 V2 V3 V4 V5 V6
# 1 revenues 4118000000 4315000000 4512000000 4709000000 4906000000 5103000000
# 2 cost_of_revenue -1595852945 -1651829192 -1705945707 -1758202489 -1808599538 -1857136855
# 3 gross_profit 2522147055 2663170808 2806054293 2950797511 3097400462 3245863145
Since you have no commas btwn the rownames and the values you need to add them back in:
txt <- "revenues 4118000000.0, 4315000000.0, 4512000000.0, 4709000000.0, 4906000000.0, 5103000000.0
cost_of_revenue-1595852945.4985902, -1651829192.2662954, -1705945706.6237037, -1758202488.5708148, -1808599538.1076286, -1857136855.234145
gross_profit 2522147054.5014095, 2663170807.7337046, 2806054293.376296, 2950797511.429185, 3097400461.892371, 3245863144.765855"
Lines <- readLines( textConnection(txt) )
# replace textConnection(.) with `file = "predicted_values.txt"`
res <- read.csv( text=sub( "(^[[:alpha:][:punct:]]+)(\\s|-)" ,
"\\1,", Lines) ,
header=FALSE, row.names=1 )
res
The decimal fractions may not print but they are there.
You want the row.names argument of read.table. Then you can simply transpose your data:
data.predicted_values = read.table("predicted_values.txt", sep=",", row.names=1)
data.predicted_values <- t(data.predicted_values)

Merging Two Headings Into One

Very simple question. I am using an excel sheet that has two rows for the column headings; how can I convert these two row headings into one? Further, these headings don't start at the top of the sheet.
Thus, I have DF1
Temp Press Reagent Yield A Conversion etc
degC bar /g % %
1 2 3 4 5
6 7 8 9 10
and I want,
Temp degC Press bar Reagent /g Yield A % Conversion etc
1 2 3 4 5
6 7 8 9 10
Using colnames(DF1) returns the upper names, but getting the second line to merge with the upper one keeps eluding me.
Using your data, modified to quote text fields that contain the separator (get whatever tool you used to generate the file to quote text fields for you!)
txt <- "Temp Press Reagent 'Yield A' 'Conversion etc'
degC bar /g % %
1 2 3 4 5
6 7 8 9 10
"
this snippet of code below reads the file in two steps
First we read the data, so skip = 2 means skip the first 2 lines
Next we read the data again but only the first two line, this output is then further processed by sapply() where we paste(x, collapse = " ") the strings in the columns of the labs data frame. These are assigned to the names of dat
Here is the code:
dat <- read.table(text = txt, skip = 2)
labs <- read.table(text = txt, nrows = 2, stringsAsFactors = FALSE)
names(dat) <- sapply(labs, paste, collapse = " ")
dat
names(dat)
The code, when runs produces:
> dat <- read.table(text = txt, skip = 2)
> labs <- read.table(text = txt, nrows = 2, stringsAsFactors = FALSE)
> names(dat) <- sapply(labs, paste, collapse = " ")
>
> dat
Temp degC Press bar Reagent /g Yield A % Conversion etc %
1 1 2 3 4 5
2 6 7 8 9 10
> names(dat)
[1] "Temp degC" "Press bar" "Reagent /g"
[4] "Yield A %" "Conversion etc %"
In your case, you'll want to modify the read.table() calls to point at the file on your file system, so use file = "foo.txt" in place of text = txt in the code chunk, where "foo.txt" is the name of your file.
Also, if these headings don't start at the top of the file, then increase skip to 2+n where n is the number of lines before the two header rows. You'll also need to add skip = n to the second read.table() call which generates labs, where n is again the number of lines before the header lines.
This should work. You only need set stringsAsFactors=FALSE when reading data.
data <- structure(list(Temp = c("degC", "1", "6"), Press = c("bar", "2",
"7"), Reagent = c("/g", "3", "8"), Yield.A = c("%", "4", "9"),
Conversion = c("%", "5", "10")), .Names = c("Temp", "Press",
"Reagent", "Yield.A", "Conversion"), class = "data.frame", row.names = c(NA,
-3L)) # Your data
colnames(data) <-paste(colnames(dados),dados[1,]) # Set new names
data <- data[-1,] # Remove first line
data <- data.frame(apply(data,2,as.real)) # Correct the classes (works only if all collums are numbers)
Just load your file with read.table(file, header = FALSE, stringsAsFactors = F) arguments. Then, you can grep to find the position this happens.
df <- data.frame(V1=c(sample(10), "Temp", "degC"),
V2=c(sample(10), "Press", "bar"),
V3 = c(sample(10), "Reagent", "/g"),
V4 = c(sample(10), "Yield_A", "%"),
V5 = c(sample(10), "Conversion", "%"),
stringsAsFactors=F)
idx <- unique(c(grep("Temp", df$V1), grep("degC", df$V1)))
df2 <- df[-(idx), ]
names(df2) <- sapply(df[idx, ], function(x) paste(x, collapse=" "))
Here, if you want, you can then convert all the columns to numeric as follows:
df2 <- as.data.frame(sapply(df2, as.numeric))

Resources