Delete column and next column based on text

Delete column and next column based on text - r

I have a data like this :
and I want to delete the column which contain "rico" and also delete all next columns. I am looking to get this :
This is what i did but it doesnt work :
mydata = data.frame(
X1 = c("john", "max", "jay", "douglas"),
X2 = c("alexia", "miguel", "vince", "gary"),
X3 = c("peter", "rico", "joe", "jenny"),
X4 = c("marc", "kelly", "max", "jones")
)
mydata[,grepl("rico", names(mydata))]
Some help would be appreciated

You can subset mydata with a range ending one before where grepl hits rico.
mydata[1:(grep("rico", mydata)-1)]
#mydata[1:(grep("rico", mydata)[1]-1)] #Alternative when there are more hists
# X1 X2
#1 john alexia
#2 max miguel
#3 jay vince
#4 douglas gary

You can use colSums -
mydata[cumsum(colSums(mydata == 'rico') > 0) == 0]
# X1 X2
#1 john alexia
#2 max miguel
#3 jay vince
#4 douglas gary
Using colSums we count number of times 'rico' is present in each column, we create a logical vector by comparing it with > 0, using cumsum we select all the columns before the 1st occurrence of the word.

Related

How to rename observations with semi-consistent format?

I'm working with the following data frame:
Team Direction Side
Joe HB-L L
Eric HB-R R
Tim FB-L R
Mike HB L
I would like to eliminate the "HB" or "FB" preceding the "L" or "R" in the 'Direction' column. I would also like to eliminate the observations for which there is no "L" or "R" in the 'Direction' column. I would like it to look like this:
Team Direction Side
Joe L L
Eric R R
Tim L R
Then, I want to add a column that indicates if the 'Direction' and 'Side' columns are the same. If yes, I would like it to read 'NEAR', if not I would like it to read "FAR."
Team Direction Side Relation
Joe L L NEAR
Eric R R NEAR
Tim L R FAR

We can first filter the rows that have "-" in them , remove everything until "-" and with if_else assign 'NEAR' OR 'FAR' value to Relation.
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Direction, '-')) %>%
mutate(Direction = str_remove(Direction, '.*-'),
Relation = if_else(Direction == Side, 'NEAR', 'FAR'))
# Team Direction Side Relation
#1 Joe L L NEAR
#2 Eric R R NEAR
#3 Tim L R FAR

We can do this in base R with sub to remove the substring till the - after subsettng the rows based on the presence of - in 'Direction
df1 <- subset(df1, grepl('-', Direction))
df1$Direction <- sub(".*-", "", df1$Direction)
-output
df1
# Team Direction Side
#1 Joe L L
#2 Eric R R
#3 Tim L R
Then, we can use == to create a logical condition to replace the valuess to 'FAR', 'NEAR'
df1$Relation <- with(df1, c('FAR', 'NEAR')[(Direction == Side) + 1])
-output
df1
# Team Direction Side Relation
#1 Joe L L NEAR
#2 Eric R R NEAR
#3 Tim L R FAR
Or with tidyverse
library(dplyr)
library(stringr)
df1 %>%
filter(grepl('-', Direction)) %>%
mutate(Direction = str_replace(Direction, '.*-', ''),
Relation = case_when(Direction == Side ~ 'NEAR', TRUE ~ 'FAR'))
# Team Direction Side Relation
#1 Joe L L NEAR
#2 Eric R R NEAR
#3 Tim L R FAR
Or using data.table
library(data.table)
setDT(df1)[grepl('-', Direction)][, Direction := trimws(Direction,
whitespace = '.*-')][, Relation := fifelse(Direction == Side, 'NEAR', 'FAR')][]
# Team Direction Side Relation
#1: Joe L L NEAR
#2: Eric R R NEAR
#3: Tim L R FAR
data
df1 <- structure(list(Team = c("Joe", "Eric", "Tim", "Mike"), Direction = c("HB-L",
"HB-R", "FB-L", "HB"), Side = c("L", "R", "R", "L")),
class = "data.frame", row.names = c(NA,
-4L))

Creating new column based on values in preceding column

I would like to add a new column to a data.frame that converts from the numeric value in the first column to the corresponding string (if any) from a subsequent matching column i.e. the column name partially matches this value in the first column.
In this example, I wish to add a value for 'Highest_Earner', which depends on the value in the Earner_Number column:
> df1 <- data.frame("Earner_Number" = c(1, 2, 1, 5),
"Earner5" = c("Max", "Alex", "Ben", "Mark"),
"Earner1" = c("John", "Dora", "Micelle", "Josh"))
> df1
Earner_Number Earner5 Earner1
1 1 Max John
2 2 Alex Dora
3 1 Ben Micelle
4 5 Mark Josh
The result should be:
> df1
Earner_Number Earner5 Earner1 Highest_Earner
1 1 Max John John
2 2 Alex Dora Neither
3 1 Ben Micelle Michelle
4 5 Mark Josh Mark
I have tried cutting the data.frame into various smaller pieces, but was wondering if someone had a somewhat cleaner method?

#Have to convert them to character for nested if else to work.
df$Earner5 <- as.character(df$Earner5)
df$Earner1 <- as.character(df$Earner1)
#Using nested if to get your column.
df$Higher_Earner <- ifelse(df$Earner_Number == 5, df$Earner5,
ifelse(df$Earner_Number==1df$Earner1,"Neither"))

dplyr approach
library(tidyverse)
df <- tibble("Earner_Number" = c(1,2,1,5), "Earner5" = c('Max', 'Alex','Ben','Mark'), "Earner1" = c("John","Dora","Micelle",'Josh'))
df %>%
mutate(Highest_Earner = case_when(Earner_Number == 1 ~ Earner1,
Earner_Number == 5 ~ Earner5,
TRUE ~ 'Neither'))

R: How to pick common words or same numbers in 2 columns from a very large table with a fast way?

I have a very large table (1,000,000 X 20) to process and need to do it in a fast way.
For example, There are 2 columns X2 and X3 in my table:
enter image description here
X1 X2 X3
c1 1 100020003001, 100020003002, 100020003003 100020003001, 100020003002, 100020003004
c2 2 100020003001, 100020004002, 100020004003 100020003001, 100020004007, 100020004009
c3 3 100050006003, 100050006001, 100050006001 100050006011, 100050006013, 100050006021
Now I would like to create 2 new columns which contain
1) the common words or the same numbers
For example: [1] "100020003001" "100020003002"
2) the count of the common words or the same numbers
For example: [1] 2
I have tried the method from the below thread, however, the processing time is slow since I did it with for loop:
Count common words in two strings
library(stringi)
Reduce(`intersect`,stri_extract_all_regex(vec1,"\\w+"))
Thanks for the help!
I am really struggling here...

We can split the 'X2', 'X3' columns by the ,, get the intersect of corresponding list elements with map2 and use lengths to 'count' the number of elements in the list
library(tidyverse)
df1 %>%
mutate(common_words = map2(strsplit(X2, ", "),
strsplit(X3, ", "),
intersect),
count = lengths(common_words))
# X1 X2 X3
#1 1 100020003001, 100020003002, 100020003003 100020003001, 100020003002, 100020003004
#2 2 100020003001, 100020004002, 100020004003 100020003001, 100020004007, 100020004009
#3 3 100050006003, 100050006001, 100050006001 100050006011, 100050006013, 100050006021
# common_words count
#1 100020003001, 100020003002 2
#2 100020003001 1
#3 0
Or using base R
df1$common_words <- Map(intersect, strsplit(df1$X2, ", "), strsplit(df1$X3, ", "))
df1$count <- lengths(df1$common_words)
data
df1 <- structure(list(X1 = 1:3, X2 = c("100020003001, 100020003002, 100020003003",
"100020003001, 100020004002, 100020004003", "100050006003,
100050006001, 100050006001"
), X3 = c("100020003001, 100020003002, 100020003004", "100020003001,
100020004007, 100020004009",
"100050006011, 100050006013, 100050006021")), class = "data.frame",
row.names = c("c1", "c2", "c3"))

Match part of a string in a dataframe and replace it by entry of another dataframe

I'm fairly new to R and I'm running into the following problem.
Let's say I have the following data frames:
sale_df <- data.frame("Cheese" = c("cheese-01", "cheese-02", "cheese-03"), "Number_of_sales" = c(4, 8, 23))
id_df <- data.frame("ID" = c(1, 2, 3), "Name" = c("Leerdammer", "Gouda", "Mozerella")
What I want to do is match the numbers of the first column of id_df to the numbers in the string of the first column of sale_df.
Then I want to replace the value in sale_df by the value in the second column of id_df, i.e. I want cheese-01 to become "Leerdammer".
Does anyone have any idea how I could solve this?

With tidyverse :
sale_df %>% mutate(ID=as.numeric(str_extract(Cheese,"(?<=cheese-).*"))) %>% inner_join(id_df,by="ID")
# Cheese Number_of_sales ID Name
#1 cheese-01 4 1 Leerdammer
#2 cheese-02 8 2 Gouda
#3 cheese-03 23 3 Mozerella

Assuming that all entries for Cheese in sale_df will start with cheese-, here is a simple solution.
sale_df$CheeseID <- as.numeric(substring(sale_df$Cheese, 8))
merge(sale_df, id_df, by.x = "CheeseID", by.y = "ID", all.x = TRUE)

sale_df$Number_of_sales=id_df$Name[match(id_df$ID,as.numeric(gsub("\\D","",sale_df$Cheese)))]
> sale_df
Cheese Number_of_sales
1 cheese-01 Leerdammer
2 cheese-02 Gouda
3 cheese-03 Mozerella

Extracting Column data from .csv and turning every 10 consecutive rows into corresponding columns

Below is the code I am trying to implement. I want to extract this 10 consecutive values of rows and turn them into corresponding columns .
This is how data looks like: https://drive.google.com/file/d/0B7huoyuu0wrfeUs4d2p0eGpZSFU/view?usp=sharing
I have been trying but temp1 and temp2 comes out to be empty. Please help.
library(Hmisc) #for increment function
myData <- read.csv("Clothing_&_Accessories.csv",header=FALSE,sep=",",fill=TRUE) # reading the csv file
extract<-myData$V2 # extracting the desired column
x<-1
y<-1
temp1 <- NULL #initialisation
temp2 <- NULL #initialisation
data.sorted <- NULL #initialisation
limit<-nrow(myData) # Calculating no of rows
while (x! = limit) {
count <- 1
for (count in 11) {
if (count > 10) {
inc(x) <- 1
break # gets out of for loop
}
else {
temp1[y]<-data_mat[x] # extracting by every row element
}
inc(x) <- 1 # increment x
inc(y) <- 1 # increment y
}
temp2<-temp1
data.sorted<-rbind(data.sorted,temp2) # turn rows into columns
}

Your code is too complex. You can do this using only one for loop, without external packages, likes this:
myData <- as.data.frame(matrix(c(rep("a", 10), "", rep("b", 10)), ncol=1), stringsAsFactors = FALSE)
newData <- data.frame(row.names=1:10)
for (i in 1:((nrow(myData)+1)/11)) {
start <- 11*i - 10
newData[[paste0("col", i)]] <- myData$V1[start:(start+9)]
}
You don't actually need all this though. You can simply remove the empty lines, split the vector in chunks of size 10 (as explained here) and then turn the list into a data frame.
vec <- myData$V1[nchar(myData$V1)>0]
as.data.frame(split(vec, ceiling(seq_along(vec)/10)))
# X1 X2
# 1 a b
# 2 a b
# 3 a b
# 4 a b
# 5 a b
# 6 a b
# 7 a b
# 8 a b
# 9 a b
# 10 a b

We could create a numeric index based on the '' values in the 'V2' column, split the dataset, use Reduce/merge to get the columns in the wide format.
indx <- cumsum(myData$V2=='')+1
res <- Reduce(function(...) merge(..., by= 'V1'), split(myData, indx))
res1 <- res[order(factor(res$V1, levels=myData[1:10, 1])),]
colnames(res1)[-1] <- paste0('Col', 1:3)
head(res1,3)
# V1 Col1 Col2 Col3
#2 ProductId B000179R3I B0000C3XXN B0000C3XX9
#4 product_title Amazon.com Amazon.com Amazon.com
#3 product_price unknown unknown unknown
From the p1.png, the 'V1' column can also be the column names for the values in 'V2'. If that is the case, we can 'transpose' the 'res1' except the first column and change the column names of the output with the first column of 'res1' (setNames(...))
res2 <- setNames(as.data.frame(t(res1[-1]), stringsAsFactors=FALSE),
res1[,1])
row.names(res2) <- NULL
res2[] <- lapply(res2, type.convert)
head(res2)
# ProductId product_title product_price userid
#1 B000179R3I Amazon.com unknown A3Q0VJTU04EZ56
#2 B0000C3XXN Amazon.com unknown A34JM8F992M9N1
#3 B0000C3XX9 Amazon.com unknown A34JM8F993MN91
# profileName helpfulness reviewscore review_time
#1 Jeanmarie Kabala "JP Kabala" 7/7 4 1182816000
#2 M. Shapiro 6/6 5 1205107200
#3 J. Cruze 8/8 5 120571929
# review_summary
#1 Periwinkle Dartmouth Blazer
#2 great classic jacket
#3 Good jacket
# review_text
#1 I own the Austin Reed dartmouth blazer in every color
#2 This is the second time I bought this jacket
#3 This is the third time I bought this jacket
I guess this is just a reshaping issue. In that case, we can use dcast from data.table to convert from long to wide format
library(data.table)
DT <- dcast(setDT(myData)[V1!=''][, N:= paste0('Col', 1:.N) ,V1], V1~N,
value.var='V2')
data
myData <- structure(list(V1 = c("ProductId", "product_title",
"product_price",
"userid", "profileName", "helpfulness", "reviewscore", "review_time",
"review_summary", "review_text", "", "ProductId", "product_title",
"product_price", "userid", "profileName", "helpfulness",
"reviewscore",
"review_time", "review_summary", "review_text", "", "ProductId",
"product_title", "product_price", "userid", "profileName",
"helpfulness",
"reviewscore", "review_time", "review_summary", "review_text"
), V2 = c("B000179R3I", "Amazon.com", "unknown", "A3Q0VJTU04EZ56",
"Jeanmarie Kabala \"JP Kabala\"", "7/7", "4", "1182816000",
"Periwinkle Dartmouth Blazer",
"I own the Austin Reed dartmouth blazer in every color", "",
"B0000C3XXN", "Amazon.com", "unknown", "A34JM8F992M9N1",
"M. Shapiro",
"6/6", "5", "1205107200", "great classic jacket",
"This is the second time I bought this jacket",
"", "B0000C3XX9", "Amazon.com", "unknown", "A34JM8F993MN91",
"J. Cruze", "8/8", "5", "120571929", "Good jacket",
"This is the third time I bought this jacket"
)), .Names = c("V1", "V2"), row.names = c(NA, 32L),
class = "data.frame")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Delete column and next column based on text - r

You can subset mydata with a range ending one before where grepl hits rico. mydata[1:(grep("rico", mydata)-1)] #mydata[1:(grep("rico", mydata)[1]-1)] #Alternative when there are more hists # X1 X2 #1 john alexia #2 max miguel #3 jay vince #4 douglas gary

Related

How to rename observations with semi-consistent format?

Creating new column based on values in preceding column

R: How to pick common words or same numbers in 2 columns from a very large table with a fast way?

Match part of a string in a dataframe and replace it by entry of another dataframe

Extracting Column data from .csv and turning every 10 consecutive rows into corresponding columns

Categories

Resources