Change value column a if column b contains conditional string - r

This issue is giving me a lot of trouble, even though it should be fixed eaily. I have a dataset with the columns id and poster. I want to change the poster's value if the id value contains a certain string. See data below:
test_df
id poster
143537222999_2054 Kevin
143115551234_2049 Dave
14334_5334 Eric
1456322_4334 Mandy
143115551234_445633 Patrick
143115551234_4321 Lars
143537222999_56743 Iris
I would like to get
test_df
id poster
143537222999_2054 User
143115551234_2049 User
14334_5334 Eric
1456322_4334 Mandy
143115551234_445633 User
143115551234_4321 User
143537222999_56743 User
Both the columns are characters. I would like to change the poster's value to "User" if id value contains "143537222999", OR "143115551234". I have tried the following codes:
Match within/which
test_df <- within(test_df, poster[match('143115551234', test_df$id) | match('143537222999', test_df$id)] <- 'User')
This code gave me no errors, but it didn't change any of the values in the poster column. When I replace within for which, I get the error:
test_df <- which(test_df, poster[match('143115551234', test_df$id) | match('143537222999', test_df$id)] <- 'User')
Error in which(test_df, poster[match("143115551234", test_df$id) | :
argument to 'which' is not logical
Match different variant
test_df <- test_df[match(id, test_df, "143115551234") | match(id, test_df, "143537222999"), test_df$poster] <- 'User'
This code gives me the error:
Error in `[<-.data.frame`(`*tmp*`, match(id, test_df, "143115551234") | :
missing values are not allowed in subscripted assignments of data frames
In addition: Warning messages:
1: In match(id, test_df, "143115551234") :
NAs introduced by coercion to integer range
2: In match(id, test_df, "143537222999") :
NAs introduced by coercion to integer range
After looking up this error I found out that the integers in R are 32-bits and the maximum value of an integer is 2147483647. I'm not sure why i'm getting this error because R states that my column is a character.
> lapply(test_df, class)
$poster
[1] "character"
$id
[1] "character"
Grepl
test_df[grepl("143115551234", id | "143537222999", id), poster := "User"]
This code raises the error:
Error in `:=`(poster, "User") : could not find function ":="
I'm not sure what the best way is to fix this error, I have tried multiple variaties and keep getting across different errors.
I have tried multiple answers from multiple questions that were asked before on here, but I still can't get to fix some errors.

Use grepl with ifelse:
df$poster <- ifelse(grepl("143537222999|143115551234", df$id), "User", df$poster)
Demo

You may try this using grepl.
df[grepl('143115551234|143537222999', df$id),"poster"] <- "User"
So, all the true for above matched in poster column getting replaced by "User"
> df[grepl('143115551234|143537222999', df$id),"poster"] <- "User"
> df
id poster
1 143537222999_2054 User
2 143115551234_2049 User
3 14334_5334 Eric
4 1456322_4334 Mandy
5 143115551234_445633 User
6 143115551234_4321 User
7 143537222999_56743 User

Related

Find differences betwen 2 dataframes with different lengths

I have two dataframes with each two columns c("price", "size") with different lengths.
Each price must be linked to its size. It's two lists of trade orders. I have to discover the differences between the two dataframes knowing that the two databases can have orders that the other doesn't have and vice versa. I would like an output with the differences or two outputs, it doesn't matter. But I need the row number in the output to find where are the differences in the series.
Here is sample data :
> out
price size
1: 36024.86 0.01431022
2: 36272.00 0.00138692
3: 36272.00 0.00277305
4: 36292.57 0.05420000
5: 36292.07 0.00403948
---
923598: 35053.89 0.30904890
923599: 35072.76 0.00232000
923600: 35065.60 0.00273000
923601: 35049.36 0.01760000
923602: 35037.23 0.00100000
>bit
price size
1: 37279.89 0.01340020
2: 37250.84 0.00930000
3: 37250.32 0.44284049
4: 37240.00 0.00056491
5: 37215.03 0.99891906
---
923806: 35053.89 0.30904890
923807: 35072.76 0.00232000
923808: 35065.60 0.00273000
923809: 35049.36 0.01760000
923810: 35037.23 0.00100000
For example, I need to know if the first row of the database out is in the database bit.
I've tried many functions : comparedf()
summary(comparedf(bit, out, by = c("price","size"))
but I've got error:
Error in vecseq(f__, len__, if (allow.cartesian || notjoin ||
!anyDuplicated(f__, :
I've tried compare_df() :
compareout=compare_df(out,bit,c("price","size"))
But I know the results are wrong, I've only 23 results and I know that there are more than 200 differences minimum.
I've tried match(), which() functions but it doesn't get the results I search.
If you have any other methods, I will take them.
Perhaps you could just do inner_join on out and bit by price and size? But first make id variable for both data.frame's
library(dplyr)
out$id <- 1:nrow(out)
bit$id <- 1:nrow(bit)
joined <- inner_join(bit, out, by = c("price", "size"))
Now we can check which id from out and bit are not present in joined table:
id_from_bit_not_included_in_out <- bit$id[!bit$id %in% joined$id.x]
id_from_out_not_included_in_bit <- out$id[!out$id %in% joined$id.y]
And these ids are the rows not included in out or bit, i.e. variable id_from_bit_not_included_in_out contains rows present in bit, but not in out and variable id_from_out_not_included_in_bit contains rows present in out, but not in bit
First attempt here. It will be difficult to do a very clean job with this data tho.
The data I used:
out <- read.table(text = "price size
36024.86 0.01431022
36272.00 0.00138692
36272.00 0.00277305
36292.57 0.05420000
36292.07 0.00403948
35053.89 0.30904890
35072.76 0.00232000
35065.60 0.00273000
35049.36 0.01760000
35037.23 0.00100000", header = T)
bit <- read.table(text = "price size
37279.89 0.01340020
37250.84 0.00930000
37250.32 0.44284049
37240.00 0.00056491
37215.03 0.99891906
37240.00 0.00056491
37215.03 0.99891906
35053.89 0.30904890
35072.76 0.00232000
35065.60 0.00273000
35049.36 0.01760000
35037.23 0.00100000", header = T)
Assuming purely that row 1 of out should match with row 1 of bit a simple solution could be:
df <- cbind(distinct(out), distinct(bit))
names(df) <- make.unique(names(df))
However judging from the data you have provided I am not sure if this is the way to go (big differences in the first few rows) so maybe try sorting the data first?:
df <- cbind(distinct(out[order(out$price, out$size),]), distinct(bit[order(bit$price, bit$size),]))
names(df) <- make.unique(names(df))

R: Replace all Values that are not equal to a set of values

All.
I've been trying to solve a problem on a large data set for some time and could use some of your wisdom.
I have a DF (1.3M obs) with a column called customer along with 30 other columns. Let's say it contains multiple instances of customers Customer1 thru Customer3000. I know that I have issues with 30 of those customers. I need to find all the customers that are NOT the customers I have issues and replace the value in the 'customer' column with the text 'Supported Customer'. That seems like it should be a simple thing...if it werent for the number of obs, I would have loaded it up in Excel, filtered all the bad customers out and copy/pasted the text 'Supported Customer' over what remained.
Ive tried replace and str_replace_all using grepl and paste/paste0 but to no avail. my current code looks like this:
#All the customers that have issues
out <- c("Customer123", "Customer124", "Customer125", "Customer126", "Customer127",
"Customer128", ..... , "Customer140")
#Look for everything that is NOT in the list above and replace with "Enabled"
orderData$customer <- str_replace_all(orderData$customer, paste0("[^", paste(out, collapse =
"|"), "]"), "Enabled Customers")
That code gets me this error:
Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
In a character range [x-y], x is greater than y. (U_REGEX_INVALID_RANGE)
I've tried the inverse of this approach and pulled a list of all obs that dont match the list of out customers. Something like this:
in <- orderData %>% filter(!customer %in% out) %>% select(customer) %>%
distinct(customer)
This gets me a much larger list of customers that ARE enabled (~3,100). Using the str_replace_all and paste approach seems to have issues though. At this large number of patterns, paste no longer collapses using the "|" operator. instead I get a string that looks like:
"c(\"Customer1\", \"Customer2345\", \"Customer54\", ......)
When passed into str_replace_all, this does not match any patterns.
Anyways, there's got to be an easier way to do this. Thanks for any/all help.
Here is a data.table approach.
First, some example data since you didn't provide any.
customer <- sample(paste0("Customer",1:300),5000,replace = TRUE)
orderData <- data.frame(customer = sample(paste0("Customer",1:300),5000,replace = TRUE),stringsAsFactors = FALSE)
orderData <- cbind(orderData,matrix(runif(0,100,n=5000*30),ncol=30))
out <- c("Customer123", "Customer124", "Customer125", "Customer126", "Customer127", "Customer128","Customer140")
library(data.table)
setDT(orderData)
result <- orderData[!(customer %in% out),customer := gsub("Customer","Supported Customer ",customer)]
result
customer 1 2 3 4 5 6 7 8 9
1: Supported Customer 134 65.35091 8.57117 79.594166 84.88867 97.225276 84.563997 17.15166 41.87160 3.717705
2: Supported Customer 225 72.95757 32.80893 27.318046 72.97045 28.698518 60.709381 92.51114 79.90031 7.311200
3: Supported Customer 222 39.55269 89.51003 1.626846 80.66629 9.983814 87.122153 85.80335 91.36377 14.667535
4: Supported Customer 184 24.44624 20.64762 9.555844 74.39480 49.189537 73.126275 94.05833 36.34749 3.091072
5: Supported Customer 194 42.34858 16.08034 34.182737 75.81006 35.167769 23.780069 36.08756 26.46816 31.994756
---

R, getting an invalid argument to unary operator when using order function

I'm essentially doing the exact same thing 3 times, and when adding a new variable I get this error
Error in -emps$EV : invalid argument to unary operator
The code chunk causing this is
evps<-aggregate(EV~player,s1k,mean)
sort2<-evps[order(-evps$EV),]
head(sort2,10)
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
Works like a charm for the first list, but the identical code thereafter causes the error.
This specific line is causing the error
sort3<-emps[order(-emps$EV),]
How can I fix/workaround this?
Full Code
url <- getURL("https://raw.githubusercontent.com/M-ttM/Basketball/master/class.csv")
shots <- read.csv(text = url)
shots$make<-shots$points>0
shots2<-shots[which(!(shots$player=="Luc Richard Mbah a Moute")),]
fit1<-glm(make~factor(type)+factor(period), data=shots2,family="binomial")
summary(fit1)
shots2$makeodds<-fitted(fit1)
shots2$EV<-shots2$makeodds*ifelse(shots2$type=="3pt",3,2)
shots3<-shots2[which(shots2$y>7),]
locmakes<-data.frame(table(shots3[, c("x", "y")]))
s1k <- shots2[with(shots2, player %in% names(which(table(player)>=1000))), ]
pps<-aggregate(points~player,s1k,mean)
sort<-pps[order(-PPS$points),]
head(sort,10)
evps<-aggregate(EV~player,s1k,mean)
sort2<-evps[order(-evps$EV),]
head(sort2,10)
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
The error message seems to occur when trying to order columns including chr type data. A possible workaround is to use the reverse function rev() instead of the minus sign, like so:
column_a = c("a","a","b","b","c","c")
column_b = seq(6)
df = data.frame(column_a, column_b)
df$column_a = as.character(df$column_a)
df[with(df, order(-column_a, column_b)),]
> Error in -column_a : invalid argument to unary operator
df[with(df, order(rev(column_a), column_b)),]
column_a column_b
5 c 5
6 c 6
3 b 3
4 b 4
1 a 1
2 a 2
Let me know if it works in your case.
On this line, emps$EV doesn't exist.
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
You probably meant
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EM),]
head(sort3,10)

Does R's mlogit deal with multi-sized set of unique alternatives?

Customers are faced with series of choices of unique alternatives. For example:
Customer 1 chooses between alternatives A1,A2,A3,A4;
Customer 1 chooses between atlernatives B1,B2;
Customer 2 chooses between alternatives C1,C2,C3;
Customer 2 chooses between alternatives D1,D2,D3
etc
Can any R package deal with it?
For example.csv
CHOICE,CUSTOMER,CHID,ALT,DIST,INCOME
1,1,1,1,100,70
0,1,1,2,372,70
0,1,1,3,417,70
0,1,1,4,180,70
1,1,2,1,68,70
0,1,2,2,354,70
0,2,3,1,68,100
0,2,3,2,354,100
1,2,3,3,399,100
1,2,4,1,180,100
0,2,4,2,100,100
0,2,4,3,80,100
mlogit failed
> library(mlogit)
> ExData <- read.csv("C:/Users/John/Desktop/R/example.csv")
> MLData <- mlogit.data(ExData, shape="long", choice="CHOICE", chid.var="CHID", id.var="CUSTOMER", alt.var="ALT")
> res <- mlogit(CHOICE~DIST|INCOME, data=MLData, shape="long", alt.var="ALT")
Error in `row.names<-.data.frame`(`*tmp*`, value = paste(chid, alt, sep = ".")) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘2.1’, ‘2.2’, ‘3.3’
while for example1.csv where the number of alternatives is constant
CHOICE,CUSTOMER,CHID,ALT,DIST,INCOME
1,1,1,1,100,70
0,1,1,2,372,70
0,1,1,3,417,70
0,1,2,1,180,70
1,1,2,2,68,70
0,1,2,3,354,70
0,2,3,1,68,100
0,2,3,2,354,100
1,2,3,3,399,100
1,2,4,1,180,100
0,2,4,2,100,100
0,2,4,3,80,100
everything is fine.

replacing a value in column X based on columns Y with R

i've gone through several answers and tried the following but each either yields an error or an un-wanted result:
here's the data:
Network Campaign
Moburst_Chartboost Test Campaign
Moburst_Chartboost Test Campaign
Moburst_Appnext unknown
Moburst_Appnext 1065
i'd like to replace "Test Campaign" with "1055" whenever "Network" == "Moburst_Chartboost". i realize this should be very simple but trying out these:
dataset = read.csv('C:/Users/User/Downloads/example.csv')
for( i in 1:nrow(dataset)){
if(dataset$Network == 'Moburst_Chartboost') dataset$Campaign <- '1055'
}
this yields an error: Warning messages:
1: In if (dataset$Network == "Moburst_Chartboost") dataset$Campaign <- "1055" :
the condition has length > 1 and only the first element will be used
2: In if (dataset$Network == "Moburst_Chartboost") dataset$Campaign <- "1055" :
the condition has length > 1 and only the first element will be used
etc.
then i tried:
within(dataset, {
dataset$Campaign <- ifelse(dataset$Network == 'Moburst_Chartboost', '1055', dataset$Campaign)
})
this turned ALL 4 values in row "Campaign" into "1055" over running what was there even when condition isn't met
also this:
dataset$Campaign[which(dataset$Network == 'Moburst_Chartboost')] <- 1055
yields this error, and replaced the values in the two first rows of "Campaign" with NA:
Warning message:
In `[<-.factor`(`*tmp*`, which(dataset$Network == "Moburst_Chartboost"), :
invalid factor level, NA generated
scratching my head here. new to R but this shouldn't be so hard :(
In your first attempt, you're trying to iterate over all the columns, when you only want to change the 2nd column.
In your second, you're trying to assign the value "1055" to all of the 2nd column.
The way to think about it is as an if else, where if the condition in col 1 is met, col 2 is changed, otherwise it remains the same.
dataset <- data.frame(Network = c("Moburst_Chartboost", "Moburst_Chartboost",
"Moburst_Appnext", "Moburst_Appnext"),
Campaign = c("Test Campaign", "Test Campaign",
"unknown", "1065"))
dataset$Campaign <- ifelse(dataset$Network == "Moburst_Chartboost",
"1055",
dataset$Campaign)
head(dataset)
Network Campaign
1 Moburst_Chartboost 1055
2 Moburst_Chartboost 1055
3 Moburst_Appnext unknown
4 Moburst_Appnext 1065
You may also try dataset$Campaign[dataset$Campaign=="Test Campaign"]<-1055 to avoid the use of loops and ifelse statements.
Where dataset
dataset <- data.frame(Network = c("Moburst_Chartboost", "Moburst_Chartboost",
"Moburst_Appnext", "Moburst_Appnext"),
Campaign = c("Test Campaign", "Test Campaign",
"unknown", 1065))
Try the following
dataset = read.csv('C:/Users/User/Downloads/example.csv', stringsAsFactors = F)
for( i in 1:nrow(dataset)){
if(dataset$Network[i] == 'Moburst_Chartboost') dataset$Campaign[i] <- '1055'
}
It seems your forgot the index variable. Without [i] you work on the whole vector of the data frame, resulting in the error/warning you mentioned.
Note that I added stringsAsFactors = F to the read.csv() function to make sure the strings are indeed interpreted as strings and not factors. Using factors this would result in an error like this
In `[<-.factor`(`*tmp*`, i, value = c(NA, 2L, 3L, 1L)) :
invalid factor level, NA generated
Alternatively you can do the following without using a for loop:
idx <- which(dataset$Network == 'Moburst_Chartboost')
dataset$Campaign[idx] <- '1055'
Here, idx is a vector containing the positions where Network has the value 'Moburst_Chartboost'
thank you for the help! not elegant, but since this lingered with me when going to sleep last night i decided to try to bludgeon this with some ugly code but it worked too - just as a workaround...separated to two data frames, replaced all values and then binded back...
# subsetting only chartboost
chartboost <- subset(dataset, dataset$Network=='Moburst_Chartboost')
# replace all values in Campaign
chartboost$Campaign <-sub("^.*", "1055",chartboost$Campaign)
#subsetting only "not chartboost"
notChartboost <-subset(dataset, dataset$Network!='Moburst_Chartboost')
# binding back to single dataframe
newSet <- rbind(chartboost, notChartboost)
Ugly as a duckling but worked :)

Resources