I have a dataframe 'likes' that looks like this:
uid Likes
123 Harry Potter
123 Fitness
123 Muesli
123 Fanta
123 Nokia
321 Harry Potter
321 Muesli
455 Harry Potter
455 Muesli
699 Muesli
123 Belgium
Furthermore I have a bunch of strings, for example: WhatLikes <- c("Harry Potter","Muesli")
I want a vector of the uid's that 'like' Harry Potter OR Muesli. Take note that WhatLikes is much bigger than this example.
The solution should thus be a vector that contains 123,321,455,699.
Help me out! thanks!
We can use %in% to get a logical index of elements in 'Likes' that are found in 'WhatLikes'. Get the corresponding 'uid' from the dataset. Apply unique to remove the duplicate 'uid's.
unique(df1$uid[df1$Likes %in% WhatLikes])
#[1] 123 321 455 699
Related
here i have the following dataframe df in R.
kyid industry amount
112 Apparel 345436
234 APPEARELS 234567
213 apparels 345678
345 Airlines 235678
123 IT 456789
124 IT 897685
i want to replace in industry which incorrectly written Apparel, or APPEARLS to Apparels .
i tried using creating a list and run it through a loop.
l<-c('Apparel ','APPEARELS','apparels')
for(i in range(1:3)){
df$industry<-gsub(pattern=l[i],"Apparels",df$industry)
}
it is not working.only one element changes.
But, when i take the statement individually it is not creating an error and its working.
df$industry<-gsub(pattern=","Apparels",df$industry)
but this is a large dataset so i nned this to work in R please help.
sub without loop using | :
l <- c("Apparel" , "APPEARELS", "apparels")
# Using OPs data
sub(paste(l, collapse = "|"), "Apparels", df$industry)
# [1] "Apparels" "Apparels" "Apparels" "Airlines" "IT" "IT"
I'm using sub instead of gsub as there's only one occurrence of pattern in a string (at least in example).
While range returns a sequence in Python, it returns the minimum and maximum of a vector in R:
range(1:3)
# [1] 1 3
Instead, you could use 1:3 or seq(1,3) or seq_along(l), which all return
# [1] 1 2 3
Also note the difference between 'Apparel' and 'Apparel '.
So
df<-read.table(header=T, text="kyid industry amount
112 Apparel 345436
234 APPEARELS 234567
213 apparels 345678
345 Airlines 235678
123 IT 456789
124 IT 897685")
l<-c('Apparel','APPEARELS','apparels')
for(i in seq_along(l)){
df$industry<-gsub(pattern=l[i],"Apparels",df$industry)
}
df
# kyid industry amount
# 1 112 Apparels 345436
# 2 234 Apparels 234567
# 3 213 Apparels 345678
# 4 345 Airlines 235678
# 5 123 IT 456789
# 6 124 IT 897685
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
Here is my data Frame "crime":
District Premise Weapon
313 99 99
316 NA
314 20 99
312 13 40
312 9 99
I have a separate list of what all the codes mean. For example, 99 in premise means "residence", 20 means "Street". 99 in Weapon means "hands", 40 means "blunt object".
In another post on stackflow, I was able to use the following code for my purpose:
crime$Premise[crime$Premise == 13] <- "House"
This worked but I realized I have 30 different codes in Premise and Weapon. There has to be a more efficient way of writing the code instead of copy and pasting the code above in multiple times and replacing the integer with the string.
*note, 99 means something else under Premise and something else in Weapon.
What is the best way to write this, so I can replace all the numbers with corresponding codes? Thank you in advance!
if you don't want to build an index vector, you can use recode :
x<-data.frame(district=c(313,316,314),premise=c(99,NA,20),Weapon=c(99,"",99))
district premise Weapon
1 313 99 99
2 316 NA
3 314 20 99
x$premise<-recode(x$premise,"99"="residance","20"="Street")
district premise Weapon
1 313 residance 99
2 316 <NA>
3 314 Street 99
I have a data frame (df) in r that has 200 columns and 18 rows. The columns have names of people, and the rows names are years (formatted "X2015", "X2016", etc.). The values in the data frame are the number of grades received by the individual in a certain year. For example:
Jen Fred Alex John
X2010 55 265 436 409
X2011 54 261 456 417
X2012 54 263 494 415
X2013 52 253 526 419
X2014 52 250 556 426
I am trying to determine what years Alex received more than 500 grades.
So far, I have tried the following, none of which have worked:
subset(df,select="Alex", df$Alex>500)
df[df$Alex>500,]
Along with many variations of these. Any help or suggestions would be appreciated!
Use:
row.names(df)[df$Alex > 500]
or
row.names(df[df$Alex > 500,])
Both will return the row names that fulfill the condition that Alex got more than 500.
If you want to obtain only the number of the years (without the "X") do:
Xyear <- row.names(df[df$Alex > 500,])
year <- substring(Xyear,2)
I have the following lines:
123 abcd 456 xyz
123 abcd 678 xyz
234 egfs 434 ert
345 fggfgf 456 455 rty
234 egfs 422 ert 33
So here, if the first field is same for multiple lines, they are considered duplicate. so, in the above example 123 is same in 2 lines, they are considered duplicates (though they differ in one field in the middle). Similarly, lines with 234 are duplicates.
I need to remove these duplicate lines.
Since they aren't 100% duplicates, sort u doesn't work. Does anyone know how i can delete these duplicate lines?
this would be a very easy task for awk, I would do it with awk. In vim, you can do:
% !awk '\!a[$1]++'
then you got:
123 abcd 456 xyz
234 egfs 434 ert
345 fggfgf 456 455 rty
if you do it in shell, you don't have to escape the !:
awk '!a[$1]++' file
g/\%(^\1\>.*$\n\)\#<=\(\k\+\).*$/d
This is easy with my PatternsOnText plugin. It allows to specify a pattern that is ignored for the duplicate check; in your case, that would be everything after the first (space-delimited) field:
%DeleteDuplicateLinesIgnoring / .*/
I'm tyring to create new vector in R using an 'if' function to pull out only certain values for the new array. Basically, I want to segregate data by day of week for each of several cities. How do I use the apply function to get only, say, Tuesdays in a new array for each city? Thanks
It sounds as though you don't want if or apply at all. The solution is simpler:
Suppose that your data frame is data. Then subset(data, Weekday == 3) should work.
You don't want to use the R if. Instead use the subsetting function [
dat <- read.table(text=" Date Weekday Holiday Atlanta Chicago Houston Tulsa
1 1/1/2008 3 1 313 313 361 123
2 1/2/2008 4 0 735 979 986 310
3 1/3/2008 5 0 690 904 950 286
4 1/4/2008 6 0 610 734 822 281
5 1/5/2008 7 0 482 633 622 211
6 1/6/2008 1 0 349 421 402 109", header=TRUE)
dat[ dat$Weekday==3, ]