I would like to know how I can score my dataframe based on values found with grep().
Say I got a DF Containing this:
age=c("France","Mars","Jupitor","Moon","Sun","Afrika","Texas","Michigan","Washington","Kiev","Amsterdam","Norway")
height=c("Paris","Planet","Planet","COLD","HOT!","LIONS","Austin","Lansing","WashingtonDC","Ukrain","Holland","Oslo")
village=data.frame(age=age,height=height)
and I use grep('Moon',village$age, ignore.case=TRUE) to search which row it is on.
How can you add a column in front of age, to score it with in example, the number 1,
if I use grep('FRANCE',village$age, ignore.case=TRUE) to score it with the number 2?
You didn't specify what the non-found "scores" should be, so the following just uses NA's:
age <- c("France","Mars","Jupitor","Moon","Sun","Afrika",
"Texas","Michigan","Washington","Kiev","Amsterdam","Norway")
height <- c("Paris","Planet","Planet","COLD","HOT!","LIONS",
"Austin","Lansing","WashingtonDC","Ukrain","Holland","Oslo")
village <- data.frame(score=NA, age=age, height=height)
print(village)
## score age height
## 1 NA France Paris
## 2 NA Mars Planet
## 3 NA Jupitor Planet
## 4 NA Moon COLD
## 5 NA Sun HOT!
## 6 NA Afrika LIONS
## 7 NA Texas Austin
## 8 NA Michigan Lansing
## 9 NA Washington WashingtonDC
## 10 NA Kiev Ukrain
## 11 NA Amsterdam Holland
## 12 NA Norway Oslo
village[grep('moon', village$age, ignore.case=TRUE),]$score <- 1
village[grep('france', village$age, ignore.case=TRUE),]$score <- 2
print(village)
## score age height
## 1 2 France Paris
## 2 NA Mars Planet
## 3 NA Jupitor Planet
## 4 1 Moon COLD
## 5 NA Sun HOT!
## 6 NA Afrika LIONS
## 7 NA Texas Austin
## 8 NA Michigan Lansing
## 9 NA Washington WashingtonDC
## 10 NA Kiev Ukrain
## 11 NA Amsterdam Holland
## 12 NA Norway Oslo
Related
I have the following data which is of a panel structure. I need to normalize each cell so that the observation for a country is divided by total number of observations for that country divided by total number of observations in the panel structure (here 10 - in my data 1100). Also I have showcased three countries (AL, UK, FR) but I have 92 in total so I need some general formula (mutate: by = country?).
This is my data
df1 <- data_frame(Country =
c("AL","AL","AL","AL","AL","AL","AL","AL","AL","AL",
"UK","UK","UK","UK","UK","UK","UK","UK","UK","UK",
"FR","FR","FR","FR","FR","FR","FR","FR","FR","FR"),
Obs = c(NA,NA,2,3,2,3,2,3,2,NA,1,2,1,2,1,2,1,2,1,2,NA,NA,NA,NA,NA,NA,NA,NA,4,NA))
df1
Country Obs
<chr> <dbl>
1 AL NA
2 AL NA
3 AL 2
4 AL 3
5 AL 2
6 AL 3
7 AL 2
8 AL 3
9 AL 2
10 AL NA
11 UK 1
12 UK 2
13 UK 1
14 UK 2
15 UK 1
16 UK 2
17 UK 1
18 UK 2
19 UK 1
20 UK 2
21 FR NA
22 FR NA
23 FR NA
24 FR NA
25 FR NA
26 FR NA
27 FR NA
28 FR NA
29 FR 4
30 FR NA
Now, what I want is to divide each cell with number of observations available for each country / total obs like so,
df2 <- data_frame(Country =
c("AL","AL","AL","AL","AL","AL","AL","AL","AL","AL",
"UK","UK","UK","UK","UK","UK","UK","UK","UK","UK",
"FR","FR","FR","FR","FR","FR","FR","FR","FR","FR"),
Obs = c(NA,NA,2*7/10,3*7/10,2*7/10,3*7/10,2*7/10,3*7/10,2*7/10,
NA,1*10/10,2*10/10,1*10/10,2*10/10,1*10/10,2*10/10,1*10/10,
2*10/10,1*10/10,2*10/10,NA,NA,NA,NA,NA,NA,NA,NA,4*1/10,NA))
df2
Country Obs
<chr> <dbl>
1 AL NA
2 AL NA
3 AL 1.4
4 AL 3.7
5 AL 2.7
6 AL 3.7
7 AL 2.7
8 AL 3.7
9 AL 2.7
10 AL NA
11 UK 1
12 UK 2
13 UK 1
14 UK 2
15 UK 1
16 UK 2
17 UK 1
18 UK 2
19 UK 1
20 UK 2
21 FR NA
22 FR NA
23 FR NA
24 FR NA
25 FR NA
26 FR NA
27 FR NA
28 FR NA
29 FR 0.4
30 FR NA
I am interested in solving the problem obviously BUT I would really really appreciate it if you could show me how to do this for multiple columns as my original data needs this same operation done for many columns where the country tickers (AL, UK, FR in example) remains the same.
You can do :
library(dplyr)
df1 %>%
group_by(Country) %>%
mutate(Obs = Obs * sum(!is.na(Obs))/n()) %>%
ungroup
# Country Obs
# <chr> <dbl>
# 1 AL NA
# 2 AL NA
# 3 AL 1.4
# 4 AL 2.1
# 5 AL 1.4
# 6 AL 2.1
# 7 AL 1.4
# 8 AL 2.1
# 9 AL 1.4
#10 AL NA
# … with 20 more rows
sum(!is.na(Obs)) counts number of non-NA values in the Country whereas n() gives the number of rows for the Country.
For multiple columns -
df1 %>%
group_by(Country) %>%
mutate(across(col1:col4, ~. * sum(!is.na(.))/n())) %>%
ungroup
This will be applied to col1 to col4 in your dataframe.
Using data.table
library(data.table)
setDT(df1)[, Obs := Obs * mean(!is.na(Obs)), County]
Or using dplyr
library(dplyr)
df1 %>%
group_by(Country) %>%
mutate(Obs = Obs * mean(!is.na(Obs)))
I want to find duplicates horizontally and keeping the uniques. Please help me with this.
I am sharing a sample dataset. Hope this helps.
X <- c(1,2,3,4,5)
Y <- c("India","India","Philippines","Netherlands","France")
Z <- c("India","India","Netherlands","France","France")
S <- c("India","France","Netherlands","France","India")
TableTest <- data.frame(X,Y,Z,S)
TableTest
Input dataset
X Y Z S
1 1 India India India
2 2 India India France
3 3 Philippines Netherlands Netherlands
4 4 Netherlands France France
5 5 France France India
Expected Output
X Y Z S
1 1 India NA NA
2 2 India France NA
3 3 Philippines Netherlands NA
4 4 Netherlands France NA
5 5 France India NA
Please help.
TableTest[,-1] <- as.data.frame(t(apply(TableTest[,-1], 1, function(a) { a <- replace(a, duplicated(a), NA_character_); a[ order(is.na(a)) ]; })))
TableTest
# X Y Z S
# 1 1 India <NA> <NA>
# 2 2 India France <NA>
# 3 3 Philippines Netherlands <NA>
# 4 4 Netherlands France <NA>
# 5 5 France India <NA>
Another base R option
TableTest[-1] <- do.call(rbind,lapply(apply(TableTest[-1],1,unique),`length<-`,ncol(TableTest)-1))
or a simpler version (thanks for advice by #Onyambu in the comments)
TableTest[-1] <- t(apply(TableTest[-1], 1, function(x)`length<-`(unique(x),ncol(TableTest[-1]))))
which gives
> TableTest
X Y Z S
1 1 India <NA> <NA>
2 2 India France <NA>
3 3 Philippines Netherlands <NA>
4 4 Netherlands France <NA>
5 5 France India <NA>
My solution:
TableTest[2:4] <- as.data.frame(t(apply(TableTest[2:4], 1, function(x) {
xo <- ifelse(!duplicated(x), x, NA_character_)
if (any(is.na(xo))) xo <- xo[!is.na(xo)]
length(xo) <- ncol(TableTest) - 1
xo
})))
Output
> TableTest
X Y Z S
1 1 India <NA> <NA>
2 2 India France <NA>
3 3 Philippines Netherlands <NA>
4 4 Netherlands France <NA>
5 5 France India <NA>
I don't think you can do it by only using data.frames, because you're moving values across columns. But here's one way to do it using matrices:
X <- c(1,2,3,4,5)
Y <- c("India","India","Philippines","Netherlands","France")
Z <- c("India","India","Netherlands","France","France")
S <- c("India","France","Netherlands","France","India")
output <- apply(cbind(Y,Z,S), 1, function(row) {
rm_dup <- unique(row)
return(c(rm_dup, rep(NA_character_,
3 - length(rm_dup))))
})
t(output)
[,1] [,2] [,3]
[1,] "India" NA NA
[2,] "India" "France" NA
[3,] "Philippines" "Netherlands" NA
[4,] "Netherlands" "France" NA
[5,] "France" "India" NA
I have this data frame that goes something similar to the following.
Code Title Year Number Allocation
1000 Jack 2001 NA 6
1000 Jack 2002 NA NA
1000 Jack 2003 NA NA
1000 Jack 2004 113 NA
1000 Jack 2005 NA NA
1001 Dave 2001 NA 19
1001 Dave 2002 NA NA
1001 Dave 2003 NA NA
1001 Dave 2004 101 NA
1001 Dave 2005 NA NA
and so on.
The data frame like this repeats with different titles, and has a number appear in 'Number' in 2004 and 'Allocation' in 2001.
How would I go about changing the data so it turns into something a single row of the data frame
Code Title Number Allocation
1000 Jack 113 6
1001 Dave 101 19
This also works:
library(dplyr)
df %>%
select(-Year) %>%
group_by(Code, Title) %>%
mutate_all(funs(sort(.))) %>%
distinct()
or:
df %>%
group_by(Code, Title) %>%
mutate_all(funs(sort(.))) %>%
distinct(Code, Title, Number, Allocation)
Result:
# A tibble: 2 x 4
# Groups: Code, Title [2]
Code Title Number Allocation
<int> <fctr> <int> <int>
1 1000 Jack 113 6
2 1001 Dave 101 19
Data:
df = read.table(text=" Code Title Year Number Allocation
1000 Jack 2001 NA 6
1000 Jack 2002 NA NA
1000 Jack 2003 NA NA
1000 Jack 2004 113 NA
1000 Jack 2005 NA NA
1001 Dave 2001 NA 19
1001 Dave 2002 NA NA
1001 Dave 2003 NA NA
1001 Dave 2004 101 NA
1001 Dave 2005 NA NA", header = TRUE)
My data takes the following form:
df <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("USA", 16)),
Quarter=rep(1:8,2),Income=20:35)
df2 <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("UK", 16)),
Quarter=rep(1:8,2),Income=32:47)
df <- rbind(df, df2)
What I want to do is to calculate the growth rate from the first quarter each year to the first quarter the second year, within country and sector. In the example above it would be the growth rate from quarter 1 to quarter 5. So for Sector A, in the USA, it would be (24/20)-1=0.2
I then want to append this data to the dataframe as a new column.
I looked at the solutions in:
How calculate growth rate in long format data frame?
But didn't have the r-skills to get it to work if the lag is more then one time-unit. Any suggestions?
ADDITION
So what i want is the growth-rate, that is (24/20)-1=0.2 in the example below. Not 1-(24/20), which I first wrote. The desired output should look something like this:
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2
6 A USA 6 25 0.1904
7 A USA 7 26 0.1818
I think you need something like this:
library(dplyr)
df %>%
#group by sector and country
group_by(Sector, Country) %>%
#calculate growth as (quarter / 5-period-lagged quarter) - 1
mutate(growth = Income / lag(Income, 4) - 1)
Output
Source: local data frame [32 x 5]
Groups: Sector, Country [4]
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2000000
6 A USA 6 25 0.1904762
7 A USA 7 26 0.1818182
8 A USA 8 27 0.1739130
9 B USA 1 28 NA
10 B USA 2 29 NA
.. ... ... ... ... ...
df3 = copy(df)
df3$Quarter = df3$Quarter - 4
df = merge(df,df3,c('Sector','Country','Quarter'), suffixes = c('','_prev'), all.x = T)
df$growth = 1 - (df$Income_prev/df$Income
> df
Sector Country Quarter Income Income_prev growth
1 A USA 1 20 24 -4
2 A USA 2 21 25 -4
3 A USA 3 22 26 -4
4 A USA 4 23 27 -4
5 A USA 5 24 NA NA
6 A USA 6 25 NA NA
7 A USA 7 26 NA NA
8 A USA 8 27 NA NA
9 A UK 1 32 36 -4
10 A UK 2 33 37 -4
11 A UK 3 34 38 -4
12 A UK 4 35 39 -4
13 A UK 5 36 NA NA
14 A UK 6 37 NA NA
15 A UK 7 38 NA NA
16 A UK 8 39 NA NA
17 B USA 1 28 32 -4
18 B USA 2 29 33 -4
19 B USA 3 30 34 -4
20 B USA 4 31 35 -4
21 B USA 5 32 NA NA
22 B USA 6 33 NA NA
23 B USA 7 34 NA NA
24 B USA 8 35 NA NA
25 B UK 1 40 44 -4
26 B UK 2 41 45 -4
27 B UK 3 42 46 -4
28 B UK 4 43 47 -4
29 B UK 5 44 NA NA
30 B UK 6 45 NA NA
31 B UK 7 46 NA NA
32 B UK 8 47 NA NA
>
I am a beginner with R so i apologise in advance if the question was asked elsewhere. Here is my issue:
I have two data frames, df1 and df2, with different number of rows and columns. The two frames have only one variable (column) in common called "customer_no". I want the merged frame to match records based on "customer_no" and by rows in df2 only.Both data.frames have multiple rows for each customer_no.
I tried the following:
merged.df <- (df1, df2, by="customer_no",all.y=TRUE)
The problem is that this assigns values of df1 to df2 where instead it should be empty. My questions are:
1) How can I tell the command to leave the unmatched columns empty?
2) How can I see from the merged file which row came from which df? I guess if I resolve the above question this should be easy to see by the empty columns.
I am missing something in my command but don't know what. If the question has been answered somewhere else, would you be still kind enough to rephrase it in English here for an R beginner?
Thanks!
Data example:
df1:
customer_no country year
10 UK 2001
10 UK 2002
10 UK 2003
20 US 2007
30 AU 2006
df2:
customer_no income
10 700
10 800
10 900
30 1000
Merged file should look like this:
merged.df:
customer_no income country year
10 UK 2001
10 UK 2002
10 UK 2003
10 700
10 800
10 900
30 AU 2006
30 1000
So:
It puts the columns all together, it adds the values of df2 right after the last one of df1 based on same customer_no and matches only customer_no from df2 (merged.df does not have customer_no 20). Also, it leaves empty all the other cells.
In STATA I use append but not sure in R...perhaps join?
THANKS!!
Try:
df1$id <- paste(df1$customer_no, 1, sep="_")
df2$id <- paste(df2$customer_no, 2, sep="_")
res <- merge(df1, df2, by=c('id', 'customer_no'),all=TRUE)[,-1]
res1 <- res[res$customer_no %in% df2$customer_no,]
res1
# customer_no country year income
#1 10 UK 2001 NA
#2 10 UK 2002 NA
#3 10 UK 2003 NA
#4 10 <NA> NA 700
#5 10 <NA> NA 800
#6 10 <NA> NA 900
#8 30 AU 2006 NA
#9 30 <NA> NA 1000
If you want to change NA to '',
res1[is.na(res1)] <- '' #But, I would leave it as `NA` as there are `numeric` columns.
Or, use rbindlist from data.table (Using the original datasets)
library(data.table)
indx <- df1$customer_no %in% df2$customer_no
rbindlist(list(df1[indx,], df2),fill=TRUE)[order(customer_no)]
# customer_no country year income
#1: 10 UK 2001 NA
#2: 10 UK 2002 NA
#3: 10 UK 2003 NA
#4: 10 NA NA 700
#5: 10 NA NA 800
#6: 10 NA NA 900
#7: 30 AU 2006 NA
#8: 30 NA NA 1000
You could also use the smartbind function from the gtools package.
require(gtools)
res <- smartbind(df1[df1$customer_no %in% df2$customer_no, ], df2)
res[order(res$customer_no), ]
# customer_no country year income
# 1:1 10 UK 2001 NA
# 1:2 10 UK 2002 NA
# 1:3 10 UK 2003 NA
# 2:1 10 <NA> NA 700
# 2:2 10 <NA> NA 800
# 2:3 10 <NA> NA 900
# 1:4 30 AU 2006 NA
# 2:4 30 <NA> NA 1000
Try:
df1$income = df2$country = df2$year = NA
rbind(df1, df2)
customer_no country year income
1 10 UK 2001 NA
2 10 UK 2002 NA
3 10 UK 2003 NA
4 20 US 2007 NA
5 30 AU 2006 NA
6 10 <NA> NA 700
7 10 <NA> NA 800
8 10 <NA> NA 900
9 30 <NA> NA 1000