Merging 2 files without a common key - r

I have 2 files. One is a time_file which has 3000 rows and the other is userid file which has 2000 rows. I want to merge the two, so that each row (ID) in the userid file is paired with the full data from each row of the time_file.
Rows 1-3000 would show the first userid with each of the dates.
Rows 3001-6000 would show the 2nd userid with each of the dates, and so on.
Thanks in advance!
Time file
mo day year date
11 1 2015 11/1/2015
11 2 2015 11/2/2015
11 3 2015 11/3/2015
11 4 2015 11/4/2015
11 5 2015 11/5/2015
.
.
userid file
userid
154
155
157
158
159
160
.
.
Ideal format(what I want)
mo day year date userid
11 1 2015 11/1/2015 154
11 2 2015 11/2/2015 154
11 3 2015 11/3/2015 154
11 4 2015 11/4/2015 154
11 5 2015 11/5/2015 154
.
.
3 28 2017 3/28/2017 154
3 29 2017 3/29/2017 154
3 30 2017 3/30/2017 154
3 31 2017 3/31/2017 154
11 1 2015 11/1/2015 155
11 2 2015 11/2/2015 155
11 3 2015 11/3/2015 155
11 4 2015 11/4/2015 155
11 5 2015 11/5/2015 155
11 6 2015 11/6/2015 155

Easiest solution in R I can think of, assuming you've gotten your time data in a data frame and your user data in a vector:
final_df <- cbind(date_df, "userid" = rep(user, each = 3000))
This will repeat each user_id 3000 times, then bind the user_id column to the date data frame.

In SPSS you can use the cartesian product function for this:
First this recreates your example data:
data list free/mo day year (3f4) date (a12).
begin data.
11 1 2015 11/1/2015
11 2 2015 11/2/2015
11 3 2015 11/3/2015
11 4 2015 11/4/2015
11 5 2015 11/5/2015
end data.
DATASET NAME time_file.
data list free/ userid.
begin data.
154,155,157,158,159,160
end data.
DATASET NAME userid.
This will now combine the two tables like you requested:
STATS CARTPROD VAR1=userid INPUT2=time_file VAR2=mo day year date
/SAVE OUTFILE="path\your combined data.sav".

Related

Importing Data in R

I want to import data into R but I am getting a few errors. I download my ".CSV" file to my computer and specify the file path like this setwd("C:/Users/intellipaat/Desktop/BLOG/files") and then I am writing read.data <- read.csv("file1.csv"), but the console returns an error like this.
"read.data<-read.csv(file1.csv)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
'file1.csv' object not found
What should I do for this? I tried the internet link route, but again I encountered a problem.
I wrote like this:
install.packages("XML")
install.packages("RCurl")
to load the packages, run the following command:
library("XML")
library("RCurl")
url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
tabs <- getURL(url)
and the console wrote me this error;
Error in function (type, msg, asError = TRUE) :
error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
I would be glad if you help me in this regard...
The Ease of Doing Business rankings table on Wikipedia is an HTML table, not a comma separated values file.
Loading the HTML table into an R data frame can be handled in a relatively straightforward manner with the rvest package. Instead of downloading the HTML file we can read it directly into R with read_html(), and then use html_table() to extract the tabular data into a data frame.
library(rvest)
wiki_url <- "https://en.wikipedia.org/wiki/Ease_of_doing_business_index#Ranking"
aPage <- read_html(wiki_url)
aTable <- html_table(aPage)[[2]] # second node is table of rankings
head(aTable)
...and the first few rows of output:
> head(aTable)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
1 Very Easy New Zealand 1 1 1 1 2 2 3 3 3
2 Very Easy Singapore 2 2 2 2 1 1 1 1 1
3 Very Easy Hong Kong 3 4 5 4 5 3 2 2 2
4 Very Easy Denmark 4 3 3 3 3 4 5 5 5
5 Very Easy South Korea 5 5 4 5 4 5 7 8 8
6 Very Easy United States 6 8 6 8 7 7 4 4 4
2011 2010 2009 2008 2007 2006
1 3 2 2 2 2 1
2 1 1 1 1 1 2
3 2 3 4 4 5 7
4 6 6 5 5 7 8
5 16 19 23 30 23 27
6 5 4 3 3 3 3
>
Next, we confirm that the last countries were read correctly: Libya, Yemen, Venezuela, Eritrea, and Somalia.
> tail(aTable,n=5)
Classification Jurisdiction 2020 2019 2018 2017 2016 2015 2014 2013 2012
186 Below Average Libya 186 186 185 188 188 188 187 N/A N/A
187 Below Average Yemen 187 187 186 179 170 137 133 118 99
188 Below Average Venezuela 188 188 188 187 186 182 181 180 177
189 Below Average Eritrea 189 189 189 189 189 189 184 182 180
190 Below Average Somalia 190 190 190 190 N/A N/A N/A N/A N/A
2011 2010 2009 2008 2007 2006
186 N/A N/A N/A N/A N/A N/A
187 105 99 98 113 98 90
188 172 177 174 172 164 120
189 180 175 173 171 170 137
190 N/A N/A N/A N/A N/A N/A
Finally, we use tidyr and dplyr to convert the data to narrow format tidy data for subsequent analysis.
library(dplyr)
library(tidyr)
aTable %>%
# convert years 2017 - 2020 to character because pivot_longer()
# requires all columns to be of same data type
mutate_at(3:6,as.character) %>%
pivot_longer(-c(Classification,Jurisdiction),
names_to="Year",values_to="Rank") %>%
# convert Rank and Year to numeric values (introducing NA values)
mutate_at(c("Rank","Year"),as.numeric) -> rankings
head(rankings)
...and the output:
> head(rankings)
# A tibble: 6 x 4
Classification Jurisdiction Year Rank
<chr> <chr> <dbl> <dbl>
1 Very Easy New Zealand 2020 1
2 Very Easy New Zealand 2019 1
3 Very Easy New Zealand 2018 1
4 Very Easy New Zealand 2017 1
5 Very Easy New Zealand 2016 2
6 Very Easy New Zealand 2015 2
>

Running regression many times of panel data [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I want to run regression for my panel data.
I have a panel data in the following format:
Column 1 has years column 2 has company name and column 3 has Equity variable
Year company name EQUITY
2006 A 12
2007 A 13
2008 A 23
2009 A 24
2010 A 13
2011 A 14
2012 A 12
2013 A 14
2014 A 14
2015 A 15
2006 B 221
2007 B 242
2008 B 262
2009 B 250
2010 B 400
2011 B 411
2012 B 420
2013 B 420
2014 B 422
2015 B 450
I have a data of 10 years for 200 companies. I want to regress the log of equity of each company on number of years(time- 10 years ). I want only slope coefficient.
I want my output like this:
Column 1-years column 2-company name column 3- beta values
Year company name slope(beta) p-value
2006 A beta value (assumed)
2007 A "
2008 A "
2009 A
2010 A
2011 A
2012 A "
2013 A
2014 A "
2015 A "
I mean slope coefficient of each comany.
Can't see what you've tried so far so here's a solution to get you up and running. The final output you sketch out doesn't really make sense since you have a slope for each company - not for each company for each year.
Here's a base R version for running the regressions. by is used to split the data and then lm for the estimation.
res <- by(indata, indata$company, FUN=function(x) { coef(lm(log(EQUITY) ~ Year+0, data=x))} )
This results in the following output of the slopes and the output can be used for plotting or listing
> res
indata$company: A
[1] 0.001344837
-------------------------------------------------------
indata$company: B
[1] 0.002896053
Update
if you want to add the slopes to the dataset for each year you can add
indata$slope <- res[indata$company]
which gives
> indata
Year company EQUITY slope
1 2006 A 12 0.001344837
2 2007 A 13 0.001344837
3 2008 A 23 0.001344837
4 2009 A 24 0.001344837
5 2010 A 13 0.001344837
6 2011 A 14 0.001344837
7 2012 A 12 0.001344837
8 2013 A 14 0.001344837
9 2014 A 14 0.001344837
10 2015 A 15 0.001344837
11 2006 B 221 0.002896053
12 2007 B 242 0.002896053
13 2008 B 262 0.002896053
14 2009 B 250 0.002896053
15 2010 B 400 0.002896053
16 2011 B 411 0.002896053
17 2012 B 420 0.002896053
18 2013 B 420 0.002896053
19 2014 B 422 0.002896053
20 2015 B 450 0.002896053

How can I have a predictive Termination & Active model?

I am a newbie in R. I have a dataset. Year & Month Active is store in the network Enterprise. Termination is store that left the network Enterprise. With these two variables, I can calculate the turnover My turnover is Termination / ((Active + Termination)) / (nb jours in the month) Example : Janv. 2013 , Active = 593 , Termination = 100 , Turnover = 1,75%
My question is with my dataset in attachment how can I calculate the active number and the termination number until 12-2015 ?
Is it possible to have a view of the scenario?
Dataset:
Year Month Active Termination To (%)
2013 1 5936 100 1,75%
2013 2 6182 190 3,21%
2013 3 6501 117 1,91%
2013 4 6675 92 1,43%
2013 5 6749 111 1,67%
2013 6 6719 145 2,20%
2013 7 6814 121 1,83%
2013 8 6854 90 1,34%
2013 9 6972 99 1,45%
2013 10 7320 99 1,42%
2013 11 7606 98 1,33%
2013 12 7976 155 1,99%
2014 1 7934 87 1,11%
2014 2 8079 127 1,61%
2014 3 8198 125 1,56%
2014 4 8135 154 1,91%
2014 5 8113 136 1,70%
2014 6 8095 173 2,17%
2014 7 8131 220 2,76%
2014 8 7950 135 1,72%
2014 9 7978 108 1,38%
2014 10 8117 199 2,51%
2014 11 8269 117 1,45%
2014 12 8471 177 2,11%
2015 1 8472 132 1,59%
2015 2 8591 117 1,39%
2015 3 8691 161 1,90%
2015 4 8647 126 1,48%
2015 5 8623 123 1,45%
2015 6 8739 177 2,07%
2015 7 8740 218 2,55%
2015 8 8548 35 0,41%

diff operation within a group, after a dplyr::group_by()

Let's say I have this data.frame (with 3 variables)
ID Period Score
123 2013 146
123 2014 133
23 2013 150
456 2013 205
456 2014 219
456 2015 140
78 2012 192
78 2013 199
78 2014 133
78 2015 170
Using dplyr I can group them by ID and filter these ID that appear more than once
data <- data %>% group_by(ID) %>% filter(n() > 1)
Now, what I like to achieve is to add a column that is:
Difference = Score of Period P - Score of Period P-1
to get something like this:
ID Period Score Difference
123 2013 146
123 2014 133 -13
456 2013 205
456 2014 219 14
456 2015 140 -79
78 2012 192
78 2013 199 7
78 2014 133 -66
78 2015 170 37
It is rather trivial to do this in a spreadsheet, but I have no idea on how I can achieve this in R.
Thanks for any help or guidance.
Here is another solution using lag. Depending on the use case it might be more convenient than diff because the NAs clearly show that a particular value did not have predecessor whereas a 0 using diff might be the result of a) a missing predecessor or of b) the subtraction between two periods.
data %>% group_by(ID) %>% filter(n() > 1) %>%
mutate(
Difference = Score - lag(Score)
)
# ID Period Score Difference
# 1 123 2013 146 NA
# 2 123 2014 133 -13
# 3 456 2013 205 NA
# 4 456 2014 219 14
# 5 456 2015 140 -79
# 6 78 2012 192 NA
# 7 78 2013 199 7
# 8 78 2014 133 -66
# 9 78 2015 170 37

Barplot beside issue

I have managed to aggregate some data into the following:
Month Year Number
1 1 2011 3885
2 2 2011 3713
3 3 2011 6189
4 4 2011 3812
5 5 2011 916
6 6 2011 3813
7 7 2011 1324
8 8 2011 1905
9 9 2011 5078
10 10 2011 1587
11 11 2011 3739
12 12 2011 3560
13 1 2012 1790
14 2 2012 1489
15 3 2012 1907
16 4 2012 1615
I am trying to create a barplot where the bars for the months are next to each other, so for the above example January through April will have two bars (one for 2011 and one for 2012) and the remaining months will only have one bar representing 2011.
I know I have to use beside=T, but I guess I need to create some sort of matrix in order to get the barplot to display properly. I am having an issue figuring out what that step is. I have a feeling it may involve matrix but for some reason I am completely stumped to what seems like a very simple solution.
Also, I have this data: y=c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec') which I would like to use in my names.arg. When I try to use it with the above data it tells me undefined columns selected which I am taking to mean that I need 16 variables in y. How can I fix this?
To use barplot you need to rearrange your data:
dat <- read.table(text = " Month Year Number
1 1 2011 3885
2 2 2011 3713
3 3 2011 6189
4 4 2011 3812
5 5 2011 916
6 6 2011 3813
7 7 2011 1324
8 8 2011 1905
9 9 2011 5078
10 10 2011 1587
11 11 2011 3739
12 12 2011 3560
13 1 2012 1790
14 2 2012 1489
15 3 2012 1907
16 4 2012 1615",sep = "",header = TRUE)
y <- c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec')
barplot(rbind(dat$Number[1:12],c(dat$Number[13:16],rep(NA,8))),
beside = TRUE,names.arg = y)
Or you can use ggplot2 with the data pretty much as is:
dat$Year <- factor(dat$Year)
dat$Month <- factor(dat$Month)
ggplot(dat,aes(x = Month,y = Number,fill = Year)) +
geom_bar(position = "dodge") +
scale_x_discrete(labels = y)

Resources