Convert specific values into columns in R - r

I have a data table that looks like this:
ID time somevalues change
001 12:33 13 NA
002 12:34 27 speed: 34
003 12:35 45 width: 127
004 12:36 41 NA
005 12:37 44 height: 19.2
006 12:35 45 NA
007 12:36 49 speed: 35
008 12:37 44 speed: 27
009 12:38 45 NA
010 12:39 44 NA
011 12:40 44 height: 18, speed: 28
012 12:41 40 NA
013 12:42 44 height: 18.1
014 12:43 55 width: 128.1
015 12:44 41 NA
... ... ... ...
The table consists of various measurements of a sensor. Some of the measurements have only been entered if they have changed. In addition, these measurements were always entered in the same column. What I need is a data table, which looks like this:
ID time somevalues speed height width
001 12:33 13 34 19.1 128
002 12:34 27 34 19.1 128
003 12:35 45 34 19.1 127
004 12:36 41 34 19.1 127
005 12:37 44 34 19.2 127
006 12:35 45 34 19.2 127
007 12:36 49 35 19.2 127
008 12:37 44 27 19.2 127
009 12:38 45 27 19.2 127
010 12:39 44 27 19.2 127
011 12:40 44 28 18 127
012 12:41 40 28 18 127
013 12:42 44 28 18.1 127
014 12:43 55 28 18.1 128.1
015 12:44 41 28 18.1 128.1
... ... ... ... ... ...
I need the data in this format to analyze and visualize it.
Is there a way to do that in R without using multiple if statements?

does this work for you?
library(dplyr)
# create data - had to remove the spaces in change, to read the table, but shouldn't make a difference.
data_temp = read.table(text = "
ID time somevalues change
001 12:33 13 NA
002 12:34 27 speed:34
003 12:35 45 width:127
004 12:36 41 NA
005 12:37 44 height:19.2
006 12:35 45 NA
007 12:36 49 speed:35
008 12:37 44 speed:27
009 12:38 45 NA
010 12:39 44 NA
011 12:40 44 height:18,speed:28
012 12:41 40 speed:29,width:120.1
013 12:42 44 height:18.1,speed:30,with:50
014 12:43 55 width:128.1
015 12:44 41 NA"
, header = T, stringsAsFactors = F)
data_wanted = select(data_temp, ID, time, somevalues)
speed = which(grepl("speed:", data_temp$change)) # in which rows is speed
speed_string = gsub(".*speed:", "", data_temp$change[speed]) # get string and remove everything before the speed value
speed_string = gsub(",.*", "", speed_string) # revomve everything behinde the speed value
# set speed variable via loop
# speed contains the positions of rows with information about speed.
# so from row 1 to speed[1]-1 we dont know anthyting about speed yet and so it shall be na
# from position speed[1] to speed[2]-1 it shall be the value of speed_string[1] and so on
data_wanted$speed = NA
for(i in 1:length(speed))
{
current = speed[i] # position of speed-update-information
till_next = ifelse(i < length(speed), speed[i+1]-1, NROW(data_wanted)) # untill position of following speed-update-information or end of Dataframe if no more update information
data_wanted$speed[current:till_next] = as.numeric(speed_string[i]) # set values
}
data_wanted
cbind(data_wanted, data_temp$change)
# ID time somevalues speed data_temp$change
# 1 1 12:33 13 NA <NA>
# 2 2 12:34 27 34 speed:34
# 3 3 12:35 45 34 width:127
# 4 4 12:36 41 34 <NA>
# 5 5 12:37 44 34 height:19.2
# 6 6 12:35 45 34 <NA>
# 7 7 12:36 49 35 speed:35
# 8 8 12:37 44 27 speed:27
# 9 9 12:38 45 27 <NA>
# 10 10 12:39 44 27 <NA>
# 11 11 12:40 44 28 height:18,speed:28
# 12 12 12:41 40 29 speed:29,width:120.1
# 13 13 12:42 44 30 height:18.1,speed:30,with:50
# 14 14 12:43 55 30 width:128.1
# 15 15 12:44 41 30 <NA>

Related

Converting month column table to chronological order in R

I have a table of the following format:
Initial Table Formatting
And I'm seeking an output resembling the following:
Date
Value
January 1659
Value 1
February 1659
Value 2
March 1659
Value 3
April 1659
Value 4
and so on (numerical representations of the Month and Year are perfectly fine also.
I've attempted using merge operations but I'm thinking there must be an easier way (possibly using packages). I've found somewhat similar questions asked but none obviously applicable yet.
You can use pivot_longer and unite, both from the tidyr package:
library(tidyr)
pivot_longer(df, -Year) |>
unite(date, name, Year, sep = " ")
#> # A tibble: 120 x 2
#> date value
#> <chr> <int>
#> 1 Jan 1659 68
#> 2 Feb 1659 97
#> 3 Mar 1659 89
#> 4 Apr 1659 74
#> 5 May 1659 44
#> 6 Jun 1659 2
#> 7 Jul 1659 81
#> 8 Aug 1659 22
#> 9 Sep 1659 87
#> 10 Oct 1659 1
#> # ... with 110 more rows
Data used
set.seed(1)
df <- cbind(1659:1668, replicate(12, sample(99, 10))) |>
as.data.frame() |>
setNames(c("Year", month.abb))
df
#> Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
#> 1 1659 68 97 89 74 44 2 81 22 87 1 76 43
#> 2 1660 39 85 37 42 25 45 13 93 83 43 39 1
#> 3 1661 1 21 34 38 70 18 40 28 90 59 24 29
#> 4 1662 34 54 99 20 39 22 89 48 48 26 53 78
#> 5 1663 87 74 44 28 51 78 48 33 64 15 92 22
#> 6 1664 43 7 79 96 42 65 96 45 94 58 86 70
#> 7 1665 14 73 33 44 6 70 23 21 60 29 40 28
#> 8 1666 82 79 84 87 24 87 84 31 51 24 83 37
#> 9 1667 59 98 35 70 32 93 29 17 34 42 90 61
#> 10 1668 51 37 70 40 14 75 98 73 10 48 35 46
Created on 2022-11-29 with reprex v2.0.2

R Points in Polygon

Was wondering if you could help me with the following. I am trying to calculate the amount of points that fall within each polygon US state. There are 52 states total. The point data and the polygon data are both in the same transformation.
I can run the function:
over(Transformed.States, clip.points)
Which returns:
0 1 2 3 4 5 6 7 8 9 10
4718 NA 488 2688 4454 3762 2041 NA 5 NA 3620
11 12 13 14 15 16 17 18 19 20 21
412 3042 2028 3390 2755 4250 3275 2484 466 4255 1
22 23 24 25 26 27 28 29 30 31 32
3238 744 4125 2926 927 495 3541 4640 3039 895 620
33 34 35 36 37 38 39 40 41 42 43
4069 4671 3801 1012 4023 626 1158 4627 217 13 4055
44 45 46 47 48 49 50 51
573 3456 NA 4670 4505 903 4172 4641
However, I want to write this function so that each polygon is given a value based on the amount of points in the polygon that can then be plotted such as:
plot(points.in.state)
What would be the best function to go about this? So that I still have polygon data but with the new point in polygons data attached?
The end goal of this is to make a graduated symbol map for each state based on the values for points in each state.
Thanks!
Jim

System exactly singular with pgmm (package plm)

I am trying to run a pgmm regression (Arellano Bond estimator) following the example online with the EmplUK dataset.
My dataset is unbalanced, with some missing values (that I also removed, without any difference). This is the paste from R' dataframe.
row.names ID Year p I
1 23 1 1992 NA NA
2 22 1 1993 17.01 NA
3 21 1 1994 15.86 NA
4 20 1 1995 17.02 7.512347
5 19 1 1996 20.64 7.685104
6 18 1 1997 19.11 12.730282
7 17 1 1998 12.76 12.633871
8 16 1 1999 17.90 7.416381
9 15 1 2000 28.66 6.396114
10 14 1 2001 24.46 9.213729
11 13 1 2002 24.99 20.117159
12 12 1 2003 28.85 11.117816
13 11 1 2004 38.26 11.242638
14 10 1 2005 54.57 13.015168
15 9 1 2006 65.16 18.507212
16 8 1 2007 72.44 18.875281
17 7 1 2008 96.94 24.459170
18 6 1 2009 61.74 21.332035
19 5 1 2010 79.61 17.119038
20 4 1 2011 111.26 16.941914
21 3 1 2012 111.63 19.964875
22 2 1 2013 108.56 28.863894
23 1 1 2014 99.03 15.182615
24 45 2 1993 17.01 NA
25 44 2 1994 15.86 NA
26 43 2 1995 17.02 NA
27 42 2 1996 20.64 NA
28 41 2 1997 19.11 NA
29 40 2 1998 12.76 NA
30 39 2 1999 17.90 11.428262
31 38 2 2000 28.66 20.232613
32 37 2 2001 24.46 25.811754
33 36 2 2002 24.99 18.959958
34 35 2 2003 28.85 20.767074
35 34 2 2004 38.26 29.260406
36 33 2 2005 54.57 25.837434
37 32 2 2006 65.16 32.675618
38 31 2 2007 72.44 48.415190
39 30 2 2008 96.94 42.444435
40 29 2 2009 61.74 40.047462
41 28 2 2010 79.61 49.090816
42 27 2 2011 111.26 53.828050
43 26 2 2012 111.63 61.684020
44 25 2 2013 108.56 68.394140
45 24 2 2014 99.03 55.738584
46 76 3 1984 NA NA
47 75 3 1985 NA NA
48 74 3 1986 NA NA
49 73 3 1987 18.53 NA
50 72 3 1988 14.91 NA
51 71 3 1989 18.23 NA
52 70 3 1990 23.76 17.046268
53 69 3 1991 20.04 30.191128
54 68 3 1992 19.32 30.414108
55 67 3 1993 17.01 27.916000
56 66 3 1994 15.86 26.437651
57 65 3 1995 17.02 25.895513
58 64 3 1996 20.64 26.791996
59 63 3 1997 19.11 30.074375
60 62 3 1998 12.76 42.636103
61 61 3 1999 17.90 46.862510
62 60 3 2000 28.66 30.154079
63 59 3 2001 24.46 30.297644
64 58 3 2002 24.99 34.851205
65 57 3 2003 28.85 38.854943
66 56 3 2004 38.26 37.542447
67 55 3 2005 54.57 38.456399
68 54 3 2006 65.16 43.465535
69 53 3 2007 72.44 41.749414
70 52 3 2008 96.94 48.371262
71 51 3 2009 61.74 54.914470
72 50 3 2010 79.61 65.444964
73 49 3 2011 111.26 76.888119
74 48 3 2012 111.63 81.833602
75 47 3 2013 108.56 83.800483
76 46 3 2014 99.03 79.713947
my codes are the following:
data <- plm.data(Autoregression,index=c("ID","Year"))
Panel <- subset(data, !is.na(I) )
Are <- pgmm( I~p+lag( I , 0:1)
| lag(I, 2:99),
data = Panel, effect = "twoways", model = "onestep")
I have tried also many other versions, including every possible number of the lags, shorter or longer. I suppose that the problem is related to the lag function inside the pgmm, that for some reason does not create lags and simply past again and again the variable, obviously making the matrix non singular. I have also tried to create proper lags with excel and then import the text file, and use the lagged variables from excel instead of the lag function. Unfortunately, I am not sure about the syntax of the pgmm and again didn't work.
The error is the following :
Errore in solve.default(crossprod(WX, t(crossprod(WX, A1)))) :
Lapack routine dgesv: system is exactly singular: U[3,3] = 0
Inoltre: Warning message:
In pgmm(I ~ lag(I, 1) + p | lag(I, 2:10), Panel, effect = "twoways", :
the first-step matrix is singular, a general inverse is used
Can you please help me?

R, correlation in p-values

quite new with R and spending lot of time to solve issues...
I have a big table(named mydata) containing more that 14k columns. this is a short view...
Latitude comp48109 comp48326 comp48827 comp49708 comp48407 comp48912
59.8 21 29 129 440 23 13
59.8 18 23 32 129 19 34
59.8 19 27 63 178 23 27
53.1 21 28 0 0 26 10
53.1 15 21 129 423 25 36
53.1 18 44 44 192 26 42
48.7 14 32 0 0 17 42
48.7 11 26 0 0 20 33
48.7 24 37 0 0 26 20
43.6 34 40 1 3 23 4
43.6 19 28 0 1 26 33
43.6 19 35 0 0 14 3
41.4 22 67 253 1322 15 4
41.4 44 39 0 0 11 14
41.4 24 41 63 174 12 4
39.5 21 45 102 291 12 17
39.5 17 26 69 300 16 79
39.5 13 46 151 526 14 14
Despite I manage to get the correlation scores for the first column ("Latitude") against the others with
corrScores <- cor(Latitude, mydata[2:14429])
I need to get a list of the p-values by applying the function cor.test(x, y,...)$p.value
How can I do that without getting the error 'x' and 'y' must have the same length?
You can use sapply:
sapply(mydata[-1], function(y) cor.test(mydata$Latitude, y)$p.value)
# comp48109 comp48326 comp48827 comp49708 comp48407 comp48912
# 0.331584624 0.020971913 0.663194866 0.544407919 0.005375973 0.656831836
Here, mydata[-1] means: All columns of mydata except the first one.

Generating Stacked bar plots

I have a dataframe with 3 columns
$x -- at http://pastebin.com/SGrRUJcA
$y -- at http://pastebin.com/fhn7A1rj
$z -- at http://pastebin.com/VmVvdHEE
that I wish to use to generate a stacked barplot. All of these columns hold integer data. The stacked barplot should have the levels along the x-axis and the data for each level along the y-axis. The stacks should then correspond to each of $x, $y and $z.
UPDATE: I now have the following:
counted <- data.frame(table(myDf$x),variable='x')
counted <- rbind(counted,data.frame(table(myDf$y),variable='y'))
counted <- rbind(counted,data.frame(table(myDf$z),variable='z'))
counted <- counted[counted$Var1!=0,] # to get rid of 0th level??
stackedBp <- ggplot(counted,aes(x=Var1,y=Freq,fill=variable))
stackedBp <- stackedBp+geom_bar(stat='identity')+scale_x_discrete('Levels')+scale_y_continuous('Frequency')
stackedBp
which generates:
.
Two issues remain:
the x-axis labeling is not correct. For some reason, it goes: 46, 47, 53, 54, 38, 40.... How can I order it naturally?
I also wish to remove the 0th label.
I've tried using +scale_x_discrete(breaks = 0:50, labels = 1:50) but this doesn't work.
NB. axis labeling issue: Dataframe column appears incorrectly sorted
Not completely sure what you're wanting to see... but reading ?barplot says the first argument, height must be a vector or matrix. So to fix your initial error:
myDf <- data.frame(x=sample(1:10,100,replace=T),y=sample(11:20,100,replace=T),z=1:10)
barplot(as.matrix(myDf))
If you provide a reproducible example and a more specific description of your desired output you can get a better answer.
Or if I were to guess wildly (and use ggplot)...
myDf <- data.frame(x=sample(1:10,100,replace=T),y=sample(11:20,100,replace=T),z=1:10)
myDf.counted<- data.frame(table(myDf$x),variable='x')
myDf.counted <- rbind(myDf.counted,data.frame(table(myDf$y),variable='y'))
myDf.counted <- rbind(myDf.counted,data.frame(table(myDf$z),variable='z'))
ggplot(myDf.counted,aes(x=Var1,y=Freq,fill=variable))+geom_bar(stat='identity')
I'm surprised that didn't blow up in your face. Cross-classifying the joint occurrence of three different vectors each of length 35204 would often consume many gigabytes of RAM (and would possibly create lots of useless 0's as you found). Maybe you wanted to examine instead the results of sapply(myDf, table)? This then creates three separate tables of counts.
It's a rather irregular result and would need further work to get it into a matrix form but you might want to consider using densityplot to display the comparative distributions which I think is your goal.
$x
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
126 711 1059 2079 3070 2716 2745 3329 2916 2671 2349 2457 2055 1303 892 692
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
559 799 482 299 289 236 156 145 100 95 121 133 60 34 37 13
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
15 12 56 10 4 7 2 14 13 28 30 20 16 62 74 58
49 50
40 15
$y
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
3069 32 1422 1376 1780 1556 1937 1844 1967 1699 1910 1924 1047 894 975 865
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
635 1002 710 908 979 848 678 908 696 491 417 412 499 411 421 217
32 33 34 35 36 37 39 42 46 47 53 54
265 182 121 47 38 11 2 2 1 1 1 4
$z
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
31 202 368 655 825 1246 900 1136 1098 1570 1613 1144 1107 1037 1239 1372
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
1306 1085 843 867 813 1057 1213 1020 1210 939 725 644 617 602 739 584
32 33 34 35 36 37 38 39 40 41 42 43
650 733 756 681 684 657 544 416 220 48 7 1
The density plot is really simple to create in lattice:
densityplot( ~x+y+z, myDf)

Resources